What is Text Mining ? Text Databases and Information retrieval

Text Mining

In Simple word, Text Mining is refers to refine the informational data from the bunch of data or collection of data.It also know as Text data mining which means deriving the high quality information from the existing data.



 

Text Databases and IR

Text databases (document databases)

  • Large collections of documents from various sources: news articles, research papers, books, digital libraries, e-mail messages, and Web pages, library database, etc.
  • Data stored is usually semi-structured
  • Traditional information retrieval techniques become inadequate for the increasingly vast amounts of text data

Information retrieval

  • A field developed in parallel with database systems
  • Information is organized into (a large number of) documents
  • Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents




 

Information Retrieval

Typical IR systems

  • Online library catalogs
  • Online document management systems

Information retrieval vs. database systems

  • Some DB problems are not present in IR, e.g., update, transaction management, complex objects
  • Some IR problems are not addressed well in DBMS, e.g., unstructured documents, approximate search using keywords and relevance

 


Basic Measures for Text Retrieval

 

Text Mining
Basic Measures for Text Retrieval

  • Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses).
Precision
Precision
  • Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved.
Text Mining
Recall
  • An information retrieval system often needs to trade off recall for precision or vice versa.
  • Commonly used trade-off is the F-score, which is defined as the harmonic mean of recall and precision

Text Mining
F-score

The harmonic mean discourages a system that sacrifices one measure for another too drastically.
 


Information Retrieval Techniques

Basic Concepts

  • A document can be described by a set of representative keywords called index terms.
  • Different index terms have varying relevance when used to describe document contents.
  • This effect is captured through the assignment of numerical weights to each index term of a document. (e.g.: frequency, tf-idf)

DBMS Analogy

  • Index Terms -> Attributes
  • Weights -> Attribute Values

Index Terms (Attribute) Selection:

  • Stop list
  • Word stem
  • Index terms weighting methods

Terms X Documents Frequency Matrices
Information Retrieval Models:

  • Boolean Model
  • Vector Model
  • Probabilistic Model

Boolean Model

  • Consider that index terms are either present or absent in a document
  • As a result, the index term weights are assumed to be all binaries
  • A query is composed of index terms linked by three connectives: not, and, and or
  • e.g.: car and repair, plane or airplane
  • The Boolean model predicts that each document is either relevant or non-relevant based on the match of a document to the query

Keyword-Based Retrieval

  • A document is represented by a string, which can be identified by a set of keywords
  • Queries may use expressions of keywords
    • E.g., car and repair shop, tea or coffee, DBMS but not Oracle
    • Queries and retrieval should consider synonyms, e.g., repair and maintenance
  • Major difficulties of the model
    • Synonymy: A keyword T does not appear anywhere in the document, even though the document is closely related to T, e.g., data mining
    • Polysemy: The same keyword may mean different things in different contexts, e.g., mining

Similarity-Based Retrieval in Text Data

  • Finds similar documents based on a set of common keywords
  • Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc.
  • Stop list
  • Set of words that are deemed “irrelevant”, even though they may appear frequently
    • E.g., a, the, of, for, to, with, etc.
    • Stop lists may vary when document set varies