Text Mining

This article gives an overview of Text Mining and also describes an application the author is developing for Text Mining

By Hari Mailvaganam

Text mining has been on the radar screen of corporate users since the mid-eighties. Technical limitations and the overall complexity of utilizing data mining has been a hurdle to text mining that has been surmounted by very few organizations. The early proponents of text mining have been surveillance agencies such as Central Intelligence Agency (CIA) and Britain's MI6.

The security agencies have a government mandate to intercept data traffic and evaluate for items of interest. This also involves intercepting international fax transmission electronically and mining for patterns of interest.

With a covert start, text mining is coming out into the open. Some of the reasons are:

  • Storage cost reduction - data are more likely to be stored in a electronic medium even after being declared non-active.
  • Data volume increase - the exponential growth of data with the lowering of data transmission cost and increasing usage of the Internet.
  • Fraud detection and analysis - there are compelling reasons for organizations to redress fraud. The federal government has mandated new laws to curtail corporate fraud - HIPAA, GLBA, Sarbanes-OxleySEC and NASDAQ Compliance.
  • Competitive advantage - text mining is used to better understand the realms of data in an organization. An example: Customers contact a company via the Call Center and emails. The notes of the call center reps are achieved along with the emails. Text mining can be used to discover clusters of interesting patterns in customer interactions - "Are Ford Explorers more likely to roll-over when used with Firestone tires?" or "Are errors  more likely to occur on the Windows XP platform?".

Text data is so called unstructured data. Unstructured data implies that the data are freely stored in flat files (e.g. Microsoft Word) and are not classified.

Structured data are found in well-designed data warehouses. The meaning of the data is well-known, usually through meta-data description, and the analysis can be performed directly on the data.

Unstructured data has to jump an additional hoop before it can be meaningfully analyzed - information will need to be extracted from the raw text.


Extraction involves grabbing the data from the medium it is stored. This can be in file directories, Storage Area Networks, Microfilms, data warehouse, data marts. The format of storage can be in any number of formats : ASCII, Doc, PDF, database records, flat files.

There are tools that can convert unstructured data stored in PDF, Word, Text files to XML. This can greatly help the data mining process and is termed text augmentation. The term text augmentation emphasizes the fact that the inferred information is not separated from the stream of text but embedded into it in XML tags. 

Making Sense of Textual Data

To go through every byte of data by human analysis would be too cost prohibitive. There are document handling technologies that can automate the process. The process most often used in text mining applications for analyzing textual data is the Natural Language Processing (NLP). There is an excellent book, Foundations of Statistical Natural Language which gives an overview of NLP.

NLP is a stepwise process that can lead towards structuring of text data. There are a number of software products that have incorporated NLP for performing Text Mining.

There are also a number of other freely available algorithms to extract information from raw text.

Characterization of Data

The first step is setting up filters and referring to semantic dictionary in order to characterize the text in a meaningful manner.



Data Mining

Data mining technology is applied  to text data, to discover association rules and patterns.

This stage usually involves considering various models and choosing the best one based on their predictive performance. For text mining the models will not differ as much as the constraints for searches. The tighter the constrains the higher the likelihood that the results will have less false positives. The drawbacks are that the margin of error for false negatives will be higher. In every day business use cases it will be desirable to mask the complexity by offering the end-users the options to select pre-determined constrain levels.

Data Visualization

Data visualization helps end-users analyze data from various perspectives. The data obtained from the text can include word frequency, relative frequency, area dependence, time sequence and so on.

The interactive data visualization process helps to find useful facts from the data mining results. The visualization tool can be used to manipulate the mining database results to optimize the required data patterns.