By Hari Mailvaganam
There is nothing either good or bad,
But thinking makes it so.
--William Shakespeare, Hamlet, II:2
Text mining is becoming more prevalent with projects such as WebFountain from IBM and Project Aura from Microsoft. Project Aura and WebFountain are used by large corporations, such as BP, to give an evaluation of the current trend of the company's reputation. These are large text mining projects that are relatively remote to the needs of most businesses.
Until recently, text mining's cost was too prohibitive to justify in most corporate environments. Text mining comes in many hues and shapes. The larger vendors are SAS and SPSS. DWreview was approached by a major insurance company in North America to evaluate the feasibility of using text mining for fraud detection.
DWreview had recently implemented a data warehousing solution for the client, which included implementing a metadata repository and a predictive data mining solution for market intelligence.
For fraud detection, the client currently had a distinct data repository where model scoring was performed. Based on the model score, reports would be queried against the data warehouse to produce the claims that were suspect. This process was inefficient for a number of reasons:
- Fraud detection analysis had to be conducted by a specialist who passed the scores onto the fraud detection team. The team performed the investigation and evaluated the merits of the claims. This was a disjointed process whereby the fraud detection data mining scores were not being improved based on investigation results.
- Fraud detection was not receptive to sudden changes in claim patterns. For natural disaster events, such as hurricanes, there would be a spike in similar claims. The data mining score would not be able to adapt to such scenarios.
- Data mining was confined to actuarial specialists and not day-to-day managers.
Initially DWreview examined the text mining solution provided by the major data mining vendors. While these proved to be rich in features, they were cumbersome to use and lacked the finesse for ongoing daily usage. It became soon apparent that it would not be cost effective to implement such solutions for the client's needs.
Developing a Customized Text Mining Solution
The customized solution was developed in three modules. A scripting engine was designed and developed for the data extraction layer. The data extraction layer was designed to pull reports, either manually or as a scheduled task, from the data warehouse repositories. The reports created by the claims examiners are stored in Microsoft Word 2000 and follow the guidelines set by the metadata repository. The scripting language used by the data extraction module is Perl, which makes the extraction module highly accessible for making changes by end-users.
Figure 1. Text Mining Modules
The text mining module contains the data mining scores based on historically analysis of likelihood of fraud. The algorithms were custom developed based on text entered in the claims examiner's reports and details based on the claim. This model was developed specifically for the client by DWreview. The data mining model can give the client a competitive advantage and the technical details are kept as a closely guarded corporate secret.
Reports are generated on a web-based application layer. The data produced for the reports are also fed into the SAP for Insurance ERP application which is used by the client and commonly found in most of the larger insurance companies.
Figure 2. Process Chart for Conducting Text Mining
Many text mining applications give users open-ended freedom to explore text for meaning. Text mining can be used as a deeper more penetrative method which goes beyond escalations of possible search interests and to sense the mood of the written text; i.e. Are the articles generally positive on a certain subject?
While such open-ended undirected data mining may be suitable in some cases of text mining, the cost associated can be very high and the results will have a lower confidence of accuracy. This may be suitable in vague business cases, as described in the paragraph above, but is not acceptable in industries that have tight business cases which require cost-effective and practical solutions.
Why the Quote from Shakespeare?
Data mining is about finding patterns in data. Patterns are found when relationships are discovered in data sets. As many an esteemed data miner have commented, patterns are all relative. To abstract the patterns of value, close thought must be given to the business case and then to derive a correlation to the data sets. Hence the quote from Shakespeare's Hamlet.
While data mining is a methodology which can help users in this task, ultimately, understanding the business case will make data mining a practical reality in the corporate environment.
Please contact us if you have any questions or suggestions.