By Robert L. English
DAT-A is a data mining and OLAP platform for MySQL. The goals of DAT-A are to design a high-end analytical product from the needs of the end-user and then follow through on the technology. Data mining software are noted for their lack of ease of usability. This together with the abstract nature of analytics has meant that there has been a relatively low pick-up of data mining in the commercial world.
Data mining has a useful role to play but very often the total cost of ownership far out weighs any business benefits. Biotechnological and financial firms find data mining an absolute part of their competitive advantage. However these industries find the cost of running data warehouses and perform analytics can be an expensive burden.
While working with Hari Mailvaganam on building a data warehouse based on commodity Linux boxes we designed a platform for performing analytics on a distributed data storage center.
Data collected by businesses are increasing at a large rate. Most data have business relevancy and cannot be simply shunted to archived storage should the cost of storage increase. This is especially valid for analytics where trends need to be observed over an extended period for greater accuracy.
The data center and data mining solution below was designed for a retail client which had global operations. Retail is a high cost low margin (HCLM) and an extremely competitive industry. The cost for data storage in an active medium was proving prohibitive for the client. For competitive reasons it was not possible to simply ignore the data and par down cost.
Together with Hari Mailvaganam, we designed a distributed data center which replaced the mainframe environment that existed previously. We presented management with an alternative to expensive server environment, using a cluster of over 200 Linux boxes powered by MySQL databases.
Figure 1. Data Cluster using Linux Boxes
Using commodity Linux boxes offered a tremendous cost savings over servers. As the amount of data stored exceeded the terabyte range it was prudent to index the data and store in a distributed manner over the data cluster.
MySQL was not quite an obvious choice as it may have seemed. While extremely fast and reliable, MySQL does not have many sophisticated features that are needed for data warehousing and data mining. We overcame this hurdle by using a database transaction engine from InnoDB. We built a remote database management systems for the MySQL/Linux cluster that allows data administrators to visualize the "data spread" over the center - this is similar to a table view found in most of the popular RDBMS databases. The data spread diagram gave the data administrators an ability to manipulate and transfer data from afar without the need to logon to the individual box. The data spread application also allowed the user to build data cubes for OLAP analysis.
We approached the design of the data mining solution by asking business executives what their needs were. Using backward analysis we could then identify the algorithms needed and create the data mining platform. There are two methods by which users can perform data mining - the first using a stovepipe approach that packages the business needs tightly together with the data mining algorithms. This first method allowed business users to directly perform data mining and the user are given limited latitude in choosing the methodologies. The second method gave freedom to explore data and choose a number of algorithms. The type of user envisioned for the second method are more seasoned data miners.
Please contact DWreview if you have any questions or suggestions.