Data Cleansing for Data Warehousing

How important is Extract, Transform, Load (ETL) to data Warehousing?

By Ari Baumgarten

Politicians raising money can be used as an analogy to compare data cleansing to data warehousing. There is almost no likelihood of one existing without the other. Data cleansing is often the most time intensive, and contentious, process for data warehousing projects.

What is Data Cleansing?

The elevator pitch: "Data cleansing ensures that undecipherable data does not enter the data warehouse. Undecipherable data will affect reports generated from the data warehouse via OLAP, Data Mining and KPI's."

A very simple example of where data cleansing would be utilized is how dates are stored in separate applications. Example: 11th March 2007 can be stored as '03/11/07' or '11/03/07' among other formats. A data warehousing project would require the different date formats to be transformed to a uniform standard before being entered in the data warehouse.

Why Extract, Transform and Load (ETL)?

Extract, Transform and Load (ETL) refers to a category of tools that can assist in ensuring that data is cleansed, i.e. conforms to a standard, before being entered into the data warehouse. Vendor supplied ETL tools are considerably more easy to utilized for managing data cleansing on an ongoing basis. ETL sits in front of the data warehouse, listening for incoming data. If it comes across data that it has been programmed to transform, it will make the change before loading the data into the data warehouse.

ETL tools can also be utilized to extract data from remote databases either through automatically scheduled events or via manual intervention. There are alternatives to purchasing ETL tools and that will depend on the complexity and budget for your project. Database Administrators (DBAs) can write scripts to perform ETL functionality which can usually suffice for smaller projects. Microsoft's SQL Server comes with a free ETL tool called Data Transforming Service (DTS). DTS is pretty good for a free tool but it does has limitations especially in the ongoing administration of data cleansing.

Example of ETL vendors are Data Mirror, Oracle, IBM, Cognos and SAS. As with all product selections, list what you think you would require from an ETL tool before approaching a vendor. It may be worthwhile to obtain the services of consultants that can assist with the requirements analysis for product selection.

Figure 1. ETL sits in front of Data Warehouses

How important is Data Cleansing and ETL to the success of Data Warehousing Projects?

ETL is often out-of-sight and out-of-mind if the data warehouse is producing the results that match stakeholders expectations. As a results ETL has been dubbed the silent killer of data warehousing projects. Most data warehousing projects experience delays and budget overruns due to unforeseen circumstances relating to data cleansing.

How to Plan for Data Cleansing?

It is important is start mapping out the data that will be entered into the data warehouse as early as possible. This may change as the project matures but the documentation trail will come in extremely valuable as you will need to obtain commitments from data owners that they will not change data formats without prior notice. 

Create a list of data that will require Extracting, Transforming and Loading. Create a separate list for data that has a higher likelihood of changing formats. Decide on whether you need to purchase ETL tools and set aside an overall budget. Obtain advice from experts in the field and evaluate if the product fits into the overall technical hierarchy of your organization.

If you have and questions or suggestions, please contact us