Rules for data cleansing

Data cleansing is one of the most demanding parts of data warehousing. It includes data scrubbing and discovering business rules embedded within data.

Before you start, you need to know what to expect and how many mops and buckets you'll need.

First, consider this. About 75 per cent of the time spent on data warehouse/data mart projects is spent on back-end issues.

Some other rules to consider:

- You probably can't clean all the

data; there's too much of it. But

you won't need to clean it all

because not all of it needs to be

in the warehouse. Twenty per cent

of the data may meet 80 per cent

of the essential needs.

- Cleanse only those portions of

data that have real business

value. Business analysts should

be primarily responsible for

determining the business value of

data. Also, look at usage. Watch

out for users who say a chunk of

data is essential but is never even

looked at. The users should help

ensure data cleanliness by

working side-by-side with IS.

- The first symptom of dirty data is

a pattern of inconsistent data.

You may notice that the same

customers are listed under a

dozen different spellings or

abbreviations. Data elements may

have different names in different

systems. Or the same name may

be used for different data

elements.

- When you clean legacy data,

don't put the clean data back into

the source system. Legacy

systems are often delicate, and

"fixing" their data may break the

old system. Put the clean data in

the warehouse.

- Data cleaning can involve

re-engineering legacy data. If you

must deal with more than a small

amount of legacy data, you would

be wise to consider automated

software tools for conversion and

migration. An automated soft-

ware tool can extract mapping

rules from actual data in min

utes, instead of the hours it would

take to do it manually.

Join the newsletter!

Error: Please check your email address.
Show Comments