What's the oldest data record you have? Ten years of canceled checks? Yellowed love letters? A family Bible?
The U.S. Department of the Interior's Bureau of Land Management has some data that stretches back more than 100 years. The bureau keeps track of oil and gas leases, rights of way, mining claims and land patents for 300 million publicly owned acres across the U.S.
This 1 billion-record storehouse sits in four databases on a 15-year-old, year 2000-noncompliant, Honeywell-based mainframe system. But the bureau wants to bring the whole thing up to date to make it accessible to the public.
As the bureau prepares to transfer that data into an Informix Corp. relational database, it needs to analyze that source data to be sure it's consistent and accessible, said Leslie Cone, a project manager at the bureau, based in Denver. "We need to do this to make sure we know where to put it and make sure we understand the relationships between the data so that we can correctly build our data models," Cone said.
On the most basic level, Cone has to make sure that every format in which a territory is mentioned -- New Mexico, N. M., New Mex. or NM -- are all understood to mean the same thing. And for the bureau, the relationships among the data are important because one land citation may have records about oil, water, mining claims and rights-of-way attached to it.
When the bureau started the project last April, staffers tried to check source data quality by testing data samples. "But we ended up spending a lot of time analyzing whether our samples were representative and not really getting much closer to the data itself," Cone said. "We just have too much data and too many relationships between the data to do that efficiently."
Data warehouse and data mining projects often suffer delays because the people running the projects don't know enough about the source data, said Richard Winter, an independent consultant at Winter Corp. in Boston. "Source data comes from systems that were implemented by people who are long gone and that are inadequately documented. Anybody who has loaded a large volume of data into a database knows that you spend a tremendous amount of time and energy dealing with the nuances and exceptions of data values and relationships," Winter said.
"Until we found software that could handle it, we were even trying to compare samples by hand," recalled a frustrated Cone. The software the bureau turned up is a US$250,000 package called Migration Architect from Evoke Software Corp. in San Francisco. It lets them look directly at the data and clean it up before transferring it.
"We can look at the [data] rows and know if we have the right links, the right number of characters and whether it was correctly entered," Cone said. The software scans data and picks out only the pieces that seem to be in the wrong format or that differ in other ways from the model the bureau expected to see, reducing the amount of data to be examined.
Ralph Nordstrom, data warehouse architect at the Automobile Club of Southern California (ACSC) in Costa Mesa, California, also knows how time-consuming analyzing data quality can be. The ACSC -- a branch of AAA -- uses Migration Architect to move data from legacy flat files into its 15G-byte Oracle Corp. data warehouse.
"If data in the warehouse is only 80 percent accurate, you can't make important business decisions based on it," Nordstrom said. "I would never attempt to do this type of analysis without a tool like this."