To put it bluntly, performing extensive extract, transform and load (ETL) processes is a symptom of poorly managed data and a fundamental lack of a cogently developed data strategy. When data is managed correctly as an enterprise asset, then ETL is significantly reduced and in many cases completely eradicated. Now, I realize that this is a provocative statement, but in my estimation, ETL is overused within the IT community, leading to inefficiency and unnecessary expense.
ETL gained popularity as companies began to outgrow antiquated systems. As functionality was moved from legacy systems to open systems architectures, ETL played an indispensable role in moving the data. Unfortunately, many companies failed to completely retire their outdated systems; rather than performing ETL as a one-time initial load event, ETL evolved into a part of daily operations.
This problem was further exacerbated as companies developed systems within functional silos. The application-specific approach, in which the database is designed to accommodate the needs of an individual group or department, took root. According to this methodology, every new system requires its own database. As a result, data is copied from system to system. Hence, ETL is now firmly ensconced in nearly every company and is an integral part of IT operations.
Consider a simplified example of typical ETL activities, in which data is propagated from the product system into warranty, finance, purchasing and sales systems, and eventually into the data warehouse. Not only is the data extracted and loaded, but it must also be transformed because the data structures between systems are completely disparate.
This problem is compounded when the data is propagated back to the source system in order to synchronize the data that's no longer synchronized between systems precisely because it was copied. The inevitable result is poor data quality and high maintenance costs.
If the product database in this example changes -- for instance, if a new field or table is added -- it will be necessary to change all of the maps that move data from the source to a target. One minor structural change in the source can create a maintenance nightmare in the ETL maps and target databases -- a lot of IT expense with no value added.
The only legitimate ETL activity in this example is the data warehouse interface. All others are unnecessary and incur a tremendous cost. According to Larry English, president of Information Impact International and a leading expert in information quality, "The IS staff is busy maintaining, on average, a tenfold [increase in] redundant databases and the redundant applications or interface programs that re-create or move the data." Ouch.
The solution to the problem is really quite simple, but it requires a new way of thinking. The answer is a single, sharable, enterprise-strength database for each major subject area. The key is to design each of these databases to meet the needs of all of the information consumers.
With the advent of new technology, there is absolutely no need to propagate data from system to system, thereby creating unnecessary (and expensive) redundancy. Applications can now access data via shared services from different databases.
Information is a corporate asset and doesn't belong to one group or department. Therefore, databases must be designed and implemented with both producer and consumer in mind. The entire company must be involved in defining entities and attributes and modelling relationships.
To support this new way of thinking, applications can't be designed in a vacuum. IT must re-engineer applications to use common, shared databases centred on the significant subject areas. The database, and not the application, is the centre of the universe.
There's a simple rule to determine whether a company has an effective data strategy: count the number of ETL processes and redundant databases. The higher the number, the less likely it is that the company manages its data correctly.
Next, ask, "Why create another database with similar data?" Often the answer is that the data is not in the right format or doesn't support the new application, or the source is incomplete and inaccurate. And then I know that the company is wasting money and resources and performing unnecessary, costly maintenance and development rather than producing value-added services based on sound data management techniques.