Acronyms have haunted every aspect of systems development for decades, and the data-movement domain is no exception. Three key processes commonly used to move data from one application or system to another are known as ETL: extraction, transformation and loading.
In broad terms, ETL applications extract data from a source database, transform the data into a format suitable for a target database, and then load the data into the target database. In this analysis we provide an overview of ETL processing.
To initiate an ETL process, programmers use extraction routines to read records in a source database and make the data in those records available for transformation processing. To extract data from a source database, programmers have three choices: they can write customised programs, rely on specialised ETL tools, or use a combination of both.
In practice, most programmers bolster third-party tools with customised programs. Referred to as user exits, these programs perform specialised functions that are unique to each environment.
Third-party ETL products are typically more effective and less costly than customised programs. Many ETL tools provide programmers with a single, intuitive interface that can be used to extract data from multiple database products. Businesses that rely on a hodgepodge of databases will benefit from unified access to those products.
ETL tools are also preferable to custom programs because they can be used "out of the box"; there is no need to write code to open files, read records, and join tables - these products perform those functions for you. Furthermore, leading ETL products include prebuilt extraction routines designed for popular ERP (enterprise resource planning) applications. In many cases, a single ETL tool can meet most of your data-movement requirements.
Once extraction routines have collected the data, transformation routines can prepare that data for its new home. There are several major transformation techniques, including aggregation, value translation, field derivation, and cleansing. Let's use a hypothetical data mart to examine these processes.
Data marts are specialised applications that allow end-users to analyse broad trends in a highly intuitive manner. For instance, a marketing analyst might use a sales data mart to examine revenue per product for each of the past five years. Unlike order processing, manufacturing, and other operational applications, data marts do not require detailed information. In fact, summarised data is preferable because it reduces response times and enhances ease of use.
Before loading a data mart, programmers typically aggregate data. Aggregation routines replace numerous detail records with relatively few summary records. For example, suppose that a year's worth of sales data is stored in several thousand records in a normalised database.
Through aggregation, this data is transformed into fewer summary records that will be written to the sales data mart. Although programmers could write code to manually aggregate data, ETL tools are more efficient because they allow programmers to summarise data in one step, with no coding.
Value translation is another common data-transformation technique. Operational databases store encoded information to limit data redundancy and storage requirements. For example, SKUs (stock-keeping units) may be stored in invoice files because they are shorter than their associated product descriptions, and so on.
Because data marts contain summarised information and are designed for ease of use, programmers typically replace encoded data with clearer descriptions. Although programmers could accomplish this task with custom code, ETL tools are more efficient because they allow programmers to use value-translation lists to decode data.
Field derivation is a third technique used to transform data. Through field derivation, new information is created for end users. For example, suppose our operational database contains one field for sales quantity and one for unit price. Rather than have end users calculate revenue, programmers could create a revenue field during transformation. Leading ETL products enable programmers to use mathematical operations, statistical functions, string manipulation, date arithmetic and conditional logic to derive new fields.
A fourth transformation routine, cleansing, has many uses. Programmers rely on cleansing algorithms to keep inaccurate data out of other systems. For example, cleansing routines typically verify that numeric fields contain numeric data, that dates and numbers are valid and reasonable, and so on. Cleansing routines can also be used in cases when one unique value is represented by a database in several ways. For example, IBM might be depicted as IBM Co, International Business Machines, IBM, and so on. During cleansing, multiple versions of the same data element are replaced with a single value.
After data has been transformed for the target database, programmers use load procedures to write that information to the new database. During this phase, you must determine whether to propagate data periodically or continuously. Periodic replenishment occurs regularly, such as daily, weekly, or monthly. Sometimes known as snapshot propagation, this approach captures the state of operational systems at a specific moment. If users require current information, continuous propagation can load data into the target database in real time. Continuous replication requires a dedicated, high-speed communications channel.
Most ETL tools support both periodic and continuous data loads. Advanced products also let you perform propagation on a net-change basis, which allows you to transmit only modified data, and consequently requires minimal communications overhead.
You can further categorise data loading by the method used to replicate data. In push replication, the source application "pushes" transformed data to the target application. In pull replication, the target application "pulls" data as it's needed, such as when an end user runs a query. You can also design load procedures that use both push and pull processes. In this approach, the source application typically pushes the data to a staging database, where it is transformed and then pulled into the target application as needed. This "mixed-mode" approach requires more disk space but can enhance performance.
Whether you use custom coding, third-party ETL tools, or a combination of the two to move data from one system to another, a thorough understanding of ETL practices, as well as of your source and target applications, is essential to success. Given the importance and price of many ETL solutions, you should also insist on a trial period so that you may test each solution in your environment.
Data extraction, transformation and loadingBusin-ess Case: Most companies move data across multiple platforms, databases, and applications. ETL products are an effective way to share data because they relieve programmers of many time-consuming, error-prone tasks. Given the cost of programming talent, ETL products offer a considerable value proposition.
Techn-ology Case: ETL products let programmers quickly move data from one environment to another without requiring custom code to open files, join tables, read records, aggregate data, and so on. As a result, programmers can implement and manage ETL initiatives more effectively.
Pros:-l Provides unified access to multiple databasesl Lets programmers quickly move data across systemsCons-:l Relatively high costl Annual licence fees