Open-source data aces
- 03 December, 2008 10:14
If any software market deserved to be shaken up by open source alternatives, it's enterprise data integration. Commercial, enterprise-grade integration tools -- typically cobbled together from M&A and legacy patchworks -- are notoriously unwieldy and impose an arduous learning curve. Complexity frequently stalls deployments by months, and aftermarket consulting can add hundreds of thousands of dollars to the TCO.
Enter Jitterbit and Talend, two fresh starts in a land of stodgy giants. Talend hits all the highlights one would look for in traditional integration platforms: batch delivery, transforms, ETL (extract, transform, and load), data governance, and a strong set of connectivity adapters. At the same time it keeps pace with important trends with such features as change data capture, metadata support, federated views, and SOA-based access to data services. Talend is capable of scaling from small departmental file migrations to large-scale enterprise warehousing projects.
Jitterbit, by contrast, is the classic case of less is more: a lighter-weight and extensible point solution that can shortcut simple migration projects by weeks. If you're in need of a quick fix for a one-off data migration project -- to quickly move from Salesforce.com to SugarCRM, for example -- Jitterbit's simple, menu-driven interface takes a lot of the tedium out of profiling application data.
These products may not yet surpass the master data management and messaging transform prowess of IBM Information Server, or the legacy and b-to-b domain expertise found in Informatica PowerCenter. But they offer substantial cost savings compared to these commercial counterparts, and their ability to shortcut complexity makes them additionally hard to resist.
Jitterbit 2.0: Master of migration
If you've used the 1.x versions of Jitterbit, you're sure to appreciate the improvements packed into Jitterbit 2.0. The new graphical UI, akin to a BPM modeler for business analysts, is one of the easiest workflow builders I've used. Parallel processing and data chunking have been added -- good for speeding up bulk moves to targets such as Salesforce.com that throttle the number of records per transaction. Plus, Jitterbit can now consume and expose SOAP Web services.
Unlike Talend, Jitterbit follows a client/server model and uses a centralized scheduling and processing engine. I was impressed by Jitterbit's small footprint and decent performance, but the centralized architecture could present a bottleneck in high-volume scenarios. Scalability is also limited by the absence of features for cluster management, load balanced routing, and real-time business monitoring.
Page BreakJitterbit can connect to a variety of sources, using ODBC or JDBC, but there are no native drivers as found in Talend. There is also no JMS support and no direct support for working with PDFs or EDI (Electronic Data Interchange) files. An SDK plug-in allows you to leverage external Java rules, but you're otherwise limited to working in Jitterbit's proprietary scripting language.
There are other shortcomings. Although debugging is enhanced by live data views, actual interprocess debugging capabilities are absent. Projects involving multiple data sources with requirements such as data de-duplication and orphaned record management will still require preprocessing. Jitterbit's forte is normalizing or denormalizing translations rather than actually scrubbing data.
Also of note: I was disappointed to discover the Jitterbit Integration Server phoning home behind my back. While generating server usage reports via the admin shell, I witnessed the Jitterbit server send a blind copy of data to jitterbit.com -- despite my expressly opting out of the User Experience Program during the installation process. Jitterbit indicated that none of the data it siphons is personally identifiable. I've yet to evaluate the claim.
Jittering the bits
Jitterbit won InfoWorld's 2008 Bossie Award for Data Migration for good reason: Jitterbit is perhaps the most uncomplicated tool available to get your data from point A to point B. In my testing, Jitterbit made simple work of configuring source and target specifications with its form-based wizards. Although database table relationships must be defined manually, the tool did a fine job picking up Web service details via WSDL. In addition to databases and Web services, Jitterbit can also pull data from XML, FTP, HTTP, LDAP, and flat files.
Transformation mappings are configured via a drag-and-drop wiring process. A simple double-click on a node spawns a separate interface for building formulas to modify data en route. Here you can draw on decent string manipulation tools and regular expressions. Math and logic functions could use some filling out, but a variety of other functions -- for handling XML, date and time, and e-mail -- round out the options. You can even pull live data into the transform for on-the-fly validation.
I found job scheduling to be very flexible, and the granular ability to set runtime priorities was a plus. The onboard dependency checker is also smart, helping to provide impact analysis for easier change management across operations, including WSDL file updates.
Additional features, including a quick test of active connections and ongoing project validation, helped polish the experience. Collapsible panes and auto-formatting in the process designer help keep designs orderly. However, a thumbnail overview would make it easier to navigate larger projects. A few other minor nits -- including slow sync during object renaming and the lack of an onboard SQL builder -- were similarly easy to live with.
Page BreakAdministration of the Jitterbit server is done using the same client interface. User and group access controls are good. Access to projects and sub-objects can be configured to ensure that only the right users have write permission -- a nice touch.
However, queue and server management is limited. Instead of granular administrative control over engines and queues, I found manually refreshed logs and limited opportunity for intervention into stuck jobs. The ability to drill down into processes from the admin UI would be a good idea, as would the ability to reschedule stuck jobs or view live data streamed from multiple servers simultaneously.
On the plus side, Jitterbit projects can be encapsulated and exported into Jitterpaks to simplify migration among dev and production systems. Jitterbit even operates a trading post for its community to buy and sell prepackaged Jitterpak solutions.
Jitterbit can't be considered a full-blown integration platform -- yet. However, despite its shortcomings, I found Jitterbit to be very good at what it does best -- namely, application data migration. Its transformation tools, though basic, are good, and its repository encourages best practices and reuse. If you're looking to push batch data around, you should consider Jitterbit to alleviate the headaches that frequently complicate -- and delay -- even seemingly simple migration projects.
Talend Open Studio 3.0.1: The real deal
Talend has developed a holistic integration platform from the ground up in a very short time. If the company continues on its current trajectory, it could do for data integration what open source has already accomplished for servers and databases.
New features in Version 3 go a long way toward bolstering enterprise viability. In addition to a native SAP connector (extract and sync), developers will appreciate component search, an ecosystem overview of projects, change impact analysis, and drag-and-drop metadata.
Perhaps most important, Talend has added change data capture (specifically, via slowly changing dimensions). Change data capture enables real-time updates that significantly reduce the size of data transfers -- an increasingly important efficiency measure for data sets that have grown so large, there's no longer enough time to complete batch runs in the overnight hours.
What I really like about Talend is its code-generating approach -- a practice that fell by the wayside in favor of higher-level, user-friendly tools built around a centralized, proprietary engine. Although the proprietary "black boxes" often help streamline development, they can also lead to processing bottlenecks and scalability issues.
Page BreakBy contrast, Talend jobs can be packaged up and deployed anywhere a Java Virtual Machine or Perl interpreter can reside. Jobs can also be embedded directly into your Java apps or even encapsulated as REST/SOAP Web services via easy export.
Not that Talend is suitable to every enterprise project. It's light on the connectors to mainframes and minis that you'll find in commercial products such as ETI Solution V6, a comparable code-generating solution that can output native code in Java as well as Cobol, C/C++, and SAP.
Open source competitor Pentaho Data Integration (Kettle), despite taking a black-box approach, does offer good control over distributed processing, as well as integration into a more elaborate set of tools for BI and EAI. Nevertheless, I prefer Talend; it's better developed and more extensible than Kettle, and it offers superb data governance.
Deploying the pieces of Talend Open Studio -- namely Job Designer, Business Modeler, and the repository manager -- is straightforward. I installed to a Windows Server 2003 platform with Sun JVM and ActiveState Perl, and was quickly off and running. (ActiveState, incidentally, has a great new rev of Komodo IDE and Perl dev tools that are worth a look.)
The Business Modeler component -- a nice touch for Talend -- is a piece of the puzzle often omitted even at the commercial level. The Business Modeler provides a pallet of components that allow nontechnical analysts to build a view of the system and its workflows, without ever touching a drop of Java. The result gets turned over to developers, who flesh out the details using the Job Modeler, an Eclipse-based IDE and debugger.
The Job Modeler will put any Eclipse-seasoned developer at ease with its own pallet of drag-and-drop components. It also provides access to the central repository, which holds all of your organization's business models, job designs, metadata, documentation, and connection-specific information.
The latest version of Job Modeler adds collapsible subroutines for easier navigation. Other niceties include quick tabbing between graphical layout and code, a job scheduling interface (that puts a GUI on the Unix crontab command), and a thumbnail overview for easy navigation of large document layouts.
I liked the tMap component for defining my transforms and data routings. Although it was reminiscent of an old switchboard with wires strewn about, it was ultimately fast and effective. An Automap option saves time setting up initial connections.
Page BreakThe Job Modeler IDE's graphical SQL editor and test facility, called SQLBuilder, helps with SQL chores. Talend generates native SQL code for every supported database, no additional effort required. XSLT and XPath are in tow for XML processing. And a good set of orchestration components makes long-running and staged processing a possibility.
Onboard debugging offers step-by-step trace and variable inspection, with real-time stats and trace data viewable directly from the layout. Other niceties, like auto- generation of HTML documentation, sweeten the offering.
You need to be able to trust the accuracy of your data, not just push it around. Talend has data governance covered with good provisions for data quality and profiling. Data conformity and consistency, beyond de-duplication, is achieved using filters such as search-and-replace, interval- and fuzzy- matching, and schema-based transformation. The profiler adds metrics on data quality -- tracked and assessed over time -- and graphically depicts stats and performance summaries for quick isolation of data in need of scrubbing.
I was impressed by Talend's rich set of components for third-party products, too. Support ranges from the higher end of OLAP cubes and Microsoft AX Server, down to QuickBooks and Google Apps. Even open BI solutions, including Jaspersoft and SpagoBI, as well as CRM apps, including Salesforce.com, Sugar, and Centric CRM, are supported.
Talend needs to work on automating management and partitioning of distributed jobs. I'd like to see Talend (and the Talend community) generate more industry-specific components -- say, to address HIPAA (Health Insurance Portability and Accountability Act) and SWIFT (Society for Worldwide Interbank Financial Telecommunication) directly. And although Talend offers ELT support in addition to ETL, currently ELT mode is limited to Oracle, MySQL, and Teradata databases.
Support is always a key concern for open source. Although Talend is still a young company, its worldwide presence enables it to deliver service, support, and training 24/7. Support is included in its team-oriented Integration Suite, along with added provisions for distributed monitoring and load balancing down to the CPU core. Talend even offers a free SaaS edition, Talend on Demand, with subscription-based support.
Clearly Talend has much to offer. Before you break the bank for a six-figure proprietary alternative or ponder the ongoing maintenance nightmare of a hand-coded solution, you'd be foolish not to explore Talend for your next data integration project.