Open source: Leading the way for big data applications

This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.

The "big data" term has been used since 2009, but has quickly turned into the biggest thing to hit information technology since the virtualization craze of the last decade. Enterprises are awash with data, having amassed terabytes and even petabytes of information. When the amount of data in the world increases at an exponential rate, analyzing that data and producing intelligence from it becomes increasingly complex -- but no less important to the success of an organization. [Also see: "Could data scientist be your next job?"]

According to the research firm Wikibon, the big data market is on the verge of a growth spurt that will hit $50 billion worldwide within the next five years. Data volumes are growing to the point where companies are being forced to scale their infrastructure, and the traditional "scale up" technologies, legacy systems and licensing models are simply not working. From its onset, open source technology has been at the forefront of massive data management. Today, open source provides the most effective way to address such a large-scale problem and get the job done faster and more accurately at a fraction of the price of alternative solutions.

ROUNDUP: 9 open source big data technologies to watch

Open source data and analytics products are no longer the risky bets they once were. They are now integral to business, and a real alternative to proprietary software. With a strong foundation of underlying tools and technologies, open source has emerged as a compelling building block for robust, cost-effective enterprise applications and infrastructure. It has gone mainstream -- not just within the vendor community, but within customer organizations of all types and sizes.

The proof is in the numbers. In 2011, an estimated $672.8 million of venture capital was invested in open source-related vendors -- an increase of more than 48% from 2010, and the highest total amount invested in any year.

Innovation in a brand new world

Most of the new big data frameworks and databases have their roots in the open source world, where developers routinely create new approaches to problems that haven't hit mainstream. Companies that represent many of the biggest providers of online communication and data transactions -- Facebook, Yahoo, Amazon, Twitter and eBay, for example -- use and contribute to innovative, open development initiatives. The rate at which the importance and popularity of big data has grown can be directly attributed to open source.

Hadoop has proven to have the most market traction of all big data technologies. Today big data is largely centered on leveraging the open source Apache Hadoop platform and the innovation coming out of the companies supporting or extending it like Cloudera, Hortonworks and MapR. This is where the center of IT innovation is now, and these emerging companies are completely disrupting large software companies such as IBM and Microsoft. Open source communities are fostering innovative new approaches and ecosystems, increasingly getting a jump on the traditional providers of proprietary offerings in advanced analytics, data warehousing and integration. [Also see: "Explosive growth expected for Hadoop, MapReduce-related revenues"]

End users are getting on board and changing their business models to support Hadoop, and efforts to create new data services is changing how companies think about their databases, data warehouses and BI systems. For example, Walmart recently said that it is changing the way it does e-commerce, by moving from 10 websites to one, and from a trial-size 10-node Hadoop cluster to a 250-node Hadoop cluster. Along the way, Walmart will build several tools to migrate data from the Oracle, Netezza and EMC Greenplum systems -- tools it hopes to open source. Walmart will still use some of its existing data warehousing technology, but to a much lesser extent.

Because Hadoop was born as an open source project, and is governed by the Apache Software Foundation, it has created a unique development ecosystem. It's remarkable that big data technology is actively being developed and maintained by several competing vendors. New partnerships between these companies, along with their integrated offerings, are being announced every week. 

CASE IN POINT: Microsoft partners with Hortonworks for big data management

While these companies are partners for the development and greater good of Hadoop, customers will only select one vendor partner for a given deployment. Yet they're all making contributions to the same Apache Hadoop stack for enterprises to make it better. That's the beauty of open source. The technical complexity of big data is so large that you need a body as large as a community, rather than a single vendor, to tackle it.

Going forward, we'll start to see more "hybrid" platforms, and a symbiosis between established software companies -- think Oracle Exadata & Cloudera, or EMC Greenplum & MapR -- and the open source movement, leading to greater innovation and the increased adoption of data integration tools to handle the divide between open source and customers' legacy systems.

The democratization of big data

Big data has become an equalizer for smaller companies that had been disadvantaged due to the high cost of processing massive amounts of data. The removal of these traditional barriers is shaking up the industry and changing the way businesses compete and succeed in the new millennium. Smaller companies want to benefit from big data without having to put millions of dollars on the table. Many companies have been doing big data for quite some time, but with conventional technologies. For example, a Teradata data warehouse can process massive amounts of data since it is extremely powerful technology, but it is also very expensive.

Hadoop changes the game, making the collection and analysis of data possible on low-cost, easily scaled, commodity hardware. It has democratized data, turning it into a competitive advantage no longer just reserved for the big guys. It brings big data to the masses, and that is thanks to the open source nature of Hadoop.

Taken alone, Hadoop remains complex to use, and data scientists must spend time trying to understand exactly what they're dealing with. Many people understand why big data is important and can see where it can take them from a hypothetical, high-level point of view, but companies struggle to find enough people with big data skills to realize their vision. Skilled resources have not materialized at the same rate as the marketing hype for big data. While awareness of big data is growing, only a few organizations that rely on the management and exploitation of data, such as Facebook and Google, are currently in a position to capitalize on it. [Also see: "Who's hiring data scientists? Facebook, Google, StumbleUpon and more"]

The time has come when organizations that expect to leverage big data not only have to understand the intricacies of foundational technologies like Hadoop, but need the infrastructure to help them make sense of the data and secure it. Without these complementary capabilities, big data will remain an IT privilege and remain out of the reach of business people and the lines of business that they represent.

If you want to alleviate the complexity of Hadoop, you need skilled resources and complementary technologies. As the enterprise Hadoop market continues to mature and companies deploy their clusters for the most demanding analytical challenges, data scientists will continue to leverage open source-centric platforms to meet these critical needs.

You can reach Bertrand Diard and learn more about open source software by contacting Talend.

Read more about data center in Network World's Data Center section.

Join the Computerworld newsletter!

Error: Please check your email address.
CIO
ARN
Techworld
CMO