Moving beyond Hadoop for big data needs
- 29 October, 2012 10:16
Hadoop and MapReduce have long been mainstays of the big data movement, but some companies now need new and faster ways to extract business value from massive -- and constantly growing -- datasets.
While many large organizations are still turning to the open source Hadoop big data framework, its creator, Google, and others have already moved on to newer technologies.
The Apache Hadoop platform is an open source version of the Google File System and Google MapReduce technology. It was developed by the search engine giant to manage and process huge volumes of data on commodity hardware.
It's been a core part of the processing technology used by Google to crawl and index the Web.
Hundreds of enterprises have adopted Hadoop over the past three or so years to manage fast-growing volumes of structured, semi-structured and unstructured data.
The open source technology has proved to be a cheaper option than traditional enterprise data warehousing technologies for applications such as log and event data analysis, security event management, social media analytics and other applications involving petabyte-scale data sets.
Analysts note that some enterprises have started looking beyond Hadoop not because of limitations in the technology, but for the purposes it was designed.
Hadoop is built for handling batch-processing jobs where data is collected and processed in batches. Data in a Hadoop environment is broken up and stored in a cluster of highly distributed commodity servers or nodes.
In order to get a report from the data, users have to first write a job, submit it and wait for it to get distributed to all of the nodes and get processed.
While the Hadoop platform performs well, it's not fast enough for some key applications, says Curt Monash, a database and analytics expert and principal at Monash Research. For instance, Hadoop does not fare well in running interactive, ad hoc queries against large datasets, he said.
"Hadoop has trouble with is interactive responses," Monash said. "If you can stand latencies of a few seconds, Hadoop is fine. But Hadoop MapReduce is never going to be useful for sub-second latencies."
Companies needing such capabilities are already looking beyond Hadoop for their big data analytics needs.
Google, in fact, started using an internally developed technology called Dremel some five years ago to interactively analyze or "query" massive amounts of log data generated by its thousands of servers around the world.
Google says the Dremel technology supports "interactive analysis of very large datasets over shared clusters of commodity machines."
The technology can run queries over trillion-row data tables in seconds and scales to thousands of CPUs and petabytes of data, and supports a SQL-query like language makes it easy for users to interact with data and to formulate ad hoc queries, Google says.
Though conventional relational database management technologies have supported interactive querying for years, Dremel offers far greater scalability and speed, contends Google.
Thousands of users at Google operations use Dremel for a variety of applications, such as analyzing crawled web documents, tracking installation data for Android applications, crash reporting and for maintaining disk I/O statistics for hundreds of thousands of disks.
Dremel, though, isn't a replacement for MapReduce and Hadoop, said Ju-kay Kwek, product manager of Google's recently-launched BigQuery hosted big data analytics service based on Dremel.
Google uses Dremel in conjunction with MapReduce, he said. Hadoop MapReduce is used to prepare, clean, transform and stage massive amounts of server log data, and then Dremel is used to analyze the data.
Hadoop and Dremel are distributed computing technologies, but each was built to address very different problems, Kwek said.
For example, if Google were trying to troubleshoot a problem with its Gmail service, it would need to look through massive volumes of log data to pinpoint the issue quickly.
"Gmail has 450 million users. If every user had several hundred interactions with Gmail think of the number of events and interaction we would have to log," Kwek said.
"Dremel allows us to go into the system and start to interrogate those logs with speculative queries," Kwek said. A Google engineer could say, "show me all the response times that were above 10 seconds. Now show it to me by region," Kwek said. Dremel allows engineers to very quickly pinpoint where the slowdown was occurring, Kwek said.
"Dremel distributes data across many, many machines and it distributes the query to all of the servers and asks each one 'do you have my answer?' It then aggregates it and gets back the answer in literally seconds."
Using Hadoop and MapReduce for the same task would take longer because it requires writing a job, launching it and waiting for it to spread across the cluster before the information can be sent back to a user. "You can do it, but it's messy. It's like trying to use a cup to slice bread," Kwek said.
The same kind of data volumes that pushed Google to Dremel years ago have started emerging in some mainstream enterprise organizations as well, Kwek said.
Companies in the automobile, pharmaceutical, logistics and financial services industries are constantly inundated with data and are looking for tools to help them quickly query and analyze it.
Google's hosted BigQuery analytics service is being positioned to take advantage of the need for new big data technologies.
In fact, said Gartner analyst Rita Sallam, the Dremel-based hosted service could be a game changer for big data analytics.
The service allows enterprises to interactively query massive data sets without having to buy expensive underlying analytics technologies, Sallam said. Business can explore and experiment with different data types and different data volumes at a fraction of what it would cost to buy a enterprise data analytics platform, she said.
The real noteworthy aspect of BigQuery is not its underlying technology, but its potential to cut IT costs at large companies, she said.
"It offers a much more cost effective way to analyze large sets of data," compared to traditional enterprise data platforms "It really has a potential to change the cost equation and allow companies to experiment with their big data," Sallam said.
Major vendors of business intelligence products, including SAS Institute, SAP, Oracle, Teradata and Hewlett-Packard Co., have been rushing to deliver tools that deliver improved data analytics capabilities. Like Google, most of these vendors see Hadoop platform mainly as a massive data store for preparing and staging multi-structured data for analysis by other tools.
Just last week, SAP unveiled a new big data bundle designed to let large organizations integrate Hadoop environments with SAP's HANA in-memory database and associated technologies.
The bundled product uses the SAP HANA platform to read and load data from Hadoop environments and then do fast interactive analytics on the data using SAP's reporting and analytics tools.
SAS announced a similar capability for its High Performance Analytic Server a few weeks ago. HP, with technology gained in its acquisition of Vertica, and Teradata, with its Aster-Hadoop Adaptor, and IBM with its Netezza tool sets, offer or will soon offer similar capabilities.
The business has also attracted a handful of startups.
One, Metamarkets, has developed a cloud-based service designed to help companies analyze copious amounts of fresh streaming data in real-time. At the heart of the service is an internally developed distributed in-memory, columnar database technology called Druid, according to the company's CEO Michael Driscoll. He compares Druid to Dremel in concept.
"Dremel was architected from the ground up to be an analytical data store," Driscoll said. Its column-oriented, parallelized, in-memory design makes it several orders of magnitude faster than a traditional data store, he said.
"We have a very similar architecture," Driscoll said. "We are column-oriented, distributed and in-memory."
The Metamarkets technology, though, allows enterprises to run queries over data even before it is streamed into a data store, so it allows for even faster insight than Dremel, he said.
Metamarkets earlier this year released Druid to the open source community to spur more development activity around the technology.
The demand for such technology is driven by the need for speed, Driscoll said.
Hadoop, he said, is simply too slow for companies that need sub-millisecond query response times. Analytics technologies such as those being offered by the traditional enterprise vendors are faster than Hadoop but still don't scale as well as a Dremel or a Druid, Driscoll said.
Nodeable, another venture-backed startup, offers a cloud-hosted service called StreamReduce that is similar to the Metamarkets offering.
StreamReduce is powered by Storm, an open source data analytics technology originally developed by BackType before it was acquired by Twitter last year. Storm, also used internally by Twitter, is designed to let enterprises run real-time analytics on streaming data.
Nodeable offers a connector to Hadoop so enterprises can use the service to run interactive queries against data stored in their Hadoop environment as well, CEO Dave Rosenberg said.
Nodeable was launched as a cloud system management company but switched tracks after seeing an opportunity for big data analytics technology. "We realized there was a lack of a real-time complement to Hadoop. We asked ourselves, how do we get real-time with Hadoop?" Rosenberg said.
Services such as Nodeable's do not replace Hadoop, they complement it, Rosenberg said.
StreamReduce gives companies a way to extract actionable information from streaming data that can be stored in a Hadoop environment or in another data store for more traditional batch processing later, he said.
Streaming engines such as those offered by Nodeable and Metamarkets are different from technologies like Dremel in one important aspect -- they are designed for analyzing raw data before it hits a database. Dremel and other technologies are designed for ad hoc querying of data that is already in a data store such as a Hadoop environment.
Meanwhile, major Hadoop players are not standing by idly.
Cloudera, the biggest vendor of commercial Hadoop technology, last week rolled out a technology called Cloudera Impala, a real-time query engine for data stored in Hadoop Distributed File System.
The Impala technology will allow companies to do batch and real-time operations on structured and unstructured data within one system, according to Cloudera.
Jaikumar Vijayan covers data security and privacy issues, financial services security and e-voting for Computerworld. Follow Jaikumar on Twitter at @jaivijayan, or subscribe to Jaikumar's RSS feed . His e-mail address is firstname.lastname@example.org.
Read more about big data in Computerworld's Big Data Topic Center.
Join the Computerworld Australia group on Linkedin. The group is open to IT Directors, IT Managers, Infrastructure Managers, Network Managers, Security Managers, Communications Managers.
TPG buys AAPT
US Supreme Court to hear software patent case
Telstra hits 300 Mbps in LTE-A trial
Telstra hits 300 Mbps in LTE-A trial
With look ahead to manned mission, China launches lunar rover