When Premier Inc's medical databases began bogging down last year, the provider of clinical data put its data warehouse in a box -- literally. Premier sells access to clinical data it gathers from 400 hospitals to pharmaceutical manufacturers. Last year, the company's IBM Red Brick data warehouse had grown to 3TB, and one table included 3 billion entries. "When you go through 3 billion rows of data, you get long runtimes," says Chris Stewart, director of data warehouse architecture.
The problem wasn't just the size of the database, however, but how clients used the data. "Our users want to access all of the data from top to bottom," says Stewart, and the complex, multipass queries created by Premier's 4000 users each week were slowing performance. Some wouldn't run at all.
Instead of adding to its 24-processor Solaris server infrastructure or making further attempts to optimize the database, Stewart brought in an all-inclusive data warehouse appliance from Netezza. Some calculations that took one or two days now finish in six to eight minutes on the appliance's 108 processors. Premier still uses Red Brick for most queries, but the NPS 8150 appliance handles the "really, really ugly questions" that weren't possible to process before, he says. "We couldn't offer the product offerings we do today" without the appliance, Stewart says.
As data warehouses continue to grow, more users are demanding access to business intelligence (BI) tools to conduct data-mining exercises across large data sets. "We're talking about using every single call-detail record generated in the last three years," says Claudia Imhoff, president of consulting firm Intelligent Solutions. It's hard for database administrators (DBA) to create aggregations of data, such as summarizations, that can facilitate the processing of these complex queries because users often don't know in advance what they're looking for. "These unplanned questions are the ones that knock the stuffing out of databases," she says.
But such queries are increasingly seen as business-critical, says William Fellows, an analyst at The 451 Group. "The problem of querying data sets that are growing at more than 100 percent a year has led to what might be called a data warehouse capability gap," he says. While market leaders like Teradata, a division of NCR, offer integrated systems to address this for high-end applications, Netezza and others are jumping in with moderately priced systems that don't require the same high-end hardware and software investments as those from IBM, Oracle and Teradata.
It's an interesting trend but still a small part of the $US16 billion market for data warehouse hardware and software, says Dan Vesset, an analyst at IDC.
Small players, big databases
Some start-ups offer only software, while others include software and hardware in a single bundle or appliance. But all use a parallelization scheme that involves symmetric multiprocessing or a massively parallel processing architecture. Designs vary, but all are based on the partitioning of data across servers -- something Teradata has been doing for years, says Fellows. "There's nothing new under the sun in terms of approach here except packaging and price," he adds. While Netezza and competitors like to position themselves against Teradata, the company still dominates on the high end, he adds.
Netezza's NPS appliance abandons database indexes in favour of direct table scans, using brute-force processing to get the job done. The system includes its own database, with specialized field programmable gateway array (FPGA) logic that links processors and storage to speed up I/O. A system comparable to Premier's, with 4.5TB of disk space, sells for "a little more than a million dollars," says Netezza CEO Jit Saxeena. By dumping the indexes, Premier's database dropped from 3TB to 1TB. The system is sufficiently fast that Stewart now uses the appliance to both process queries and build the data-aggregation tables that he loads into the Red Brick data warehouse.
Start-up Calpont Corp, is developing a similar appliance that hard-codes the database on an FPGA chip. Because it will store the data on a solid-state disk, or synchronous dynamic RAM, however, it will be targeted at smaller data sets. A 128GB box capable of supporting 40GB to 50GB of data will have a price tag in the "couple hundred-thousand dollar range", says CEO Jim Janicki. "We wanted a brute-force engine to handle everything we could throw at it," he says of the device, which is scheduled to ship by midyear.
Datallegro is rolling out a turnkey system that functions much like the Netezza appliance, but it's built using off-the-shelf components. "We're taking standard, commodity servers with an open-source database," says CEO Stuart Frost. Datallegro's 3TB P3000 includes 21 dual-Xeon-processor servers, each connected to 12 Western Digital Raptor drives, and was priced at $US450,000 for its April release. Frost is targeting Oracle customers with databases in the 1TB to 5TB range and up to 300 concurrent users.
Metapa takes a similar approach but lets users buy their own components based on its specification, rather than bundling everything together. Users "can assemble systems that are just as fast as the high-end data warehouses at a fraction of the cost. We don't believe you need a specialized ASIC chip to get there," says Scott Yara, founder and president of the Californian start-up. The total price, including Metapa's Cluster DataBase -- due to ship in the second quarter -- and required hardware, will be half the cost of a Netezza appliance, he claims.
Clareos' CrossCut software, now available, adds yet another twist. Instead of using database tables, it combines a BI reporting tool with a spreadsheet-like data model that creates a single, flat file of rows and columns.
"The next generation of BI tools will have a flat file structure that will be very fast," predicts Steve Foley, CEO of Clareos. CrossCut software and recommended hardware to process 146GB of data costs about $65,000. But the product differs from products like Netezza's in one key respect: CrossCut is a read-only database that doesn't provide update capability, Foley says. Competitors that use vector-based processing to support a real-time decision-making application include Alterion and Aleri, says Fellows at The 451 Group.
By contrast, Teradata's integrated systems connect clusters of high-performance servers using a proprietary high-speed interconnect called Bynet and store data in a Fibre Channel storage-area network. The vendor focuses on allowing large numbers of concurrent queries in a mixed-workload environment and supports "active data warehousing", where databases are continuously updated, says Stephen Brobst, chief technology officer. He sees the start-ups' products as best suited for single-function, low-end data marts and cautions that "data marts end up replicating data".
But that's a trade-off users may be willing to make when cost is a factor. "With an IBM or Teradata solution, your scalability is in large chunks," says the vice president of infrastructure at a large financial services company that's beta-testing a Datallegro system. The incremental cost for adding capacity to an appliance can be a small fraction of what it costs to upgrade his Sun Microsystems' system. He is cautious about buying from a small vendor, but adds, "If they can deliver the same or better performance at 20 percent of the cost of an IBM or Teradata solution, then you have to do it."
Most of these systems take a black-box approach to optimization, which means DBAs can't do any tuning. That paradigm shift may be the toughest sell, says Intelligent Solutions' Imhoff, and it's definitely a weakness for Michael Benillouche, director of technology at ACNielsen, who prefers to optimize his Oracle data marts.
But Premier's Stewart sees that as an advantage. "My DBA staff has more time for development instead of hand-holding a database. We don't need to build in cycles to make queries go faster," he says.
In traditional systems, ad hoc queries that bog down the data warehouse are restricted, says Imhoff. Now IT can spin off a subset of data to more groups for business analytics without supplying DBA resources. "If I can bring in a technology that doesn't require an army of DBAs, great Scott, what a boost," she says.
Data warehouse acceleration appliances
What they are
Stand-alone, integrated systems designed to support ad hoc queries for business analytics and decision support. The systems require no tuning and are ready to go -- just add data.
How they work
All designs are based on research done at the University of Wisconsin in the mid-'80s on partitioning of data on servers, says William Fellows, an analyst at The 451 Group. Data is segmented, queries are parallelized, and a symmetric multiprocessing or massively parallel processing system executes the pieces of the query in parallel to return results more quickly. Hardware-based data warehouse acceleration appliances like those from Netezza and Datallegro abandon the use of the database index in favor of direct table scans and use parallelization and raw processing horsepower to process the query. Netezza attempts to shorten the path between the query and the result by placing processors next to each storage device within the appliance. By contrast, a traditional query goes first to the database management system and then through the operating system to read the indexes. Only after the indexes are searched is the data retrieved from disk. Systems may consist of an all-in-one hardware appliance or an integrated turnkey system.
The appliances cost less than using enterprise data warehouses for this purpose, are easier to use and manage, and can make data or data subsets available to a broader range of people than would otherwise be possible.
For example, an IT organization could spin off a subset of data and make it available via the appliance to a group or department. Users can then query the system independently, without the need for the database tuning and optimization that would normally tie up a database administrator. "Appliances are self-contained, plug-and-play solutions, so there is a lot less baby-sitting by the IT department," says IDC analyst Dan Vessett. A typical application for data warehouse appliances would be as a mechanism to query call center call-detail records.
The appliances won't handle the same levels of data supported in an enterprise data warehouse. An integrated decision support database system from Teradata will support a petabyte vs 27TB for Netezza's appliance, 3TB for Datallegro's, 1.5TB for Metapa's and 128GB for Calpont's appliance, which uses a solid-state disk.
Each appliance also requires its own local copy of the data, so administrators may end up maintaining multiple instances of the same database.