Big data storage doesn't have to break the bank
- 07 October, 2013 14:01
Big data is nothing new to Quicken Loans. The nation's largest online retail mortgage lender is accustomed to storing and analyzing data from more than 1.5 million clients and home loans valued at $70 billion in 2012.
But the big data landscape got a little more interesting for the Detroit-based company about three years ago.
"We were starting to focus on big data derived from social media -- Twitter, Facebook, Web tracking, Web chats" -- a massive amount of unstructured data, explains CIO Linglong He. "How to store that data is important because it has an impact on strategy -- not just in storage and architecture strategy, but how to synchronize [that with structured data] and make it more impactful for the company," she says.
Quicken Loans already had a scale-out strategy using a centralized storage area network to manage growth. But it needed more for big data storage -- not just scalable storage space, but compute power close to where the data resides. The solution: scale-out nodes on a Hadoop framework.
"We can leverage the individual nodes, servers, CPU, memory and RAM, so it's very fast for computations," He says, "and from cost, performance and growth standpoints, it is much more impactful for us."
Move over, storage giants, and make way for the new paradigm in enterprise big data storage -- where storage is cheaper and computing power and storage power go hand in hand.
Data at Warp Speed
When it comes to big data, "storage is no longer considered to be a monolithic silo that's proprietary and closed in nature," says Ashish Nadkarni, an analyst at IDC. "A lot of these storage systems are now being deployed using servers with internal drives. It's almost like Facebook or Google models where storage is deployed using internal drives in servers. Some servers have up to 48 drives in them, and the storage platform itself is all software-driven. It's all done using general-purpose operating systems with a software core written on top of it."
Indeed, in the era of big data, companies are gathering information at warp speed and traditional storage strategies can't keep up.
Stored data is growing at 35% per year, according to Boston-based Aberdeen Group. That means IT departments have to double their storage capacity every 24 to 30 months. "Today, an average of 13% of [the money in] IT budgets [is] being spent on storage," says Aberdeen analyst Dick Csaplar. "Two and a half years from now, it will be 26%, and then 52%. Pretty soon, this ratchets out of control, so you can't keep doing the same things over and over." And while it's true that storage costs are declining, he contends that they're not decreasing quickly enough to offset the need to spend more on storage as the amount of data grows.
The deluge of unstructured data continues to grow as well. "The tough challenge, which everyone is trying to solve, is unstructured data that's coming off documents that you wouldn't have expected to have to mine for information," says Vince Campisi, CIO at GE Software, a unit launched in 2011 that connects machines, big data and people to facilitate data analysis. "The traditional BI principles in concept and form still hold true, but the intensity of how much information is coming at you is much higher than the daily transactions in systems running your business."
How do you build a data storage strategy in the era of big data, scale your storage architecture to keep pace with data and business growth, and keep storage costs under control? Find out from big data veterans who share their storage sagas and explain how they have reinvented their storage strategies.
Lower-End Storage Does the Trick
In close political races, data can make a difference. Just ask the folks at Catalist. A Washington-based political consultancy, Catalist stores and mines data on 190 million registered voters and 90 million unregistered voters -- including almost a billion "observations" of people based on pubic records such as real estate transactions or requests for credit reports. The information produced from its analytics tools tells campaign organizers whose door to knock on and can even prompt candidates to change their voter strategies overnight.
"We used to have a big EMC storage system that we retired a while back just because it was so expensive and consumed so much power," says Catalist CTO Jeff Crigler, noting that the EMC system also ran out of space. So the firm built a cluster of NAS servers that each hold about a petabyte of data. "It's essentially a big box of disks with a processor that's smart enough to make it act like an EMC-like solution" with high-density disk drives, some "fancy" configuration software and very modest CPU to run the configuration software.
Csaplar sees a growing trend away from expensive storage boxes that can cost more than $100,000 and toward lower-cost servers that are now capable of doing more work. "As servers get more powerful," he says, "they take over some of the work that you used to have specialized appliances do." It's similar to the way networking has evolved from network-attached hubs to a NIC card on the back of the server to functionality residing on silicon as part of the CPU, he adds.
"I believe that storage is moving this way as well," says Csaplar. Instead of buying big expensive storage arrays, he says, companies are taking the JBOD (just a bunch of disks) approach -- using nonintelligent devices for storage and using the compute capacity of the servers to manage it. "This lowers the overall cost of the storage, and you don't really lose any functionality -- or maybe it does 80% of the job at 20% of the cost," he notes.
Catalist replaced its "$100,000 and up" boxes with four NAS storage units at a cost of $40,000. "We quadrupled our capacity for about $10,000 each," Crigler says. "That was a year and a half ago," and the cost of storage has continued to go down.
Csaplar says he expects to see more lower-end storage systems on the market as more organizations find that they meet their needs. Big vendors like EMC see the writing on the wall and have been buying up smaller, boutique storage companies, he adds.
The Storage and Processing Gap
Data analytics workflow tools are allowing stored data to sit even closer to analytics tools, while their file compression capabilities keep storage needs under control. Vendors such as Hewlett-Packard's Vertica unit, for instance, have in-database analytics functionality that lets companies conduct analytics computations without the need to extract information to a separate environment for processing. EMC's Greenplum unit offers similar features. Both are part of a new generation of columnar databases, which are designed to offer significantly better performance, I/O, storage footprint and efficiency than row-based databases when it comes to analytic workloads. (In April, Greenplum became part of Pivotal Labs, an enterprise platform-as-a-service company that EMC acquired in March.)
Catalist opted for a Vertica database specifically for those features, Crigler says. Because the database is columnar rather than row-based, it looks at the cardinality of the data in the column and can compress it based on that. Cardinality describes the relationship of one data table to another, comparing one-to-many or many-to-many.
The Right People for the Job
What skill sets will big data storage and analytics require? By 2015, 4.4 million jobs around the world will require big data skills, but only one-third of those jobs will be filled, according to Gartner. IT professionals must acquire the skills needed to connect, analyze and manage any type of information, in any location, using any interface, to help organizations fully realize the potential of big data, according to a report by the research firm.
Dealing with big data requires a unique set of skills that may be scarce in mainstream IT. For traditional data analysis, such as for finance and HR, it's easy to find people who are familiar with a business discipline, who know what each data field means and who can help create reports. But with big data, there's more to it.
"You definitely need someone with business domain expertise," but you also need people who know how to work with data to do machine learning and other techniques to, for example, build an algorithm or a transfer function, says Vince Campisi, CIO at GE Software. Having people with more specialized skills "allows you to stitch together this information and produce an analytic that tells you something you couldn't have otherwise seen," he adds.
Campisi equates this role to actuaries in the insurance industry -- those "data scientists of their time" who analyzed data and came up with models or made predictions. "Now every industry is going to have that actuarial type of person that we now call data scientists, who just work at connecting and stitching together this information," he says. "They'll try and find some relationship that no one's thought of, or some curve that's very valuable to know but that no one else has found the formula for yet."
-- Stacy Collett
"We have a column in the database called 'State' on every single person's record." But in a database of 300 million registered voters, "it only appears in our database 50 times," he says. "In [row-based open-source relational database management systems like] Postgres and MySQL, it appeared 300 million times. So if you replicate that level of compression on everything from street names to the last name Smith, that plus other compression algorithms buys you tremendous savings in terms of storage space. So your choice of database technology really does affect how much storage you need."
On the storage side, deduplication, compression and virtualization continue to help companies reduce the size of files and the amount of data that is stored for later analysis. And data tiering is a well-established option for bringing the most critical data to analytics tools quickly.
Solid-state drives (SSD) are another popular storage medium for data that must be readily available. Basically a flash drive technology that has become the top layer in data tiering, SSDs keep data in very fast response mode, Csaplar says. "SSDs hold the data very close to processors to enable the servers to have the I/O to analyze the data quickly," he says. Once considered too expensive for many companies, SSDs have come down in price to the point where "even midsize companies can afford layers of SSDs between their disks and their processors," says Csaplar.
Cloud-based storage is playing an increasingly important role in big data storage strategies. In industries where companies have operations around the world, such as oil and gas, data generated from sensors is being sent and stored directly to the cloud -- and in many cases, analytics are being performed there as well.
"If you're gathering data from 10 or more sources, you're more than likely not backlogging it into a data center" because that isn't cost-effective with so much data, says IDC's Nadkarni.
GE, for instance, has been analyzing data on machines' sensors for years using "machine-to-machine" big data to plan for aircraft maintenance. Campisi says data collected for just a few hours off the blade of a power plant gas turbine can dwarf the amount of data that a social media site collects all day.
Companies are using the cloud to gather data and analyze it on the spot, eliminating the need to bring it into the data center. "Companies like Amazon give you a compute layer to analyze that data in the cloud. When you're done analyzing it, you can always move it from, say, the S3-type layer to a Glacier-type layer," Nadkarni adds.
Glacier is an extremely low-cost storage option that Amazon Web Services announced earlier this year. It's designed for keeping data "on ice" for decades. Other companies are introducing similar cloud-based archiving services, says Csaplar, noting that these offerings are professionally managed at a very reasonable price and could, for example, serve as the ultimate resting place for old tapes.
With prices as low as pennies on the gigabyte, it's hard to pass up. "As long as your data is scrubbed and doesn't have any sensitive information, you can dump it into this kind of archive and reduce your data center footprint," says Nadkarni.
Words of Advice
There isn't just one approach [to big data storage]. You really need to look at the use cases you have internally and understand which technologies would best suit [them]. In the old days, we would try to use one tool and make that tool a sledgehammer for everything. Now we have a whole toolbox. So go out and understand how to use those tools and when those tools apply, and then effectively use them.
-- Lloyd Mangnall, vice president, MIS systems architecture,
VHA, parent company of Novation
Don't Store Everything
There's a temptation to think that you're just going to store everything. First, that's a fool's errand because it will break the bank. Yes, storage is getting cheaper, but it's not getting cheaper as fast as we're getting more data. And second, it just doesn't make good business sense. Your need for all that data varies with time.
-- Jeff Crigler, CTO, Catalist
Big Data Isn't for Everyone
You have to be a fairly large company to generate that amount of data. For [small and midsize businesses], it's about being able to get more and more granular data out of what they've already got, and being able to mine and manage it.
-- Dick Csaplar, analyst, Aberdeen Group
Granted, [outsourcing] doesn't always provide the data you need at first blush, but with some effort and custom code, you can get great results.
-- Jeff Brown, CTO, Cheezburger.com,
an internet humor destination
Compiled by Stacy Collett
Mainstream enterprises are also showing interest in using the cloud for storing and analyzing data. Some 20% of IT leaders surveyed by IDC report that they've turned to the cloud as a way to augment their analytics capabilities, even though they have their own data centers to perform analytics.
"It's mostly for two reasons," Nadkarni explains. "Many times, these projects aren't done by IT. Second, because of the time to deploy and to go live, many business units find it easier to spin up a couple of instances in the cloud and get going, so it goes from a few weeks to a few days."
Campisi says most of the customers his unit supports are still storing and analyzing data on-site -- for now. "We are transitioning to more and more using cloud technology and capabilities to support our strategy. From what I see from customers, it tends to be more of a traditional approach where they use their own internal corporate data center," he says.
For his part, Crigler is trying to figure out how to migrate all of Catalist's data to the cloud. The firm already replicates its database that matches voters' identities to the cloud "because it's a ton of data, and it's used on a very 'spikey' basis," he says. "Four to five months [before] an election, it gets hammered. So being able to expand processing capacity and throwing more disks and CPUs at it is really important."
He's also trying to come up with a strategy that gets the best performance for the money given the demand on that type of data and the need to do analytic queries against historical data.
"It's a big challenge," Crigler says. For instance, "Amazon's Elastic Block [Store] is slow, and S3 is even slower. The best option is the most expensive, which is the attached dedicated storage on the very large Amazon boxes -- and that's really expensive. So you have to have a way of analyzing your data and calculating the price-performance curve for different kinds and ages of data, and optimizing your storage based on your real needs."
Though many companies are still grappling with the early stages of their big data storage strategies, it won't be long before hyperscale computing environments like those at Google and Facebook become more commonplace.
"It's happening," says Nadkarni. "This whole server-based storage design is a direct result of department practices followed by Amazon, Facebook, Google" and the like.
In Silicon Valley, startups are offering big data storage systems based on those companies' principles. At VMware's recent VMworld virtualization conference, says Nadkarni, "there were at least a dozen companies with founders who used to be at Google and Facebook."
For legal reasons, the startups can't replicate exactly their former employers' magic, "but the principles are well entrenched in Silicon Valley," Nadkarni says. "In a few years you'll see this hyperscale principle make its way into the mainstream enterprise because there won't be any other way to do it."
Join the Computerworld Australia group on Linkedin. The group is open to IT Directors, IT Managers, Infrastructure Managers, Network Managers, Security Managers, Communications Managers.
Thanks a million, Drupal
Optus goes over the top with VoIP service
Turnbull asks how the NBN got that way
U.S. retailers insist on PIN requirement in smartcard rules
Yelp speeds database access with flash storage