Diligent Technologies is among the pioneers of data deduplication technology, which helps enterprises reduce redundant copies of data and, in turn, shrink storage requirements and shorten backup times. Neville Yates, Diligent's CTO, talked with Network World Senior Editor Deni Connor about the varying deduplication technologies used with today's virtual tape libraries (VTL).
So what is deduplication?
Deduplication is a means by which data is examined and compared to existing data. If it is the same, it is filtered out and the existing data is referenced. Deduplication is very prominent in applications such as backup that cause a lot of duplication as a byproduct of how they work. These applications are prime targets for deduplication technology.
What forms of deduplication are there?
There are three ways deduplication can occur that are talked about today in the market. One of them is the offering from Diligent called HyperFactor, which takes a look at data in an agnostic form and searches the datastream for similarity. Once similarity is found, a computation difference is performed guaranteeing that what is to be filtered out is exactly the same as what is referenced. Only new data is stored.
Another one uses hash technology or hash algorithms whereby data is sliced into some digestible piece -- such as perhaps 8Kbytes in size -- and a hash is assigned to that data and the data is stored. If that signature or hash is recomputed on a new datastream, then that computation suggests that that data already exists and can be referenced. It doesn't need to consume more storage, thereby reducing the amount of storage consumed.
The third is one where the datastream is looked at inside for its logical content, assuming that a file of a particular name is most likely to be a good candidate when compared to the contents of a file of exactly the same name on a fully qualified basis, meaning directory, directory tree, etc., and then a computational difference is done between the two files.
So there are three fundamental approaches and many different ways of implementing those approaches.
What are the different ways deduplication has been implemented?
One of the implementation differences in those approaches is whether you receive all of the data and lay it down on disk and then sometime in the future read it back in from a deduplication perspective, or whether during the receipt of the data you process it inline and in real time to achieve the deduplication.
Those are called inline and post-processing?
That is correct.
You say that Diligent uses the HyperFactor approach. Who are some of the vendors that use hash algorithms?
Hashing or some derivative thereof is used by Quantum/ADIC, Data Domain and FalconStor. HyperFactor is our own IP. Content-aware is something that is being pursued by Sepaton.
What are the advantages and disadvantages of inline deduplication and post-processing?
Inline deduplication first of all is difficult to achieve in terms of performance. But if you do achieve it, it is advantageous because once you have finished the job, the job is done -- there is no heavy lifting and you don't have to worry about capacity planning for any background tasks and what resources might be available to support that. Contrary to post-processing, while the data is being received by the backup application, none of the heavy lifting is being done, and so end users need to concern themselves with the amount of effort needed to do the post-processing.
It is quite easy to understand when you look under the covers that the activity on the disk subsystem is greatly increased as a byproduct of post processing, simply because you have to write everything and read it back. Then there's all the database and indexing overhead that is painful and can slow the process down. It is quite reasonable to assert that if you are able to de-dupe inline at 300 to 400MB per sec you wouldn't even consider doing post processing because the situation drives toward a higher I/O profile and slows you down.
ScrumMaster offers tips on how to play in a winning dev team
How spyware nearly sent a teacher to prison
Open source identity: Asterisk founder and Digium CEO Mark Spencer
Fighting e-waste one mobile phone at a time
MIT's JoAnne Yates on information overload, 'CrackBerry' addicts and the 'always online' life
Read up on the latest ideas and technologies from companies that sell hardware, software and services. Strategies for Eliminating .PST Files
Mimosa™ NearPoint™ for Microsoft® Exchange Server: Email Archiving 101
Best Practice in Building an Integrated Information Management Strategy
Taking On Demand CRM Integration to the Next Level
Business Intelligence and Enterprise Performance Management: Trends for Emerging Businesses
Making the Business Case for IT Consolidation
The state of Middleware
Email Archiving Implementation: Five Costly Mistakes to Avoid
Zones provide focussed content from Computerworld and leading technology partners.Discover how SOA can create smarter outcomes for your business.
Attend and learn:
- How SOA is helping leading companies to become more agile
- Where you should be applying SOA processes in your company
- The top SOA implementation mistakes to avoid
Click here for more information.
- +
Computerworld Live Podcast #97: The Future of Enterprise Networking 25/07/2008 09:45:36
This week CW Live chats with Mark Thompson, global sales and marketing manager for HP ProCurve, on the future of the enterprise networking. Mark discusses the trends we can expect to see in the near future and how the right infrastructure can ensure your enterprise network is secure. - +
Computerworld Live Podcast #96: Security at the Edge 11/06/2008 09:22:22
CW Live speaks with Amol Mitra, HP ProCurve Director of Marketing for Asia Pacific and Japan. Today's topic: how enterprises are starting to shift away from simply controlling security via server logins, firewalls and moving to more adaptive security frameworks. - +
Data Management Edition #10: Multi-Petascale Systems 02/05/2008 09:12:33
This week we look at sustainability and the development of multicore technologies to build multi-petascale systems. - +
IT Security Edition #11: How to poison the Storm botnet 01/05/2008 08:51:55
This week CW Live presents a case study on how to poison the notorious Storm botnet . Plus we take a look at Cisco's plans for Ironport. - +
IT Security Edition #10: Cyber-battles fought and won 24/04/2008 11:09:47
Vendors bow to end user pressure to improve product security, and we take a look at the latest concepts shaping the cyber-battlefield of the future.
AOC Launches 18.5” Widescreen Green 16:9 LCD Monitor in Australia and New Zealand 2008-12-03 15:30:00+11
FrontRange Solutions eases software license management with new License Manager 3.0 2008-12-03 14:56:00+11
Progress Software's Cure for Managing Services-based Applications 2008-12-03 14:42:00+11
S3 Graphics Unleashes Full OpenGL® 3.0 API Support with Beta Driver for Chrome 500 Series GPUs 2008-12-03 14:08:00+11
Informatica Powercenter added to Nec Infoframe Solution Suite 2008-12-03 11:36:00+11
Business Intelligence and Enterprise Performance Management: Trends for Emerging Businesses
Hyperion surveyed 163 companies to understand BI and EPM requirements, evaluation processes, and extent of adoption. Top areas of current and future investment for emerging businesses include budgeting and planning as well as management reporting solutions. Read on to discover more.












