PrologueIt was 1996, the bandwidth between Australia and the rest of the world was miserable, and Andrew Tridgell had a problem. He wanted to synchronize source code located in Australia with source code on machines around the world, but sending patches was annoying and error-prone, and just sending all the files was painfully slow. Most people would have just waited a few years for trans-Pacific bandwidth to improve; instead Tridgell wrote rsync, the first known instance of content-based addressing (also known as content-addressed storage, or compare-by-hash), an innovation which eventually spread to software like BitTorrent, git, and many document storage products on the market.
Used properly, content-based addressing dramatically reduces bandwidth use, simplifies code, and improves security. But used when it doesn't make sense, it can reduce performance, increase bandwidth usage, and create new security risks. Figuring out when content-based addressing is good and when it is very, very bad requires an understanding of both systems programming and mathematics. What's a programmer to do? In this article, we'll lay out the considerations in language that any programmer (and even some managers) can understand.
Content-based addressing: The basics
Let's start with an example problem. Say you have an MP3 music collection, and your friend sends you a link to a new MP3 but doesn't give you the title. If it's already in your collection, you don't want to bother downloading and storing a second copy, so you want to quickly find out if it is identical to one of your other MP3s. If you only have a few MP3s, you'll just directly compare the new MP3 to each of your existing MP3s. This takes an amount of time proportional to the number of MP3s you already have, and if you have a lot of MP3s (and we know you do), it will run too slowly and you'll look for a smarter algorithm. A method for quickly finding a particular item that is nearly as old as computer science is called the hash table. There are a thousand variations on the hash table, but we'll describe a very simple example, and then use this to understand the tradeoffs of content-based addressing.
Instead of directly comparing the entire MP3 with every other MP3, we could run a function that takes the MP3 data as its input, and outputs a fixed-size signature based on the input. The goal is to generate something like a thumbnail version of a digital photo - the signature doesn't have all the information in the original, but it's a useful hint as to what the whole file contains. For example, we could just XOR together every 4 bytes together - the core of the CRC32 hash function. Hash functions are designed to generate these signatures, or hash values. The same input will always have the same hash value, and conversely, different hash values will always have different inputs. So, instead of comparing all the file data, you'll keep a table of hash values of the MP3s you already have. Then you can ask your friend to send you the CRC32 of the new MP3 and look it up in your hash table. If you don't have an MP3 with that hash value, then you know you don't have your friend's MP3, and it's worthwhile downloading.
What happens if the hash value of the new MP3 matches one already in your database? While the same input always has the same hash value, different inputs may have identical hash values - a hash collision. For example, the 4-byte XOR of an input consisting of the two 32-bit numbers 0x12340000 and 0x00005678 will be 0x12345678, and the 4-byte XOR of 0x12345678 and 0x00000000 will also be 0x12345678. Because the number of inputs is infinite, and the number of hash values are finite, collisions are inevitable. In fact, some of your own MP3s may already have colliding hash values. The easiest way to deal with this is to make each entry in the hash table a list of MP3s that have that same hash value, along with pointers to their entire data. When you find two MP3s with the same hash value, you compare the whole files to see if they are really the same, or just have the same hash value. Now, instead of comparing with every file, you only compare the MP3 with the few files that have the same hash value - much faster than comparing directly with every single file.
Now comes the interesting part. Sometimes, comparing the file against another file even once is prohibitively expensive, as when the files are located on different machines separated by low-bandwidth networks - say you're at an Antarctic research station and your friend's MP3 has to be trickled through the shared satellite link at a few kilobits per second. (Or maybe you're just still on dial-up.) If you had her send you the hash of the MP3, and you didn't have any file with that same hash, you'd know it was worth downloading the MP3. The problem is when the hash matches up to an existing MP3 in your database - how do you know whether or not they are the same MP3 without slowly and painfully downloading the new MP3 and comparing it? Hash functions have collisions, it's a fact of life. But if you could find a hash function that almost never had collisions, then you could, with very high probability, be sure that if the hashes are the same, the files they represent are also the same. If that probability is high enough, you would feel safe assuming that a collision will never occur in practice (bravely accepting the risk of never hearing that hi-lar-i-ous Weird Al Yankovic song).