Stories by Valerie Henson

The code monkey's guide to cryptographic hashes for content-based addressing

It was 1996, the bandwidth between Australia and the rest of the world was miserable, and Andrew Tridgell had a problem. He wanted to synchronize source code located in Australia with source code on machines around the world, but sending patches was annoying and error-prone, and just sending all the files was painfully slow. Most people would have just waited a few years for trans-Pacific bandwidth to improve; instead Tridgell wrote rsync, the first known instance of content-based addressing (also known as content-addressed storage, or compare-by-hash), an innovation which eventually spread to software like BitTorrent, git, and many document storage products on the market.

Real-world disk failure rates offer surprises

At this year's USENIX File Systems and Storage Technology Conference, we were treated to two papers studying failure rates in disk populations numbering over 100,000. These kinds of data sets are hard to get -- first you have to have 100,000 disks, then you have to record failure-related data faithfully for years on end, and then you have to release the data in a form that doesn't get anyone sued.