Refactoring development

Ramblings from the trenches...

View on GitHub

Cache by Hash

Cache by hash has been on my mind for a while, but I saw this and it makes me convinced that the web is going to head in this direction for scaling. Maybe we won’t get quite as far as IPFS hopes, but I’m sure we’re heading in that direction.

 

There’s a great free e-book on how git works (Pro-Git). For me the biggest take-away was the beauty of storing data by it’s hashcode (SHA256 in git’s case).

If you can exclude mutable things like the filename and timestamps, then when you have multiple copies of the same file (or chunk of data) you’ll naturally save them under the same location. You could call this ‘hash based compression’ as the same chunk of data is never stored twice.

You’d need to reference count the pointers to this chunk of data (file) if you wanted to delete it (unless you’re using a time based clean up).

Ok, so no dupe, but what about load balancing?

It turns out that the chord algorithm is perfect for this (or pick any other DHT algo). You can split up where you store the data and can store it n-times for redundancy.

Key Gotcha

There’s still the question of how to get the hash if you don’t have the original. The hash makes a perfect immutable key for the data, but you still need to map from your mutable key (filename/timestamp) to the immutable key.

The good news is you have split the problem into two distinct problems:

  1. Mapping from a small mutable key to a small immutable hash.
  2. Distribution of the (heavy) data.

Once you’ve keyed by hash you can’t get the wrong / out of date contents. It is at this stage very error-proof. (And you can always verify that the data you have does indeed match the hash).

Versioning

So far, so good, but how do you have a version 1, version 2 etc of an object? If we’re storing the object under its hashed address we can hash the hash (or just add one to the hash) and store the pointer there to the next hash. (For a more in-depth discussion on this topic see Mutable Value Chains)

 

Additional references:

Single Instance Storage

W3C’s SRI working draft (Supported by Chrome and FireFox)

SRI CDN jsdelivr


 

UPDATE

I see Docker has finally moved to using cache via hash for their layers and images (“content addressable image ids” they call them.).  
  Cache by hash has been on my mind for a while, but I saw this and it makes me convinced that the web is going to head in this direction for scaling. Maybe we won’t get quite as far as IPFS hopes, but I’m sure we’re heading in that direction.

 

<hr /