I think consistent hashing is pretty fascinating. It lets you define a ring of machines that shard out data by a hash value. Imagine that your hash space is 0 -Int.Max, and you have 2 machines. Well one machine gets all values hashed from 0 -Int.Max/2 and the other from Int.Max/2 -Int.Max. Clever. This is one of the major algorithms of distributed systems like cassandra and dynamoDB.
For a good visualization, check out this blog post.
The fun stuff happens when you want to add replication and fault tolerance to your hashing. Now you need to have replicants and manage when machines join and add. When someone joins, you need to re-partition the space evenly and re-distribute the values that were previously held.
Something similar when you have a node leave, you need to make sure that whatever it was responsible for in its primray space … Read more
A fun problem that has come up during the implementation of cassieq (a distributed queue based on cassandra) is how to evenly distribute resources across a group of machines. There is a scenario in cassieq where writes can be delayed, and as such there is a custom worker in the app (by queue) who watches a queue to see if a delayed write comes in and republishes the message to a bucket later on. It’s transparent to the user, but if we have multiple workers on the same queue we could potentially republish the message twice. While technically that falls within the SLA we’ve set for cassieq (at least once delivery) it’d be nice to avoid this particular race condition.
To solve this, I’ve clustered the cassieq instances together using hazelcast. Hazelcast is a pretty cool library since it abstracts away member discovery/connection and gives you events on membership … Read more