A Haskell Kademlia implementation.
Lovely you should ask.
Kademlia is a distributed hash table – why is this cool? It allows a bunch of computers to get together and store information across the entire system. Then, each can find/retrieve information that exists only on a given computer, without ever having talked to it before!
So let’s say Computer 1 wants File A. File A exists on Computer 1337. Computer 1 can make a few jumps across the network (log_2(n) to be exact) to find exactly where File A is - then C1 can talk to C1337 - and splendid piracy continues.
- Its ID, which let’s say is in binary format, like 101101.
- A Binary Tree containing information about a subset of other nodes in the network.
- The tree has lists as its leaves
- These lists are filled with <Node ID, IP Address, UDP Port> tuples.
- PING
- STORE
- FIND_NODE
- FIND_VALUE
Most of these operations involve a lookup, which is described further down.
- A given object is hashed into the same format as the node IDs. (Let’s assume binary format)
- When performing a FIND_VALUE, the goal is to look for nodes close (see below) to that value (in terms of their IDs)
- If we consider Node / Object IDs as binary digits, we can XOR them to find the “distance”
- This has a number of cool properies
- unidirectionality : all paths lead to Rome (the node). For a distance d and point x, there is one y such that dist(x,y) = d
- symmetry : dist(x,y) = dist(y,x)
The following is pseudo-code for a node U looking up the ID W.
user sets A, a concurrency parameter user sets K, a replication parameter maintain k-heap, a min-heap of nodes ordered by distance from W each entry also maintains a queried? flag (which means queried & received response) define get-k-closest, which takes a node and target and returns node's K closest nodes to target define query(to-query): results := to-query.each { |x| get-k-closest(x, W) } terminate if any of the results contain our value or have our desired node ID add results to k-heap terminate if the first K of k-heap has been queried distances := map (distance to W) on to results if min(distances) < peek(k-heap): query-next := grab A unqueried from k-heap else: query-next := grab any node in the first K of k-heap that's unqueried query(query-next) start by calling query on A closest nodes to W.
In order to get the initial A closest nodes:
- Assume IDs are 160-bits.
- Assume the lookup-tree is a binary tree where the branches at each point are labeled 0 and 1.
- Walk the digits of W while walking the tree until you hit a leaf.
- Grab the <= k nodes in that bucket.
- If < k, walk back a digit and find the next leaf bucket (repeat if necessary).
If the request is for a value, terminate once a node returns the value associated with the key.
- When a node receives any message from another node W, it updates the k-bucket for W.
- If W is already there, move it to the end of the list.
- If it’s not there, add it to the end of the list.
- If it’s not there and the bucket is full, ping the last recently seen node (first) in the k-bucket.
- If it responds, move it to the end of the list and discard W.
- If it does not, evict it and add W to the end of the list.
- This way, new nodes can’t flood routing tables.
- Once a lookup succeeds, the requesting node stores the <key,value> it was looking for at the closest node that did not return the value.
- <key,value>s expire in a time exponentially proportionalo to the number of nodes between the storing node and the next closest node to the key.
- If a node hasn’t performed a lookup in a given bucket in an hour, it will pick one at random and perform a lookup.
- To join the network, node U adds W into a k-bucket and then performs a node-lookup for itself. Then it refreshes all k-buckets further than closest neighbor.
- <key,value>s are republished once every hour
- TODO: there’s more to talk about here