Hash tables are a common data structure for storing key-value pairs.
To get a better understanding of how they work I started building my own data structure to store key-value pairs and then tried to optimize it for performance.
If you’ve got the time, consider trying to come up with your own solutions before reading on.
Hash tables can be used to implement associative arrays (also called dictionaries). This is how our hash table will work:
A naive key-value store
Let’s start with a very naive implementation of a dictionary. We’ll actually end up using this in our hash table later on too.
NaiveDict does everything we set out to do, but it’s very slow. We may need to iterate through every item in the table, so worst-case performance is
How hash tables work
At the core of a hash table is an array of roughly similar size to the number of key-value pairs stored in the hash table.
Each index in the array is called a bucket (or entry, or slot) and stores any number of key-value pairs. A bucket may not have any key-value pairs in it, or all key-value pairs in the hash table may be in the same bucket. But the goal is to equally distribute key-value pairs across the buckets.
The key of a key-value pair decides what bucket it should be stored in. First, the key is converted to a number using a hash function:
We want to use the hash as an index in our array of buckets. But what if we only have 1000 buckets, and the hash value exceeds the array size?
To solve that problem we use the modulo operator to get an index that’s smaller than our array size.
Sometimes the hash function will return the same hash for the same key – that’s called a collision.
Because collisions are possible we need a second level of storage in each bucket. In this example we’ll use the
NaiveDict class from above.
When initializing the hash table we create an array containing a fixed number of these buckets.
More complex hash table implementations will resize the table dynamically based on the number of items they store.
We need to decide how our hash function should work. For now, let’s use a very simple implementation that takes the sum of the ASCII codes of the characters in the key.
To determine the correct bucket given a certain key we use the modulo operator on the hash of the key, then retrieve the bucket at the index we just calculated.
Finally, we implement the
set functions. Each bucket contains a
NaiveDict, so we can insert or look up the key there.
We’ve got a working hash table!
Let’s see how well our hash table is doing.
This test generates 100,000 keys and values, then measures how long it takes to insert (
SET) and then read (
GET) them from the hash table.
I took the
makeid function from StackOverflow.
console.timeEnd to measure how fast our hash table is.
Result on my machine:
SET: 108.112ms GET: 760.709ms
That’s pretty good! Doing the same test with
NaiveDict takes almost two minutes:
SET: 7.021ms GET: 111172.771ms
SET: 127.702ms GET: 30.727ms
Let’s see if we can get the performance of our hash table closer to that.
A better hash function
I mentioned earlier that, ideally, the hash function should uniformly distribute keys to buckets. But here’s what our hash function does:
Only a few hundred buckets are actually used.
The current hash function treats every character equally, but let’s change it to take the position of the character into account:
Now a bit over 10 thousand buckets are used:
And we get better performance:
SET: 208.314ms GET: 95.843ms
Which uses most of the buckets we’ve allocated!
However, this hash function is also more complicated and takes more time to run. That makes inserting the key-value pairs take longer. But because we have fewer collisions retrieval becomes faster, despite the increase in time spent on hashing.
SET: 434.148ms GET: 72.455ms
More things we could do better
While our CPU performance is pretty good there’s still a lot of room to save on memory. For example, even buckets that contain no items currently use up memory (two empty lists in
Also, in practice the individual buckets often use linked lists rather than our
While we can retrieve the correct bucket in constant time (
O(1)), the complexity of retrieving the value from
NaiveDict is still
However, now our n is much smaller. In fact, if our hashes were perfectly distributed and there are no collisions each
NaiveDict would only contain one item. Thus, in the best case