Analyzing collision frequency and lookup performance for a chaining, open addressing, and hybrid hash table configurations
Motivation
Hash tables are fundamental data structures used in countless applications, from database indexing to caching systems. The efficiency of these operations heavily depends on how the hash table handles collisions — when two different keys map to the same index. Our project investigates three prominent collision resolution strategies: chaining, open addressing, and a hybrid approach that combines both.
The motivation for this comparison stems from several key considerations:
Performance Trade-offs: Each collision resolution strategy presents unique trade-offs between memory usage, insertion time, and lookup performance. Chaining uses additional memory for linked lists but offers consistent performance, while open addressing is more memory-efficient but can suffer from clustering. The hybrid approach attempts to balance these trade-offs, but its effectiveness needs empirical validation.
Real-world Impact: In practical applications, hash tables often need to handle both successful and unsuccessful lookups efficiently. For example, in a spell checker, most lookups will be successful (words are spelled correctly), but the system must also handle failed lookups (misspelled words) quickly to provide responsive suggestions.
Scalability Concerns: As datasets grow, the performance characteristics of different collision resolution strategies can change dramatically. Our doubling experiment (testing with sizes from 1,000 to 512,000 entries) helps understand how each approach scales and at what point performance might degrade.
Implementation Complexity: While theoretical analysis provides valuable insights, real-world performance can differ due to implementation details, memory access patterns, and hardware characteristics. Our empirical comparison across different operating systems (Windows and macOS) provides practical insights into these factors.
By conducting this comprehensive comparison, we aim to provide data-driven insights that can help developers choose the most appropriate hash table configuration for their specific use case, considering factors like expected data size, lookup patterns, and performance requirements.
Experiment Design
Our experiment was designed to tackle the following research question. RQ: Compare hash table configurations (open addressing, chaining, hybrid) using a doubling experiment with randomly generated key-value pairs to analyze collision frequency and time overhead for lookups, including searches for both existing and non-existing keys. In order to successfully design a tool to test these features we split our tool into three algorithms (chaining, open addressing, and hybrid), benchmarking to measure performance metrics of our implementations, and a doubling experiment. This approach allows for clear implementation, running, and analysis of data.
Implementation
Chaining
Chaining is a collision resolution strategy that is used in hash tables. Hash tables have an index or bucket in the table that stores a list of key value pairs instead of single values. What chaining works to do is have multiple keys hash to the same index, then store them together in the buckets list which is why the method is called chaining. This allows the hash table to handle collisions easier without needing to probe other slots.
My algorithm uses separate chaining to handle collisions with lists (or, called chains) at index or buckets. Then it stores keys that are mapped together. Working to never probe but to only operate within the chain at the hashed index.
self.buckets = [[] for _ in range(capacity)]
In this line I created an array where each element is a list. This list is the “chain” at each index. If two keys hash to the same index, they are stored in this list, giving evidence to the separate chaining process.
def put(self, key, value):
= hash(key) % self.capacity
index = self.buckets[index]
bucket # record a collision if there is already at least one element in the bucket.
if bucket:
self.collisions += 1
Here I compute the hash index
according to index = hash(key) % self.capacity
then I access the bucket at that index bucket = self.buckets[index]
if that bucket is not empty, you record a collision, because another key already hashed there. Then, you either update an existing key or you would append a new key.
for i, (k, v) in enumerate(bucket):
if k == key:
= (key, value)
bucket[i] return
bucket.append((key, value))
This is what it would look like if you have to iterate through the bucket to check if the key already exists. If it does, it updates the value. If not, it appends the new key-value pair to the bucket like mentioned above.
def get(self, key):
"""Retrieves the value associated with the given key from the hash table."""
= hash(key) % self.capacity
index = self.buckets[index]
bucket for k, v in bucket:
if k == key:
return v
return None
This code segment works by looking through the chain (the list at bucket[index]
) and find the correct key. If it finds the correct key it will return the value, if not it will return None.
Open Addressing
Open addressing is a collision resolution method used in hash tables. When a collision occurs (two keys hashing into the same index) open addressing will search for the next available slot in the hash table to store the collided key. This element will ensure that all elements are to be stored in the hash table itself.
I used linear probing in the hash table which is a sequential checking method. The sequence starts by checking the original hash location. If that location is occupied the next location will be checked. This process will continue until an empty slot is found.
= (index + 1) % self.capacity index
This section of code is the rehashing function that will find the empty slot in the hash table.
Hybrid
The hybrid method combines the use of open addressing and chaining for hash tables. It does this with the parameter probe_threshold
, which sets the point at which the algorithm will switch between open addressing and chaining.
def __init__(self, capacity, probe_threshold=3):
When the probe threshold is less than three, open addressing is used, meaning that the algorithm will attempt to resolve collisions by probing sequentially for an empty slot within the hash table. If a collision occurs, the algorithm will check the next available slot, continuing this process up until probe_threshold
times. For the put
operation, this means inserting the key-value pair into the next available slot if the initial one is taken. For the get
operation, this means looking at the next slot to continue searching for the desired key if it isn’t found at the current index.
= (index + 1) % self.capacity index
Once the probe threshold is reached, the algorithm switches to chaining, where multiple values are stored in a list at a single index in the hash table. If another collision occurs at this point, rather than continuing to probe, the new key-value pair is added to the list. For the put
operation, this involves appending the new key-value pair to the chain at that index. For the get
operation, the algorithm searches through the list at that index to find the desired key and return its associated value.
self.table[index].append((key, value))
Experiment and Benchmarking
To evaluate the performance of our hash table configurations (i.e., chaining, open addressing, and hybrid), we implemented a benchmarking structure that compares how each algorithm handles insertion and lookup in a doubling experiment. Our primary goal was to measure collision frequency during insertion and runtime during lookups. For the lookup experiment, we were concerned for both keys that existed in the table and those that did not. By doubling the number of key-value pairs with each test, from 1000 to 512,000 pairs, we can observe each implementation’s worst-case time complexity and how well they maintain performance as the table becomes more populated.
To simulate realistic and diverse inputs, we used randomly generated strings as keys and integers as values. The random keys were created using combinations of uppercase and lowercase letters, while the values were simply random integers. This method ensured that our experiments weren’t biased toward any particular pattern and better represented unpredictable real-world data.
def random_string(length=5):
"""Generate a random string of specified length."""
return ''.join(random.choices(string.ascii_letters, k=length))
def generate_random_pairs(k: int) -> List[Tuple[str, int]]:
"""Generate k random key-value pairs."""
return [(random_string(), random.randint(1, 100)) for _ in range(k)]
Our benchmarking implementation is split into two main tasks: insertion benchmarking and lookup benchmarking. For insertion, we measure the total time it takes to populate the table and count how many collisions occur. This benchmarking design gives us insight into how each method handles collisions in the hash space. Lookup benchmarking is performed by timing look-up for both existing keys and non-existing keys. This approach allows us to see not only how efficient each hash table is under successful look-up, but also how well each algorithm handles lookup failures. This is an important consideration because failed lookups may be common in real-world scenarios like caching or spell checking.
Overall, both approaches use a similar benchmarking technique by using the time
function, to test the total tie to call the put
and get
methods for each of our hash table configurations. Together, out measurements provide a holistic view of the strengths and trade-offs of each collision resolution technique, and presented the data in a way that was simple to analyze and directly see the worst-case time complexity.
Running the Experiments
To run our experiment, we have opted to use Python directly. In order to run the program, all that the user needs to do is clone the repo to their laptop. From there, the user simply enters the src
directory of our tool, and types python main.py
in their terminal and the program will produce 6 basic tables. Each of the tables represented the doubling experiment conducted, three for collision testing and 3 for look-up testing. The data should be displayed in a clear and labeled manner to ensure readability and easy analysis.
When using the tool, it is important to note that there should be no need for alteration of the code to properly run this edition of our experiment, as we have handled all the variables and key-value pair generation through the individual implementations and experiment design.
Example Output
Below is an example of what the expected output from running our tool looks like. This data, along with the analysis from other runs across operating systems, will help guide our analysis and conclusion on collision frequency and lookup performance of various hash table configurations.
- Window OS
Starting hash table comparison experiments...
Generating experiment data...
Running doubling experiment...
Chaining Hash Table Population Performance (seconds)
Size Populate Time Num. of Collisions
---- -------------- ----------
1000 0.000551 0
2000 0.000539 0
4000 0.000535 0
8000 0.001102 0
16000 0.004733 0
32000 0.012599 0
64000 0.023571 0
128000 0.047765 0
256000 0.084535 0
512000 0.188057 0
Open Addressing Hash Table Population Performance (seconds)
Size Populate Time Num. of Collisions
---- -------------- ----------
1000 0.001745 509
2000 0.006581 980
4000 0.016572 1997
8000 0.059750 3990
16000 0.145642 7962
32000 0.767986 15972
64000 2.548893 32008
128000 5.890426 63845
256000 24.904955 127993
512000 32.579744 255330
Hybrid Hash Table Population Performance (seconds)
Size Populate Time Num. of Collisions
---- -------------- ----------
1000 0.000000 918
2000 0.000000 1768
4000 0.000000 3617
8000 0.016283 7232
16000 0.004849 14407
32000 0.018594 28789
64000 0.041206 57284
128000 0.116695 115051
256000 0.246317 229837
512000 0.716221 458745
Running lookup experiment...
Chaining Hash Table Lookup Performance (seconds)
Size Existing Lookup Non-Existing Lookup
---- ---------------- -------------------
1000 0.000543 0.000000
2000 0.000529 0.000000
4000 0.000000 0.000534
8000 0.000564 0.000000
16000 0.000524 0.000000
32000 0.000000 0.000000
64000 0.000000 0.000000
128000 0.000000 0.000000
256000 0.000556 0.000000
512000 0.000000 0.009288
Open Addressing Hash Table Lookup Performance (seconds)
Size Existing Lookup Non-Existing Lookup
---- ---------------- -------------------
1000 0.002285 0.084326
2000 0.001909 0.170237
4000 0.003496 0.368865
8000 0.006198 0.964839
16000 0.015819 1.990731
32000 0.033609 3.385081
64000 0.049811 8.993584
128000 0.046854 11.788684
256000 0.079588 37.889480
512000 0.056853 19.009886
Hybrid Hash Table Lookup Performance (seconds)
Size Existing Lookup Non-Existing Lookup
---- ---------------- -------------------
1000 0.000000 0.000000
2000 0.000000 0.000000
4000 0.001080 0.000503
8000 0.000000 0.000000
16000 0.000000 0.000000
32000 0.000000 0.000000
64000 0.000000 0.000000
128000 0.001000 0.001000
256000 0.005110 0.000000
512000 0.000000 0.000000
Importantly, while this result was run on a Windows system, our results and analysis were ran across a variety of operating systems, including Windows and macOS, in order to provide the most inclusive and well-rounded analysis possible.
Starting hash table comparison experiments...
Generating experiment data...
Running doubling experiment...
Chaining Hash Table Population Performance (seconds)
Size Populate Time Num. of Collisions
---- -------------- ----------
1000 0.000094 0
2000 0.000166 0
4000 0.000331 0
8000 0.000902 0
16000 0.001426 0
32000 0.003253 0
64000 0.008531 0
128000 0.018017 0
256000 0.050152 0
512000 0.099632 0
Open Addressing Hash Table Population Performance (seconds)
Size Populate Time Num. of Collisions
---- -------------- ----------
1000 0.000618 493
2000 0.003033 1029
4000 0.009645 1999
8000 0.020971 3945
16000 0.051408 8036
32000 0.118082 15951
64000 0.491041 31951
128000 2.330490 64169
256000 15.483382 127762
512000 21.686648 255631
Hybrid Hash Table Population Performance (seconds)
Size Populate Time Num. of Collisions
---- -------------- ----------
1000 0.000293 874
2000 0.000498 1852
4000 0.001072 3547
8000 0.002150 7168
16000 0.004979 14467
32000 0.009560 28554
64000 0.020484 57517
128000 0.050377 115388
256000 0.127418 229571
512000 0.303631 459213
Running lookup experiment...
Chaining Hash Table Lookup Performance (seconds)
Size Existing Lookup Non-Existing Lookup
---- ---------------- -------------------
1000 0.000090 0.000096
2000 0.000086 0.000092
4000 0.000086 0.000092
8000 0.000107 0.000092
16000 0.000096 0.000095
32000 0.000101 0.000101
64000 0.000126 0.000123
128000 0.000166 0.000125
256000 0.000207 0.000176
512000 0.000253 0.000221
Open Addressing Hash Table Lookup Performance (seconds)
Size Existing Lookup Non-Existing Lookup
---- ---------------- -------------------
1000 0.000658 0.042120
2000 0.001209 0.084219
4000 0.002769 0.170023
8000 0.002219 0.344002
16000 0.002225 0.707919
32000 0.004066 0.788226
64000 0.007995 1.840707
128000 0.014459 6.570866
256000 0.154649 17.753949
512000 0.088448 19.394856
Hybrid Hash Table Lookup Performance (seconds)
Size Existing Lookup Non-Existing Lookup
---- ---------------- -------------------
1000 0.000244 0.000297
2000 0.000241 0.000287
4000 0.000253 0.000287
8000 0.000239 0.000313
16000 0.000307 0.000331
32000 0.000397 0.000508
64000 0.000456 0.000556
128000 0.000451 0.000611
256000 0.000529 0.000606
512000 0.000684 0.000764
Data Analysis
In our experiment comparing hash table configurations (i.e., chaining, open addressing, and hybrid) we found interesting patterns in how each approach handled doubling data sizes, collisions, and lookup efficiency. Chaining consistently demonstrated the best performance in terms of both population time and collision frequency. Across all sizes, chaining had zero collisions and kept and efficient population and lookup times. This is because chaining uses linked lists at each index to handle collisions, which scales well as long as the size remains manageable. The lookup times also remained minimal, even for non-existent keys.
Open addressing, on the other hand, showed a steep increase in both collisions and time costs, especially as the table size doubled. The primary reason is that open addressing stores all elements within the table itself, using linear probing to resolve collisions. As the size increased, so did the time needed to find an open slot or resolve a failed search, which explains the drastic rise in non-existing key lookup times (reaching as high as 37 seconds in some cases). This method suffers under heavy population due to clustering and probe sequences.
Our hybrid hash table, while faster than open addressing and more flexible than chaining, still incurred a large number of collisions. This approach attempts to balance between chaining and open addressing but ended up trading off efficiency. The high collision count suggests this implementation leans heavily on fallback mechanisms, which increase overhead during population. Yet, lookup times, especially for non-existing keys, remained low across the board, regardless of size. This suggests the hybrid strategy prioritizes efficient searches even if there are frequent collisions. Overall, our data shows that chaining remains the most scalable and efficient method for hash table configurations when looking at collisions and lookup time.
Below is a table of the worst case time complexities of the different configurations and methods, according to the data we received in our experiments and are shown above.
Approach | Population WCTC | Lookup (Existing) WCTC | Lookup (Non-Existing) WCTC |
---|---|---|---|
Chaining | O(n) | O(1) | O(1) |
Open Addressing | Between O(n) and O(n²) | O(n) | O(n), but much slower |
Hybrid | O(n) | O(1) | O(1) |
Future Work
For future work, we could further explore the hybrid method and its use of the probe threshold. Our current implementation uses a fixed probe threshold, setting the value at three for every run. We could experiment with different probe threshold values to see if increasing or decreasing this value has any impact on runtime performance. We could also explore using dynamic thresholds based on load factor to see if runtime performance can be improved.
Conclusion
In conclusion, this project provided a hands-on comparison of different hash table implementations, showing how chaining, open addressing, and hybrid methods perform under the same conditions. By running controlled tests and analyzing the results, we gained practical insight into how each approach handles collisions, lookups, and scales with data. Understanding the trade-offs between different hash table implementations is essential for designing efficient data structures.