Hash Table Data Structure


Things to be discussed here,
  • Introduction
  • Hash Table
  • Implementation
  • Practice Problem
Introduction
If you have programming experience and don't know what a Hash Table is, chances are you use them regularly without realizing it. They are so useful and practical that many programming languages have a hash table construct built into the syntax.
They are referred to as associative arrays (PHP) or dictionaries (Python) or hashes (PERL). By creating your hash table, you can tailor it to your specific needs with optimal efficiency. In this article, I'll try to provide a basic understanding of hash tables. In the end, I'll provide a full implementation of a hash table in C++.

“If I were stranded on a desert island and could only take one data structure with me, it would be a hash table.”
- Peter Van Der Linden, Author of Deep C Secrets
Hash Table
So, what is a hash table? Think of it as arrays on steroids. It’s a versatile data structure that can solve many kinds of problems, and it can also store a huge variety of data types and quantities super efficiently.
A hash table is made up of two different parts: An array (which you're already familiar with) and a hash function (which is probably a new concept!). So why these parts? The array is what holds our data and the hash function decides where exactly our data goes in the array - and how we'll get it out when the time comes.
Hashing is a technique that is used to uniquely identify a specific object from a group of similar objects. Some examples of how hashing is used in our lives include:
  • In universities, each the student is assigned a unique roll number that can be used to retrieve information about them.
  • In libraries, each book is assigned a unique number that can be used to determine information about the book, such as its exact position in the library or the users it has been issued to etc.
In both these examples, the students and books were hashed to a unique number.
More formally, the Hash table is a data structure that represents data in the form of key-value pairs. Each key is mapped to a value in the hash table. The keys are used for indexing the values/data.
Hash Function
Any function that can be used to map a dataset of arbitrary size to a dataset of a fixed size that falls into the hash table is called a hash function. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.
To achieve a good hashing mechanism, it is important to have a good hash function with the following basic requirements:
  1. Easy to compute: It should be easy to compute and must not become an algorithm.
  2. Uniform distribution: It should provide a uniform distribution across the hash table and should not result in clustering.
  3. Fewer collisions: Collisions occur when pairs of elements are mapped to the same hash value. These should be avoided.
Note: Irrespective of how good a hash function is, collisions are bound to occur. Therefore, to maintain the performance of a hash table, it is important to manage collisions through various collision resolution techniques.
Operations
Average Case
Worst Case
space
O(n)
O(n)
insert
O(1)
O(n)
lookup
O(1)
O(n)
delete
O(1)
O(n)
Strengths
  • Fast lookups. Lookups take O(1) time on average.
  • Flexible keys. Most data types can be used for keys if they're hashable.
Weaknesses
  • Slow worst-case lookups. Lookups take O(n) time in the worst case.
  • Unordered. Keys aren't stored in a special order. If you're looking for the smallest key, the largest key, or all the keys in a range, you'll need to look through every key to find it.
  • Single-directional lookups. While you can look up the value for a given key in O(1) time, looking up the keys for a given value requires looping through the whole dataset - O(n) time.
  • Not cache-friendly. Many hash table implementations use linked lists, which don't put data next to each other in memory.
Collision Problem
Let’s just take a quick pause to think about hash functions for a second - since we’re converting data of any length x into data of fixed size y, dataset x will be way larger than dataset y. There are infinite entries in dataset x, but limited ones in dataset y because all the numbers must be a certain size or length. In other words, there could be two or more pieces of data in dataset x that will hash to the same integer in dataset y.
This is called the collision problem, and here are some ways to deal with it:
  1. Separate Chaining
  2. Linear probing
  3. Quadratic Probing
  4. Double Hashing
Separate Chaining
When two or more data collide and hash to the same index, we could simply link the data up into a linked list at that index of the array, as you can see above.
Of course, this is the kind of situation we’re going to want to avoid because instead of using constant time O(1) to find a key and its corresponding information in the array, we end up needing O(n) time to walk through the linked lists to find our actual values.
Linear Probing
Another way to deal with collisions is to walk down the array from the collided index one by one until you get to the next free slot.
Quadratic Probing
Quadratic probing is like linear probing and the only difference is instead of going one by one we follow an arbitrary polynomial to get to the next unoccupied index.
Double hashing
Double hashing is like linear probing and the only difference is the interval between successive jumps. Here, the interval between jumps is computed by using two hash functions.
Dynamic array resizing
Suppose we keep adding more items to our hash map. As the number of keys and values in our hash map exceeds the number of indices in the underlying array, hash collisions become inevitable.
To mitigate this, we could expand our underlying array whenever things start to get crowded. That requires allocating a larger array and rehashing all our existing keys to figure out their new position in O(n) time.
Applications
Hash tables are implemented where
  • Constant time lookup and insertion is required.
  • Cryptographic applications are used.
  • Indexing data is required.
For example,
  • Associative arrays: Hash tables are commonly used to implement many types of in-memory tables. They are used to implement associative arrays (arrays whose indices are arbitrary strings or other complicated objects).
  • Database indexing: Hash tables may also be used as disk-based data structures and database indices (such as in DBM).
  • Caches: Hash tables can be used to implement caches i.e. auxiliary data tables that are used to speed up access to data, which is primarily stored in slower media.
  • Object representation: Several dynamic languages, such as Perl, Python, JavaScript, and Ruby use hash tables to implement objects.
  • Hash Functions are used in various algorithms to make their computing faster.
Implementation in C++
The implementation of Hash Table using separate chaining to resolve collisions is given here,

If you have any doubts or issues in the implementation of other types of hashing, feel free to comment below.
Practice Problems
  1. Bear and Three Musketeers ( 574B )
  2. Gargari and Bishops ( 463C )
  3. King's Path ( 242C )
  4. Train and Peter ( 8A )
  5. Correcting Mistakes ( 533E )
If you have any doubt or any content related to the above-discussed topics, feel free to drop in the comment section

Thank You for Reading


Comments