Mastering Hash Maps: Key Concepts and Techniques

Hey everyone! Welcome to my channel where we explore the fascinating world of computer science and technology. In this video, we're going to dive deep into the world of hash maps, a fundamental computer science concept that is used in many real-world applications. So, let's jump into it! Why is it important to learn how a hash map works? Hash maps are probably the most common data structure used in real-world applications. For example, database indexing, caching, compilers, and so on. In practice, you are most likely going to use a library and not worry too much about its internals. However, learning how a HashMap works is a crucial step in developing your skills as a software engineer and improving your ability to create highly performant code. A HashMap is a collection of key-value pairs. This means that each key is associated with some value. For example, a HashMap can be used to store age for Alice, Bob, and Charlie. A HashMap allows us to insert, find, or delete elements from it. For example, we can add Tim and his age in the map, find Charlie's age, or delete Alice. Although other data structures, such as lists, also provide this functionality, hash tables are much faster for most use cases. HashMaps are usually implemented on top of an array where each element contains both the key and the value. But how do we decide where to place each element in the array? We need to assign a slot for each key. For example, we could assign a number to each letter, then sum them. This computation is called a hash function and the result is a hash value. Some numbers returned by the hash function might be bigger than the size of the array, so let's take the remainder when dividing by the size of the array. This number gives us a slot in the array for each key. Let's look at another example. Say that we want to store Alice's age. The hash function returns a slot 1 that is already taken by Charlie. This situation is called a collision. You may say, let's change the hash function to avoid collisions, and that would be ideal. While a well-designed hash function can minimize the number of collisions, we still need a method for resolving them. One of the simplest collision resolution techniques is called closed addressing. There is also an alternative method which is called open addressing. We will cover both in this video. The idea behind closed addressing technique is to place all the elements that hash to the same slot into the same bucket. The bucket can be a linked list, an array, or a balanced tree. A technique that uses a linked list is called chaining. To implement this, each slot in the array becomes a linked list. We can now insert Alice's age into the corresponding linked list. Let's look at a few more examples. John is stored at a slot number one, Amy at slot number zero, and Alex at slot number two. But what about running time complexity? Wouldn't this be slow? For searching, the worst case running time is proportional to the length of the list, but in the average case running time for searching is 0 of 1. Actually, the complexity depends on how many elements are in the map, but we'll get to that in a second. Insertion and deletion have the same running time complexity as searching. For insertion, we have to check if the element exists before inserting it, while deletion requires us to find the element with the corresponding key in the linked list. However, Insertion can be implemented in O worst case if we assume that the element doesn't already exist. Similarly, deletion can be implemented in O if we already have a pointer to the element in the doubly linked list. So how well does hashing with chaining perform in the average case? In particular, how long does it take to search for an element with a given key? To answer this question, let's introduce a load factor alpha. For a HashMap with m slots and n elements, we define a load factor as n over m. This gives us an average number of elements per slot. The average case performance depends on how well the hash function distributes keys on these m slots. Let's look at an example to understand how load factor changes. If we add new elements to the HashMap, the load factor increases. Similarly, the load factor decreases if we remove elements. If the load factor is too high, usually above 75%, it is a good idea to create a larger array to reduce the load factor. The new array is twice the size of the original array. To do that, we need to distribute the elements from the original array to the new one, and this process is called rehashing. Note that the elements are stored in different slots now because the array size has changed. Now let's talk about open addressing. In open addressing, all elements are stored in the array slots. Each slot either contains an element or it doesn't. When searching for an element, We examine the slots until we find the element we're looking for, or have concluded that the element doesn't exist. There are no additional lists to store elements, like in chaining for example. This means that the open addressing hash table can fill up. One consequence is that its load factor can never exceed 1. So why would you want to use open addressing? There are no pointers to lists in open addressing, which means that open addressing hash map can have more slots than the closed addressing for the same amount of memory. This is important because more slots means fewer collisions on average. But what do we do with collisions? To insert an element, we examine the hash table until we find an empty slot. However, instead of examining slots in order from 0 to n, which would have linear complexity, the examined sequence of slots depends on the key being inserted. This process is called probing. Instead of using a hash function that takes just the key, We create a new function, let's call it probing, that takes the key and the number of attempts so far. The idea is that the probing function returns a different result for each attempt even if the key doesn't change. This property guarantees that the probing will eventually return an index to an empty slot, assuming that the hashmap is not full. In mathematical terms, we require that the sequence returned by the probing function for a fixed key is a permutation of 0 to m-1, where m is the number of slots. Let's look at an example. Let's define the probing function which starts with a slot return by the original hash function and adds one for each attempt. We will use the same hash function as before. Let's see some examples. So we can add Alice, Tim and Bob to their slots because they're already empty. Charlie should go to slot number one but it's already taken so we try the next slot. Same for John but we keep trying until we find an empty slot. This probing function is called linear probing. There are other probing techniques such as quadratic probing and double hashing. Quadratic probing is better than linear probing, while double hashing offers one of the best methods available for open addressing. This is because the permutations that are produced have many characteristics of a randomly chosen permutation. However, we need to choose the right parameters to ensure the probing sequence is a permutation. This can sometimes be tricky. Let's have a look at how double hashing works. Double hashing uses two hash functions to generate an index. The initial probe goes to the slot returned by the first hash function, and every successive probe position is offset by the second hash function. As a result, the elements are more uniformly distributed in the underlying array and we have fewer collisions. But how can we construct these functions such that they return a permutation? Well, it is sufficient that the second hash function always returns a number that is relatively prime to the size of the array. This means that the greatest common divisor for the value returned by the second hash function and the size of the array is 1. A convenient way to satisfy this condition is to choose the size of the array to be a power of 2 and design the second hash function to always produce an odd number. Since the size is a power of 2, its only divisor is 2, so any odd number is relatively prime. to the size. Let's go back to our example with linear probing and look at how searching works. It is very similar to insertion. We start at the first slot returned by the probing function and keep looking until we find an element that matches the key we are looking for. Or until we reach an empty slot. You are probably wondering what happens if we delete an element. Say that we delete Bob from the map. Wouldn't that break our searching algorithm? if we were to search for John? Yes, it would. This is why the deletion doesn't actually remove the element. It just marks it as deleted. The insertion algorithm is allowed to insert elements in the empty slots or slots that are marked as deleted, while the search algorithm stops only if the slots are empty. So should you be using open addressing or closed addressing? You should consider the use case that you have and the properties of both of these approaches and then make a decision for yourself. For example, closed addressing is simpler to implement than open addressing. However, closed addressing weighs some space because it needs to store buckets. So if you care about memory usage, open addressing can be a better choice in some cases. And furthermore, open addressing provides better cache performance because everything is stored in the continuous. part of the memory. A good rule of thumb is to consider using open addressing when you know the keys and the frequency at which they appear. On the other hand, use closed addressing if you don't know much about the keys and their distribution. If you are in doubt, you should use closed addressing. Before we wrap it up, let's talk about what makes a good hash function. A hash function that we've used in our examples is not great because some slots are more likely to be chosen depending on the input. A good hash function ensures that each key is equally likely to hash to any of the slots. While we typically have no way to check this condition, many standard hash functions perform well in practice. Another important property is that a hash function should be fast to compute. This is important because a hash function is called on a key for every operation. If it was slow, it could significantly impact the performance of our hash map. Thank you for watching this video! I hope you found it informative, entertaining, or both. If you enjoy this content, please consider subscribing to my channel to stay up to date with my latest videos. Also, feel free to share it with your friends who may find it helpful or interesting. See you in the next video!

Hash maps are probably the most common data structure used in real-world applications. For example, database indexing, caching, compilers, and so on. In practice, you are most likely going to use a library and not worry too much about its internals. However, learning how a HashMap works is a crucial step in developing your skills as a software engineer and improving your ability to create highly performant code.

A HashMap is a collection of key-value pairs. This means that each key is associated with some value. For example, a HashMap can be used to store age for Alice, Bob, and Charlie. A HashMap allows us to insert, find, or delete elements from it. For example, we can add Tim and his age in the map, find Charlie's age, or delete Alice.

Although other data structures, such as lists, also provide this functionality, hash tables are much faster for most use cases. HashMaps are usually implemented on top of an array where each element contains both the key and the value. But how do we decide where to place each element in the array?

We need to assign a slot for each key. For example, we could assign a number to each letter, then sum them. This computation is called a hash function and the result is a hash value. Some numbers returned by the hash function might be bigger than the size of the array, so let's take the remainder when dividing by the size of the array.

This number gives us a slot in the array for each key. Let's look at another example. Say that we want to store Alice's age.

The hash function returns a slot 1 that is already taken by Charlie. This situation is called a collision. You may say, let's change the hash function to avoid collisions, and that would be ideal.

While a well-designed hash function can minimize the number of collisions, we still need a method for resolving them. One of the simplest collision resolution techniques is called closed addressing. There is also an alternative method which is called open addressing. We will cover both in this video.

The idea behind closed addressing technique is to place all the elements that hash to the same slot into the same bucket. The bucket can be a linked list, an array, or a balanced tree. A technique that uses a linked list is called chaining. To implement this, each slot in the array becomes a linked list. We can now insert Alice's age into the corresponding linked list.

Let's look at a few more examples. John is stored at a slot number one, Amy at slot number zero, and Alex at slot number two. But what about running time complexity? Wouldn't this be slow?

For searching, the worst case running time is proportional to the length of the list, but in the average case running time for searching is 0 of 1. Actually, the complexity depends on how many elements are in the map, but we'll get to that in a second. Insertion and deletion have the same running time complexity as searching. For insertion, we have to check if the element exists before inserting it, while deletion requires us to find the element with the corresponding key in the linked list. However, Insertion can be implemented in O worst case if we assume that the element doesn't already exist. Similarly, deletion can be implemented in O if we already have a pointer to the element in the doubly linked list.

So how well does hashing with chaining perform in the average case? In particular, how long does it take to search for an element with a given key? To answer this question, let's introduce a load factor alpha.

For a HashMap with m slots and n elements, we define a load factor as n over m. This gives us an average number of elements per slot. The average case performance depends on how well the hash function distributes keys on these m slots. Let's look at an example to understand how load factor changes.

If we add new elements to the HashMap, the load factor increases. Similarly, the load factor decreases if we remove elements. If the load factor is too high, usually above 75%, it is a good idea to create a larger array to reduce the load factor. The new array is twice the size of the original array.

To do that, we need to distribute the elements from the original array to the new one, and this process is called rehashing. Note that the elements are stored in different slots now because the array size has changed. Now let's talk about open addressing. In open addressing, all elements are stored in the array slots.

Each slot either contains an element or it doesn't. When searching for an element, We examine the slots until we find the element we're looking for, or have concluded that the element doesn't exist. There are no additional lists to store elements, like in chaining for example. This means that the open addressing hash table can fill up.

One consequence is that its load factor can never exceed 1. So why would you want to use open addressing? There are no pointers to lists in open addressing, which means that open addressing hash map can have more slots than the closed addressing for the same amount of memory. This is important because more slots means fewer collisions on average.

But what do we do with collisions? To insert an element, we examine the hash table until we find an empty slot. However, instead of examining slots in order from 0 to n, which would have linear complexity, the examined sequence of slots depends on the key being inserted. This process is called probing.

Instead of using a hash function that takes just the key, We create a new function, let's call it probing, that takes the key and the number of attempts so far. The idea is that the probing function returns a different result for each attempt even if the key doesn't change. This property guarantees that the probing will eventually return an index to an empty slot, assuming that the hashmap is not full. In mathematical terms, we require that the sequence returned by the probing function for a fixed key is a permutation of 0 to m-1, where m is the number of slots.

Let's look at an example. Let's define the probing function which starts with a slot return by the original hash function and adds one for each attempt. We will use the same hash function as before. Let's see some examples.

So we can add Alice, Tim and Bob to their slots because they're already empty. Charlie should go to slot number one but it's already taken so we try the next slot. Same for John but we keep trying until we find an empty slot.

This probing function is called linear probing. There are other probing techniques such as quadratic probing and double hashing. Quadratic probing is better than linear probing, while double hashing offers one of the best methods available for open addressing. This is because the permutations that are produced have many characteristics of a randomly chosen permutation. However, we need to choose the right parameters to ensure the probing sequence is a permutation.

This can sometimes be tricky. Let's have a look at how double hashing works. Double hashing uses two hash functions to generate an index. The initial probe goes to the slot returned by the first hash function, and every successive probe position is offset by the second hash function. As a result, the elements are more uniformly distributed in the underlying array and we have fewer collisions.

But how can we construct these functions such that they return a permutation? Well, it is sufficient that the second hash function always returns a number that is relatively prime to the size of the array. This means that the greatest common divisor for the value returned by the second hash function and the size of the array is 1. A convenient way to satisfy this condition is to choose the size of the array to be a power of 2 and design the second hash function to always produce an odd number.

Since the size is a power of 2, its only divisor is 2, so any odd number is relatively prime. to the size. Let's go back to our example with linear probing and look at how searching works. It is very similar to insertion.

We start at the first slot returned by the probing function and keep looking until we find an element that matches the key we are looking for. Or until we reach an empty slot. You are probably wondering what happens if we delete an element.

Say that we delete Bob from the map. Wouldn't that break our searching algorithm? if we were to search for John? Yes, it would. This is why the deletion doesn't actually remove the element.

It just marks it as deleted. The insertion algorithm is allowed to insert elements in the empty slots or slots that are marked as deleted, while the search algorithm stops only if the slots are empty. So should you be using open addressing or closed addressing? You should consider the use case that you have and the properties of both of these approaches and then make a decision for yourself. For example, closed addressing is simpler to implement than open addressing.

However, closed addressing weighs some space because it needs to store buckets. So if you care about memory usage, open addressing can be a better choice in some cases. And furthermore, open addressing provides better cache performance because everything is stored in the continuous.

part of the memory. A good rule of thumb is to consider using open addressing when you know the keys and the frequency at which they appear. On the other hand, use closed addressing if you don't know much about the keys and their distribution. If you are in doubt, you should use closed addressing.

Before we wrap it up, let's talk about what makes a good hash function. A hash function that we've used in our examples is not great because some slots are more likely to be chosen depending on the input. A good hash function ensures that each key is equally likely to hash to any of the slots.

While we typically have no way to check this condition, many standard hash functions perform well in practice. Another important property is that a hash function should be fast to compute. This is important because a hash function is called on a key for every operation. If it was slow, it could significantly impact the performance of our hash map. Thank you for watching this video!

I hope you found it informative, entertaining, or both. If you enjoy this content, please consider subscribing to my channel to stay up to date with my latest videos. Also, feel free to share it with your friends who may find it helpful or interesting. See you in the next video!

Transcript for:Mastering Hash Maps: Key Concepts and Techniques

Transcript for:
Mastering Hash Maps: Key Concepts and Techniques