Hashing Lecture Notes

Introduction

Topic: Hashing
Previous lecture: Set data structures (storing items by key, unsorted array, looking up in log n time).
Today: Discussing the set interface (dynamic find, insert, delete) and hashing for faster operations.

Aim: Can we do faster than log n?
Introduced new model: Comparison Model (only comparisons between keys are allowed).
Comparison Sort Algorithms:
- Based on key comparisons (equal, less than, greater than).
- Represented using a Decision Tree, where internal nodes are comparisons and leaves are outputs.
Goal: Minimum height of binary tree with n+1 leaves (output n items or not found) is log(n).

Store item in an array at the index of its key.
Performance: Constant time for find, insert, delete.
Problem: Prohibitively large space requirement if key range is large (e.g., MIT IDs).
To overcome: Use more space-efficient structures.

Concept: Map a large key space to a smaller array using a hash function.
Function: h(key) maps key to an array index (0 to m-1).
Problem: Collisions (multiple keys mapping to the same index).
Solutions:
- Open Addressing: Find another place in the array (complicated).
- Chaining: Store a list or other data structure at each array index that manages collisions.

Division Method: h(k) = k mod m.
- Works well if keys are uniformly distributed.
Universal Hash Function: More generalized approach to ensure better performance.
- Function: h_a,b(k) = ((a * k + b) mod p) mod m (with random a and b).
- Ensures key distribution with less probability of clustering.
Universality: Ensures probability of two keys colliding is ≤ 1/m.
- Implication: Expected chain length is constant (1 + (n-1)/m) if m ≈ n.

As the number of stored keys (n) grows, increase m dynamically.
Rebuild: Recreate hash table with larger m to maintain efficiency, similar to dynamic arrays.

Hashing allows efficient dynamic set operations even when keys are large integers.
Next lecture will focus on faster sorting techniques.