🔑

Understanding Hashing in Data Structures

Jun 4, 2025

Lecture 4: Hashing in Data Structures

Overview

Focus: Hashing, specifically in set data structures.
Comparison with previous topics: Sequence data structures vs. set data structures.

Recap on Set Data Structures

Last lecture reviewed two ways to implement a set:
- Unsorted Array: Linear scan for operations.
- Sorted Array: Faster find operations with O(log n) time.
- Sorting methods: Two O(n^2) and one O(n log n).
Future topic: Building data structures faster.

Performance Considerations

Static find operations: O(log n) time is an improvement over linear.
Goal: Achieve faster than O(log n) time.
Model of Computation: Comparison model restricts operations to comparisons only.

Comparison Model

Focused on comparing keys:
- Operations: Compare two keys (equal, less than, greater than, etc.).
- Algorithms: Viewed as decision trees; each comparison branches the computation.
- Binary tree representation:
  - Internal nodes = comparisons.
  - Leaves = outputs.

Decision Trees and Comparison Limitations

Minimum height for n+1 leaves in a binary tree is log n.
Conclusion: Need to improve beyond comparison model for faster find operations.

Direct Access Arrays

Utilize random access memory for constant time access:
- Store items at index equal to their key.
- Problems arise with large universe of keys (e.g., MIT IDs).

Hashing Introduction

Goal: Map a large space of keys to a smaller space (from U to m).
Hash Function h: Maps key from large space to smaller space.
Challenges: Collisions when multiple keys map to the same index.

Handling Collisions

Chaining:
- Use pointers to a secondary data structure (e.g., linked list) to handle collisions.
- Ensure average chain length is small to maintain efficiency.

Universal Hash Functions

Aim to choose hash functions that ensure even distribution of keys.
Universality Property:
- Probability of two keys colliding is ≤ 1/m for all distinct pairs.

Expected Chain Length

Use probability to show expected chain length is constant for m = O(n):
- Expected chain length calculation involves indicator random variables.
- Shows 1 + (n-1)/m expected size for each chain.

Dynamic Hash Tables

Adjust m as n grows to maintain efficient performance:
- Rebuild hash table when necessary to keep m linearly proportional to n.

Conclusion

Next lecture will cover faster sorting methods.

Full transcript