🔑

Understanding Hashing in Data Structures

Jun 4, 2025

Lecture 4: Hashing in Data Structures

Overview

  • Focus: Hashing, specifically in set data structures.
  • Comparison with previous topics: Sequence data structures vs. set data structures.

Recap on Set Data Structures

  • Last lecture reviewed two ways to implement a set:
    • Unsorted Array: Linear scan for operations.
    • Sorted Array: Faster find operations with O(log n) time.
    • Sorting methods: Two O(n^2) and one O(n log n).
  • Future topic: Building data structures faster.

Performance Considerations

  • Static find operations: O(log n) time is an improvement over linear.
  • Goal: Achieve faster than O(log n) time.
  • Model of Computation: Comparison model restricts operations to comparisons only.

Comparison Model

  • Focused on comparing keys:
    • Operations: Compare two keys (equal, less than, greater than, etc.).
    • Algorithms: Viewed as decision trees; each comparison branches the computation.
    • Binary tree representation:
      • Internal nodes = comparisons.
      • Leaves = outputs.

Decision Trees and Comparison Limitations

  • Minimum height for n+1 leaves in a binary tree is log n.
  • Conclusion: Need to improve beyond comparison model for faster find operations.

Direct Access Arrays

  • Utilize random access memory for constant time access:
    • Store items at index equal to their key.
    • Problems arise with large universe of keys (e.g., MIT IDs).

Hashing Introduction

  • Goal: Map a large space of keys to a smaller space (from U to m).
  • Hash Function h: Maps key from large space to smaller space.
  • Challenges: Collisions when multiple keys map to the same index.

Handling Collisions

  • Chaining:
    • Use pointers to a secondary data structure (e.g., linked list) to handle collisions.
    • Ensure average chain length is small to maintain efficiency.

Universal Hash Functions

  • Aim to choose hash functions that ensure even distribution of keys.
  • Universality Property:
    • Probability of two keys colliding is ≤ 1/m for all distinct pairs.

Expected Chain Length

  • Use probability to show expected chain length is constant for m = O(n):
    • Expected chain length calculation involves indicator random variables.
    • Shows 1 + (n-1)/m expected size for each chain.

Dynamic Hash Tables

  • Adjust m as n grows to maintain efficient performance:
    • Rebuild hash table when necessary to keep m linearly proportional to n.

Conclusion

  • Next lecture will cover faster sorting methods.