Systems Design: TinyURL and Pastebin

Jul 2, 2024

Lecture Notes on Systems Design: TinyURL and Pastebin

Introduction

  • Topic: TinyURL and Pastebin Systems Design Interview Questions
  • Objective: To design systems for generating unique short links with some amount of analytics.
  • Complexity Added: Including analytics for number of clicks per link.

Core Problem: Generating Unique Short URLs

  • Challenge: Ensuring uniqueness of short URLs while maintaining service speed.
  • Example: TinyURL (e.g., mapping "google.com" to "tinyurl.com/SLABC").
  • Analytics Challenge: Counting clicks per link accurately over time.

Performance Considerations

  • Assumption: Support a trillion URLs, median 10,000 clicks per URL, most popular in millions.
  • Storage Needs: If 1 trillion short URLs with 1 KB of data each, results in 1 petabyte.
  • Reads vs. Writes: Optimizing for more reads than writes.

Link Generation Strategies

  • Monotonic Sequence: Not practical due to required locking.
  • Hashing Approach: Distributes links evenly across the system.
    • Hash Function: Combine long URL, user ID, and creation timestamp.
    • Character Set: Using 0-9 and A-Z gives 36 choices per character.
    • Combination Calculations: 36^8 provides ~2 trillion combinations.

Handling Hash Collisions

  • Options: Chaining vs. Probing (probing preferred in databases).
  • Strategy: If a hash results in a collision, increment and try again (e.g., 32fc1ca8 -> 32fc1ca9).

Database Considerations

  • Replication: Needed for fault tolerance and maximizing read performance.
    • Single-leader vs. Multi-leader: Use single-leader to avoid write conflicts.
  • Caching: Not feasible to use a write-back cache due to consistency issues.
  • Partitioning: Distribute the load by partitioning on short URL range.
    • Consistent Hashing: To minimize re-distribution of keys when cluster size changes.

Database Schema

  • Fields: Short URL, URL, User ID, Create Time, Expire Time, Number of Clicks.
  • Primary Challenge: Handling simultaneous writes to the same short URL key.
    • Predicate Locks: Locks on rows that don't exist yet.
    • Materializing Conflicts: Pre-writing all possible keys (~16 TB data, feasible).

Database Engines

  • No Range Queries: Allows use of hash indexes, although disk-based might be preferred.
  • Choices: LSM Tree (better for writes) vs. B-Tree (better for reads).
    • Preference: B-Tree for handling more reads effectively.
    • Database of Choice: MySQL (structured simplicity favored over NoSQL).

Read Performance Optimization

  • Replication: Using follower replicas to distribute read load.
  • Partitioning: More partitions = less load per partition.
  • Eventual Consistency: Acceptable with application logic checks.
  • **Caching Hot Links: **For links with high traffic.
    • Partitioned Cache: Cache by short URL, use LRU eviction policy.

Analytics Implementation

  • Stream Processing: Avoid race conditions in click counting.
    • In-memory brokers (e.g., Redis) vs. Log-based brokers (e.g., Kafka): Use Kafka for durability.
  • **Choosing Stream Consumers: **
    • Batch Jobs/Spark Streaming: Spark Streaming favored for mini-batching.
  • **Consistency and Idempotency: **
    • Two-Phase Commit vs. Idempotency Key: Favor using keys, stored in partitioned Kafka queues.

Expired Links Handling

  • Batch Job: To clean up expired URLs, pseudo-code provided for implementation.

Special Considerations for Pastebin

  • Storage for Large Pastes: Amazon S3 preferred over HDFS.
  • Delivery Speed: Use CDN for caching static content.
  • Write Sequence for Data Integrity: Write to CDN, then S3, then database for consistency.

Integrated Design Overview

  • Load Balancer: Distributes write load across URL assigning servers.
  • URL Assigning: Data partitioned across multiple MySQL databases.
  • Caching Layer: Redis caches, distributed and partitioned.
  • Reader Operations: Cache miss redirects to database, with click data sent to Kafka.
  • Analytics Pipeline: Kafka queues, partitioned streams consumed by Spark Streaming.
  • Coordination Service: Zookeeper used for managing metadata and system coordination.

Conclusion

  • Summary: Comprehensive system involving URL generation, data consistency, caching, analytics, expiration, and large data support.
  • Final Note: Emphasis on balancing reads vs. writes with robustness in handling large-scale data efficiently.
  • Personal Touch: Presenter experienced technical issues and long recording effort.