Systems Design: Tiny URL and Pastebin

Jul 1, 2024

Systems Design: Tiny URL and Pastebin

Introduction

  • Topic: Systems design interview questions on Tiny URL and Pastebin.
  • Context: Creating short links for URLs and pastes.

Key Concepts

Problem Requirements

  • Generating Unique Short URLs: Ensure uniqueness is challenging and impacts service speed.
  • Analytics: Number of clicks per link; must ensure accurate data eventually.
  • Scale: Consider handling trillions of URLs and various sizes for pastes (from kilobytes to gigabytes).
  • Performance Context: More reads than writes, particularly for popular URLs.

Performance and Scalability

  • Partitioning: Necessary due to data scale.
  • Optimizing Reads: Focus on optimizing for more reads; reads will outnumber writes.

Link Generation

  • Avoid Monotonically Increasing Sequence: Uses locking, which slows down the process.
  • Using Hashing Functions: Distribute links evenly through hashing (include URL, user ID, timestamp).
  • Character Combinations: Estimation using base 36 for sufficient combinations (2 trillion for 8 characters).
  • Hash Collisions: Handle via probing (searching incrementally until a free slot is found).

Database Choices

Replication Strategy

  • Avoid Multi-Leader/Leaderless Replication: Can result in conflicts and inconsistencies (Last Write Wins example).
  • Single Leader Replication: More appropriate to avoid conflicts.

Caching

  • Right Back Cache: Not suitable due to consistency issues.
  • Right Through Cache: Slows down writes unnecessarily (most writes donโ€™t need to be cached).
  • Right Around Cache: Suitable for handling cache misses without slowing down writes.

Partitioning Scheme

  • Use Consistent Hashing: Reduces the need for redistribution of keys when changing cluster size.
  • Schema Example: User, URL, timestamps, number of clicks (use indexes to speed up queries).

Database Engine

  • LSM Trees vs. B-Trees: Prefer B-Trees for better read efficiency (despite higher write cost).
  • Recommended DB: MySQL favored for simplicity and support of B-Trees.

Read Performance Optimization

  • Caching: Utilize distributed, partitioned, and replicated cache for popular links (e.g., Redis instances).
  • Partition on Short URL: Ensures cache misses target the relevant partition.

Handling Hot Links

  • Caching Strategy: Scale out caches to handle high-traffic links.
  • Handling Read from Cache: Write results to cache and return to user on read through process.
  • Eviction Policy: Use LRU (Least Recently Used) to keep cache efficient.

Analytics

  • Stream Processing for Clicks: Avoid direct database writes to prevent race conditions.
  • Kafka for Logging: Durable, log-based message broker to handle incoming click events reliably.
  • Spark Streaming for Aggregation: Mini-batch click data and periodically update the database to reduce load.
  • idempotency Key: Ensure operations are not repeated during failovers.
  • Partition Events by Key: Ensure single consumer per short URL to avoid multiple locks.

Expiring Links

  • Batch Job for Expiration: Nightly job to clean up expired links without significant locking requirements.

Pastebin Specifics

  • Large Files Handling: Store large pastes in object storage (e.g., S3) rather than DB fields.
  • Use of CDN: Deliver static content efficiently using a CDN for cached content.
  • Write Protocol: Ensure data consistency by writing to CDN, then object storage, and finally database.

Overall Architecture

  • Writer and Reader Separation: Separate services for readability and scalability.
  • Replication and Partitioning: Optimize read and write efficiency using appropriate database and cache partitioning strategies.
  • Zookeeper for Coordination: Manage metadata about system components.

Conclusion

  • Key Takeaways: Building efficient systems to handle URL shortening and pastes involves careful design considerations regarding hashing, caching, partitioning, database selection, and stream processing.