Systems Design: Tiny URL and Pastebin

Introduction

Generating Unique Short URLs: Ensure uniqueness is challenging and impacts service speed.
Analytics: Number of clicks per link; must ensure accurate data eventually.
Scale: Consider handling trillions of URLs and various sizes for pastes (from kilobytes to gigabytes).
Performance Context: More reads than writes, particularly for popular URLs.

Partitioning: Necessary due to data scale.
Optimizing Reads: Focus on optimizing for more reads; reads will outnumber writes.

Avoid Monotonically Increasing Sequence: Uses locking, which slows down the process.
Using Hashing Functions: Distribute links evenly through hashing (include URL, user ID, timestamp).
Character Combinations: Estimation using base 36 for sufficient combinations (2 trillion for 8 characters).
Hash Collisions: Handle via probing (searching incrementally until a free slot is found).

Avoid Multi-Leader/Leaderless Replication: Can result in conflicts and inconsistencies (Last Write Wins example).
Single Leader Replication: More appropriate to avoid conflicts.

Right Back Cache: Not suitable due to consistency issues.
Right Through Cache: Slows down writes unnecessarily (most writes don’t need to be cached).
Right Around Cache: Suitable for handling cache misses without slowing down writes.

Use Consistent Hashing: Reduces the need for redistribution of keys when changing cluster size.
Schema Example: User, URL, timestamps, number of clicks (use indexes to speed up queries).

LSM Trees vs. B-Trees: Prefer B-Trees for better read efficiency (despite higher write cost).
Recommended DB: MySQL favored for simplicity and support of B-Trees.

Caching: Utilize distributed, partitioned, and replicated cache for popular links (e.g., Redis instances).
Partition on Short URL: Ensures cache misses target the relevant partition.

Caching Strategy: Scale out caches to handle high-traffic links.
Handling Read from Cache: Write results to cache and return to user on read through process.
Eviction Policy: Use LRU (Least Recently Used) to keep cache efficient.

Stream Processing for Clicks: Avoid direct database writes to prevent race conditions.
Kafka for Logging: Durable, log-based message broker to handle incoming click events reliably.
Spark Streaming for Aggregation: Mini-batch click data and periodically update the database to reduce load.
idempotency Key: Ensure operations are not repeated during failovers.
Partition Events by Key: Ensure single consumer per short URL to avoid multiple locks.

Batch Job for Expiration: Nightly job to clean up expired links without significant locking requirements.

Large Files Handling: Store large pastes in object storage (e.g., S3) rather than DB fields.
Use of CDN: Deliver static content efficiently using a CDN for cached content.
Write Protocol: Ensure data consistency by writing to CDN, then object storage, and finally database.

Writer and Reader Separation: Separate services for readability and scalability.
Replication and Partitioning: Optimize read and write efficiency using appropriate database and cache partitioning strategies.
Zookeeper for Coordination: Manage metadata about system components.

Key Takeaways: Building efficient systems to handle URL shortening and pastes involves careful design considerations regarding hashing, caching, partitioning, database selection, and stream processing.