Lecture Notes on Systems Design: TinyURL and Pastebin

Introduction

Topic: TinyURL and Pastebin Systems Design Interview Questions
Objective: To design systems for generating unique short links with some amount of analytics.
Complexity Added: Including analytics for number of clicks per link.

Challenge: Ensuring uniqueness of short URLs while maintaining service speed.
Example: TinyURL (e.g., mapping "google.com" to "tinyurl.com/SLABC").
Analytics Challenge: Counting clicks per link accurately over time.

Assumption: Support a trillion URLs, median 10,000 clicks per URL, most popular in millions.
Storage Needs: If 1 trillion short URLs with 1 KB of data each, results in 1 petabyte.
Reads vs. Writes: Optimizing for more reads than writes.

Monotonic Sequence: Not practical due to required locking.
Hashing Approach: Distributes links evenly across the system.
- Hash Function: Combine long URL, user ID, and creation timestamp.
- Character Set: Using 0-9 and A-Z gives 36 choices per character.
- Combination Calculations: 36^8 provides ~2 trillion combinations.

Options: Chaining vs. Probing (probing preferred in databases).
Strategy: If a hash results in a collision, increment and try again (e.g., 32fc1ca8 -> 32fc1ca9).

Replication: Needed for fault tolerance and maximizing read performance.
- Single-leader vs. Multi-leader: Use single-leader to avoid write conflicts.
Caching: Not feasible to use a write-back cache due to consistency issues.
Partitioning: Distribute the load by partitioning on short URL range.
- Consistent Hashing: To minimize re-distribution of keys when cluster size changes.

Fields: Short URL, URL, User ID, Create Time, Expire Time, Number of Clicks.
Primary Challenge: Handling simultaneous writes to the same short URL key.
- Predicate Locks: Locks on rows that don't exist yet.
- Materializing Conflicts: Pre-writing all possible keys (~16 TB data, feasible).

No Range Queries: Allows use of hash indexes, although disk-based might be preferred.
Choices: LSM Tree (better for writes) vs. B-Tree (better for reads).
- Preference: B-Tree for handling more reads effectively.
- Database of Choice: MySQL (structured simplicity favored over NoSQL).

Replication: Using follower replicas to distribute read load.
Partitioning: More partitions = less load per partition.
Eventual Consistency: Acceptable with application logic checks.
**Caching Hot Links: **For links with high traffic.
- Partitioned Cache: Cache by short URL, use LRU eviction policy.

Stream Processing: Avoid race conditions in click counting.
- In-memory brokers (e.g., Redis) vs. Log-based brokers (e.g., Kafka): Use Kafka for durability.
**Choosing Stream Consumers: **
- Batch Jobs/Spark Streaming: Spark Streaming favored for mini-batching.
**Consistency and Idempotency: **
- Two-Phase Commit vs. Idempotency Key: Favor using keys, stored in partitioned Kafka queues.

Batch Job: To clean up expired URLs, pseudo-code provided for implementation.

Storage for Large Pastes: Amazon S3 preferred over HDFS.
Delivery Speed: Use CDN for caching static content.
Write Sequence for Data Integrity: Write to CDN, then S3, then database for consistency.

Load Balancer: Distributes write load across URL assigning servers.
URL Assigning: Data partitioned across multiple MySQL databases.
Caching Layer: Redis caches, distributed and partitioned.
Reader Operations: Cache miss redirects to database, with click data sent to Kafka.
Analytics Pipeline: Kafka queues, partitioned streams consumed by Spark Streaming.
Coordination Service: Zookeeper used for managing metadata and system coordination.

Summary: Comprehensive system involving URL generation, data consistency, caching, analytics, expiration, and large data support.
Final Note: Emphasis on balancing reads vs. writes with robustness in handling large-scale data efficiently.
Personal Touch: Presenter experienced technical issues and long recording effort.