Coconote
AI notes
AI voice & video notes
Export note
Try for free
Systems Design: TinyURL and Pastebin
Jul 2, 2024
Lecture Notes on Systems Design: TinyURL and Pastebin
Introduction
Topic:
TinyURL and Pastebin Systems Design Interview Questions
Objective:
To design systems for generating unique short links with some amount of analytics.
Complexity Added:
Including analytics for number of clicks per link.
Core Problem: Generating Unique Short URLs
Challenge:
Ensuring uniqueness of short URLs while maintaining service speed.
Example:
TinyURL (e.g., mapping "google.com" to "tinyurl.com/SLABC").
Analytics Challenge:
Counting clicks per link accurately over time.
Performance Considerations
Assumption:
Support a trillion URLs, median 10,000 clicks per URL, most popular in millions.
Storage Needs:
If 1 trillion short URLs with 1 KB of data each, results in 1 petabyte.
Reads vs. Writes:
Optimizing for more reads than writes.
Link Generation Strategies
Monotonic Sequence:
Not practical due to required locking.
Hashing Approach:
Distributes links evenly across the system.
Hash Function:
Combine long URL, user ID, and creation timestamp.
Character Set:
Using
0-9
and
A-Z
gives 36 choices per character.
Combination Calculations:
36^8
provides ~2 trillion combinations.
Handling Hash Collisions
Options:
Chaining vs. Probing (probing preferred in databases).
Strategy:
If a hash results in a collision, increment and try again (e.g.,
32fc1ca8 -> 32fc1ca9
).
Database Considerations
Replication:
Needed for fault tolerance and maximizing read performance.
Single-leader vs. Multi-leader:
Use single-leader to avoid write conflicts.
Caching:
Not feasible to use a write-back cache due to consistency issues.
Partitioning:
Distribute the load by partitioning on short URL range.
Consistent Hashing:
To minimize re-distribution of keys when cluster size changes.
Database Schema
Fields: Short URL, URL, User ID, Create Time, Expire Time, Number of Clicks.
Primary Challenge:
Handling simultaneous writes to the same short URL key.
Predicate Locks:
Locks on rows that don't exist yet.
Materializing Conflicts:
Pre-writing all possible keys (~16 TB data, feasible).
Database Engines
No Range Queries:
Allows use of hash indexes, although disk-based might be preferred.
Choices:
LSM Tree (better for writes) vs. B-Tree (better for reads).
Preference:
B-Tree for handling more reads effectively.
Database of Choice:
MySQL (structured simplicity favored over NoSQL).
Read Performance Optimization
Replication:
Using follower replicas to distribute read load.
Partitioning:
More partitions = less load per partition.
Eventual Consistency:
Acceptable with application logic checks.
**Caching Hot Links: **For links with high traffic.
Partitioned Cache:
Cache by short URL, use LRU eviction policy.
Analytics Implementation
Stream Processing:
Avoid race conditions in click counting.
In-memory brokers (e.g., Redis) vs. Log-based brokers (e.g., Kafka): Use Kafka for durability.
**Choosing Stream Consumers: **
Batch Jobs/Spark Streaming:
Spark Streaming favored for mini-batching.
**Consistency and Idempotency: **
Two-Phase Commit vs. Idempotency Key:
Favor using keys, stored in partitioned Kafka queues.
Expired Links Handling
Batch Job:
To clean up expired URLs, pseudo-code provided for implementation.
Special Considerations for Pastebin
Storage for Large Pastes:
Amazon S3 preferred over HDFS.
Delivery Speed:
Use CDN for caching static content.
Write Sequence for Data Integrity:
Write to CDN, then S3, then database for consistency.
Integrated Design Overview
Load Balancer:
Distributes write load across URL assigning servers.
URL Assigning:
Data partitioned across multiple MySQL databases.
Caching Layer:
Redis caches, distributed and partitioned.
Reader Operations:
Cache miss redirects to database, with click data sent to Kafka.
Analytics Pipeline:
Kafka queues, partitioned streams consumed by Spark Streaming.
Coordination Service:
Zookeeper used for managing metadata and system coordination.
Conclusion
Summary:
Comprehensive system involving URL generation, data consistency, caching, analytics, expiration, and large data support.
Final Note:
Emphasis on balancing reads vs. writes with robustness in handling large-scale data efficiently.
Personal Touch:
Presenter experienced technical issues and long recording effort.
📄
Full transcript