Overview
This lecture covers the high-level system design for an e-commerce platform like Amazon, focusing on the products database, shopping cart service, order processing, and optimizing read performance for scalability and speed.
System Requirements & Capacity Estimates
- Focus is on product catalog, cart management, inventory, and fast/scalable reads; excludes reviews, payments, AWS, and support features.
- Assume up to 1 billion products (~1PB data, including images).
- Estimated about 10 million orders/day (≈100 orders/sec).
- Requires sharding and caching due to data volume and performance needs.
Product Database Design
- Read throughput is prioritized; writes by admins/sellers are infrequent.
- Single-leader replication is sufficient; B-tree indexing provides fast reads.
- NoSQL (MongoDB) chosen for flexible schema (products vary widely); data stored as JSON documents.
- Images stored in S3, with CDN for fast, global access.
Shopping Cart Service Design
- Avoid store-and-overwrite or simple list approaches due to potential concurrency issues.
- Optimal model: represent cart as a set (table with cart ID and product ID) to minimize locking.
- Use CRDTs (Conflict-Free Replicated Data Types), specifically observed-remove sets, for eventual consistency across multiple leaders and avoid lost updates.
- Writes and reads are synchronized per user session to avoid missing recent updates.
Order Processing & Inventory Management
- Each product has a stock counter; decrements on order require locking to prevent race conditions.
- To scale, use Kafka for order/event buffering and Flink for stream processing.
- Orders placed into a unified Kafka queue for atomicity, then split per product for inventory checks.
- Inventory updates from sellers flow into Kafka and are reflected in real-time inventory via Flink.
- Email notifications sent post-processing (may not be idempotent; two-phase commit can address).
Optimizing Read Performance
- Pre-cache popular products/data using Redis; cache images with CDN.
- Identify popular products with Spark Streaming, aggregating order/click metrics daily and exporting to HDFS.
- Batch job ranks products by popularity per category for cache prepopulation.
Search Indexing Strategy
- Use Elasticsearch with inverted index (term → product ID/description/thumbnail).
- Prefer local (category- and score-based) partitioning over global index to minimize duplication and speed up pagination.
- Popularity scores guide internal shard allocation for fast retrieval of top items.
Cache & Index Swap Process
- Populate a standby cluster with updated cache/index after batch job completes.
- Update coordination service (Zookeeper/DNS) to point services to the new cluster, minimizing downtime during switch.
- Schedule swap during off-peak hours to reduce impact.
Key Terms & Definitions
- NoSQL (MongoDB) — Database supporting flexible JSON schemas, ideal for diverse product data.
- B-tree Index — Tree-based data structure for efficient read queries.
- CDN (Content Delivery Network) — Distributed servers for fast image/content access.
- CRDT (Conflict-Free Replicated Data Type) — Data type ensuring eventual consistency across distributed nodes.
- Kafka — Distributed messaging system for buffering events and orders.
- Flink — Stream processing framework for real-time data.
- Elasticsearch — Search engine using inverted indexes for quick full-text search.
- Sharding — Partitioning data to distribute load across multiple servers.
- Redis — In-memory data store used for caching.
Action Items / Next Steps
- Review concepts of CRDTs and their use in distributed systems.
- Study stream processing pipeline setup (Kafka, Flink, Spark).
- Understand how inverted indexes and sharding enhance search scalability.
- (If assigned) Implement sample cart service or search indexer following these principles.