Overview of Twitter's System Design

Aug 15, 2024

System Design Overview - Twitter

Introduction

  • Speaker: Lorraine
  • Focus: Core concepts of Twitter's system design rather than specific technologies.

Key Features of Twitter

  1. User Tweets:

    • Users can tweet to followers quickly, even with millions of followers.
  2. Timelines:

    • Home Timeline: Tweets from accounts the user follows.
    • User Timeline: All tweets made by the user.
    • Search Timeline: Search function for tweets based on keywords.
  3. Trending Topics:

    • Display trending hashtags or topics based on location.

Application Characteristics

  • Over 300 million users.
  • Up to 600 tweets per second and 600,000 tweet queries processed.
  • Read/Write Ratio: 1 tweet generates approximately 200 reads.
  • Eventual Consistency: Tweets can be delayed in visibility by a few seconds.
  • Storage Cost: Limited to 140 characters per tweet, storage costs are not a primary concern.

System Design Requirements

  • High Read Capacity: Must support extensive reads effectively.
  • Horizontal Scalability: Ability to scale out as demand increases.

Proposed Solutions

  • Redis:

    • Use Redis for in-memory data storage to speed up read operations.
    • Data structured with user IDs and tweet IDs for efficient access.
  • Database Structure:

    • Key tables include User Table, Tweet Table, and Follows Table.
    • One-to-many relationships for users and tweets.

Data Retrieval Strategies

User Timeline Retrieval

  • Access user table using user ID to get tweets and retweets.
  • Sort data by date/time for display.
  • Introduce a caching layer to reduce database load and speed up retrieval.

Home Timeline Retrieval

  • Retrieve followers for a user and gather latest tweets from each follower.
  • Fan-out on Write:
    • When a user tweets, the tweet is added to the user's timeline and sent to each follower’s home timeline.
    • This reduces the need for multiple database queries.

Handling High-Profile User Tweets

  • For celebrities with millions of followers, use a strategy to avoid updating all home timelines directly due to performance concerns.
  • Instead, maintain a cache of celebrity tweets and check against this cache when a user's home timeline is requested.

Trending Hashtags Calculation

  • Volume and Velocity: Trending determined by tweet volume over a short time frame.
  • Use stream processing frameworks for real-time analysis (e.g., Kafka, Spark).
  • Geolocation Consideration: Trends may vary by region; geographical mapping of hashtags is essential.

Search Functionality

  • Twitter uses an inverted full-text indexing system called Earlybird.
  • Tweets indexed into a distributed table for efficient look-up.
  • Scatter and Gather approach for distributed queries across data centers.

Data Flow in the System

  1. Tweet Submission:

    • User submits a tweet through an API call.
    • Tweet stored in a database and sent to the trending hashtag service.
    • Processed for updates to followers' timelines.
    • Indexed for searchability.
  2. Request Handling:

    • Queries for timelines hit a load balancer, directing traffic to the appropriate services (e.g., timeline service).
    • Data pulled from Redis for rapid access.
  3. Real-Time Connections:

    • WebSocket service for maintaining persistent connections with users.
    • Manages millions of connections simultaneously.
  4. Coordination:

    • Zookeeper used for managing distributed components, maintaining node configurations, and tracking online/offline servers.

Database Technology

  • Twitter uses MySQL for core databases and Cassandra for analytics and large-scale data processing.

Conclusion

  • Summary of components covered in the system design.
  • Encouragement for viewers to subscribe and provide feedback for improvements.