How Discord Stores Trillions of Messages

Jul 22, 2024

How Discord Stores Trillions of Messages

Background

  • Discord faced significant scaling issues with their MongoDB database in 2017.
  • Transitioned to a Cassandra database to handle growing user base and message volume.
  • Continued exponential growth: from 10 million users in 2017 to 140 million in 2021.

Issues with Cassandra

  • Stored messages in partitions based on channel and static time windows (called buckets).
  • Messages replicated across three nodes.
  • Major issues:
    • Different server sizes (small vs giant ones like Mid Journey).
    • Reading messages was slower and more resource-intensive than writing.
    • Data compaction and Java garbage collection caused further slowdowns.
    • Difficult to manage large number of nodes (12 initial increased to 177).

Need for Change

  • Increased latency and maintenance difficulties necessitated a new solution.

Transition to ScyllaDB

  • Eyeing ScyllaDB, a high-performance NoSQL database, compatible with Cassandra but written in C++.
  • Advantages of ScyllaDB:
    • No Java garbage collection.
    • Better performance and scalability.
  • Migration was an extensive effort to maintain uptime and minimize user disruption.

Data Services and Rust Integration

  • Development of "data services" to handle traffic between API and database.
  • Chose Rust for its performance, security, and developer experience, especially with asynchronous requests.
  • Implemented gRPC for data transmission between services, ensuring performance and safety.
  • Data coalescing to reduce database queries when multiple users access the same data simultaneously.

Migration Process

  1. Setup ScyllaDB Cluster: Initialize new ScyllaDB nodes.
  2. Data Migration Tool: Rewrite migration tool in Rust to increase speed (from 3 months to 9 days).
  3. Migrate Data: Continuously migrate new data and historical data.
  4. Performance Improvements: Achieved migration target with 3.2 million messages per second; reduced nodes from 177 to 72.

Performance Gains

  • Significant improvement in latency and scalability:
    • Message fetch time reduced from 40-125ms down to 15ms.
    • Insertion time stabilized at 5ms.
  • Successful handling of high-load events (e.g., World Cup Finals).

Key Takeaways

  • Importance of tackling infrastructure issues beyond simply adopting new technology.
  • Transition to ScyllaDB and Rust integration improved performance and maintainability.
  • Innovation should balance with practical problem-solving.

Conclusion

  • Discord's engineers successfully migrated to ScyllaDB, demonstrating the importance of adapting to technological demands while ensuring stability and performance. This migration underscores the value of strategic innovation in handling large-scale infrastructure challenges.

Check out the Discord Engineering blog for more detailed insights.