How Discord Stores Trillions of Messages
Background
- Discord faced significant scaling issues with their MongoDB database in 2017.
- Transitioned to a Cassandra database to handle growing user base and message volume.
- Continued exponential growth: from 10 million users in 2017 to 140 million in 2021.
Issues with Cassandra
- Stored messages in partitions based on channel and static time windows (called buckets).
- Messages replicated across three nodes.
- Major issues:
- Different server sizes (small vs giant ones like Mid Journey).
- Reading messages was slower and more resource-intensive than writing.
- Data compaction and Java garbage collection caused further slowdowns.
- Difficult to manage large number of nodes (12 initial increased to 177).
Need for Change
- Increased latency and maintenance difficulties necessitated a new solution.
Transition to ScyllaDB
- Eyeing ScyllaDB, a high-performance NoSQL database, compatible with Cassandra but written in C++.
- Advantages of ScyllaDB:
- No Java garbage collection.
- Better performance and scalability.
- Migration was an extensive effort to maintain uptime and minimize user disruption.
Data Services and Rust Integration
- Development of "data services" to handle traffic between API and database.
- Chose Rust for its performance, security, and developer experience, especially with asynchronous requests.
- Implemented gRPC for data transmission between services, ensuring performance and safety.
- Data coalescing to reduce database queries when multiple users access the same data simultaneously.
Migration Process
- Setup ScyllaDB Cluster: Initialize new ScyllaDB nodes.
- Data Migration Tool: Rewrite migration tool in Rust to increase speed (from 3 months to 9 days).
- Migrate Data: Continuously migrate new data and historical data.
- Performance Improvements: Achieved migration target with 3.2 million messages per second; reduced nodes from 177 to 72.
Performance Gains
- Significant improvement in latency and scalability:
- Message fetch time reduced from 40-125ms down to 15ms.
- Insertion time stabilized at 5ms.
- Successful handling of high-load events (e.g., World Cup Finals).
Key Takeaways
- Importance of tackling infrastructure issues beyond simply adopting new technology.
- Transition to ScyllaDB and Rust integration improved performance and maintainability.
- Innovation should balance with practical problem-solving.
Conclusion
- Discord's engineers successfully migrated to ScyllaDB, demonstrating the importance of adapting to technological demands while ensuring stability and performance. This migration underscores the value of strategic innovation in handling large-scale infrastructure challenges.
Check out the Discord Engineering blog for more detailed insights.