Transcript for:
How Discord Stores Trillions of Messages

Imagine storing the trillions of messages ever sent on discord. While more are being sent by the millions every hour, it is critical that the system remains operational without any downtime. Discord faced this situation and was outgrowing their previous database, so they needed to find a solution. This video is based off of the Discord Engineering blog article written by Bo Ingram, senior software engineer at Discord. So make sure you check it out. I put the link in the description below. Hi, my name is, And if you're new here, I'm a professional software developer on a mission to inspire developers and tech enthusiasts. But before we start, let's roll back a little bit. Funny enough, I made a video on this previously called How's Discord Stores, billions of Messages, which is kind of annoying, like they couldn't have worked on my schedule. Anyway, if you don't wanna watch that video, Here's a little backstory. In 2017, the MongoDB database was causing significant scaling issues for discord as its popularity grew. After experiencing significant scaling issues with MongoDB in 2017, discord switched to a Cassandra database. Now, despite some challenges with migration, the team was ultimately able to successfully transition to Cassandra and get their system up and running since. Discord has continued to grow exponentially with over 140 million users in 2021, which means that they had to continue to make updates and improvements to their database infrastructure to keep up with the increasing demand of their users. But 2017 was six years ago. In 2017, 10 million users were using Discord, and in 2021 there's over 140 million people using Discord. So 14 times the users, something's gotta change. With this massive growth came a massive problem. How do you store trillions of existing messages? Initially, they started with just 12 Cassandra nodes that stored billions of messages between them. But fast forward to last year, it grew to 177 nodes with trillions of messages on them and managing. This was. Becoming a nightmare. The system was going through insane slowdowns and the Discord engineering team was having a hard time maintaining this at all times. It was clear that they needed a new solution and they needed it fast. Okay, so let's look at the current issues. Each message is partitioned by the channel. They are sent in as well as a static time window that they call a bucket. Since it's partitioned this way, Cassandra will store all the messages for a given. And bucket together and replicate it across three nodes. This design caused major issues for the two types of users that use Discord. One, a small server that sends a few messages versus a giant server like Mid Journey where messages are sent like every millisecond. And for Cassandra to read a message, it takes more time and effort than it does to write something. When you send a message to Discord, it's saved in a log and in memory that eventually writes to disc when you are doing something else. But when you want to get a. It has to look in multiple places, nodes, partitions, disk files, et cetera, to get the message that you want. But when too many people do this, it causes the partitions to slow down and cause latency across the whole database cluster, kind of like a road closing, and then the cars are forced to merge into your lane. It just slows everybody down. So not only did this result in slowdowns to the software and the databases, but it also slowed down engineering. To maintain this database, to optimize your re for Cassandra Discord would have to use compaction, which is a maintenance task that will make it easier for Cassandra to organize information on the disc to make it faster defined. However, with the issues already mentioned, doing this task was put off and caused it to be way more expensive down the line, and then just caused a domino effect of issues after that. Also, another classic enemy showed his. The Java garbage collector, which constantly had to be tuned when things would break. Ugh. Java Garbage collection is still causing issues to people. I thought we were done with this. This clearly wasn't cutting it, so they did what any programmer would want to do when they see something new, and that is to change architectures. So funny enough, discord actually had their eye set on another database years ago called sdb, and yes, that's actually how it's pronounced here. Come here with mes. DB is a high performance NoSQL database. Yep. I say NoQ L. Get over it. That is designed to handle the most demanding workloads. It is a Cassandra compatible database that is written in c plus plus, and it offers many awesome features that set it apart from other databases. One of the best features of solid B is the removal of the garbage collector, since it isn't written in Java. So this makes it a great choice for organizations that want to avoid the performance overhead that comes with the garbage collector from Java. So Silly DB is designed to handle a large amount of. Simultaneously, and it could scale up easily where their users are frequently accessing the database or not. It is a great choice for organizations that need a high performance, scalable, and reliable NoSQL database. So in summary, facility DB was arguably perfect for what Discord needed in this current moment. So after some experimentation and toying around the Discord, engineers made the bold choice of migrating all of the databases, TOIL db. They were successful and they had moved every database over toil. Except for one, and that database being Cassandra messages, trillions of messages, 200 nodes. Ugh. This was gonna be a serious undertaking. Also, just because you get the latest and greatest technology doesn't mean it's always the best. Sometimes the issues are with your own setup and infrastructure. One of the most important components of any web application is this database. If you had a low balancer that was pointing at different versions of discord, Already. There's a huge risk of one of those applications being able to store information on one database, but not being replicated when all is finished. Speaking of which, sometimes the part that we do see doesn't interact with a new API well at all. So in some instances, maybe you run into an error message because the newest version of Discord still hasn't settled in your area. So to think this happening with over 140 million active users and moving over a database with Discords primary. Messages Sounds stressful. First, a great optimization was finding a way to handle the traffic that was being led to their overloaded partitions. So in between their API monolith and their database, they wrote something called data services. Now, when dealing with something that should be milliseconds fast, you need a programming language that is, And secure. It's 2023, so of course they chose rust. I mean, can you blame them though? Russ is awesome. It offers an amazing developer experience while also providing some of the advantages you would typically see in low level programming languages such as C or c plus plus Rust has been ranked the most love programming language to work with for the past four years. On the Stack Overflow developer survey precisely for this reason. And Russ Seamless integration with the Tokyo ecosystem for managing asynchronous requests makes it a top choice for developers working on high performance applications, which is probably why big companies like aws, Facebook, and Dropbox are using this daily in their own applications. The combination of Rust and Tokyo provides a reliable, scalable solution for handling large amounts of. And request without experiencing significant slowdowns or downtime. But let's look at these data services. Each. Data service contains a G P C endpoint per database query, and doesn't contain any business logic. Now, G R P C is a newer way to transmit data across different services that you might have. For example, if I wanted my Python app, I talk to my Java. In the server, G R P C can be used. The benefits is that it's more performant as it uses HTTP 2.0 to transmit the data and it's safe as it uses something called protocol buffers to help validate data before it's even sent performant and safe. Seems like the word of the day here. The biggest feature this data surface did was called coalescing. If multiple users tried to read the same message at the same time, rather than querying X amount of people trying to read it, it would pull it once and then distribute this amongst the other users. And this solved many problems like when someone pinged at everybody or of a new announcement was made and people swarm to a server. Remember, the less work a database has to do, the better to optimize it even. They also provided a way to route their request to go to a data service based off of the channel id. So all requests for the same channel go to the same data service, so that can pull together the request and take advantage of the first step mentioned above. Okay, cool. But they're still on. Cassandra, when do we abandon Shep? Okay, so move trillions of messages while 140 million people use your application at the same time. But do this without any downtime. That's like migrating the population of Russia to a brand new house without them. Me being noticing this is gonna be tough. Step one, boot up a new Silla DB cluster. Step two, migrate newer data using a cut over time, and then migrate historical data behind it. Let's start after they got this process started, they're now writing new data to their old Cassandra database and their shiny new S DB database. And after booting up their silla DB migration tool, they got an estimated time of three months. Three months is a very long time. By that time, they'll probably wanna migrate to another database. So doing what every programmer does in the year 2020. They decided to completely rewrite the migration tool and rust to make it go 10 times faster. I mean, maybe even a hundred times faster. So here's how it works. It reads the range of data from the database and then using a smaller database like SQL Light, it uses a checkpoint where they are at in the migration. Then it dumps it all in to SIL a db, the new estimation, put it all the way down to nine. Instead of three months, this means that they can do their whole migration and just one clean swoop. With this approach, they were able to migrate 3.2 million messages per second. After the nine days, it was all migrated successfully. Well, apart from a small collection of tombstones that you can find more about on the previous video, and just like that, facility B is the primary database of discord hosting trillions of messages. Goodbye, Cassandra. So after a year of this migration, things are still going extremely well. Beforehand, there are 177 Cassandra nodes, but now it's set down to 72 Silla DB nodes. Latency to fetch messages also improves. Going from 40 to 125 milliseconds down to 15 milliseconds. Inserting a message, took five milliseconds to 70 milliseconds, and is now down to a consistent five milli. So all was looking great during the World Cup finals. Discord engineers would look at the message performance graphs, and were easily able to see the height behind the whole match based on the metrics. So looking at the metrics, you can see right here where Lionel Messi has a penalty at what empa. Ties it back up. Hope I said his name right. It's funny to think that when millions of people are stressing about a World Cup game, these Discord engineers are looking at these metrics and stressing even hard, making sure Discord doesn't go down. Being able to handle the communication in such a popular event is truly a flex and is a perfect way to showcase that their new system. Just worked. The migration clearly was a success. It's funny to think of some takeaways that we can take from this story as only a few software engineers have truly gone through this mega scale project. As of this video, discord is one of the world's most popular applications, so having the AB go down for the time it took to migrate. What make history of it happened. What I really liked about this article is that the Discord engineers just knew that if they moved over to a shiny new object, that it wasn't gonna solve all of their problems. Developers have shiny object syndrome, which distracts developers from solving the problems that they run into because they think that the shiny object is the answer. Transitioned, toil out was for sure an improvement, but the data services in between the database and the API was a massive improve. On top of the database migration that they were talking about. Now, although it sounds like I'm gonna contradict myself here, don't be too comfortable with the tech stack that you have used. Just because something fits in somewhere doesn't mean that it belongs there. Discourse switch to rust helps us understand that getting into the nitty gritty. Helps us determine what truly is the best tool for the job is. Thanks for watching this video. I have a playlist dedicated for more videos like this, showing how big tech builds the tools that you love. And if you like more stories like this, make sure you subscribe to my newsletter, the Better Dev, where I inspire developers to become better devs. Who should I cover next piece out coders.