Race Conditions in Concurrency

This video was sponsored by Brilliant. Race conditions are some of the trickiest bugs in software and today we'll try to beat them with nothing but code only to discover why that fight is already lost. Hi friends, my name is George and this is core dumped. Let's start with the classic example of a race condition. Imagine we have a variable, say a counter. Our program enters a for loop and increments this counter on every iteration. After the loop ends, we print the final value of the counter. We'll hard-code the number of iterations. So the final result doesn't depend on user input. Since the counter increases by one per iteration, we expect the program to output exactly the number of iterations. And if we run it, that's exactly what happens. Now let's make this process concurrent. Instead of having the main program perform all the increments, we'll spawn two threads and split the workload. Each thread is responsible for incrementing the counter half the total number of iterations. Since there are two threads and each performs half the increments, we would expect the final counter to equal the total number of iterations. But when we run this version, that's not what we see. Even worse, if we run the program multiple times, we get a different result each time. What's affecting our program is something called a race condition. We covered race conditions in detail in a previous episode, but here's a quick recap of what's happening. When we write a line of code, even a very simple one like counter equals counter + 1, the CPU doesn't necessarily execute it in a single step. Incrementing a number can take multiple instructions. For example, in many architectures, load the number from memory into a register, then increment the value in the register and then store the result back into memory. When a single operation takes multiple steps, we say that the operation is not atomic. This is how all computers fundamentally work. They achieve complex tasks by breaking them down into very simple instructions. Unfortunately, this way of operating does not mix well with another fundamental concept in computing, concurrency. We've explained before that concurrency is a technique that allows computers to run more programs than there are physical processors by rapidly switching the CPU's access among them. This switching happens so quickly that it creates the illusion that everything runs at the same time. The problem is that since some tasks don't finish in a single step, if such a task is interrupted in the middle of modifying data and another task accesses that same data, it may access it while in an invalid state. Here's a quick explanation of what could have happened in our example. One thread fetches the counter's value, increments it, but gets interrupted before storing it back into memory. The second thread then fetches the same old value, increments it, and successfully stores it. But when the first thread resumes, the value it fetched from memory is already outdated. Since it applied its increment to this stale number and only now writes it back, the counter increases once instead of twice. And because this can happen at unpredictable points during execution, the final result varies each time the program runs. The very first thing we should always always do when dealing with threads is to identify the part of our code where data might be accessed while it might be concurrently being modified. This is known as the critical section. Notice that in this example, the rest of the code, the for loop, is not part of the critical section. That's because the loop's control variable is local to each thread and the number of iterations is a constant, so is safe for both threads to read from it since it cannot change. Once the critical section has been identified, our job is to find a mechanism that prevents both threads from executing their critical sections at the same time. In other words, we need to achieve mutual exclusion to stop one thread from entering its critical section while the other is already inside. That way, even if that code translates into multiple CPU instructions, those instructions can't overlap and the race condition is avoided. The first idea most people new to concurrency come up with is what if we add another control variable? For example, suppose we introduce a variable called is safe that indicates whether it's safe to enter the critical section. The plan is simple. When a thread enters its critical section, it sets it safe to false. When it exits, it sets is safe to true. So if one thread is executing its critical section, the other thread spin waits until is safe becomes true again. Sounds reasonable, right? But here's the problem. This control variable itself is shared. Both threads are reading from and writing to it. And since these reads and writes are not atomic, we've just created a new potential race condition. In fact, with the right timing, which we cannot control, something like this could happen. Thread one reads is safe equals true but gets interrupted before comparing it. Thread two runs, sees is safe equals true, sets it to false, and enters its critical section. But since the critical section is not an atomic operation, thread 2 is then interrupted in the middle of it. When thread one resumes, it compares the old value it fetched, which was true at the time, and assumes it's safe to enter, even though thread 2 is still inside. And as you can see, both threads can still slip into the critical section at the same time, even if one hasn't finished yet. So instead of solving the problem, this approach just moved it somewhere else. If you want a deeper look, this is explained with more intuitive animations in my previous video, intuitive, like every single lesson on Brilliant. If you're someone who loves learning new things and building real skills, but feels like endlessly scrolling through PDFs isn't the way to do it, then Brilliant is for you. Brilliant is a learning platform built around active problem solving. Every lesson is designed to let you play with ideas using a first principles approach that helps you understand concepts from the ground up. You'll stay motivated with a perfect blend of engaging challenges, daily streaks, and competitive features that actually make learning feel rewarding. All content on Brilliant is crafted by an award-winning team of teachers, researchers, and professionals from Stanford, MIT, Caltech, Microsoft, Google, and more. The best part is Brilliant is available right on your phone. So instead of passively scrolling PDFs, you can learn a little every day, wherever you are. Courses like thinking in code and programming with variables help you develop the mindset of a programmer by teaching you to break down complex problems into logical manageable parts and also help you develop an intuition for computer logic as you learn to design and debug real programs. To learn for free on Brilliant, go to brilliant.org/corumpt, scan the QR code on screen or click on the link in the description. Brilliant's also given our viewers 20% off an annual premium subscription which gives you unlimited daily access to everything on Brilliant. And now back to the video. How can we solve race conditions? If we can't trust that any single read or write will be atomic, it feels like we're stuck. But take a look at this algorithm. Some decades ago, this was one of the most widely taught approaches to illustrate how to avoid race conditions. The idea is that whenever two threads need to protect their critical sections from one another, they surround those sections with a small amount of coordination code. Both threads need to share two variables. The variable turn indicates whose turn it is to enter the critical section. The flag array is used to indicate if a process is ready to enter its critical section. This is known as the Peterson solution. For thread zero to enter its critical section, it first sets its own flag to true and then sets turn to one, which effectively asserts that if the other thread wishes to enter the critical section, it can do so. After that, thread zero enters a loop and spin waits until the other thread leaves its critical section. When that happen, thread one will set its flag to false and thread 0's loop will break. Only then does thread zero proceed into its own critical section. Notice that if thread one doesn't want to enter its critical section, the waiting condition fails immediately and thread zero enters right away. Instead of competing to rush in, this algorithm has each thread offer priority to the other and wait until it's safe to proceed. That's why in the second line of code, they assign the turn to each other and not to themselves. The highle reason this algorithm guarantees mutual exclusion is that for a thread to enter its critical section, its loop must break. If both threads try to enter at once, both flags will be true. And the only way for both loops to break simultaneously would be if turn were equal to zero and one at the same time, which is impossible. At this point, you may ask, how reliable is this approach? Beyond the critical section itself, the algorithm depends on shared variables to coordinate access. But since these are also shared, wouldn't they too be susceptible to race conditions? Well, let's briefly analyze how the algorithm behaves for each thread. The only way to enter the critical section is if one of the loop conditions becomes false. So, we need to examine what happens if a thread is spin waiting while the other modifies one of these shared variables. So, let's assemble this code and start with the turn variable. Checking its value is not an atomic operation. That means while thread zero is reading turn, it could be interrupted and thread one might change its value. At first, this sounds dangerous, but what it really means is that thread zero will simply compare an outdated value of turn and therefore won't realize right away that thread 1 has just given it permission to enter the critical section. The only consequence, thread 0 will need exactly one more iteration of its spin loop before evaluating the updated value and proceeding into its critical section. Notice that any upcoming interruption does not affect the synchronization process since by that time thread 1 is already stuck in its own spin loop for two reasons. Thread 0 has already raised its flag to signal its intention to enter and thread one itself set turn to zero in the previous step giving priority to thread zero. So the only way for thread one to get through is for its own while loop condition to break. And that happens only after thread zero finishes its critical section and resets flag zero to false. This doesn't mean the critical section cannot be interrupted. What it means is that if the critical section is interrupted and the other thread gets scheduled to run, it will only execute instructions that are outside its own critical section. Essentially just keeping the spin loop alive. Again, only when one thread exits its critical section does the other become allowed to enter. The other variable constantly being checked inside the loop is the flag of the other thread. As mentioned before, the idea is simple. If thread 1 hasn't set its flag to true, it means either it's not ready to enter its critical section yet or it has already exited it. In either case, this flag tells thread zero that thread 1 is not interested. So thread 0 can enter immediately without spin waiting. But once again, the while loop might be checking the condition at the same time the other thread is modifying its flag, which can happen in two ways. As you can see here, take this line of code and see what could happen. Thread zero fetches the value of flag one. It gets interrupted. Thread one sets its flag to false. When thread zero resumes, it's still holding the old value, thinking thread 1 is still inside the critical section. But once again, the only effect is that thread zero will spin for one more loop iteration before realizing thread one has left its critical section. Not a problem. But now consider the opposite case here. Thread 0 might fetch the flag, get interrupted, and then thread one sets the flag to true. When thread zero resumes, it still believes thread 1 is not interested and goes ahead into the critical section even though that's not true anymore. But again, this is not really an issue because the very next instruction thread one executes is to set turn to zero, giving priority back to thread zero anyway. Finally, there's the situation where both threads have expressed their intention to enter. If they both try to assign turn at the same time, a decision should be made by the operating system scheduler, allowing one of them to set the value before the other. And here's the interesting part. The first thread to set turn effectively gives the other thread priority. But since the second thread immediately flips it back, the first thread is the one allowed to enter. So at the low level, they're still competing. So there's still a race, just not a race condition. Any solution to the critical section problem must satisfy three requirements: mutual exclusion, progress, and bounded waiting. And if you analyze Peterson's solution, you'll find out it satisfies all three. Yet, if you try to write the simple counter program from earlier, but now protect its critical section with Peterson's algorithm and run it, the problem persists. Race conditions keep happening. So, what's going on? It turns out that when we write code, not only does each statement expand into multiple CPU instructions, but compilers may also reorder those instructions. Take this simple snippet. Notice that the third statement depends on the result of the first statement, but it does not depend on the result of the second statement. So the compiler is free to swap their order. And you might be wondering why would it do that? And the answer is to make the program faster. In the original order, one statement stores a result in memory, another runs, and then the third statement needs the result of the first one. So that value must be loaded from memory. But if we reorder them, the value is still in a register. So the extra memory load isn't needed. The result, fewer instructions, fewer memory operations, and better performance. This kind of rearranging is part of what's called compiler optimizations. And modern compilers do it all the time. Now consider what happens if the first two statements in Peterson's algorithm get reordered. For thread zero, it might set turn to one first, then get interrupted before setting its flag to true. Meanwhile, thread 1 sets turn to zero, then sets its own flag to true. From thread 1's perspective, thread 0 hasn't set its flag yet, so it assumes thread 0 isn't interested. It enters the critical section without spin waiting. But if the critical section gets interrupted later, thread 0 resumes and finally sets its flag to true. But since thread 1 already flipped the turn, thread zero also skips the spin weight and enters while thread one is still inside. The result, both threads end up in the critical section at the same time. Mutual exclusion is lost. You might think, okay, fine. Just disable compiler optimization. So, the compiler emits instructions in the exact order I wrote them. Unfortunately, the problem goes even further because modern CPUs themselves can also read an executable file and reorder those instructions. That may sound even weirder, but it's just as common as compiler optimizations with the reason being also performance. But this time is kind of hard to explain. So here's my best try to do it. You probably know that executing an instruction fundamentally involves three stages: fetch, decode, and execute. On very early architectures, this process was strictly sequential. An instruction wasn't fetched until the previous one had finished executing, and so on. But engineers realized computers would be much faster if different stages overlapped. So while one instruction is being decoded, the CPU can already fetch the next one. And while that one is being decoded, the previous one is executing. This is known as CPU pipelining. And it is currently the rule on CPU design since the CPU can do multiple things on each clock cycle. Modern CPUs push this idea further. Since some instructions are more complex than others and might require more clock cycles, they're broken into smaller micro operations that can also be pipelined. Some of those microlops involve memory operations. Inevitably, two of them might need the memory unit at the same clock cycle. Since memory can't handle both simultaneously, one microlop has to wait. Waiting though leaves parts of the CPU idle, which is wasted potential. But since the CPU doesn't just handle one instruction at a time through a feature called out of order execution, it can shuffle, overlap, and even temporarily reorder instructions while they're in flight. Of course, the CPU includes complex circuitry to ensure that only safe reordering happens, meaning no instruction that depends on the result of another gets executed first. Unfortunately, unlike compiler optimizations, this feature is not optional on many architectures. And on top of that, modern hardware adds more complexity like multi-core systems where parallelism means threads can truly run at the same time. Think about cache. That fast memory inside the CPU where a portion of main memory is copied for quick access. If two threads run on different cores but share the same cache, any change to a shared variable will be visible right away when the other thread reads it. But what if those threads are on cores that don't share a cache? If one modifies a shared variable, is the change instantly replicated in the other cores cache? And if not, what happens when one thread updates the variable in its cache while the other keeps reading an outdated value from its own? In the end, this video wasn't about teaching a brand new trick, but about showing how high-level solutions don't always behave the way we expect once they're running down at the hardware level. We don't need to fully understand the insanely complex circuitry that makes all of this possible, but we should at least be aware of it. That way, when things don't go as expected, we have a better idea of where to start looking when debugging. Regarding race conditions, unless we're targeting a very specific architecture and know exactly what we're doing, we can't fully trust softwareonly solutions. Hardware support is therefore required in the form of atomic operations. These guarantee that neither compiler optimizations, CPU reordering nor caching mechanisms can alter their execution and compromise correctness. This however is a subject we will cover in a later episode when we study thread synchronization in detail. So, make sure to subscribe because you won't want to miss it. And if you enjoyed this video or learned something new, hit the like button. It's free and that would help me a lot. See you in the next

Transcript for:Race Conditions in Concurrency

Transcript for:
Race Conditions in Concurrency