Understanding Distributed Systems and MapReduce

All right, let's get started. This is 6824, Distributed Systems. So I'd like to start with just a brief explanation of what I think a distributed system is. You know, the core of it is a set of cooperating computers that are communicating with each other over network to get some coherent task done. And so the kinds of examples... that we'll be focusing on in this class are things like storage for big websites or big data computations such as MapReduce, and also somewhat more exotic things like peer-to-peer file sharing. So those are all just examples of the kinds of case studies we'll look at. And the reason why all this is important is that a lot of critical infrastructure out there is built out of distributed systems. Infrastructure that requires more than one computer to get its job done or it sort of inherently needs to be spread out physically So the reasons why people build this stuff First of all before I even talk about distributed systems I just want to remind you that you know if you're designing a system Designing you need to solve some problem if you can possibly solve it on a single computer You know without building a distributed system. You should do it that way And there's many many jobs you can get done on a single computer and it's always easier. So distributed systems, you know, you should try everything else before you try building distributed systems, because they're not simpler. So the reason why people are driven to use lots of cooperating computers are they need to get high performance. And the way to think about that is they want to get, achieve some sort of parallelism. Lots of CPUs, lots of memories, lots of disk arms moving in parallel. Another reason why people build this stuff is to be able to tolerate faults. Have two computers do the exact same thing. If one of them fails, you can cut over to the other one. Another is that some problems are just naturally spread out in space. Like, you know, you want to do interbank transfers of money or something. Well, you know, Bank A has this computer in New York City and Bank B has this computer in London. You know, you just have to have some way for them to talk to each other and cooperate in order to carry that out. So there's some natural sort of physical reasons. Systems that are inherently physically distributed. And the final reason that people build this stuff is in order to achieve some sort of security goal. So often by, if there's some code you don't trust or you need to interact with somebody but... they may be malicious or maybe their code has bugs in it. So you don't wanna have to trust it. You may wanna split up the computation. So your stuff runs over there on that computer, my stuff runs here on this computer and they only talk to each other through some sort of narrowly defined network protocol. So we may be worried about security and that's achieved by splitting things up into multiple computers so that they can be isolated. Most of this course is going to be about performance and fault tolerance, although the other two often work themselves in by way of the sort of constraints on the case studies that we're going to look at. You know, all the distributed systems of these problems are, because they have many parts and the parts execute concurrently because there are multiple computers, you get all the problems that come up with concurrent programming, all the sort of complex interactions and weird timing dependent stuff. And that's part of what makes distributed systems hard. Another thing that makes distributed systems hard is that because, again, you have multiple pieces plus a network, you can have very unexpected failure patterns. That is, if you have a single computer, it's usually the case that either the computer works or maybe it crashes or suffers a power failure or something. But it pretty much either works or doesn't work. Distributed systems made up of lots of computers, you can have partial failures. That is, some pieces stop working, other people, other pieces continue working. Or maybe the computers are working but some part of the network is broken or unreliable. So partial failures is another reason why distributed systems are hard. This is sort of basic challenges. And a final reason why it's hard is that the original reason to build a distributed system is often to get higher performance, to get 1,000 computers worth of performance or 1,000 disk arms worth of performance. But it's actually very tricky to obtain that 1,000x speedup with 1,000 computers. There's often a lot of roadblocks thrown in your way. Okay, so the... It often takes a bit of careful design to make the system actually give you the performance you feel you deserve. So solving these problems, of course, can be all about addressing these issues. The reason to take the course is because often the problems and the solutions are quite just technically interesting. They're hard problems. For some of these problems, there's pretty good solutions known. For other problems, there are not such great solutions known. Distributed systems are used by a lot of real-world systems out there. Big websites often involve vast numbers of computers that are put together as distributed systems. When I first started teaching this course, distributed systems were something of an academic curiosity. People thought, oh, at a small scale they were used sometimes, and people felt that someday they might be important. But now, particularly driven by the rise of giant websites that have vast amounts of data and entire warehouses full of computers, distributed systems in the last 20 years have gotten to be a very seriously important part of computing infrastructure. This means that there's been a lot of attention paid to them, a lot of problems have been solved, but there's still quite a few unsolved problems. So if you're a graduate student or you're interested in research, there's A lot of problems yet to be solved in distributed systems that you could look into as research. And finally, if you like building stuff, this is a good class because it has a lab sequence in which you'll construct some fairly realistic distributed systems focused on performance and fault tolerance. So you've got a lot of practice building distributed systems and making them work. All right. Let me talk about core structure a bit. Before I get started on real technical content, you should be able to find the course website using Google and on the course website is the lab assignments on the course schedule and also linked to a Piazza page where you can post questions get answers. The course staff, I'm Robert Morris, will be giving the lectures. I also have four TAs, you guys want to stand up and show your faces. The TAs are experts at, in particular, at doing the solving the labs. They'll be holding office hours, so if you have questions about the labs, you should go to office hours, or you could post questions to Piazza. The course has a couple of important components. One is lectures. There's a paper for almost every lecture. There's two exams. There's the labs, programming labs, and there's an optional final project that you can do instead of one of the labs. The lectures will be about sort of big ideas in distributed systems. There'll also be a couple of lectures that are... More about sort of lab programming stuff. A lot of the lectures will be taken up by case studies. A lot of the way that I sort of try to bring out the content of distributed systems is by looking at papers, some academic, some written by people in industry, describing real solutions to real problems. These lectures actually be videotaped, and I'm hoping to post them online so that you can, if you're not here. or you want to review the lectures, you'll be able to look at the videotape lectures. The papers, again, there's one to read per week. Most of them are research papers. Some of them are classic papers, like today's paper, which I hope some of you have read on MapReduce. It's an old paper, but it was the beginning of, it spurred an enormous amount of interesting work, both academic and in the real world. So some are classic and some are more recent papers, sort of talking about more up-to-date research, what people are currently worried about. And from the papers, I'll be hoping to tease out what the basic problems are, what ideas people have had that might or might not be useful in solving distributed system problems. We'll be looking at sometimes at implementation details in some of these papers, because a lot of this has to do with actual construction of software-based systems. And we're also going to spend a certain amount of time looking at evaluations. People evaluating how fault tolerant their systems by measuring them or people measuring how much performance or whether they got performance improvement at all. So I'm hoping that you'll all read the papers before coming to class. The lectures are maybe not going to make as much sense if you haven't already read the lecture because there's not enough time to both explain all the content of the paper and have a sort of interesting reflection on what the paper means on one class. So you've really got to read the papers before coming to class. And hopefully one of the things you'll learn in this class is how to read a paper rapidly and efficiently and skip over the parts that maybe aren't that important and sort of focus on teasing out the important ideas. On the website, there's for every link to by the schedule, there's a question that you should submit an answer for for every paper. I think the answers are due at midnight. And we also ask that you submit a question you have about the paper. To the website in order both to give me something to think about as I'm preparing the lecture and if I have time I'll try to answer at least a few of the questions by email. And the question and the answer for each paper are due midnight the night before. There's two exams. There's a midterm exam in class I think on the last class meeting before spring break. And there's a final exam during final exam week. at the end of the semester. The exams are going to focus mostly on papers and the labs. And probably the best way to prepare for them, as well as attending lecture and reading the papers, a good way to prepare for the exams is to look at old exams. We have links to 20 years of old exams and solutions. And so you look at those and sort of get a feel for what kind of questions that I like to ask. And indeed, because we read many of the same papers, Inevitably, I ask questions each year that can't help but resemble questions asked in previous years. The labs, there's four programming labs. The first one of them is due Friday next week. Lab one is a simple MapReduce lab to implement your own version of the paper. that we read today and which I'll be discussing in a few minutes. Lab 2 involves using a technique called raft in order to get fault, in order to sort of allow, in theory, allow any system to be made fault tolerant by replicating it and having this raft technique manage the replication and manage sort of automatic cutover if there's a failed, if one of the replicated servers fails. So this is raft for fault tolerance. In lab 3, you'll use your raft implementation in order to build a fault-tolerant key-value server. It'll be replicated and fault-tolerant. And in lab 4, you'll take your replicated key-value server and clone it into a number of independent groups, and you'll split the data. In your key value storage system across all these individual replicated groups to get parallel speed up by running multiple replicated groups in parallel and you'll also be responsible for moving the various chunks of data between different servers as they come and go without dropping any balls. So this is a what's often called a sharded key value service. Sharding refers to splitting up the data, partitioning the data among multiple servers in order to get parallel speedup. If you want, instead of doing lab4, you can do a project of your own choice. And the idea here is if you have some idea for a distributed system, you know, in the style of some of the distributed systems we talked about in the class, if you have your own idea that you want to pursue and you like to build something and measure whether it worked in order to Explore your idea. You can do a project. And so for a project, you'll pick some teammates because we require that projects are done in teams of two or three people. So like some teammates and send your project idea to us and we'll think about it and say yes or no and maybe give you some advice and then if you go ahead and do if we say yes and you want to do a project you do that instead of lab 4 and it's due at the end of the semester and you know you should do some design work and build a real system and then in the last day of class you'll demonstrate your system as well as handing in a short sort of written report to us about what you built. And I posted on the website some ideas which might or might not be useful for you to sort of spur thoughts about what projects you might build. But really the best projects are one where you have a good idea for the project. And the idea is if you want to do a project, you should choose an idea that's sort of in the same vein as the systems that we talk about in this class. Okay, back to labs. The lab grades... we give you, you hand in your lab code and we run some tests against it and you're graded based on how many tests you pass. We give you all the tests that we use so there's no hidden tests. So if you influence the lab and it reliably passes all the tests, then chances are good, unless there's something funny going on, which there sometimes is, chances are good that if your code passes all the tests when you run it, it'll pass all the tests when we run it and you'll get a full score. So hopefully there'll be no mystery about what score you're likely to get. on the labs. Let me warn you that debugging these labs can be time consuming. Because they're distributed systems and have a lot of concurrency and communication, sort of strange, difficult to debug errors can crop up. So you really ought to start the labs early. Don't leave them, you'll have a lot of trouble if you leave the labs to the last moment. You gotta start early. If you're Problems, please come to the TA's office hours and please feel free to ask questions about the labs on Piazza. And indeed I hope if you know the answer that you'll answer people's questions on Piazza as well. All right, any questions about the mechanics of the course? Yes? So the question is, how do these things factor into the grade? I forget, but it's all on the website under something. I think it's the labs are the single most important component. Okay. All right, so this is a course about infrastructure. For applications. And so all through this course, there's going to be a sort of split in the way I talk about things between applications, which are sort of other people, the customer, somebody else writes, but the applications are going to use the infrastructure that we're thinking about in this course. And so the kinds of infrastructure that tend to come up a lot are storage. Communication and computation. And we'll talk about systems that provide all three of these kinds of infrastructure. The storage, it turns out that storage is going to be the one we focus most on because it's a very well-defined and useful abstraction, and usually fairly straightforward abstraction. So people know a lot about how to use and build storage systems, and how to build... sort of replicated fault tolerant high performance distributed implementations of storage. We'll also talk about some some of our computation systems like MapReduce for today is a computation system and we will talk about communication some but mostly from the point is a tool that we need to use to build distributed systems like computers have to talk to each other over a network you know maybe you need reliability or something and so we'll talk a bit about but we're actually mostly Consumers of communication. If you want to learn about communication systems as how they work, that's more the topic of 6829. So for storage and computation, a lot of our goal is to be able to discover abstractions, ways of simplifying the interface to these to storage and computation, distributed storage and computation infrastructure. So that it's easy to build applications on top of it. And what that really means is that we need to, we'd like to be able to build abstractions that hide the distributed nature of these. of these systems. So the dream, which is rarely fully achieved, but the dream would be to be able to build an interface that looks to an application as if it's a non-distributed storage system, just like a file system or something that everybody already knows how to program and has a pretty simple model, semantics. We'd love to be able to build interfaces that look and act just like non-distributed storage and computation systems, but are actually You know, vast, extremely high performance, fault tolerant, distributed systems underneath. So we'd love to have abstractions. And, you know, as you'll see as the course goes on, we sort of, you know, only part of the way there. It's rare that you find an abstraction for a distributed version of storage or computation that has simple behavior, behaves just like... the non-distributed version of storage that everybody understands. But people are getting better at this, and we're going to try to study the ways and what people have learned about building such abstractions. Okay, so what kind of topics show up as we're considering these abstractions? The first one, the first topic, general topic that we'll see a lot... a lot of the systems we look at, have to do with implementation. So, for example, the kind of tools that you see a lot for ways people learn how to build these systems are things like remote procedure call, whose goal is to mask the fact that we're communicating over an unreliable network. Another kind of implementation... Topic that we'll see a lot is threads, which are a programming technique that allows us to harness multi-core computers, but maybe more important for this class, threads are a way of structuring concurrent operations in a way that hopefully simplifies the programmer view of those concurrent operations. And because we're going to use threads a lot it turns out we're going to need to also Just as from the implementation level, spend a certain amount of time thinking about concurrency control, things like locks. And the main place that these implementation ideas will come up in the class, they'll be touched on in many of the papers, but you're going to come face to face with all of this in a big way in the labs. You need to build distributed, you know, do the programming for distributed systems, and these are like a lot of sort of important tools, you know, beyond just sort of ordinary programming, these are some of the critical tools that you'll need to use to build distributed systems. Another big topic that comes up in all the papers we're going to talk about is performance. Usually the high level goal of building a distributed system is to get what people call scalable speed up. So we're going to look in for scalability. And what I mean by scalability or scalable speedup is that if I have some problem that I'm solving with one computer and I buy a second computer to help me execute my problem, if I can now solve the problem in half the time or maybe solve twice as many problem instances per minute on two computers as I had on one, then... That's an example of scalability. So, sort of two times the computers or resources gets me two times the performance or throughput. And this is a huge hammer. If you can build a system that actually has this behavior, namely that if you increase the number of computers, you throw at the problem by some factor. You get that factor, more throughput, more performance out of the system, that's a huge win. Because you can buy computers with just money, right? Whereas if in order to get the alternative to this, is that in order to get more performance, you have to pay programmers to restructure your software to get better performance, to make it more efficient, or to apply some sort of specialized techniques, better algorithms or something. If you have to pay programmers to fix your code to be faster, that's... An expensive way to go. We'd love to be able to just, oh, buy a thousand computers instead of ten computers and get a hundred times more throughput. That's fantastic. And so this sort of scalability idea is a huge idea in the backs of people's heads when they're building things like big websites that run on a building full of computers. If the building full of computers is there to get a corresponding amount of performance, but you have to be careful about the design in order to actually get that performance. So often the way this looks when we're looking at diagrams or writing diagrams in this course is that, let's suppose we're building a website. Ordinarily you might have a website that has a HTTP server or let's say it has some users running web browsers and they talk to A web server running Python or PHP or whatever, sort of web server, and the web server talks to some kind of database. You know when you have one or two users you can just have one computer running both and maybe a computer for the web server and a computer from the database but maybe all of a sudden you get really popular and you, you know, a hundred million people sign up for your service. How do you You know, how do you fix your... You certainly can't support millions of people on a single computer, except by extremely careful, labor-intensive optimization, which you don't have time for. So, typically, the way you're going to speed things up, the first thing you do is buy more web servers and just split the users so that, you know, half your users or some fraction of the user go to web server one and the other half you send them to web server two. And because maybe you're building, I don't know what, Reddit or something, where all the users need to see the same data ultimately, you have all the web servers talk to the back end. And maybe you can keep on adding web servers for a long time here. And so this is a way of getting parallel speed up on the web server code. If you're running PHP or Python, maybe it's not too efficient. As long as each individual web server doesn't put too much. load on the database, you can add a lot of web servers before you run into problems. But this kind of scalability is rarely infinite, unfortunately. Certainly not without serious thought. And so what tends to happen with these systems is that at some point after you have 10 or 20 or 100 web servers all talking to the same database, now all of a sudden the database starts to be a bottleneck. And adding more web servers no longer helps. So it's rare that you get full scalability to sort of... infinite numbers of adding infinite numbers of computers. At some point you run out of gas because the place at which you are adding more computers is no longer the bottleneck. By having lots and lots of web servers we basically move the bottleneck. The thing is limiting performance from the web servers to the database. And at this point, actually you almost certainly have to do a bit of design work because it's rare that you can take that there's any straightforward way to take a single database and sort of refactor things. You can take a data sort in a single database and refactor it so it's split over multiple databases, but it's often a fair amount of work. And because it's awkward, but people, many people actually need to do this. We're going to see a lot of examples in this course in which the distributed system people are talking about. is a storage system because the authors were running something like a big website that ran out of gas on a single database or storage servers. Anyway, so the scalability story is we love to build systems that scale this way, but it's hard to make it, or it takes work often, design work, to push this idea infinitely far. Okay, so another big topic that comes up a lot is fault tolerance. If you're building a system with a single computer in it, well, a single computer often can stay up for years. Like, I have servers in my office that have been up for years without crashing. You know, the computer's pretty reliable, the operating system's pretty reliable. Apparently the power in my building is pretty reliable. So it's not uncommon to have single computers that just stay up for amazing amounts of time. However... If you're building systems out of thousands of computers, then even if each computer can be expected to stay up for a year, with a thousand computers that means you're gonna have like about three computer failures per day in your set of a thousand computers. So solving big problems with big distributed systems turns sort of very rare fault tolerance, very real failure, very rare failure problems into failure problems that happen just all the time. In a system with a thousand computers, there's almost certainly always something broken. There's always some computer that's either crashed or mysteriously, you know, running incorrectly or slowly or doing the wrong thing. Or maybe there's some piece of the network. With a thousand computers, we've got a lot of network cables and a lot of network switches. And so, you know, there's always some network cable that somebody stepped on and is unreliability, or a network cable that fell out, or some network switch whose fan is broken and the switch overheated and failed. There's always some little problem somewhere in your building-sized distributed system. So big scale turns problems from very rare events, you really don't have to worry about that much, into just constant problems. That means the failure has to be really, or the response, the masking of failures, the ability to perceive about failures, just has to be built into the design, because there's always failures. And, you know, as part of building, you know, convenient abstractions for application programmers, we really need to be able to build infrastructure that, as much as possible, hides the failures from application programmers or masks them or something, so that every application programmer doesn't have to have a complete complicated story for all the different kinds of failures that can occur. There's a bunch of different notions that you can have about... what it means to be fault tolerant, and about a little more about exactly what we mean by that. And we'll see a lot of different flavors, but among the more common ideas you see, one is availability. So some systems are designed so that under some certain kinds of failures, not all failures, but certain kinds of failures, the system will keep operating. Despite the failure while providing undamaged service, the same kind of service it would have provided even if there had been no failure. So some systems are available in that sense that up and up a you know so if you build a replicated service that maybe has two copies you know if one of the replicas replica servers fails maybe the other server can continue operating. They both fail of course you can't. You know, you can't promise availability in that case. So available systems usually say, well, under a certain set of failures, we're going to continue providing service. We're going to be available. If more failures than that occur, it won't be available anymore. Another kind of fault tolerance you might have, or in addition to availability or by itself, is recoverability. And what this means is that if something goes wrong, maybe the service will stop working. That is, it'll simply stop responding to requests, and it'll wait for someone to come along and repair whatever went wrong, but after the repair occurs, the system will be able to continue as if nothing bad had gone wrong. So this is sort of a weaker requirement than availability, because here we're not going to do anything until the failed component has been repaired. But the fact that we can get up... get going again without, you know, without any loss of correctness is still a significant requirement. It means, you know, recoverable systems typically need to do things like save their latest date on disk or something where they can get it back, you know, after the power comes back up. And even among available systems, in order for a system to be useful in real life, usually what the way available systems are specced is that They're available until some number of failures have happened. If too many failures have happened, an available system will stop working, or will stop responding at all, but when enough things have been repaired, it'll continue operating. So a good available system will sort of be recoverable as well, in the sense that if too many failures occur, it'll stop answering, but then will continue correctly after that. So this is what we'd love to obtain. The biggest hammer, we'll see a number of approaches to solving these problems. There's really sort of... Two things that are the most important tools we have in this department. One is non-volatile storage, so that if something crashes, power fails, or whatever, maybe there's a building-wide power failure, we can use non-volatile storage, like hard drives or flash or solid-state drives or something, to sort of store a checkpoint or a log of the state of the system, and then when the power comes back up or... Somebody repairs our power supply, who knows what, we'll be able to read our latest state off the hard drive and continue from there. So one tool is sort of non-volatile storage. And the management of non-volatile storage is something that comes up a lot, because non-volatile storage tends to be expensive to update. And so a huge amount of the sort of nitty-gritty of building sort of high-performance fault-tolerant systems is in, you know, clever ways to avoid having to write the non-volatile storage too much. In the old days, and even today, what writing non-volatile storage meant was moving a disk arm and waiting for a disk platter to rotate, both of which are agonizingly slow on the scale of three gigahertz microprocessors. With things like Flash, life's quite a bit better, but still requires a lot of thought to get good performance out of it. And the other big tool we have for fault tolerance is replication. And the management of replicated copies is sort of tricky. You know that sort of key problem lurking in any replicated system where we have two servers each with a supposedly identical copy of the system state. The key problem that comes up is always that the two replicas will accidentally drift out of sync and will stop being replicas. And this is just, you know, at the back of every design that we're going to see for using replication to get fault tolerance. And Lab 2 and Lab 3 are all about management of replicated copies for fault tolerance. As you'll see, it's pretty complex. A final topic, a final cross-cutting topic. is consistency. So as an example of what I mean by consistency, supposing we're building a distributed storage system, and it's a key value service, so it just supports two operations. Maybe there's a put operation, and you give it a key and a value, and the storage system sort of stashes away the value under as the value for this key, so it maintains just a big table of keys and values. And then there's a get operation. Your client sends it a key, and the storage service is supposed to, you know, respond with the value it has stored for that key. And this is kind of the... When I can't think of anything else as an example of a distributed system, I'll whip out key value services. And they're very useful, right? They're just sort of the kind of fundamental, simple version of a storage system. So... Of course, if you're an application programmer, it's helpful if these two operations kind of have meanings attached to them. That you can go look in the manual, and the manual says what it means, what you'll get back if you call get. And sort of what it means for you to call put. So it would be great if there was some sort of spec for what they meant. Otherwise, who knows? How can you possibly write an application without a description of what... Putting get are supposed to do. And this is the topic of consistency. And the reason why it's interesting in distributed systems is that both for performance and for fault tolerance reasons, we often have more than one copy of the data floating around. So, you know, in a non-distributed system where you just have a single server with a single table, there's often, although not always, but... there's often like relatively no ambiguity about what put and get could possibly mean, right? Intuitively, you know, what put means is update the table. And what get means is just get me the version that's stored in the table, which, but in a distributed system where there's more than one copy of the data due to replication or caching or who knows what, there may be lots of different versions of this key value pair floating around. Like if one of the replicas, you know, supposing some client issues a put, and you know there's two copies of the server, so they both have a key value table, right, and maybe key one has value 20 on both of them, and then some client issues a put. So we have a client over here. It's going to send a put. It wants to update the value of 1 to be 21. Maybe it's counting stuff in this key value server. So it sends a put, key 1, on value 21. It sends it to the first server. And it's about to send the same put. It wants to update both copies to keep them in sync. It's about to send this put. But just before it sends the put to the second server, it crashes, power failure, bug in the operating system or something. So now the state we're left in, sadly, is that we sent this put, and so we've updated one of the two replicas to a value of 21, but the other one's still with 20. Now if somebody comes along and reads with a get, they might get, they want to read the value associated with key one, they might get 21, or they might get 20, depending on who they talk to. And even if the rule is you always talk to the top server first, if you're building a fault tolerance system, the actual rule has to be, oh, you talk to the top server first. unless it's failed, in which case you talk to the bottom server. So either way, someday you risk exposing this stale copy of the data to some future get. And it could be that many gets get the updated 21, and then, like, next week, all of a sudden, some get yields, you know, a week-old copy of the data. So that's not very consistent, right? So in order, but, you know, it's the kind of thing that could happen, right? We're not careful. So we need to have, we need to actually write down what the rules are going to be about puts and gets given this danger due to replication. And it turns out there's many different definitions you can have of consistency. Many of them are relatively straightforward. Many of them sound like, well, a get yields the value put by the most recently completed put. So that's usually called strong consistency. It turns out also it's very useful to build systems that have much weaker consistency. For example, do not guarantee anything like a get sees the value written by the most recent put. And the reason, so there's strongly consistent systems that usually have some version of get seeing most recent puts. Although you have to... There's a lot of details to work out. There's also weakly consistent, many sort of flavors of weakly consistent systems that do not make any such guarantee, that may guarantee, well, you're, you know, if someone does a put, then you may not see the put. You may see old values that weren't updated by the put for an unbattered amount of time, maybe. And the reason for people being very interested in weak consistency schemes is that strong consistency, that is... Having reads actually see, be guaranteed to see, the most recent write, that's a very expensive spec to implement. Because what it means is almost certainly that somebody has to do a lot of communication in order to actually implement some notion of strong consistency. If you have multiple copies, it means that either the writer or the reader, or maybe both, has to consult every copy. Like in this case. Where, you know, maybe a client crash left one updated but not the other. If we wanted to implement strong consistency in maybe a simple way in this. We'd have readers read both of the copies, or if there's more than one copy, all the copies, and use the most recently written value that they find. But that's expensive. That's a lot of chit-chat to read one value. So in order to avoid communication as much as possible, particularly if replicas are far away, people build weak systems that might actually allow the stale read of an old value in this case. Although there's often more semantics attached to that to try to make these weak schemes more useful. And where this communication problem, you know, strong consistency requiring expensive communication, where this really runs you into trouble is that if we're using replication for fault tolerance, then we really want the replicas to have independent failure probability, to have uncorrelated failure. So, for example, putting... both of the replicas of our data in the same rack, in the same machine room, is probably a really bad idea. Because if someone trips over the power cable to that rack, both of our copies of our data are gonna die, because they're both attached to the same power cable in the same rack. So in the search for making replicas as independent in failure as possible, in order to get decent fault tolerance, people would love to put different replicas as far apart as possible, like in different cities, or maybe on opposite sides of the continent. So an earthquake that destroys one data center will be extremely unlikely to also destroy the other data center that has the other copy. You know, so we'd love to be able to do that. If you do that, then the other copy is thousands of miles away, and the rate at which light travels means that it may take on the order of Milliseconds or tens of milliseconds to communicate to a data center across the continent in order to update the other copy of the data. And so that makes the communication required for strong consistency, for good consistency, potentially extremely expensive. Like every time you want to do one of these put operations or maybe a get, depending on how you implement it, you might have to sit there waiting for like 10 or 20 or 30 milliseconds in order to talk to both copies of the data to ensure that they're both updated or both checked. to find the latest copy. And that tremendous expense, right? This is 10 or 20 or 30 milliseconds on machines that, after all, will execute, like, a billion instructions per second. So we're wasting a lot of potential instructions while we wait. People often build much weaker systems. You're allowed to only update the nearest copy or only consult the nearest copy. I mean, there's a huge sort of amount of academic and real-world research on how to... structure weak consistency guarantee, so they're actually useful to applications and how to take advantage of them in order to actually get high performance. All right, so that's a lightning sort of preview of the technical ideas in the course. Any questions about this before I start talking about MapReduce? All right, I want to switch to MapReduce. Sort of detailed case study that's actually going to illustrate most of the ideas that we've been talking about here. MapReduce is a system that was originally designed and built and used by Google. I think the paper dates back to 2004. The problem they were faced with was that they were running huge computations on terabytes and terabytes of data. Creating an index of all of the content of the web or analyzing the link structure of the entire web in order to identify the most important pages or the most authoritative pages. The whole web is, even in those days, tens of terabytes of data. Building an index of the web is basically equivalent to a sort, running sort of the entire data. Sort, you know, is reasonably expensive. and to run a sort on the entire content of the web on a single computer, I don't know how long it would have taken, but weeks or months or years or something. So Google at the time was desperate to be able to run giant computations on giant data on thousands of computers in order that the computations could finish rapidly. It was worth it to them to buy lots of computers so that their engineers wouldn't have to spend a lot of time reading the newspaper or something waiting for their big compute jobs to finish. And so for a while they had their clever engineers sort of hand write, you know, if you needed to write a web indexer or some sort of link, web link analysis tool. You know, Google bought the computers and they said, here, engineers, run whatever software you like on these computers. And they would laboriously write the sort of one-off, manually written software to take whatever problem they were working on and sort of somehow farm it out to a lot of computers and organize that computation and get the data back. If you only hire engineers who are skilled distributed systems experts, maybe that's okay, although even then it's... probably very wasteful of engineering effort. But they wanted to hire people who were skilled at something else and not necessarily engineers who wanted to spend all their time writing distributed system software. So they really needed some kind of framework that would make it easy to just have their engineers write the kind of guts of whatever analysis they wanted to do, like sort algorithm or web index or link analyzer or whatever. just write the guts of that application and not be able to run it on a thousands of computers without worrying about the details of how to spread the work over the thousands of computers, how to organize whatever data movement was required, how to cope with the inevitable failures. So they were looking for a framework that would make it easy for non-specialists to be able to write and run giant distributed computations. And so that's what MapReduce is all about. And the idea is that the programmer just write the application designer, the consumer of this distributed computation, just be able to write a simple map function and a simple reduce function that don't know anything about distribution. And the MapReduce framework would take care of everything else. So an abstract view of what MapReduce is up to is it starts by assuming that there's some input. And the input is split up into a whole bunch of different files or chunks in some way. So we're imagining that we have input file 1, input file 2, etc. These inputs are maybe web pages crawled from the web, or more likely sort of big files that contain many web, each of which contains many web files. Crawl from the web. All right. And the way MapReduce starts is that you define a map function, and the MapReduce framework is going to run your map function on each of the input files. And of course, you can see here there's some obvious parallelism available. You can run the maps in parallel. So each of these map functions only looks at its input and produces output. The output that a map function is required to produce is a list. It takes a file as input, and the file is some fraction of the input data, and it produces a list of key value pairs as output, the map function. And so, for example, let's suppose we're writing the simplest possible map-reduce example, a word count map-reduce job, whose goal is to count the number of occurrences of each word. So your map function might emit... Key value pairs where the key is the word and the value is just one. So for every word it sees, so this map function will split the input up into words. For every word it sees, it emits that word as the key and one as the value. And then later on, we'll count up all those ones in order to get the final output. So, you know, maybe input one has the word A in it and the word B in it. And so the output map is going to produce is... Key A value 1, key B value 1. Maybe this second map indication sees a file that has a B in it and nothing else. So it's going to implement output B1. Maybe this third input has an A in it and a C in it. All right, so we run all these maps on all the input files. And we get this intermediate, what the paper calls intermediate output, which is for every map. a set of key value pairs as output. Then the second stage of the computation is to run the reduces. And the idea is that the MapReduce framework collects together all instances from all maps of each keyword. So the MapReduce framework is going to collect together all of the A's from every map, every key value pair whose key was A. It's going to collect them all and hand them to One call of the programmer to find reduced funds. And then it's going to take all the B's and collect them together. Of course, it requires real collection because different instances of key B were produced by different invocations of map on different computers. So we're now talking about data movement. So we're going to collect all the B keys and hand them to a different call to reduce that has all of the B keys as its arguments. And same with C, all the C. So this is going to be the MapReduce framework will arrange for one call to reduce for every key that occurred in any of the map output. And, you know, for our sort of silly word count example, all these reduces have to do, or any one of them has to do, is just count the number of items passed to it. It doesn't even have to look at the items because it knows that each of them is the word it's responsible for plus. one is the value. We don't have to look at those ones, we just count them. So this reduce is going to produce A and then the count of its inputs. This reduce is going to produce the key associated with it and then the count of its values, which is also two. So this is what a typical map-reduced job looks like at a high level. Just for completeness, a little bit of terminology. The whole computation is called a job. Any one invocation of map or reduce is called a task. So we have the entire job and it's made up of a bunch of map tasks and then a bunch of reduce tasks. So an example for this word count, what the map and reduce functions would look like. The map function takes a key and a value as arguments. And now we're talking about functions like written in an ordinary programming language, like C++ or Java or who knows what. So this is just code people, ordinary people can write. What a map function for word count would do is split the key, is the file name which typically is ignored. We don't really care what the file name is. the file name was. And the V is the content of this map's input file. So V is, you know, just contains all this text. We're going to split V into words. And then for each word, we're just going to emit. And emit takes two arguments. Emit's a call only map can make. Emit is provided by the MapReduce framework. We get to produce, we hand emit a key, which is the word, and a value, which is string one. So that's it for the map function. And a word count map function in MapReduce literally could be this simple. So there's sort of promise to make the... And this map function doesn't know anything about distribution or multiple computers or the fact we need we need to move data across the network Or who knows what? This is extremely straightforward and the reduce function for a word count The reduce is called with Remember each reduce is called with sort of all the instances of a given key the MapReduce framework calls reduce with the key that it's responsible for and a vector of all the values that the maps produced associated with that key. The key is the word, the values are all ones, we don't really care about them, we only care about how many there were. And so, reduce has its own emit function that just takes a value to be emitted as the final output, as the value for this key. So we're gonna emit the length of this array. So this is also about as simple as reduce functions are in MapReduce, namely extremely simple and requiring no knowledge about fault tolerance or anything else. All right, any questions about the basic framework? Yes. You mean can you feed the output of the reduces sort of, oh yes, oh yes, in real life, all right, in real life. It was routine among MapReduce users to define a MapReduce job that took some inputs and produced some outputs and then have a second MapReduce job. You know, if you're doing some very complicated multi-stage analysis or iterative algorithm, like PageRank, for example, which is the algorithm Google uses to sort of estimate how important or influential different web pages are. That's an iterative algorithm that sort of... Gradually converges on an answer and if you implement in MapReduce, which I think they originally did, you have to run the MapReduce job multiple times and the output of each one is sort of a list of web pages with an updated sort of value or weight or importance for each web page. So it was routine to take this output and then use it as the input to another MapReduce job. Well, yeah, you need to sort of set things up. The output, you need to write the produce function sort of in the knowledge that, oh, I need to produce data that's in the format or has the information required for the next MapReduce job. I mean, this actually brings up a little bit of a shortcoming in the MapReduce framework, which is it's great if the algorithm you need to run is easily expressible as a map. Followed by this sort of shuffling of the data by key, followed by a reduce, and that's it. MapReduce is fantastic for algorithms that can be cast in that form. And furthermore, each of the maps has to be completely independent. The maps are required to be functional, pure functional functions that just look at their arguments and nothing else. You know, it's like, it's a restriction. And it turns out that many people want to run much longer pipelines that involve lots and lots of different kinds of processing. And with MapReduce, you have to sort of cobble that together from multiple MapReduce, distinct MapReduce jobs. And more advanced systems, which we'll talk about later in the course, are much better at allowing you to specify the complete pipeline of computations. And they'll do optimization. You know, the framework realizes all the stuff you have to do and organize much more complicated, efficiently optimize much more complicated. Computations. No question. From the programmer's point of view, it's just about map and reduce. From our point of view, It's going to be about the worker processes and the worker servers that are part of the MapReduce framework that, among many other things, call the MapReduce functions. So, yeah, from our point of view, we care a lot about how this is organized by the surrounding framework. This is sort of the programmer's view with all the distributed stuff stripped out. Yes? Sorry, I've got to say it again. Sorry, do you emit locally? Oh, you mean where does the emit data go? Okay, so there's two questions. One is, when you call emit, what happens to the data? And the other is, where are the functions run? The actual answer is that first where the stuff run. There's a number of say a thousand servers. Actually the right thing to look at here is figure one in the paper. Sitting underneath this in the real world there's some big collection of servers. And we'll call them maybe worker servers or workers. And there's also a single master server that's organizing the whole computation. And what's going on here is the master server knows that there's some number of input files, you know, 5,000 input files. And it farms out invocations of map to the different workers. So it'll send a message to worker 7 saying, please run, you know. So this map function on such and such an input file, and then the worker function, which is part of MapReduce and knows all about MapReduce, will then read the file, read the input, whichever input file, and call this map function with the file name value as arguments. Then that worker process will... is what implements emit, and every time the map calls emit, the worker process will write this data to files on the local disk. So what happens to map emits is they produce files on the map worker's local disk that are accumulating all the keys and values produced by the map's run on that worker. So at the end of the map phase, What we're left with is all those worker machines, each of which has the output of some of whatever maps were run on that worker machine. Then the MapReduce workers arrange to move the data to where it's going to be needed for the reduces. And since, you know, in a typical big computation, you know, this reduce indication is going to need all map output that... mentioned the key A, but it's going to turn out, you know, this is a sort of simple example, but probably, in general, every single map invocation will produce lots of keys, including some instances of key A. So typically, in order, before we can even run this reduce function, the MapReduce framework, that is the MapReduce worker running on one of our thousand servers, is going to have to go talk to every single other of the thousand servers and say, look, You know, I'm going to run the reduce for key A. Please look at the intermediate map output stored in your disk and fish out all of the instances of key A and send them over the network to me. So the reduce worker is going to do that. It's going to fetch from every worker all of the instances of the key that it's responsible for, that the master has told it to be responsible for. And once it's collected all of that data, then it can call reduce. And the reduce... Function itself calls reduce emit, which is different from the map emit. And what reduces emit does is writes the output to file in a cluster file service that Google uses. So here's something I haven't mentioned. I haven't mentioned where the input lives and where the output lives. They're both files. Because any piece of input, we want the flexibility to be able to read any piece of input on any worker server, that means we need some kind of network file system to store the input data. And so indeed, the paper talks about this thing called GFS for Google File System. And GFS is a cluster file system, and GFS actually runs on exactly the same set of workers. that work our servers that run MapReduce. And the input, GFS just automatically, when you, you know, it's a file system, you can read and write files, it just automatically splits up any big file you store on it across lots of servers in 64 megabyte chunks. So if you write, you know, if you have 10 terabytes of crawled web page contents and you just write them to GFS, even as a single big file, GFS will automatically... split that vast amount of data up into 64 kilobyte chunks distributed evenly over all of the GFS servers, which is to say all the servers that Google has available. And that's fantastic, that's just what we need. If we then want to run a MapReduce job that takes the entire crawled web as input, the data is already stored in a way that's split up evenly across all the servers. And so that means that the map workers, you know, we're gonna launch, you know, if we have a thousand servers, we're going to launch 1,000 map workers, each reading one one-thousands of the input data, and they're going to be able to read the data in parallel from 1,000 GFS file servers, thus getting a tremendous total read throughput, the read throughput of 1,000 servers. So are you thinking maybe that Google has one set of physical machines that run GFS and a separate set of physical machines that run MapReduce jobs? So you might have to read from a different machine. OK. They're not necessarily on the machine. Right. So the question is, what does this arrow here actually involve? And the answer to that actually sort of changed over the years as Google's evolved the system. But In the most general case, if we have big files stored in some big network file system, GFS is a bit like AFS you might have used on Athena, where you go talk to some collection, your data is split over big collection of servers, you have to go talk to those servers over the network to retrieve your data. In that case, what this arrow might represent is the map-reduced worker process has to go off and talk across the network to the. correct GFS server or maybe servers that store its part of the input and fetch it over the network to the MapReduce worker machine in order to pass the map and that's certainly the most general case And that was eventually how MapReduce actually worked. In the world of this paper though And if you did that, that's a lot of network communication Are you talking about 10 terabytes of data and we have to move 10 terabytes across their data center network, which You know, data center networks run at gigabits per second, but it's still a lot of time to move tens of terabytes of data. In order to try to, and indeed in the world of this paper in 2004, the most constraining bottleneck in their MapReduce system was network throughput, because they were running on a network, if you sort of read as far as the evaluation section, their network, their network was... They had thousands of machines, whatever, and they would collect machines, they would plug machines into, you know, each rack of machines into an Ethernet switch for that rack or something. But then, you know, they all need to talk to each other, but there was a root Ethernet switch that all of the rack Ethernet switches talked to. And this, you know, so if you just pick some MapReduce worker and some GFS server. Chances are, at least half the time, the communication between them has to pass through this one root switch. The root switch had only some amount of total throughput, which I forget. Some number of gigabits per second. Anyway, I forget the number. But when I did the division, that is divided up the Total throughput available in the root switch by the roughly 2,000 servers that they used in the paper's experiments, what I got was that each machine's share of the root switch, or of the total network capacity, was only 50 megabits per second. In their setup. So 50 megabits per second per machine. And that might seem like a lot, 50 megabits, gosh, millions and millions. But it's actually quite small compared to how fast, say, disks run or CPUs run. And so their network, this 50 megabits per second, was like a tremendous limit. And so they really stood on their heads in the design described in the paper to avoid using the network. And they played a bunch of tricks to avoid sending stuff over the network when they possibly could avoid it. One of them was they ran the GFS servers. and the MapReduce workers on the same set of machines. So they have a thousand machines, they'd run GFS, they'd implement their GFS service on that thousand machines and run MapReduce on the same thousand machines. And then when the master was splitting up the map work and sort of farming it out to different workers, it would cleverly, when it was about to run the map that was going to read from input file one, It would figure out from GFS which server actually holds input file 1 on its local disk, and it would send the map for that input file to the MapReduce software on the same machine, so that by default this arrow was actually local read from the local disk and did not involve the network. And, you know, depending on failures or load or whatever, it couldn't always do that. So, it's almost all of the maps. would be run on the very same machine that stored the data, thus saving them a vast amount of time that they would otherwise have had to wait to move the input data across the network. The next trick they played is that a map, as I mentioned before, stores its output on the local disk of the machine that you run the map on. So again, storing the output of the map does not require network communication. at least not immediately, because the output is stored in the disk. However, we know for sure that one way or another, in order to group together all of the way the MapReduce is defined, in order to group together all of the values associated with the given key and pass them to a single invocation of reduce on some machine, this is going to require network communication. We're going to, you know, we want to need to fetch all the A's of the missing single. machine that have to be moved across the network. And so this shuffle, this movement of the keys from this kind of, they're originally stored by row on the same machine that ran the map. We need them essentially to be stored by column on the machine that's going to be responsible for reduce. This transformation of row storage to essentially column storage is called, the paper calls a shuffle. And really that required moving every piece of data across the network. from the map that produced it to the reduce that we needed. And now it's like the expensive part of the map reduce. Yeah? Maybe we'll just take one, but then you wouldn't have to do it. You're right. You can imagine a different definition in which you have a more kind of streaming reduce. I don't know. I haven't thought this through. I don't know why. whether that would be feasible or not. Certainly as far as programmer interface, if the goal, their number one goal really was to be able to make it easy to program by people who just had no idea what was going on in the system. So it may be that this spec, this is really the way reduced functions look in C++ or something. Like a streaming version of this is now starting to look, I don't know how it would look, probably not this simple. But you know, maybe it could be done that way. And indeed, many modern systems, people have gotten a lot more sophisticated with modern things that are the successors to MapReduce. And they do indeed involve processing streams of data often rather than this very batch approach. It is a batch approach in the sense that we wait until we get all the data and then we process it. So first of all, you then have to have a notion of finite inputs, right? And modern systems often do indeed use streams and are able to take advantage of some efficiencies due to that, but not MapReduce. Okay, so this is the point at which the shuffle is where all the network traffic happens. This can actually be a vast amount of data. So if you think about sort, if you're sorting, the output of the sort has the same size as the input to the sort. That means that if your input is 10 terabytes of data and you're running a sort, you're moving 10 terabytes of data across the network at this point, and your output will also be 10 terabytes. And so this is quite a lot of data, and indeed it is for many MapReduce jobs, although not all. There's some that significantly reduce the amount of data at these stages. Somebody mentioned, oh, what if you want to feed the output of Reduce into another MapReduce job? And indeed, that was often what people wanted to do. And in any case, the output of the Reduce might be enormous. like for sort or web indexing, the output of the reduce is on 10 terabytes of input. The output of the reduce is, again, going to be 10 terabytes. So the output of the reduce was also stored on GFS. And the system would, you know, reduce would just produce these key value pairs, but the MapReduce framework would gather them up and write them into giant files on GFS. And so there was another round of network communication required to... get the output of each reduce to the GFS server that needed to store that reduce. And because you might think that they could have played the same trick with the output of storing the output on the GFS server that happened to run the MapReduce worker that ran the reduce. And maybe they did do that, but because GFS, as well as splitting data for performance, also keeps two or three copies for fault tolerance. That means no matter what, you need to write one copy of the data across a network to a different server. So there's a lot of network communication here, and a bunch here also. And it was this network communication that really limited the throughput of MapReduce in 2004. In 2020, because this network arrangement was such a limiting factor for so many things people wanted to do in data centers, modern data center networks are a lot faster at the root. than this was. And so one typical data center network you might see today actually has many root, instead of a single root switch that everything has to go through, you might have many root switches and each rack switch has a connection to each of these sort of replicated root switches and the traffic is split up among the root switches. So modern data center networks have far more network throughput. And because of that, actually modern, I think Google's sort of stopped using MapReduce a few years ago, but before they stopped using it, the modern MapReduce actually no longer tried to run the maps on the same machine as the data was stored on. They were happy to load the data from anywhere because they just assumed the network was extremely fast. Okay, we're out of time for MapReduce. We have a lab due at the end of next week in which you'll write your own. somewhat simplified MapReduce. So have fun with that and see you on Thursday.

And so the kinds of examples... that we'll be focusing on in this class are things like storage for big websites or big data computations such as MapReduce, and also somewhat more exotic things like peer-to-peer file sharing. So those are all just examples of the kinds of case studies we'll look at.

And the reason why all this is important is that a lot of critical infrastructure out there is built out of distributed systems. Infrastructure that requires more than one computer to get its job done or it sort of inherently needs to be spread out physically So the reasons why people build this stuff First of all before I even talk about distributed systems I just want to remind you that you know if you're designing a system Designing you need to solve some problem if you can possibly solve it on a single computer You know without building a distributed system. You should do it that way And there's many many jobs you can get done on a single computer and it's always easier.

So distributed systems, you know, you should try everything else before you try building distributed systems, because they're not simpler. So the reason why people are driven to use lots of cooperating computers are they need to get high performance. And the way to think about that is they want to get, achieve some sort of parallelism.

Lots of CPUs, lots of memories, lots of disk arms moving in parallel. Another reason why people build this stuff is to be able to tolerate faults. Have two computers do the exact same thing.

If one of them fails, you can cut over to the other one. Another is that some problems are just naturally spread out in space. Like, you know, you want to do interbank transfers of money or something. Well, you know, Bank A has this computer in New York City and Bank B has this computer in London.

You know, you just have to have some way for them to talk to each other and cooperate in order to carry that out. So there's some natural sort of physical reasons. Systems that are inherently physically distributed. And the final reason that people build this stuff is in order to achieve some sort of security goal.

So often by, if there's some code you don't trust or you need to interact with somebody but... they may be malicious or maybe their code has bugs in it. So you don't wanna have to trust it. You may wanna split up the computation. So your stuff runs over there on that computer, my stuff runs here on this computer and they only talk to each other through some sort of narrowly defined network protocol.

So we may be worried about security and that's achieved by splitting things up into multiple computers so that they can be isolated. Most of this course is going to be about performance and fault tolerance, although the other two often work themselves in by way of the sort of constraints on the case studies that we're going to look at. You know, all the distributed systems of these problems are, because they have many parts and the parts execute concurrently because there are multiple computers, you get all the problems that come up with concurrent programming, all the sort of complex interactions and weird timing dependent stuff. And that's part of what makes distributed systems hard. Another thing that makes distributed systems hard is that because, again, you have multiple pieces plus a network, you can have very unexpected failure patterns.

That is, if you have a single computer, it's usually the case that either the computer works or maybe it crashes or suffers a power failure or something. But it pretty much either works or doesn't work. Distributed systems made up of lots of computers, you can have partial failures. That is, some pieces stop working, other people, other pieces continue working. Or maybe the computers are working but some part of the network is broken or unreliable.

So partial failures is another reason why distributed systems are hard. This is sort of basic challenges. And a final reason why it's hard is that the original reason to build a distributed system is often to get higher performance, to get 1,000 computers worth of performance or 1,000 disk arms worth of performance. But it's actually very tricky to obtain that 1,000x speedup with 1,000 computers. There's often a lot of roadblocks thrown in your way.

Okay, so the... It often takes a bit of careful design to make the system actually give you the performance you feel you deserve. So solving these problems, of course, can be all about addressing these issues.

The reason to take the course is because often the problems and the solutions are quite just technically interesting. They're hard problems. For some of these problems, there's pretty good solutions known.

For other problems, there are not such great solutions known. Distributed systems are used by a lot of real-world systems out there. Big websites often involve vast numbers of computers that are put together as distributed systems. When I first started teaching this course, distributed systems were something of an academic curiosity.

People thought, oh, at a small scale they were used sometimes, and people felt that someday they might be important. But now, particularly driven by the rise of giant websites that have vast amounts of data and entire warehouses full of computers, distributed systems in the last 20 years have gotten to be a very seriously important part of computing infrastructure. This means that there's been a lot of attention paid to them, a lot of problems have been solved, but there's still quite a few unsolved problems. So if you're a graduate student or you're interested in research, there's A lot of problems yet to be solved in distributed systems that you could look into as research. And finally, if you like building stuff, this is a good class because it has a lab sequence in which you'll construct some fairly realistic distributed systems focused on performance and fault tolerance.

So you've got a lot of practice building distributed systems and making them work. All right. Let me talk about core structure a bit. Before I get started on real technical content, you should be able to find the course website using Google and on the course website is the lab assignments on the course schedule and also linked to a Piazza page where you can post questions get answers.

The course staff, I'm Robert Morris, will be giving the lectures. I also have four TAs, you guys want to stand up and show your faces. The TAs are experts at, in particular, at doing the solving the labs. They'll be holding office hours, so if you have questions about the labs, you should go to office hours, or you could post questions to Piazza.

The course has a couple of important components. One is lectures. There's a paper for almost every lecture. There's two exams.

There's the labs, programming labs, and there's an optional final project that you can do instead of one of the labs. The lectures will be about sort of big ideas in distributed systems. There'll also be a couple of lectures that are...

More about sort of lab programming stuff. A lot of the lectures will be taken up by case studies. A lot of the way that I sort of try to bring out the content of distributed systems is by looking at papers, some academic, some written by people in industry, describing real solutions to real problems. These lectures actually be videotaped, and I'm hoping to post them online so that you can, if you're not here.

or you want to review the lectures, you'll be able to look at the videotape lectures. The papers, again, there's one to read per week. Most of them are research papers.

Some of them are classic papers, like today's paper, which I hope some of you have read on MapReduce. It's an old paper, but it was the beginning of, it spurred an enormous amount of interesting work, both academic and in the real world. So some are classic and some are more recent papers, sort of talking about more up-to-date research, what people are currently worried about. And from the papers, I'll be hoping to tease out what the basic problems are, what ideas people have had that might or might not be useful in solving distributed system problems.

We'll be looking at sometimes at implementation details in some of these papers, because a lot of this has to do with actual construction of software-based systems. And we're also going to spend a certain amount of time looking at evaluations. People evaluating how fault tolerant their systems by measuring them or people measuring how much performance or whether they got performance improvement at all.

So I'm hoping that you'll all read the papers before coming to class. The lectures are maybe not going to make as much sense if you haven't already read the lecture because there's not enough time to both explain all the content of the paper and have a sort of interesting reflection on what the paper means on one class. So you've really got to read the papers before coming to class. And hopefully one of the things you'll learn in this class is how to read a paper rapidly and efficiently and skip over the parts that maybe aren't that important and sort of focus on teasing out the important ideas. On the website, there's for every link to by the schedule, there's a question that you should submit an answer for for every paper.

I think the answers are due at midnight. And we also ask that you submit a question you have about the paper. To the website in order both to give me something to think about as I'm preparing the lecture and if I have time I'll try to answer at least a few of the questions by email. And the question and the answer for each paper are due midnight the night before. There's two exams.

There's a midterm exam in class I think on the last class meeting before spring break. And there's a final exam during final exam week. at the end of the semester.

The exams are going to focus mostly on papers and the labs. And probably the best way to prepare for them, as well as attending lecture and reading the papers, a good way to prepare for the exams is to look at old exams. We have links to 20 years of old exams and solutions. And so you look at those and sort of get a feel for what kind of questions that I like to ask.

And indeed, because we read many of the same papers, Inevitably, I ask questions each year that can't help but resemble questions asked in previous years. The labs, there's four programming labs. The first one of them is due Friday next week.

Lab one is a simple MapReduce lab to implement your own version of the paper. that we read today and which I'll be discussing in a few minutes. Lab 2 involves using a technique called raft in order to get fault, in order to sort of allow, in theory, allow any system to be made fault tolerant by replicating it and having this raft technique manage the replication and manage sort of automatic cutover if there's a failed, if one of the replicated servers fails. So this is raft for fault tolerance. In lab 3, you'll use your raft implementation in order to build a fault-tolerant key-value server.

It'll be replicated and fault-tolerant. And in lab 4, you'll take your replicated key-value server and clone it into a number of independent groups, and you'll split the data. In your key value storage system across all these individual replicated groups to get parallel speed up by running multiple replicated groups in parallel and you'll also be responsible for moving the various chunks of data between different servers as they come and go without dropping any balls.

So this is a what's often called a sharded key value service. Sharding refers to splitting up the data, partitioning the data among multiple servers in order to get parallel speedup. If you want, instead of doing lab4, you can do a project of your own choice. And the idea here is if you have some idea for a distributed system, you know, in the style of some of the distributed systems we talked about in the class, if you have your own idea that you want to pursue and you like to build something and measure whether it worked in order to Explore your idea. You can do a project.

And so for a project, you'll pick some teammates because we require that projects are done in teams of two or three people. So like some teammates and send your project idea to us and we'll think about it and say yes or no and maybe give you some advice and then if you go ahead and do if we say yes and you want to do a project you do that instead of lab 4 and it's due at the end of the semester and you know you should do some design work and build a real system and then in the last day of class you'll demonstrate your system as well as handing in a short sort of written report to us about what you built. And I posted on the website some ideas which might or might not be useful for you to sort of spur thoughts about what projects you might build.

But really the best projects are one where you have a good idea for the project. And the idea is if you want to do a project, you should choose an idea that's sort of in the same vein as the systems that we talk about in this class. Okay, back to labs. The lab grades... we give you, you hand in your lab code and we run some tests against it and you're graded based on how many tests you pass.

We give you all the tests that we use so there's no hidden tests. So if you influence the lab and it reliably passes all the tests, then chances are good, unless there's something funny going on, which there sometimes is, chances are good that if your code passes all the tests when you run it, it'll pass all the tests when we run it and you'll get a full score. So hopefully there'll be no mystery about what score you're likely to get.

on the labs. Let me warn you that debugging these labs can be time consuming. Because they're distributed systems and have a lot of concurrency and communication, sort of strange, difficult to debug errors can crop up.

So you really ought to start the labs early. Don't leave them, you'll have a lot of trouble if you leave the labs to the last moment. You gotta start early.

If you're Problems, please come to the TA's office hours and please feel free to ask questions about the labs on Piazza. And indeed I hope if you know the answer that you'll answer people's questions on Piazza as well. All right, any questions about the mechanics of the course?

Yes? So the question is, how do these things factor into the grade? I forget, but it's all on the website under something. I think it's the labs are the single most important component. Okay.

All right, so this is a course about infrastructure. For applications. And so all through this course, there's going to be a sort of split in the way I talk about things between applications, which are sort of other people, the customer, somebody else writes, but the applications are going to use the infrastructure that we're thinking about in this course. And so the kinds of infrastructure that tend to come up a lot are storage. Communication and computation.

And we'll talk about systems that provide all three of these kinds of infrastructure. The storage, it turns out that storage is going to be the one we focus most on because it's a very well-defined and useful abstraction, and usually fairly straightforward abstraction. So people know a lot about how to use and build storage systems, and how to build... sort of replicated fault tolerant high performance distributed implementations of storage.

We'll also talk about some some of our computation systems like MapReduce for today is a computation system and we will talk about communication some but mostly from the point is a tool that we need to use to build distributed systems like computers have to talk to each other over a network you know maybe you need reliability or something and so we'll talk a bit about but we're actually mostly Consumers of communication. If you want to learn about communication systems as how they work, that's more the topic of 6829. So for storage and computation, a lot of our goal is to be able to discover abstractions, ways of simplifying the interface to these to storage and computation, distributed storage and computation infrastructure. So that it's easy to build applications on top of it. And what that really means is that we need to, we'd like to be able to build abstractions that hide the distributed nature of these.

of these systems. So the dream, which is rarely fully achieved, but the dream would be to be able to build an interface that looks to an application as if it's a non-distributed storage system, just like a file system or something that everybody already knows how to program and has a pretty simple model, semantics. We'd love to be able to build interfaces that look and act just like non-distributed storage and computation systems, but are actually You know, vast, extremely high performance, fault tolerant, distributed systems underneath.

So we'd love to have abstractions. And, you know, as you'll see as the course goes on, we sort of, you know, only part of the way there. It's rare that you find an abstraction for a distributed version of storage or computation that has simple behavior, behaves just like...

the non-distributed version of storage that everybody understands. But people are getting better at this, and we're going to try to study the ways and what people have learned about building such abstractions. Okay, so what kind of topics show up as we're considering these abstractions? The first one, the first topic, general topic that we'll see a lot... a lot of the systems we look at, have to do with implementation.

So, for example, the kind of tools that you see a lot for ways people learn how to build these systems are things like remote procedure call, whose goal is to mask the fact that we're communicating over an unreliable network. Another kind of implementation... Topic that we'll see a lot is threads, which are a programming technique that allows us to harness multi-core computers, but maybe more important for this class, threads are a way of structuring concurrent operations in a way that hopefully simplifies the programmer view of those concurrent operations. And because we're going to use threads a lot it turns out we're going to need to also Just as from the implementation level, spend a certain amount of time thinking about concurrency control, things like locks.

And the main place that these implementation ideas will come up in the class, they'll be touched on in many of the papers, but you're going to come face to face with all of this in a big way in the labs. You need to build distributed, you know, do the programming for distributed systems, and these are like a lot of sort of important tools, you know, beyond just sort of ordinary programming, these are some of the critical tools that you'll need to use to build distributed systems. Another big topic that comes up in all the papers we're going to talk about is performance.

Usually the high level goal of building a distributed system is to get what people call scalable speed up. So we're going to look in for scalability. And what I mean by scalability or scalable speedup is that if I have some problem that I'm solving with one computer and I buy a second computer to help me execute my problem, if I can now solve the problem in half the time or maybe solve twice as many problem instances per minute on two computers as I had on one, then...

That's an example of scalability. So, sort of two times the computers or resources gets me two times the performance or throughput. And this is a huge hammer. If you can build a system that actually has this behavior, namely that if you increase the number of computers, you throw at the problem by some factor. You get that factor, more throughput, more performance out of the system, that's a huge win.

Because you can buy computers with just money, right? Whereas if in order to get the alternative to this, is that in order to get more performance, you have to pay programmers to restructure your software to get better performance, to make it more efficient, or to apply some sort of specialized techniques, better algorithms or something. If you have to pay programmers to fix your code to be faster, that's... An expensive way to go. We'd love to be able to just, oh, buy a thousand computers instead of ten computers and get a hundred times more throughput.

That's fantastic. And so this sort of scalability idea is a huge idea in the backs of people's heads when they're building things like big websites that run on a building full of computers. If the building full of computers is there to get a corresponding amount of performance, but you have to be careful about the design in order to actually get that performance. So often the way this looks when we're looking at diagrams or writing diagrams in this course is that, let's suppose we're building a website.

Ordinarily you might have a website that has a HTTP server or let's say it has some users running web browsers and they talk to A web server running Python or PHP or whatever, sort of web server, and the web server talks to some kind of database. You know when you have one or two users you can just have one computer running both and maybe a computer for the web server and a computer from the database but maybe all of a sudden you get really popular and you, you know, a hundred million people sign up for your service. How do you You know, how do you fix your...

You certainly can't support millions of people on a single computer, except by extremely careful, labor-intensive optimization, which you don't have time for. So, typically, the way you're going to speed things up, the first thing you do is buy more web servers and just split the users so that, you know, half your users or some fraction of the user go to web server one and the other half you send them to web server two. And because maybe you're building, I don't know what, Reddit or something, where all the users need to see the same data ultimately, you have all the web servers talk to the back end. And maybe you can keep on adding web servers for a long time here. And so this is a way of getting parallel speed up on the web server code.

If you're running PHP or Python, maybe it's not too efficient. As long as each individual web server doesn't put too much. load on the database, you can add a lot of web servers before you run into problems. But this kind of scalability is rarely infinite, unfortunately. Certainly not without serious thought.

And so what tends to happen with these systems is that at some point after you have 10 or 20 or 100 web servers all talking to the same database, now all of a sudden the database starts to be a bottleneck. And adding more web servers no longer helps. So it's rare that you get full scalability to sort of...

infinite numbers of adding infinite numbers of computers. At some point you run out of gas because the place at which you are adding more computers is no longer the bottleneck. By having lots and lots of web servers we basically move the bottleneck. The thing is limiting performance from the web servers to the database.

And at this point, actually you almost certainly have to do a bit of design work because it's rare that you can take that there's any straightforward way to take a single database and sort of refactor things. You can take a data sort in a single database and refactor it so it's split over multiple databases, but it's often a fair amount of work. And because it's awkward, but people, many people actually need to do this. We're going to see a lot of examples in this course in which the distributed system people are talking about.

is a storage system because the authors were running something like a big website that ran out of gas on a single database or storage servers. Anyway, so the scalability story is we love to build systems that scale this way, but it's hard to make it, or it takes work often, design work, to push this idea infinitely far. Okay, so another big topic that comes up a lot is fault tolerance. If you're building a system with a single computer in it, well, a single computer often can stay up for years.

Like, I have servers in my office that have been up for years without crashing. You know, the computer's pretty reliable, the operating system's pretty reliable. Apparently the power in my building is pretty reliable. So it's not uncommon to have single computers that just stay up for amazing amounts of time. However...

If you're building systems out of thousands of computers, then even if each computer can be expected to stay up for a year, with a thousand computers that means you're gonna have like about three computer failures per day in your set of a thousand computers. So solving big problems with big distributed systems turns sort of very rare fault tolerance, very real failure, very rare failure problems into failure problems that happen just all the time. In a system with a thousand computers, there's almost certainly always something broken.

There's always some computer that's either crashed or mysteriously, you know, running incorrectly or slowly or doing the wrong thing. Or maybe there's some piece of the network. With a thousand computers, we've got a lot of network cables and a lot of network switches.

And so, you know, there's always some network cable that somebody stepped on and is unreliability, or a network cable that fell out, or some network switch whose fan is broken and the switch overheated and failed. There's always some little problem somewhere in your building-sized distributed system. So big scale turns problems from very rare events, you really don't have to worry about that much, into just constant problems. That means the failure has to be really, or the response, the masking of failures, the ability to perceive about failures, just has to be built into the design, because there's always failures. And, you know, as part of building, you know, convenient abstractions for application programmers, we really need to be able to build infrastructure that, as much as possible, hides the failures from application programmers or masks them or something, so that every application programmer doesn't have to have a complete complicated story for all the different kinds of failures that can occur.

There's a bunch of different notions that you can have about... what it means to be fault tolerant, and about a little more about exactly what we mean by that. And we'll see a lot of different flavors, but among the more common ideas you see, one is availability.

So some systems are designed so that under some certain kinds of failures, not all failures, but certain kinds of failures, the system will keep operating. Despite the failure while providing undamaged service, the same kind of service it would have provided even if there had been no failure. So some systems are available in that sense that up and up a you know so if you build a replicated service that maybe has two copies you know if one of the replicas replica servers fails maybe the other server can continue operating. They both fail of course you can't.

You know, you can't promise availability in that case. So available systems usually say, well, under a certain set of failures, we're going to continue providing service. We're going to be available. If more failures than that occur, it won't be available anymore.

Another kind of fault tolerance you might have, or in addition to availability or by itself, is recoverability. And what this means is that if something goes wrong, maybe the service will stop working. That is, it'll simply stop responding to requests, and it'll wait for someone to come along and repair whatever went wrong, but after the repair occurs, the system will be able to continue as if nothing bad had gone wrong.

So this is sort of a weaker requirement than availability, because here we're not going to do anything until the failed component has been repaired. But the fact that we can get up... get going again without, you know, without any loss of correctness is still a significant requirement. It means, you know, recoverable systems typically need to do things like save their latest date on disk or something where they can get it back, you know, after the power comes back up. And even among available systems, in order for a system to be useful in real life, usually what the way available systems are specced is that They're available until some number of failures have happened.

If too many failures have happened, an available system will stop working, or will stop responding at all, but when enough things have been repaired, it'll continue operating. So a good available system will sort of be recoverable as well, in the sense that if too many failures occur, it'll stop answering, but then will continue correctly after that. So this is what we'd love to obtain.

The biggest hammer, we'll see a number of approaches to solving these problems. There's really sort of... Two things that are the most important tools we have in this department. One is non-volatile storage, so that if something crashes, power fails, or whatever, maybe there's a building-wide power failure, we can use non-volatile storage, like hard drives or flash or solid-state drives or something, to sort of store a checkpoint or a log of the state of the system, and then when the power comes back up or...

Somebody repairs our power supply, who knows what, we'll be able to read our latest state off the hard drive and continue from there. So one tool is sort of non-volatile storage. And the management of non-volatile storage is something that comes up a lot, because non-volatile storage tends to be expensive to update.

And so a huge amount of the sort of nitty-gritty of building sort of high-performance fault-tolerant systems is in, you know, clever ways to avoid having to write the non-volatile storage too much. In the old days, and even today, what writing non-volatile storage meant was moving a disk arm and waiting for a disk platter to rotate, both of which are agonizingly slow on the scale of three gigahertz microprocessors. With things like Flash, life's quite a bit better, but still requires a lot of thought to get good performance out of it. And the other big tool we have for fault tolerance is replication. And the management of replicated copies is sort of tricky.

You know that sort of key problem lurking in any replicated system where we have two servers each with a supposedly identical copy of the system state. The key problem that comes up is always that the two replicas will accidentally drift out of sync and will stop being replicas. And this is just, you know, at the back of every design that we're going to see for using replication to get fault tolerance.

And Lab 2 and Lab 3 are all about management of replicated copies for fault tolerance. As you'll see, it's pretty complex. A final topic, a final cross-cutting topic. is consistency. So as an example of what I mean by consistency, supposing we're building a distributed storage system, and it's a key value service, so it just supports two operations.

Maybe there's a put operation, and you give it a key and a value, and the storage system sort of stashes away the value under as the value for this key, so it maintains just a big table of keys and values. And then there's a get operation. Your client sends it a key, and the storage service is supposed to, you know, respond with the value it has stored for that key.

And this is kind of the... When I can't think of anything else as an example of a distributed system, I'll whip out key value services. And they're very useful, right?

They're just sort of the kind of fundamental, simple version of a storage system. So... Of course, if you're an application programmer, it's helpful if these two operations kind of have meanings attached to them.

That you can go look in the manual, and the manual says what it means, what you'll get back if you call get. And sort of what it means for you to call put. So it would be great if there was some sort of spec for what they meant. Otherwise, who knows? How can you possibly write an application without a description of what...

Putting get are supposed to do. And this is the topic of consistency. And the reason why it's interesting in distributed systems is that both for performance and for fault tolerance reasons, we often have more than one copy of the data floating around. So, you know, in a non-distributed system where you just have a single server with a single table, there's often, although not always, but... there's often like relatively no ambiguity about what put and get could possibly mean, right?

Intuitively, you know, what put means is update the table. And what get means is just get me the version that's stored in the table, which, but in a distributed system where there's more than one copy of the data due to replication or caching or who knows what, there may be lots of different versions of this key value pair floating around. Like if one of the replicas, you know, supposing some client issues a put, and you know there's two copies of the server, so they both have a key value table, right, and maybe key one has value 20 on both of them, and then some client issues a put.

So we have a client over here. It's going to send a put. It wants to update the value of 1 to be 21. Maybe it's counting stuff in this key value server.

So it sends a put, key 1, on value 21. It sends it to the first server. And it's about to send the same put. It wants to update both copies to keep them in sync. It's about to send this put.

But just before it sends the put to the second server, it crashes, power failure, bug in the operating system or something. So now the state we're left in, sadly, is that we sent this put, and so we've updated one of the two replicas to a value of 21, but the other one's still with 20. Now if somebody comes along and reads with a get, they might get, they want to read the value associated with key one, they might get 21, or they might get 20, depending on who they talk to. And even if the rule is you always talk to the top server first, if you're building a fault tolerance system, the actual rule has to be, oh, you talk to the top server first.

unless it's failed, in which case you talk to the bottom server. So either way, someday you risk exposing this stale copy of the data to some future get. And it could be that many gets get the updated 21, and then, like, next week, all of a sudden, some get yields, you know, a week-old copy of the data. So that's not very consistent, right?

So in order, but, you know, it's the kind of thing that could happen, right? We're not careful. So we need to have, we need to actually write down what the rules are going to be about puts and gets given this danger due to replication. And it turns out there's many different definitions you can have of consistency.

Many of them are relatively straightforward. Many of them sound like, well, a get yields the value put by the most recently completed put. So that's usually called strong consistency.

It turns out also it's very useful to build systems that have much weaker consistency. For example, do not guarantee anything like a get sees the value written by the most recent put. And the reason, so there's strongly consistent systems that usually have some version of get seeing most recent puts.

Although you have to... There's a lot of details to work out. There's also weakly consistent, many sort of flavors of weakly consistent systems that do not make any such guarantee, that may guarantee, well, you're, you know, if someone does a put, then you may not see the put.

You may see old values that weren't updated by the put for an unbattered amount of time, maybe. And the reason for people being very interested in weak consistency schemes is that strong consistency, that is... Having reads actually see, be guaranteed to see, the most recent write, that's a very expensive spec to implement.

Because what it means is almost certainly that somebody has to do a lot of communication in order to actually implement some notion of strong consistency. If you have multiple copies, it means that either the writer or the reader, or maybe both, has to consult every copy. Like in this case.

Where, you know, maybe a client crash left one updated but not the other. If we wanted to implement strong consistency in maybe a simple way in this. We'd have readers read both of the copies, or if there's more than one copy, all the copies, and use the most recently written value that they find.

But that's expensive. That's a lot of chit-chat to read one value. So in order to avoid communication as much as possible, particularly if replicas are far away, people build weak systems that might actually allow the stale read of an old value in this case. Although there's often more semantics attached to that to try to make these weak schemes more useful. And where this communication problem, you know, strong consistency requiring expensive communication, where this really runs you into trouble is that if we're using replication for fault tolerance, then we really want the replicas to have independent failure probability, to have uncorrelated failure.

So, for example, putting... both of the replicas of our data in the same rack, in the same machine room, is probably a really bad idea. Because if someone trips over the power cable to that rack, both of our copies of our data are gonna die, because they're both attached to the same power cable in the same rack.

So in the search for making replicas as independent in failure as possible, in order to get decent fault tolerance, people would love to put different replicas as far apart as possible, like in different cities, or maybe on opposite sides of the continent. So an earthquake that destroys one data center will be extremely unlikely to also destroy the other data center that has the other copy. You know, so we'd love to be able to do that. If you do that, then the other copy is thousands of miles away, and the rate at which light travels means that it may take on the order of Milliseconds or tens of milliseconds to communicate to a data center across the continent in order to update the other copy of the data.

And so that makes the communication required for strong consistency, for good consistency, potentially extremely expensive. Like every time you want to do one of these put operations or maybe a get, depending on how you implement it, you might have to sit there waiting for like 10 or 20 or 30 milliseconds in order to talk to both copies of the data to ensure that they're both updated or both checked. to find the latest copy. And that tremendous expense, right?

This is 10 or 20 or 30 milliseconds on machines that, after all, will execute, like, a billion instructions per second. So we're wasting a lot of potential instructions while we wait. People often build much weaker systems.

You're allowed to only update the nearest copy or only consult the nearest copy. I mean, there's a huge sort of amount of academic and real-world research on how to... structure weak consistency guarantee, so they're actually useful to applications and how to take advantage of them in order to actually get high performance.

All right, so that's a lightning sort of preview of the technical ideas in the course. Any questions about this before I start talking about MapReduce? All right, I want to switch to MapReduce. Sort of detailed case study that's actually going to illustrate most of the ideas that we've been talking about here.

MapReduce is a system that was originally designed and built and used by Google. I think the paper dates back to 2004. The problem they were faced with was that they were running huge computations on terabytes and terabytes of data. Creating an index of all of the content of the web or analyzing the link structure of the entire web in order to identify the most important pages or the most authoritative pages.

The whole web is, even in those days, tens of terabytes of data. Building an index of the web is basically equivalent to a sort, running sort of the entire data. Sort, you know, is reasonably expensive.

and to run a sort on the entire content of the web on a single computer, I don't know how long it would have taken, but weeks or months or years or something. So Google at the time was desperate to be able to run giant computations on giant data on thousands of computers in order that the computations could finish rapidly. It was worth it to them to buy lots of computers so that their engineers wouldn't have to spend a lot of time reading the newspaper or something waiting for their big compute jobs to finish.

And so for a while they had their clever engineers sort of hand write, you know, if you needed to write a web indexer or some sort of link, web link analysis tool. You know, Google bought the computers and they said, here, engineers, run whatever software you like on these computers. And they would laboriously write the sort of one-off, manually written software to take whatever problem they were working on and sort of somehow farm it out to a lot of computers and organize that computation and get the data back.

If you only hire engineers who are skilled distributed systems experts, maybe that's okay, although even then it's... probably very wasteful of engineering effort. But they wanted to hire people who were skilled at something else and not necessarily engineers who wanted to spend all their time writing distributed system software.

So they really needed some kind of framework that would make it easy to just have their engineers write the kind of guts of whatever analysis they wanted to do, like sort algorithm or web index or link analyzer or whatever. just write the guts of that application and not be able to run it on a thousands of computers without worrying about the details of how to spread the work over the thousands of computers, how to organize whatever data movement was required, how to cope with the inevitable failures. So they were looking for a framework that would make it easy for non-specialists to be able to write and run giant distributed computations.

And so that's what MapReduce is all about. And the idea is that the programmer just write the application designer, the consumer of this distributed computation, just be able to write a simple map function and a simple reduce function that don't know anything about distribution. And the MapReduce framework would take care of everything else.

So an abstract view of what MapReduce is up to is it starts by assuming that there's some input. And the input is split up into a whole bunch of different files or chunks in some way. So we're imagining that we have input file 1, input file 2, etc.

These inputs are maybe web pages crawled from the web, or more likely sort of big files that contain many web, each of which contains many web files. Crawl from the web. All right. And the way MapReduce starts is that you define a map function, and the MapReduce framework is going to run your map function on each of the input files.

And of course, you can see here there's some obvious parallelism available. You can run the maps in parallel. So each of these map functions only looks at its input and produces output. The output that a map function is required to produce is a list.

It takes a file as input, and the file is some fraction of the input data, and it produces a list of key value pairs as output, the map function. And so, for example, let's suppose we're writing the simplest possible map-reduce example, a word count map-reduce job, whose goal is to count the number of occurrences of each word. So your map function might emit...

Key value pairs where the key is the word and the value is just one. So for every word it sees, so this map function will split the input up into words. For every word it sees, it emits that word as the key and one as the value.

And then later on, we'll count up all those ones in order to get the final output. So, you know, maybe input one has the word A in it and the word B in it. And so the output map is going to produce is...

Key A value 1, key B value 1. Maybe this second map indication sees a file that has a B in it and nothing else. So it's going to implement output B1. Maybe this third input has an A in it and a C in it.

All right, so we run all these maps on all the input files. And we get this intermediate, what the paper calls intermediate output, which is for every map. a set of key value pairs as output.

Then the second stage of the computation is to run the reduces. And the idea is that the MapReduce framework collects together all instances from all maps of each keyword. So the MapReduce framework is going to collect together all of the A's from every map, every key value pair whose key was A. It's going to collect them all and hand them to One call of the programmer to find reduced funds. And then it's going to take all the B's and collect them together.

Of course, it requires real collection because different instances of key B were produced by different invocations of map on different computers. So we're now talking about data movement. So we're going to collect all the B keys and hand them to a different call to reduce that has all of the B keys as its arguments.

And same with C, all the C. So this is going to be the MapReduce framework will arrange for one call to reduce for every key that occurred in any of the map output. And, you know, for our sort of silly word count example, all these reduces have to do, or any one of them has to do, is just count the number of items passed to it. It doesn't even have to look at the items because it knows that each of them is the word it's responsible for plus.

one is the value. We don't have to look at those ones, we just count them. So this reduce is going to produce A and then the count of its inputs. This reduce is going to produce the key associated with it and then the count of its values, which is also two. So this is what a typical map-reduced job looks like at a high level.

Just for completeness, a little bit of terminology. The whole computation is called a job. Any one invocation of map or reduce is called a task. So we have the entire job and it's made up of a bunch of map tasks and then a bunch of reduce tasks. So an example for this word count, what the map and reduce functions would look like.

The map function takes a key and a value as arguments. And now we're talking about functions like written in an ordinary programming language, like C++ or Java or who knows what. So this is just code people, ordinary people can write.

What a map function for word count would do is split the key, is the file name which typically is ignored. We don't really care what the file name is. the file name was. And the V is the content of this map's input file. So V is, you know, just contains all this text.

We're going to split V into words. And then for each word, we're just going to emit. And emit takes two arguments. Emit's a call only map can make. Emit is provided by the MapReduce framework.

We get to produce, we hand emit a key, which is the word, and a value, which is string one. So that's it for the map function. And a word count map function in MapReduce literally could be this simple.

So there's sort of promise to make the... And this map function doesn't know anything about distribution or multiple computers or the fact we need we need to move data across the network Or who knows what? This is extremely straightforward and the reduce function for a word count The reduce is called with Remember each reduce is called with sort of all the instances of a given key the MapReduce framework calls reduce with the key that it's responsible for and a vector of all the values that the maps produced associated with that key.

The key is the word, the values are all ones, we don't really care about them, we only care about how many there were. And so, reduce has its own emit function that just takes a value to be emitted as the final output, as the value for this key. So we're gonna emit the length of this array. So this is also about as simple as reduce functions are in MapReduce, namely extremely simple and requiring no knowledge about fault tolerance or anything else.

All right, any questions about the basic framework? Yes. You mean can you feed the output of the reduces sort of, oh yes, oh yes, in real life, all right, in real life.

It was routine among MapReduce users to define a MapReduce job that took some inputs and produced some outputs and then have a second MapReduce job. You know, if you're doing some very complicated multi-stage analysis or iterative algorithm, like PageRank, for example, which is the algorithm Google uses to sort of estimate how important or influential different web pages are. That's an iterative algorithm that sort of...

Gradually converges on an answer and if you implement in MapReduce, which I think they originally did, you have to run the MapReduce job multiple times and the output of each one is sort of a list of web pages with an updated sort of value or weight or importance for each web page. So it was routine to take this output and then use it as the input to another MapReduce job. Well, yeah, you need to sort of set things up. The output, you need to write the produce function sort of in the knowledge that, oh, I need to produce data that's in the format or has the information required for the next MapReduce job.

I mean, this actually brings up a little bit of a shortcoming in the MapReduce framework, which is it's great if the algorithm you need to run is easily expressible as a map. Followed by this sort of shuffling of the data by key, followed by a reduce, and that's it. MapReduce is fantastic for algorithms that can be cast in that form.

And furthermore, each of the maps has to be completely independent. The maps are required to be functional, pure functional functions that just look at their arguments and nothing else. You know, it's like, it's a restriction. And it turns out that many people want to run much longer pipelines that involve lots and lots of different kinds of processing. And with MapReduce, you have to sort of cobble that together from multiple MapReduce, distinct MapReduce jobs.

And more advanced systems, which we'll talk about later in the course, are much better at allowing you to specify the complete pipeline of computations. And they'll do optimization. You know, the framework realizes all the stuff you have to do and organize much more complicated, efficiently optimize much more complicated. Computations. No question.

From the programmer's point of view, it's just about map and reduce. From our point of view, It's going to be about the worker processes and the worker servers that are part of the MapReduce framework that, among many other things, call the MapReduce functions. So, yeah, from our point of view, we care a lot about how this is organized by the surrounding framework.

This is sort of the programmer's view with all the distributed stuff stripped out. Yes? Sorry, I've got to say it again. Sorry, do you emit locally?

Oh, you mean where does the emit data go? Okay, so there's two questions. One is, when you call emit, what happens to the data?

And the other is, where are the functions run? The actual answer is that first where the stuff run. There's a number of say a thousand servers.

Actually the right thing to look at here is figure one in the paper. Sitting underneath this in the real world there's some big collection of servers. And we'll call them maybe worker servers or workers. And there's also a single master server that's organizing the whole computation.

And what's going on here is the master server knows that there's some number of input files, you know, 5,000 input files. And it farms out invocations of map to the different workers. So it'll send a message to worker 7 saying, please run, you know. So this map function on such and such an input file, and then the worker function, which is part of MapReduce and knows all about MapReduce, will then read the file, read the input, whichever input file, and call this map function with the file name value as arguments. Then that worker process will...

is what implements emit, and every time the map calls emit, the worker process will write this data to files on the local disk. So what happens to map emits is they produce files on the map worker's local disk that are accumulating all the keys and values produced by the map's run on that worker. So at the end of the map phase, What we're left with is all those worker machines, each of which has the output of some of whatever maps were run on that worker machine.

Then the MapReduce workers arrange to move the data to where it's going to be needed for the reduces. And since, you know, in a typical big computation, you know, this reduce indication is going to need all map output that... mentioned the key A, but it's going to turn out, you know, this is a sort of simple example, but probably, in general, every single map invocation will produce lots of keys, including some instances of key A. So typically, in order, before we can even run this reduce function, the MapReduce framework, that is the MapReduce worker running on one of our thousand servers, is going to have to go talk to every single other of the thousand servers and say, look, You know, I'm going to run the reduce for key A. Please look at the intermediate map output stored in your disk and fish out all of the instances of key A and send them over the network to me.

So the reduce worker is going to do that. It's going to fetch from every worker all of the instances of the key that it's responsible for, that the master has told it to be responsible for. And once it's collected all of that data, then it can call reduce. And the reduce... Function itself calls reduce emit, which is different from the map emit.

And what reduces emit does is writes the output to file in a cluster file service that Google uses. So here's something I haven't mentioned. I haven't mentioned where the input lives and where the output lives.

They're both files. Because any piece of input, we want the flexibility to be able to read any piece of input on any worker server, that means we need some kind of network file system to store the input data. And so indeed, the paper talks about this thing called GFS for Google File System. And GFS is a cluster file system, and GFS actually runs on exactly the same set of workers.

that work our servers that run MapReduce. And the input, GFS just automatically, when you, you know, it's a file system, you can read and write files, it just automatically splits up any big file you store on it across lots of servers in 64 megabyte chunks. So if you write, you know, if you have 10 terabytes of crawled web page contents and you just write them to GFS, even as a single big file, GFS will automatically... split that vast amount of data up into 64 kilobyte chunks distributed evenly over all of the GFS servers, which is to say all the servers that Google has available. And that's fantastic, that's just what we need.

If we then want to run a MapReduce job that takes the entire crawled web as input, the data is already stored in a way that's split up evenly across all the servers. And so that means that the map workers, you know, we're gonna launch, you know, if we have a thousand servers, we're going to launch 1,000 map workers, each reading one one-thousands of the input data, and they're going to be able to read the data in parallel from 1,000 GFS file servers, thus getting a tremendous total read throughput, the read throughput of 1,000 servers. So are you thinking maybe that Google has one set of physical machines that run GFS and a separate set of physical machines that run MapReduce jobs?

So you might have to read from a different machine. OK. They're not necessarily on the machine.

Right. So the question is, what does this arrow here actually involve? And the answer to that actually sort of changed over the years as Google's evolved the system. But In the most general case, if we have big files stored in some big network file system, GFS is a bit like AFS you might have used on Athena, where you go talk to some collection, your data is split over big collection of servers, you have to go talk to those servers over the network to retrieve your data.

In that case, what this arrow might represent is the map-reduced worker process has to go off and talk across the network to the. correct GFS server or maybe servers that store its part of the input and fetch it over the network to the MapReduce worker machine in order to pass the map and that's certainly the most general case And that was eventually how MapReduce actually worked. In the world of this paper though And if you did that, that's a lot of network communication Are you talking about 10 terabytes of data and we have to move 10 terabytes across their data center network, which You know, data center networks run at gigabits per second, but it's still a lot of time to move tens of terabytes of data. In order to try to, and indeed in the world of this paper in 2004, the most constraining bottleneck in their MapReduce system was network throughput, because they were running on a network, if you sort of read as far as the evaluation section, their network, their network was... They had thousands of machines, whatever, and they would collect machines, they would plug machines into, you know, each rack of machines into an Ethernet switch for that rack or something.

But then, you know, they all need to talk to each other, but there was a root Ethernet switch that all of the rack Ethernet switches talked to. And this, you know, so if you just pick some MapReduce worker and some GFS server. Chances are, at least half the time, the communication between them has to pass through this one root switch.

The root switch had only some amount of total throughput, which I forget. Some number of gigabits per second. Anyway, I forget the number.

But when I did the division, that is divided up the Total throughput available in the root switch by the roughly 2,000 servers that they used in the paper's experiments, what I got was that each machine's share of the root switch, or of the total network capacity, was only 50 megabits per second. In their setup. So 50 megabits per second per machine. And that might seem like a lot, 50 megabits, gosh, millions and millions. But it's actually quite small compared to how fast, say, disks run or CPUs run.

And so their network, this 50 megabits per second, was like a tremendous limit. And so they really stood on their heads in the design described in the paper to avoid using the network. And they played a bunch of tricks to avoid sending stuff over the network when they possibly could avoid it. One of them was they ran the GFS servers. and the MapReduce workers on the same set of machines.

So they have a thousand machines, they'd run GFS, they'd implement their GFS service on that thousand machines and run MapReduce on the same thousand machines. And then when the master was splitting up the map work and sort of farming it out to different workers, it would cleverly, when it was about to run the map that was going to read from input file one, It would figure out from GFS which server actually holds input file 1 on its local disk, and it would send the map for that input file to the MapReduce software on the same machine, so that by default this arrow was actually local read from the local disk and did not involve the network. And, you know, depending on failures or load or whatever, it couldn't always do that. So, it's almost all of the maps. would be run on the very same machine that stored the data, thus saving them a vast amount of time that they would otherwise have had to wait to move the input data across the network.

The next trick they played is that a map, as I mentioned before, stores its output on the local disk of the machine that you run the map on. So again, storing the output of the map does not require network communication. at least not immediately, because the output is stored in the disk. However, we know for sure that one way or another, in order to group together all of the way the MapReduce is defined, in order to group together all of the values associated with the given key and pass them to a single invocation of reduce on some machine, this is going to require network communication.

We're going to, you know, we want to need to fetch all the A's of the missing single. machine that have to be moved across the network. And so this shuffle, this movement of the keys from this kind of, they're originally stored by row on the same machine that ran the map.

We need them essentially to be stored by column on the machine that's going to be responsible for reduce. This transformation of row storage to essentially column storage is called, the paper calls a shuffle. And really that required moving every piece of data across the network. from the map that produced it to the reduce that we needed.

And now it's like the expensive part of the map reduce. Yeah? Maybe we'll just take one, but then you wouldn't have to do it.

You're right. You can imagine a different definition in which you have a more kind of streaming reduce. I don't know. I haven't thought this through.

I don't know why. whether that would be feasible or not. Certainly as far as programmer interface, if the goal, their number one goal really was to be able to make it easy to program by people who just had no idea what was going on in the system. So it may be that this spec, this is really the way reduced functions look in C++ or something.

Like a streaming version of this is now starting to look, I don't know how it would look, probably not this simple. But you know, maybe it could be done that way. And indeed, many modern systems, people have gotten a lot more sophisticated with modern things that are the successors to MapReduce. And they do indeed involve processing streams of data often rather than this very batch approach.

It is a batch approach in the sense that we wait until we get all the data and then we process it. So first of all, you then have to have a notion of finite inputs, right? And modern systems often do indeed use streams and are able to take advantage of some efficiencies due to that, but not MapReduce. Okay, so this is the point at which the shuffle is where all the network traffic happens. This can actually be a vast amount of data.

So if you think about sort, if you're sorting, the output of the sort has the same size as the input to the sort. That means that if your input is 10 terabytes of data and you're running a sort, you're moving 10 terabytes of data across the network at this point, and your output will also be 10 terabytes. And so this is quite a lot of data, and indeed it is for many MapReduce jobs, although not all.

There's some that significantly reduce the amount of data at these stages. Somebody mentioned, oh, what if you want to feed the output of Reduce into another MapReduce job? And indeed, that was often what people wanted to do.

And in any case, the output of the Reduce might be enormous. like for sort or web indexing, the output of the reduce is on 10 terabytes of input. The output of the reduce is, again, going to be 10 terabytes. So the output of the reduce was also stored on GFS. And the system would, you know, reduce would just produce these key value pairs, but the MapReduce framework would gather them up and write them into giant files on GFS.

And so there was another round of network communication required to... get the output of each reduce to the GFS server that needed to store that reduce. And because you might think that they could have played the same trick with the output of storing the output on the GFS server that happened to run the MapReduce worker that ran the reduce.

And maybe they did do that, but because GFS, as well as splitting data for performance, also keeps two or three copies for fault tolerance. That means no matter what, you need to write one copy of the data across a network to a different server. So there's a lot of network communication here, and a bunch here also.

And it was this network communication that really limited the throughput of MapReduce in 2004. In 2020, because this network arrangement was such a limiting factor for so many things people wanted to do in data centers, modern data center networks are a lot faster at the root. than this was. And so one typical data center network you might see today actually has many root, instead of a single root switch that everything has to go through, you might have many root switches and each rack switch has a connection to each of these sort of replicated root switches and the traffic is split up among the root switches.

So modern data center networks have far more network throughput. And because of that, actually modern, I think Google's sort of stopped using MapReduce a few years ago, but before they stopped using it, the modern MapReduce actually no longer tried to run the maps on the same machine as the data was stored on. They were happy to load the data from anywhere because they just assumed the network was extremely fast.

Okay, we're out of time for MapReduce. We have a lab due at the end of next week in which you'll write your own. somewhat simplified MapReduce. So have fun with that and see you on Thursday.

Transcript for:Understanding Distributed Systems and MapReduce

Transcript for:
Understanding Distributed Systems and MapReduce