The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality, educational resources for free. To make a donation or view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. PROFESSOR: Today, what we want to do is talk about something at a much higher scale than what we've thought about through most of this semester. And that's probably by design. Over the course of the semester, we started with kind of enzyme kinetics or molecular binding kind of events, and we slowly built our way up the larger and larger scales. Now there's always this question about whether we're claiming that we really understand how the higher levels of organization result from the lower level interactions. And I'd say, we definitely don't understand all of it. So you shouldn't come away with that as the notion. But at least one thing that I think is fascinating about this area of systems biology is that much of the framework that we use to understand, let's say, molecular scale interactions or stochastic gene expression, so these dynamics at the smaller scale, much of those ideas and such certainly transport up to these higher scales or translate up to the higher scales, where, in this case, we're using kind of master equation type formulas to try to understand relative species abundance. And so I think part of what I like about this topic of neutral theory versus niche theory and so forth in ecology is that you can just see how very, very similar ideas, that we applied for studying stochastic gene expression, can also be used to try to understand why it is that some species are more common than others when you go and you count them, in this case, on an island in Panama. Now, the subject is, by its nature, less experimentally focused than much of what we've done over the course the semester. And this is really a topic the tends to be a combination of mathematical theory with kind of careful counting of species in some different areas and trying to understand what that means. But it's an area that there have been a number of physicists involved in over the last 10 years. And I think that it's fascinating, because it does get to the heart of what we are looking for from a theory, what kind of evidence do we use to support a theory or to refute it. So I think there are a lot of very basic issues about science that come up when we start thinking about this question of neutral theory in ecology. And since it's, for many of us, a totally new area that we don't know very much about, you can come to it with maybe fresh eyes. And you don't have the same preconceptions that you would have for many other models that you might be more familiar with in the context of molecular cell biology. So the basic question that we're going to try to talk about today is just the question of why is it that, when you look out at the world, you see that there are some species that seem to be abundant and some that seem to be rare? Are there other patterns that are somehow universal? And what kind of sort of lower scale processes might lead to the patterns that we observe? And I think that this paper that we read is-- I mean, it's not that it's. Well, can somebody say what the actual scientific contribution of this paper was? Yes? AUDIENCE: They did a calculation. PROFESSOR: They did a calculation. But it's a little bit more specific than that. What is it? AUDIENCE: They came up with the closed form equation? PROFESSOR: That's right. Basically, there was a model of this neutral theory in ecology that we're going to explain or try to understand. You can simulate the model, but then there are possible issues associated with convergence or something of those. Although it's hard to believe that that's really such a concern. But you can simulate that model. What they did is they just showed that you could get an analytic-y kind of expression for it. It's not a super analytic expression, but, at least, it's not a straight up simulation. You kind of numerically do something, integrate something, as compared to doing the stochastic simulation. So it's not that that, in and of itself, is what you feel like-- it's not what we necessarily care so much about. But I think that it's still just a nice, short description of the model and the assumptions that go into it. And you get a little bit of a window into the debate that's going on between these two communities of kind of the neutral theory guys and the niche theory community. So there's only one figure in this paper. And it's an example of the kind of data that we want to try to understand. So there's a particular pattern in terms of the relative species abundance. And we want to understand what kind of models might lead to that observed pattern. But given that there's just one figure in the paper, we have to make sure that we understand exactly what is being plotted. And what I've found from experience-- and, actually, even the answer to the email question that was sent out, I think, was incorrect on one of these things. So we'll talk about that some more. So beware. We'll figure it out. But I think it's actually surprisingly tricky to understand what this figure is saying. But first of all, can somebody describe not what the figure is saying but just what the data is supposed to be? Where do they get the data? Anything that's useful? AUDIENCE: They were on an island ecosystem. PROFESSOR: There's an island. It's called BCI, Barro Colorado Island. AUDIENCE: [INAUDIBLE]. PROFESSOR: So it's a 50 hectare plot. Does anybody know what a hectare is? AUDIENCE: It's a lot more than a square meter. PROFESSOR: It's a lot more than a square meter, yes, indeed. Yeah. Is this an English unit of measure? This is the kind of thing that I have to Google. But it's one hectare is equal to 10 to the 4 meters squared. That's a good thing to memorize. I AUDIENCE: Exactly or approximate? PROFESSOR: I think it's exact. I think I think it's an exact. AUDIENCE: Then it's a metric unit. PROFESSOR: Yeah, so apparently it is a metric unit. So the idea is that if you take a 100 meters by 100 meters, this is a hectare. And there's 50 of them. It's about like a half a square kilometer to give you a sense of what we're talking about. And what do they do on this plot? AUDIENCE: They count a certain number as canopy trees. So the trees that are, like, really big. PROFESSOR: And how do they decide which trees to count? Did they count every tree? AUDIENCE: No, just the ones that like formed the top layer. PROFESSOR: I think that the way that they decide-- OK. Does anybody remember how many trees were counted? AUDIENCE: [INAUDIBLE]. PROFESSOR: So there are 21,457 trees in this 50 hectare plot. They identify the species for each one of these 21,000 trees. And they assign them. And they found that there were 225 distinct species. So this is really quite an amazing data set. Because I can tell you that I would not be able to do this. This was highly skilled biologists that can distinguish 225. If they can identify these 225, that means they have to be able to identify other ones as well. And they did it for 20,000 trees. And indeed, Barro Colorado Island is one of the major Smithsonian research institutes, where they've been tracking. They do this like every five years or so, where they do a census, where they count all of the trees. And they're also tracking many other-- it's not just trees. They're doing everything there. AUDIENCE: Is there only plants? PROFESSOR: What's that? AUDIENCE: Is it only plants? PROFESSOR: No. So actually, I visited BCI, and it seemed like they were studying all sorts of things. And there were nice looking birds there. AUDIENCE: No, I mean in this census. PROFESSOR: In this census, it's only trees. And the way that they decide which of the trees to do, it's the ones that are more than 10 centimeters DBH. Anybody can guess what DBH might mean? It's actually diameter at breast height. So what they do is they walk up to the tree with a ruler, and then, if it's larger than 10 centimeters, then they count it. You need to have some threshold at the lower end, otherwise you're in trouble, right? And there were plenty of trees that satisfied this requirement here. Then what they do, for all of these trees, it's assigned to some species. The basic goal of this branch of biology or ecology is to try to understand the pattern, from this sort of data, where it comes from. Or first describe it, and then once you have a description of it, then you can try to understand what microscale processes might lead to the pattern. And the pattern is what's plotted in figure 1. It's the only figure in the paper. I have reconstructed a rough version of it, here, for you on the board. But if you want a more accurate version, you can look at your paper. Now, we want to make sure that we understand what the figure is saying. So we will ask the following question. What is the most common number of individuals for a species in this data set? The most common/frequent number of individuals for a species to have in this data set. Now, it's maybe worth just saying something a little bit more. So you notice that they were not trying to count the total number of species, altogether. And in general, all of this field of relative species abundance, to try to understand them, what you do is typically take one trophic level. So some of the classic studies were of beetles in the Thames River. The idea is that it's some set of species that you think are going to be interacting, maybe competing, with each other, in some way, in the sense that they're maybe eating related things and being eaten by related things. And so in this case, these are the trees in Barro Colorado Island. And you can imagine that this is useful. The fact that it's trees instead of something else means that you can actually track the individuals over time. And when you go to the island what you see is that all the trees, they're wrapped by some tag. And presumably, they have some system to tell you which species that is so that they keep records of everything. But the question is, what's the most common number of individuals for species in the data set? Do you understand what I'm trying to ask? And we're going do approximate, so we'll say. Or this, can't determine. We want to know, what is the mode of this distribution of the number of individuals for each of these species? Do you understand the question? I'm going to give you 20 seconds to look at this. AUDIENCE: Should we just hold a blank piece of paper? PROFESSOR: Oh, we don't have our-- ah. AUDIENCE: [INAUDIBLE]? PROFESSOR: You know, the TA always lets me down. All right, yeah. So you can do A, B, C, D, E. Are we ready? AUDIENCE: [INAUDIBLE]? PROFESSOR: You can just do this if you're not. But given this was the only figure in the paper, and that this is a basic property of the distribution, I'm sure that you figured that out last night, anyways, right? Especially since it was one of the questions in the [INAUDIBLE]. So you presumably already thought about this question, right? OK. Yes? AUDIENCE: Yes. PROFESSOR: Ready, three, two, one. I'd say we got a lot of B's. So it seems like B is the most. So this, we'll put a question mark here. Can somebody verbally say why their neighbor said that the mode of the distribution is around 30? Yeah? AUDIENCE: The tallest bar. PROFESSOR: The tallest bar there is around 30. That's a very practical definition. So that's normally what we mean by the mode. There is a slight problem in all of this, which is that this thing is plotted in a very kind of funny way. So if you look at the figure, what you'll see is that it's number of individuals. And down here, it says, log2 scale. Now, when we say the mode, what we're wondering about is that, if you just take the most typical kind of species of tree that's there, how many individuals do we think there should be there? Of course, typical is hard to define. We can talk about mode, median, mean, et cetera. But the most common number of individuals for a species of the data set ends up not being 30. It ends up being 1. And we will try to reconstruct this right now. Because you have to do a little bit of digging to figure out what is being plotted here. But it's not the raw data. The problem here is that this is on this log scale, where the bins here are growing kind of geometrically or exponentially, whatever, as you move to the right. So over here, this thing only contains one real bin. And actually, we're about to find it's half a bin, which is even weirder. Whereas out here, this is maybe 30 bins. So the number of species that we're going to put in this bin is everything between around 20 something up to 50 or so. The number of kind of true bins that end up in each of these plotted bins is going to grow geometrically as we move to the right. So this is a very funny transform of the data. And indeed, I think it's always nice to just, in life, you always plot the raw data first. And then what you can do is then you can do funny. There's a reason to plot it this way. Because this is where they get this idea that this might described as described as a log normal. The idea is, if you take a log of the data, then you get something that looks like a normal. But you always plot the raw data first. So let's try to figure out what the raw data looked like. And now what we're going to do is we're going to have real scalings, honest to goodness numbers. Now the number of species you get still. So this is asking, how many different species do we see with one member or with two members or with three, four, et cetera? And I don't know how far we're actually going to be able to get. But in this one figure, in our paper, they tell us what the histogram means. So the first histogram bar represents what they call phi 1 divided by 2. Phi 1 was the number of species observed with one member, which means that even this first plot bar is not the number of species observed with a single individual. It's half of that. You can argue about the consistency of how these things should be, but that's what this thing's plotted. And it looks like it was nine, here, so this should be 18. So I'm going to put up here, here's a 20 and here's a 10. Right, so here is an 18. Now, what do they say? This bin represents phi 1 divided by 2 plus phi 2 divided by 2. So they took the number of species where they saw just a single individual plus the number of species where they saw two individuals, and they added those and they divided by 2. That's this number. We're not going to go through this whole process, because it's a little bit tiresome. But I've already done it for you. So I'm going to plot a few of things to get you there. And so I calculated it was 19, 13, 9, 6. It becomes ill-determined once you get out here, in the sense that we don't have enough. It's not uniquely specified going from that to that as it has to be. But I calculated it. It's around 5, in here, for a few. And somewhere in here, it's going to go into 4. And then this might go down to 3, and then deh, deh, deh. Now, if you look at this and the rapid rapid fall-off, do you think that you're going to find any species that have more than 20 individuals? We're going to vote. So you see this falling-off? So let's say that I've just showed you this, and I haven't yet calculated the rest, do we think that there's going to be any species with more than 20 individuals? Greater than 20 individuals, question mark? 1 is yes. 2 is no. It's going to be yes, no. Ready, three, two, one. So we got some 2s. So I'd say that most people are saying, no. Look at this fall-off. They're not going to be any species with more than 20 individuals. Although we already know that there are many species with more than 20 individuals. So this plot is useful for something. You can see that there are. And we know exactly the number of species that have more than 20 individuals, roughly. So those ones are all in these. So you can see that there are hundreds of species with more than 20 individuals. And indeed, it looks like there were two or three species that had more than 1,000 individuals or 1,500 or whatever the cutoff there was. So this distribution starts out rather high but then falls quickly. And out here, it's going to be very, very sparse. So there's going to be a bunch of numbers in here where there's not any species in the histogram. And then out there, there's going to be one, right? And indeed, you have to go really far out. Because there's one species out there that has a couple thousand. And indeed, the mean number of individuals per species has to be around 100. We know how to calculate a mean. This divided by this is just short of 100. So the mean number of individuals in a species is around 100. The mode is one. And the median? Well, ready? We decided this was the mode. Where is the median going to be? Is it going to be A, B, C, D? Ready, three, two, one. Indeed, this tells you pretty clear where the median is. This thing is indeed around the median. Because you can say, oh, it's about the same numbers to either side. So the median is around here. And I told you where the mean was, again. You guys remember? Ready, three, two, one. Mean, uno. Mean. So this is a very, very funny distribution. I guess I want to highlight that. And I think it's not at all what you would have expected somehow. At least, if you had described this measurement process to me, if you told me that you went to this island and you counted 20,000 trees, I don't know how many species I would have guessed. But OK, 220, it's reasonable. Well, I would have guessed it would have looked something like this on a linear scale, maybe, right? You know, that there would be a bunch of them around 50 to 100 and some would go couple hundred, some of them. So I guess I would have thought that the mean, mode, median would all be kind of a more similar thing. But this is just not the way the world is. It's not just on BCI. People, for hundreds of years, have been studying these distributions. And things that look like this, with extremely long tails, this is what people see. Now you can argue about exactly how fast it falls off and whether it's different on a mainland or an island. But this basic feature, that rare species are common, this seems to be just that's what you always see. This is the thing that you have to remember, rare species are common. And I think that this is the basic, surprising thing in this whole field. And the ironic thing is that even after spending all this time reading about theories to describe these distributions, it's still very possible-- and I would say, based on the statistics, this year and past years, it's not just possible, but it is the standard outcome-- is that after reading this paper, you do not realize that the distribution looks like this. You somehow still think that it looks-- you kind of still think it's like a linear scale, where the typical species has this, where the mean, median, mode are all about the same thing. So I guess always plot the raw data in an untransformed way. There are theoretical reasons why it might be nice to plot it like this. But be very careful about what you're doing. Because then you're left with a mental image of a histogram that looks like this. And that's very, very dangerous. Yeah? AUDIENCE: Why does it matter [INAUDIBLE]? [INAUDIBLE] the aggregate data in bins like that. And I mean, sure, exactly one species is the mode, but do you really want the--? PROFESSOR: I understand what you're saying. It's just that there's a qualitative aspect to the data, which is that most species are very rare. And this is something that I think is surprising. I think it's deep. And it's something that you do not get realized. AUDIENCE: Most species have more than 16. I mean, it depends what you mean by rare. PROFESSOR: Yeah. AUDIENCE: Look at the way that the distribution is away from trend. AUDIENCE: That's a good point. But the species density is clustered around the low numbers. PROFESSOR: Right. AUDIENCE: But actually most species have more than 30. PROFESSOR: Maybe the surprising thing is that just if you take-- the mean is 100. And so I would've thought that, if you plot number of species as a function of the number of individuals, given those numbers, I would have guessed, OK, here's 100. I would have guessed-- here's 50, so just to highlight that this is 150. So linear scale, I would have guessed it would look something like that, maybe larger than Rudin or something. AUDIENCE: What would that look like in a log2 scale? It would look like It's like the log of [INAUDIBLE]? So it goes up really fast and then-- PROFESSOR: So this thing would be kind of like shoom. I mean all the weight would be in. It would be like all here plus a little bit on each of these. AUDIENCE: But yeah. I don't think it's actually that different. The only thing that's different is the tail on the left. PROFESSOR: And the tail on the right. AUDIENCE: Yeah, it's a little bit longer. PROFESSOR: No, it's lot longer, right? Because this thing, all of the weight is between 50 and 150, which means that all of the counts are basically going to be these two, basically. Because this thing comes out either way. So in this case, if you take that histogram put it on this kind of scale, you end up with two bars up high, nothing outside. So it's a very different distribution. And it's not to say that this is a ridiculous thing to do. It's just that. But the problem is that your mental image of what the distribution looks like ends up being incorrect, in the sense that you have a qualitatively different sense of what's of what's going on. And if you go up to 10 species, here, and 10 is way down here. If this is what it looked like, there would be essentially no species with fewer than 10 individuals. But if you come over here and you add it up here. It's like a mean of 6 times 10 is 60 out of 200. A quarter or a third of the species on this plot of land have fewer than 10 individuals. And 10 is really a very small number. Well, rare species are common. I think it's a true description of the observed distribution here and elsewhere. And it's not something that you appreciate or realize when you plot it in that way. AUDIENCE: But you can get this information from that plot. PROFESSOR: No, I agree. You can get it. You can get it. But it was only 10% of the group got it. Right, the fact that you can get it-- right, it's possible. But you don't get it. That is a practical statement. Yeah, I'm not dead set against this distribution. It's just that it makes everybody think something that's not true. So if you think that that's OK, then I can't help you. It's OK, but it's just you have to be careful is my only statement. And I very much want you to take away. Because I this is an accurate description of the data. Rare species are common. And one of the readings-- I think it was in this paper, maybe it was a different one that I was reading. Even Darwin, when talking about this, commented on this fact that rarity of species is somehow a typical event. AUDIENCE: And common species are rare. PROFESSOR: And common species are rare, that's right. This distribution is hugely, hugely skewed. These are the measurements. It's good to look at them in both of these ways. Because you can't even plot the data on a linear scale. So that's a good reason for doing it. But I think it's good to have both of these pictures in mind. What we want to do is to talk about two classes of models that give something that's essentially this log normal distribution. So on a log scale it looks normally distributed, approximately. And those two models are going to be kind of a niche-based model and a neutral model. Can somebody, in words, explain what they maybe see as the difference between this niche and a neutral kind of approach? Yeah? AUDIENCE: [INAUDIBLE]. PROFESSOR: Every species is--? AUDIENCE: [INAUDIBLE]. PROFESSOR: In which one? AUDIENCE: In niche. PROFESSOR: In the niche theory, the species are different. So it seems like a ridiculous statement. Do you believe that species are different? We can vote, yes or no. Ready, three, two, one. Yeah. Well, somebody's been convinced by the neutral theory. It's clear that species are different. And the question is which patterns in the data do you need to invoke differences in order to explain? And I think that one, maybe, theme that's come out of this relative species abundance literature and the debates between the neutral and the niche guys is just that this distribution is less informative of the micro scale or individual kind of interactions then you might have thought. Because multiple models can adequately explain such a pattern. In all areas, we have to remember that you make an observation, and you write down a model that explains that observation. So what you do is you write down a model. And writing down a model, what that means is that you make some set of assumptions. And then you look to see what happens in that model. And if the model is consistent with the data, that's good. But it doesn't prove that the assumptions that went into the model are correct. And this is a trivial statement. And I've said it before. You have to tell yourself this or remind yourself of this kind of once a month. Because it's just such an easy thing to forget about. Now, the niche models indeed assume that the species are different. And that's reasonable. Because we think it's true. But then, of course, there are many different ways of capturing those differences. And then you have to decide whether the assumptions there are reasonable or whether they're necessary, essential. In the context of the niche models, we're going to think about the so-called broken stick models. So basically, you get log normal distributions when there's some sort of multiplicative-type random process that's being added together. You get normal distributions when you have sums of random things going together. This is the central limit theorem. But when you have multiplicative kind of errors or random processes coming together, you get log normal distributions. And I want to highlight that that does not necessarily have to tell you so much about the biology of it. Because a classic situation where you get log normal distributions is if you take a stone and you crush it. You can do this experiment at home. And then you measure the mass distribution of the resulting fragments. And the distribution of mass is log normal. Just take a stone, grind it under your boot or hammer it, just kind rub it right in. You'll get you'll get some distribution of fragments. For each of the fragments, measure the mass, and, indeed, you end up getting a log normal distribution. Because there's some sense that what's happening is that you take a larger mass, you break it up randomly, and then the resulting fragments, at some rate, each of them you break up randomly. and the small ones are maybe kind of less likely to get broken up as the big ones, so then the small ones can still get even smaller. But then there's going to be, at some rate, some very large ones. So such a process ends up-- I mean it's not biology. This is just something about the nature of the breaking up of this physical object. And indeed, the basic idea behind many of the niche models that give you a log normal distribution is equivalent to crushing a stone and measuring the resulting distribution. I'll describe what I mean by that. Typically, the broken stick models, they say there's some resource axis. This is a resource axis. And this could be, for example, where you're getting food from. Now, we're going to have to divide up this resource access among some number of different species. And what we're going to assume is that the number of individuals in the species is proportional to the length of the resource axis that it's able to capture. And I want to make sure I find my notes. I want to highlight this. This comes from MacArthur in the 1950s. MacArthur and it's 1957. So we imagine there's this homogeneous resource axis. We're going to break it up into N segments. And the abundances are proportional to the length. And the idea is that, if you just break this up randomly, so let's say you just draw N minus 1 lines randomly, or N minus 1 points randomly here. Now you have N species with N different abundances. The question is does that give a log normal? We'll say N minus 1 random points. Do you understand what I mean. You sample uniformly once, sample uniformly twice. You do that N minus 1 times, and now you have N and deh deh. And then we say, OK, the first species has this many individuals. The second has this one. The third is this one, et cetera. The question is does random points, does that lead to a log normal? Yes and no. Let's think about this for 10 seconds. N minus 1 random points, log normal distribution, ready, three, two, one. So I'd say that we have a majority are saying no. Can somebody say why that is? AUDIENCE: [INAUDIBLE]. PROFESSOR: Because it's something else. That's fair. But can you say qualitatively why it is that this is not going to work? AUDIENCE: You can't have very long gaps. PROFESSOR: Right. That is it's going to be very unusual that you get a very long gap. What about the other end? AUDIENCE: Also a very long tail. PROFESSOR: Now I'm a little bit worried. I think that that's true, right? Well, I'm going to say that you're not going to get this super long ones. I think that the distribution might still be peaked at short values. No? AUDIENCE: No. PROFESSOR: Random? If we were just traveling along this resource axis, at a rate that's kind of exponentially distributed, like Poisson rate, we just dropped points, that's something very similar to this random-- AUDIENCE: It said we're limited in the number-- PROFESSOR: No. Is that not true? AUDIENCE: Your sample [INAUDIBLE]. PROFESSOR: I'm a little bit worried that I might be-- now, I'm not 100% confident. Depending on how I look at this, I get different distributions. Yeah? AUDIENCE: But I think the first thing that he said, where you just say, I'm going to pick N minus 1 points-- PROFESSOR: Yes. AUDIENCE: --is a different thing than going along the axis and exponentially dropping ones along. PROFESSOR: I agree it's different. AUDIENCE: I don't think that would be the idea simulated, because you would be very likely to just get this giant thing at the end when you're finished. AUDIENCE: What you could do, you could go on to draft N plus 2 points. PROFESSOR: No, I think-- AUDIENCE: These scales that are your two end points [? are doubled. ?] PROFESSOR: Because I think that the probability distribution does grow. I think that I'm going to side with you. So we've decided that there are not going to be as many short sticks, and there's not going to be as long sticks as compared to a log normal. Do we agree with that? At least we agree that it's not going to be a log normal. So you're not going to get this huge variation of some very long sticks and some very short ones. Now, the question is how would you change this sort of model in order to generate a log normal? And the answer is that what you have to do is you have to what is called some niche hierarchy or so some hierarchical breaking. Just like what led to the stone giving you a log normal is that you have to have some successive process of breaking things. So this is what they call some hierarchy model. And then they key thing is that it's sequential. You have your resource axis. First, you have some rule for breaking it up. It could be that you just sample uniformly or some other probability distribution. And the way that you might think about this is via-- just everything up on the board is so nice and useful. I feel bad getting rid of it. This thing is not true, so I don't mind erasing it. So let's imagine some bird community in the forest. And we're going to think about where is it that the birds are getting their grub or their food to eat. First, well, now the axis is somehow vertical. You could divide them up into the ground foragers as compared to the tree foragers in terms of where they're getting their food. And you say, oh, well, how much of the food is on each side? Oh, well, we'll say 30% is on the ground, 70% is on the tree. This is along the stick. You cut the stick in some way, or you break the stick in some way. But then within the tree foragers, you'd say, well, the resources might be separated. And this is really like speciation, a species is in the niche, the species are focusing on different niches. So you'd say, oh, some are going to focus on the trunk, some will focus on branches. And again, this part of the stick is now broken or divided among different resource locations with some amount. But then also, you're going to get speciation in different directions here, because there's both the surface-- I don't know if you guys have ever eaten grubs-- but there's the surface grubs, and then there's also the sub-bark grubs. And so you kind of do this process multiple times, where you kind of pick different branches and break them to divide up the niche. And then you end up with a log normal type distribution. And this is a similar process to the crushing of the stone, because the idea is that there's sequential breaks of the stone. So the stone first breaks into maybe simply two or it could be three. First, there's one breaking. And then one of them is broken more. So given this process, you end up getting a log normal distribution. Yeah. AUDIENCE: But you also have a distribution of like how far. Because I guess there are two questions. Like when you break your stick, you assume, somehow, that you uniformly break it. PROFESSOR: Yeah. A lot of work has gone into the question of how it is you should break the stick. Given that you have this tree foraging stick. On a practical level, what they do is they ask, well, what probability distribution gives you the best agreement with the data? Is it uniform? Or is it, oh, it's broken like this? And in some cases people say, well, it's actually tilted on one side. Well, in the context of a succession and some other environments, there's an idea that, if a species first gets somewhere, they can kind of monopolize a larger fraction of the resources then if it's divided kind of an equally at the beginning. And that's going to effect where this probability distribution is going to break each one. But there's always this question about how constrained are the notions and so forth. And I'm agnostic on that point. AUDIENCE: But you also need distribution for how many times it breaks [INAUDIBLE]. PROFESSOR: Yes. It's just that, if you do this process, it's like a central limit theorem type result. So you have to do it enough times so that you get to some limiting distribution. And then you could keep on doing it. In the end, we always say that species abundance is proportional to the size. So we're going to scale, ultimately, to get the correct number of individuals. It's just that you have to do it some reasonable number of times so that the randomness kind of washes out, and you end up approaching that limiting behavior. Does that make sense? And indeed I just want to mention a major result in this field. These niche type models successfully explained or predicted another pattern that had been observed, which is the so-called species area relationships. So this is just saying that, here, we looked at 50 hectares, and we asked how many species where there. 225 species in 50 hectares. Now, the question is, if instead of looking at 50 hectares, we instead looked at 500, do you think of that the number of species we observed would have gone up, stayed the same, or gone down? Up, same, down, ready, three, two, one. Up. Up. If you look at a larger area, you expect to see more species in a larger area. And people really do this. They look in some area, going from, say, they take a meter, and they count all the species. And then they go and here is 100 meters, and they count all the species. And they ask, how many species do you see as a function of the area? And what people have found is that the number of species you observe it is proportional to the area to some power, where Z is around a 1/4. And of course, the area goes as some r squared. If you wanted to, you could say it goes as the square root of the radius, whatever. But the number species in some area, it grows, but it grows in a manner that is less than linear. Does that make sense? It definitely makes sense that's less the linear. Because linear would be that you sample a bunch of species here, and then you look at another identical plot, you get some other species. And they were saying that, oh, that you really don't expect any of those species overlap. That would be a weird world. So it very much make sense that this is less than 1. Of course, it didn't have to be this power law. But one thing that has been discovered, around the world, is that power laws are very interesting. But once again, many different microscopic processes can lead to power laws. The niche models have successfully predicted or explained why it might have this scaling. But it turns out that neutral models can also predict it. And may just be that lots of spatially explicit models will give you some power law type scaling that looks kind of like this. So once again, it's a question of how convinced you should be about microscopic processes based on being able to explain some data. And I think the best cure for this danger, of assuming that the microscopic assumptions are correct, because the model is able to explain something, is that, if you find some other very different set of microscopic assumptions that also explain the patterns, then it becomes clear that you have to take everything with a grain of salt. And that's I think part of what's been very valuable about the neutral theory contribution to this field. AUDIENCE: Does this just come from-- you assume that all the individuals are uniformly distributed and then [INAUDIBLE]? PROFESSOR: There are multiple derivations of this, so it's a little bit confusing. The neutral models, that I have seen, that lead to these patterns, they basically have the individuals randomly, either with sex or without sex, kind of diffusing around, and then they divide, deh-deh. And then you can explicitly just do the different spaces and see that you get a scaling. It seems to be a surprisingly emergent feature of many of these models. And once again, it may be something that tells us less about biology than it does about math or something. Any other questions about this, the base notion of this niche hierarchy type models? So I want to spend some time talking about this neutral theory in ecology. The math, in particular the derivation of this particular closed form solution, is not really so interesting or relevant. But I think it's very important to understand what the assumptions are in the model and maybe also something about the circumstances in which we think that it should apply. So the basic idea is that we have, what we hope, is some metacommunity that is large. And then we have an island. So this has to do with this theory of island biogeography. We have an island over here. And in the context of the nomenclature of this paper, they are some community size, size j here. This tells us about the number of individuals. And they're distributed across some number of species. Now, the neutral theory, the key thing is that we assume that all individuals are identical. And once again, it's not that the neutral theorists believe that this is true. It's that they think that it may be sufficient to explain the patterns that are observed. And when we say that all individuals are identical, what we mean is that the demographic parameters are the same, birth, death rates. And it's even a stronger assumption, in some ways, than that. It's assuming that the individuals are the same, the species are the same, and that there are no interactions within the species as well. So there's no Alley effect, or no specific competition. So the birth, death rates are going to be independent of everything, which is an amazingly parsimonious model. And it's kind of amazing you can get anything out of it. And then we have a migration rate m. It's either a rate or a probability, depending on how you think about it. Rate or probability m. And can somebody remind us how we handle that? AUDIENCE: Both just in a community? PROFESSOR: Yeah. AUDIENCE: At some probability that is proportional to the distribution of the species in the metacommunity? PROFESSOR: Yeah, that's right. AUDIENCE: --transfer an individual from the metacommunity to the island. PROFESSOR: Perfect. AUDIENCE: We do stick to the island to make sure that number of individuals. PROFESSOR: Right. So what we're going to do is we're basically going to pick a random individual, here, each cycle. This is kind of like a Moran process. We're going to pick an individual here. And we're going to kill him. And then what we're going to do is, with probability m, replace that individual with one member of the metacommunity at random. So the rate coming from here will be proportional to the species abundance in the metacommunity. And with a probability of 1 minus m, what we're going to do is we're going to replace that individual with another individual in the island. Now, the math kind of gets hairy and complicated. But the basic notion is really quite simple. You have a metacommunity distribution, which is going to end up being the so-called Fisher log series in this model. This describes the species abundance on the metacommunity. But then on the island, we're just going to assume that there's birth, death that occurs over here at some rate. But we don't even have to hardly think about that. From the standpoint of, say, a simulation or model, we just run multiple cycles of this, where we have j individuals. And we always have j individuals, because it's like the Moran process. At every time point, we kill one individual, and we replace it, with somebody either from the same community or from the island. And you can imagine that in the limit of m going to zero, what's going to happen on the island? Yeah, so you'll end up just one species, just because this is just random, like genetic drift. It's ecological drift where one species will take over. Whereas if m is large, then somehow it's more of a reflection of the metacommunity. Are there any questions about what this model is looking like for now? AUDIENCE: Could we talk about the Fisher log series? PROFESSOR: Yeah. AUDIENCE: So we would put it on the same axis as the [INAUDIBLE]? PROFESSOR: Yes, this is a very, very good question. So we'll do this in just a moment. Because this is very important. I want to say just a couple things about this model. So when I read this paper, what I imagined is that it really looked like this. This was Panama, and that, 30 kilometers off the coast, there was this island, BCI, Barro Colorado Island. But that's not maybe an accurate description of what the real system looks like. Does anybody know where BCI is? AUDIENCE: It's in Panama. PROFESSOR: Hm? AUDIENCE: Panama. PROFESSOR: So it is in Panama. But it's not off the coast of Panama. I guess that was my original. AUDIENCE: It's in the canal. PROFESSOR: Yeah, it's in the canal. So it's an island that was created when they made the Panama Canal. So this thing was not always an island. It's been an island for 100 years. And it's in the middle of a canal. And they actually have cougars that swim back and forth from the mainland. But it does make you wonder whether this is-- it's much more strongly coupled to the mainland then I imagined when I read this paper at first. I don't know what that means for all this. But certainly, you expect this to be a more or less appropriate model depending on this. Because, of course, if you went and you sampled 50 hectares here, you wouldn't believe that it should have the same distribution. You'd believe it should be more like the Fisher log series. And there's some evidence that things are tilted in a way that you would expect. And we'll talk about that. It's tricky. And of course, you have to decide in all this stuff, oh, what do you mean by free parameters? And actually, it seems like people can't count. And we'll talk about this in a moment, too. Because, of course, constructing the model, there's some sense of free parameters that you have there. Because we could have said, oh, it's just going to be the Fisher log series, or we could have said, oh, it's going to be island. Or we could have said, oh, there's another island out here. And then that would be another distribution. And not all of these things introduce more free parameters, necessarily, because you could say, oh, this is the same migration rate, or you could do something. But they are going to lead to different distributions, and you have that freedom when you're trying to explain the data. There are a lot of judgment calls in this business. But let's talk about Fisher log series, because this is relevant. So the model is very similar to what we did for the master equation in the context of gene expression and the number of mRNA. So was the equilibrium or steady state distribution of mRNA in a cell, was that a Fisher log series? Yes or no, five seconds? Was the mRNA steady state probability distribution a Fisher log series? Ready, three, two, one. No. No. What was it? It was a Poisson. And you guys should review what all these distributions are, when you get them, and so forth. So what was the Difference why is it that we have some probability, P0, P1, P2? This could be mRNA or it could be number of individuals in some species with some birth and death rates. What was the key difference between the mRNA model, which led to this distribution becoming Poisson, and the model that we just studied here, where it became a Fisher log series? And I should maybe write down what the Fisher log series is. So this is the expected number of species with n individuals on the metacommunity. Here is the Fisher log species. There was some theta X to the n divided by n. So what's the key difference? Yeah. AUDIENCE: I think that the birth and death rates are both proportional [INAUDIBLE]. PROFESSOR: Right, the birth and death rates are both proportional. AUDIENCE: In the Fisher log series. PROFESSOR: In the Fisher log series. So what we have is that b0-- and what should we call b0 in this model? AUDIENCE: [INAUDIBLE]. PROFESSOR: Well, right now, we're thinking about the metacommunity. AUDIENCE: Speciation. PROFESSOR: Speciation. b0 is speciation, which we're going to assume is going to be constant. In this model, do we have speciation on the island? No. The assumption is that the island is small enough that the rate of speciation is just negligible. So speciation plays a role in forming the metacommunity distribution, but it doesn't play a role in the model. So this is speciation. But then what we assume is that b1, here, is equal to some fundamental rate b times n, but it's b times, in this case, 1. So more broadly, bn is equal to some birth rate times n. This is saying that the individuals can give birth to other individuals. Now, we're not assuming anything about sexual reproduction necessarily or not. We're just saying that the kind of rates are proportional to the numbers. So if you have twice as many individuals, the birth rate will be twice as large. This is reasonable. This is Pn and this is Pn plus 1. So this is d of n plus 1 is equal to some death rate times n plus 1. So each individual just has some rate of dying. It's exponentially distributed. This again makes sense. What was the key difference between our mRNA model, from before that gave the Poisson, and this model that gives the Fisher log series? AUDIENCE: So with the mRNA, it's with a standard like a chemical equation where there's some fixed external input. But then the degradation is according to the amount that you have. So death is proportionate [INAUDIBLE]. PROFESSOR: Perfect. In both cases, the death rate is proportional to the number of either mRNA or individuals. However, in the mRNA model, what we assume is there some just constant rate of transcription, so a constant rate, per unit time, of making more mRNA. So just because there's more mRNA doesn't mean that you're going to get more mRNA. But here, we assume that the birth rate is proportional to the number. So that's what leads to the difference. And so this is one of the few other cases that you can simply solve the master equation and get an equilibrium distribution. And it's the same thing we do from just always, where we say, at steady state, the probability fluxes or whatever are equal. So you get that P1 should be equal to P0. and then we have a b0 divided by d1. And more broadly, we just cycle through. The probability of being in the nth state, it's going to be some P0. And then basically, it's going to b0 divided by d1, b1 divided by d2, b2, d3, dot, dot, dot, up to bn minus 1 dn. And indeed, if we just plug in what these things are equal to, we end up getting-- there's P0, the fundamental birth over death to the nth power. And then we just are left with a 1 over n. Because we're going to have a 2 here and a 2 here, and those cancel. A 2 here and 3 here, and those cancel. And we're just left with the n at the end, finally. So this x, over there, is then, in this model, the ratio of the birth and death rates. So which one is larger? Is it A slash 1? Is it b is greater than d? Or is it b slash 2, that b is less than d? Think about this for five seconds. Do you think that birth rates should be larger than death rates or death rates should be larger than birth rates or do they have to equal? Ready, three, two, one. So we got a number of-- it's kind of distributed, 1 and 2's. Well, it's maybe not that deep, not deep enough. Can somebody say why their neighbor thinks it's one or the other? People are actually turning to their neighbor. A justification for one or the other. AUDIENCE: So if this problem where b over d is greater than 1, then this distribution is not normalized. PROFESSOR: Right. So if b over d is greater than 1, so if x is greater than 1, then this distribution blows up. Then it gets more and more likely to have all these larger numbers. But then if b is less than d, shouldn't everybody be extinct? No. Can somebody else say why it is that it's OK for b to be less than d? If birth rates are less than death rates, shouldn't everyone be extinct? AUDIENCE: Because there's a rate b0. PROFESSOR: Because there's a rate b0, exactly. So there's a finite rate of speciation. So it's true that every species will go extinct. But because we have a constant influx of new species, we end up with this distribution that's this Fisher log series. Now, if you plot the Fisher log series, it looks a bit like this. But let's think about it a little bit. Does the Fisher log series, does it fall off, A, faster or slower than this? Fisher falls, A, faster-- this is in this direction-- or, B, slower? AUDIENCE: Faster or slower than what? PROFESSOR: Than the island distribution. Because you can see that this falls off pretty rapidly. Ready, maybe? Three, two, one. I saw a fair number of people that don't want to make a guess. Indeed, it's going to be faster. Can somebody say why? Yeah. AUDIENCE: [INAUDIBLE]. PROFESSOR: Is it going to be because of the 1 over n? I mean the 1 over n is certainly relevant. Without the one over n, then we just have sort of a geometric series. And the log normal is not just a geometric series either. AUDIENCE: [INAUDIBLE] Whereas this has a very long tail. PROFESSOR: That's right. So this falls off. This would be kind of exponentially, and this is faster than exponentially. And indeed, this make sense based on the model. Because this community, the reason that it has some very, very abundant species is partly because it gets migration from the abundant species here. This falls off pretty quickly. But those frequent species still can play a pretty important role in the island community, because the migration rate is influenced by large numbers. And the other thing is, of course, that the rare species are going to often go extinct. I mean the distribution on the island is some complicated process of the dynamics going here, plus sampling from here. But there's a sense that it's biased towards-- it's not just a reflection of the metacommunity, because the migration rate is sampled towards the abundant species. So the migration of these species ends up playing a major role in pushing the distribution to the right. So you have much more frequent, abundant species on the island as compared to the mainland. AUDIENCE: [INAUDIBLE]? PROFESSOR: Yeah. AUDIENCE: [INAUDIBLE] measurement of the distribution on the-- PROFESSOR: Well, I'm sure they have. I think the statement that there's a faster fall off on mainlands than on the islands I think is borne out by the data. But I don't know if trees on the Panama side of the canal are actually better described by a Fisher log series as compared to this, though. AUDIENCE: I guess my question was the abundant species that we see on the island, is it just the result of diffusive drift? PROFESSOR: Well, this also has the diffusive drift. AUDIENCE: But in the sense that what really pushes. PROFESSOR: Well, I mean I think you need both, the diffusive drift and the migration. But I think that the fact that the migration is from the mainland, and it's biased towards those abundant things, I think is necessary or important. AUDIENCE: I guess just in terms of distinguishing between the niche and the neutral models, as applied to the mainland, does the niche model predict also a log normal? Because it seemed like, in the discussion earlier, the neutral also predicted log normal [INAUDIBLE]. PROFESSOR: That's a good question. In this whole area, I mean it's a little bit empirical. The fact that the niche model kind of predicts this, or this broken stick thing predicts a log normal, they didn't say anything about islands there, right? I guess even Fisher's original log series, he used it to describe-- I think maybe that was the beetles on the Thames. But his original data set, where the Fisher log series was supposed to described it, as it was sampled better and better, it eventually started looking more and more like a log normal anyways. I mean it's easy to see the frequent species, because you see them. This tail can actually be very hard to see, because you have to find the individuals. It's a good question of to what degree each of the models really predicts one thing on one place and another. There's always tweaks of each model that adjust things. So I think it's a bit muddy. But the one thing that I want to highlight. So there's a lot of debates, then, between these different models. And each of the models have some fit. They have red and black. There's one that kind of goes like this. And another one that kind of goes like that. And they're not labeled, because they look the same. And you can argue about chi squareds and everything, but I think it's irrelevant. They both fit the data fine. And the other thing, just the sampling of kind root n sampling, if you expect to see 10 species, then if you go and you actually do sampling, you expect to have kind of a root n on each one. I mean the error bars, I think, around this are consistent with both models. So I'd say that the exercise of trying to distinguish those models based on fit to such a data set I think is hopeless from the beginning. And then you can talk about the number of parameters. And if you read these two papers, they both say that they have fewer number of free parameters. And it is hard to believe that there could be a disagreement about this. But then, you know, it's like, oh, well, what do you call a free parameter? And then what they say, any given RSA data set contains information about the local community size j. So they say, given that, it's not a free parameter, because you put that in. That's the number of individuals. And then outcome is your distribution, right? And you say, OK, well, all right, that's fine if you don't want to call that a free parameter. But then when you fit the log normal to this distribution, the overall amplitude is also to give you the number of individuals in the metacommunity. So if you don't call j a free parameter in this model, then you can't call the amplitude a free parameter when you fit the log normal, at least in my opinion. I think that they both have three. Because if you fit a log normal to this, you have the overall amplitude. That's the number of individuals. And then you have the mean and the standard deviation or whatever. From that standpoint, I think they're the same. Yeah. AUDIENCE: But I mean how do you fit the log normal when you don't impose? Do they impose the amplitude? I mean it's still a parameter. PROFESSOR: No, that's what I was saying. It's a parameter. I mean the normalized log normal, you integrate, and it goes to 1. But then you have some measured number of individuals in your sample, and then you have to multiply by that to give you. AUDIENCE: But is that what they do when they do their fit? PROFESSOR: Yeah. AUDIENCE: Or do they keep that amplitude as also a parameter? PROFESSOR: I think that you can argue whether this is a free parameter or not. But I think that you can just put it as the number of individuals, and it's not going to affect anything. You could actually have it be a free. But this gets into this question about what constitutes a free parameter or not. And actually, there is some subtlety to this. But I think, at the end of the day, the log normal is not going to look like this. You have to. You basically put in the number of individuals that you measured. AUDIENCE: So when you calculate [INAUDIBLE]? PROFESSOR: Huge numbers of pages of has been written about comparing these things. At some point, it comes down to this philosophical question about what you think constitutes a null model. And this gets to be much more subtle. And I think reasonable people can disagree about whether the null model that you need to reject should be this neutral model or if it should be a niche-based model. Or maybe it's just that there's some multiplicative type process that's going on and gives you distributions that look like this, and you need other kinds of information to try to distinguish those things. And in particular, I'd say that it's really the dynamic information in which these models have strikingly different predictions, and then you can reject neutral-type models. Because that neutral models predict that these species that are abundant are just transiently abundant, and they should go way. Whereas the niche-based models would say, oh, they're really fixed. And indeed, in many cases, the abundant species kind of stick around longer than you would expect from a neutral model. Of course, the neutral model is not true in the sense that different individuals are different. But it's important to highlight that even such a minimal model can give you striking patterns that are similar to what you observe in nature. And so I think we're out of time. So with that, I think we'll quit. But it's been a pleasure having you guys for this semester. And if you have any questions about any systems biology things in the future, please, email me. I'm happy to meet up. Good luck on the final.