Good day, everyone. This is your lecturer, Monica wahi. And we're going to start now with section 1.1. What is statistics? So here's our learning objectives for this lecture. At the end of this lecture, the students should be able to state at least one definition of statistics. Yes, there's more than one, give one example of a population parameter. And one example of a sample statistic. Also, the student should be able to classify a variable into quantitative or qualitative and as nominal ordinal, interval, or ratio. So what we're going to cover in this lecture is, first I'm going to go over some definitions of statistics. Like I said, there's more than one. But they all sort of relate to the basic concept of why you're doing statistics, and especially not math. So what's the difference, right, then we're gonna go over a population parameter and sample statistic. And you'll know what those mean, at the end of the lecture. And finally, we're going to go over classifying levels of measurement. So let's start with the definition of statistics. And so we're going to go over these concepts like what it is. And also I'm going to define for you the concept of individuals versus variables. You may know definitions for those words already, but I'm going to give you them in statistics ease. And then I'm going to give you examples of statistics, individuals and variables in healthcare. So here are the definitions. What is statistics? statistics is the study, how to collect, organize, analyze, and interpret numerical information and data. Well, that sounds pretty esoteric, right? But if you actually think about it, even if he did a simple survey, like you just did a wiki, you just look on Yelp, right? You look on Yelp, and you see, you know, the restaurant, you want to go to some people say five stars or four stars, but there's a few two stars one star will do you go? I mean, there's a whole bunch of different answers. So how do you do that, you kind of have to analyze it, you kind of have to interpret it. So it's not that easy. So statistics is both the science of uncertainty, and the technology of extracting information from data. So in other words, if you've got a bunch of data about like a restaurant, um, you don't know how it's gonna be if you actually go there, right? You don't know for sure. But, uh, so it's the science of uncertainty. If you look on Yelp, and you're seeing almost everybody's giving it a four or five star, maybe it's gonna be good for you, right? But you don't know, maybe there's new management. That's the uncertainty. So statistics is used to help us make decisions, not just whether to go to the restaurant or not, but important statistics, such as in health care and public health. Well, I guess if it's an expensive restaurant, maybe it's important. But anyway, and health care and public health, you really need these statistics, because they really guide you. Like, for example, let's think of the Center for Disease Control and Prevention in the United States. So what do they do? They spend the whole year studying the different flu viruses that go round, because there's more than one. They spend the whole year doing that they organize, analyze, and interpret numerical information and data about these different viruses, the different influenza viruses that are going around. They extract that information. And you know, what decisions I make the make the decisions about what viruses to include, in the next year sexy? Are they always right? Sure enough, they're not. I mean, have you ever had a year where you're like, Oh, my gosh, everybody I know, got vaccinated, and they're still getting sick? Well, you know, give him a break. It's this sign some uncertainty, they it just didn't work out that time. However, this is probably better than just randomly guessing. Right. So that's statistics for you. Know, I promised you I'd tell you the statistics ease version of individuals and variables. Now, if you're outside statistics, you know that individuals are people, right. And you know that a variable is a factor, like a factor that can vary, you know, like, the only variable is I don't know what time something's going to happen. But when you enter the land of statistics, there are specific meanings to these two words. Individuals are people or objects included in a study. So if you're gonna do an animal study with some mice in it, those would be the individuals. If you do a randomized clinical trial, and you include people who have Alzheimer's in it, then patients are your individuals. But we do a lot of different things in healthcare. We sometimes study hospitals, like the rate of nosocomial infections, in which case if you're looking old bunch of stuff in hospitals, those would be the individuals. Sometimes we look at states rates of infant mortality, for example, in different states, in that case, states would be individuals. So as you can see at the bottom of the slide, a variable then is a characteristic of the individual to be measured, or observed. I give some examples on the slide. But like I was saying, you know, if you wanted to study a hospital, for example, I gave you the example of a variable of a rate of nosocomial infections, you could also have other variables about that individual or hospital, like the rate of in hospital mortality. And so, as you can see, one of the things we do in statistics is we sit down and we decide, well, who are going to be our individuals that we're going to measure? And what variables are we going to measure. So I just threw up here a few examples of different kinds of individuals we have, that we use a lot in health care and public health, and an example of just one variable, about those example individuals. But there would theoretically be many variables about them. And I just want you to notice, a lot of times, the individuals are geographic locations. Other times they might be institutions, like I said, like hospitals, or clinics, or programs. There's other things that they are, but these are just kind of the big ones. So, um, as I was describing, and just to review, what I went over, statistics is used in healthcare and other disciplines to, to aid in decision making, like I gave the example the CDC and their vaccine for influenza. And so therefore, it's really important to understand statistics, because you need to understand these processes in healthcare, like how do we figure out what to do? Like not only what do we do, but how do we figure out what to do. And that's really important because we use statistics a lot in healthcare. Now, we're going to move on to talk about what a population parameter is, and what a sample statistic is. So we're going to go over first definition of a population and the definition of a sample. So you're sure about what those mean. And we're going to talk about the data about a population and the data about a sample and how those are different. And then we're going to get into what I was just describing parameters and statistics. And I'll give you a few examples. So let's start with what is the population, again, another case where you just have a normal word, but it has a special meaning and statistics? Well, it's a group of people or objects with a common theme. And when every member of that group is considered this population, right. So here, here's just one example. So the theme would be like nurses who work at Massachusetts, Massachusetts General Hospital, so the population then if that was your theme, will be the list from human resources of every nurse out currently employed at mgh. Now, it really does depend on how you define that thing. Like I could have said, nurses who belong to the American nursing Association, right? And then we'd be looking at a different list. I could say nurses who live in New Orleans, in the city limits of New Orleans who live there, right, then we'll be looking at a different population. So really has to do with the details of how you describe the theme around that population. But the point is, once you describe that theme, the population is every single individual in there. So then, what is the sample? Well, it's a small portion of that population. It can be a representative sample, but it can also be a biased sample, and we're going to get into that. So let's just go back to mgh. And think let's say we were going to survey a sample of the population of nurses at mgh, let's say we only surveyed nurses in the intensive care unit. That would be a sample, but not a representative sample. So it would be a small portion of that population, but not a representative one. Probably more representative would be if we asked at least one nurse from each department. And so I just want to get in your head that the whole concept of sample is, is that it's just a small portion of the population. And it's not a portion of some other population. It's just that one. But the problem is you can get a biased one or representative one. So you have to think about So when you think about it, if you've got a whole population, then you would get variables about each individual in that population. And those variables would be your data. But if you chose samples, that you know, just a portion will be a lot less work, right? You'd still have to get variables about those individuals, but there's way fewer individuals, so it probably be easier. So in population data, data from every single individual in the population is available. And that's called a census. So I'm, I knew a person who decided to do a survey of every single professor at a college. She didn't take just some professors from each department, she sent the survey to every single professor. So she did not use a sample, she used a census. But in sample data, the data are only available from some of the individuals in the population. So if we go back to the researcher I described, if she had only taken some of the list, the email list of the professors at that college, then she would have been serving a sample. And that's actually very commonly used in research studies, especially if patients, why would you need to go get every, for example, kidney dialysis patient and study every single one, you only need a sample. And why is that because we have statistics. So I'm going to just give you a few examples of real population data in healthcare. You're probably familiar with Medicare, Medicare is the public insurance program in the United States, for elders. So even my grandma was on Medicare when she was alive, and she was not a US citizen, she was from India. So we really do a good job of covering our elders in the US with Medicare. In fact, I even read a statistics that said, almost 100% of people aged 65 and over are in Medicare. And so therefore, if you download data from Medicare, they make it confidential, you only just replace all the personal identifiers. But there's this thing called the Medicare claims data set for every single transaction that happens, like if you're in Medicare, and you go get some treatment that's in there. So it has all the insurance claims filed by the Medicare population, because it has everybody, everything than that is population data. Also, in the United States, every 10 years, the government hires a bunch of people to go out and survey a bunch of people. And also, they send out a bunch of surveys. And the idea is to try to get every single person in the United States to fill out that survey. And that's called the United States Census. So now, I'm going to give you sort of a mirror image of the sample data. Okay. Remember how I was just talking to about Medicare? People who are enrolled in Medicare are called Medicare beneficiaries, and Medicare cares what they think. So they do a survey of a sample of individuals on Medicare. And they do this kind of often. I think they do it once a year. I'm not sure it's a phone survey. They only do a sample because they're going to use statistics to try and extrapolate that knowledge back to the population of Medicare beneficiaries. Also, in case you notice, the United States Census only takes place every 10 years. Do you think changes happen in between? Yep, lots of changes. Like you just think about Hurricane Katrina. That's very sad. It changed the population distribution in Louisiana, vary vary dramatically, and also other states around there. So how did they keep up? Well, they used the American Community Survey, the government does this the United States Census Bureau, and that, again, is done by phone. And that's conducted yearly. And it's a sample and so the US doesn't know exactly how many people would be in Louisiana or anywhere else. But they can use statistics to extrapolate that from the sample of the American Community Survey. I want to just do a shout out to statistical notation. So from now on, when we see a capital N, like let's say you sack capital N equals 25, then you can assume that 25 means a population that's just kind of a secret code we use in statistics. However, if you saw a lowercase n, n equals 25, and it was lowercase, then you could assume that this was a sample of the population. And again, it's just kind of like a secret code, you have to pay attention. When I'm talking and I say n, and you can see uppercase and lowercase. You don't know if I'm talking about a population, or a sample. Now I'm going to get into the concept of parameter versus statistic, I want you to notice that the word parameter starts with P PA. So parameter is a measure that describes the entire population. So for instance, anything that would come out of that whole Medicare claims data set, or that whole United States Census would be a parameter. On the other hand, a statistic statistic starts with S, and statistic is a measure that describes only a sample of a population. Here we have an, again, a situation where the word statistic is used, like daily on the news. In fact, sometimes I hear on the news, something like Oh, look at the rate of HIV in Africa, it's going up. That's a terrible statistic. I agree. It's terrible. But they mean parameter, because they're talking about all of Africa, every single person in Africa, if the rate of HIV is going up in Africa, they mean a parameter, they don't need a statistic. So here's an example of parameters and statistics that are based on the same population. So for example, the mean age of every American on Medicare is a parameter that's every single person. However, remember, the Medicare beneficiary survey, that's just a sample. So if we took the mean age of those people, we would just have a statistic. And again, you just have to pay attention, because if you listen to the news, you'll hear them use the word statistic to mean both parameter and statistic. But in this situation with, when you're practicing in the field of statistics, it's very important to point out when the number you're talking about comes from a population versus comes from a sample. So you should really use the term. This is a parameter if it's from a population, or this is a statistic, if it's from a sample. And so again, don't get confused. If you're listening to someone talk in a lecture or in a video, you might want to look for clues that a number is a population parameter, or as a sample statistic, if you hear that the data set that they use encompasses an entire population. And usually that's the kind of stuff done by governments, like remember when I was talking about the rate of HIV in Africa, lead probably be done by governments of the United Nations, or the World Health Organization. So when you're talking about numbers that might have come out of an entire population, usually done by the government, that's probably a population parameter. clues that someone's talking about a sample statistic is if you hear them talking about a study that recruited volunteers, well, then, if it's volunteers, they didn't get everybody in the population. So it's going to be a sample. Also, like surveys, for instance, surveys about who people are going to vote for you public opinion surveys, they're never going to ask some every single person in the state, who are you going to vote for build us ask a sample. So if you hear about a survey, you might even have them tell you say, n equals maybe a few 1000 people because that's all they surveyed. And so that's a clue that we're talking about a sample statistic rather than a population parameter. Now, I'm going to talk about the difference between descriptive statistics and inferential statistics. But first I'm going to remind you what the word infer means. So infer means to kind of get a hint from something indirectly. It's kind of the complement to imply. So if I said my friend implied that I should not call after 9pm and I figured that out. I would say I inferred that I should not call my friend after 9pm. Okay. So in inferential is what I'm going to talk about next. But first I'm going to talk about descriptive descriptives is pretty easy, because you can do it to samples and you can do it to populations will variables from samples and populations, right. And so, descriptive statistics involve methods of organizing picturing in some Rising information from samples and populations. It's basically just making pictures of it right? Like look at that bar chart. And that's just a simple picture. And that can be made with just about any data. You get data from surveying people at work, you get data from surveying your friends, what they're going to bring to the potluck. If any of that can be used, you can go download the census data, you can make descriptive statistics out of that. But there's something very special about inferential statistics. And that involves methods of using information from a sample to draw conclusions regarding the population. Therefore, inferential statistics can only be done on a sample. And therefore and that's why that's called inferential. Right? Because infer, because the sample is going to give a hint about what the population is right? It's not going to say it directly, which is annoying, right? But that's that uncertainty thing I was telling you about. So the sample is going to imply something? Well, we're gonna infer something from the sample about the population, right? So that's what inferential statistics is, is where you take a sample, and you infer something about the population. Whereas descriptive statistics is more loosey goosey. You can just do that to samples and populations, kind of like make pictures out of it, right. So in statistics, it's really important to properly identify measures as either population parameters, or sample statistics. Because as you can see, you can only do inferential statistics on samples. And so you have to really know what you're doing when you're doing statistics, what you're talking about, because different types of data are used for parameters versus statistics. Alrighty, now we're going to get into classifying variables into different levels of measurement. So remember our variables, right, like we have individuals, and then we have variables about them. And those variables actually can only fall into two groups, quantitative versus qualitative. And then depending on which group they fall into, you can further classify them as interval versus ratio, or nominal versus ordinal. And I'm going to give you some examples of how to classify a few healthcare data, types of variables already, so I like to draw this picture. It's a four level data classification, I'll draw it solely here for you. So we start with human research data, that's what I like to start with. Alright, so we're going to split that into two. Remember, I said that, we're going to start by talking about quantitative. Another word that's often used for that is continuous, but we're going to use the word quantitative. So what does that mean? That is a numerical measurement of something. So like, this gives an example of temperature. So something with a number in it, I always think if I can make a mean out of it, it must be a quantitative variable, right? And so here's an example of quantitative variables. So time of admin, right? So imagine that you work a shift in the ER, right? And from maybe 8pm to 12. like midnight, right? So you have this for hours. And you could say, what the average time of admin would be for those who got admitted to the hospital, you know, somebody got admitted at like, eight o'clock, and then somebody at 815, and whatever, you could put that together, and you'd say what the average time was, also, like, if you were doing a study, and you as you were saying, patients with a particular condition like Alzheimer's disease, you could ask them their year of diagnosis, and then you could make an average of that. And so you know, that that is quantitative. systolic blood pressure is also numerical, and platelet count. And these are variables we run into all the time in healthcare. So we're, you said that this is quantitative. Now, we'll get back to our picture. So that's one side. So what if it's not quantitative? What else could it be? Well, the only other category, it could be is categorical or qualitative. I use the term qualitative, but some people use the term categorical, but that's kind of what it is, is that it's a quality of something or a characteristic of something like sex or race. So here are some qualitative variables in healthcare, like you can have type of health insurance, like whether you're on Medicare or Medicaid or different types of private insurance. Those are all just categorical, right? You can't make a mean out of that. Also country of origin. If you're in our group of students and their international students in there. Well, what countries are they from? Right? Well, you can't make a mean out of that. Also you have situations where you do have numbers involved, like the stage of cancer, right? That's depressing. Stage One, cancer, stage two, cancer, stage three, well, you never can make a mean, out of the stage of cancer, you wouldn't say, well, the mean stages is 1.4, or something like that. It's just a category. And of course, stage four is a lot worse than stage one. You know, they're not just equal categories, but their categories. Same with trauma center level level four Trauma Center, where you wouldn't make a mean out of the number of after the term Trauma Center, right, like what level it is. But you could say, well, in the state, maybe. So many percent of our trauma centers are level four trauma center. So it's really just a categorical variable, even though there's a number involved. Alright, so let's get back to our diagram, we figured out how to take any variable, and first split it into one of two categories is either quantitative, if it's numerical, or qualitative, if it's a characteristic. Now, we're going to just concentrate on quantitative because we're going to separate those variables into two categories. And the first one we're going to look at is interval. And the second one we're going to look at is ratio. So if a if you happen to decide a variable as quantitative, then it could be interval or ratio, but not if it's qualitative. Okay, if it's qualitative, it doesn't get to do that. So let's look at interval versus ratio. So on the left side of the side, we have interval, which is where it's quantitative, and the differences between data values are meaningful. And ratio has the same thing, the differences between the data values are meaningful. What does that mean by that? Well, remember how I was talking before how level one trauma center and level two trauma center that that those are really categories, and not quantitative variables, because the difference actually between them is not equal. Especially if you think of job classifications that might go in 1234, like nurse, one, nurse to nurse three, nurse four, or I worked at a job where we had office specialist one, office specialist to Office specialist three. And you know what the deal for going from office specialists to to Office specialist three was really hard, you really had to do a lot there. But to go from one to two wasn't that hard? So that was a categorical variable, right? Because the differences between the values were meaningless. Okay. Like the difference between s one and s two versus Oh, s two, and s three, they weren't equal. Whereas when you're dealing with a quantitative variable, regardless of whether it's interval or ratio, you're talking like years, or systolic blood pressure, one year for you is one year for me. So that's fine, right? But here's where the difference comes in between interval and ratio. So all quantitative variables have meaningful differences between their data values, but this hairsplitting thing here is that an interval, there is no true zero. And in ratio, there is a true zero. And this is how I try to think about it. an interval means kind of like, a space between two things. Like if you think of the word intermission is kind of like an interval. It's like an interval of time during a show where you get to get up and go the bathroom and get some coffee. So that's interval. And so if you have something that's a space in between, that's not going to have a zero, it doesn't really start anywhere, or end anywhere. It's in between. Whereas ratio, how are you number that is, I don't know if you remember from like high school, but you can't have a zero on the bottom of a ratio or a fraction. So that's the way I use a pneumonic. That ratio means that you cannot have a true zero. But how does this work out literally? Well, I'll show you. So let's go back to those examples I showed you of quantitative variables, right? Because those are the only ones we have to make this decision about whether they are interval ratio. So these are these examples. Now I'm going to remind you that ratio has a true zero. Remember that little pneumonic I said, like don't divide by zero. And so you know, like in a ratio, so they have a true zero. Well, let's think about it. It's not very pleasant to have a zero systolic blood pressure because you'd be dead. Same with the platelet count, but it is possible, right? But now when we go on to interval, we can't have Like zero time, like time of admet, you know are your diagnosis, there's no like, year zero. So as you probably just guessed, ratio is where it's at. In healthcare. There's not a whole lot of times when we have interval data, but we do, you know, anytime you have a time, so you got to keep that in mind that if you want to split your quantitative variables into either interval or ratio, you got to keep this in mind the difference between the true zero and the no true zero. Okay, here's our handy dandy diagram. We've just gone through the tree classifying quantitative data into interval versus ratio. Now let's go pay attention to the other side of the tree qualitative. So how do we split those? Um, well, we can split those into nominal versus ordinal. All right. So nominal applies to categories, labels, or names that cannot be ordered from smallest to largest. Okay, like I kind of think of when they have an advertisement, they say, for a nominal fee, you can do this, it means it's small, they're like, there's almost no difference. And so that's why I say, there's no difference, it's not smallest to largest is means they must be equal. That's how I remember it in my mind. But then ordinal applies to data that can be arranged in order in categories. But remember that thing I was saying about quantitative, it's not quantitative, right? Because the difference between the data values either cannot be determined or is meaningless, like I was talking about with cancer, especially, you know, if you go from stage three to stage four, that's materially different than stage one to stage two. So you really can't determine those things. So this is where we're gonna get into that it's ordinal. It's arranged in categories that can be ordered from smallest to largest. So remember, our old friends that I threw up there before of these examples of qualitative variables and healthcare? Well, let's just reflect on this nominal cannot be ordered, right. So that would be more like type of health insurance and country of origin because they could all be equal. Whereas ordinal is going to have a natural order, even though the differences between the levels is meaningless, which is what makes it so different from a quantitative variables. So which is why it stays on the qualitative side of the tree, it just gets labeled ordinal. So what you want to do is if you think you have a qualitative variable on your hands, look for a natural order. If there is one, it's ordinal. And if not, it's nominal. So all data can be classified as quantitative or qualitative. So if you have a variable, that's the first split you can make as the difference between quantitative and qualitative, but once you do that, you can further classify it as interval ratio, nominal, or ordinal. And it's really important to know how to classify data in healthcare, as you'll find out later. Because depending on how you classify it, you might be able to do different things with it in statistics already, so what we went over was the definition of statistics. And we talked a little about why you use it and how you use it, especially in healthcare. We went over what it means to talk about a population parameter and the sample statistic, and we went over some examples about them. And then we talked about classifying variables into the different levels of measurement, and even talked about a few examples there. So I hope you enjoyed my lecture. Greetings, this is Monica wahi lecturer at library college, bringing you your lecture on section 1.2 on the topic of sampling. So here are your learning objectives for this particular lecture. At the end of this lecture, the students should be able to define sampling frame and sampling error, the student should be also able to give one example of how to do simple random sampling. And one example of how to do systematic sampling. The students should be able to explain one reason to choose stratified sampling over other approaches, state to differences between cluster sampling and convenience sampling, and give an example of a national survey that uses multistage sampling. So let's jump right into it here. So we're going to go over in this lecture, sampling definitions, and then those different types of sampling I mentioned in the learning objectives, simple random sampling, stratified sampling, systematic sampling, and then convenience and multi state sing. So let's start with some sampling definitions. What is a sample Okay, so we're going to revisit that concept from the previous lecture, we're also going to talk about sampling frames, and what errors mean and errors of sampling frames. And then we're also going to just go right back over that and make sure you understand before we go on, and talk about the different types of sampling. So we take a sample of a population, because we want to do inferential statistics, remember that we want to infer from the sample to the population. And it's just not necessary to measure the whole population, it would be impractical. And it's cost a lot. And actually, what you'll find is, if you ever do an experiment, when where you actually do measure the whole population, you'll find that if you get, you know, a pretty good proportion of the population, and you just take that, you, that's all you really needed to talk to. So ultimately, we save resources, especially in health care, when we do a good job of sampling, and use that to infer to the population rather than having to take a census of the whole population all the top. So that brings us to the concept of sampling frame. So the sampling frame is the list of individuals from which a sample is actually selected. And the list may be this physical concrete list, like you could have a list of students enrolled at a nursing college, or in my other lecture, I gave an example of a list of nurses who work at Massachusetts General Hospital, that could be your list, you'd go to human resources and get that. Or it could be a theoretical list. It could be like the list of patients who present to the emergency department today, obviously, when you go into work, at the beginning of the shift, you're not going to know who's on that list yet. But it could be a theoretical list. But whatever that list is, that is your sampling frame. So that those are the people who actually could be selected for your study. So the sampling frame is the part of the population from which you want to draw the sample. And you want to work at such that everybody from your sampling frame has a chance of being selected for your sample. In other words, you don't want to leave anyone that should be in your sampling frame out in the cold. That leads us to the concept of under coverage. So what is it? It's omitting population members from the sampling frame? They're supposed to be on the list, but they're not there. So how can this happen? Well, let's say you did what I was suggesting in the previous slide, you got a list of nursing students, you know, from a college, let's say somebody signed up that day, or somebody was just admitted that day, maybe they didn't make it into the database in time and you're missing them. Or even like that HR list I talked about, at mgh, well, you know, I know how nurses are, sometimes they'll temp in different places, and maybe they're not on the payroll, maybe they're through a temp agency. And so then we would miss those nurses from the sampling frame. And then, you know, people who present at the emergency department at night might be different than those in the day. And so if you're really trying to sample from people who present to the emergency department, you can't just look at like some small period of time, you'd have to look at, you know, the whole 24 hour cycle. So if you omit population members from your sampling frame, they don't even get a chance to be in it. And that's called under coverage. Now, I'm going to shift around, we're jumping around with a few different definitions. And we're going to talk about errors. Now, this is something that took me a while to get used to in statistics, there's actually two kinds of errors in statistics. The first kind is I call it This is my own terminology, a fact of life error. It's just an error that happens. When you do statistics, it's not bad or good. It's just what happens. And in this case, I'm going to describe one of those. It's called a sampling error. So the sampling error just simply says the population mean will be different from your sample mean, and the population percentage will be different from your sample percentage. So what does that mean? That means that if I cut corners, like I said, I could write and just take a sample to infer to the population. If I actually do one of those experiments I was telling you about where I have the population data and I just take a sample and compare the means they will be different. Okay, I mean, there might be this huge coincidence where they're the same but they're typically different. Same if you do percentages, and and we just know this is going to happen. The statistics we account for it, we have ways of dealing with it. But we know that there's always going to be sampling error whenever you take a sample from a population To try to make a mean or percentage in the sample, it's just not going to be exactly what's in the populations fine. But then there are other errors and statistics, which are actually bad. And your it means you made a mistake. It's like mistakes, literally mistakes. And so as you go through learning about statistics, it's almost like you have to sit down and ask somebody, is this one of those fact of life errors? Or is this one of those errors you want to avoid? Well, we just talked about sampling error. That's just a fact of life error. But errors, you want to avoid non sampling error. That's basically using a bad list. I had an example in my life where I wanted to study a whole bunch of providers, right. And my friend gave me this list of providers, and and said, this is the entire list of all these providers in this particular professional society. But when I sent the email to that list, I found there were not only duplicates on this list, but a lot of people emailed me back and said, Why are you sending this to me? I'm not a provider. I'm not part of this professional society. And also, some people who were in that professional society, who had heard about the survey emailed me and said, Why didn't I get the survey. So this was a bad list. Some people had been left out of the sampling frame. So people who were in the society somehow weren't on my email list. And that's a problem, right? So you have to pay careful attention. This was actually a mistake I made, you have to pay careful attention that everyone in the population who was supposed to be represented in your sampling frame is actually there. So I should have really done a better job of calling the professional society and making sure that this list was a good list. So sampling error was caused by the fact that regardless of what you do, your sample will not perfectly resent represent the population. Whereas non sampling error, yeah, I was sloppy. It was poor sample design, sloppy data collection, and accurate measurement instruments, you can have bias and data collection, other problems introduced by the researcher. So this is your fault if there's non sampling error, but sampling error is just a fact of life. Little whiplash here, we're gonna now move on to the concept of simulations. So a simulation is defined technically as a numerical facsimile, or representation of a real world phenomenon. So it's like working through a pretend situation, to see how it would come out in the case that was real. And this, you know, when you study statistics, you end up doing a lot of simulations. And remember how I've been talking about an experiment you could do if you somehow did a census and had a whole bunch of data on a population, you could do an experiment where you just took a sample from that population and looked at their mean to see the sampling error. That's an example of a simulation. So to just conclude this little section, it's really important to do your best to avoid non sampling error. And this is achieved by making sure you do not have under coverage when sampling from your sampling frame. So this puts together some of our vocabulary. But just remember, sampling error is a fact of life. Okay, now we're going to specifically talk about different types of sampling. And we're going to start with simple random sample. Okay, so first, we're gonna start with just explaining what is meant by simple random sampling, then we're going to talk about two different methods of doing simple random sampling, they work the same way they achieve the same thing. It's just that depending on how you're doing your research, one might be more convenient for you than the other. Finally, we will go over the limits of simple random sampling, because all these sampling methods seem perfect. But then you got to take a look at their limitations. So let's first define simple random sampling. So here's a definition. A simple random sample of n measurements from a population is a subset of the population selected in such a manner that every sample of size n from the population has an equal chance of being selected. Well, it's kind of complicated, but what it means is, is that if you use the proper approach for simple random sampling, whatever sample you get, you could have had just as easily a chance of getting another batch, another group of people from that sample. In other words, like, let's say you have a list of the population of students in the class. So I'm going to define a class as a population. And you want to take a sample of five students from this bigger class. If you take a simple random sample, it means that all the different groups of five students you could pick from the list has an equal chance of being the sample group you actually pick. Now, you can just imagine that if you race into the class right at the beginning, and you take your sample of five and not everybody's in the class, what does that sound like, right, a sampling frame problem, maybe an under coverage problem, maybe biases creeping in there, right. And so you just got to be careful, if you're going to do simple random sampling, that you start with a list with everybody in your sample frame, because every single sample that you could possibly take should have equal chance of ending up being your sample. And I'll kind of explain it by explaining the two different methods that can be used of obtaining that sample. So one of the best things that you can do is just start with a really good list of all the people in your population. So maybe, you know, if I was going to study, I used to work at the army. So let's say I was going to study all the people who are active duty in the US Army, I would like to get a list of all of those people from an accurate place at the army. And I would like to have them have a unique ID. Okay. And that's true in the army, everybody in the army has a unique numerical ID. So what I would do, like in here, if you were looking at students, you'd take maybe take a student ID, so then you take the IDS from everybody on the list, and you cut them up, like you print them out, and you cut them up, and you put them in a hat, right, or a bag where you can't see in it. And they mix them all up where you can't see it. And you draw five of them up, or like in the picture, you know, what they did was mix up all those papers, and now they're not looking. And they're drawing a few out. Okay, so what did you just do, you just made sure, first of all, that everybody in the population had an ID number. And that when you printed it out and cut it up, all, you didn't lose any of them, if you drop them on the floor, or something that's not simple random sample, you got to make sure you keep all of them, and that you put them all in the hat, and that you didn't look and you draw five or whatever, because then any five of those slips of paper could have been drawn in there for your meeting with simple random sampling. Okay, that method will work, right? Another method that works, that might work better if you can't do this ID thing where you cut a paper is where you simply just make your own list of unique random numbers, right, you just make your own list. And then you assign those to the population. A great example is if you're, you know, kind of teaching kids and you want to put them in a random order, maybe you're gonna do a game or something. Well, all you do is you you get, like, let's say you have 10, kids, you number one to 10, you put it in the hat, and then you pull out the first number, let's say it's five, you give it to the first kid, right? And then you just keep pulling out numbers and giving them to the kids and then tell them to stand in order, right? So you generate a list of random numbers as long as the list of the population. So I said, What if you have 10 kids? Well, if you have, you know, 500 names, then you get 500 numbers, and they don't have to be one through 500. They just have to be unique. Okay, I like smaller numbers. So I'd say keep them small, but you can do what you want. And then, in any case, you randomly assign these numbers, you can use the hat, I'm big on hats to this population. And then, you know, you ask them to stand in order, or somehow you figure out it's kind of like a raffle you call out who's got number one, you know, and whoever says yes, you're like, you're lucky you get to be in my study, you know, so you can take the first five numbers in the order, right. And that's, that'll achieve the same thing as the last method, you'll get a simple random sample, it's just two different ways of doing it. So ultimately, being in a simple random sample means that the sample has an equal chase chance of being selected out of the hat that this group of people or a group of whatever has an equal chance of being selected. And you'll see this picture on the left here is bingo, as some of you may play bingo. You know, they pull balls out of there and they call off the names of the balls. Well, each ball has a unique actually a letter and a number unique on there. And that's how they make them random. That's they take a simple random sample of these bingo balls each time that they do a bingo game. So I described to you the first method of doing that using an old fashioned hat. The second method, you know, where you generate your own numbers, and you just make sure they're unique. And then you assign them to things and put them in order. Well, that's my electronic hat. That's how I handle it. If I have, for example, somebody sends me an Excel sheet with a list of hospitals on it. I'll just assign each hospital random number and sort them in order. And I'll sample the top few hospitals. That'll be how I get a simple random sample of possibles. That way, I'm not biased, picking out my favorite hospitals where all my friends work, right? If I do it that way, the first method or the second method, all members of the population have the equal probability of being selected in the sample. And more importantly, all possible samples, all possible groups had an equal chance of being selected. Of course, I only did it once. So I only got one of them. But the other ones that weren't selected had an equal chance of being selected. All right, you probably saw the limits, is this whole list? Even if I'm sampling hospitals, right? I still need a list of hospitals to sample from. So you may not know who's gonna show up in the emergency department that day, if you do, while you're psychic, because most people are not. So how would you sample from them using simple random sampling? So simple random sampling is okay, when you got a list like hospitals, but it's not so good when you don't know who's going to show up that day. And even if you do a simple random sampling, you need a good list. I made a mistake once, where I did a survey with a bunch of professionals using a professional society list. And when I sent out the survey, I learned that there were people on the list who were no longer part of the society that it was an old list. And more importantly, there were people who had joined the society that had not made it onto that list. So I was getting under coverage. So like, if you were doing a study with students, you know, what if they just left off the part time students, then you'd be missing them. So this is a great example of non sampling error. And so if you're going to do simple random sampling, you do need a list and you really want to research it and make sure it's the best list possible. So I just went over the characteristics of simple random sampling, and two different methods you can use from to sample from a list. And I also mentioned the limits of it. Now we'll talk about a different kind of sampling, stratified sampling. So we're gonna go over what it is. And then I'm just like, simple random sampling had all these steps to it, there are different steps in stratified sampling. And I'll give you some examples. And then of course, just like simple random sampling, this stratified sampling has limitations, and I'll talk about those. So I first wanted to just remind you what the word stratified means, or what strata are, the single word is stratum, and more than one a strata. Now you see that rock on the slide, you see that big, horizontal line across it, that those that's a stratum, there are strata, right? Those are strata of rock, if you stay geology, that'll the geologists will explain that where those breaks are, it means something happened often in the weather or the environment or whatever. But the reason why I put this picture up there is I want you to sort of imagine those layers. Because that's what we do in stratified sampling is first, we divide our list, of course, you know, a list, we divide our list into layers. Okay, so remember how I was just talking about simple random sampling? Like, what if I sample from hospitals? Well, I could take this hospital list and divide it until layers by for example, how close they are to the city, I could say, urban, suburban, and rural, I could first put them into those strata. Okay. And if I was doing that, I'd be doing stratified sampling. Same with students, like I could put them in, you know, first year nursing students, second year students, you know, and I'd have this them divided into strata first. Um, so this is what so why would you do that? Why not just do simple random sampling? Well, if you think about it, let's say that you've got a class like statistics, maybe a lot of you know, they're not that many first year students in it. So let's say the very small proportion is that way. If you do simple random sampling, you might just by lock miss all of them. Right. And so, if you're really concerned about what a minority thinks, then you can make sure to get representative from that stratum. By doing stratified sampling, because the first thing you do is you put those that list into groups. And then you take a simple random sample from each of the strata. So here's the steps. So step one, divide the entire population, the whole list you have into distinct subgroups called strata. And remember, each individual has to fit into one of those categories. So if you have somebody who's sort of halfway halfway between first year and second year, or you've got a hospital that's kind of on the border, it you got to choose, you got to put it in one of those groups. Step two, um, well, it's not really step two, but you've got to think about the strata like what is it based on, it's got to be based on one specific characteristics, such as age income, education level, you know, a great example is you could take people of all different incomes, right, that's a quantitative variable, but you can put them in strata by you know, less than a certain amount. And then that to that, that to that you can make, you know, four or five strata. And then, um, you know, you just want to make sure that all members of the stratum, each stratum, share the same characteristic. And then you could do step four, which is draw a simple random sample from each stratum. So like, in the case where I was describing, like, maybe you have a class with very few first year students, if you take a random sample of five from each strata, you know, each stratum, then you might be, you know, you're kind of getting almost like, extra votes from a small minority, right? Like, you're kind of treating them fairly, even though there's a way bigger group of the other people you're taking exactly five from. And, but you just that, that's the risk you take, because you want to make sure you hear from that small group. Because if you just do sample random sampling with groups, so small, you might just accidentally miss it. So here are some examples of stratified sampling. And you'll see this in the youth Behavioral Risk Factor Surveillance surveys that they do in high schools, that they'll stratify by grade, right, because if they did a simple random sample, you know, a lot of students drop out of junior and senior year, they get probably too many freshmen and sophomores. And so they're gonna want to look at getting a certain amount of freshman classes, certain amount of sophomore classes, certain amount of junior classes, student run the senior classes, so they can have enough of each to make good estimates, right. And in hospitals, they often sample providers from each department, right? Like, they don't just do a simple random sample of providers, if they're asking about like provider satisfaction, or if you know about a policy, they won't just do that, because they might, for example, Miss everybody in the ICU. Or if you're studying, you know, ICU is you have multiple ICU is there, then you would want to maybe stratify by ICU, just to make sure even if one of them's smaller, just to make sure you have a good, good solid representation from each ICU. So those are the reasons that push you to do stratified sampling. It's not always necessary. But when you have these situations where you have these distinct groups, especially the little one involved, and you want to hear from everybody, you really want to consider the stratified sampling. So of course, there's limitations. And I've been sort of leading up to this, what you end up doing is over sampling, one of the groups usually, you know, like the smallest group, if you make the same amount of people you take from that stratum, the same amount as you take from the big stratum. It's like the smallest group is having all these powerful votes and the biggest group has is weaker, you know, they're both equal when they're not technically equal in the population. But that's the way it goes, right? And I do higher level statistics, there's ways to adjust back for that, to just sort of say, take a penalty for that and go back and say, Well, what if the real pot you know, we can extrapolate this back to the population proportions? It's possible, but it's it takes some post processing is just the issue. And it's also like simple random sampling not really possible to do without a list beforehand. And it's also hard to do, because you actually have to split the list into groups into these strata. So let's say I had these hospitals and I didn't know where they were, I didn't know exactly if they were urban or rural or suburban. Well, that adds another level of complexity to this whole stratified sampling. So, in summary, I just went over what stratified means, and it means you know, putting things in groups and then taking from that, and I describe the steps involved. And it's a stratified sample. It goes a lot easily. A lot more easily if the strategist happened to be equal to begin with, you know, I gave the example of high schools, usually there's maybe slightly fewer people in junior and senior year, but it's kind of close. And it's always nice. Like if you're comparing ice use, for example, if the ice use are roughly the same size, because then you don't have to worry about this whole, one of them is smaller, but it's getting an equal vote. Already, now we are going to move on to talk about systematic sampling. Okay, well, systematic sampling actually can be done with or without a list. So it's a little more flexible than the kind of sampling we've been talking about. systematic sampling, it's easier for me to like, define it by describing the steps you go through to do it. So I'm just gonna explain how to do it. And then you'll understand, in fact, you'll understand why it's called systematic. So whether you have a list or not, what you have to do for step one is arrange all the individuals of the population in a particular order. Now, if it's a list, you just make it in whatever order you want to make it in. But if we're talking about, for example, patients coming into the ER, well, they come in, in the order that they want to. So they already are arranged in the list, right? You just don't know what that list is. Okay, then step two is pick a random individual as a start. So let's say I had a list of hospitals, and let's say it was just sorted by state, right? I, let's say I picked a random individual, maybe I went down, you know, seven on the list, and I picked that hospital. Or maybe you could be at the ER, you start your shift. And the seventh patient who is admitted to the ER, you pick that person, just I picked seven, I mean, you could have picked five, you could have picked 20, you know, just you pick a random person. Then the next step, step three is take every case member of the population in the sample. Now, don't try this in Scrabble case is not a word in Scrabble, okay? It's just a word and statistics ease, in what case means spelled k th, it means every so many. So let's pick a number and fill it in for K. So let's pick the number three. So let's say after you pick your first hospital from the list, or the first patient from the ER, it doesn't matter what number you chose for that, then you take every third after that. So every third patient that comes in after that, you ask them if they want to be in a study, or every third hospital after that original random one, I pick and I say, Okay, this is going to be part of my systematic sample. So as you can see, it's like pretty simple to do, it's easy to do, if you have a list, it's easy to if you don't have a list, it's just the deal is you have to pick K, well, first you pick a random place to start, then you pick K, and then you just keep going every so many. So you could do this with classes, you could take out a list of classes available at your college next semester, she pick a random number like three, you know, and it's sorted some way. So you go to the third class and you circle that, then you pick another random number like five and then after that you pick every fifth class. So after the third one, you go 45678, and then 910 11 1213. And you keep picking classes. Okay, this is not career advice. Okay? Do not pick your classes that way. This was just an example. Alright, so as you probably guessed, I'm going to be negative Nelly, again, there are problems with systematic sampling. If already things are set up, boy, girl, boy, girl, for example. If you pick like an even number, you're going to get all boys are all girls, right? And I noticed this actually, when I was doing a study in the lab, we wanted to study like whenever they put the assay through the machines, we thought some of the assays weren't running, right. And so we wanted to take a sample. And I wanted to take a systematic sample. But I wanted to take a systematic sample, like every seven days, and that's a week. And so I asked my colleague, does the lab vary day by day in what assez it runs because of it always runs the sexually transmitted disease assays, it saves them up and runs them all on Friday. And I'm sampling from every Friday, that's all I'm gonna get, right? That's actually called periodicity. You don't have to remember that I don't think I've ever even seen that written. It's just I remember my lecture in my class telling us that that's what you have to worry about with systematic sampling. It's not real common problem, though. But what's awesome about it is you can do it in a clinical setting. So you You can sample patients that way, coming into a clinic or coming to a central lab or like in the emergency room. And that's why this is a particular power, particularly powerful way to sample is that if you have an ongoing sort of patient influx, when you design your research, you could simply say, once you decide how many people you need to recruit for your sample, that you would use systematic sampling, and just have somebody in the clinic inviting every case person who qualifies every case patient who qualifies into your study. So it's easy to do systematic sampling, it's easy to do with or without a list. And you just pick a random starting point, and then you pick every case individual. Next, we're gonna move on to cluster sampling. So what is up with cluster sampling? Why do we need even other kinds of sampling? I just went over so many kinds. I mean, you could use stratified systematic or simple random sampling, why would you even need another kind? Well, cluster is very special. It's special, because it's the kind of sampling you use when you think there's a problem at a particular geographic location. Typically, that's how cluster sampling is used. And, and I'll explain it further. Imagine, for example, there's a particular factory that's is believed to admit fumes that cause problems with people's health. Well, you can't do simple random sampling all over the nation, right, or you won't even get people by that factory, can't really do easily do stratified or systematic sampling their cluster sampling is what's designed when you want to study something that's coming from a geographic location. So when you do cluster sampling, you start by dividing a map into geographic areas. So I'm from Minnesota, and I know that there was a mine there with vermiculite in it. And it was it was contaminated, a lot of people got sick from it. But they didn't know that's what was going on. So they first I think divided Minnesota into different geographic areas, areas. after dividing the area into these different geographic areas, some with the, with the bad thing in it, and some without the bad thing in it, you randomly pick these clusters or areas from the map. So the app, like if you'll see there on the screen, there's a map of the state of Virginia, and it's all been divided into different groups. And then this, this cluster is is highlighted, you usually probably pick more than one cluster, sometimes it's only four or five. But the idea is you try to enroll all of the individuals in the cluster, it's usually people, although you can do it with animals, if there's a disease going around among animals, you know, you would have these, the divide the area up into clusters, and then you try to measure all the animals in the cluster. So as you can imagine, not only is this sort of practically difficult, but there's reasons why people live together, right? People live in communities. I mean, people don't just randomly scattered themselves, you know, cultural communities grow. companies grow around art, you know, affluent communities have different people in them, then communities that have less money. So sometimes the people located in the cluster are all similar in a way that makes the problem hard to study. And this is, especially if you're studying some geographic thing, like maybe a factory or a sewage plant, that you think might be causing cancer, if you're in an area where there's a lot of pollution anyway, from other things, and a lot of low income people live there. Because if you're high income you can afford not to, well, they're already being exposed to higher rates of carcinogens and probably have a higher cancer rate. It's hard to tell what the independent effect might be of that thing in that geographic location because of the other similarities of the people around. And so this is cancer ends up being a really difficult, tough nut to crack. Because where we see high rates, there are often a lot of different geographic issues going on there in cluster sampling doesn't really help tease that out. So to wrap this up, cluster sampling is used when geography is important. So if there is something geographically located in a certain spot and you can't move it, then you kind of are stuck doing cluster sampling. So briefly, the map around that areas divided into different sub areas, right. And those are Not all the areas are picked, just a few are randomly picked. And then all of the people in that particular area are sampled. And of course, it's biased towards the people living in the area. If you you know, in the area you pick with a bunch of affluent people, you'll get affluent people pick an area with a bunch of immigrants, he'll get immigrants. And so a cluster sampling is not perfect, but you're kind of stuck with it. When there's a situation with geography, how long it was, remember it is, when I used to live in Florida, we'd like to drive up to Georgia because they had the best pecan clusters. That's like a type of dessert with pecans and Carmel and stuff. So when I think of cluster sampling, I think of those pecan clusters that they're only really good in Georgia. So that's my way of remembering that cluster sampling has to do with geography. Now I'm finally going to talk about the last two types of sampling that I'm going to cover in this lecture, convenience sampling and multistage sampling. They're both a little quick, so I'm going to just cover them quickly. First, we're going to start by talking about convenient sampling. And we like that name, right? It's convenient. Convenient sampling can be used under low risk circumstances, like if the findings of what you're doing aren't really that important. Like, for instance, let's say that you wanted to know what ice cream is the best from the restaurant next to the hospital, let's say a new restaurant opens up, and you're gonna go off your diet, you're gonna go get some ice cream, but you don't want to waste it right. So you want to ask people, what's the best one, you might ask your coworkers, you might ask, you know, the people at the restaurant, hey, what's the best ice cream, but the results are not so reliable, because you might end up on Yelp and see that other people disagree. So a convenient sampling is basically using results or data that are conveniently or readily obtained. And my master's degree, one of the things I did was I surveyed people anonymously who were coming to a health fair, I sat at a booth, and I gave them the survey, to view questions in it. That was definitely a convenient sample, you know, just people showing up for the health fair. And this can be useful when there's not a lot of resources allocated to the study, like, I was a starving master's student, right, like, I didn't have any money. So that that was perfect for me convenience sampling. And also, you know, the questions I was asking them about were just characteristics of whether or not they had risk for diabetes. Well, I'm not a doctor, and I wasn't going to do anything about it. But it was interesting. So it wasn't a very high risk survey to fill up. It and convenience sampling is convenient, because it uses an already assembled group for surveys like I was doing at the health fair. An example might be to ask patients in the waiting room to fill out a survey or ask students in a class, you know, sometimes I do when I'm teaching, I'll do a convenient sample of whoever sitting there. I'll say, Hey, is the homework that I signed you this week too hard? Well, it's always too hard. I don't even know why I do the survey. But anyway, um, sometimes as a teacher, you'll just want to do a convenient sample just to get the gauge on where the classes but there are problems with it, right? You can't just use it for everything, even though it's nice and convenient. There's bias in every group, right? So if I let everybody go on break, and then whoever's still sitting there, I asked them a thong works too hard, I might get a totally different answer than if I waited for everybody come back. Right. And, you know, just about any time you just waltz into a room, like when I went to the health fair, who do you think, is there a bunch of sick people? No, there's a bunch of health minded people there. And so I'm gonna get a bunch of bias, right. And also, more importantly, when you do convenient sampling, you often miss important subpopulations. So remember, stratified sampling, how sometimes people don't group evenly into the different strata? Maybe they do kind of in high schools, but especially when it comes to job classifications, they usually have fewer bigwigs than they do. lackeys, right. And if they just have a few bigwigs, if you do a simple random sample, you you might miss all of them. So maybe you try a stratified sample. On the other hand, if you walk into the break room that is used by the lackeys and you say, hey, I want to fill out my, you know, work satisfaction survey. All of the ones you're going to get are going to be from the lackeys, you're not going to get any representation from the upper job classes because they don't go in that lounge, so you'd be missing them. So that's the main problem with convenience sample is the results can be so severely biased because you're only asking the small, biased group of people that probably are all alike in some way. It's not very representative sample. Next, I'm going to talk about multi stage sampling. So, you know, if you have a kid and the kids crying somebody like What's up, you say, well, the kids going through stage as well. That's exactly what you're doing when you're doing multi stage sampling, as you're going through stages. It's basically like mixing and matching, the different sampling I just talked about, only you do one stage, and then two stages, and then three stages, and then four stages, or maybe even more. And that's how you get your sample. So if you're imagining why I got to start with a lot of people, you're probably right, I just gave an example I made up of a way that you could do multistage sampling is you could start one with stage one as a cluster sample, right? Remember, where you take out a map, and then you divide into areas? Well, let's divide into states and take two census regions of states like about 10 states from those clumps. Okay, now, we limited it to that. Now let's go to stage two of our multistage sampling. Now, from each of those, we could take a random sample of counties, right. So we go and look at all the counties and then take that random sample. Then after we get those counties, stage three, we could take a stratified sample of schools from each county. So some of the counties will be totally rural, some will be totally urban, but most will have some mix. So we'll take a look at a few schools from the urban a few schools from the rural in stage three from the stratified will tell you a stratified sample schools from the simple random sample of counties from this cluster sample of states. Okay, now we got our schools, stage four could be a stratified sample of classrooms. So once we figured out our urban schools or rural schools, we could go in there and look at all the classrooms, freshman, sophomore, junior senior and take a stratified sample of those. So it's basically mixing and matching. But you're right, you got to start with a lot to begin with, if you're gonna whittle it down, and a whole bunch of stages, doesn't have to be four I just gave you for. Now I'm going to give you a real life example. This is the National Health and Nutrition Examination Survey. And Haynes definitely not a Master's project. This is done by the Centers for Disease Control and Prevention at the United States, right. So what I'm kind of hinting towards is the kinds of places doing multistage sampling our governments, not only do you have to start with a whole bunch of people and things and individuals, states and schools, and what have you, right, is that it's a lot of work to do all the sampling, and it better be for good reason. And the National Health and Nutrition Examination Survey is a good reason. That's, that's a survey that's done by the CDC to try and measure America's Health. Of course, it's doing inferential statistics, right, it's taking sample and trying to extrapolate that information back to the population. And so it's got to be really careful about how it does a sampler you can't just waltz in and do a bunch of convenient sampling. So this is how it does it, just briefly, they start by in stage one, sampling counties. Then from those counties, they sample something called segments, which is defined in the census, it's their different areas, from those segments, those areas, they sample households. And that's what they mean, like, wherever you live as a household. Even if you live in a dorm, that's a household or you live in assisted living, that's a household. I'm an apartment building house. So they sample those and once they knock on your door of your household, they sample individuals from the house. So they use four stages of sampling. And that's a real life example of multi stage sampling. So in summary, convenience and multi stage sampling, with respect to convenience sampling, you want to avoid it unless it's really a low risk question you're asking about. And you also want to avoid it unless it's really the only type of sampling possible under the circumstances. When you have situations where you have patients with very rare disease, probably convenience sampling from your Rare Disease clinic is reasonable. There, it's also used when resources are low. And so those are a few good reasons to try to use convenient sampling. It's really something that you want to use only if it's the thing you're stuck with. It's much better to look towards these other sampling approaches I described. And then finally, multistage sampling is usually used in large governmental studies. So don't expect to actually design anything alone with multistage sampling. When that happens, I showed you those four things for that survey that the CDC does hundreds of people work on that even just a sampling tons of people work to try and set that up. It's very difficult. But I wanted you to know about that kind of sampling, because it's important in healthcare, and it happens a lot. So in conclusion, we made it through the sampling lecture didn't wait. I first started by describing some definitions, you needed to be able to understand all these different types of sampling. Then I went into simple random sampling, and showed you how to do it two different ways and what it achieves and also its limitations. We next talked about stratified sampling, why you do that and how you do that, and the limitations of that one, too. Then we got into systematic sampling, which is a little more flexible, and pretty easy to explain. Next, we talked about cluster sampling, and why you might need to pull that tool out of your sampling toolbox. And then finally, we covered convenient sampling and multistage sampling. Already. Well, I hope you better understand sampling now and can keep all of these different types of sampling straight in your mind. Hello, everybody, it's Monica wahi labarre. College lecture for statistics are on to Section 1.3. Introduction to experimental design. And here are your learning objectives. So at the end of this lecture, you should be able to first state the steps of conducting a statistical study, and then select one step of developing a statistical study and state the reason for the step, you should be able to name one common mistake that can introduce bias into a survey and give an example should be able to explain what a lurking variable is, and give an example of that. And you should be able to define what a completely randomized experiment is. So let's get started. This lecture is in a cover four basic topics. First, we're going to look at the steps to conducting a statistical study, you may think there's a lot of steps to conducting a study, this is from the point of view of the statistician. Okay? Then we're gonna go over basic terms and definitions. And by now, you're probably used to the fact that in statistics, certain words are reappropriated. And they mean something specific in statistics. So we'll talk about that. Then we'll talk about bias and what that is and how to avoid it in when designing your studies. Finally, we'll talk about randomization in particular topics you need to think about when thinking about randomization. So let's get started. We're going to start with, of course, basic terms and definitions. And so first, we're going to review these steps that I keep talking about to conducting a statistical study. But there's some vocabulary, vocabulary that comes up. And so we're going to talk about those vocabulary terms that come up. And then also, I'm going to give you a few examples from healthcare. So here are the steps I keep talking about. So these are the basic guidelines for planning a statistical study. So the first thing you want to do is state your hypothesis. And you know, I'm in a scientist a while now. And I can't tell you how many times I get in a group of us, and people are all curious, and they start thinking about let's do a study. And it's only halfway through our conversation that I suddenly say, Hey, wait a second, we don't have a hypothesis, what's our apotheosis? So it's easy, even for scientists to forget that that's really step one, is you have to have a hypothesis. And so whatever hypothesis you pick, the hypothesis is about some individuals, if I have a hypothesis about hospitals, those are the individuals I have a hypothesis about patients. Those are the individuals. But it's important actually, to nail that down. Because am I talking about patients in the hospitals? Or am I talking about the hospitals, so make sure that you understand after you, you know, percolate and decide on your hypothesis, who the actual individuals of interest are? And that's because you're going to have to marry measure variables about these individuals. So step three is to specify all the variables you're going to need to measure about these individuals. You know, and of course, they relate to the hypothesis. So it's good thing is that was step one, right? Step four is to determine whether you want to use the entire population in your study or a sample. If you already have a bunch of data like you have the census data you You might as well use the entire population. But typically, if you don't have the data, you're going to want to sit down and think about using a sample. And if you do that, while you're sitting down, you should probably also choose the sampling method on the basis of what I talked about in the sampling lecture. Now that you've figured out your hypothesis, you got your individuals, you figured out your variables, and you figured out whether you're going to do a census or a sample, if you're going to do a sample what type of sample Step five is you think about the ethical concerns before data collection. If you're going to be asking some sensitive questions, you think about privacy, if you're going to be doing some invasive procedures, you think about how painful that would be, and how hard that would be on somebody, especially if they're not even, you know, it's they're just healthy. And you're just doing an experiment of unhealthy people just to better understand biology. So you have to really sit down and think about these ethical concerns. And they may change slightly your study design. Finally, after you get steps one through five, are taken care of, that's when you actually jump in and collect the data. And like I was saying, you know, when I meet with my scientist, friends, we get all excited about an idea. We're often talking about Step six, we're like, oh, we should do a survey, we should this we should that. And I realized I ended up saying, Hey, we actually have to go back to step one and start talking about a hypothesis, because I suddenly realized, I don't even know what data to collect, right? If you don't go through the steps in order, you really aren't doing it right. Step seven, is after you get the data, you finally use either descriptive or inferential statistics to answer your hypothesis. And that's what statistics is about. It's here for that. And then finally, after you use the statistics, you have to write up what you find, even if you're at a workplace. And they asked you to do a little survey that happened once when I was working somewhere. And they wanted us to do a survey. Their hypothesis was that they didn't have enough leadership programs, and they weren't building good leaders they could promote. And so I was on a team that did the survey, we didn't, you know, really publish it, like, everywhere. But we made an internal report, right. And in that internal report, we had to do step eight, which we had to note any concerns about data collection or analysis, you know, that happened when we were doing a report. And we also had to make recommendations for future studies, or if you wanted to study this in future groups of employees. So in science, what it usually ends up being is a peer reviewed literature report, right? is you do a scientific study, maybe you get a grant. And then you do all these steps. And then step eight is where you actually prepare a journal publication. And in that, you have to note any concerns about your data collection or analysis, anything that might have gone wrong, or not gone exactly the way you planned, or something you need to take into account to really properly interpret what the study found. You also want to make recommendations for future studies, especially if you screwed something up, or especially if you answered a really good question. No reason to per separate on that question, why don't we move forward and ask the next one. Now, these are a lot of steps to remember. So I'm going to help you try to remember them in sort of clumps. So let's look at the first clump, which are steps one through three, which is data hypothesis, identify the individuals of interest, and specify the variables to measure. So let's give an example of that. So let's say our hypothesis was air pollution causes asthma, and children who live in urban settings. You know, that's how we'd stated or we could say that as a research question, like does air pollution cause asthma in children who live in urban settings. And so in that case, the individuals would be children in urban settings, and the variables we'd have to measure our air pollution at least, and asthma at least. And of course, we'd want to know more things about these individuals, these children, we probably measure their income and where exactly they were living, and how old they were, and if they're male or female, and these kinds of things, but that just kind of helps you think about the first three steps together. Now let's think about the second three steps together four, five, and six, which is determine if you're going to use a population or sample If it's sample, pick the sampling method, look at the ethical concerns and then actually collect the data. So, when you do that, you can either quote unquote, collect data, you know, like, by using existing data by downloading data from the census, or like Medicare, they have data sets available that are, are de identified, so you don't know who exactly is in there. Or you can collect data yourself, like do a survey or, you know, get a bunch of patients that will allow you to measurement. When you use it, a government data set, often you can make population measures out of it. And so you don't really have to go through a lot of sampling, or ethics, because they've already provided it for you. And it's confidential. And that's kind of your data collection. But most of the time, what you'll see, especially for studying patients, and treatments, and cures, and things like that, those are on a smaller scale. So you end up collecting data from a sample for those estimates. And again, you need to choose a sampling approach. And then you need consent, if legally found to be human research. So I just want to share with you in case you didn't know, if you want to go do research on humans, you're a nursing student, or your medical students or a dental student, any any students or or your dentist, your physician, whatever, a nurse, you can't just make up a survey, or study design and go out and do it, you have to get approval from an ethical board. And that ethical board will talk to you if what you're doing is considered li li human research, that you need to get consent from the patients or the participants in your study if they're humans. And if you're collecting data about children, for example, you have to get the consent of their parents and the assent of the children. And in the United States, that way, we have a setup, it's called an institutional review board for the protection of human subjects and research or the short answer is IRB. And so I just want to make sure that if you ever do design a study that you know about this IRB thing, and you realize you have to go through this ethical board and make sure that they're cool with it. Before you can move on to the next step of designing a statistical study. All right, finally, we're on to the last clump of steps, which is seven, and eight, right? So that's using descriptive or inferential statistics to answer your hypothesis you in six, you collected the data. Now we're going to do the statistics. And then step eight is noting any concerns about your data collection or analysis and making recommendations for future studies. So you can kind of imagine this is where we're sitting in our offices, and writing up our research, whether we're writing an internal report to our bosses, over writing for the scientific literature to publish for everybody. So at this point, I just want to remind you that it matters whether you picked a census or a sample, for your study design. Because if you pick the census, you're going to do a certain kind of analysis. And if you pick the sample, you're going to do a different kind of analysis and statistics. So again, that's all kind of cycles back to your study design. And what's important here is I want to talk to you about the two different main types of studies. Now within these two categories, you have different subtypes. But these are the two main types that you can have. The first is called an experiment. experiment is where a treatment or intervention is deliberately assigned to the individuals. So you can kind of imagine that if you enter a study, and they assign you to take a drug in the study that you weren't taking before, that would be an experiment. But another thing could happen. I mean, you could do this to individuals, you could do it to animals, but you could do it, I keep getting the example of hospitals, we could choose some hospitals and say, Hey, you need to try a new policy as the intervention and and that was assigned by the researcher. So that makes this an experiment. And the reason why we have experiments is sometimes you need them. The purpose is to study the possible effect of the treatment or the intervention on the variables measured. And so that's one option you can do is have an experimental study where the researcher assigns the individuals to do certain things in the study. There's another kind of study The other kind, which is called observational, and the way you can think about it is in experiments, the researcher does something, they intervene, they give a treatment, right? But an observational, the researcher doesn't do that the researchers just observes. So, if you enroll in the study, and you say, Do I have to take a drug? Am I supposed to eat something? What am I supposed to do? And the researcher just says, No, we're just going to measure you, we're just going to ask you questions, and we're going to measure things about you, we're not going to tell you to do anything different, then you're in an observational study. So no treatment or intervention is assigned by the researcher in an observational study. Now, let's say you're taking a drug, you know, just because maybe you have migraines, you're taking a migraine drug, well, you just keep taking it, or you can stop taking it, you know, they don't care, they might ask you about taking the drug, but they're not going to assign you to take it. It's an observational study. I wanted to give you a couple of real life examples. So Women's Health Initiative up on the slide was mainly an experiment, okay. This is was run by the United States government, but of course, had the cooperation of many, many universities and, and health care centers, and most importantly, women. So women in America, women who were postmenopausal, volunteered to be in the study. And the study actually had two separate sections, the experiment section, and the observational study section. They really wanted women to qualify for the experiment, and that the purpose of the experiment was to study whether hormone replacement therapy, which is a therapy for symptoms that women can get if they're postmenopausal, that are unpleasant. What whether that therapy is good for women, or bad for women, because they thought maybe it helps them the post menopause system symptoms. But they thought maybe it causes cancer, right? So they know. So what they had to do was assign, get a bunch of women who were agreeing, you know that they would take whatever was assigned to them. And they had to assign the drug to some of these women. So that's what made an experiment. The problem is not all the women qualified for the study. So they had a separate observational study, if if the woman did not qualify to get the experimental drug assigned to her, then she could be in the observational study. And because this is these big government studies, why not, you know, somebody wants to be in a study, why not study them, just put them in the observational section. A very huge, popular long, ongoing study. That's an observational study, again, run by Well, this one actually started out of Harvard. And that's called the nurses Health Study. Some really smart person figured out a long time ago, that nurses are, are smart people, they understand their own health, they understand other people's health. And they're good at filling out surveys about health. So they started studying nurses and regularly sending them surveys, of course, they didn't tell the nurses what to do. They didn't assign the nurses any sort of drug to take or any diet or intervention or anything. They just observe the nurses, they send the nurses a survey, and about the nurses health, and then the nurse vault fills out that information. I think it's every two years that they do that, they're still doing it. Also, at this point, I do want to point out the concept of replication. So just the word replication, right, regular speaking means to copy, right? Like, if you ever, you know, have a new roommate, you might need to replicate your key. So you have a copy of the key for the new roommate? Well, part of the whole science thing is that studies must be done rigorously enough to be replicated. So those are little keywords in there. A rigorous study means one that's done really carefully, like thinking about sampling very carefully. You know, like avoiding, for example, non sampling error not being sloppy, not getting a lot of under coverage, using a good sampling frame. You know, I'm just giving you examples that you might know about. But there's a lot of things that have to be done in research to do it properly. It's just like driving or anything else. You really have to keep your eye on a lot of different things and you want to try to do them perfectly. And the main reason why you want to do that is so if somebody tries to do this same experiment you did or roughly the same experiment you did. Because you can't do exactly the same, right? If I study this hospital over here, and somebody wants to study that hospital over there, well, they're going to get different people in there, right? But even so if that person decides that they want to study that hospital over there, if I did my study rigorously, then it won't be so hard for that person to replicate how I did the study. And then we can see if that person and my study if we get the same thing, or if there's something slightly off or what's going on. And so replicating the results of both observational studies and experiments, is necessary for science to progress. So you'll know that a lot of experiments are done on drugs, before they can be approved to be given to everybody, because they can't just do one study, they have to replicate it, to make sure that the findings are all sort of coming in about the same and that we can deduce some information about it, you really just don't want to rely on one study for your findings. So I just went over several steps that we need to follow when we're doing a statistical study, and we actually have to follow them in order. And you also have to determine the type of study you're doing, you know, is an experiment, or observational study. And there's a ton of study decisions you have to make. So you got to keep that in mind. Now, we're going to talk about avoiding bias in specifically survey design. Now, you can do a lot of different kinds of studies. But let's just talk about surveys, because that happens a lot in nursing. Nurses interact with patients a lot, and with the community with each other. And often they gather information about those interactions or attitudes or, or how the healthcare system functions by using a survey. So surveys can provide a lot of information and useful information. But it's important that all aspects of survey design and administration when you're giving it, you got to think about minimizing bias and try you know, try to get a representative sample trying to get accurate measurements. And so several considerations should be made. When you want to think about non response and also voluntary response, okay, so I talked a lot about sampling in the previous lecture. But just because you invite someone to participate in your study, like maybe you're doing systematic sampling, and every third patient, you asked, Would you like to fill out a survey? That doesn't mean they're going to, right? And so if that person says no, thank you, even though there were a sample, that's called non response. So if I was helping you with a survey, and you said, Hey, I was getting a lot of non response, I would look at the proportion if you approach 200 people, and 80 said, No, you know, that's only a 20% response rate and an 80% non response rate. if many people are refusing your survey, the few who actually completed are likely to have a biased opinion. I've noticed this at in in situations where things are really bad, okay. Like, I remember going to a subway station and it was flooded, and it was really in a bad situation. And there was a man handing out surveys from the Transportation Authority. And he was like, please take my survey, please take my survey. And everybody was waving past him. They didn't want to grab a survey. While you know me, I got a bleeding heart for surveys. So I took his survey, and I filled it out. You know, I think the transportation authorities not so bad. Right? I lived in Florida, there's no transportation there, right? So and here in Massachusetts, we got a great transportation system, even if it's flooded or doesn't work half the time, right. It's way better than not having one. Well, I'm not the only one who grabbed a survey a bunch of nice Pollyannas, like me grabbed a survey. So probably the Trent Transit Authority thinks that everybody loves the subway when everybody was waving past this poor guy because they were so disgusted, because the station was flooded. Right? So if so many people are refusing your survey, a high proportion, the feebly will actually fill it out are going to be kind of weird, probably like me. You know, you're gonna get a bunch of happy people when most of the people who said no might be sad people. And so, the reason they may not be completing your survey has may have to do with how they feel about your topic. This is not just in terms of satisfaction. Let's say you want to talk about how many drinks per night somebody has. Okay? Do you think a lot of people who are struggling with alcoholism are gonna want to fill out that survey? You know, how about illegal drugs or other illegal activity, people who are into that they don't always feel so good about talking about it. And so, you know, you might get a few people to fill out your survey, but those are not necessarily the people who are engaging in the behaviors. So the fact that we have the freedom to choose whether or not we want to be in a survey is great. But from a researcher standpoint, is you have to be careful. If you get low response rates, you need to ask yourself who was not responding? And, you know, am I missing a good share of opinion there? And then, when you get people who do respond, you got to be careful with that two, respondents may lie on purpose. If you've got a pretty cool survey, but you suddenly ask a question, that's too personal. People might just lie. If you ask, maybe a students you're doing a sin, you know, maybe satisfaction survey with how the front desk runs at a dorm or something. If you, you know, ask a question, have you ever cheated on a test? You know, my, everybody's probably gonna say no. Also, if you ask a question where people don't really know the answer, offhand, they're not gonna put it. Like if you ask somebody, you know, when you're, you know, you asked a kid who's been living in the house forever, when your parents bought the house? How much did it cost? I mean, they're not gonna know. Maybe they'll know, but probably not. And so you want to be careful when you design your questions that you're not asking anything that's so personal, everybody's in lie about it? Or that you're not asking a question, then you would have Trump people try to be accurate, they're probably not even give you the right answer, because it's just too hard to think about. Um, respondents also to, you know, to surveys may lie without meaning to, like, inadvertently. Again, if you ask a question about something that happened really a long time ago, they're not probably going to get it right. This is called recall bias, like you can have you can you know how, like, you can look back at a time in your life, like, especially if you went through something really harsh, like if you were a part of a sports team, and you went to state and it was really tough that you don't remember the tough part, right? You sit around singing, you know, your sports songs, and you say, Hey, that was awesome. Well, that's recall bias, right? Because after winning state, everything looks rosy. But, you know, on the bus, there really wasn't that easy. So people tend to have recall bias, it's influenced by events that have happened since the original event. So if you're giving people a survey, and you're saying, Well, before you applied for nursing school, you know, what did you think this? Or did you think that, you know, they might just tell you and think they're telling you the truth, but they're actually lying. If you actually managed to go back in time and ask them, then they tell you something different. So again, you can kind of screw up your own data by screwing up your own questions. So you want to think about how you word your questions. You can also screw up your questions by introducing a hidden bias. Something happened to me recently, where a company sent me a free app. And they said, try our free app, and I downloaded it, and it was awful. Okay. And then about a month later, they sent me a survey. And these were the questions I said. When do you use the app? You know, what time of day? Do you use it? Right? Like how how, how do you use it? Do you read scientific literature? Do you read news? And the problem was, I couldn't really answer any of this. Because from the day I downloaded it, I never used it. It was so bad. Right? So question wording may induce a certain response. They were asking me how do you use this, but they didn't give me a choice of I don't. So I had to say something. I don't even know what I said. I mean, there was nothing I could say To be honest, because of that bias. So you have to be careful that you aren't too rosy about whatever your topic is, and and assume everybody loves everything. I mean, you've got to put out questions like are you even using the software? Did you have any problems with the software? Right? I'm just assuming they're using it and liking it and using it. You know, like it's supposed to be used is a big assumption. Order of questions and other wording may induce a certain response and you'll see this a lot if you take a public opinion poll. I used to do a lot of polling We'd ask questions like, how likely are you to vote for candidate x? You know, very likely someone likely? Somewhat unlikely and not at all likely? And people say, I don't know, no, no likely. And then you'd say, Well, what if you knew that candidate x supported this new proposition? proposition? 69. Right, then would you be more likely to vote for candidate x? And so that's why order of questions other wording and stuff. They're trying to see if I add this fact that that fact is that going to make the person like the candidate better. And so you do have to think about the order you put the questions. And if you want to ask about two different subjects, kind of think about which subject should come first, because it might color the respondents answering of the subsequent subject. And also on the slide, I wanted to point out that the scales of questions may not accurately measure responses. Do your feelings always fit on a scale from one to five? Well, you know, yelps kind of figured it out. If people's feelings about restaurants tend to fit on a scale of one to five, I'd have a lot of trouble filling that out if they gave me a scale of one to 17. Right. But sometimes people have more granular feelings about things, maybe they need a longer scale one to seven. Um, you'll see a lot of pain scales, where they offer more than just five choices, because probably pain can maybe go from one to seven or one to 10. So think about your scales when you're creating these questions, because that's your choice if you're designing the study. Another point to be made is the influence of the interviewer. Now, we don't have as much interviewing going on these days, because we have the internet where we can do anonymous surveys, and people just fill them out self report, we have Robo phones that you can call robo call. And using an automated voice, that's obviously not a person, you can get survey data. But there's always situations where you actually have to interview people, especially if somebody is really sick in bed, and you have to show up there, you have to talk to them. And so even on the phone, you have to interview people, and they can hear your voice, right. So you got to think about when you're pairing up whoever's being interviewed with whoever's interviewing, um, I've found that it's best to have the interviewer come from the same population as the research participant, in general, the only time that can be a problem is a thirst from the same community, and there's a privacy issue. But it can be very helpful, for the most part, not always, to have your interviewers be actually from the population that you would be studying, you know, from the individuals that you would be studying. So for instance, if you need to interview a bunch of young African American, you know, like some African American teenage men, like I recently saw a study on how health care in the United States really isn't suited for them. And it needs to improve and needs to better cater to this population. Well, let's say you wanted to better understand that, the best thing would be is to hire a young African American male and train him on how to be good interviewer and do be good data collector, because you probably get the best data that way. On the other hand, let's think of different ways that that could go, you could take a person who was older, who is maybe of a different race, and maybe that would change how this young African American male would respond to this interviewer. I mean, the interviewer could be like, in many ways, like the respondent, but the respondents perception might change, then how they answer all verbal and nonverbal influences matter, you know, clothing, the setting that the person's being interviewed in. And so I'm not saying there's really a solution to all this. I'm just saying, make some good decisions. Like I remember working on a data set where there were some questions that had been asked about some older men about their sexual function. And I, it looks the data look funny to me in the statistician who was there during data collection told me that they had chosen young, female nursing students to interview these elders. Men about their sexual habits. And I just said, you know, that might be subject to interviewer influence. And then you of course have to worry about vague wording. Just because it looks clear to you doesn't mean it looks clear to everyone. There are simple ways of avoiding vague terms in the survey, when you can just put a number on it. So instead of asking a person, if they've waited a long time in the waiting room, you can say, more than 10 minutes. You can say exactly like within the last month, have you done certain a certain activity or within the next year? Do you expect to change schools or whatever. And so try to wherever you can use numbers or something very specific, you know, instead of go to the clinic, go to the public health clinic at this particular corner, or whatever. And then you're going to get some pretty accurate information. But sometimes you're stuck using vague terms, because you're studying vague terms, right? I was doing a study of controllable lifestyle attitudes towards controllable lifestyle in medical students. So we asked this question, how important is having a controllable lifestyle to you in your future career? Well, what does that mean? That's pretty vague. So what we did is we use this grounding this anchoring language, we added the sentence, a controllable lifestyle is defined as one that allows the physician to control the number of hours devoted to practicing his or her specialty. So even though we're talking about something kind of wofully, and watery, loosey goosey like control of a lifestyle, who knows what that means? And that's not to say that that sentence could be interpreted differently by people it certainly is. But if you're stuck with vague wording, try to put some grounding language in it. So everybody's at least sort of led in the same direction with their thought before they answer the question. Now, I want to also point out, you probably have noticed, there's all these issues, you have to think about when doing surveys, there's this other issue called the lurking variable, well, you know, lurk means to sneak around behind the scenes, right? Behind the scenes, a lurking variable is a variable that's associated with a condition, but it may not actually cause it. I remember when I was studying epidemiology, they talked about how a lot of people with motorcycle accidents, you unfortunately got in motorcycle accidents that they had tattoos. So therefore, they said, Everybody shouldn't get a tattoo, you might get it in a motorcycle accident? Well, that's a great example of a lurking variable. Yeah, a lot of people who do get into motorcycle accidents, have tattoos, but that the tattoos don't cause that. Um, we also know that having more education increases income, but people have the same education level do not all make the same income, there's this thing, you know, called, it's sexism. And it's called racism. So it matters whether you're a woman or a man, it matters, the color of your skin. If the you know, if you've got a darker skin, doesn't matter, that you have the same education as somebody with lighter skin, you're still gonna make less money. And so you have these lurking variables behind the scenes. So when people are looking at Well, why are people you know, making less income, because they're less educated, whatever? Well, you got to look for also the lurking variables. So current studies show that why women and African Americans make less money on the whole, it's not explained by fewer of them working or fewer of them getting degrees. It's really these lurking variables. And so you got to think critically. And I guess what I would say is, whenever you do a survey, if you're studying something that has a lot of lurking variables associated with it, make sure you measure those variables. Like early studies where they were looking to see if drinking a lot of alcohol causes lung cancer. Some of them forgot to really study how much these people would smoke. Because we know smoking causes lung cancer. And we know if you're hanging out in a place with a lot of drinking and they allow smoking, you'll see a lot of people smoking too. They seem to go hand in hand. So you don't want to miss measuring variables that you think might be lurking variables. It's no problem to measure them and not use them later, but just make sure they're included. So, as a final note on bias, I just want to point out that survey results are so important. for healthcare, and for the progression of science, that you really owe it to even a simplest survey, to think about all of these things, these possible things that could go wrong, just with the wording of questions or with how you're approaching things, and just really consider how you can improve it. It's really important to pay attention to avoiding bias when you're designing and conducting your survey. So think about all these things at the design phase. Finally, I'll get into the last section of this lecture, which is about randomization, which I think a lot of us have heard about. So I'm going to explain the steps to a completely randomized experiment. And after I go through all that, I'm going to also talk about the concept of a placebo and the placebo effect. Then we're going to briefly touch on blocked randomization, and also define for you what is meant by blinding. So why ever randomize, right? So what randomizing is, is when you take a bunch of respondents or participants in your study, and you randomly choose what group they go in. And if you remember, like I was talking about experiment versus observational study, we can't do that in observational study. This is definitely an experiment because you're telling them what group to go, right. So randomization is used to assign individuals to treatment groups. And when you do that, when you randomly assign them, not only you're assigning them, but you're randomly assigning them, you're not picking, you know, you're using like dice or some sort of random method, and helps prevent bias and selecting members for each group. It distributes the lurking variables evenly, even if you don't know about the lurking variables, even if you aren't measuring them. By using this randomization method, they get equally allocated in each group. So just to remind you, how you actually do that is, first I remember the steps to that statistical study, you have to follow those. And after you get to the point where you have ethical approval, that's when you start doing the data collection step. And that's where you start recruiting sample or, you know, hanging up signs and saying, Be in my study, and people come in, and you see if they qualify, and if they qualify, you've got this group of sample, right. And what you do with those people is you say thank you for being in my study. And you measure the confounders, which is another word for lurking variables. You also measure the outcome, whatever you're trying to study, if you're doing a randomized experiment, I know I've been involved in a lot of these where they're studying drugs for lowering blood pressure. So they'll often have maybe two groups or three groups, where they're randomizing people into, but they don't do that first, the first thing to do is get everybody in there and measure their blood pressure, right? The outcome, you know, because they want to know that before, they are going to take a picture of that before. And they also measure confounders, like smoking, remember, smoking is not good for your blood pressure, you know, other things are not good for your blood pressure, like not exercising, well measure all of those things. Okay, now, here's where we get into things. That's when the whole randomization happens. So I showed this picture of a dye, but we usually use a computer for it. So we got all these people together. And now you know, randomly, we put them in different groups. And in this example, on the slide, we're just going to pretend that there's two groups. And in fact, we can't really study blood pressure on the slide. Because we're going to give one group treatment and the other group placebo, which is an inactive treatment, it's fake, it doesn't work. Of course, the treatment and the placebo are going to look the same to the people taking it or, you know, we're going to fool them. They don't, they won't know. But the reason why in real life, you can't do that with a blood pressure study today is we know that high blood pressure is really bad for you. So it's really unethical to give someone a placebo, you got to give them some sort of drug to lower the blood pressure. So usually when we do studies like this on blood pressure, now, new blood pressure drugs, Group A is treatment in Group B is old treatment, like they usually take a new treatment and give it to group by an old treatment to Group B, see if they can find just a better treatment. But if we were talking about something like all timers, especially late stage old timers, there's no treatment. Okay? And so what go what's on the side here, Group A, that gets treatment and Group B, which gets this Sham pill, this placebo, that would be ethical then, but let's just cross our fingers that someday that's not ethical anymore and that we do get a treatment right. Okay. So after you put them in the two groups with sort of missing from the slide is time passes, people in Group A take whatever they're supposed to take their treatment. And in this example, on the slide, people in Group B, take the fake treatment, the placebo, and neither of them, you know, usually knows what's happening. But it takes a while, right. And in the olden days before we knew high blood pressure was bad. These were the study designs. And this is what ended up happening is that you would see, at the beginning where they measured the confounders and the outcome, everybody had high blood pressure, they all look the same. But after treatment, Group A would go down, whereas group and Group B would go down a little bit from CBOE effect, which I'll explain in the next slide. But that's how we learned that you can make blood pressure go down with these different pills. Finally, after that time passed, it could be six weeks, it could be years, however long that took after that passed, when it was over, we'd measure again, the confounders because they could have changed. And the outcome, which in my example, was blood pressure, or, you know how serious some of these Alzheimer's disease would be, if we were doing that. So I promised you on the last slide that I talked to you about more about what a placebo is, and the placebo effect, found this great picture of old placebos from the National Institutes of Health. So a placebo is this fake drug that's given and it's actually kind of hard to make placebos. Just imagine a drug you may need to take me even excetera and or something like that. Imagine we had to study etc. And we'd have to make a fake excedrin that tasted like it and look like it. Because then Otherwise, the people who are randomized to the placebo group would be able to totally tell that they were in the placebo group, and that's not good to do. So, what the reason why you need a placebo is there's this thing called the placebo effect. And that occurs when there is no treatment, but the participant assumed she is receiving treatment and responds favorably. Now, sometimes I talk about one of my favorite epidemiologists, comedians, Ben Goldacre, he reported in one of us, I think one of his TED talks about a study where they everybody they enrolled, um, they didn't have a disease, right, I guess they had a mild disease. And they told everybody, either they were going to give them nothing, or they were going to give them a pill, that's a placebo, it doesn't do anything. Or they're going to give them an injection. That's a placebo injection, it doesn't do anything. And what they found is of the three groups, the people who got the injection did the best. And the people, you know, the fake injection, people got the fake pill, the placebo pill, that is second best that people didn't get anything didn't, the worst. And that his point is, that's what the placebo effect is, for some reason, when we're getting injected. Even with just sailing, we think we're getting some sort of drug and it psychologically, or however, affects our bodies. The same thing when we're taking a pill. I don't know if you've ever seen kids, you know, saying, Oh, I need medicine 90 minutes. And then then the parent gives them an m&m, right, they think it's a pill, they're happy with it. But actually, the placebo effect can cause real effects on your health, it can make you feel better just because you think you're taking a drug. And so that's why it's super important to include a placebo group, if you don't have a comparison group, like I described with blood blood pressure in all your studies, because if you just have one group where they're taking it, they'll all say it's good. They would say it's good if it was water, right. So the placebo is given to what's called a control group, and they receive the placebo. Now, if you're studying like acupuncture, you can't really give up placebo acupuncture. So what they'll do is they'll sort of hang, hang up a little curtain and kind of tap you and you don't know whether you're getting real or it's called sham acupuncture. Other things have to happen like that when you're doing these studying these interventions that aren't pills. Those are called attention controls, right? Where we have like a sham acupuncture. So in any case, you've got to think about this because you need a controller comparison group. That's fair. Whenever you're testing in an experiment in a randomized experiment, a new thing promised you I'd talk a little bit about blocked randomization, I won't get much into it. But sometimes when you go to randomize, right, you know, you get this whole group of people, they're all about the same, but you're gonna split them into a group A and Group B, one's gonna get maybe a drug and the others maybe gonna get the placebo. Sometimes you get worried that the groups are going to be unbalanced with respect to a particular lurking variable. In blood pressure, we'd always care about smoking, we want the equal amount of smokers in each group. You know, a lot of times we we care about gender, we want equal amounts of men and women in each group. So if you're worried about that, with randomization, you can't just do it one at a time, because you might just randomly put too many men in one group. So what you have to do is block randomization. So see, I drew all these blocks on the on the screen, and you'll see that there's nobody in them, they're just blank, I just put xxx. So this is before you do your study, you have these blank blocks. And what you do is as you enroll those people remember you have to measure them and make sure that they qualify for the study, as you get them in, you can just write them in the blocks, right. So here, I just put their fake initials, you know, so let's say that XYZ came in first, that's a woman, and then maybe NSW came in, and that's another woman, you just keep putting the women there. And then when the men come in, you put them in, and you fill up the blocks, then here's a trick, you actually randomize the entire blocks, right? So block one and block three ended up in Group A, and but magic, you got to equal men and women there. And then Group B equal men and women. And so that's how you do with blocks. So but you know, there's some limitation to this, like, if you get multiple races in your study, maybe, you know, four or five racial groups. If you make a five block, you've got to fill up the whole block before you randomize it. And, you know, sometimes you're you're in an area where certain racial groups are rare. And you might have trouble filling up your blocks. So there's some limitations of this too. Now, I had mentioned the situation where you really don't want if you're going to do an experiment, right, not an observational study, experiment. And you're going to randomize people either to a drug or some sort of intervention versus placebo, or a drug versus another drug, an old drug, you really don't want them to know what group they're in. I mean, because you have to be ethical. before they enter the study, you have to tell them, you're gonna put them in one or two group, one of two groups, but you got to tell them, you're not going to know what group you're in wallets going on. So blinding is where the, where any person is deliberately not told of the treatment assignment. So he or she is not biased in reporting study information. And it actually doesn't have to just be the participant in the study, it can be researched, like, the most common one is a participant is blinded to treatment or placebo. But I've been in studies or I've been worked on studies of like Alzheimers disease, right? Well, they'll they want to take the patients are the participants in the study might have Alzheimer's disease, and look at their image, the MRI of their head. And often, they'll have also a neurologist interview them, they'll also see a neuro psychologist. And they often want those three different groups, they imaging group, the neuro psychology group and the neurology group, not to know about each other's opinion of this particular patient. So they'll blind them to each other's opinion. So blinding AR is much more complicated than just blinding the participant to whether or not they're in placebo, or they're in drug group. But double blind is a really important concept. And that means that both the participant and the study staff do not know the treatment assignment. So everybody who's operating with the patient doesn't know it. So you're probably thinking that's really pretty serious, right? Like, what if that person gets sick, and goes to the emergency room, and they're taking an experimental drug or they could be taking placebo? Who knows what they're taking? Well, in that case, what happens is there's an unblinding procedure, there just has to be as part of ethics. It's already set up in the study. If somebody goes to the emergency room, there's a person that can be called to unblind. The pate, the participant who's now a patient, and and once they're unblind, they learn what they were taking. Even if they were taking placebo, the whole thing's over. Right? Even the study staff work. It's just a fact of life. It has to happen sometime. But for the most part, what we tried to do is keep things steady. double blind because it makes things the least biased in the most fair. So 10, the session on randomization, the purpose of randomization, why we go through all this when we're testing treatments, especially, is that it's used to reduce bias. And especially if you have a particular variable you're concerned about like gender, like we were talking about race, or smoking, smoking status, you can use a block randomization to even out each group. And then blinding further prevents bias, right? Because people don't know what they're taking in the study staff don't know what they're giving them. And the reason why you have to really think about blinding is the placebo effect is necessary to take into account, you're always going to get the placebo effect every time you give somebody something. So you've got to account for that in your study design. So in conclusion, I went over the steps to conducting a statistical study in order and kind of give you tips on how to remember that we looked at some basic terms and definitions. And we talked about how to avoid bias in survey design, because there's a lot of different considerations. And finally, we talked more in depth about specifically about randomization in experiments. All right. Now, you know, a lot, maybe too much. I hope you enjoyed my lecture. Hi, Whoa, it's me again, Monica wahi, your statistics lecturer from labarre College. Now we're going to go go back and cover what I didn't cover in the last lecture about chapter 2.1, which are frequency histograms and distributions. So here are your learning objectives for this lecture. So at the end of this lecture, you should be able to state the steps for drawing a frequency histogram, you should also be able to name two types of distributions and explain how they look, you should be able to define what an outlier is, and say one reason why you would make a frequency histogram. Finally, you should be able to define what a relative frequency is and what a cumulative frequency is. Okay, so let's get started. First, we're going to review frequency histograms and relative frequency histogram. So you'll figure out what I'm talking about there. Then we're going to go over five common distributions in statistics, so you know what that's all about. And then I'm going to talk about outliers. Now, you'll notice I have a lot of pictures in this presentation of skylines. And the reason why is they remind me of histograms. So let's talk about what is a frequency histogram. So a frequency histogram is important in statistics, because, as you'll see, you need to make one in order to see what the distribution is. So I'm going to go first explain what one is, like, show you what one looks like. And then I'll explain how to make one. And then I'll explain the relative frequency histogram. And then we'll move on to looking at why do we need that for distributions. So here's another skyline because it looks like a histogram to me. So what is a frequency histogram? Well, it's actually a specific type of bar chart. And it's made from data in a frequency table. So you might see a frequency histogram and go, well, that looks like a boring old bar graph. Well, it's not just any old bar graph, it's got specific properties that I'm going to talk to you about in this lecture. Okay. Both frequency histograms and relative frequency histograms are bar charts with their special bar charts that have to be done a certain way. And why? Because if they're done that way, in their histograms, they will reveal the distribution of the data, which I'll explain later. So here is a frequency table, we had this before. This was of those fake patient transport miles, right. So you'll notice here were the class limits, and then we put in the frequency and we even threw in this relative frequency. Okay, so this is the frequency table I'm going to use as a demonstration for how you make a frequency histogram, you first need a frequency table. Okay, now, here's the histogram version of what's in that frequency table. So I'm going to annotate this one image to explain the order in which you draw it basically by hand. So the first thing you do is draw this vertical line for the y axis, okay, you just draw a line. Next, you write words next to the line, and you always start with frequency of, and then whatever In our example, it was patience, okay. And I'm telling you, you need to do it in this order, or you'll get confused. So you start with that first line, and then you write this frequency. Okay. Next, you draw the whole horizontal line for the x axis, okay. And then after that you write the classes below. Remember, like the lowest class is one to eight, that's a lower class and an upper class limit of the lowest class, like you literally write those labels in. And why do I, why am I so freaking out about this order is because I totally get confused if I do not do this y axis first. Because then all there's all these numbers. And it's totally confusing. So just try to do it in this order. Okay. Now, number six, I had to flip the slide here. Okay, at step six, use drawn like the basic background, you've got the x and y axis and those labels. So now you have to start drawing in the bars. So for your first bar, you look at the first class, and you find the frequency on the table, which I think it was 14 or something. And so you look for it on the y axis, and you want to label the y axis so that the maximum one is is incorporated in it, like you see our maximum is above 20. So we wouldn't want to end our Y axis at 20, or 15, or something, you have to make it bigger, so you can put everybody on there. But our first one was what at 14, so we draw this horizontal line around the 14, right there, that that horizontal line, because we're gonna make that first bar, then you draw the two vertical lines down, and you position it over where you labeled the class. And that makes the bar and then you, you actually color in the bars, like and you repeat this for each class, right? So you go, that's why I labeled the classes first on the x axis just to make sure everything is even. And then I go through and I make all the bars. And again, this is why you need to prepare your frequency table first. So you know how to graph it, you know what to put on this graph? Okay, this is the relative frequency histogram, you already understand what relative frequency is, right? It's that proportion, the proportion of your sample that's in each class. And so the change, if you're going to do a relative frequency histogram, you basically go through the same steps, it's just you're changing what's on the y axis, you change what you label it, okay? But the x axis stays the same. And even though you're, you're charting the relative frequencies, like, you'll be like, Okay, this is a totally different number, what you'll see is the pattern ends up being the same. So it takes on the similar pattern, which is the pattern is actually what we're going after, that's the thing I'm going to talk about with a disparate distribution. And so I tend to prefer since the pattern is going to come out the same, I tend to prefer using a relative frequency histogram, versus a frequency histogram. Because if I have two different groups, like let's say, there were two hospitals, and I gathered two sets of data, and I wanted to compare the models transported, then I could use this relative frequency histogram, and not only with the patterns be evident, but I could compare them fairly, like whatever's 35, you know, point three, five or 35%. In this, even if the other hospital maybe had tons more transports, I could see it as like 35%. And I could really compare the percent, right. So that's why I lean towards relative frequency histogram. But ultimately, you're going to get the same pattern on your histogram, whether you use frequency or relative frequency. So again, another picture of a skyline. So you can see why I think of skylines because they look like histograms, right? So after making a frequency table, what you do with quantitative data, right? Because you're trying to organize it, it's also important to then make a frequency histogram and or relative frequency histogram, and why it's because it reveals a distribution. And now, that's what we're going to talk about. We're going to talk about distributions. So first, I'm going to define what I'm talking about with the distribution. And now you're gonna see a lot of other kinds of pictures like this on the right, see that that shape? That's one of our distributions, okay. And so that's a little prequel to what I'm going to say. So first, we're going to talk about what these distributions are. Then I'm going to describe what an outlier is, and, and how you can detect them by using histograms. Finally, I'm going to wrap it up by explaining what cumulative frequency is and when an old jive is. Okay, so what is this distribution thing I keep talking about? Well, it's actually just a shape. It's the shape that is made if you draw a line along the edges of the histograms bars, so On the left, you see I drew the scribbly shape. But you'll notice you can do it with a stem and leaf too. This is not the same data graphed on the right in the stem and leaf. I'm just using, you know, recycling the old picture that I used before. But you see, you can do the same drawing that squiggly line, you know. And that's actually the distribution. I mean, they don't all look exactly like that. But that's what you do is you draw this line thing. I know, it's kind of odd that that's what a distribution is, is just a shape. But there's actually five of them that we use a lot. There's way more than five, actually, in statistics, but you have to get into kind of higher level statistics to care about those, we're only going to concentrate on these five. Okay. So the first one is called normal distribution. And it's called that everywhere, except I noticed the book call that mound shaped symmetrical distribution, but I'm going to call it a normal distribution. And there's nothing really normal about it, it's just named that for some reason. And then there's a uniform distribution, skewed left distribution, skewed right distribution, and by modal distribution, so those are the five we're going to cover. So let's start here with the normal distribution. So as you can see, on the right, somebody made a histogram. And then they do that squiggly line. Well, actually, it was me who made this histogram and drew the squiggly line. And notice the squiggly line, what it looks like, it kind of looks like what the book called it, it's mound shaped and symmetrical. But that's the shape of the normal distribution, it looks like that it's got kind of hokey things on the side, and, and a mound in the middle. And if that's what your histogram ends up looking like, where it's kind of like a little mountain like that, then you've got a normal distribution. Okay, let's look at a different histogram. Okay? In this histogram, you'll notice that like, each of the bars, each of the frequencies is almost like the same, right? It's either five or six. And it doesn't matter what class we're talking about. When it's like that, the little line you draw across, it's not squiggly at all, it's straight. I don't see this very often in healthcare data. But it does happen in other kinds of data more frequently. And this is called the uniform distribution, which makes sense, it's almost all of these bars are a uniform height. So that's what a uniform distribution is. Okay, now, this is one kind of like the one we were looking at before, where it looks kind of like a slide like at a playground, where, you know, like, you climb up the right side, and then you slide down to the left side. Okay? And that whenever it's like that, where it's low on one side and high on the other, it's called skewed. The problem is, which way is it skewed? Right? And how I remember which way to say it's skewed? Is it skewed, where it's light or short? So here, I would say it's light on the left. So it's skewed left, right? Because on the left side, it's really the bars are all short. And then you can just imagine what's going to come next here? Well, look at this, this is skewed, right, because it's light on the right. It's short on the right. So it's skewed, right. So technically, I mean, both of them are just skewed distributions. I like I just like to explain them separately. Because sometimes people don't know which way to say is left to right. And this is how I remember light on the left, light on the right. Finally, we have bi modal. Now, the word mode in some areas of statistics, and then engineering and stuff often means like a high point. And by modal means two high points. So as you can see, it looks like a camel with two humps. And it's a little hard sometimes to tell by modal from normal. Because if you remember normal, like let's say you have a normal distribution, but you just have one little one little bar kind of in the middle, you're like, is this bi modal, or is this normal? How I tell coach people to see if it's bi modal is if there's a really big space between the two humps that's not so apparent on this image here. But you'll see class three and class four, they're both short. If only one of them was short, I might I might have called it a normal distribution. But I've really seen by modal distributions when it comes to like lab data, because my best friend is a pathologist, and he'll show me you know, with situations where people have like really super high platelet counts, and then like no platelets practically and there's nothing in the middle. And that's where you'll see a bi modal distribution. Now we're gonna talk about outliers. And outliers are data values that are, quote very different from other measurements in the data. What's very different, right? Like it's an opinion. But people in statistics come up with different formulas to try and figure out if something is very different from the other measurements. And we'll talk about that actually, later in later chapters in the class, not so much for identifying outliers, but just to just to better understand our distributions. But just as a quick and dirty representation of what would be an obvious outlier lit, like nobody would disagree on is this histogram here. So you'll notice I just threw down nine classes, I made up this data. But you'll see a class two and class three, there's just like nothing, and there's nothing in class eight. But when you get, and then suddenly, there's something in class one and something in class nine. And when you have these big gaps, this is kind of like that platelets, like I was telling you about only this maybe would be you know, you would say this is tri modal, like there's three modes, but there's not really three modes, right? There's a wacky low one and a wacky high one, and everything else is in the middle. So because that one in class one, and that one, and class nine, they're so far away from what's in the middle, like just about every statistician would agree, these are both outliers. But you can just imagine how much we argue about what actually is an outlier. It's especially hard when you're getting data on weight of people. Some people really do weigh 400 500, maybe even 600 pounds, you don't know if they're really outliers, or data mistakes, or what to do with them. They're real people. And maybe they have really high weights. And unfortunately, some of them have really low weights too. So the one of the main points of doing the histogram is not only to look for these distributions, but also to see if you've got any super obvious outliers that you're just gonna have to think about before you proceed with your analysis. Now, I'm going to talk to you about what cumulative frequency means, you know, the word accumulate means to just like keep accumulating things like if you have a gutter on your house, it will accumulate leaves, like old leaves will sit there and new leaves will keep coming and the old ones will still be there, until it like totally clogs your gutter, and you have to clean it. So that's what cumulative frequency is, is where it accumulates all the frequencies. So you see on the slide, you know, in the first class, when they ate, we had a frequency of 14. So your cumulative frequency, those are like the leaves at the first beginning of the season, that's all you got is 14. But when you add on the next class 21. Now you add to the cumulative frequency, it accumulates, you add that 21 to the 14, and now you've got 35. And if you can extrapolate as you walk up all these classes, eventually you get to the total, right. And so yeah, so that's what you got. And the first class is always the same as the frequency and each cumulative frequency is equal to or higher than the last one. I'll have to say in healthcare, we don't really use cumulative frequency a whole lot, you'll see it but we are really into relative frequency, I'll just tell you that. But some groups are into cumulative frequency and those who are, they like to plot it in a plot called an Ojai. And again, I'll be honest, and healthcare, I've never seen an old giant that was just in the scientific literature, which is why you'll see this is about NFL teams salaries, because I think they use it a lot more in economics. But at any rate, what you'll see is that the classes are along the x axis, you know, you're used to that, because that's what we do in a frequency histogram. But along the y axis, you see these numbers called cumulative frequency. And you just graph it, right, but one of the things you'll just notice is that it's going to go up, like each one is going to either, unless you have a class with zero in it, it's going to stay the same for that one. But otherwise, it's just going to keep going up. So you'll always see some sort of shape like this, where it's always going up and it hits the top. At the end, it hits the total cumulative frequency at the end. So, just to review, there are five main types of distributions used in statistics. And I emphasize mean, there's other ones, but these are the ones we're going to look at. And so that's why we were doing our histograms and our seven leaf displays is we were looking for these distributions. And also we were looking for outliers. And then finally, I just quickly did a shout out for your Oh, jive here and your cumulative frequency. So you know what, what's up with that. So in conclusion, the purpose of the histogram is to reveal the distribution and also the stem and leaf displays reveal the distribution. And you look then, for outliers. You'll probably wondering, Well, why do we do all this work to, to reveal the distribution, we'll you'll find in later chapters and matters, what kind of distribution you have, what kind of statistics you can do insert, in a way, you know, like I went kind of, on and on about the normal distribution. Well, we all really like that in statistics, we're all really partial to that, because it allows you to do a whole bunch of different statistics, you know, pretty easily if you get a normal distribution. However, what's often happens is in healthcare, because I've done it, is you get a skewed distribution left skewed right skewed, and then you have to make some decisions, that makes it a little harder. Also, I've had to buy moral distribution before I'm remembering that one day, that was kind of an issue, and then I had to figure that one out. So that's roughly why we have to go through this chapter and figure out how to do these distributions. And then later, I'll explain to you what you do with that knowledge. Hello, there, it's Monica wahi labarre College statistics lecturer. We're going to circle back now to chapter 2.2. And talk about these other graphs, I'm doing things a little out of order, because it makes sense to me. I hope it makes sense to you too. Well, for this lecture, we're going to have these learning objectives. So when you're done with this lecture, you should be able to describe a case in which a time series graph would be appropriate, you should be able to explain the difference between what would be graphed on a bar graph versus a time series graph, you should be able to describe the type of data graphed in a pie chart. And you should also be able to list two considerations to make when choosing what type of chart to develop. Alright, so let's get started here. What I'm going to be doing it in this lecture is, first I'm going to explain what a time series graph is. Then I'm going to talk about a bar graph. And of course, I'm going to show you roughly how to make these, I'm gonna explain a pie chart and how to make that. And then I'm going to go over a review of all the graphs I've talked about for chapter two. And just summarize when to use what type of graph. So let's start with the time series graph. And actually, the word time is the key. The time we're going to talk about this time series graph and what our time series data, right. As you can see, by this little example, time is across the x axis. And that's kind of a hint for where we're going. Okay, so then I'll show you roughly how to plot one. And I'll explain why we have these time series graphs, like how you interpret them and why you even make them. So, of course, I'm an epidemiologist. So what am i into m&m mortality, morbidity. So here's a nice time series graph, wonderful graph of the percentage of visits for influenza like illness reported by the US outpatient influenza like illness surveillance network, by surveillance week, and this is October 1 2006, through May 1 2010. And you're like, oh, time? Yeah, that's the deal. time series data are made of measurements for the same variable, for the same individual taken in intervals over a period of time. Only. In this case, in the example here, the individual is not a person, right? Because remember, individuals are just what you measure what you're measuring variables about. Here, the individuals are actually weeks, right? Because every week, they're making a measurement. So like I said, time series data are made of measurements for the same variable, which is what percentage of visits for influenza like illness. So every week they went to I don't know who is in like, what clinics are in this outpatient influenza like illness surveillance network, but let's just pretend there's like 10 clinics in there. So each week, these clinics have to go in and say, Yeah, I had, for example, 100 visits this week, and 10 of them were for influenza, like illness. So then that would be 10%. That week, for that clinic. Well, they got all the clinics together, and they found out what the percents were. And you can see on the y axis, right, there's the percentage, and then you see on the x axis all the weeks in the year. So um, so you've seen these before, right? You especially see it with stock market, right? You go on Yahoo, and look at your favorite stock, right? You know, we're also rich, we own so much stock, and so you track your favorite stock that way. Personally, I'm spend more time looking at mortality and morbidity, things like influenza, but hey, there after I get some money, I'll be looking at stock market prices. So when we see these time series data graphed in these time series graphs It's often about things like influenza rates. Other rates, you'll see life expectancy, rates of heart attack. And that's usually what we see, because we're trying to affect those rates. And we're trying to see if they're going up or down. So I'm going to just roughly go through how you make one, if you ever wanted to make one, the first thing you need is a table, kind of like the one on the right, I just made up these data, they don't mean anything. But roughly what you need is a column that says, in this case, I put year, the influenza people they put a week, but you have to put like regular time increments in the first column. And then you have to put that variable measured at that time in the next column. So let's say it's today, and you're like, Oh, I want to measure how many times I went to the gym each week, you know, weekly over the last few months? Well, you're gonna have to reconstruct that data, right? Like maybe from your memory or your calendar. So normally, when you're going to go do time series stuff, you start and you collect the data as you go along. And then it's nice and accurate. Okay, so let's say you did that, and you managed to get some time series data together, then how do you plot? Well, the first thing you do, and I'm using this influential thing, as an example, is you draw a horizontal line and you make that your x axis, now you gathered your data based on years or weeks or something. So you can label those time periods there, because you already know those time periods. And so you just label that x axis. There, then you draw the vertical line for your y axis. And again, you've done all your measurements, right? So if you were measuring how many times you went to the gym per week, you know, maybe once a day, you know, that would be seven would be the maximum, right? So you didn't want to make sure your y axis is tall enough to get that seven. And if you had a good week there. And so that's really what you're looking for in the y axis, you don't want to too tall, like you see the highest point that they have Ooh, in 2009, they had an outbreak there, they needed to make sure that the y axis was tall enough so that they could graph that. But other than that, you don't want to too much taller. And then make sure you label it. I'm big on labeling here, because otherwise people get confused. Okay, now we're going on to the next step, then this is where you get into actually putting in your data. Now, because there were so many weeks, like if you look at like 2007 is only about like the x axis is only about two inches wide. And all like 52 weeks of 2007 were plotted in there. So it literally looks like a super smooth line. But honestly, what they did was they went and they put each point in. And so they put each point in separately, and then they connected the dots. And that's why it looks so smooth. If you only have a few points, and you have a wider x axis, it'll be a more choppier, it will be, it'll look a little bit more like they'll stock market. Graphs like that go up and down, up and down and kind of look like a roller coaster and not so smooth. But if you have a lot of points and you mission together ends up looking really smooth. You also I just wanted to point out can have more than one line on the graph. For more than one set of data values. Like here, they're comparing, I don't know some sort of book performance, how much it was sold. In US versus Canada, you just have to make sure that you have a legend if you do that, so people can tell the lines apart. So to summarize, time series graphs are useful for understanding trends over time, like whether things go up or down like you saw on that influenza chart, we could see when there apparently was kind of an epidemic or an outbreak. So graphing more than one set of time series data, like you saw in the last graph on one graph can help and comparing the differences between the datasets I worked at for the US Army. And there's a lot of problems with people getting injured in the army. And so I made a lot of time series graphs of rates of injury over the years because we were trying to do things to make the rates of injury go down. And then that way we could see if the trend was there that we were actually making them go down. So that's the main goal of these time series graphs. Now, I'm going to move on to talk about a bar graph, which can display quantitative or qualitative data. And I'm going to first start with the features of the bar graph. here's just an example on the right here. I'm going to talk about how to make one and then we're going to talk about what happens when you change the scale meaning the x axis like how how tall the x axis is, on a bar chart because it really changes things. I call it a bar chart sometimes, or bar graph. They're really the same thing. I don't know why they chose graph in the book. But then finally, there's I want to do A little shout out to what purrito charts are, we don't really use them much in healthcare, but I still wanted you to know about them. Alright, so let's look at the features of a bar graph. The first thing you want to know is that they the bars can be vertical or horizontal. So don't, even though I'm showing you this horizontal, or this vertical example, don't be thrown off, if you see a horizontal example. Regardless of whether they're vertical or horizontal, the bars are supposed to have a uniform width, and uniform spacing, they can't be wider or skinnier. And they have to be spaced apart at a uniform rate. I'm gonna use, like I said, this big one here, as an example, to talk about bar graphs, I just want you to notice what is being graphed here. And this is the percentage of people in the US not covered by health insurance. And it's split up by race and ethnicity. And it's looking at the years 2008 through 2012, which is like bad, right? Like, you want people to have health insurance. Okay, um, so item three here says the length of the bars represent either the variables frequency or percentage of occurrence. So if we were looking at instead of percent like it's I've circled percentage, because that's what we're looking at in this one, we could have looked at, you know, number of visits at a health care clinic, and that would be frequency, right. But we haven't been looking at percentage here. So I, so I just wanted to call that out. So you'll see then, on the y axis, we have the measurement scale. And as long as we write it there, and we use that same measurement scale, for graphing each of the bars, we will be fulfilling the item for which is the same measurement scale is used for each mark. I don't know why anybody do it any other way. But that's part of the features of the bar graph. Now, this is a feature that really is like my pet peeve, I get so irritated when I find a bar graph or any other graph where things are not labeled, I get totally confused. So you really want to put on a title, you need to put the bar labels, at least on the app on the x axis, right? Like you have to know see how it says white alone, black alone, like you wouldn't even know what those bars were unless somebody put something there, right. And some people also add the actual values for each bar, I'll do that if there's space, like there was space here. If it gets too busy, I don't do that. But um, because you can kind of see them from the graph. Now, you're probably wondering, um, you're probably kind of having a flashback, you're like, this looks totally like a histogram. What is the difference? Well, I started by talking to you about histograms, they're actually a special case of a bar graph, right? So bar graphs are more general. And the histogram is a specific type of bar graph. So histograms are bar graphs that must have classes of a quantitative variable on the x axis. So you can already see that the bar graph I'm showing you is not a histogram, because it says categorical, qualitative things, it doesn't have a class, right? Also histograms must have frequency or relative frequency on the y axis, which as you can see this as percentage of something. So that's not that. So this isn't a histogram. But whenever you make a histogram, you're just making kind of a special bar graph. And I just wanted to point that out, so you weren't confused. Now, I said, I was going to warn you about what goes wrong when you change the scale. And what I mean by changing the scale is when you look at that y axis, notice how it the top of it the way this person made, it, is at 35, or 35%. But notice that the highest racial group without health insurance, which is unfortunately, those of Hispanic origin, that that's close to 30. But it's not all the way up to 35. So I'm not exactly sure why they made it so high. So I wanted to see what would happen, what the shape would change these bars, if I actually made the top 30. So I regenerated this, and then you'll see what happens. See, it's the same data. I just made it and I made the top 30. It's kind of subtle, but suddenly all the bars look bigger, right? So if I were like some advocate and running around saying this is terrible, you know, these people don't have insurance. I'd like to look at the one on the left more than the one on the right. But, you know, in a way, that's a little misleading, right? It's the same data. So the differences between bars are more dramatic when we change the scale to be shorter, a little bit more dramatic. But let's go The other way, and this is where I see people do things a lot. Let's see what happens, see how that the the top of the y axis is 35. Right now, let's double that. Let's just make it 70. And then let's see what happens. As you can see, the differences between the bars look small, right? Like, the difference between that big Hispanic origin one and the lower white and Asian alone ones isn't really that big anymore. So my opponents would rather look at that graph. In fact, everything looks kind of small. on that graph, it's a Oh, there's no problems with insurance. Um, and that's, you know, when people talk about lying with statistics, so to speak, I mean, these are the kind of tricks people do to try and change how things appear. And the best way to do it is to just do kind of what I suggested is look at the next one up from your tallest one. And do that, use that as your top of your y axis, what I would have to do with the army is I was looking at rate of knee injury, and also rate of ankle injury. But knee injury was way more common. And so if I wanted to compare the two, I always use the same scale, because otherwise, people wouldn't be able to see that the ankle injury was really, really low. Compared to the knee injury, even though they're both important. Um, let's hall with a taller y axis, the differences between the bars look dress less dramatic, and also the taller you make your y axis, the less it looks like you have of the bars, so you got to be really careful. I don't think you would do that. But you know, other people do that, when they're trying to make their points. So just be careful for that. Also, a term that was mentioned in the book is the term clustered and clustered bar graph. It's not that complicated, it just means more than one bar is graph for each category. You'll see in the in the last one I did, it was just on on one topic. And here, if you look at this one on the right, and of course, I mixed it up a little I did the horizontal version. But this is life expectancy at birth. And it's it's separated by you'll see that there's three sets of bars, right? There's both sexes together, in there's a bunch of bars for that. And you see the legend Hispanic, non Hispanic, black, non Hispanic, white, and then they mix them all together all races origin. And then they also have separate set of bars for male and female. And so this would be clustered. And if you do that, you really need a legend so people can tell what's going on. You'll also notice that you know, life expectancy, that's good. If it's high, right, you want to live to be 8090 100. But if you look at the bottom of the slide where we have the x axis, if we mean if we started at zero, and just made it all long, it would not even fit on the slide. So what they'll do is they'll make these little hash marks with this little squiggle, and indicate that they just skipped ahead. But like I said in the first part of this, if they skip ahead on the female one, they have to skip ahead on all of them. Right, so everything is skipped ahead there. This is a fair comparison. It's just like we're sort of, it's like, we're fast forwarding through the movie up to about 50. And then looking at the differences there because everything's the same up to that. So that's just another thing about scale is notice whether it's clustered if you've got a legend, and also look for the squiggle. Okay, now I'm going to give you a shout out to a purrito chart. And you probably already noticed, we don't really use these much in healthcare, because this example is about causes of an engine overheating. Well, we don't do that a lot in healthcare. And you'll see I kind of slapped on a label on the y axis, the word frequency, okay. So in a perrito chart, this is you remember how I was saying this histogram is a special bar chart, or bar graph will pre though chart is a different kind of special bar graph. Okay. And then that one, the height of the bar indicates the frequency of an event. Like if you look at these events here, like damage radiator core, that happened 31 times right? And then happened more often than faulty fans, which only happened 20 times. So what they do is they figure out what happened the most and the second most and least whatever, and they deliberately arranged them in order left to right, according to decreasing height. It's a way of sort of zoning in on what is the most important problem you're finding. So it's really meant to graph frequencies of problems. I actually only saw one purrito chart I've ever ever in healthcare, so So far, I really looked for one. And what it was about was, it was about things that can happen that are bad in a nursing home. And I remember the tallest bar was for falls, right? Like people fall in a nursing home. And then there was a smaller bar for medication errors that happens. The reason why we don't I think the reason why we don't use these a lot in healthcare is, you know, let's pretend that's what this was of it, let's pretend this 31 instead of damage radiator course that 31 Falls? Well, the first thing you'd probably ask is, well, how many people are in that, that nursing home? You know, and how long did you collect data for right? 31 Falls is pretty bad. But it's not bad. If you have hundreds of people over 10 years of that all you get a 31 Falls, you're doing pretty well. So I would say that the reason why we don't use preto charts a lot in healthcare is that sort of leaves out some important information about these serious events. And so we like to look at things in different ways. So just to summarize, about bar graphs, bar graphs must be made following a few rules, I talked to you about the you know the difference. with, you know, you have to keep the width the same and, and how you have to label the axes. So we know what you're talking about. Because you can visualize both quantitative and qualitative data using a bar chart. So these labels become really important, as do scales, right? Like, I showed you how you change the scale, and you can make things look different. So you want to be careful and be cognizant of that. And also, I did a shout out to purrito charts, and I explained why I think they're not used that much in healthcare. Now we're going to jump into pie charts. You know, just even the thought of a pie chart makes me hungry, doesn't make you hungry. Um, so here's what a pie chart is. They're also called circle graphs. They're used with counts or frequencies that are mutually exclusive. And that sounds really fancy. But all it means is when every individual can only fall in one category. So I'm going to give you the example on the right, which is actually from a real report you should probably read. It was a survey that was done by the Massachusetts nursing Association, and they got 339 nurses to fill out the survey, one of the questions was, do you receive annual blood borne pathogen training? Now the answer is only going to be yes or no. They can't say yes and no. That is what mutually exclusive is, is where you can only answer one answer. So as you can see, 234 people said yes, which is good. And 105 said no, which is bad, I'm worried about that. But these pie charts are often made in graphing programs, because they're a little difficult to do by hand. And I'll explain to you why. And unlike peredo, charts, these are super common in healthcare, as you can see right there on the slide. So let's look at the features of a pie chart. Um, I actually just made up this fake pie chart, I pretended I had a class where I gave a five point quiz, right? And the reason why I did that is I wanted to show you how to do it with a quantitative variable. Because remember, the last one, it was yes or no. And that's qualitative. Those are the the answers that the nurses could give to that survey question. Well, this is a different one. This is where I actually put, you know, fake students in their their points on this quiz into classes, right? Like you see zero points, one to two points, three to four points and five points, right. So regardless of whether you're doing yes, no, no qualitative, or, you know, different categories like that, or you're doing classes like this, every individual in your data must be in only one of the categories, only one of the classes kind of like frequency tables and histograms. You everybody gets one vote. And that's really important in a pie chart, even though it can be used with qualitative or quantitative variables. And you'll see later What I mean by that. And so here is just a fake example I made of how you would then make a pie chart out of a quantitative variable. So I'm just gonna briefly go over how you would do this by hand and I'm realizing I've never done this by hand. I always use Excel as you probably recognize that lovely purple color, which comes out of Excel. But if you were going to do it by hand, I guess you'd have to go buy one of those things in the lower left, which is a protractor because that helps you see the degrees of a circle. Remember, it's a whole circle has 360 degrees, right? I don't know if you remember all this from like trigonometry. And but then like a half circle would be 182 Freeze. And so that's how you figure out like how much of the piece of the pie you need is using this protractor. So if you're going to make a pie chart by hand, you first have to make a table, you'll see we make tables constantly and statistics. And I put class in the first column, because I was doing one that required class because it's quantitative. If you were doing that one with the nurses saying yes or no, you would put category and you just say yes or no, right, and then total, then of course, next, you put the frequency. And I always put total to add it up to try and make sure you know my fake class apparently, and 37 people in it. So I just want to make sure you know, everything adds up, then the next step room will remind you of relative frequency, it's where you figure out the proportion of the circle that that's going to take up, right. So see, the five points out the seven people who got five points? Well, if you divide seven by 37, you're going to get point one, nine, well, that's I like percent. So that's 19%. So that would say what proportions the circle they get, right. And then finally, in the last column, remember how it's telling you the whole circle is 360 degrees, when you take that proportion you get, and you multiply it by 360, to figure out how many degrees, you're going to make your circle. And that's why you need the protractor. And that's also why I always use Excel for this because it makes it so you don't have to worry about those things. All you would need for Excel is actually just the class or the categories, and the frequency. And then if you use their automatic pie graph function, then you can get all this other stuff out very quickly. So I just wanted to make a few notes about pie charts. This is the thing I'm coming back to is this mutually exclusive categories. So I want you to imagine that I do a survey, right. And I asked the question, what is your favorite color? And I give some choices like red, green, blue, whatever, there's only going to be one answer to everybody's question, right? Because you can only have one favorite, right? And that then is eligible to be used in a pie chart, because everybody gets one vote. But a lot of times, I'll see people who do a different survey question, they'll say, check off all of the colors you like. So if I get that I'm like, Oh, I love red. I like orange, I like green, I'm checking off a bunch. There's some people I know who don't really like color, like they just were gray and black. So they probably wouldn't check off anything. And then there are the people who just check off one or two. Well, as you can see, people can have multiple votes or no votes or whatever. And if you have that situation, like I was telling you, where people can say multiple things, you've got to go into bargraph land, okay? Because a whole bunch of people can like read a whole bunch of people can like green, a whole bunch of people can like blue. And you won't get a circle out of that. If everybody answers just one answer. And so therefore, everybody's in a mutually exclusive category, then you can use the pie chart. I also wanted to let you know that I find it and I think a lot of people do more informative to put the percentage on the actual chart, then the frequency, some people put both the frequency and the percentage, which is good, it's not so helpful to just put the frequency as you see that the nursing report did on the left. And it's because you really don't know, you know, 234 seems like a lot. But what proportion is that of the circle, that's what you would kind of want to know. Whereas if you look on the right on mine, you can see like, for instance, only 5% God zero point, that's a small amount, right? You know what 5% means? It's just hard to tell, you know, if you look at that one on the left, and looks a little like two thirds, which would be 66%. But we don't know what the percent is, right. And so it's really helpful to have that percent. And always include a title and a legend. Because if you're, if you're graphing a pie chart, you're gonna have more than one category, and so people are gonna want to know what that color means. This looks so good, doesn't look good. Um, pie charts are common in healthcare, and they graph mutually exclusive categories. Okay, so so you'll see this all the time. And like I said, it's easier to make using software, I use Excel, it can come out of other software, but I just like Excel because you can really put fancy labels on and you can do that squiggle thing and but choosing a graph requires some consideration, like whether or not you actually want to make a pie chart or a bar chart or whatever, requires some thought. And also, regardless of the chart you make, you should follow these rules. You should always provide a title, okay? Even if it's just for your private use. Trust me, I've done this. I go back and I'm like, I don't even know what I grabbed. So take your time sit down, write a little title. So you remember what you also labeled the axes. Because, again, you think you're going to remember or maybe you think it's obvious everybody in the audience is going to tell, don't leave anything to be assumed, just be absolutely clear about what's on each axis. Always identify your units of measure. So if you're talking about a rate per 10,000 people or a percentage, or maybe you're talking about an average, or you're talking about a frequency, it doesn't matter, just make sure you're clear about what you're talking about. In the units of measure, usually, this ends up on the y axis. So the thought is to make the graph as clear as possible, thinking font size, thinking number of items graph, you know, I've sometimes seen a bunch of time series graphs where they put so many lines on there, I can't even see anything. Or they'll have these really tiny font sizes. Or they'll just try to put too much on one graph. And it's hard to read. So if you find, if you have trouble reading it, probably everybody else will. So you want to modify it. So I just throw this on the right. Can you tell what's missing from the above graph? The above graph is really missing a lot of information. I mean, we don't even know what it's about we, we can kind of guess it's a time series graph because of the time at the bottom. But what else right? So the person who made this really knew what they were talking about, but we don't, and you don't want that to happen to your graph. Okay, so here, what I'm going to do is review all the different graphs I've talked about in chapter two, and talk about the cases where that graph is useful. So you can keep the straight in your heads what why we have all these graphs, right. So first, there's the frequency histogram. Remember that that was only for quantitative data. And that's what you make when you want to see the distribution, right? Remember, the distribution was a shape. And, and a frequency histogram is a particular type of bar graph that is meant for showing these distributions. I also showed you how to make a relative frequency histogram, which is almost the same thing, only it graphs the relative frequency instead of the frequency. And that also will show you the distribution, right, because the pattern will be the same. But this one's specifically good for comparing to other data. So if you have two sets of data, maybe from two different locations are two different groups, then you want to use the relative frequency histogram, because then it's easier to compare distributions, right. I also showed you how to make a stem and leaf display, I explained what the stem and leaf is, what the leaves are, and what the stem is. And that's also for quantitative data. And that's also if you want to see the distribution, it's also good for organizing the data, it's a little easier to make by hand than a histogram. Because a histogram makes you make a frequency table first, and stem and leaf display, you can kind of skip that step. So again, these first three were just about trying to take quantitative data and visualize it so you can look at distributions and also look for outliers. Next, we went into the time series graph. And that is really about time, right? That's for graphing a variable that changes over time. And as measured at regular intervals, mainly to see trends like is it going up? Is it going down? Was there an epidemic, and that's what a time series graph is for a bar graph. Now this is the generic bar graph, not the specific histogram, like I described, but the generic bar graph can be used for qualitative data or for quantitative data. And it can be used for displaying frequency or percentage, and we went over some examples. Then I shouted out to the perrito chart, which is a special bar graph, right. And that special bar graph graphs frequencies of rare events, in descending order, usually bad things, you know, rare bad things. And again, we don't really use this much in healthcare. Finally, I went over the pie graph. And that's four mutually exclusive categories, quantitative or qualitative. And we use those a lot in healthcare. So in conclusion, in this particular lecture, I first went over the time series graphs, and explained how they show changes over time. And then I went over bar graphs and showed you how they can display quantitative and qualitative data. They can be up and down or horizontal. I showed you some different examples. And then we went through pie charts, looking at mutually exclusive categories, which I think are my favorite, like look at this pie. This makes me so hungry. Um, but at the end, it's important to pick the right chart. Because you want to have a useful visualization of your data. If you're trying to look for a distribution. Choose the right kind of visualizations, the right kind of graphs, if you want to instead look for trends over time. You get to choose the right kind of work. So I gave you some pointers on how to do that. And now my mouth is watering. So I'm gonna go eat some pie. Yoo hoo, it's Monica wahi. Again, your statistics lecturer from labarre College, I decided to chop up chapter two and reconfigure it. So this first lecture is going to be on part of chapter 2.1, frequency tables, and the entire chapter 2.3, which is stem and leaf displays. So here are your learning objectives for this lecture. At the end of this lecture, you should be able to state the steps for making a frequency table defined class, upper class limit and lower class limit, you should be able to explain what relative frequency is and why it's useful for comparing groups. Also, you should be able to state the steps for making a stem and leaf display. And finally, you should be able to describe the difference between an ordered and ordered leaf. And if all that sounds foreign to you, don't worry, you'll understand it all at the end of this lecture. So just to introduce what I'm going to cover, first, I'm going to define for you what a frequency table actually is. And then I'll explain to you how to make one which will help you understand even better what it is. After that I'm jumping right into what a stem and leaf display is, and how to make one of those in the main reason why I can combine these is because I feel like stem and leaf displays can help you make frequency tables. That connection was not really made in the book. So I'm making it here. So let's just start with the frequency table. So what is one of those? Well, you know, when I think of frequency, I think of the radio, right? Like I think of REM what's the frequency? KENNETH? I think that was a last hit. Okay, that's not what we're talking about. We're talking about frequency, like the word frequently, like How frequently do you go to work per week, right. And you would count how many times you go to work or go to class per week? Well, frequency is, like frequently, it's like how frequent something happens. So first, I'm going to explain to you what a frequency table is, and why you make them, then I'm going to define some more terms, I just defined frequency, I'm going to just define some more that you're going to need to know. And then I'm going to explain the steps for making a frequency table and a relative frequency table. So remember, quantitative data, I'll just remind you qualitative data are categorical. So that's like gender race diagnosis, where you put individuals into categories. And quantitative data are numerical. Remember, like age, heart rate, blood pressure. Now, I just want to calibrate you to the idea that this whole frequency table thing, this, this whole thing is about quantitative data. And so this entire lecture actually is focusing only on quantitative data and not qualitative data already. So when you have quantitative data, as you probably noticed, if you've ever had it, right, like, let's say that you, let's say you go on Yelp, you know, I always give that example. And you tried to decide whether to go to a restaurant or not. You have a bunch of fives, and fours and threes and twos and one stars, how do you know, you know, you just have a pile of numbers. So how do you organize them, I'm going to give you like a totally fake example I made up Okay, so I'm pretending that 60 patients were studied for the distance, they needed to be transported in an ambulance. So how far they needed to be transported from where they call the ambulance, and were picked up and actually got to the hospital. So the shortest transport in my fake data, or the minimum was one mile, which is awesome. That's kind of what happens to me because I live right near a hospital, hopefully, I don't need to be in an ambulance very often. But that's what happens in urban centers, the longest transport the maximum was 47 miles, which would really suck. And I just want to point that out that happens to people in the rural areas because of lack of access. So this is kind of realistic, even though it's fake data. But anyway, it's hard to just look at a pile of numbers. So how do we understand these data? Well, now I'm going to start those definitions. The word class means the interval in the data. So in Remember, we're talking quantitative data. So let's say I just made up well, how many people got transported between 30 and 40 miles, okay. That would be a class of 30 to 40, right. And the class limit is the lowest and highest value that can fit in the class. So carrying on with my example of a class I just randomly picked 30 to 40. If we made that a class we would say 30 would be the lower class. limit, and 40 would be the upper class limit. Make sense? Alrighty. So then, of course, you have the width of the class or the class width. So that's how wide the classes. So carrying on with the example, if the upper class limit was 40, and the lower class limit was 30, what you do is you minus 30, from 40, which you get 10. And then you add one, and n equals 11. That's a little formula. But if you're like me, and you count on your fingers, you would go 3031 32 6034, blah, blah, blah, and you'd realize that there are 11 numbers in that. Now we get to frequency, like I sort of quickly explained in that is how many values from the data fall in the class. So how many patients were transported 30 to 40 miles. Or another way of saying it is, if you look in all the data you have, and you find every single person that either got 3031 3233, blah, blah, blah, up to 40, count all those people up that then you will get the frequency for that class. Okay, but you probably realize you do need to decide on classes before you go counting frequencies, because you need to know the lower and upper class limits. So let's talk about some rules about classes. First of all, classes have to be the same width, you can have 30 to 40, and then 40 to 42, right, or 41 to 42, right? You can't have skinny class, fat class, they have to have the same width. But, um, there are different ways to pick it, right? So, class width can be determined empirically isn't that a fancy word empirically just means you just choose it because you like it, right. And if you ever look at survey data, about just about anything, when they look at the quantitative variable of age, they often put that in classes. And as you'll see on the slide, these are the classes we often see 18 to 2425 to 3435 to 44. And you can go on, right, like, that's what you normally see. And that means, empirically, you just picked it out of the hat. And already, you're probably noticing Well, 18 to 25, or 18, to 2465. and older, those classes aren't really equal as the ones in the middle, right? Like, what's the upper class limit for 65 and older? Okay, well, that's just normally what happens in the world, and especially in healthcare, and healthcare, when you pick classes. Even though the classes are technically supposed to be the same width, you really should be guided by the scientific literature. And you'll see why later, when I show you the other videos in this chapter. It's because you really want to be able to compare whatever you find to whatever other people have found before you. And therefore you don't want to cut up your classes in different ways, or it's hard to compare them. However, in the book, they teach this class with formula, so I thought I should really show you that, too. So here's the class with formula that I don't really see used much in healthcare statistics, but I'm going to teach you anyway. So this is the formula. First you calculate this number, you find the maximum in your data, and you're in the minimum in your data, and you subtract the minimum from the maximum. So the example I was giving from the fake data about the transport is 47 was a maximum, and the minimum was one. So I did the first step and got 46. Okay, looking back into the formula, you divide whatever you got there by the number of classes desired. In other words, like however many, you know, categories you want, right. So if you never want too many, like you don't want 10 or something, you know, 34567, usually something in that range is a good number of classes. So let's pick six just for fun. So we'll take that 46 number we got we divided by six and we get 7.7. Then back to the formula side, how you decide then your class width is you increase this number, you get to the next whole number. Now a lot of people are confused by that, because even if I've gotten something like low, like 7.1, I'd still go up to eight, you have to increase it up to the next whole number. So you have like this, this integer, you know, that's a number without any decimals after it. So you have this integer for your class with so our class with in this example then would be eight. So, um, now I described to you that whole class with, but I'm not going to use it in the example because we don't really do that much in healthcare and it makes it actually kind of hard to understand because you want something that's a little intuitive, like if you look on the slide right now, you know, less than 20 miles 21 to 2930 to 39 and then 40 or more, that may A little more sense in your head. You know, that's how we think of miles. If I had put like 18 to 24, and 25 to 29, you know, we don't really think that way. So this is helpful in healthcare to boil it down to something like this. And by the way, if I was writing a real paper in the sort of real data, I'd be looking at the papers before this that talked about transport times and looking at those class limits. Okay, so a frequency table displays each class, along with the frequency, the number of data points in each class, as you can see, the class limits are on the left side of the simple frequency table, you know, the classes, and then the frequencies on the right side, right. And you'll notice that they all add up to 60, because we measured 60, fake patients, and it's really good to do that little check. Because you don't want to double count people put them in two classes, they only get to be in one, etc. So selecting arbitrary class limits, can make the frequency table unbalanced. So in other words, doing this empirical thing can make it sort of weird because less than 20 is big, and 40 or more miles is big. And it's bigger than the other classes. So it's does it kind of breaks the rules of class with but not following the scientific literature can make your results not comparable, and can make the science less useful. And so that's why I sort of flail against the book with this class with formula thing. So I'm, I'm going to just give you another example for a frequency table. Okay. This one is more, it's also health carry, you know, glucose is measured in the blood and expressed in milligrams, 400 milliliters, right? So glucose is a huge molecule, and it should be cleared from the blood, especially a fasting. So if you're not eating anything, you're not putting any glucose in your body supposed to be like metabolizing. That problem is some people don't metabolize glucose very well, you know, that's what diabetes is. So you, you care about how much glucose is sitting around people's blood. So blood glucose levels for a random sample of 70, women were recorded after a 12 hour fast. And this is what they got, they got the minimum was 45, the maximum was 109. And they picked six classes. So this is how they set up their class limits. And again, this is using a class with formula. And just to demonstrate, you know, it sort of comes out a little weird here. But then they they got these frequencies, okay. And this is again, just another example, using this time the class width formula to get our six classes and to make sure that they covered everybody. Now, you'll notice in this, we start with the minimum like 45 to 55. And we end with the maximum, which is up to 110. And that's really the clearest way to do it. It's just not typically done that way. If you read, like scientific literature and healthcare, you just don't see these frequency tables labeled like that. So and just to wrap up this part, make sure all of your data points are accounted for only once in one of the classes. So whether you use a class with formula, or you use empirical or arbitrarily picked classes, every single data point only gets one vote, it can only be in one of the classes. And, and also, you don't want to leave any of the data points out. So you want to make sure that that happens that you account for all of them. And also you need to make sure your classes cover all the data, right. And healthcare when we do that thing up to 20, and 65. And over all that stuff, we cause that to happen. However, if you're going to use a class with formula, you really have to pay attention to where your minimum and your maximum are. Because then you want to make sure all of your classes cover all of your data. And like I mentioned, make sure the total of your classes of the frequencies in your classes adds up to the total number of data points, it's just a little check, make sure you didn't do something wrong. Now I'm going to talk about what is a relative frequency table. And that builds on what you already just learned about frequency. So we all know what our relatives are. They're like our family, right? We have relationships with them. And so what relative means is in relationship to the rest of the data, okay? So in statistics, they often use this fancy F to stand for frequency. And, as I've mentioned before, the sample size, if you have a sample, they use a lowercase n. So what they use as the formula for relative frequency is F divided by n. And if you're clever with math, you realize what that means is is if you take a frequency of any of the classes, you know, it's just a portion of the whole sample, and you divide it by the total sample, which is that n you You'll get the proportion of values that are in that class, it's not really that fancy. So relative frequency is something very useful to put in a frequency table. So you'll see that I, I kind of crammed it in onto the right side, this is the old frequency table I just showed you with glucose, but I crammed in this relative frequency next to it. So it's super easy to calculate, like, for example, for the first one, see, 45 to 55, the frequency is three, what did I do? Pull out the old calculator? Well, I actually I use Excel. And I did three divided by 70, because I was a total. And I got Oh point oh four. And those of you don't really like proportions, you can do that thing where you move the decimal two places to the right, and then put us percent sign. So that would be like 4% of those 70 people are in that first class. And then the same thing happened with the next one, I took, you know, the 56 to 66, I took seven divided by 70, which came out 2.10. And those of you into percents, I'm really into percents, I like moving that decimal over, I think of it as 10%, then, but whatever, as you can see at the bottom, and all has to equal 1.0. If you like proportion, land, or 100%, if you're like me, and you like percent land. But in any case, this is all you have to do to do the relative frequency table, you just make another column and do all those calculations. And it's super easy to calculate it. And it's very helpful. So why did we even do this, because we had a pile of quantitative data, and it was really hard to organize right. And the first thing was we had to do was select class width. And I talked about the politics behind that. But ultimately, whatever you do you do in the lower in the upper class limits need to be determined and put in the first column of your frequency table. Then in your second column, which are the frequencies, you count up, how many are in that class, and you fill it in. And then if you make that third column, then you can do that dividing thing and get your relative frequencies. And that's great. That's how you build your frequency table. And as I go through future lectures, you'll see even more why you would make that table like how useful that can be. Given that you have quantitative data, and it kind of gets all over the place, it's very helpful to organize it in that table. Now I'm going to move on to talk about the stem and leaf. And the reason why I picked talking about it. Now it's because it's on the theme of organizing quantitative data. So I'm going to talk to you about what the stem and leaf plot actually is. Here's a just an example on the slide and how you make one. And why why you might make one of these you'll find it feels a lot like making a frequency table. But why do you make these instead of a frequency table? And it's just more food for thought. So first, one of the things that I got hung up on when I took biostatistics is I could not get over the fact that it was called a stem and leaf. So I had to understand that. So this is an example of a stem and leaf there. So why is it called a seven leaf? Well, there's always the stem. And that's so see these corn stalks, I'm from Minnesota, I'm used to seeing them, you'll notice that there's a stem, right, like this big corn stock has the stem, that thing you see that vertical line and a bunch of numbers on the left, that part of the stem and leaf plot is called the stem. And then leaves are added onto the sim as we tally up the length of the leaves. And that may not make much sense right now, but I'll show you how to make one. But essentially, what you end up doing is adding these leafs like you see under two, there's a little leaf that just has a zero on it. But if you see under five, there's this big long leaf with a whole bunch of numbers off of it. So I'm making one will help you understand this terminology. But I first wanted to just show you this picture because it's actually kind of hard to understand what's going on with a stem leaf unless you understand that that vertical line in the numbers to the left of it is considered a stem. And then each one of these things we build off start, you know, off of each of those numbers is called a leaf. So people talk about the four leaf in the five leaf already. Okay, so again, I'm just so into making up data, right? So I decided to make up data from 42 patients who visited a primary care clinic and referred to mental health. Now the reason why I made update on the subject is I'm very upset about this subject. I think people are waiting too long to get mental health treatment. Especially if you've been following the news about the Veterans Administration. In the US. A lot of people are put on hold even for primary care. You know, they're put on waiting lists and I don't like so I made a fake data by That as a demonstration just to highlight these issues. Okay, so what what data Did I make up, I made up the the number of days between the referral and their first mental health appointment. That was what was collected. So let's say you go in on January 1, and you get a referral. And then 10 days later, you actually show up at the clinic, then that would be 10. Right? That would be your value. So that's quantitative. So let's take a look at it. So on the right side of the slide, you see just this pile of numbers from all these people that came in and, and then got a referral. So like, you look at the first person had to wait a month, go see a mental health professional. But if you look, you know, the third one, and that person only needed 12 days. So that's how you sort of consume this fake data I made. And then you'll see over on the on the left side, I already made a step. It's blank that doesn't have any numbers on it, but I knew I need that vertical line. So I just made that in preparation. Okay, so let's build our simile. So what we do is we start with the first number, and that's what's awesome about this is you just start with the first number. And if you want, you can kind of cross them out as you go along to keep track. So we start with this first number. And you'll see what I did, I went over to the stem, and I put the three on the left side of the stem and the zero on the right, this begins the three leaf, okay. Here's the next number. Now, I put the two above the three because it's like right before it and you can kind of imagine we're gonna walk down like 23456. And then I put the seven on the right side to start the the two leaf. Alrighty, here we are with the next number, which is 12. And as you'll see, I started the one leaf, you're starting to see the pattern, right? And you can probably guess what's going to happen next, we start the four leaf and put the two there. Okay, our next leaf, we've already started, right for 35. So what do we do there? Well, we just add the five on to the three leaf, the three leaf was already started with that, that 30 at the beginning, so we just pile a five on there. Here's 47, we just pile a seven on there. Now you'll notice I tried to line up that seven on the four leaf with the five on the three leaf. When you're doing this by hand, well, even when you're not doing it by hand, you really have to keep those things lined up or you you won't have a good stem and leaf. Okay. Now I'm going to just fast forward a little a little because you can probably imagine how to do the next row the 3836. You just keep piling it on. But I want to show you what happens when you get to the special case here. Okay, well, we'll go with this 29. This is the last thing before the special case. So you'll notice that 38 got put in there, see that eight, three leaf that 36 got put in there, you know from the second row, see, we put everything in there. And now we put in the 29 look at that we got a three after that. That's our next one. So where are we gonna put that three? And I, you know, you might think on three leaf but that's not right, right? Because that's 30 something. So where do you put the three? Well, some of you figured this out, you have to add a zero onto your step. So look at that, I put that zero there and then we put the three in. And then you can already guess how to do the 21. Next, we'll just tack a one on to the to lead. But then when we get to the next zero, we just add a zero on to the zero we. So you can probably figure out how to pile up all of these. But I did want to talk to you about something else that happens with these stem leafs. As you go on adding to the leaf, you got to be careful because you might end up with a situation where you got something big now I really feel sorry for this fake person. 51 days for a mental health appointment that's too long, right? But it causes us later to have to add a five. Now this can cause real estate problems, especially on a piece of paper, you know, what have you the four was right at the bottom of the paper, right, it's kind of hard, maybe you have to tape some paper at the bottom I have this problem a lot. Um, you'll see here this, I even had to move this up on the slide when we got later to the 70 I'd add the seven leaf. Now I just want to show you for some reason the state of we didn't have any 60s. But you still have to put that six leaf place or in that that's got to be there. So even if you know as we go on, if we're missing any leaves in between, we just need the place are there because that space has to be there. And here's here's an outlier. We're gonna learn about outliers pretty soon. This is a really long time. 105 days this is kind of like VA status right? But it And you'll see that and of course, this is fake data, but unfortunately reflects real data. You'll see when we get to 105, not only did we skip the eight leaf and the nine leaf, and we need to leave a space for them, but 10 becomes the part of the stem. So if we went on to 200, or 300, I mean, that would be awful. The wait that long, though, the first two digits of it, like if we had 365, the 36 of the 365 would be the part of the step. Alright, so I just did a little demonstration to explain certain nuances of the stem leaf that you might encounter in your life. So now, I'm going to just reflect back on the two ways that I've described in this lecture for you to organize quantitative data. First, I showed you how to make a frequency table. But what you need to do with that one is you need to set up classes and class with and and to count the frequencies in there a lot of there's a lot of pre processing a lot of pre calculations, you really want to think when you're doing this, and you don't want to be distracted. However, if you're trying to do a stem and leaf, you really can do that on the fly, you don't need to set up classes or class with, as you noticed, we just went through the line of those pile of numbers, and just crossed them off as we put them onto the stemmen wave. And there was really no need to count, you can tally the data as you go through the list, you know, cross it off. And it's just really quicker to do. Of course, those of you who are pretty clever saying, Well, basically you're forcing in a stem and leaf everything to be in the class of, you know, the 10s, right, you know, the 20s and the 30s in the 40s. That's like the two leaf, the three leaf and the poorly. And yeah, it is kind of like a simplified way of making those kinds of classes. But in any case, I just wanted to alert you to this because you might see some similarities between the two. And I wanted to highlight those as well as the differences. Now I'm going to give you a few tricks here, I want to tell you about the concept of an unordered leaf. So an unordered leaf is what we were making before when I was demonstrating, it's just where the numbers are out of order in the leaf like you'll see this two leaf it's a seven, seven to nine. Well, if there were an order would say 2779, right, like the two would come first before the seventh and the ninth. And the same with the three leaf that's out of order, because you can see that it's zero and five is fine, but eight doesn't come before six and five, right? That's no problem to make an unordered leap. However, after making an unordered version, you can rewrite the stem and leaf in an ordered way. So you see how I did that I rewrote the two leaf and the three leaf. And now they're all the leaves are in in order. Okay, you don't have to be but you can do that. And if you do that, if you make your stem only first unordered the way I was demonstrating, then you rewrite it into ordered, it is way easier to count it up to make a frequency table no matter what classes you choose. Or you can just make each leaf a class. And then it's super easy to make the frequency table. So that's why I combined these two pieces of the chapter together is because I wanted to show you how you can use a stem and leaf to help you make a frequency table. So a stem leaf, it's just another way to organize quantitative data. And it's easier to make kind of on the fly than a frequency table because it requires less preparation. And they can help you put data in order before like in preparation for a frequency table started to help you as a first step to make sure that you can organize everything. And at the end. Remember I keep emphasizing your frequency table has to reflect all your data points. And they can only be in one class, blah, blah. Well this is one way to make sure that happens is to first do this pre organization using an ordered stem and leaf. So in conclusion, frequency tables and stem and leaf displays organize data, they organize quantitative data. And the stem and leaf may help you make a frequency table. So you might want to start with that. And the purpose of both of these things is to reveal a thing called a distribution. And I'm going to explain that in the next lecture. Hello, it's Monica wahi. Again, your lecturer from library college and we are moving on to chapter 3.1 which is measures of central tendency. And here are your learning objectives. So at the end of this lecture, you should be able to explain how to calculate the mean. You should also be able to describe what a mode is In say how many modes a dataset can have, you should be able to demonstrate how to find the median in the set of data with odd number of values, as well as in a set of data with an even number of values. And you should also be able to define trim mean and weighted average. All right, so what's this measures of central tendency, I'm going to explain that why we kind of call it that. And then I'm going to talk about them, which the three biggies are mode, median, and mean. So I'm going to talk about those and explain how to get those. Then, towards the end of the lecture, I'm going to go into some special situations. One is called the trimmed mean. And the second is a weighted average. So let's get started. What is the central tendency thing? Well, if you think about quantitative data, which that you can only do this with quantitative data, not qualitative data. But when you think of having a pile of numbers like this, one of the things you want to know is how much they tend towards the center. Now, of course, you don't know where the center is, until you start looking at the data. Some data are kind of high up in the hundreds, like systolic blood pressure. I give a five point quiz and one of my classes, so those numbers are low, like 12345. But then the question becomes, do the group towards the center of whatever list of data they're in? Or don't they? How sort of sensory? Are they? You see these distributions on the slide? You'll see, on the left, you'd probably say, well, that looks more sensory than what's on the right, you know, this normal distribution on the left, and the skewed right distribution on the right. And so intuitively, you kind of know what I'm talking about. But what this lecture is going to be about is how to actually put numbers on the difference between what you see on the left and what you see on the right. So these are the numbers, these are the measures of central tendency, we're going to go over mode, median, and mean. And the median is a little different, depending on whether you have an odd number of values or an even number of values. I mean, it means the same thing, but you calculate it slightly differently. So I'll go over that. And then the mean, a lot of you already know what a mean is, but there's a couple special means we can make. One is called a trim mean, and another is called weighted average, which is a weighted mean, I don't know why they chose the word average for that one, because mean an average mean the same thing. But I'm going to go over these things. Okay, well, let's start with the mode. The mode is the number in the data set that occurs the most frequently. So I put up this little tiny data set here of just five numbers. And it's obvious that then five is the mode, right, because it repeats Once there, two fives there. But look, I just changed one of them, I changed it to a six. And now there's no mode. So I just want you to know that a lot of data sets don't even have a mode, there's just no repeat at all in them. And that usually happens when you have a broad range of numbers, they can have like systolic blood pressure, I mean, it would be kind of lucky, you just got two people with the exact same one. But that can happen. So don't think there's always going to be a mode, there might not be one. It's also possible to have more than one mode, like look at that. So I've got six numbers up there. And the two repeats once and the three repeat ones. So you've got two modes, right? But let's say that the three actually repeated three times, then it would only be one mode, because the three threes would Trump the two twos, right? So you can just imagine how confusing this gets when you got a ton of numbers. What's a little less confusing is, um, if you like I said have a broad range of numbers, it would be kind of a coincidence, if two patients had the exact same systolic blood pressure or platelet count, you know, like you get a repeat in there. And then that would be the mode. Of course, if you measure a whole bunch of people, then eventually you're probably going to get one. But I just wanted to say and also, if you look at the slide all those numbers, you'd really have to go through and organize them and count them up and see if there is a mode, there probably is one because we see a lot of repeats. But then which was the one that wins that's repeated the most? Or are there two that are repeated the most, and becomes kind of political when you really do it. And it's not worth a lot of work, because what does the mode tell you? It doesn't really tell you much. It does tell you the most popular answer. The word mode in French means fashion. So like I put on the slide, you know Allah mode, it's in fashion. So it's the one that's most popular or the most common result, but it's not used a lot in healthcare. And it's actually not used very often once in a while. I'll say, Oh, the mode. In the class for my five point quiz was five, meaning everybody did pretty well they mostly got a five. That was the most popular result. But you hardly ever have to say that. And so remember, we learn the words resistant, like if a measure is resistant, you can't whack it out very easily. Well, you can change things pretty easily with the mode, the modes not resistant, I even just demonstrated that on those slides, by just changing one number, you can erase the mode or add a mode or whatever. And so it's not stable, it's not resistant. And those are the kinds of things we don't really like and healthcare, so we don't really use them. So I'll move on to some cooler measures of central tendency. And here's a really cool one, which is called the median. And it's the middle of the data. And I'll explain that a little bit more what we mean by the center of the data. Okay, so remember, we're talking about quantitative data. So you've got some pile of numbers, it doesn't matter, you can always sort them in order of lowest to highest. And I keep talking about this five point quiz, I give him my class. It's an easy quiz. And most people get fives. But even so somebody gets a four usually, or somebody doesn't show up for the quiz, and they get a zero. And so it doesn't matter, I can have 100 people in the class, I still could put all of those numbers in order of lowest to highest, even if most of them were fives. Because you'll get repeats in your data sometimes, right. And also, sometimes you'll get outliers. Like if I said one person maybe didn't take the quiz and they get a zero. But everybody gets else gets four and five is an easy quiz, well, then that zero would be an outlier. So you don't have to worry about that. And like I said, you know, the data values sometimes are almost the same, like almost everybody gets a five on my quiz, because it's so easy. So it doesn't matter. Even if you have these weirdnesses in your data, you can still just arrange them in order. And that's what we mean by the median is the number that is halfway up, or halfway down, right. So if I've got 100 people in my class, and I've got the zero over here on the left, and I put all the, you know, fours, and then the fives, you know, I have to count up what 50, right to see where the middle is. And it's probably going to be in the five range, right. But that's all we mean, we say, you'll take however many values you have, put them in order, even if there's repeats and outliers or whatever, just put them in order, and then count up halfway. And that's where the median is going to be. So I'll demonstrate this here. So how to find the median, the first step is to order the data from the smallest to largest. So I'm giving you two demonstrations. And I don't even know what these data mean, I just totally made them up. The one at the top, the data set the top that starts with 42, that only has five numbers in it. So I'm going to demonstrate the odd version with them. The one set at the bottom has actually six numbers in it. So I'm going to demonstrate the even version, because remember, it goes a little differently, whether you have an odd number of numbers or an even number of numbers. Okay, so those are the numbers. And we still have to do the first step, which is order the data from smallest to largest, because you can see they're not in order. So I'm going to do that here. Okay, there it is. So those are the same numbers, they're just in order from smallest to largest, okay. So we're going to get rid of those numbers on the top, and instead put the position they're in. So let's look at the top data set, which is the odd one. So I'm going to say this is how you find the median is you number the positions, you know, it's 12345. And it's the middle position. So you can imagine, if we had had seven data points, we'd go out 1234. And we'd circle that one, and that would be the median. So that's what you have to do is you take these, if you have odd values, you just put them in order, and see I numbered them for you. And then you take the middle number, and that's the median. That's what it is. It's 42 in this one. Okay, we'll do the downstairs data set there that has six, as you can see, the positions are numbered. And then what do you do, you go to the third and fourth position, which is the kind of the middle right, and you literally make an average of them, you add the two, and they happen to be seven and eight right next to each other. But if they had been like eight and 10, then the average would have been nine, and that would have been the median. But because this is seven and eight, you do seven plus eight, divided by two, and it's 7.5. So when you do the median with an odd number of values, you're going to be taking one of the values in there. If you're doing the median, on an even number of values, you might get something with like a decimal, because you're looking for the two values that straddle the middle, and you're going to be making an average of them. And so you might get kind of a wacky number like 7.5 that's not in the underlying data set. So um, this is fine for like, if you have five or six numbers or seven. What What if you have like 150 numbers, I mean, you do still have to put them all in order to begin with, you know, like I use Excel, I probably just soared. But you have to know how many numbers to go up. It's not obvious. So this is how you find the middle number. They have a little formula for it. So let's say we have an odd number of values. And I'm giving you the example like 21 love Let's say at 21 students in my class, and that's how many values I have. And I wanted to make a median of their grade, what I would do is put them all in order. And I'd say, Well, I have to go up so many, and that's the median. But I don't know how many to go up. So I would use this calculation. So I take the end, which in our case is 21. And I'd add it to one to it. And then we get 22. And then I divide by two. So that's just how it works. So if you had 41, you would do 41 plus one, it would be 42, divided by two. Or if you had, like, I don't know why I'm picking on ones like 27, you do 27 plus one, and that would be 28. And 28 divided by two is 14. And so you see, it would just force it to be an even number that you come out with. And then that's the position you got go often. So if I had 21 students in my class, and I took the grades and raised them in order from lowest, lowest to highest, like if they were that quiz grades, you know, most of them would probably be four and five, but it wouldn't matter, what I would do is just start with the lowest and count up to the 11th 1/11 position, and then that would be my meaning. Now, you also have to do that, you have to find the middle number, even if you have an even number of values. So I took an example 14, now you'll notice we use the same formula. But if you do use this formula, you get 7.5. And that doesn't, that's not the median. That's just how many positions you have to go up. Right. And so remember, on the earlier slide, we had, we had to go between the third and fourth position, we had to average those two numbers. Well, this is basically saying, if you get 7.5, you have to go to the seventh and the eighth, the one that straddles it, and those are the two that you average. So if my n like 100 is a nice, even number. So if you have 100 plus one and you get 101, then you've got, you know, 50.5, right, and that just is a secret message that when you line up all your data, you take the 50th, one in the row and the 51st, one in the row, add them together, divide by two, and that's going to be your median. So I just wanted to share with you this little formula, just in case, you get like a large number of numbers thrown at you and putting them in order is a big pain. And then you have to figure out how many to count up, you can use this formula to get the middle number. So what does a median tell you, we have a lot more to talk about here. First of all, it's called the 50th percentile of the data, what it means is 50%, or half of the data points are below the median, and the other half are above. And that intuitively makes sense because you just created we created this median together. And we could see that half of the points are in the bottom half on the top. And so it's also known as a middle rank of the data. And what's nice about the median is it doesn't really care much about the ends of the data. Like if I gave extra credit to a few people in my five point quiz, and they got a few sixes, probably the median won't even change because it's in the middle where all the action is where we find the median. And outliers don't really bother it because like if one or two people get a zero on the quiz, it's really, you know, if there's 21 people in there, or 100 people in there, it really isn't gonna affect, you know, these things happening at the end. So we like the median because it's very resistant, and it's very stable, you can't really whack it out with some outliers, throwing them on the ends. Now I'm moving on to the third measure of central tendency, which is a mean, but I also threw in here, trimmed mean and weighted average because there are other kinds of means. And we're going to talk a little bit also about resistant measures, because like I just mentioned that. But I'm gonna step back and talk a little bit about the Greek letter sigma here, that's actually capital sigma, I do not speak Greek. And I actually have trouble speaking statistics, because a lot of it's in Greek. So I try to avoid that and my lectures, but sometimes you can't get away from it. So I have to really introduce you to this capital sigma. So in English, we say or statistics ease, I guess, is whenever you see this, you say some of Wah, like you expect something to be right after it. Okay. So if you see, like the sigma and then x, you would say sum of X. That's how you say. So what is x? Well, remember how we were just making medians. And we were looking at modes, well, each value there is considered an X, okay, so each of the values in those days sets an X. So sum of X would mean add these all up or add up all the axes. And then I just threw on another example, let's say somebody came to you and said sum of X, Y, it would mean you must have some x y's lying around and you have to add them together. Or somebody came up to you and said, you know, some of the prices on your, of the food in your basket and the grocery store, right? Somebody said some of that, you'd be like, Okay, I have to go through all these prices and add them up. Right. So that's what some of them Okay, and it's used a lot in statistics, and we're going to use some of all the time. So I just want you to get in your head that whenever you see some of, there's probably going to be this thing next to it. And it's gonna be a batch of numbers that you have to add up. And if it's numbers from our data set, it will be called x, if it's other numbers from something else that will be called whatever they're called. But just know that this means some of and I see on the slide, the upper one is Times New Roman, and the lower ones Arial, they look kind of different. But I just wanted you to get ready to deal with this some of a lot. Okay, so here we are, I'm hitting you with a sum up. This is the formula for the mean. And a lot of you already know how to calculate the mean. And you just kind of do it. And you didn't know this is how you say it in statistics. But basically, it's this ratio. So this is like a fraction. And on the top of the fraction is a sum of X, you add up all your actions. And on the bottom of the fraction is an, which is however many you have. So you add them all up and divide by however many you have. And you've probably been doing this your whole life. But this is actually the formula. So I just thought I'd demonstrated, um, see, I put that sum of remember those six data points I was using for the median, I just kind of copied them over here, I add them all up. And so I got some of axes 40, right. And then I counted them, and that was six, while I made them be six. And so 40 divided by six is 6.7. So that would be the mean for these data. And you probably already knew how to do that. But I wanted to sort of crosshatch it with the actual formula. Okay, now I'm again, going to take a little break here to just talk about means, because remember, we talked about sample statistics and population parameters. If somebody just talks about a mean to you, and they say, look, the mean such and such as six or something, unless you really get into it with them, you're not going to tell it's not going to be obvious if they did a sample mean, or did a population mean? So but when we write this down, it becomes obvious. If I say, x bar, see that x without line above it, that's pronounced x bar, and you'll see I write it on the sides x bar, because it's so hard to put that little line up there. But that means the same thing, this x bar, whenever here x bar, or you see that x with a line over it, it means that it's the sample statistics. So if you ever saw like x bar equals six, not only do you know the mean is six, but the secret code says this mean comes from a sample, because x bar is being stated. But if you look on the right side, you'll see that it says there's this m, and it's pronounced mu, it's a Greek letter again, and I you'll show, you'll see on the left, I put it in Arial. And on the right, it's n times new roman looks a little different. But it's pronounced mu. And so if you saw mu equal sex, you'd be like, Whoa, that was a population they measured. And the you probably say that too, because you don't see mu a lot like people usually don't measure the population, it's a lot of work, you often see x bar, but even so I want you to be cognizant of whether it says mute or whether it says x bar, because it's still going to be a mean. But if it's mu, they're talking about the population. And if it's x bar, they're talking about a sample. And that might be more important later. But just keep this in mind. Also, when we talk about samples, we use a lowercase n to mean the number of numbers we have. Whereas if we use, we're talking about populations, we use an uppercase n a capital N. So you'll see that the sample mean formula on the left side, this x bar equals sum of x divided by n, it changes if you're talking about the population mean, and you're like, come on, you add it up the same way. Like mu is basically the population mean, and capital and it's just the number in the population, that means almost the same formula. But the issue is you really are supposed to label things what they are. So if you're doing a population mean, mean, you're supposed to call it mu, and you're supposed to use, you know, write it like that on the right side of the slide. And if you're doing a sample mean, you're supposed to call it x bar, and you're supposed to do it like on the left side of the slide. So I just wanted to make that clear to you as you go through the rest of these lectures. Because when I say mu, I'm gonna mean a mean, but it's gonna be from a population. And when I say x bar, the mean the mean, but it's gonna mean it's from a sample. Alright, so now we've talked about several measures of central tendency, but I wanted to put a means and medians together in kind of a cage match because I wanted you to look at them and see what their differences are. Now, I've been sort of giving accolades to the median, right, because it is very resistant to outliers, and it's very stable. Remember how I pointed out if you throw some outliers on either side, it doesn't really affect it much. Unfortunately, means are not resistant to outliers. You could just throw like if I took my five point quiz, and I just felt like failure. barring a student and then giving them 10 points, it would totally screw up the mean for that class. And it's so it's not very stable. So one of the things we can do if we've got outliers in our data is to just use the median. But sometimes we want to use the mean. So we got to do different things with it. So one of the things we can do to try and make a more stable mean, or honest mean is to trim it. So I'm going to talk about how you do that. So as you can see, on the left side of the slide, a very high value, a very low low value, like an outlier, or more than one outlier can really throw off the mean. And it's not a problem with median. So if you want to make the meal a little resistant, what you can do is trim data off of each end. So the outliers get cut off, okay? The problem is, you can't look at the data, when you're doing that, really, you would just have to make a rule when you're not looking and say, Okay, I'm going to trim X amount off the top and X amount at the bottom and as to be equal, and you just have to look away when you're doing. Okay, so what I'm some people do is a 5%, trim mean, which means you take 5% of the data at the top and cut it off, and 5% at the bottom and cut it off. So you basically lose 10% of your data. And in health care, a lot of people get mad about that they don't want to lose any data. So they don't like to use this way of fixing the problem of outliers, they use other ways. But I wanted to show you this as a simple way to fix it. So I'm going to imagine we have 100 data points, because it just makes it easier for you to see what's going on. Um, so if you had 100 data points, 5% of them would be five. So basically, you'd be trimming five off of the top, and five off the bottom. So the first step would be is probably you already made the mean out of this 100. And you didn't like it because you saw outliers at the top and bottom. So what you have to do is put the data in order just like you do for the median, you put them all in order, you sort order from, you know, the lowest to the highest, take all of your 100 and do that, then what you would do is you would like circle the five most bottom ones, and they're going to get cut off, and you'd circle the five top most one of them, they're going to get cut off, they get thrown out. And then you're you've got the 90 values left in the middle. Now you make a mean out of those. And then that's a 5% trim mean, and you got to tell people, if you do that, you can say here's the original mean, and here's the 5% trimmed mean, because then people get an idea that there must have been some outliers and some of your data got hacked off. But then this might give you sort of a more stable estimate of the mean. Now I'm going to move to something else entirely. It's not about trying to make the mean stable, it's just about trying to make the mean a little different. Sometimes certain values in your mean should count more than others towards the mean. And that sounds really esoteric, but the way we see it all the time is in school. So you might get a great grade on your homework, you might get A's on your homework, right? But if homeworks only worth 10% of your final grade, it doesn't help you much. And so what that 10% is it when you have a class like that is it's called a weight. When you move into statistics, you say well, I'm going to, you know, I as the teacher, I'm going to wait your homework grade at 10% of your final grade. So it doesn't matter how awesome your homework grade is, or how bad it is, it's really only going to count for 10% of your final grade. And that's why we do weighted averages, you know, I don't think your homework should be worth like 50% of your grade, right? That doesn't make any sense. And so even though, so you might want to have different things contribute a different amounts of weight to that final mean. So this is a way of messing around with the mean, and making certain things going into it count for more, or have kind of a bigger vote than the other ones. And so I again, I'm just gonna stick with school to give examples because this is where we normally see it. So I mean, if this example where homework is worth 10% of your final grade and quizzes would be worth 20%. And the final worth 70%. And I just want to point out, I've actually seen people do this, like cuz I tutor, and like this is horrible making your final worth, like, over 50% of your grade. So this is just a shout out to any like professors watching this. Don't do this. Okay. But anyway, let's say I was mean and I did it. And let's say you were pretty good student and you got an A on the homework, right? And so we're gonna say that's a 4.0 because a lot of schools would say A's 4.0. Then let's say you got B plus on the quizzes, maybe because the lectures weren't very good, right? Haha. So you got B plus on the quizzes that would translate to the number 3.5 on that four point scale. And let's say you got to be on the final. That's too bad, but that's 3.0. So what do I say that's too bad? Well, you probably want an eight because the final counts for greater weight right accounts for 70% and you'd want that to be really high. Great. Now I first wanted to show you the non weighted average, like the normal mean, you would make the normal mean you would make as you just add the four to the 3.4 to the three and then divide by three, because you have a three in there, and you'd get 3.5, you get a B plus in the class, right? But let's just look down, or let's look up at that formula. So this is the weighted average formula. It's the sum of x times the weights, divided by the weights. And remember what I said sum of x y, like as an example. So we have to, instead of just summing x, like we did in the non weighted average, we have to do X times W, on all of them in summit, and you're like, what's w? Well, remember, I told you what the homework worth 10% that's the weight for it, right? And so, so using percent, when we do the weighted average, you use the decimal version. So you'll see under the weighted average, I'm doing that sum of X w thing by taking the four and timesing it by point one for that 10% first, and then see that B plus that 3.5. That gets multiplied by point two, because that's where 20% and then there's that B, you got on the final, right, that gets multiplied by point seven. So that's the sum of X w thing going. And what do you get, you get 3.2. Now I don't even bother to, to divide this by some of W, because some of W is one in this case, like if you add up point seven plus point two plus point one, you get one. And that often happens, you just make the weights add up to one. But I just wanted to let you know if for some reason you had goofy weights that didn't add up to one, the last thing you have to do is divide by them. So as you can see, in the lower part of the slide, the sum of X W is 3.2. And if we divided it by one, we get 3.2. And now you don't get b plus in the class, now you get like a B. And that's the difference between the non weight and the weighted average is the weighted average weighted this final be extra, and then that caused the grade, the final grade to be lower. And that's what waiting is. Now, I just want to say a few things. I've gone through all our measures of central tendencies, but I wanted to talk about how they relate to the distributions we learned recently. So I just put up an example of a normal distribution. And then I color coded these lines. So see on the way, right, there's a color coded mean. And then there's a green median. And then there's a purple mode. Technically, they should all be right on top of each other. But you can see them if I did that, so I just wished him up next to each other. what the point is, is if you have data with a normal distribution, all these three things are on top of each other. And what the magic of this is, is you don't even need a histogram to know. So like I use statistical software, and I'll feed in the data, like a quantitative variable. And they'll say, Tell me the mean, median, and mode. And then it will, it'll tell me the mean, median and mode. And even if I don't look at the histogram, if it says almost the same number for Mean, Median mode, I automatically know it's a normal distribution. Well, that's not the case with skewed distributions. So with skewed distributions, the measures of central tendency are not right on top of each other. In fact, they're in a different order, depending on whether we have right skewed or left skewed. So at the top of the slide, I've got an example of a right skewed distribution, right? Because it's light on the right. Alright, so what's happening here? Well, the mean, is getting dragged around by that tail, that big tail. So you can see that the blue mean, is on the right side of the median. So the median is more resistance. So it's sort of hanging out closer to the bottom of the data. But the the tail, that right tail is pulling the mean up. And then the mode is the lowest one. So if I get this print out, and I see that the mode is the lowest the medians in the middle, and the means the highest, I can say without even looking at the histogram, this is probably right skewed. Now let's look at the bottom of the slide where we have the left skewed distribution, you know, because it's light on the left, and you see the same phenomenon, but it's going the other direction, that that tail, that's towards the low end of the data. It's dragging the mean down now. And notice the median is more resistant doesn't get dragged down as much. And of course, the mode stays at the high part of the data where there's more data, right? So if I get the printout, and I see that the mean is the lowest and the medians in the middle and the most the highest I'm like Okay, all right. have to look at the histogram. And I know this is left skewed. So this is basically what I wanted to tell you about the, the distributions, and these actual numbers and how they sort of relate. So in conclusion, what this lecture was mainly about was the measures of central tendency, right? mode, median and mean, and how to calculate those. And, you know, I've been kind of bagging on the mean, I'm sorry, but the mean is just not resistance is totally not stable. And the median is, so you want to remember these things? Yeah, you can kind of fix things by doing the trimmed mean, we don't really like to do that in healthcare. Because we lose some of our data, we find other ways of fixing the fact that our mean, maybe kind of goofy. But they're outside of this lecture, how we do that. I also showed you about weighted average, you know, just in case you have to hand calculate your grade. I'm actually I had a student in my class once. And this is back when we had Blackboard. And there was something wrong with Blackboard. So she was really upset because she thought she was getting a really bad grade. But she was getting a bad grade because she didn't do a good job of learning weighted average, because when I showed her how to actually calculate her grade, it turned out to be a B, I remember she was crying. Because she did an unweighted average, she was crying in my office. And then I just showed her how to do the weighted average. And she stopped crying, she was getting a B. So just don't cry. Try the weighted average first, okay. And then finally, I went over distributions and measures of central tendency, and just related to you how the distributions, how the numbers we get from the measures of central tendency, how we can put them on distributions and see some information about the distribution. All right, well, you made it through the measures of central tendency, get ready for 3.2 measures of variation. Hello, and welcome to chapter 3.2. It's Monica wahi. Library college lecture. And I'm here to go over with you measures of variation. Alright, right, here are your learning objectives. So at the end of this lecture, the student should be able to state three different measures of variation using statistics, you should also be able to explain how to calculate variance and standard deviation, which I'll give you a hint, those are two of the measures. All right, you should also be able to calculate the coefficient of variation and explain its interpretation. And finally, you should be able to state chebi shows theorem. So now we're going to be concentrating on measures of variation. And the first one, I'm going to talk about his range. And then I'm going to talk about variance and standard deviation, which are two different ones, but I'm going to talk about them together. And you'll see why. Then we're going to go over the coefficient of variation, which is abbreviated CV, then we're going to talk about Chevy, Chevy Chevy came up with a theorem, we're gonna talk about his theorem. And then his theorem leads us to calculate these intervals. Remember, intervals are like, have a lower limit and an upper limit. I'll remind you that and when will calculate Championship at intervals together? Alright, let's get started. So let's think about variation. Okay. What is variation even mean? Well, it means how much does the data vary? So imagine I taught two classes, which isn't too hard, because I do teach two classes, I teach two of the same classes, two different sections. So imagine that I gave a quiz. And the same mean grade was in each class. Okay. And I said that, could we tell how internally consistent those grades were? So for instance, let's say that I gave a five point quiz. And the mean, in each class was three? Do we really know how many people got something far from three, like, maybe in one class, people got a lot of fives, and ones. And that's how we got the average of three. And maybe in the other class, everybody just got three, like, we really can't tell from a measure of central tendency like median, or mean, or even mode, we can't tell how internally consistent the data are, especially, we can't even tell that from a mean, two different classes can have the same mean, and a totally different kind of variation behind the scenes. So when you're talking about quantitative data, and you have a whole data set, and you do the measures of central tendency, like Mean, Median mode, it doesn't tell the whole story, you have to also add on the information about variation. And these calculations that we're going to learn here in this lecture are about ways to express how much the data vary in the data set. And it's just separate from central tendency. So central tendency is just about central tendency. And then this variation is about variation. And you need to know both before you can really evaluate your data set. So we'll get started on talking about ways to calculate these measures of variation. So um, As I said, I'm going to go through range. First, I'm going to talk about variance and standard deviation. And I just want to remind you, you know how I'm always going on about sample statistics versus population parameters. Well, this starts playing in in that the formulas are slightly different than for sample variance, the standard deviation and population standard deviation. So we'll go over those separate different formulas. Finally, we're going to talk about in the measures of variation, we're going to talk about the coefficient of variation or CV, but we'll do that after these other ones. Okay, so we're going to start with the range, because it's the simplest to calculate. So here's how you do it. So you'll notice on the right, I just made up five numbers, I just totally made them up. I don't know what they are. Okay, I just did that for a demonstration, because the range is the difference between the maximum and minimum value. So literally, it's pretty easy to calculate, you have to first search around for the highest or the maximum, which in this little data set, it's so cute. It's only got five numbers. So it was obvious that somebody ate was the highest, right? And it's sort of obvious at 21 is the lowest. So how you calculate the range is you take the highest minus the lowest, and then you get a number. And that's the range. And sometimes my students actually take the highest, and then they put minus and then the lowest. And then they tell me, that's the range. And I'm like, No, yeah, I actually have to subtract it out. So you'll see here, it says 78 minus 21 equals 57. So it's 57. That's the range. Okay. So all it's telling you is the distance between the top and the bottom. And I'll just say that, that's not very useful. In fact, I had a problem with that when I was working, I worked at the army on this army database. And I looked at the range of ages of soldiers when they started. And the range was h Four, three 107. Alright, obviously, there was a problem with the data, right? Just for some reason, there was a screwed up record that said, somebody got him when they were four. And there was another screwed up record that said, somebody got in when they were over 100, they were just screwed up data, okay. And that caused me to have this ridiculous range. And so the range is not very stable or resistant, right? If we just fixed that, you know, record that said somebody was four when they got in the army, then we might have a normal range, you know, like little more like a minimum, we might see 18, or 17, or 19, or something. But, as you can see, on the right side of the slide, I just picked out that the minimum and the maximum, we could just change arbitrarily change those numbers. And suddenly, we'd have something totally different from 57. So as you can see, even though this range is a measure of variation, it's not stable and resistant. And it actually kind of doesn't tell you much. If I say we've got a range of 57, you don't know if the minimum is like zero, or like negative, or like 105, you know, you really don't know where that ranges in. So it's not very useful. But it's a place to start, because that's our first measure of variation. Now we're going to get into what we really use in statistics a lot, you'll sometimes see in articles where they state with the ranges, they usually don't state the actual number I tell you to calculate, they actually state the minimum and the maximum. And sometimes that's interesting. But variance and standard deviation. That's what we really live on in statistics for measures of variation. And you're probably wondering why I'm talking about them together when they're totally different calculations. Well, it's because they're friends. Okay? And how are they friends? Well, the variance calculations, kind of a big formula. And so you get through that, and then you have the variance. And then all you have to do to get the standard deviation is take the square root of the variance. So that's why they're friends is like you go through all this trouble to get the variance. And then the next step is just take the square root of that, and you get the standard deviation. So before I actually talk about those formulas, I wanted to just set in your head, what these words mean. Because, like, I remember, I worked in a mental health place. And I don't know, we didn't have enough licensed people there. And so our leader said, Oh, I'm applying to the state for a variance, right? Meaning that the state would give us allow us to vary from the rules. Well, that's what variances is how the data vary. So you think of the spread of the data and how well does the mean every represent that spread? It doesn't, right. So variance is a way of representing how the data vary really around the meet. Now, you're probably wondering, well, then why do you even have standard deviation? It's the square root of variance. But let's just think about what the word means. You know, standard means sort of following a standard are the same. So it's just the amount of variation, that standard in the data set. And you know what the word deviation means? Like, you can say, oh, that person is a social deviant because they go to crimes or something. Or like this guy with a healthy nose, he does not have a deviated septum. But you know, some people do have a deviated septum where it's like crooked, right and they have trouble like sneezing and blowing their nose and sometimes even breathing. Well, a standard deviation would simply mean that everybody's deviation is about the same. So, variance is a calculation that says how much things vary. And so the standard deviation, because it's just the square root of variance, but I just want you to imagine in your head, oh, standard deviation, that means how much the data deviates around the mean, because a lot of times students get confused about the measures of central tendency, they try to apply them to variation, but variation is totally different thing. So just remember what variance literally means, and what standard deviation literally means. And that might help you get through these formulas and understand the interpretation. So as I mentioned earlier, the formulas for variance and standard deviation are different, whether you're talking about a sample, or a population. And, admittedly, we don't use the population variance or population standard deviation calculation very often, because we don't measure the population that often. So we tend to use the sample variance and sample standard deviation all the time. So I'm going to demonstrate those. But you'll notice conceptually, they're really similar. Like, um, you know, if you have population parameters like Meuse, and like population standard deviations, they tend to behave similarly in formulas, as sample versions, it's just that in statistics, we always want to be really clear about what we're talking about. So we always want to use the right symbol, so we're hinting towards, we're analyzing a sample versus we're analyzing a population even though conceptually like means or a mean, right? But you want to represent which mean you're talking about one, that's a parameter, or one, that's a statistic, whenever you write out the formula, so I'm just being picky about that. And then there's also two other things you want to know. Um, there's two different ways of actually doing each of these formulas. You know how like an algebra, you can have a big equation, and you can express it more than one way. So that's all they do is they put a formula in one way called the defining formula. And then they put the formula, same formula, but rearranged by algebra into the computational formula. Now, I always think that's kind of funny that they call the computation, right? I mean, both the formulas give you the same results, it's just plugging in numbers and getting out the answer. And the answer is gonna be same, whether you use the defining formula, or the computational formula. But what I think is so funny is they call it the computational formula, but I cannot compute it. Like I always get confused when I use it. So I pretty much ignore the computational formula in my entire life. And I just teach the defining formula. And I find my students always remember the defining formula, they always can get through it. Although people who are into the computational formula, they tell me that I'm doing things the hard way, I'm going the long way around. But you know, what just goes a long way around, it helps you not get confused, and helps you convince yourself you actually got the right answer. So let's just do the defining formula. All right. So let's look at the defining formula, you can look it up, you can look up the computational formula, but this is the defining formula. So let's just get get our minds wrapped around that. Remember, I told you that variance is great, because you calculate that, and then you just take the square root of that, and you get the standard deviation. So as you can see on the left side of the slide, we abbreviate the sample variance by just saying s, which is the standard deviation to the second. I know that sounds ridiculous, right? Like why don't we have a special thing just for the variance? Why do we just say it's so the second and then say sample standard deviation is just as as well actually, to be honest with you people use different notation. I'm just using this because it matches the textbook we're using. But people will often say var for variance. And so in other textbooks, they'll do that, and then statistical software, but they'll also say s to the second like this, and it's maybe a good way of you remembering that the standard deviation is just the square root of the variance, right? So if you ever see s to the second, remember, S is the sample standard deviation, and s The second is the sample variance. And I'll show you the population one in a minute. But if you see those, that's what they're talking. Okay. Now, let's look upstairs at the top formula. See this thing on the top? It's really kind of scary, but we're going to work through this and you're not going to be scared of it. Okay. I know you know that there's a little some sign there that capital sigma so you know, something's gonna They get summed up. But that looks kind of scary that x minus x bar to the second thing will handle that, okay. But n minus one on the bottom, that's not so scary, okay. And we'll handle that one too. And then you'll just notice, all I did for the bottom part is I just put this huge square root sign over that whole thing. So that's the only difference between the upstairs and the downstairs. And then I also wanted to show you a picture of a calculator, because a lot of times, if you haven't really done math or statistics for a while, you forget the whole concept of square root. And I'll just remind you, whenever there's a square root of something, it just means that if you times it by itself, you'll get that number. So remember, like 25, the square root of 25, if you put 25 in your calculator, and you hit that square root thing, you'll get five, right, because five times five is 25. However, if you put in 24, you're gonna get something with decimals, right? But whatever it is, you get, if you times it by itself, you'll get 24. So I just want to remind you of that, because sometimes people forget that if they haven't been doing statistics or math for a while, or they haven't used the calculator for a while. All right, I told you, I talked to you about this numerator, right that the top is the numerator in a fraction, and the bottom is the denominator. So I'm going to talk to you about this numerator. So the sum of X minus X bar squared, you know, that's how I would say it, this is actually called this little piece of the formula is called the sum of squares. And so when From now on, when I say sum of squares, I literally mean the top half of this equation. So what you do when you do the defining formula, is you just kind of relax and say, the first thing I'm going to do is figure out the sum of squares, I'm going to figure out the top part. And then I'm going to just write that down, and then later, I'm gonna come back to this formula and enter it. So this next part is, how do we figure out that top part of the equation? How do we get the sum of squares, and I'll show you. Okay, so let's just look at the slide, I'm on the left, there's this blank table. And that's usually what I do first is I make this blank table. And you don't want to say column one, column two, column three, I just put that there. So I could talk about the columns. And then you know, I was talking about, but usually, what I put is I put x in the first column, and they put x minus x bar I wrote out minus, but you can just use a dash. And then I put in parentheses in the third column, x minus x bar to the second, like that. Remember, when you have parentheses, you have to do what's inside the parentheses first. So this means you literally have to do X minus X bar before you to the second it or square it. And I'm just walking you through this to get you ready for what we're going to do with this tape. On the right, so this slide, I'm just reminding you that the sum of x minus x squared to the second, in other words, the sum of whatever is going to be in the column three. That's another way of saying the sum of squares. Okay. So an easy way to explain this, what the squares are, is to just show you how to calculate it. So I just pulled out some data set, imagine a sample of six patients presented to Central lab. So this happens to me when I go to my doctor, sometimes she'll say, you know, it's time to do a lab panel for you. So she gives me this slip of paper, and I go downstairs to the central lab, and I give them the slip of paper, and they say, Okay, sit down, and then we'll call you up, and we'll draw your blood or whatever. So we're imagining six people did that. And then they got up to have their blood drawn. We asked them, How long did you wait? Okay. And I'm in the central lab where I literally do wait two minutes, that's a really good lap. But sometimes it's really busy if I go like during lunch, and I'll wait something like 10 minutes. So here are six patients. One of them waited two minutes, a couple of them waited three minutes, probably the other three came in during lunch because they waited eight minutes, 10 minutes and 10 minutes. Okay, so that's our data, that it's a little tiny data set, but I just wanted to use something small to show you how to calculate the variance, and then the standard deviation with just this little data set. Okay. So what's the first step? After making the table you have to make the blank table for us is you fill in the first column, which is called x. So what is x? Actually, each of these patients waiting time is an X. Remember sum of x, if we said sum of x, we would mean add all these x's together, right? So So that's all I did, I just put each x in the column, you'll see 2338 1010. It's just like identical to these x's. And then I put at the bottom, I put that little fancy sum of X and said 36. Okay, and so that's just the first thing you do. Just put them all in and do the sum of X. All right. Now the next step is don't look at the left side of the slide yet, look at the right side. Before you go and fill in column two, you have to do X bar. In other words, You have to figure out the mean. Now, you can kind of cheat because you just figure it out some of x. And if you remember the formula, the mean, or the x bar of the sample is the sum of x divided by n. And remember, I told you at six patients, so you just take 36 divided by six, and you get six. Now you just hold that number, you hold that. So between column one and column two, you got to calculate x bar, and you hold, right. And then while you're holding that, you keep it off to the side, you realize that this is how we're going to fill in column two is what x minus x bar means is the x bar is just six. But we have to go through each x and minus x bar from, it's helpful to order the x's before you do this, like notice, I put them in order 2338 1010, it's a good idea to just do that, because it helps your brain think whether or not you're doing the right thing. So let's start with the two. So we do two minus six, which is the x bar. Now you can look at column two, two minus six equals negative four, I hate negative numbers, but you just have to deal with them sometimes. Okay, so it's negative four, so you just deal with that, then you go to the next slide, and it's three minus six, which is negative three. So we're still on the water here with the negatives, but you'll notice that the next 1x is three, so you can kind of copy what you just did. So you're getting negative three. So what you're actually technically filling in this column, I showed you the equation, but you're putting negative four in the first one, negative three in the second one, negative three in the third one. And then now finally, the fourth x is eight. So eight minus six, we got above water, now we're in two, right. And then we have 10 minus six was 410 minus six, which is war. And when you order them like that, that's often what happens. In fact, that's always what happens is you end up with a bunch of negative ones at the beginning and a bunch of positive one later, that's just totally normal. Don't worry about that. But you got to be careful, you got to make sure you make the right meet. I've had people on tests actually screw up this mean. So you can just imagine when a train wreck happens after that is you do not get anything right after that. So make sure your means right. And then make sure you subtract it from every single x and put the right answer in column two. That's the next step. All right. Okay, so we're done with that step, what do we do next? Now, we just take whatever we got in column two in square. So we have the first one was negative four. So we take remember, square is just the the number time itself. So if you don't like to use x to the second button on your calculator, you can just do negative four times negative four, same thing. And so you'll notice we do negative four times negative four, we get 16. Now, it's pretty easy. negative three times negative three is not, you know, two times two is four. But I what I want you to really look at is the 10s. Notice that they get a 16 two, just like the two did. And that's the trick here. Remember, I said I hate negative numbers? Well, a lot of statisticians feel the same way I do. And so they often fix it by squaring the number because it's a racist, the negative. Just remember, negative times negative is positive, and positive times positive is also positive. That's a little trick, you know, when it comes to multiplying. And so when we do that, we are squaring each one of column two. And they're called squares, right? So we've got 16 994 1616. These each are squares. So what do you think we do? We add up that entire column, and we get the sum of squares. So look at that, we add up that entire column, and we get that super complicated looking thing at the bottom, which is the numerator for our variance equation, right? Like this wasn't really that hard. Was it? Okay, so we sum that up. And as it turns out, we get the number 70. So 70 is our sum of squares. All right. All right. Now we're back at the sample variance formula. And I'm so excited because look at the top of the formula. We answered. It's it's 70. Okay, so we got that 70. But we still have to deal with the bottom of the formula. Remember, n was six, right? We had six patients, and the bottom of the formula is n minus one. So the bottom of the formula is going to be five, right? So let's fill this in. I was kind of running out of room, so I just filled it in upstairs. So you see that 70 divided by five suddenly this looks super easy, right? So 70 divided by five is 14. Okay? That's the variance. totally easy, right? Once you make that, I mean, it's not it's tedious, right? You have to make that whole table and add things up and stuff. But here, it's not really that hard. Now, Guess how we're gonna make the standard deviation you've probably guessed it, we're just going to take a square root of 14. So remember that button on your calculator, you could put in 14, hit that button, and you get 3.74 and a bunch of other stuff, but I just chopped it off at 3.74. So that is your sample standard deviation. Now I promised you I would talk about the population formulas for standard deviation and variance, as well as the sample ones. And I told you, they wouldn't really be conceptually much different. As you can see on the left side of the slide, sample variances expressed, I made things red, so you can see what the differences were sample variances s to the second, but population variances as other Greek letter. Remember, I told you that that other sum was capital sigma, like, you know, Greek is like English, in the sense they have capital and lowercase letters? Well, that thing that I always think it looks like a jelly roll, but the Jelly Roll looking thing is actually lowercase sigma. So that I'm never going to say lowercase sigma, except for now, I'm going to say population variance and population standard deviation. So you'll see at the bottom of the slide, the lowercase sigma alone is the population standard deviation. And then the lowercase sigma to the second is the variance. So just remember, if you see that Jelly Roll thing, we're talking about a population version of the standard deviation or variance in that the sample. Also, you already know about mu versus x bar, right, so we have x bar on the left. And that's the sample mean, and mu on the right, which is population mean. And you also already know about n, which is the number in your sample. And this is where there's a big difference actually, in the sample, you have to do n minus one on the bottom, and in the population, you just do, and capital N that whole population. And if you think about it, it makes kind of sense, because populations are huge, so won't even matter if you like subtracted one. Whereas, you know, samples are small. So you sometimes have to, you know, adjust or something, so you have to minus one, but you wouldn't even matter like people make a mistake and accidentally minus one from the population one, they don't get much of a different answer. And so that's why I'm concentrating on the sample once, that's what we normally do. But I wanted to give a shout out Just so you know, if you ever see the arm formulas on the right side of the slide, you know their population level formulas. Alright, now we're gonna move on, we made it through range, variance and standard deviation. So now we're gonna move on to talk about the coefficient of variation. And this is used a lot for comparisons for comparing between two different labs often. I say that because my friends are pathologist, in the first time I actually use this in medicine, as we were comparing lab values on the same assay from two different labs, I just wanted to explain to you this might be the first time you've heard the word coefficient. And that gets a little confusing for people in statistics who are new, because the word coefficient is actually just kind of a generic term for certain kinds of numbers. So you'll hear somebody say, coefficient of variation. And you'll say, you'll hear somebody say coefficient of something else, or coefficient of something else. And just a word coefficient. Most people haven't even heard it. It just means a certain kind of number. It's just somebody says, oh, the coefficient is not good, or it's high, or whatever, you need to ask them, What coefficient are you talking about, right. So in other words, coefficient doesn't mean a specific thing. It just means a number that comes out of statistics. And so you have to know which coefficient they're talking about. So this is the first time maybe you've heard the word coefficient. And I'm going to talk for the first time then, to you if you've never heard coefficient before, about a specific coefficient called the coefficient of variation. Now, you'll, as we go through this textbook, there's other coefficients on it. So please remember this one is coefficient of variation, right? And a way to remember it is a CV for short. And so other coefficients have different abbreviations, but the coefficient of variation is CV. So I put on the right side of the slide the the formulas, and nobody seems to have any trouble doing the formula, right, because once you calculate the standard deviation, the sample standard deviation of the population one, as you can see in the formulas, and once you calculate x bar, which is a mean for the sample, it's pretty easy to do the division, and then they like it when you do it in percent. And you'll notice that about statistics is certain things they prefer as proportions. And certain things they prefer as percents. It's just like, I don't know, it's just like our culture in a way and so coefficient a very is always expressed as a percent. So you have to times that by 100. And then put a percent sign after it. But really, that's pretty easy to do you take the standard deviation, you'll see I did it for our patients 3.74. It took us all that work to get there, right? Remember square root of 14. And then remember, our x bar was six. So we needed that remember earlier for that column, too. So I just dumpster dive dumpster dove, those numbers, and then did this calculation out and I got 62%. And so students generally don't have trouble getting that number. But what the problem is, is like, what is the number even mean? Right? Like, what does it mean, if you divide the standard deviation by the x bar and times by 100%? And like, how do you interpret that percent? So the easiest way to talk about it is to actually compare it with something. Because one thing you'll also notice in statistics is if you make ratios of things, they don't have any units. So if I take your blood pressure, like your systolic blood pressure, and I say it's whatever, 130 mmHg. If I divide that by your diastolic blood pressure, or even by some lab value, or your temperature, or whatever, your IQ, suddenly I get a ratio, and that doesn't have units, right, it doesn't have mmHg, or anything like that. And if I do that to a bunch of people, all of those ratios don't have any units. And so they technically could be compared to each other. So you'll see that that's a strategy in statistics is they'll make ratios of things and say all those don't have any units. So it's, you know, sort of lacking in that way. But the power is you can compare these ratios. So, I decided to just pull out other patients, I just made up other patients, right. I pretended we went back to the lab, the next day, and we gathered some data. And we gather some data, and we came up with I just made this up an x bar of eight, and a standard deviation of four. It's a little close to what we had before, right? Like x bar six insanity, Visa 3.74. But anyway, in this next sample patients, the S four divided by the x bar of eight times 100 equal to 50%, and not 62%, like the other one did. So how do you interpret that? Well, the CV is a measure of the spread of the data relative to the average of the data. So in the first sample, the standard deviation is only 50% of the mean. But in the second sample, the standard deviation is 62%. of the mean. So what I would say is that the second sample, the red one with the 62%, has more standard deviation, compared to the mean. And so that means it's less stable, right? It's got more variance compared to its mean, and it's more standard standard deviation compared to its mean. So it's less stable. So it moves around a lot. So if you said to me, if these were actually two different labs, I would say, you know, I prefer the first lab, the purple lab, because it's more predictable. I know, it's gonna be like less variation, because it's 50%. And the 62% means that that's less predictable. It's a little hard to see in this example. But what happens is, if you have two different labs, and you're looking at this, like maybe you split a blood sample or a bunch of blood samples and send half to one lab and half to the other, what you're supposed to get the same mean and the same standard deviation, right? They're the same blood, you just want it. But sometimes you don't sometimes you get something like this, in which case, if you're comparing labs, you would go with the purple lab and not the red lab because they produce a more predictable result. So CV is a little hard to interpret. But it's easy to calculate. So that's one awesome thing about now, we're gonna move on to chubby chef and his theorem. So chubby chef figured something out a long time ago. And this is how he started thinking about it. He first started thinking, well, let's say you have an x bar and an S, like we just did with the CV. He noticed something else about it, he didn't notice the CV, he noticed that you can create a lower and upper limit by subtracting the ass and adding the s to the x bar. So remember back when we were making frequency tables, and I said, Well, we need to make class limits, we need to make a lower class limit and an upper class limit. Well, we use those terminology a lot like lower limits and upper limits. Well, Chevy show was like wait a second, I got an idea. Let's say I take a mean. And I you know, this will force the mean to be in the middle of this. I can subtract one standard deviation from it, and I'll get some sort of lower limit and I'll add a standard deviation to that mean and get some Sort of upper limit. And of course, let's pretend the standard deviation was one, like you'd subtract one to that one. And so this would be like totally symmetrically in the middle, right, the x bar would be in the middle, and then it'd be surrounded equally by these two standard deviations. And I'm just saying standard deviation generically, because you could do this with a mu, and the population standard deviation, two, you can do the population work. So he just sort of, like figured out, that's a thing that can happen, you can add and subtract a standard deviation from the mean. And you can get these limits. And so example, let's say I have a mu. So I'm gonna pretend I have a population a mu of 100. I don't know what I measured, but I got 100 and a population standard deviation of five. So Chevy, I was thinking, you know what I could do, I could take that 100 and subtract that five from it, and I get 95, I could take that 100 and add five to it, I get 105. And so we just started like working with this concept, like I could subtract and add like a standard deviation. And then he thought, Wait a second, I could even do this with two standard deviations, right? So I could take like, if it was five, I could take that times two, that's 10. And so I could do 100, subtract 10, and I get 90 for the lower limit, and 100 and add 10. And I get 110 for the upper limit. And so I can make this this range or this interval, right? from the lower limit to the upper limit, we call it an interval, right. And so he just sort of conceptually realized that if he used some rules along with this, there might be some useful interpretation of these limits, right, there might be some way that uses limits to mean something. So we're going to look at how he figured out to be able to use, you know, one standard deviation on either side of the mean, or two, or three, or four multiples of these standard deviations on either side of the mean, to actually come up with some lower and upper limits, that meant something. So he realized that what these low lower and upper limits would mean is that at least some percent of the data would be between these limits. So in other words, some percent of the of the axes would be between the lower and the upper limit. But that percent would depend on how many standard deviations you're going out, right? Like is it one is a two is a three, the, the more you go out, obviously, the more percent of your data are covered by the limits, because they're just huge, like, get it. So the interval so big, and almost covers the whole thing. So you would expect that percentage go up, as the number of standard deviations you use goes up. So so he was working on this out, and he came up with this formula, right. And he also, he was figuring out, he wanted this to work for all distributions, like normal, but also skewed. And also like uniform and by modal. So this was the formula he came up with. Now, in this formula, see at the bottom, k stands for the number of standard deviations, or the number of population standard deviations that he's going to use, right? So let's pretend that he made KB to like two standard deviations, right? Then you'd see this, it says one minus one divided by k to the second, which would be to the second, so that would be to the second is what four. So one divided by four is point two, five. And so one minus point two, five is like point seven, five, well, you make that a percent at 75%. So he's like, okay, that's what I'm going to say. If you go out two standard deviations up or down, and you make those upper and lower limits, at least 75% of the data of the axes are going to be there, at least, there might be more, but it'll be at least that. So he did this he used to, and they use three, and he used four. So two standard deviations, either way, three standard deviations either way, or four standard deviations either way. Now, students in my class often think that they have to memorize this one minus one over K to the second, you don't memorize. This was just a story of how Chevy chef did this proof. So you can memorize it for fun, but nobody memorizes it. I mean, you know, Chevy chef did the work. I'm just showing you the proof, right? So he figured this all out. So as you can see how he like you can do this with two, three and four, you'll get the same answers Chevy chef does. So it's kind of a waste of time, but you can do it just for fun. So he did the two one, I showed you that on the top. I even talked you through it. So you've plugged two into the equation, you'll get 75%. So in that thing I was just talking about like imagine I had 100, right? And that was my x bar and my standard deviation was five, right? And then two times that is 10. So I go well my lower limit then would be 90 in my upper limit. That would be 110. And I would be able to confidently say at least 75% of my x's are between 90 and 110. So if I'd measured maybe 100 people, right, I'd say at least 75 of them are going to be between these limits. In fact, it could be 80, could be more, but at least 75. So then Remember, I told you to predict that as we made this number bigger, you know, we go out more standard deviations, we're going to cover more of the data, right? So we needed three, it didn't come out as even, it came out in 88.9% of the data. So almost 89% will be covered if you go out three, and at least almost 88.9%. And if you go out four standard deviations, it's at least 93.8%. Right? And just to remind you, you know, when you have upper and lower limits, you have an interval, right? That's just we just call it that. But this particular interval, if you get it this way, it's Chevy service interval, because everybody's so happy did all this work, right? Because I wouldn't have figured it out. So I just wanted to demonstrate an example of championships interval, because then you can know how to interpret them or why anybody does them. Okay, so remember our patient sample, they're in the waiting room at the lab, right? So they waited on average, six minutes, and then the standard deviation of them waiting was 3.74. Right? Now, when I gave you this demonstration of how to calculate the standard deviation, I use this patient sample, I did that I only had a few patients in the sample on purpose, because otherwise your table that we made with the defining formula would be huge, and I never finished this video. So what I'm gonna ask you to do is pretend that instead, we had 100 patients in there, right? Instead, I measured 100, and I got my x bar, my 3.75 standard deviations, okay, so if we measured 100 patients, and we got that, I just want to, I put this chubby shove rules in that table. So if we go out two standard deviations from the mean, from the x bar, either side, whatever limits we get whatever interval we get, we know at least because I made it, so we say you know, studied 100 patients. So by law, we're at least 75 of those patients will be between those lower and upper limits, if we follow championship syrup. And if I do go out three standard deviations, at least 88.9 patients will be in there. Okay, I know that doesn't make any sense, like 88.9 patients Saudia point nine of a patient. But what they're saying is, I guess it would be 89. All right, yeah, 89% of the patients or in other words, 89 patients, at least would be in that interval. And of course, if I went out for at least, I wouldn't have to say 94.8 of a patient, but at least 94 patients would fit in that interval. And if you're thinking about if we only start with 100 patients, that's almost all of them. So the for one isn't so useful, right? So you'll see me on the left side of the slide calculating the intervals, right? So let's start with the first one. The first one is two standard deviations on either side of the mean. So the chubby chef interval we get is negative 1.48 to 13 4.48. And you probably notice you can't wait negative time. So already, this is kind of weird, right? But what this is saying is of our 100 patients, at least 75 of them because this is 75% championship interval, weighted between negative 1.48 minutes, so that might as well rounded to zero between zero minutes, and 13 4.48 limp minutes, right. And so at least 75% of them are, I fell in that range. Now 13.48 minutes is kind of long. So we would be happy, I guess is 75% of them fell in that range, because then that means that they were probably not waiting that long. But if you go out, then you widen this interval like 88.9. If you do that, then you say at least well rounded to 89 89% of the patients waited between negative five point to two minutes, which is you might as well make zero and 17.22 minutes. So as you see, if we widen the interval, we're going to get some later waiters in there. And so then we'll say, Well, at least 89% were between there, but at least 90 89% were between there and that means it wasn't bigger, right. And then again, we go out one more, we get 93.8%. So let's just round it to 94. So at least 94% of the patients or if we have 100 patients, at least 94 of them waited between negative 8.96 minutes, which again is nonsensical, up to 20.96. But then we're starting to get where we'll have almost all the patients with Somewhere between zero and 20 minutes, we really don't know how long they waited. So this is just kind of to show you what happens when you line that interval, you you maybe have less certainty about what individuals happen, be sort of a better idea of what the range is. So again, I just put this at the bottom. If we had 100 patients, this is how you would interpret it, at least somebody five would have waited between the lower and upper limit for the 75% championship interval. And then at least 80.9 patients I know nonsensical. And then the 93.8. So you see that interpretation lower part of the slide. So this is a really difficult concept for a lot of students. And so I'll just give you this take home message. First of all, Chevy shove interval works for any distribution, normal skewed whatever. Reason why that's part of the take home messages later, we're going to learn about intervals that only work with normal distributions. Okay? So this one is loosey goosey. It works with all distributions. So that's one of the take home messages for chubby sets interval. Also, Chevy says interval tell you that at least a certain percent of the data are in the interval. Later, we're going to learn about intervals where exactly a certain amount of data are in that interval. And so Chevy shop again, a little loosey goosey, right, he says at least. Next, championship intervals are sometimes nonsensical, as we just talked about. Negative time doesn't work, right. Sometimes you'll have very high limits, especially with a four. And so ultimately, they're not very useful. And they're not used in health care. I literally had never heard of Chevy shows interval until I started teaching this class. So what is the purpose of teaching you Chevy says interval. The purpose of teaching this is to point out in statistics, we often use the s or the population standard deviation, you know, just standard deviation. And we add or subtract, we'll add and subtract it from the mean, is a good way of making lower and upper limits that have special significance. That's really the main take home message is that you'll see this pattern as we go through this class, where we get a mean either populations or sample, and we have x bar, you know, x bar or population mean. And then we have a standard deviation, right either from sample a population. And then we take either one standard deviation, we added subtracted or two, or multiples. And those intervals then have certain significance. I only taught you in this one about Chevy chef, what you learn about other intervals later that are made similarly. So in conclusion, what did we learn, we learned how to calculate the range, we learned how to calculate the variance and standard deviation. We learned about how to calculate the coefficient of variation, how to interpret it. And we talked about the difference in the formulas from sample versus population. And we learned about Chevy Chevy and his theorem, how he figured it out, and how we calculate this intervals and how you interpret them. Now I just thought I'd show you this picture of Chevy chef here. He's a Russian guy. Well, the stamp was from the USSR, for the Iron Curtain fell. But I just thought I'd show it to you. So you knew who figured all this out? Good job, you've made it through the measures of variation. And now you're ready to do what the quiz, the homework, whatever, right? You're totally knowledgeable. Good job. Well, I'm back. And so are you. Welcome to Chapter 3.3 percentiles and box and whisker plots. It's Monica wahi. Library college lecturer. And this is what we're going to talk about. And this is what you're going to learn. At the end of this lecture, the students should be able to explain what a percentile means, describe what the interquartile range is, and how to calculate it. Explain the steps to making a box and whisker plot, and also state how a box and whisker plot helps a person evaluate the distribution of the data. So let's get started. You know, whenever we talk about a box and whisker plot, I think of some cute little animal with all those whiskers. I'll explain what the whiskers really are, I mean, not on the animal, but on the box and whisker plot later. So what are we going to go over, we're going to go over percentiles, and we're going to explain what those are. Then we're going to talk about core tiles sounds a little slimmer, it's got the tiles and it will you'll you'll understand why they're similar. Then we're going to compute core tiles. And then finally, we're going to do the box and whisker plot. All right. So let's go. So percentiles, we're going to have a flashback, okay. You're not going to like this little part because it's going to remind you of standardized tests. So maybe not all of you have been subjected to this, but most of us have if you gone to high school. In the US, you probably got to deal with these standardized tests. So just remember, we're only talking about quantitative data. All right. So if you take a standardized test or a non standardized test, you usually get points. And points are numerical. So that's quantitative data. So I remember I used to take the standardized tests, and I'd be, you know, showing my friends what I got, right, because they'd send you that thing in the mail. Now, I learned pretty early on, that it mattered who all was in the pool of people maybe taking a test with you, right. So if you're taking the test with a lot of stupid people, it's easier to get a higher percentile, because what percentile means is it for example, if you test at the 77th percentile, it means you did better than 77% of people taking the test. And a lot of those standardized tests, they didn't care how many points you got, what they cared about is what percentile you were at. So different batches of people would have different scores. And if you got a lot of lucky, got a lot of stupid people, then your score would be higher than there. So it didn't really matter what your absolute score was, it just mattered what your percentile was. So just to sort of remind you, if somebody had come up to me in high school and said, I got 77 percentile, what I'd say is okay, if only 100 people had taken the test, you'd have done better than Sunday, seven of them. Of course, we were all Brady, Brady, you know, I was always in like the 95th, or the 97th, or the 98th. And it happened so often, I wondered if it was really true. But what I realized is, is that there were so many people in the pool, because you know, I was in public high school in Minnesota, well, they were pulling together all the public high schools in Minnesota, ninth grade, you know, as pulled with them in 10th, grade or whatever. And when you're taking like nursing examinations, sometimes they'll do that they'll put you on a percentile. So I try to tell people, you know, strategize, try to take in when only stupid people are taking, which of course, makes no sense. How can you tell when stupid people are taking it, right? You don't even know who's taking it. But really, that's that's what a percentile is, it's the percentage of people that you did better than if you're at the 77th percentile, then you did better than 77%. Okay, so here's just some rules about percentiles. First of all, you know, I gave the example of the 77th percentile, well, the rule is you have to have one between one and 99. Like, you can't have the negative second percentile, or the 100, and fifth percentile. So that's the first, then whatever number you pick, like I was saying, that percent of the values would fall below that number. And 100 minus that number, have the values fall above that number. So like, in my, well, here, we'll give an example. 20, people take a test, just 20, right, let's say there's a maximum score of five on the test. The 25th percentile means that 25% of the scores will fall below whatever score that is, and 75% will fall above that score. So let's say it's an easy test. And let's say out of my 20, people, 12, get a four, which is almost the total, right, and the remaining eight, get a five, so everybody gets either a four or five, well, then, you know, the 25th percentile, or the score that cuts off the bottom five tests, right, will be a four, just because this was an easy test. And every you know, the first 12, people got a four and then the rest eight out of five. So even the 50th percentile, then would technically be at a four, right? Now, this would all come out differently if it were a hard test, and most people got a score below three, right? And so the percentiles would be shifted down, I just tell you that so you can keep in mind the difference between the actual score and the percentile. So the percentile just happens to mean that this percent of people got the score lower than whatever your score is, it doesn't actually say what your score was, right? So that's what you just want to remember as we're going to percent. Okay, now we're going to talk about core tiles, and also the interquartile range. Remember the tile think so this relates to percentiles. So I put a little quarter up there. So core tiles is a specific set of percentiles. And you'll see why I put the little quarter up there. It's because there's technically four core tiles, it's just that the top quartile doesn't count because it's like the 100% one. And remember, it can only go up to 99, like I was just showing you. So we calculate the first second and third quartile. So we have the 25th percentile is the first quartile, the 50th percentile, which is also known as the median, which you're already good at, right? That's known as a second quartile. And then the third quartile is the 75th percentile. So those are your courthouse 25th 50th and 75th. And technically a 100th. But we never say that, right? Because it only goes up to 99. So you have the first quartile at the 25th percentile, the second quartile at the 50th percentile. The third quartile at the 75th percentile. And these are actually not that hard to calculate by hand. So here's, like how you do it sort of an overview. So first you order the data from smallest to largest, because remember, we have quantitative data, so you can sort them, so you sort them smallest to largest. And this is feeling very immediately, right? Well guess what, that's step two is you find the median, because the median is also the second quartile, which is also the 50th percentile. So already, you have know how to do this, right? Because you could already do step one, and two. Now, this is the harder part, this is the new part. Step three is where you find the median of the lower half of the data. Right. And so wherever you put your median, you pretend that's the end, and you look at the smaller values, and you find the median of those. And that would be the first quartile or the 75th percentile. Then finally, step four, which you probably guessed, is you find where your median was. And then you look at the upper half of the data between the median and the maximum, and you make a median out of that part of the data, and then that's your 75th percentile. Okay, and I'll show you an example of us doing that. But this is an overview of the steps. Now, remember, range before what the range was, yeah, you remember it, that's where we had the maximum minus the minimum, right? And I told you, you have to actually do out the equation and tell me what number you get. And that's the range. Well, we have something new and improved. In this lecture, here, we have the inter quartile range. Okay, so you already know about quartiles, we were just talking about them. But inter quartile sort of means like, within, right. So once you have the third quartile, and you have the first quartile, you can calculate the inter quartile range, or RQR for short. So if you see IQ are on here, just remember, that's interquartile range. So that's the third quartile minus the first quarter. And again, I'll show you an example. It's this is just an overview. Okay, here's the example I promised. On the right side of the slide, you will see a sample of data I collected, I went to HD comm that's American Hospital directory calm, and that provides publicly available information about American hospitals. So I went in, and I took a random sample of 11, Massachusetts hospitals, there's a lot more, so I took a random sample. And what I did was I wrote down how many beds each of those hospitals had. Because if a hospital has several 100 beds, they're considered kind of a big hospital. And if they have less than 100 beds, they're considered a smaller hospital. So I wrote all those numbers down. And then I already did step one of making our courthouse which is to order the data from smallest to largest. So you'll see on the right side of the slide, my smallest hospital had only 41 beds, and my largest hospital had 364 beds and see I put all of them in order, they're on the right. And so we already did step one. So let's go on to step two. So the Step two is to find the median, and that's quartile two, or the 50th percentile. Now, you're already good at that, right. And so we have 11 hospitals. So we know that the sixth one in the row is going to be the median, you know, because it's an odd number of hospitals that I drew. And so the sixth one will circle it, that's the 50th percentile or the median, so we already got quartile two, it's, it's funny that you have to start with quartile two, but that's what you have to do. Now, I just re color coded these. So you could kind of remember what's going on as we do the other steps. 126 is the median. That's kind of not on anybody's side, it's not on the lowest side, and it's not on the highest side. The orange ones then are considered below the median. And the blue ones are considered above the median. And so I just color coded them so you can keep track of what's going on in the next slides. Okay, now we're going to do the 25th percentile for step three. So the goal is to find the median of the lower half of the data. So now you see why I color coded it is because now we're pretending just the orange ones exist. And we are just finding the median of that. And we're not counting that 126, because that's already been used. And so now we find that 90 is the 25th percentile, how you remember that it's not the 75th, it's not the third one is because it's the low one, like 25 is a low number. And 75 is a higher number. So you go to the lower part of the data, you find the median of that, and that's going to be your 25th percentile. And so in our case, that's 90 then you probably guessed it, you go to the blue ones, right the upper half and you go get the median out of that. And so of course ours is 254. So that's our 75th percentile. So what we just did is we calculated our courthouse. We have Our 50th percentile, our 25th percentile and our 75th percentile. So that's what I meant by that overview slide. This is an example of how you would do that. And of course, I have to give a shout out to the IQ R, which is the interquartile range. Remember, you just learn that. So that's the 75th percentile minus the 25th percentile. So in our case, that's going to be 254 minus 90, which equals 164. So that is your IQ R. So if I gave you a test, and I asked you what is the IQ or for these data, you can't just put 254 minus 90, you actually have to work it out and put 164. So there you go. So that's our quarterly example. So I just wanted to step back and give you some philosophical points on what happens with q1 and q3, depending on how many data points you have. Okay, so remember, the first step of this is always to put them in order from smallest to largest. So let's pretend I had only drawn the first six values of my hospitals. See how I put on the slide, I put the position of the number, which is 123456. And I put above the example numbers. So let's say I was going to do the median on that, you know, what I'd have to do is I'd have to take 90 plus 97, divided by two. But then the next question is, what do we do for q1 and q3? Well, given that in the example of having six values, the 90 and 97 are mushed, together for the median, they don't get, they can get reused, or they do get reused when looking at the bottom and the top half of the data. So when we went to go to do q one in this, we would actually count that 90 in there. In fact, q one would be 74, because that's the median of the three numbers below the median right below that line. And then the Q three would actually be 121, because we actually count the 97 in there. So in other words, when you have like six values, and the median is made out of mushing together two values, like taking the average of those two values, those two values, they get to double dip, they get to be in the bottom, and the bottom line gets to be in the bottom, and the top one gets to be in the top when calculating q1 and q3. Now, well, what if we had seven values instead of six? Okay, so I just expanded and pretended we had seven hospitals. And you'll see that I have seven positions there. Well, this was a little like the one we did, together with the 11 values, where the median was clearly this 97. Here, in this case, it's 97. So that 97 does not get reused in the bottom in the top. So you'll notice that q one is the middle number of the three bottom ones, and Q three is the middle number, the top three ones. And so that's what happens when you have seven values. And it's also happens when you have 11 values, like I demonstrated with those hospitals. But it's not super predictable. Because what if you had eight values, we suddenly see it gets a little complicated. So how would we do this? Well see the first four are between 41 and 97, top four between 121 155. Well, to make our median, we'd have to take the mean of 97, and 121. But remember, they don't get used up the 97 then gets to double dip and be part of the calculation for q1, and 121 gets a double dip and Part B part of the calculation for q3. But even even with this double dipping, if you go down, you'll see that there are four then numbers to contend with, for q1. So of course, to get q1, you actually have to mush together or take an average of 74 and 90. And if you go up the upper part of the data, in order to get q three, you're going to have to make an average of 126 and 142 are the ones in position six in position seven. So if you're unlucky enough to get like eight values, then you realize you're going to have to make your median by making an average of two numbers, your q1 of making an average of two numbers and your q3 like that. So it's not super predictable what's going to happen. You just have to pay a lot of attention. Just remember if your median is made out of two numbers average, those numbers get to double dip in the downstairs and the upstairs of calculating q1 and q3. If instead your median is just one number, like because you have an odd number of values, then that guy has to just stay there and does not double dip in q1 and q3 calculations. So we can just see another example of this. So this is nine values right? Now remember, when I had 11 values, it was like having seven values. I had this median and it was really clear like we have here but even Um, the medians of the top of the top of the data and the bottom of the day, they were just, you know, it was an odd number. And so it was easy to figure that out. Well, you see here, in this case, our median is the fifth value, and that's 121. So 121, does not double dip anywhere, right? So we go to calculate q one, we only have four values, because we're not counting the 121. And then we're stuck with taking an average of the second and third value to get q one. And then same thing upstairs here, between, you know, 142, and 155. You know, those are the two middle numbers of our four numbers at the top. And then we have to take an average of those to get q3. So I guess this is just my long way of saying you got to be really careful what you're doing. First, make sure you've gotten the median, then figure out if that median is this kind of a median where it's just you're circling, or it's a medium that came out of an average, because if it's a medium that came out of an average, just know that those numbers are going to double dip in q1 and q3. And if it's a medium that was because you had an odd number of data, it was just like in the middle, that one doesn't get to double dip. Okay, enough double dipping, I'm getting hungry. When I go to that roller coaster, I'm going to get a double dip ice cream cone. Okay, we're gonna move on to box and whisker plot, which is kind of like your percentiles getting graphed, right. So let's go back to our ingredients, we already created our box plot ingredients. In fact, that's why I trickily went through those portals first, because now we've created our ingredients to make a box plot. So I just sort of summarize what we have on the left slot, side of the slide, say that 50 times, hospital beds was what we were counting, the smallest Regional Hospital had only 41 beds. q1 was 96. a little easier. I put it in an order cure, one was 90, median q2 was 126. You know what I mean? I mean, cuartel, right, like by these cues, then q3 is 254. And then the maximum was 364. Okay, so let's make a boxplot. And then you remember what the data looks like on the right side of the slide. Okay, well, now I'm going to walk you through how you would make this box plot. So first, you draw this thing? Well, how do you know what to draw? Well, I usually just draw a line and a vertical line, and then put a zero at the bottom, and then I cheat, I go look at the maximum go, Oh, I wonder where that is. And see our maximum was like 364. So I just made 400. At the top, if our maximum had been something like, you know, I think Massachusetts General Hospital has something like 600 or 800 beds. If we had gotten that one in there, and that was our maximum, I would maybe go up to 900, you know, whatever is a little bit above the maximum, that's what I put at the top. So this was 364. So I put 400, then what I did was I divided it in half, like I see where the 200 is, I just kind of threw that in there. And then I divided between the 200 and the 400, a half and put the 300. And so you can just kind of eyeball this and draw it out that way if you want. Okay, so I got this thing set up. And then here we go, we're going to do the first thing. Okay, here's the first thing we're going to draw in q1 or quarter one. So on the left side of the slide, you'll see a circle that's 90. On the right side of the slide, I made this horizontal line. Now how Why do you make that line? Well, look at how its proportion to that that upward and down graph thing I made, you know, with the numbers, you probably don't want to too wide, but you don't want to too skinny. This is just about right, like Goldilocks just right. Okay, so you just make this horizontal line at q1. So that's the first. Now you make a copy of that same line parallel, and you make it at q3. So if you look at that, if you're I hope you're not lost, if you look at that, you know, 100 200 300 400, you know, q1 is 90, so it's about 10, under 100. So that's how I knew where to position that lower one. And then 254, that's about, you know, a little bit higher than halfway between 203 100. So that's where I roughly knew how to position this one. It's not perfect. If you do it in statistical software, they put it out and it's perfect. But for demonstration purposes, that's what I'm doing. Okay, so now what we've done is we put in q1 and q3 and we put these horizontal lines that are parallel. Alright, here's the next step. We connect them, hence, the box so the box gets made, right that you just call it connect them. Alright, now I put a little circle on the right side of the slide because I wanted you to make sure you saw what's going on there. Okay. That's when we put in q2 or the median, right? So the median is 126. See where 100 is. It's up a little bit, and we make that parallel. But you see how I made q one q three connected the box and then did the median. I think this is the easiest order to do it and when you're drawing it by hand and you're not the statistical software Because then that way, you know, this box is all nice. And then your median fits and everything looks nice, but we're not done yet. We got the whiskers. So you're probably wondering this whole time, what is this whisker thing? Well, you just figured out what the boxes the whiskers are the markers for the minimum and the maximum. So you'll see the minimums at 41. And then we have a whisker at 41. So why is it called a whisker? Well, it's smaller. I don't know why it's called the whisker, but it's different from the other ones. Because it's smaller. I guess that's a reason maybe. But notice how it's like half the size, almost half the size. Sometimes they're really, really small, but it's tiny. And you want to position it, like vertically in the middle, like you don't want it off to the side or anything. But and you also want these parallel. You'll notice the maximums up there way high at 364. So I just did both of these on the same slide. So you draw on the whiskers. And then you probably can guess the last step. Yeah, connect the whiskers to the box. So good job. There you went and did it You made a box plot. And then now let's look at the inter quartile range. Remember how you calculated this, you took q three minus q one? Well, that means this boxy thing is 164. Beds long, right? So that's where your IQ are. This is a visual pictorial of your IQ. So very good. We did our boxplot, we did our inter quartile range. And you're probably wondering, why don't we just do this? I'll explain. So why do we do this? Well, one of the main things that we do is we look at the distribution in the data. I know, I know, you guys learn how to do a histogram already, and you're good at a stem and leaf. Those are other ways of looking at the distribution. And if you make a histogram of these data, you'll find that Well, I mean, these are only 11. But you know, if you get a pile of data, and you make a histogram and the stem and leaf, you'll find that those images agree with the boxplot. And you're probably thinking, Well, how do how do they agree? Well, if you look on the right side of the slide, I'm just giving you an example. So skewed, right? If you had skewed right data, and you knew it, because you made a histogram and you saw a skewed right distribution, if you took the same data, and you made a boxplot, it would be kind of like that skewed right one that we just did, where the top, whisker would be really high in that thing connecting the whisker to the box. That would be like really long, whereas the one on the bottom is short. As you can see, the skewed left is the opposite, right? The bottom one is long, and the top one short. If you have a normal distribution, remember that that's symmetrical. That's that mound shaped distribution, and you have a larger spread. In other words, you have a bigger standard deviation, you have a bigger variance, right? Then you're going to see a box that's really big like that. But if you have a smaller spread, and it's a normal distribution, you're going to see a box that looks like this. And you're probably wondering, where are you getting these shapes? Well, I'll show you a kind of on the last slide here as we wrap up the conclusion. It's because if you fly over a roller coaster, like see this roller coaster, this roller coaster is skewed right? That would make sense, right? Because you want to go up steeply, and then go down really fast. And see how the boxplot for the roller coaster looks. You've got sort of the part where you start going up really fast. That's kind of near the median and kind of near the the 25th percentile. And the part where you start where you're just getting on and it's slowly going there. That's like the bottom whisker. And then you go up and you come down. And it's a long tail, which is good, I guess if you design roller coasters, and then that long tail, then is that right skew? So that's why I mean, if in your mind, you're going how she getting this this histogram in this box, but this is kind of how I'm doing it, as I'm saying, Well, if you flew over the histogram, or the roller coaster, you might see like a shape of a box plot. So in conclusion, we talked about percentiles, in general, like the 77th percentile, what that all means. And then we focus in on quartiles, which are a specific set of percentiles. And then we're going to go or we already did calculate the quartiles. And the reason why we did that is because we first needed to do that in order to make the interquartile range. And then finally, we need those quartiles in order to make and interpret a box and whisker plot. Okay, this isn't the roller coaster I'm going to, but I'm going to one and I guarantee you it is skewed right. Greetings and salutations. Hi, this is Monica wahi, your library college lecturer bringing to you chapter 4.1, scatter diagrams and linear correlation. So here's what you're gonna learn at the end of this lecture, you should be able to explain what a scattergram is and how to make one state what strength and direction mean with respect to correlations and compute correlation coefficient are using the computational formula. And finally, you should be able to describe why correlation is not necessarily causation. So let's jump right into it. First, we're going to talk about making a scatter diagram. And the thing on the right side of the screen is not a scatter diagram, but it's kind of scattered. So I put it there, it's kind of pretty. And then next, we're going to talk about correlation coefficient, R, and how to make it. And then finally, we're gonna do a shout out to causation and lurking variables, which remember we talked about before, but we're going to talk about them again, in relationship to our. So let's start with the scattergram. And I also call it a scatter plot, because it's like everything in statistics, there's got to be about eight names for everything. So scatter gram, and scatterplot mean the same thing. So let's just get with the setup here. So scatter grams, or scatter plots are graphs of x, y pairs. So what's an XY pair, xy pairs are measurements, two measurements made of the same individual or the same unit. So if you measure my height and my weight, that's an XY pair, if you measure my height in the my friend's weight, that's not an XY pair, because that's two different people, right? So these xy pairs, the x part is called the explanatory or independent variable. And it's always graphed on the x axis. So remember, in algebra, you would do these graphs, where you have this vertical line, and that was the y axis, and you have this horizontal line, which was the x axis. And I always had trouble remembering, which is which, but that's how it is. And so whichever x whichever of the pairs is x, expect that to be graphed along the x axis. And it's also called the explanatory and or independent. Remember, there's got to be a million names for everything explanatory or independent variable. So if I talk to you and said, here's an XY pair, and this one is the independent variable, or this one is the explanatory variable, you need to like just secretly know I'm talking about the X of the two. And then surprise, here's the y of the two and the Y is also called response variable. It's also called the dependent variable. And that is graphed on the y axis. So again, like I said, I used to have trouble remembering is the vertical one, the y axis or the horizontal one. But what I did was I remembered, if you take a capital Y, and you go grab onto its tail, and you go pull it straight down, you'll see that it's vertical. And that's how I remember that's the y axis, it doesn't hurt the Y. It's used to that. So if you can stretch the y's tail down, and you get vertical, remember, that's the y axis. And then the other one is the x axis. Okay? And then also, you have to find a way to remember which one means what like, does x mean explanatory and independent? Or what or does it mean response independent. So how I do it is, you know how we sing the ABCs abcdefg. Well, if you fast for the N is w x, y, z, right, so the x comes before the Y, you know, in the alphabet, so I do x and then an arrow to y. And then I imagined in my head that saying X causes Y, even though it doesn't necessarily cause y's, you'll see at the end of this lecture, but I think about it that way. Because if that happens, then y is dependent on x and x is independent, it can do whatever it wants, but y is dependent. So that's my way of remembering x is the independent variable, and y is the dependent variable. So anyway, that's a long way of saying the scattergram is a graph of these xy pairs. And that's what we're going to do is make that graph. So we needed some xy pairs, right? So I asked the question, do the number of diagnoses a patient has, does that correlate with the number of medications she or he takes? So if you don't have that many diagnoses, you probably aren't on that many meds, right. But if you have a lot of diagnoses, you should be on a lot of meds. But we all know people in real life can sort of violate that just depending, I mean, you could have one really bad diagnosis with a lot of meds. Or you can have a bunch of diagnoses that are all taken care of with one mad so it's not perfect, but this is kind of a reasonable thing to think. So what I did was I put up here just for x y, Paris, as you can see, so I'm got four pretend patients. And you can see here's the first patient, that person has an x sub one because they only one diagnosis, but like I was saying must be a bad diagnosis because that person has a y of three or is on three meds for it. Right? So that's how you read this table. So let's start making our scattergram out of these data. Okay, so here we go. So I labeled the x axis number of diagnoses, right just to keep things straight, and the y axis number of medications, and then you'll see where I put the dot, right? because x is one, I went over to one number of diagnosis, right? The one diagnosis, and then, because why was three, I went up three to this three, right, and there goes the dot, that's where that first person gets a dot, okay, you put it there. And that's what you're going to do with these other ones, too, is four dots. Okay, I just threw all the dots down, so you can kind of see what was going on. But here's the second person, right? So that person had an X of three. So I went over three. And I just put those green arrows in just so you can see what was going on, they're really not part of the scatterplot is just more like, like cheating, you know, to show you because we're just practicing right? And then that person, so had an X of three and then a y of five, and you see where the dot goes right. And then here, you can see where the fourth got.or I'm sorry, the third that goes because there's a four and a four. And then here we have the fourth that. So this is the scattergram of these four patients. Of course, a lot of times you have like hundreds of patients in there. But I just showed you the simple example. Okay, now, because we did that, I can talk about linear correlation, you'll kind of get it right. linear correlation, that term means that when you make a scatterplot of xy pairs, it kind of looks like a line. Now over here on the right is not like biology. That's not like statistics. That's like algebra, right? Because back in algebra, you'd have these perfect lines where the dot was right on the line and see the x and y. Notice there's no diagnosis, nothing. That's algebra, right. So perfect linear correlation. Looks like graphing points in algebra. And if you actually make a scatterplot, of like people, xy pairs, and you see that, you should suspect there's something wrong, it actually happened to me once, one of our statisticians came to me and said, Monica, look at this, you won't believe this. And I said, Well, I don't believe this. What are you graphing? And he said, on the x axis, he had put the weight of every of the person's liver. And on the y axis, he put the weight of the whole person. And I'm like, I, how do you weigh people's livers? Like, that sounds painful. And he goes, Oh, let me go see. And what he learned was that you don't waste people's livers, you use an equation to estimate the weight of their liver and guess what's in the equation is their actual weight. So I'm like, that's why I came out, like on a line is because you were using the Y to calculate the x. And he was like, Oh, you're so smart for a secretary. So then I became an epidemiologist. But anyway, if you ever see this in biology, just suspect Something's fishy, because really, things just don't end up right on line. But if they get really close, you can say it's close to perfect linear correlation. I just wanted to let you know, that's what we're what's going on here with this linear correlation. Okay, so let's talk about facts about linear correlation. So things can be linearly correlated, without being perfectly on the line, obviously, our little thing was, so if, if when you make those dots, your scattergram, if you imagine a line going through it, if you imagine that the line is going up, like it kind of looks like it's going up, this is called a positive correlation. But you don't always have a line going up. So I want you to look at this. And I made up these data too. But on the x axis is the number of patient complaints. So as we go on, the patients are madder and madder. They're grouchy and gross, you're making more complaints. on the y axis, we have number of nurses staffed on the shift, right? And so as you go up, there's more nurses. Well, sure enough, when you got a lot of nurses, you don't have as many patient complaints, right? Because they're being attended to. So this is what you would say is some people say inverse correlation. But in this presentation, I'm calling it a negative correlation. Because as one goes up, the other goes down. And as one goes down, the other goes up, because and that's depicted visually with this line going down so you see, you can imagine line going down That's a negative correlation. Neither is better, you know, positive versus negative, it just explains how these things are behaving together how X and Y behave together. But then you can have situations where there's really no correlation, like x and y really don't have anything to do with each other. So as you've seen, you know, when you're, when you have patients in the hospital, some of them have really big families, and those families come a lot. And some of them don't really have that many loved ones. So as you can see along x, here are totally unique visitors, meaning you just count each person wants. So you could have, there's a patient who only has one Unique Visitor. But if you look at why they spent in the hospital, that person that's been there seven days, and that that visitor keeps coming, right. And then you have maybe a patient here, the second one is to unique visitors. And that person's only been in one day, but both those people have been there, then you have people like a person with three unique visitors. And they've been in the hospital for days, right. And those are probably the same three people coming back. So it really doesn't matter how long a person's in the hospital, if they've got a lot of loved ones who keep coming, they'll keep coming or not. Right? Right, according to this correlation. So you end up imagining a straight line. And that's no correlation, that's fine, too. Nothing is better or worse, it's just that you make the scattergram to try and understand how x and y are related. This is always fun. Like in books, they always make some sort of goofy picture. I don't know why they do this, I would never get a goofy picture, like they show in books about, you know, this, I made up the correlation. This is in the lobby, the number of the games in the lobby, and the number of the books in the lobby, they should really have nothing to do with each other. But if you see something just way goofy like this, just say it's no correlation. I don't even know how I get this. Hi, there. Alright, so we've been talking about correlation. And it actually has two attributes. So far, we've only talked about one and that is direction, we talked about positive, negative and no correlation. So whenever you're talking about a correlation, you have to say what direction it is. But you also have to say the other thing, which is what strength it is. So now we're going to talk about how you figure out what the strength is. So strength refers to how close to the line, all of the dots, they fall really close to the line, it is considered strong. If they fall kind of close to the line, it's called moderate. And if they are very close to the line is weak. Now remember, that's totally different from what direction is it could be positive, strong, or negative, strong, right, could be positive, moderate, or negative, moderate. So this is just a statement, the strength is a statement of how close the dots you make in your scattergram file close to the line that you end up dropping. So I thought I'd just give you a few examples. So look at this, I just made this up. This is what a strong negative one would look like. Notice how those pink dots are almost on the line. And this is a strong positive. Again, even one of the dots is on all right, not all of them, you know, or it'd be perfect, but it's never perfect. So this is really close. But it's strong, positive. So strong just refers to the fact that the dots are almost on the line. Now, this is almost the same correlation, but the dots are not really almost on the line has to be fair and kind of going between them, but they're kind of far away. And so just eyeballing it, you would say this is moderate. And here, it gets weak. And mainly it's because the dots are more all over the place. But you'll notice there's one that's like right on the x axis. And then hey, look up there, like in the title, there's one up there, like way up there. And that's like an outlier. And sometimes, when you get outliers, they can really whack things out. So even though this is a weak correlation, that line looks like so powerful, because it's almost basically connecting these two outliers. So you just got to be careful, and that's part of why you make a scattergram first is out large can have a really powerful effect on the correlation. Especially it's an any of the four corners of the plot. Like if you get a weird outlier kinda in the middle, it's not going to do as much as if it's in the upper right, upper left, lower right or lower left. It can really affect the direction like like, you know, it's Like a seesaw, or a teeter totter, you know, an outlier can get on and really change the direction of it. And it can also mess with how strong or weak the correlation is. So that's why you really want to start with a scatterplot. And that's why the way this chapter is organized starts with the scatterplot. This, you just want to look for outliers. And also just see how X and Y look when you plot them. Now we're going to get on to correlation coefficient, R, we're going to get on to computation and actually making a number. So you can not just use watery terms like direction, you know, positive, negative, or moderate, strong weak to explain it, but you can actually put a number on how correlated x and y are. So remember, the word coefficient, we did it with coefficient of variation, which is different. So the CV, you know, is one kind of coefficient. But what we're going to talk about is a different kind. This time, our coefficient, this time is called R. And just coefficient means the number we just like to use it in statistics. Now, it seems kind of weird, because like, I'm talking about correlation, and people are like, Well, why is it our Why isn't it like see for correlation, then like, I don't know, I didn't invent it. But this is how you can remember you can go correlation, correlation. So correlation coefficient, R. So just remember, r means correlation. And technically our mean sample correlation, population correlation coefficient, right? Like his, you know, imagine you're correlating like height and weight and the population like, oh, everybody in particular state, you actually need a Greek letter for that. And I showed it on the screen, I don't know it's this fancy p, I don't know the right name of it. But we don't actually cover it in this class. So I just want to just show it to you, we're only going to focus on R, which is the sample correlation coefficient. So what is r? Well, it's like I said, it's the numerical quantification of how correlated a set of x y pairs are. And it's actually calculated by plugging all of the XY pairs into the equation, I'll show you how to do it. And you can see that if you do it by hand, if you have a lot of xy pairs that will take forever. So I tried to limit that. And like, remember, standard deviation and variance, there was like a defining formula and a computational formula. This time, I'm only going to show you the computational formula, it's, in my opinion, ways your to do, but it gets you the same number. Alright. So that's what we're going to do is we're going to take a set of xy pairs, and we're going to calculate our M. But then how do you interpret our Well, let me just prepare you mentally for what we're going to get out of this calculation. The our calculation produces a number and the lowest number possible is negative 1.0. So that's perfect negative correlation. So if we were like in algebra, and we had an A line going down, and all the dots were on it, then the R would be negative 1.0. But that never happens. Right? So if you want to think about it is like if you have a negative correlation, and you get an R, that's like negative point nine, five, or something really close to negative 1.0, that it's close to negative 1.0. So it's close to perfect negative correlation. That's how you want to think about it. And then the opposite is the highest possible number you can get for our is 1.0. But most people never do that. except for that one mistake I was telling you about. And that would be perfect positive correlation. So if you see that you calculate an R, and it gets really close, like point nine, five, like I said, or nine, eight or whatever, then you're thinking, whoa, this is really close to perfect positive correlation, right? And then everything else is in between. So like, you know, point five or negative point three or point 02, or negative point, oh nine, like all of those are between negative 1.0 and 1.0. And that's where r should be. So let's say you calculate R and you get eight. Okay, you did it wrong, right? Or you calculate R and you get negative 2.3. Like that's not right, it's got to be between negative 1.0 and 1.0. And if you make a scattergram, you should know whether it should be on the negative side of the positive side or it should give you a hint. So this is just more to calibrate what to expect from our because it's kind of a big calculation. So I'm just going to give you some pictorial example. Because remember, every single time we make our right, um, we also have a scatterplot behind it. And I just thought, you know, it would be helpful to see some real life examples of our, these are real life examples, okay, real life, you don't get this from just anything, right? I'm just teasing. But anyway, so I started with some negative hours because I'm feeling negative today. I went into the literature and I found this article about, oh, it's not MIT and Harvard. It's about the evolutionary principles of modular gene regulation, a nice and all I know, it's, I'm supposed to cut down on eating bread. So that's all I know about this. But they had these really nice scatter plots. So and they calculated are for them, so and they had a little line on them. So I thought I'd show them to you. So if you look, the one that's labeled D, see where the dots are, right, and see where the line is. And this looks kind of like a moderate to strong, negative correlation, right? Because the dots are kind of close to the line. And then when the group calculated are they got negative point seven. And so that kind of makes sense, because, and then I put my opinion in the lower right, these aren't official cut points or anything, but I usually use these as a guide, see how I said negative point four to negative point seven is moderate. So I would call that the one monitor. Now let's look at E. So see how the dots don't cluster so close to the line, as they do with the D one, that's going to make it a weaker correlation, it's still it's still negative, right? So it's negative point four, four. And when you look at my little opinion, I still call that moderate, but it's on the low end, see that. And then if you look at AF, see how many of them are like way far away from that line, and they're dragging it down. So now it's in the even weaker correlation, negative point two, five, right. And so then that's weak. And so this is just some examples to give you a pictorial. And now I'll be I promise to be more positive, here's some positive Rs, they didn't draw a line on this one, this is a different article, right? Says obesity is associated with macrophage accumulation, and adipose tissue. So again, try to cut down on bread. But anyway, um, if you look on the left side, you'll see all of these x y pairs plotted on the scattergram. And even though we don't have a line there, we can imagine it's going up. So we would expect this to be positive. But we also would imagine they're not really clustering around the line very tightly. So when we see that the R is point six, we're not surprised. I mean, it's on the high side, a moderate in my world, which makes sense. But go look on the right one, you know, under the B one, look at how those, you could almost connect the dots and get a line out of that. So that's really tightly hugging the line. And then we're not surprised to see that the R is point nine, two. So that's pretty strong. So I just wanted to give you these tutorials before we actually went forth, and calculated r because that's one thing you can do is do the scatterplot have an expectation, what r should look like. And then if you calculate R and it's totally wacky, you know that you did something wrong. Okay, let's calculate our and let's use the computational formula. Okay, I threw the formula up in the upper left, and don't feel overwhelmed by it, we're going to take that apart very carefully, right. But before we even do that, I just want you to have a flashback to chapter 3.2. c, all those sums of are those capital sigma was in the equation. So we're going to handle calculating are a lot like we handled calculation, calculating variance and standard deviation. We're going to make like a table with columns. And then we're going to fill in those columns with calculations. And then we're going to add up the columns to get all those numbers. So already you were good at that, and 3.2, you'll be good at this too. And then I made up a story because it's a lot easier to check your work if there's some story behind that and statistics. So pretend we have seven patients that have been going to your clinic for a year. They're good patients, they keep coming. So they came to the clinic over the year. And at the last visit of the year. You measured the diastolic blood pressure, and what you predicted was or what you thought would make sense as those with a higher diastolic blood pressure would have had more appointments over the year because probably they're trying to stabilize and run power. Sure, maybe they have other problems that are driving it up. This makes perfect sense, right? So what you wanted to do is see if you are right, so you're going to take the diastolic blood pressure at the last appointment as your x, you know, because you think that that's maybe the explanatory variable, or, you know, that would be the independent variable that would make it so have something to do with whether or not they had a lot of appointments. And then you take why as the number of appointments over the last year, because you'd say, Okay, hi, DBP probably means they have more appointments. That's just your idea, maybe you're wrong, but we're gonna do that. Okay. So, um, I put in the title, just a reminder, access DVP. And why is number of appointments so you don't forget. And then we made up this tape. So look at the first column, it's just the patient number, it's nothing, you know, exciting, we just want to keep track of which patient is one, right. And then notice under x, we just have all of their dbps. So this patient, one at the last appointment had a 70 mmHg, and patient two at 115, mmHg. That's kind of alarming. But these are fake data. So don't get worried about these patients. But anyway, we just fill in x. And then also, when you have their chart out, you can look up how many appointments they had over the last year and patient went only at three, whereas patient two had like 45, which you can believe because sometimes they're coming in all the time to get stuff, adjusted. It but then you know, patient three, only a 21 and patient four at seven. So you can see these are the XY pairs for each of these patients, right. And it's pretty simple to go to the bottom and sum up each of the columns, we have some of xs 678 and some of y's 166. And also, I'm reminding you of the our calculation, I put that in the upper right, just so we can see what we're doing. I just want to call your attention to one of the terms in there, which is sum of X, which I put in the parentheses here. And that we already know, just from making the first part of this table and adding it up. So we already have that thing. And now I just wanted to point out, if you saw the sum of x over here, it's not exactly the sum, it's a sum of x y. So the Y is mushed. Right next to it, that's not some of x, that's some of x y. And that's later in the game, we're gonna put the sum of x y at the bottom of the last column. So So that first term there, that's not some of x, that's some of x y. Okay, now downstairs, we see the sum of x to the second, right? And that looks an awful lot like the one next to it on the left, which says sum of x to the second, right? And so how do you tell the difference between the kind without the parentheses and the kind with the parentheses. So this is how I do. The rule is always regardless what's going on, do what's in the parentheses first. So that's easy to do. If you have parentheses, if you got the parentheses version, you know that the sum of x to the second with the parentheses in it, is you just do the sum of X, and you do the sum of X and E times by each other. Right? But what if you don't have any? Well, what I do is I say, Well, if I did have some, I do it this way. But if I don't have any, then I know I have to do the sum of the x squared calm, right. So that's where you take x times x x times x, x times x on each line, put it there and sum that. So that's how I go through it no matter where I am in statistics or algebra. If I see that some symbol and then the x squared, I first look for the parentheses. If they're there, I know what to do. If they're not there, then I know you don't do the thing where you just take the sum of x squared, you have to go and look at the bottom of the column of the x to the second column and take the sum of that. I hope this is helpful. All right, so as you can see, there's, I've shown you on the top of the equation is where you just take the sum of X and the sum of Y. And on the bottom, I'm showing you where you take those and you take the square of them. And then in the other term is the one where you just take the sum of the call. All right. And so there you go. So what happened here? Well, we filled an x to the second so if you go to a patient, 170 times 70 is 4900. That's where we're getting that number. So you go through and then patient to 115 times 115 is 13,225. So you go from All those and then you sum those up. And that's what goes in that first term. And then I'll bet you can guess what the next slide is. Surprise. Now we do the y one, so don't get confused because you kinda have to skip a column there. So three times three is nine. And so that's why in the Y squared, I'm 45 times 45 is 2025. That's how we're doing those. You sum all that up, and then go look up at the equation, that's where you put that sum of Y squared. Now we have x, y. And this reminds me of a student I had before. She was really confused. She's like, Monica, I don't know what to do with x, y, the x, y quantity. And I go, What do you mean? I mean, it's pretty obvious. You just take x times y, like here, 70 times three is 210. She goes, x times y, where's the times? Like, how do you know it's supposed to be times like, I don't see any times. Right? I don't see any dimes either. Like there's no like, like, how do you know to do that? Well, anyway, I'll just tell you, I guess, imagine, like a little multiplication symbol between x and y. That's what's supposed to be there. That's what you're supposed to imagine, I guess I was so used to looking at it was like, you're right, I guess you're just supposed to assume that. So take x times y. So for patient two, we just took 115 times 45. And that's how we got 5175. So you go through each of those, it's a lot of processing. And then you sum it up at the bottom, whoo, that's a big number. And then you see, I circled it in the our equation. So I think we figured out where to put everything, obviously, n is seven, right, because we have seven patients, you see a bunch of ends in there. So I think we have all our ingredients. So let's move forward. So all I did here was rewrite the exact same equation with all the ingredients in it, right. So like I said, the N is seven. And so wherever you see n, you'll see a seven. See that sum of X, Y on the top, you see where that goes, see some of x and some of y and then downstairs, you'll see I filled in all those numbers too. Now, let me just talk to you a little bit about both levels, the numerator and the denominator in the numerator, because we have order of operation, you need to do out the end times the sum of x y, that's seven times 18,458, you need to do that out first. And then you need to do the other one, you know the 678 times 166 first, and then after you're done with those two things, you have to subtract the second one from the first one, that's the order you have to do that in to get the numerator right. Now for the denominator, it's a little bit the same, but a little more complicated. You see on the left side, you have that seven times 67,892, you have to do that out. And then you have 678 squared, you have to do that out, then you have to take that, subtract it from the first one. And after that, after you have that, you take a square root of all of that, and that's your first term. And then you still have to go over to the other one, you have to take seven times 6768. Keep that then take 166 times 166. Keep that, that that term, you subtract from the first one. And after you're done with all that, you take the square root of that, and then those two things, you have to multiply together. So that's a lot of work, and you have to do it in the right order. So here, I just wanted you to see how you, you probably want to just work out this term separately first, and then work out this terms separately. And just like that thing I was telling you about x y, those two terms, once you work them out, you take the square root of the left one in the square root of the right one, you have to multiply them together to get the denominator. So this slide is to help you see I threw the numerator on that was relatively easy. But these are the two different numbers you should get from the left side of the denominator and the right side of the denominator just to check your work. And then of course, once you multiply them by each other, you get this number 17,561.3. So ultimately, what the calculation for our comes down to is you're trying to calculate the numerator and you're trying to calculate the denominator. And at the end, you divide the numerator by the denominator and you get the answer which is R. So we're going to do that now. And here's what we got is we got this 0.949. And because we see that it's positive, then we know it's a positive correlation. And then remember my opinion. And also probably everyone's opinion, because if you run that up, you go point nine, five, well, that's getting really close to 1.0. So most people would agree that that's pretty strong. So how you would diagnose this correlation is you would say it's positive, and it's strong. Okay, I just want to wrap this up by giving you a few facts about our that I may not have covered yet. First, r requires data with a bi variate normal distribution, which is something we didn't check before doing our r in this class, because I just don't cover that. But please know, if you take another statistics class, and they bring up our, they might talk about checking for the by various normal distributions. So just know about. Next, please know that our also does not have any units. So other things that don't have units, remember, the coefficient of variation didn't have any units, some things just don't have units, and r is one of them. Also, we did talk about how perfect linear correlation is where r equals negative 1.0. That's if it's a negative correlation, or r equals 1.0, which is a positive correlation. But I might not have mentioned that no linear correlation is r equals zero. Now, you probably won't see that in real life. But sometimes I'll make an R, and the R is either positive or negative. But it's 0.0000000. Something right? Regardless of whether it's positive or negative, if it's 0.00000, something, it's really close to zero. So that means there's probably like, no linear correlation. And then we learned about positive or negative art, but I just wanted to remind you of the behavior of X and Y when you get those circumstances, okay. So if you have a positive R, it means as x goes up, y goes up. But it also means as x goes down, y goes down. So they travel together. When you get a negative r, it means as x goes up, y goes down. But also it means opposite, as x goes down, y goes up, so they travel in the opposite directions. Now, here's another fact about our little factoid, if you choose to switch the axes, like let's say I designate, you give me xy pairs, and I designate a certain variable as x and the certainly one is y, and you actually designate them the opposite, it really doesn't matter even in the equation, because you'll end up with the same R value. So it doesn't matter if you call the x my X, Y, and I call your, you know, y x, like we can switch them, but you'll still end up with the same are with the calculation. Then finally, even if you converted x&y to different units, you get the same error. So let's say that you were in England, and you were doing the correlation between height and weight. And you were using the metric system on the same patients that I was using the US system, even though we'd have different numbers, cuz obviously you have to convert them, we'd still get the same are when we're done. So finally, we get to the last subject of this lecture, which is lurking variables, which you've heard about before. But the main point I want to make is correlation is not causation. So you don't want to be misled by correlations. So beware of lurking variable. So remember, lurking variables are things lurking behind the scenes, I caused things, right. And so you may have realized that selecting x and y, like if you have xy pairs, designating which one is x and which one is y is kind of political, because you're implying that x could cause y. So let's say that you're correlating height and weight, taller, people are heavier. So you would cause x to be height and y to be weight. You know, people don't go, Oh, I'm too short, I should gain weight so I can grow taller. You know, that's just not the way things work. So you have to put x as the height, and y is the weight. But there are Riya. In reality, other causes of weight besides height. In fact, there are things that cause both height and weight, like genetics, right? So a genetic profile that leads to Thomas and also obesity could be a lurking variable in the relationship between height and weight. So there could be some tall people that are always obese, and it's not really just because they're tall. It could be because they have the genetics that programmed them to be tall and also obese, right? And so here's an example where you got to be real careful. Um, with correlation. So there's been this claim that eating ice cream causes murders, because they noticed when in areas where ice cream sales go up, murder rates rise. And I don't know about you, but when I have some really good ice cream, it just makes me so mad. I'm just kidding. I mean, why would this happened? Right? Well, the reality is summer and warm weather are lurking variables, because we sell more ice cream in the summer. You know, the ice cream consumption goes up. But also people are outside more and more murders occur. And you know, I from Minnesota, where it gets really cold for periods of the winter, and oh my gosh, there are totally no murders, then, like people just don't commit murders, when it's really frigid out, it's just really inconvenient. So that's a situation where there's a lurking variable. And so you don't want to start, you know, screwing up our ice cream laws and making it so we can have ice cream, just because you misappropriate that ice cream causes murders, right? There's a lurking variable behind it, that's having something to do with both. Here's another one. And this was my professor in my biostatistics class, they use the C put up a really like a time series chart over a long time, like since the 1900s. And they pointed out as people purchase more onions, the overtime is onion consumption goes up and down. The stock market rises, right? So when the stock market slow, people aren't eating as many onions. And this is just true over generations in the US. So um, yeah, we've had some problems with our economy in the US, do you think we should all start eating a bunch of onions, right? So the healthy economy is a lurking variable. And a healthy economy, people buy more food, they including onions, and also a healthy economy boost the stock market. So you got to be careful about this correlation is not causation. You know, and so if you want to make the stock market go up, don't make everybody onions. And definitely don't make a stop eating ice cream, that would make me very upset. So at the end of the day, you're not going to be able to affect the murder rate by bringing down the ice cream consumption rate. And you're not going to be able to fix the stock market by making people eat onions. And so that's the whole concept behind lurking variables. And correlation is not necessarily causation. So in conclusion, when you're doing your correlations, First, make a scattergram because you want to get an idea visual idea of the strength in their direction. And you also want to look for outliers, then go on and calculate are by hand, but be really careful because it's a big hairy calculation. And you don't want to make any mistakes. And then finally, when you go to interpret are Be careful of lurking variables. And remember that correlation is not necessarily causation. And now, time for some ice cream. Hello, it's Monica wahi, your library college lecturer here to ruin your day with chapter 4.2 linear regression and the coefficient of determination. So at the end of this probably painstaking lecture, the student should be able to at least explain what the least squares line is. Identify and describe the components of the least squares line equation, explain how to calculate the residuals, and calculate and interpret the coefficient of determination, or CD for short. Alright, so it's really cool if you have a crystal ball, because then you can make predictions, right, you just look into the crystal ball. It's some nice equipment, I've had friends who have them, they're very nice to put out on your dining room table as the centerpiece. Unfortunately, though, they don't really play much into statistical prediction. So what I'm going to show you in this lecture is how we use statistics for prediction instead of this beautiful crystal ball. So we're going to start by talking about what the least squares line is. And then we're going to talk about the least squares line equation, which is the crystal ball thing we use only in statistics, okay. And then we're going to talk about dealing with prediction using the least squares line. And finally, we're going to talk about the coefficient of determination. So let's get started. And let's get started with the term least squares. criterion, right? So remember, criteria is plural and criterion is singular. And it means well criteria as stuff you need to meet right to be eligible like you have to meet the criteria for registration for college right? Well, least squares Cartier tyrian is just one, which is awesome, because then you only have to meet one thing. So one of the things you probably wondered when you were watching last lecture is how do you know exactly where to draw this line when you have a scatterplot. Like, how do you know where to make the line the most fair. So in the last chapter, when we plotted the scatter grams, I just drew a line there for demonstration. But there actually is an official rule as to where the line goes. Okay. And basically, the rule is as has to meet the least squares criteria. Okay? if it meets that criteria, there's only one line that does, then that is where the line goes. So how do we get to that? Well, this is roughly what it looks like. When you draw the line, there is a vertical distance from each of the dots to the line. Now, as you can see, by the slide, sometimes the dots are below the line. And sometimes they're above the line. And so the word square is indicates that whether it's up or down, you're going to square it. So it's not going to be negative anymore. Because whenever you square a negative, it becomes positive. So first, you're going to have to square all of these things. Okay? So imagine you were just going to try it out, like, maybe draw this line, and then you calculate the squares, and you'd be like, okay, that's how many and then maybe you tilt the line a little. and calculate the scores again. And your goal would be to add when you added up all the squares, to have the least ones. So the line belongs where what causes smallest sum of squares for the whole data set. So if your software, which you're not you're a person, right, but if you were software, you'd be figuring that out using your software brain as well, how exactly to tilt this line, and where exactly to put it to minimize these squares, but we're people. So I'm going to go on and explain how people do this. So the trick is, if you can figure out with the line close, you can draw it on the scatterplot and be right. But there is a challenge of knowing exactly where it belongs on the graph. And then also, you're probably realizing you don't always have a graph to draw it on. Like maybe you need to talk to somebody about where the line goes, and you can't draw a picture. So how you explain where the line goes as you use an equation. And some of you may remember this, and some of you may not, so I thought I'd do a little quick review of how lines and equations relate. Okay, so we're going to get into the least squares line equation. But first, I'm going to give you a little flashback about algebra, and I'm sorry, if this is painful, um, this is hard for me, because I wasn't really that good at algebra. But um, I and this isn't statistics, this is algebra, but I just wanted you to remember this part. Okay. So back in algebra, there was a chapter, where you were given these xy pairs, and then was different from statistics, because they all lined up on a line, see, these pink dots are just perfectly out of line, okay, and these are the XY pairs. And remember, you had to graph this kind of like we had to do scatter plots. And then you were given this equation, y equals b x plus a, right? And that was the linear equation to describe this line. And you were like, okay, I don't get how to put this equation together with this line. And so first, the teacher would say, well, B stands for the slope of the line, right? Because you have to know the slope, I mean, the line can be tilted, any which way. And so if you know the slope, you already know something about the line. And in algebra, how you would make the slope as you calculate the rise over the run, right. And so there, you know, be in algebra was rise over run, and you'd get the slope. And then you'd be like, great. But you'll always needed another thing in order to define the line. Because if you imagine this line is in an elevator, it could still have the same slope, but go up or down, right, so we need to anchor it on the y axis somewhere. So h stands for the Y interceptor where it's Spears through the y axis. And, as you can see, by the drawing, it looks like a is zero comma, zero, right? But you don't have to look at it, what you can do in algebra, is you to get a is what you would do is go since you'd filled in B, you just go grab an XY pair, and plug the X and and plug the y and then plug the B, you just got in and back. Calculate the y intercept, right. And that's how you would get the whole linear equation. And so that's how you would do it in algebra. And I just wanted to remind you that because we do some similar things in statistics, it's a little different. But I wanted to remind you how to connect what a line looks like with how this equation works. All right. Well, welcome to statistics looks, those pink things are not on a line. So we want to make a line but now you know about the least squares criterion. What you're trying to do is make a line that minimizes the least squares, right? So here we go. Um, remember Hello, I was just talking about this linear equation back in algebra. Well notice the difference. The main difference here is the hat, right? The y is wearing a hat. And that's universally in statistics, whenever you see a letter or a number wearing a hat, it means it's an estimate. Okay? So of course, we're estimating why because if you look on that line, none of these dots actually falls on that line. And we don't really expect even an estimate to fall on that line just close, right? You know, because of the least squares, okay. And so we almost have, in a way, the same goal we did back in algebra, we have to get that be that slope. And then we have to use that to back calculate our a. Okay, so let's go on with that. Um, so like I said, in the software approach, you just feed all the XY pairs in, and then the software just actually prints out the B in the A, it just prints out the slope and the y intercept, which is why I love the software. But we don't get to use that in our class. In our class, we have to do the manual approach just because it's painful. And I had to do too. So now I'm making you do it right, me. Okay, what, what we'll do is plug all the XY pairs into an equation to get the slope, the speed. And I promise you, I won't give you a ton of xy pairs, you know, or you'll be there forever. But this next step, we have to do, we didn't have to do an algebra. And that is we're going to have to go back to all of our x's, calculate x bar, and go back to all of our y's and calculate y bar. Remember, that's the mean of the x's in the mean of the y's. And you're probably wondering, Well, why do we have to do that? I'll show you again. But in case you didn't notice, though, those dots really didn't fall on least squares line, they fell around, and you need a.at least on that line to help back calculate that wider set. And the rule of the least squares line, one of the rules of it is that x bar comma y bar is on that least squares line. So you can know if you calculate that out that that's actually on the least squares line. Okay. And so finally, after you do x bar and y bar, you plug in B, and you plug in x bar for the x, and you plug in y bar for the Y hat to back calculate the A. So it's a similar, but different process as algebra. So the moral of the story is you need to recycle, right, we got to be good to the environment. So what has happened? Well, you wouldn't be at this point in your life of making a least squares line, if you hadn't already started out by making a scatterplot. And then deciding you wanted to do R, and then making are. And when you make Are you end up with that big table, remember, and you end up with all these calculations, like some of x, some of y, some of x squared and some of x y. Now you want to recycle those, you want to save those calculations from our because they fit also into the equation for b. So you want to recycle that. Also, you want to save the are you made, because you're going to recycle that into the coefficient of determination, which I'll explain later. And then this is not about recycling, you'll actually have to make this a new, but you need to calculate x bar and y bar. Now you never needed to do that before now, but now you need this. And so yeah, so get together your old r calculations, and then put your x bar and y bar together and you'll be ready to do the least squares line equation. Alright, so here's a flashback. Remember this big table? Remember our story, we had seven patients, right? And x was their diastolic blood pressure at the last visit they had of the year. And then why was the number of appointments they had over the year. And we thought, Well, if your diastolic blood pressure, you know goes up, then maybe you need more appointments because it's marker of being sick. I don't know. That was my little story. Okay, so over on the right now we'll see that the formula, we have the formula we're using for B, the tax gives you two formulas, again, I've always got my favorite, it's the one with the table, right? So here's the formula for B. And then after you calculate B, you'll notice in the formula for a, b is in the formula for a so you got to do B first, right. So a lot of times students are a little confused and what the goal is here, the goal is to if you look at the bottom of the slide, the goal is to come up with what B is and what A is, and then fill it in. And that's your least squares line equation. So your least squares line equation is always going to have an A y hat in it. That's that's a variable that just gets to stay there. It's always going to have that equals and then after that, whatever your B is going to be mushed up next to that x so it's always gonna have that x there. And then plus and then whatever you get for a and just as a trick, if a Turns out to be negative, then it ends up being minus a, right. But that's the generic equation. And our goal is to calculate B and A and fill them in. And then we will say this is our least squares line equation. Oh, remember how I was saying, you actually need to make some new calculations, right. So you need to make y bar and you need to make x bar. And it's a little easier to show when I've got this column, the columns up. If you look at the bottom of the slide, remember how some of X was six, some D eight and remember how our n is seven. And remember how a sum of x divided by n is your x bar. And the same goes for y, right, we have the sum of Y divided by seven, I just wanted to quickly remind you of this, that you need to generate these things before, you can actually completely finish the least squares line equation. I just summarized like that I cut to the chase, basically, I just summarize the the actual numbers you're going to need and put them over here. So we don't have to look at that whole big table anymore. Alright, and you'll notice that I grayed out the sum of Y squared because I realized later we don't really use that. Okay, so let's look under on the left side under the big list of numbers we have. And you'll see the B equation that I filled in, right, and if you compare that to the formula on the right side, you'll see what's going on, you know that n is seven, right? So wherever you see that seven, that's where n is okay, then the top of equation, remember some of x, y, let's just look that up. Yeah, that's that big number 18,458, I wanted to just be clear, you have to do out that left side, the seven times the 18,458, you have to do that one out, and then do out the right side, which is that sum of x times sum of Y which is 678 times 166, you have to do that one out. And then after that, you have to subtract the right one from the left one, because of order of operation. Okay, so that's how you make the numerator. Now let's just look downstairs, again, we have an n, so we know that's seven, and then that sum of x squared. And remember, it doesn't have the parentheses around the sum of X square, if it had the parentheses around it, you'd be taking like 678 and squaring that, but it doesn't have the parentheses. So you have to use that big numbers 67,892. Okay. And again, like with the upstairs, you got to do out that side of the equation, right, that term, you've got to multiply that out before even looking at the rest of the equation, right. And then Oh, here we go. On the right side of the denominator, we have some of x squared, that's exactly the example I was giving earlier. So you say 678 times 678. And you have to do that one out, right. And then after you do that one out, and you do the first one out, then you subtract the second one from the first one, remember order of operation. And if you do it right, you should get C below the on the left side of the slide, you should get that for the numerator in that for the denominator, and then you divide them out and you get 1.1. And that's your B, right. So there you go. That's how you do it. And so now we got to worry about AES. So what I did was I just wrote B at the top there, so B is 1.1. And so now we can use B to try and figure out a, so remember how I look at my list. Remember, I did x bar and y bar for you just so we had that ready. So now we're going to calculate a by putting in Y bar minus and remember order of operation again, we got to do the B which is 1.1 times x bar. So we do that one out first, and then subtract it from 23.7. And remember, remember, I was saying sometimes you get a negative a, well, we got negative ad for a. Alright, so we got our B, we got our a, and let's go. Now, oh, if you want to check your work, this should work out right. Like you should be able to take the B times the x bar, right, which is 1.1 times 96.9 minus 80. You know the a and you should get 23.7. So if that works out, then you know you did everything right. But remember what the goal was, the goal was to actually fill in that least squares line equation. So if you look over on the right, that's what we did. So we still have our Y hat, we still have our equals, now we have a 1.1 where the B belongs. We still have that x because those are variables that we had in the x, and then we do minus 80. Because we came out with a negative one. If it had been just plain 80 would say plus ad, okay. All right at the beginning of this presentation, I teased you that we were going to do prediction with the least squares line equation. We weren't going to use a crystal ball. We were going to Use this equation. Well, I finally get to that exciting part of this presentation. But, and there's always a big, but I first have to warm you up with some rules, right? First of all, I just want you to reflect on what we just did. And realize that we can draw the least squares line. But unlike algebra, our xy pairs probably aren't on it, right? Like in this example, none of the XY pairs are on it. So you need to be sure about at least one xy pair that's actually going to land on the least squares line. And the only one that you can be sure of is going to land on least squares line is x bar, comma y bar. And if you reflect on it, that's why we had to calculate that right, because we had to use x bar and y bar in the calculation to back calculate a the y intercept. Now, you may be lucky and get a data set that there is an x y pair that just happens to fall on the least squares line, or maybe even a couple or maybe more. But you can't trust that. So if you need to trust that there's a point on the least squares line, you know, it's always going to be x bar comma y bar. All right. And now I want to focus more succinctly, on to the slope or B, right. So remember, we just in our example, calculated B and we got 1.1. For me, and that's a slope. So I want to point it out that the slope B of the least squares lines tells us how many units the response variable or Y is expected to change for each one unit of change and the explanatory variable or x. So that's a little kind of a tongue twister. But if you think of our example, it's a little easier to understand. So the fact that that slope was 1.1, in our example, and that we were having XP DBP. And why be number of appointments over the last year, what we're essentially saying by that is, for each increase in one mmHg of DBP, or the X for each increasing one of those, there is a 1.1 increase in the number of appointments the patient had over the past year. So as DBP goes up by one, then the appointments goes up by 1.1. Well, I don't know what 1/10 of an appointment is, but you get what I'm saying because it's just a Y, okay. And so the number of units change in the Y for each unit change in X is called the marginal change in the Y. So which if you sort of think about it, that's 1.1. So 1.1 is the slope. But 1.1 is also the marginal change in the Y for each unit change in the x. Now, I also want to just recall for you this concept of influential points, right, so like with our if a point is an outlier, and remember, we should have done a scatterplot. And everything before we got to this point, because we need our we need all those sums of x's and sums of y's and sums of sums and whatever, right. And so like with AR, if a point is an outlier, and you can see it on the scatterplot, it can really drastically influenced the least squares line equation, just like it's can screw up our right. And so an extremely high x or an extremely low X can do this. And I was just, you know, pointing out a culprit we have here on the scatterplot. So always check your scattergram first for outliers, because you could end up in a situation where you're making a least squares line and there's a bunch of outliers, you know, whacking it out. Okay, now I'm gonna also bring up, you're probably like, when do we get to the prediction part? I'm like, you just have to relax, I have to get through a few of these issues, right? So one of them is the residual. And you know, the word residual, like it kind of sounds like residue, right? Like you said, you know, somebody comes over and sits there their cup on your coffee table without using a coaster that leaves some residue and you get all mad, okay, well, that's kind of what a residual is. It's like kind of like residue, it's like something left over, right. So once the equation is there, once you make the least squares line equation, there's something I just want you to notice. And that is you can take each x, remember how we had seven patients, they each had an X, you can theoretically take each x, plug it into the equation and get the Y hat out, right? So I want to just demonstrate doing that. So we have our equation upper right here. So a patient one, I took patient ones x which was 70. And I plugged it in 70 times 1.1 minus 80. You know, I put in the equation and I got negative three. Now that's why had the real why I put it on the screen here is actually three. So as you can see, you know it's not the same answer, right? And then patient two I did it with patient two also I did 1.1 times 115 because that's the x and then minus 80. You know, because that's the rest of the equation. And I got 46.5 Now that was a little closer, because look at patient twos wise. That was 45 If it's really close to this 46.5, that's a little bit better. But the reason I was doing all that is I just wanted to tell you the residual is y minus y hat. So in the first case, we have y hat was negative three and y was three. So patient when we did three minus negative three, and we got sick, so that's the residual, it's kind of like residue, right? It's like the residue leftover between Y hat and y, right. And then patient who we did it again, we took y which is 45 minus y hat, which was bigger, it was 46.5. So we got negative 1.5. So that's the residual. So So this is how you calculate the residual. And this is what it is, this is how you get it. But the bottom line is, you don't want big residuals, right? Because that would mean the line didn't fit very well. So you'll find that if you have a really good fitting line, you have very small residuals. And so you're probably like, well, what's a good fitting line? Well, we'll get to the coefficient of determination, and that'll help you see what constitutes a good fitting line. But first, I will get to the prediction part, okay. So you're done with your least squares line equation, and you want to use it for prediction. So let's say you knew someone's DVP, and you wanted to predict how many appointments she or he would have in the next year. Now, what you're not doing is you're not using, you're not reusing your X's from your data, we just did that to make the residuals, what you're doing is actually imagining a new thing out there. And you're gonna use this equation for prediction. So you could plug in the DVP as an X, and get the Y hat out, and say that's your prediction, right? But you gotta use some caution. If you use an X within the range of the original equation, as you can see, I put the x's up here, the range of the original equation was like 70 to 125. Right, those were, you know, the areas covered by x, right? If you do that, if you pick an X, somewhere in there, this type of prediction is called interpolation. And people feel pretty good about it. But if you use an x from outside the range, like one that's really smaller, like 65, or one that's bigger, like 130, then it's called extrapolation. And then it's not such a good idea, because you don't know if it's really going to work, right. So here, I'm going to give you an example of interpolation. The patient in your study as a DBP of 80. Okay, so 80s, right in there, it's in that range. So let's use it right. So we do it. Now, this looks familiar to you, because we just did this when we did residuals, but we're using a new person now. So 1.1, times 80, minus 80, equals eight. So this is how we, what we would do is predict that this patient would come to eight appointments next year. So there, that's how we use our least squares line equation, like a crystal ball where we can predict right? So is it really this easy, right? Is this all you have to do to predict the future? Well, it's not really that easy. You can't make a linear equation out of any old xy pair. So remember this from our last lecture, see, the scatterplot. It looks like what a cloud in That's right. It doesn't have a linear equation, you know, it doesn't look like it should make a line. But you know what, you feed that stuff into the software, or you feed that stuff into your B formula, and you're a formula, you'll get, you'll get a line out of it, even if there's no linear correlation. And so if you get that line out of some scatterplot, that looks like this, then it's not a very good line, right? And it wouldn't work very well for prediction, right? Because this looks pretty unpredictable. So for that reason, we can't just accept any line that is handed to us. To evaluate if our least squares line equation should be used for interpretation, we need the coefficient of determination. So here we are at the coefficient of determination. And so remember how I said you have to recycle, recycle recycle in this, well get out your our time to recycle. So the coefficient of determination is also called r squared. And it literally means r times r. And I just have to add this on. Just like remember the coefficient of variation. Remember that one, we always turn r squared into a percent, right? And so you times it by 101%. So in this example that we did remember, early on in the last lecture, we did the R for this, that not the scatterplot I just showed you, but the for the one of DBP, and the appointments, right? And we got an R that was really, really strong positive correlation, right, we got point nine, five. Well, if we want to calculate r squared, which is the coefficient of determination, we take point nine five times point nine, five If and we get point nine oh, but we got to do that percent thing. So we end up with 90%. So this is how you say it, you say that 90% is the variation that's explained? And why, by the linear equation, right? So that's, you know, y varies, right? Like how many appointments they had, you know, it was different for each person. Well, 90% of that variation is explained by the equation. And of course, if you take 100 minus 90%, there's 10%, unexplained variation. So there's still some variation that could be explained by other variables, but not a lot. And how you actually stated is, you know, when you're done with this, if you were writing a paper, you'd say, 90% of the variation in the number of appointments is explained by DBP. And I know people are like, explain, like, it doesn't have a mouth, like, what does it talking about? You just have to say it this way. There's it's statistics ease, this is how you say it. And by contrast, or by complimentary, what you would say is 10% of the variation in the number of appointments is not explained by DBP. Right? It could be explained by other things. Well, we happen to get a nice, I see CD for coefficient of determination. You know, we got a nice high one. But what if it's a low? Well, let's just think about it CD should be better than at least 50%? Because that would be random, right? And the higher the better. So if you're on a test, nobody's going to give you a CD of like 60% and say, Is this any good because I don't know, you'd be very conflicted. In real life, what I use it for is to compare models, if one is 60%, and the others 55%. Of course, I'm going to go with a 60%. One, but it's still not very good, right. And if it's low, you know, the higher the better, basically. And if it's low, it means that you probably need other variables to help the x you use to explain more of the variation because that x is not doing. Okay, in summary, I just wanted to go over chapter four, so you realize where we've been. Okay. So we started out with a set of quantitative x, y pairs. First thing we did was we made a scatterplot, we wanted to look at the linear relationship between x and y. And we wanted to look at outliers. If we'd seen a lot of outliers, or no linear relationship, we would have stopped there. But because this is a class we had to learn, I forced them to be a scatterplot with a linear variation, and not too many outliers. So we could move forward and do our so we calculated our to see if our correlation was positive or negative, and weak, moderate, or strong. So that's what you do if you find a linear relationship. Next, in addition, in this lecture, we calculated B and A to come up with the least squares line equation. And I just wanted to you to notice that the sign on B will always match the sign on R. So if you have a positive R, you'll have a positive slope, if you have a negative or you have a negative slope, but otherwise, the numbers won't match, just a sign. And then also, I wanted you to notice that strong correlations will give you high coefficient of determination, even if they're negative correlations, because remember, it's r times r. And so negative times negative are still as positive, right? So if you have strong correlation, like negative point nine, or point nine, it really doesn't matter what direction if it's strong, then you're going to get a high coefficient of determination. So after we did this B and A thing, we use that linear equation to calculate residuals, right, like we took the x's from the original data and put them in got the Y hat and calculated the residuals. After that, we use R to calculate the coefficient of determination or CD, to decide if we wanted to use the literate equation for prediction. Because if it was bad, we weren't going to do that. But we decided was good for prediction at 90%. And we decided to use it. So that was our journey through these xy pairs all the way down to the coefficient of determination. Good job, you made it. So in conclusion, the least squares criterion, and calculating the least squares line was the first thing we went over how to do that and what it all means. And then I reviewed some issues with prediction using the least squares line, because it looks kind of easy. It looks kind of, you know, better than sliced bread, but there are some things you have to think about. Finally, we went over the coefficient of determination so that you could figure out how good your least squares line equation was. And I just wanted to point out that CD kind of looks like CDs, you know, like we used to have CDs. They were so pretty and rainbowy like that. But now all CD means is coefficient of determination. Hello, and welcome back to statistics. It's Monica wahi are labarre College lecturer and You've made it to chapter seven, I broke up chapter seven into bite sized pieces. And we're going to start with chapter 7.1, talking about the normal distribution and the empirical rule. So here are your learning objectives for this lecture. At the end of this lecture, you should be able to state two properties of the normal curve, state two differences between Chebyshev intervals and the empirical rule, and explain how to apply the empirical rule to a normal distribution. So, remember, distributions, we learned about them a while back, but I'll remind you a little bit about them. And then we're going to talk about properties of the normal distribution, or specifically the normal curve, that shape that comes out of making a histogram of normally distributed data, then we're going to remember Chevy Chevy intervals, we're going to talk about what Chevy Chevy did for us, and what Chevy Chevy really didn't do for us. And then we're gonna move on to the empirical rule, which works very well, better than Chevy Chevy intervals, when you have normally distributed data. And then I'm going to show you an example of how to apply the empirical rule to that normally distributed data. So remember, the normal distribution, in fact, remember distributions at all right? So to get a distribution, and a lot of people sort of forget this, by the time we get to chapter seven, but I just wanted to remind you, this is from an earlier lecture, we had a quantitative variable, which was how far a patient's had been transported. And we determined classes, and we made a frequency table. So remember that. And then after that, we made a frequency histogram, and then made a shape. And as you could see that shape, which is the distribution, that shape in this one was skewed, right, see that light on the right, okay, but that's an example of something we cannot apply the empirical rule to, because the empirical rule only applies to normally distributed data. So I had to give you an example of that. And here's my example. So when I was in my undergraduate in costume design at the University of Minnesota, they made us take a chemistry class and one of those big lecture halls. So I was in a very large class that probably had about 100 people. And we were given this really difficult test, it was 100 point test, and I was used to getting like A's. And so when they were done with the test, the T A's, were handing the tests back to everybody. So they could see their grade, while the professor was writing on the board, and was reading the frequency of all the different scores. And I remember the TA handed me my test, and it said 73 on it. And I'm used to getting like 90s, up to 100. And I remember stating out loud, saying 73, that is an awful score, I can't believe I did so badly. I was talking like that. But at the same time, the professor was writing the frequencies on the board. And what I realized is the top score was in the 80s. And I had the third top score was 73. That's how hard the test was. And that's a nice Shut up, because I noticed everybody giving me dirty looks because they had scored actually below me. So I wanted you to imagine that class. And I imagined what the normal distribution would look like for that class with the distribution of the scores. And the reason why I thought it would be normal is because we all did badly, right. And so nobody got 100. So we were all below the 100. So I imagined this curve here for you. And I imagined my class, I had 100 people just to make it easy. Of course, the test was difficult. And nobody got 100 points. And the mode, the median. And the mean, were all near see great, because you remember how, when you have a normal distribution, the mode, median, and mean are all on top of each other. So we all did pretty badly. So I'm going to use this example of the fake chemistry test scores to exhibit exemplify these properties of the normal curve. So there's five I'm going to talk about. The first is that the curve is bell shaped with the highest point over the mean. And so you can see I drew a scribbly little curve, put a little arrow there to show you that that's where the mean of the scores were. And then I also wanted you to notice that the curve is symmetrical with a vertical line through the mean. So there's like a mirror image of the curve on either side. Now, it's not perfect, obviously. But it should be roughly like that. And you know, this is not true of skewed or bi modal or these other things we've been talking about. Okay, and the third property is that the curve approaches the horizontal axis but never touches it. You don't have to memorize this, but remember, asym totw or asymptomatically close, that's when a line gets really close to another line, but they never touch. It's so romantic. But anyway, that's a very Bollywood thing to say, by the way, but uh, so the curve approaches the horizontal axis and never touches or crosses and then also there's this inflection or these transition points between cupping upward and downward. And these transition points occur at about the mean, plus one standard deviation and about the mean minus one standard deviation. And this is a little hard to explain. But imagine you're on a roller coaster and you're going up this normal curve. There's this part where you're just mainly going on, well, the part where it seems to kind of level out and you're at the top of the curve, he starts to relaxing. That's that inflection point. And so as you're going over in the roller coaster, and you're in that flat part, and then you start kind of going down, that's the second inflection. So that's where what it's saying about is the property of this curve is that you have these inflection points like that. And they roughly occur at plus or minus one standard deviation above and below the mean. Then finally, and I call it this, and just so you could see it, the area under the entire curve is one, so think 100%. So it would be nice if that were a square or rectangle, or even a triangle, something that we're used to in geometry, but it's not, it's this goofy shape, right? But still, you need to get it in your head that that shape is worth 1.0 in proportion land, or 100% in percent land. And what I mean by that is, let's say we cut that shape and half, the, each side would have 50% or point five on it, then let's cut it a different way. So the part of the curve on the right side of that line is a fourth of the curve, or 25% of the curve, even though it's goofy shaped, and the part on the left side is 75%. So that's what we're trying to get you to think like is that, yeah, you can just declare that all the area under the curve equals one or 100%. But the reason why we're declaring that is because we're gonna cut it up and say different amounts of percent of the curve. Now we get to the empirical rule, since we reviewed this whole curve thing, and I'm going to make you remember Chevy shove, I'm sorry, but you know, let's talk about Chevy Chevy, Chevy shove helped us get some intervals, right, in intervals have boundaries, or limits, they have a lower limit and an upper limit. That's how you know what bounds the interval. So when we were doing Chebyshev intervals, what we would do is we'd figure out a lower limit and upper limit, and we'd say at least so much percent of the data falls in the interval, right? So when we would choose the lower limit of mu minus two times the standard deviation, and the upper limit was mu plus two times the standard deviation, we would say at least 75% of the data were in the interval. So I wanted to just show you a demonstration using my fake class. So remember, there were 100 students in the class, I actually came up with a mu for them. And their mu on the test was 65.5. So my 73 was better than the mean, but not much better, right. So the mu for that class was 65.5. And the standard deviation was 14.5. So I calculated these chubby shove this championship interval for 75% of the data. So I took 65.5 minus two times 14.5. And I got 36.5, which is a pretty bad grade. And then the upper limit was pretty good, right? 65.5 plus two times 14.5 equals 94.5. On 100 point test, that's a pretty good grade, right? So if you had 100 data points, or 100 students, at least 75 would have scored between 36.5 and 94.5. So you're probably already realizing, okay, that doesn't really help Monica, who scored 73. And this is a really wide range, we say at least 75% of people score there, you could probably guess that without even knowing about chubby ship intervals, right? So it didn't really help me narrow down, like how well is this class doing? If I had had the mu and the standard deviation, I could have calculated this and said, Okay, I'm no better off. So championships theorem on the left side, and applies to any distribution, you don't need a normal distribution, you can use that skewed distribution. Also, you'll notice it says at least. So like this was at least 75% of the data fell in there. Maybe even 100% fell in there. So it doesn't really help us. And as you go, let you start with two standard deviations. If you go out three, it's 88.9%. And four, it's 93.8%. You know, you might as well start at the beginning and say almost 100% of the data falls in this interval. And if you're saying that it's not very useful, right. But it kind of gets stuck doing that because championships theorem applies to any distribution, the empirical rule is much more elite. It only applies to the normal distribution. And you'll see why if you are lucky enough to get the normal distribution that you want to use the empirical rule instead of championship. Okay? Because Secondly, the empirical rule says approximately It doesn't say at least, so it's saying basically, not at least it's saying about exactly this. So you can trust it. Okay, you don't have like this unknown, like maybe 100%. There's, so it says, This is what it says and I'll show you a diagram of it, but it says that 68% of the data are in the interview interval. mu plus or minus one standard deviation. So mu minus one standard deviation all the way up to mu plus one standard deviation 68% of the data are in there. And you'll notice that Chevy chef didn't even say anything about one standard deviation. And so already, we've got something way more useful if we apply the empirical rule, right. So next we go to 95% of the data are in the interval, mu plus or minus two standard deviations, 95%, approximately 95% are in there. Now, if we had bought chubby chef, we'd be saying about this too, we'd be saying 75%. Okay, we'd be saying at least 75%, which could be 95%. But here, if we're using the empirical rule, we're relatively sure that it's 95% between mu plus or minus two standard deviations you can like better, right? Finally, if you get out to three standard deviations, you're kind of running out of data, because 99.7%, almost all of them fall in that interval. So as you can see, the empirical rule is going to give you a more specific answer. But again, you can only use it if you have a normal distribution, but which we do. So let's go look at that. Okay, this is a diagram that I'm going to help I made it myself, actually, because I thought it was the other diagrams I saw were not pretty. And this one is very pretty in my mind, but let me unpack this diagram for you, because there's a lot going on. And first of all, I want you to notice the shape of it, it's a normal distribution, okay. And then I want you to notice that I put this black line down the middle, and I put a little arrow that says mu. So this is where we want to imagine mu, it's no matter what your what your actual numbers are from you. Like in our case, this is 65.5 for our points. Just imagine whatever your mu is, and whatever your standard deviation is, this is where you would put the meal, right, then you'll notice that each of these sections that's colored, has a little standard deviation symbol in it, because that's representing that, that the width of that is one standard deviation. So if your standard deviation was like five, then mu would be plus plus or minus five, like the green one would be mu plus one standard deviation. So it'd be mean plus five, and then you draw that parallel line there and see that arrow that says mu plus one zero deviation, that would be there. And of course I can, I just had to use the symbols, because I don't know how big the standard deviation really would be, or what the mean really would be. But whatever it was mu plus one standard deviation, if you go up there, you would see that that green area represents 34% of the data. And if you're lucky enough to have exactly 100 people, like I did in my demonstration, that would mean that between mu and mu plus one standard deviation of these test scores would be 34 people's scores, right, so you can really figure that out. Same with the yellow section only, that's mu minus one standard deviation, and 34% of the scores would be between those two numbers. Now you'll see as you get up into the blue, that's between one and two standard deviations above the mu, you'll see that because the roller coasters a lot lower to the ground there, that section is really small, it's only 13.5% of the data. And the same with the orange one that's on the other side of the mu. So that's below the mean. And that's only 13.5. And then you'll notice that at three standard deviations, between two and three, there's a little tiny piece right, the purple piece and the red piece, those are only worth 2.35% of this shape. And then I wanted to point out there is some stuff at the end, in the little black part beyond three standard deviations on either side, there's point one 5%. And a lot of times people forget that. But one way you can make sure that you've got to remember that it's there is that if you add up all these percents on the slide, you'll get 100% because remember, I promised you that the whole the whole curve is worth 100%. And this is how we split it up. I also want you to notice that there's kind of a cheat, right? If you just add up the green, blue, purple, and then the little black part at the end, if you just add up those percents, you'll get 50%, right, because that's half the curve. And the same, you'll get the same thing if you do the yellow, orange, red, and the little part and the black at the bottom. If you add those up, you'll get 50%. So that's how you want to just conceptualize this whole empirical roll diagram. But now we'll apply. So I put the empirical rule diagram on the left, and then I put our class frequency histogram on the right and look, I put the meal and I put the standard deviation so we could have it there. Now the first part of this section, I'm just going to show you how to fill in the numbers under the diagram. Okay, and then after we fill in the numbers, I'm going to talk to you about how to interpret those numbers. So let's start with easy let's write the mu underneath the symbol for me, which was 65.5. So we just wrote that was simple, okay. Now let's do the plus or minus one standard deviation. So you'll see 65.5, which is our mu minus, and I put one times 14.5. I know I just did that for demonstration purpose. So you see, we're doing one times the standard deviation. So if you subtract that from the meal, you get 51. And so I wrote that 51 underneath the mu minus one standard deviation. And if you go the opposite way, and you add on 14.5, you get 80. So I put that up there. So that's I just labeled those two, you can kind of guess what we're going to do on the next slide. Surprise, we're going to do almost the same thing. All we're doing the mu minus two times the standard deviation to get the 36.5. And the mu plus two times the standard deviation to get that 94.5. And you probably already, we're ahead of me with this one. This is where we do 65.5 minus three standard deviations, and we get 22. And then we add three standard deviations, and we get 109. And now we're all able to So what does this all mean? Well, remember, our n equals 100, just out of convenience. So what does this mean? It means that 34% of the scores are between 51 and 65.5. So that's the yellow bar. Right? So 34 scores were that because I 100 people in the class. So I'm standing there in that class, and I've got a 73. But I don't 34 of those people I'm looking at have a score between 51 and 65.5. I also know that another 34%, or another 34 in this class, because there's 100 have a score between 65.5 and 80. And my 73 is somewhere in there, right? So already, I'm getting an idea that 68 people are 68% of the scores are going to be between 51 and 80. Right. And so I'm right there with 68% of the class. So I'm going to go through some fake test questions for you to just show you how to come up with the answer. So let's say the question was, what percent of the data student scores are between 36.5 and 80? So think about how you would answer that question. So see where 36.5 is, it's on the lower limit of the orange part, and see where the ad is, it's on the upper limit of the green part. So what you would do is you would add up the percents in between right 13.5 plus 34, plus 34? And the answer to what percent of the data are between 36.5 and 80? The answer would be at 1.5%. Here's another question. What cut point marks the top 16% of the scores. So already, you know you're up in that area, probably where the purple or the blue are, right? And so what would make the top 16%? Well, if you actually add together that point, one 5%, from the little black part, the 2.35%, from the purple, and the blue 13.5%, you'll get 16%. So the cut point then for that all the scores above 80, that would constitute the top 16% of the scores. Here's another quiz question, what percent of the scores are below 94.5. So we see 94.5 is at the upper limit of the blue section. So you could kind of say, well, let's just add up everything below. Right, we'll add up everything below it, and that person, the scores will be below 94.5. And so we do that we add everything below it. But remember how I said that there that the yellow, orange, red, and the little black part there that that equals 50%? If you just wanted to say okay, that's 50% plus the green part, plus the blue part, you could do that, and then you get the same answer. So what are the cut points from the middle 68% of the data? I just wanted to show you an example. What if they say middle, right? Well, you're gonna have to be centered around me that right? So the middle 68% means 34% above the mean, and 34% below the mean. So the cut points would be 51 to 80. Okay, now I'm going to ask a similar question, but I'm going to use different words. Okay. What is the probability that if I select one student from this class, that student will have a score less than 80? Okay, so notice, I'm using totally different terminology. I'm saying what is the probability yet? The only the actual answer is what you would probably guess, which is where you add up all the percents below 80. So the point of me giving you this quiz questions is to point out that percent and probability mean the same thing when you talk. So either I'm gonna say what percent of the data are below at the score of 80? Or what is the probability that if I select one student, that student was scored less than 80? That is actually the same question. So the answer is going to be I use that 50% trick here. That answers me 50%, which is the whole bottom half of that curve plus 34% gets up to 84%. Right? So, so the probability that if I select on student, that student will have a score less than 80 is 84%. And that's the same as what percent of the data is below 80 is 84%. Okay. Here's another probability question, what is the probability I will select a student with a score between 36.5 and 51? Well, that's as if I was asking, it's the same question as what percent of the data are between 36.5 and 51? which you would know the answer that that would be 13.5. That's the orange part, right? But even if I say, what is the probability, I will select a student with a score between 36.5 and 51 13.5%? So let's say that we were at a casino, and we were betting, right. And I'm like saying, okay, there's 100 students, I'm going to just grab a score out, and I'm betting a lot of money that I'm going to grab somebody between 36.5 and 51. And you'd probably be like, you don't want to bet on that. Because you only have 13.5% probability of selecting one, you probably want to bet if you're going to bet on something in the in the yellow section or something in the green section, because they have higher probability. So that's how you would think about probability. And percent, even though they're kind of the same thing. I just wanted to show you how they word the questions differently. But it means the same thing. So now I want you to just sit back and think for a second. So think about what would happen in a different class taking the same hard test, meaning nobody's getting 100%? What's the mu was the same, meaning everybody's doing badly. But the standard deviation was larger than 14.5? What would that do to the intervals? So let's just stare at this for a second. Let's say the mu was still 65.5. But the standard deviation was like 30. Okay, there was a lot of variation in the class, that would already mean that where the ad is right now, that that would actually be 95.5. Right? And where that 51 is there. Now, if we have a standard deviation of 30, that would actually be 35.5. I mean, that'd be a way bigger interval, right. And so the class I was in in chemistry was an undergraduate class, I was in costume design. This was a whole bunch of different kinds of people in chemistry. And that's probably why we even had kind of a big standard deviation of 14.5. Even though I made that up. I mean, in reality, we probably did have a big standard deviation. I knew in the chemical engineering department, they had chemistry classes for chemical engineering majors, I'll tell you, their standard deviation was probably a lot smaller, because they were probably more alike and got more similar grades as each other. But with this diverse class, we probably had a pretty big standard deviation. So that gets to my last question, what if the standard deviation was actually smaller than 14.5. So if we were like in the chemical engineering class, and they were taking chemistry, and they had a smaller standard deviation, maybe they might have had the same mean 65.5. But let's say their standard deviation was like five, then where the ad is now would be a 70.5. And where the 51 is, would be a 60.5. And we'd have way more confidence of where we knew the scores fell, like as I was standing there with my 73. I would be saying like, Oh, you know, my 73 is pretty high, if everybody has a small standard deviation, right? Whereas it's not that high here, because we have kind of a big standard deviation. That's in the first though the green part. So the reason why I want you to think about that is, that's why this shape goes by mu and standard deviation, because it really matters how big the standard deviation is, how big each of those areas are with the different colors. So I just wanted to remind you that percent, area and probability are all related. The percents literally refer to the percent of the area of the shape, okay? And imagine the whole thing is 100%. So just to remind you, the orange part is 13.5% of the area of the hole shape, but it also is the probability that an X like a student and x falls between mu minus one standard deviations and mu minus two standard deviations. And that if I select 1x, from a group, this group that I'm 13.5% is the probability that I will get an X in that range. And so it means both things. So in conclusion, the empirical rule helps establish intervals that apply to normally distributed data. And it's more useful than trebuchet. Because it's more specific, these intervals have a certain percentage of the data points in them. And they also refer to the probability of selecting an X in that interval. And these intervals depend on the mean and the standard deviation of the data distribution. So if those change then exactly where the numbers are on those intervals change. Well, I hope you enjoyed my explanation of the empirical rule. And now you can practice doing it yourself at home. Good morning, good day. And good afternoon. This is Monica wahi, your library college lecturer here moving you through chapter 7.2, and 7.3, z scores and probabilities, I decided to merge these two chapters together, because I thought they actually kind of belong together, I didn't really understand why they were separated. So at the end of this lecture, you should be able to explain how to convert an X to a z score, show how to look up a z score in a Z table. Explain how to find the probability of an X falling between two values on a normal distribution, describe how to use the Z table to look up a z corresponding to a percentage, and describe how to use the formula to calculate x from a z score. Well, that sounds like a lot, but you'll understand that at the end of this lecture, first, I'm going to go over what a z score is and what the standard normal distribution is. Then I'm going to talk about Z score probabilities. And what those are, I'm going to show you how to use the Z table to answer some harder questions besides the ones I talked about during the z score probabilities section, then I'm going to show you how to use a slightly different formula to calculate x from z. Finally, I'm going to just remind you some tips and tricks about using z scores and probabilities correctly. So all this talk about z scores. So what is the z score? And what is the standard normal distribution? Well, let's take a look at this very, pretty thing I made. You may recognize it from the last lecture, it was my little Empirical Rule diagram. So remember, the empirical rule, remember how it required a normal distribution? Well, that worked well for the cut points available, right? Like mu mu plus or minus one standard deviation, mu plus or minus two standard deviations. If we ask questions that were right on those cut points, we had good answers. But what about in between those cut points. So I wanted you to notice, in this Empirical Rule diagram, these numbers at the bottom, like I just circled them, like negative three, negative two, negative one, and then mew doesn't have a number. So pretend there's a zero there. And then there's one, two and three, okay? That is the standard normal distribution. And that is also called z. So these things on the right, those are z scores. So see the green area, zero is the z score that's on the lower limit of that, and one is the z score at the upper limit of the green area. So you can see that this whole curve, the the standard normal distribution on the right, the whole, the mean of the whole curve is zero. And the standard deviation of the whole curve is one. And that is what c score is. So I just want you to notice the concept of standard. I'm, I'm in the US. And in the US, we use, you know, the US dollar, but one of the things I've noticed is that a lot of countries see it as a standard. So they'll map their currency to the US dollar. So maybe the Euro will map its currency to the US dollar, maybe the Egyptian pound will also map its currency to the US dollar. And once it does that, it's a lot easier to compare them, right. And so that's the main reason for the standard normal distribution is it helps you compare exes from different distributions, different normal distributions that have different means in different standard deviations from each other. It helps you map them to this normal standard normal distribution here that standard, so you can compare them. So let's talk about z scores, every value on a normal distribution. So every x can be converted to a z score, just like I was saying how you can convert any currency to dollars, there's some formula for that. You can convert every x on a normal distribution to a z score. But you have to know how to use the formula right? And what goes into that formula. Well, first, you need the X that you want to convert to a z score. So you need to pick one, then you need to know the mu of your distribution, your normal distribution, and the standard deviation of your distribution. And here are the two formulas that are used. The one I was just talking about is on the left is the formula for calculating the z score. And we'll go over the one on the right later in this lecture. So remember in the last lecture, I was talking about a class that had 100 people in it. And that all took a really hard test, it was so hard, nobody got 100%. And it was 100 point test. So nobody got 100. The top score was in the 90s. So um, and remember, in the upper right there was there's the meal, the meal was 65.5, which is pretty bad score, 100 point test, and the standard deviation was 14.5. So I'm going to give you an example of calculating a z score on that particular distribution. So let's say you got a friend, you have smart friend, and that's my friend got a 90 in the face of all this? Well, let's calculate the z score for 90 on this particular distribution. Okay, so here's what we're going to do is, first we're going to remind ourselves, you don't have to do this in real life when you're doing it. But I'm just doing this for demonstration purposes, is what our Empirical Rule stuff look like. Remember, at mu plus one standard deviation was 80. And mu plus two standard deviations was 94.5. So already, you know, whatever your answer is going to be for 90 is it's going to be between one and two. Right. But we just don't know exactly what it's going to be. So I'm just showing you this for demonstration purposes to relate it to the last lecture. But you don't have to do this in real life when you calculate. Okay, so we know that the Z we're going to calculate is going to be somewhere between one and two. And as you'll see, on the slide here, I labeled over on the z curve, I labeled where z equals zero, which is the mu that's 65.5. So we're going to anticipate we're going to get a z score, that's somewhere between one and two. And you'll see in blue, I listed the ingredients, right, so we have the smartphone score 90, we have the mu 65.5. And we have standard deviation 14.5. And then we have our z formula. So let's do it. Okay, so x minus mu is going to be 90, which is our x minus 65.5. You do that out first, and then you divide it by 14.5. And look, our Z score is 1.69. And that's exactly where we thought it would be, it would be somewhere between one and two. And so as you can see, you can take any x and convert it to Z. Here we'll do another example, only this friend is not so smart. This friend actually got a score that was kind of low, it was so low, it was below the meal of 65.5, this poor friend only got a 50. So let's try it again, let's do a z score for 50. So again, you know this is just for demonstration purposes. But remember, in Empirical Rule land 51 was that mu minus one standard deviation. So we're going to expect that between again, negative one and negative two is z is where our 50x is going to land if we calculate the z score. And so here we are, we calculate the z score, we have 50 minus 65.5 divided by 14.5, and we get negative 1.07. And the reason why it's negative is, as you can see, it's on the left of the meal, so then the z score is gonna be negative. And so as you can see, it's exactly where we thought it would be, it would be a little bit to the left of negative one. So now we're going to get into something that's a little bit harder, which is the z score probability. So you're feeling pretty good about the z score. But now let's talk about the probabilities. Okay, so remember the probability from the empirical rule, this is just old Empirical Rule stuff. So remember, I gave you a question at the end of that lecture, I said, What is the probability I will select a student with a score between 36.5 and 51? And remember, the answer was like this orange area, which is 13.5%. But what if you have z scores like 1.69? The Smart friend, and negative 1.07, which are the not so smart friend, you know, in other words, you have excess of 90 and 50, which are not on the empirical rule? How do you figure out the percent or the probability? That's the next step with your z scores? Okay, so now let's ask this question, let's say, what is the probability that students scored above the smartframe. Now, we could also ask for below, but I'm just choosing to ask for above this time. So in other words, what is the area under the curve from z equals 1.69? All the way up. So see, like a little ways through that blue edge. We wish we knew the area for everything up from 1.69 Z, through the purple area through the little black thing at the top. We wish we knew that area. We only know from the empirical rule what's on the cut points of like one and two, but we don't know this in in between things. So how do we figure that out? Well This is another problem here. What is the probability that students scored below the nozzle smart friend, right? And in that case, see the diagram, we'd have to figure out what is the part of the orange that that friend gets plus the red and plus a little black part of the bottom? What is the percent or the proportion of the curve that represents that. So that's what we're getting into now. And that's what we do is we look these up in a Z table. So what the Z table is, is basically, they figured out every single Z score, you could have between negative 3.49. And I'll go into why negative 3.49, between negative 3.49 and positive 3.49. And they went like every 100. So they figured out for every single one of those these scores, what the probability is, and they actually fit that all on a table. And so now, what I'm going to show you how to do is how to use that table to look up the probabilities. And by the way, if you look up a probability that happens to be on one of those Empirical Rule cut points, you'll get what the empirical rule says. It's just said, the empirical rule is nice, because you don't have to pull out the table. But if you have something that's not on the empirical rule, cut points, get out your Z table. So how do you use the Z table? Well, the first thing is you want to figure out what area you want, right? So we're going to start and do the not so smart friend, because that's a little bit easier actually to demonstrate. Okay, so what is the probability that students scored below the not so smart friend? So, which is a secret way of saying, what is the area under the curve that makes up most of that orange part, all the red and the little black part at the bottom? What is that proportion. And so for areas left of specified Z value, you're supposed to use the table directly. So I'm going to show you how to use that table to look up negative 1.07. And then I'm going to come back and tell you what they mean by use it directly. Hi, there. So here we are at the Z table. And if you have the book, you can look it up in the appendix in on page eight. But there's also a lot of z tables on the internet. Sometimes they're arranged a little differently. So I'm using this one because it's from the book. So remember, the Z that we're looking up, we're looking up the Z of negative 1.07. So remember, I said they had to somehow calculate all the different probabilities for every single z between negative 3.49 through positive 3.49. Every 100th, they had to come up with that, well, how did they fit it all on their table? Well, this is what they did. See, this is the being the Z table. Remember, I said negative 3.49? Well, this is negative 3.4. And then to find the Z and negative 3.49, you have to imagine that the nine is here, but it's going to be the last one here. So see this nine here, this is what it would be. So just for pretend, if we had a z score of negative 2.58, I go 2.5. And then I have to go over to the eight, one right here. Okay. Or if I had one that was negative 2.10, right, or negative, just plain 2.1. Right? Then I'd go over just one to this zero, line and see these these little tiny things in here. Those are all probabilities. In fact, let's go look up our probability, which is negative 1.07. So we're going to go down here, negative, here we are at negative 1.0. And then we have to go over to the seven column, right, so what's the song? Here's a song, it's three from the left, I guess I could have guessed that. So we have negative 1.0987. So this is point 1423. Otherwise known as 14 point 23%. So that's actually what you get out of the Z table. That's the probability that's the percent you're looking for. And just in case, you're wondering, these aren't all negative, the first page is negative. The second page is positive is all the positive Z scores all the way up to 3.49. But what I want you to hold in your head is what we just looked at, which was negative 1.07, which is point 1423. Okay, hold that thought. Okay, here we are back at our slides. And so look at that green part where it says four areas to the left of a specified Z value, which we're doing with the not so smart friend, use the table entry directly. So here was our table entry. It was point 1423. So we're just going to use that number that we found and we're gonna say the probability then, is 14.23%. And that kind of makes logical sense knowing the empirical rule. Now, I'm going to show you an example of what why I was saying, use it directly. In this next example, we're going to look at the smart friends probability. In fact, we're going to ask what is the probability that the students scored above the smart friend in the smart friend set z equals 1.69. So I'm going to demonstrate now, for areas to the right of a specified Z value, you either look them up in the table, then subtract result from one, or you use the opposite z, which is in this case would be negative 1.69. And you'll get the same answer, whether you do with the first way The second way, but I'm going to demonstrate both okay. So first, I'm going to demonstrate what happens when you look up the probability in the table for that, see, and then you subtract that probability from one. So let's go look up z equals 1.69. All right, here we are back at our Z table, only this time, we're looking up a positive z. So we don't want this first one, we want the second one. So remember, we're looking up z equals 1.69. So we're looking under here for 1.6. And that's right here. And now we have to go over to the nine column. So that's going to be point 9545. So hold that thought, point 9545. Okay, we're back with our probability that we looked up in the Z table. Now remember, we were supposed to look it up in the table and subtract the result from one. So that's what we're going to do now. So we found point 9545 in the table, we're going to take one minus point 9545. And we get 0.0455, or 4.55%, this little tiny piece, which kind of makes sense, because it's right at the top of the distribution, just a little piece of the blue, and the purple, and then the little black at the top. Alright, and so what you want to imagine is that point 954, or five, which is like 95.4, or 5%, that's the whole piece below z equals 1.69. That's most of the blue, the green, the yellow, the orange, the red, and the little black at the bottom, that's all in the point 9545. Okay, so again, we were looking up in the area to the right of the specified Z value, and I showed you the first way of doing it, there's another way of doing it, and that's where you just use the opposite z from the get go. So we're going to now use the opposite seat, we're going to look up negative 1.69. All right, here we are back at the Z table. Only this time, we're looking at negative 1.69. So negative 1.6 is the first thing we need to find in this column. So here we are negative 1.6. And then we know nine is the last column. I'm learning that. So we'll go over here. And so that that looks familiar. Right point. Oh, 455. Okay, hold that thought. All right, well, back. And so as you know, if you look it up in the table directly, like the 1.69 directly, and you take that probability, and you subtract it from one, which is what we did last, we got the same answer we got now, right point, oh, 455, or 4.55%. So it is kind of more efficient, to just use the opposite z, if you're looking for areas to the right of the specified Z value. But I always say when you're done looking it up, compare it to the picture. And I always say draw a picture to, you know, I don't mind if you have normal curves drawn, drawn over all of your homework, or all over the wall, I guess, or maybe a whiteboard, that's probably more efficient. But it's best to draw it out. label on there, where your z and your x are, and then just look at it. Because we know that the little piece above z equals 1.69 is not 95% of that curve. It's just not it, that's over 50%. And we can tell that little tiny pieces under 50%. So if you accidentally do the first way and forget to subtract from one, you know, maybe if you check it against your normal curve drawing, you'll realize oh, I made a mistake. So even though there's two different ways to find the probability, if it's to the right of the z value, just try to make sure no matter which ways you use that you finally do a reality check against the drawing you make, just to make sure you got the right piece because there's only two pieces. There's a big piece and a little piece of the skirt, and we got 4.55% we know that's a little piece and we know From our drawing that we were looking for the little piece. So that's how you do your reality check. Okay, you thought that there weren't any harder questions? Well, here are some harder questions. So this is a little bit more on probabilities in the Z table. So here's another question we haven't handled yet. What if you were looking at a probability between two scores, such as the probability the students will score between 50 and 90, so it's somewhere in the middle, okay. Note that in that case, when you have a between one, you actually have two axes, and we'll label them x one and x two, so the not so smart friend is going to be x one, and the smarter friend is going to be x two, just to keep these x's straight. Okay. So the next step is you're going to calculate z one and z two. And I'm kind of cheating. Because we already did these, we already knew the Z one for the National smartphone was negative 1.07. And we already knew the Z two, for the smarter friend was 1.69. So I just put them on the diagram. Okay, and then here's this beginning of the strategy, and I'll just explain the strategy, and then I'll do the strategy. So for z one, you find the probability to the left of the Z, so you find the little piece to the left. And remember, you can take the direct probability from the Z table. So that's what direct means is you just get to copy it directly out of this table. Then for z two, you find the probability to the right or above z. So you find the little piece there. And you use one of those two methods I showed you, which we did together. And then finally, imagine like the whole curve, you're subtracting the piece at the bottom, the Z, one probability, and you're subtracting the piece at the top. So you're trimming with those two pieces to get the between probability. So that's the strategy is basically you find out the the size, the probability of each of the little pieces on the sides, you subtract both of those from one, and that traps whatever's left in the middle. So I'll demonstrate this. So remember, for z one, the probability to the left of Z one was point 1423. We did that together. And then we use both of those methods. And they got the same answer to find the probability to the right of z two, which was point o 455. Okay, so that's a little piece at the top, and then we got the little piece at the bottom. And now we'll take one minus the piece at the bottom minus the piece of the top and the total is point 8122, or 81. Point 22%. which kind of makes sense, that's a big piece in the middle. So it wouldn't be surprising if it was about 80% of the curve. So this is how you do a between like. Here's another question I haven't really handled, what have you looking at a probability more than 50%? So such as the probability that students will score greater than 50? Right? Like, like the big side? Okay? Well, actually, you just do what you normally would do, you say four areas to the right of the specified Z value, either look up in the table and subtract the result from one, or use the opposite z, which in this case would be 1.07. So if we did method one, we'd end up going one minus point 1423, which we already looked at, and we get point 8577, we use method to we'd take the Z of 1.7, not negative 1.07, but 1.07. And we could go look it up in the Z table, and we get point 8577. Again, 85 point 77%. So if this isn't actually a harder question, I just wanted to show you how it works when you're getting like a bigger piece, bigger than 50% piece of the distribution. And here's another sort of similar example, where we're looking at the probability that students will score less than 90, okay. So that's easy, right for the area's to the left of the specified Z value, just use the table directly. So when we went and looked up z equals 1.69, we got point 9545. So that's the answer. It's 95.45% of the curve is below z equals 1.69, or below x equals 90. So as I mentioned before, but I'll just mention again, you're supposed to treat all probabilities to the left of z equals negative 3.49 as P equals zero. So I showed you what negative 3.49 looks like in the Z table. It's like point O two. Well, there's not much smaller than that. So just, if you actually calculate z and you get like negative four, just say the P is zero, okay. Then the second thing is treat all areas and probabilities to the right of z equals 3.49, SP equals one or 100%. So as you can imagine, you know, 3.49, that's at the top of the curve. So if you calculate a Z and you got like a five, you can just assume that's 100%, right or one. Okay, um, so we've gone through how to calculate z. And we've talked about looking at probabilities in the Z table. And we've even talked about manipulating those probabilities to get certain probabilities. But we haven't talked about calculating x when z is given. So sometimes you're actually given a z. And you are have to calculate the x back from the Z. In fact, sometimes it's even harder. Sometimes you're given a probability. And the probability is not as easy. But you can use the probability, remember that those little percents in the middle of the table, you can go find it in the middle of the table and look up the Z that keys to it, and then put it into this equation. And so I'm going to just give you examples of some real life questions that you might see, like on a homework or on a task, probably not in real real life. That where you need to calculate x, and you need to use that formula in the red circle. So let's say I was just bored. And I was wondering, what is the score the test score on the story distribution? That is add z equals 1.5? Okay, so see where z equals 1.5? We never asked that question before. So let's say I just out of curiosity wanted to know, what would the test score be of a student who was at z equals 1.5. So what I would do is I would take 1.5 times 14.5, because that's what the formula says. It's z times the standard deviation. And then I do that first because order of operation. And then after doing that, I'd add the mu, which is 65.5. And I get 87.3. So the x, the student who got 87.3, that student got a score, that's add z equals 1.5. Now, as you probably imagine, people don't go around asking so much about well, I wonder what that person's score is at z equals negative 2.3? Or whatever. They don't usually phrase it like that. Usually, you see more like a question like this, which is what is the score that marks the top 7% of scores? And that's a secret way of saying, We are looking for the Z at p equals point. Oh, seven. Oh, so it's like we turn that 7% backwards into probability. And we say, we're actually looking for the Z at p equals point. Oh, seven. Oh, so how do you do that? Well, I'm going to show you. Okay, so we're on the hunt for probability. Point. 0700. Okay, so let's start at the top of the table here. You'll see we're digging around in the middle of the table, right? And you'll see like point oh, that's nowhere near the ballpark, because we're looking for point O seven. Oh, so let's scroll up here. or scroll down, actually. So now we're more we're in the point O four neighborhood. Here's point O six. Okay, we're getting close. Well, here we have a point. Oh, 708. And that's point oh, eight more than we want it to be. Well, here next door, we have point Oh, 694. And that's only point oh, six less than we want it to be right, because if it had point O six more, it would be point O seven. Oh, so this is technically closer than this one, because this is point O, O eight off. And this is only off by point O six. So we're gonna choose point o 694. As the probably the probability of record for this for the top 7%. Only, we're not going to just choose this, we're going to figure out what is z at that score. So what are we gonna do, we're gonna map back here, negative 1.4. And then we got to go all the way up, which we can guess is eight. So it's negative 1.48. So hold that thought. Okay, we started out looking for the Z p equals 0.0700. And but the closest we got was 0.0694, and then map to z equals negative 1.48. Now, what I want you to notice is negative 1.48 is actually on the left side of me. Okay, so that is the z score at the bottom 7% of the scores. So we're going to use the positive version of that see, since we want the top 7%, so we're going to use 1.48. So the opposite See, and now we're going to plug it into the equation. So 1.48 times 14.5, which is the standard deviation plus 65.5 equals 87. So now at seven is the score that marks the top 7% of the scores. I'm going to do another exercise for you. That does the this time the bottom 3% of the scores because this is often kind of challenging for students. So I'll just give you a second demonstration. So as you can imagine, we're going on the hunt now for z at p equals 0.0300. So let's go over to the Z table. All right, now we're getting a little good at this, right? So we're digging around in the middle, and we're looking for 0.0300. Okay, and starting at the top, we're in the 00. department. Oh, here's point 01. Something 02. Okay, we're getting close to the point 0300. So a point, point 0301. Could you ask for anything closer? Totally Perfect. Okay, so that's what we're going to use for our z is the the Z at 0.0301. So let's look up that C so that c is negative 1.8. And then we look up eight, so it's negative 1.88. Hold that thought. All right. Well, we were on the hunt for P equals Oh, point oh, three. Oh, and we didn't find that. But we did find p equals point. Oh, 301 and the table, and that mapped back to z equals negative 1.88. Right. And now we go back to the question, we see that we want the bottom 3%, so we keep the negative. Now if I'd asked about the top 3%, we'd lose the negative we use 1.88 in the equation, but since we want the bottom 3%, we're going to keep the negative. Okay, so now let's do the equation. So x equals and then in the parentheses negative 1.88 times 14.5, which is our standard deviation, then plus our mu, which is 65.5. And the score we get is 38.2. So 38.2 is the score that marks the bottom 3% of scores, and just be happy your score is not in there. Okay, now, here's another challenging hard question. What is the question on the tester, probably not in real life, but on a test says what scores mark the middle 20% of the data. And so I put little arrows on there just to point out well, when they say middle, they mean, it's hugging the meal, it's actually assuming that there's gonna be 10% on the right side of the meal, and 10% on the left side of the meal. And so how you start to do this is you figure out the z score for one minus point two, which is the 20% divided by two, which equals four, right? So then after that, you know, because one minus point two is point eight, and point eight divided by two is point four. So we get this point four. So we go find the z score at point four, which you're good at using the Z table now. So uh, so I'm, you know, looked around, and I found point 4013, in that, digging around in the middle of the Z table, and that map back to negative z equals negative point two, five, right. And so that is then what I would put on for the lower limit on that one, and then z equals point two, five, the positive version goes on the other side. So once you figured out both of the Z's, the Z on the left and the Z on the right, you just have to put them through the equation. So for the left side, we use the negative z. And for the right side, we use the positive Z. And that's how we get our limits. So what's for is mark the middle 20% of the data 61.9 and 69.1. It's not weird how that worked out. But anyway, 61.9 and 69.1. Mark the middle 20% of the data. I didn't totally didn't do that on purpose. It just worked out that way. All right, I can't believe you made it through all this. I'll bet your brain is ready to explode. So now is a good time to talk about just a little review. Just help me come down a little bit from this whole really intense lecture. Okay. So first, I'm going to do a little Z score quiz game show style stuff here, right? So if you ever get the question when you're on the test, and you're like, Oh, my gosh, where is x? Where's x? Well, if you can't find x, it's usually in the question. So usually, the way these questions go is somebody like maybe me, we'll put a mu and a standard deviation at the top of the question. And then there'll be like, maybe five questions about that pertain to that mu and that standard deviation, but they asked about different axes. And when I would teach this class, a person, you know, people will come running up to me in the middle of a test, which you probably shouldn't do. And they would say, where's the x? Where's the x you gave me you know? These pieces of the equation but I can't find the x. And I'd be like, walk on the question. Look in the question, you know, because I don't want to give it away, and then they'd all run back to their seats and find it. So that's so if you're wondering, your panic and where's x? Look in the question, it's usually in the question. Okay, so let's say you find an X, and what do you do with an x? Okay, and you're stuck with an X, what do you Well, usually, what you have to do is calculate a z score. So remember, if you've got an X, you probably have a mu and a standard deviation, you can calculate a z score on that. So if you're panicking on a test, and you have an x, I mean, Sandy nation, just for fun, calculate a z score and see if it gets you anywhere. Okay, well, let's say you have a z score, what do you do with a Z score? Well, you always look it up, right? I mean, if you're, if you're going this direction, if you're getting if you started with an X, and you get a Z, you got to go to the Z table with. Okay, so that's your next step. So if you're doing all this work, calculate a z score. And then you're done. You're like, Oh, my gosh, what's my next step? Go look at the Z table. Well, what is the question asks for an x, right? Well, remember, we have a whole formula for that. So use the x formula. So if there's no x anywhere, and it's asking for an x, then use the other formula, use the x formula? And what if the question gives you a P, or I just said p for probability, but it could be a percentage, like Remember, the top is 7%, and the bottom 3%? Well, if they give you a percent, just start digging around in the middle of the Z table, just start digging around looking for that person. Because once you start digging around, you realize that map's back to a z. And then you can get into the groove of using the x formula, and you'll probably get yourself out of this pack. So here are some final tips and tricks for getting z scores and probabilities, right? And I've said this one before, draw a picture. And what do I mean by that graph out the question, draw the curve, draw the line from you, which goes in the middle. And where the X goes above or below the mu, just start with that it doesn't have to be the scale. But mainly, you want to get those elements in there. There's 1x shade, the part of the curve wanted either above the X or below the x, you know, just color it in. So that you get an idea of Do you want the big part, the one that's greater than 50%, or the little part, the one that's less than 50%? If there are two x's, then shade in the area wanted, which is usually in between them. If it's a calculate the x question, put where the Z or the P is. So if it was like the top 7%, you could shade in the top little part of the curve. If it was the bottom 3%, you could cheat in the bottom little part of the curve. So make this picture and do it at the beginning. Okay, then, note that x is usually in the question. If you can't find x, and you're trying to do the Z formula, and you're saying, Okay, I'm trying to make a z score. That's what it asks for. I'm trying to find a probability. That's what it asks for looking the question, and you'll probably find the accent there. A big problem that I see is people mistake little Z's for peace. Now, obviously, if you've got a Z, that's like negative, you know, a, p can't be negative, a probability can't be negative. So you won't make that mistake. Even if it's like negative point two, five, right? You won't make that mistake. And if the Z is bigger than one, you won't make that mistake. So if you see a z equals 2.5, you're like, obviously, that's not a probability. But when you have a little BBC score, that's between zero and one, like point O two, three, it looks a lot like a P, but it's still a z. So a lot of times people get a little lazy, like they hate using the Z table, and then they calculate the z score, and it's really little, so they don't look it up. Don't be fooled. You still have to look it up. So if you're calculating z, you need a little baby z like that it still is he still go look it up. Okay. Then finally, remember how step one was draw a picture. And I went on and on about that. Step 99. Or the last step before you're done with the question is check your logic against that picture. So if you shaded a big part of your picture, your probability should be bigger than point five, or 50%. If you shaded a little tiny part of your picture, and you're getting like point nine, five, something, you know that that's wrong. So please check your logic against the picture. Before you say that you're done with your question. Okay. So you made it through this long lecture about z, and about probabilities. So I gave you an introduction to the standard normal curve into those two Z score formulas. I showed you how to calculate z scores, and how to look at probabilities. And I also showed you at the end, how to calculate x if given a z score or a probability. Okay, and all I want to say is, unfortunately, those students those pretend students on that distribution, they were none of them got 100% Okay? That's not the case in our class, a lot of times people get 100% on the quizzes. That's why I can't use your grades as examples. Okay, so good luck on the quiz. Well, hello, it's time for statistics. It's Monica wahi, your library college lecturer back with chapter 7.4 and 7.5 sampling distributions and the central limit theorem. So at the end of this lecture, you should be able to state the new statistical notation for parameters and statistics, for two measures of variation. Name one type of inference and describe it. explain the difference between a frequency distribution and a sampling distribution, describe the central limit theorem in either words or formulas, and also describe how to calculate the standard error. So, here's your introduction to this lecture. And as you can see, I must 7.4 and 7.5. Together Again, they felt like a natural fit. First, we're going to review and maybe overview on parameters, statistics, and also inferences, we're going to just talk about those ideas, because that will sort of easy into the next part, which is where we start talking about sampling distribution, which is the new concept here. Okay. And then we'll go on to talk about the central limit theorem. And finally, I'll do a little demonstration of how to find probabilities regarding x bar. So if you're not really sure about what that means, don't worry, you should be able to understand it at the end of this lecture. All right, here's the first part, parameters, statistics and inferences. And this is the review and overview I promised you. So if you remember from a long time ago, a statistic is a numerical measure describing a sample. And a parameter is a numerical measure describing a population remember s s sample statistic p p, population parameter, you probably remember that. Okay, so we have different ways of notating these. So if you look under measure, like you see me right, and if it's a statistic, it's x bar, and I say x bar on this on the slide sometimes because it's hard to make that little line always be positioned above the x. So I'm just lazy to say x bar. And then under parameter, it's that that new symbol, so it's pronounced a meal, but it looks like that thing on the slide. All right, um, the next two variants and standard deviation, remember how they're friends. And so the statistic version is the s for variance, it's the s with the little two up there, the exponent, because you know, it's standard deviation to the second is variance in the square root of variance is a standard deviation. So that's why they have s and then S to the second for the statistic, okay. For the parameter, it's that lowercase sigma symbol. And that's it's that to the second when it's variance, and it's just without the exponent, when it's just the regular parameter of standard deviation, right. And you're used to seeing these on the slides. This is just review. I'm also in mentioned in the book proportion is p hat, and then the parameter is P. But I don't really go into that. I just wanted to do a little shout out to it. Okay, let's think about the word inference, like infer, like, if somebody implies something, maybe you'll infer it. Like, he implied, it would be hard if I came over late that night. So I inferred that I shouldn't come over late then. So like here, you know, you may have heard the term where there's smoke, there's fire. And so you see this on the slide, there's a lot of smoke. Is there fire, though, is that smoke coming from fire? Because if you look at it, it probably could be coming from fire. But there's sort of this outside chance. It's not what we think it is, like maybe, you know, I have if you've ever used a fire extinguisher, they make all this phone come out. Maybe it's that, you know, or maybe it's like, if you've ever had dry eyes, and then that makes a bunch of smoke. Maybe it's not fire, right? So where there's smoke, there's fire. That's an inference. Well, let's see if it's actually fire, right. But we weren't sure we thought it was likely to be fire. But we weren't sure. And so there's inference is something that you do in statistics, because you use probability to make these inferences because you can't see the fire. You can just see the smoke and you're not sure, right? So there's three different kinds. I'm going to talk about the first kind of estimation, where we estimate the value of a parameter using a sample. So the sample is kind of like the smoke and the parameters the fire we can't see. So we estimate Okay, and we're going to talk about that in chapter eight more. A second time, type of inference we do is testing, where we do a test to help us make a decision about a population parameter. In other words, we don't know one, but we want to make a decision about it. So we do a statistical test. And we're not going to get into that, that's in chapter nine. Finally, there's regression, where we make predictions or forecasts about a statistic, that's a third kind of inference. And we actually already did this in chapter 4.2. So the reason why I bring up all of this is that estimation, which is going to be in chapter eight, and testing, which is going to be in chapter nine, but we're not going over chapter nine in this class. But um, but if we were, you know, you'd have to know this because in this lecture, I'm going to talk about sampling distributions in the central limit theorem. And you need to grasp those things in order to do those, these two things on the slide that with the box around them, estimation, and testing. And so that's why I'm bringing this up now. Okay, so now we're going to move on to talking about sampling distribution, and how it's different from a frequency distribution. Alright, so let's just remind ourselves what a frequency distribution actually is. Okay? So remember that from a long time ago, what you would have is a quantitative variable, you'd make a frequency table. And then you use that to graph the histogram, right. And here, I made an example down there of frequency histogram that shows a normal distribution. And so that's what you would do, you know, step two would be draw it. And then you see the shape and figure out what the distribution was of that quantitative variable, or that x, okay, because each one of these is an X, like the middle one, it's almost 30 X's that are in that frequency. Okay, now we're going to talk about sampling distribution, it's a little more complicated. In a sampling distribution, you start out with a population, that's the first thing is you're dealing with population, then you pick an N, of a certain size, like you pick a number, that you're going to have your sample size B. And then you take as many samples of that size as possible from the population. And then you make an x bar from each of the samples. So there's a ton of samples, right? Because and I'll show you a little demonstration. So you can really wrap your mind around how many different samples that can be. But each one is going to have an x bar. And then you make a histogram of all those x bars. So like I said, I'm going to just kind of show you what I'm talking about. So we're going to imagine this is a population of people. And we're going to imagine we're going to talk about BMI or body mass index, just so you can wrap your mind around this. So you start with this population, let's decide on an N. How about five five is good, right? So now what the deal is, is I'm trying to take as many samples of n as possible from all of these people on the slide. So here's our first sample we took, and we got an x bar for BMI of 23. From these five people. Well, let's try these five people. Now, look, we double dipped with that first one, okay, but we get this x bar of 21. And we can keep going. And actually, there's gonna be a ton of these, right, there's a ton of different ones. But it's finite. I mean, at the end of the day, there's only so many groups of five, I can get out of this population on the slide, and each group of five is going to have its own x bar. So I could write down every single one of those x bars I get for every single group of five I can make out of this. And then I can make a histogram of all the x bars. And, of course, I'd start with a frequency table. But look at the frequencies, they're huge. That's because you can get just a ton of samples out of one population. And so what you'll see is if you make a histogram out of that, it looks normally distributed, it's just that the frequencies are really high, because there's a whole bunch of different samples you can take. And remember, this is a frequency histogram of x bars. This is each one of these frequencies is an x bar that you got out of a group of five you could take. And so that's what the sampling distribution is, it ends up looking like a histogram, but it's a histogram of all the possible x bars you could get from all the possible samples of whatever end size you picked from the population that you have. So uh, so this is the fancy way, the official statistical way of saying it is a sampling distribution is a probability distribution of A sample statistic, in this case x bar based on all possible simple random samples of the same size from the same population. So that's what makes it the sampling distribution and not a frequency distribution. And so in the next section, so you're probably like, Okay, great, that's wonderful. You just explained that. But in the next section, we're going to talk about the central limit theorem, here comes a theorem, right. And there's a proof for the theorem. And you need to understand this concept of sampling distribution for inference in order to understand this proof, so I just had to go through this. Okay, now we're on to the central limit theorem, and how it's used for statistical inference. So I'm gonna start by explaining it in words and see that sampling distributions over there. So this is the words around the central limit theorem, it says, For any normal distribution, and remember, we're talking about a normal distribution here, the sampling distribution, meaning the distributions of the x bars from all possible samples, like we just talked about, is a normal distribution, meaning it's not skewed, it's not my model, whatever, it looks kinda like what is on the slide. Okay. And then to this is important, the mean of the x bars is actually mu. So I had a student who would say, Oh, the x bar of the x bars, is mu. And that's actually true. If you actually did the thing I described, which don't try it at home, because you'll be up all night taking samples, okay. But if you did, if you actually got all samples of five from a population, and got all their x bars, and you made a mean of all those x bars, you'd get mu and how you could check it is, of course, just easily taking a mean of the entire population like that would have been the easy way to do it. But no, if you do it this way, where you get every possible x bar for a particular sample size, and then you make an x bar, those x bars, you'll get meal. So that's, you know, it's a proof. So that sounds like a thing, that would be inappropriate, right? Now, here's the next part three, the standard deviation of all those x Mars is actually the population standard deviation divided by the square root of whatever and you picked. So in other words, if you have the whole population data, and you just found out the standard deviation, you just have the standard deviation. But if you did this thing with the x bar, where you took all those x bars, and you found the standard deviation of those x bars, that would equal the population standard deviation divided by the square root of whatever n, you use to get all those x bars, again, sounds really poufy In theory, but that's the third part of the central limit theorem in words. And so here's some people like to look at it from a formula standpoint. So you'll see on the right side of the slide, in this little, these little formulas, that N means the sample size. And remember, I picked five, you could pick a different one, right? And mu is the mean of the x distribution, meaning the population mean, right. And then that population standard deviation symbol is the standard deviation of the x distribution mean the population standard deviation. So we look on the left. Now this is just a formula version of what I just the mu of all the x bars that you could get from a particular sample in a particular population is going to equal the mean or the population. And the standard deviation of all those x bars is going to equal the population standard deviation divided by the square root of whatever n you picked. So now, I just want to point out the Z thing. We've been doing this z thing, right, but we've been doing it with 1x. Now, if you imagine grabbing a bunch of x's, in other words, a sample, this is the formula you're going to be using, which is x bar minus mu over the standard deviation divided by the square root of n, right? And so that's kind of what we're moving into here is what happens if you get a sample and you're looking at x bar, not if you just grab 1x. And you're looking at that. So I wanted to point out, first of all, that this whole thing is only supposed to happen if your n is greater than 30. Okay? Otherwise, you shouldn't really be doing this. Then the second thing I wanted to point out is that this piece underneath and the lower part of the equation, that's called the standard error, they named that piece. And part of the reason why I like that they named that piece separately, is I usually make that piece before I even do the equation. So I just have that number sitting around because, you know, there's a square root underneath this standard deviation, and that whole thing is underneath another thing so it's hard to do all that dividing. So I usually just make that standard error first, by taking the standard population standard deviation divided by the square root of n and just have that number and then later I use it in this z equation. So that's two things I wanted you to notice. So I brought that out on the slide. Okay, here's more on the central limit theorem. So if the distribution of X is normal, then the distribution of x bar is also normal. So we look at the top, that's an example of just an X distribution. And then if you go do that thing, we take all those samples, and you get all those x bars. And then you make the histogram, you'll see the pink one down, lower. Next bar distribution, this is just a pictorial example. But even if the distribution of X is not normal, as long as there's more than 30, and is more than 30, the central limit theorem says that the x bar distribution is approximately normal. So remember, a lot of that hospital data we've been looking at, like a hospital beds in a state, often you'll see a skewed distribution. But if you have more than 30, hospitals, then it what you could do is you could pick n n, and take n bigger than 30. And take a bunch of samples and get a bunch of x bar, it's not just a bunch get all of them all of the possible ones. And then when you if you made that x bar distribution, even though the hospital beds would be skewed, just as an X distribution, their x bar distribution would be normal. And that's one other important piece of the central limit theorem. That's one important piece of that proof is that all of those x bars that you get, will end up on a normal distribution, even if your underlying distribution is not normal. So long as the end you're picking is greater than 30. And finally, that leads to you know, proofs are they build on each other, that leads us to the concept that a sample statistic is considered unbiased, just unbiased, right? It's not perfect, but it's unbiased. If the mean of its sampling distribution, equals the parameter being estimated, in other words, the fact that the x bar of the x bar is is mu, means that an x bar is going to be unbiased. It might not be mu, it might not be exactly the same as the population mean. But it will be unbiased. It's not a biased representative of mu. All right, now let's move on to finding probabilities regarding x bar. So for those of you who want to actually do something and apply something and stop thinking about theory, let's go. Okay, but let's remind ourselves, what are we doing? Right? What are we doing? Well, what were we doing in chapters 7.1 through 7.3, we were looking at having a normally distributed x. So we have this population of quantitative values that were normally distributed. And we had a population mean a mu, and we the population standard deviation. And we kept doing these exercises, where we were finding the probability of selecting a value from that population and x from that population above or below a certain value of x, right. And so we were looking at the probabilities, and we'd look up the z score in the Z table probabilities. And so basically, what we would be doing is converting m x to z, right. And we use this formula here to convert x to z. So whenever we add an x, we could put it on the Z distribution, and we could figure out the probability. So here's what's different. Now, you'll notice the first thing has not changed, we're still talking about normally distributed x's, we're still talking about a population where we have a mu and a population standard deviation. But now we're not just grabbing 1x. From that population, we're grabbing a sample. And because we're grabbing a sample, we have to pick an N. So the N is going to be different each time, right? So we're grabbing a sample of the population. Well, how do we boil that down to one number? Well, we're taking the x bar are the mean value from that sample. And that's what we're doing. The Z score is that x bar instead of the x, because we're taking a sample, so when you see the formula below, you'll notice that the other one just had x in it, because we only had one, this one has x bar, and because we have a sample, you also notice that downstairs, what we had before was the population standard deviation, but now we have the standard error. Remember I talked about that the population standard deviation divided by the square root of n, that's where n comes in, because it's going to matter which what and you have to make the Z come out right? Alright, so now that we're reminded of what we're doing, we'll just explain how to do it right. So let's say you do have an N, right, and you have an x bar, like you grabbed your n and you got an x bar, you can convert that x bar to a z score using this formula, where, of course, you have to be told the population mean and the population standard deviation, but then you'll have your x bar and you'll have your n. So you can do the whole equation. And then you'll get to see and guess what you do. What do you do with a Z, you look it up. So you look at the probability for the z score in the Z table. Like in chapter 7.2, and 7.3. Only, this is just about x bar, basically. So um, and then I thought, what I would do is walk you through two examples. You're already kind of good at this, because this is not too different from 7.2, and 7.3. But I just want to walk you through it, because it is a little different when you have a sample versus just 1x. Okay, so remember our poor chemistry class that I was in when I got to 73? Well, remember, we were assuming it was 100 Student class. So there were 100 students in the class and equals 100 in the class capital, right, because they're the population. And then if you look on the slide, you'll see the mu of their scores was pretty bad. It was 65.5 on 100 point test, and the population standard deviation was 14.5. So this was the population of this 100 Student class. So I'm going to do some exercises here, let's say we're going to pick a, we have to pick an N bigger than 30. So we're going to pick an N of 49. Right? Now, I'm coming up with a little scenario here. To pass the class students have to get at least 70, which is a C. So let's pretend this is the question, what is the probability of me selecting a sample of 49 students with an x bar greater than 70? Notice how we ask the question a little bit differently. What's the probability of me getting a set of 49 students such that their x bar is greater than 70? Does not kind of remind you of the central limit theorem, where we had to go back and get a like an N a five, we got different ends of five? What what's the probability of me getting one of those samples that has an x bar in the greater than 70? That's the question, right. And I drew this out here, remember our old z distribution with our also our x distribution, and I kind of drew where somebody is. But I wanted you to point I wanted to point out for you, the probability for an x bar is going to be smaller than for x, because you're going to have to do a lot of work to get that x bar to be above 70. Right? So here we go. So I'm just going to remind you that the equation at the top and the equation at the bottom are the same equation. I'm just using the term assay for the standard error. And I like to calculate that separately, like I told you, so I like to do that first. So we're going to do that. And how do we do that? Well, the end was 49, right? And I'm the population standard deviation is 14.5. So that's where we get this, this number, the standard error of 2.1. So now, let's calculate the Z. All right, here's z. So z is our x, which is our x bar, which is 70 minus 65.5, which is our mu, divided by our prep cooked standard error, which is 2.1. And we get a Z of 2.17. So we're tempted to look that up. But let's look at our picture. So here's our z distribution. And what we're going for is this little piece at the top right above 2.17. So that's a little piece. So we got to look for that right? Let's go look. So because we're going to go for the piece at the top, we're going to use the opposite z. There's remember two ways of doing this. But everybody seems to prefer the way where you use the opposite z if you're looking for something to the right. So we're going to use negative 2.17 to get a little piece, right? Because when you look that up, I'm not going to demonstrate you guys are good at this now. You get P equals 0.0150. If you were to look up 2.17, then you'd get the big piece. So that's why we do this. And so then the answer is, remember the question was what is the probability of me selecting a sample or a set of 49 students with an x bar that's greater than 70. And remember how this real test really sucked. I mean, people that mu was 65.5. So it was pretty hard to get a high score. So the probability was pretty low as point 0.0150. Or if you do that Present version 1.5%. Okay, now we're going to try a different one. That one was asking what is the probability of me selecting a sample with an x bar greater than a certain number? Now we're going to talk about the probability of selecting a sample with the x bar between two numbers, right? So again, we're back with our poor student class that with this terrible chemistry test, this time I decided to choose the end of 36, you'll notice that I always choose perfect squares for ends because you have to take the square root, and I'm just lazy. So okay, here's our question, what is the probability of me selecting a sample of 36 students with an x bar between 60 and 65. And just I drew this picture up here to remind you that, that's gonna be on the left side of meal, you know, we're going to be dealing with negative Z's right. And so we have to remember when we would have two axes, back in 7.2, and 7.3. Well, this is now a situation where we have 2x bars, so you just got to name them x bar one and x bar two. And, again, I show you this demonstration, you know, these red arrows, but the probability for x bar will be smaller than for x, because it's harder to get a whole group of people together to give you an x bar in between a certain place. Alright, so this is not new, these are the same formulas I showed you before, I just want to emphasize that making your standard error first, can really help you as you move along through these problems, it just makes it a little easier to calculate, especially in this case, where we're going to use the standard error twice. So again, what we do is we take, this would look exactly like the last standard error, but it's different because our n is different. So this time, our standard error comes out as 2.4. And what I just want to remind you is that the more and you get, the bigger that square root of n gets, I mean, n gets bigger, the square root of n gets bigger. And that's then the smaller the standard error gets. So you can make the standard error really small, if you just get a lot of n, right. So here's z one and z two, I put them both up there. But we can just walk through this, you know, x bar one is 60. And x bar two is 65. Because it's between 60 and 65. So you see that, um, you see what's going on in the slide. And like I told you, you know, these were both of these x bars are below the mu. So they're both kind of negative Z's. And so we've got our negative Z's. And that now we have to just remind ourselves, well, what are we doing, right? And so you see, z one is at negative 2.28. So that's a little piece at the bottom, we're going to want to trim off. And then the big piece at the top for z two, that starts at negative point two, one. So that's just remember, the picture is really helpful. So now we're going to go deal with the probabilities, right? So for z one, we're looking at something to the left, so we just leave the Z alone and go look it up. And that's p equals 0.0113. For z two, we got to flip the sign because we have to use the opposite z, because we're going for the right, so that was the probability two and we can check that see, because we can see that's more than 50% of that shape. So it's point 5832. Okay, so we got our probabilities now. And like just like last time, we got to take one minus both of those pieces, right? And then we get the probability in the middle. And that's the probability of drawing us sample of 36 students with an x bar between 60 and 65. And I just to translate that to the answer, the probability is point 4055. Or if you rounded it, you know, when you like, percents, you could say 41%. So in conclusion, we reviewed the parameters, and the statistics, and those notations. And we talked about inferences and what we're doing with inference. Next, we talked about what a sampling distribution is, and how that's different from a frequency distribution. So you can tell you know what's going on with that. Then I presented to you the central limit theorem, which may have been kind of confusing, because you know, theorems always are, they're always about different principles and about different things equaling each other. But because of the central limit theorem, we then have permission to do the operations we're doing after that, which is finding probabilities regarding x bar. The central limit theorem says that, you know, this is how the world works. So you get to use the standard error, and you get to do these kinds of calculations. So now, you know how to in addition to finding probabilities regarding x, you can find probabilities, we got x bar. Don't you feel smart