Transcript for:
Understanding Key Concepts in Statistics

Stats start here. This is the beginning of our statistical journey and just trying to understand what are statistics. So with the idea what are statistics let's look at how statistics are often portrayed in like a newspaper. Here we are with four out of nine Alabamians will die of colorectal cancer in 1986. 44.44 percent of Alabamians will die. That sounds pretty extreme. Well this is why stats get a bad rap. It's because People don't usually give real statistics, and we usually see inflated numbers or things that don't represent the population. Like maybe they're talking about people who go to the hospital, people who get colorectal cancer. You probably have people with an agenda right here, like a hospital trying to get people checked, which is not a bad thing to get checked. But it's bad if you use numbers that don't represent the truth. You should actually say, you know, we see in the history 0.3% of people will die of colorectal cancer. Three out of a thousand will die of colorectal cancer. But maybe the truth is four out of nine Alabamians who get colorectal cancer will die of it. So if you get colorectal cancer, your probability of dying is much higher. And now we have a conditional probability. Look at how much more complex things get when we add more context to what's going on with the data. And this is a key thing right here. All data needs context. I can't state this enough. All data needs context. Because when we think about statistics, Statistics as a noun is just numbers, values, data. That four out of nine right there, that is a statistic. What's the context of it though? When we think about doing statistics, statistical reasoning, statistical inference is a way of understanding the world around us. So statistics can be used to grant knowledge. Imagine you are playing a game of basketball with your friend and they make... 4 out of 9 shots. Well, you have a data collection that they make 4 out of 9 shots. But then you would try to understand the world around you to understand how good they are at basketball. Compare that to a friend who would make 9 out of 9 shots. Well, now you might compare these and try to understand the world. And if you're picking for a team, you might say, that friend who made 9 out of 9 shots, using the statistics they have, I'm making a statistical inference, inferring that they are a better player. So we can use statistics to do inferences about the world around us by gathering data and understanding it in the context. Once again, maybe we need some context about that whole making shots in basketball, because maybe the one friend lowered the rim and stood underneath it, and the other friend was making half-court shots. I'll tell you this much, if you're making 4 out of 9 half-court shots, like shooting half-court shots, it's more impressive than lowering the rim and making 9 out of 9 shots. So once again, we need the context for this. How was this data collected? Thank you. what did we observe, and what inferences can we then make from this data that's collected. The main things we do inside of statistics is we measure variation, understand the variation, reduce or adapt the variation afterwards. So think about this, you're measuring variation. You have your friends right here, and your friend one makes, let's say, 18 out of 20 shots. Friend two is going to make three out of 20 shots, probably me. And with this, you would then also control for it to make sure the only differences you're observing here are going to be the differences in shots they make. You don't want them to be at different spots on the court. Maybe you put them on the free throw line, you have the rim at the same height for both of them, and then you try to measure the variation between your friends. Now to understand the variation after measuring it, you would simply just do a comparison right here. Then... reducing or adapting to the variation, well, we could try to train up one of our friends, or we could, so we get them both train each other, so the one gets better and pulls up, but we're just going to look at this and make decisions from it. So really, this last part right here is the whole decision-making process from the variation that we observe. One, we're going to collect the data to measure the variation in this example, looking at our friends and seeing the percent of free throws they make. We're going to understand these differences that we observe, And then we are going to try to reduce or react to this in some way that adapts and changes our decisions based on the variation. And the biggest thing we have to highlight again here is decisions are made from statistics. Statistics are our best insight to what is going on. And so we make decisions based on data. We call this data-driven decisions. Data-driven decisions. And that's the important thing, especially if you're working in a business, is making data-driven decisions. So what about that data? We have to know how to handle data. We have to know what ways data can be presented to us to be able to handle data. Data can be anything. It can be just me on the screen right now as data. I'm zeros and ones. Characters, images, characters mean like letters not like characters in a show. But they're also data too because they're images. So that's also characters in a show would be data. Images, what you're seeing, other labels, things you can think of, anything you can collect and put into a database is going to be data in a database. Usually we think of numbers as data or we think of like categories like we talk about maybe people are wearing a color of a shirt like wearing a blue shirt today. So we could say people who chose to wear blue shirts versus people who did not choose to wear blue shirts. That'd be a categorical variable. We're on that in a moment. Or we could do things like how old are you, which would be quantitative. Now notice right here we have numeric data. It looks like till you notice that is represented by category. One of the trick questions we have right there, you have to watch out that not all numeric data is going to be actual real numbers. An actual real number is something that is like, how tall are you? 75 inches? Well, that's a quantity of how tall you are. But if we do gender and male is one, that's not a quantity of gender. That's not how much gender do you have. It's just that we are assigning group one to be called male and group two to be called female. And that's why we always need, and we can't state it enough, the context of the data. Your gender is one. What does that mean? It means group one in this instance. So we always need people to present us the context and the ways in which they collected the data. What does it mean? Here's some data without context. What does it mean? Well, nothing yet until we get the context to realize it's Amazon orders. Now we can see the order number and the name on the order, the state and country, the price, the area code, what they downloaded, did they get a gift? the Amazon store identification number, ASIN right there, and the artist. So all of this right here is the context and the variable names as column names in a data frame. This is a data frame right here. We talked about data frames just a second ago. The data frame has rows and has columns. The columns are going to be the variables and the rows right here are going to be each observation. That's going to be a row right there. And one problem we run into is data can be very messy. There's lots of ways we have data that just doesn't make sense. Probably my biggest messy data is shoe size. I hate how when I go to get a shoe and if I say size 12, they're like, hey, you want to go get those size 12s. We have, at least in America, we have male and female and sometimes unisex shoes. And so depending on if you get a 12 male or 12 female, the shoe is going to be a different size. The best way I would think to do shoe size would be just to measure your foot. How many centimeters long is your foot? Because that's going to be more exact measurement than inches. So let's go centimeters and then this is a how many centimeter long shoe? That's a good size. And I'd preferably do that because what is a 12? Well, not all 12s are the same when you go and buy a shoe. Just measure the shoe. Maybe I should start doing that. I should measure my foot and measure the shoe. And then I'll have a less ambiguous form of data right there when I go and buy shoes. I can even learn something from this. We also have problems that sometimes people don't response. This is called non-response bias. So how do we handle when someone doesn't talk to us and we're like, How tall are you? And they don't respond. Do we infer or what do we do? How do we collect data? Usually non-response deals with more controversial topics where people don't want to answer questions and then their differences might actually be some sort of thing we didn't collect on. How do we handle ridiculous responses? I've asked people how many days a week do you study and they say 10. Maybe they mean 10 hours a week. What do they mean by 10? Maybe they mean this month. So how do we handle these responses? When you collect data, the first thing you want to do is going to plan out how you're going to collect this data and clean the data and make sure your questions that you have get precisely what you want. Data collection is extremely messy and it's hard to do. First questions we would think would be simple. Number of siblings. Well, do we mean step-siblings? Number of countries visited. Which ones are we talking about? Does it count if you're born in that country? Because I'm visiting America right now. I guess visiting, yeah, sure, I'm here. Height, do we mean centimeters, do we mean inches, do we mean meters? Favorite coffee flavor at Starbucks, one problem with this is you're gonna get a lot of responses and they're gonna differ and it's very hard to analyze data from categorical responses that have lots and lots of different possibilities. Shoe size, well, do we mean in centimeters, do we mean male, female shoe sizes? And gender right here, there's lots of ways to collect variables differently. So we might not have everything in our survey that we're looking for, especially if you'll get surveys over time and how we've changed the ways we collect different variables. So how do you collect all of this? This is something you're going to have to plan out before you do the data collection. Usually even before collecting data, people think about. How do I handle this? What answers do I give? Do I get free responses? And that's up to the person doing the research to figure out how they want to collect the data because once you collect the data, it's very hard to fix data that's been collected if you didn't collect it properly. So here we go to some key words that you need to know. The first key word is a population. The population is the group of people or things that you want to understand. So friends are making free throws. Your population of interest is the percent or the free throws they make. You're trying to understand their free throws. That's the thing you want to understand. You go out here to the University of Tennessee and you talk to students and you say, do you like the dining options? You're trying to understand University of Tennessee students. Notice it's not always a person. Like with free throws, you're trying to understand free throws and specifically the parameter. The parameter is the thing you are trying to understand. So let's just put parameter on here as a subset under this. We'll decrease the font size just a little bit and we'll add parameter onto this slide. The parameter is the thing of interest about the population. We sometimes call it a population parameter. The parameter is the thing you want to understand. For University of Tennessee students in the previous example, it would be their thoughts on dining options. The population is UT students. The parameter of... interest is what are their thoughts on dining options. For the basketball free throws the population is basketball free throws. The parameter of interest is the percent made. So what we have now is a sample. You're not probably not going to talk to all UT students. You'll probably talk to a sample or a survey. You'll do a survey and get a sample from that which is a subset of the population. And from a sample we get statistics and these are estimates of parameters. So when we talk about inferential statistics we're actually saying statistics allow us to make inferences about a population. And one way I like to describe this is S, S, P, P, and Y. Well sample statistics allow us to make inferences about population parameters. To understand the free throws of your friends you would take a sample of them taking free throws. The statistic would be the percent they make in the sample. You want to understand their free throws, the population, and the true percentage they make over time, like what is their actual free throw percentage. In the other example we have, you want to understand University of Tennessee students'thoughts on dining options. So you take a sample of maybe 100 UT students, you ask them their thoughts on dining options, and 60% say they like the dining options. Well that would be the statistic because it came from a sample. You then use that statistic of 60% to make inferences and confidence intervals or run statistical tests to understand the population parameter which is what is true for all University of Tennessee students. One thing that your sample must be is representative. If you go to the dining halls you probably won't get a representative sample because people in the dining halls are choosing to go to the dining halls and might be more likely to like the dining hall options. You'd want to find some way to randomly take University of Tennessee students to figure out their views on dining hall options. Just like with the free throws too, you wouldn't want to do the free throws of your friend after they've been playing basketball for three hours. They might be tired. You want to figure out a good time. Maybe they've warmed up and you do it at the start of their basketball session and then at the end of the basketball session. Maybe do some, we'll say right there, stratified sampling more on these topics later in class, picking from different times which would represent how they are as a free throw shooter. So we like to use this right here. We use randomness to figure out who we're going to select to take representative samples. So a lot of times a representative sample is taken randomly from the population. Don't go out and talk to the first 100 students you see. That'd be a non-random sample. Go out and try to find a spot on campus. Talk to 10 students. Go to another spot on campus. Talk to 10 more students. And once again, this would be kind of cluster at location if you did that. Lots of ways to do sampling. More on this in later chapters. The key aspects on this slide have been highlighted for you it's a little tad tidbits for the future I've been thrown in. This is what our data files will look like and we'll be going over this in class on how to use data and what it looks like. One thing I want to point out right here is over on the side in jump you can see some color coding it's a little small but you'll see some blue and some red and that helps us see what type the variable is. So with this we can take random samples right here and try to understand some key aspects of the data. Now the who's are going to be each of the rows. This is a big way for your assignments to know what the who's are. This is who was collected on. What was collected on is the column. Key thing for your assignments. The who's, whatever these rows are. Looks like apps in the Apple Store. If you look right here, we're looking at the Apple Store data. The who was collected on, apps in the Apple Store. What was collected on, well, that's the different characteristics. What is their ID? What is the app name? How big is the app? So the what is the variables and the who's are the rows. A lot more we can ask with this, but the who's and the what's are big right there. Remember, who's are the rows? Every time you collect like on a student, that would be a who. What you ask them would be the what's. And there is the when was it collected, the where, why did we collect this? How did we collect this? And all of this right here is very importantly. context of your data. Your data needs context. You have to be able to explain to everybody, oh we went out and stoked to 100 UT students. That's the who. Here's the questions we ask them. That's the what. When did we collect this? At the start of the year. Where did we collect this? Five different locations on campus. Why did we collect this? To understand dining options. And how do we collect this? Well it was Brian and three other people. That's how we collected it right there. That gives the full context. Then we can use it to make data-driven decisions with this data right here. I just noticed that's slightly covered up. The who's and what's are essential right there, so make sure to see that. Finally for this lecture, let's talk about variable types. Variable types are important throughout the remainder of this semester because I'll usually ask you, is something categorical or quantitative? Let's start with categorical. You've probably heard of things called qualitative. I'm not a big fan of saying this because it kind of sounds like quantitative, but if you want to remember it, it's like quality, like your eye colors are quality, your hair colors are quality, and that's actually a category. People with brown hair is a category. I prefer to call it categorical because it puts things into categories or groups. These are the same thing if you say categorical or qualitative. Now be careful also because sometimes we can use numbers as categories. If we assign female to group one and male to group two, that doesn't make it quantity because it's not how much is your gender, it's going to be which group are you in, but still a category. So categorical data puts things into categories. And we can usually think of a what question like what state were you born in? What's your favorite dessert? For categorical data. Now there are three subtypes of this. The first one is going to be ordinal. I'll make this red right here so we can see it a little bit clearer. Ordinal is going to be when there's an order to the groups. Freshman, sophomore, junior, senior. How about stake preference? Rare, medium-rare, medium, medium-well, well. Notice how it goes in an order. Just like freshman, sophomore, junior, senior. You wouldn't say senior, freshman, junior, sophomore. It kind of bounces around. We know there's a logical order. Now, nominal is easy to remember. If you think of ordinal has order, nominal has no order to it. So categorical data with no order, like what state are you born in? Of course, you could alphabetize them, but there's not some sort of order the states must be presented in. Gender has no order either. Many things are usually nominal, like favorite dessert would be nominal. There is no order to chocolate pie being a dessert in a certain order. Last but not least, we have identifiers. And the number one thing to know about identifiers is that they are always unique. An identifier can never repeat. Never, ever, ever. Identifiers can never repeat. Which means if you're taking a test with us, and you are trying to see if something is an identifier, You simply just need to look at the data and see if it repeats. Now, that doesn't mean it is an identifier if it doesn't repeat. But looking right here, that is not an identifier. That is not an identifier. This, that's the same row right there. Download could be an identifier, but here's the thing. It's not that it doesn't repeat in the small data set. It's that it can never repeat. An identifier, think about why we call it an identifier, must never repeat no matter how big the dataset is because it would identify that row. So order number, if you call Amazon and you give them your order number, that would be an identifier. If you tell them what you downloaded, it's probably not going to identify you. So we have order number here, usually the first column. I can't promise that, but the first column is usually the identifier for what we have in a dataset because we put it first. and then we can find the row we're talking about based on that identifier, which will be unique and can never repeat no matter how large the dataset gets. If we keep adding rows to this, the identifier will never repeat. Remember, identifiers are unique categorical variables that can never repeat no matter how much data you collect. I'll say that one more time. Identifiers are unique categorical variables that can never repeat no matter how large the dataset gets. And there's only one other type of variable. Quantitative. Quantitative can come in a few forms right here. It can be discrete or continuous. So discrete. or continuous, which I can never spell. Did I actually spell it right today? Nope, of course I didn't. Discrete or continuous. Just do a little right quick. There we go. Continuous is when you think about a number line right here. Any number on the number line that exists anywhere is continuous. So continuous would look like the following. 1.1, 1.2, 1.3. Where discrete is going to be all numbers. So something that is discrete could be like pets owned. Something that is continuous could be like height. Be 75.1 inches tall. So usually someone wouldn't say I own 2.1 pets. We don't usually collect that as a continuous data point. So continuous is going to be a number that can have decimals. Where discrete is going to be whole numbers. And so you would say something is quantitative continuous or a continuous quantitative variable. We ask you how many pizzas did you eat? You say, oh, yesterday I ate one and a half pizzas. How many pizzas did you order? They don't usually let you order half a pizza. You probably ordered two pizzas. Oh no, I'm going for some pizza. Jump actually shows you a lot of this right here. And we can see right here, this is going to be ordinal, categorical. That's the green one. And this is going to be nominal categorical. And this is going to be quantitative. And age is probably going to be probably collected discreetly. Finish time is probably continuous. Because age, people usually say something like I'm 40, 41. They don't usually say they are 40.5. Now, you can collect age continuously. A lot of people will collect it as discrete whole numbers. So whole numbers is discrete. And there are variables people do collect as a discrete quantitative number. But sometimes even with age, you could collect down to the decimal points. But then there's things like pets owned, which you probably can't. No one's going to say they own 2.1 dogs. I probably wouldn't even say that. Once again, we want to stress, and I can't stress this enough, about identifiers. Identifiers aren't really a third type of variable. We once again want to state that identifiers are special case categorical variables. They can never repeat. If someone's collecting on social security number, will not repeat. that extra numbers that they're going to look up will not repeat. If we have books and the ISBN is for books, they cannot repeat. We do not usually summarize these or graph these because they're just used for identification purposes. Why do we have identifiers? To identify. Identifiers are used to identify. Rarely are they ever used to do anything other than make graphics with them. We have a small data set right here. And how would we handle this small data set? Order number. is going to be an identifier. Name is actually a category of names, and you can tell because it repeats, it can't be an identifier. If we say we're looking for the order from Catherine H., well, they're not going to know what the order is. And right here, that would also be nominal, categorical nominal, state somebody ordered from, categorical nominal, price is quantitative continuous, and area code, even though it's a number, is going to be categorical And it might have some sort of ordinal way to it. The United States is kind of broken up into different groups. I know zip code is that way. Not the end of the world, but it might have a little bit of ordinal nature to it. There might be some sort of order to zip codes where certain zip codes or area codes, excuse me, certain area codes are in different parts of the United States. It might be a little ordinal to that. Download is categorical nominal. Gift, yes or no, is categorical nominal. Amazon store identification number. This is a confusing one. Even though it has identification number in it, it could repeat because it's the product they ordered. That is going to be categorical nominal. And the artist somebody ordered is categorical nominal. Now notice a huge thing to see is only things that are numbers could possibly be quantitative. So if it's not a number, it's not quantitative for certain, then you have to be careful because you have instances like this where it is a number and it's not quantitative. quantitative because it's not how much is your area code, not an amount of area code. Finally, let's talk about randomness. Now, what is randomness? Well, think about a coin flip. In a coin flip, you have heads or tails. So, you know what could happen. Here's what could happen down here. But we can't predict what will happen. And knowing what just happened, like if we just flipped heads on the coin, it's not going to give us any information. So in other words, the next coin flip, since we know what could happen, we don't know what will happen. and previous information doesn't give us understandings of what will happen, we consider this to be random. Now in this instance heads and tails is equally likely. But do random events always have to be equally likely? No. Think about the lottery. If you go out and buy a lottery ticket there's one winner and tons of losing tickets. Although there's just two outcomes to the lottery, it's not going to be a 50-50 chance of winning or losing. So there's lots of things in life that are random But it doesn't mean that everything's a 50-50 chance just because it could or could not happen. We don't think of randomness as a bad thing. We actually love randomness. It's the way we collect data. We use it for a lot of different techniques and different models. And it can be very hard to generate random numbers. That's why we often use computers to generate random numbers. So computers have random number generators, something called RNG, random number generation, used a lot in video games too. to generate these random numbers and every number isn't equally likely. If you think about this, maybe you're playing a video game and you have a 1 in 100 chance of getting a really cool loot drop. Well, it's not equally likely to get the loot drop or not get the loot drop and what the computer probably does is it generates a random number to see if you get it and then if you do get the random number in the background of the computer, you get the really cool item. That's how things are happening in the background, how computers are using randomness, that 1 out of 100 people who do that really cool raid or something. we'll get that item and it's just random who gets it when we do sampling right here a big thing we do is we use randomness to take a representative sample so representative samples are taken by taking random samples simulation also is going to use randomness right here i could simulate the video game and use one one out of 100 people get the random loot drop and i could see maybe if there's like four items that you want to get from four different raids. I could simulate and see how long would you have to play the game and give you a distribution of times. A lot of really cool things you can do with statistics and I would just take a random sample from the numbers 1 through 100 and see if you got that item and calculate how long it takes for you to play each raid and then simulate until you get all four items and rerun the simulation again run the simulation like 500 times, take the average and then I would know on average how long it would take somebody to get all four items if they're doing that raid. All the things you can do with statistics. I always think about video games when it comes to stats. A simulation is going to allow us to use randomness to figure out and predict what would happen or what we would expect given certain distributions. Once again, the one out of 100 is when we know what could happen right there. We don't know what will happen. We don't know if you're going to get that item. And getting the item previously or not getting it is not going to explain if you get it the next time. So once again, the idea of simulating a video game right here works under the idea of randomness that we know. You could get it 1 out of 100 times, we don't know if you will get it, and then knowing if you got it or didn't get it in the previous attempt at doing the raid wouldn't give us information about if you're going to get it this time. So with this right here we run into a problem that sometimes some simulations are too complex or impossible. I do want to say this, it's never that they're impossible, it's just that it can be very hard to do simulations that perfectly mimic reality. If you think about even simulating someone shooting basketball free throws, They're not constant in the amount of free throws they make. Maybe they start out really strong, they get a little bit worse, and then they get better as time goes on. So you see a change to the probabilities, which might be hard to model. And with that, we've completed the material. Feel free to email me if you have any questions. Good work.