stats start here this is the beginning of our statistical journey and just trying to understand what are statistics so with the idea what are statistics let's look at how statistics are often portrayed in like a newspaper here we are with four out of n alabamians will die of coloral cancer in 1986 44.44% of alabamians will die that sounds pretty extreme well this is why stats get a bad rap it's because people don't usually give real statistics and we usually see inflated numbers or things that don't represent the population like maybe they're talking about people who go to the hospital people who get colal cancer you probably have people with an agenda right here like a hospital trying to get people checked which is not a bad thing to get checked but it's bad if you use uh numbers that don't represent the truth you should actually say you know we see in the history uh 3% of people will die of col rectal cancer three out of a thousand will die of col rectal cancer but maybe the truth is four out of nine alabamians who get get coloral cancer will die of it so if you get coloral cancer your probability of dying is much higher and now we have a conditional probability look at how much more complex things get when we add more context to what's going on with the data and this is a key thing right here all data needs context I can't State this enough all data needs context because when we think about statistics statistics as a noun it's just numbers values data that four out of nine right there that is a statistic what's the context of it though think about doing statistics uh statistical reasoning statistical inference is a way of understanding the world around us so statistics can be used to Grant knowledge imagine you are playing a game of basketball with your friend and they make four out of nine shots well you have a data collection that they make four out of nine shots but then you would try to understand the world around you to understand how good they are at basketball compare that to a friend who would make nine out of nine shots well now you might compare these and try to understand the world and if you're picking for a team you might say h that friend who made nine out of nine shots using the statistics they have and making a statistical inference inferring that they are a better player so we can use statistics uh to do inferences about the world around us by gathering data and understanding it in the context once again maybe we need some context about that whole making shots in basketball because maybe the one friend lowered the rim and stood underneath it and the other friend was making halfcourt shots and I'll tell you this much if you're making four out of nine halfcourt shots like a shooting half court shots it's more impressive than lowering the rim and making nine out of nine shots so once again we need the context for this how was this data collected what did we observe and what inferences can we then make from this data that's collected the main things we do in side of Statistics is we measure variation understand the variation reduce or adapt to the variation afterwards so think about this you're measuring variation you have your friends right here and your friend one makes uh let's say 18 out of 20 shots friend two is going to make uh three out of 20 shots probably me and with this you would then also control for it to make sure the only differences you're observing here are going to be the differences in shots they make don't want them to be at different spots on the court maybe put them on the free throw line have the rim at the same height for both of them and then you try to measure the variation between your friends now to understand the variation after measuring it you would simply just do a comparison right here then reducing or adapting to the variation well we could uh try to train up one of our friends or we could so we got to them both train each other so the one gets better and pulls up or we're just going to to look at this and make decisions from it so really this last part right here is the whole decision-making process from the variation that we observe one we're going to collect the data to measure the variation in this example looking at our friends and seeing the percent of free throws they make we're going to understand these differences that we observe and then we are going to try to reduce or react to this in some way that adapts and changes our decisions based on the variation and the biggest thing we have to highlight again here is decisions are made from statistics statistics are our best insight to what is going on and so we make decisions based on data we call this data driven decisions datadriven decisions and that's the important thing especially if you're working in a business is making data driven decisions so what about that data we have to know how to handle data we have to know what ways data can be presented to us to be able to handle data data can be anything it can be just me on the screen right now is data I'm zeros and ones characters images characters mean like letters not like characters in a show um but they're also data too because they're images so that's also characters in a show would be data images what you're seeing other labels things you can think of anything you can collect and put into a database is going to be data in a database usually we think of numbers as data or we think of like categories like if we talk about maybe people are wearing a color of a shirt like I'm wearing a blue shirt today so we could say people who chose to wear blue shirts versus people who did not choose to wear blue shirts that' be a categorical variable we on that in a moment or we could do things like how old are you um which would be quantitative now notice right here we have numeric data it looks like till you notice that is represented by categories one of the trick questions we have right there you have to watch out that not all numeric data is going to be actual real numbers an actual real number is is something that is like how tall are you 75 in well that's an a quantity of how tall you are but if we do gender and male is one that's not a quantity of gender that's not how much gender do you have it's just that we are assigning group one to be called male and group two to be called female and that's why we always need and we can't State enough the context of the data your gender is one what does that mean that means group one in this instance so we always need people to present us the context and the ways in which they Ed the data and what does it mean here's some data without context what does it mean well nothing yet until we get the context to realize its Amazon orders now we can see the order number the name on the order the state and Country the price the area code the do did what they downloaded did they get a gift the Amazon store identification number Asin n right there and the artist so all of this right here is the context and the variable names as column names in a data frame this is a dat data frame right here we talked about data frames just a second ago and the data frame has rows and has columns The Columns are going to be the variables and the rows right here are going to be each observation so that's going to be a row right there and one problem we run into is data can be very messy there's lots of ways we have data that just doesn't make sense probably my biggest messy data is shoe size I hate how when I go to get a shoe and if I say size 12 they're like hey you want to go get those size 12 we have at least Amer we have male and female and sometimes unisex shoes and so depending on if you get a 12 male or 12 female the Sho is going to be a different size the best way I would think to do shoe size would be just to measure your foot how many centimeters long is your foot because that's going to be more exact measurement than inches so let's go centimeters and then this is a how many centimeter long shoe that's a good size and I'd preferably do that cuz what is a 12 well not all tws are the same when you go and buy a shoe just measure the shoe maybe I should start doing that I should measure my foot and measure the shoe and then I'll have a less ambiguous uh form of data right there when I go and buy shoes I can even learn something from this we also have problems that sometimes people don't response this is called non-response bias so how do we handle when someone doesn't talk to us and we're like you know how tall are you and they don't respond do we infer or what do we do how do we collect data usually non-response deals with more uh controversial topics uh where people don't want to answer questions and then their differences might actually be some sort of thing we didn't collect on how do we handle ridiculous responses I've asked people how many days a week do you study and they say 10 and maybe they mean 10 hours a week uh what do they mean by 10 maybe they mean this month so how do we handle these responses because when you collect data the first thing you're want to do is going to plan out how you're going to collect this data and clean the data and make sure your your questions that you have get precisely what you want data collection is extremely messy and is hard to do first questions we would think would be simple number of siblings well do we mean step siblings number of countries visited are we which which ones are we talking about does count if you're born in that country because I'm am I visiting America right now I guess visiting yeah sure I'm here height do we mean centimeters do we mean inches do we mean uh meters favorite coffee flavor at Starbucks one problem with this is you're going to get a lot of responses and they're going to differ and it's very hard to analyze data from categorical responses that have lots and lots of different possibilities shoe size well do we mean in centimeters do we mean male female shoe sizes and gender right here there's lots of ways to collect variables differently so we might not have everything in our survey that we're looking for especially if you look at surveys over time and how we've changed the ways we collect different variables so how do you collect all of this this is something you're going to have to plan out before you do the data collection so usually even before collecting data people think about how do I handle this what answers do I give do I give free responses and that's up to the person doing the research to figure out how they want to collect the data because once you collect the data it's very hard to fix data that's been collected if you didn't collect it properly so here we go to some key words that you need to know the first keyword is a population the population is the group of people or things that you want to understand so friends are making free throws your population of interest is the percent or the uh free throws they make you're trying to understand their free throws that's the thing you want to understand you go out here to the University of Tennessee and you talk to students and you say do you like the dining options you're trying to understand University of Tennessee students notice it's not always a person like with free throws you're trying to understand free throws and specifically the parameter the parameter is the thing you are trying to understand so let's just put parameter on here as a subset under this we'll decrease the font size just a little bit and we'll add parameter onto this Slide the parameter is the thing of interest about the population we call sometimes call it a population parameter so the parameter is the thing you want to understand for University of Tennessee students in the previous example it would be their thoughts on dining options the population is UT students the parameter of interest is what are their thoughts on dining options for the basketball free throws the population is basketball free throws the parameter of interest is the percent made so what we have now is a sample you're not probably not going to talk to all UT students you'll probably talk to a sample or a c you'll do a survey and get a sample from that which is a subset of the population and from a sample we get statistics and these are estimates of parameters so when we talk about inferential statistics we're actually saying statistics allow us to make inferences about a population and one way I like to describe this is SS p and why well sample statistics allow us to make inferences about population parameters to understand the free throws of your friends you would take a sample of them taking free throws the statistic would be the percent they make in the sample you want to understand their free throws the population and the true percentage they make over time like what is their actual free throw percentage in the other example we have you want to understand University of Tennessee students thoughts on dining options so you take a sample of maybe 100 UT students you ask them their thoughts on dining options and 60% say they like the dining options well that would be the statistic because it came from a sample you then use that statistic of 60% to make inferences and confidence intervals or run statistical test test to understand the population parameter which is what is true for all University of Tennessee students one thing that your sample must be is representative if you go to the dining Halls you probably won't get a representative sample because people in the dining Halls are choosing to go to the dining halls and might be more likely to like the dining hall options you'd want to find some way to randomly take University of Tennessee students to figure out their views on dining hall options just like uh with the free throws too you wouldn't want to do the free throws of your friend after they've been playing basketball for 3 hours they might be tired you want to figure out a good time maybe they've warmed up and you do it at the start of their uh basketball session and then the end of the basketball session to maybe do some uh we'll say right there stratified sampling more on these topics later in class picking from different times which would represent how they are as a free throw shooter so we like to use this right here we use Randomness to figure out who we're going to select to take representative samples so a lot of times a representative sample is taken r randomly from the population don't go out and talk to the first 100 students you see that'd be a non-random sample go out and try to find a spot on campus talk to 10 students go to another spot on campus talk to 10 more students and once again this would be kind of cluster at location if you did that lots of ways to do sampling more on this in later chapters the key aspects on this slide have been highlighted for you and some little Ted tidbits for the future have been thrown in this is what our data files will look like and we'll be going over this in class on how to use data and what it looks like one thing I want to point out right here is over on the side in jump you can see some color coding it's a little small but you'll see some blue and some red and that helps us see what type the variable is so with this we can take random samples right here and try to understand some key aspects of the data now the who's are going to be each of the rows this is a big way for your assignments to know what the who's are this is who was collected on what was collected on is the columns key thing for your assignments the who's whatever these rows are looks like apps in the Apple store if you look right here we're looking at the Apple Store data so who was collected on apps in the Apple Store what was collected on well that's a different characteristics what is their ID what is the app name how big is the app so the what is the variables and the who's are the rows a lot more we can ask with this but the who's and the whats are big right there remember who's are the rows every time you collect like on a student that would be a who what you ask them would be the what and there is the when was it collected the where why did we collect this how did we collect this and all of this right here is very importantly the context of your data your data needs context you have to be able to explain to everybody oh we went out and Stoke to 100 UT students that's the who here's the questions we ask them that's the what when did we collect this at the start of the Year where did we collect this five different locations on campus why did we collect this to understand dining options and how do we collect this uh was Brian and three other people that's how we collect data right there that gives the full context then we can use it to make data driven decisions with this data right here I just noticed that's slightly covered up the who's and whats are essential right there so make sure to see that finally for this lecture let's talk about variable types variable types are important throughout the remainder of this semester because I'll usually ask you is something categorical or quantitative let's start with categorical you've probably heard of things called qualitative not a big fan of saying this because it kind of sounds like quantitative but if you want to remember it it's like quality like your eye colors of quality your hair colors a quality and that's actually a category people with brown hair is a category I prefer to call it categorical because it puts things into categories or groups these are the same thing if you say categorical or qualitative now be careful also because sometimes we can use numbers as categories if we assign female to group one and male to group two that doesn't make it quantity because it's not how much is your gender it's going to be which group are you in so it's still a category so categorical data puts things into categories and we can usually think of a what question like What state were you born in what's your favorite dessert for categorical data now there are three subtypes of this the first one is going to be ordinal make this red right here so we can see it a little bit clear ordinal is going to be when there's an order to the groups freshman sophomore junior senior how about stake preference rare medium rare medium medium well well notice how it goes in an order just like freshman sophomore junior senior you wouldn't say senior freshman Junior sophomore it kind of bounces around we know there's a logical order now nominal is easy to remember if you think of ordinal has order nominal has no order to it so categorical data with no order like what state are you born in of course you could alphabetize them but there's not some sort of order the states must be presented in gender has no order either many things are usually nominal like favorite dessert would be nominal there is no order to uh chocolate pie being a dessert in a certain order last but not least we have identifiers and the number one thing to know about identif ifers is that they are always unique an identifier can never repeat never ever ever identifiers can never repeat which means if you're taking a test with us and you are trying to see if something is an identifier you simply just need to look at the data and see if it repeats now that doesn't mean it is an identifier if it doesn't repeat but looking right here that is not an identifier that is not not an identifier this is that that's not that's a same row right there download could be an identifier but here's the thing it's not that it doesn't repeat in the small data set it's that it can never repeat an identifier think about why we call it an identifier must never repeat no matter how big the data set is because it would identify that row so order number if you call Amazon and you give them your order number that would be an identifier if you tell them what you downloaded it's probably not going to identify you so we have order number here usually the first column I can't promise that but the First Column is usually the identifier for what we have in a data set because we put it first and then we can find the row we're talking about based on that identifier which will be unique and can never repeat no long no matter how large the data set gets if we keep adding rows to this the identifier will never repeat remember identifiers are unique categorical variables that can never repeat no matter how much data you collect say that one more time identifiers are unique categorical variables that can never repeat no matter how large the data set gets and there's only one other type of variable quantitative quantitative can come in a few forms right here it can be discret or continuous so discreet or continuous which I can never spell did actually spell right today nope of course I didn't discrete or continuous just do a little right quick there we go continuous is when you think about a number line right here any number on the number line that exists anywhere is continuous so continuous would look like the following 1.1 1.2 1.3 where discreet it's going to be whole number numbers so something that is discret could be like pets owned something that is continuous it' be like height it' be uh 75.1 in tall so usually someone wouldn't say I own 2.1 pets we don't usually collect that as a continuous data point so continuous is going to be a number that can have decimals where discret is going to be whole numbers and so you would say something is quantitive continuous or a continuous quantitative variable if we ask you how many how many pizzas did you eat you say oh yesterday I ate one and a half pizzas how many pizzas did you order they only usually let you order half a pizza so you probably ordered two pizzas I don't know hungry for some pizza jump actually shows you a lot of this right here and we can see right here this is going to be ordinal categorical that's the green one and this is going to be nominal categorical and this is going to be quantitative and age is probably going to be uh probably collected discreetly finish time is probably continuous um because age people usually say something like I'm 40 41 they don't usually say they are 40.5 now you can collect age continuously a lot of people collect it as discrete whole numbers so whole numbers is discrete and there are variables people do collect as a discrete quantitative number but sometimes even with age you could collect down to the decimal points but then there's things like pets owned which you probably can't no one's going to say they own 2.1 dogs that's I probably wouldn't even say that but once again we want to stress and I can't stress this enough about identifiers identifiers aren't really a third type of variable we once again want to state that identifiers are special case categorical variables and they can never repeat if someone's collecting on social security number will not repeat FedEx tracking numbers that they're going to look up will not repeat if we have books and think the ISBN is for books they cannot repeat we do not usually summarize these or graph these because they're just used for identification purposes why do we have identifiers to identify identifiers are used to identify rarely are they ever used to do anything other and make graphics with them so we have a small data set right here and how would we handle this small data set well order number is going to be an identifier name is actually a category of names and you can tell because it repeats it can't be an identifier if we say we're looking for the order from Katherine H well they're not going to know what the order is and right here and that would also be nominal categorical nominal State somebody ordered from categorical nominal price is quantitative continuous an area code even though it's a number is going to be categorical and it might have some sort of uh ordinal way to it the United States is kind of broken up into different groups um I know zip code is that way not the end of the world but it might have a little bit of ordinal ordinal nature to it there might be some sort of order to zip codes or certain zip codes um or area codes excuse me certain area codes are in different parts of the United States might be a little ordinal to that download is categorical nominal gift yes or no is categorical nominal Amazon store identification number this is a confusing one even though it has identification number in it it could repeat because it's the product they ordered so that is going to be categorical nominal and the artist somebody ordered is categorical nominal now notice a huge thing to see is only things that are numbers could possibly be quantitative so if it's not a number it's not quantitative for certain but then you have to be careful because you have instances like this where it is a number and it's not quantitative because it's not how much is your area code not an amount of area code finally let's talk about Randomness now what is Randomness well think about a coin flip in a coin flip you have heads or tails so you know what could happen here's what could happen down here but we can't predict what will happen and knowing what just happened like if we just flipped heads on the coin it's not going to give us any information so in other words the next coin flip since we know what could happen we don't know what will happen and previous information doesn't give us understandings of what will happen we consider this to be random now in this instance heads and tails is equally likely but do random events always have to be equally likely no think about the lottery if you go out and buy a lottery ticket there's one winner and tons of losing tickets although there's just two outcomes to the lottery it's not going to be a 50-50 chance of winning or losing so there's lots of things in life that are random but it doesn't mean that everything's a 50/50 chance just because it could or could not happen we don't think of Randomness as a bad thing we actually love Randomness it's the way we collect data we use it for a lot of different techniques in different models and it can be very hard to Generate random numbers that's why often use computers to Generate random numbers so computers have random number generators something it's called RNG random number generation used a lot in video games too to generate these random numbers and every number isn't equally likely so if you think about this maybe you're playing video game and you have a one in 100 chance of getting a really cool Loot drop well it's not equally likely to get the Loot drop or not get the Loot drop and what the computer probably does is it generates a random number to see if you get it and then if you do get the random number in the background of the computer you get the really cool item so that's how things are happening in the background how computers are using Randomness that one out of 100 people who do that really cool raid or something will get that item and it's just random who gets it when we do uh sampling right right here a big thing we do is we use Randomness to take a representative sample so representative samples are taken by taking random samples simulation also is going to use Randomness right here I could simulate the video game and use 100 one out of 100 people get the random Loot drop and then I could see maybe if there's like four items that you want to get from four different raids I could simulate and see how long would you have to play the game and give you a distribution of times a lot of really cool things you can do with Statistics and I would just take a random sample from the numbers 1 through 100 and see if you got that item and calculate how long it takes for you to play each raid and then simulate until you get all four items then rerun the simulation again run the simulation like 500 times take the average and then I would know on average how long it would take somebody to get all four items if they're doing that raid all the things you can do is statistics I always think about video games when it comes to Stats but simulation is going to allow us to use Randomness to figure out and predict what would happen or what we would expect given certain distributions once again the 1 out of 100 is when we know what could happen right there we don't know what will happen we don't if you're we don't know if you're going to get that item and getting the item previously or not getting it is not going to explain if you get it the next time so once again the idea of simulating a video game right here Works under the idea of Randomness that we know you could get it one out of 100 times we don't know if you will get it and then knowing if you got it or didn't get it in the previous attempt at doing the raid wouldn't give us information about if you're going to get it this time so with this right here we run into a problem that sometimes some simulations are too complex or impossible I do want to say this it's never that they're impossible it's just that it can be very hard to do simulations that perfectly mimic reality if you think about even simulating someone shooting basketball free throws they're not constant in the amount of free throws they make maybe they start out really strong they get a little bit worse and then they get better as time goes on so you see a change to the probabilities which might be hard to model and with that you've complete the material feel free to email me if you have any questions good work