Transcript for:
Understanding Descriptive Statistics Basics

hi everybody this is dr. Alvarez welcome to lecture 4 of sociology three or three social statistics where we where we will be discussing descriptive statistics now if you remember what we said descriptive statistics are useful and this is going to sound mad obvious to you when you hear me say it so don't roll your eyes at me but here we go descriptive statistics are use for describing variables we use descriptive statistics to describe how respondents in our sample answered questions how they answered know what is the average or median age of all the people in our sample you know what is the most frequently occurring occurring race or party identification few are people in our sample more liberal or more conservative we use descriptive statistics to describe the variables that we are interested in in our sample and when we say describe what we mean is we are going to be able to write out for readers how people in general answered our questions and we're gonna I'm gonna give you three tools today three tools and if you want to you can write this down and I think it will help you three tools to help you describe variables the first one is frequent is a frequency table a frequency table is used for nominal and/or no variables in other words categorical variables if you're thinking about a type of variable so you're faced with a nominal or no variable what we call a categorical variable you will use a frequency table I also provide for you measures of central tendency that's the second tool measures of central tendency many of you all have heard of these before this is mode median and mean right if you have or no variable or if you have an interval ratio variable you will use some select measures of central tendency to thing upon the nature of your variable finally we have measures of dispersion the the major measure of dispersion that we are going to be using is standard deviation which you will use when you have a interval/ratio variable so I laid out for you right now some very important points I laid out for you the three tools frequency tables measure of the central tendency and measures of dispersion and I lay it out for you which of those tools you should use depending upon the variable that you have and you can look at the level of measurement to indicate what which of these three tools or in combination which of the tools you should use in order to describe the variable that you are interested in using or learning more about so I think that's a nice starting point let's jump right in where the goals for the day we want to understand what descriptive statistics are used for you want to demonstrate frequency tables explore measures of central tendency and explore measures of dispersion let me say right now that you should be writing down the language that I use when I talk about these tables so for instance when we talk about frequency tables we're going to go through about four different examples and I have the examples on the slide and you're going to need to make sure you write down the language I use the language I use to talk about them because I'm going to ask that you mimic that language and I'm the phrase that I always like to use to describe this is to say you will always you will always describe and interpret any statistic that you encounter in this class you would describe and interpret to describe simply means that you tell your reader in words what's there what's in the table interpretation is the act of making that information meaningful right the descriptions will almost always simply be what's in the table that won't be up for debate the interpretation will be the thing that you the researcher use the the scientists that you decide what is most meaningful about that particular statistic and so you will always describe and interpret every statistic that you encounter in this class I am going to describe and interpret frequency tables for you in this lecture and you will need to make sure you write down how to do this and then email me if you have any questions about it we clear I'm giving you fair warning to make sure you have your pen and pencil out to make sure that you have a you know that you have the notes out and that you're ready to write some stuff down and I'm encouraging you to email me if you have any problem ok so again we're going to understand the goals of the scripted statistics we're going to demonstrate frequency tables we're gonna explore measures of central tendency and explore measures of dispersion alright so this descriptive statistics they describe the variable of interest in your sample or essentially when you're looking at for instance the GSS we said that there's no 900 or so variables you're probably interested in somewhere between 5 and 35 variables and any particular analysis that you might do and before you actually die into a very sophisticated complex analysis the first thing that you would do is to describe the distribution of each of those variables for your readers of your writing is up as a report right and so you would want to describe the distribution one way to think about this is if you think about a political poll that somebody takes let's imagine that Pew takes a poll of 1,200 registered voters right the first thing they would have to tell us is of those 1200 registered voters how many were Republicans how many were Democrats and how many were independents right that that is a descriptive statistic right you have to describe it if I asked each of those 1200 people how much money do you earn how much money did you earn last year right and we we think of that as their annual income then you would want to be able to say something about what's the average or median annual income for all the people in our sample right so you get a sense of who these people are right you can ask them obviously do you support you know the president and then you could provide well what for what percent said they support it what percent said they did not those are describing the things that are in the sample normally speaking of course in statistics we don't just care about what's in the sample but we care about using the information in the sample to say something about the population and overall but you always begin with describing the variables that you are interested in using that are in the sample for our purposes for this class I usually tell you which bearable you know I want you to provide descriptive statistics for or I will provide for you the tables I want you to describe and interpret and then your job will be to describe and interpret okay so this act of describing the variables that are of interest to you is the first thing that you would do in any analysis in any analysis you would first describe the variables and their distribution for your reader and here is a reminder that much of what we do in this class and I've said this before is as an act of storytelling well we're taking numbers and making them meaningful and so one of the ways that you can sort of get around the idea that there's gonna be a lot of math in this class is to think instead about the fact that what we're really doing is taking a bunch of number and learning how to talk about them and more specifically how to write about them so that a reader can understand the statistics you are providing right and so when you ever you write out a store whenever you write out a report or when you write out a research you would do a research paper right you would have different portions of the paper that will provide a story for your reader about the statistical analysis that you rent we talked earlier in this course about conceptualization that is part of the story right but describing your variables of interest is another important part of that storytelling right and in particular how you describe and interpret the statistics that you decide to run is an important part of that storytelling okay so let's keep going where is a frequency table remember the frequency table is a way of doing descriptive statistics for nominal and ordinal variables what is a frequency table do it lays out how every single respondent answered the question it does this by showing you all of the categories that respondents can provide as answers and then giving you letting you know how many people provided that response and when we give you an example of this in just one second in the frequency tables I have to keep hitting this point home to you you have to be able to not only to describe with it what's there but it is but also to interpret what is there you have to be able to not only describe the table but to look at that table and decide for yourself for yourself what is the interpretation that you want your reader to walk away with from this analysis right always always always describe and interpret any statistics that you come across so let's look at our first example real fast this is our first example frequency table this is the respondents highest degree this question is in the GSS and is asked every single respondent in the GSS what is the highest educational degree that you have earned right the response categories are over here listed on the right hand side right what are they they are less than LTI school which means less than high school high school junior college bachelor and graduate degree right you see that then we have a bunch of columns here we have the frequency the percent the valid percent the cumulative percent I'm gonna just tell you this right now and if I were you I would put a star exclamation point on this you know just gonna tell you straight away that we're not gonna focus so much on the actual specific frequency we're gonna focus here on the valid percent the valid percent the valid percent that's the key issue here we want to know what percent of the respondents who answered this question who provided a valid response what percent provided each answer right so what does that look like oh where is it in this case sound like I'm gonna read it to you now follow along with me write it down for yourself in your notes here's what a description sounds like okay this question this variable asked about the respondents highest degree 11.8 percent of respondents said they had less than a high school degree forty nine point nine percent of respondents said they have a high school degree seven point five percent of respondents said they have a junior college degree nineteen point two percent said they have a bachelor's degree and eleven point five percent said they have a graduate degree boom done that's your description notice I didn't say anything about the frequency itself right I didn't say anything at all about the frequency because we don't care about the actual frequency itself I mean we do there's some times when we do where they care about it but rather our focus is on the person the valid percent that's where our that's where our attention is and notice right now the percent and the valid percent right they are exactly the same well well you might be like well so what's the hell is the difference well as we go through some of the next examples you're gonna see where the difference is I'll just tell you right now if case you want to write it down you will need to know this the data percent provides the percent of all the respondents who provided a valid answer to the question well what are the non valid answers to questions well if that person said don't know if they said I refuse to answer the question or they were not asked the question at all which happens all the time called ia P that is considered a missing response and we generally don't want to include that when we describe the distribution of a variable we only want to focus on the people who provide valid respond responses right that's what we want to do valid responses so we always read the valid percent now notice one thing here that I didn't say which is the total number of respondents who answered the question right 3842 that's the total sample size who answer the question that's one thing that you could actually provide if you wanted to is there is to provide the total sample size for this question which provides context for the for the valid percent so you can do that so right now I just laid out for you what a description looks like well what would an interpretation look like an interpretation is entirely up to you depending upon what your particular focus is now in this class you're often just doing what I tell you to do right and so I say go look at this variable and do XYZ statistical analysis on it right however normally you know as a researcher you look at a variable like this and you and you come to it with us research question in mind and so the interpretation relates to that research question that you had may have in mind but for our purposes just looking at this and trying to say something overall about the distribution that you think is interesting and important that you would want a reader to understand could be just about anything so I'm going to provide you a whole bunch right here you could simply say only about 12% of respondents in this sample have less than a high school degree now I said only because that might seem low but you might think wow 12 almost 12% of this this sample doesn't you know has less than a high school you mean they didn't graduate high school that feels like a lot right depends on your particular perspective and in terms of grading I would actually I would give you points regardless of how you said that as long as you provided the interpretation right but let's do a couple more right you could if we're focusing on the valid percent right now you could say almost half forty nine point nine percent almost half have a have a high school degree that's interesting right you can say something like 19 percent of the sample has a bachelor's degree right and I'm just repeating some of the description here right so like we're not doing anything sophisticated or difficult here right but these are particularly these are all fine interpretations of this depending upon what jumps out at you normally though I like to use the the information contained in the cumulative percent to describe stuff so for instance sixty one point seven percent of the distribution has either a high school degree or less than high school or has less than high school degree how do I know that well the cumulative percent adds up how much of the distribution is taken up up to that particular category right so the cumulative percent is letting it for the first row eleven point eight percent that's because there's only no so far only eleven point eight percent of the sample has been has been gone through right but then we go to the next row right and we include highschool people who they say they have a high school degree that's 49.9% right plus the 11.8% gives you 61.7% so 61.7% of this of this sample has a high school degree or less than a high school degree do you see that does that make sense you could say something along the lines at 69.3% either have a junior junior college degree high school degree or have less than high school degree right that's how we use the cumulative percent you use some combination of the valid percent and the cumulative percent to talk about to talk about and make meaningful of the distributions that you encounter right it's entirely up to you and for the purposes of this class as long as your interpretation makes sense and I can see where you got it from based on the table your earned points but you must have it right you must have it you should first describe and then interpret right and so I mean there's lots of little things in you you could add together Bachelor and graduate degree right and so that way you get what what is that that's about thirty point seven percent is that right thirty point seven percent of this distribution have either a bachelor's degree or graduate degree how did I do that I just added these things up right that's what I did and so you will always when you encounter frequency tabel first describe it and then interpret it right first describe and then interpret now I'm going to go from some this is a very basic one and we're going to deal with some more complex frequency tables as we move forward so let's take a look at another one okay so objective class identification this is a question from the GSS right that you can write on your so you if you're feeling confident you could go to the GSS and run a descript a descriptive frequency table on this right and it's simply asked of respondents of these four class identifications which one do you identify with right so now I am going to describe and interpret I'll do one interpretation that I move on to a bunch of others okay here we go a total of three thousand eight hundred and eighteen respondents were asked to provide their subjective class identification nine point one percent said lower-class forty five point four percent said working class forty two point eight percent said middle class and two point seven percent said upper-class 54.5% identified it as either a working class or lower class boom done makes sense just like that I described it by going through all the categories and providing the valid percent right and then I provided an interpretation that jumped out at me right but here there's lots of things that you could say in terms of interpretation right one thing that you could say is that forty five point four percent and then 42.8% identifiers working class in middle class meaning that there are not very many respondents in this sample who identify as working class or upper-class most people identify as being in the middle which is what we know from you know essentially over the last two generations doing public opinion work most people would like to identify themselves as in the middle somewhere so working class and middle class end up being in the middle right you could simply say only 2.7 percent say they are upper-class right that mean so that in and of itself could be an interpretation and that's also interesting right it's entirely up to you notice here again though we see the valid responses right then we have the missing responses DK meaning don't know in a meaning did not provide an answer right there you know refused to answer right this is why there's a difference between the percent and valid percent that we see here does that make sense right because really there are three thousand eight hundred and forty two respondents in the GSS overall for only 3818 provided a valid response that's what the valid percent is based off of right that's what the valid percent is based off of let's keep going and do another one I can here's one of my personal favorites as an economic sociologist and I allow the work that I do is looks at how people come to make decisions about lending money to their friends and family right lending money to their friends and family this variable here is lent money to another person past 12 months right and then it lists out a bunch of valid possible responses right do you see that notice here that only about twelve hundred sixty-nine respondents is more specific only twelve hundred sixty-nine respondents answered this question another two thousand five hundred and sixty eight were not and did were not asked this question right another five refused to provide an answer right notice that the percent takes all of that into account but we don't want to pay any attention to this missing section we only want to look at those who provide a valid response those who provide a valid response so here I am going to provide a description and interpretation just like we did before okay this variable asked respondent to say to provide an answer to how frequently they lent money to another person within the past 12 months 1.7% send more than once a week one point two percent said once a month five point nine percent said once a month seventeen point three percent said at least two or three times in the past year eighteen point seven percent said once in the past year and fifty five point two percent said not at all in the past year overall we can see in this distribution that forty four point eight percent gay lent at least once a year if not more right do you see that because I'm using the cumulative percent forty four point eight gave once in the past year or at least two or three times or once a month or once a week or more than once a week right make sense or you could have somebody said that over half a majority fifty five percent have not met money to anybody in the past twelve months that could have been your interpretation as well I do not excuse me I do not need to provide information about the missing section right I don't need to do that I just need to describe and interpret what's there amongst the valid responses right does that make sense right let's do one more let's do one more yeah and also there's some additional information and let me just for those of you who are ready for a little bit more complexity right now there's a little bit more information that we can provide in terms of descriptive statistics aside from the frequency table so for instance you oftentimes have to identify the mode and the median and we're going to cover that in a later portion of this lecture I'm just gonna you know tip my hat to this point right that often times not only will we look at a table like this and say hey here's what how everybody answered this question I describe and interpret the frequency table but I can also identify the mode right which is the most frequently occurring what's the most frequently occurring response here it's not all in the past year how do we know it's 55.2 but then we also can identify the median right the median which is the middle response and there we look at the cumulative percent and we try to find the category where we go over 50% right so no category up to here goes over 50% it's not all in the past year that's the meeting and if you like what the hell did you just do dr. Alvarez I'm just tipping my hat so that when I see we talk about it later on it will make sense to you also the description of the median for nominal and ordinal excuse me for ordinal variables is actually in your textbook so make sure that you read that and prepare so for that okay so let's keep going one more time last example favor spanking to disciplined child favor spanking to discipline child let's just describe and interpret this Pro fast right three thousand one hundred eighty six respondents were asked to provide their opinion about favor spanking to disciplined child twenty two point one percent said they strongly agree with spanking to discipline child 48.1% said they agree to favor of spanking to discipline a child 22.1% said they disagree and 77.7% said they strongly disagree it makes sense again our focus is on valid percent that's all our description what's our interpretation you can do anything the one that jumps out at me is that a whopping 70 percent either agree or strongly agree with fail with spanking spanking a child to discipline them as someone who was spanked I'm not necessarily growing up I'm not surprised by that though you know I don't think of spanking it's a great way of parenting but hey you do you and you with your kids I don't have any sorry why are we even going into this anyway that's a description and interpretation description interpretation really very straightforward and you need like on an exam what would happen I would give you a table like this and ask you to describe and interpret it sitting here right now I just mention to you the idea of the mode and the median because those are descriptive statistics that are also appropriate for a for a or no variable can you remember what those are using a guess and what the for instance the mode would be the most commonly occurring the most commonly occurring frequency response rather what would it be it would be agree right yes 48 percent it has 48 percent of all the respondents so that's the response that's you know provided provided most frequently then we also have the median right which is the middle then the middle of the distribution when you're looking at a table like this again how do we find the median we find that the row where the cumulative percent goes over 50% so in the first row does it go over 50% it does not how about for the agree row does it go over 50% it does so the median is in the agree is in the agree category so agree is the median that is the middle again I haven't gone over what mode immediate means I'm just tipping my hat so that when you get asked about it later when we talk about it later you have a sense of what those things mean and we'll talk more about them okay so frequency tables they display the overall distribution of responses for categorical variables nominal and most ordinal variables they help us describe and interpret how respondents answered this question right and that's why I always will ask you to do to describe and interpret how respondents answered this question and it provides an overview of the variable tells you overall how did people answer this particular question right now I have said that frequency table frequency tables are for nominal and ordinal variables right and and I'm going to demonstrate for you why that is there's a variable in the GSS that says the the number of days of poor physical health in the last 30 days right what level of measurement is that they ask you for the number of days it's interval ratio right because I'm asking for a number right I'm asking for a number right now do you see how many different responses that are here and remember these are not categories now these are actual numbers right these are actual numbers this is the number 0 number 1 number 2 number 3 etc etc etc right and would you expect to say when asked this number when asked the number of days of poor physical health in the past thirty days sixty seven point nine percent said zero five point five percent said one six point four percent said to all the way down through all the responses no that doesn't make any sense it doesn't make any sense you don't use frequency tables for interval ratio variables never ever ever do that never ever do that the tables are only for only for nominal and ordinal variables and even some ordinal variables have too many categories nine or more and so you wouldn't normally produce that you would try to find some other way to describe and interpret them right so never ever ever produce never ever produce a frequency table for a interval/ratio variable and make sure that's in your notes if you do have an interval ratio variable there are other descriptive statistics that you can provide measures of central tendency and measures of dispersion or variation there is also something called a Rico where you can change the structure of the variable and normally you know in a face-to-face class where I would go over and teach you how to do this but I think that that's a little bit much for an online class so you know we're gonna set aside but I do at least want you to know what a recode is so you know hurt you've heard someone talk about it we can actually go in and change the structure of a parable and maybe later on the course I'll give you an example of this but the point that I'm trying to make to you is never ever ever provide a frequency table for an interval ratio bearable right do remember that descriptive statistic is something we use to describe the distribution of a variable right the distribution of a variable and you should always describe and interpret the table the table the table that you have remember that descriptive statistics when we use frequency tables are only for categorical variables if we have an interval ratio variable which you may we should use measures of central tendency and measures of dispersion which we're going to talk about soon I keep hitting on that point about what are the specific descriptive statistics you should use for the specific variable that you have this is because you actually need to start getting sort of prepared for the idea of being able to look at a variable identify what the level of measurement of the available is the type of variable it is and then provide for me and tell me what is the specific set of descriptive statistic tools that you need to provide to me to appropriately describe that variable right and so I've given you this handy little guide if I were you I write this down that in lecture four slide 14 that I am laid out for you depending upon the type of variable that you have what specific descriptive statistics that you should you should provide right notice here that if you have a categorical variable meaning a nominal or or no variable you should provide a frequency table see how it says yes right but notice here I also say that you should provide the mode and the median right the mode and the median particularly for ordinal variables right so remember that you also have to provide the mode or the median for categorical variables and we talked about that so far right we talked about that in our when I was going through the example so just want to remind you of that I've already been blabbing away quite a bit we're only one third the way through this lecture I'm so sorry I am so so sorry I promise you it's much more dynamic and engaging on a face to face level but listening to this I know it can be kind of tough feel free to take a little little bit of a break right now I'm gonna take a little drink of water let's get through this next section okay measures of central tendency is still part of our descriptive statistics section where our goals for this section we want to understand the uses of mode median and mean we want to understand symmetrical and skewed distribution I have to mass BSS practice in here additional output but I moved it to the very end of the lecture so I have some more examples in just a little bit where I'll talk you through how to do some of this okay so but let me be clear here okay let me be clear that when you're faced with an interval ratio variable you're want to provide measures of central tendency and then you can use those measures of central tendency to say even more interesting stuff about the distribution of a variable particularly if it's either asymmetrical or skewed distribution you know you'll see exactly what I mean very shortly let's just jump right in and get to this measures of central tendency they are an attempt to tell us about the middle of a distribution as a way of summarizing rather than looking at the distribution as a whole as we commonly do with frequency tabel so I'm gonna do something I just like doing it and go back we had this interval ratio variable right here right and I provide a frequency table for and I said this is not the right way to do it right because look at all this information and this is you know and with interval ratio variable there's all of these possible responses they can provide it would take forever to try to read through this and actually read through this and describe it like we normally do frequency tables right because frequency tables are used for describing all the responses but normally if we have an interval ratio variable when we have number an actual physical number there are far too many responses to go through every single response so we have to find ways of summarizing that information summarizing that information right in some ways really what this is is trying for interval ratio variables is finding the most appropriate way to summarize a distribution that the most appropriate way to characterize how people have answered a question rather than providing all of those responses now that said even though we are summarizing we're still going to like everything else in this class describe and interpret the measure of central tendencies that we get so we still describe and interpret what we're going to have to make sure we understand exactly what it is we're describing and describing and interpreting first the mode the mode is the most commonly occurring response typically the most common category usually used with categorical variables use in other words the mode even though we use it frequently for interval/ratio variables can also be used with categorical variables nominal and ordinal and is very very often provided when you have a frequency table remember we talked about the mode in our frequency tables right it's very useful if you're looking at a distribution you can normally find where the mode is by finding the highest point on the distribution right the highest point in a distribution that's normally where it is to be looking at it graphically but the SPSS will also provide for you the mode if you if you ask for it ask for it right sometimes you have distributions that have more than one mode right meaning that they've got two responses that are both have about let's say 40% of the distribution in which case you could call that bimodal and you wouldn't want to point that out for our purposes your job is simply to know what the mode is is the most commonly occurring the most commonly occurring this response to any variable your job is to be able to know find the mode and some output and to put that into a sentence right that's what your job is and again normally we use the mode for categorical variables but also for sometimes interval racial variables and we're going to get into this complication in a bit let's keep going the median this is the response that cuts the distribution into equal parts the middle response the middle and this can be used for describing ordinal and interval ratio variables why can't you find the middle of a nominal variable why can't you find the middle of a nominal variable well because a nominal variable is not ordered right you can't put something in order so therefore there is no beginning middle or end right it's just all a bunch of responses right so the median can only be the meeting can only be used for ordinal and interval ratio variables for this class you will be it you will need to be able to look at a frequency table and identify the mean and we've done that but also to put into a sentence the median for an interval ratio there are born again we're going to get to the get to that soon the textbook provides for you a very easy way for you to calculate what the median is but for the most part in this class and journey and statistics overall we can ask SPSS our statistical program to actually produce the mode and the median and the mean for us so we won't do a lot of calculating it by hand nor did your job is to be able to look at it in a statistical output and then actually put it into a sentence as to what it means right let's go to our final one the mean what is the mean it's just the average right you've done averages all your life probably you know how to do this right it is the average of all the scores is the average of all the numbers provided so we add up everybody's stated age right and we divide by the number of people right and so I provide here a little foreman for you some of you may have seen this before we have y bar that is the typical statistical way um way of denoting the mean Y bar means mean or average right and then we have this funny yeah the mean that we can use the appropriate name for a by like to just say the funny II just to make sure nobody gets like intimidated body about it and if you write funny e on your exam or a homework I'll accept that that's not a problem so funny e why we add up all the values of the variable so ax everybody how old you are I add up all the ages so what that funny hat e says is to add up all the Y's right then I divide by what in right what is in it is the total number of people who are in my sample right so if I have 3,800 people I add up all their ages then I divide by 3800 right make sense make sense this is just puts us into words at that funny hat why this means to sum all the values and again I'm show this to be is we're gonna we're gonna deal with that that language that symbol really quite frequently in this course in egos the number of cases also known as there was a number of respondents right and so one of the things that I do I have to point out here and we're gonna talk about this in just one second in the next slide but I want to point out that for interval/ratio variables and I will put a star next to this but this means numerical values things like if I ask you what is your age if I ask you in dollars how much money did you earn last year if I ask you in dollars what is your total net wealth right when we provide descriptive statistics about that information we don't go through and give every single person's age or every single person's you know income or every single person's their worth we find a way to summarize it mode median and mean our each ways of trying to summarize distributions that give us information about what the sample looks like over all right the mode is not a very good summary of a distribution it's just it's okay but it's not very good and normally we use it with categorical variable so in other words to make sure you provide a mode when you're looking at a frequency table right but when you're looking at a whole list of numbers the most common the one that most commonly appears if they're like thinking about income right will Inc an income can range from zero to two million dollars right that's the most common number that's provided actually give you a good sense overall of what the distribution of income looks like no you're probably better off providing either the median or the mean right the meeting of the mean now though we get into this question of well what really is the difference between the median and the mean and there are really really big differences and these are often used for political purposes and why do I say that there are problems and I'll just go to the next slide to help us did you start thinking about this there is a problem with means and that is that a mean is highly sensitive to outliers it is highly sensitive to outliers am I making sense to you right now is highly sensitive to outliers and so this means that using averages can sometimes be sort of tricky or misleading and so the one of the ways in which people can lie with statistics is to is to be strategic in how you talk about summarize the distribution by using means let me give you an let me give you an example you're like what the hell are you talking about dr. dr. Alvarez let's be real there is this thing called the Bill Gates problem which I guess you know you could also say the Zuckerberg problem or whatever you want to call it I imagine there's a bar somewhere right there are four people in this bar who are unemployed they've got enough money to buy a couple beers and to go and be social and so they went down to the bar to drink and to spend some time with each other right the clearly the average income of every but of the patrons the customers in their bar is zero because all of them are now unemployed right the median is also zero right that's the middle right the middle two cases is zero divided by 2 is zero the meaning is zero right Bill Gates happens to walk by and you know he knows that this particular bar has a Chocolate Stout and he loves chocolate stout right he also loves you know those like those lambic beers and you know some of those Trappist beers anyway he goes into the bar and sits down and he orders a beer right and let's just say last year that Bill Gates earned ten million dollars in income let's just say for the sake of argument right so now you have to ask yourself well what is the average income of the patrons of the bar well let's just do a little math zero plus zero plus zero plus zero plus ten million right equals ten million divided by five right gives us two million dollars so the average income of the people in the bar has gone from zero to two million dollars well what's the median all right well what do we do with the median we put things in from lowest to highest zero zero zero zero and then ten million we find the middle point all right one two three four five the third person is the middle what's their income zero the median is zero the median hasn't changed at all even though this new outlier has come in right so then you have to step back and ask yourself which of these two numbers does a better job summarizing income in the bar is better than think of that bar is having customers who on average earn about two million dollars or is it better to think of that bar as having customers who have a median income of zero right the answer to that in there and there is a right answer to this is that the median is zero because saying everybody has a income of two million dollars it actually paints a very rosy picture of how well the people in the bar are actually doing right and now you see this actually quite a bit when people talk about say and I'm just gonna you know give you like an example near and dear to my heart if you think about something like income in the United States and how income has changed across time if you look at average income of Americans right now starting right around 1973 to about right now what you see is you know that there has been a real growth in average income from 1973 to to right now right so let's just say let's say around 2017 2018 there's been some fairly good growth in average income if you look at median income growth there has not been that much change from 1973 levels right I mean you know controlling for inflation and all that good stuff right so well what's going on here are is one of these right and one of these wrong no they both would be right it's just that if you look at where income gains have gone to over and from the period of 1973 to 2017 it's mostly been at people at the very high end of the income spectrum right so people who are very very very very very high incomes and this draws up the average income making it seem like all workers are doing better when in fact it's only people at the very high end of the income distribution or who are doing better right we can actually see how poorly people in the middle of the distribution are doing by looking at the median and we can recognize that median income growth has not changed very much so why am I saying all this I'm saying all this because one of the way that people try to lie or be political with their statistics is by strategically using the median or the mean in order to talk about changes in populations right and oftentimes because the mean is very sensitive to out outliers to outliers it can easily be used to mislead people to mislead people so when we do statistics in this class when we're describing a variable we want to make sure that we are clear about the relationship between the mean and the median in any particular variable that we're looking at where it's appropriate right and again it's only appropriate in interval/ratio variables right and so one of the things that your need to be able to do is to look at the median of a variable and the mean of a variable and to determine if they are different from one another why would that matter if the mean and the median are different this indicates that there is a outlier in your distribution right an outlier in your distribution that is skewing your distribution that's skewing your distribution right now if the mean and the median are equal to each other we ignore it and technically the mode technically we normally think of that as a symmetrical distribution can you see this on 4.6 so let me turn on my little writing tool here even though I don't like to do that let's just turn our a fast which is draw something so it's symmetrical right do you see that here's my terrible attempted drawing thanks that's terrible right and what does that mean the mean and the median are the same I in the mode two but normally we'll focus on the mean and the median in this class right so what does that mean that means that there's no outliers either low outliers or high outliers that are pulling the mean away from the median so that's a symmetrical distribution but there are instances there are instances where we can have outliers on the bottom side of things right you see the bottom here down here there are some outliers how do you know there's an outlier because there's this long tail and this distribution do you see how here the tails are the same on both sides here the tail on the left-hand side is longer right here the median meaning right it's above right no more specifically the way that I like for you to think about this the mean is less than the median or why would the mean be less than the mean why would the mean not line up to the middle of the distribution because there are these bottom out outliers right here right in this tail why am I writing this out this doesn't mean no no sense this leftward tail that shows a negatively skewed distribution there are low values low outliers that are pulling the mean down down below the median right when do you encounter this you encounter this in something like if you were to look at the distribution of the number of children that that an individual has right there are lots of individuals who have no children right and so there are lots of 0 values right and then there are some people there are some people who have one kid there are some people have two kids and then three kids and four kids right but those get fewer fewer and fewer right so there's a negative typically a slightly negative skew there right how about a positively skewed distribution a positively skewed distribution right what would you expect to happen there here the tails out to the right to the positive values right come on shirt there are these positive values over here right far help values you see that and so this is tail all the way out to the right and so what do you see happening the mean is pulled above the median giving it a skew a positive skew right so let's be clear if you have low value outliers very small numbers right you have a negatively skewed distribution how will you know this how will you know this how do you know this because the mean will be less than the median right the mean will be less than the median then we have cases when there are very high positive high positive outliers right which produces positively skewed distributions right how will you be able to identify those by being able to see that the mean is a higher value than the median right and I'm gonna we're gonna go through my examples that near the end of this lecture right I'm gonna put it all together for you let me be clear to you about this right let me be clear about this to you right let me actually do or erase this cuz otherwise it goes with us follows it follows us all the way through the year now all the way through the lecture you will need to be able to look at in this class the median and the mean to determine if a distribution is symmetrical positively skewed or negatively skewed right the rule of thumbs that we have our if the mean is higher than the median then you have positive skew you have positive outliers drawing the meet the mean above the median can you think of an example that we just talked about where I might ask you a question that you can ask me the Bill Gates problem is a example of a positive skew right the median is zero the mean is two million right so clearly there's a very high value outlier that's drawing the mean away from the meeting that's drawing the mean away from the bulk of the distribution in this case who is the outlier it's Bill Gates right or Mark Zuckerberg or you know named your you know you know your random billionaire right if the mean is lower than the median then you have a negative skew right you can sometimes look at a histogram in SPSS and might do those a little later on the class but oftentimes I find it much easier just to look at the mean and median and compare them to get a sense of what the shape of the distribution looks like you can also calculate in SPSS an actual precise value for skewness I don't I think that's a little bit too technical for us for this class and so we're just gonna Seguin stick with our rules of thumb if the mean is higher than the median the near positive skew if it's lower than the median you have negative skew and if it's roughly equal to the median then you have a roughly symmetrical distribution make sense so measure central tendency frequency tables they describe the entire distribution but mostly good for categorical variables right though of course and here write this down you still can use mode and median with frequency tables and you should write you should okay if you don't remember go back to the frequency table example number three and number four and and that will remind you remember that measures of central tendencies summarize distributions you don't go through every single value they are useful for summarizing distributions right the mean is the typical score our best guess of the typical respondents value in an interval ratio variable unless it's skewed so what do I mean by that the mean is really good at summarizing a distribution it's very useful for us right it gives us a very good sense of where the the main part of the distribution is but if you have skew if you have skew in your distribution the mean becomes less and less useful and for very very skewed distributions you would actually prefer the median as your measure of central tendency rather than the mean write that down and actually let me just do a quick little summary statement in about I don't know 15 slides no twelve sites I'm gonna go through some examples of how you describe and interpret interval/ratio descriptive statistics were II just went through measures of central tendency which is mode median and mean the mode is the most commonly occurring and is less useful for interval/ratio variables but it's more useful for categorical variables that we saw in frequency tables we talked about the median which is the middle of the distribution it is the person in the middle of a distribution we talked about how to find that in a frequency table but you absolutely should and we also said that spss will calculate this for us for interval/ratio variables we said that we also use the mean know which is the average response right the average response when you have an interval ratio variable specifically you will describe the measure of central tendency by just providing me the number then the median for this distribution is zero and the mean is two million that's your description your interpretation is to say because the mean is far above the median this indicates positive outliers and therefore there is positive skew in our distribution given this our preferred measure of central tendency for this distribution given its skew is the median of 0 that's how we were interpreted so we describe and interpret a measure of dispersion excuse me a measure of central tendency when we have an interval ratio variable by first providing what the numbers are Lily you just say the numbers you write the numbers that are in the output I know that might seem silly to you well you this is a lot of what we do and publish this is you write out what they actually mean you write out in words what the tables say you write out what those numbers are and then you interpret it by saying is there skew present in this what type of skew is it and what is my preferred measure of central tendency okay I hope that makes sense if not please drop me an email please please please drop me an email right and just so we're clear if skewed your best guess your best summary measure for that distribution is the meet is the median and not the me okay all right measures of dispersion last part we're almost there we are almost there huh take another sip of water go take a bathroom break do your thing on the way my cats are being very well behaved no I'm in my office and they have the doors closed but the door has a windows on it so you can see into it and none of my cats have come over here and cried and meow to get inside the office and so that's a good thing for you or you don't have to listen that instead you have to listen to me talking about my cat so I don't know maybe that's not so great so anyway hope you had a good break let's talk about measures of dispersion we're gonna get through this and be done with this lecture I'm excited you're excited let's get done with this right I hope you took a break okay goals for this section understand what dispersion refers to understand variants understand standard deviation we're going to be dealing with some more formulas here so prepare yourself for that but it's all good we're gonna break down trying to make sure were you understand so where does they measure up dispersion what does it mean to disperse it means to spread widely right to spread widely and what we're really doing is that we are measuring how widely spread out our distribution is and the greater the dispersion or the larger our range the wider our distribution what does that really mean I mean let me just say this you what we want to know is how similar respondents answers to questions are are they very similar to each other or they dissimilar to each other right there are there are you know this class is capped at roughly about 25 students and let's imagine that we asked 25 students how much money do you have in your pocket right this second right this second right and you can imagine a world where you know everybody is like I very rarely bring cash with me to campus I have my card on me and my debit card and I and I load up money onto my my student ID so I very rarely have cash so everybody roughly has about I don't know about three dollars in their pockets you know maybe four and everybody's within that realm right that's way you know some people have three some people have four some people have five some people have one somebody to send a bunch of people have zero right and there people are very you know quite similar right they're very very similar in the amount of money that they have in their pocket so what we would say is that there's not much dispersion there's not much difference among people in how they responded to this question they're really they're quite similar to one another right alternatively you could imagine me asking this question and some people have $2.00 but some people like to pay in cash for stuff right and so they have twenty dollars some people have forty dollars somebody has a hundred dollars some people has thirty some some person that's thirty dollars right and so there we would imagine that there's lots of dispersion there's lots of difference right there's a lot of difference between people and how they respond to that question right and so really what we want to be able to to do is to look at a variable and to look at the numbers that are provided that for the analysis of that variable and to determine is there a lot of difference in the way that people are responding to this question right do they do they all look like basically look like each other right they bought they all basically have like around three or four bucks in their pocket right or is it a distribution where people look very different from one another right where some people have three dollars the other people have hundreds of dollars in their pocket right so is it a low dispersion group or is it a high dispersion group right and so we're going to use some mathematical tools to help us figure this out right to help us figure this out here is one way of thinking about this right here's one way of thinking about this now you can you see this yellow distribution here right you see how thin it is all right you see how thin it is that's an indicator of a distribution where everybody is pretty similar to one another right everybody falls within a similar range right a very very similar range so there's a very narrow distribution right that curve is very now and thin do you see that but then we also said there was another one where we have a lot of dispersion in it right where some people had $1 or $2 but other people had $100 or $200 right that might be this sort of blue purple line down here let's call it blue on the bottom right do you see that wide it is you see how wide it is there's a lot of dispersion right in that distribution right and then we have something like the pink one in the middle right which is sort of like a monocle model right a monocle amount of of dispersion it no this is an oversimplified way of thinking about it but I think it's very useful and those of you who might have had it this class before you know that actually what we're looking at is a say graph of kurtosis you don't need to know that for this class but I find this to be a very useful way of actually trying to illustrate to you what we are doing we're trying to sort of figure out when we look at a variable is it a sort of yellow distribution with us where people are pretty similar to one another is it a blue distribution where people are really there's a lot of dispersion where people are really quite different from one another right or is it the sort of pink one where there's some similarity and some different a sort of medium amount right so how do we measure this well there are some very simple ones that we can do one is to look at the minimum and the maximum values that are given that can sometimes be useful also known as the range right the minimum and the maximum so you know if in one class the minimum the minimum is 1 and the maximum is 5 right so the range is only 4 right no so 5 minus 1 gives you you know 4 that's the range whereas in the other one the minimum was about $1 but the maximum was say to $200 right clearly there's a lot more dispersion in the in that second distribution because when is it because it has a much greater max and has a much better range right however that's not really precise enough for us and so instead we calculate a specific statistic that we're going to spend a lot of time on today right now the first version of it is called the variance the variance and we're going to talk about the variance quite a bit right now it is the average the average of the squared deviations from the mean average of the squared deviations from the mean but we also talked about standard deviation the square root of the average of the squared deviations from the mean then we just give you a heads up right now what are we gonna do in the next couple slides let me just break it down for you before we get there we're gonna calculate the variance and look at the formula for the variance the equation for it just to make sure you understand what it does because we're going to be using formulas that are similar to it later on in this course right I'm gonna go through it we're going to calculate it once you're gonna try to I'm trying to get you to understand it intuitively and then I'm going to point you to the problems that emerge from using the variance and I'm gonna make the argument and actually it's not an argument it's just a straight up you know dictum you're gonna have to do it this way we're gonna use the standard deviation instead of the variance you're gonna need to know why we use the standard deviation instead of the variance right so we're gonna go through the variance formula understand the variance formula we're going to identify the problems with the variance and then we're going to see how the standard deviation is the solution to these problems and we're going to focus on the standard deviation for the rest of the lecture okay that's what we're gonna do I'm just giving you an time let's get to it what is the variance variance is designated by s squared of Y right s squared is the variance it and this is variance of the variable Y s squared of Y what are we safe this was funny e it says to add things up right and so what do we do we take any particular value minus the mean of that variable right so this is the value of a variable minus the mean of that variable we square it and then we divide by the number of people in the sample minus one right so we're taking the average I want you to step back for one second sort of think about this for one second what are we doing here what we are doing is that we're measuring distances from the mean that's how we measure dispersion where does that mean using mean and mean a lot in the same sentence as hard let's try to use average now shall we let's imagine in our low dispersion group of asking people how much money do you have in your pocket that the average amount of money in people's pocket is three right what we will want to do is to go through and for every person who says how much money they have in their pocket we put their value in for y so for me right myself right ii i literally have one dollar my pocket like that's for real that's exactly how much money i have in my pocket right now is one lonely miserable dollar right so what would I do it would be 1 minus what do we say the average is 3 right so how far away is one dollar from the mean of three dollars it's two dollars away it's two units away and one way of thinking about that right so what the distance that my value is from the mean is - does that make sense then we can go on to the next person right and we can actually you know we can find we find out that they have five dollars in their pocket right so there will be five minus three there are two units away from the mean right so we're getting for every single person in our class we can find out how far away they are from the mean right for each one of those I square it right so 3 3 1 minus 3 equals obviously -2 you square that you get 4 plus 5 minus 3 then it goes to you square that right that's also 4 so 4 right on and on and on for each person in the in in our classroom right that is how we add this up that's what this that's what the top of this means we add up funny II for each individual value we have we go we put that value in we - out the mean to find out how far it is away from the mean and then we square it right so we get square deviations from the mean squared see that - and then deviations from the mean right deviations from the mean how far is something from the mean deviations from the mean right we square them and we add them all up so then we have this big pot of squared deviations and then we just divide by the number of people in this in the sample in the classroom - one a small correction what are we doing we're finding the average of that we're finding out what the average squared deviation from the mean this is a fancy way of doing an average we're adding a minus one so it's a longer more complicated but minus one is a slight correction that we have to put in there but we are doing a fancy average we're finding out on average how far are our scores away from the mean the squared deviations from the mean the average squared deviations from the mean average of the squared deviations from the mean that's what the variance is so well that's that's for you more systematic than that let's take it step by step what the hell does this all mean right so step by step sy squared right that's the variance variance of some variable Y Y is the actual value for the variable for each person right I've been using this example of number of dollars in your pocket but it could be anything right I mean think about if I went through and asked all 38 hundred people with their age in the GSS right well then we have 3800 people we have a lot it's a lot of math for us to do because we have to do it for each person the average age and the GSS is about 44 something around that right so you know we go to the first person they say they're 21 so what do we do we do 21 - 44 we get some number we square and we go to the next person we add all those up right so the why is the value of that particular variable whatever that variable is and then y bar is the mean of that variable in that sample right the mean number of dollars in your pocket the mean age whatever it is right whatever it is n is the sample size corrected we use this to get the average right because it's the average squared deviations from the mean right making sense so we are taking every value Y subtracting the mean this is what a deviation is a deviation measures how far score is away from your mean right then we square those deviations we square those deviations that I'm about to ask nation mean question marks here because why do we do that I'm explaining to you in just one second and then we add all of those up we add up all of those square deviations right and then we divide by the sample size right that gives us the average of those squared deviations right it gives us the average of those squared deviations we are averaging the distance from the mean for our values of Y right for our values of Y but why do we square it anybody know anybody know yes you in the back just kidding I miss being in front of and from people talking about this so Y would square the values that's let's look at some examples here we have four people in our in our sample re Barbara curtain Diego we look at their their income right already makes twenty three thousand Barbara makes seventy eight thousand curtain makes twenty seven thousand and Diego makes forty three thousand right the mean for this group is forty two thousand seven hundred and fifty do you see that that's all I did was add it up and divide by four that gives us four thousand forty two thousand seven that's right if I don't square the deviations right if I don't square the deviation if I just calculate the straight-up deviations right so twenty three thousand - forty two thousand five seven hundred and fifty the deviation there's nineteen thousand seven hundred fifty Barbara seventy eight thousand minus forty two thousand seven hundred fifty that gives me thirty five thousand two hundred and fifty Kurt twenty seven thousand minus 48 mm some hundred fifty that gives me minus fifteen thousand seven hundred and fifty and then Diego it's forty three thousand minus forty two thousand seven hundred fifty two hundred and fifty I add all of that up and what do I get I get 0 I get 0 you said this is a mathematical property of the mean this is a mathematical property of the mean and so really the reason why we take a square while we square it is to get rid of these negatives right we want to get rid of these negative results so that it doesn't sum to zero if you're like dr. Alice I have no damn clue what you just did what you're talking about then the thing that you should write down in your notes right now just straightforwardly is we square the deviations in order to get rid of the negative numbers we square the deviations in order to get rid of the negative numbers that's what you should write and then in here in this last column I've squared them all I have squared them all squared there's squared there squared this right and then I added them up and divided by three right so that my variance is this big-ass number right here right that's the variance right that number right there is our variance here's the step-by-step thing that I did to get the variance 23,000 - forty two thousand seven hundred and fifty seventy eight thousand - forty two thousand seven hundred fifty twenty seven thousand five blah blah blah blah right each one of these are the deviations right if I don't square them and I add them up they's just sum to zero so I square them to get rid of the negative numbers I get these big-ass numbers i sum all of those numbers I sum the square deviations right then I divide by the sample size minus one there was four people in the sample minus one gives us three there's a long story behind why we have to do that correction I will never ask you about it I'm happy to give you more information about why we do this correction but it's beyond the scope of this class just know that it's n minus one we are taking the average of these squared deviations and what do we get this ginormous number right here that ginormous number right here this presents a problem though it presents a problem what are the two problems because we square it the variance is no longer expressed in the original units right in this case we were talking about dollars right so when we square we end up with dollars squared which doesn't make any sense right so the variance cannot be understood or it cannot be directly interpreted because the units are some random squared unit of whatever you're look talking about because we had to square them right am I making sense to you right here right now the second issue here the second issue here is that because we squared it and we are doing an average right taking a mean we know that the mean is very susceptible to outliers right thing back to the Bill Gates problem right so what do you think happens if we square right we actually raise it to the power two right multiply it by itself if we square something inside of average right it means that it's really really sensitive to outliers right outliers have a very large impact on the variance so if squaring poses a problem because it makes the variance in the wrong units and it makes the variance makes the variance difficult or rather Morse acceptable to outliers what do you think that we can do to get rid of to get rid of there's these two problems what can we do remember I guess I'll give you a second the solution take the square root of the variance you remember in the beginning of this section what we said the square root of the variance is called the standard deviation the standard deviation right so let me just stop for just one quick second and just say what the hell did we just do just for a second what we did was to lay out what the variance is right because I find that the starting with the variance is a little bit simpler and we saw that the variance was a way of measuring dispersion in a interval/ratio variable and the way that it does it is by taking every response to that question and finding out its distance from the mean so Y minus y bar what we call a deviation we then square that deviation to make sure there are no negatives right we take the square deviation for each person in our sample no matter how many people are in our sample and we add up those square deviations for each person we then divide by the number of people in that sample minus one with a correction in order to get the average square deviations we then said that this measure that we call the variance has a problem it actually has two problems that are merged from taking or rather by using the squared variations it is that it puts it in the wrong units and makes it very sensitive to outliers and so in order to deal with this problem we take the square root of the variance we call this statistic the standard deviation the standard deviation is one of the most important concepts in statistics and we're going to be using it throughout this course so I encourage you to get used to it to ask questions about it if you don't understand and to really start trying to wrap your mind around it the standard deviation is a way of measuring how much dispersion how much variation there is in a variable it's a way of measuring our people very similar to each other or are they very different from each other right it's a way of measuring that right this is the actual formula for that right the symbol for it is s of why standard deviation of why it is the square root of you know the funny II summing up y minus y bar squared over n minus 1 right that is the formula for the standard the standard deviation so the standard deviation is less influenced by outliers and can be interpreted in the original units so in our particular in our in our particular example the mean was 42,500 and the standard deviation was eight hundred and forty two point seventy eight hundred forty two dollars and seventy six cents so how do we interpret dispersion much like when dealing with measures of central tendency right measures of central tendency we said that you have to describe by providing what the numbers are and then interpret by putting the numbers in communication with each other to come to some understanding of the shape of the distribution with measures of central tendency it was you know say what the mean is say what the median is and then we interpret it by putting those two things in conversation to determine if it's a symmetrical or if it's a positively or negatively skewed distribution we do a similar sort of thing for the standard deviation how do you describe a standard deviation you just simply say what the standard deviation is you say what the standard deviation is then to interpret it you have to tell me what you believe is the amount of dispersion in that distribution well how do you figure that out you compare the standard deviation to the mean to the mean if the standard deviation is small relative to the mean that means that there's not much dispersion and that values are on average relatively close to the mean that means that in other words it's another way of saying that people are actually really quite similar to each other right they're really quite similar to each other if the standard deviation is large relative to the mean we would say that there's a large amount of dispersion that values are on average not close to the mean right there people are very different for one another right until you look at this for a second write it down write it is small relative to the mean the standard deviation is small relative to the mean we say there's not a lot of dispersion if it's large relative to the mean we say there it is right so I want to take you back to a slide we started with just to look at this for a second right let's go back boom let's look at this again which one of these did we say had low dispersion where people are very similar to each other it's not widely dispersed it's the other one there right the yellow one there right so in that case do you think that the standard deviation is large relative to the mean there or small relative to the mean it's small relative to the mean right small dispersion means that there's not a lot of dispersion so the standard deviation is small relative to the mean you get these really narrow these really narrow distributions right there's really narrow distributions because people are really quite similar to each other so they fall in a very narrow range right but if the standard deviation is large relative to the mean which one of these do we think it is it's this blue one on the bottom right where it's a very wide distribution where there is a lot of dispersion right what do you think this pink this purple one is this pink one is in the middle that's when the standard deviation and the mean are relatively equal to each other right and they're relatively ego to each other right so there's an average amount of dispersion there right so when you are describing an inter pane your standard deviation you are telling me what the name the actual number is what is the standard deviation and then you're telling me is there a lab dispersion in it is there an average amount of dispersion in it or is there a lot of dispersion right that's what your job is going to be so let's get back to where we were showing so how to interpret this dispersion right if it is small relative to the me not much dispersion value values are on average relatively close to the mean if there's large relative to the mean large amount of dispersion values are on average not close to me right so to recap our focus will be on standard deviation not variance not variance you need to be able to interpret the standard deviation how large is it relative to the mean what does that tell us and you should get comfortable with what that equation is doing because we're going to actually be thinking about similar equations down the line down the line okay so let's look at let's look at a couple of examples right here where I provided a bunch of information about three different variables age TV hours per day and internet internet use per week if I were you I would get out my pen in pencil right now and make sure that you take down how I talk about some of this so you can use a similar language in your exams and in your homeworks and all that good stuff okay so let's do it let's start with age of respondent right sorry I need a glass of water some water so each respondent there are three thousand eight hundred and eighteen valid responses to this right everybody see that the mean the mean age in a distribution is forty nine point eight for the median age is 50 therefore this represents a symmetrical distribution because the mean and the median are really very close to one another boom I described described and interpreted the measure of central tendency the standard deviation for this distribution is seventeen that is small relative mean so this symmetrical distribution has very little dispersion in it boom I'm done makes sense you could spice this up if you really want to do say the mat in the minimum age given was 18 the maximum was 89 you want to do that impress me go for it not required though okay not required notice I didn't say anything at all about the variance right because the variance is not actually very useful for us we care about the standard deviation right known as I said the mode right the mode is 53 is that super useful for us no it's not not for an interval ratio variable it's much more useful for again categorical variables nominal and ordinal variables so that was a good example let's do another one real fast huh hours per day watching TV I'm gonna give you a second to write for yourself what you think the answer should be right I mean I guess you could just pause it before I say it but I'm gonna give you a that's that's annoying sorry I'll stop there so the valid number of respondents to this question is 2571 the mean was two point nine five and the median is 2 this indicates positive skew since the mean is above the median that's my describe description interpretation now you might be saying well it's two point nine five versus two is that really that big of a difference let me have you stop and think about it it kind of is is it if the median is two and then the median excuse me if the median is two and then the mean is essentially three right that's what that means that the mean is about fifty percent greater than the median right so why it seems like a relatively small number it's actually a relatively big percentage difference isn't it right so that indicates positive skew right it indicates positive skew if I were you you know you have to sort of get used to and just sort of get experienced with looking at numbers and trying to determine for yourself is that a big difference is that a small difference should I pay attention to that you really do you have to come to your own determination about it in this particular case though this would be indicative right of positive skew right not a huge amount but a fair amount right a fair amount of positive skew right the standard deviation is two point five five eight this is roughly the same size of the mean so there's an average amount of dispersion in this distribution boom done again boom I don't know why I keep saying boom like that again that's that's annoying I'm mr. Dahmer I'm gonna be very informal now and I'm going to say it very formally let's deal with this last one ww hours per week it's very funny to see that that we did because nobody really says the world wide web funnily enough I worked on the original research team that actually got this specific question on the GSS this is way back in 2000 a long time ago a long time ago I this was at the University of Maryland and we get a when I was about to become a graduate student here to get my master's degree we we got a three million dollar grant from the National Science Foundation to fun exploring how people use the Internet and so we put a bunch of questions on the GSS all right back to this you ready write down for yourself what do you think the answer is I'm gonna go in a second how would you describe and interpret ww hours per week the number of valid respondents this question is 1399 the mean number of hours online per week is 11.6 - whereas the mean amount of time online bro week is 6 this indicates a large positive skew a large positive skew the standard deviation is 15 this is much larger than the mean indicating a very wide distribution there is a lot of dispersion in this distribution you are therefore done that's it that's all there is to it right and if you have questions about that you should let me know now on the exam you're gonna get a table that looks like this and I'm going to ask you to describe and interpret that table just like you'll get frequency tables on the exam and I will ask you to describe and interpret those frequency tables okay you're gonna need to be able to describe and interpret those tables you're gonna get one at least one of the two of the frequency tables and then one of these descriptive statistic tables okay now we have two more things that we need to very quickly talk about them and we'll be over and I appreciate your patience and and I really do appreciate you hanging with us and hanging tough and I hope you've taken some breaks right so let's deal with the first one and this is going to become really with this next point it's going to become really really important in the next week or so so make sure you pay close attention the standard deviation is used throughout statistics it's a very very very important concept it not only measures the amount of dispersion in a distribution it can it also has very specific or very special properties with specific types of distributions in particular if you have a normal curve if you have a note what's called a normal distribution right that's the so called bell curve right and you some of you have heard the bell curve right under the properties of having a normal distribution we can say and there's a mathematical proof the spectrum that backs this up that women plus one standard deviation of the mean and minus one standard deviation of the mean 68% of all cases will fall between those two numbers no matter what you're measuring no matter what the variable is if it's normally distributed if there's a normal curve between plus one and minus one standard deviation from the mean you will find 68% of all cases between plus 2 and minus 2 standard deviations there will be 95% of all cases when then plus 3 and minus 3 standard deviations of all cases plus 3 or minus 3 standard deviations from the mean you have 99% of all cases what the hell does this mean let me let me just work through a different example you ready I'll try to make this fast as I know it's already we're already been going off for a while let's imagine that the that average height for men is is 6 feet tall right which is 72 inches right 72 inches and that the standard deviation is plus 1 let me assume you know the standard deviation is 3 inches it's 3 inches right so very very very narrow distribution right very very narrow distribution the mean is 72 inches and I would 6 feet and the standard deviation is 2 inches between what two values would you expect 68% of all men to fall in 68% of all men well 72 is the mean right so what would be plus one standard deviation well what is one standard deviation it's three inches right so 72 inches plus three inches equals 75 inches so that's six for three right well we're about one standard deviation beneath that right so again what's that deviation is three inches right 72 inches is our mean 72 minus three is 69 so that's five foot nine right so we will expect 68% of all men be between approxi five foot nine and six foot three right that's plus three inches above the mean and minus three inches below the mean right that's plus one standard deviation above plus one deviation below so one of the two heights that 95% of all men will fall into 95% of all men this is plus two standard deviations and minus two standard deviations right so now we're not just doing plus three but we're doing what plus six right and we're not just doing minus three we're doing minus six right because it's two standard deviations right so that would take us from what five foot six to six foot six right and so we would say that 95% of all manual between will be between five foot six and six foot six right makes sense and then we can do the same thing for 99% cases is plus three standard deviations right and minus three standard deviations again very straightforward right 95% of all men will be between five foot three and six foot nine right does that make sense right now you may be like why does this matter I will show you why it matters very soon is it important for you to see it right here's a graphic representation of it probably doesn't help you too much but we're gonna get there we are going to get there we are going to get there together finally last point and this won't be on an exam it won't be on an exam but I want you to see it and I'll just say very quickly this thing that we use on the very top of the variance equation or in standard deviation it's called the sum of squares right we call deviations where you get a deviation you Square and then you summit this thing called the sum of squares shows up in statistics all over the place all over the place and so just make sure that you you recognize it when you see it later it's a way of measuring how far things are are away from the mean of something you know it's usually called this of squares that is it thank you for sticking by me this entire time and I appreciate your patience never going