Transcript for:
Handling CSV Files and Hypothesis Testing

so for today's session i changed the plan slightly based on a consultation session this morning where um some of the students who have tried the new practical have run into a little problem that they didn't foresee and that was that we got are actually giving you the samples for your questions two three and four so you are going to have to in the numbers assignment you will see a link for each question for a data set which you need to download and that data set is a csv file so it's not an excel file but a csv file is just a comma separated values file which you can use in excel as well so some of you are having the issue that when you open that data set all the data is in one column and you don't know what to do with it so i thought let's start today's session with first how do we deal with cs um and then after that we will get back to the hypothesis templates now i quickly went in it for question two one of the data sets um it's gonna be different for you so everyone's gonna have a different data set so let's say you open it and it looks like this it hasn't separated the values by the commas and the commas in this case actually tell excel this is the data that goes into the next column so if your data set looks like this then it's actually very easy to fix that um and to move over into data that is in two columns so you could see here we should have a names column and we should have a values column and that's the data that you would be working with now um this typically happens if your computer is set up to use decimal commas then it will open in this format if you're a computer set up to use decimal points then when you open these files it should automatically split the data into two columns for you so there are instructions on click up how to change your computer to use decimal points instead of commas which will solve the problem immediately now if you don't want to do that easy way of doing this is you highlight your entire column of data you go here to data and you'll see there's a text to columns but so michaela this data set will be available when you open your practical sex the first question will be based on the fifa data set which you can download before you open the test and then for each of the for questions two three and four for each of them you'll get a csv file that you're going to have to download and um this is one of those but it's going to be different for every single student so yours isn't going to have exactly the same names and numbers in it every student is getting their own sample to work with now once you've highlighted this entire column you can click your antiques to columns and you will tell it that it is delimited data so you can see here it says characters such as commas or tabs separate each field so you click next and then you change it from it's normally on tab delimited by default you just go and change that to comma delimited and you'll see immediately it recognizes oh this is where i need to go split the data set um so you can click next and sometimes we'll we'll go and add um like a format for the data but you don't have to do it you can actually at the previous step have already clicked on finish and you can see now the data is given in this format and if i just make the columns wide enough the values are given nicely so you can see here this is quite a large value this is 2.9 something times 10 to the power 8 but if you actually just make the column wider it gives you the full value so it's nothing weird happening with that data now a second way in which you can do this is if you're already in a date like an excel file you can import your data directly from the csv as well by again clicking here on data going to get external data and then save from text and in that case you can go and open your sample data with the csv file where you've downloaded it and you click on import and again you're going through the same process where you say yes it's delimited you could tell it yes it has headers as well i normally don't even remember to do that and you can just say okay this is comma delimited and say finish and then it asks you where to put the data a1 is a good place to start and you can see it imports the data correctly for you immediately so um this is just in case you uh get your data and it's not already split into different columns this is how you're going to do it you can see it's quite straightforward um so at the beginning of if if you open your practical data and you get stuck on that you can just quickly come back to this recording and just watch how to do that now um let's get back to the practical so where we were last time we finished the first case study and we were working on the second case study part one so we finished all of these formulas let's just go back there we finish these formulas um up to the t statistic and now things should make a lot more sense to you guys now if we're working with um the data here and we want to calculate our p-value i'm going to go write down the formulas here so my p-value for lower tail remember rep is linked to the probability that we obtain a z value and we have our t statistic here what sign should i have in here what is mis probability statement less than and it's less than because it's a twist now the next one upper tail how we calculate a proper p value there we want the probability that z is what greater perfect and then for the two-tailed one remember we're going to do two times and let's just write it like this probability that z is greater than the absolute value of the t statistic so if i make my t statistic positive then i can um always take the greater than probability um it's easier to work with this then to work with the um with the negative one so we want remember with our templates we want you to do the calculations automatically so when you put new data in here and you specify your population standard deviation hypothesize you then all of these calculations need to be done automatically so standard error just a recap from monday sample size we got with the count function i don't know what that thing is that keeps popping up sample mean which is the average function population standard deviation we had to paste in there and then our standard error was just let's just double click on that our standard error is the population standard deviation divided by the square root of the sample size then this t statistic is just remember i've written it down here it is my sample average minus my hypothesized value divided by my standard error so i just referenced those cells and done my calculation now my p value what function should i be using to calculate the p-value and i see danielle you're saying that it's a confusing pack yes it's confusing in the beginning because hypothesis testing is new but with practice and with all the hints that we're giving you um that should help you a lot as well so um with the the um on numbers just have a look bottom right of your screen when you've typed in answers if it's marked wrong go scroll down to the bottom right there will be hints there for a lot of mistakes that you guys commonly make okay um so what formula should we be using to calculate the probability that z is less than a statistic any ideas i'm seeing tdist is the sigma known or sigma unknown it's a sigma known case so remember when we know sigma we are always going to use the normal distribution and that z rt statistics z actually tells us that remember z has a standard normal distribution so that means you are correct solani we are going to use the norm and i press the wrong button again very good at pressing the wrong button apparently i'm going to use norm.a.s because that gives me a probability from a standard normal distribution so now what we need to reference is our t statistic because we want to calculate the probability that z is less than our taste statistic value and as always we put true in there okay so that is what we have this i just want to see if i can move all of this and then what i'll do is i'll just copy these so let's just actually for them and they fold this as well so that you guys can see exactly what i've done at each step so these ones maybe we don't need to show but let's show these values so if 4 f4 and i'm going to do the same thing and then i'll just copy these formulas so that you can see exactly what we've done here so let's put that there and we'll put that there as well and then just delete that bit okay so these are formulas i'm gonna also just lock onto this one of course if you're setting your templates up you don't have to do that i'm just doing it so that i don't have to keep changing the references okay so that was just my norm.est.testbit now if i want to calculate my upper p value so now i'm trying to calculate the probability that z is larger than the t statistic again what what should i be using here is so you guys one minus and what am i norm that is the best and we use our taste statistic and we use the word true and let me just fix on to that okay so there we go that is now i don't know why it's doing that that's very weird i'm gonna just see if i can fix it yeah it was just a formatting issue okay so now we need to do the two-sided one and that's the most difficult of all of these so there's more than one way of doing it so the first way would have been to just say exactly what i've written over here so we can go do it as two times norm dot ace dot best and then we can just use the absolute value of the t statistic and add the argument true so this is one way of doing this but you could see that oh this should have been one minus a p-value remember is a probability so if you get something like i just did you should immediately know okay there's a problem here we're going to do your method next so that problem that i fixed i realized that it was the problem was that there were too many decimal places for some reason it suddenly wanted to show me a whole ton of decimal places so all i did was i clicked on the previous one which only had four decimal places clicked on the formatting brush here at the top left and clicked on that one so let's say for instance i want to make this five decimal places then i can use my formatting brush and brush over those two cells and it will automatically then apply the same formatting to that cell okay now um i'm seeing something here population standard deviation we did not calculate it in this data set with standard deviation not p remember we used the full data set with the population data and then the function standard deviation.p that was done in our um other sheet and then i just copied the answer over here okay now the other option for doing this is like solani said we could also have said let's take two times the minimum of those two values okay that is how we typically go and code this so you can see we're getting the same answer so i'm just going to quickly fix on to them and then i'll discuss why we can do this so and you would have seen this in your textbook already as well because this template is actually available in your textbook so how this works let's do it with a sketch so let's say we have a normal distribution and our t statistic we'd say that's zero in the middle let's say our t statistic is here on the left hand side now if i have a two-sided test what i would have done because this is a negative t statistic so this is a negative one then i would have said okay well let's take this area and multiply it by two now this area is also my lower tail p value and the area to the right would have been my upper tail p value so if you look at these two you can see that if you add those two answers together they're going to give you one because the one gives you the area to the left of the t statistic the other gives you the area to the right so if i have a negative t statistic then it's going to i need to take the lower tail area but if i had a positive t statistic so let's say my t statistic is over there and i was doing a two-sided test i would have taken the area to the right which is then multiplied by two so you can see that if i'm negative one i want to take the smaller probability and times it by two if i had a positive t statistic again the area to the left is quite large area to the right is quite small and the smaller one is the one that we want to multiply by two so that's why we can simply say here let's take two times the smallest one of these because if um my t statistic is negative i'm going to multiply the lower tail probability by two if it's positive i'm going to multiply the upper tail probability by 2. and we can go double check if this works i'm just going to quickly make an error here in my calculation of the t statistic let's just quickly make it a negative value to see if this works i'm just going to put a minus sign in there and we'll take it out again and you can see here if i had a negative t statistic my formula here takes the lower tail p value and my two to give me the two sided p value so that function now automatically according to what case i'm in to do the calculation for me so yeah so it's been about the um the data the original data you'll see if you go to the next template which we'll work on just now it has units here here we have drive while drowsy these are textbook examples um so all you do is you just copy your song data and you paste it in here so now let's say for instance i wanted to do a hypothesis test using this data um that we got from clicker i can just highlight this copy it and go paste it over here and all my calculations should adjust immediately you can immediately see here something is happening here and the reason why i'm getting such weird answers because i've actually not replaced these two values over here so i still need to go and input my hypothesized value and my population standard deviation which will also be different for everyone so obviously population standard deviation would be the same but um the hypothesized value could be different for all of you when you do your practical so um the idea with this template is simply that whatever data you put in here and you add the appropriate values here all your calculations should be done automatically now if we go back to this you can see that we're also going to calculate our critical values and we need the significance level so we're going to use the 0.05 and now we just need to figure out how to do critical values here so hypothesized value and um will be given to you in a in the wording of your question so i'm not going to tell you guys now what it is it's going to be different for everyone anyway um you would just have to put the right one in and you get that from the wording of your question just like linda's been doing in class this week okay now critical value let's focus on that one how would i get let's say we're doing a low tail test um how do i calculate my critical value we know how to get it from the table so how would i do it if i needed to get it from the from excel any ideas what function exactly so we're working with a normal distribution because we're dealing with a z t statistic so z is your clue that this is normal so normal and is it inverse or it's not inverse it's not inverse perfect okay and now what do i enter in here alpha exactly so i just go and reference my significance level and there we go i have my um critical value so now if i decide i want to do this test at a 10 level i just change this to 0.1 and there we go automatically i get my new um critical value so that is the beauty of referencing in the cells and then your level of significance again would be given in a question if we ever give you a question where we don't specify the level of significance then you go for five percent um that is a very commonly used one no one can tell you wrong if if we don't like if we don't tell you what significance level to use um then automatically assume five percent and we will not argue with you about that now if i want to do the upper tail one i'm also going to use the norm dot s dot inverse one and what should i be entering this or this time okay so now we want one minus alpha okay you can just log on to that one so i can share these formulas with you guys um there we go okay so let me show that and that okay so here the five percent is in the upper tail and that's why we do one minus that to get the area to the left because remember the norm dot s.inverse function takes the area to the left as your argument and again if i change this to 0.1 you can see all of my possible critical values change so let's just change that back um set okay and again just notice this where i'm typing the values in i've made blue the background blue just so that it's a reminder for me this is where i changes now our upper critical value is going to be again norm dot ace dot inverse and what do i input this time so one minus alpha over two so remember when we are doing a two-sided test let's just draw a picture of that very quickly and you'll see with the rest of the templates those arguments are pretty much the same it's just we might be using a different distribution so if we have a two-sided test remember that we have two critical values and the area here is going to be alpha over 2 the area there is going to be alpha over 2. so this area in here is 1 minus alpha so if i want to go and work out what the area to the left of this is i can just say 1 minus the area here to the right so that's 1 minus alpha over 2 and that's what i have typed in there or if you want to go and add those you could have said 1 minus alpha plus alpha over 2 which again gives you 1 minus alpha over 2. so if we go back to this formula just to show that to you again that's not showing it helps if you press the right buttons so let's just do that and you can see here we have one minus alpha over two so i'm just gonna move that sketch a little bit okay cool everyone happy with this and remember the total area under the curve is one that's why if we want area to the left we can say one minus area to the right perfect okay so confidence intervals we covered in monday's one so i'm going to skip them now so that we can do the other two templates um but i'm sure you guys can go and fill those in um and you can use the data from the first case study to go and double check if you then paste the data from the first case study into this column a um it should give you your low and upper limits that we calculated the other day okay so now again same thing happens here we need to fill this in so this is the sigma unknown case i'm going to start by writing down my formula for my t statistic t is equal to x bar minus mu naught over s over square root of n so this is now the formula i need to work with and you can see in this case i need to input my hypothesized value manually and i need to input my level of significance so where we get the hypothesized values from our question level of significance also from our question and you can see here we're going to use the sample standard deviation so if we go back to our case study um here let's see what data we're working with we're again going to work with the ages and all of that but let's first go full in some formulas so again sample size we all know sample size is going to use the count function so this is the same as before sample mean again we need to go calculate the average of the values in column a so that's nice and straightforward what function should i be using here for the sample standard deviation perfect standard deviation and where do i get this from the population or from column a column a perfect okay so everything is obtained from my sample then i'm going to input my 23 here but i haven't done the data but i want you guys to see how we would use a template so we're going to do the calculator or put the formulas in and then paste the data which is what you would then basically do so you would set your templates up now get them to work test them with your case study data and once you know all your calculations are fine when you open your practical and you download that new data set you literally need to copy that data paste it into your the appropriate template just make sure you use the right one edit the values that you need to edit and that's pretty much it and then you can go and put your answers and deal with all the theory that goes with this so calculation for d8 that is there's no calculation this value i typed because i know that that's what we're going to enter in the next question so but let's actually take it out for now let's pretend we don't know what it is now my standard error so this is actually my standard error estimate how am i going to calculate that so you can look at the formula i've written down that should give you s over square root of n that's perfect so we're going to equal standard deviation and then we're going to divide by root of my sample size so same actually exactly the same as we did in the previous template so i'm going to show that formula to you guys okay read my t statistic what should i type here and i want you guys to give me the answer in um references not not with the symbols so ngati what you typed in the chat is the um margin of error for a confidence interval not the standard error remember the standard error is the standard deviation of x bar and in this case we're estimating it so if it makes it easier you can also write here estimate of standard error so that would technically be more correct so someone said let's take d5 in brackets e5 minus b8 that would be correct divided by e3 so that is 100 correct that is doing exactly what this formula on the side tells us to and again we don't have the right data in yet so we're going to go and check just now with the data to see exactly what's going on now in my or my degrees of freedom how do i calculate that i think i just said the answer out loud but i want it with references so 25 minus 1 in this case but if exactly b4 minus 1. so we're going to take reference the sample size because remember we're going to paste different size samples in column a so we want to it to adjust automatically to the new sample size there we go and then we need to go calculate our p-value should we now paste the data and check if everything is fine so far so that yeah f4 okay so let's go look at that now if we go to our case study data we are dealing with the ages of the people so i'm going to just copy that and i'm going to go paste it into my template and then if we go back to our question is leia looking at us um if we go back to our question let's go zoom in a little bit you can see here that my hypothesized value is 23. so we repeated the same question so here it is in words uh we repeated the same question but using a sigma unknown case so if i enter there 23 you can see immediately everything here has been calculated correctly so here we have less decimal place space so let's actually just add some decimals here to make it a little bit clearer okay there we go so now it matches exactly so i would also suggest when you want to do your practical on click up first thing you do is go and look at the question and see how many decimal places spaces for each of these things so if i ask you a test statistic and go just edit the decimal spaces on your template so that if you do this repeatedly as well that you always end up with the right number of decimal places um when you type it in that you don't forget to put the right number of decimal places so now all we need to do is we need to go calculate our t statistic ah sorry not a statistical p value so what formula should i be using for my p value for a lower tail test in this situation could i use norm dot s dot test no so you guys will see this so why we using tdist it's a sigma unknown case and our t statistic is already telling us it's a t t statistic so we should be using a t distribution so we're going to use t dot rest and you'll see there's a whole bunch of functions here for some reason excel went crazy and did all kinds of different functions for t distributions not just the one so i'm going to start with the t dist because it returns the left lt distribution there's also one that does the two-tailed one also one that does the right-tailed one but you can actually do everything with just the left-hand one so i'm going to do it with the left-hand one just so that it links up nicely with what we did in the previous template now what do i need to add as my first argument yeah so elandre you're gonna use each but every time you do it instead of having to do the calculations from scratch you are going to literally just um paste and read the answers off so it's it's that easy okay so i'm seeing a 23 and i'm seeing a d4 so what's in d4 d4 is 55 but have a look at what this function says it says test x degrees of freedom cumulative so my value of x in this case what we need here is actually our test statistic so remember same as with the norm inverse where we had the x we would have put our value of the t statistic now we're just putting our value yeah again of the t statistic okay then degrees of freedom we find over there so we're going to reference that and same as the normal distribution functions we just need to add true in there so we can just press f4 on these so yeah the beauty of the templates is that you need to understand the work to be able to set up the template and that's why i want you guys to go through this process but once you've set it up like i often use my templates when i'm checking students work instead of doing all the calculations by hand i actually overwrite formulas in my template and it does all my calculations for me yeah so i'm pressing the f4 just so that when i do this to show you guys what the calculation is that it doesn't move my references so you technically don't need to press f4 there at all um you could have left it as is i'm just doing it so i don't constantly have to move the references okay so that's pretty much it there now upper tail one how can we calculate this one one minus e14 that works or we could have said let's go try the next way we could have said one minus and done the t dot this thing again with rt statistic and our degrees of freedom and the argument true you can see we're getting exactly the same answer let's just go formatted so the two are the same so again i'm using my little trick there and i'm gonna just make it let's make it a few more decimal places and then there's another way you can do it as you said we could have also said t dot best dot rt for right tailed and then all we do is we enter the t statistic and we say degrees of freedom and you can see yet again we are getting exactly the same answer as before so you i've you've now seen three different ways that you could have done this calculation and all of them are fine now if we go to the two-tailed one what should we do there okay we can do the 2t let's start by doing the first one um and let's see if it works if we do the 2t1 so um 2 times the minimum of these two values that would be the way that we generally do it if we go and try it with the t dot this dot two t and we enter that let's see if we get the same answer only way we'll know is if we actually try it and there we go same answer so you can see there's two ways you could have approached this as well so and you could have of course done it like the other way that i did it earlier as well where uh you take absolute values and all kinds of stuff like that but you can see that let's go test this if we maybe just manipulate this so that our statistic is negative we want to see if it still works and you can see now i'm running into issues so this is this function it seems let's go and look at that function the t-dist.2t because remember we want this to work in every single situation so let's go see what the problem is here i can tell you what the problem is but we're gonna go look for it in the help so there's our function all kinds of information this tells us a value if it's non-numeric we're going to get an error then we're going to get an error and if x is negative then this will give us a num error as well so this function doesn't like negative t statistics to be placed in there so if you do want to use this you would have to go and say okay well we need the absolute value of this and then you'll see we end up with the right p p-value again but there's that little trick there with that function let's just go fix our test statistic so we know it's not a negative one and we can go double check we see all of these work out we're getting the right values make all of them for five decimal places then it matches what we have um in the uh case study everyone happy so far perfect cool so and then level of significance again 0.05 let's go and use that and then critical values so how am i going to get my lower critical value in this case so if we don't need to finish the uh proportion one we will just continue with that in tomorrow's uh prac session but you'll see that we'll we'll make good progress on the proportion just now as well so but first you guys must tell me what we're doing with the critical value i'm blonde i don't know okay so there's a t dot inverse okay so now again there's two of them there's a t inverse and there's a t inverse 2t so let's keep that in the back of our mind so there's a two-tailed one there as well but again everything can be done using the t inverse function so t inverse what am i putting in there alpha one minus alpha alpha over two we're just putting alpha not the one below it and then our degrees of freedom so i'm just going to put f4 here just so that i can show the formula and there we have our um critical value let's just try and copy that over and then i'll show the formula and we can go and check quickly yes we are getting the same critical value so we're on the right track here now for the next one how am i going to do this one the upper tail one we know it's t dot inverse i'm gonna have that and then what do i put in here one minus okay so 1 minus d18 you are correct and then we need to add our degrees of freedom so remember when we say one minus t inverse similar then you're actually saying the probability one minus the percentile from the t distribution so you need to know should the one minus the outside we only put it outside our function when we're working with probabilities in this case we're working with percentile so yeah so perfect i'm glad you you recognize that so it is actually quite a common mistake for students to make so i was glad you said that so that i could make that correction or that comment so let's do that okay so then for our upper tail one we can again say t dot inverse and we can say alpha one minus alpha divided by two so that is our first option and then we'll look at the other option and see if we get to the same answer so i just need to add my degrees of freedom as well um so this is our one option for getting the upper tail uh critical value if it's a two-tailed test now if we're using the t dot inverse.2t function you can see what it asks us for is for a probability so in this case i can put alpha and then i can put my degrees of freedom and you will see that it automatically gives me that upper tail one so let's just show you those two formulas so here in this case we're doing it the same way that we would have approached it with the normal distribution here we're using that special function but both give us the same answer and that's what i want you guys to recognize here so 1 minus something we use when it's a probability that we're calculating so something like this where we said one minus that probability so i could have also i think we did typed it over there one minus t dist so we never do one minus something inverse because then inverse is linked to our proportion to our percentiles okay and that's pretty much it so then again confidence intervals we covered that in monday session so i want you guys to go and fill these in and check them for yourselves and let's just go back to this and double check you can see our upper critical value here was fine as well now last one proportions let's get this one going so sample size what function should we use here okay let's strike out and it's a zero but we can see values why is it giving me a zero ah count a so remember function counts the number of cells in a range that contain numbers count a counts the number of cells in the range that are not empty so count a is what we want in this case so i'm gonna highlight that and am i done with this or am i still missing something let's there's a minus one okay so let's add that minus one so the count it will be very difficult to use because there might be quite a few different values so here we have yes's and no's so you can say count if yes plus count if no and all of that but what if you have 10 different things in there you're not even sure what's in there so you want something more generic to do your count and we're subtracting one y heading exactly so the count a function will also count the heading where if you go back to the previous ones we didn't have to worry about the heading because it wasn't counted because it's not a number so if your heading is a number then you're going to run into issues then we need to tell it what our response of interest is so let's actually go back to our case study here so in this one we are interested in the nationality of the players so let's actually start by adding that data into our template so that we can see all the way whether we're on the right track or not so i'm gonna that's not the right one there we go so i'm gonna take my nationality if i can find it where is the nationality i can't remember what column we put it oh right next to hcl see like my dad always used to go to the shops and he would look at something on the shelf and then take the thing next to it and buy the thing next to it which we didn't want i apparently got those jeans as well missing the obvious okay so in this case we want to go and look for england now count our response of interest what function should we use for that so counting the number of englands in in column a without having to sort or do anything there countif okay and count if obviously going to count in a and what criteria should i use what exactly should i be typing here do i type the word england yes i want to reference it so that if i change this it will automatically change uh the reference so yeah so it's going to count something different so if i want to go account spain such as valid right there's one of them italy one of them let's write turkey one of them germany there's three of them so at least there's something different for germany but whatever question i'm giving you if i just type in there i can reference that cell and you can see it will automatically update now my sample proportion how do i calculate that what references should i type in or what exactly should i type 5 divided by d3 so x over um so that is my formula perfect so we have that one and if we go and check here we'll see yes we're on the right track there's seven people from england blah blah everything makes it we also have a hypothesized value given in the question as 0.1 so i'm going to type that in that's a given again in the blue cell and now my standard error let's just go and think about this formula first so if i do my formula formula is going to be z is p bar minus p naught over the square root of what i'm gonna write this but i need to add something in here p naught yes so we are using for hypothesis test we are using p naught in these calculations so i think this is now the end of the the session so i'm gonna just stop thinking and let me just share those formulas with you guys uh and see this is what happens um so now i have to go fix it there we go let's just make it lift so then and we'll show that one so that one is that and that one is and that should be not so let's just do that bit okay so and let me just make it so you can actually see all of this so um what i want you guys to do is carry on with this and then in tomorrow's session we will finish this template and then also take any q a um just keep in mind i want you guys to also please go do the confidence interval and the confidence interval formula write it down before you code it here so then we can help you um because you are likely to run into some issues with this one so we're gonna deal with that tomorrow um and i think any other questions you guys have um you can also just bring tomorrow then um for sigma unknown just again upper critical value so it's the same kind of thing where you have um let's just write that down so if we have a two-sided test we have here alpha over two alpha and this is alpha over 2. so if i use the first formula i just need to say 1 minus the upper bit gives me the lower probability which is what i've put in there and then this one this function automatically knows to give you the positive one so you just give it the probability and it knows to split it into the tail ends okay so thank you very much everyone and then i will see you tomorrow