Intro to Data Wrangling and Analysis Using R

hello and welcome to intro to data wrangling and Analysis using our my name is Dr Emily nortman and I will be your instructor for this part of the course so in this video I am going to do a walkthrough of the data ranking chapter uh where you'll be looking at how to um select and filter data and do some kind of common uh cleaning and processing tasks that we all have to do regardless of the kind of data that we're using right so you should have uh RNR Studio installed um and actually the first thing we're going to do for this course is to just make a new project um so that we can keep all of our files organized so to do so I'm just going to click file the new project um it'll just take a second to initialize their new directory new project um and I'm just going to call this directory uh so intro uh wrangling analysis okay um and I'm going to save it uh in a folder I have for various uh our project so if you just click create project um and again it'll just initialize take a second and it will open up um in the new project and you can see at the top here uh we've got our project name into a wrangling and Analysis um so you can easily switch between projects there and it's also opened up in the working directory uh down here as well hopefully um it'll have a little bit of experience uh using art even if it's just a little bit so none of us uh will be new to you but it's always good to have a little bit of a refresher of the basics um so the next thing we want to do is create a new R markdown document and we're going to do all of our work in our markdown in this course we'll just call it uh chapter two you can call it whatever you want me to be honest with you but I'm going to go with that chapter two um and do you remember to save it um so I'm just going to click the little blue save icon there uh and again I'm just gonna call it a chapter two and that will automatically save into the project uh directory so it's a working director here so I can see on my computer I now have uh the folder um with the rmd file and the project file there as well um and just get rid of all this uh welcome text so um when you open up a new markdown document it will give you that kind of Welcome template every time um so we just want to get rid of that um because it's not helpful so the next thing we're going to do is then load in the data and the packages that we need so for this chapter uh we need the Tidy boots um which is uh the package we use for all of our data Brown main officialization tasks and you'll get this we message here that just tells you that it has attached the packages um we're also then going to load in um the package Medical Data um because that's a package that contains the data sets we're going to use if you get the error message uh could not find a package um it means that you need to install the packages so you can do so just using uh you know the in-store packages called um uh the name of the yeah so that's if you get that error message um we're also then just going to load in uh the data sets that we need um so that the data set called opt and there's a data set called uh polyps um which has been a fun word to learn how to type and spell and when you load in these data sets using the data argument it'll come up initially in the environment pane here with this weird promise thing here but it's just once you click on it it will then load itself in uh properly and if you click on it um it's it's always good to just eyeball the data uh whenever you read it in um but the other thing that you can do is actually to use the help uh documentation um so if you just type uh question mark and then opt into the console um it'll give you a couple of options but it's the it's the data set that we want here so it's just this um uh one uh to top there um and it just gives you a bit of a description um so this is about uh treatment of maternal periodontal disease lovely fun topic uh with uh yeah whether or not it can reduce uh pre-term birth and low birth weight uh in in babies so this is a study um on uh mothers um there are a lot of variables in this data set as you can see we're not going to use all of them but it is worth just having a quick look through as it just will make the analyzes that we're doing make a little bit more sense um if you know uh roughly uh what each variable is relating to you can also do the same with the polyps data set box data set is much smaller so opt has 171 columns uh polyps just has seven um and uh this is about uh treatment um for uh colonic uh polyps so in this one we have the Baseline number that the patient uh Baseline number of polyps and then the number of polyps active treatment and you know what treatment um uh the patient had so what we're going to do is just basically run through a number of the most frequent uh wrangling functions and data cleaning functions uh that you might use and this is using a tidy verse approach this chapter is not going to cover absolutely everything that you might want to do with it or angling but what it should do is give you um give you the the code and the kind of the syntax and the style you need to be able to extend your learning to other problems so once you kind of have the Tidy verse styled down it's generally quite easy to then look up other functions and figure out how they work because they're all fairly uh consistent so the first thing we're going to do is I'm just going to create a new code chunk now I have a code a keyboard shortcut for this which is um set to control and [Music] um uh full stop period um which inserts a new code chunk um you can also do it uh just by um the menu um so it's this little cold chunk button up here and you can just click uh are there I think on Windows there's a bug where the um the default keyboard shortcut for adding a new code talk doesn't work um but if that's the case you can then make your own so you can click tools and then modify keyboard shortcuts and you just search ah chunk is it insert Ed in search on car that's it um and you can just change the the keyboard shortcut to whatever you want I've set mine to control and period um other options are available Okay so we've got a new coach on um and what I want to do now is uh we're going to start with the select function um which is the function for selecting variables okay or columns so I'm going to create a new object called opt select that contains uh just a reduced subset of the columns um so I'm going to use a paper notation here so I'm going to start with the off data set and then we're going to select the columns Clinic page and education and if I run that code you'll see that we now have this opt select that has the same number of observations so we haven't done anything to the number of rows but instead we've just taken the three variables Clinic age and education and it's produced them in that order so for example if we said chronic education and age we would then get a slightly different data set in a different order you can also do exactly the same thing but of the columns rather than the names of them so rather than saying connect age and education we say at column two four three and column 10. okay and that will produce exactly the same thing most of the time I tend to just use the column names because I think it's a bit more meaningful it's easy to see what you're doing and to not accidentally select the wrong column but sometimes if variable names are you know they're really long or they've just been written in a really faffy way um uh then it might be easier to use uh than the number of the column I have just realized that I'm not sure whether or not the word fat fee translates uh so uh from British English and so apologies uh Taffy is um awkward um I should also note that um as I said I'm using the pipe syntax here um if I wanted to just use regular syntax um then what I would do is we'd say uh select the first um argument would be the name of the data set we're starting from so it would be opt and then we would do Clinic education uh and age um I have just personal preference really taken to just always using a pipe so I always just the first line is the data set I'm starting with and then I pipe into the function um it isn't that our personal preference um but if you just wanted to do it you know kind of standard way then then this would be uh how you do it okay so in this one we have um written out each of uh the variable names that we want uh to select um but there is another way um of doing it so let's say we want to select uh multiple different columns okay so let's say we're going to select um the participant ID at connect group and age now for these columns um they are all actually together so if we look at the data set here so PID Clinic group age they're actually in order that's to order they come in the data set which is different to the columns uh chronic education and age which were columns two three and ten so they weren't you know consecutive instead of writing these all out individually what we can do is use um what's called the colon notation um which is just a colon um and you give it the first variable and the last variable and what this will do will select everything including the first uh in between the first and last variable um so this is going to look exactly the same um I've spelled multiple wrong multiple um this will look exactly the same as is this version It's just slightly quicker so if you if you want to select a huge number of um variables and they're all in order you can just specify the first and the last um which is very quick way of doing things the other reason that select uh is is quite handy is that you can use it to rename columns whilst you're selecting them um so rather than you know selecting them and then doing a separate set to rename them you can just do it all uh in the one process so in this case we'll start with opt and we're going to select the participant ID column but then let's say we want to change um the name of the variable Clinic to to rename it as a location so in this case what we do is its new name equals old name okay and then you can also combine the different uh specifications so this is going to select PID and then it's going to select Clinic which is going to rename its location and then we're going to select everything from black to Hispanic using that notation so if we now look at that you'll see the optary name now has a column called location which used to be called uh Clinic what we've been doing there is um uh uh the selecting columns by telling our which ones we want to keep sometimes it's easier to tell our which ones we want to get rid of so it might be that you've got you know 10 columns and you want to keep nine of them um so obviously it's easier then to tell it that you want to drop one than it is to to give the full name um and this is it works exactly the same um as uh the selection function um deselect um the only difference is that you put a minus sign in front so you say um give me the opt data set minus the column location um um oh we've got an error what have I done I've had done column location doesn't uh exist oh that's because the column location is that's the one that we change the name of okay so in uh opt rename it's called location so if I wanted to drop location I would have to um to use that data set uh if I wanted to um use opt then that should have been kind of like okay uh just to note that the other um the sorry nothing trouble typing and talking today um the colonization also works for dropping cons as well so we can say drop everything from black to Hispanic okay um so it'll it'll do that uh and it will also um it's also going to give us an error message up here as well I'm just gonna um to do uh oh so what we have to do here is put this in our quotation mark in in Brackets so rather than just having um the minus sign and then the names of the variables what we need to do is have the minus sign but then the the selection of variables with that colon notation they need to be wrapped in Brackets as well so we've actually got two opening brackets here and two closing brackets uh there okay so if we run that we'll now it won't produce an error and if we look at opt deselect two you can see it's gotten red um of all of the columns from black to uh Hispanic inclusive okay um it is also worth saying that there's loads of different ways that you can select columns um so um we've given you some examples uh in the course materials um there is um you could say give me all the columns that start with um the word data give me um all of the columns that end with a number give me the columns that contain this so if you have columns that have you know for example like time one time two time three times four you could say give me everything that contains the phrase time you know um what exactly what functions you need will depend on your exact data set but it all follows uh the same kind of logic so in that example that we were looking um at uh selecting or deselecting columns or variables now we're going to use this filter function um to uh retain um or remove rows of data so this isn't about the variables it's about the observations and filter is it's an incredibly useful um it's probably I think the function that I might use uh the most um so we're going to say right we're going to start with data set opt and then we're going to filter it um and then you just have to give it the criteria that you want it to filter on okay so in this case this code will keep all of the clinics uh all of the data where Clinic equals uh NY uh so New York okay um so that will just uh give us um yeah so that's just keeping the data uh from from Clinic okay um we're not saving this uh to um an object um it's just to show you the different options um so you can give it an exact value so you can say only keep um the rows where click is exactly equal to NY um and if it's Text data that you're matching on it should be in uh double quotation marks but we can also use filter to give us um age um uh to filter the numerical data so this would give us um all of the roles of data where the variable age equals exactly 20 okay or 20.1 not 19.9 exactly 20. on the other hand we can then use a filter to give it just um slightly more nuanced criteria uh so rather than giving us an exact age we might want to say okay give us um return all the values where age is more than 20 or we might do you know less than 20 or so on um and I think what's quite important to know in terms of how filters working is that in the background what it's actually doing is it checks each um value against that criteria and says is it true or false so what it says is you know is is the value in row three does the age equal 20 true or false and it will keep whatever it evaluates is true and it will drop whatever it evaluates as false um just like select you can also tell it what not to keep rather than telling it what to keep um so for example if we wanted to keep um everything um but uh the New York Clinic um then it would be the syntaxia so rather than the double equal signs it's an exclamation mark and then an equal sign so that will give you everything but New York you can then build up filter to be you know as nuanced and as complex um as you want you can have multiple uh criteria um that you're filtering on so in this case and we are actually going to save this to an object here and we're going to create an object called BMI diabetes um so we're going to start with the opt data set um and then use filter um and the two uh what we want is all the patients who have a BMI equal to or more than 30 and you also have diabetes so we want both of those things to be true and it is very very easy to combine these kind of um uh Criterion um so we say okay we want the um variable diabetes to equal yes um and then we want BMI to be more than or equal to 30. I always get this the wrong way around okay so it has to be the more than and then the equal sign if you do the equal sign and then the more land doesn't work um so if you get an error you've probably done that because I do it all the time um and this is just uh really easy so if you just give it multiple Criterion uh separated with a comma or you can also separate it uh with um um with an ampersand with the and sign either of those the comma or the Ampersand will only keep the data where both of those conditions are met so uh where it evaluates diabetes equals yes is true and the BMI is being equal to or 30. so if we run that we'll see that there's 17 cases um so uh that meet these criteria so you can see all the diabetes have yes and the BMI is All Above 30. to me um if you want to um rather than doing uh and you want to do all then it is um slightly different which is that rather than um use a comma or the Amazon um you use uh the pipe function um which is uh it's this uh here um this I this feels like something I shouldn't admit in accordance but I can never get the pipe uh key on my keyboard that to work um so I just copied it in uh from the book um uh so this is the um we'll go we'll change this to the imbc's uh BG's either okay so this is going to retain roads where diabetes equals yes or BMI is more than or equal to 30. so it's the one of those conditions has to be true um and if we run that then what we'll see is you've got 239 observations so obviously you would expect the number of observations uh to be retained to be much higher when you only have to meet one of the conditions compared to when you have to meet at both of them um so it's these kind of checks that are really important when you're running filter because it's so easy to accidentally mispecify what you want to keep to do the due to those sanity checks um a lot of you know what you're learning is programming but actually the thing that will make you successful at programming is just having that critical eye um and making sure that you're checking your data checking the outputs of what you produce it could also be though that you want to retain um from a choice of different values uh so for example um if we have our op data set and and we want to filter by Clinic we could be that we want to filter we don't just want to keep one Clinic we want to keep a number of them um and there is this slightly weird notation that we use here which is the um percentage sign in and then percentage sign again um so what we can do is uh give it a range of values um and uh because we're using uh multiple we've got multiple values being passed to a single argument we have to wrap it in this C notation it's kind of combine collection concatenate um and you tell it the values that it can meet and they should all be in their own quotation marks if their text doesn't it doesn't need it if it's numbers um but if it's text it's on quotation marks separated by a comma and what this will do it was it will keep anything where cleric equals New York or Minnesota or I want to say Kansas and I really apologize for my British geography if that's wrong um and the same as you can also build up um multiple criteria in there so this will keep anything where the clinic is New York Minnesota or Kansas probably um and where black equals yes okay um one thing that is a little bit confusing um about this notation I I find um is that you can use it to um deselect so in the same way that we said you know do not um keep it if it's New York up here you can do the same thing um the reason I find this um confusing is just I think where the exclamation mark goes is not completely intuitive um so in this case uh it goes at the frontier okay so we say filter where Clinic does not equal any of these values okay um and we can say you know and also where black is not equal uh yes so this would keep everything that's not New York Minnesota and Kansas uh and where uh black uh does not equal yet so the exact opposite of that and but it's just to make it louder if you're using this um multiple value um selection that the the exclamation mark goes right at the front which I think is just slightly less intuitive um uh then it could be filter is incredibly powerful um it's very very useful for selecting data for creating new groups based on multiple different uh grouping variables because of that I think it's one of the most dangerous functions and it's so easy to accidentally slightly mispecify something to put in and instead of all um to put it equals when you meant dis did not equal um and and so on so always check the output of what you've done check that there are you know the number of observations makes sense you know you might want to perform some basic descriptive statistics to check that like you know the the numbers in each group make sense for the filtering that you've done this is why it's important uh to know um data so the next function we're going to look at is the arrange function and this isn't one that you're probably going to use all that often uh in our um but it's one of those ones where when it's useful it's really useful um in terms of sorting uh the data um so it it might be that you um uh you work in Excel and I always find an Excel that I end up kind of sorting things by columns uh a lot more frequently than I do in R so it's an idea if you're distorting the values um so for this one we're going to switch to um the polyps data set um uh and we're just gonna again keep using the paint presentation here um and these function uh arrange so you can um for example if we put in one argument with say a range um by six what I will do is I'll just give you um it'll group together to get got in Alpha uh alphabetical order um so we've got a female um it's arranged all the female participants first Then followed by all the male participants but you can also really easily just add in um another criteria so we say range bisect and then by Baseline so if we did that it still arranged it uh female and then male but within that it has um arranged them in order um of the value of Baseline if you want to arrange it in descending order um so this is arranger in ascending order so it started with the lowest up to the highest then what you do is you wrap the variable where you want to sort it descending in this function at desk for descending and if we do that exactly the same female and then male um but the Baseline here it starts with the highest value and then goes down it like I said it's not complicated function but it's one of those ones where it actually can be quite useful uh particularly if you're trying to identify you know the highest score or something like that the next function we're going to use is the mutate function a mutate is very very useful very powerful and you're going to use it commonly because what mutate does is it creates new columns or it overwrites existing ones and this could be if you're overwriting it from you know changing the type of data that are things it's a number and actually it's a factor it could be about creating new scores there's loads of different uses of mutate and quite often mutate will be used in conjunction with other functions so you'll create a new column using mutate and the creation of that column also involves another function and there's an example of that coming up in a bit um but for now what we're going to do is we're going to create a new data set called at polyps 2. and this one so we're going to start with the original polyps data set and then use mutate to create a new column so just to quickly go back because it's important that you understand the data set what we're going to create here in this data set we have a role for each participant uh we've got their ID sex age how many uh polyps they had at Baseline when they first came in the treatment that they then got then we have the number of uh polyps um they had after three months um and then again let's say there was a second round of treatment um and then the number of polyps they had after 12 months uh after 12 months of of treatment um so hopefully the idea is that the Baseline is the highest number and with treatment the number of polyps after three months and the number after 12 months uh would um would go down okay so what we can do is create a new um variables here using mutate so we're going to create a variable called treatment one um so this is going to tell us um how many um polyps uh were reduced between uh three months and base out so this is basically what effect did treatment number one have in in reducing the polyps we could also then do the same thing and go okay well what um what effect did treatment number two had so we're going to look at actually um the difference between 12 months and three months which should be the difference of the second treatment we could then also create a column called total um which is the total number of polyps that were reduced uh with you know treatment one and treatment two combined what you notice here is that actually these variables don't exist in the original data set we're creating them here and then we're using them in that same function call which is absolutely fine but you just need to remember that it reads things in order so you have to create treatment one in treatment two first in this line of code before you could then use them as variable names here um and then the other thing we should do is for example we're going to create a poem called treatment um and use this function uh paste this is one of those it's easier to show than tell um so we're not actually creating a new column here because we already have a column called treatment instead if we look at products too we can see is treatment we've now pasted on the word condition okay so now it says uh cylindric condition or Placebo condition you didn't we didn't really add in this information it was just because how Pace works um but then we also have a now our treatment one and treatment two so we can see uh for participant three um they had three fewer polyps after treatment uh one and then an additional uh two uh polyps uh removed after treatment two for a total of five follow-ups you also notice that there is a couple of bits of missing information that's a missing data there that's going to cause us a problem uh but we will come back to that um in a bit as I said we can also use um mutate with other functions um and we can also use it um kind of like how we've used a filter which is to create new values based on criteria so we're gonna um stick with the pull-ups 2 data set that we have created and we're going to create a column called Improvement okay um and what this is going to do is missing x equal sign there it's going to create a column called Improvement and it's going to evaluate whether or not the column total is more than zero okay um and if it is more than zero it will return a value of true and if it's zero or less it will return a value of false um so uh and the other thing we're going to do uh is um overwrite treatment as a factor so at the moment if you take that summary for us too and what you'll see is that R thinks that treatment is a character it thinks it's just text and whilst it is text it's more informative text this isn't just any old text these represent distinct groups it's a factor variable so what we're going to do with this code is just overwrite treatment with treatment as a factor okay um so if we do this where are we here uh well it's two um okay I realized I've done this the wrong way around so because of the way that we've set up our um uh are our numbers here a negative number equals Improvement so we're saying that the total actually what we want this total to be is you know minus six means that you had um six fewer uh uh polyps after treatment so that's a good thing um so actually what I want to do is there is just say Improvement um there is Improvement uh if the total is a less than zero okay because that means you've got fewer um uh polyps okay so you can see here that um we now have this column called Improvement um and uh we can see that there was an improvement um because you know like this person had 14 fewer polyps um at the end of the treatment um and if we were to run summary again on polyps to um now that this is a factor the treatment condition is a factor it doesn't just tell us that it's character data instead it tells us how many um people are in each condition awesome in addition to basically doing it quite like the the default way which is that it will give you you know true or fell false you can specify the criteria and then what the new value uh should be so in this case uh we're going to create a new data set for apologize three um and we're going to take the data from ports to here okay so with those new variables um that we've created again we're going to use mutate and this time we're going to actually overwrite the column Improvement um and we're going to use an additional function nested with a mutate to do this so the function case when is really really helpful um so I'm going to say when the total uh is uh more than zero okay so remember because a positive number is a bad thing um we want to say that actually represents a decline in the patient's condition they're getting worse they've got more polyps oh sorry I just realized I have missed out uh this little uh tilde here um so it's the the Criterion so it's uh the variable is less than zero then matilde and then the label that you want attached to that and then a comma and then the next uh criteria so if the total equals zero then what that means is that there's been no change um and finally if the total is uh less than zero so they've got fewer polyps than they started with um then that represents uh an improvement and what this um will do is that rather than just having that Improvement column have true or false in we now actually have something a little bit more sophisticated so we've given it a label depending upon uh the criteria um we could also use it um this the case went to recode variables as well and this is something um again I use uh relatively um frequently so we're going to start with our polypsory data set that we just uh coded here um again using at mutate um and what we're going to do is create a new variable called category see and again using uh a case when and what we're going to do here is create a category that um puts people into groups depending upon the number of polyps they had at Baseline so obviously you can see if you look through this data set that some people have come in and actually they had very few polyps to start with and some people have got a huge number of polyps there's actually a couple of outliers um so we're going to say when the Baseline was less than or equal to 10 then we're going to say that they're in the low category um however if the Baseline is more than 10 but it's also less than or equal to 30. then we'll say it's uh medium and a medium group um this is when actually lining up um your code with using the enter key is very helpful because it shows that you're actually keeping everything in the right brackets one of the things I think quite quite difficult with case when is it when you have lots of you know nested arguments but making sure you didn't accidentally go outside one of the brackets or delete a bracket or something like that can get a little tricky so if you can see your code lining up I think it helps um and then finally um it's just uh if it's more than 30 um it's basically the rest of them uh it'll be that high so if we look at polyps 4 now we've now got this category so we can look at the Baseline here so we can see everything up to a baseline of 10 is categorized as low everything between 10 and 30 is medium and anything above 30 is high okay so you can use it like that as well and then finally um in terms of mutate just like um whoops um just like a filter um you can use it to combine multiple variables so again it's very very powerful very flexible um so in this case we're going to create a new variable called risk um and this is going to combine several different types of information uh to to tell us um whether or not the person is at high risk so we're going to say okay if the category is equal to um High then they're high risk okay so regardless of anything else if the the Baseline number reports was high then then we're going to label them as high risk but then it could also be that we say um if uh uh is this if their sex is male um uh and so sex is male um and they're less than 25 um then they're also high risk um and then obviously there's a lot of other combinations so what we're saying here is okay if you've got a high number of uh polyps at Baseline you'll be categorized at high risk um the other um thing is if you're male um and you are uh and age is less than 25 then you also be categorized at high risk everybody else we're just going to categorize is not a high risk obviously if there are any Physicians who actually know about this I this is just arbitrary things that I have uh made up here um what we can see now is that in that risk column so we now have um these patients that have been labeled as high risk um and uh they were all actually in in the the high category uh there okay um so you can combine at multiple um different uh Criterion depending upon what it is you exactly want to do but do go and check the data and make sure that it has actually done uh what you what you want to do um so uh actually a good check out of this one um just to make sure that everything is gone right um is that uh let's see so let's take one of uh the not high risk um and we'll say right so this uh woman is a less than 17. so let's just change this and see if we can uh just put in there [Music] um so if 8 is less than 18. 2 right yeah so uh where there are 17 year old okay no so that should be uh not how that should be high risk okay aha this is say it's a pipe I've put in an and rather than or it's always good to go and check these things okay so let's have a look at our high risk now uh the female still coming up um it's not high risk like what I've done anything to do so so I had to post a video there for a second um uh to uh figure out what it is I done wrong um I'm Gonna Leave This mistake in because I think it's useful for you to see that even sometimes people who are teaching the course uh make mistakes um the problem there was not that I had used um a comma rather than the pipe or anything like that it was that I had put in female uh uh with the capital F and of course if you look at the data set there aren't no people in that data set that have the value of female with a capital f um which is why it wasn't coming up so if we run uh this then and have a look then what we can see is uh our uh woman uh he was um who's less than the age of 18 uh is also now categorized as high risk uh even though um her category isn't high risk okay um so yes always know your data set and always check it um to make sure that you you know give yourself tests to actually see uh whether or not you have done what you've intended to do basically don't be like me in the last five minutes okay um so what we're going to do now is um do uh some descriptive statistics um and just run through a bunch of different functions these are all the code for them is fairly simple and once you've got the syntax down um very easy to extend so a really useful function uh is count so we just give it the data set and then literally pipe it and say count and if you do that um then what it will produce is a count of the number of rows in that data set so this is just telling us that the N is 22. it has 22 different rows okay um but you can also um use it um to count um the number in a specific uh variable and obviously this would work best uh with a kind of categorical uh variable so we can say don't discount okay data set count how many are in each treatment condition so in this case uh we've got 11 in a placebo condition and 11 in the cylinder condition uh as well just like most of the other functions that we come across um you can add in multiple grouping variables so we can say count uh treatment uh biceps um and this will give you uh the number uh of each sex I assume it's sex is kind of birth it doesn't say um uh by each treatment um condition um and it's just worth highlighting here that if you were to change uh the order of these variables then it would change how the table is displayed the date is the same but in the first one you get um the condition uh and then by sex whereas in in the second one you get sex by conditions so basically it's whichever one um you think is most helpful for the data that you're trying to present the next function is summarize and summarizes used to produce um descriptive uh statistics um but this is where those missing values aren't going to um come in so I'll show you how to write the code up to use uh summarize um so uh we can just type in summarize here I should note that summarize will work with either the British spelling with an S or it will work with the American spelling uh with um as a a z um I the the urge to use the S is obviously quite strong um uh but it will it will work with both um so for summarize it's kind of similar to mutate and that first of all you give it the name of the column that you want to create so this is going to create a table of descriptive statistics so we give it the names on the column so we say right we want to create a column called mean polyps that tells us the mean number of polyps and we can just do the same there so median products um Min as well so operator mini polyps and there are a bunch of built-in functions so median mean min max standard deviation and so on are all built into actually base r um for you to use so this is giving us uh the mean total um so if you remember total was um the total reduction or the the change between Baseline and and treatment two okay um so if we run this what we're going to get is this we're going to get a table full of n a's um and that's because we have missing data and the problem with missing data is that um it seems unintuitive that I would give you this any this this don't know because you know we look at this data set there's all the data there there's all the numbers um but the reason it makes sense is because if you say you know what's the average of 100 plus I don't know then the average is I don't know it's not a hundred okay um so ours just being conservative and it doesn't want to guess something that you've told it you know not to guess or not to ignore okay the first thing we want to do is actually have a look at where our missing data are now this is actually a really small data set so you could just go and click on it and and go and have a look and we can see here that actually we've got two missing values uh down here but if you have a much larger data set you're going to want um to do it with code um and what we're going to do here is we're actually going to go uh and use the original polyps data set so before we changed anything before we added anything on we want to go and have a look um at this so again we can use summarize um and this time what we're going to do is create uh three variables one called missing Baseline um and that is going to give us um how many values are equal to n a okay so the syntax here is sum and then in Brackets is Dot N A so sum up how many values in Baseline are equal to n a that's what that's doing um and then we can say okay do is the same thing um but for um the three months uh variable um so it is an a at number 3M and then again misting at 12 months at some interesting okay and if we do that we'll produce uh this table here so we can say there's no missing values in the Baseline variable there's not actually any missing values in the three-month variable it's the 12 month one there's two missing values in that 12 month variable and even though there's only um uh two missing values it's causing us um a huge amount of a huge number of issues and that's because I'm just going to copy and paste uh this photo from the workbook um that if we look at polyps five so that's the one that we created and then the variables that we created so treatment warning treatment two we can see that um we don't have any missing data and treatment more so if you remember treatment one was created by subtracting the number of polyps three months from Baseline and those two variables don't have any missing data missing tree so treatment two was created by subtracting the number in the 12-month column from the three month column but of course there's two missing values in the 12 month column so you get that there and then missing total um is two because the total column is created by using treatment one and treatment two and treatment two has missing values in it so you can see that actually it's that thing of you really need to look at your data sets right at the very beginning because the problems will just follow through and actually potentially uh expand there are a few different ways you can deal with Messy values and you might also have some statistical techniques that you want to use in terms of you know imputation and stuff like that we're just going to give you some relatively uh simple ways of dealing with it but you should be able to extend the code relatively easy um to come up uh to decide on what you want to do um so we're going to create now a new data set called polyp 6. um and we'll just as I said we're going to go back to using the original data set before we added anything on because we need to get rid of these issues right back at the start so one way we can get rid of it is to use the function drop n a and we can say we want to get rid of all the rows of data where there is a missing value in that 12-month column okay um and then we want to add on um just that I'm just going to copy and paste in again for time um we're just going to add on uh the just exactly what we did before okay but this time because we've now dropped um those uh those missing values polyp 6 just has 20 rows of data so it used to have 22 but we've now got 20 rows of data and because they're all complete you can now see we've got um we've got numbers in every single one of those and what that means is if we were then to run our summarize code so this is exactly the same code as we ran up here when we got all those Nas because we're now running it on polyp 6 which is a complete data set we actually get the median mean uh number of polyps I realized that that should be uh median okay uh the mean median minimum and maximum number so you can see how this tiny little thing actually has um quite a big uh impact on it the other thing that we can do is that rather than um rather than um dropping uh these um these numbers um what we could uh dropping the missing numbers here so this is basically just taking those rows of data out instead we could actually use the original polyps data set um but in each uh sorry that the the data set um that has the the missing uh variables in it um and you can add or n a DOT RM uh true okay um to here unless we'll just um ignore the missing values uh for each individual uh calculation so rather than actually creating a data set uh where they're dropped um you can just do it uh this way uh instead okay um the final option um and it's kind of it's approaching uh imputation um but um uh I'm just going to copy paste the code in here um is to replace the nas with the different um uh with it with a different number so in this one we're going to create a data called pull up seven and this is going to overwrite the column number 12 M so that's the one with the missing data and it's going to replace any missing values in that column with the number zero now for this data set it actually doesn't make any sense to do this okay because zero is not you know that suggests that there's been no change in treatment it would be it would be inappropriate to do this but I just want to show you the code um and you could also do something like you could overwrite it so that it replaces um any missing values in that column with the mean of that column so mean imputation again this is a really appropriate but in terms of uh for the data that it is but in terms of learning the code uh hopefully it's useful for you to be able to see that so the final round function we're going to use before we just finish off with rounding is grouped by and we've all already kind of used Group by um or review something kind of similar to group by uh perform we we used it on the intro to our course if you're a part of that um but it is worth showing you how to do this in the context of everything that we've done in this chapter um so far so what groupbind does is it um it performs the function um on uh on each level of the group uh separately okay so easier to show than tell uh as at most times um so what we're going to do is we're going to take we're going to take the data set pull up spine and we're going to group it by six so what that means is that whatever function we specify next will be done um multiple times separately for each uh group that's in our sex variable okay so we can then just put in uh summarize we say give us the mean uh of total na dot or m equals true um and this will give us um it will produce the mean but separately by six and group Y is really powerful because all you have to do if you want to add one another uh different variable is just add it to the group by um so again we can you know it's very similar to the count function um we can now produce uh the mean um so we can see that actually for women in the placebo condition on average they they gain six points 6.75 polyps but for all the others uh there was a reduction um but really really easy just to add in other group of variables and kind of break your data down by lots of different uh conditions um do just be aware that just like uh count the order that you put these in that does matter so same numbers but it'll be laid out slightly differently but it's not to summarize you can use uh Group by with um so it's also functions um uh but basically any other function but the Thai Universe really um so there's quite a nice function uh slider Max which will give you the top three values okay so what this or the top number of values that you specify um so what we're going to say is give us um the top uh values um uh top three uh reductions in uh in the number of polyps okay so um and this one let's just once again um so we're looking here at the total so this is um the top three so one participant had a reduction of 274 polyps the next highest was 135 and the next house was 90. you can actually see that there's some fairly substantial outliers in this data set from looking at this um but this is just all um uh across all participants um for this particular data set what we might actually want to look at is okay but what what was the reduction depending upon the treatment you got um and so what we can see here is that the top three um uh the top three for the placebo condition um was 274 14 and 8 whereas the top three for the cylinder condition was 135 19 and 10. so it'll do it uh by groups as well so everything it does it does separately depending upon the grouping variable that you've given it foreign we're going to do is just have a look at rounding in R because rounding in R is one of those things that you don't realize it's a problem until it causes you a big problem and I say this uh from personal um experience um so we're going to go back to using um the opt data set um and uh I'm just gonna copy paste this code over so we've got the update set what we're going to do is just select this column so this is pocket depth okay um so we're just going to select this one column here and the reason I'm going back to this data set is because we need some decimal places okay so you can see that there's a bunch of data here to three decimal places we're going to create a new column now called um BL dot pd.av.1 okay and that is going to be a column where this data is rounded to one decimal place so it's quite easy the function is round you give it the name of the column you want to round and then how many decimal places to round it to if you just give it the name of the column and don't tell it how many different places it'll round to a whole number so if we run that and have a look at Op 2 then what you can see is we've got this one round of one decimal place and this one rounded to a whole uh number okay there is a quirk though of R okay and to show this I'm going to do is create a new data set called op3 and it's just going to take out uh where the um uh the ones that were rounded to one decimal place are on the 0.5 so we're going to just keep the values at 1.5 2.5 3.5 and 4.5 okay um and then we're going to round just those numbers so if we have a look at all three so we've just got the 0.5's here what you can see here is that 2.5 has rounded to two it's rounded down 3.5 has rounded to four it's rounded up um where are we uh if you saw these um 4.5 is rounded to four it's rounded down um and this is because our rounds um like rounds to the nearest even number so 2.5 for nearest even numbers are 2.5 is two okay um uh 3.5 is going to round to four okay four point five or round to four as well um and the reason for this is 0.5 is exactly halfway between two numbers so there's no reason it should always round up the problem is that um many of us use the round up from 0.5 uh as a rule so for example when I'm working out grades at the University of Glasgow we use the 0.5 rule to round up um it's it's very very commonly uh used to get around this what you have to do is redefine the round function so in the the course materials there's this um function code here you can literally just what we'd recommend is putting it in at the top of your script where you load in the functions of the packages um and if you run this in the environment pane you'll then get you see we've got this function bit here that says round okay and that means that you've defined your own function so rather than um using the round function that's built in we actually change uh how that that works so now if we run this okay and we look at op3 it will round as we would expect so 2.5 rounds to three three point rounds to four four point five rounds to five uh and so on if you want to get rid of that function um then it is um you use RM and then round that will work for actually any object in the environment um and then it would go back to using uh the the initial one um but it is it's one of those things where you don't realize it's a problem until it's quite a big problem um so that's the end of the data wrangling chapter there is simultaneously both a lot in this chapter and also so much more to do um what I strongly recommend is find a data set up your own that you know well that is Meaningful to you and try and replicate each of the functions that we've done in this chapter and then start thinking about what else you would have to do to clean up the data set in the way that you'd like um with the functions that we've given you with the tinyverse you should be able to extend your knowledge of the syntax in the way that the code works but if there's anything that you would like to know that you're not sure of uh then please do post on the help and discussion forum and I'll see you in the next video

Transcript for:Intro to Data Wrangling and Analysis Using R

Transcript for:
Intro to Data Wrangling and Analysis Using R