Transcript for:
Essential Guide to Data Analysis in R

hey everybody this is a short but I think shockingly thorough introduction to data analysis with r one of the amazing things about R I think is how quickly you can just jump in and get stuff done even if you don't have any programming experience my whole YouTube channel is dedicated to data analysis with r I encourage you to subscribe I'll throw links to different videos on going into some of these topics in more depth as we go through if you haven't already you'll need to install R I strongly recommend that you also install rstudio R is the programming language our studio is the front end that almost all of us use when we're working with r if you just Google rstudio or our studio desktop and follow the installation links you'll wind up at a page like this where it prompts you to both install r as well as the front end rstudio all of this is free um and generally pretty hassle-free as you install it once you've done that open up rstudio that is the only one you'll ever need to click on you don't need to interact with um the actual R icon directly and you'll get something like this there's a lot going on but when you're first starting out if you want you can just view it as a graphing calculator and do things like five plus seven or the absolute value of negative 17. and get answers out you can also work with variables here so we can assign X to be the value um I don't know negative 12. and you can see that that value X is now stored as negative 12. I use a left arrow for assignment that's encouraged in R an equal sign will also work but for some technical reasons we generally don't do that and then you can do your usual operations on that variable so I could do X plus 7 to get negative 5 or I could do the absolute value of x you can also assign vectors I'm sorry variables to be entire vectors of values so let's let y be equal to negative 12 6 0 and negative 1. you can see now I have a value stored for y but there's actually four numbers in there so it's sort of an ordered n-tuple negative 12 6 0 negative 1. and I can do operations on y on this Vector for instance I can double the whole thing notice that's happening component wise I can apply functions to it like absolute value of y and expect those to be done component wise I could take a sine a tangent an exponential whatever I like if you're watching this video though you're probably not so much interested in the programming aspects and more on the data aspects and that makes sense R is fundamentally a language set up for working with data so let's import a data set and uh and take a look at it here in the lower right I have a file browser so I'll click on that and you can see you can navigate around your machine until you find the data set that you want to work with and if you have an Excel spreadsheet for instance or a CSV file importing a data set is extremely easy just find it by following these breadcrumbs clicking on the folders that you want and then just click on the file that you want and go to import data set that'll pull up a window with lots of different options that you can more or less just ignore when you're getting started and just click import okay so a few things have happened before I talk about any of them I just want to mention that what this data set is that we're looking at this is the Scooby-Doo database and I got it from the tidy2 to say project every week tidy Tuesday posts a new and interesting data set for us to practice our our skills data cleaning visualization and Analysis more broadly you can also of course work with these data sets in other languages but this definitely is revolving around the our ecosystem strongly recommending this um to check out tidy Tuesday the Scooby-Doo data set was featured at one point or another okay back to the code the importing is actually happening here in the second line the read Excel command and you can see inside it has found the file in question you can see the file path there write it in and assigned it to the variable Named Scooby and if you look in my environment tab up here on the upper right there's now a data set 549 observations rows of 75 variables columns this line here the view command is what actually opened up that data set so that we could see it it actually just put it in the viewer in an interactive way and we can kind of scan through it and see some of the variable names who caught the villain Velma Shaggy and so on did Shaggy get a Scooby Snack stuff like that lots of fun stuff here this First Command Library let's talk about that for a second R is an old language but over time it has been expanded and developed by its large and Vibrant Community of users the add-on sets of functions that they have created and which are available to us are called packages and this Library command is opening up a package of functions called read Excel that give us some additional functionality for working with Excel spreadsheets you can think of this package read Excel like an app on your mobile phone and the library command is opening that app and giving us access to its functions but if you haven't already installed that app if you haven't already installed this package of functions ours not going to know what to do with this Library command won't know how to open the app so you need to start by installing it with install.packages parentheses quote read Excel I won't actually execute that because I already have it installed and actually I already have the package loaded so that's not something I need to do you only need to install the package once but every time you want to use it every session um every time you open up R and want to use it you have to library it you have to actually open that app okay so lots of interesting stuff here lots of interesting variables um we might want for instance to get some summary information on some of these variables like for instance what's the average run time of all the episodes in this database so not surprisingly that's a mean command now I want to get the average of this column in the Scooby data set and the syntax in R is to first name the data set so Scooby notice the autocomplete suggestion you can either use tab or enter to acknowledge that so I want the mean in the Scooby-Doo data set of the runtime variable and I'll specify the column I want with a dollar sign and then if I start typing runtime I can then select the variable that I want or just type it entirely a little more than 19 minutes for the average run time of all these episodes let's do the same thing for IMDb the average IMDb rating we'll see there's a little bit of a subtlety notice how fast I was able to key that in I used the up Arrow to get to the previous command and if I want to get two commands back I go up Arrow twice so this can save you some type some typing time so up arrow and then backspacing over run time and I'll replace it with IMDb and when I execute that to get the average IMDb rating of all these episodes I get an n a and the reason is if I go all the way down to the bottom here there are some n a's in this column some missing values there's literally no data for more recent episodes of Scooby-Doo in here they just didn't have IMDB ratings when this set was made and so when R tries to figure out what the average IMDb rating is over time it just doesn't know because these n a's could be very high or very low and so those n a's propagate so the mean command I'm using the up Arrow here like many functions in R has optional arguments and I'm going to put in one of those optional arguments right now n a DOT RM remove the n a's is true you can leave out n a DOT RM and it will default to false leave them in here I'm overriding that and saying take them out the average IMDb rating in this set for the episodes that actually have IMDB ratings is 7.34 foreign okay now we've already done a bunch of commands here things are starting to get a little bit busy if you're doing a more full data analysis going line by line like this can be problematic you can lose track of your work it's also more difficult to recreate things later on so what we'd like to do is to actually have a document that actually encodes all of this that actually contains all of this in this video we're going to see two ways the most fundamental though is a script and so I'm going to go up to this little piece of paper here and get new R script and that's going to open up a new tab right next to my data set that's essentially just a text file and the idea is that we can code line by line here and then save this file later on just using the disk icon and save it anywhere we want so for instance I could put in a library command like Library read Excel going forward I'm going to want to use an entire ecosystem of packages that have become excuse me very popular over time and that is the Tidy verse family of packages these are produced by posit the same company that makes this front end our studio largely developed by their Chief data scientist Hadley Wickham and the Tidy tidyverse family of packages have really revolutionized R and data analysis just over the last decade so um I'm going to execute that if I just hit enter right now nothing's actually going to happen except I'm going to get a new line This is literally a text file and so R doesn't know that I'm actually wanting to execute that code it just thinks I'm wanting to go to a new line to write some more stuff if I want to actually execute the code I have to hit command enter on a Mac which I'm on or control enter if I'm on a PC and then that will send the line of code down to the console and actually execute it the Tidy verse consists of eight core packages you can see them listed here they all have some pretty important purposes in R I would say the ggplot2 and re and I'm sorry and D plier in particular have become absolute standards in our programming if you are using R today you absolutely have to know those two the others have been largely adopted as well it's certainly worth learning them all in this lecture in this video I'm going to talk about those two packages a little bit I don't think I'll get into any of the others particularly okay um the First Command I want to show you other than loading up packages or the first I think helpful tip I want to point out is this data command and I'm going to hit that and when it happens when I do that it's gonna show me a long list of data sets that are built in in r that you can use to practice some of the skills that I'm going to be teaching some of them that are just built in with base r that just come no matter what else you do like Titanic is a famous One and some that load up with these other packages so for instance if I scan down far enough data sets in the package D plier so these are some data sets that are included with that D plier package that was included in the Tidy verse that you can use to work on your data wrangling ggplot2 and so on in the next few examples I'm going to be using the mpg data set so I could do view MPG we've already seen that command when we um imported the Scooby data set I'll hit command enter to actually execute it and you can see the command was sent down here and now the mpg data set gets opened up in the viewer I'm going to close up a couple of these other windows for neatness okay so I don't know anything about the mpg data set I want to learn a little bit more about it when you want to learn more about a function or a data set that is either built in or loaded in with a function or loaded in with a package you can ask about it with question mark in this case MPG and when I hit command enter that will open up something in this help tab in this case we see we have fuel economy data from 1999 to 2008 for 38 popular models of cars and then we have a little bit of a data dictionary telling us more about it we can get help files on all sorts of functions as I mentioned for instance mean to get the arithmetic mean and you can see the sorts of arguments we might use as well as some of the optional arguments okay one other command I would like to show that I think is very helpful when you're encountering a new data set is the Glimpse command and you just feed it the name of the data set and the Glimpse command if you look down here maybe let me make this a little bit more visible for the moment just gives you sort of a top level overview of your data set what how many rows how many columns what are the variable names remember after a dollar sign R is typically specifying a variable name in a data set what are the first few values and what sort of variable is it now if you have programming experience you know that different variables can have different types and you've probably had to suffer through a fair amount of information on different data types in r that certainly exists however we are able to suppress some of that uh some of the technical stuff in our data analysis and just acknowledge that data can either be categorical like Audi or A4 or quantitative like 1.8 or 1999. now at a deeper level of sophistication there are decimals doubles and integers there are factor variables versus character vectors but another nice thing about R is that fundamentally for most purposes you don't have to think too hard about that and so I'm not going to talk anything more about not going to say anything more about it in this video okay so a very common data analysis task that you might have on a set like this is to subset it by rows for instance I might be interested in only the front wheel drive cars or only the cars that have City mileage at least uh 20 miles per gallon so let's do both of those things the fundamental command we're going to use to subset by row is the filter command and of course we can learn more about that with question mark filter in fact maybe I'll just do that question mark filter so I want this one subset rows using column values so for instance I'm going to start out by getting the cards whose City Highway City mileage is at least 20 miles per gallon um the first argument here should be the data set so that's MPG and then I have to specify the condition I want so City mileage should be at least uh 20. so greater than or equal to and when I command enter this it's not going to be exactly what I would hope for I'll explain why let me get a little better view on this okay so what happened is it just kind of printed it out you'll see I now have 56 rows as opposed to the um uh how many did I have before many many more that I had before if I go up and look at that Glimpse command I can see I had 234. what I would really like to do is to take this filtered set save it as a new value so maybe how about MPG efficient and then be able to do operations with that so I'm going to copy and paste for instance just right off the bat maybe I want to view that MPG efficient and now here in my viewer I can see that all these cars have City mileage of at least 20. great let's do one more filter let's take MPG let's call this um Ford and let's do a filter so that was that manufacturer I think yes manufacturer should be quote Ford now if I hit command enter right now I'm going to get an error we detected a named input so remember when I was doing an optional argument on that um that mean command earlier which was pretty far back let's see here should I even try and find it it's up here somewhere yes I named the argument n a DOT r m equals true so here R is looking for an argument called manufacturer that's not what I mean I mean I want a value of a variable so the equal sign here that I'm looking for is actually different than the equal sign that R thinks I mean to specify um this kind of logical equality that I'm looking for I want a double equal sign and now that'll work and um I misspelled it manufacturer now it will actually work and um let's just take a view of that MPG underscore there it is and you can see now it's all forts great neaten that up a little bit um I think the next most common data task that you might have is to add or change a column in a data set so I'm going to do that one of the things I notice in this set is that the units of measure are metric or I'm sorry our standard miles per gallon and I know many people in my audience will be UNS familiar or less comfortable with miles per gallon then for instance uh kilometers per liter so let's take the mpg data set and add in a new column that is going to have the city mileage in in that new unit of measure the command I'm looking for is mutate okay and mutate is going to add or change a variable in my data set as with Filter the first argument should be the name of the data set and then after that I need to specify the name of the column that I want to add or change so in this case it's going to be cty metric and I need to specify a formula for this new column so um I Googled the conversion factor for converting miles per gallon to kilometers per liter and it is this number right here so let me just copy and paste that so I'm going to do that times the uh City mileage that was in miles per gallon command enter and you can see I now have a data set called MPG metric instead of having 11 variables it now has 12. and maybe I'll Glimpse this foreign all the same until I get to this last column there is a new column called City metric just like I would hope another very very um important thing in r is um let me start that sentence again frequently in R you'll be doing long procedures where you start with a data set and do a number of different things to it first filter it then add a new column then maybe do something else and we have this standard syntax among many many verbs in R and in particular in all the Tidy verse verbs where the first argument is just the data set and you'll find yourself doing verb parentheses MPG over and over and over again and that gets inconvenient for any number of reasons and so um there's actually a tool built into R to help you get around that it's called the pipe and the pipe just takes um an argument and passes it to the next function as the first argument so I'm going to redo this mutate command using the pipe so it's still MPG metric that's still the output I want but instead of putting mutate MPG I'm going to put MPG pipe mutate and I'll start a new line so it's a little bit more neat what this is saying is take the mpg data set and pass it as the first argument into the mutate command so now all I have to do is put in this conversion the second part of it and then if I wanted to do another argument another function after that I could do another pipe for instance and then a filter command or whatever this becomes very natural to read um in English or whatever language you happen to be speaking just from left to right because it's noun and then verb so take the mpg data set and mutate it in such and such a way and if there were another pipe here I'd say and then if there were another verb here filter it or whatever else you might like by the way there's a keyboard shortcut in R for this pipe you saw that I didn't actually type the characters one by one it is command shift M on a Mac or Ctrl shift M on a PC and you will use that shortcut more than probably any other when you're using r and so if I execute that the same thing happens as did before another hugely common data analysis task is to get grouped summaries so for instance let's take a look at this data set one more time we have several different classes of vehicles compact mid-size two-seater SUV and so on and I might want to know how is the average city mileage different in these different classes now there's a bunch of different classes so it's not really so convenient to do a filter over and over and over again followed by a mean we'd like to automate this process and R and the tidyiverse family of packages in particular have a very natural way of doing that so I'm going to take the mpg data set and using my new pipe operator I'm going to pass that as the first argument into this group by command and group by is just literally going to take this data set and view it now as grouped by whatever categorical variable I might pass it so in this case I want to group it by class and after I've grouped it by class I'll do another pipe so I'll take that data set and pass it in to a summarize command and you can use an s or an e this was originally developed I believe in New Zealand so s is the default but Z will work if you're in the United States for instance so now it's grouped by class I have to say what operation I'd actually like to perform and I'd like to take the mean of the city mileage so I'll execute that this is going to take the mpg data set group it by class and then take the group means when I command enter that the whole thing will be executed and I get a data set back in this case it has seven rows I get the average mileage for the two seaters the compacts and so on all at once and I can do more than one thing at once for instance suppose I also want the median of the city mileage I'm just going to put both of those in and you can see out I now got an extra column with the medians I should talk about uh these line breaks here notice that I've been sometimes hitting enter at the end of a line and getting this indenting if you hit enter after a pipe or comma or another place where R is expecting more in its command it will indent things for you our studio will indent things for you and recognize that you're intending to carry on the command on the next line and um that can be really helpful when you get these longer commands this is much more human readable than it would be if I'd put this all on the same line there are entire style guides devoted to the best way to um to indent your code and as you get deeper into this eventually you'll need to make a foray into that okay we have two more things to accomplish in this video first of all a tiny little bit of data visualization and then second of all a little bit about communicating your results um I think before I start visualizing let me mention that you can insert comments into your code with the hash so this lets R know that this next line the next thing I'm going to write this isn't actually code this is supposed to be a comment for a human reader so I'm going to do a little date of this there's lots of different graphing systems in um in R by far the predominant one by far the standard one these days is ggplot2 and so I strongly recommend that you use that rather than the base plotting package or anything else um this is part of the Tidy verse the ggplot2 package is one of the core pieces of the Tidy verse GG stands for grammar of graphics and I it's at the rather revolutionary idea I think that when you look at a data visualization there are some very key components that are common to every data visualization and so when you're specifying a plot you should follow that fundamental structure that'll be clearer as I do an example so in r in the Tidy verse pack fight family of packages and ggplot in particular we have one workhorse plotting command not one command for every sort of plot like you would in some languages ggplot first we specify the name of the data set our assumption in ours that we're working with data the philosophy behind the grammar of Graphics is that the fundamental thing about a plot isn't so much whether it's a histogram or a frequency polygon or a scatter plot but rather what variables are being communicated in which ways on the x-axis on the y-axis with color and so on so we're going to specify which variables are being communicated in which ways first and only later specify the sort of plot we want so in this case I want to specify an X aesthetic on the x-axis of my plot I'm going to want City mileage so aesthetic is just the way of saying what variables are going to be communicated in which way if I execute this Command right now something will come up in my plot pane in a second sorry I have an old computer and it's going slowly but it's not very interesting you can see it's put City mileage on the x-axis but not actually plotted any data and that's because I haven't actually specified what sort of plot I want so R doesn't know how to actually visualize that I'm going to put a plus and then to make it read a little better I'll start a new line and then actually specify the sort of plot I want and the Syntax for that is geom underscore and then the type of plot so in this case let's get a histogram there we are there's a handy little zoom button I'll click that there it is okay so there's any number of things we could do to make this plot look nicer we could change the labels we could put a title on it we could change some colors lots of different stuff I'm not going to get into that in depth except to say that the grammar of Graphics is layered and the idea is that we should first get the basic plot and then go back and change the non-data aspects later for instance with plus labs and for instance we can change the X label to be City mileage and when I execute that you'll see it's changed the x-axis label okay so this seems rather verbose why should we bother with all of this when we should when we could just have a command that said histogram well one reason is that there's not really anything different between this plot and for instance a frequency polygon and if I execute this command you'll see the plot looks fundamentally the same even though it's no longer a histogram key point to both of these plots is that City mileage is being put on the x-axis the fact that the one is a frequency polygon and the other is a histogram really is secondary this becomes really powerful for instance because and I'll just copy and paste again we can actually do both at once so I'll do a histogram and a frequency polygon at the same time and this isn't maybe the best plot but it illustrates an important Point once that comes up you can just layer things on however you want let's do just a couple more plots with the mpg data set let's get a scatter plot let's get City let's also get a y aesthetic this time let's get highway mileage so City versus highway mileage makes sense obviously I don't want to histogram anymore I want a scatter plot so let's do that with geom point so this is how I get a fundamental scatter plot using ggplot and I'll zoom in on that you can see City mileage on the x-axis highway mileage on the y-axis not surprisingly there's a fairly linear relationship between those two oh linear relationship let's put on a regression line another layer GM smooth there's many different ways of putting smoothers on top of plots I want a linear one the Syntax for that method equals LM and there's my regression line you can see a little gray band here that is a confidence band so um I won't talk too much about this about that right now let's see here one more thing with this scatter plot that I'd like to show I'll leave out the regression line for now remember I have um different classes here I did some group summaries with those earlier I want to show how to get different colors for the different classes so just as the city mileage is being displayed on the x-axis and highway mileage is being displayed on the y-axis I want to add another aesthetic here saying how the class variable should be displayed now I don't want that on an axis I want that to be displayed with color so I'm just going to add an extra aesthetic and I'll command enter on that okay so now each of these points has a different color and there's a key on the right letting you know what color is representing what class now um just as we can layer different labels we can also change the colors that are actually being used here you can do that by hand but what I recommend is that you choose a built-in color palette so that's not a genome sorry it's changing the um the color palette so scale color Brewer and I'm going to use the palette that's called Dark two and I like this one because it's colorblind friendly unlike the um the base palette there we go so that looks a little better I want to close by talking a little bit about communication of results as your data science skills improve you're going to want to share your results either with supervisors in your jobs or with clients or just with friends and it's awkward to be writing Word documents and dragging in visualizations like this one from r or copying and pasting tables fortunately rstudio provides a great tool for this it's the markdown document so let's create one of those I'm going up to the new file thing here and just going a little lower than our script to our markdown there's some various options here I'm just going to leave them all as is um HTML is a good all-purpose format both From rstudio's perspective from the coding perspective but also just in terms of flexibility if you generate an HTML document you can convert it to other things later using either r or other things and when I open up that file going to be a template that our studio is going to pull up and there's essentially three pieces to this and by the way when you're starting out and for a while after that I recommend just modifying this template rather than starting from scratch every time the first part of the document is this header the so-called yaml header and you can see title author date and output format you can obviously modify those first three as you see fit um you have these gray chunks with three hyphens before or three um single quotes before and after this is our code and so you can insert our code that the markdown document will later execute if you want it to finally you have some lightly formatted text like right here you can do headers that's what's going on here links like this um this is going to be bold face there are other some there's some other formatting options as well you can Google for instance um our markdown cheat sheet to get some of the other commands at your disposal and you can see that this template document includes some R code so for instance Let's uh let's change summary of cars which gives some summary information about a built-in data set to a command we've already seen like Glimpse MPG okay you can embed plots this is going to give a plot using the base r package so when I um when we see this plot it's going to be kind of ugly let's see here remember that the Glimpse command was in the D plier package that was in tidyverse we got to make sure that we load up tidy verse let's do that but uh Dr guard didn't you already load up tidyverse well when I go to actually make this code into a document R is going to essentially be starting from a blank slate so I gotta have things like Library tidy verse actually in the document okay so I mentioned about rendering a document how do we do that it's this little spool of thread here it's the knit command and by the way that's actually written in the template document that you should click the knit button to actually generate the output so let's see what happens it's going to prompt me to save it and I will just like take all of the defaults sorry you didn't have to see my file structure there it uh it tends to move a little slowly it's not the fastest thing in general but in particular on my old computer it is going to be really slow there it is you can see the name the author and so on I promised you a header here for our markdown a link and some bold face we got all of those the code and the output of the code right here and there's that old-fashioned plot this does not look great in my mind possibly it looked great in the 80s when the function was written these days it just looks kind of uh kind of old school there are lots of different options for these code chunks that you can explore as you get better at R but I will just point out this little gear right here and when you click on that gear you have choices for instance do I want to show the output of the code in this case the glimpse so it was a table sort of thing do I want to show the code and the output and so on and so when you're writing reports for um non-r users for instance you might want to suppress the code but if you're wanting if you're writing for more expert users who you want to actually have troubleshoot your code for instance you'd probably want to include it all right so R is an entire world obviously this is just a quick start for you but I think it should give you lots of directions to actually get some stuff done as well as to start exploring if this was helpful to you please subscribe I'll be generating a lot more our help for you over the coming weeks