Welcome to R an Introduction. I'm Barton Poulsen and my goal in this course is to introduce you to R. This is R.
But also, this is R. And then finally, this is R. It's arguably the language of data science.
And just so you don't think I'm making stuff up off the top of my head, I have some actual data. This is a ranking from a survey of data mining experts on the software that they use most often in their work. And take a look here at the top, R is first. In fact, it's 50% more than Python, which is another major tool in data science.
And so both of them are important. But you can see why I personally am fond of R and why is the one that I want to start with in introducing you to data science. Now there's a few reasons that are especially important.
Number one, it's free, and it's open source compared to other software packages, it can be thousands of dollars per year. Also, R is optimized for vector operations, which means you can go through an entire row, or an entire table of data without you having to explicitly write for loops. If you've ever had to do that, then you know it's a pain. And so this is a nice thing. Also, R has an amazing community behind it where you can find supportive people.
And you can get examples of whatever it is you need to do. And you can get new developments all the time. Plus, R has over 9,000 contributed or third party packages available that make it possible to basically do anything.
Or, if you want to put it in the words of Yoda, you can say this, this is our there is no if only how and in this case, I'm quoting our user Simon Blomberg. So very briefly, in some here's why I want to introduce you to our number one because our is the language of data science. because it's free and it's open source. And because of the free packages that you can download, install R makes it possible to do nearly anything when you're working with data.
So I'm really glad you're here. And then I'll have this chance to show you how you can use R to do your own work with data in a more productive, more interesting and more effective way. Thanks for joining me.
The first thing that we need to do for our introduction is to get set up. More specifically, we need to talk about installing R. The way you do this is you can download it.
You just need to go to the homepage for the R project for statistical computing. And that's at r dash project.org. When you get there, you can click on this link in the first paragraph that says download R. And that'll bring you to this page that lists all the places that you can download it. Now I find the easiest is to simply go to this top one.
This is cloud, because that'll automatically direct you to whichever of the below mirrors is. best for your location. When you click on that, you'll end up at this page, the comprehensive R archive network or CRAN, which we'll see again in this course, you need to come here and click on your operating system. If you're on a Mac, it'll take you to this page.
And the version you're going to want to click on is just right here. It's a package file that's a zipped application installation file. Click on that, download it and follow the standard installation directions. If you're on a Windows PC, then you're probably going to want this one, base. Again, click on it, download it, and go through the standard installation procedure.
And if you're on a Linux computer, you're probably already familiar with what you need to do. So I'm not going to run through that. Now before we get a look at what it's actually like when you open it, there's one other thing you need to do. And that is to get the files that we're going to be using in this course. On the page that you found this video, there's a link that says download files.
If you click on that, then you'll download a zipped folder called r01 underscore intro underscore files, download that unzip it. And if you want to put it on your desktop, when you open it, you're going to see something like this a single folder that's on your desktop. And if you click on it, then it opens up a collection of scripts, the dot r extension is for an R source or script file. I also have a folder with a few data files that we'll be using in one of these videos. you simply double click on this first file, whose full name is this, that'll open up in R.
And let me show you what that looks like. When you open up the application R, you will probably get a setup of windows that look like this. On the left is the source window or the script window where you actually do your programming.
On the right is the console window that shows you the output. And right now it's got a bunch of boilerplate text. Now coming over here again on the left, Any line that begins with a pound sign or hashtag or octothorpe is a commented line that's not run. But these other lines are code that can be run. By the way, you may notice a red warning just popped up on the right side.
That's just telling us about something that has to do with changes in R and it doesn't affect us. What I'm going to do right here is I'm going to put the cursor in this line, and then I'm going to hit Command or Control and then enter, which will run that line. And you can see now that it's opened up over here. And what I've done is I've made available to the program a collection of data sets. Now I'm going to pick one of those datasets, it's the iris datasets, very well known.
It's a measurement of three species of the iris flower. And we're going to do head to see the first six lines. And there we have the sepal length, sepal width, petal length and petal width of in this case, it's all setosa. But if you want to see a summary of the variables, get some quick descriptive statistics, we can run this next line over here. And now I get the quartiles, the mean as well as the frequency of the three different species of virus.
On the other hand, it's really nice to get things visually. So I'm going to run this basic plot command for the entire data set. And it opens up a small window, I'm going to make it bigger.
And it's a scatterplot of the measurements for the three kinds of viruses, as well as a funny one where it's included in the three different categories there. I'm going to close that window. And so that is basically what our looks like and how our works in its simplest possible version. Now before we leave, I'm actually going to take a moment to clean up the application in the memory, I'm going to detach or remove the data sets package that I added, I already closed the plot. So I don't need to do this one separately.
But what I can do is come over here to clear the console, I'm actually going to come up to edit and come down to clear console. And that cleans it out. And this is a very quick run through of what R looks like in its native environment, but in the next movie I'm going to show you another application we can install called RStudio that lays on top of this and makes interacting with R a lot easier and a lot more organized and really a lot more fun to work with.
The next step in R in introduction and setting up is about something called RStudio. Now, this is RStudio. and what it is, is a piece of software that you can download in addition to R which you've already installed. And its purpose is really simple. It makes working with R easier.
Now there's a few different ways that it does this. Number one is it has consistent commands. What's funny is the different operating systems have slightly different keyboard commands for the same operations in R.
RStudio fixes that and it makes it the same whether you're on Mac, Windows or Linux. Also, there's a unified interface. Instead of having 2, 3, or 17 windows open, you have one window with the information organized.
And it also makes it really easy to navigate with the keyboards and to manage the information that you have in R. And let me show you how to do this, but first we have to install it. What you're going to need to do is go to RStudio's website, which is at rstudio.com. From there, click on Download RStudio.
that'll bring it to this page or something like it. And you're going to want to choose the desktop version. Now, when you get there, you're going to want to download the free sort of community version as opposed to the $1,000 a year version.
And so click here on the left. And then you're going to come to the list of installers for supported platforms. It's down here on the left, this is where you get to choose your operating system. Click the top one if you have Windows. The next one if you have Mac and then we have lots of different versions of Linux.
Whichever one you get, click on it, download it and go through the standard installation process. Then open it up and then let me show you what it's like working in RStudio. To do this, open up this file and we'll see what it's like in RStudio. When you open up RStudio, you get this one window that has several different panes in it. At the top, we have the script or the source window and this is where you do your actual programming.
And you'll see that it looks really similar to what we did when I opened up the R application. The coloring is a little different, but that's something that you can change in preferences or options. The console is down here at the bottom.
And that's where you get the text output. Over here is the environment that saves the variables if you're using any, and then plots and other information show up here in the bottom right. Now you have the option of rearranging things and changing what's there as much as you want. RStudio is a flexible environment.
and you can resize things by simply dragging the divider between the areas. So let me show you a quick example using the exact same code that I did in my previous example. So you can see how it works in RStudio as opposed to the regular R app that we use first time. First, I'm going to load some data.
That's by using the data sets package, I'm going to do a command or control and enter to load that one. And you can see right here, it's run the command. Then I'm going to do the quick summary of data, I'm going to do head iris shows the first six lines. then here it is down here, I can make that a little bit bigger if I want, then I can do a summary by just coming back here and clicking command or Ctrl, enter.
And actually, I'm going to do a keyboard command to make the console bigger now. And then we can see all of that, I have the same basic descriptive statistics and the same frequencies there. You go back to how it was before and make this bring this one down a little.
And now we can do the plot. Now this time you see it shows up in this window here on the side, which is nice. It's not a standalone window.
Let me make that one bigger. It takes a moment to adjust. And there we have the same information that we had in the R app. But here it's more organized in a cohesive environment. And you see that I'm using keyboard shortcuts to move around.
And it makes life really easy for dealing with the information that I have in our I'm going to do the same cleanup, I'm going to detach the package that I had. This is actually a little command to clear the plots. And then here in RStudio, I can run a funny little command that will do the same as doing Ctrl L to clear the console for me.
And that is a quick run through of how you can do some very basic coding in RStudio. Again, which makes working with R more organized, more efficient, and easier to do overall. In our very basic introduction to R and setting up, there's one more thing I want to mention. that makes working with R really amazing.
And that's the packages that you can download, install. Basically, you can think of them as giving you superpowers when you're doing your analysis, because you can basically do anything with the packages that are available. Specifically, packages are bundles of code. So it's more software that adds new function to R makes it so it can do new things.
Now there are two kinds of package two general categories. there are base packages. These are packages that are installed with R so they're already there, but they're not loaded by default.
That way, R doesn't use maybe as much memories it might otherwise. But more significant than that are the contributed or third party packages. These are packages that need to be downloaded, installed and then loaded separately.
And when you get those it makes things extraordinary. And so you may ask yourself where to get these. marvelous packages that make things so super duper? Well, you have a few choices. Number one, you can go to CRAN.
That's the Comprehensive R Archive Network. That's an official R site that has things listed with the official documentation. Two, you can go to a site called CRANTASTIC, which really is just a way of listing these things. And when you click on the links, it redirects you back to CRAN. And then third, you can also get our packages from GitHub, which is an entirely different process.
If you're familiar with GitHub, it's not a big deal. Otherwise, you don't usually need to deal with it. But let's start with this first one, the comprehensive our archive network or CRAN.
Now we saw this previously when we were just downloading our this time we're going to cran dot r dash project.org. And we're specifically looking for this one, the CRAN packages, that's going to be right here on the left, click on packages. And when you open that, you're going to have an interesting option. And that's to go to task views.
And that breaks it down by topic. So we have your packages that deal with Bayesian inference packages that deal with chemometrics and computational physics, so on and so forth. If you click on any one of those, it'll give you a short description of the packages that are available and what they're designed to do. Now another place to get packages I said is crantastic at crantastic.org.
And this is one that lists the most recently updated the most popular packages. And it's a nice way of getting some sort of information about what people use most frequently, although it does redirect you back to crant to do the actual downloading. And then finally, at github.com, if you go to slash trending slash r, you'll see the most common most frequently downloaded packages on GitHub for use in R. Now, regardless of how you get it, let me show you the ones that I use most often.
And I find these make working with are really a lot more effective and a lot easier. Now they have kind of cryptic names. The first one is dplyr, which is for manipulating data frames, then there's tidier for cleaning up information, stringer for working with strings or text information, lubridate for manipulating date information, HTTR for working with website data, GGViz where the GG stands for grammar of graphics, this is for interactive visualizations. ggplot2 is probably the most common package for creating graphics or data visualizations in our shiny is another one that allows you to create interactive applications that you can install on websites. Rio is for our input output is for importing and exporting data.
And then our markdown allows you to create what are called interactive notebooks or rich documents for sharing your information. Now there are others but there's one in particular that thinks useful, I call it the one package to load them all. And it's Pac Man, which not surprisingly stands for package manager. And I'm going to demonstrate all of these in another course that we have here. But let me show you very quickly how to get them working.
He just tried in our, if you open up this file from the course files, let me show you what it looks like. What we have here in our studio is the file for this particular video. And I say that I use Pac Man, if you don't have it installed already, then run this one installation line. This is the standard installation command in R. And they'll add Pac Man and then it will show up here and packages.
Now I already have it installed. And so you can see it right there. But it's not currently loaded.
See because installing means making it available on your hard drive. But loading means actually making it accessible to your current routines. So then I need to load it or import it, and I can do it with one of two ways. I can use the require which gives a confirmation message, I can do it like this. And you see it's got that little sentence there.
Or I can do library, which simply loads it without saying anything. You can see now by the way that it's checked off. So we know it's there.
Now if you have Pac Man installed, even if it's not loaded, then you can actually use Pac Man to install other packages. So what I actually do is because I have pacman installed, I just go straight to this one, you do pacman and then the two colons, it says, use this command, even though this package isn't loaded. And then I load an entire collection, all the things that I showed you starting with pacman itself. So now I'm going to run this command.
And what's nice about pacman is, if you don't have the package, it will actually install it make it available. and load it. And I got to tell you, this is a much easier way to do it than the standard R routine.
And then for base packages, that means the ones that come with our natively like the data sets package, you still want to do it this way you load and unload them separately. So now I've got that one available. And then I can do the work that I want to do. Now I'm actually not going to do it right now, because I'm going to show it to you in future videos. But now I have a whole collection of packages available.
they're going to give me a lot more functionality and make my work more effective. I'm going to finish by simply unloading what I have here. Now if you want to, with Pac Man, you can unload specific packages or the easiest way is to do P underscore unload all.
And what that does is it unloads all of the add on or contributed third party packages. And you can see I've got the full list there of what is unloaded. However, for the base packages like data sets, you need to use the standard R. command detach, which I'll use right here. And then I'll clear my console.
And that's a very quick run through of how packages can be found online, installed into our and loaded to make your code more available. And I'll demonstrate how those work in basically every video from here on out. So you'll be able to see how to exploit their functionality to make your work a lot faster and a lot easier, probably the best place to start when you're working with any statistics program.
basic graphics so you can get a quick visual impression of what you're dealing with. And the command in R that makes the simplest of all is the default plot command. It's also known as basic x y plotting for the x and y axes on a graph.
And what's neat about ours plot command is that it adapts to data types and to the number of variables that you're dealing with. Now, it's going to be a lot easier for me to simply show you how this works. So let's try it in R. Just open up the script file and we'll see how we can do some basic visualizations in R.
The first thing that we're going to do is load some data sets from the data sets package that comes with R. We simply do library data sets. And that loads it up.
We're going to use the iris data which I've showed you before and you'll get to see many more times. Let's look at the first few lines. I'll zoom in on that. And what this is, is the measurement of the sepal and petal length and width for three species of viruses is a very famous data set about 100 years old.
And it's a great way of getting a quick feel for what we're able to do in R. I'll come back to the full window here. And what we're going to do is first get a little information about the plot command to get help on something in R, just do the question mark and the thing you want help for. Now we're in R studio. So this opens up right here in the Help window.
And you see we've got the whole set of information here, all the parameters and additional links you can click on and then examples here at the bottom. I'm going to come over here and I'm going to use the command for a categorical variable first. And that's the most basic kind of data that we have.
And so species, which is three different species is what I want to use right here. So I'm going to do plot. And then in the parentheses, you put what it is you want to plot. And what I'm doing here is I'm saying it's in the data set iris, that's our data frame, actually.
then the dollar sign says use this variable that's in that data. So that's how you specify the whole thing. And then we get an extremely simple three bar chart, I'll zoom in on it. And what it tells you is that we have three species of virus, setosa, versicolor and virginica.
And then we have 50 of each. And so it's nice to know that we have balanced group that we have three groups because that might affect some of the analyses that you do. And it's an extremely quick and easy way to begin looking at the data.
I'll zoom back out. Now let's look at a quantitative variable. So when that's on an interval or nominal level of measurement.
For this one, I'll do pedal length. And you see I do the same thing plot and then iris and then pedaling. Please note, I'm not telling our that this is now a quantitative variable. On the other hand, it's able to figure that one out by itself. Now, this one's a little bit funny because it's a scatter plot, I'm going to zoom in on it.
But the x axis is the index number or the row number in the data set. So that one's really not helpful. It's the variable that's going on the y that's the petal length that you get to see the distribution.
On the other hand, you know that we have 50 of each species. And we have the setosa. And then we have the versicolor. And then we have the virginica. And so you can see that there are group differences on these three things.
Now what I'm going to do is I'm going to ask for a specific kind of plot. to break it down more explicitly between the two categories. That is I'm going to put in two variables now, where I have my categorical species, and then a comma and then the petal length, which is my quantitative measurement.
I'm going to run that again, you just hit controller command and enter. And this is one that I'm looking for here. Let's zoom in on that. Again, you see that it's adapted.
And it knows, for instance, that the first variable I gave it is categorical, the second was quantitative, and the most common chart for that is a boxplot. And so that's what it automatically chooses to do. And you can see, it's a good plot here, we can see very strong separation between the groups on this particular measurement, I'll zoom back out. And then let's try a quantitative pair.
So now I'll do pedal length and pedal width. So it's gonna be a little bit different. I'll run that command.
And now this one is a proper scatter plot where we have a measurement across the bottom, and a measurement up the side. But you can see that there's a really strong positive association between these two. So not surprisingly, as a pedal gets longer, it generally also gets wider.
So it just gets bigger overall. And then finally, if I want to run the plot command on the entire data set the entire data frame, this is what happens we do plot and then Iris. Now, we've seen this one in previous examples, but let me zoom in on it. And what it is, is an entire matrix of scatter plots of the four quantitative variables.
And then we have species, which is kind of funny because it's not labeling them. But it shows us a dot plot for the measurements of each species. And this is a really nice way, if you don't have too many variables, of getting a very quick holistic impression of what's going on in your data.
And so, the point of this is that the default plot command and is able to adapt to the number of variables I give it and to the kind of variables I give it and it makes life really easy. Now I want you to know that it's possible to change the way that these look. I'm going to specify some options, I'm going to do the plot again, the scatterplot, where I say plot, and then in parentheses, I give these two arguments, or saying what I want in it, I'm going to say do the pedal length, and do the pedal width. And then I'm going to go to another line, I'm just separating with comma. Now if you want to, you can write this all as one really long line, I break it up because I think it makes it a little more readable.
I'm going to specify the color, and do with call for color, and then I use a hex code. And that code is actually for the red that is used on the data lab homepage. And then PCH is for point character. And that is a 19 is a solid circle. Now I'm going to main title on it, and then I'm going to put a label on the x axis and a label on the y axis.
So I'm actually going to run those now by doing command or Ctrl Enter for each line and you can see it builds up And when we finish, we got the whole thing, I'll zoom in on it again. And this is the kind of plot that you could actually use in a presentation or possibly in a publication. And so even with the base command, we're able to get really good looking, informative, clean graphs. Now, what's interesting is that the plot command can do more than just show data, we can actually feed it in formulas, if you want, for instance, to get a cosine.
I do plot and then cos is for cosine. And then I give the limit I go from zero to two times pi because that's relevant for cosine, I click on that and you can see the graph there. It's doing our little cosine curve, I can do an exponential distribution from one to five.
And there it is curving up. And I can do D norm, which is for a density of a normal distribution from minus three to plus three. there's the good old bell curve there in the bottom right.
And then we can use the same kind of options that we used earlier for our scatterplot. Here I'm going to say do a plot of D norm, so the bell curve from minus three to plus three on the x axis. But now we're going to change the color to red, LWD is for line width, make it thicker, give it a title on the top, a label on the x axis and a label on the y axis.
We'll zoom in on that. And so there's my new and improved prettier and presentation ready bell curve that I got with the default plot command and are and so this is a really flexible and powerful command. Also, it's the base package.
And you'll see that we have a lot of other commands that can do even more elaborate things. But this is a great way to start and get a quick impression of your data, see what you're dealing with, and shape the analyses that you do subsequently. The next step in our introduction and our discussion of basic graphics is bar charts. And the reason I like to talk about bar charts is this, because simple is good. And when it comes to bar charts, bar charts are the most basic graphic for the most basic data.
And so they're a wonderful place to start in your analysis. Let me show you how this works. Just try it in our open up the script and let's run through and see how it works. When you open up the file in RStudio, the first thing we're going to want to do is come down here and open up the datasets package.
And then we're going to scroll down a little bit and we're going to use a data set called MT cars. Let's get a little bit of information about this do the question mark and the name of the data set. This is motor trend. That's a magazine car road tests from 1974. So you know, they're 42 years old.
Let's take a look at the first few rows of what's in MT cars by doing head. And I'm going to zoom in on this. And what you can see is that we have a list of cars, the Mazda RX-4 and the wagon, the Datsun 710, the AMC Hornet, and I actually remember these cars. And we have several variables on each of them.
We have the mpg miles per gallon, we have the number of cylinders, the displacement in cubic inches, the horsepower, the final drive ratio, which has to do with the axle. And then we have the weight in tons, the quarter mile time in seconds. And these are a bunch of really, really slow cars. VS is for whether the cylinders are in a V or whether they are in a straight or inline.
And then the AM is for automatic or manual. Then we go down to the next line, we have gear, which is the number of gears in the transmission, and carb for how many carburetor barrels they have, which is we don't even use carburetors anymore. Anyhow, so that's what's in the data set.
I'll zoom back out. Now if we want to do a really basic bar chart, you might think that the most obvious thing to do would be to use ours bar plot command, that's its name for the bar chart, and then to specify the data set MD cars, and then the dollar sign and then the variable that we want cylinders. So we think that would work, but unfortunately, it doesn't.
Instead, what we get is this, which is just kind of going through all the cases on a one by one by one row and telling us how many cylinders are in that case. That's not a good one. That's not what we want.
And so what we need to do is we actually need to reformat the data a little bit. By the way, you would have to do the exact same thing if you wanted to make a bar chart in a spreadsheet like Excel or Google Sheets. It can't do it with the raw data, you first need to create a summary table. And so what we're going to do here is we're going to use the command table, we're gonna say, take this variable from this data set and make a table of it and feed it into an object.
you know, a data thing, data container called cylinders, I'm going to run that one. And then you see that just showed up in the top left, let me zoom in on that one. So now I have in my environment, a data object called cylinders, it's a table, it's got a length of three, it's got a size of 1000 bytes, and it gives us a little bit more information. Let's go back to where we were. But now we've saved that information into cylinders, which just has the number of cylinders.
I can run the bar plot command. And now I get the kind of plot I expected to see. From this, we see that we have a fair number of cars with four cylinders, a smaller number was six.
And because this is in 74, we've got a lot of eight cylinder cars in this particular data set. Now, we can also use the default plot command, which I showed you previously on the same data, but it's going to do something a little different, it's actually going to make a line chart. where the lines are the same length of each bars, I'd probably use the bar plot instead, because it's easier to tell what's going on.
But this is a way of making a default chart that gives you the information you need for the categorical variables. Remember, simple is good. And that's a great way to start. In our last video on basic graphics, we talked about bar charts.
If you have a quantitative variable, then the most basic kind of chart is a histogram. And this is for data that is quantitative or scaled or measured or interval or ratio level. All of those are referring to basically the same thing. And in all of those, you want to get an idea of what you have and a histogram allows you to see what you have. Now there's a few things you're going to be looking for with a histogram.
Number one, you're going to be looking for the shape of the distribution. Is it symmetrical? Is it skewed?
Is it unimodal, bimodal, you're going to look for a gaps or big empty spaces in the distribution. You're also going to look for outliers unusual scores because those can distort any of your subsequent analyses. You'll look for symmetry to see whether you have the same number of high and low scores or whether you have to do some sort of adjustment to the distribution.
But this is going to be easier if we just try it in our so open up this our script file and let's take a look at how we can do histograms in our when you open up the file the first thing we need to do is come down here and load the data sets. We'll do this by running the library command, I just do Ctrl or Command Enter. And then we can do the iris data set.
Again, we've looked at it before. But let's get a little bit of information from it by asking for help on iris. And there we have Edgar Anderson's iris data, also known as Fisher's iris data, because he published an article on it. And here's the full set of information available on it from 1936. So that's 80 years old. Let's take a look at the first few rows.
And again, we've seen this before, sepal and petal length and width for three species of iris, we're going to do a basic histogram on the four quantitative variables that are in here. And so I'm going to use just the hist command. So, hist and then the dataset iris and then the dollar sign to say which variable and then siebel.length.
When I run that, I get my first histogram. Let's zoom in on it a little bit. And what happens here is, of course, it's a basic sort of black line on white background, which is fine for exploratory graphics. And it gives us a default title that says histogram of the variable and it gives us the clunky name, which is also on the x-axis on the bottom. it automatically adjusts the x axis and it chooses about seven or nine bars, which is usually the best choice for a histogram.
And then on the left, it gives us the frequency. the count of how many observations are in that group. So for instance, we have only five irises, whose sepal length is between four and four and a half centimeters, I think it is. Let's zoom back out. And let's do another one.
Now this time for sepal width, you can see that's almost a perfect bell curve. If we do pedal length, we get something different. Let me zoom in on that one. And this is where we see a big gap, we've got a really strong bar there at the low end.
In fact, it goes above the frequency axis. And then we have a gap, and then sort of a bell curve that lets us know that there's something interesting going on with the data that we're going to want to explore a little more fully. And then we'll do another one for pedal width, I'll just run this command.
And you can see the same kind of pattern here where there's a big clump at the low end. there's a gap, and then there's sort of a bell curve beyond that. Now another way to do this is to do the histograms by groups. And that would be an obvious thing to do here because we have three different species of iris. So what we're going to do here is we're going to put the graphs into three rows, one above another in one column, I'm going to do this by changing a parameter par is for parameter and I'm giving it the number of rows that I want to have in my output.
And I need to give it a combination of numbers, I do the C which is for concatenate, it means treat these two numbers as one unit, where three is the number of rows, and then the one is the number of columns. So I run that it doesn't show anything just yet. And then I'm going to come down and I'm going to do this more elaborate command, I'm going to do hist, that's the histogram that we've been doing. I'm going to do pedal length, except this time in square brackets, I'm going to put a selector is this means use only these rows. And the way I do this is by saying I want to do it for the setosa irises.
So I say iris, that's the data set, and then dollar sign and then species of the variable. And then two equals because in computers that means is equivalent to then in quotes, and they have to spell it exactly the same with the same capitalization, I do setosa. So this is the variable and the row selection.
I'm also going to put in some limits for the x, because I want to manually make sure that all three of the histograms I have have the same x scale. So I'm going to specify that breaks is for how many bars I want in the histogram. And and actually what's funny about this is it's really only a suggestion that you give to the computer. then I'm going to put a title above that one, I'm going to have no x label, and I'm going to make it red.
So I'm going to do all of that right now, I'll just run each line. And then you see I have a very skinny chart, let's zoom in on it. And so it's very short, but that's because I'm going to have multiple charts.
And it's going to make more sense when we look at them all together. But you can see by the way that the pedal width for the setosa irises is on the low end. Now let's do the same thing for Versa color. I'm going to run through all that. It's all going to be the same except we're gonna make it purple.
There's Versa color. And then let's do Virginia cuff last, and we'll make those blue. And now I can zoom in on that. And now what we have are three histograms.
It's the same variable, petal width, but now I'm doing it separately for each of the three species. And it's really easy to see what's going on here now. Satosa is really low. Versicolor and virginica overlap, but they're still distinct distributions. This approach, by the way, is referred to as small multiples, making many versions of the same chart on the same scale.
So it's really easy to compare across groups or across conditions, which is what we're able to do right here. Now, by the way, anytime you change the graphical parameters, you want to make sure to change them back to what they were before. So here I'm going par and then going back to one column and one row. And that's a good way of doing histograms for examining quantitative variables, and even for exploring some of the complications that can arise when you have different categories with different scores on those variables.
In our two previous videos, we looked at some basic graphics for one variable at a time, we looked at bar charts for categorical variables. And we look at histograms for quantitative variables. While there's a lot more you can do with univariate distributions.
you also might want to look at bivariate distributions, we're gonna look at scatterplots as the most common version of that you do a scatterplot when what you want to do is visualize the association between two quantitative variables. Now, I actually know it's more flexible than that. But this is the canonical case for a scatterplot.
And when you do that, what sorts of things do you want to look for in your scatterplot? I mean, there's a purpose in it. Well, number one, you want to see if the association between your two variables is linear, or if it can be described by a straight line, because most of the procedures that we do assume linearity, you also want to check if you have consistent spread across the scores as you go from one end of the x axis to another, because if things fan out considerably, then you have what's called heteroscedasticity.
And it can really complicate some of the other analyses. As always, you want to look for outliers because an unusual score or especially an unusual combination of scores can drastically throw off some of your other interpretations. And then you want to look for the correlation, is there an association between these two variables.
So that's what we're looking for. Let's try it in our simply open up this file. And let's see how it works. The first thing we need to do in our is come down and open up the data sets package, just do command or control and enter. and we'll load the data sets.
We're going to use empty cars, we looked at that before, it's got a little bit of information. It's road test data from 1974. And let's look at the first few cases. I'll zoom in on that. Again, we have miles per gallon cylinders, so on and so forth.
Now, anytime you're going to do an association, it's a really good idea to look at the univariate or one variable at a time distributions as well. We're going to look at the association between weight, and miles per gallon. So let's look at the distribution for each of those separately.
I'll do that with a histogram. I do hist and then in parentheses, I specify the data set empty cars in this case, and then the dollar sign to say which variable in that data set. So there's the histogram for weight.
And you know, it's not horrible, though, it looks like we got a few on the high end there. And here's the histogram for miles per gallon. Again, mostly kind of normal.
but a few on the high end. But let's look at the plot of the two of them together. Now, what's interesting is, I just use the generic plot command, I feed that in, and R is able to tell that I'm giving it two quantitative variables and that a scatterplot is the best kind of plot for that.
So we're going to do weight and miles per gallon. And then let me zoom in on that. And what you see here is one circle for each car at the joint position of its weight and its miles per gallon.
And it's a strong downhill pattern. Not surprisingly, the more a car weighs, and we have some in this data set that are five tons, the lower its miles per gallon. We get down to about 10 miles per gallon here. The smallest cars, which appear to weigh substantially under two tons, get about 30 miles per gallon. Now, this is probably adequate for most purposes.
But there's a few other things that we can do. So for instance, I'm going to add some colors here, I'm going to take the same plot, and then add on additional arguments or say, use a solid circle PCH is for point character 19 is a solid circle, CX has to do with the size of things and I'm going to make it a 1.5 means make them 150% larger call is for color and I'm specifying a particular red, the one for data lab in hex code, I'm going to give a title, I'm going to give an X label and a Y label. And then we'll zoom in on that. And now we have a more polished chart that also because of the solid red circles makes it easier to see the pattern that's going in there, where we got some really heavy cars with really bad gas mileage, and an almost perfect linear association up to the lighter cars with much better gas mileage.
And so a scatterplot is the easiest way of looking at the association between two variables, especially when those two variables are quantitative. So they're on a scaled or measured outcome. And that's something that you want to do anytime you're doing your analysis to first visualize it, and then use that as the introduction to any numerical or statistical work you do after that, as we go through our necessarily very short presentations on basic graphics. I want to finish by saying one more thing. And that is you have the possibility of overlaying plots.
And that means putting one plot directly on top of or superimposing it on another. Now, you may ask yourself why you want to do this. Well, I can give you an artistic version on this. This, of course, is Pablo Picasso's Les Demoiselles d'Avignon.
And it's one of the early masterpieces in cubism. And the idea of cubism is it gives you many views, or it gives you simultaneously several different perspectives on the same thing. And we're going to try to do a similar thing with data.
And so we can say very quickly, thanks, Pablo. Now, why would you overlay plots? Really, if you want the technical explanation, it's because you get increased information density, you get more information, and hopefully more insight in the same amount of space and hopefully the same amount of time.
Now, there is a potential risk here, you might be saying to yourself at this point, well, you want dense, guess what, I can do dense. And then we end up with something vaguely like this. the Garden of Earthly Delights, and it's completely overwhelming. And it's just makes you kind of shut down cognitively. Anyhow, thank you, Hieronymus Bosch.
No, I instead, while I like Hieronymus Bosch's work, I'm gonna tell you when it comes to data graphics, use restraint. Just because you can do something doesn't mean that you should do that thing. When it comes to graphics and overlaying plots, the general rule is this use views that complement and support one another that don't compete, but that give greater information in a coherent and consistent way.
This is going to make a lot more sense if we just take a look at how it works in our. So open up this script, and we'll see how we can overlay plots for greater information density and greater insight. The first thing that we're going to need to do is open up the data sets package. And we're going to be using a data set we haven't used before about links is that's the animal. This is about Canadian links trappings from 1821 to 1934. If you want the actual information on the data set, there it is.
Now let's take a look at the first few lines of data. This one is a time series. And so what's unusual about it is it's, it's just one line of numbers.
and you have to know that it starts at 1821. It goes through. So let's make a default chart with a histogram as a way of seeing where links trappings consistent or how much variability was there. We'll do hist, which is the default histogram, and we'll simply put links in, we don't have to specify variables, because there's only one variable in it.
And when we do that, I'll zoom in on that we get really a skewed distribution, most of the observations are down at the low end, and then it tapers off to actually measured in 1000s. And so we can tell that there is a very common value is at the low end. And then on the other hand, we don't know what years those were. So we're ignoring that for just a moment and taking a look at the overall distribution of trappings regardless of years. Let me zoom back out.
And we can do some options on this one to make it a little more intricate, we can do a histogram. And then if in parentheses, I specify the data, I also can tell it how many bins I want. And again, it sort of is suggesting it because ours going to do what it wants. Anyhow, I can say, make it a density instead of frequency.
So it'll give proportions of the total distribution, we'll change the color to called thistle one, because you can use color names in our, we'll give it a title here. By the way, I'm using the paste command because It's a long title, and I want it to show up on one line, but I need to spread my command across two lines, you can go longer, I have to use a short command line. So you can actually see what we do when we're zoomed in here.
So there's that one. And then we're going to give it a label that says number of links trapped. And now we have a more elaborate chart, I'll zoom in on it, it's a kind of little thistle purple lilac color. And we have divided the number of bins differently. Previously, it was one bar for every 1000. Now it's one bar for 500. But that's just one chart.
We're here to see how we can overlay charts and a really good one anytime you're dealing with a histogram is a normal distribution. So you want to see are the data distributed normally. Now we can tell they're skewed here, but let's get an idea of how far they are from normal. To do this, we use the command curve. And then D norm is for density of the normal distribution.
And then here, I tell it x is, you know, just a generic variable name, but I tell it use the mean of the links data. use the standard deviation of the links data, we'll make it a slightly different thistle color, number four, we'll make it two pixels wide, the line width is two pixels, and then add says stick it on the previous graph. And so now I'll zoom in on that.
And you can see if we had a normal distribution with the same mean and standard deviation as this data, it would look like that. Obviously, that's not what we have, because we have this great big spike here on the low end. then I can do a couple of other things.
I can put in what are called kernel density estimators. And those are sort of like a bell curve, except they're not parametric. Instead, they follow the distribution of the data. That means they can have a lot more curves in them, they still add up to one like a normal distribution. So let's see what those would look like here.
We're going to do lines. That's what we use for this one. And then we say density, that's going to be the standard kernel density estimator, we'll make it blue. And there it is on top, I'm going to do one more than we'll zoom in, I can change a parameter of the kernel density estimator here, I'm using adjust to say, average across, it's sort of like a moving average, average across a little more.
And now let me zoom in on that. And you can see, for instance, the blue line follows the spike at the low end a lot more closely, and then it dips down. On the other hand, the purple line is a lot more slow to change because of the way I gave it its instructions with the adjust equals three. And then I'm going to add one more thing, something called a rug plot. It's a little vertical lines underneath the plot for each individual data point.
And I do that with rug. And I say just use links. And then we're going to make it a line width or pixel width of two, and then we'll make it gray.
And that and assuming is our final plot, you can see now that we have the individual observations marks and you can see why each bar is as tall as it is and why the kernel density estimator follows the distribution that it does. This is our final histogram with several different views of the same data. It's not cubism, but it's a great way of getting a richer view of even a single variable that can then inform the subsequent analyses you do.
to get more meaning and more utility out of your data. Continuing in our introduction, the next thing we need to talk about is basic statistics. And we'll begin by discussing the basic summary function in our, the idea here is that once you have done the pictures that you've done the basic visualizations, then you're going to want to get some precision by getting numerical or statistical depending on the kinds of variables you have, you're going to want different things.
So for instance, you're going to want counts or frequencies for categories. And they're going to want things like quartiles and the mean for quantitative variables. We can try this in our and you'll see that it's a very, very simple thing to do. Just open up the script and follow along. What we're going to do is load the datasets package, controller command and then enter.
And we're actually going to look at some data and do an analysis that we've seen several times already, we're going to load the iris data. And let's take a look at the first few lines. And again, this is for quantitative measurements on the sepal and petal length and width of three species of iris flowers. And what we're going to do is we're going to get a summary in three different ways.
First, we're going to do summary for a categorical variable. And the way we do this is we use the summary function. And then we'd say iris, because that's the data set and then a dollar sign and then the name of the variable that we want. So in this case, it's species, we'll run that command.
And you can see it just says setosa 50, versicolor 50, and virginica 50. And those are the frequencies are the counts for each of those three categories in the species variable. Now we're going to get something more elaborate for the quantitative variable. we'll use steeple length for that one. And I'll just run that next line.
And now you can see it lays it out horizontally, we have the minimum value of 4.3, then we have the first quartile of 5.1, the median, then the mean, then the third quartile, and then the maximum score of 7.9. And so this is a really nice way of getting a quick impression of the spread of scores. And also by comparing the median and the mean, sometimes you can tell whether it's symmetrical or there's skewness going on. And then you have one more option.
And that is getting a summary for the entire data frame or data set at once. And what I do is I simply do summary and then in the parentheses for the argument, I just give the name of the data set iris. And this one I need to zoom in a little bit, because now it arranges it vertically, where we do sepal length.
So that's our first variable and we get the quartiles and we get the median, then we do sepal width, petal length, petal width. then it switches over at the last one species where it gives us the counts or frequencies of each of those three categories. And so that's the most basic version of what you're able to do with the default summary variable in our gives you quick descriptors gives you the precision to follow up on some of the graphics that we did previously, and it gets you ready for your further analyses. As you're starting to work with our and you're getting basic statistics, you may find you want a little more information than the base summary function gives you. In that case, you can use something called describe.
And this purpose is really easy, it gets more detail. Now this is not included in ours base functionality. Instead, this comes from a contributed package, it comes from the psych package.
And when you run describe from psych, this is what you're going to get, you'll get n that's the sample size. the mean the standard deviation, the median, the 10% trimmed mean, the median absolute deviation, the minimum and maximum values, the range, skewness and kurtosis and standard errors. Now, don't forget, you still want to do this after you do your graphical summaries pictures first numbers later.
But let's see how this works in our simply open up this script, and we'll run through it step by step. When you open up our, the first thing we're going to need to do is we're going to need to install the package. Now I'm actually going to go through my default installation of packages, because I'm going to use one of these pacman. And this just makes things a little bit easier. So we're going to load all these packages.
And this assumes of course, you have pacman installed already, we're going to get the data sets, and then we'll load our iris data. We've done that lots of times before sepal. and pedal length and width and the species. But now we're going to do something a little different.
We're going to load a package I'm using P load from the pacman package. That's why I loaded it already. And this will download it if you don't have it already might take a moment and it downloads a few dependencies generally other packages that need to come along with it. Now if you want to get some help on it, you can do P anytime you have P and underscore that something from pacman.
p help psych. Now when you do that, it's going to open up a web browser, and it's going to get the PDF help. I've got it open already, because it's really big. In fact, it's 367 pages here of documentation about the functions and psych.
Obviously, we're not going to do the whole thing here. What we are going to do is we can look at some of it in the our viewer, if you simply add this argument here. web equals F for false, you can spell out the word false as long as you do it in all caps, then it opens up here on the right. And here is actually this is a web browser, this is a web page we're looking at. And each of these you can click on and get information about the individual bits and pieces.
Now let's use describe that comes from this package. It's for quantitative variables only so you don't you want to use it for categories. What we're going to do here is we're going to pick one quantitative variable right now. And that is iris and then sepal length, when we run that one, here's what we get. Now we get a list here a line, the first number, the one simply indicates the row number, we only have one row.
So that's what we have anyhow. And it gives us the n of 150. the mean of 5.84, the standard deviation, the median, so on and so forth out to the standard error there at the end. Now that's for one quantitative variable.
If you want to do more than that, or especially if you want to do an entire data frame, just give the name of the data frame in describe. So here we go describe Iris, and I'm going to zoom in on that one, because now we have a lot of stuff. Now it lists all the variables down the side.
sepal length, and it gives the variables numbers 12345. And it gives us the information for each one of them. Please note, it's given us numerical information for species, but it shouldn't be doing that because that's a categorical variable. So you can ignore that last line. That's why I put an asterisk right there.
But otherwise, this gives you more detailed information including things like the standard deviation in this unit that you might need to get a more complete picture of what you have in your data. I use describe a lot. It's a great way to complement histograms and other charts like box spots to give you a more precise image of your data and prepare you for your other analyses.
To finish up our section and our an introduction on basic statistics. Let's take a short look at selecting cases. What this does is it allows you to focus your analysis, choose particular cases and look at them more closely. Now in R, you can do this a couple of different ways.
You can select by category if you have the name of a category, or you can select by value on a scaled variable, or you can select by both. Let me show you how this works in our just open up the script and we'll take a look at how it works. As with most of our other examples, we'll begin by loading the data sets. package and by using library, just Ctrl or Command Enter to run that command that's now loaded, and we'll use the iris data set.
So we'll look at the first few cases head iris is how we do that. Zoom in on it for a second. There's the iris data, we've already seen it several times, then we'll come down and we'll make a histogram of the petal length for all of the irises and the data set.
So irises the name of the data set and then petal length. There's our histogram off to the right. I'll zoom in on it for a second. So you see of course that we've got this group stuck way at the left and then we have a gap right here. Then we have a pretty much normal distribution the rest of it.
I'll zoom back out. We can also get some summary statistics. I'll do that right here for a petal length. There we have the minimum value of the quartiles and the mean.
Now let's do one more thing and let's get the name of the species. that's going to be our categorical variable and the number of cases for of each species. So I do summary.
And then it knows that this is a categorical variable. So we run it through, and we have 50 of each, that's good. The first thing we're going to do is we're going to select cases by their category, in this case by the species of iris, we'll do this three times, we'll do it once for versa color.
So I'm going to do histogram, or I say, use the iris data, and then dollar sign means use this variable petal length. And then in square brackets, I put this to indicate select these rows or select these cases. And I say select when this variable species is equals you got to use the two equal signs to versicolor, make sure you spell it and capitalize it exactly as it appears in the data, then we'll put a title on it that says pedal length versicolor.
So here we go. And there is our selected cases, this is just 50 cases going into the histogram. Now on the bottom right, we'll do a similar thing for virginica, where we simply change our selection criteria from versicolor to virginica.
And we get a new title there. And then finally, we can do it for setosa also. So great. That's three different histograms by selecting values on a categorical variable, where you just type them in quotes exactly as they appear in the data. Now, another way to do this is to select by value on a quantitative or scaled variable, we want to do that, what you do is in the square brackets to indicate you're selecting rows, you put the variable, I'm specifying that it's in the iris data set, and then say what value you're selecting, I'm looking for values less than two, and I have the title change to reflect that.
Now what's interesting is this selects the cytosis, it's the exact same group. And so The diagram doesn't change, but the titles and the method of selecting the cases did. Probably more interesting one is when you want to use multiple selectors.
Let's look for virginica, that'll be our species. And we want short petals only. So this says what variable we're using petal length. And this is how we select we say iris dollar sign species.
So that tells us which variable is equal to with the two equals of virginica. And then I just put an ampersand. then say iris pedal length is less than 5.5. Then I can run that I get my new title, and I'll zoom in on it.
And so what we have here are just for genica, but the shorter ones. And so this is a pair of selectors use simultaneously. Now another way to do this, by the way, is if you know you're going to be using the same sub sample many times, you might as well create a new data set that has just those cases. And the way you do that is you specify the data that you're selecting from, then in square brackets, the rows and the columns. And then you use the assignment operator, that's the less than and dash here, which you can read as get.
So I'm going to create one called I dot setosa for iris setosa. And I'm going to do it by going to the iris data and in species reading just setosa. I then put a comma because this one selects the rows, I need to tell it which columns if I want all of them, you just leave it blank.
So I'm going to do that. And now you see up here in the top right, I'll zoom in on it, I now have a new object, new data object in the environment is a data frame called is a Tosa. And we can look at that sub sample that I've just created, we'll get the head of just those cases.
Now you see, it looks just the same as the other ones, except it only has 50 cases as opposed to 150. I can get a summary for those cases. And this time, I'm doing just the pedal length. And I can also get a histogram for the pedal length, and it's going to be just the statuses. And so that's several ways of dealing with sub samples. And again, saving the selection, if you're going to be using it multiple times, it allows you to drill down in the data and get a more focused picture of what's going on, and helps inform your analyses that you carry on from this point.
The next step in our introduction is to talk about accessing data. And to get that started, we need to say a little bit about data formats. And the reason for that is sometimes your data, you're like talking about apples and oranges, you have fundamentally different kinds of things.
Now there are two ways in particular that this can happen. The first one is you can have data of different types, different data types. And then regardless of the type, you can have your data in different structures. And it's important to understand each of these.
We'll start by talking about data types. This is like the level of measurement of a variable, you can have numeric variables, which usually come in integer, whole number or single precision or double precision. You can have character variables with text in them. We don't have string variables in our they're all character, you can have logical which are true, false, or otherwise called Boolean, you can have complex numbers, and you can have a data type raw. But regardless of which kind that you have, you can arrange them into different data structures.
The most common structures are vector matrix or array, data frame, and list. We'll take a look at each of these. A vector is one or more numbers in a one dimensional array. Imagine them all in a straight line.
Now what's interesting here is that in other situations, if it's a single number, it would be called a scalar. But in our it's still a vector is just a vector of length one. The important thing about vectors is that the data are all of the same data type. So for instance, all character or all integer.
And you can think of this as ours basic data object and that most of the things are variation of the vector. Going one step up from this is a matrix, a matrix has rows and columns. It's two dimensional data.
On the other hand, they all need to be of the same length, the columns all need to be the same length and all the data needs to be of the same class. Interestingly, the columns are not named, they're referred to by index numbers, which can make them a little weird to work with. And then you can step up from that into an array. This is identical to a matrix, but it's for three or more dimensions.
On the other hand, probably the most common form is a data frame. This is a two dimensional collection that can have vectors of multiple types. You can have character variables in one you can have integer variables in another you can have logical and a third The trick is they all need to be the same length.
And you can think of this as the closest thing that R has that's analogous to a spreadsheet. In fact, if you import a spreadsheet, it's going to go into a data frame typically. Now the neat thing is that R has special functions for working with data frames, things that you can do with those you can't do with others. And we'll see how those work as we go through this course and through others. And then finally, there's the list.
this is our most flexible data format, you can put basically anything in the list. It's an ordered collection of elements. And you can have any class, any length, any structure. And interestingly, lists can include lists include lists, and so on and so forth.
So it gets like the Russian nesting dolls, you have one inside the other one inside the other. Now the trick is that may sound very flexible may very good, it's actually kind of hard to work with lists. And so a data frame, really sort of the optimal level of complexity for a data structure.
And then let me talk about something else here, the idea of coercion. Now, in the world of ethics, coercion is a bad thing. In the world of data science, coercion is good. What it means here is coercion is changing a data object from one type to another, it's changing the level of measurement or the nature of the variable that you're dealing with.
So for example, you can change a character to a logical, you can change a matrix to a data frame, you can change double precision to integer, you can do any of these, it's going to be easiest to see how it works if we go to our and give it a whirl. So open up the script, and let's see how it works in our studio. Now for this demonstration of data types, we don't need to load any packages, we're just going to run through things all on their own.
We'll start with numeric data. And what I'm going to do is I'm going to create a data object, a variable called n one, my first numeric variable. And then I use the assignment operator. That's this the little left arrow, and it's read as n one gets 15. Now, R does double precision by default, let me do this n one, then you can see that it showed up here on the top right. If I call the name of that object, it'll show its contents in the console.
So I just type n one and run that. And there you can see in the console at the bottom left, it brought up a one in square brackets, that's an index number for the first object in an array. And this is an array of one number.
But there it is, and we get the value of 15. Also, we can use the R command type of to get a confirmation of what type of variable that says, and it's double precision by default, we can also do another one where we do 1.5, we can get its contents 1.5. And then we see that it also is double precision, we want to come down and do character, I'm calling that c1 for my first character variable, you see that I do c1, the name of the object I want to create, I put the assignment operator, the less than and dash, which is right as gets. And then I have in double quotes. In other languages you would do single quotes for a single character and you would use double quotes for strings.
They're the same thing in R. And I put in double quotes the lowercase c. That's just something I chose. So I feed that in. You can see that it showed up in the global environment there on the right.
We can call it forward and you see it shows up with the double quotes on it. We get the type of and it's a character. That's good.
If we want to do an entire string of text, I can feed that. into C2, just by having it all in the double quotes. And we pull it out. And we see that it also is listed as a character, even though in other languages, it would be called a string, we can do logical. This is l one for logical first.
And I'm feeding in true when you write true or false, they have to be all caps, or you can do just the capital T or the capital F. And then I call that one out. And it says true.
Notice, by the way, there's no quotes around it. That's one way you can tell that it's a logical and not a character. If we put quotes into it, it would be a character variable. We get the type of there we go, it's logical. I said you can also use abbreviations.
So for my second logical variable, l two, I'll just use F, I feed that in. And now you see that it when I asked it to tell me what it is, it prints out the whole word false. And then we get the type of again, also logical.
then we can come down to data structures, I'm going to create a vector, which is a collection of one dimensional collection. And I'm doing it by creating v1 for vector one. And then I use the C here, which stands for concatenate, you can also think of it as like combine or collect. And I'm going to put five numbers in there, you need to use a comma between the values. And then I call out the object.
And there's my five numbers. Notice it shows them without the commas, but I had to have the commas going in. And then I asked our Is it a vector is period vector and then ask about it. And it's just gonna say true.
Yes, it is. I can also make a vector of characters. I do that right here, I get the characters.
And it's also a vector. And I can make a vector of logical values true and false, call that and it's a vector also. Now a matrix you may remember is in going in more than one dimension.
In this case, I'm going to call it m1 for matrix one, and I'm using the matrix function. So I'm saying matrix and then combine these values tt FFT F. And then I'm saying how many rows I want in it, and it can figure out the number of columns by doing some math. So I'm going to put that into m1. And then I'll ask for it.
And see now it displays it in the rows and columns and it writes out the full true or false. Now I can do another one. where I'm going to do a second matrix.
And this is where I explicitly shape it in the rows and columns. Now, that's for my convenience, R doesn't care that I broke it up to make the rows and columns. But it's a way of working with it.
And if I want to tell it to organize it to go by rows, I can specify that with the by row equals T or true command, I do that. And now I have the ABCD. And you see, by the way that I have the index numbers.
On the left are the row index numbers, that's row one and row two. And on the top are the column index numbers, and they come second, which is why it's blank and then one for the first column and then blank and then two for the second column, then we can make an array. What I'm going to do here is I'm going to create data and I'm going to use the colon operator, which says, give me the numbers one through 24, I still have to use the concatenate to combine them. And then to give the dimensions of my array, and it goes rows, columns, and then tables, because I'm using three dimensions here, I'm going to feed that into an object called array one.
And there's my array right there, you can see that I have two tables. In fact, let me zoom in on that one. And so it starts at the last level, which is the table.
And then we have the rows and the columns listed separately for each of them. A data frame allows me to combine vectors of the same length, but of different types. Now what I'm doing here is I'm creating a vector of numeric values of character values and logical values. So these are three different vectors. But then what I'm going to do is I'm going to use this function C bind for a column bind to combine them into a single data frame, I'm calling it DFA for data frame A, or all.
Now the trick here, is that we had some unintentional coercion. By just using cbind, what it did is it coerced it all to the most general format. I had numeric variables, I had character variables and logical, and the most general is character.
And so it turned everything into a character variable. That's a problem. It's not what I wanted. I have to add another function to this. I have to tell it specifically, make it a data frame by using as dot data dot frame.
When I do that, I can combine it. And now you see it's maintained the data types of each of the variables. That's the way I want it.
And then finally, I can do a list, I'm going to create three objects here, object one, which is numeric with three values, object two, which is character with four and object three, which is logical with five. And then I'm going to combine them into a list using the list function, I'll put them into list one. And now we can see the contents of list one.
And you can see it's kind of a funky structure. And it can be hard to read. But there's all the information there. And then we're going to do something that's kind of, you know, hard to get around logically, is I'm going to create a new list that has list one in it.
So I have the same three objects, plus I'm adding on to it list one. So list two, I'm going to zoom in on that one. And you can see it's a lot longer. And we've got a lot of index numbers there in the brackets. they're the three integers, the four character values and the five logical values.
And then here they are repeated, but that's because they're all parts of list one, which I included in this list. And so those are some of the different ways that you can structure data of different types. But you want to know also that we can coerce them into different types to serve our different purposes.
The next thing we need to talk about is coercing types. Now there's automatic coercion, we've seen a little bit of that. where the data automatically goes to the least restrictive data type.
So for instance, if we do this, where we have a one, which is numeric, a B in quotes, which is character, and a logical value, and we feed them all into this idea coerce one, and by the way, by putting parentheses around it, it automatically saves it and shows us the response. Now you can see that what it's done is it's taken all of them and made all of them character because that's the least specific most general format. And so that'll happen.
But you got to watch out because you don't want things getting coerced when you're not paying attention. On the other hand, you can coerce things specifically if you want to have them go in a particular way. So I can take this variable right here, coerce to, I'm gonna put a five into that. And we can get its type and we see that it's double.
Okay, that's fine. What if I want to make it integer, then what I do is I use this command as dot integer. I run that feed into coerce three, and it looks the same when we see the output, but now it is an integer.
That's how it's represented in the memory. I can also take a character variable. And here I have one, two, and three in quotes, which make them characters. And get those and you can see that they're all character.
But now I can feed them in with this as dot numeric. And it's able to see that they are numerical numbers in there, and coerce them to numeric. Now you see that it's lost the quotes, and it goes to the default double precision.
probably the one you'll do the most often is taking a matrix. And that's just let's take a look, I'll make a matrix of nine numbers in three rows and three columns. There they are.
And what we're going to do is we're going to coerce it to a data frame. Now that doesn't change the way it looks, it's going to look the same. But there's a lot of functions you can only do with data frames that you can't do with matrices. This one, by the way, we'll ask, is it a matrix and the answer is true.
But now let's do this, we'll do the same thing and just add on. as dot data dot frame. And now we tell it to make it a data frame.
And you see, it basically looks the same. It's listed a little differently. This one had its index numbers here for the rows and the columns. This one is a row index. And then we have variable names across the top.
And it's just automatically giving them variables one, two and three. But the numbers in it look exactly the same. On the other hand, if we come back here and ask, is it a data frame, we get true.
And so it's a very long discussion here. But the point here is, data comes in different types and in different structures, and you're able to manipulate those. So you can get them in the format, and the type and the arrangement that you need for doing your analyses in our to continue our introduction and accessing data, we want to talk about factors.
And depending on the kind of work that you do, this may be a really important topic. Factors have to do with categories and names of those categories. Specifically, a factor is an attribute of a vector that specifies the possible values and their order, it's going to be a lot easier to see if we just try it in our and let me demonstrate some of the variations.
Just open up the script and we can run through it together. What we're going to do here is create a bunch of artificial data. And then we're going to see how it works. First, what I'm going to do is I'm going to create a variable x one with the numbers one through three.
And by putting it in parentheses here, it'll both store it in the environment, and it will display it in the console. So there we have three numbers one, two and three, I'm going to create a another variable y that's the numbers one through nine. So there that is.
Now what I want to do is I want to combine these two. And I'm going to use the C bind or column bind data frame. So it's going to put them together, and it's going to make them a data frame.
And it's going to say them into a new object I'm creating called DF for data frame one. And we'll get to see the results of that. Let me zoom in on it a little bit. And there you can see we have nine rows of data, we have one variable x one, that's from the one that I created, and then we have y, and then we have the nine indexes or the row IDs there down the side, please note, that the first one x one only had three values.
And so what it did is it repeated it. So you see it happening three different times 123123. And what we want to find out is now what kind of variable is x one in this data frame? Well, it's an integer. And we want to get the structure, it shows that it's still an integer if we're looking at this line right here.
Okay, but we can change it to a factor by using as dot factor, and it's going to react differently then. So I'm going to create a new one called x two, that again, is just the numbers one, two, and three. But now I'm telling our that those specifically represent factors, then we'll create a new data frame using this x two that I saved as a factor, and the one through nine that we had in y. Now, at this point, it looks the same, but if we come back to where we were, and we get the type of, it's still an integer, that's fine.
But if we get the structure of df two, now it tells us that x two, instead of being integer is a factor with three levels. And it gives us the three levels in quotes, one, two, and three, and then it lists the data. Now, if we want to take an existing variable, and define it as a factor, we can do that too.
Here, I'll create yet another variable with three values in it. and then we'll bind it to y in a data frame. And then I'm going to use this one factor right here, and I'm going to tell it to reclassify this variable x three as a factor and feed it into the same place. that these are the levels of the factor.
And because I put in parentheses, it'll show it to us in the console. And there we have it, let's get the type, it's an integer, but the structure shows it again as a factor. So that's one where we could take an existing variable and turn it into a factor.
If you want to do labels, we can do it this way. We'll do x four, again, that's the one through three, and we'll bind it to nine to make a data frame. And here I'm going to take the existing variable df four, and then the variable is x four, I'm going to tell it the labels.
And then I'm going to give them text labels, I'm going to say that there are Mac OS, Windows and Linux, three operating systems. And please note, I need to put those in the same order that I want them to line up to those numbers. So one will be Mac OS, two will be Windows, and three will be Linux.
I run that through, we can pull it up here. And now you can see how it goes through and it changes that factor to the text variables, even though I entered it numerically. If I want the type of to see what it is, it still calls it integer, even though it's showing me words, and the structure, this is an important one. Let's zoom in on that just for a second. The structure here at the bottom is it says it's a factor with three levels, and it starts giving me the labels.
But then it shows us that those are actually numbers one, two and three underneath. If you're used to working with a program like SPSS, where you can have values, and then you can have value labels on top of them. It's the same kind of concept here, then I want to show you how we can switch the order of things. And this gets a little confusing. So try it a couple of times and see if you can follow the logic here, we'll create another variable x five, that's just the one, two and three.
we'll bind it to y. And there's our data frame just like we've had in the other examples. Now what I'm going to do is I'm going to take that new variable x five in the data frame five df five.
And notice here, I'm listing the levels, but I'm listing them in a different order, I'm changing the order that I put them in there. And then I'm lining up these labels. When I run that through, now you can see the labels here, maybe yes, no, baby, yes, no, it is showing us the nine values. And then this is an interesting one, because they're ordered, it puts them with the less than sign at each point to indicate which one comes first, which one comes later, we can take a look at the actual data frame that I made, I'll zoom in on that.
And you can see we know that the first one's a one, because when I created this, it was 123. And so the maybe is a one you see the because it's the second thing here in each one. So one equals maybe, but by putting it in this order, it falls in the middle of this one, there may be situations in which you want to do that, I just want to know that you have this flexibility in creating your factor labels in R. And finally, we can check the type of that. And it's still an integer because it's still coded numerically underneath, but we can get the structure and see how that works. And so factors give you the opportunity to assign labels to your variables and then use them as factors in various analyses.
If you do experimental research, and this sort of thing becomes really important. And so this gives you an additional possibility for your analyses in R as you define your numerical variables as factors for using your own analyses. Our next step in our introduction and accessing data is entering data. So this is where you're typing it in manually.
And I like to think of this as a version of ad hoc data. Because under most circumstances, you would import a data set. But there are situations in which you need just a small amount of data right away. And you can type it in this way.
Now, there are many different methods that are available for this. There's something called the colon operator. There's seq, which is for sequence, there's C, which is short for concatenate, there's scan, and there's rep. And I'm going to show you how each of these work. I will also mention this little one, the less than and a dash. That is the assignment operator in R.
Let's take a look at it in R and I'll explain how all of it works. Just open up the script and we'll give it a whirl. What we're going to do here is just begin with a little discussion of the assignment operator.
The less than dash is used to assign values to a variable. So it's called an assignment operator. Now a lot of other programs would use an equal sign but we use this one that's like an arrow and you read it as it gets so x gets five. It can go in the other direction pointing to the right, that would be very unusual. And you can use an equal sign, R knows what you mean, but those are generally considered poor form and that's not just arbitrary.
If you look at the Google style guide for R, it's specific about that. In RStudio, you have a shortcut for this. If you do option dash, it inserts the assignment operator and a space.
So I'll come down here right now, do option dash. And there you see. So that's a nice little shortcut that you can use in RStudio when you're doing your ad hoc data entry. Let's start by looking at the colon operator. And most of this you would have seen already.
And what this means is you simply stick a colon between two numbers and it goes through them sequentially. So I'm doing x one is a variable that I'm creating. And then I have the assignment operator, it gets zero colon 10. And that means it gets the numbers zero through 10. And there they all are, I'm going to delete my colon operator that's waiting for me to do something here. Now if we want to go in descending order, just put the higher number first. So I'll put 10 colon zero, there it goes the other way, seq.
or sec is short for sequence. And it's a way of being a little more specific about what you want. Now if you want to, we can call up the help on sequence, it's right over here for sequence generation, there's the information.
And we can do ascending values. So sec 10 duplicates one through 10 doesn't start at zero starts at one. But you can also specify how much you want things to jump by. So if you want to count down in threes, you do 30 to zero by negative three means step down threes, we'll run that one.
And because it's in parentheses, it'll both save it to the environment and it'll show it on the console right away. So those are ways of doing sequential numbers. And that can be really helpful.
Now, if you want to enter an arbitrary collection of numbers in different order, you can use C that stands for concatenate, you can also think of it as combine or collect, we can call it to help on that one. There it is. And let's just take these numbers and use C to combine them into the data object X5. And we can pull it and there you see, it just went right through. An interesting one is scan and this is for entering data live.
So we'll do scan here, get some help on that one. You can see it read data values. And this one takes a little bit of explanation. I'm going to create an object X6 and then I'm feeding into it scan with. opening and closing parentheses because I'm running that command.
So here's what happens, I run that one. And then down here in the console, you see that it now has one and a colon. And I can just start typing numbers.
And after each one, I hit Enter. And I can type in however many I want. And then when you're done, just hit Enter twice.
And it reads them all. And if you want to see what's in there, come back up here and just call the name of that object. they're the numbers that I entered. And so there may be situations in which that makes it a lot easier to enter data, especially if you're using a 10 key.
Now rep, you can guess is for repetition, we'll call the help on that one replicate elements. And here's what we're going to do, we're going to say x seven, we're going to repeat or replicate true, and we're going to do it five times. So x seven, and then if you want to see there are our five trues all in a row.
If you want to repeat more than one value, it depends on how you set things up a little bit. Here I'm going to do replicate or repeat for true and false. But by doing it as a set where I'm doing the C concatenate to collect the set, what it's going to do is repeat that set in order five times. So true, false, true, false, true, false, and so on.
That's fine. But if you want to do the first one five times, and then the second one five times. I mean, think of it as like collating on a photocopier.
If you don't want it collated, you do each, and that's going to do true, true, true, true, true, false, false, false, false, false. And so these are various ways that you can set up data, get it in really for an ad hoc or an as needed analysis. And it's a way of checking how functions work, as I've used in a lot of examples here. And you can explore some of its possibilities and see how you can use it in your own work. The next step in our introduction and accessing data is talking about importing data, which will probably be the most common way of getting data into our now the goal here is you want to try to make it easy, get the data in there, get a large amount, get it in quickly and get processing as soon as you can.
Now, there are a few kinds of data files, you might want to import, there are CSV files, that stands for comma separated values in a sort of the plain text version of a spreadsheet. Any spreadsheet program can export data as a CSV and nearly any data program at all can read them. There are also straight text files txt.
Those can actually be opened up in text editors and word processing documents. Then there are xlsx. And those are Excel spreadsheets as well as the XLS version. And then finally, if you're going to get fancy, you have the opportunity to import JSON. That's JavaScript Object Notation.
And if you're using web data, you might be dealing with that kind of data. Now, R has built in functions. for importing data in many formats, including the ones I just mentioned. But if you really want to make your life easy, you can use just one, a package that I load every time I use R is Rio, which is short for our import output. And what Rio does is it combines all of ours import functions into one simple utility with consistent syntax and functionality and makes life so much easier.
Let's see how this all works in our just open up this script. And we'll run through the examples. all the way through.
But there is one thing you're going to want to do first. And that is, you're going to want to go to the course files that we downloaded at the beginning of this course. These are the individual R scripts, but this folder right here that significant, this is a collection of three data sets, I'm going to click on that. And they're all called MBB. And the reason they're called that is because they contain Google Trends information about searches for Mozart, Beethoven, and Bach, three major classical music composers.
And it's all about the relative popularity of these three search terms over a period of several years. And I have it here in CSV or comma separated value format, and as a text file dot txt. And then even as an Excel spreadsheet. Now let's go to our and we'll open up each one of these. The first thing we're going to need to do is make sure that you have Rio now.
I've done this before that Rio is one of the things I download every time. So I'm going to use Pacman and do my standard importing or loading of packages. So real is available. Now, I do want to tell you one thing significant about Excel files.
And we're going to go to the official our documentation for this. If you click on this, it'll open up your web browser. And this is a shortcut web page to the our documentation. And here's what it says. I'm actually read this verbatim.
Reading Excel spreadsheets. The most common R data import export question seems to be, how do I read an Excel spreadsheet? This chapter collects together advice and options given earlier. Note that most of the advice is for pre-Excel 2007 spreadsheets and not the later Excel SX format.
The first piece of advice is to avoid doing so if possible. If you have access to Excel, export the data you want from Excel in a tab delimited or comma separated form and use read.dilem or read.csv to import it into R. You may need to use read.dilem2 or read.csv2 in a locale that uses comma as the decimal point. Exporting a diff file and reading it using read.diff is another possibility. Okay, so really what they're saying is don't do it.
Well, let's go back to R. And I'm just going to say right here, you have been warned. But let's make life easy by using Rio.
Now if you've saved these three files to your desktop, then it's really easy to import them this way. We'll start with the CSV, we use Rio underscore CSV is the name of the object that I'm going to be using to import stuff into. And all we need is this command import, we don't have to specify that as a CSV or say that has headers or anything, we just use import.
And then in quotes, and in parentheses, we put the name and location of the file. So on a Mac, it shows up this way. to your desktop, I'm going to run that. And you can see that it just showed up in my environment on the top right, I'll expand that a little bit, I now have a data frame, I'll come back out, let's take a look at the first few rows of that data frame, zoom up. And you can see we have months listed.
And then the relative popularity of search for Mozart, Beethoven and Bach during those months. Now if I want to read the text file, what's really nice is I can use the exact same command import and I just give the location and the name of the file, I have to add the dot txt. But I run that and we look at the head and you'll see it's exactly the same. No difference, piece of cake.
What's nice about Rio is I can even do the xlsx file. Now it helps that there's only one tab in that file, and that it's set up to look exactly the same as the others. But when I do that, we run through and you see that once again, it's the same thing.
Rio is able to read all of these automatically, makes life very easy. Another neat thing is that our has something called a data viewer. Now we'll get a little bit of information on that through help.
And you invoke the data viewer. Let's do this one, we do it with a capital V for view. And then we say what it is we want to see. And we'll do Rio underscore CSV.
When we do that command, it opens up a new tab here. And it's like a spreadsheet right here. And in fact, it's sortable, we can click on this, go from the lowest to the highest and vice versa.
And you see that Mozart actually is setting the range here. And that's one way to do it. You can also come over to here and just click on this little, it looks like a calendar. But it is in fact the same thing, we can double click on that.
And now you see we get a viewer of that file as well. I'm going to close both of those. And I'm just going to show you the built in our commands for reading files. Now these are ones that Rio uses on its own. And we don't have to go through all this but you may encounter these in a lot of existing code because not everybody uses real.
And I want you to see how they work. If you have a text file and it's saved in tab delimited format, you need the complete address. And you might try to do something like this read dot table is normally the command.
And you need to say that you have a header that there's variable names across the top. But when you read this, it's going to get an error message. And it's you know, it's frustrating. That's because they're missing values in there.
in the top left corner. And so what we need to do is we just need to be a little more specific about what the separator is. And so I do the same thing where I say read dot table, there's the name of the file in this location, we have a header. And this is where I say the separator is a tab, the backscore says that indicate this is a tab.
So if I run that one, then it shows up, it reads it properly. we can also do CSV. The nice thing here is you don't have to specify the delimiter because CSV means that it's comma separated. So we know what it is. And I can read that one in the exact same way.
And if I want to, I can come over here. And I can just click on the viewer here. And I see the data that way also. And so it's really easy to import data, especially if you use the package Rio, which is able to automatically read the format and get it in properly and get you started on your analyses as soon as possible. Now the part of our introduction that maybe most of you were waiting for is modeling data.
On the other hand, because this is a very short introductory course, I'm really just giving a tiny little overview of a handful of common procedures. And in another course here at datalab.cc, we'll have much more thorough investigations of common statistical modeling and machine learning algorithms. But right now, I just want to give you a flavor of what can be done in R. And we'll start by looking at a common procedure, hierarchical clustering, or ways of finding which cases or observations in your data belong with each other. More specifically, you can think of it as the idea of like with like, which cases are like other ones.
Now, the thing is, of course, this depends on your criteria, how you measure similarity, how you measure distance. And there's a few decisions you have to make. you can do, for instance, what's called a hierarchical approach, which is what we're going to do.
Or you can do it where you're trying to get a set number of groups, or that's called k, the number of groups, you also have many choices for measures of distance. And you also have a choice between what's called divisive clustering, where you start with everything in one group, and then you split them apart, or agglomerative, which is where they all start separately, and you selectively put them together. But we're going to try to make our life simple And so we're going to do the single most common kind of clustering, we're going to use a measure of Euclidean distance, we're going to use hierarchical clustering, so we don't have to set the number of groups in advance. And we're going to use a divisive method, we start with them all together and gradually split them. Let me show you how this works in our and what you'll find is even though this may sound like a very sophisticated technique, and a lot of the mathematics is sophisticated, it's really not hard to do in reality.
So what we're going to do here is we're going to use a data set that we use frequently, I'm going to load my default packages to get some of this ready. And then I'll bring in the data sets, we're going to use m t cars, which if you recall, is motor trend car road tests data from 1974. And there are 32 cars in there. And we're going to see how they group what cars are similar to which Now let's take a look at the first few rows of data to see what variables we have in here. You see we have miles per gallon, cylinders, displacement, so on and so forth. not all of these are going to be really influential or useful variables.
And so I'm going to drop a few of them and create a new data set that includes just the ones I want. If you want to see how I do that, I'm going to come back here and I'm going to create a new object, a new data frame called cars. And this says it gets the data from empty cars by putting the blank and the space here. That means use all of the rows.
But here I'm selecting the columns C for concatenate means I want columns one through four, skip five, six and seven, skip eight, and then nine through 11. That's a way of selecting my variables. So I'm going to do that. And you see that cars is now showing up in my environment there at the top right.
Let's take a look at the head of that data set. We'll zoom in on that one. And they can see it's a little bit smaller, we have miles per gallon cylinders displacement, weight, horsepower, quarter mile seconds. so on.
Now we're going to do the cluster analysis. And we're going to find is that if we're using the default, it's super, super easy. In fact, I'm going to be using something called pipes, which is from the package dplyr, which is why I loaded it is this thing right here.
And what it allows you to do is to take the results of one step and feed it directly in as the input data into the next step. Otherwise, this would be several different steps. but I can run it really quickly.
I'm going to create an object called hc for hierarchical clusters, we're going to read the cars data that I just created, we're going to get the distance or the dissimilarity matrix, which says how far each observation is in Euclidean space from each of the others. And then we feed that through the hierarchical cluster routine hclust. So that saves it into an object.
And now we need to do is plot the results. we're going to do plot HC, my hierarchical cluster object. And then we get this very busy chart over here. But if I zoom in on it, and wait a second, you can see that is this nice little it's called a dendrogram because it's branches in a tree looks more like roots here, you can see they all start up together and then they split and then they split and they split. Now if you know your cars from 1974, and you can see that some of these things make sense.
So for instance, here we have the Honda Civic and the Toyota Corolla, which are still in production. are right next to each other. The Fiat 128 and the Fiat X19 were very, well, they were both small Italian sports cars.
They were different in many ways, but you can see that they're right next to each other. The Ferrari Dino and the Lotus Europa, they make sense to put next to each other. If we come over here, the Lincoln Continental and the Cadillac Fleetwood and the Chrysler Imperial, it's no surprise they're next to each other. What is interesting is this one here, the Maserati Bora. totally separate from everything else because it's a very unusual different kind of car at the time.
Now one really important thing to remember is that the clustering is only valid for these data points based on the data that I gave it. I only gave it a handful of variables and so it has to use those ones to make the clusters. If I gave it different variables or different observations we could end up with a very different kind of clustering. But I want to show you one more thing we can do here with this cluster to make it even easier to read. Let me zoom back out.
And what we're going to do is draw some boxes around the clusters. We're going to start by drawing two boxes that have gray borders. Now I'm going to run that one and you can see that it showed up. And then we're going to make three blue ones, four green ones and five dark red ones.
And then let me come and zoom in on this again. And now it's easier to see what the groups are in this particular dataset. So we have here, for instance, the Hornet four drive, the Valiant, the Mercedes-Benz, 450 SLC, Dodge Challenger, and Javelin, all clumping together in one general group.
And then we have these other really big V8 American cars. What's interesting is, again, is that the Maserati Bora is off by itself almost immediately. It's kind of surprising because the Ford Pantera has a lot in common with it.
But this is a way of seeing, based on the information that I gave it, how things are clustered. And if you're doing... market analysis, if you're trying to find out who's in your audience, if you're trying to find out what groups of people think in similar ways, this is an approach that you're probably going to use.
And you can see that it's really simple to set it up, at least using the default in our as a way of seeing how you have regularities and consistencies and groupings in your data. As we go through our very brief introduction to modeling data in our another common procedure that we might want to look at briefly, is called Principal Components. And the idea here is that in certain situations, less is more. That is, less noise and fewer unhelpful variables in your data. can translate to more meaning.
And that's what we're after in any case. Now, this approach is also known as dimensionality reduction. And I like to think of it by an analogy, you look at this photo, and what you see are these big black outlines of people, you can tell basically how tall they are, what they're wearing, where they're going. And it takes a moment to realize you're actually looking at a photograph that goes straight down. And you can see the people there on the bottom, and you're looking at their shadows.
And we're trying to do a similar thing. Even though these are shadows, you can still tell a lot about the people. People are three-dimensional, shadows are two-dimensional, but we've retained almost all of the important information.
If you want to do this with data, the most common method is called principal component analysis or PCA. And let me give you an example of the steps metaphorically in PCA. You begin with two variables.
And so here's a scatterplot, we've got X across the bottom, Y at the side, and this is just artificial data. And you can see that there's a strong linear association between these two. Well, what we're going to do is we're going to draw a regression line through the data set. And you know, it's there about 45 degrees. And then we're going to measure the perpendicular distance of each data point to the regression line.
Now, not the vertical distance, that's what we would do if we were looking for regression residuals, but the perpendicular distance. And that's what those red lines are. Then what we're going to do is we're going to collapse the data by sliding each point down the red line to the regression line.
And that's what we have there. And then finally, we have the option of rotating it so it's not on diagonal anymore, but it's flat. And that there is the PC, the principal component. Now, let's recap what we've accomplished here.
We went from a two-dimensional data set to a one-dimensional data set. but maintained some of the information in the data. But I like to think that we've maintained most of the information, and hopefully, we maintain the most important information in our data set. And the reason we're doing this is we've made the analysis and interpretation easier and more reliable by going from something that was more complex, two dimensional or higher dimensions, down to something that's simpler to deal with fewer dimensions, it means easier to make sense of in general. Let me show you how this works in our open up this script.
And we'll go through an example in our studio. To do this, we'll first need to load our packages because I'm going to use a few of these. I'll load those and we'll load the data sets. Now I'm going to use the empty cars data set. We've seen it a lot.
And I'm going to create a little subset of variables. Let's look at the entire list of variables. And I don't want all of those in my particular data set. So the same way I did with hierarchical clustering, I'm going to create a subset by dropping a few of those variables.
And we'll take a look at that subset. Let's zoom in on that. And so there's the first six cases in my slightly reduced data set.
And we're going to use that to see what dimensions we can get to that we have fewer than the 123456789 variables we hear. let's try to get to something a little less and see if we still maintain some of the important information in this data set. Now what we're going to do is we're going to start by computing the PCA, the principal component analysis, we'll use the entire data frame here, I'm going to feed into object called PC for principal components.
And there's more than one way to do this in R, but I'm going to use PR comp. And this specifies the data set that I'm going to use. And I'm going to do two optional arguments.
One is called centering the data, which means moving them. So the means of all the variables are zero. And then the second one is scaling the data, which sort of compresses or expands the range of the data. So it's unit or variance of one for each of them, that puts all of them on the same scale. And it keeps any one variable from sort of overwhelming the analysis.
So let me run through that. And now we have a new object that showed up on the right. And If you want to, you can also specify variables by specifically including them. The tilde here means that I'm making my prediction based on all the rest of these.
And I can give the variable names all the way through. And then I say what data set it's coming from, I say data equals empty cars, and I can do the centering and the scaling there also, it produces exactly the same thing. It's just two different ways of saying the same command.
To examine the results. we can come down and get a summary of the object PC that I created. So I'll click on that and then we'll zoom in on this.
And here's the summary. It talks about... creating nine components PC one for principal component one to PC nine for principal component nine, you get the same number of components that you had as original variables.
But the question is whether it divvies up the variation separately. Now you can take a look here at principal component one, it is a standard deviation of 2.3391. What that means is if each variable began with a standard deviation of one, This one has as much as 2.4 of the original variables.
The second one has 159. And the others have less than one unit of standard deviation, which means they're probably not very important in the analysis, we can get a scree plot for the number of components and get an idea on how much each one of them explains of the original variance. And we see right here, I'll zoom in on that, that our first component seems to be really big and important. Our second one is smaller, but it still seems to be you know, above zero, and then we kind of grind out down to that one. Now there's several different criteria for choosing how many components are important, what you want to do with them.
Right now, we're just eyeballing it and we see that number one is really big. Number two, sort of a minor axis in our data. And if you want to, you can get the standard deviations and something called the rotation here, I'm going to just call PC. And then we'll zoom in on that in the console to scroll back up a little bit.
And it's a lot of numbers. The standard deviations here are the same as what we got from this first row right here. So that just repeats that the first one's really big, the second one smaller. And then what this right here does with the rotation is it says is what's the association between each of the individual variables and the nine different components. So you can read these like correlations.
I'm going to come back and let's see how individual cases load on the PCs. What I do that is I use predict running through PCs and then I feed those results using the pipe and I round them off so they're a little more readable. I'll zoom in on that. And here we've got nine components listed and we've got all of our cars.
But the first two are probably the ones that are most important. So we have here the PC one, and two, you see we got a giant value there 2.49273354 and so on. But probably the easiest way to deal with all this is to make a plot.
And what we're going to do is go something with a funny name a biplot. And what that means is a two dimensional plot. Really, all it says is, it's going to chart the first two components. But that's good because based on our analysis, it's really only the first two that seem to matter anyhow. So let's do the biplot, which is a very busy chart.
But if we zoom on it, we might be able to see a little better what's going on here. And what we have is the first principal component across the bottom and the second one up the side. And then the red lines indicate approximately the direction of each individual variables contribution to these. And then we have each case we show its name about where it would go.
Now if you remember from the hierarchical clustering, the Maserati Bora was really unusual. And you can see it's up there all by itself. And then really what we seem to have here is displacement and weight and cylinders and horsepower. This appears to be big heavy cars going in this direction. Then we have the Honda Civic, the Porsche 911, Lotus Europa.
These are small cars with smaller engines more efficient. These are fast cars up here, and these are slow cars down here. And so it's pretty easy to see what's going on with each of these, as in terms of clustering the variables. With the hierarchical clustering, we clustered cases, now we're looking at clusters of variables.
And we see that it might work to talk about big versus small and slow versus fast, as the important dimensions in our data as a way of getting insight to what's happening and directing us in our subsequent analyses. Let's finish our very short introduction to modeling data in R with a brief discussion of regression, probably one of the most common and powerful methods for analyzing data. I like to think of it as the analytical version of E Pluribus Unum, that is, out of many, one.
Or in the data science sense, out of many variables, one variable. Or you want to put it one more way, out of many scores, one score. The idea with regression is that you use many different variables simultaneously. to predict scores on one particular outcome variable.
And there's so much going on here, they'd like to think that there's something for everyone. There are many versions, and many adaptations of regression that really make it flexible and powerful for almost no matter what you're trying to do. We'll take a look at some of these in R. So let's try it in R. And just open up this script.
And let's see how you can adapt regression to a number of different tasks and use different versions of it. When we come here to our script, we're going to scroll down here a little bit and install some packages, we're going to be using several packages in this one, I'll load those ones as well as the data sets package, because we're going to use a data set from that called us judge ratings, let's get some information on it. It is lawyers ratings of state judges in the US Superior Court.
And let's take a look at the first few cases with head hole. zoom in on that. And what we have here are six judges listed by name. And we have scores on a number of different variables like diligence and demeanor, and whether it finishes with whether they're worthy of retention, that's the RT en retention, we'll scroll back out. And what we might want to do is use all these different judgments to predict whether lawyers think that these judges should be retained on the bench.
Now we're going to use a couple of shortcuts that can actually make working with regression situations kind of nice. First, we're going to take our data set, we're going to feed it into an object called data. So that shows up now in our environment on the top right. And then we're going to define variable groups, you don't have to do this, but it makes the code really, really easy to use.
Plus, you find if you do this, then you can actually just use the same code without having to redo it every time you do an analysis. So what we're going to do is we're going to create an object called x, it's actually going to be a matrix, and it's going to consist of all of our predictor variables simultaneously. And the way I'm going to do this is I'm going to use as matrix.
And then I'm gonna say read data, which is what we defined right here, and read all of the columns except number 12. That's one called retention, that's our outcome. So the minus means don't include that, but do all the others. So I do that. And now I have an object called x. And then the second one, I say, go to data, and then this, you know, blank means use all of the rows, but only read the 12th column.
That's the one that has retention, our outcome. So following standard methods, x, those are all our variables, and y, that's our single outcome variable. Now, the easiest version of regression is called simultaneous entry. You use all of the x variables at once.
throw them in one big equation to try to predict your single outcome. And in our we use LM, which is for linear model. And what we have here is y, that's our outcome variable.
And then the tilde means is predicted by or as a function of x. And then x is all of our variables together being used as predictors. So this is the simplest possible version. And we'll save it into an object called reg for regression one.
And now If you want to be a little more explicit, you can give the individual variables, you can say that our 10 retention is a function of or is predicted by all of these other variables. And then I say that they come from the data set us judge ratings, that way, I don't have to do the data. And then dollar sign before each of these, that'll give me the exact same thing.
So I don't need to do that one explicitly. If you want to see the results, we just call on the object that we created from the linear model. and I'm going to zoom in on that. And what we have are the coefficients. This is the intercept start with minus two and then for each step up on this one at 0.1 point 36, so on and so forth.
You'll see by the way that it's changed the name of each of the variables to add the x because they're in the data set x now that's fine. We can do inferential tests on these individual coefficients by asking for a summary. we click on that, and we'll zoom in. And now you can see there's the value that we had previously, but now there's a standard error. And then this is the T test.
And then over here is the probability value. And the asterisks indicate values that are below the standard probability cut off of point oh five. Now we expect the intercept to be below that. But you see, for instance, this one integrity has a lot to do with people's judgments of whether a person should be retained. And this one, physical, really, you know, are they sick?
And we have some others that are kind of on their way. this is a nice one overall. And if you come down here, you can see the multiple R squared, it's super high.
And what it means is that these variables collectively predict very, very well, whether the lawyers felt that the judge should be retained. Let's go back now to our script, you can get some more summary data here if you want, we can get the analysis of variance table, the ANOVA table. And if we click on that, zoom in, there you can see that we have our residuals and the Y come back out, we do the coefficients, here are the regression coefficients, we saw those previously, this is just a different way of getting up the same information, we can get confidence intervals, we'll zoom in on that.
And now we have a 95% confidence interval. So the two and a half percent on the low end, the 97 and a half on the top end in terms of what each of the coefficients would be, we can get the residuals on a case by case basis, let's do this one. And when we zoom in on that, now this is a little hard to read in and of itself, because they're just numbers. But an easier way to deal with that is to get a histogram of the residuals from the model. So to do that, we just run this command.
And then I'll zoom in on this. And you can see that it's a little bit skewed, mostly around zero, we've got one person way up on the high end, but mostly these are pretty good predictions. We'll come back out. Now I want to show you something a little more complicated.
We're going to do different kinds of regression, I'm going to use two additional libraries for this one is called Lars that stands for least angle regression, and carrot, which stands for classification and regression training. We'll do that by loading those And then we're going to do a conventional stepwise regression, which a lot of people say there's problems with this, but I'm just going to show that I'm going to do it really fast. There's our stepwise regression, then we're going to do something from Lars called stagewise. It's similar to stepwise, but it has better generalizability.
We run that through, we can also do least angle regression. And then really one of my favorites is the lasso that's the least absolute shrinkage and selection operator. Now I'm running through just the absolute bare minimum versions of these, there's a lot more that we would want to do explore these. But what I'm going to do is compare the predictive ability of each of them.
And I'm going to feed into an object called r to comp for a comparison of the r squared values. And here I specify where it is in each of them, I have to give a little index number, then we're going to round off the values. And I'm going to give them the name, say the first one stepwise and forward, then lard, then lasso. we can see the values. And what this shows us here at the bottom is that all of them were able to predict it super well.
But we knew that because when we did just the standard simultaneous entry, there was amazingly high predictive ability within this data set. But you will find situations in which each of these can vary a little bit, maybe sometimes they vary a lot. But the point here is, there are many different ways of doing regression and our makes those available to whatever you want to do.
So explore your possibilities and see what seems to fit. In other courses, we will talk much more about what each of these mean, how they can be applied and how it can be interpreted. But for right now, I simply want you to know that these exist, and they can be done, at least in theory, in a very simple way in R. And so that brings us to the end of R and introduction. And I want to make a brief conclusion, primarily to give you some next steps, other things that you can do as you learn to work more with R.
Now we have a lot of resources available here. Number one, we have additional courses on our in data lab.cc. And I encourage you to explore each of them.
If you like art, you might also like working with Python, another very popular language for working in data science, which has the advantage of also being a general purpose programming language. The things that we do in our we can do almost all the same things in Python. And it's nice to do a compare and contrast between the two.
with the courses we have at datalab.cc. I'd also recommend you spend some time simply on the concepts and practice of data visualization. R has fabulous packages for data visualization. But understanding what you're trying to get and designing quality ones is sort of a separate issue. And so I encourage you to get the design training from our other courses on visualization.
And then finally, a major topic is machine learning or methods for processing large amounts of data. and getting predictions from one set of data that can be applied usefully to others. We do that for both R and Python. other mechanisms here in data lab, take a look at all of them and see how well you think you can use them in your own work.
Now another thing that you can do is you can try looking at the annual our user conference, which is user with a capital R and an exclamation point. There are also local our user groups or rugs. And I have to say unfortunately, there is not yet an official our day.
But if you think about September 19, it's international talk like a pirate day. And we like to think as pirates say are and so that could be our unofficial day for celebrating the statistical programming language are any case, I'd like to thank you for joining me for this. And I wish you happy computing.