Understanding R Programming and Statistics

Okay, it is two o'clock. So before we do anything here, one thing I forgot to mention. One thing I forgot to mention last class is we have an undergraduate learning assistant in this course and she has set up a discord server for you guys.

I'm not going to be using the discord server, but you know. It's a resource there that you can make use of if you want. You can use it to, you know, talk with other students and, you know, work together on different things and so on. So feel free to make use of that. The other thing I want to mention is if you go to the module section of our Canvas page here, what you can see is I've added a course project.

So this is your first course project, and it's just called Prison Plot. It's pretty simple. Basically, what you're going to be doing is you're going to be recreating the plot that you see here using our code. Now, I will say to do this, this assignment kind of assumes that you've read those three chapters in the book I wrote.

So if you haven't read those, this project is going to be pretty hard. As we kind of move through the lectures, we'll cover a lot of the topics that are necessary to produce this plot. But if you've gone ahead and you've read those three chapters, like kind of I had asked you to, you could conceivably do this tonight, no problem. So that's there. It's due this September 27th, basically by midnight.

So that's a Friday night. And yeah, so you have basically all month to do it. And as I say, it'll be easiest if you... kind of have read through those three chapters of the book I wrote because it's really kind of meant to sort of just get you working with R, get you creating things and kind of doing some fun stuff.

But with that we will kind of get rid of this. Don't need that. And we're going to pick up from where we left off last time.

So we had just finished talking about this. vectors in R and how you can create different vectors. You could have vectors of numbers, you could have vectors of character strings, or you could have logical vectors, you know, that is to say vectors that consist of like true and false and so on.

Now kind of going along with all that discussion about vectors is an important concept in programming called indexing. So let's say you have a vector just like some random numbers. So we'll just create an object, we'll call it randnums. Okay, and we'll use our combine function here, and we'll just give it, I don't know, a bunch of numbers.

Doesn't matter what they are, it's not really that important. For this example, did I throw in an extra comma there? I did, yeah.

You don't have to copy these exact numbers, just choose your own. really does not matter. So we have this vector of just random numbers here.

And actually, yeah, so like let's suppose we wanted to see the number that's in the fifth position of this vector. What we could do is we could just type the object name, so in this case randnums. And then we put some square brackets.

And these square brackets, these are how you index things. So if we type a 5, this is going to show us the number that's in the 5th position. So 1, 2, 3, 4, 5. So it should show us 243 here when we run this. And you can see that is precisely what it does.

If you wanted to, for instance, look at the first three numbers, then what you could do is, you know, you could just basically give it a sequence. So you put your square brackets. You could just say 1 to 3 like so and it shows us the first three values. Yes, this is recording.

What else? If you wanted to take the mean of those three numbers you could do that very easily. So we know we have a function for the mean so we could just take the mean and we could insert rand nums and then index that and just say 1 to 3. And now we have the mean of those three numbers. You know, it's pretty simple. If we wanted to see maybe the last number in this vector of random numbers that we have.

What we could do is we could make use of the length function. So hopefully you all remember what the length function does. It just counts how many elements are in the vector.

So if we type length and insert randnums like so, you can see it's telling us there's 11 items in this vector. So if we wanted to look at the last item, you know, the one thing we could do is we could say randnums like that. and we could just insert 11. Alternatively, we could literally just insert the length function here and go rand nums. And this second method is preferable to just typing the number 11 because what it means is if you end up changing for some reason, like maybe you've made some kind of mistake and you end up changing your original vector, this code will still work, right? So that's kind of the advantage to that.

But what if you wanted, let's say, what if you wanted to see like the last five numbers of the vector? You know, how would you do something like that? Well, we know there's 11 numbers in this vector, right?

So one option would be to do something like this. You could just say ran nums, put your square brackets. You could just say 7 to 11, right?

And that'll give you the last five numbers. One, two, three, four, five, yeah. But, you know, a method like this, it only really works if you have kind of a few values.

You know, if you have a giant vector with, you know, just an insane amount of values, you're going to want a better strategy than actually like hard coding the numbers into this. So if you think about it, we can actually use... this length function to kind of solve this problem of grabbing the last x amount of numbers that we want. And we can use the length function to do this in a way that will generalize to a vector of any size we want. Whereas when we write it like this, we're basically forced to always have 11 items, right?

So what we could do is, if we wanted like the last five values, what we could do is, first of all, let's store. how big our vector is so we'll just we'll just stall we'll call this n so n will be equal to the length of rand nums whoops rand nums that you know so we can see n here is just equal to 11. now if you think about this we can then just kind of create a sequence here so basically we want All of the values up to n, or actually, sorry, we want the last five values. So that means we want basically n minus 4 values all the way to n. So if we type something like n minus 4, and we put a colon all the way to n, this will give us a sequence of numbers.

And we have to make sure to put parentheses here so that it's n minus 4. So notice that gives us 7, 8, 9, 10, 11. So that's the last, you know, five positions. So what we could do is we could just insert this into our square brackets, you know. So we could say brand nums and then just go like that.

And now we have the last five values. And what's cool about this code that we've written, as opposed to this code here, that this code isn't dependent on how many numbers are in the in the in the vector right so this code if for instance you realize one of these numbers was a mistake and you had to delete it this code will always work whereas this code is going to give you the wrong answer if you end up changing your vector some for some reason so that's your list maybe did this will say bad bad strategy Good strategy. All right.

What else? Okay, so one thing that, or actually what I should say, like another useful feature of R concerns its help documentation. So let's say there's a function and you don't understand how that function works. What you can do is you can access R's kind of built-in documentation about that function.

So let's use, let's take a simple example here. Whoops, what I'm going to do is just kind of clean this up a bit. Start from kind of a blank slate there. So let's say we have an object nums and This is just going to be equal to the numbers 1 through 5. Okay, very simple and What we'll do is we'll take the mean of nums here like that So you can see we get a value of 3 there now if you wanted to learn more about this function mean, then what you can do is you can just put here, just put a question mark and type mean, like so. And when you do that, conveniently in RStudio, you'll get a little window popping up here that will just give you information about the function.

So up here in the top corner, You can see there's the function's name and you can see that this function is part of base r. We're going to be talking about packages later on and not all functions are going to be part of base r. So that's just something to be aware of.

So this is called the arithmetic mean. It's a generic function for calculating the arithmetic mean. Then we can see some some information here. Critically this line right here is kind of important.

So remember I said different functions have multiple arguments. And here you can see the various arguments that the mean function has. So basically what the documentation is sort of showing us here with this little x here is it's saying the function takes an r object.

So that's x. And that's basically what this nums is, right? So this is the r object.

We're basically inserting that. into the function. But the mean function also has a variety of other arguments. So it has this trim function or has this trim argument, has this na.rm argument, and so on.

And as I mentioned earlier, you know, arguments are basically what allow you to customize the behavior of a function. So imagine, let's suppose, for instance, we're missing a data point in our vector. Okay, so we have this vector nums.

Let's imagine one of those data points is missing. So in R, the way missing values get notated is just using an N A. So just type N A like, whoops, didn't mean to do that. If you just type N A like that, it stands for not available.

And now if we kind of rerun this, so if we run that, let's go like this, clean that up. All right. So if we rerun that, you can see there's no errors there. And now if we calculate our mean, it just shows as NA.

So this should make sense, right? If you actually think about this, this behavior that we're seeing from R here should make sense. You know, how are you supposed to calculate the mean of, you know, these things if you're if you genuinely don't know what one of the numbers is, right, you can't.

However, if we kind of look at our documentation here, right, we have this argument na.rm, and it currently defaults to false. So if we look here at kind of our documentation, we can see na.rm is a logical value evaluating to true or false, indicating whether any value should be stripped before the computation proceeds. In other words, what this is telling us is that we can basically have the function kind of ignore this na value. So the way we do that is we just add the argument to this.

So we say comma na.rm, and we set that equal to true. Oops, true, like so. So we're basically saying, so we're saying remove the na value.

That's what na.rm is kind of. doing there. And if you run that you then get the mean of 1, 2, and 5 there. So if you don't tell R to remove the the missing values then it just doesn't know what to do with it.

All right so that is the help documentation and admittedly you know sometimes the help documentation can be quite difficult particularly for novices. with programming. It can be quite difficult to discern sometimes, but it should always really be your first kind of go-to source for information about something.

If the help documentation doesn't make sense then a good strategy then is to just do a kind of internet search. Internet search will usually clear things up. Now so that's vectors that's missing values.

Oh sure. Do you mean this here or do you mean the actual argument itself? Well it actually means it actually means not available so it's it's you can think of it as a placeholder for a missing value yeah Yes, exactly.

Exactly. And not all functions are going to have this na.rm. That's something to keep in mind. A lot of the base r functions will have that, but not all of them are going to. So that's, you know, so you might have to, you know, use other methods to remove missing values if that's the case.

But, you know, most of the time, for everything we're going to be doing in this course, most of what we're going to be doing will... have this kind of functionality built into it. But missing values is not something we're going to be dealing with all that much in this class.

Alright, so, oh and just as a reminder, feel free to shout out questions because I probably won't see your hand raised and stuff, and try and talk loudly because my hearing is not too great. All right, so another very important type of object that we have to kind of talk about here, since we're on the topic of like R basics and stuff, is something called a data frame, or a data frame, I guess, if you're American. So a data frame is kind of most easily understood as just a spreadsheet, basically, in R.

So for instance, let's suppose you've run some kind of study and you've collected some data. Just as an example here. So we'll get some space here. So what you might have in that case is you might have a variable called subjects.

You have a variable called subjects. And let's say... This variable subjects just represents kind of unique IDs for the subject. Let's say you have six subjects like that. So we just have, you know, the numbers one through six.

That's all that's there here. Okay. And let's also suppose that you have another variable called group. Let's say group like so.

And basically this variable just tells us what... what group each of those six subjects are kind of assigned to. So you know maybe one subject is assigned to the experimental group, another is assigned to the control group, another is assigned to the experimental group, another to the control group, and let's do that.

So I'll just point out you can always move to a new line if there's a comma. So just in case you guys were unsure about that, so we have one experimental, experimental control, we'll say experimental and control. So that's all six subjects.

So, you know, subject one was in the experimental group, subject two was in the control, subject three was in the experimental, and so on and so forth. Alright and let's also suppose that you have another variable called score and this will just tell you and this variable just tells you what each participants kind of scored in your study like whatever it happened whatever it is you happen to be measuring so we'll say score like so okay so these can be really kind of any numbers doesn't it doesn't really matter pick some values here one two three four five and toss in a missing value one two three four five six okay So as I mentioned earlier, you know the last value here, NA, just stands for not available. So for whatever reason, subject number six just has no data. I don't know, maybe their computer crashed. Maybe they showed up absent.

Who knows? All right. Now a data frame can basically take these three variables that we have and kind of group them into this.

Nice spreadsheet structure. So I'll show you how that's done. So what we can do is we can give our data frame a name here. So we'll just call it expdf. So df is just going to be short for data frame.

And to create our data frame, we're going to use the function data.frame, like so. And then we can just insert these three variables that we created here. So let's go ahead and do that. So we'll say, whoops, subjects, group, and score, like so.

And now if we run expdf, what you can see is we have everything laid out very, very nicely there. So we can see, you know, subject one was in the experimental group, they scored 45, subject two was in the control, they scored 234, and so on and so forth. Now there is one critical aspect of data frames that I need to draw your attention to, and this is why this NA value is so important. It's because every single column of your data frame needs to contain the same amount of values. every other column and this often trips students up because a lot of students like a lot of conventional spreadsheet software it doesn't care if you have unequal column sizes but R does care okay so for instance if we didn't have this NA value here and we just skip this value like like if we just delete this right and we try and run all of this what you can see is We get a nice big error thrown in our face and it's telling us, you know, the arguments here imply a differing number of rows and that's because subjects here has six elements, group has six elements, but score only has five elements.

So we need to tell R that there's a missing... whoops... we need to tell R there's a missing value there.

That's pretty important. Then we can run that, we can run that, and we're back to the square one there. So that's really important.

All of the columns need to have the same amount of items in them. But now that we have our data frame here we can do some basic things with it. So for instance if we want to look at just like a single column. If we want to look at a single column what we can do is we can type the name of our data frame here.

and we can put a dollar sign. And actually RStudio is quite clever. And you'll notice there are the three columns in our data frame.

So you could, instead of typing this, you could just say, like let's say we wanna look at the score column. You just do that, run it. And you can see there's all of the values in the score column. Moreover, if we wanted to take the mean of all of the values in this score column, like if we wanted the mean score, we could just type whoops mean and then insert the column so we could say exp df and we're gonna say score if you run that you can see we get an NA value because as we learned a moment ago right you have to basically tell R to strip the mean or strip the NA values when it's calculating the mean here so we're just gonna say NA dot RM set that true whoops There we go. This should work now.

And look at that. We get 165. What else here? So once you've created a data frame, what you can then do is you can easily add other columns to it. So let's just rerun that so we can look at it. So right now we have three columns, right?

Subjects, group, and score. But let's say we wanted to add a fourth column to this. that multiplies all of the values in the score column by 100. Okay, that's really easy to do.

All we do is type the name of our data frame. Then we're going to type a dollar sign and then type the name of this new column. So we'll call this new column new call because why not? It's nice and straightforward. Then what we can do is we can literally just grab the score column and multiply it by 100. so we say exp df score multiply that by 100 run it no errors it's a good thing but now if we take a look at our data frame what you can see is here i'll make this i'll resize this for you guys here we go what you can see is we now have a fourth column that is just multiplying all of the values in the score column by 100. So 45 times 100 is, of course, 4,500.

234 times 100 is 234,000, and so on and so forth. And of course, missing times 100 is still missing, right? Okay.

But what if, let's suppose you want to only look at certain rows. Like let's say you have this data frame and you only want to look at the first three rows. How would you do something like that? Well, type the name of your data frame, and we're going to use those square brackets that we were using for indexing with the vectors.

And basically we're going to index the data frame. So if we say 1, 3, and then put a comma, run that, you can see that shows us just the first three rows there. Alternatively, you could specify kind of individual rows.

So if you wanted to look at maybe only row 1, 3, and 5, let's say, what you can do is you could say expdf, and you could use your combine function and just say 1, 3, 5, put a comma there, run it, and now we're just looking at rows 1, 3, and 5. If you wanted to look at maybe a subset of columns as opposed to rows, you would do basically the same thing. So if we say, let's say we want to look at maybe columns 2 and 4. So what we could do is we could say expdf, put our square brackets, put a comma, and then on the other side of that comma, we could say use our concatenate function, or our combined function I mean, and then just say we want row 2 and row 4. or not row sorry column two and column four and you can see now we're looking at the group column and the new column oh yeah ah then you would use um you would use either the subset function or the filter function which we haven't actually covered yet um actually there's there's actually multiple ways of doing that i shouldn't say You can use those functions. You could also, there are ways of writing basically like inequalities inside your indexing.

We're not going to cover that right now. But actually, if you read, I think, those three chapters that I wrote, at some point in there, we're doing stuff like that. So I would highly encourage to read that. And as we go through the course, we will do some of that. For now, I'm just trying to give you guys the ABCs, you know.

essentially. Sure, you'll see right away. You'll see right away why it matters.

So right now we've, with this syntax right here, right, we've selected column two and column four. But what if, suppose we wanted to select rows 1, 3, and 5 and columns 2 and 4. Basically what we're going to do is we're going to type the exact same thing, right, except on the right side of the or sorry on the left side of the comma we'll say we want to look at rows 1, 3, and 5 and on the left side of the comma we're going to say we want to look at columns 2 and 4. When you do that, you can see we get rows 1, 3, and 5, and we get column 2 and 4. So hopefully you can see kind of what the comma, what this comma right here, is actually doing. Basically, everything on the left side of the comma applies to rows, and everything on the right side of the comma applies to columns. So what we could do is write a little note to ourselves.

So we could say, like here we'll say expdf and we'll say rows columns like that so the numbers on the left side of the comma those correspond to rows the numbers on the right side of the comma correspond to columns and this is a pretty common convention you know you'll see it a lot in uh in other languages as well. The easy way to remember this is to sort of think of rows and columns as essentially x and y coordinates. That's how I sort of tend to visualize this.

You know rows are like the x values because they're horizontal and columns are the y values because they're vertical. Anyway that is your little intro to coding done here. So right now what I've done is I've given you guys a very bare-bones understanding of R and it's probably not obvious how a lot of this is really going to apply.

But believe me when I say like this is kind of the what we've kind of covered here are just sort of like the fundamental building blocks to using R with and kind of conducting statistics. I would strongly recommend Now that we've kind of covered this material, I would strongly recommend, you know, making sure you've read through those three chapters in the book I've read or in the book that I wrote. And, you know, that'll give you kind of a much more nuanced understanding of kind of everything that's happening here.

And particularly chapter two and chapter three get you to do plotting and get you to do some more advanced stuff that is going to be very useful for not only your homework, but also for your kind of. First course project. All right.

So with that out of the way, what we're going to do now is we're going to write some notes. So I'm going to get kind of a whiteboard or a blackboard, I guess. Do you guys, I have an option here.

We'll see what you guys prefer. I suspect I know what it'll be. But we could do...

A white background or a dark background? Sorry? Okay, well, we can always, yeah, we can switch.

It's not a big deal. All right, so before we do any math, there's a few basic statistical concepts that we have to kind of cover. And most of these will be concepts you're actually already familiar with, but they're definitely worth repeating in the context of this class.

So maybe what we'll do is we'll just say basic terms. Well, that's not very big, is it? Let's make that bigger.

View, that's too big. All right. Hopefully, that is large enough.

So one profoundly critical concept is that of a population. So a population is just the entire collection of events or observations, if you prefer, that you're interested in as a kind of a researcher. So we could say population, the entire collection of events or observations that you... So for instance maybe you're interested maybe a researcher who's interested in the phone screen time of students who are enrolled at the U of A.

In that case, your population of interest is just literally every student at the U of A. You know, that's one example of a population, but a population is not always so simple. So the number of heads and tails that you have is just so simple.

obtain from flipping a coin to the rest of eternity would also be a statistical population. Though in this case, you know, the population of an infinite amount of coin flips is a lot more conceptual, whereas the population of U of A students, right, it's much more concrete, it's finite. Now related to the concept of a population is what's known as a sample.

And a sample is just a proper subset of. population so say sample so a proper whoops proper subset of a population so a sample could be just a single element of the population or it could be all of the elements of the population minus one or it could be of course something in between So as I'm sure you all appreciate, samples are pretty critical for research because you usually can't realistically study every element of a population. So you need a sample to basically make inferences about the population that you're interested in. So the easy way to think about the distinction between samples and populations is if you're making some soup. Okay, so you can think of the entire pot as, you know, the whole pot of soup is your population.

And, you know, sometimes you're going to take a small sip of that soup, you know, to see if it needs more salt or whatever. That small sip is analogous to a sample in a study. So the assumption is when you take that sip, it's telling you something about the overall flavor of the pot. You know, that is why samples are important.

You know. But what we're really interested in is the population, right? The sample, though, is just presumably telling us something about the population.

So since we have some text on screen, this is what it looks like as white. Do you guys prefer it this way or the other way? Okay, are you indifferent? If you're indifferent, I'm just going to pick what I prefer.

Okay, I prefer the black one. Okay, so that's populations and samples. Now there's two terms that are kind of intimately related to population and sample, and that is the terms parameter and statistic. So a parameter, or actually, when you take a measurement, that measurement refers to the entire population that is what we call a parameter.

So let's do this Or you could even say applies to the entire population So if the population of interest is all of the students at the U of A, and I calculate the mean screen time of using the values of every single one of those students, the mean screen time would be what we call a parameter in that case, because it's being calculated using the entire population. If, however, I calculated the mean screen time using only a sample of students at the U of A, Then that is what we call a statistic. Okay.

Statistic. A measurement that refers to a sample of a population. So the easy way to tell if you're dealing with a parameter as opposed to a statistic is parameters will be denoted with Greek characters whereas statistics will be denoted with Latin characters.

So for instance let's maybe write that in so denoted Greek letters so as an example you know a famous one that you've probably seen in other classes is mu right so mu is equal to the mean so be the mean of the population actually let's do that say the mean of the population By contrast, the sample mean is usually denoted by x-bar or y-bar. So for instance, sometimes you'll see like x-bar like this. oops let's try to write that nicely this is the sample mean now this kind of brings us to the distinction between descriptive and inferential statistics so descriptive statistics are statistics that are providing us with mathematical descriptions or mathematical summaries of our data so that we can understand it better and kind of talk about it a bit more easily.

So let's do this we'll say descriptive statistics statistics that provide So for instance the mean of a data set is a measure of the data sets kind of central tendency In other words, you know, the mean is a number that is trying to represent what value is most typical of the data set. Similarly, the median of your data is also a measure of central tendency, right? Like the mean, it's a mathematical representation of what value is thought to be most typical. Another example of a descriptive statistic would be the standard deviation, which I'm sure you've all heard of, even if you don't necessarily know what it is. the standard deviation is a measure of, you know, how spread out the data is. So if we didn't have descriptive statistics like these, we'd really have no way of describing our data other than to just essentially list out every single individual value, which is obviously highly impractical, especially if you have a giant data set.

Now in contrast to descriptive statistics, there are what are known as inferential statistics. So, Brad Sheil, oops, statistics. So these are statistics that infer or deduce.

something about a population based on samples that have been drawn from that population. So we'll say statistics that make an inference about a population based on samples drawn from that population. Or actually, maybe we should say based on a sample.

Could be either, to be honest, but we'll do it that way. So for instance, if I take a sample of students from the U of A and I measure their phone screen time, and then I use the average of that sample to estimate the average screen time of the whole student body, I've just made an inference about the population based on a sample, in the same way that the small sip of soup is telling me something about the overall pot, or that I'm using it to tell me something about the overall pot. Now just because you've collected a sample, that doesn't necessarily mean that you are allowed to use that sample to draw an inference about the population. To do this, your sample has to be representative of the population.

And kind of the only real good way to get a representative sample is to employ what's known as random sampling, or as some people call it, random selection. In essence, this is referring to a method. of drawing a sample from a population such that every element of the population has an equal chance of being selected.

So I'll kind of try and flesh that out but let's maybe define this random sampling method. such that every element of the population has an equivalent chance of being selected and we can even add to this So you can imagine a random, you can imagine random sampling as being sort of equivalent to stirring the entire pot of soup before you actually take the small sip of soup. You know, stirring the soup is going to mean that.

that sip you take is more representative of the pot as a whole. By contrast, if you don't stir the pot, you know, that sip may not be a good reflection of the pot as a whole, and consequently the inference you draw from it may be inaccurate. Random sampling here, in particular, is especially important for survey research. Having said that, it's also something of an ideal.

So like in psych research, you know, true random samples are more often than not basically impossible to achieve because of just, you know, practical and ethical constraints. You know, even things like the psych department's use of intro psych students as a subject pool, you know, that is technically a non-random sample because number one, all of the students get to choose which studies they want to participate in. And the reason that's a problem. is because of something known as the non-response bias.

So certain demographics may be choosing to not participate or they might not be able to participate in a study for one reason or another. In other words, you're sort of violating this equal likelihood principle that is part of true random sampling. But also... Psych research is often interested in generalizing to a broader population. Usually, it's not only interested in psych students.

Usually, the idea is you hope that the results of the psych students generalize to the population at large. The hope is that psych students are basically equivalent to normal people. But let's be real, they're probably not. The point being...

The population of interest isn't being randomly chosen with something like the intro psych subject pool. Psych students are what are known as a convenient sample. They're not a random sample. You know, just consider how much females tend to outnumber males in psych classes.

That right there is a pretty big bias in your sample to say nothing of the other differences that exist between psych students and the general population, you know, like education level, socioeconomic status, etc. you know and and the other thing to maybe just quickly mention here is increasing your sample size is not going to make these kinds of problems go away right these these problems are intrinsic to the way the subjects are being recruited or the participants are being recruited and this is also why replication and looking at converging lines of evidence are so important All right. So on Wednesday, we'll kind of pick up from where we left off here. But yeah, if you have any questions, feel free to ask.

And I will talk to you guys on Wednesday.

Transcript for:Understanding R Programming and Statistics

Transcript for:
Understanding R Programming and Statistics