Transcript for:
Understanding Variables and Data Types in Research

So in this module we're going to learn about types of variables and types of data. So variables basically describe the measurements that we've made. Now these measurements might be done in terms of a response to an experiment or not in response to an experiment.

Okay, something, some information that you're just collecting. But variables come in different types. And broadly speaking, we can segregate variables into explanatory, and response variables. So explanatory variables, or also known as independent variables, are the variables that may explain the response.

So in an experiment, the independent or explanatory variables are the ones that are manipulated by the experimenter to determine an effect on a dependent or response variable. So one of the easy examples of this is a drug, right? So an investigator might take a drug and they'll fix the dosage.

So if you're doing an early, like a phase one clinical trial, and you're looking for the optimum dose, you might give the participants in the study maybe two or three different doses of the drug. So you're controlling what's happening there. So those are your explanatory or independent variable.

So you can think of the independent variable as the cause. Then the response or dependent variables are the variables that are the outcome of interest. So in the experiment the dependent or response variables are the factors that can't be controlled by the experimenter. So in the previous example we talked about our independent variable or our explanatory variable maybe being a drug where the dosages are controlled. And then our response variable or our dependent variable might be blood pressure, right, if we're looking at a new antihypertensive drug, then it might be blood pressure as the outcome variable of interest.

Okay, so the response or the dependent one. We don't control the blood pressure, but we want to see how does blood pressure change relative to the drug that we gave. Okay, so you can think of dependent variables as the effect, right? So if you think about cause and effect, explanatory and independent variables are the cause, right?

And then response and dependent variables are the effect. So let's just go through a little exercise here. Okay, and you can sort of, you know, think about what the correct answers are here.

But what are the response and explanatory variables for the following experimental questions? Okay, does sleep deprivation affect math ability? So here, what would be the explanatory variable and what would be the response variable?

So here the explanatory variable is sleep deprivation, because you're controlling that. And then the response variable is math ability. So if you get more or less sleep, does your math ability change?

Let's look at the second one. Do glucose levels go lower if patients are treated with a higher dose of an anti-diabetes drug? I'll give you a second to think about that.

do glucose levels go lower in response to a treatment with a higher dose of a drug? Okay, so here our independent variable is the drug dosage, right, or the anti-diabetes drug. And then our dependent variable is going to be glucose levels, because we're looking to see do glucose levels go down in response to the drug. Okay, and then the last one, does tumor size change with radiation therapy? Okay, what do you think?

So in this case, right, the explanatory variable would be the radiation therapy, because that's what we're giving the patient, and we want to see whether tumor size changes, so the response variable here is tumor size, okay? Now, a variable is a value or a characteristic or... just a measurement that can differ from individual to individual. Remember when we've in the module where we talked about population versus sample we said you had a population you had you took measurements on a sample right from the population to make an inference about the population. So our variable basically is that measurement that we make in the individual from our sample.

Okay so You can think of variables as being either quantitative, like a numeric type of variable, or qualitative. something that's more descriptive. So for example, if we think about height of each student in the class in PM 510, well, that's a quantitative variable, right? Because measurements of height, they might be inches or centimeters or meters, right?

But it's a quantitative value. You're six foot two inches, you're one meter, 12 centimeters, whatever. Okay.

Hair color of each student in PM 510. Well, this is a... qualitative characteristic, right? It's not numeric.

You have blonde hair, you have brunette hair, you have red hair, right? Purple hair, whatever it is, but that's not a numeric characteristic. That's a qualitative, sort of descriptive characteristic. Body mass index or BMI of research subjects, that would be a quantitative value, quantitative variable, because again, body mass index is measured as Weight divided by height squared.

So it's usually in kilograms per meter squared. And so that's a quantitative value. Average number of workouts per week. Again, this is a quantitative variable, right?

Because you're saying how often do you work out on average, right? So you might say I work out five times a week or three times a week, right? On average during the month, I work out maybe four times a week.

So it's a quantitative value. A level of difficulty of a course. So this would be, again, a qualitative variable because you're not quantifying it. Think about when you talk to your friends about different classes. Oh, that class is hard or that class is easy.

It's kind of medium, right? It's not really a quantitative value. You guys don't actually do some kind of survey and say, you know, on a scale of 1 to 10, you don't do a quantitative analysis of the class, but you might give it a very descriptive. you know, easy, not easy, you know, etc. Okay?

Or a variable might be a value or characteristic that can differ at different times for a single individual. So if you think about looking at a single individual but measuring something about them over time, okay? Or measuring a characteristic of something over time like cell characteristics you might follow over time and do multiple measurements on a single set of cells or a single group of individuals. Okay. So for example, again, hours of sleep per night during a week, right?

So we're saying how many hours do you sleep per night over the course of a week? That would be a quantitative variable that we'd be looking at. Okay. The weather each day in a neighborhood where you lived, right?

And again, here we might say the weather each day over a week or over a month or something, right? And this is a qualitative characteristic because again, you know, the weather. So we say it's sunny, it's cloudy, you know, it's raining, right? It's not a quantitative, it's not a numeric assignment that we do, but a characteristic that we give it.

Okay. Traffic count at a freeway exit over a certain amount of time, right? This is another quantitative variable.

Another number of workouts per week, another quantitative variable. Hair color over a year, right? People change their hair color fairly often.

I have a friend who seems to change hair color every other month, it seems. And so, well, maybe not now under the pandemic, but, you know, generally speaking. And so, again, this would be a qualitative value because hair color, there's no numeric value assigned to hair color. It's either, you know, you're blonde, you're brunette, you're red, you know, purple, whatever. Now, there are two broad categories of data types.

that make up response and explanatory variables. Notice the difference here. Now we're talking about types of data.

We're not talking about the variables, but we're talking about the types of data that make up the variables, either the response variable or the explanatory variable. And those are categorical or numeric. You've seen that I've already been talking about in all the previous examples, I've been talking about whether something's quantitative or qualitative. So I've been talking about the different types of variables, right?

So we can define those as either the qualitative variables as either categorical or the quantitative variables as numeric. OK, so you have categorical versus numeric data. Categorical falls into mutually exclusive categories. OK, and under the heading of categorical, we have two types of categorical data. There's data that we call nominal.

OK. And nominal data are basically categorical without a natural order or ranking. So you think about sex, you think about like animal strains, cell types, tissue types, ethnicity, right?

They're just labels. There's no specific order to them, right? And so those are what we would call nominal variables.

We can convert nominal variables into quantitative by simply arbitrarily assigning numbers to them, right? So, for example, we can assign 1 to male or 2 to females or 1 to females and 2 to males or 0 to males and 1 to females. This is totally arbitrary. but we're taking our qualitative male-female characterization and making a numeric assignment to it.

Okay, and just say 1, 2, 0, 1 or whatever. The other group of categorical variables are ordinal variables. So this is categorical data with a natural order or ranking. So you can think about maturation levels.

You're an infant, you're a toddler, you're an adolescent, right, teenager. There's a progression. They're still just categories.

We don't have a numeric value assigned to infancy or toddler or whatever, but it's just a category. But we do have a progression. OK, you can't start as an adult and then go to infant and then go to teenager.

Right. There's a natural progression here. Education is the most common example.

Right. So you can think about like K through 12, you know, and then doing college and postgraduate. right?

So different levels. You can't jump from first grade, go to college, and then come back to third grade, and then go to kindergarten, right? There's a natural progression here. So the categories can be considered a ranking with ordinal data.

Sometimes we don't know how large the differences are between categories, okay? But we know that there's an order. So for example, if you think about... finishing a race. We have first, second, and third place.

So that's a ranking, and they're categorical, but it's a ranking. The person who wins in the first place may have been a second faster, a minute faster, or even 30 minutes faster, depending on whatever the type of event is. They could have been really, really fast.

or much less fast, right, or slower, than the second place finisher. We don't know what the interval, when we say first, second, and third, we don't know what the interval is between first and second and second and third. We just know the ranking. So the actual quantitative distance between first, second, and third doesn't make a difference. So the best thing to do, you know, the best thing to use as an example are races, you know, where, you know, like if you think about the Olympics, right, there's a first, second, you know, gold, silver and bronze.

But you could have won the gold medal by one one hundredth of a second. And the difference between the silver and the gold and the bronze could have been like a second. So there's no defined sort of interval between the ranking. But we just know that the ranking exists.

Another example is the pain rating scale. So you go to the doctor, you hurt someplace, and they'll show you this little thing and say, well, on a scale of 0 to 10 or 1 to 10 or whatever, where do you rate your pain? And they show you these little smiley and frowny faces, and you're supposed to tell them, well, my pain is like a 5 or it's a 9 or whatever. Again, these numbers are assigned, but it doesn't mean that a two is twice as much, a four isn't necessarily twice as much pain as a two, right?

This is totally arbitrary. And, you know, it's just a scale to help, you know, understand what the pain is like. But it doesn't necessarily mean that there's a quantitative association between these.

Okay. So again, educational status is another example of ordinal data. So here this is just some data showing you unemployment rates and earnings for different levels of education.

And you can see here, right, we start down here with less than a high school diploma going all the way up to a doctoral degree. And you can see the different gradations of education here, right? So again, there's a ranking in terms of the level of education.

And yet another example is geography, right? So if you think about how maps work, you have you know like starting with your home and then your town and then the city and then the state you know etc etc so again this is some data looking at disadvantages and disadvantages this is percentile ranking comparing aboriginal and torres strait islanders to non-indigenous australians but looking at different geographic regions and you can see again you know the sort of ordinal scale on geography as opposed to like just random non-ordinal type categories. So here again, you can see how that works. And then the last example we're going to look at is what's known as the ECOG or the Eastern Cooperative Oncology Group Performance Status.

And this is a scale used to see how functional a patient is, right? Are you able to do, you know, unrestrictedly do tasks or do you have some kind of restrictions, you know, that you like can't go upstairs but you can still walk? And you can see here, right, so we start off as asymptomatic, fully active, able to carry out all pre-disease activities. Here you're symptomatic but completely ambulatory but And then you move to symptomatic less than half a day in bed, symptomatic more than half a day in bed, but not totally bed bound and then totally bed bound. Right.

So, again, you can see this progression going from asymptomatic to bed bound. And then, of course, nobody wants to die, but maybe you're dead. OK, so, again, there's a there's a order here that we're following.

And but again, it doesn't mean that there are some. quantitative difference between being a zero and a one. You know again a one doesn't, a two doesn't mean that it's twice as bad as a one, it's just a category that we're using. Okay when we think about numeric data, we're thinking about whole or real numbers.

Okay and we break numeric data down into discrete or countable ordered numerical data that are whole numbers. Right so things like the number of mice that we used or the number of transformed cells. that turned out during an experiment, or the number of genes that are expressed in an experiment, or the number of people with red hair.

These are whole numbers, we're just counting things. Compare that to a continuous numeric variable where the order numeric data can theoretically take on any value. So you think about height.

You're only restricted by the tool that you use to measure. So you can do height in inches or feet. or you can go down to centimeters. If you really, really wanted to get nasty, you can pull out a micrometer, right? But it's still some numeric value, and that value varies by individual to an individual.

Think about weight, age, cholesterol level, okay? So, you know, to sort of clarify the difference between discrete and continuous. You can think of discrete as unordered items, right?

An apple and an orange or an orange and an apple, right? Very distinct, right? Whereas continuous, there's a wide spectrum of things.

So you can't really take an apple and an orange and have it continuous. It's not like an apple eventually converts over to an orange, right? Going from one to two.

It's either an apple or an orange. There's no mixture here, okay? So one hint for you is that if there's a digit beyond the decimal point, then it's probably a continuous variable. And if it's a whole number, then it's probably a discrete variable. So, showing you examples of discrete data, this is the number of automobiles sold by maker and month.

And you can see here, this is looking at the year 2015 in the United States, looking at the different months. Okay, in the year and the total here at the end. Okay, and then we have the different manufacturers of different model cars over here on the left and then the numbers, right?

So this is discrete data. We're counting how many, you know, Nissan Leafs were sold in the different months. And overall, I guess, in 2015, Nissan sold 13,630 Leafs, right?

So we're just counting the numbers, right? The discrete data. Now, the confusing thing, and I want you guys to all clearly understand this, is that the data categories that we just talked about are not absolute. For example, we can take continuous data and create an ordinal categorical variable, right? So we can take age and convert it into age categories.

Maybe we don't care about exactly how old you are, but we care, we want to sort of segregate you out into kind of young, you know, kind of, you know, average and then older, right? So for example, we can take age and say you're less than 10 years old. How many of you are 10 to 17 year olds? and how many of you are older than 18. So that's taking a continuous variable and converting it into a categorical variable.

And in this case, it's ordinal, right? Because it's still age, so age 10 is still lower than the 10 to 17 category and 18 and plus, okay? Same thing with income. You might be making...

$60,000 a year or $54,000 a year or a billion dollars a year, we may not care about that. We may want to create low, medium, and high income categories, creating some criteria, right? So that's taking, again, a continuous variable and converting it into a categorical. A very common example in science, or especially in medicine, right, is to take levels of something and create...

disease categories. So for example, in the diabetes field, you can take the glucose levels and create what are known as glucose tolerance categories. So based on your blood sugar level, you may be categorized as having normal glucose tolerance or impaired glucose tolerance or diabetic level of glucose tolerance.

So again, taking a continuous variable like glucose levels, which are measured as milligrams per deciliter or millimolar, it might be like a milligrams per deciliter. We might... you know, measure your glucose and it's 120.3 milligrams per deciliter. But if it is, then, you know, we, you know, that puts you actually in the diabetic glucose tolerance category.

Okay. So similarly, we can take continuous data and create a nominal categorical variable, right? So we can take something like blood pressure.

And again, that example of using a variable in the clinical arena. We can take a continuous variable like blood pressure and basically break you down as being hypertensive or not hypertensive. Right.

So two categories, taking a continuous variable, but creating categories out of it. OK. Tumor sizes.

Right. Operable versus not operable. Depending on how big the tumor is or how small the tumor is. You know, maybe it's operable, maybe it's not operable.

OK. And taking years. into like 1964 to 1968 versus all of all others, right?

So taking years and creating categories to study things. Here's another example taking ordinal categorical data and then treating it as continuous, right? So now we have a categorical variable that are counts or they're labeled as numbers and we treat them as if they were continuous. Okay, and we can do that because remember ordinal variables right, have a natural order to them. So if we again if we think of education levels like k through 12, we can treat k through 12 as 0 through 12, right?

So 0 being kindergarten, first grade being 1, second grade being 2, etc, etc. And even though those are whole numbers, because they have an order to them, we can treat them as continuous variables. Think of it as, you know, especially if we're looking at this educational level like it. You can think of it as a real number, right? This could be, you know, 0.0, 1.0, 2.0, 3.0, etc.

Okay, so you can treat it like it was a continuous variable even though it isn't. Glucose tolerance categories that we used in the previous example, you know, you have normal, impaired, and diabetic. We can just treat those, right?

Normal leads to impaired, leads to diabetic and so we can numerically assign 0, 1, 2 and create a progression. These are progressive categories so we can make numeric groupings and make it a numeric progression. And again you treat this as a continuous version. The Karnofsky performance status that takes performance categories and treats them as numerics. This scale, the Karnofsky scale, goes from 0 to 100 in increments of 10. And so again, even though it's a categorical variable, because it goes from 0 to 100, we can treat it as a continuous variable.

Now, I want to warn you, some people use discrete and categorical interchangeably. So in this class, we'll try to refer to discrete numeric to emphasize that, at least for us, discrete data are not categorical. So let's look at some more examples. So here we're going to take age. So age in and of itself is a continuous variable.

So you can see here. In this table over here on the right, if we look at age as a continuous variable, it can be 18, 20, 30. Little kids always like to say, I'm 2 right? So they're 2.5 years old.

And if you think about months, you can convert, if you measure age in months and you want to convert it to years, then you can end up with a decimal value. So age can be continuous. We can take age and create categorical categories for them.

And because age is continuous and has a natural order to it, we can get a categorical ordinal variable for age by creating age groups, 15 to 20, 21 to 25, etc. Or we can do the same thing. We can create a different categorical ordinal variable using labels like youth, teen, adult, elderly.

Similarly, captures... that progression in age, but in this case it would just be counts. How many kids are you, how many of our sample are youth, how many are teen, how many are adult, depending on how we define those categories.

Okay and the same is true of the categorical ordinal where we have the age categories 15 to 20, right, how many of our sample fall between those age ranges and we're counting. Okay we can also use age as a categorical nominal variable, right, young versus old. Take your age and split it in two.

You have two categories. Or even age versus odd age. A different way to sort of look at age as a variable.

So one thing that you need to do is you need to pay attention to what the data are. In other words, are you looking at a category? Are you looking at something like a categorical variable like this?

Or are you looking at a continuous variable like that? Continuous data. Okay.

And don't pay attention to what the name of the variable is. This could all be called age, right? And if you think about age immediately and start thinking, oh it's continuous data that it's going to be like this, then you're not going to really follow what's going on. But if you look at what the actual data are, oh I've got real numbers so it must be numeric and continuous. Oh I've got categories and I'm counting how many people there are.

So therefore it must be age as a categorical variable. Okay, things like that. Pay close attention to how the data are being treated and not necessarily what the label for the data is.