So in this module we're going to talk about what is biostatistics and then we'll talk about the difference between populations and samples. So what is biostatistics? If you look it up in Merriam-Webster's dictionary, it says statistics apply to the analysis of biological data.
Other people talk about it as a mathematical body of science dealing with the collection, analysis, interpretation, and presentation of biological data. And one philosophy of biostatistics is that it's a search for the truth. There are many specialized branches of biostatistics.
For example, statistical genetics, which is an area that I'm trained in, clinical trials, you know, exposure methods. You know, there's all kinds of different specialized areas of biostatistics. And it's very, very important to remember.
that you cannot prove a biologic hypothesis with biostatistics. The biostatistics only provide you the weight of the evidence for or against your hypothesis. And really, there's no real way that we can prove anything with statistics.
Why do we need biostatistics? Well, one thing I always tell all my students in all my classes is that part of it is your responsibility as a scientist, physician, or public health official to clearly, honestly, and correctly explain scientific findings to the public. They're the ones that don't really know statistics.
You're taking this class to learn about biostatistics, and you're the one that has to explain to them why a certain scientific outcome is the way it is and what is the weight of the evidence, etc. You need to understand what's the methodology that was used, how were the samples collected, how was the data analyzed, and all these complicated things that the public in general doesn't understand, doesn't want to understand. They just want to know an answer. Right.
And so by you learning biostatistics in this class, you'll be able to interpret what was done, provide the weight of the evidence. Is it a strong study? Was the evidence good?
Were the results very positive, very strong, or were they weak? You're the one that can interpret that stuff because you're the one that's going to have this knowledge from this class to do that. I've always said to students that this is one of the primary responsibilities that we have, again, as scientists, physicians, or public health professionals, to explain these things to the public. If you look today, for example, with our current pandemic, there's... tons of misinformation out there.
And really, it's up to us as scientists, whether, you know, people want to believe it or not, that's a whole different problem that we're facing. But at least to get the facts out there and the correct interpretation of the data out there. So, you know, there's that old saying that you can lie with statistics. And that partly comes from the fact that there's been, you know, in the past, a very poor use of statistics, very poor study designs. and poor interpretation of statistical results.
So again, we're here to teach you how to use these methodologies, how to correctly interpret them, and to be honest with whatever the result is that comes out. This is an important tool for your future careers. Even for those of you who think that, oh, I'm never going to do a statistical analysis in my life.
I'm interested in public health policy. I just have to take this class because it's a requirement. You're still going to need to know this stuff because if you've got to go look through all these studies, these scientific studies, to come up with your policies, you need to be able to understand what data analysis was done, how was it interpreted, how do you interpret the results based on what they did. So you need to understand biostatistics as well. So this is a critical tool for everybody.
So as a little bit of an assignment for you guys, I want you to view this John Oliver episode from his show last week tonight. The episode was called Science. There's a link available on Blackboard.
Please view this. And for the synchronous session related to today's module, we're going to discuss a little bit about what was said in that in that episode. So again, why do we need biostatistics? Well, one of our goals is to make an inference about a population based on a measurement made on a sample from the population. So if you think about what we want to do, we want to make an inference about some health aspect about a population.
Does eating high fiber reduce cholesterol in humans? Well, we can't go and measure cholesterol and fiber intake in everybody in the country or everybody in the world, because that's the population, but we can actually go get a sample and try to make an inference about the population based on the sample that we've collected. And doing an analysis on that sample then provides us with, you know, the statistical tools that we use provide us with a way to summarize and characterize that data that we've collected. It helps us determine whether we have observations that deviate from expectation or random chance. It gives us a way to compare results across studies, right?
And this is one of the things about science. Science is very inefficient. No single study... proves anything about anything in biomedical research. Really what it is, is that multiple people do the same or similar studies, and we use the weight of the evidence across all those studies.
If all the studies are consistent in direction and outcome, then we make an inference about the population and say, well, this must be true. Only then and only then do we have really a proof. of a hypothesis.
So again, no single study does that, but we need to be able to compare results across studies in order to interpret things. And then biostatistics gives us tools to estimate how many samples or subjects we'll need for a given study. So anybody who writes a scientific grant to the NIH or the NSF or any other organization, every grant... has to include a sample size calculation, or what we call is a statistical power calculation, to basically say in order to have sufficient statistical power to determine, to test my hypothesis, I need to be able to collect, you know, x number of samples. And there's a series of different calculations you can do to do this.
And For some of the methods that we're going to talk about in this class, we'll talk about how to make those estimates of sample size. So here's an example of where statistics can be useful, right? So this is a Gallup poll from November 2013 looking at congressional approval, and they conclude that congressional approval was at its lowest in Gallup's history. And this is the figure from their poll, and you can see, you know, it goes from 2008 to 2013. And the x-axis is percentage approval. And you can see down here at the very end, you know, at the last measurement, you know, congressional approval is only 9%.
And they say this is the lowest in Gallup polls history. But you can look at the data here, right? And it bounces around all over the place. And really, is it really the lowest in its history?
I mean, based on raw number, yeah. but how do we know that this 9 is no different than this 10 over here or this 10 over here? Or maybe it's how do we know it's not different from, you know, this 14 way back over here in 2008? Okay, so if you look at the fine print that comes with this poll, it says results for this Gallup poll are based on telephone interviews conducted November 7 through 10, 2013. with a random sample of 1039 adults aged 18 and older living in all 50 U.S. states and District Columbia and for the results based on the total sample of national adults one can say with 95% confidence that the maximum margin of sampling error is plus or minus four percentage points. So they're saying they're 95% confident that the sampling error is only plus or minus for percentage points.
And that's based on only 1,039 adults. So in 2013, based on a July 4, 2013 estimate, there were approximately 316 million U.S. adults, people living in the United States. But they only sampled 1,039 out of that. And they said across all 50 states. So actually, you know, if you do the rough calculation, you know, this is not totally correct, but if you did the rough calculation and said 1039 divided by 50, then the sample size for each state gets even smaller, right?
So is sampling 1039 adults enough? to be able to make this statement, right, that the congressional approval is at its lowest in Gallup's history. Okay, so we're going to visit this poll later in the class and we'll see that actually 1039 adults is actually sufficient to make this statement about 95 percent confidence with an error of plus or minus four percentage points.
And you know, You see this kind of stuff all the time, especially when we get into election season, right? There are all these polls that come out and talk about whoever's running against whoever and who's up and down. And they talk about, you know, the error, the 95 percent, you know, plus or minus whatever percentage points. This is where all of that comes from. So you'll be able to actually calculate these things yourself and see if these guys are right or not.
OK, so but all of that, all of this, I should say. is premised on some other factors that we're going to talk about in this course that are very, very important. And so another example comes from the 2012 presidential election when Barack Obama ran against Mitt Romney.
And there is this episode from the Colbert Report that talks about that election. I have the link here, but also if you go to Blackboard, I have a direct link. You just click on it. It'll go to. the episode and I want you all to look at that as well as part of your homework and we'll be discussing this also in the synchronous session later this week.
So population versus a sample, right? I said we were trying to make an inference about the population based on a measurement that we make from the sample. You have some population.
So population here is defined as the totality of subjects under study. And it doesn't necessarily mean like when you talk about humans, it doesn't mean all humans. You have to, you know, when you talk about a scientific question, you have to clearly define what you're talking about.
You may be only interested in, for example, adults between the ages of 18 and 34 who have some certain characteristics. That might be your population. Your population might be only those people who are left-handed or some specific ethnic group.
You may say, I'm only like for myself in my own studies, we're looking at Hispanic women who have had a previous diagnosis of gestational diabetes within the last five years. That's the population that we're interested in, okay? And we're going to make an inference about all Hispanic women who had a previous diagnosis of gestational diabetes blah blah blah, okay?
So you have to define what your population is, okay? And then you have to take a measurement right on some an individual within that population So an assessment a measurement is basically an assessment that's made on a single unit, right? So you might be interested in cholesterol or you might you know, so you might sample blood cholesterol notes Or you might be interested in dexterity, so you might have a dexterity test that your individuals have to do.
If you're doing like air pollution research and lung function, you might have a lung function test. Or you might, you know, you think about a treadmill test for cardiac events, right? You can think of all these different kind of measurements that you can do, and they can, they don't necessarily have to be biological samples, right?
You could just ask, like, have you had a previous experience with fainting? Or have you experienced prejudice? There are all kinds of measurements that you can make, but you're asking each individual unit this question.
And then the sample, right? So again, we can't measure everybody in the population, so we draw a sample from that population. and we do a measurement on each one of those individuals we've sampled, and we try to make an inference about the population.
So the sample is a subset of the population on which these measurements are obtained. Now this is where problems start to occur because sampling of the population is subject to error and or bias. And this is where doing different studies, depending on how the study is sampled, their subjects, that can introduce a problem. So one study may have drawn samples in one way, another study may have used a different approach, and that could result, that difference in the sampling could result in a difference in an outcome. Okay, so sampling is really, really critical in terms of biostatistics and analysis.
Okay, so again, population, right, everybody, the population is the totality of whatever it is we're looking at. We describe, statistically, we describe populations using parameters, what we call parameters, okay? And parameters are usually represented by Greek symbols. Whereas if we take a sample from that population, the sample...
then is described using what we call sample statistics. And the sample statistics are estimates of the population parameters. And we usually, this is not totally true, we usually represent sample statistics using the equivalent Roman letters.
So if we use a certain Greek letter to describe a certain population parameter, the sample statistic that estimates that same parameter will be represented by the equivalent Roman letter. That's not always true, but that's typically true. So again, how you sample from the population can affect your results.
So let's look at a simple example here. Let's say we're interested in height. And let's say we have a population where the average height is 1.73 meters.
Let's say we, for whatever reason, we know exactly what the population average is. But of course, in a population, some people are shorter, some people are close to the average, and others are taller. So there's all kinds of variation in our population. So when we go to sample the population to estimate, we're going to try to estimate this 1.73.
So we're going to go into the population, we're going to sample, and then we're going to measure height in our sample and use that as our estimate of our population height. Depending on how we sample, we could end up with different results. So, for example, if I do poor sampling and I only sample from amongst the short end of the population distribution, then I'm going to underestimate the population average height. right because I'm only going to have short people so my estimate of the population height is going to be smaller than the actual population height and I'm also going to underestimate the variation in height because again I'm only collecting short people and I'm not collecting you know the people that are closer to the average or the taller people so my variation in my sample is also going to be underestimated Similarly, if I only collect from the tall end of the population distribution, now I'm going to overestimate my population average.
And I'm still going to underestimate the population variation now, because now I'm only collecting tall people and I'm not sampling short people or people who are closer to the average. And then again, similarly, if I collect only at the population average, somewhere around the population average, I'm probably going to get the correct estimate of the population average height, right, because I'm sampling near the actual population average. But again, I'm going to underestimate the population variation because I'm not sampling short people and I'm not sampling tall people. However, if I sample across the entire population distribution, I'm going to get a sample that's going to include short people, tall people, and people who are close to the average. So if I do a good job of sampling, then I should correctly estimate the population average height.
I should get some number that's close to 1.73 meters. And I should correctly estimate the population variation in height. Because now I have a mixture of short, tall, and average people. So the different kinds of samples that we collect, and this is only a small subset of the different types of samples.
that we could be collecting has an impact on our outcome and our ability to make inferences about the population. So a random sample is probably one of the best samples you could be collecting because of a random sample eliminates bias. And a random sample base is as its name implies, you randomly go in and grab a sample. So for example, you might take a random selection of 200 test tubes from a day's production of something. Okay.
Sometimes, depending on the type of question that you're asking, it might be important to get some sort of balance across a certain stratum. So, for example, let's say you wanted to. make sure you had similar numbers of males and females.
So you might do stratified random sampling where you'll say, I'll get a random sample of males and I'll get a random sample of females. So that's different than you saying I'm going to go random, get a random sample from the population because depending on how you do that sampling, there's no guarantee that you're going to get males and females in sufficient numbers. You may be unlucky and you only sampled females or you only sampled males or, you know, something else.
You might have some imbalance. So stratified random tries to eliminate bias within strata. OK. And then, you know, there's also the sample of convenience. And this is probably the worst kind of sample you can collect because it's very biased.
It's possibly very biased. So, for example, you might collect cell lines derived from. some previous set of experiments instead of generating new cell lines you you know collect from a previous set of cell lines or maybe you have you know a large clinical trial you're doing and you decide oh it's too hard to go get new set new patients and run them through your experiment so you just sample from the existing clinical trial because they're there they're easy to access right so that's why it's a sample of convenience but your sample could be biased and not represent the population that you're interested in. So I want you all to prepare for our synchronous session by reading the three papers that we've assigned or looking through again. You don't have to read them in detail, but you familiarize yourself with the paper by Shikuma, Zhang, and Arshad, and we'll talk about these during our synchronous session.