Transcript for:
Understanding Replicability and External Validity

This week we'll be covering Chapter 14, Replicability, Generalization, and the Real World. This is the final chapter in this course and aims to wrap up and tie together all of the concepts we've discussed throughout the semester. So let's look at how this chapter is broken down. It's divided into two major sections. The first focuses on replicability and the other on external validity. Both of these sections are important to understand. for you to better take what you've learned in this class and apply it in the real world moving forward. Replicability deals with a study's ability to be repeated and come to similar conclusions. If results are reliable, they should also be repeatable. Remember, external validity applies to the degree to which a study's findings can be generalized to the population of interest from the sample. External validity is an important aspect to a study's overall value. Here we see the subheadings for the first section of the chapter. We'll discuss some replication studies, how replication is debated in psychology, what a meta-analysis is, and how it applies to replication, and how replication is viewed and discussed in the popular media. Remember, reproducible does not mean that the study could hypothetically be replicated, but rather that the result has been repeated. In a replication study, a researcher performs a study again. There are three major types of replication. Direct replication, conceptual replication, and replication plus extension. Direct replication, also known as exact replication, is when the original study is repeated as closely as possible to determine whether the original effect is found in the new data. Jones and colleagues theorized in 2004 that because most people like themselves, they might implicitly associate positive feelings with the letters in their own name and the numbers in their birthday. They further hypothesized that people might be more attracted to people who share their name, birthday, or initials. To test this, participants completed a short survey about themselves. Then, each participant was given the page of responses from a randomly selected partner. At the top of the page was the partner's ID number, printed in large font. For half of the participants, the partner's ID number had been matched to their own birthday. For the other half of participants, it was not. The other information on the partner responses were identical, so the only difference was whether the ID number matched their birthday. After several minutes, participants were asked to rate their partner on different dimensions. As you can see in the graph, when the partner's ID number matched their birthday, they liked that partner more than when the ID number did not match the birthday. You can see the results from the original study in 5a, and the direct replication in 5b. Although direct replications make a lot of sense, when using one you run the risk of repeating the threats to internal validity or construct validity in your original study in your direct replication. Consequently, researchers sometimes turn to other forms of replication instead. Conceptual representation is where researchers explore the same research question but use different procedures. The conceptual variables in the study are the same, but the procedures for operationalizing the variables are different. Consider the theory of implicit egotism, in which people's implicit self-associations predict their social behaviors. They have operationalized implicit self-associations through first name, last name, and birthday similarity, and operationalize social behaviors through liking for a person, marriage patterns, residence location, and career choice. For example, they found that people named Denise or Dennis are more likely to become dentists than lawyers, but people named Laura or Lawrence are more likely to become lawyers than they are dentists. Your dentist name is Crentist. Because many factors influence our social behaviors, implicit egotism effects are very small. However, they appear in almost all instances tested by the researchers. Replication plus extension. Here researchers replicate the original study but add some variables to test additional questions. One example is the research on note-taking discussed in Chapter 10, the one by Mueller and Oppenheimer. In the original post-test only design, students taking notes with a laptop performed worse on subsequent exams than those taking notes by hand. The researchers noticed that when people used laptops, their notes had more verbatim overlap with the lectures compared to longhand note takers. The researchers suspected that the laptop users were merely transcribing directly, so they were processing the information superficially rather than thinking about it and making deep connections. Thus, they performed poorly when tested. The researchers tested this hypothesis in a second study with three conditions. A longhand condition, a laptop condition, and a new laptop condition in which people were warned not to copy the lecturer's words exactly. This was a replication plus extension study because it replicated the original study and included one new condition. Their findings were that the notes of the warned laptop users contained the same level of verbatim overlap as laptop users with the original instructions. In addition, both laptop groups performed worse on the test questions compared to the longhand note takers. Another way to conduct a replication plus extension study is to introduce a new independent variable. Mueller and Oppenheimer's third study illustrates this approach. This study replicated the laptop note-taking effect, extending the research by allowing half the students to study their notes a week later. The replication debate in psychology can be discussed in three parts. First, the replication crisis, or that many replication studies do not find similar results as the original works. Second, why these replication studies might fail. And third, ways to improve the practice to result in replicable studies. Let's first talk about the replication crisis. This problem came about due to journals preferring to publish new methods and new theories, whereas direct replication studies are neither of those and are often not published. Early replication attempts inspired a large group of psychologists known collectively as the Open Science Collaboration, or OSC, to attempt replication on a larger, systematic scale. The OSC selected 100 studies from three major psychology journals and recruited researchers around the world to conduct direct replications. They used different metrics to judge whether a study was a successful replication. By one metric, only 39% of the studies had clearly replicated the original effects. Other measures yielded better rates of replication, but the media usually reported the lowest, or the 39% figure. The announcement was alarming. Science writers dubbed the problem a replication crisis. And in the months that followed, scientists began to analyze the situation. So why did the replication crisis take place? What caused the replication studies to fail? Let's talk about three issues that might have been involved. First, there may be contextually sensitive effects. Some of the measures and manipulations used in replications may not have had the same meaning as the original. Some original effects are contextually sensitive, and when the replication context is too different, some authors argued, the replication is more likely to fail. For example, in one original study of how people respond to charitable appeals, the original study mailed actual letters about an AIDS-related charity, but the replication sent emails instead. they were about an environmental charity. Another original study has Stanford University students watch a video about diversity-related admissions practices. The replication used the same video in the Netherlands, where the university systems differ. The number of replication attempts may also cause problems. The OSC conducted only one replication attempt for each original study, but any single study always has the potential to miss a true finding. leading to a failed replication. To counter this problem, some researchers obtained data from another large-scale effort, the Many Labs Project, or MLP, which had conducted up to 36 replications of each study and combined the results of them all. Using this more powerful approach, the replication rate rose to 85%. Finally, there may simply be problems with the original study. In certain cases, the original study's sample size was simply too small. So, a few extreme individuals could have had a disproportionate influence on means and patterns. In large samples, extreme cases will almost certainly be cancelled out by individuals at other extremes. But in small samples, this is less likely to happen. Small samples can accidentally lead to significant findings that can't be replicated because there probably wasn't a real effect in the first place. Harking is a term that stands for hypothesizing after the results are known. and involves creating a hypothesis after seeing surprising results. Such findings may be due to chance and cannot be replicated. P-hacking is when a researcher might peek at the study's results, and if they are not quite significant, run a few more individuals, decide to remove certain outliers from the data, or run a different type of analysis. The P stands for P-value, or significance. Here's a clip from John Oliver where he discusses some of the issues in replication. There is a lot of bullshit currently masquerading as science. So tonight, we thought we'd talk about a few of the reasons why. And first, not all scientific studies are equal. Some may appear in less than legitimate scientific journals, and others may be subtly biased because of scientists feeling pressured to come up with eye-catching positive results. My success as a scientist depends on me publishing my findings, and I need to publish as frequently as possible in the most prestigious outlets that I can. Now that's true. Scientists are under constant pressure to publish, with tenure and funding on the line. And to get published, it helps to have results that seem new and striking, because scientists know nobody is publishing a study that says, nothing up with acai berries. And to get those results, there are all sorts of ways that, consciously or not, you can tweak your study. You could alter how long it lasts, or make your random sample too small to be reliable, or engage in something that scientists call p-hacking. That's p-hacking with a hyphen, not to be confused with facking, which, as I think everyone knows, is a euphemism for f***. the Philly fanatic. Now, p-hacking is very complicated, but it basically means collecting lots of variables and then playing with your data until you find something that counts as statistically significant but is probably meaningless. For example, the website FiveThirtyEight surveyed 54 people and collected over 1,000 variables and through p-hacking the results was able to find statistically significant correlations between eating cabbage and having an innie belly button, drinking iced tea, and believing Crash didn't deserve to win Best Picture, and eating raw tomatoes and Judaism. And the only thing tomatoes have in common with Judaism is that neither of them really feel quite at home in the upper Midwest. But you don't even need to engage in these kinds of manipulations to get results that don't hold up. Even the best design studies can get flukish results. And the best process that science has to guard against that is the replication study, where other scientists... Re-do your study and see if they get... similar results. Unfortunately, that happens way less than it should. Replication studies are so rarely funded and they're so underappreciated. They never get published. No one wants to do them. There's no reward system there in place that enables it to happen. So you just have all of these exploratory studies out there that are taken as fact, that this is a scientific fact that's never actually been confirmed. Exactly. There is no reward for being the second person to discover something in science. There's no Nobel Prize for fact-checking and incidentally There's no Nobel Prize for fact-checking is a motivational poster in Brian Williams's MSNBC dressing room And and for for all those reasons for all those reasons scientists themselves Know not to attach too much significance to individual studies until they're placed in the much larger context of all the work taking place in that field but too often a Small study with nuanced tentative findings gets blown out of all proportion when it's presented to us, the lay public. Sometimes that happens when a scientific body puts out a press release summarising the study for a wider audience. For instance, earlier this year, a medical society hosted a conference at which a paper was presented comparing the effects of high and low flavonol chocolate during pregnancy. If that sounds narrow and technical, it was supposed to be. There wasn't even a control group of women who didn't eat chocolate. And the study found no difference in pre-eclampsia or high blood pressure between... between women who ate the two chocolates. So there is no way a study that boring can make it to television, right? Well, wait. Because that medical society issued a press release with the much sexier but pretty misleading title, The Benefits of Chocolate During Pregnancy. And because most TV producers just read press releases, this happens. Turns out if you're pregnant, eating 30 grams a day of chocolate, that's about two-thirds of a chocolate bar, not the whole chocolate bar, could improve blood flow to the placenta and benefit the whole body. the growth and development of your baby, especially in women at risk for preeclampsia or high blood pressure in pregnancy. Except that's not what the study said! It's like a game of telephone. The substance gets distorted at every step. And I can only imagine how someone who watched that segment must have described it the next day. Or the news said our baby is made of chocolate and it's okay if I eat it, but only two thirds. In response to these critiques, psychologists and other scientists have introduced changes. One is for research journals to require much larger sample sizes, both for original studies and for replication studies. Researchers are also urged to report all of the variables and analysis they tested. Open science collaboration has been highly encouraged in the academic community. Open science is the practice of sharing one's data and materials freely so that others can collaborate, use, and verify the results. Preregistration has also been implemented to help improve the scientific process. Scientists can preregister their study's method, hypothesis, or statistical analysis online, in advance of data collection. We're now going to talk about another method the scientific community uses to come to larger consensus on research findings, called meta-analyses. These are studies that look at multiple research publications on a single topic, and determine the overall consensus of the literature. Remember, scientific literature is a series of related studies conducted by different researchers who have tested similar variables. So, a meta-analysis is a statistical analysis that yields a quantitative summary of a scientific body of literature. Let's first go over an example of a meta-analysis and then talk about the strengths and weaknesses to such types of analyses. In our example, a group of researchers conducted a meta-analysis on studies that examined the relationship between response time and truthful responding. By searching databases, contacting online groups, and emailing colleagues, they collected 114 experiments. They computed the effect size, d, of each study such that the larger the effect size, the greater the reaction time difference between lies and truths. Then they calculated the average of all effect sizes. Although they use D, a meta-analysis can also be used to compute the average of effect sizes as measured by R. As shown in the table, Cohen's 1992 conventions for effect sizes described the average effect size as small, medium, or large. The average effect size in the 114 studies was 1.26, a very large effect size supporting the cognitive cost of lying. In follow-up analyses, the team separated the studies into different categories. seen in Table 14.3. For example, some studies told participants to go as fast as possible, while others did not. The theory predicts that the reaction time difference will be especially noticeable when people are trying to go fast. As expected, the average effect size for the fast-as-possible studies was large. In other words, speed instructions moderated the relationship between lying and reaction time. The researchers also separately analyzed studies in which participants had an incentive to avoid being caught, a situation similar to a criminal interrogation, here seen in Table 14.4. Motivation to avoid getting caught moderated the influence of lying on reaction time, such that when people were motivated, the reaction time difference was smaller. So, what are some strengths and weaknesses of conducting a meta-analysis? The file drawer problem states that a meta-analysis might overestimate the true effect size because null effects, or opposite effects, haven't been included in the analysis. It's called the file-drawer problem because all of the null results were left to be filed into a drawer and forgotten, rather than published as findings from significant studies were. This again draws back to the problem of academic journals favoring the publishing of significant findings over null ones. For example, the effectiveness of antidepressant medications on depressive symptoms. Of 74 studies analyzed, 38 showed positive results. 24 showed negative results and 12 showed questionable results. All but one of the studies with positive results had been published in medical journals, whereas null findings were less likely to be published. Only 3 of 36 negative or questionable outcomes were published. If you only examine published studies, then 94% showed the antidepressants are effective. But only 51% of registered studies have in fact shown that antidepressants are effective. Finally, in terms of the popular media, replicability is very often undervalued and underrepresented by the media. When viewing popular media yourself, you should always look out for additional information to be presented that summarizes the entire body of literature, not just the one study in particular. A responsible journalist will do just that, present all of the known information on any given topic. We're now going to shift and talk about the second half of the chapter, external validity. In this section, we'll talk about generalizing findings to other participants and settings, along with how to evaluate the degree to which a study does in fact generalize, and whether or not a study must take place in a real-world setting to be considered generalizable. In order to evaluate a study's ability to generalize to other people, you need to find out how participants were obtained. A probability sample is intended to generalize to the population from which the sample is drawn, but a convenience sample may not generalize to the population. Remember, it's a population, not the population. When we refer to a population used in research, we are not talking about everyone in the world or everyone in the United States. That would be the population. We are referring to a population of interest. The population to which we would like our results to generalize. It might be freshmen at your university or fifth grade teachers in El Paso, Texas. External validity comes from how, not how many. How a sample is obtained, random sampling versus bias sampling, is more important than how many participants are in your sample. Just because a sample comes from a population doesn't mean it generalizes to that population. It is incorrect to assume that the sample is from a particular population. that just because a convenience sample includes some members of a population, for example, juniors and sororities, Jewish Democrats, or motorcyclists, that it generalizes to all the members of that population. In order to generalize to a population, you must have a specific probability sample of, in this case, juniors and sororities, Jewish Democrats, or motorcyclists. When we want to know whether a laboratory situation within the context of a study generalizes to real-world settings, we're interested in the study's ecological validity. Think about this example. Would the studies on lying about playing cards, which took place in the laboratory on a computer, generalize meaningfully to real-world lies? It might, but we also need to think about what might cause it to not generalize. This is how we evaluate a study's ecological validity. How is external validity related to a study's importance? The answer is that it depends on whether the researcher is operating in theory testing mode or in generalization mode. These two different approaches to research shape the overall goal of the study. If it is intended to focus on generalizability, then yes, in fact external validity is crucial for the study's importance. Let's look at each of these in some examples. Theory testing mode is typically used when testing association claims or causal claims to investigate whether there is support for a particular theory. You might recall the theory data cycle from chapter 1, in which a study is designed to test a theory, and then uses the data from the study to reject, refine, or support the theory. When in theory testing mode, external validity is less important than internal validity. Basic research tends to be done in theory testing mode. Think back to the contact comfort theory with Harry Harlow's studies of attachment in infant monkeys. His studies were designed to test whether the data supported the cupboard theory, that babies are attached to mothers because they feed them, or the contact comfort theory, that babies are attached to mothers because they nurture them and are soft and comforting. The study had two mothers, a cloth mother providing comfort but not food, and a wire mother providing food but not comfort. The baby monkeys in this study spent almost all of their time with the cloth mother, which supports the contact comfort theory. Another example of research in this theory testing mode comes from studies of how children learn grammar, or the parent as grammar coach theory. The reinforcement theory of grammar states that children learn correct grammar through reinforcement. Brown and Hanlon tested this theory in 1970 by listening to audiotapes of parents and children interacting. They found that most of the time parents corrected children's speech for factual accuracy, for example, mom is a boy, but not for grammar, for example, mom are a girl, which does not support the reinforcement theory of grammar. The participants in the study were upper middle class Boston families, who are not representative of all families in the United States. The biased sample wasn't important because the researchers were testing a theory. Therefore, they were less interested in external validity. Generalization mode, on the other hand, is used when researchers want to generalize the findings from the sample in the study to a larger population of interest. Here, it is important to use probability samples. The primary concern is external validity. Applied research tends to be done in generalization mode, however, some research has aspects of both modes. Frequency claims are always in generalization mode, and survey research that's intended to support frequency claims is always done in generalization mode. For example, what percentage of college students have lived in the dorms, or what percentage of voters will vote yes on this bill. Having a representative sample is crucial for supporting frequency claims, and frequency claims are always tested in generalization mode. Association and causal claims are sometimes used in generalization mode. Although association and causal claims are usually conducted in theory testing mode, there are situations in which it is appropriate to conduct them in generalization mode. For example, suppose you've come up with a new technique to help people to increase their reading speed and comprehension. You might first conduct it on a limited sample in theory testing mode. Then, you might wonder if the findings generalize to a more diverse sample. So, you shift into generalization mode. Another example, marketing researchers might start with a very limited sample, or a focus group, to see if a particular advertising campaign or product package is effective. Then, they might want to extend the research with a large and more diverse sample to see if the findings generalize. Cultural psychology is a sub-discipline in psychology focusing on how cultural contexts shape human thinking, feeling, and behavior. Most cultural psychologists tend to emphasize the generalization mode, and they have challenged researchers who operate exclusively in theory testing mode. You might be familiar with this perceptual illusion called the Mueller-Lyer illusion. Does the vertical line B appear longer than vertical line A? The lines are the same length, but North Americans and Europeans tend to say that line B is longer than line A. If you look at the cross-cultural results on the left, you'll see that many people around the world do not see line B as being longer than line A. If researchers from North America were working in a theory testing mode, then they would probably use a North American sample since generalizability is not a priority. But Segal and colleagues were operating in generalization mode, in their cross-cultural study with participants from around the world. They were working in generalization mode, and they found out that people that grow up in a carpentered world have more experiences with right angles as cues for depth perception than those in other societies. And this experience changes their perception of the two lines'lengths. When you look at the picture on the right, what do you notice first? Most North Americans tend to notice the three large fish. referred to as the figure of the picture. But as you study the image further, you might notice other things such as the plants, smaller fish, or water. These things are referred to as the ground. When researchers Masuda and Nisbet showed this image to Japanese college students, more of them tended to comment on the background or the ground and the figure later. Later on, participants were quizzed on what they had seen. Sometimes, participants were shown one of the original fish with the original background, or with no background, and or with a novel background. See the figure on the bottom. The results from Masuda and Nisbet's study are shown here. North Americans could still recognize the fish, which suggests that they processed the figure separately from the ground, whereas Japanese participants were less able to identify the fish when it was removed from the surrounding scene, which suggests that they processed the fish in context and when the context changed it affected memory. Masuda and Nisbet were in generalization mode in this cross-cultural study, but if they were in theory testing mode they would have only studied North Americans to test theories about how all people perceive complex scenes. This is yet another example Another example of how cultural psychologists combined generalization and theory testing modes to test a theory in multiple cultures before assuming it applies to all people. In the top six journals in psychology in 2007, 68% of the participants were American and 96% were from North America, Europe, Australia, or Israel. Participants from these countries are very different than participants in the rest of the world and they have been referred to as weird. Western, educated, industrialized, rich, and democratic. Weird samples are not very representative of the world's population. As you know, when researchers are in theory testing mode, they don't prioritize external validity, and might only test weird participants. But cultural psychologists remind researchers that if they only test weird participants, their results may not generalize to everyone else. Let's now look at the last section of part two of this chapter. Some people assume that studies conducted in the real world are more important than studies conducted in laboratory settings. But that may not be the case. Let's look at some of the reasons why this might be. When a study takes place in the real world, it occurs in a field setting and has high external validity. Field setting is the real world. It has high external validity, more specifically, high ecological validity. Sometimes, however, studies conducted in laboratories can feel very real. Their tasks might be similar to tasks we do in the real world, and we might even feel some real emotions in the lab. This is what's referred to as experimental realism. Remember when working in generalization mode, it is important to have a representative sample for high external validity. But it's also important to consider if the findings generalize to the real world. again referred to as ecological validity. Finally, remember that when you're working in theory testing mode, external validity and real-world applicability are not as important. The chart on the right shows good examples of how to react to scientific results. Remember, it's always important to think critically and question the information you're given. Hi, post-production Dr. Volante here. I wanted to leave you with another clip from John Oliver. but due to copyright issues, I can't actually have it in the video. So I'll just recommend you go watch his video on scientific studies. I'm going to go ahead and leave a link for that in the video description. So just take a look at that whenever you get a chance. So let's wrap up this week's topics. This week, we discussed Chapter 14. Specifically, we discussed the two major sections of this chapter, study replicability and how it's important for a study's findings to be able to be replicated and the importance of external and ecological validity, specifically in terms of generalization mode. I want to sincerely thank everyone for the hard work and effort they've put into this class, so keep working hard and pushing forward, and I'm confident everyone can have a strong and successful finish to their semester. See you in class!