Transcript for:
Effective Sampling and Survey Construction Strategies

Welcome to Module 4. We'll be describing sampling, both in a quantitative and qualitative context, and also how these methods can be combined for mixed methods. We'll specifically be describing purposeful sampling. Following that, we'll detail how do you construct a survey. So the article that we read introduced a new synthesis for doing this. Lastly, we'll close with intersexual competition, where I introduce a scale that I created. And for the discussion this week, you're going to be... tearing that apart, taking into account what you learned in the article about survey construction, and see where are the missed opportunities, how could it be improved. Before we get into it, we're going to define some concepts. There are many different types of validity. Internal validity relates to how well a study is conducted. How well can you say that there is a relationship between variable X and variable Y? Internal validity tends to be higher in experimental studies because experimental studies tend to control for outside factors or other variables. that might interfere or impact the relationship. They also tend to use randomization. Because of that, experimental studies can show cause and effect. External validity, on the other hand, relates to how well can these findings from the study apply to the real world, outside of the laboratory, beyond the survey that participants are taking. Randomized control trials, or RCTs, are the gold standard for showing effectiveness and side effects of drugs, treatments, interventions. In this design, participants are randomly assigned to one of two groups. They're either assigned to the experimental group where they receive the treatment drug or intervention that's being tested, or they're assigned to the control group. Participants who are assigned to the control group will receive a placebo. The placebo contains an inert substance. For instance, it might be a sugar pill, or if it's an injection, it might contain saline solution. In addition, it doesn't necessarily have to be a placebo, but it could be whatever the conventional treatment actually is, and that can serve as a baseline. It would be really unethical if we didn't offer participants. whatever the standard baseline or conventional treatment is that they would get if they went to see a doctor or a physician, a clinician, a psychiatrist. Randomized control trials tend to have really large samples. You need them to be really diverse or heterogeneous, and they need to represent the target population. If you're testing, for example, how effective vaccines are, you would need to make sure that they work equally well in everybody, or to the extent that they don't, you need to be able to detect side effects, even if they're really small. Do keep in mind, though, Whenever you have a large sample, results tend to be significant because they are related to the sample size. When you calculate standard error, you're dividing by the square root of the sample size. As sample sizes get larger and larger, almost any effect, even if it's tiny, becomes significant. What we pay attention to is the effect size. Does a drug have a noticeable effect on the outcome? Is there a noticeable difference between the experimental group and the control group that would actually be important in real life? You can see if the effect size is near zero, small, medium, or large. And when something does have an effect size that's important, you can say it has clinical significance. The goal with randomized control trials is to reduce extraneous variables, to reduce the possibility of compounds accounting for the result, and to attempt to reduce bias. A lot of randomized control trials tend to be double-blinded. the researcher or doctor doesn't know if the participant is getting a placebo or the experimental treatment. If the researcher or doctor is aware of who's getting the experimental treatment or placebo, they could interact differently with the participant. There could be bias that's introduced because of that. Likewise, if a participant knows that they're getting treatment, that could affect how much they believe the treatment's going to actually influence the outcome as well. And they could get a placebo effect. They could also just behave differently. this can manifest in real physiological differences too. To the extent that you can, it's great to have double-blinded studies. Recall from lecture one, with respect to mixed methods, there's three different approaches. You can have a parallel approach where you collect and analyze qualitative and quantitative data separately. You can have a sequential approach where the quantitative and qualitative data are analyzed in a particular sequence. The concurrent approach combines qualitative and quantitative data. They're integrated. You transform the data so you can merge them together and compare them. analyze them at the same time. Critical to running a study is collecting data from a sample. So a population is the entire group that you want to draw conclusions about. And your target population can either be very small or it could be very broad. Well, some topics might apply to everybody across the globe. Other topics might be really confined. For example, if you're interested in how head injuries impact professional football players, your population is rather small. If you're interested in something like... How does money affect happiness? Then your population can be quite broad, although, of course, this could differ from country to country. You want to make sure that your sample actually represents that population. The sample is a specific group of individuals that you'll actually collect the data from. The sampling frame is the actual list of individuals that the sample will be drawn from. Now, a lot of studies might not include this because you just might not have access to this. This is important for probabilistic sampling. You can imagine if you wanted to survey people at the university that you're in, so everybody at ASU, you could actually collect a roster of everybody's name, and that could be your sampling frame. Many studies... do tend to have potentially bad samples that the results may be difficult to extrapolate to the target population that they have in mind, much less a broader population. One problem with studies we've mentioned before is there can be a self-selection bias. If there's a study that's being advertised, certain people might read that advertisement and want to join your study. They might be interested in the topic, or if you're offering some sort of cash reward or gift card, they might join for that reason. That could be a different kind of person that opts into that study than people who don't. aren't interested in participating. This is another example of a potentially bad sample. So imagine you're conducting a study on whether playing video games improves hand-eye coordination. You do want to make sure that your sample represents whatever your target population is. Imagine that you or a researcher only recruited participants for the study by putting up flyers that were in or around gyms, or if you're posting online, they're in forums that are dedicated to athletics. Unless you're specifically interested in athletes, athletes as your target population. Athletes tend to be very coordinated. Then your results of the study may not necessarily apply to the general population. The study by Palenkes and company was about purposeful sampling. The field that they were coming from was mental health services. They were concerned with evidence-based practices, innovative practices, treatments, interventions, and programs, not only how effective they might be, but also implementing them. And they recognize that mixed methods offer certain advantages to examining these complex research topics about effectiveness and implementation. The question that they have, though, is how do you sample people for mixed methods research? I felt like this article had a lot of content and a lot of ideas were somewhat abstract. They didn't give a lot of concrete examples. I'm not necessarily saying I need to have a bunch of pictures in order to understand things. But. I thought it would have been nicer if they actually had included some more examples. So if you weren't exactly drawn in or you felt a little lost in the depths, I totally understand. I'm going to do my best to condense it down and provide some examples. To begin with, there are seven general principles of sampling, and this applies whether you're conducting a quantitative study or a qualitative study. First, the sample strategy should stem logically from the study's research question and conceptual framework. Second, the sample should generate a thorough database, a lot of knowledge, a lot of data on the phenomenon that is being studied. Third, the sample should at least allow the possibility of drawing clear inferences and credible explanations from the data. Fourth, the sampling strategy must be ethical. The sampling plan should also be feasible. Six, the sampling plan should allow the researcher to transfer or generalize the conclusions of the study to other settings or populations. Now, this is definitely easier to do if you run a quantitative study. But we have learned in the past certain techniques that qualitative studies can do this to try to transfer their results to a broader population as well. Seven, the sampling scheme should be as efficient as possible with respect to quantitative sampling. The goal tends to be to minimize bias in the sample. Your sample characteristics include being large and diverse. You want to make sure that you represent the target population as much as possible. Quantitative studies may use what's known as probabilistic sampling. Now, this isn't always the case, but quantitative studies have this option available to them, whereas qualitative studies don't. Quantitative approaches rely on hypothesis testing. You'll have the null hypothesis where there's no effect or no relationship between two variables versus the alternative hypothesis that suggests there is an effect. And you'll run statistics to either reject the null hypothesis, where you support the alternative hypothesis, or you fail to reject the null hypothesis. When you fail to reject the null hypothesis, it doesn't necessarily mean that there isn't defect. It means you didn't detect it. It could be the case that there doesn't exist an effect. And whether or not you decide to either reject the null hypothesis or if you fail to reject it is largely dependent on p-values and also confidence in nulls. With null hypothesis testing, there's four different outcomes you can have. Two are true, two are false. The null hypothesis might actually be the true hypothesis. The real state of the world could be that... there's no relationship between two variables. For example, it could be the case that there's no relationship between video game playing and violence. If you fail to reject the null hypothesis, that would be true. On the other hand, there could be some sort of effect there. There could be a relationship between video game playing and violence. If that's what you found in your study, and if the real state of the world is where that's true, then you would reject the null hypothesis and you would be accurate. In both of those cases, you would be right. just depending on the real state of the world. A type one error, on the other hand, would be if there really is no link between video game playing and violence, and you say that there is. That's an example of a false positive. On the other hand, it's possible that you conduct a study and there is really, in the real world, a link between playing video games and violence, but you fail to reject the null hypothesis, right? You say, we didn't find the effect. That would be a type two error. I thought this picture here illustrated this really nicely. So type one error or false positive is the example of this doctor telling this elderly gentleman, you're pregnant. A false negative, this very obviously pregnant woman is being told that she is not pregnant. Another example of a false negative would be my dog, Jojo. He's a total food thing. And whenever he hears the slightest crinkle of a bag or a box opening, he is right there. He wants the food. It is worse for him, in his mind, to miss the potential opportunity to have food. So he commits a lot of false negatives. A lot of times, there's no treats, but he's there to check it out anyway, because to him, the worst error is missing potential opportunities. So null hypothesis testing tends to protect against false positives. We don't like to say there's an effect when there really isn't one. And that's why the p-value level set a significance of 0.05, meaning there's a 5% chance. that whatever you found in your study is due just to chance, that the null hypothesis actually is true. The stronger the relationship is between two variables or the bigger the group difference is, the p-value will get smaller and smaller, right? We never completely say that there's no probability that you couldn't have just found this by chance, but it gets more and more unlikely. But there is a balance. We want to make sure that our studies also guard against type 2 errors or false negatives. This means that you need a sample size that's large enough to be able to detect effects if they really exist. If you have a study that is what they call underpowered and the sample size is too small, that means you might actually miss detecting an effect when there actually is. And it's just a product of you not having enough people in your survey to detect it. This is particularly true for smaller effect sizes. But you might even miss some moderate effect sizes if your sample size is really small. So say like you have 10 people. in each group and you're looking to see how some sort of intervention changes their behavior, and it's between subjects designed for 10 different people in each group, you might be underpowered to detect the effect, even if it's a moderate size, right? So studies particularly don't like false positives, but they also try to protect against type 2 errors by making sure that you have enough power to detect the effect by having a large enough sample. And overall, the quantitative approach is about achieving breadth of the results. Whatever you find largely are talking about. what people do on average or what central tendencies tend to be, what is a median score, what is a modal score. Basically, you want to be able to take whatever you found in your study to actually generalize it to a larger population. Plamitative approaches have the advantage of being able to use probabilistic sampling. So probabilistic sampling is when subjects of the target population have a known probability of being selected into the sample. Now, the probability might not always be equal, but the point is the probability is actually known. And this kind of sampling is used to generate what is known as a random sample. Whatever results you find from your study, you can generalize to a broader population. Now, that doesn't mean it necessarily applies to every individual. That is kind of the con of the study. It talks about central tendencies as opposed to actually looking at how people might differ, how individual differences matter. So a very obvious example is athletes probably are more coordinated than most people. And there would be a mean difference in coordination than what the normal population would be. But. that doesn't mean that there isn't a distribution of scores within those groups. And those distributions, however minimally perhaps, but more likely there's going to be a decent amount of overlap between those distributions. So you could have some athletes that perhaps aren't any more coordinated than the normal population. On the other hand, you can have folks coming from just a normal population that are just as, if not more so, coordinated than any athletes. There's different types of probabilistic sampling. with simple random sampling, this is when the population is relatively small. You can have something like a random number generator used to pick participants. In this case, each participant would have an equal chance of being selected for the study. You're unlikely to be able to do this with really large samples, however. In stratified random sampling, people are divided into groups, and then the researcher can select people from these strata or groups. And you would do this when people come in natural groups. It could be stratified by age, by gender, socioeconomic status. The example I give here is grade level at university. You could have a university where there could be 5,000 freshmen, 3,000 sophomores, 3,000 juniors, and 2,000 seniors. And if a researcher wanted to get a representative sample of this university, not oversampling freshmen, for example, they could divide people up into these grades and then randomly select people from the different grades. So you could select 50 freshmen, 30 sophomores, 30 juniors, and 20 seniors. Systematic random sampling is where the researcher uses a random starting point The example I give, every 20th shopper on the receipt has a link to a survey that they can take to talk about their customer service experience. Cluster sampling is similar to stratified random sampling, but instead the groups that you make aren't stratified or divided by natural groupings. The groups themselves should actually look like the target population, so each group should be really similar. Then the researcher will go and randomly select amongst these clusters, so not every cluster gets selected. As I mentioned before, I don't want you guys to have the impression that all quantitative studies are probabilistic. Many of them just don't even have the capacity to do that. The target population is either too large or as in psychology studies, a lot of them are run at universities using freshmen, which may or may not necessarily be a great group. You'll see in my paper that we also did that too. That's what's known as convenience sampling, where the researcher goes out and collects data from people that are nearby or accessible that can bias the sample, of course. So it's an ideal strategy. Another technique that you'll see quantitative and qualitative studies have in common, I've used this technique in the past, is snowballing sampling. This is where a participant in the study or an informant is going to recommend other people that the researcher can contact. This can be a really good method, especially if you're interested in having other people that are similar to the participant, or maybe you don't want a lot of variation behavior. Maybe you're interested in a really niche group of people. And if it's a very small population or population that's hard to reach, this method can really be invaluable. With respect to qualitative sampling, sample sizes tend to be a lot smaller. For in-depth one-on-one interviews, a lot of times you're going to have a sample size of three to six, and the researcher is going to go back to these participants and have multiple interviews. Grounded theory, the sample size tends to be a little bit larger. They may have 10 to 20 participants. It's possible to conduct qualitative studies on just a single participant. The sample, of course, is at random. The researcher selects information-rich cases, and the emphasis is on saturation. This is where the researcher continues to sample until no new information is achieved. The goal of qualitative studies is to obtain depth of understanding. Qualitative studies make use of purposeful sampling strategies. Purposeful sampling is used to identify and select information-rich cases for the most effective use of limited resources in qualitative studies. And there's various types. You'll see there's a table and an article that goes through a number of them. Researchers go out and pick information-rich cases. This is not. probabilistic. The researcher picks people based on their understanding of the phenomenon and the goals that they would like to reach. And what they generally tend to do is seek informants or participants that are particularly knowledgeable or experienced with the phenomenon. Beyond that, you have to make sure that you select people who can also articulate their experience as well. Advantages of purposeful sampling is that it tends to be cost and time effective, whilst analyzing data for qualitative studies tends to be a real big time sink. Sampling for it, not so bad. A con is the potential for selection or sampling bias to occur. Obviously, the researcher has their preconceived notions about what they're trying to find, and they may select cases that tend to confirm whatever it is that they're trying to show, or they select people who are similar to them, or some sort of bias creeps in whenever they're selecting people. Additionally, anybody who opts into the study might also differ from people who choose not to. The gist is, quantitative research tends to have broadness. You can generalize the findings to a larger population. There's greater breadth. Qualitative studies, on the other hand, tend to show more depth. It's information-rich, detailed descriptions, and personal experiences. As we've just shown, qualitative and quantitative studies are often contrasted based on depth and breadth. And in general, this is true. But obviously, we need both to be able to understand phenomena. We need both depth and breadth to get a comprehensive understanding of whatever it is we're researching. Now, while generally what I said is true, given any approach, usually there is an attempt to at least achieve both. With quantitative studies, you can look at individual differences and you can try and find variables that might moderate effect. For example, does a treatment affect men and women differently than with qualitative data? An example with ground theory and theoretical saturations, you're trying to ensure that all aspects of the phenomenon are included. There's a bit of breadth in there as well. Any one aspect of the phenomenon needs to be thoroughly examined. Depth really applies here, but there is... a degree of trying to get some breadth and qualitative research. Mixed method approaches can be used to expand or explain results obtained from the use of one method to another, which enables researchers to broaden their investigation. They can also try and complement the results. They can combine the two methods to produce a holistic picture that allows for divergent views. Mixed method approaches can also be used for development, where they use the results of one method to construct questions, questionnaires, or conceptual models. Having both quantitative and qualitative data also allows for triangulation or convergence and confirmation of results. Likewise, with initiation, the results might also diverge in interesting ways. You might find contradictory perspectives that you wouldn't have necessarily thought there. Sampling for quantitative studies is really well defined, but it's pretty poorly defined for qualitative studies. This also remains true for mixed methods. There's no clear guidelines for conducting purposeful sampling in mixed method implementation studies. Researchers using mixed methods face a number of challenges. Most obviously, if you're running a quantitative study. We're going to start off with a large sample and a very broad base of people in order to analyze a phenomenon. But then when you move to the qualitative study, your sample size is too large to do that. And also you might have a lot of people in there who aren't very experienced with the phenomenon that you're interested in, in a way that you could do in-depth one-on-one interviews with. The informant quality won't be as good as if you actually seek out people that were knowledgeable and experienced and articulate. On the other hand, if you start out with a qualitative study, start out with a small sample size, of people that are really informed about a particular issue. They're going to be unique in their own kinds of ways, and the sample size is just too small to conduct statistics. There are various different types of purposeful sampling. The article noted that criterion sampling was the most commonly used method out of the articles that they actually reviewed and looked at. Out of 28 articles, it was about 72%, I believe, that used criterion sampling. And what is criterion sampling? This is where a participant meets a specific criteria in order to be included. included in the study. For example, within the clinical field, a criteria for a specific study in order to take part in interviews might be that you have to be a program director, supervisor, or clinician. And to some extent, this makes sense. These are people that are using treatments and thinking about how to implement them. But on the other hand, this could come with some drawbacks. So the article asked specifically, is criteria sampling actually adequate for capturing the breadth and depth of the phenomenon? They come back with, well, maybe not. There could be better ways to run studies like that. this. The criteria sampling might have some limitations that make it inadequate. For example, by focusing only on program directors, it eliminates other potential groups of people who might also play a role. So family members, staff, policymakers, and they may have insights into a specific phenomenon, how a certain treatment is implemented, how do they use it. If you're not getting their point of view on it, you're missing part of the picture. There's a puzzle piece that's missing. This can limit the breadth of the results. On the other hand... you might also fail to include the most knowledgeable, experienced, or expressive articulate individuals, which could limit the depth. Be sure to look at table one for all the different types of purposeful sampling strategies. What type of purposeful sampling you use depends on whether you want to emphasize variation or breadth, or if you want to emphasize similarity or depth. Strategies that tend to emphasize similarity include homogeneous cases, typical case, snowballing, and this one might be a little bit tricky, extreme or deviant cases. So what are these and how do they emphasize similarity? Homogeneous cases is about reducing variation. What you're doing is just looking at a group of people that are all very similar in a particular aspect. Typical case sampling is used to illustrate or highlight what tends to be normal or average or typical. A clinician might provide an in-depth profile of how the program or intervention typically affects people and the kind of results that you could expect to see. We've already mentioned what snowball sampling is. So obviously the goal there is, is to get people that are very similar. Extreme or deviant sampling is learning from unusual manifestations of the phenomenon. This might sound like a strategy for diversity, but really what you're interested in is people at the extreme ends and sampling that. So you're actually sampling people that are really similar to each other. Well, in the clinical setting, you could focus on clinicians that have really high success rates with implementing a treatment, as well as those that have notable failures. You can use extreme case sampling if you want to learn lessons about unusual conditions or extreme outcomes. And this can actually help you understand better the phenomenon and how it might apply to normal or more typical cases or people. For example, in the early days of AIDS research, when HIV infections were almost always resulting in death, a small number of people actually appeared to... do quite well and be fine even though they are infected with HIV, they didn't develop AIDS. It became really crucial to understanding how researchers could translate this into combating AIDS. Strategies that emphasize diversity includes max variation, intensity sampling, confirming or disconfirming case sampling, Max variation, as the name implies, is used to capture a wide range of perspectives relating to whatever it is you're interested in studying. Not only do you document the ends of the structure, but you also get people that are in between. Intensity sampling is a lot like extreme or deviant sampling, but there's less emphasis on the extremes. The researcher tries to identify individuals where the phenomenon of interest is strongly represented. In this article on how do general practitioners persuade parents to vaccinate their children, This was conducted in 2001, and this is about general immunization, so measles, mumps, rubella, tetanus. The sampling technique that they used for the study was typical case sampling, but the author of this paper found that by doing that, they actually had a really poor response rate of GPs. They would often say that they didn't have time or they weren't interested in participating in the study. When the typical case sampling proved not to be that successful, they engaged in intensity sampling. They actually purposely found GPs that have a strong opinion on immunization. You can think about why intensity sampling might have actually been a good strategy in the study when you consider what they did and what their end goals were. In the study, the researcher presented the general practitioners with two scenarios. One scenario was about a parent that was considering delaying immunizing their child. The second scenario was about a parent that was just outright refusing vaccines. In the interviews, the researcher role played the part as a parent. They had a script that they stuck to, and they had a character description. There was a particular way that they had to act, and the GP was aware of this, of course. This was role-playing. GPs were asked to respond to the researcher as if this was a normal encounter or conversation that they're having with an actual patient. When the results were fairly interesting, the GPs tended to adopt the role of a persuader rather than informer, and the themes that came out of the data were either characterized as being helpful or unhelpful in communication. You can see the examples of what they got from the study. You can imagine having GPs that have a strong opinion on the subject would influence how they communicate this information and the sort of tactics that they use to try to persuade parents in order to immunize their child and potentially good ways that they can do this and how this potentially can go wrong. Confirming or disconfirming case sampling is also really important. Once trends are identified, you can seek both confirmation as well as deliberately seeking cases that counter the trend. These are people who are exceptions to the rule, so you're black swans, as Karl Popper would point out. They're disconfirming cases that test and highlight the boundaries of the finding, while when you're confirming cases, it provides deeper insight into your preliminary findings. The article that I found was Conflicting Discourses in Qualitative Research, the Search for Divergent Data Within Cases. Note that this is similar to disconfirming cases sampling, but instead of resampling people, the authors actually looked at their field notes. and made reference to previous literature to see what sort of disconfirming information have I already collected that goes against the trends of what maybe I was expecting to find and what the previous literature actually said. Their sample was African-American women. And these women originally were discussing food choices that they were making, but it commonly went into the direction of body image. And the authors thought this was interesting. So they did start to ask people about this. And what they found that there was a conflict between satisfaction and dissatisfaction in body image. And this was related to societal pressures, which tend to emphasize thinner bodies, as well as ethnic ideals, which maybe tend to emphasize not necessarily thinner bodies, but different kinds of body shapes. And not to mention people had their own preferences for how they like themselves. So I have two examples here of the narratives that Marie and Nene were talking about. You can read that again. The point isn't really the results of the study, but how the researchers found information that was discordant or divergent, and they further investigated that, which brought about new insights. They said that this was an important discovery given that it challenges much of the literature on body image for African-American women. There are certain strategies, like stratified purposeful sampling or opportunistic or emergent sampling, that are designed to achieve both of these goals. The article also describes alternatives to randomized control trials or RCTs. Generally speaking, RCTs, they are the gold standard, and they're a gold standard for a reason, especially for showing cause and effect, as well as small effect sizes and being able to detect side effects. But the article did highlight some interesting study designs that I thought are really good to know, that can look at individual differences as opposed to just central tendencies. One of these is what's known as an interrupted time series design. And this is where you collect multiple data points before a treatment or intervention is given. So you can see what somebody's baseline is. So maybe you take 10 measurements before treatment's given. You can find the average of that baseline. Imagine you're interested in depression. You can have somebody rate their depression level on a scale or have a questionnaire where they get a score. You can see how that fluctuates over time a bit, but you can get an average for it. After that, you can give them the treatment and you can take multiple measurements following the treatment and get an average for that and compare it. You can see here with these two different graphs, this is made up data. It's a graph just illustrating what a significant difference would look like versus one where the treatment had no significant effect. When you're doing this kind of design, since you have multiple measurements, you don't necessarily need as big of a sample size. And it's also within subjects, so you're controlling for some individual differences as well. Another potential alternative to RTCs is the reversal design, also known as the ABA design. This is where the dependent variable is measured at baseline, again, taking multiple measurements. Then you give a treatment, and you also take multiple measurements. But then you take away the treatment and see what the baseline is after that. Now, a lot of times after the treatment's taken away, people revert back to their normal baseline. But they may not revert at all, or they may revert a little bit. So designs like this are really good for showing whether or not a treatment has a lasting effect and how long does that effect actually last. The example that I showed here would be if you have kids that are really disruptive in the classroom, what is the best way to help mitigate that? This is hypothetical scenario. It's just meant to illustrate what a significant find would be like. So when the student was praised for their good behavior, their disruptive behavior declined. But we can see once the praise stopped, they basically went back up or pretty close to what their pretreatment levels were. Another method is known as multiple baseline. In this case, you have multiple people and you only measure the baseline once, but the key here is that you start at different times. You can imagine if you wanted a treatment for obsessive compulsive disorder for hand washing, you can measure people's rates of hand washing before treatment is applied, and then you can measure it afterwards. but you really want this treatment to start at different times for the different individuals. It's possible that people just change over time or maybe different times a year, behavior tends to shift or just people when they know that they're part of the clinical trial, their behavior can change. So sometimes just knowing that you're going to receive help can have a bit of a placebo effect on people. They know they're being watched or recorded, their behavior changes a bit. If you can see that the treatment has an effect of reducing the behavior for each individual only when the treatment starts. then you can be more confident that the treatment actually does have a causal effect in, say, reducing the amount of times participants wash their hands. Another potential alternative is multi-stage strategy for purposeful sampling. Here's where you alternate between sampling for either variation or for similarity, and you may have several phases of this. Generally speaking, RTCs give an average of how the treatment works. Maybe the clinician is instead interested in sampling people that represent. one end of the spectrum. People that had extreme success with the treatment are people that basically failed with it. If that's the case, maybe then you sample with extreme or intensity sampling, and that could provide valuable insights. And once you've done this, you may change out your sampling strategy. After you identify the extreme ends of the cases, you could then sample for homogeneity within that specific parameter that you're looking at in the second stage. You're going to take a funnel approach. This is where your qualitative methods start out. seeking people or sampling for breadth and variation, and then you kind of hone in or go more for similarity or depth. This is especially true for semi-structured interviews and focus groups. Opportunistic or emerging sampling is where a researcher is able to add to the sample by taking advantage of unforeseen opportunities after the data collection has begun. Imagine you find a new direction in your research or you realize that other people might make good influence. You're going to take advantage of that. That's a really good thing to do. The other type of design they described was hybrid designs. Hybrid designs look at both effectiveness and implementation of a treatment. Again, this article was really about the clinical field and how the design studies to look at how treatments are both effective and how they're implemented. Hybrid designs do exactly this. There's three different types of hybrid designs. Hybrid design one prioritizes the effectiveness of a treatment. While they're also collecting data on whether or not the treatment was implemented, it really is prioritizing whether or not the treatment itself was effective. Hybrid 2 designs have equal priority for the effectiveness and implementation of a treatment, drug, or intervention. Hybrid 3 designs prioritize implementation over effectiveness. So imagine treatment is already shown to be effective in the lab. You might collect some more data on that, but you want to see how is this being rolled out? How do we actually implement this policy? Why do some people succeed or fail? And hybrid. designs, the course can use purposeful sampling and multi-stage sampling. A quantitative study will typically look at the effectiveness of the drug, and the qualitative data is going to expand on the results. And generally speaking, when you talk about the implementation of a treatment, a lot of times it's going to be qualitative data. If the effectiveness trial finds substantial variation in success rates, you can imagine what kind of sampling you might actually use to follow that up for your qualitative study. You're going to want to look at diversity, you're going to want to look at variation in breadth. So you might choose maximum variation sampling, and you might also look at confirming and disconfirming cases. If alternatively, your quantitative studies show that there's not a lot of variation, your follow-up qualitative study on implementation might actually look at depth. You might engage in typical case sampling or homogeneity. Just summing up the conclusions, researchers need to clearly describe their sampling strategies and provide the rationale for why they're engaging in the sampling that they're doing. Are they going for variation? Are they going for similarity, breadth, or depth? They need to stick to the seven principles of sampling, whether it's quantitative or qualitative. Using a multi-stage approach for purposeful sampling should first start broad, emphasizing variation, and then move to a narrow view, just like a funnel. Often, probability sampling is preferred strategy for quantitative research. Thus, the selection of a single and multi-stage purposeful sampling strategy afterwards should be based on part of what you find for your project. quantitative studies. Both quantitative and qualitative studies tend to differ in their capture of breadth and depth. However, both strategies ought to capture elements of both. They don't always, so in general, they don't always tend to do this, but they do strive to do it a little bit. Both of these elements are needed to have a comprehensive picture of what's going on to generate new knowledge. This is why you use mixed methods and hybrid designs or multi-stage designs, because it's important to be able to get at both of these concepts. Moving on from sample sampling, we're going to be talking about survey development. With the paper that you're going to read after that, you're going to be looking at the scale for gender-neutral intersexual competition and applying these principles to that scale. Another instance of deja vu, recall from lecture two what the conceptual definition is versus the operational definition of a construct. The conceptual definition outlines what a construct means. You can think of this as being like a textbook definition. What does it mean to be angry? What does it mean to be happy? How are you defining those things? These conceptual definitions don't explain how you're going to measure that concept. On the other hand, operational definition does just that. It describes how are you measuring whatever construct it is that you're interested in. The example that I'm going to give is jealousy. What is jealousy? It's a complex of thoughts, feelings, and behaviors that follows threats either to your self-esteem. or what we're going to be looking at, threats to your existing relationship or the quality of your relationship. And those threats lead to negative emotions. This could be threats from either a real or someone who is perceived as a romantic rival or just someone who's completely imaginary. It can be in your own head. The emotions that it triggers usually are negatively valence. It can include elements of fear, anger, sadness, and anxiety. It can be long lasting. So sure, it could be fleeting, but... It can last for a really long time, and some of it is instinctual, while other parts of it, you can actually elaborate on it cognitively. There's various ways to operationalize a measurement. What we're really interested in this segment is surveys, so that's what we're going to look at. Here's a study that's measuring romantic jealousy. Importantly, they actually did collect measures of reliability and validity, and you can see there's three different components of romantic jealousy. There's a cognitive component, there's an emotional component, and there's a behavioral component. There's seven to eight items in each of these different sub-skills. The cognitive component, participants were asked to indicate how often they have certain thoughts about their partners. For example, I suspect that X, whatever your partner's name is, is secretly seeing someone of opposite sex. Obviously, this questionnaire could be tailored to be to whomever you're attracted to or whatever your romantic attraction is. Emotional jealousy, participants were asked to consider their emotional reactions to various situations. So it's imagining those scenarios and indicating how jealous you would be. An example is, X comments to you on how great looking a particular member of the opposite sex is. The behavioral component is your actual behavior. Ask participants, how often do you actually engage in specific behaviors? An example item is, I look through X's drawers, handbooks, or pockets. They probably should actually have another item on here about checking up on their social media or looking at their DMs. Later, we're going to see how sometimes you might want to adapt scales or modify them where you like the basic. gist of what they're saying, but every once in a while, an item you want to toss out or maybe add an item or modify it in some ways. This would be a case where maybe you want to do that as technology has changed. Maybe some of these items may or may not apply as much, or there's new items you'd like to have. What they found with these scales is they were pretty reliable. You can see the alpha's cron box for all these various subscales were all above 0.7. Two of them were above 0.8. So it has adequate reliability. I bring up jealousy because keep in mind when you're thinking about my skill, do you think my skill measures jealousy or something different? How would you test that? The second article measured twice cut down error, a process for enhancing the validity of survey skills. I really enjoyed why, because well, it gave concrete examples as somebody who's more quantitatively focused. And I really appreciated the examples and also highlighting best practices for survey development. It was fun to see. how my approach maybe differs from the best practices. They suggest that many new scales violate best practices for survey design. Why is this the case? Well, for causes unknown, but it may be the case that grad students don't really learn the best practices. All the information they say is just sequestered in the scholastic journals and, you know, the word doesn't really get out. It could be that people are stuck in their ways and old habits die hard. I kind of liken it to opinions in podcasts. podcast. Almost everybody has one and or could create one, but it doesn't mean it's necessarily good. People like to see their name on something. I'm probably not immune to that. I think it has a lot to do with that. Even if existing scales are available, it's not unusual to see researchers coming up with their own scales. In this paper, authors synthesize a new way, they give six steps to develop scale items that gets an input from both academics as well as potential respondents and why this is valuable. Recall from lecture two, too few researchers consider the validity or reliability of their measures. Well, if you say something's a validated scale, that's technically a misnomer because the scale doesn't have the property of being valid or reliable. It can maybe produce valid and reliable results, but That may only be true in the sample or the population that was intended for. If you go somewhere else with the scale, you can see maybe the results aren't necessarily as valid or reliable there. An instrument can produce valid and reliable results, but the instrument itself is not inherently valid or reliable. And recall from lecture three, many newly created scales fail to report validity and reliability. The amount that reported reliability was 50%. The amount that reported validity was only 37%. This is for new scales. That's appallingly bad. I don't know anybody who would do that. Clearly, with scale development, it's important that researchers conceptually define the variable of interest as well as adequately operationalize the variable. But scale development, it's tricky. Is your definition correct or complete? It may not be. You may not have defined your construct very well, or your measurement may not capture all aspects of that construct. Perhaps the most important question, are you measuring what you think you're measuring? Validity is critical. Reliability is important and all good scales should have some form of reliability. But if it's not valid, it's trash. So let's assume you have a scale that's at least somewhat valid. Does your scale produce consistent measurements? If the trait doesn't change that much over time, if I take a scale measuring that trait today, will I get the same score tomorrow or a year from now depending on how much that trait tends to fluctuate? Or another way to think about it, say like you and I score the same on introversion in the real world. we would have the same score. But the test gives me a score, say, I'm very introverted, and for you, you're just introverted. That may not be right, especially if in the real world, we're the same. A good scale also needs to minimize measurement error. You're not having errors like where somebody maybe misunderstands the question, the response scale isn't so good. There's a million ways that you can have measurement error, but you want your scale to minimize it as much as possible. Critically, do participants understand the survey items in the intended manner? You may have a way of expressing. what you're trying to get at. Participants may not understand it that way. You may not be clear, you may be using jargon, but also just the academic versus the everyday colloquial use of specific terms or understanding of certain things can just be different. How do most researchers create a scale? Well, there's a general template that they tend to follow. One is to clearly define the construct in question. Then you're going to go consult the literature and decide whether or not new scale is even needed. Are there existing scales available? Do you think that they get at the construct adequately? Maybe you have a different conceptual definition of what you're trying to administer. From there, you're going to develop an item pool. And this actually should be overly inclusive. Whatever you think of, you write it down. You're going to err on the side of, say, false positives, right? You're going to have items on there that you think actually get at the construct, when in reality, they may not. From there, you're going to select appropriate response format for the items. Are they going to use the Likert scale? Are you going to use something different than that? you have to be very careful about how you formulate the response scale. You're going to conduct several iterations of pilot testing, and from there you're going to use some statistics to remove problematic items. You can look at cron box alpha, reliability measures, factor analysis, and other forms of psychometrics. This approach, there's less focus on developing the items compared to actually selecting the items. You go out and you do develop a pool of items, but once you've come up with that, maybe on the basis of your own judgment and face validity, it basically stops there, and then it's about narrowing those items down. But there is a better way. This relies less on psychometrics and more on underutilized techniques that involve collaboration. Input is gained from both experts as well as potential participants as they can both offer valuable insights. Focus should be on validity during the development of the items as opposed to the selection of the items. Doing so potentially leads to greater efficiency. Front-loading the task of validity may lead to shorter scales and more efficient pilot testing. This new synthesis involves six different steps. Step one is similar to the general template. It's a literature review. This enables you to precisely define the construct, which is critical. Typically, you're going to go with what's within the research literature. It is possible that you do redefine the concept. After that, you're going to identify how existing measures of the construct, or maybe even related constructs, tend to measure whatever it is that you're interested in. Even a questionnaire that may not look so good could have a few items on there that are useful. Considerations that you need to make is how will your measurement overlap with different and existing scales? If you find that a scale maybe overall is good, maybe what you do is just modify one or two items. And the article gave an example here. The authors changed one of the wordings on a scale that already existed. In their development, the teacher-student relationship questionnaire had an item on there that I probably don't have to tell you why it's awkward. I share an affectionate, warm relationship with this child. This is a teacher writing about a student. The authors of this article obviously revised this statement. Step two is getting feedback from potential respondents, either in one-on-one interviews or through focus groups. You'll want to see if your definition... and the items that you select as being important actually reflects what potential respondents think. There can be discrepancies there. Does your definition match how potential participants think about the construct? Are you using the appropriate terminology? There are important considerations. So whoever you're interviewing or whomever's involved in the focus group, you need to make sure that this isn't just a convenience sample. People who are involved in this should look like the target population that you're interested in. Researchers also need to... kind of close their mouths a little bit and let participants talk. They need to know how participants think and feel about the construct in their own words with very little prompting. Now, once the unprompted data is collected, the researcher can poke and prod and ask more direct questions. Full with the teacher-student relationship scale. After the unprompted data was collected, the researcher asked the group, I'd like you to think about and describe what it means to have a good positive relationship with the teacher. What does it look like? Perhaps participants didn't talk about this. So the researcher... prompted them to do this. And you get more concrete examples as well, which is great. They asked participants specifically to talk about a teacher who they had a particularly good relationship with and what are their qualities. Another thing that they did that I thought was really interesting is they use cards to rank concepts. And so you could have a bunch of concepts that you're thinking about including as survey items. And you can have participants rank these concepts as being extremely important or essential to what you're trying to measure, somewhat important or not important at all. and sometimes the results might surprise you. Step three is synthesizing the literature review with the interviewer focus group data. Sometimes the results will be pretty similar, sometimes they'll be different. If there's agreement, then merging the data is really easy and you can use the colloquial terms that participants come up with if there is some sort of discrepancy. You want to make sure you're using the right terminology. If there are differences, you're not going to throw out or modify questions right away. You're going to retain both until a later stage. An example with the teacher-student relationship scale, the participants, as well as the research literature, describe the importance of having high expectations, that they want their teacher to have high expectations of them. Teachers, on the other hand, express the need for realistic expectations, and they went on to actually change the items about having high expectations based on this. Step four is developing the items. As I mentioned before, in step three, you might have items from the Literature Review as well as from the Focus Group or Interviews, the Participant Respondents. Here's where you're going to try to integrate them, change them, see which ones are more adequate representations of the construct. There's various challenges of constructing scales. One is just how many items are you going to include? Generally speaking, you don't want your scales to be really long. People can get bored and that can affect how the results come out. Sometimes some scales do need to be a little longer than others. We've seen before with self-esteem, there's a 10 item measure, and then they actually created a single item measure. Shorter sometimes can be better, but not all scales have that luxury. If you want a scale that has, say, eight items in the final version, it's best to keep 15 items before you start doing some of your rigorous tests on it. So you want to be a little bit more inclusive than what you're going to end up with. An example from the teacher-student relationship scale, they noticed that there's three separate items that address teachers'enthusiasm, their warmth, and their communication. Instead of having three separate items, they decided to combine this into a single item that was encouragement. Again, it makes the scale a bit shorter and a bit more efficient. Step four, developing the items. Wording is critical. And Steven Pinker and his ideas creeping in here again. You want to be simple, concise, and clear. And there's various things that you definitely need to avoid. One is just the use of jargon. You want to use words that the general population is going to be aware of. You want the reading level not necessarily to be at the college level, but maybe the seventh grade level. Writers who use jargon that they assume is widespread will be restricted to a sub-sub-sub-sub-sub specialty. They won't give concrete detail in order to visualize what they're describing because it's just so clear to them. Example would be the product helped me meet my OKRs. First time I heard this term, I didn't know what it meant. Other people probably aren't going to know what it means either, especially if they don't work in a business environment. Rephrase that sentence to be the product helped me meet my goals and responsibilities. Avoid leading questions. An example would be how great is our hardworking customer service team? Why is this a leading question? Because it makes the assumption it's telling you that you already have a hardworking customer service team. You're viewing them already in a positive light. If you have adjectives that are overly positive or negative that frame things in a biased way, you want to avoid loaded questions. An example would be, what problems do you have with the new version of the software? A loaded question, some people confuse this with a leading question. A loaded question has some sort of assumption that's already built into it. This question is assuming that you have problems. It's biased in that perspective. You may not. actually have any problems with it, but you're answering a question that how you respond to it is based on the assumption that you have some sort of issues. You want to avoid double barrel questions. This is where a question has more than one question embedded in it. Sometimes you'll have two questions, but sometimes you'll see things are triple or quadruple barrel. An example would be, was the product easy to navigate and aesthetically pleasing? While these things might be related, not all things that are aesthetically pleasing to the eye. are necessarily easy to use or navigate. A good example of that is Normandors. But when it comes to a product or a website or something like that, just because it might look good doesn't mean it's easy to navigate. So you really need to break that up into two separate questions. One of my favorites, double negatives. These can be extremely difficult to understand and you might have participants that respond in a way that doesn't actually reflect their true feelings or opinions. You want to make sure that you use appropriate and unbiased terminology. You can look at the APA page and it has terminology that you can use when you're referring to specific groups of people. If you're working with a very specific or niche group. they may have specific terminology that they understand or prefer to use as well. So an example that Dr. Von Becker, who does research on polyamory, brought up is that you don't want to use the term swingers to refer to people who are in the polyamorous community, or what they sometimes refer to themselves as ethical non-monogamy. Most people who construct surveys tend to be aware of those issues, even though they don't always develop the best questionnaires. This article suggests new best practices, which you can see in table four. Some of these... I'm a little bit guilty of, not going to lie. Some of them, I'm not 100% sure that I actually agree with. I think it can depend on the type of skill that you're developing. But overall, it's worth consideration. And if empirical evidence investigates that some of these practices are indeed better than other ways of doing things, it's definitely good to follow. One of their suggestions is avoiding reverse scored items. In theory, this tends to be seen as a good thing. So an example of reverse scored item would be, I consider myself outgoing. Another item, would be the reverse of that. Consider myself introverted. If you're calculating a scale for how extroverted somebody is, the more introverted or non-outgoing item would have to be reverse scored. Part of the reason why researchers tend to do this is they have an idea that you want to keep participants honest, keep them alert, make sure they're answering questions consistently. But it's possible that the reverse of the item is not necessarily the exact opposite of the original item. People can interpret them differently. People sometimes may answer on the far end of the scale for one of those items. but not the other. I have seen this in practice when analyzing data and conducting reliability analysis. The items that are reverse scored tend to lower the reliability of the whole scale. Response anchors should have at least five or seven points. I generally tend to follow this rule. In fact, my preference tends to be for seven-point Likert scales. The author suggests that you can use a five-point scale when it's unipolar, so when it ranges from zero to infinity. A seven-point scale tends to be good for bipolar. scales when you can have a scale that ranges from conceivably negative infinity to positive infinity. Now, while I tend to like five-point, especially seven-point scales, I think there are cases in which you could have fewer points than that. You could have a four-point scale or a forced-choice two-point scale. A good example is the social desirability scale, where you're trying to see how much do people respond to items on a survey in a way that makes them look good. And they're given the option of just true false. Example items on the social desirability scale includes, I sometimes feel resentful when I don't get my way. No matter who I'm talking to, I'm always a good listener. You're not always a good listener. So people that tend to respond in the affirmative to this item are probably not being completely truthful and probably are answering questions in a socially desirable way. Having this forced choice element is definitely helpful for this scale. Many of you have probably heard of the trolley problem. This is a clear case where having a forced choice is necessary because it's a difficult decision for people to make. And what you're trying to see is what do people do when they're essentially forced to do this? There's many different versions of the trolley problem. The most basic one is where a trolley is going down the track. And if it continues, it's going to hit five people that are on the track. They can't move. The trolley's coming. There's nothing that they can do about it. There's nobody else around to help them. all you can do is pull a switch and divert the trolley down another track. Now that would be an easy decision to make if that were just the end of it. But the problem is, is you have somebody else who's on the track that the trolley is going to be diverted to, and it's a single person. So if you divert the track, you save five people, but you're going to end up killing one person. Generally speaking, most people are able to make this decision fairly quickly in a four-choice scenario. The problem gets a little bit harder when you have these other versions of the trolley problem. For example, my... be that you don't have a switch that you can pull, but you can push somebody off the bridge in order to stop the trolley from hitting five people. So it's basically the same scenario in a sense. You're going to save five people by sacrificing one. But the difference here is, is how far removed are you from the situation? One's a little bit more tangible, right? People have a lot harder time with that scenario. Another suggestion they made is to avoid using agree or disagree response anchors. They say this is a cognitively demanding task. I don't know. I tend to do this actually. You'll see with the gender-neutral intersexual competition scale, this is exactly what I did. I used strongly agree to strongly disagree on a seven-point scale. Do you agree or disagree? Should you use these kinds of anchors or not? They also suggest labeling each response anchor with a construct-specific verbal label and avoid using just purely numeric labels. I tend not to label the numbers in between. I do sometimes, but not always. I tend to label the ends, and while I... generally use the specific construct as part of that label? I don't always. The last suggestion, I think it's a really important one, is you don't want questions on your survey that maybe apply to some participants, but not at all to others. You need to try to avoid items that are like that. An example item that they gave is, how often do you see your family doctor? If a participant doesn't have a family doctor, or doesn't frequently see them, or they even suggested that they frequently see their family doctor, but it's to play tennis with them. this question is difficult. People might skip it and you're going to have more missing data or your scale might be less reliable if you have items that are like this. As you can see, a lot of the best practices relate to having response scales that are perceived as good. I think there definitely needs to be more research on this topic. Certain response scales might do better when applied to specific domains or populations for people in certain research topics, as we mentioned with the trolley problem. having forced choice is important there. Same thing with social desirability. I found a study that compared different types of response scales in terms of their reliability, as well as how participants conceived of them. Did they like using those scales? Are there certain questions that you could answer that maybe certain kinds of response scales did a better job at capturing the construct compared to others? There's different kinds of response scales that you can have. The first one you see here is a bipolar seven point scale ranging from negative three to positive three. The second one is a bipolar VAS scale which is a visual analog scale. There's no numerical response choices above the scale, you just move a slider along the length of the scale. The third scale is a unipolar 11 point numeric scale and you can see that they broke that up into two different scales. You have two other scales that are unipolar five point verbal scales where you have the labels. that are above the scales. As you can tell by looking at these response scales, the researchers were measuring something quite basic. They were looking to see based on the response scale that people got, were people responding differently to questions of how cold or hot they were and how comfortable or uncomfortable they were. So they're measuring thermal sensations and they did this incredibly systematically. They have participants come in for multiple sessions where they had the room at specific temperatures. They would set the temperature 15 hours before the participants actually came in to do the study. They also ensured that participants didn't vary much in their clothing. Overall, the results revealed that the bipolar visual analog scale was subjectively preferred in all conditions. People actually really liked that slider without the numbers and without the specific words above the different points on the scale. And in fact, the fastest response times were to the two bipolar scales, whether it was one that was ranging from negative three to three or the visual analog scale with the slider. The reliability of the visual analog scale was pretty good as well. It wasn't always consistently the highest, but it did well on all the different measures. So even though it seems like you would have a lot of variance in response there, it actually was a pretty reliable measure. They also found that there wasn't a ton of variation in how these scales performed. It was not significantly different for their sample of young college students when reporting about temperature and acoustic sensations. Now, the results of the study may not generalize to the broader population, but I do think it's really interesting. Other studies that have found that the ease of using a scale is best for 6-, 7-, and 10-point scales, but once you get to 11 or higher, those are harder and take longer for participants to actually process and respond to. It, to some degree, contradicts some of the best practices, but maybe you just need more research, maybe different domains, areas of study would need. different kinds of response scales, or maybe some of the differences aren't as big as what you might actually imagine. It depends. There needs to be more research on this, and I'm certainly not an expert in this area. Step five is expert validation. After you synthesize information from both the literature review and the focus group or interviews, the participant respondents, you're going to send that questionnaire to other people who are expert in the field. You're going to ask them to define the construct as well and indicate how well do the items represent the construct. And you can do this using a survey. The authors actually, at the end of their article, provide a survey that you could give to experts. They have places where experts can rate the items for how relevant they are. There's space provided for them to make suggestions for either clarity or to comment on specific items. And they even can indicate what they anticipate the mean score would be for a given item or the scale. Experts can also comment on whether they think anything's been omitted and needs to be included. Step six is what is called cognitive pre-flight. you want to make sure you avoid any sort of epic fails. There's certain problems that you just need to look out for before you run your study. More often than not, when you're not doing adequate testing, you probably would look back on the study and say, there's a better way that I could have run this. I wish I would have done this instead. Instead of running a large study and having regrets about it, better to actually do the cognitive pre-flight and pilot testing afterwards to catch some of these mistakes. It's much better before conducting a large-scale study or even your pilot study to best learn how potential respondents understand and respond to each item. And this takes a structured approach. You can have participants come in, look at your construct, and actually do a think-out-loud session. They can look at the question and they can repeat in their own words what they think that item is trying to get at or how they understand that item. After this, a researcher can follow up with more probing questions to clarify how respondents understand each question. This could be really strange and unnatural for people to do. It's good to make them feel comfortable and maybe give them some practice items before you actually get into your construct of interest. To make them feel a little more accustomed to what they're expected to do, you can give them feedback on the items that don't matter, right? So that way you're not biasing them in any specific way. The authors noted with their teacher-student relationship survey, there's three problems that they caught during this phase. They realized that some of the items were ambiguous. Some of the items have okay. vocabulary that was challenging. Remember, students are going to be taking this questionnaire, and some of the students are going to be elementary or perhaps middle school level. And three, there is ambiguity of the situation that was being described. Generally speaking, they were dealing with issues of people not really understanding how to interpret the questions correctly or not understanding all the terminology. Once you've gone through the cognitive pre-flight, you can move on to piloting the study. So you're still not really running your study, but you're running a smaller study with a smaller subset of actual respondents. There's some stuff that you're just not going to catch until you can actually run proper statistics, have people take the questionnaire for real in a smaller subset to see what actually happens. One thing that you'll have to watch out for is what's known as ceiling or floor effects. This is where all the respondents basically answer in the same way at one end of the scale or the other. For example, in a classroom survey where I was teaching statistics, I collected data from the class and had them analyze that data. I asked a question, how much do you like dogs on a five point scale? You can imagine almost everybody likes dogs. There was a ceiling effect from one to five. Almost everybody answered a five. A floor effect is the opposite pattern, which. I actually did kind of find in my study on the gender-neutral intersexual competition scale, where some of the behaviors that I was measuring, which were related to eating disorder behaviors, the mean for certain behaviors was quite low on a one to seven scale. We're talking a little bit above one and under two for some of the behaviors. Another thing you may notice, some items have really good face validity, but when you run the actual correlations to see, does this item correlate with other items that are similar? Maybe they don't correlate that. highly. And if that's the case, then maybe you're not actually measuring what you think you're measuring. So you might want to toss that item out or modify it. The authors stress that taking these six steps won't necessarily correct every problem, but you can catch a lot of the errors along the way. And the idea of front-loading the validity, spending a lot of time developing the items and being very thoughtful about how you do this systematic, how you do it and getting both expert as well as layperson target respondents opinions about the scale. is immensely helpful and will save you a lot of headaches in the future. And the data that you collect in the future will be better quality data. So spending time up front means that you'll spend less time running studies that might be large scale and expensive and lesser quality data. These six steps are intended to help minimize measurement error, to help increase construct validity, and help increase efficiency. Measure twice, cut once.