Categorical Data Analysis in Admissions

Hello Statisticians! Mr. Young-saver here from Skew The Script. Today we'll discuss how a team of researchers at UC Berkeley dug into some data about the admissions process the university. And investigate it for potential bias, and as they dug into that data, they found a bit of a confusing trend. We'll discuss that using the tools we have for describing categorical data. Let's skew it! [Music] Today's lesson's on describing categorical data; this is lesson 1.2 in our course sequence. We'll look at data from UC Berkeley, which is California's flagship public university. Specifically, we're going to look at data at newly admitted graduate students in 1973. If you're planning to be a graduate student, that means you're applying to get a PhD or a master's, and you're going to apply to a specific department, so you apply to be a PhD in the English department, or you apply the History department, or reply to the Physics department. So we're going to look at data among the six largest departments that students were applying to for graduate school. The data in the original data set is broken down into male students and female students. And note this data set is from 1973; at the time, there are intake forms for admitted students who didn't have non-binary gender categories. In more modern data sets, you're going to see a wider array of gender categories; however, we don't have that here. That said, we can still investigate the data that we do have to see if there are any important trends that we should be considering, and it looks like there might be one. Among the admitted students, 1,198 were male and only 557 were female. If you break that down in terms of percentage, that means that only about 32% of the admitted class of graduate students were female. And at the time, university employees were wondering, "Why does this gap exist?" "What's driving this gap?" So, a SAS professor, Anthropology professor, and a data scientist all from Berkeley decided to team up, gathered data, and investigate further. So in today's key analysis, we're going to follow their work. And we're going to ask, "Were Berkeley's emissions biased along gender lines?" If you'd like to follow along, you can print our guided notes at this URL. So first, we're going to discuss marginal distributions by looking at our full data set. Hence, this is the data—all emissions decisions from Berkeley's sixth-largest graduate departments in 1973. This is organized in a two-way table; that's a table of accounts describing two categorical variables. The two categorical variables we're describing is gender and the admissions decision: whether or not they were admitted or rejected. And a good first step is to find all the totals. So I'm going to get the row totals and the column totals here. This is what I mean by column totals. A good way to distinguish between a row and a column Row is like the columns up and down; Greek columns, Roman columns; and rows, like rowing a boat, goes side by side rowing . ores in a boat. So my column total for male students, if you have these up, is 2,691. For female students, it's this. I can get my row totals for admitted and rejected. And then I can get the grand total. The way to get the grand total for the table describing all individuals in the table is that you can either add up those column totals or you can add up those row totals. Don't want to do both things, but do one or the other. And if you do both of them, they just should match up. So this is the gender gap that we describe at the outside of the lesson. The number of admitted male students the number of admitted female students. The initial question researchers had was, "Is this due to maybe a gap in applications?" Maybe it has nothing to do with the emissions process at all; maybe it's just the composition of people who are choosing to apply to the university as graduate students. So they wanted to ask, "Was the gender distribution among all applicants to the university? So in order to answer this, we're not looking at the emissions decision; we're just looking at what is the gender composition of people who are applying to the university, regardless if they were admitted or rejected. So to do that, we're going to calculate a marginal distribution, that's a breakdown of one variable or one margin of this table. In this case, the gender margin. And we think about marginally; we think about marginal paper, like this top margin and side margin. So we're just going to look at the top margin here, which is gender. So I'm going to delete all the information that pertains to a breakdown by a mission decision; I'm just going to look at the totals for gender. So among all 4,526 applicants, 2,691 were male. So if I divide those numbers, I get the proportion of applicants of all applicants who are male, which, if you convert this decimal to a percent, is 59.5%. I can do the same thing by dividing for a female total, and we find that 40.5% of applicants were female. So, it looks like more male students apply. So is this gap in applications the driving reason behind the gap in admitted students? To look at this, we can create what's known as a side-by-side bar graph. So again, this was the distribution—the marginal distribution—of all applicants. And this was the distribution of admitted students that we saw at the very beginning of the lesson. So, we can visually compare these distributions by using what's called a side-by-side bar graph, and that's the graph as we see it here. So just a couple things to note before we analyze this. This graph follows a key thinking process: title, tick, tick, label, label, label. This is a mnemonic, you can use with yourself to make sure that you're putting everything that you need to on a graph in order to get full points on the AP exam. I have a title here, I have tick marks to show scale on the axis, and I have labels. I have an x-axis label, I have a y-axis label, and I have a label for the key of what I was shading in. Now note, I put the bar values the percentages over the bars; you don't need to do that on the AP exam; I'm just doing it here for clarity. So what trends do we notice? Well, it is true that among all applicants, more men apply, but the proportion among them is students is even higher than the proportion of applicants. In other words, male students are admitted at a disproportionately high rate because the composition of the class of admitted students is higher than the proportional composition of applicants. And if you look at female students, they're admitting a disappointing low rates; their share of admitted students is lower than their share of applicants. So to dig deeper, investigators decided to look at admissions rates by gender. So we got to look at conditional distributions. So let's calculate the admission rate for each gender. When you see language like "Among each" or "For each," that such signal to you that you need to find the conditional distribution by a certain variable. In this case, the conditional distribution by gender. Here's what we mean by conditional distribution: if you look at among the male applicants, so just for the male applicants, among them, and before anything else except for the male applicants. You have this new total; there's the total number of male applicants and the amount that were in this case admitted. So out of the total number of male applicants, 1,198 of those 2,69 applicants were admitted. If you calculate that proportion, that's 44.5% of male students were admitted. And you can do the same thing with the proportion that were rejected. So of the 2,691 male students, 1,400, roughly close to 1500, were rejected; that's 55.5% when you divide those numbers. So now we have the conditional distribution among male applicants—the proportion that got admitted and rejected. This is conditional on being male—the likelihood of being admitted or rejected. So, conditional on being female, what do those proportions look like? The female total is 1,835 do 557 admitted. Out of that total, 30.4% of female students were admitted, and 69.6% of female students were rejected. So we have a bunch of numbers here; how can we visualize this in a nice summary way? To do that, we can use what's called a segmented bar plot. So I'm going to go ahead and make a scale here with axes, and I'm going to find my data. And among male students, about 45 were admitted about 56 were rejected. So I'm going to draw this bar to show the proportion of male students that were admitted up to 44.5% on that y-axis, and you can see here it matches up. And then, to fill in the rest of the male students, I'm going to draw the rest of the percentage, which is 55.5, to get up to 100. So now we've described all 100 percent of male applicants. and we see the proportion that were admitted and the proportion that we're rejected. We can do the same thing for the female applicants, so 30.4% were admitted and the remainder, 69.6%, were rejected. So I can go ahead and draw those here, and this is what's called a segmented bar chart. So know I'm still going by that title: tick, tick, label, label, label. I want to get full points on the AP exam. I have a title here, I have tick marks for sort scale, and then I have labels my y-axis, x-axis, and for my coloring key. So is there evidence of an association between gender and admission rates based on this table and based on my conditional distributions? So whenever you're asked about associations, such as do the data suggest an association, you need to think about a few things. And what this question is asking you is, "Do admissions rates differ between gender groups?" Well, in this case, we saw that women are admitted at a lower rate compared to male applicants. And so we can say there is an association. In order to get full points on the AP exam, when talking about an association, we need to make sure to include these things: we need to make a claim, i.e., there is or isn't an association. We need to support that claim by comparing percentages between groups, and we need to include context we need to mention the variables involved in our analysis. And when comparing percentages, you want to use comparative language: "higher," "lower," "similar." These sorts of words show that you're comparing things in a way that's going to reward you at the points on the AP exam. So my answer would be: "Gender and admission rates are associated; in particular, women are admitted at a lower rate compared to men." I have a claim here; I said, "They are associated, these two variables." I have context; I've talked about the variables involve gender and emission rates; and I have comparative language. I'm showing that women are admitted at a "lower" (that's my comparative language) rate than men, and I'm comparing two percentages in order to show evidence for that. So this brings us to our discussion for the lesson, and note that I said at the top, "Graduate school admissions are made individually by each department." The requirements to get into the Physics department, is different than the requirements to get into the English department. is different than the requirements to get into the history department as a PhD student or a master's student. So they're all going to run their different admissions processes. And so to pinpoint where these association in gender and admission rates are coming from, they want to see which departments had the largest gender admission gaps. So the investigators broke down the data between departments that were not selective, so the kind of quote-unquote easiest or least selective departments to get into that admitted over half their students, the selective departments that submitted between 26 to 50% of their students, and in the very selector departments that are turning away a lot of people, they're only admitting one to 25% of applicants to their department. So here's among Male applicants, the total admitted and rejected to all three types of departments. And here are the percentages. When you look at the admit rate among each type of department, you can see here the not-selected departments they're admitting the highest rate, as you would expect; the very selected departments are made the lowest rate, as you would expect. And here's the same thing for Femaloe applicants and those rates again: the not-selective departments are admitting at a higher rate than the very selective departments. So what we can do is we can look at the male and female admission rates between these different sorts of departments and compare what percent of each group were admitted. And you'll notice that among the not-selected departments, so we just look at the not-selected selective Because departments tend to admit a lot of people, they actually have a higher admission rate among female applicants than they do among male applicants. And if you look at selective departments, like medium moderately selected departments, they're about equal; there's a.4% difference in the admission rate, but it's roughly equal both for males and females. And then, if you look at very selected departments, they actually have a higher admission rate for female applicants; a higher proportion of female applicants will get in compared to male applicants. As the researchers are like, "What that is so confusing?" And the reason it's confusing is because the overall admit rate was much higher for men than for women, but within departments, if you're looking at within each sort of department, the admission rates tend to be higher for women, or neutral kind of equal between male men and women. So how is that possible? And that's a discussion question for today. Looking at the data that I presented here at the end, across departments, the female admin rate is higher or roughly equal. So how is it possible that the overall admin rate is lower for women? What is going on here? I want you to try and think about it, look at that raw data, and explain your reasoning in class. Now one thing to look at and one hint for you as you look at this. We have here what's called mosaic plots. Mosaic plots are the same thing as segmented bar graphs, but the bar widths the width of the bars are scaled by the sample size, i.e., the number of applicants in this case in each category among not selective selective or very selective departments. And we have here the mosaic plot for male applicants and female applicants. So again, that discussion question: "Across departments, the female admit rate is roughly equal," so in each of these bars, we see the admin rate is higher or roughly equal for women, "so how could the overall adminant rate be lower for women?" I want you to explain your reasoning and see if you can use these graphs to help you out. That's it for today's statisticians. We'll see you in class. [Music]

Transcript for:Categorical Data Analysis in Admissions

Transcript for:
Categorical Data Analysis in Admissions