the value of a randomized controlled trial is that it produces two comparable groups of people a treatment group and a control group the control group provides us with a counter factual or an idea of how things would have turned out for the people in the treatment group if they never got the treatment having a clear and correct counterfactual is what allows us to conclude that one thing is causing another and so in cases where a randomized controlled trial is not feasible we need to test our theories by identifying the correct counterfactual for those who get the treatment if we want to figure out the impact college has on wages we need to look at people who go to college and somehow figure out what would have happened to them if they went straight into the workforce instead with no college education economists often employ the Latin phrase ceteris paribus which means all else equal or all else the same if I want to identify the causal effect of sending you to college I want to split the universe into two along One path you go to college and along the other you don't we let life take its course and there are sure to be other factors that impact your income but when we compare the outcomes in each Universe the difference must be due to the treatment of going to college because this is the treatment that split the universe in two every other Factor before the split is the same and every difference after must be due to the decision to go to college and so everything else all the other causes of income not impacted by college attendance are the same in each universe of course as it stands now this kind of analysis just isn't possible and so we have to compare different people to each other we have to compare those who go to college to those who don't when we do that as we saw before we see this big difference in wages between the college-educated and non-college educated but is the group of people who didn't go to college the true counterfactual for those who did is that what would happen to everyone if they didn't go to college probably not people who go to college tend to have higher IQs and higher family incomes a high IQ probably makes school a little easier to complete leaving people with high IQs to get more education but a higher IQ is also valuable to some employers and might lead to higher wages independently of School attendance the same thing is true for a high family income which allows people to afford more school and also tends to open doors to higher paying careers working in the other direction is experience people with more work experience tend to get paid more those who Skip college and go right into the workforce will accrue four years of experience while their peers are in classrooms if we compare these two groups directly it will likely lead to a biased result if we took the people who did not go to college and send them it would be unlikely that college would raise their wages by the full 37 percent the comparison is not ceteris paribus there are other factors influencing both the amount of treatment that people get and the outcome being measured that's where the statistical process of regression comes in to save us regression is a method of estimating the relationship between a dependent variable and independent variables have you ever seen a scatter plot where data is plotted out as is pictured here and then there's a line drawn through the dots to indicate the trend in the data regression is how we calculate that Trend and what really makes it special is that we can calculate the trend across multiple variables at the same time it allows you to control for specified variables estimating the independent effects for each one if you've ever heard someone say something like you're tall for your age that is a mental application of the concept of regression a five-year-old who's tall for their age is probably still short compared to adults but we can all recognize that if we held age constant or said another way if we compare the five-year-old to only other five-year-olds then they would be comparatively tall that is what we mean by controlling for a variable regression allows us to estimate how one variable affects another holding other variables constant we can use regression to control for the confounding variables we identified before regression Compares people with the same IQ family income and work experience to each other and sees how their income varies as their schooling varies and so the coefficient we estimate for going to college would be the effect independent of differences in IQ family income and work experience when we do that we find a smaller result around 28 percent which is quite a bit less than 37 percent regression allows us to correct the biases of confounding variables in order to create a ceteris paribus comparison the only problem is if we thought of them all are there other things which impact both whether you go to college or not and impact your wages I could think of a couple and some of them might be hard to even measure which means it might not be possible to include them in a regression when we leave out a confounding variable our results will be biased we call this omitted variable bias and any variable correlated with both the treatment X and the outcome y will lead to a biased estimate of how the treatment affects the outcome here are some examples of omitted variables which might matter to our estimate a student's aptitude inclination and motivation for school work will impact whether they attend college and further impact their wages in the ongoing effort to estimate the causal effect of schooling on wages this is referred to as ability bias how do we measure a person's Grit and yet it is almost certainly an amid a variable another could be different College opportunities family income might grabbed some of it but if you're more likely to get into Harvard as a legacy admit because your parents went there then you probably are going to have a lot more doors open for you too people with access to high quality schools also have access to high quality jobs these opportunities increase both College attendance and wages before we close the book on schooling we would need to figure out a way to account for omitted variables omitted variables can lead to some bad conclusions in 1999 scientists published an article in one of the most prestigious peer-reviewed journals in the world which showed that infants who sleep with a nightlight or in a room where the light is on are significantly more likely to need glasses early in life monsters under the bed everywhere celebrated as parents were convinced to keep their babies in dark rooms but alas there was no effect on myopia or the need for glasses due to nearsightedness the researchers overlooked an important omitted variable genetics genetics are very important determinant for whether a person will need glasses but why would a baby's genetics be correlated with whether they had a nightlight because the night light wasn't for them the night light was for their poorly sighted parents who when woken up in the middle of the night by their crying infant didn't always have their glasses on and so they preferred to walk into a room with a little light in it so they could see parents with glasses used night lights and they were also more likely to pass on their myopia to their children through genetics it was a third factor C that was causing both A and B