so today's lecture is going to be about causality who's heard about causality before raise your hand what's the number one thing that you hear about when when thinking about causality yeah correlation does not imply causation anything else come to mind that's what I came to my mind anything else come to mind so up until now in the semester we've been talking about purely predictive questions and for purely predictive questions one could argue that correlation is good enough right if we have some signs in our data that are predictive of some outcome of interest we want to be able to take advantage of that whether it's upstream downstream the causal directionality is irrelevant for that purpose although that even that isn't quite true right because I've been hinting throughout the semester people have been hinting throughout the semester that there are times when the data changes on you for example when you go from one institution to another or when you have non-stationarity and in those situations having a deeper understanding but the data might allow one to build an additional robustness to that type of data set shift but there are other reasons as well why understanding something about your underlying data generating process it could be really important it's because often the questions that we want to answer when it comes to health care are not predictive questions they're causal questions and so what I'll do now is I'll walk through a few examples of what I mean by this let's start out with what we saw in lecture four and problem set two where we looked at the question of how we can do early detection of type 2 diabetes and you downloaded you used trip and market scans data set to to build a risk stratification algorithm for detecting who's going to be newly diagnosed with diabetes one two three years from now and if you think about how one might then try to deploy that algorithm you might for example try to get patients into the clinic to get them diagnosed but the next set of questions are usually about the so what question what are you going to do based on that prediction once diagnosed how will you intervene and at the end of the day the interesting goal is not one of how do you find them early but how do you prevent them from developing diabetes or how do you prevent the patient from developing complications of diabetes and those are questions about causality now when we built our predictive models and we introspected at the weight we might have noticed some interesting things for example if you looked at the highest negative weights which I'm not sure if we did as part of the assignment but something that I did is part of my research study you see that gastric bypass surgery has the biggest negative weight does that mean that if you give an obese person gastric bypass surgery that will prevent them from delft developing type 2 diabetes that's an example of a causal question which is raised by this predictive model but just by looking at the weight alone as I'll show you this week you won't be able to correctly infer that there's a causal relationship and so part of what we will be doing is coming up with a mathematical language for thinking about how does one answer is there causal relationship here here's a second example right before spring break we had a series of lectures about diagnosis particularly diagnosis from imaging data a variety of kinds whether it be radiology or pathology and all the questions are of this sort right here is a woman's breast she has breast cancer maybe you have an associated pathology slide as well and you want to know what is the risk what what is the risk of this person dying in the next five years so one can take a deep learning model learn to predict what one observes so in the patients and your data set you have the input you have that say survival time and you might use that to predict something about how long it takes from diagnosis to death and based on those predictions you might take actions for example if you predict that a patient is not risky then you might conclude that they don't need to get treatment but that could be really really dangerous and I'll just give you one example of of why that can be dangerous these predictive models if you're learning them in this way the outcome this case that say time to death is going to be affected by what's happened in between so for example this patient might have been receiving treatment and because of them receiving treatment in between the time from diagnosis to death it might have prolonged their life and so for this patient in your data set you might have observed that they lived a very long time but if you ignore what happens in between and you simply learn to predict Y from X X being the input then a new patient comes along you predict that that new patient is gonna survive a long time and it would be completely the wrong conclusion to say that you don't need to treat that patient because in fact the only reason the patient's like them in the training data lived a long time is because they were treated and so when it comes to this field of machine learning in healthcare we need to think really carefully about these types of questions because an error in the way that we formalize our problem could kill people because of mistakes like this now other questions are one about not how do we predict outcomes but how do we guide treatment decisions so for example as data get data from pathology gets richer and richer and richer we might think that we can now use computers to try to better predict who is likely to benefit from a treatment then humans could do alone but the challenge with using algorithms to do that is that people respond differently to treatment and the data which is being used to guide treatment is biased based on existing treatment guidelines so similarly to the previous question we could ask what would happen if we trained predict past treatment decisions this would be the most naive way to try to use data to guide in treatment decisions so maybe you see David gets treatment a John gets trippin b1i gets treatment a and you might ask then okay a new patient comes in what should this new patient be treated way and if you just learned a model to predict from what you know about David the treatment that David is likely to get then the best that you could hope to do is to do as well as existing clinical practice so if we want to go beyond current clinical practice for example to recognize that there is heterogeneity in treatment response then we have to somehow change the Machine change the question that we're asking I'll give you one last example which is perhaps a more traditional question of does X cause Y for example does smoking caused lung cancer is a major question of societal importance now you might be familiar with the traditional way of trying to answer questions of this nature which would be to do a randomized control trial except this isn't exactly the type of setting where you could do very much controlled trial right well how would you feel if you were a smoker and someone came up to you and said you have to stop smoking because I need to see what happens or what happened well how would you feel if you were a nonsmoker and someone came up to you and said you have to start smoking right that would be both not feasible and completely unethical and so if we want to try to answer questions like this from data we need to start thinking about how can we design using observational data ways of answering questions like this and the challenge is that there's going to be bias in the data because of who decides to smoke and who decides not to smoke so for example the most naive way you might try to answer this question would be to look at the conditional likelihood of getting lung cancer among smokers and getting lung cancer among non-smokers but those numbers as you'll see in next few slides can be very leading because there might be confounding factors factors that that would for example both cause people to to be a smoker and cause them to be to receive lung cancer which would allow which would differentiate between these two numbers and and we'll get will have a very concrete example of this in just a few minutes so the four properly answer all of these questions one needs to be thinking in terms of causal graphs so rather than the traditional set up in machine learning where you just have outcome inputs and outputs now we need to have triplets rather than having inputs and outputs we need to be thinking above inputs interventions and outcomes or outputs so we now need to be having three quantities in mind and we have to start thinking about well what is the causal relationship between these three so for those of you who have taken more graduate level of machine learning classes you might be familiar with ideas such as Bayesian networks and when I went to to undergrad and grad school and I studied machine learning for the longest time I thought causal inference had to be had to do with learning causal graphs so this is what I thought causal inference was about you have data of the following nature one zero zero one dots I thought you know so here there are four random variables I'm showing the realizations of those four binary variables one per row and you have a data set like this okay and I thought causal inference had to do with taking data like this and trying to figure out is the underlying Bayesian network that created that data is it X 1 goes to X 2 goes to X 3 to X 4 all right where I'll say this is X 1 that's X 2 X 3 and X 4 or maybe the causal graph is X 1 to X 2 2 X 3 to X 4 and trying to distinguish between these different causal graphs from observational data is one type of question that one can ask and the one thing you learn in sort of traditional machine learning treatments of this is that sometimes you can't distinguish between these causal graphs from the data you have for example suppose you just had two random variables because any distribution could be represented by probability of X 1 times probability of x2 given X 1 according to the Bayes according to just rule conditional conditional probability and similarly any distribution can be represented as the opposite probability of X 2 times probability of X 1 given X 2 which would look like this the statement that one would make is that if you just had data involving X 1 and X 2 you couldn't distinguish between these two causal graphs X 1 causes X 2 or X 2 cause of X 1 and usually another treatment would say okay but if you have a third variable and you have a V structure or something like X 1 goes to X 2 X 1 goes to X 3 this you could distinguish from let's say a chain structure and then the final answer to what is called inference from this philosophy would be something like okay if you're in a setting like this you can't distinguish when X 1 goes to cause X 2 X 2 causes X 1 then you do some interventions like you intervene on X 1 and you look to see what happens to X 2 and that'll help you disentangle these directions of causality none of this is what we're gonna be talking about today ok today we're gonna be talking about the simplest simplest possible setting you could imagine that graph shown up there you have 3 sets of random variables X which is perhaps a vector so it's high dimensional a single random variable T and a single random variable and we know the causal graph here okay we're going to suppose that we know the directionality that we know that X might cause T and accent T might cause Y and the only thing we don't know is the strength of the edges all right and so let's try to think through this in context to the previous examples yeah question correct that's the assumption we're going to make here all right so let's let's try to instantiate this so start with this example X might be what you know about the patient at diagnosis t I'm going to assume for the purposes of today's class is a decision between two different treatment plans and I'm going to simplify the state of the world I'm gonna say those treatment plans only depend on what you know about the patient at diagnosis so at diagnosis you decide I'm going to giving them this sequence of treatments at this three-month interval or this other sequence of treatments that may be that four month interval and you make that decision just based and diagnosis and you don't change it based on anything you observe okay then the causal graph of relevance there is based on what you know about the patient at diagnosis which I'm going to say access to vector because it's maybe maybe it's based on images your whole electronic health records a ton of data you have in the patient at diagnosis based on that you make some decision about a treatment plan I'm gonna call that T T could be binary choice between two treatments it could be continuous maybe you're deciding the dosage of the treatment or could be maybe even a vector for today's lecture I'm going to suppose that T is just binary just involves two choices but most of what I'll tell you about will generalize to the setting where T is non binary as well but critically I'm going to make the assumption for today's lecture that you're not observing new things in between so for example in today and in this whole week's lecture the following scenario will not happen based on diagnosis you make you make a decision about treatment plan the treatment plan starts you get new observations based on those new observations you realize that treatment plan isn't working and change to another treatment plan and so on okay so that scenario goes by a different name which is called dynamic treatment regimes or off policy reinforcement learning and that we'll learn about next week okay so for in today's and then today's and Thursday's lecture we're going to suppose you based on what you know about the patient at this time you make a decision you actually get the decision and you look at some outcome okay so X causes T not the other way around right and that's pretty clear because of the way that that we because of our prior knowledge about this problem right it's not that the treatment effects what their diagnosis was and then there's the outcome Y and there again we suppose the outcome what happens to the patient may be survival time for example is a function of what treatment they're getting and aspects about that patient all right so this is the causal graph we know it but we don't know does that treatment do anything to this patient for whom does this patient to the street and help the most and those are the types of questions we're going to try to answer today is the setting clear okay now these questions are not new questions they've been studied for decades in fields such as political science economics statistics biostatistics and the reason why they're studying those other fields is because often you don't have the ability to intervene and one has to try to answer these questions from observational data right for example you might ask what will happen to the US economy if the Federal Reserve raises US interest rates by 1% when's the last time you heard of the Federal Reserve doing a randomized control trial and even if they had done a randomized control trial for example flipped a coin to decide which way the interest rates would go it wouldn't be comparable had they done that experiment today - if they had done that experiment two years from now because the state of the world has changed in those two in those years okay let's talk about political science I have close colleagues of mine at NYU who look at Twitter and they want to ask questions like how can we influence elections or how are in elections influenced so you might look at some unnamed actors possibly people supported by the Russian government who are posting to Twitter or their social media and you might ask the question of what did that actually influence the outcome of the previous presidential election again in that scenario it's one of what we have this data something happened in the world and we'd like to understand what was the effect of that action but we can't exactly go back and replay to do something else right so these are fundamental questions that appear all across the sciences and of course they're extremely relevant in healthcare but yeah we don't teach them in our introduction to machine learning classes we don't teach them in our undergraduate computer science education all right and Ivy this has amazed your whole in education which is why we're spending two weeks on it in this course we're still not enough now but what has changed between these fields rayon what is relevant in healthcare well the traditional way in which these questions were asked in statistics were ones where you took a huge amount of domain knowledge to first of all make sure you're setting up the problem correctly and that's always gonna be important but then to think through what are all of the factors that could influence the treatment decisions called the confounding factors and the traditional approach is one would write down ten twenty different things and make sure that you do some of the analyses including the analyses I'll show you about in today and Thursday's lecture using those ten or twenty variables but where this field is going is one of now having high dimensional data so I talked about how you might have imaging data for acts you might have the whole entire patient electronic health record data facts and the traditional approach is that the statistics community used to work on no longer work in this high dimensional setting and so in fact it's actually a really interesting area for research one that my lab is starting to work on in many other labs where we could ask how can we bring machine learning algorithms that are designed to work with high dimensional data to answer these types of causal inference questions and today's lecture you'll see one example of a reduction from from causal inference to machine learning where we'll be able to use machine learning to answer one of those causal inference questions so the first thing we need is some language in order to form wise these notions so I will work within what's known as the Reuben diamond causal model where we talk about what are called potential outcomes what would have happened under this world or that world we'll call Y zero and often it'll be denoted as Y underscore zero sometimes they'll be noted as Y parentheses zero and sometimes it'll be denoted as Y given X comma you y equals zero and all three of these notations are equivalent okay so Y is zero corresponds to what would have happened to this individual if you gave them treatment to zero and y1 is the potential outcome of what would have happened to this individual how do you give them treatment 1 so you can think about y1 as being you know giving the blue pill and y0 is being given the red pill now once you can talk about these states of the world then one can start to ask questions of what's better the red pill or the blue pill and one can formalize that notion mathematically in terms of what's called the conditional average treatment effect and this also goes by the name of individual treatment effect so it's going to take as input X I which I'm going to denote as you know you're the data that you had it at baseline for the individual it's the it's the covariates the features for the individual and one wants to know well what's the for this individual with what we know about them what's the difference between giving them treatment 1 or giving them treatment 0 so mathematically that corresponds through difference in expectations it's a difference in expectation of y1 from Y is 0 now the reason why I'm calling this an expectation is because I'm not going to assume that y 1 and y 0 deterministic because maybe there's some bad luck component like maybe a medication usually works for this type of person but with you know with a flip of a coin sometimes it doesn't work and so that's the randomness that I'm referring to when I talk about probability over y 1 given X I and so the Kate looks at the difference in those two expectations and then one can form one can now talk about what the average treatment effect is which is the difference between those two so the average treatment effect is now the expectation of I'll say expectation of the Kate over the distribution of people evacs now we're gonna go through this and four different ways in the next ten minutes and then you're going to go over at five more ways during your homework assignment and you'll go over two more ways on Friday in recitation so if you don't get it just yet stay with me you'll get it by the end of this week now in the data that you observe all you see is for an it for an individual all you see is what happened under one of the interventions so for example if the if the I thin dividual in your data set received treatment TI equals one then what you observe weii is the potential outcome Y one another hand if the individual in your dataset receive treatment TI equals zero then what you observe for that individual is the potential outcome y zero alright so that's the that's the observe it observed factual outcome but one could also talk about the counterfactual of what would have happened to this person had the opposite treatment been done for them and notice that I just swapped each TI for one minus TI and so on now the key challenge in this field is that in your data set you only observe the factual outcomes and when you want a reason about the counterfactual that's where you have to impute this unobserved counterfactual outcome and that is known as the fundamental problem of causal inference that we only observe one of the two outcomes for any individual in the data set so let's look at a very simple example here individuals are characterized by just one feature their age and these two curves that I'm showing you are the potential outcomes of what would happen to this individuals blood pressure if you gave them treatment zero which is the blue curve versus treatment 1 which is the red curve all right so let's dig in a little bit deeper for the blue curve we see people who received the control what I'm calling treatment zero their blood pressure was pretty low if the for the individuals for a low and for individuals whose age is high but for middle aged individuals their blood pressure on receiving treatment zero is sort of in the higher range another hand if for individuals who receive treatment one it's the red curve so young people have much higher let's say blood pressure under treatment one and similarly much older people so then one could ask well what about the difference between these two potential outcomes that is say the the Kate a conditional average treatment effect is simply looking at the distance between the blue curve and the red curve for that individual right so for someone with a specific age let's say a young person or a very old person there's a very big difference between giving treatment 0-3 giving treatment 1 whereas for a middle-aged person there's very little difference so for example if treatment one was was significantly cheaper than treatment zero then you might say we'll give soup in one even though you know it's not quite as good as treatment zero but it's so much cheaper and the difference between them is so small we'll give the other one but in order to make that type of policy decision one of course has to understand that conditional average treatment effect for that individual that's something that we're going to want to predict using data now we don't always get the luxury of having personalized treatment recommendations sometimes we have to give a policy like for example I took this example out of my slides well give it you anyway the federal government might come out with a guideline saying that all men over the age of 50 I'm making it that number need to get annual prostate screening prostate cancer screening all right that's an example of a very broad policy decision you might ask well what is the what is the effect of that policy now applied over the full population on let's say decreasing deaths due to prostate cancer and that would be an example of asking about the average treatment effect so if you were to average the redline here to average the blue line you get those two dotted lines I show there and if you look at the difference between them that is the average treatment effect between giving the giving the red intervention or giving the blue intervention and if the average treatment effect is very is very positive you might say that on average this intervention is a good intervention if it's very negative you might say the opposite now the challenge about doing causal inference observational data is that of course we don't observe those red and those blue curves rather what we observe are data points that might be distributed all over the place like for example in this example the blue treatment happens to be given in the data more to young people and the red treatment happens to be given in the data more to older people and that could happen for a variety of reasons it could happen due to access to medication it could happen for socio-economic reasons it could happen because existing treatment guidelines say that old people should receive treatment 1 and young people should receive chicken 0 right these are all reasons why in your data who receives what treatment could be biased in some way and that's exactly what this edge from X to T is modeling but for each of those people you might want to know well what would have happened if if they had gotten the other treatment and that's asking about the counterfactual all right so these dotted circles are the counterfactual is for each of those observations and by the way you'll notice that those dots are not on the curves and the reasoning on the curve is come trying to point out that there could be some stochasticity in the outcome so the dotted lines are the expected potential outcomes and the circles are the realizations of them all right everyone take out a calculator or your computer or your phone I'll take out mine this is not an opportunity go on Facebook just to be clear all you want to calculator my phone does oh my oh yeah how's a cow Claire good all right so we're gonna do a little exercise here's a data set on the left hand side each each row is an individual we're observing their a the intervals aides gender whether they exercise regularly which I'll say is a 1 or a 0 and what treatment they got which is a or P on the far right hand side are they're absorbs herbed sugar glucose sugar levels let's say at the end of the year now what we'd like to have it looks like this right so we'd like to know what would have happened to this person's sugar levels had they received medication a or had the receive medication B but if you look at the previous the previous slide we observed for each individual that they got either A or B and so we're only going to know one of these columns for each individual so the first row for example this individual received treatment a and so you'll see that that I've taken the observed sugar level for that individual and soon as they received you're in a that observed level represents the potential outcome Y a or Y zero okay and that's why I have a 6 which is bolded under y 0 and we don't know what their what would have happened to that individual had they received treatment B so in this case some magical creature came to me and told me they their sugar levels would have been 5.5 but we don't actually know that that wasn't in the data all right let's look at the next line just to make sure we get what I'm saying so the second individual actually received treatment B their reserves sugar level is six point five okay let's do a little survey a that 6.5 number should it be in this column raise your hand or should it be in this column raise your hand all right about half of you fat right indeed it goes to the second column and again what we would have liked to know is the counterfactual what would have what would have been their sugar levels had they received medication a which we don't actually observe in our data but I'm going to hypothesize is you know suppose that someone told me it was seven then then you would see that value it filled in there that's the unobserved counterfactual all right first of all is it is the setup clear all right now here's what you use your calculators so we're going to now demonstrate the difference between a naive estimator of your average treatment effect and the true average treatment effect so what I want you to do right now is to compute first what is the average sugar level of the individuals who got medication B so for that we're only going to be using the red lot the red ones okay so this is conditioning on receiving medication B and so we're only going to there this is equivalent to going back to this one and saying we're only going to take the rows where individuals receive medication B and we're going to average their observe sugar levels and everyone should do that what's the first number six point five plus I'm getting 7.87 five for the pub bill for this is for the average sugar given that they received medication B that is that what other people are getting all right what about this for the second number average sugar given a alright I'm gonna know I'm gonna I want you to compute it okay and I'm gonna ask everyone to say it out loud and in literally one minute and if you get it wrong of course you can be embarrassed so I'm gonna try myself okay on the count of three I want everyone to read out what that third number is one two three seven point one two five all right good we can all do arithmetic alright good so the this is the again we're just looking at the red numbers here okay just the red numbers so we just computed that difference which is point one 0.75 yeah that looks about right good all right so that's positive number now let's do something different okay now let's compute the actual average treatment effect which is we're now going to average every number in this column and we're going to average every number in this column right so this is the average sugar level under the potential outcome of had the individual received treatment B and this is the average sugar level under the potential outcome that individual received treatment a alright who's doing it point seven five is what how do you know oh you're fast okay let's see if you're right actually don't know okay the first one to play seventy-five good we got that right I intentionally didn't post a slide to today's lecture and the second one is minus 0.75 all right so now let's put us in the shoes of a policymaker all right the policymaker has to decide is it a good idea to let's say it's an insurance company health insurance company health insurance companies trying to decide should i reimburse for treatment B or not okay or should I simply say no I'm never gonna reimburse Woodrum because it doesn't work well alright so if they had done the naive estimator that you that that would have been the first example that it would look like medication B is sort of we want lower numbers here so it looked like medication B is worse than medication a and if you properly estimated what the actual average treatment effect is you get the absolute opposite conclusion you conclude that medication B is much better than medication a that's just a simple example to really illustrate the difference between conditioning and actually computing that counterfactual as okay so hopefully now you're starting to get it and again you can have many more opportunities to work through these things in your homework assignment and so on okay so by now you should be starting to wonder how the hell could I do anything in this state of the world right because you don't actually observe those black numbers right these are all unobserved and clearly there's bias in what the values should be because of what I've been saying all along so what can we do well the first thing we have to realize is that typically this is an impossible problem to solve your instincts aren't wrong and we're gonna have to make a ton of assumptions in order to do anything here so the first assumption called sattva I'm not even talked about you can read about that in your readings I'll tell you about the two assumptions that are a little bit easier to describe the first critical assumption is that there are no unobserved confounding factors mathematically what that's saying is that your potential outcomes Y naught and y 1 are conditionally independent of the treatment decision given what you observe on the individual acts now this could be a bit hard to - and that's called ignore ability and that's give me a bit hard to understand so let me draw a picture so actually your covariates T's your treatment decision and now I've drawn for you a slightly different graph right over here I said X goes to T X and T go to Y but now I don't have Y instead I have y 0 and y 1 I don't have any edge from T to them right and that's because now I'm actually using the potential Outcomes notation y0 is a potential outcome of what would have happened to this individual had the received treatment zero and y1 is what would have happened to this individual if they received treatment 1 and because do you already know what treatment the individuals received it doesn't make sense to talk about an edge from t2 to those values that's why there's no edge there so then you might wonder how could you possibly have with a violation of this conditional dependence assumption well before I give you that answer let me put some names to these things so we might think about acts as being the age gender weight diet and so on an individual t might be a medication like antihypertensive medication to try to lower a patient's blood pressure and these would be the potential outcomes after those two medications so an example of a violation of ignore ability is if there is something else some hidden variable H which is not observed and which affects both the decision of what treatment to the individual in your dataset receives and the potential outcomes right now it should be really clear that this would be a violation of that conditional independence assumption in this graph y naught and y 1 are not conditionally independent of T given X all right so what are these hidden confounders well they might be things for example which really affect treatment decisions so maybe there's a treatment guidelines saying that for diabetic patients they should receive treatment 0 that that's the right thing to do and so a violation of this would be if the fact that the patient's diabetic were not recorded in the electronic health record so you don't know like that's not up there you don't know that in fact the reason their patients receive treatment E was because of this h factor and it's critically another assumption which is that H actually affects the outcome that's right which is why you have these edges from H to the wise if H were something which might have affected treatment decision but not the actual potential outcomes and that can happen of course things like gender can often affect treatment decisions but maybe maybe for some diseases might not affect outcomes in that situation it wouldn't be a confounding factor because it doesn't set it doesn't violate this assumption and in fact one would be able to come up with consistent estimators of average treatment effect under that assumption where things go to hell is when you have both of those edges alright so there can't be any of these ages you have to observe all things that affect both treatment and outcomes the second big assumption oh yet question of what I'm showing you here for hypertension I have no idea but but I think what you're really trying to get at here and asking your question how good of a model is this is well oh my god how do I know if I've observed everything right alright and that's where you need to start talking to main experts so this is the place where my starting place where I said no I'm not going to attempt to fit the causal graph I'm going to assume I know the causal graph and just try to estimate the effects that's where this starts to become really relevant because you know if you notice this is another causal graph not the one I joined the board and so and so that's something we're really talking with them and experts will be relevant so if you say okay I'm gonna be studying hypertension and this is the data I've observed on patients well you could then go to a clinician maybe a primary care doctor who often treats patients with hypertension and you say okay what usually affects your treatment decisions and you get a set of variables out then you then you check to make sure do I have all of those am i observing all of those variables at least variables that would also affect two outcomes and so often there's gonna be a back and forth in that conversation to make sure that you've set up your problem correctly and again this is one area where you see a critical difference between the way that we do causal inference from the way that we do machine learning machine learning if there's some Sun of unobserved variables so what I mean may be a predictive accuracy isn't quite as good as it could have been but whatever here your conclusions could be completely wrong if you don't get those factors right now in some of the optional readings for Thursday's lecture and we'll touch on it very buthey and Thursday but but there's not much time in this course I'll talk about ways and you'll read about ways to try to assess robustness to violations of these assumptions and those go by the name of sensitivity analyses so for example the type of question you might ask is how would my conclusions have changed if there were a confounding factor which was blah strong and that's something that one could try to answer from data but but is really starting to hit beyond the scope of this course so I'll give you some readings on it but where I won't be able to talk about it in the in the lecture now the second major assumption that one needs is what's known as common support and by the way pay close attention here because at the end of today's lecture and if I forget someone must remind me I'm going to ask you where did these two assumptions come up and the proof that I'm about to give you okay the first one I'm gonna be give you a dead giveaway so I'm gonna answer to you wearing nor ability comes up but it's up to you to figure out where does come and support show up okay so what is common support well well common support says is that there always must be some stochasticity in the treatment decisions for example if in your data patients only receive treatment a and no patient receive treatment B then you would never be able to figure out the counterfactual what would have happened if patients received tripping be right oh but what happens if it's not quite that Universal but maybe there's classes of people you know some individuals acts that say people with blue hair where people with blue hear blue hair always received treatment zero and they never see treatment one well for those people if for some reason something about them having blue hair was also going to affect how they would respond to the treatment then you wouldn't be able to answer anything about the counterfactual for those individuals this goes by the name of what's called a propensity score it's the probability of receiving some treatment for each individual and we're going to assume that this propensity score is always bounded between zero and one right so it's between 1 minus epsilon and epsilon for some small epsilon and violations of that assumption are going to completely invalidate all conclusions that we could draw from the data all right now in actual clinical practice you might wonder can this ever hold right because there are clinical guidelines well a couple of places where you'll see this are as follows first often their settings where we haven't the faintest idea how to treat patients like second line diabetes treatments right you know that the first thing we start with is metformin but if metformin doesn't help control the patient's glucose values there are several second line data treatments and right now we don't really know which one to try so a clinician might start with treatments from one class and if that's not working you try different class and so on and it's a bit random which class you start with for any one patient and other settings there might be good clinical guidelines but there's randomness in other ways for example clinicians who are trained in the on the west coast might be trained that this is the right way to do things and treatment if clinicians are trained in the East Coast might be trained to that this is the right way to do things and so even if any one clinician is this treatment decisions are deterministic in some way you'll see some stochasticity now across clinicians and it's a bit subtle how to use that in your analyses but trust me it can be done okay so if you want to do causal inference from observational data you're gonna have to first start to formalize things mathematically in terms of what is your X what is your T what is your Y you have to think through do these choices satisfy these assumptions of ignore ability and overlap some of these things you can check in your data normally you can't explicitly check in your data but overlap this thing you can test in your data by the way how nata someone else wasn't spoken today so you just think back to the previous example right you have this you have this table of these axes and keep an A or B and then sugar values how would you yep potestas you use a frequentist version just like count how many things show up but if they're zero then you could say that it's good right so you have this table right just go back to that table we have this table and you know this these are your axes and actually we'll go back to the previous slide where is a bit easier to see here we're going to ignore the outcome the sugar levels right because remember this only has to do with probability of treatment given your covariance the Y doesn't show up here at all so this thing on the right hand side the observed sugar levels even is irrelevant for this question all we care about is what goes on over here so we look at this these are your axis and this is your treatment and you could look to see okay here you have 175 year old male who does exercise frequently and receive treatment a is there anyone else in the data set who is 75 years old male does exercise regularly but received treatment B yes or no no good okay so overlap is not satisfied here at least not empirically and now you might argue that I'm being a bit too coarse here right that I'm well what happens if the individual is 74 and received treatment B maybe that's close enough right so there starts to become subtleties and assessing these things when you have finite data but it is something at the fundamental level that you could start to assess using data as opposed to ignore ability which you cannot test using data right so you have to think about do the art of these assumptions satisfied and only once you start to think through those questions can you start to do your now assess and so that now brings me to the the next part of this lecture which is how do we actually lets us nuts now believe David believe that these assumptions hold how do we do that causal inference so the question is what happens if you have a violation of overlap for example you have individuals who you know that in healthy individuals never receive any treatment should you remove them from your dataset well first of all it has to do with howdy formal is the question because not receiving a treatment is a treatment all right so that might be your control arm just to be clear now if you're asking about the difference between two treatments two different classes of treatment for a condition then often one defines sort of the relevant inclusion criteria in order to in order to have these conditions hold for example we could try to redefine the set of individuals that we're asking about so that overlap does a hold but then in that situation you have to just make sure that your policy is also modified you say okay I conclude that the average treatment effect is bla for this type of people okay okay so how can we possibly compute the average treatment effect from data remember Irish human effect mathematically is the expectation between potential outcome y 1 minus y 0 the key tool which we'll use in order to estimate that is what's known as the adjustment formula this goes by net many names and the statistics community such as the G formula as well here I'll give you derivation of it worst first going to recognize that this expectation is actually 2 expectations in one it's the expectation over and Jules axe and ass expectation over potential outcomes why given acts so I'm first just going to write it out in terms of those two expectations and I'll write the expectations like two acts on outside that goes by the name of law of total expectation not you know this is this is it's trivial at the stage okay and by the way I'm just writing out X fiction of y1 and in a few minutes I'll show you activation of y0 but it's gonna be exactly now I'll guess now the next step is where we use ignore ability alright so I told you I was gonna give that one away so remember we said that we're assuming that y1 is conditionally independent of the treatment T given ax what that means is probability of y 1 given X is equal to probability of Y 1 given X comma T equals whatever and this case all say T equals 1 alright this is implied by y1 being conditionally independent of T given X so I can just stick in comma T equals 1 here and that's explicitly because of ignore ability holding but now we're in a really good place because notice that and here I've just done some shorter notation I'm just gonna hide the this expectation and by the way you can do the same for y 0 y 1 y 0 and now notice that we can replace this average treatment effect with now this application the respect to all individuals acts of the expectation of y1 given a comma T equals 1 and so on and these are mostly quantities that we can now observe from our data so for example we can look at individuals who receive treatment 1 and for those individuals we know we have realizations of why 1 we can look at individuals who received treatment 0 and for those individuals we have realizations of why 0 and we could just average those realizations to get estimates of the corresponding expectations alright so these we can easily estimate from our data and so we've made progress we can now estimate some part of this from our data but notice there are some things that we can't yet directly a straight from our data in particular we can't estimate expectation of y 0 given ax comma T equals 1 because we have no idea what would have happened to this individual if who actually got treatment 1 if they had gotten treatment 0 so these we don't know and similarly so ok so these we don't know now what is the trick I'm planning on you wait how does it help that we can do this well the key point is that these quantities that we can estimate from data show up in that term in particular if you look at these individuals acts that you've sampled from the full set of individuals P of X for that individual X for which in fact we observe T equals 1 then we can estimate expectation of y 1 given ax comma 2 equals 1 and similarly for y 0 what we need to be able to do is to extrapolate because empirically we only have samples from P of x given t equals 1 P of X given T equals 0 for those two potential outcomes correspondingly but we are going to also get samples of X such that for those individuals in your data set you might have risk you might have only observed T equals 0 and to compute this formula you have to answer for that X what would it have been if they got treatment equals 1 so we have they're going to be set of individuals that we have to extrapolate for in order to use this adjustment formula for estimate at yep because comes first true we'd have some patients that received each treatment or a given X yes but but now so you so yes that's true but that's a statement about infinite data and in reality one only has finite data and so although common support has to hold to some extent you can't just build on that to say that you always observe the counterfactual for every individual such as the pictures I showed you earlier so I'm gonna leave it I'm gonna leave this slide up for just one more second to let it sort of sink in see what it's saying we started out from the goal of computing the average treatment in fact expected value of y 1 minus y 0 using the adjustment formula we've gotten to now an equivalent representation which is now an expectation with respect to all of individuals sampling from P of X of expected value of y 1 given X comma T plus 1 expected value of y 0 given ax comma T equals 0 for some of the individuals you can observe this and for some of them you have to extrapolate so there are many attempts now many from here there are many ways that one can go hold your question for a little while so class types of causal inference methods that you will have heard of include things like Cove area justment propensity score awaiting EE robust estimators matching and so on and those are the tools of the cause one French trade and in this course we're only gonna talk about the first two and today's lecture we're only gonna talk about the first one kaveri adjustment and Thursday we'll talk about the second one so covariant adjustment is a very natural way to try to do that extrapolation what we're also goes by the name by the way of response service modeling what we're going to do is we're going to learn a function f which takes as an input X and T and its goal is to predict Y so intuitively you should about f as this conditional probability distribution it's it's predicting Y given X and T so T is going to be an input to the machine learning algorithm which is going to predict what would be the potential outcome Y for this individual described by features X 1 through X D under intervention T so this is just for the previous slide and what we're going to do now are this is this is now where we get the reduction to machine learning is we're going to use empirical risk minimization or maybe some regularized empirical risk minimization to fit a function f which approximates the expected value of YT given capital T equals little T gamma X and then once you have that function we're going to be able to use that to estimate the average shape in effect by just comparing but just by just implementing now this formula here so we're going to first take an expectation with respect individuals in the data set right so this is we're going to approximate that with an empirical expectation where we sum over the little n individuals in your data set then what we're going to do is we're going to compute estimate the first term which is f of X I comma 1 right because that's approximating the expected value of y1 given T comma ax T equals 1 comma X and we're going to approximate the second term which is just plugging now 0 for T instead of 1 we're gonna take the difference between them and that will be our estimator of the average treatment effect here's a natural place to ask question you know one thing you might wonder is in your data set you actually did observe something for that individual right notice how your raw data doesn't show up in this at all because sorry I done machine learning and then I've thrown away the observed wise and I used this estimator so what you could have done an alternative formula which by the way is also a consistent estimator would have been to use the observed wife or whatever the factual is and the imputed why for the counterfactual using F and that would have been that would have also been a consistent estimator for the average here in fact you could have done either okay so now sometimes you're not interested in just the average treatment effect but you're actually interested in understanding the heterogeneity in the population well this also now gives you an opportunity to try to explore that heterogeneity so for each individual x i you could look at just the difference between what f predicts for ax at 4:00 for chicken 1 and what F predicts given in treatment 0 and the difference between those is your estimate of your conditional average treatment effect so for example we want to figure out for this individual do we what is the optimal policy you might look to see is kate positive or negative or is it greater than some threshold for example so let's look at some pictures now what we're using is we're using that function f in order to impute those counterfactuals and now we have those observed and we can actually compute the Kate's and an averaging over those you can estimate now the average treatment effect yep good so where can this go wrong oh so what do you have a bias for us to ask something as we just in the paper we've take the vanilla and they yep oh thank you so much for bringing the back up so you're referring to one of the readings for the course from from several weeks ago where we talked about using just a pure machine learning algorithm to try to predict outcomes in a hospital setting particular what happens for patients who have pneumonia in the emergency department and and if you all remember there was this asthma example where patients with with asthma were predicted to have better outcomes than patients without asthma and you're calling that bias but remember when I taught about this I called it bias due to a particular thing what's the language I used I said bias due to intervention maybe is what I can't run exactly what I said right make it up now textbook will be written with advised by intervention okay so the problem there is that they didn't formalize that they didn't form eyes the prediction problem correctly right the question that they should have asked is for asthma patients right you what you really want to ask is the question of X and then Chi and why where T are the interventions that are done for asthmatics so the failure of that paper is that it ignored the causal inference question which was hidden in the data and it just went to predict Y given ax marginalizing over T altogether so T does never in the predictor model and so differently they never asked counterfactual questions of what would have happened had you done a different T and then they still used it to try to guide some treatment decisions like for example should you send this person home which should you keep them for careful monitoring or so on so they were using this is go this is exactly the same example as I gave you in the beginning of the lecture where I said if you just use a risk stratification model to make some decisions you run the risk that you're making the wrong decisions because those decisions because those predictions are biased by decisions in your data so that doesn't happen here because we're explicitly accounting for T in all of our analyses so how much treatment information is in mimic a ton so in fact one of the readings for next week is going to be about trying to understand how one could manage sepsis which is a condition caused by by an infection which is managed by for example giving broad-spectrum antibiotics giving fluids giving pressors and ventilators and all those are interventions all those interventions are recorded in the data so that one could then ask counterfactual questions from the data like what would have happened if this patient had they received a different set of interventions would we have prolong their life for example and and so in in an intensive care unit setting most the questions that we want to ask about not all but many of them or about dynamic treatments because it's not just a single treatment but really about a service sequence of treatments responding to the current patient condition and so that's where we'll really start getting into that material next week not in today's lecture yep that's a phenomenal question where were you this whole course thank you for asking it so I'll repeat it how do you know that your function f actually really learned something about the relationship between the between the input X and your and the treatment T and the outcome and and that really gets to the question of is my reduction actually valid all right so I've I've taken I've taken this problem and I've reduced it to this machine learning problem where I I take my data and and literally just learn a function act to try to predict well the observationally and how do we know that that function f actually does a good job at estimating something like average shape and in fact in fact it might not and this is where things start to get really tricky particularly high dimensional data because it could happen for example that your treatment decision is only one of a huge number of factors that affect the outcome why and it could be that a much more important factor is hidden in acts and because you don't have much data and because you have to regularize your learning algorithm let's say with l1 or l2 regularization or maybe early stopping for using deep neural network your algorithm might never learn the actual dependence on T it might learn just to throw away T and just use X to predict Y and if that's the case you will never be able to infer these average treatment effects accurately you'll have huge errors and that gets back to one of the slides that I skipped where I started out from this picture this is the machine learning picture saying okay so a reduction to machine learning is you know you add an additional feature which is your treatment decision and you learn that blackbox function f but this is where machine learning causal inference start to differ because we don't actually care about the quality of predicting Y right we can measure your root mean squared error in predicting Y giving your axes and T's and that error might be low but you can run into these failure modes where just completely ignores T for example so T is special here right so really the picture we want to have in mind is that T is some parameter of interest right we want to learn a model F such that if we twiddle T we can see how there's a differential effect on Y based on twiddling T that's what we truly care about when we're using machine learning for causal inference and so that's really the gap right that's that's the gap in our understanding today and it's really an active area of research to figure out how do you change the whole machine learning paradigm to recognize that when you're using machine learning for causal inference you're actually interested in something a little bit different and by the way that's a major area of my labs research and we just published a series of papers trying to answer that question beyond the scope of this course but I'm happy to send you those papers if anyone's interested so that type of question is extremely important it doesn't show up quite as much when your axes aren't very high dimensional and where things like regularization don't become important but once your ax becomes high dimensional and once you want to start to consider more and more complex s during your fitting like you want to use deep neural networks for example these differences in goals become extremely important so so there are other ways in which things can fail so I want to give you here an example where shoot I'm answering my question okay question no one saw that slide question where did the overlap assumption show up in our approach for estimating average treatment effect using Cove area just meant I'm gonna go back to the formula someone who hasn't spoken today hopefully you can be wrong it's fine yep in the back so maybe have an individual with some some age who's we're going to want to be able to look at the difference between what F predicts for that individual if they got treatment a versus treatment B or one or versus zero and let me try to lead this a little bit and it might happen in your data set that for that individual we own for individuals like them you only received you only have our observed shipment one and there's no one even remotely like them who you observed treatment zero so what's this function gonna output then when you input zero for that second argument everyone say outloud garbage right if you if if in your data set you never observed anyone even remotely similar to X I who received treatment zero then this function is basically under undefined for that individual I mean yeah your function will output something because you fit it but it's not gonna be the right answer right and so that's where this assumption starts to show up when one talks about the sample complexity of learning these functions have to do covariant adjustment and one one when one talks about the consistency of these arguments for example you'd like to be able to make claims that as the amount of data grows to let's say infinity that this is the right answer gives you the right estimate so that's the type of proof which is often given in the causal in French literature well if you have overlap then as the amount of data goes to infinity you will observe someone like the person who received treatment one you'll reserve someone who also received treatment zero I might have taken you a huge amount of data to get there because treatment zero might have been much less likely than tripping one but because the probability of treatment zero not zero eventually you'll see someone like that and so eventually you'll get enough data in order to extrapolate in order to learn a function which can extrapolate correctly for that individual all right and so that's where the consistency that's where overlap comes in and giving that type of consistency argument of course in reality you never have infinite data and so these questions about trade-offs between the amount of data you have and the and the fact that you never truly have have empirical overlap with a small amount of data and when and answering when can you extrapolate correctly despite that is the critical question that one needs to answer but is by the way not studied very well in the literature because people don't usually think in terms of sample complexity in that field that's where computer scientists can start really to contribute to this literature and bringing things that we often think about in machine learning to this new topic so I've got a couple minutes left are there any other questions or should I introduce some new material in one minute yep so you said the average treatment effect estimator here is consistent but does that matter if we choose the wrong to choose some functional form of the features to the great question no no no you're asking all three questions good job today Ron so so no so if you think if you walk through that argument I made I assume two things first that you observe enough data such that you can have any chance of extrapolating correctly but then implicit in that statement is that you're choosing a function family which is powerful enough that it can extrapolate correctly so if your two function is none if the true function if you think back to this this figure I showed you here you know if if the true potential outcome functions are these quadratic functions and you're fitting them with a linear function then no matter how much data you have you're always gonna get wrong estimates all right so the type of cuz this is argument really requires that you're considering more and more complex non-linearity as as your amount of data grows all right so now here's a visual depiction of what can go wrong if you don't have overlap so now I've taken out you know previously I had one or two red points over here and one or two blue points over here but I've taken those out so in your data all you have are these blue points and those red points so one kid attempt you know they so you all you have are the points and now one can learn as good functions and you could imagine to try to let's say minimize the mean squared error predicting these these blue points and minimize the mean squared error of predicting those red points and what you might get out is something you know maybe you'll decide on a linear function because that sort of you know as good as you can do if all you have are those red points and so you know even if you were willing to consider more and more complex hypothesis classes here you know if you tried to consider a more complex type other this class then then this line you probably just be overfitting to the data you have and so you decide on that line which because you had no data over here you don't even know that it's not a good fit to the data right and and then you notice that you're getting completely wrong estimates for example if you asked about the Kate for a young person it would have the wrong sign over here because they they flipped the two the two lines so that's an example of how one can start to get errors and when we begin on Thursday's lecture we're gonna pick up right where we left off today and I'll talk about I'll talk about this issue a little bit more detail I'll talk about how if one were to learn a linear function how one could actually under the assumption that the true potential outcomes are linear how one could actually interpret the coefficients of that linear function in a causal way under the very strong assumption that the two potential outcomes are linear so that's what we'll return to on Thursday it's all