Lecture Notes on Inverse Probability Weighting

[Music] okay the last approach we're going to do for adjustment is to use inverse probability weighting instead of just matching the advantage of this as you saw in the lecture is that we don't have to throw away data with the match data we got rid of like 600 rows um it would be nice if we could use those rows and kind of get a more accurate estimate without throwing everything away but what we want to do is weight each of the rows based on how weird they are so we're kind of matching but what we're going to do is a couple steps we're going to generate predicted probabilities for each row so we're going to model how we're going to predict each row if they're if they should be using a net based on the different confounders and then we'll use that propensity score that we generate or the predicted probability to generate an inverse probability weight which is that weirdness score so higher weights will be observations that should be using a net but aren't or that are using a net that shouldn't and so if it has a low score it means it was predicted to use a net and they are using a net or they are predicted to not use a net and they're not using a net so ultimately we want to use that weirdness score and then we'll use that weirdness score the inverse probability weight as weights in a regression just like this i'm with the matched regression and we should get a causal estimate in our regression coefficients so we'll make a new section here called inverse probability weighting so the first step is to generate propensity scores scores so these propensity scores are just the the probability that a row is going to use a mosquito net based on whatever covariates we want to include and what we want to include are the things in our dag because those are our confounders so we want to predict why people are using nets based on nighttime temperatures income and health we don't need any of the other nodes we're just going to use those three to predict net usage okay so to do that we need to build a model so we'll name this chunk here called make p scores so what we want to do is build a model um using logistic regression to predict um if people are using nets or not so we're going to make a model here called model net this has nothing to do with the effect of the nets on risk on malaria risk this is just predicting if people are using nets or not so the function for this is going to be glm which stands for generalized linear model instead of regular linear model and this is how we can run logistic regression so it uses the same formula syntax that we're familiar with with regular regression so we're going to say net is explained by income plus temperature plus health and the data that we're looking at we're not going to use the matched data we're ignoring the match stuff now we're just going to use the original nets data the last thing we need to add to make sure that this is logistic regression is we have to tell it to be logistic regression we do that with the family argument we say family equals binomial meaning there's only two um outcomes yes or no and then the link equals logit um the only way you can like do that is just to either memorize it which i have because i've been doing this for a while but before i had it memorized i would just copy and paste from previous logistic regressions um because that's just how you learn is copying and pasting so this is the the incantation you use here to tell it to do logistic regression and then we want to see the results just for fun so we'll say tidy model net so if we run this now here's our logistic regression all of these coefficients are pretty uninterpretable because they are log odds they don't make sense but we can unlog them by exponentiating them so that means we take e to the power of negative 0.05 and it'll tell us the odds ratio we don't have to do that manually we can actually just in tidy we can say exponentiate equals true and if we run that there we go so there's there's no huge effects here um again remember this is based around one so any changes above one or under one change the the likelihood of using a net so according to this um as temperature increases you are by one degree you're six percent less likely to use a net um so if temperature goes up by another percent then you're six percent more six percent less likely to use a net and it's six percent because that's a hundred minus or one minus 0.94 these things are above so like as your income increases you are 0.2 percent more likely to use a net which is not huge but also income is measured in dollars so if income goes up by like 100 then you're going to be more likely to use a net and if health goes up by some instead of just one unit like 10 units then you'll be more likely to use a net so that's what that's showing but we don't care about that if you're doing this in real life you'd probably want to make sure that this is doing a fairly good job at predicting net usage um here because it's simulated data and fake we can assume that it's it's doing a good job at predicting so the next step is we want to generate the actual propensity score so we're going to we have a model now we're going to take our data set and plug every value of income temperature and health into the model and it will spit out a propensity or a predicted probability of using a net based on income temperature and health so to do that we're going to make a new data set called model or we're going to call it nets ipw because this is with the inverse probability weights in it and we're going to base this on or there's a there's a function called augment this will take our model and then plug in our data set and then generate the propensity scores which is neat the one issue with using augment is that it will throw away any columns that you don't use in the model so like eligibility will disappear um what else is in here um number of people in the household to disappear so if we don't care about those we can just use augment that's fine but if we do care about those and we want to keep them then the function we can use is augment underscore columns and this will add things so we're going to take our model that we called model net and we're going to take our mosquito nets data set our nets here and then the last argument is we want to tell it what to do we want type predict equals response that just means that it's going to scale down the predictions into probabilities if we don't do that it'll scale it down into odds or into log odds which again don't make sense they're not interpretable but if we say type equal type predict equals response it'll convert those into a zero to one probability score so if we run this now and we look at net's ipw we have all of our existing columns we have even the ones we didn't use like household and eligible but if we keep scrolling over we have all of these new columns the one we care about the most here is this one called dot fitted this is our predicted probability column this is our propensity score and so we can sort by this we can say this person here only has a 10 chance of using a net given their income and temperature and health and they use the net so already that's probably a weird observation they're going to have a high inverse probability score and if we reverse it we can see the people who are most likely to use nets so this person has a 74 chance of using a net based on the confounders we we used and they used a net so they followed what was predicted these other columns are just kind of the standard error around the fitted value other errors other diagnostics that you can look at all we really care about is this fitted value that's our propensity score and for the sake of remembering that that is the propensity score we can actually rename this because dot fitted is not super clear so to rename that column we can actually just add this pipe symbol which is command shift m or control shift m on windows and we can say rename and we'll if we say propensity equals dot fitted so we're gonna take that dot fitted column and rename it to propensity so now if we run it and we look at nets ipw and scroll over this is no longer called dot fitted it's now called propensity which is nicer so we'll do that okay the last thing we want to do is generate our inverse probability weight value remember this is the weirdness column this measures how unpredictable different rows are so that 74 chance where was it this person here that had a 74 chance of using a net this person here they were supposed to use a net according to the model and if you scroll over they did that's not very weird that's not very unexpected if we sort it back to the lowest though this person here that had a 10 chance they did use a net that's kind of weird and unexpected so they're going to be weighted more importantly so we need to generate that weight so to do that we're going to add a new column which involves the mutate function so say mutate we'll make a new column here just called ipw you can name it whatever you want you could name it weights you could name it inverse probability weights you could name it weirdness score whatever you want we're going to say equals so the formula we want to use if you remember from the lecture was let me pull up the lecture so you can actually see it here this is what we care about here it is treatment over propensity so we should have a column in there for treatment and it's numeric so it's going to be 1 or 0 if they used a net or not and then propensity is our propensity score that probability and then so we're going to say treatment over propensity plus one minus treatment over one minus propensity so that's the thing we need so let's move this off to the side and look at that as we build this thing so it's and we'll use parentheses to make sure order of operations works so we want our treatment column which is the numeric nets thing which we called net underscore num there we go so we'll say net underscore num divided by propensity so that's the first part of our formula because that's our treatment over propensity and then we want plus and then some extra parentheses for order of operations purposes 1 minus net num divided by 1 minus propensity and that should be enough parentheses yep um i'm using the rstudio preview that was just released recently which lets you do these rainbow parenthesis if you go into the settings which is really nice because you can see like this green parenthesis matches up with that green one this pink one matches up with that one it helps with parentheses counting which is helpful um so if we run this now let's see if it worked and we look at nets ipw we should have a new column here called ipw um so this is again the weirdness score so if we sort it this is these are the most boring people so this 1.13 this person their propensity they had a 12 chance of using a net and they didn't and uh 14 chance of using a net and they didn't so these are the boring people if we sort it the other way nine so this person had a 10 chance but they did it so there's our our our true exception here so they have a really high inverse probability weight but everybody else here like these six is here if we scroll over their propensity was like 15 and they did it i wonder if there are any high people um these are all pretty much these are all net users oh there's somebody row 109 there they did not use a net they had a probability of 70 of using a net but then they didn't and so their inverse probability score is high like 3.4 ish cool so that now we have that column with weights that gives more importance to the observations that kind of don't follow expectations um and so that's kind of a different way of matching these observations we don't have to throw anything away we're just going to give more importance to some of the observations so we now have data we can work with so we'll do the estimation here so we will come and make a new section here called find effect and we'll add a new chunk here and we'll call it ipw model the nice thing is that it follows the same syntax as before so if we come up to our waiting formula here we can just copy this so it's going to be lm malaria risk is explained by net we're going to use weights and we're going to use a data set we're going to change some of these things because we're not using matched data so i'll just copy that come down to the bottom so we're going to make a new object called model underscore ipw and we're going to set that equal to this so malaria risk is explained by net usage the data though we're not using matched data we're using net underscore ipw that's our inverse probability weighted or that's our data set that has the inverse probability weights in it that's this data set here with our ipw column so we're going to use nets ipw the weight column is no longer named weights we named it ipw so ipw the indentation is off so i'll just select these rows press command i make everything nicely indented and then we want to see the results here so we'll say tidy model ipw so if i run this now the causal effect is negative 10 0.1 so that is a lot more accurate than the negative 16 that we found with just the observational data without any adjustments it's fairly close to the matched version and so that that's good this is a plausible causal effect that we have now you could report this in a paper and say this is our estimate of the true causal effect using observational data and the reason you can legally cl you can legally claim causality now instead of just saying this is correlated with or associated with is because we followed the dag [Music] we adjusted for income and temperatures and health we made those adjustments by incorporating these into our propensity score calculations and that was one way of making these adjustments and then all we're left with is net the effect of net on malaria and there's our causal effect of negative 10. so we want to just compare all of these models all at once at the end just to see you know which ones work good and what work well and which ones don't so we'll make a new heading here called all models um so we're going to use the model summary package that you did in problem set two so just for good practice it's good to have all of your packages be up at the top so you can know things you need to install so we're gonna say library model summary and we'll run that so it actually loads it so we're going to insert a new chunk here we're going to name the chunk all models together yay um we're going to use the model summary package and the way this works is we just give it a list of models to show all at once simultaneously so we'll use the list function to create a list and we want to feed it all of the models we've made so if we scroll up we can find out what those are called we have one called model wrong we'll include that one in there so model wrong was one of them and then comma we also had model matched paste we also had model matched weights and we had model ipw so there's our four models that we're going to show all at once if i just click on play it should show them down below like that so the way you read this is it's now vertical so each of these models is a column this is the intercept for each of the models 41 38 36 39 this number down here is the standard error it's the same as so there's our 0.46 for the intercept for the last model it should show 0.46 right there so that that's what that number is coming from the one we care about the most is this net true idea this right here is the ostensible causal effect of using a net um so if we look at this this is our naive one the model one that was wrong um that was 16 that doesn't have that big an effect this is the matched version it was negative 12. this is the matched with weights and it was negative 10.4 and this is the inverse probability weighting negative 10. um in this case the inverse probability weighting version was the most accurate the real true causal effect is negative 10 that's what i built into the data so this gets the closest that is not true in all situations it's not that inverse probability of weighting will always get you the right effect and um like this is not kind of increasing in in accuracy it is in this case never do this one this one's bad we didn't do any adjustments there but yeah that's how we do it if we want to be a little bit more official and cool with our model summary we can give names to these instead of saying model one model two model three um here in the list we can say model like we can actually give it a name we can say naive equals model wrong matched equals model matched matched plus weights equals model matched weights and ipw equals that so now if we run it our column headings should be much nicer so there's our naive model matched matched plus weights and inverse probability weighting so that is how you can use r to make adjustments based on a causal diagram using daggety here it says we need to adjust for these things this is how we actually make the adjustments using matching and inverse probability weighting and it's all here in one r markdown file that we can knit and we can email this to people and here's our final report finding the causal effect of using a mosquito net on malaria risk using observational data

[Music] okay the last approach we&#39;re going to do for adjustment is to use inverse probability weighting instead of just matching the advantage of this as you saw in the lecture is that we don&#39;t have to throw away data with the match data we got rid of like 600 rows um it would be nice if we could use those rows and kind of get a more accurate estimate without throwing everything away but what we want to do is weight each of the rows based on how weird they are so we&#39;re kind of matching but what we&#39;re going to do is a couple steps we&#39;re going to generate predicted probabilities for each row so we&#39;re going to model how we&#39;re going to predict each row if they&#39;re if they should be using a net based on the different confounders and then we&#39;ll use that propensity score that we generate or the predicted probability to generate an inverse probability weight which is that weirdness score so higher weights will be observations that should be using a net but aren&#39;t or that are using a net that shouldn&#39;t and so if it has a low score it means it was predicted to use a net and they are using a net or they are predicted to not use a net and they&#39;re not using a net so ultimately we want to use that weirdness score and then we&#39;ll use that weirdness score the inverse probability weight as weights in a regression just like this i&#39;m with the matched regression and we should get a causal estimate in our regression coefficients so we&#39;ll make a new section here called inverse probability weighting so the first step is to generate propensity scores scores so these propensity scores are just the the probability that a row is going to use a mosquito net based on whatever covariates we want to include and what we want to include are the things in our dag because those are our confounders so we want to predict why people are using nets based on nighttime temperatures income and health we don&#39;t need any of the other nodes we&#39;re just going to use those three to predict net usage okay so to do that we need to build a model so we&#39;ll name this chunk here called make p scores so what we want to do is build a model um using logistic regression to predict um if people are using nets or not so we&#39;re going to make a model here called model net this has nothing to do with the effect of the nets on risk on malaria risk this is just predicting if people are using nets or not so the function for this is going to be glm which stands for generalized linear model instead of regular linear model and this is how we can run logistic regression so it uses the same formula syntax that we&#39;re familiar with with regular regression so we&#39;re going to say net is explained by income plus temperature plus health and the data that we&#39;re looking at we&#39;re not going to use the matched data we&#39;re ignoring the match stuff now we&#39;re just going to use the original nets data the last thing we need to add to make sure that this is logistic regression is we have to tell it to be logistic regression we do that with the family argument we say family equals binomial meaning there&#39;s only two um outcomes yes or no and then the link equals logit um the only way you can like do that is just to either memorize it which i have because i&#39;ve been doing this for a while but before i had it memorized i would just copy and paste from previous logistic regressions um because that&#39;s just how you learn is copying and pasting so this is the the incantation you use here to tell it to do logistic regression and then we want to see the results just for fun so we&#39;ll say tidy model net so if we run this now here&#39;s our logistic regression all of these coefficients are pretty uninterpretable because they are log odds they don&#39;t make sense but we can unlog them by exponentiating them so that means we take e to the power of negative 0.05 and it&#39;ll tell us the odds ratio we don&#39;t have to do that manually we can actually just in tidy we can say exponentiate equals true and if we run that there we go so there&#39;s there&#39;s no huge effects here um again remember this is based around one so any changes above one or under one change the the likelihood of using a net so according to this um as temperature increases you are by one degree you&#39;re six percent less likely to use a net um so if temperature goes up by another percent then you&#39;re six percent more six percent less likely to use a net and it&#39;s six percent because that&#39;s a hundred minus or one minus 0.94 these things are above so like as your income increases you are 0.2 percent more likely to use a net which is not huge but also income is measured in dollars so if income goes up by like 100 then you&#39;re going to be more likely to use a net and if health goes up by some instead of just one unit like 10 units then you&#39;ll be more likely to use a net so that&#39;s what that&#39;s showing but we don&#39;t care about that if you&#39;re doing this in real life you&#39;d probably want to make sure that this is doing a fairly good job at predicting net usage um here because it&#39;s simulated data and fake we can assume that it&#39;s it&#39;s doing a good job at predicting so the next step is we want to generate the actual propensity score so we&#39;re going to we have a model now we&#39;re going to take our data set and plug every value of income temperature and health into the model and it will spit out a propensity or a predicted probability of using a net based on income temperature and health so to do that we&#39;re going to make a new data set called model or we&#39;re going to call it nets ipw because this is with the inverse probability weights in it and we&#39;re going to base this on or there&#39;s a there&#39;s a function called augment this will take our model and then plug in our data set and then generate the propensity scores which is neat the one issue with using augment is that it will throw away any columns that you don&#39;t use in the model so like eligibility will disappear um what else is in here um number of people in the household to disappear so if we don&#39;t care about those we can just use augment that&#39;s fine but if we do care about those and we want to keep them then the function we can use is augment underscore columns and this will add things so we&#39;re going to take our model that we called model net and we&#39;re going to take our mosquito nets data set our nets here and then the last argument is we want to tell it what to do we want type predict equals response that just means that it&#39;s going to scale down the predictions into probabilities if we don&#39;t do that it&#39;ll scale it down into odds or into log odds which again don&#39;t make sense they&#39;re not interpretable but if we say type equal type predict equals response it&#39;ll convert those into a zero to one probability score so if we run this now and we look at net&#39;s ipw we have all of our existing columns we have even the ones we didn&#39;t use like household and eligible but if we keep scrolling over we have all of these new columns the one we care about the most here is this one called dot fitted this is our predicted probability column this is our propensity score and so we can sort by this we can say this person here only has a 10 chance of using a net given their income and temperature and health and they use the net so already that&#39;s probably a weird observation they&#39;re going to have a high inverse probability score and if we reverse it we can see the people who are most likely to use nets so this person has a 74 chance of using a net based on the confounders we we used and they used a net so they followed what was predicted these other columns are just kind of the standard error around the fitted value other errors other diagnostics that you can look at all we really care about is this fitted value that&#39;s our propensity score and for the sake of remembering that that is the propensity score we can actually rename this because dot fitted is not super clear so to rename that column we can actually just add this pipe symbol which is command shift m or control shift m on windows and we can say rename and we&#39;ll if we say propensity equals dot fitted so we&#39;re gonna take that dot fitted column and rename it to propensity so now if we run it and we look at nets ipw and scroll over this is no longer called dot fitted it&#39;s now called propensity which is nicer so we&#39;ll do that okay the last thing we want to do is generate our inverse probability weight value remember this is the weirdness column this measures how unpredictable different rows are so that 74 chance where was it this person here that had a 74 chance of using a net this person here they were supposed to use a net according to the model and if you scroll over they did that&#39;s not very weird that&#39;s not very unexpected if we sort it back to the lowest though this person here that had a 10 chance they did use a net that&#39;s kind of weird and unexpected so they&#39;re going to be weighted more importantly so we need to generate that weight so to do that we&#39;re going to add a new column which involves the mutate function so say mutate we&#39;ll make a new column here just called ipw you can name it whatever you want you could name it weights you could name it inverse probability weights you could name it weirdness score whatever you want we&#39;re going to say equals so the formula we want to use if you remember from the lecture was let me pull up the lecture so you can actually see it here this is what we care about here it is treatment over propensity so we should have a column in there for treatment and it&#39;s numeric so it&#39;s going to be 1 or 0 if they used a net or not and then propensity is our propensity score that probability and then so we&#39;re going to say treatment over propensity plus one minus treatment over one minus propensity so that&#39;s the thing we need so let&#39;s move this off to the side and look at that as we build this thing so it&#39;s and we&#39;ll use parentheses to make sure order of operations works so we want our treatment column which is the numeric nets thing which we called net underscore num there we go so we&#39;ll say net underscore num divided by propensity so that&#39;s the first part of our formula because that&#39;s our treatment over propensity and then we want plus and then some extra parentheses for order of operations purposes 1 minus net num divided by 1 minus propensity and that should be enough parentheses yep um i&#39;m using the rstudio preview that was just released recently which lets you do these rainbow parenthesis if you go into the settings which is really nice because you can see like this green parenthesis matches up with that green one this pink one matches up with that one it helps with parentheses counting which is helpful um so if we run this now let&#39;s see if it worked and we look at nets ipw we should have a new column here called ipw um so this is again the weirdness score so if we sort it this is these are the most boring people so this 1.13 this person their propensity they had a 12 chance of using a net and they didn&#39;t and uh 14 chance of using a net and they didn&#39;t so these are the boring people if we sort it the other way nine so this person had a 10 chance but they did it so there&#39;s our our our true exception here so they have a really high inverse probability weight but everybody else here like these six is here if we scroll over their propensity was like 15 and they did it i wonder if there are any high people um these are all pretty much these are all net users oh there&#39;s somebody row 109 there they did not use a net they had a probability of 70 of using a net but then they didn&#39;t and so their inverse probability score is high like 3.4 ish cool so that now we have that column with weights that gives more importance to the observations that kind of don&#39;t follow expectations um and so that&#39;s kind of a different way of matching these observations we don&#39;t have to throw anything away we&#39;re just going to give more importance to some of the observations so we now have data we can work with so we&#39;ll do the estimation here so we will come and make a new section here called find effect and we&#39;ll add a new chunk here and we&#39;ll call it ipw model the nice thing is that it follows the same syntax as before so if we come up to our waiting formula here we can just copy this so it&#39;s going to be lm malaria risk is explained by net we&#39;re going to use weights and we&#39;re going to use a data set we&#39;re going to change some of these things because we&#39;re not using matched data so i&#39;ll just copy that come down to the bottom so we&#39;re going to make a new object called model underscore ipw and we&#39;re going to set that equal to this so malaria risk is explained by net usage the data though we&#39;re not using matched data we&#39;re using net underscore ipw that&#39;s our inverse probability weighted or that&#39;s our data set that has the inverse probability weights in it that&#39;s this data set here with our ipw column so we&#39;re going to use nets ipw the weight column is no longer named weights we named it ipw so ipw the indentation is off so i&#39;ll just select these rows press command i make everything nicely indented and then we want to see the results here so we&#39;ll say tidy model ipw so if i run this now the causal effect is negative 10 0.1 so that is a lot more accurate than the negative 16 that we found with just the observational data without any adjustments it&#39;s fairly close to the matched version and so that that&#39;s good this is a plausible causal effect that we have now you could report this in a paper and say this is our estimate of the true causal effect using observational data and the reason you can legally cl you can legally claim causality now instead of just saying this is correlated with or associated with is because we followed the dag [Music] we adjusted for income and temperatures and health we made those adjustments by incorporating these into our propensity score calculations and that was one way of making these adjustments and then all we&#39;re left with is net the effect of net on malaria and there&#39;s our causal effect of negative 10. so we want to just compare all of these models all at once at the end just to see you know which ones work good and what work well and which ones don&#39;t so we&#39;ll make a new heading here called all models um so we&#39;re going to use the model summary package that you did in problem set two so just for good practice it&#39;s good to have all of your packages be up at the top so you can know things you need to install so we&#39;re gonna say library model summary and we&#39;ll run that so it actually loads it so we&#39;re going to insert a new chunk here we&#39;re going to name the chunk all models together yay um we&#39;re going to use the model summary package and the way this works is we just give it a list of models to show all at once simultaneously so we&#39;ll use the list function to create a list and we want to feed it all of the models we&#39;ve made so if we scroll up we can find out what those are called we have one called model wrong we&#39;ll include that one in there so model wrong was one of them and then comma we also had model matched paste we also had model matched weights and we had model ipw so there&#39;s our four models that we&#39;re going to show all at once if i just click on play it should show them down below like that so the way you read this is it&#39;s now vertical so each of these models is a column this is the intercept for each of the models 41 38 36 39 this number down here is the standard error it&#39;s the same as so there&#39;s our 0.46 for the intercept for the last model it should show 0.46 right there so that that&#39;s what that number is coming from the one we care about the most is this net true idea this right here is the ostensible causal effect of using a net um so if we look at this this is our naive one the model one that was wrong um that was 16 that doesn&#39;t have that big an effect this is the matched version it was negative 12. this is the matched with weights and it was negative 10.4 and this is the inverse probability weighting negative 10. um in this case the inverse probability weighting version was the most accurate the real true causal effect is negative 10 that&#39;s what i built into the data so this gets the closest that is not true in all situations it&#39;s not that inverse probability of weighting will always get you the right effect and um like this is not kind of increasing in in accuracy it is in this case never do this one this one&#39;s bad we didn&#39;t do any adjustments there but yeah that&#39;s how we do it if we want to be a little bit more official and cool with our model summary we can give names to these instead of saying model one model two model three um here in the list we can say model like we can actually give it a name we can say naive equals model wrong matched equals model matched matched plus weights equals model matched weights and ipw equals that so now if we run it our column headings should be much nicer so there&#39;s our naive model matched matched plus weights and inverse probability weighting so that is how you can use r to make adjustments based on a causal diagram using daggety here it says we need to adjust for these things this is how we actually make the adjustments using matching and inverse probability weighting and it&#39;s all here in one r markdown file that we can knit and we can email this to people and here&#39;s our final report finding the causal effect of using a mosquito net on malaria risk using observational data

Transcript for:Lecture Notes on Inverse Probability Weighting

Transcript for:
Lecture Notes on Inverse Probability Weighting