How Not to Run a Consumer-Side Experiment - by Cecilia Chen

[Music] [Music] [Music] foreign hello everyone hello hello welcome to those of you who've just joined us welcome to the final talk of the dsf morning session um i need to be on the right page excuse me for one second so in introducing this talk i think whenever we go to data meetups and see data talks we always find everyone's keen to say how brilliant they are and how amazing their product is and so on and so on and it's really actually really rare and really refreshing to hear someone coming to talk about something that didn't go so well um so with the learning from that i'd like to talk about to introduce cecilia chen with her title how not to run a consumer side experiment cecilia is a seasoned behavioral economist and a senior data scientist at livery she's also a senior lecturer in the economics department at the university of exeter her research combines experimental methodology game theory and insights from psychology and sociology that's a lot of ologies i know and she holds a passion for bringing behavioral science out of academia and into the real world to drive consumer behavior can we give a round of applause for cecilia please thank you guys is the mic working all right cool alrighty um i think now i'm officially the major obstacle between you and the lunch so i hope the obstacle will be fun and in the process that you learn something and hopefully you have something to take home with okay alrighty let's get started so what am i gonna talk about today right so i think uh uh we can talk about how not to run a an experiment so we're gonna talk about you know how to run it and also i will tell you a story about how not to run it okay all right so a little bit about me thank you for the introduction um so i'm a senior lecturer at the university of exeter so what i do is i research and teach behavior economics and experimental economics so usually for my day-to-day life i run a lot of laboratory experiments field experiments and online experiments okay so for those of you guys who are not familiar with people taking subjects into a laboratory and run experiments on them maybe this is what you think and experiments the economics look like right this one or maybe something like this because we have to put a little bit of a money into the economist view right um how much money are you willing to pay to save a nice life but actually this is not a joke this is a real study and the mice will die if you don't pay for it okay um you can so so this is very interesting to study but uh we can talk about it later on if you're interested um so usually what usually actually happens is a bit more mundane than what you imagine so when we run experiments with subjects we invite them especially if you're running laboratory experiments we we kind of invite them to laboratories and usually they're making decisions independently in their own kind of compartments and are usually over a computer okay so it's not as that cool all right so um and also i want to highlight that experiments they're kind of i think the more familiar term that we have is a b testing so essentially a b testing is actually kind of experiments in a b testing you have the control and treatment but in usual experiments there are more kind of variants that we're testing simultaneously so essentially those are the same things all right so what am i going to do today right so today like i said i want to kind of tell you a little bit about a particular type of experiments in particular experiments with input imperfect compliance and i will also tell you a story or an example in terms of how i apply those tips from the learnings to a real kind of experiments that we ran and what my mess-ups are okay and at the end hopefully you'll feel comfortable and okay making mistakes along the way all right so um what are those experiments with incomplete compliance so usually on the left here are the usual type of a b testing that we're all very familiar with right so we know that when you have a feature you usually want to test it against the current version and what you do is you randomly assign some units into your control conditions so those going to be the existing version that your users or units of tests are experiencing we also know that you're going to randomly assign the others to the treatment condition where they can experience the new feature okay so as you can see that's a kind of a more common form of experiments that we all kind of are familiar with but there are a lot of occasions a different type of experiments where um there are potentially non-compliance so you still have your units allocated randomly to the control but for those units that are allocated to the treatment they're not necessarily all treated so this is a scenario where we call the experiment with non-compliance so what do i mean so give you a couple of examples to kind of help you um get the concept uh to understand the concept think about a clinical trial right so when you run the clinical trials you need to have a randomized control trials you have the some of the patients who would be allocated into the placebo and you have some people who are allocated to the actual drug okay so when you have those people into the actual drug sometimes they may or may not actually take the drug okay so either they forget or there are various reasons that they fail to take the drug so that's the case where there are some incompliance or non-compliance going on so another example probably a bit more relatable so think about there might be a some sort of health regimen right so particular agency is thinking about hey wait we want to test a whether particular health regimen is going to help treat a particular type of disease so those patients who are allocated to the treatment conditions are given instructions things to eat um time to exercise you know steps that you have to achieve every single day then think about for yourself right are you gonna seriously follow this regimen you know religiously probably not and that's another kind of example in the cases where even if you're allocated into treatment conditions you're not not necessarily treated like the more uh more concrete example that are relevant to what we do is think about a feature a b test right so you might release a new feature a new experience a new journey for your users but when you're doing your testing it's not necessarily the case that all your users that are allocated to the treatment or actually treat it or they're actually using the feature so coming back so that's kind of a typical example of you have your subjects in the control you have your subjects in the treatment but not all of them are actually using your feature or experiencing your feature okay so in this cases what people normally do is that they follow we're going to calculate the average treatment effect and that's what we usually do we're going to look at all the outcomes or the metrics that we really cared about we calculate the average and we subtract the average of the control of the key matrix that you care about and we call it a day the problem is when not all of your units are actually treated what you're calculating is not necessarily the average treatment effects it's actually what we call the intent to treat effect right so all those other guys look at here they didn't really experience your future so they're not really treated so when you're calculating using the subtraction of the mean between the treatment and control you're not really understanding what is the impact of your future what is the improvement that your future is bringing to your company so you can do that but i would like to highlight that you may not necessarily want to calculate your average treatment effect or calculate the intent to treat the reason is remember at the end of the day as a data scientist when you run your experiments you need to make recommendations you want to tell the business stakeholders that your feature that your engineers work so hard to release is actually making a difference right so and we want to help our engineers to to show that their work is making an impact and our analysis is essentially the way to achieve that but in that particular scenario if you calculate your average treatment effect by lumping up all those guys in a treatment and compare that with all those guys in the control then you're actually not doing any of your engineers any good the reason is that when you calculate the simple average treatment fact or intent to treat effect you essentially is diluting the impact of your future well the reason is if you look at those guys right they are not really experiencing your feature and making them putting them into your sample you are essentially diluting the real impact of what your future could have had so that's not good you have a smaller effect size and second when you're doing your calculations you're putting all those guys into your sample as well and you know they have variance right so things change around when you do the comparisons you will notice that you would also have a lot bigger variance so now when you have an effect size that's diluted that's a lot smaller and you have a variance that's a lot bigger that means that your p-value is going to be a lot bigger and it's a lot harder for you to achieve statistical significance and that is what i want to argue things that you don't want to do so what should we do instead then well so there are a couple of solutions some of the solutions are a bit more elegant in a way than others but uh but some of the solutions are probably a bit more universal than others as well so the first solution is you can actually solve this analytically so what do i mean the intuition is you can still calculate the average treatment effect but you want to pump it up by dividing your actual average treatment effect by the percentage of your subjects that are actually treated so essentially we're saying well we calculated a diluted average treatment effect but we recognize that we're actually only calculating the effect based on the limited sample so we really need to pump up the the effect size by dividing um of the percentage of people who are actually treated this way you will drastically bump up your treatment fact and that becomes what we call the local average treatment fact so that's kind of intuition analytically what you can do by taking into account of the percentage people who are actually treated so this one basically gives you a point estimate but if you want to go the whole shebang a more kind of appropriate way is actually you use a kind of uh iv instrument so the actual treatment status you will use that as an instrument to to calculate the actually the actual effect size for when you have only limited units or treated okay so we're not going to go to the this stuff um this kind of you can download a couple of packages and use the lv reg um so what i want to talk about is there's another way to kind of go around this problem so you can do it in analytic way thinking about how do you given the data that you have how do you process it another way is actually thinking before your run analysis how can you design your experiments such that you're already taking this problem into consideration so what do you want to do if you have a potentially not everyone's actually been treated what you want to do is in this scenario design an experiment such that even within the control conditions you would have some subjects you can identify that those are the ones should they be put into the treatment conditions they would have been treated so that's creating a counter factual in the control condition so once you're able to identify those guys who are in your control but they could have been in the treatment then the next step you can simply drop all those guys both from your control and your treatment and only focusing on those guys in the control who are identified as they could have been treated with those guys who are actually treated so this way you can see it's a bit more elegant in the way that your analysis is extremely simple so it's as simple as let's compare the averages in the treated versus the untreated counter factual so the same average treatment fact uh kind of uh mechanism or the step that you would follow okay so that's the two ways that when you face a situation where your future doesn't have 100 take up what you can do the next thing i want to do is tell you a kind of example um in terms of how i used the experiments to solve the problem of incomplete compliance and uh how i kind of stumbled things through hopefully you'll learn something from that all right so um so when i joined delivery so essentially i was advertising to my team that i've done years and years experiments i'm really the expert that you should listen to when running experiments okay so that's kind of the impression of what i'm setting to them and then one month in we have this problem that essentially there are times that some of our restaurant partners they would give you a delivery time promise that's way too ambitious and uh so the problem comes in that those partners that usually set a fixed delivery time and uh for a lot of cases if you live too far away from the particular restaurants the time that's set by the restaurants are not necessarily realistic and this problem becomes particularly urgent when our fierce leader kind of talked about it in the company kind of workplace and highlighting this is really a problem and you guys probably need to look into this so kind of after getting a little bit of push-up from the the higher execs we think okay really this is a big problem we need to jump into action and try to fix it right so what is our solution so our solution is trying to kind of manage the negative side of the inaccuracy in the promised delivery time so we kind of implemented a algorithm such that if the time that as a restaurant that you can't deliver to a customer that's actually longer than or actually shorter than the time that we estimate to travel to you in that case we're actually going to replace the promised delivery time with the travel time instead so essentially a lot of cases if the restaurants are too far away from you and they promise 10 minutes delivery time we're going to replace that time with the actual travel time so that's our solution now once we have the solution it's time for the experiment um so this is a scenario that i would like to highlight that it will have imperfect take-ups the reason is remember our algorithms it doesn't modify all delivery times it only modifies a subset of them so if you as a restaurant partner you're particularly ambitious in saying i can always deliver to anywhere in london within five minutes then you're more likely to be modified but if you are a restaurant partner who promise a more reasonable time then we're not going to modify your delivery time so from the way that we kind of operate our algorithm you know that even if you are a user that are allocated to a feature you are not necessarily going to have actually experienced the feature if you live very close to particular restaurants and that restaurant is really reasonable in setting up the delivery time then there's no way that our future would ever be called as a result although you're allocated into a feature you're actually never treated okay so that's what i mean that this is a good example where um we have a situation where uh when we allocate subjects to a particular condition you're not necessarily treated you know as a matter of fact a lot of our users are not experiencing this new feature that we're trying to test so what do i do well following my own recommendation i want to resolve this problem from the root by design so what i did is we run the algorithm for subjects who are both in the control and treatment okay but we only going to log the data whether those guys will have a modified delivery time but when the when they actually go through their app and the ordering experience they're not actually going to see the updated delivery time so that's what i mean by identifying those units that would have been treated should they be allocated to a treatment condition okay so we lock them but we never show them the actual updated delivery time and on this side for those uh for those units that are actually treated we know that you know their delivery type were modified as a result this way we identify the counterfactuals that could have been treated but wasn't actually receiving the treatment and we also know those guys were actually treated then in the analysis you can kick out all those guys kick out them and only compare those guys with those guys and voila you have your treatment effect okay so this is a plan i think it works perfectly um but well uh so you know as a data scientist one of the important thing is you know you need to log your data and you need to communicate to your engineers to lock the data that you need right so i think i told my engineers that we need to identify the counterfactuals and make sure that i was able to identify those units that are treated in the control should they be in the treatment and they also logged you know what happened to those users who were actually um kind of treated so identified all of that and i decided okay looks like the the testing data is looking good i have everything logged and then we just roll out the experiment um the only thing is uh six weeks later um i thought i logged everything i didn't i didn't log the identifier that would actually link the treatment allocations to the outcome so i know whether you are in a treatment or control or whether even when you were in the control whether you're likely to be treated or not but i didn't kind of log how can i link those guys to the outcome metric that i really want to measure okay so what after six months later when i saw my data and i realized i have two piles of data but i don't know the connection between them and at that point it's a little bit embarrassing to be honest coming out saying you are the expert but in the end when you do analysis you realize i can't actually do the analysis because i cannot link those two together um so i was struggling and thinking about maybe i should go back to the team and ask them to rerun the whole experiment but there's some kind of exact pressures it's not particularly a great idea to to to ask to rerun so ultimately i mean we come to a kind of a satisfactory end that i was able to try to use a timestamp to link their assignments and the outcome variable although it's not a perfect solution because using timestamp to match is not accurate but hey i'm not in academia anymore am i come on uh so it's as long it's good enough you know it's satisfactory enough we should call it a day um so that's what i did so i didn't ask our engineers to rerun it instead match the data and it's a good enough level and we wrote loud features it turns out uh the impact is actually quite huge once you're able to kind of isolate those who are actually receiving your conditions okay right so so that's the story what's the takeaway log your data if you don't know if you know that you needed data log it if you're unsure whether you should lock the data log it if you kind of like uh i don't read the data today log it please so that's the first take away log your data always um and then the second thing is um i think it's actually quite helpful for a lot of us to kind of think through the type of analysis that you would like to run beforehand so hopefully you know you if i would have kind of run through that the kind of analysis i would like to run i would have discovered there's a gigantic missing link here and it would have called out engineers so i would recommend to don't don't don't follow my steps on that kind of one go through the analysis that you have to do and look for any missing data okay so that's that and thank you very much that was really good thank you and remember we won't log your data okay so if we can just remember that we're not passing the mic around for questions if you could stick your hand up uh and cecilia will do her best to navigate around the room okay i think that that gentleman the back please you need to yeah you need to question is why chill don't run the uh experiments at the same time period of multiple you determine that same time and to try to understand the variance so you can basically ask them the variance and in the most conservative cases what's the best what's the worst variance in terms of your your order on the dental side to do that rather than you have to downsize or outside your factor so essentially we're not restricting data um so the reason that so when we're running experiments uh we're actually capturing data not only for those guys we're also capturing data for all those all of them as well it's just once we get the data when you're running your analysis we're going to not include the observations from those guys and we can only focus on the observations from those guys the reason when you include those guys cn yes you would have a bigger sample size but you also have a lot bigger variants so you see the problem here when you restrict um your and also those variances or the the observations by including those guys they're not necessarily useful especially those guys because they're not really experiencing the feature so you wouldn't anticipate their behavior to change as a result including them as a treated you're including bunch of useless data that creates a lot of variance so that's why i'm not saying not not keeping their data the data is still there it's just when you are running your analysis you can kick those guys out they're not really contributing to the understanding of the actual impact of your future does that answer the question cool thank you all right that gentleman there i'm not sure what is the point of dividing controlling because i can't imagine to use the old algorithm as a baseline the new one as the new one i want to evaluate and calculate in the same population what is the mean square or against the gold the equals so sir thinking about so i should just looking though here no i mean forget about control and up to different times to evaluate the baseline so the one with the old outgoing masses used to be and the new one with the corrected algorithm and then for it for example do the mean square error the mean comparison yumi you can see which is closer to the actual time um i think uh i think there's just a little bit of our uncertainty so so when you at least well given our particular example uh it's actually not clear ex-ante who's gonna see the feature so there's an uncertainty in terms of how many people would actually be treated versus not maybe i'm not understanding your question correctly yeah so that's the that's the algorithm right so those people who are in the treatment they will actually receive a lot more accurate delivery promise while those guys if you're in the control i know that you would have been receiving the the more accurate delivery time but i'm not showing that to you so that's the way to identify to create a counterfactual that i'm not really releasing the feature to those guys they're still in the control so they don't experience the new delivery time at all but they just at the back hand for me that i know should i place you to the treatment you would have been treated but those guys they're not really receiving the treatment does that make sense yeah so we can talk about that later cool yeah go for it yes so so usually you would regress your um so the z would be the act whether you're not treated so you would regress your whether or not you're receiving the treatment here on the um on the on the uh whether you're in the treatment condition or not and then on the second stage you would regress the outcome variable on the whether or not you the outcome of the first um the first stage of the actual treated status so the actual treated status is the instrument and then you have the outcome variables and you have the actual assignment of the treatment and control so the treated status is correlated with whether or not you're in the treatment but not correlated with outcome so that's why it's a perfect instrument yeah go for it did you run this for a particular restaurant because when you evaluate which people are to be treated from tweeting you you're just saying to them close enough to the one specific restaurant so it's all the it's all the restaurants so not a particular one the algorithm essentially uh is calculating uh for any pairs between the customer and the restaurants what's the travel time and what is the actual time provided what's the promise and which whichever one is bigger we're going to show that one so when you choose on the left hand side yeah we don't actually it's a it's a it's only on the cutting down on the the delivery promise that's too ambitious that's not deliverable so if you promise the time say it actually only takes a 10 minutes to deliver to you but the restaurants that oh we're going to deliver to you in 20 minutes in that scenario we don't modify the delivery uh promise i mean i'm still trying to say that what if the person never actually buys from that restaurant that you put them on your control zone you chose some that they would actually never buy from those restaurants which are too far for them they would buy from a new one how would you treat that in that case so in the end what we cared about so so the outcome variable for this particular experiment is how many orders that you receive in the end so what we're thinking about is that in the control conditions we're not doing a lot of modifications while in the treatment conditions we are improving essentially the delivery promise so we anticipate it potentially when you see the the places that promise to be able to deliver to you a lot sooner we should see a increase in the water volume and that's what we found as well but that's only but i want to highlight that that's only evident when we were able to identify those guys and compare with those guys if we were to run the analysis based on taking everyone in the control and compare with that with everyone in the treatment there's no effect it's just because there's a lot of noise so that's why you need to focus on the would have been treated with the actual treated to identify the real impact there will always be a restaurant far enough for them to be potentially treated uh not necessarily well they're they're equally likely in both situations it's a random assignment so it's equally likely in both situations it's just when you are in the treated conditions then you will receive some of them will receive the treatments that's the purpose but then the type of you know say you're you're super far away from the restaurant the kind of customer and the restaurant pair they should be equally likely in the control and treatment because it's a random assignment yeah cool all right yeah please uh i don't know whoever went on first all right please oh okay so you're saying that how do i know we i have a kind of like imperfect compliance problem is that uh it's actually true uh well it's actually not true so so essentially uh the the algorithm runs both for the control and treatment because we have a random assignment in the control one treatment the percentage of getting treated is actually comparable in both cases which is what we found in the data as well so when you have a perfect assignment randomizations you should anticipate it's the same algorithm if we randomize the sample well then you should anticipate so the percentage people that are potentially treated should be the same as those guys as well so it is okay cool yeah please so given that you have designed an experiment to measure difference between given treatment or not how about it would it be to also evaluate the side effects for example number of cells uh so we were talking about potential kind of a spillover is that because of uh so there are a good question uh we did answer as well good so essentially there are two things here so you know that some of the restaurants in our particular algorithms will receive kind of reduced delivery time essentially while in the in the control conditions uh it's just you know business as usual what we discover is um that for those type of restaurants who had their delivery promise kind of reduced indeed we we kind of observed a very significant increase in their order volume um but uh what happened is that we did a kind of follow-up analysis we found is that um those uh those kind of the or at least in the control um those orders that went to the the restaurants um that are with a smaller kind of delivery time it actually is a kind of a substitution so you go the total order volume between those two conditions is actually comparable so what happened is that you have a little bit different cut of pie so when we have those restaurants who had their delivery time shortened they have more orders but at the same time those other restaurants what we call our core restaurants would have a slightly reduction in the water volume but as a total for a company the the kind of the particular feature didn't really change the total pie that we're getting so there is spillover and we were able to identify that but that is because we were able to identify those guys with those guys then you have a lot greater precision cool but good question thank you yes have a gentleman there model not very good and right so if it's not good and identifying two metrics oh very good question thank you very much so um so one thing i think i forgot to highlight a little bit uh is precisely that uh generally the analytical solutions are a bit more robust that if you do a corrections by and large you would be able to get the real effects this experimental design solution is elegant but it's only applied in particular situations so for our situation there's no selection into the the treatment per se because it's algorithm you either receive it or you don't so it's not because i'm a particular type of user i choose to be into our treatment so that's why it's a it's a very good comparison to control compare the control and treatment there's no kind of kind of selection going on so yeah this is the caviar for the design is you have to think about your actual problem what is causing um your incomplete imperfect compliance if it's because of self-reflection then you cannot use a design to to differentiate it unless you i mean there yeah like you said you can't try to run some sort of a propensity score match and try to find those guys that could have been treated in the control but it kind of defines the whole purpose you might as well go back to using the analytical solution that just kind of take care of the percentage of people who are actually treated yeah but generally i personally prefer that you know if you can um kind of coming from the background running experiments um i i obviously i know that there's a lot of fancy methods that are out there for you to identify causalities but i feel like if you can identify the causality using the experiment at the core it's way better solution in exposed kind of analysis trying to figure out which model can help be kind of detangling dodgy needy it's not it's kind of when you're in this business for a long time you know that depending on the model that you choose the effect that you have will vary quite a bit so the analytical solution is always a little bit um you know not quite trustworthy as if you were doing it kind of the the way from the start you run your experiment you have the data you're not changing the data or how you're analyzing the data if the data is there there's a method that everyone uses to calculate the treatment effect and that's it so that's i think it's the most robust way for us to actually kind of deal with those type of problems but thank you very much for your question i see another gentleman back there yeah good point yeah i know we could have went back and just rerun it uh it was too late to do that i think neither our proxy using a time stamp is good enough and the the kind of the outcome when we evaluate this experiment i think the effect was in within our expectation is making sense so we kind of made our recommendations to roll out the feature and that was that we pack it away all right and we see a question here please yeah this lady here thank you for sharing uh but i like to know how do you measure or how do you define it uh so so when you are running experiments right so usually before you even collect your data you would have some sort of hypotheses in terms of how you anticipate your future to have impact your business and depending on where you are those outcome variables could be you know as high as the order volume or some other things that potentially are you know kind of impactful for your particular organization so you think it through and before you run an experiment you probably have to make sure that you know should you run your experiment what should be the metrics that you evaluate against to make a recommendation so usually those metrics should be something that your organization cared about right so even if it's not contributing directly to the revenue it should contributing say to the cost side or user engagement all of those can be your success metric the key question is you just have to communicate with you know the the pms you're in your team or the more senior stakeholders to see what they want and whether your particular feature can help your team to to achieve their goals yeah so just uh to making sure that you know pick the right ones that actually mattered and run your analysis against those metrics if you have a significant result then you can make the recommendation saying hey i was able to scientifically causally identify the impact and they're statistically significant significant so we're very confident about the results roll it out okay cool thank you for the question yeah i think it's a bit related to what you were just talking about i think you mentioned that your metric in this case was an increase is one increase in the number of deliveries is that correct do you know why where that comes from is it that people cancelled less is it that there was more confidence because they estimate the time seemed a bit more fair and so people felt more converged to yeah so the reason that we uncovered is uh when you have a shorter delivery time generally that particular restaurant becomes a lot more attractive right suppose that you're here right now i think everyone's hungry looking for their lunch so you probably understand this if you open the delivery app right now and you're hungry obviously you're gonna pick the restaurant that delivered to you in 10 minutes rather than 30 minutes right so that's the reason why that we think um for some for some of the the restaurants if we can reduce their delivery time or if we make the delivery time longer then they're going to reduce the water volume but if it's shorter then people is going to deliver uh it's going to make the orders a lot more but essentially you didn't really reduce anyone's delivery time right you just increased some of the the neighborhood around right yeah so they can have received less um orders but then those missed orders come to the other restaurants who can present a shorter delivery time so that's the spillover that the other gentleman asked about thank you very much cool thank you for the question please satisfaction for uh we're not sure we didn't measure that because we cared about is if you are willing to you know make more orders with us i guess that's a good sign to say you're happy enough otherwise you wouldn't make orders with us but uh yeah that's something that's interesting capture but as a customer satisfaction you know it's hard to measure how do you measure it do i give you a survey are you gonna fill out that survey if you don't fill it out where's my to order more from the data restaurant because they know that this wrestling promises and delivers to all the premises because my first idea was when you were when you started describing the experiment was they do not deliver on time so people are not happy about it order less because their expectations are not met where it looks now that your experiment was about the distribution of the yeah the orders more than uh fair point we do measure the care contacts though so we have a measure that we kind of measure how many keratin pads that you you're sending to the order help to ask where your order is or complain about whether your order is late i guess that captures to a degree the customer satisfaction yeah so what we found is that with this particular feature we were also when you have a delivery time that's a bit more reasonable we did see a significant reduction so this is a significant reduction in the care contacts that has complaints about lateness of the order so i guess that somewhat measures the customer satisfaction does that make sense cool thank you for the question i think the baby's really hungry all right well i need to release you guys and if you have any questions please come and talk to me thank you very much thank you cecilia thank you [Music] [Music] you

Transcript for:How Not to Run a Consumer-Side Experiment - by Cecilia Chen

Transcript for:
How Not to Run a Consumer-Side Experiment - by Cecilia Chen