Transcript for:
Creating Transformative Impact with AI in Food Delivery

[Music] [Music] thanks good morning I hope nobody's too cold out there please feel free to just ask questions or interrupt me as you go and I know it's going to be hard for people to run around with microphones throughout the talk but don't be shy raise a hand shout out an answer and I'll be happy to repeat it back and help and make sure that everybody hears you so don't be shy so how do you create transformative impact with AI the answer is actually you just stick to your fundamentals I'm going to go through what that means for delivery today so deliveries offer to customers is that we want to give you your favorite restaurants delivered fast to your door it's a fun light-hearted very 21st century value proposition everything's convenient it's just there for you open your phone and you've got it okay what this looks like as a business problem is that you actually need to satisfy three different kinds of users each of whom cares about very different things so first you've got your they're all sort of customers but we call them you know we focus on the eater we tend to call our customer so your first type of user is your customer or your actual eater these people are the people that are going to pick up their phones and and you know press that they would like their their food your next type of user is your writers these are your couriers what they care about is making as much money as possible per unit time your third type of user are your restaurants what they care about is having a fast reliable delivery partner that's going to help them put their best foot forward for their customers is going to help them create an awesome dining experience every time so in order to make these three kinds of users happy this is an image that's probably going to suggest a type of problem that's familiar to all of you this is a very hard optimization problem okay you need to look at all the orders that coming in you need to look at where they're all distributed in space you need to look at where all of your riders are distributed in space and you need to look at where all the restaurants are distributed in space and then you need to form the best matching or assignment of riders to orders the orders are coming in in real time so this isn't a case of a postal system where you get everything in a big batch and then you have all night to sort of process and think about it the orders are coming in in real time and you tell people you're going to give them their food in half an hour so this is very very quick you don't have a lot of time to perfect or optimize things and everything is uncertain you don't know when you're going to get your next order request you don't know if a rider that's logged on right now is going to stay logged on or if other riders are going to log on when you send a rider an order request you don't know if they're going to say this looks like an order I want to do or no thank you when a rider starts moving it could take them five minutes to go a distance or it could take them 20 so there's a lot of uncertainty there are a lot of real-time aspects so this problem is sort of the cellphone age nightmarish realization of a lot of textbook optimization problems and you simply can't solve it so the name of the game for delivery is a business is to find the best approximation they can and the opportunity for the business is to find better and better and better approximations so there are a lot of ways to tackle that hard problem and tackle the real problem which is finding the better and better approximations we really took a page from the Y Combinator playbook following this famous bit of advice from Paul Graham they said one of the most common types of advice we give people at Y Combinator is to do things that don't scale what that means is that to build something quickly that solved this big nasty problem or they gave us our first approximate solution to this big problem we didn't wait years and years to build a perfect artificial intelligence system instead what we did was we hired a bunch of McKinsey analysts we gave them Google sheets and we said figure it out and all of these people sort of developed processes for doing this so that was a really successful way to get a company off the ground and in fact we started in January 2013 this red dot here in May of 2016 is when we really seriously started building automation we got all that way through four rounds of funding got launched in 12 countries and at that time we were up to 75 cities we're up to 200 cities now but we came a huge huge huge way by doing things that don't scale Paul Graham would be proud and because we got so far before we really started to build AI and to build algorithms this presents a really fun case study in the impact that building algorithms can have at a company that's already at international scale so that's why this is kind of a fun story that I get to share and that's what I'm going to talk about how you can get impact what we did to get that impact with that large opportunity so I want to break this down the impact of AI is made up of AI and impact so let's let's clarify terms there I sure I'm not the only person in the room who has no idea what artificial intelligence is what I mean by artificial intelligence is what's sometimes called AI a or intelligent automation ok we're we are talking about algorithms that automate a bunch of human decisions into an optimization algorithm which then allows you to to sort of control and improve your processes over time and so what I mean by that is in May of 2016 we started with this system that was not automated our technology platform in May of 2016 was heavily reliant on manual input ok so customers would place orders sometime later a restaurant worker would then decide that it was time to call for a writer for the order so a restaurant worker would press a button that would enter and enter the order into the system the system was basically not doing anything with it not thinking about the order up until then the restaurant worker would decide to enter when to enter the order into the system the order would then enter a first in first out queue and would just grab the the nearest writer when its place in the queue came up there were thousands of restaurant workers and hundreds of delivery workers making the manual decisions that made the system work as I just said the restaurant workers made the critical logistical decision of when to send a writer there were also hundreds of delivery workers that would also make that decision so they would watch individual orders and try and guess whether the restaurant had forgotten to make the decision right and then they would make this difficult decision about whether it was time to override it then there were all the estimates some of the estimates were made on a moment-to-moment basis so we had hundreds of workers making real-time decisions about what the network delay was so that riders so that customers would experience their food as being on time and then we had restaurant workers and delivery workers making hourly and daily and weekly decisions all from different you know rationales on what the food time estimate was going to be and then delivery workers were making velocity estimates from a variety of rationales and here's what I mean by intelligent automation so we got to a place where there are no manual inputs into that system all of the time estimates are made by machine learned models and all of those time estimates feed into an algorithm that's completely automated so nobody's saying I should send or send for a writer for this order now nobody's saying I think the network delay is about six minutes right now it's all fully automated and so here's the impact this is totally transformed the way that we do business so I'm gonna walk through these but there's greater context that isn't even on the slide all of this is against the backdrop of a hundred fifty percent growth and against a backdrop of eliminating tedious manual tasks that people had to be employed to do and we're talking about hundreds or thousands of people so against that backdrop of a hundred fifty percent growth so our network being pressed harder than it had ever been pressed before and against a backdrop of people not having to do anything anymore we made all kinds of users happier we made all three kinds of users happier okay so from the customer side late orders went down by 40 percent order duration went down by 12% so people are getting their food more when they expect it more on time and in general faster rider to restaurant time went down 13% so riders are just making more money and late pickup was down 48 percent so restaurants are getting a much much more reliable delivery partner that they can depend on and are happy to work with so I talked about what we did what I meant by AI going from that manual system to a very automatic system and we've shown that it had a massive massive impact and so the rest of the talk is just talking about what we did to make that happen so as I said to make this happen we've really stuck to our fundamentals what I mean by that is incremental ISM bias to simplicity and rigorous experimentation there isn't a silver bullet we have not used cutting edge deep learning techniques we haven't used the latest libraries it's just good old-fashioned scientific and engineering research methodologies incremental ism simplicity and rigorous experimentation these things are all really important as anyone will tell you but the fun thing about the delivery story is that you can actually see how they went into action so let's start with incrementalism remember we were coming from here from a very manual platform to here a fully automatic platform so we were building the fully automatic platform in a vacuum it wasn't a a research project and it wasn't a sort of pre-launch development we were building this at a company that was already functioning and had very sharp very fast competitors so we needed to do this quickly and keeping all of that in mind there were hundreds of millions of dollars staked against you know beating some competition and really moving things forward so this is a great example of what it means to really push yourself to be incremental so the first thing we did before this is we broke up we looked at the difference between where we're at and the goal state and we broke that up into different themes okay each of the themes is fairly self-contained the idea is there are three machine learned models the travel-time model the writer delay model and the food prep model those models then combine to form an objective function and the objective function then plugs into a solver when you have those three models you put them together into an objective function and you use a solver to minimize that objective function you now have your optimization algorithm so that is taking everything that all those thousands of people were doing in sort of distributed non-centralized ways and putting it all together into a collection of these three or five themes that can then all be continually iterated on and optimized and improved so first we identified all of the themes and then we had to sequence them and decide which ones are gonna work on first so the sequencing was really important for getting incremental value as we went first we had to do the travel time model and the rider delay model and the food prep time model because everything else depended on those we picked the travel time model first because you could already plug it into the existing system and begin to get a little bit of a win remember that the existing system did just grabbed the nearest rider you could have the existing system instead grab the writer that was going to get there fastest so you immediately begin to get wins there right the ride the the customer gets a little bit more accurate of an estimate and the system gets a little bit more efficient plug in the writer delay model all of a sudden be the customers get much better delivery time estimates in their there they get frustrated much less often but all of these machine learned models had to go live before we could do the objective function because the objective function is made of these things right you can't build the objective function without the models that make it up and then of course the objective function itself has to go live before you can build the solver the solver by the way is not any particular textbook solver but we borrowed really heavily from the vehicle routing problem literature so you can't find VRP solvers that are going to tackle one of these problems really really really well but a lot of those concepts are very powerful and very useful so I'll refer to it repeatedly throughout as a VRP solver for vehicle routing solver but that had to come last after the objective function and all the machine learned models but then once we got this minimum viable product for everything we could then decouple them all and parallel eyes there are no longer independents interdependencies like there were at the beginning so once you get to here the horse is out of the barn and you can just let it run and you can have multiple teams working on these in parallel one thing that's really important about this sequence is that every single step was a live experiment so that let us check that we were going in the right direction and it also let us feel quickly in the sense that we didn't spend a lot of time prototyping something you know fuss around see if it works and then maybe experiment on it because the answer that you get with the live experiment is often very different from the answer that you get with offline analysis so every single step here was a live experiment that that generated value or didn't so why did we bother breaking everything up like that ah why didn't we just do it right the first time so this is an impulse that I feel very deeply myself and I expect that a lot of people in this room feel and it's kind of part of our discipline to tackle this impulse and recognize that it's a bit of a false question there is no such thing as doing it right the first time doing it right the first time actually means doing it wrong again and again and again and again as quickly as possible you're always trying to do the least wrong thing not the right thing and why is that so our total dispatch algorithm is still getting large large improvements two years later that means that we still haven't figured out how to do it right so if we'd waited more than two years before ever delivering any value that would have been really bad for the company so so getting those little bits of value along the way have really helped the company to make it as far as they've made it another thing is that in competitive environments on-demand food delivery is heating up there are a lot of entrants into the space that means that the expectations of restaurants and customers and riders are shifting somewhat radically on a time scale of about six months so if you spend two years trying to build the perfect thing it'll be obsolete about 25% of the way through your build and you will have just wasted a whole lot of time and again delivered no value finally as I mentioned a couple times this problem is just too hard to solve you can't sit down with a pen and paper or even really extensive and elaborate months worth of worth of computer simulation and determine what the right thing to build is so this is a sort of a nasty bit of industrial reality but it's also the fun part right if you wanted to solve problems that you could solve without trying to kind of hack around you could solve that in academia the the the rubber sort of hits the road here when it's too hard so what we have found to work is using a moderate amount of theory using a moderate amount of offline and analysis and simulation and then leaning really heavily on just trying it putting it out there and experimenting on it and those two things together helped us to go fast we would go nowhere if we tried to solve this problem that's totally beyond the state of the art of difficulty with you know just our brains we're not that smart sure so the question was can you anticipate when the market is going to be changing and sort of baked that into your development cycle or do you really just need to react to market changes and I think that the the the notion of the market change here is basically a placeholder for you can't anticipate what competitors are what the market are going to do so the whole notion of the market change here is basically saying these are things you can't bake in you simply don't know what just eat is suddenly going to announce in their in their quarterly statement right you simply don't know what uber eats is going to do so yeah you you really can't bake it in in fact the the degree the the way that you do bake that in is by saying we're simply not going to have a two year development cycle we're baking in a bit of a bit of temporal discounting by saying if you can't tell me what the what the value of this is going to be in three months I'm gonna have a real hard time believing that this is delivering value in two years does that make sense yes I mean that's where it's fun right that's the that's the secret sauce and and that's the hard part you have to use judgment you have to use experience you have to have a team of really smart people that are asking themselves the hard questions all the time I think the the real answer is the hard questions you get that team of really smart people and then they have to say it's not that we're trying to solve a theory problem it's that we're trying to solve a really really hard business problem and they have to you know use their judgment and learn that judgment by trial and error to make sure that they're asking the right thing does that make sense sure so the algorithm I'm going to talk a little bit more about the optimization a little bit later so maybe we can come back to that the question was how do you iterate on the optimization algorithm how did you pick it did you go greedy from day one or what was the sequence of things that you tried I say a bit about it I'm gonna say that in a couple slides so that was incrementalism let's talk about the bias to simplicity this one is really nasty too to the question in the back you know it's it's cute to say incrementalism it's also very easy to say incrementalism what does incrementalism actually mean I think that bias to simplicity is the worst of these three in that regard every single person in this room will tell you that simplicity is the most important part of the job and if you're doing things in a way that's too complicated you're not going to succeed but every single person in this room has a totally different idea of what simplicity means so simplicity is like beauty it's completely in the eye of the beholder so I want to offer to you some judgments that reflect our simplicity aesthetic and talk about what those mean to us and and how that's made us successful you can use that as you like I'm breaking this up into machine learning and optimization which are very sort of different areas so the simplicity in machine learning is going to be a little bit technical feel free to check your email if you're not interested in the technical part I'll give you a warning we went in a couple slides when we come back - to the ground okay so we still use the generalized linear model we still use logistic Poisson and Gaussian regression for absolutely everything so you could call these models classic you could also call these models ancient why are we using these instead of deep learning or at least trees the reason is that with really careful sampling and feature engineering you kill two birds with one stone one is that you understand your problem very deeply you gain domain knowledge that's going to accelerate your trajectory forward and at the same time you get really good features and really good training set samples that make these classic or ancient models really work very well if you put in all that work to gain insight about your problem which you're probably going to have to put in anyway these are going to be really hard to beat they'll obviously be beat eventually but two years in we haven't beaten them yet so if you really roll up your sleeves and put it put your statistician hat on and and really get down in the mud the generalized linear model is just real hard to beat we haven't gone there yet we will eventually we try trees every now and then we've tried a deep neural network you know haven't really gone for it but we've taken a look it just hasn't paid for itself yet yes we have used XG boost for it's really true it can be great in a lot of cases however sometimes it's simply you know you can do surprisingly well with linear regression another thing about another thing about XG boost you can have data scientists in terms of speed and and really getting things going and getting that value per unit time you can have data scientists set up a training pipeline in R so they can do model selection and are very very fast to do model selection and are right then you can set up a training pipeline and a mixture of Python and are plug it into your ETL that's not sexy glamorous technology but the fact is it's real fast you can get a you can get a linear model retraining like that super fast and then because it's just a Gaussian linear model engineers can hard code a combination of that into their production environment of choice right so your training and scoring environment that port is incredibly easy with the Gaussian model if you'd go with a randomized forest or gradient boosted tree ensemble you're gonna have to get some kind of library or you're gonna have to have engineers really buckle down in order to do that and the library ports can be flaky so there are other ways beyond model selection that this that this that there are other parts that make up the hurdle the linear model hurdle that other things have to be and we've just found that we've gotten really far and done a lot of good stuff without surmounting that hurdle yet with other methods okay we overuse linear models we overuse linear regression so we use linear regression for at least one quantity that is time so time is of course only positive gaussians can be anywhere in the real line it works it works really really well technically we should be using something like a gamma regression or at least a Poisson regression it's so easy to fit the linear model it's so fast to fit the linear model that you know you have to go pretty far along your development cycle before a really hard nosed look at model selection says the other models pay for themselves in a lot cases obviously not at all we would not advocate for fitting a binary outcome with a linear model right but don't be shy to use linear models for all kinds of positive quantities like time at least give it a try we violate independent and identically distributed we violate iid all over the place and we just don't worry about it we have lots of regularization we have empirical validation methods we have lots of data it's fine check it but whitening your data finding ways to resample it so that it's actually iid are very very very famously difficult and you know don't don't kill yourself on that scale the question was do we not do any scale normalization before training into a model do you mean scale normalization like making sure all of your features are on the same scale we do do that that's that's a bit of a separate issue so we definitely do normalize predictors and things like that sure so we all of these things are rubrics that we found to work by checking outcomes these are not principles these are not theoretical ideas I mean certainly don't take them as principles these are anti principles I'm saying these are places where we have found that the outcomes are fine and finally we lean really heavily on empirical validation this is probably not surprising to a lot of people in this room but closed form closed form analytical validation statistics are classic parts of Siri they're very difficult to get work in these so to work in these settings we're breaking the idea assumption we're using regularization it's very hard to get P values or your F change statistic or your aki k's information criterion to work in these settings and it you just don't need to you have reams and reams of data a lot of the time use that data to empirically validate your models yep so with predictive models we use a two-stage validation process first we're never doing predictive models for their own sake as mentioned we're only caring about predictive models as they plug into a system and give that system better operational characteristics so two-stage validation first we do offline validation just to give ourselves a good back-of-the-envelope check and then we plug something into the algorithm and run an experiment on it to see if the algorithm actually makes customers and riders and restaurants happier with business metrics so I think you're you're asking about how do we how do we check things in the how do we check things in the offline validation stage for predictive models we lean really heavily on hasty tips irani and Friedman's one standard error rule so we will look in the validation set we will find ways to carve up the validation set and to approximately ident independent and identically distributed objects and we will compute the mean mean validation metric and the standard error of that validation metric across all of the validation validation set objects and then we'll pick the model the technique whether it's a tree or it's a linear model if it's a linear model what what the feature set is that gives the the best validation metric that is the simplest model within one standard error of the most but is within one standard error of the the best validation metric result of the more complicated model Wow no don't clap I messed that one up but but it's it's hasty tip sure on Ian Friedman's you know rubric that they call their one standard error rule it's an elements of statistical learning it's a PDF online good book I'm sorry I think there is a question behind sure so all of our all of our validation is one standard error rule in an empirical holdout set that's the offline validation we never take that at face value even if it's something like the travel time model which you think it's obviously gonna work if it's more accurate offline we always plug it in and run a live experiment on it we've found many times that when you get a more accurate offline machine learning result you plug something into a live entire system and the system gets worse so two stages yes yes when something is we have not yet gotten to the point where so for example we haven't used XG boost found that it works better by more than one standard error and a holdout set and then said well it's just too complicated we're not going to implement it in fact XG boost hasn't gotten one standard error yet on the generalized linear model stacking or blending after regression do you mean like fallback models yeah that's a that we we all of our models are actually a series of fallback models okay so machine learning we don't cut corners on feature engineering or regularization feature engineering is is where a lot of this where a lot of this actually you know succeeds that's where you put your domain knowledge and your sampling and all the the unsexy rolling your sleeves up fundamentals in and that's where it works I'm gonna hang off on questions for for until the end unless you know something's really really unclear at this point that I'm happy to check because we're gonna run out okay so an optimization two points first we don't use a really fancy solver so the the the vehicle routing problem ish problem that we solve is an integer optimization problem that's a very very difficult class of problems to solve and there all kinds of state-of-the-art solutions with really fun and scary names like math heuristics or ant colony optimization or particle swarm optimization each of these is basically hill climbing and none of them works off the shelf for every problem each of them requires really intense domain knowledge to get it working so we kind of just said well we'll try straight hill climbing and then use all the domain knowledge and see how far we get so that's exactly what we're doing we're not using a genetic algorithm our particle swarm we are literally using deterministic hill climbing and then we're putting a lot into into the domain knowledge that makes up the system and we've gotten really far with that again using empirical validation to make sure that we're not missing opportunities that's not to say that fancier techniques won't work in the fullness of time but to the point where we've gotten to so far we haven't seen it the other thing is that we radically approximate the objective function so the objective function is an expectation it's a statistical expectation it's often something like the mean order duration across all orders that we build up from a non linear combination of the machine learning models the travel time model the prep time model those are themselves expectations right the travel time is in exactly five minutes that that's and that's an expectation about it the nasty bit there is that the objective function is the is mathematically the expectation of a non linear combination of expectation or it's an expectation of a nonlinear combination that is not generally equal to the non linear combination of the expectations so for example we have this might be the travel time model giving you an expectation of X and and prep time model giving you an expectation of Y and you're taking the max of those that's not equal to the expectation of the max so instacart had a great blog post it was really really pretty cool they go and they do it right so they have a big Monte Carlo simulation and they compute that expectation of the of the nonlinear combination through brute simulation force right and it's it's great it's really it's really nice statisticians have been grossly abusing expectations and grossly abusing these kind of rules for a hundred years look at the method of moments the method of moments you're just you know abusing these things all over the place the the derivation for the AIC looks at a single observation of a number and calls that the expectation of the number and you know the world moves forward things have not completely broken down so we just borrow from that playbook we're not as sexy as instacart we just say the expectation of the non linear combination is equal to the nominee or combination of the expectations and we've gotten pretty far with that finally we don't cut corners in optimization on the neighbor operators so for people in the room that are more familiar with machine learning than optimization neighborhood operators are in optimization analogous to the features in machine learning so this is a sort of analogous truth here just like in machine learning you really don't go small on feature engineering you really don't go small on neighborhood operators and optimization this is where you roll up your sleeves and you really think hard about the problem and you don't take things off the shelf at face value that's been very very successful okay this is last but definitely not least we have rigorous experimentation this was the first part of the talk because in many ways I think it's the most important somebody told me that I could not go to a machine learning and AI presentation and start with experimentation or I would just lose the room this is the most important okay as I said in the incremental ISM you simply cannot theorize about the right answer beforehand it's too complicated even if you could theorize about it it would take years of research to theorize about it and we don't have years so what you need to do what we've found to be very successful is you do a bunch of pretty smart things and then you just try them and see what happens in order to do that you need to have a really rigorous protocol for trying them so you need to measure live business metrics not offline machine learning results they can go in opposite directions in this case you need to look at network effects and you have to have causation you can't do a before-and-after experiment because you can't tell whether the before and after difference was causal so quickly going through this let's say we just tried to do an a/b test on the orders right some orders we get get algorithm a some orders get algorithm B that's actually nonsensical you have a bunch of a bunch of riders that are all together in one Network it doesn't even need anything to have two algorithms in like a race condition to assign each of the riders your your your your world would sort of collapse but let's say it didn't the knock-on effects from one of these algorithms being faster would spool into the other into the other set of orders so you're a experience would be perturbed by the fact that there were Network effects from the B experience so you simply can't do that okay so let's take different cities as different independent networks okay let's do some cities get treatment a other cities get treatment B that doesn't work because when we started this we had 75 cities call it a hundred cities okay and of a hundred is not good enough to get good statistical sensitivity with an a/b test design so we went to a blocked design okay every city gets a and then B and then a and then B on different days so every day and city pair is an isolated observation in the network but then notice that the first day of the experiment may have been really nasty rained throughout England okay if the first day of the experiment is really nasty rain then a is going to be sort of bad observations in all of these so then you don't have causal you don't have causal interpretation so finally we did a counterbalanced randomized design to restore causal interpretation half of the cities get ba ba and the other half get ABA B so this gives us with just n of a hundred statistical power that is as good as we would have gotten if we ran it on a hundred thousand orders but allows you to have all of the orders on the same day in the same city getting the same treatment so you can capture those network effects so the the randomized block design gives you statistical power causal interpretation and network effects so this is just an illustration of how much time we put into developing rigorous experimentation techniques without which before we had those we were paralyzed you could do all the work on the machine learning and and the optimization and the and the incremental ISM you just can't move if you can't tell what's what's really happening is the result of what you've done so we achieved transformation with AI with just really classic principles of science and machine learning our science and engineering incremental ISM bias to simplicity and rigorous experimentation [Applause] sure sure so the question was how do we do feature engineering for categorical variables we use dummy coding which is close to 100 coding I just you do get a lot of features so you have to be judicious there regularization is essential you can't you we have you know many hundreds of thousands of parameters you have to regularize that's the stick one in the the back sure sure so I think the bias to experimentation and and the rigorous experimentation come from psychology I wouldn't say I I wouldn't say it makes somebody better than people that come from other fields people have their own styles and on strengths I think the the experimentation the mix of the mix of empiricism and bits of machine learning together can be very effective I just like to remind everybody that although the next speaker in this room has cancelled we've got a few minutes here with Michael if people want to hang around but there will be speakers starting in the other rooms about now so [Music]