foreign [Music] on stochastic Inventory management optimization so far we have seen a deterministic inventory model which was the eoq model then we saw a very simple probabilistic inventory model which was our news vendor problem we went into the time axis and extended the time access to a finite time Horizon and discussed deterministic Dynamic inventory optimization problem which was solved using the dynamic programming formulation and the Bellman equation and now this is going to be the last session on the inventory management which is on stochastic Dynamic Inventory management problem so this is also an optimal an opportunity for us to introduce uh applications of reinforcement learning uh so I'll discuss uh that uh once I describe the scenario so what are we what are we looking at in the stochastic inventory problem the uh the most highlighting feature is that the demand is not certain or we we don't know Demand with guarantee so we only probably know the problem we probably know the probably the distribution of demand that is known uh later on we will probably relax that also uh that's will that's when we will get into uh Q learning and all that but we are right now assuming that the probability distribution of demand is known right uh so that that's uh so this is what differentiates it from the deterministic dynamic inventory problem uh so this is a stochastic Dynamic inventory problem where not only we are stretching the time axis but we are also making the demand stochastic okay so uh the everything else remains uh the same uh which essentially means that in every decision Epoch other decision maker looks at the current inventory level decides how much to order which essentially decides uh uh the the the state variable in the next time period and so on right so the the problem is a typical inventory optimization problem where we need to decide how much to order in every time Epoch we need to look at how how what is the current inventory level what could be the potential demand because we don't know demand for sure and then decide how much inventory to order from our Upstream suppliers right so that that that basically remains exactly the same right so every time uh the order is placed the decision maker incurs some kind of ordering cost we have looked at uh CO as our ordering cost right and that trade-off also remains exactly the same so if I if I place too many orders I am going to incur ordering cost however if I decide to order fewer times but increase the order quantity then I may incur what is called as the holding cost and right that was my carrying cost Cc or ch right top uh because the inventory has to be carried in a warehouse a warehouse has to be rented uh so the storage facility uh the cost of maintaining the storage facility all that we have discussed right so uh the essential features of the inventory optimization problem remain exactly the same just that demand now has a probability distribution right so let us get into the detail before we get into the detail let me highlight uh some of the standard very very uh often quoted references from where we have used our notations and our model building so these are the two famous uh references I I refer to these two books very often uh so the first one is the classic Martin Portman's book on mdps Markov decision processes uh right I I happen to have a 2005 version I honestly don't know whether there was a version that came out after 2005. Abhijit gosafi's book even though the the focus of the book is on simulation uh simulation based optimization there is a very very detailed description of mdps through exhaustive search and why exhaustive search fails and therefore we need to use some some clever method some some method using a Bellman equation and therefore policy iteration value iteration precisely what we are going to do is in this session therefore kosovi's text also becomes very relevant right uh most of the references actually use the notations uh uh that that were first used in putterman's book so putterman's book has kind of become a classic now we will use the same notations most of the times our notations are going to be similar to what putterman had used in his text okay so what are the assumptions of the model are the assumptions of the model as we said are uh so uh first of all demand is discrete uh uh so uh demand is one two three four five six seven eight uh right uh otherwise uh the summation sign becomes the integral sign let let's not let's not worry about that for now computationally complex and therefore let us assume that demand is discrete for now right uh the delivery from the supplier is instantaneous which means that there is no lead time we had discussed about this issue uh when we spoke about deterministic Dynamic inventory optimization problem right uh it does shift the shift the state updations by that much uh remember L was our lead time uh if we were to consider early time uh so we order we place the order in this time period I get the order only in t plus L time period and therefore the state update has to happen accordingly uh here the problem is much more complicated than what it was in the dynamic deterministic case because uh now uh I don't know what the demand is going to be uh in the t plus Health time time period but I'm going to receive the order that I am placing now only in the t plus Health time period right also all those complications uh will have to be solved therefore since we are describing the problem first let us not let us not get into those complications and let us say that as soon as the order is placed uh the shipment is received instantaneously right the demand is IID random variable right it has a very stationary pmf and uh the demo the probability that demand is equal to J is essentially PJ right uh doesn't depend on T right that doesn't depend on T which that that's what makes it stationary so it is IID right uh when we talk about RL towards the very end of the session now we will probably say that well these assumptions can be relaxed now because we are going to uh we are going to actually forecast these probabilities using the data that we have using the history that we have collected and so on so but uh right now let us start by saying that there is a discrete pmf PJ that the demand is actually equal to J right uh and the warehouse remember this was the same assumption that we had used in the deterministic dynamic case Warehouse has a capacity which means that the state variable can only be up to M State variable cannot become more than M because you can only store uh so many units in the inventory right and we are going to make a further assumption that invent is non-perishable which means that I can store the inventory for how many time periods uh I I want the only thing is uh the more I store the more I will incur the holding cost or the carrying cost right those are the assumptions uh we are also going to assume that demand that is not met on that particular day is essentially lost uh against back ordering remember we had looked at back ordering uh previously uh so uh we are going to right now say that uh demand uh is essentially lost which means that uh we better ensure that uh whatever demand can be met should be met because once the customer goes away from our shop uh we are assuming that the customer doesn't come back or we don't uh we cannot ask the customer to come back after two days three days four days uh and that would have been back ordering right so we we are not considering back ordering right now where does it have an impact uh remember uh how where does it have an impact it has impacts it has impact on state of data updation right how does a state variable get updated so what are the sequence of events sequence of events is uh this is the beginning of the time period This is the end of the time period so very typical uh the ending inventory from the previous time period becomes the beginning inventory of the current time period so this is that uh the start of the time period this is the start of the day this is the start of the weeks this is the start of the month and this is end of the time period T this is end of the time period t right so we observe the state variable uh which is the inventory level and that is St right uh based on that I place the order 8080 is my action variable uh that is my decision variable that's what I'm trying to find the optimal value for uh using all this uh stochastic dynamic programming and all that so 80 is my action variable that tells me how many inventory uh how many quantities to order as soon as I place the order uh remember we said the delivery is instantaneous so immediately right here right here I get the inventory uh delivered to me whatever I had ordered therefore my inventory at hand becomes St plus 80 which can be used to satisfy the demand now demand can happen any time during the time period right this is your DT uh unfortunately I don't know the value of DTR priority I only know the probability distribution however as I realize the demand I will actually know the value but I have to place the order even before I realize how much demand I am going to see in that particular time period right and then for the the time period and then the demand is incurred the demand is realized uh sometimes I'm able to meet the demand using St plus 80 sometimes I am not able to meet the demand using the inventory that I have plus the order quantity whatever it is uh I incur some uh cost I incur some Revenue right whatever I'm able to sell I I generate Revenue uh plus I I may have spent on some ordering cost uh whenever I place the order for 80 right all that will happen I may have to pay for my raw material uh buying this quantity 80 I may have to pay something that cost is incurred and the time period ends and then I update my uh my state variable the state variable for the next time period will be essentially whatever was uh available at the beginning of the time period St plus whatever was delivered in this time period which was 80 minus DT right minus DT DT is the demand and therefore this is the amount of inventory that I will have to carry to the next time period uh therefore this is the amount that will have to incur the holding cost right this is this order quantity gets into my warehouse and uh it is carried for the next day which means overnight it stays in my warehouse and therefore this will incur the holding cost right as we said uh the demand that is not met is lost which means that we have to ensure that this guy is greater than or equal to zero right most of the time because uh we don't know what the demand is going to be therefore we cannot guarantee that it will always be greater than or equal to zero but we are saying that if DT is more then St plus 80 right if DT is more than St plus 80 since we are considering the Lost sales case uh then we are simply going to say that uh demand was more therefore all the inventory was exhausted and therefore how much am I carrying from this time period to the next time period I'm only carrying a zero inventory from the current time period to the next time period right we will not allow the inventory to become negative because negative inventory would actually mean back ordering right uh figure out figure out what are the implications of uh considering the back ordering case right now we are not going to consider in this session all right so how do we formulate this problem as a Markov decision problem uh we need stage variable we need State variable I mean reduction variables so let us Define all of that actually we did little bit of that in the deterministic dynamic case uh let us let us recall all that and uh let us let us Define all these things so uh what was our stage variable stage variable was the time right the days and the months and the weeks right uh so we we can think of having a finite time Horizon uh and N days to be considered uh uh in the in the problem sometimes the problem is infinite Horizon problem uh so T will go all the way from one two three four five six all the way to Infinity right so there are finite uh Verizon markup decision problems there are infinite Horizon Market decision problems but essentially time is our stage variable what is our state variable State variable is our inventory level right inventory level or at a particular time t s t is our state variable right if we were to consider lead time uh if we were to consider lead time our state uh variable uh becomes um two dimensional not only do we have to keep track of how many how much inventory do we have in hand which would be St we also have to keep track of how much inventory is on order right so uh there the state variable will come St and Y T y T is the general notation used for inventory on order uh so remember once we consider lead time I place the order in this time period I may receive the goods only L time periods later therefore that part whatever I ordered in this time period will go into y t will not immediately get added to St here we simply set St minus St plus 80 right here we said s t plus 80. can't do that if we consider lead time because uh that will only be part of uh that will only be part of the y t variable which is inventory on order right let us not worry about that right now our state variable is a single dimensional uh right s t is the inventory level at time t what is the action variable action set is how much to order at time T right uh looking at the starting inventory I make an I can only look at what could be the potential demand I may not know the exact demand I still have to decide how much to order in each time period and once again ordering 0 is admissible right I may say that well I am carrying you know enough inventory from the previous time period I don't need anything in this time period that is perfectly acceptable so 80 equal to 0 for a particular T is an admissible value of 80. not doing something is an admissible action right that's what we mean so uh the reward function reward function is slightly involved so let us discuss that in detail our reward function is generally denoted by R right r r is St and 80 uh at time period T we had a state variable value of St we took an action of 80 right we took an action of 80. therefore what is my immediate reward my immediate reward is whatever money that I get by selling the inventory right obviously I'm I'm buying it from my supplier so that I can sell it to my customers right that's the whole point uh this price is supposed to be known and fixed throughout the time Horizon we are not changing that we may change it but whatever is the price Discovery we we know right where we are assuming that we get to fix the price and therefore we know the value of the price there is no stochasticity associated with price the only stochasticity is associated with the Demand right so this expected value is with respect to demand and not with respect to the price uh so uh here how much will I be able to sell I will be able to sell minimum of DT or St plus 80 right uh remember uh uh I can sell only what I can have right I cannot sell more than what I have so uh if let us say if I carried an inventory of uh St was 4 right I I decided to order two more quantity so the total inventory at hand is four plus two so I can only sell six units as long as demand is less than six I will be able to uh I'll be able to satisfy all my customers however if the demand is more than six I can only sell 6 units and therefore it is minimum of DT and St plus 80. right so this is the revenue part uh this is the revenue part uh plus uh my action quantity is 80 uh this is the amount of uh inventory that I order in this time period uh Therefore I may have to incur the ordering cost Co ordering cost remember inventory problem classical inventory problem is always a trade-off between carrying cost or holding cost and ordering cost uh how much uh sorry this is a mistake here uh this this should be a St plus 80 minus DT St plus 80 minus DT right uh so we discussed that how much inventory do I carry I carry St plus 80 minus DT amount of inventory uh in in my warehouse right uh and I don't allow this uh stock to become negative and therefore uh we are considering the case for lost sales right uh so that's the reward function uh but this is the this is what is called as immediate reward right therefore it is denoted by r r is always the notation for immediate reward then there is future reward which will be represented in the dynamic programming formulation precisely what we had done in our deterministic uh Dynamic inventory case right but we'll come to that we will come to that in the next slide okay so what is the terminal condition uh if I assume if I assume a finite time Horizon what's going to happen in the last time period in the last time period uh for whatever inventory is left over right whatever is inventory is less to leftover what will be the inventory that will be left over in the in the last time period you are going to start with uh beginning inventory of SN you may Place some order a n you have to meet the demand in the last time period which is DN right so this is the amount of inventory that is left over in the nth which is the last time period there is no future Beyond this right because this is a finite time Horizon problem in a fighter in a finite time Horizon problem I always have to worry about the terminal condition and what we are what we are going to assume here is whatever is leftover inventory that is salvaged and we have some function right we have some we have some function uh once again a n need not always be zero so this I I would say this is also a typo right uh this need not always be zero because if you start with the starting inventory of 0 in the nth time period and you will realize that you have to meet the demand you have to meet the demand so why would you not meet the demand so you will place the order so a need not always be 0 right and E naught always be zero a n plus 1 will always be 0 because there is no there is no n plus first time period you are not going to play place any order in the N plus first time period so n plus 1 would be 0 and need not always be zero but the assumption is uh in the last time period actually yeah actually this is n plus 1 everywhere right n plus 1 n plus 1 n plus 1 everywhere and then it will make sense right then it will make sense so in the N plus first time period and N plus first time period the only thing you are going to do is solve is the inventory you are not going to incur any new Demand right uh therefore in the N plus first time period uh you me uh you uh generate you generate uh you generate uh income by selling by selling your leftover inventory and your left or inventory is going to be SN plus 1 right so if this is SN plus 1 then all of this makes sense right all of this makes sense so sorry for the typo uh we can we can always correct it okay uh so uh all right so let us continue uh we have formulated uh all the things needed for mdp formulation now let us get into the equations right let us get to another Bellman equation uh so uh how do you formulate the mdp uh how do we update the state first of all as we have consistently said the state is going to be updated as St plus 1 is Max of s t plus 80 minus DT 0 this 0 is needed because we are not going to allow ST1 St plus 1 to become negative right so if St plus 80 minus DT is negative uh the the state variable is going to get updated only a zero right uh this is obviously under the say uh Assumption of loss sales if we were considering back ordering if we were considering back ordering the state update would happen like this s t plus 1 is equal to St plus 80 minus DT and therefore if DT is bigger than s t plus 80 then St plus 1 would become negative and therefore you are actually carrying negative inventory to the next time period uh so what would negative inventory mean let us say that St plus 1 St plus 1 is negative 3. now what would St plus 1 equal to negative 3 mean it would mean that there are three customers who came to my shop I wanted to buy something let us say pen or so that's what the example that we had taken in the last session they wanted to buy a pen from me I recorded their order I told them boss I can't deliver the pen right now I don't have inventory uh I will give you the pen two days from now please come to my shop and pick up the pen and I did that for three of my customers right therefore uh uh the the the inventory is negative three so what do I have to do probably not not after three days come back tomorrow and I'll give you your pen right that's what we mean so tomorrow what is going to happen is the first thing I have to do right first thing I have to do is forget about meeting the demand for tomorrow the first thing I have to do tomorrow is to make sure that uh I keep the three pens uh separately for the customers who had come yesterday and uh asked me for the pens right uh therefore uh from the from the next days uh from the next day is inventory I'm going to pull three products out uh therefore the inventory has become negative and that's what would happen if we consider back ordering this is this is called back ordering right you tell the customer to come tomorrow or you deliver the product to the customer tomorrow and all that will happen right the back ordering cost will have to be incurred right back ordering cost there is typically a back ordering cost because you allows the customer right uh you you not lost the customer you have asked the customer to come back tomorrow so there may be a transportation cost and so on so uh so if we had considered considered back ordering uh we would have updated the state as St plus 1 equal to St plus 80 minus DT but uh right now we are considering with the max function all right so what is objective function objective function is maximization of reward over a particular time Horizon right we will discuss that little later little later uh but this is not a minimization problem this is maximization problem where was our reward immediate reward immediate reward was here immediate reward was whatever Revenue I get by selling the inventory minus ordering cost minus the holding cost right uh that that is the immediate reward that I get plus I may get some reward in future so that will be the discounted reward right uh so that maximization of the total reward is my objective in solving this mdp all right how do I update my state variables uh the state transition probabilities this is new this was not there in the stockcast this was not there in the deterministic dynamic inventory problem because there was no need to update uh probabilities right there was no need to update probabilities so so here uh we use a markovian transition uh transition probability Matrix right Markov and transition probability Matrix essentially means uh that what will be my state variable tomorrow only depends on what is the state variable today it does not depend on the history right so where am I today I am I am at a state variable S I have taken an action a and therefore that is going to dictate the state variable that I may get into tomorrow I may get into at T Plus 1. currently my State uh my state variable is as my action variable value is a right so this is a typical markovian uh probability transition update all right uh and M what was m m was the warehouse capacity right remember M was the warehouse capacity so uh you trust me right M was the M was the warehouse capacity so uh now that I have come back let me show you that warehouse capacity was m thank you yeah we are all capacity for them right so we will not allow the the state variable to become more than M right because that's the way that's the uh Warehouse capacity so uh the there there cannot be a situation that my S Prime uh is more than M can't happen can't happen because S Prime can only be up to M therefore the probability transition Matrix probability transition values uh will be governed will be governed with this m right so uh so uh yeah uh exactly so why is this zero why is this zero uh uh let us say that you are currently at s you have placed the order for a which will be delivered immediately so what is your uh time uh at time T at time T what is your total inventory your total inventory will be S Plus a right S Plus a now there may be some Demand right there may be some demand which is D now even if this D is 0 the maximum inventory that you can carry to the next time period is S Plus a the maximum inventory you can carry is S Plus a you cannot carry more than S Plus a because that was the maximum inventory that you had in the current time period so how can you carry more than that in the next time period this Demand right this demand is going to be always non-negative right it may be zero sometimes but it is going to be positive right sometimes so uh this demand is never going to be negative and therefore you cannot carry more than S Plus a which is what is governed by this right so S Prime uh if S Prime is uh in in this interval uh S Plus a to M uh you can't have the situation therefore the probability is going to be 0 right now S Prime could be between 0 to S Plus ah but S Plus a has to be less than or equal to M because that's what would have happened in the previous time period right you can carry only if you can carry only m in the warehouse right and therefore that is given by S plus a minus a uh P of s plus a minus S Prime right this is the probability that the demand was actually s plus a minus a prime right that is the probability of demand now there could be a possibility that uh uh you had more than what you uh what inventory you were carrying the demand was actually more than that right if the demand was more than S Plus a if the demand was more than S Plus a H then your S Prime will be zero uh your S Prime will be zero right but you have to ensure that you are not carrying more than M at any point of time because that's how much the warehouse can hold right uh so what is the probability that uh what is the probability that demand was more than S Plus a uh what is the probability the demand was K uh where K was bigger than S Plus a that is all p k because p k remember demand had a very stationary pmf PJ PJ was the probability that the demand will be equal to J here we are talking about K where K is any value which is bigger than S Plus a right so summation of all that is the total probability that the demand was more than S Plus a right up so that in those cases uh the next time period inventory will be uh zero right you will carry zero inventory to the next time period so if uh you cannot carry inventory between S Plus a to M you may carry some inventory between 0 to S Plus a that is when the demand is s plus a minus uh S Prime and if you if you have just too much of demand then you will carry zero inventory in the next time period right so those are the three possibilities and those are the corresponding three probabilities so this is our state transition probabilities this is what creates what is called as PPM what is TPM transition probability Matrix right transition probability Matrix ah similarly there is something called trm trm is uh transition reward Matrix and we saw the reward function and it obviously depends TPM and trm obviously depends on what action you have taken so you change the action TPM changes right up if you change the value of a what was a a was the order quantity that I that I ordered in this time period right in time period t uh so uh uh obviously if a changes then the TPM changes and you will you see how what is the impact of that and also trm changes because the reward also depends on how much was ordered because that dictates how much inventory can you carry where you carrying how much inverted did you order and how much you were able to send all right so this is how you update the transition probability matrices now let us as we promised let us look at the objective function little more closely let us calculate the the components of total reward right not the immediate reward but the total reward right so uh ordering cost uh what was our ordering cost component uh if you are placing the order right if 80 is greater than 0 which means you are definitely placing the order what is going to happen you are going to incur some fixed cost this was our fixed cost component and then you are going to incur some cost because you ordered for two units so you pay for those two units right so that was C into a c of 80. ah C of 80 is the variable cost of ordering but the moment you decide to place the order you are going to incur fixed cost K very similar structure that we had considered even in the deterministic dynamic example right however if you are not placing the order uh not only the C of 80 will be 0 even your fixed cost will be 0 because there is no cost of ordering as such right so that is our ordering cost this is our ordering cost uh now the reward right reward uh the the revenue so uh as we said earlier uh uh the reward is generated from whatever you are able to sell this is your total inventory that you are carrying your total inventory that you are carrying is St Plus 80. right up and uh obviously how much can you sell you can certainly sell only minimum of DT or St plus 80 uh minimum of DT or St plus 80 multiplied by price we have already always said that price remains fixed uh we are going to consider price to be fixed uh in in in the time Horizon that we are considering okay uh now as we said what is the probability what is the probabilities here so let us say that St plus 80 is uh denoted by U I'll let us say DT is denoted by J let the demand be equal to J what is the probability that demand is actually equal to J or that will be P J right that will be P J now let us say that you carry enough inventory uh much bigger than what the demand was right the total inventory that you are carrying is St plus 80 which is you and apparently this U is bigger than J which is the demand if you are carrying enough inventory then your demand you will be able to meet the entire demand J right you will be able to meet the entire demand J and therefore your Revenue let us denote that by small F small F your your Revenue will be price multiplied by J because demand was J and you were able to meet the entire demand now this happens with the probability PJ because demand was J and that happens with the probability PJ so if this J is lesser than the total inventory that you are carrying you will be able to meet the full demand and you will actually earn the maximum potential Revenue that you were going to earn which is priced into demand however if your inventory is not enough or I mean essentially means that your demand was lot higher then you will not be able to meet the full demand you will be able to meet the Demand only that is equal to the total inventory that you are carrying right so if the demand exceeds the available stock then the revenue will be Fu right Fu where U is St plus 80 uh simple right so as we said let us say St was 2 a t was 3 right so your total uh inventory that you are carrying was five now if demand ah if J was 7 right demand was seven but you were only carrying five units of inventory uh and therefore uh how much revenue will you earn you will earn F of U where price into u u is five even though the demand was seven you will earn Revenue only for the units that you were able to sell and the number of units that you were able to sell was five right and that happens with the probability uh PJ where all these JS are bigger than you where the JS are bigger than you so u u plus one U plus two U plus three U plus four all the way to Infinity summation of all that uh is essentially the total probability that your demand will be more than the available stock right and therefore what is the value of this ft uh which is revenue uh it is uh if the demand is up to you minus 1 up to U minus 1 right uh it will be uh it will be FJ right the the revenue will be FJ and that happens with the probability PJ and what is the what are all the possible J's J could be 0 to U minus 1 however uh if uh demand is more than the available uh stock then that probability is q u and the revenue is Fu right so this is the expected this is the expected uh Revenue right this is the ordering cost plus the holding cost what is the holding cost holding cost is uh CC uh which is carrying cost of St plus 1 right as t plus one uh because you are going to carry St plus 1 which is the closing inventory of the current time period which becomes the beginning inventory of the next time period right so that is perfectly straightforward now what are policies right policies are essentially the order quantities right some decision rule uh Markov decision rule that assigns an order quantity to be ordered in each time period right 80. for all the possible STS now at uh the value of 80 will depend on St if St changes 80 will change right so the available uh the the order quantity that you will decide will obviously depend on what was the starting inventory in that particular time period so 80 is always going to be a function of St changes 80 changes remember we have actually seen something like that in the dynamic deterministic case uh so what is the policy policy is this Vector 80 okay is this Vector 80 A1 A2 A3 A4 A5 right or sometimes it is called decision rule right uh it is a sequence of decision rule sometimes though so for example putnam's text actually refers to that as DN which is decision rule right in the end stage uh which is uh how much are you going to place uh uh actually I once again this notation is wrong this notation is wrong okay so this was uh this was uh a n of 1 a n of two a n of three right because this is your state variable this is your state variable all the way to S your state variable can be anywhere from 0 to S where s is the maximum value uh here that s is M because you can carry only so much of inventory in your Warehouse right uh so this notation again I'm sorry for this uh it it because this is not 80 right this is not t one t two t three t t equal to one t equal to three and that's that's not what we want it is a vector right in time period one what what you are going to do in time period two what are you going to do time period three what are we going to do right and so on so not only that ah what are we going to do in time period one if the state variable is this much if the state variable is this much if the state variable is this much and all that right so it is a sequence of such decision rules uh so a very famous example of this policy of of one such policy is order sufficient stock to ensure that inventory reaches capital s uh right place these orders whenever inventory at the beginning of the time period is less than S right so whenever St is less than S you place the order in such a way that your St plus 80 is actually equal to this capital S right so that's what this policy means place the order whenever your beginning inventory is less than S and order so order so much that you are total on hand inventory reaches capital s whenever the beginning inventory is more than S which means if St is greater than S then don't place any order then 80 will be actually equal to 0 right so this is this is example of one such policy right so this is obviously a stationary policy very stationary policy doesn't depend on time it only says keep looking at your beginning inventory whenever beginning inventory is less than small s place the order whenever the beginning inventory is more than as don't place the order how much do you place the order you place the order in such a way that your total on hand inventory is actually equal to capital S right so that that's a famous SS policy uh right if you want to read more about SS policy enough books on Inventory management that that speak about uh this policy uh this was only an example of such policy but once again what are policies policies are essentially sequence of decision rules that tell us exactly what do I need to do in every time period not only that it must tell it must tell me what do I need to do in every state I may find myself in right not only not only every stage what do I do on day one what do I do on day two what do I do on day three right not only that on day three if I find myself in state equal to something what do I need to do right uh State State uh three state four state five state zero right what do I need to do right so policy should tell me that essentially that's our decision variable right uh that that's our entire decision variable remember the vector vector of is Vector of phase and a obviously depends on S so let us take a numerical example for this uh let us put some put some numbers uh to this abstract idea so far so let us put some numbers so that we can actually solve it okay so what is ordering cost uh let let the capital value capital K be four dollars every time you place the order you incur the cost of four dollars irrespectively of how much you order and how much you order the variable cost depends on two dollars per unit so if you place the order for four units you are going to incur four into two which is eight dollars plus this four dollars therefore twelve dollars remember we have tried to keep the costs uh similar to what we had encountered in the deterministic dynamic case Okay holding cost let us say every unit of every unit of uh inventory kept in the warehouse incurs one dollar right so U units of inventory will will incur you dollars so every unit of inventory incurs one dollar right that's what I meant uh Salvage cost let us assume that there is no Salvage cost essentially meaning that uh whatever is left over inventory that is a waste uh so uh you we may assume some value here that may change the ordering policy because there is some value to it right now let us assume that there is no salvage value Revenue let us say that uh I should have put a dollar sign here uh so what is revenue uh it is priced into units right if you uh where U is the unit sold uh so it is eight dollars per unit of sales okay uh so that's our cost function uh inventory cap is 4 we can carry only four units of inventory in our warehouse uh let us say that we are looking at a very small time very small problem only three time periods we have tried to keep things similar to what we have done in the deterministic case here uh because of bellman equation and because of the objective function that we are going to use we may need one more parameter and that is called discounting Factor uh discounting Vector I'll explain that in the next slide let Lambda be the discounting factor and let the value of Lambda be 0.95 uh doesn't make sense to have a discounting factor which is more than one so essentially what does that mean it means that uh a dollar tomorrow is only worth 0.95 today right uh so Bird in Hand is right uh you you know the phrase so uh if you give me money tomorrow that's not worth as much to me as you as you would give me money today right so one dollar today is not worth actually I said it I said it the opposite one dollar today would only mean 0.95 tomorrow all right uh density very important or stochastic problem therefore I have to tell you the probability density function so probability is equal to one uh with a prob uh demand is equal to one with the probability one fourth uh demand is equal to two with the probability one half demand is equal to three with the probability one fourth and there is no other possibility of demand taking any other value and therefore probability zero or at all the other places right so demand can only be one two or three and the corresponding probabilities are one by four plus one by two plus one by four which is equal to one and therefore exams of probability are met uh so these are the only three possibilities of demand demand can be one demand can be two demand can be three remember we try we have tried to keep keep the problem a very simple problem okay so uh how do we start we start with the Bellman equation of course but the probabilistic uh Wellman equation in the previous session we have seen a deterministic version of the Billman equation so let us do that here so how do you do that uh at time period uh T at time period T what is the optimal uh value of being in state s and taking an action a there is some immediate reward immediate reward plus the expected future reward expected future reward is this right uh this is UT plus one of being in state J but I could be in any state J right that probability is governed by me going to State J given the fact that I was in state s and to connection a right uh so this was my element of TPM this is my element of TPM currently in in time period t i I am in state s taking an action a demand is realized because of demand realizing I'm a land up in a different state at the beginning of the next time period right at the beginning of the next time period so this is the probability this is the utility that I get or the reward that I get from t plus 1 onwards t plus 1 onwards so this is the expected reward of all the future expected reward of the future uh but I have to discount it to the current value or discount it to the current time so I have to find the present value of it so Lambda is my discounting Factor so this is the Bellman equation for the probabilistic version right uh earlier we didn't have this P of uh J given s s n a right uh so this is obviously conditional probability uh I do realize that I have skipped a lot of portion where I could have explained to you what is the TPM and all that right now right now take it that these are markovian transition probability values and therefore they get into some kind of Matrix and therefore that Matrix is TPM so the Bellman equation is going to be of this format right now you can you can take the limits of all that and or or if the problem is uh finite time Horizon problem uh remember we solve the deterministic problem also using backward induction uh we can also do that in this case right so we will actually start with the beginning inventory of the fourth time period which will be the closing inventory of the third time period because our time Horizon is only three and there is no Salvage there is no Salvage we are not going to get anything uh by selling anything in time period four right what are we going to sell we are only going to sell the leftovers so we are not going to incorporation demand we are not going to incur fresh ordering nothing only selling of whatever is left over but apparently that also doesn't get you anything right uh because that was you that was the assumption so uh most of the finite time Horizon problems are solved using backward induction we could we can solve it using backward induction how would we solve our numerical example let us let us do that so let us solve the problem at time period three at time period three it's a very trivial problem right uh ah because there is no u4 right there is no u4 there is no future right so the expected future cost the expected future cost is going to be equal to zero there is no Salvage right so everything is zero so what happens in time period three which is the last time period you incur only the immediate reward right so what is the objective in the third time period the objective that you have in the third time period is you start with some inventory that inventory is as you take the order you place the order which is a in such a way that you meet the Demand right you meet the demand but that's about it right you want to be just enough because if you have leftover inventory that is going to have no value at all but you will incur the cost of ordering that excess inventory right so you just want to order so whatever maximizes this immediate reward whatever maximizes this immediate reward is the objective function for time period three time period three time period 3 starts by you sitting in a state called s and you're taking an action called a based on the numerical values you can actually calculate what is the best policy what is the best action that maximizes the reward immediate reward only immediate reward there is no future reward when you are in time period three however when we are in time period 2 and time period 1 we have to worry not only about the immediate reward but we have to also worry about uh what state transitions I'm going to make and what what I'm going to Inc what I'm going to earn from next time period onwards right uh so the pro the Wellman equation will be fully utilized uh currently the second part was Zero that will get utilized completely and that will be slightly more involved calculation uh so we will solve the t t equal to two problem and then we will solve the t equal to one problem and then we would say the entire problem has been solved right because we are using backward induction uh right now I I leave this discussion here and uh obviously after this we will have a python demonstration of the same numerical example where we will actually solve this problem and tell you what is the optimal policy which essentially tells us what to do in each time period if we find if we find ourselves in a particular State variable right so that is a numerical example and that is the example of a finite time Horizon problem and that is solved using backward induction of the famous Bellman equation now this is this is a simple example right inventory problem is solved now what can be complications what could be complications uh right now since we wanted to keep this problem very small we said well hold on Warehouse can only accommodate uh four units of inventory right so uh your s your state variable can only be 0 1 2 3 4 that's all only five values and therefore your TPM the transition probability Matrix uh is is a is a very small Matrix right five by five Matrix what if this uh possible values of State variable uh possible values of State variable is huge which means TPM is large remember if TPM is large a computation of this future reward uh becomes uh complex right calculating that becomes complex uh so what if what if this TPM is very large and very unwieldy right unwieldy what if I am solving an infinite time Horizon problem all right how how long will you keep doing this right if it's an infinite time Horizon you can't obviously go backwards because there is no last time period right up so you will have to essentially look at some other algorithm uh so uh in those cases backward inductions will fail us right backward induction will fail us and we need something better than the Brute Force implementation of bellman equation right we need something better what are the two or three most famous uh methods to solve these kinds of problems then the first method is essentially value iteration so first of all we have to understand what is the value okay so what is the value we use uh uh optimality equation right optimality equation what is optimality equation let us say that VN is the value of being in a state s at a at a stage n what is the value value is the maximization of not only the immediate reward but all the future rewards from n plus one time period onwards uh being in state J where J could be anywhere in my state uh uh set of set of possible uh values of the state right obviously discounting Factor right and the possibility that I may hit State J if I am in state s today I am taking an action a is obviously governed by TPM maximization over all the possible values of a will essentially give me the value of being in state s at a particular time n or at a particular stage n if I were to take the limits of this if I were to take the limits of this and infinite Horizon equation actually becomes something like this where value of being in state I right value of being in state I today is essentially maximization of immediate reward uh maximization of uh immediate reward because of that action a plus all the future rewards and remember since it's an infinite time Horizon problem uh the future that looks uh the future that is visible from today onwards will be very similar to the Future that looks from Tomorrow onwards because it's Infinity right so we drop the subscript n plus 1 we drop the subscript n and so on and we only say that the uh the value today is uh being in state I which is a v i and the value tomorrow will be per v j because we are in state J and J is essentially added uh so I'm taking an expectation over all the possible states that I may end up being tomorrow onwards tomorrow onwards right this what is this v j Vijay is the expected reward not not expected reward after we multiply with PJ it will become expected what what is v j v j is the value of sitting in state jail tomorrow and earning a future reward from Tomorrow onwards that will include that will include immediate reward that you are going to incur tomorrow plus all the future from Tomorrow onwards right so essentially Vijay is that future multiply that with PJ you will get all the future rewards discount that to the present time using Lambda and add that to the current reward immediate reward that you are going to get by being in state I today and taking an action a and you maximize that over all the possible actions all the possible actions right so how do you how do you run a value iteration you run a value iteration assuming that you don't know the value of these vs right you don't know the value of these V's obviously you are going to assume that TPM is known uh you know the say transitions uh you obviously know the immediate reward you know how to calculate the immediate reward because we have seen right up the the revenue that you earn the expected Revenue that you earn minus the cost of ordering minus the cost of holding that can that that will go as part of our the probability transitions are known Lambda is known so we are going to start a value iteration by assuming that this v i and v j are variables that I don't know I have to find the values of I may start with some arbitrary values right I may start with some arbitrary values so how do I start arbitrary values I'll start by setting some v0 some v0 and I will select some threshold I I'll tell you what is the purpose of this threshold let us start with n equal to 0. and keep incrementing uh so for all values of s we calculate this guy we calculate this guy uh where we plug in where we plug in this uh V 0 right so when we calculate V1 of s we will use V 0 of s and V 0 of s is arbitrarily selected we may select all the V 0s to be zero doesn't matter right up we'll select some values we'll select some values of v0 for All State so we will have to we will have to assume V 0 of 1 V 0 of 1 uh sorry V 0 of 0 V 0 of 1 V 0 of 2 V 0 of 3 V 0 of 4 that's it five five values that I have to assume because my state variable cannot be more than 4 and therefore these are the five values that I will have to assume I have to select some values right and based on that I will calculate I will calculate the value of v n plus 1 right which is maximized over all possible actions uh that I can take which maximizes my immediate reward plus the future here V 0 will be known when I solve it for V1 keep doing this right top and uh essentially check if uh the gap between v n and VN plus 1 is very small right some threshold some Lambda Prime actually if you refer to putterman's book if you refer to any standard text on mdps and particularly evalu iteration uh there is some equation which depends on some guy called Lambda uh I mean some some guy called Epsilon and obviously Lambda which is the discounting Factor I'm going to right now assume that all of that has been taken care of in this Lambda Prime right so threshold uh if sorry once again a typo this has to be three so if this v n plus 1 and v n are very close and they are less than this Epsilon Prime Epsilon Prime is the threshold that I have set then we are done actually we have we can stop the value iteration and we can go to step four otherwise we have to increment uh n by n plus 1 and go to step number two which is to calculate that again we will calculate v n plus 2 use v n plus 1 here and keep doing that what happens if we go to the stopping criteria uh this should be 4 because that was step number four so essentially if uh uh we have a policy we have an optimal it is called actually Epsilon Prime policy uh optimal policy optimal policy is going to be given by uh whatever action maximizes the total value uh the immediate reward plus the future values right so Arc Max ARG Max so what is the decision if you are in state s uh the arc Max of all of this uh will give us the optimal policy uh in this particular state right obviously we are assuming um yeah so this is this is not the the optimal policy this is this is called Epsilon Prime of optimal policy because I am assuming that the gap between VN plus 1 and VN is small enough right so this is how you run value iteration fairly straightforward implementation of the Bellman equation hope this is okay with you let us go quickly to policy iteration all right policy iteration you here you assumed values of in the value iteration we assumed values of uh V 0 here we assume values of V 0 in policy iteration we start by assuming a particular policy right select a policy at random right select some policy n equal to zero select an arbitrary decision rule some arbitrary decision rule d0 which is which is a set of all the admissible decision rules right uh so select that now there are two steps in policy iteration one is called policy evaluation one is called policy Improvement first for for this arbitrarily selected decision rule evaluate the policy right uh policy evaluation uh here you solve for unknown VNS by plugging in the value of d0 right here you know the d0 so here you know the values of a and therefore you calculate the VN uh here everything is known you know a uh you know the state you are in right and therefore you will be able to calculate the immediate reward you know the TPM right you know the TPM you know Lambda so VN is the unknown guy so essentially if you solve for this and you get the values of VN right for all the possible States uh this is for a particular decision rule which gives us the value of a but decision rule essentially gives you the order quantity right in every state so you you solve this v n and remember how many decision variables are there when you say we have to solve for VN there are essentially five decision variables uh v n of 0 v n of 1 v n of two v n of three v n of four right so five decision variables you solve for for the given policy then you improve the policy right improve the policy uh now at the N plus First Step you choose a DN plus n such that you maximize this maximize this value maximize this value right maximize this value which is uh uh immediate reward plus all the future expected reward discounted right this is a discounted uh reward policy this is a discounted reward objective right so what what is your DN plus one DN plus 1 which is a new decision rule in the N plus First Step n plus first step is you are trying to improve uh because you are trying to find the action that maximizes not only the immediate reward plus all the future expected discounted rewards right uh from that time onwards from from Tomorrow onwards uh generally when we select DN plus one we try to select uh DN plus one as elements of DN itself as much as possible uh sometimes they are diffic sometimes they are different when does the iteration uh complete uh the policy iteration algorithm closes when DN plus 1 is actually equal to DN then we stop right else we increment n by 1 and then go to policy evaluation again use DN plus 1 here and solve this v n and then select the next decision Rule and keep doing this right uh here what is the optimal uh policy then the optimal policy when we stop the optimal policy is d n plus 1 which is essentially obtained by taking an R Max of all the possible actions uh right uh right okay so this is the policy iteration okay one last uh thing to be discussed uh which will take us very very close to which will take us inside RL which is called Q learning uh so when do we use generally equal Q learning uh so far we have been debating about uh uh this Bellman equation where this is the TPM right what if we don't know TPM when we do know TPM that that uh that all that mdp is essentially called Model based mdp when the probability transition matrices are known right however if the if the transition matrices are not known then that is called Model free mdp model free mdps are when the probabilities are not known in those cases what do we do we Define we start by defining a vector of queues right q t uh which is in the state which is in the state space of s cross a number of State uh number of states multiplied by the number of actions for each time Epoch right so if we Define a sequence of vectors and how do we update these vectors uh we update these vectors uh by by this equation by this equation where ah q t plus 1 of sna you are in state s and taking an action a right is given by this expression you keep updating this and this is essentially uh becomes what is called as Q learning so this is essentially Q learning so what we will do is we'll stop here we will demo all of this by the way what is this uh Alpha T Alpha T in Q learning is essentially the rate of learning the rate at which we are learning right so this is this is our attempt this is essentially our attempt uh uh to uh calculate uh the expected rewards even without knowing the probability so essentially we are learning the probability transitions uh so Alpha T generally refers to the rate of learning right uh so more of more of these details when we actually discuss uh python uh example of all this right so essentially three methods value iteration policy iteration and Q learning there are many many methods right so once we set Q learning then we can argue why not TD right temporal difference method many many methods uh but uh but the drill is going to be similar right the the elements are going to be very similar just that computational complexities will force us to take that particular method on it so let me end the session here uh and uh we will follow this up with a python example of everything