Transcript for:
Understanding Multi-Armed Bandit Techniques

so we have multi-unbanded problem in chapter two   for reinforcement learning in this chapter we  will talk about the first multi-arm banded problem   what it is then we talk about the exploration and  exploitation what does it mean then based on that   we clarified what is the absolute problem very  simple technique to be used then we talk about   the upper confidence Bond or UCB algorithm  then we talk about the time zone sampling   algorithm and we will see how we can solve the  uh salty for example multi-arm Band-Aid problem   and here in this chapter you're gonna see the the  most basic reinforcement learning algorithm and   and how we can deal with it with  different techniques in simplest way   so let's start exploration and exploitation  I assume all of you already know that uh   so exploration means that you want to explore  something new try something new exploit means   that you need to use your memory your knowledge  of previous things that already happened   we have very nice example for this uh so let's  say our Taiwan is also like restaurants and Foods   right so let's say uh you're gonna go  to restaurant and now you're hungry   and you want to go to restaurant you have two  options what are those two options one is go   to restaurant that you already know you tried  and you know that's delicious and the price   is Affordable that's one option second option  going some new restaurant totally new restaurant so exploitation means going to where you  know based on the options that you know   you know you have based on the price and  how much delicious it is you choose one   but exploration is to try  something clear very simple so definition it is a new here the new things to  try and hear the previous experience of the use uh we have some note here that it is critical uh  many times we're gonna also emphasize on it   that sometimes you need to have long-term  strategy and do some sacrifices over here   what does it mean it means that when you're  trying something new you go to the restaurant   that is like the food is very bad awful and  very expensive you wasted your money right   but that is sacrifice to  find many good restaurants   let's say you know five already and if you don't  explore among other restaurants maybe there is   very good restaurant that is very cheap and very  delicious you're gonna love it but in order to   find that one you may try 10 20 100 times more and  that's the sacrifice you need to do in order to   find that that restaurant that you want to go and  that's very delicious and and the restaurant that   I'm mentioning is the solution for our problems  basically in reinforcement learning that we need   to our reinforcement learning algorithm needs to  sacrifice explore in order to find the solution bring it to the memory okay so now we want to talk about the multi-arm  bandwidth problem in Casino you're facing with   multiple machines and each configured with an  unknown probability of which likely to be win or   not uh you you can play in different machines then  pull the Handler or the handle or different name   that we have and for each one of these machines  you may have different probabilities of meaning   that already defined built-in in the machine  so the developer or producer of The Machine   written some code that for example for this  machine 25 percent of time that's too high but   for Simplicity for a simple example you  you're gonna win here 75 here 30 pair 50.   but of course these are very small in Real  Environment here just we have some simple examples now the question is what is the best strategy to  achieve highest long term reward think about it you have multiple machines you have you are agent  and you're gonna play and you don't know this what is your strategy to play yeah that's very nice strategy try each one  of them 10 times see which one you want more   then then you can play that one that's one of  the strategies that we can have but we want to   do now what we want to do what here we want  to formalize it in form of the reinforcement   learning in order to solve the different  solutions with different techniques okay nice   so if you consider Infinity number of Trials  to understand the concept why because if we say   we have finite number of Trials so how many  your classmates at 10. but I say 1000 is it   enough maybe no to get the exact probability  for now for Simplicity we say okay let's say   uh we can try forever to see what is  the real uh probability over here to   understand it but later of course we  will have uh finite number of uh tries uh so this is very nice example of demonstrating  exploration versus exploitation why because Azure strategy indicated at the beginning  you you wanted to don't care about the uh   uh for example you play let's say you play  first machine and you win are you going to   continue forever this machine no you're  going to for example as your friend said   10 times dry it means at the beginning we can do  exploration but later you want to collect lots   of reward and based on the probability that  you get then you want to play on the machine   that's higher chance to be so because of that  then second stage is basically exploitation so we have question now for you guys this  problem is well known as one state problem why oh really I thought you're  gonna think a little bit okay close but not the answer yeah original state oh yeah very much much very very close  yeah kind of I can accept as answer so   your friend means your classmate means kind of  uh after so you have current state you choose   one of the like let's say you choose one of the  actions then whatever is happening in next state   as if you imagine there is next step so what  is the difference between previous and now   is the reward changing or something is  happening we are exactly at the same scenario   because there are probability we don't  know and we need to again play play one   of them nothing is changing the reward  is not changing during time now during   the the part that we have because of because  of that statement we have uh uh single state   uh problem that the deliveries are not affecting  the future rewards and the distributions of the   rewards are also stationary because basically  we don't have any options but of course most   of the problems that later on we will  face is not like this here we're just   warming up to getting into the real problems  of the reinforcement learning to solve them   so the goal is maximizing the accumulated River  during time of course and the the note that we had   it is the simplest version of the Markov  decision process that we will see next   chapter and because the reason is that  we will uh uh we do not have many states   we just have one state but in markovision  problem we have multiple States basically what is the most basic approach to do this   so you answered better better answer for this  question already but one of the approach can   be okay playing with one machine for many runs  and get reworked like like I just I'm the guy   okay I'm saying I don't care randomly choose  one I just play play play play play whatever   that's one approach but of course that's not  good second so this does not guarantee the best   long-term reward of course uh also based on  the limited number of knowledge and resources   that that we have we need better approach  and based on that we want to see how uh how   uh how to solve the problem the definition for  the multi-arm Bandit is considered couple of   the actions and rewards so I can do  so we don't have S anymore here for   now for this chapter so we have action and  revert we do something and get the reward so we'll do the action get the reward then the K  machines if reward probabilities each K number of   machines that per each one we have probability  of getting reward so we have PK then the value   of action a is the expected reward so so for each  one of the actions I can get the value what was   the value what was the value that we defined in uh  previous chapter was expected reward of the action   and here we don't have this we have directly or  a little bit slight differences because we don't   have State anymore but in next problem that we  will have there will be state that this expected   reward represent us finally the probability of the  winning for each one of the bandits or machines   at time step T the reward function looks  like this that RT is equal to reward of   the action based on the the probability that  uh that is hidden for us at the beginning so formally if you define the uh the goal  the goal is the maximizing cumulative   reward or maximizing the r from T1 to T and  maximize the reverse that we are getting if we know the optimal action if we know optimal action and this  reward the goal can be minimizing   the loss function what's the loss function  loss function is here I mentioned to you we   have objective function loss function givard  function and a little bit to clarify for you   that what are those generally this loss function  that are famous also in neural network we can   show the differences of between the what it's  supposed to be and what we are doing now so   the differences of the optimal or the answer and  what is the currently but we can show that the   uh uh we can Define it also as regret  that we have if we play something   so let's say we play one machine and let's say  how regret how much how much regret should I feel   that regret can be also called as last year  that the differences between optimal reward   probability P star subtracted part Q of the  a t and the value of the taking action that   we gain from really what is happening you  play something you get some value Q value   here Q of the 80 here not again we don't have  a state for now we have just action and the   differences between the value and we want  to sum up and I call it as uh expected um expected loss and show it  as loss of the power problem but optimal reward probability can be also  represented in this form P start with the QA star   equal to maximum of the all possible probabilities  that if we play Infinity number that is going to   be very close like like if you play the machines  Infinity number of course after billions of times   of running the probability of each machine that  wins how many times will be very close to or   kind of very accurate to the P star or optimal  value so that we can say the maximum of those   can represent the Q a star or the uh  the P star reward of the probability clear so as we said one of the approaches  is playing one machine for many runs   this uh this is naive approach because  there is no exploration second one   exploring randomly another approach can be we  can say Okay instead of playing just picking   one machine and play Forever Play randomly each  time again this approach is not reliable because   that Randomness is completely random and you want  to say okay every time I play randomly whatever   best for luck then everything is unlocked  nothing strategy and nothing you learn   but of course uh the exploration  preferences of the uncertainties   uh or we can call it as wise exploration is the  objective that we have we wanna Implement some   techniques or see how we can use some techniques  to to do this and today we are going to see some so firstly start with Epsilon Grady policy   Epsilon greedy policy is very simple if  you know exploration and exploitation   load exploration exploration definition  a restaurant that I explained to you   usual place or somewhere new so uh  Epsilon Grady algorithm says a coin and randomly if like 50 of the time so  if it's 50 50 that can be different so   if you consider just passing a coin that  is 50 50. 50 percent After Time uh explore   explore means do random actions the  other 50 percent use your knowledge this is whole what is Epsilon releases toss a  coin fifty percent fifty percent uh play random   do some random action or use your memory so that  uh so that if you are going to do use the memory   we mean that we're gonna use the best action  that we have among the uh among the things and   best action comes from where come from Q value  that we have or the value for state that we have   so I formalized uh in in some simple definitions  here for you guys number a says generate a random   number between zero and one like you can use Rand  function in any programming language then if the   random number that you generated is a smaller  than Epsilon choose grade reaction we call it we   can call it exploration uh uh exploitation in this  case otherwise choose random action as exploration this technique although this similar  very simple but it's gonna be very useful   we will use it in different algorithms one of  the famous ones is the Q learning that we're   gonna use it deep Q learning we're gonna  use little C although it is very simple now the question is so before  showing that what is the FC loan   what should be the Epsilon over there the  value for Epsilon here what do you think so I exemplified with touching the coin and that  was 50 50. but is that 50 good or 75 25 is good   and how should I Define it for  this simple algorithm everyone is can be but yeah not the answer exactly exactly so the answer is fluorite we say we don't  know the value and if I say you 50 50 is good   you can say no it's not good how you say that and  you're right I cannot tell 50 50 is good can I pay   90 10 is good no I cannot say to deal with this  problem we have technique called Epsilon Decay   that's what we do at the beginning we say okay uh  at the beginning we don't know which restaurant   is good right let's say I'm I'm new to Taiwan I  don't know anywhere like is this is valid for some   of the students that I know they are very new I  don't know anyway I just just explore then we put   we increase the Epsilon value that so that we  say okay everything is random at the beginning   then gradually by the time passage we decrease it  after six months after one year in restaurant case   we say okay we explored enough and come exploit  just use your own knowledge don't explore too much   this is very clear example of what's  happening with Epsilon Decay that here I   I clarified for you is it important to know of  course it's very important we're gonna use it in   even very nice algorithms very Advanced algorithms  some of them still are using these techniques   it's kind of similar to if you've studied   uh neural network or deep learning kind of similar  to learning rate kind of similar to learning but very uh applicable in reinforcement  learning key learning for example   salsa or esarsa we're gonna study  then later DQ learning and so on clear any question so we say in multi-arm Band-Aid  by Epsilon greedy policy   trim value is estimated by using past  experiences so that we can say we can   have averaging the rewards of action that we  have observed up to current time step how to   represent it like this we can see a q of the Q  value of the action Q of the a is equal to for   example average of the entities number of times  that how many times we played a specific machine   uh so so 10 times the reward of the uh this is  the sum of the rewards that play specifically   that specific machine this indicates the Q  value for each one of the machines for us   so for each one we need to calculate different  cues so that we have to use that indicates what   is the average value that we've gotten there  and this one over here that represents you see   platelet machine or not just just times  the river and what's the reward there is so if you lose you can say Okay minus one or zero  play machine machine a b play if even we can say   Okay 100 plus 100 or plus one whatever the reward  it is who is gonna Define it you guys how are you   gonna Define it by knowledge of working with  differently enforcement algorithms to know what   should be supposed to be good there but you  for usually end to for example if you have a   problem like task planning from here to end for  the end result we really give huge big reward unless we have for each step some rewards then  it's still supposed to give big reward for the   success too like like when you were kid you  were doing something nice your parents were   saying wow very nice give you chocolate  or or appreciate you then you were happy   then you remember that action and you  try to do it again something similar now you may say okay we have the key of  A what to do with it this queue is just   indicating some value but to do with  it that's very simple we come to the   next step we get arcmax of the Q of the Q  of the a that that this says Arc Max says   who who knows what doesn't knows  what is it is there anybody Arc Max okay so Arc Max says uh get the maximum and return  the index or return the item itself not the value   so you have multiple values multiple  parameters which one is the Max gets   it and not Returns the value that is  inside Returns full of it or index of it   so that here we can get the action which  action is the maximum of the queue and   put it in a star T that is the optimal  policy or optimal action that you have so any question thank you so if you run this approach of course it is gonna  it is going to work but what are the challenge because like you do this in Loop you try try exit  on greedy each time then update the queue and get   the get the best action and during time that that  is happening happening happening based on tossing   a coin that you have then sometimes you picking  this you're picking random action picking random   machine or sometimes picking the best action  that is possible and playing and during time   if you especially apply if season Decay it is  gonna be work but what are the challenge that   we have in this approach we say if we can tune  Epsilon parameter well it will have good results   but although we introduced the Epsilon Decay  but how much should be the Decay value how fast   it's not still easy to so it should Decay  like in one million or 100 times or 500 times   if you can do that that's nice it works  pretty fine so but still it is challenging and and you may okay now ask if I'm  gonna use it how gonna find it then   I mean Epsilon Decay rate what's the answer any guess again beset the challenge is the in epsilon Grady  is defining Epsilon right the amount that says 50   50. picking which action the amount of value then  to solve that I said we can have Epsilon decay   at the beginning very random then decrease it by  the time after a while no random anymore just best   action now the question is the scale how fast  should be this Decay happen how we can find it yeah I know what you mean you introducing the  smart approach to do that and yeah we have those   techniques yeah you can have some technique  to uh let me explain the simple one first   the the simplest one is just right one time try  one thousand one million one time try 500 one   time try 100 then you will see what's  happening and so so when is the waste   it converts kind of then it's not finding better  values with trying different values for Epsilon   Decay value amount of decrease of the exploration  to exploitation then you will find it out or you   can write some codes to do that try different  values as your friend said try some write some   equations that tries it and finds what is the  best value to this degree this Decay value are these important that I'm  explaining to you very very important   you're gonna code this you're gonna  implement this if you cannot do these   things that I explained to you you you're not  gonna be able to Simply run Q learning for   example one of the most basic algorithm  that we're gonna use to solve problems   so these are very important you're gonna try it  whatever whatever I'm explaining here yourself then uh so defining proper value of  the FC longer it is challenge as I said   and the solution one of the solutions  also can be upper confidence Bond UCP   that uh that we are going to introduce now an  approach that realizes number of times that   we apply one of the actions it is also kind of  smart to say okay let's say I have three machines   and I am exploring I'm in mode of exploring  by even the previous algorithm Epsilon Grid or   whatever it is I'll explain it means that trying  randomly don't care about anything but after   one thousand times I just tried 10 times of that  machine but a lot of these two machines is it good   that's not very wise because that's random may  happen right maybe mostly these two happens that   one is not happening a lot so upper confidence  bound or UCB is going to propose an approach   to deal with this problem and let's see  how we can do that their definition is   favor exploration of the actions with  strong potential to have optimal value what is the optimal value remember we defined today and I said very important right  I'm gonna use it a lot and I'm gonna use it up   to end of class please please again so when I'm  telling optimal value now you should imagine the   equation there what is the optimal value so  we are going to favor exploration of actions   that leads us to have optimal value let's have  a look so uh we want to Define a bond per each   action that that Bond indicates how confident we  are about the Q value of that action that action and now again you should quickly imagine  what's the Q value and what I'm talking about   less confidence bound will be big  more confidence don't will be small   and how we gonna Define the bond we are going  to use this equation simplification here   first have a look at the equation  next slide we will see example in   visual form and example of the numerical  form that you're gonna see what's happened so we want to define the bond that that Bond helps  us to pick the action among multiple options that   we have so here we have the UTA so upper bound  of a that we want to calculate upper bound of   the action that you want to calculate is equal  to the square root of the two times logarithm of   T that t is the number of the choosing action s  t over NTA that NT is number of times that we've   chosen one that is specific machine the action  or the action that we've chosen because here the   actions are indicating different machines right  A1 A2 A3 machine a machine B Machine 3 for example so calculation of the two times logarithm  of t t you know we have it is including   exponentially and log getting logo it decreases  the value two times of that over the uh over   the dot for each machine how many times we  played it this can give us an upper bound   of that amount that indicates and will help  us to choose the best action that that best   action means here that the the action that  we have more confident I need to be played so what else we have Point here to mention uh  also I say that this empty number of times that   you are playing if it's larger give us a small  confidence uh decreases the confidences uh because   we want to give more chances to machine that I  explained here we didn't play too much because   if we played thousands of times these machines  thousands after in this machine just 10 times   of this machine we cannot be confident about this  machine right we wanna give more more chance to   this machine by the confidence that we're gonna  try we're gonna Define with that term over there so the same explanation in fact we  prefer actions that we haven't had   a confident value estimation for it and  UCB algorithm measures the the potential   of the upper confidence band of the band of  the reward value by q and so it is selected   the greediest action to maximize after  confidence response as follow and each time   after they calculated the upper band we added  by the QT of a and based on these we get the   maximum action that we're gonna have it so  the Q the Q of the a the the value of the   uh the action that we have for each machine  added by the confidence that gives us the value   maybe one of the machines Q so what is why we are  adding this because if the machine is very has   very high chance to win of course we don't care  anymore about that machine if we played or not right this machine is 90 80 of winning  chance although we tried it millions of   times we tried it 100 500 times we still want to  play here more but sometimes also try it that's   the reason here we have this term and getting  Arc Max in Greedy way greedy means we want to   always green we want to get the best one and are  used so let's see example to clarify everything for UCP we have an example here in this example  let's assume the Q values of we have three   machines machine one two three one two three  and Q values of the machines initially are one   what does it mean it means that the value  of playing each one of these are one equal   out for Simplicity we just get one and of  course this is gonna be update after playing   and this end indicates the N of A1  indicates how many times we tried it   here for example 200 times we tried machine a  machine b or Machine 2 150 and machinery 750.   okay and we had this term already  here or we can calculate the UT   now we want to using this equation over here see  uh what is the result and how we can calculate   what's going on we say upper band of fun thousands  of A1 is equal to two logarithm of the T what is   the T deflate 1000 times over nta1 how many  times have you played that machine 200 times   then we calculate the result we get some  value clear we just place the values that   we have number of times the uh the number  of times that we played machine and the Q   value uh T value is still the same we are not  talking about it I've been visualize it there   then for second machine everything is same  for this machine you played 150 times then we   get some other value and for third machine 750 a  league replace and calculate and we get 0.29 now let's visualize it so I'm saying these are Q values Q value of A1  A2 A3 why these points over here because they   were same and they are in one horizontal line the  Q values could be different and these values could   be upper and Downer first of all be careful at  all then what we've done we calculated different   upper bonds and if we visualize it like the 0.5  0.51 64 and 29 we may have something like this from this point we go up for 0.51 first case  0.64 0.29 and what we're gonna do we're gonna   pick choose the one that has maximum  uh uh value of the upper Bond to play so we can say in Machine 2 we have higher hope  two to play to win the game and get higher reward so this is going to change during time this  point is going to go up down and this one's   based on where the point is the upper Confidence  from going to change and based on that we're   gonna try different machines and this algorithm is  gonna tell us every time instead of casting a coin   like Epsilon Grady and choose random man and best  action uh moving whole procedure much more smarter so I didn't put the comparison here if you have  a look in the book there are comparison between   Epsilon greedy for example or UCB on different  algorithms since you have an assignment for this   you're gonna compare it uh you will see you said  that which one is working better in one problem okay any question is everything clear all right so I'm I'm gonna rest to three  minutes then we continue with Thomas all right so now uh we're gonna discuss about  the Thompson sampling algorithm and let's see   what is it doing and what's the differences with  uh between previous ones we say at each state   at each time step we want to select action  according the probability that it shows that   it is optimal the action that we are doing we want  to talk about uh mostly probability of probability   form of uh choosing an action in terms of sampling  algorithm so first of all before talking about   the Thompson sampling algorithm we need to talk  about the beta distribution if you haven't heard   about it yet we can consider uh uh consider Q of a  which essentially is success of probability Theta   that we will see next slide how we can how  we can Define it in as beta distribution or   we will see the equation and here we can  have two terms the alpha and beta that   corresponding to the uh the cons that we have  succeeded or failed on the gaining reward from   each one of the actions that we are doing  and running on the machines that we have so Thompson sampling actually samples uh  an expected reward Q of a from so this is   expectation the heart and whenever I I forget to  explain to you in machine learning or actually   they are imagined in the field of area whenever  you see hat it is like expected or estimated   of the something estimation after something  so expected reward of the to you a from prior   distribution we will have the beta our beta  distribution of the alpha I and Delta I that   we're gonna see in a minute what does it mean  then the best action selected among the samples so beta distribution says we can have a creation  like this that probability of the Theta is equal   to 1 over beta Alpha and beta so this is beta  distribution indication and and the alpha and   beta are two input arguments over uh actually  times sum that comes over both are same times   X power of the alpha minus 1 times 1 minus X over  beta minus one and this is the term that clarifies   beta distribution for this what does it mean uh  it means to if you want to feed two parameters   Alpha Beta and get the probability of the Theta  and this alphabeta is going to indicate us the   distribution over over a plot so uh we will  see quickly now that uh the distribution that   we can have we can move it to the right and left  in order to predict the probability that we have   so this comes from this this uh beta distribution  over here the alpha Theta is the the integral of   the U over uh U path to the alpha minus one  times one minus U over beta minus 1 d u that   are you going to like are we going to prove it  and how this is coming uh not reading this course   if you are interested in just have a look what  does it looks like it looks like uh this example   that I'm going to show you if you have the input  argument Alpha and beta the same values and small   values it means that the distribution over the  probability of for example X here is same what   if I increase it so if you increase it it goes to  be like uh not exactly gaussian but looking like   gaussian distribution but here if I increase it  more but equal so looks like caution distribution so these are input arguments that I'm feeding  the equation Alpha Beta And this gives me the   probability distribution that I have here but  if I have like alpha 1 beta increase to 6 then   the curve looking like this if I  have something like Alpha 2 beta2   looking like this and if I have Alpha bigger value  than beta the Curve will look like this what is   the intuition here the intuition is here okay get  idea when we are talking about the probability   when we want to talk about the probability of  playing machines can we use this term somehow   this equation this this metal distribution somehow  to play with these parameters to finally find good   distribution of which machine to play the answer  is yes comes on sampling is trying to do that   so clear up to here and if you want to try I put  the link this is implementation online you can   implement it uses the term uh I think I'm not  sure I think I've used also in the code I just   I should check it I I developed some examples for  you but anyway if you go to that link you can try   different Alpha Beta values and see what's  happening also useless and the the idea is   somehow using this special distribution in order  to solve our problem multi-arm Bandit can we do   that yes how let's see the idea is the estimation  of Alpha Beta values by changing them continuously   and it changed the shape of the distribution  related to possible actions that we can have   and if we repeat this many time uh actually we can  call it a sampling the distribution and getting Q   values we can form the true estimation of the  uh the probability of each machine winning is so how let's visually see how let's talk about the Thompson sampling  in iteration one and assume that the   distribution initially is Alpha  2 Beta 2. and we have three bands three meshes machine one machine two machine  three and these are indicating few values for us so now uh first of all we need to sample Q value from each  distribution and that that Randomness is based on   the distribution probability so I want to get  one random point for each machine then then   based on that distribution I'm gonna play so I  do it for you I play it like generated random   numbers based on this property distribution and  you can see that okay maybe here for machine one   here for machine two and here for machinery so  is there any reason that this is happening no   because the the probabilities are completely  uh similar and are kind of totally randomly Next Step I'm gonna say okay  the Q value is maximum here   because this point shows higher value for me so   this after after taking these numbers if I take  the maximum the one that shows me the higher   chance to win it is kind of definition of higher  chance domain although it doesn't make sense here   for now but later it's gonna make sense so that  I'm gonna use this point and play this and machine then after playing it the idea is updating the alpha beta  parameters for the cases of the winning or losing   if I'm gonna win here I'm gonna add the alpha  and if I'm gonna use I'm gonna add the beta okay and what is going to happen so  far without explanation look at here when I was changing Alpha Beta the probability  distribution was changing here and now that idea   that I had here the example for example when Alpha  is greater so the distribution moves to the right   and to the left and if beta is greater so that  the idea comes from there from beta distribution   and I'm gonna tune it by winning or losing being  winner or lose so in this example iteration one I   generated for each machine one sample points then  then this one that one shows higher value and I   played it then let's see if we win here we could  have moved we don't know we pulled liver and Vivo   just assumption because you're gonna try  it right in in this case let's save you   Vivo what is happening if we want people  we increase the alpha by one if we increase   of R3 better two and you visually can see  that this distribution tends to the right   and then it tends to the right we have q here  and it is increasing and now you can get the key   point if we run only based on this distribution  choose random values much like it is much likely   to to be in linear case to be in right side  why that happened because we we won the game clear okay uh here I'm gonna throw  iteration two let's try it again so what we do generator random numbers  based on the distributions that we have   now in this case let's say we have  this one and this one looks like   the winner so the Q value is maximum there and  we're gonna play This Time Bandit number one   so now let's say if this machine loose what we  do is time instead of alpha we increase the beta if you increase the beta of  the curve tends to the left   by the beta distribution that the term that  equation that we defined and now you can see   distance to the left distance to the right and  this is like as initial that we have the machines now uh what if we continue this a lot based on the the machine chances that's winning  and losing we will have distributions like this   for example that shows Bandit one  has very low probability to win   after trying for example I don't know thousands  of times I've showed 200 here for example you   can sum up all the tries becomes how many times  we tried here so as I remember these are real   values I don't know I should check it I think the  plots are correct of the trial then it is here in Bandit to like 70 75 percent of winning chance  for example and here 25 of times after winning   chance and this shows us that okay Bandit 2 is  more likely to be in the game and try that benefit the Samsung sampling algorithm clear any question so the same thing that I explained to you now we're gonna go see some implementation example  that I've done for you for Thompson sampling and   after after we discuss it we have one assignment  for you that you're gonna for next week compare uh the challenge the most challenging one I will also  share the quote with you now I will explain it a   little bit then you're gonna also Implement two  techniques that we studied today Epsilon ready   and UCB then compare those three together plus  the result to see okay for for example three   Arm Bandit machine which one of these techniques  working better to collecting rewards that we have and I also noted that you  need to plot total reward and   and 1000 iterations that uh as we are  seeing we are going to see in the example okay this chapter also is finished let me conclude  then go to uh code we discussed and introduce   multi-arm Bandit we saw uh Epsilon ready algorithm  that is very useful in future we will see the   these are UCB and how UCB algorithm can cover the  problems of Epsilon Grady and later we thought   one sample of thumbs on sampling and applied it  on multian Bandit and understand concept how it   is working and that's it for uh first chapter  of the multi-arm bandage based on reinforcement