so we have multi-unbanded problem in chapter two for reinforcement learning in this chapter we will talk about the first multi-arm banded problem what it is then we talk about the exploration and exploitation what does it mean then based on that we clarified what is the absolute problem very simple technique to be used then we talk about the upper confidence Bond or UCB algorithm then we talk about the time zone sampling algorithm and we will see how we can solve the uh salty for example multi-arm Band-Aid problem and here in this chapter you're gonna see the the most basic reinforcement learning algorithm and and how we can deal with it with different techniques in simplest way so let's start exploration and exploitation I assume all of you already know that uh so exploration means that you want to explore something new try something new exploit means that you need to use your memory your knowledge of previous things that already happened we have very nice example for this uh so let's say our Taiwan is also like restaurants and Foods right so let's say uh you're gonna go to restaurant and now you're hungry and you want to go to restaurant you have two options what are those two options one is go to restaurant that you already know you tried and you know that's delicious and the price is Affordable that's one option second option going some new restaurant totally new restaurant so exploitation means going to where you know based on the options that you know you know you have based on the price and how much delicious it is you choose one but exploration is to try something clear very simple so definition it is a new here the new things to try and hear the previous experience of the use uh we have some note here that it is critical uh many times we're gonna also emphasize on it that sometimes you need to have long-term strategy and do some sacrifices over here what does it mean it means that when you're trying something new you go to the restaurant that is like the food is very bad awful and very expensive you wasted your money right but that is sacrifice to find many good restaurants let's say you know five already and if you don't explore among other restaurants maybe there is very good restaurant that is very cheap and very delicious you're gonna love it but in order to find that one you may try 10 20 100 times more and that's the sacrifice you need to do in order to find that that restaurant that you want to go and that's very delicious and and the restaurant that I'm mentioning is the solution for our problems basically in reinforcement learning that we need to our reinforcement learning algorithm needs to sacrifice explore in order to find the solution bring it to the memory okay so now we want to talk about the multi-arm bandwidth problem in Casino you're facing with multiple machines and each configured with an unknown probability of which likely to be win or not uh you you can play in different machines then pull the Handler or the handle or different name that we have and for each one of these machines you may have different probabilities of meaning that already defined built-in in the machine so the developer or producer of The Machine written some code that for example for this machine 25 percent of time that's too high but for Simplicity for a simple example you you're gonna win here 75 here 30 pair 50. but of course these are very small in Real Environment here just we have some simple examples now the question is what is the best strategy to achieve highest long term reward think about it you have multiple machines you have you are agent and you're gonna play and you don't know this what is your strategy to play yeah that's very nice strategy try each one of them 10 times see which one you want more then then you can play that one that's one of the strategies that we can have but we want to do now what we want to do what here we want to formalize it in form of the reinforcement learning in order to solve the different solutions with different techniques okay nice so if you consider Infinity number of Trials to understand the concept why because if we say we have finite number of Trials so how many your classmates at 10. but I say 1000 is it enough maybe no to get the exact probability for now for Simplicity we say okay let's say uh we can try forever to see what is the real uh probability over here to understand it but later of course we will have uh finite number of uh tries uh so this is very nice example of demonstrating exploration versus exploitation why because Azure strategy indicated at the beginning you you wanted to don't care about the uh uh for example you play let's say you play first machine and you win are you going to continue forever this machine no you're going to for example as your friend said 10 times dry it means at the beginning we can do exploration but later you want to collect lots of reward and based on the probability that you get then you want to play on the machine that's higher chance to be so because of that then second stage is basically exploitation so we have question now for you guys this problem is well known as one state problem why oh really I thought you're gonna think a little bit okay close but not the answer yeah original state oh yeah very much much very very close yeah kind of I can accept as answer so your friend means your classmate means kind of uh after so you have current state you choose one of the like let's say you choose one of the actions then whatever is happening in next state as if you imagine there is next step so what is the difference between previous and now is the reward changing or something is happening we are exactly at the same scenario because there are probability we don't know and we need to again play play one of them nothing is changing the reward is not changing during time now during the the part that we have because of because of that statement we have uh uh single state uh problem that the deliveries are not affecting the future rewards and the distributions of the rewards are also stationary because basically we don't have any options but of course most of the problems that later on we will face is not like this here we're just warming up to getting into the real problems of the reinforcement learning to solve them so the goal is maximizing the accumulated River during time of course and the the note that we had it is the simplest version of the Markov decision process that we will see next chapter and because the reason is that we will uh uh we do not have many states we just have one state but in markovision problem we have multiple States basically what is the most basic approach to do this so you answered better better answer for this question already but one of the approach can be okay playing with one machine for many runs and get reworked like like I just I'm the guy okay I'm saying I don't care randomly choose one I just play play play play play whatever that's one approach but of course that's not good second so this does not guarantee the best long-term reward of course uh also based on the limited number of knowledge and resources that that we have we need better approach and based on that we want to see how uh how uh how to solve the problem the definition for the multi-arm Bandit is considered couple of the actions and rewards so I can do so we don't have S anymore here for now for this chapter so we have action and revert we do something and get the reward so we'll do the action get the reward then the K machines if reward probabilities each K number of machines that per each one we have probability of getting reward so we have PK then the value of action a is the expected reward so so for each one of the actions I can get the value what was the value what was the value that we defined in uh previous chapter was expected reward of the action and here we don't have this we have directly or a little bit slight differences because we don't have State anymore but in next problem that we will have there will be state that this expected reward represent us finally the probability of the winning for each one of the bandits or machines at time step T the reward function looks like this that RT is equal to reward of the action based on the the probability that uh that is hidden for us at the beginning so formally if you define the uh the goal the goal is the maximizing cumulative reward or maximizing the r from T1 to T and maximize the reverse that we are getting if we know the optimal action if we know optimal action and this reward the goal can be minimizing the loss function what's the loss function loss function is here I mentioned to you we have objective function loss function givard function and a little bit to clarify for you that what are those generally this loss function that are famous also in neural network we can show the differences of between the what it's supposed to be and what we are doing now so the differences of the optimal or the answer and what is the currently but we can show that the uh uh we can Define it also as regret that we have if we play something so let's say we play one machine and let's say how regret how much how much regret should I feel that regret can be also called as last year that the differences between optimal reward probability P star subtracted part Q of the a t and the value of the taking action that we gain from really what is happening you play something you get some value Q value here Q of the 80 here not again we don't have a state for now we have just action and the differences between the value and we want to sum up and I call it as uh expected um expected loss and show it as loss of the power problem but optimal reward probability can be also represented in this form P start with the QA star equal to maximum of the all possible probabilities that if we play Infinity number that is going to be very close like like if you play the machines Infinity number of course after billions of times of running the probability of each machine that wins how many times will be very close to or kind of very accurate to the P star or optimal value so that we can say the maximum of those can represent the Q a star or the uh the P star reward of the probability clear so as we said one of the approaches is playing one machine for many runs this uh this is naive approach because there is no exploration second one exploring randomly another approach can be we can say Okay instead of playing just picking one machine and play Forever Play randomly each time again this approach is not reliable because that Randomness is completely random and you want to say okay every time I play randomly whatever best for luck then everything is unlocked nothing strategy and nothing you learn but of course uh the exploration preferences of the uncertainties uh or we can call it as wise exploration is the objective that we have we wanna Implement some techniques or see how we can use some techniques to to do this and today we are going to see some so firstly start with Epsilon Grady policy Epsilon greedy policy is very simple if you know exploration and exploitation load exploration exploration definition a restaurant that I explained to you usual place or somewhere new so uh Epsilon Grady algorithm says a coin and randomly if like 50 of the time so if it's 50 50 that can be different so if you consider just passing a coin that is 50 50. 50 percent After Time uh explore explore means do random actions the other 50 percent use your knowledge this is whole what is Epsilon releases toss a coin fifty percent fifty percent uh play random do some random action or use your memory so that uh so that if you are going to do use the memory we mean that we're gonna use the best action that we have among the uh among the things and best action comes from where come from Q value that we have or the value for state that we have so I formalized uh in in some simple definitions here for you guys number a says generate a random number between zero and one like you can use Rand function in any programming language then if the random number that you generated is a smaller than Epsilon choose grade reaction we call it we can call it exploration uh uh exploitation in this case otherwise choose random action as exploration this technique although this similar very simple but it's gonna be very useful we will use it in different algorithms one of the famous ones is the Q learning that we're gonna use it deep Q learning we're gonna use little C although it is very simple now the question is so before showing that what is the FC loan what should be the Epsilon over there the value for Epsilon here what do you think so I exemplified with touching the coin and that was 50 50. but is that 50 good or 75 25 is good and how should I Define it for this simple algorithm everyone is can be but yeah not the answer exactly exactly so the answer is fluorite we say we don't know the value and if I say you 50 50 is good you can say no it's not good how you say that and you're right I cannot tell 50 50 is good can I pay 90 10 is good no I cannot say to deal with this problem we have technique called Epsilon Decay that's what we do at the beginning we say okay uh at the beginning we don't know which restaurant is good right let's say I'm I'm new to Taiwan I don't know anywhere like is this is valid for some of the students that I know they are very new I don't know anyway I just just explore then we put we increase the Epsilon value that so that we say okay everything is random at the beginning then gradually by the time passage we decrease it after six months after one year in restaurant case we say okay we explored enough and come exploit just use your own knowledge don't explore too much this is very clear example of what's happening with Epsilon Decay that here I I clarified for you is it important to know of course it's very important we're gonna use it in even very nice algorithms very Advanced algorithms some of them still are using these techniques it's kind of similar to if you've studied uh neural network or deep learning kind of similar to learning rate kind of similar to learning but very uh applicable in reinforcement learning key learning for example salsa or esarsa we're gonna study then later DQ learning and so on clear any question so we say in multi-arm Band-Aid by Epsilon greedy policy trim value is estimated by using past experiences so that we can say we can have averaging the rewards of action that we have observed up to current time step how to represent it like this we can see a q of the Q value of the action Q of the a is equal to for example average of the entities number of times that how many times we played a specific machine uh so so 10 times the reward of the uh this is the sum of the rewards that play specifically that specific machine this indicates the Q value for each one of the machines for us so for each one we need to calculate different cues so that we have to use that indicates what is the average value that we've gotten there and this one over here that represents you see platelet machine or not just just times the river and what's the reward there is so if you lose you can say Okay minus one or zero play machine machine a b play if even we can say Okay 100 plus 100 or plus one whatever the reward it is who is gonna Define it you guys how are you gonna Define it by knowledge of working with differently enforcement algorithms to know what should be supposed to be good there but you for usually end to for example if you have a problem like task planning from here to end for the end result we really give huge big reward unless we have for each step some rewards then it's still supposed to give big reward for the success too like like when you were kid you were doing something nice your parents were saying wow very nice give you chocolate or or appreciate you then you were happy then you remember that action and you try to do it again something similar now you may say okay we have the key of A what to do with it this queue is just indicating some value but to do with it that's very simple we come to the next step we get arcmax of the Q of the Q of the a that that this says Arc Max says who who knows what doesn't knows what is it is there anybody Arc Max okay so Arc Max says uh get the maximum and return the index or return the item itself not the value so you have multiple values multiple parameters which one is the Max gets it and not Returns the value that is inside Returns full of it or index of it so that here we can get the action which action is the maximum of the queue and put it in a star T that is the optimal policy or optimal action that you have so any question thank you so if you run this approach of course it is gonna it is going to work but what are the challenge because like you do this in Loop you try try exit on greedy each time then update the queue and get the get the best action and during time that that is happening happening happening based on tossing a coin that you have then sometimes you picking this you're picking random action picking random machine or sometimes picking the best action that is possible and playing and during time if you especially apply if season Decay it is gonna be work but what are the challenge that we have in this approach we say if we can tune Epsilon parameter well it will have good results but although we introduced the Epsilon Decay but how much should be the Decay value how fast it's not still easy to so it should Decay like in one million or 100 times or 500 times if you can do that that's nice it works pretty fine so but still it is challenging and and you may okay now ask if I'm gonna use it how gonna find it then I mean Epsilon Decay rate what's the answer any guess again beset the challenge is the in epsilon Grady is defining Epsilon right the amount that says 50 50. picking which action the amount of value then to solve that I said we can have Epsilon decay at the beginning very random then decrease it by the time after a while no random anymore just best action now the question is the scale how fast should be this Decay happen how we can find it yeah I know what you mean you introducing the smart approach to do that and yeah we have those techniques yeah you can have some technique to uh let me explain the simple one first the the simplest one is just right one time try one thousand one million one time try 500 one time try 100 then you will see what's happening and so so when is the waste it converts kind of then it's not finding better values with trying different values for Epsilon Decay value amount of decrease of the exploration to exploitation then you will find it out or you can write some codes to do that try different values as your friend said try some write some equations that tries it and finds what is the best value to this degree this Decay value are these important that I'm explaining to you very very important you're gonna code this you're gonna implement this if you cannot do these things that I explained to you you you're not gonna be able to Simply run Q learning for example one of the most basic algorithm that we're gonna use to solve problems so these are very important you're gonna try it whatever whatever I'm explaining here yourself then uh so defining proper value of the FC longer it is challenge as I said and the solution one of the solutions also can be upper confidence Bond UCP that uh that we are going to introduce now an approach that realizes number of times that we apply one of the actions it is also kind of smart to say okay let's say I have three machines and I am exploring I'm in mode of exploring by even the previous algorithm Epsilon Grid or whatever it is I'll explain it means that trying randomly don't care about anything but after one thousand times I just tried 10 times of that machine but a lot of these two machines is it good that's not very wise because that's random may happen right maybe mostly these two happens that one is not happening a lot so upper confidence bound or UCB is going to propose an approach to deal with this problem and let's see how we can do that their definition is favor exploration of the actions with strong potential to have optimal value what is the optimal value remember we defined today and I said very important right I'm gonna use it a lot and I'm gonna use it up to end of class please please again so when I'm telling optimal value now you should imagine the equation there what is the optimal value so we are going to favor exploration of actions that leads us to have optimal value let's have a look so uh we want to Define a bond per each action that that Bond indicates how confident we are about the Q value of that action that action and now again you should quickly imagine what's the Q value and what I'm talking about less confidence bound will be big more confidence don't will be small and how we gonna Define the bond we are going to use this equation simplification here first have a look at the equation next slide we will see example in visual form and example of the numerical form that you're gonna see what's happened so we want to define the bond that that Bond helps us to pick the action among multiple options that we have so here we have the UTA so upper bound of a that we want to calculate upper bound of the action that you want to calculate is equal to the square root of the two times logarithm of T that t is the number of the choosing action s t over NTA that NT is number of times that we've chosen one that is specific machine the action or the action that we've chosen because here the actions are indicating different machines right A1 A2 A3 machine a machine B Machine 3 for example so calculation of the two times logarithm of t t you know we have it is including exponentially and log getting logo it decreases the value two times of that over the uh over the dot for each machine how many times we played it this can give us an upper bound of that amount that indicates and will help us to choose the best action that that best action means here that the the action that we have more confident I need to be played so what else we have Point here to mention uh also I say that this empty number of times that you are playing if it's larger give us a small confidence uh decreases the confidences uh because we want to give more chances to machine that I explained here we didn't play too much because if we played thousands of times these machines thousands after in this machine just 10 times of this machine we cannot be confident about this machine right we wanna give more more chance to this machine by the confidence that we're gonna try we're gonna Define with that term over there so the same explanation in fact we prefer actions that we haven't had a confident value estimation for it and UCB algorithm measures the the potential of the upper confidence band of the band of the reward value by q and so it is selected the greediest action to maximize after confidence response as follow and each time after they calculated the upper band we added by the QT of a and based on these we get the maximum action that we're gonna have it so the Q the Q of the a the the value of the uh the action that we have for each machine added by the confidence that gives us the value maybe one of the machines Q so what is why we are adding this because if the machine is very has very high chance to win of course we don't care anymore about that machine if we played or not right this machine is 90 80 of winning chance although we tried it millions of times we tried it 100 500 times we still want to play here more but sometimes also try it that's the reason here we have this term and getting Arc Max in Greedy way greedy means we want to always green we want to get the best one and are used so let's see example to clarify everything for UCP we have an example here in this example let's assume the Q values of we have three machines machine one two three one two three and Q values of the machines initially are one what does it mean it means that the value of playing each one of these are one equal out for Simplicity we just get one and of course this is gonna be update after playing and this end indicates the N of A1 indicates how many times we tried it here for example 200 times we tried machine a machine b or Machine 2 150 and machinery 750. okay and we had this term already here or we can calculate the UT now we want to using this equation over here see uh what is the result and how we can calculate what's going on we say upper band of fun thousands of A1 is equal to two logarithm of the T what is the T deflate 1000 times over nta1 how many times have you played that machine 200 times then we calculate the result we get some value clear we just place the values that we have number of times the uh the number of times that we played machine and the Q value uh T value is still the same we are not talking about it I've been visualize it there then for second machine everything is same for this machine you played 150 times then we get some other value and for third machine 750 a league replace and calculate and we get 0.29 now let's visualize it so I'm saying these are Q values Q value of A1 A2 A3 why these points over here because they were same and they are in one horizontal line the Q values could be different and these values could be upper and Downer first of all be careful at all then what we've done we calculated different upper bonds and if we visualize it like the 0.5 0.51 64 and 29 we may have something like this from this point we go up for 0.51 first case 0.64 0.29 and what we're gonna do we're gonna pick choose the one that has maximum uh uh value of the upper Bond to play so we can say in Machine 2 we have higher hope two to play to win the game and get higher reward so this is going to change during time this point is going to go up down and this one's based on where the point is the upper Confidence from going to change and based on that we're gonna try different machines and this algorithm is gonna tell us every time instead of casting a coin like Epsilon Grady and choose random man and best action uh moving whole procedure much more smarter so I didn't put the comparison here if you have a look in the book there are comparison between Epsilon greedy for example or UCB on different algorithms since you have an assignment for this you're gonna compare it uh you will see you said that which one is working better in one problem okay any question is everything clear all right so I'm I'm gonna rest to three minutes then we continue with Thomas all right so now uh we're gonna discuss about the Thompson sampling algorithm and let's see what is it doing and what's the differences with uh between previous ones we say at each state at each time step we want to select action according the probability that it shows that it is optimal the action that we are doing we want to talk about uh mostly probability of probability form of uh choosing an action in terms of sampling algorithm so first of all before talking about the Thompson sampling algorithm we need to talk about the beta distribution if you haven't heard about it yet we can consider uh uh consider Q of a which essentially is success of probability Theta that we will see next slide how we can how we can Define it in as beta distribution or we will see the equation and here we can have two terms the alpha and beta that corresponding to the uh the cons that we have succeeded or failed on the gaining reward from each one of the actions that we are doing and running on the machines that we have so Thompson sampling actually samples uh an expected reward Q of a from so this is expectation the heart and whenever I I forget to explain to you in machine learning or actually they are imagined in the field of area whenever you see hat it is like expected or estimated of the something estimation after something so expected reward of the to you a from prior distribution we will have the beta our beta distribution of the alpha I and Delta I that we're gonna see in a minute what does it mean then the best action selected among the samples so beta distribution says we can have a creation like this that probability of the Theta is equal to 1 over beta Alpha and beta so this is beta distribution indication and and the alpha and beta are two input arguments over uh actually times sum that comes over both are same times X power of the alpha minus 1 times 1 minus X over beta minus one and this is the term that clarifies beta distribution for this what does it mean uh it means to if you want to feed two parameters Alpha Beta and get the probability of the Theta and this alphabeta is going to indicate us the distribution over over a plot so uh we will see quickly now that uh the distribution that we can have we can move it to the right and left in order to predict the probability that we have so this comes from this this uh beta distribution over here the alpha Theta is the the integral of the U over uh U path to the alpha minus one times one minus U over beta minus 1 d u that are you going to like are we going to prove it and how this is coming uh not reading this course if you are interested in just have a look what does it looks like it looks like uh this example that I'm going to show you if you have the input argument Alpha and beta the same values and small values it means that the distribution over the probability of for example X here is same what if I increase it so if you increase it it goes to be like uh not exactly gaussian but looking like gaussian distribution but here if I increase it more but equal so looks like caution distribution so these are input arguments that I'm feeding the equation Alpha Beta And this gives me the probability distribution that I have here but if I have like alpha 1 beta increase to 6 then the curve looking like this if I have something like Alpha 2 beta2 looking like this and if I have Alpha bigger value than beta the Curve will look like this what is the intuition here the intuition is here okay get idea when we are talking about the probability when we want to talk about the probability of playing machines can we use this term somehow this equation this this metal distribution somehow to play with these parameters to finally find good distribution of which machine to play the answer is yes comes on sampling is trying to do that so clear up to here and if you want to try I put the link this is implementation online you can implement it uses the term uh I think I'm not sure I think I've used also in the code I just I should check it I I developed some examples for you but anyway if you go to that link you can try different Alpha Beta values and see what's happening also useless and the the idea is somehow using this special distribution in order to solve our problem multi-arm Bandit can we do that yes how let's see the idea is the estimation of Alpha Beta values by changing them continuously and it changed the shape of the distribution related to possible actions that we can have and if we repeat this many time uh actually we can call it a sampling the distribution and getting Q values we can form the true estimation of the uh the probability of each machine winning is so how let's visually see how let's talk about the Thompson sampling in iteration one and assume that the distribution initially is Alpha 2 Beta 2. and we have three bands three meshes machine one machine two machine three and these are indicating few values for us so now uh first of all we need to sample Q value from each distribution and that that Randomness is based on the distribution probability so I want to get one random point for each machine then then based on that distribution I'm gonna play so I do it for you I play it like generated random numbers based on this property distribution and you can see that okay maybe here for machine one here for machine two and here for machinery so is there any reason that this is happening no because the the probabilities are completely uh similar and are kind of totally randomly Next Step I'm gonna say okay the Q value is maximum here because this point shows higher value for me so this after after taking these numbers if I take the maximum the one that shows me the higher chance to win it is kind of definition of higher chance domain although it doesn't make sense here for now but later it's gonna make sense so that I'm gonna use this point and play this and machine then after playing it the idea is updating the alpha beta parameters for the cases of the winning or losing if I'm gonna win here I'm gonna add the alpha and if I'm gonna use I'm gonna add the beta okay and what is going to happen so far without explanation look at here when I was changing Alpha Beta the probability distribution was changing here and now that idea that I had here the example for example when Alpha is greater so the distribution moves to the right and to the left and if beta is greater so that the idea comes from there from beta distribution and I'm gonna tune it by winning or losing being winner or lose so in this example iteration one I generated for each machine one sample points then then this one that one shows higher value and I played it then let's see if we win here we could have moved we don't know we pulled liver and Vivo just assumption because you're gonna try it right in in this case let's save you Vivo what is happening if we want people we increase the alpha by one if we increase of R3 better two and you visually can see that this distribution tends to the right and then it tends to the right we have q here and it is increasing and now you can get the key point if we run only based on this distribution choose random values much like it is much likely to to be in linear case to be in right side why that happened because we we won the game clear okay uh here I'm gonna throw iteration two let's try it again so what we do generator random numbers based on the distributions that we have now in this case let's say we have this one and this one looks like the winner so the Q value is maximum there and we're gonna play This Time Bandit number one so now let's say if this machine loose what we do is time instead of alpha we increase the beta if you increase the beta of the curve tends to the left by the beta distribution that the term that equation that we defined and now you can see distance to the left distance to the right and this is like as initial that we have the machines now uh what if we continue this a lot based on the the machine chances that's winning and losing we will have distributions like this for example that shows Bandit one has very low probability to win after trying for example I don't know thousands of times I've showed 200 here for example you can sum up all the tries becomes how many times we tried here so as I remember these are real values I don't know I should check it I think the plots are correct of the trial then it is here in Bandit to like 70 75 percent of winning chance for example and here 25 of times after winning chance and this shows us that okay Bandit 2 is more likely to be in the game and try that benefit the Samsung sampling algorithm clear any question so the same thing that I explained to you now we're gonna go see some implementation example that I've done for you for Thompson sampling and after after we discuss it we have one assignment for you that you're gonna for next week compare uh the challenge the most challenging one I will also share the quote with you now I will explain it a little bit then you're gonna also Implement two techniques that we studied today Epsilon ready and UCB then compare those three together plus the result to see okay for for example three Arm Bandit machine which one of these techniques working better to collecting rewards that we have and I also noted that you need to plot total reward and and 1000 iterations that uh as we are seeing we are going to see in the example okay this chapter also is finished let me conclude then go to uh code we discussed and introduce multi-arm Bandit we saw uh Epsilon ready algorithm that is very useful in future we will see the these are UCB and how UCB algorithm can cover the problems of Epsilon Grady and later we thought one sample of thumbs on sampling and applied it on multian Bandit and understand concept how it is working and that's it for uh first chapter of the multi-arm bandage based on reinforcement