Thank you very much. I want to thank the organizers for inviting me to give this lecture on Model Credibility Control and Reinforcement Learning. The contents of the lecture are based on my course at ASU on reinforcement learning that I have given in the last five years and also my recent books that you see over here.
Peace. Last book here is the textbook for the course. And this book in blue is the most connected with this lecture.
It's a short monograph, and you can find the PDF of these two books on the internet at my website. So let me give an outline of my talk. I'm going to try to connect reinforcement learning and MPC. and I'm going to use computer chess as a vehicle for doing so. Then I'm going to introduce a new theoretical framework for MPC that is based on Newton's method for solving the Bellman equation of the associated dynamic programming problem.
All of this is going to be for one step look ahead and then I'm going to discuss MPC with multi-step look ahead, then discuss a few related issues relating to stability. the rollout and the policy duration algorithm and then I'm going to discuss some applications and I will if time permits I'd like to discuss this application of MPC to computer chess for improving the performance on any chess engine is something that's very recent that I'm quite excited about and I hope I'll have time to get into it so let me go into computer chess. It's an old subject, but it went through a rather dramatic transformation with the introduction of the AlphaZero program in 2017, which plays a remarkable and innovative chess and also created a lot of enthusiasm for the prospects of reinforcement learning and artificial intelligence. Now, AlphaZero, as most computer chess programs, involves two algorithms.
One is the offline training algorithm by which the program learns to evaluate positions. This is all done offline before the program gets to play any game against an opponent. And then there is the online play algorithm which is used to play against the human or computer opponents and it relies on multi-step look ahead. and a position evaluation at the end.
Now most attention has been focused on the alpha zero offline training algorithm which involves innovative neural network technology, deep neural networks and the like. The online play algorithm is more or less traditional. but it's the one that's critically important for good performance in AlphaZero as well as in general computer chess engines. And now the online play algorithm computer chess is the one that's strongly connected with MPC.
So going into MPC, let me describe briefly the model that I'm going to be having in mind as I speak. It involves L-step look-ahead optimization. followed by a cross-function approximation at the end of the L-steps. And the problem I'm considering is a deterministic optimal control problem that involves a state xk and a control u k and the state evolves according to the system equation. And there is a cost at each stage k that depends on state and control and At the current state xk, MPC considers an L-step look-ahead optimization involving the next L controls and a terminal cost function approximation after that, at the state the results after that.
And the problem that's being solved involves the cost of the first L steps and an approximation of the cost of the future steps by means of this J tilde function. We solve this problem for the controls, the first L controls from U to U plus L minus 1. We obtain the optimal sequence and we apply the first control of that sequence and we discard the remaining controls. And then at the next stage, K plus 1, we repeat the process. It's a classical form of MPC, deterministic MPC.
And let me go now into computer chess. It also involves multi-step look ahead with a position evaluation at the end. So at the current position or current state, we consider all possible legal moves, which represent the controls.
then we consider the moves of the opponent and that's the first stage of look ahead. Now there are multiple stages like that and at the end we look at the position obtained and we evaluate it using the offline obtained position evaluator. Now if you look at this diagram and you connect the previous one they are essentially isomorphic.
Both involve multi-step look ahead with some kind of cost evaluation that represents the future. However, there are some differences. In chess, the state and control spaces are discrete. There's a finite number of positions, a finite number of moves that we can make. Whereas in MPC, they are usually continuous, or sometimes mixed, discrete, and continuous.
In chess, the look-ahead tree is usually pruned, simplified for computational efficiency, while in MPC the look-ahead optimization is usually exact. But these two differences, I would argue, are inconsequential because they can be interpreted within the framework of a Newton step theory which relies on abstract dynamic programming and allows arbitrary state and control spaces as well as an exact minimization. Even though I'm going to focus on simple problems, everything that I'm going to discuss in this lecture generalizes.
to much broader domains. There's one more difference in computer chess. Chess is a two-player game, but you can view it as a one-player game if you consider chess against a fixed opponent. So there's only one branch that's going to come out of here, and and this opponent, the decision making of the opponent is known, and so it becomes a one player. one player in a game.
And I'm going to discuss this a little later also. So let me get into the principal viewpoints of this talk. I'm going to argue that online play with one step look ahead is a single step of Newton's method for solving the problems Berman equation.
And for multi-step look ahead, a similar interpretation applies. Now the offline training provides the starting point from a Newton step and improves it relative to one step and improves it. But the online play is the real workforce while offline training plays a secondary role.
Now this Newton step framework is very general because it's couched on abstract dynamic programming ideas. and it applies to arbitrary state of control, stochastic problems, deterministic problems, hybrid systems, multi-agent systems, minimax. Anything that you can imagine is covered by this framework because the fundamental principles underlying it are abstract and very broad in scope.
There's a cultural difference which has always mystified me and I'm trying to understand. Reinforcement learning and artificial intelligence research and model predictive control have different emphases. Reinforcement learning is focused largely on offline training issues, while model predictive control research is focused largely on online play and also stability issues, which reinforcement learning doesn't consider almost at all.
However, Through this framework that I'm going to discuss, I'm going to argue that MPC and reinforcement learning have strong similarities, and they stand to learn a lot from each other. So let me go into this Newton theoretical framework. And for visualization, I'm going to focus on linear quadratic problems that are one-dimensional. Everything generalizes, but the visualization... is not possible if you go to two-dimensional or higher dimensional problems.
So we assume that the state and control are scalars. The system equation is like so, with a and b being given scalars. The cost function is quadratic, where t and r are positive scalars, and the basic facts around this problem are very well known. The optimal cost function is a quadratic function of x, and the optimal control of all this is a linear function of x. And the optimal solution is as follows.
is obtained in terms of a Riccati equation. The optimal cost function, J star sub x, the optimal cost that Munich can obtain starting from state x is quadratic, with the coefficient K star being the unique positive solution of the Riccati equation, which is a fixed point equation involving the Riccati operator that I've written here for one-dimensional problems. The optimal policy is linear of this form where L star, the scalar L star, is given by this expression in terms of k star.
Now I'm going to focus for visualization for one-dimensional problems, but the insights generalize to multi-dimensional linear quadratic problems and beyond. So let's look at the graphical solution of the Riccati equation. Here's the Riccati equation. To solve this problem, this equation graphically, we have to draw the graph of the right-hand side and the graph of the left-hand side and find the points where the two graphs meet.
So here's k and the graph of the Ritardi operator is what you see here, the red curve. This is the graph of the right-hand side. The graph of the left-hand side, the straight line, a 45 degree straight line and the point where the two meet is this k star, the solution of the Riccati equation. k star corresponds to the point of intersection, the graph of f with the 45 degree line.
Now the Riccati equation can be solved also iteratively by iterating with the Riccati operator. This is called value iteration in dynamic programming. and value iteration generates the optimal cost functions of k-stage problems.
The k-th iteration generates the optimal cost function of a k-stage problem, and it's well known that jk is quadratic of this form, where the coefficient kk is obtained by iterating with the Riccati operator, like so, with k0 being a given initial condition. And now let's try to visualize the Riccati iteration. Here is k, the Riccati operator at the typical iterate, kk. We obtain kk plus 1 by iterating with the Riccati operator, it's this point here. We reflect this point down to the horizontal axis, here's kk plus 1, and kk plus 2 is obtained similarly, kk plus 3, and so on.
So value iteration can be interpreted in terms of this staircase structure that starts from any point over here and leads up to the converges to the optimal solution k star. Similarly if we are stuck on this side the Iriqadi operator operates again on the sterting structure and each iteration brings the indirect closest closer to k star. So now let's try to visualize linear policies and the entire equation for both stable and unstable policies.
So let's consider a linear policy involving the coefficient L, scalar coefficient L, and its cost function. Now this cost function is quadratic of the form KL x-squared, where KL is the unique solution for the Italian equation for linear policies, sometimes also for the Lyapunov equation. It's a fixed point equation where the operator corresponding to L is now linear, linear in k, with this coefficient for the linear term and this constant. And so you can visualize these policies in terms of lines. lines that represent the graph of the corresponding Lyapunov equation.
So each policy has a line here and there are some policies which are stable. These are the ones for which this relation holds. a plus b L squared is less than 1 and these have a slope that's less than 1 and intersect the 45 degree line at some point which is the solution of this Lyapunov equation.
So that's how you obtain the cost function of a linear policy represented by some line. We intercept these two lines, project down, and this is the cost coefficient of L. Now this is true for stable policies which have an intersection, but unstable policies, the ones for which A plus PL is greater than one, do not have an intersection or they have an intersection.
intersection that's irrelevant, and for this unstable policies the KL coefficient is infinity. So we have representation of the solution of the Riccati equation graphically, value iteration, and also policies in terms of lines, and now the question is how all these connect. So the important fact is the graph of the Riccati operator is the lower amber law. of the policy lines, stable as well as unstable.
So we have all these lines that represent policies and if you look at the lower envelope we obtain this red curve. And the proof of this is very simple. f of k you can calculate is the minimum over l of these these lines.
And this is precisely now what you do in MPC. If you take K tilde, cost function approximation K tilde, and all these lines provide points of intersection of this line, and the one that comes closest is the one that is tangent to F at the point K tilde. And this is precisely the MPC policy with one step look ahead.
when the cost function approximation is quadratic with coefficient k tilde. You start with k tilde, go up, draw the tangent, and this gives you the cost coefficient of the MPC policy. Now notice that this cost is obtained by a linearization process.
We basically linearize the Riccati operator at this point. and solve the linearized fixed-point problem to obtain KL tilde. And this is a critical property for understanding how Newton's method is involved.
In Newton's method, you have to solve a nonlinear equation, you linearize the equation of the current iterative, solve the linearized equation, and that's the next iterative of Newton's method. So this linearization is a critical part of this graph, and also, for analytical purposes, the concavity of this recalculation operation is also important. All of these properties generalize to linear quadratic problems of multiple dimensions and beyond. So let's review the Newton step interpretation with one step look ahead. At a given state we consider all possible controls that lead to states xk plus one and then we take a cost function approximation at the end of this step.
And this part here is the Newton step. It maps cost approximation k tilde to cost of the one-step MPC policy. And this relation is super linear because Newton's method, Newton's relation has super linear convergence.
So it relates super linearly the cost function approximation with the MPC performance. Visually, here's the Newton step. We look at the cost function approximation k tilde, linearized at this point. This is the line corresponding to the MPC policy with one step look ahead, and then at the point of intersection we go down to compute the cost coefficient of the MPC policy. And the mapping that takes k tilde into k L tilde is the Newton step.
Now, the meaning of superlinear convergence here is that the performance error, the error of the MPC policy to the optimal cost, is much smaller than the approximation error. And in fact, the ratio between these two goes to zero as K tilde approaches K star. And to me, this explains the consistently good performance of MPC in practice. I think this is what makes MPC work. Now let me extend this view to multi-step look-ahead.
Excuse me. Now the important fact about L-step look-ahead is that it is really one step look-ahead on a different cost-function approximation. L-step look-ahead on k tilde.
is the same as one step look ahead applied to k tilde after it is iterated upon with l minus 1 value iterations. So for example I'm showing here two step look ahead minimization. At the current state we apply one control to get to state k plus one and then a second control to get to state k plus two and then we consider a cross function approximation. Now The Newton step is just this first stage of look-ahead acting on the remainder, which includes a value iteration corresponding to the second look-ahead step. It is only the first step of the look-ahead that's the Newton step.
The remainder helps to modify the cost function approximation at the end. So graphically for two-step look-ahead. Here is the line that corresponds to the two-step look-ahead MPC policy. You are at k tilde, you apply a value iteration to get to this k1, remember the staircase structure, and then you apply the Newton step, which brings you to the performance coefficient of the two-step MPC policy. So this is the Newton step.
and this step here enhances the starting point for the Newton step. And now that you'll notice that k1 is closer than k tilde to k star, and more generally multi-step look-ahead brings the starting point of the Newton step closer to k star. That's the key idea behind a multi-step look-ahead. This is two-step look-ahead.
If we had 10-step look-ahead, you would have nine steps like this. and then the Newton step. So let's reflect on the importance of the first step of look ahead. Again this is the same figure as before, L step look ahead is one step of look ahead on a different modified Turner cost function.
The first step of look ahead acts like a Newton step, the remaining L steps. only serve to enhance the starting point of the first Newton step. And there is an important insight that comes from this. The first minimization step of the look-ahead should be done exactly, should not be approximated, because if you do so, it would spoil the good convergence properties of Newton's method. On the other hand, the remaining steps of the look-ahead can be done approximately, because that would mean just an approximation of the effective cost-function approximation at the end of the first step.
Now this has important implications in stochastic problems where MPC involves a lot of computations. We can use certainty equivalence to simplify the computations in all look-ahead steps except the first one. This is an idea that has been verified in a number of computational studies going back to 25 years.
Reduction in the computational burden is very substantial, and the loss in performance, if any, is very small by doing the certainty equivalence, depending on the problem, of course. So now let me go into issues of stability, the region of stability, and also the rollout and policy iteration algorithms. When you apply MPC with a given cost function approximation k tilde, you're not sure that you're going to get a stable policy.
Sometimes you get a stable policy, sometimes an unstable policy. So what is the region of k tildes for which the MPC policy is stable? Let's take this k tilde.
Going up here, the slope of the corresponding MPC policy is a it's a slope of this line is less than one so this policy is stable on the other hand if you consider this k tilde the corresponding slope is greater than one and the npc policy is unstable and so the axis of all k the horizontal axis is divided into two parts the stable part the region of stability and the unstable part and we can see that in this one dimensional case There's a unique scalar here that demarcates the region of stability from the region of instability and this scalar corresponds to the point where the Riccati operator has slope exactly equals to one because the Riccati operator is concave, everything to the right of this point leads to state policies, everything to the left leads to unstate policies. Now another important point here is that the region of stability expands as the length of the look-ahead increases. A long look-ahead means several value iterations are applied to K-tilde to bring it closer to K-star. So the longer the look-ahead, the more of a chance there is that after, at the end of the look-ahead, you'll enter the region of stability. Now, a related issue with stability has to do with the rollout algorithm.
The rollout algorithm is the one that uses s-k-tilde to be the cost coefficient of some linear stable policy mu. This is called the base policy, and the important fact is that a stable base policy produces a stable rollout policy by a Newton step. And so the cost of a stable policy always lies within the region of stability, and when you apply a Newton step to it, again you obtain another state policy. Now this policy is called the rollout policy.
So let me describe here what's happening in terms of the graph. Suppose that you have a stable policy that has a line with slope less than 1. Its cost function is right here. and this is the starting point for the Newton step to obtain another policy.
So we linearize here and we go down and this is the cost obtained after a Newton step starting from here and this policy here is called the rollout policy. The rollout policy because of the Newton step is typically much better than the cost of the base policy. That's an important fact for the success of the rollout method. And this has been verified, by the way, by many, many experiments over the years. Now, what is policy duration?
Policy duration is repeated steps of rollout. In other words, it's exactly the same as Newton's method applied to solving the Riccati equation. And this is a well-known fact in control theory.
It dates back to 1968 in the PhD thesis of David Kleinman at MIT. also mention that there are several interesting variants of rollout and also variants of policy duration. For rollout you can find literature on truncated rollout, fortified rollout, simplified rollout.
I'm not going to explain what this means. They are basically some simplifications or modifications of this Newton step. or perhaps the starting point of the Newton step, more precisely the starting point of the Newton step.
So now let me spend some time on applications with a focus on our work at ASU. First let me note that the Newton step view of MTC applies very broadly. thanks to the generality of the underlying abstract dynamic programming theory, which I haven't talked about, but this is what underlines the Newton step view, traditional MPC applications have dealt with continuous state and control problems, such as process control, robotics, aerospace, self-driving cars, and so on. But there are many successful applications involving discrete state and control spaces, and the These are the applications that we have been focusing primarily at ASU.
One type of problem started at ASU when Stephanie Gill was at ASU and continued after she moved to Harvard. It involves multi-agent robotics such as maintenance and repair with multiple repair persons, taxi fleet management and so on. Another computational study that's recent involves data association and multi-target tracking and it has four authors, Placutio Mussinuri, Yu-Chao Lee, then James Van Beber and myself.
There has been work on sequential inference, sequential decoding problems and the New York Times word puzzle and this is work with two of my students, Siddhant Bambri and a breeder but that's our chief and there are now two very recent applications which i'm quite excited and this one involves the problem of generating the most likely sequence in a chat gpt type transformer now this is an extremely difficult problem to solve however with rollout we have been able to obtain very good approximations a relatively small computational cost now there's also a very important related class of problems dealing with HMM inference and the Biterbi algorithm. And we have an algorithm which we call Biterbi rollout that's much easier to execute, much less computationally demanding than the Biterbi algorithm. This is joint work with Yu-Chao Li and myself.
And finally, very recently, we have done work on computer chess solving providing an MPC architecture for computer chess. And this is work with Akhar, Vaguldevar, Yichal Lee and myself. And I'm going to spend a little bit of time describing this one.
How to use MPC to play better chess. This is done with an architecture which we call MPCMC. MC stands for meta chess.
And as I mentioned earlier chess involves two players. However, we introduce a one-player MPC architecture, where the true opponent is approximated by a nominal opponent. So in this MPC architecture, at the current position we consider all possible legal moves, and then we predict the play of the opponent.
using some computer engine and then we evaluate the result using a position evaluator engine. So this is an architecture that involves existing computer chess engines as components. That's why you call it a meta algorithm.
And there is the position evaluator engine which evaluates position drive over here. and the nominal opponent engine that predicts the move of the true opponent of NPCMC, either exactly or approximately. What's important is that the nominal opponent is a pretty good player, so even if the evaluation is, is a, even if the prediction is approximate, the move predicted is still good.
Now each move at any one position involves a Newton step, based on what I've discussed earlier, starting at the position evaluation function. So we expect to get the benefit of the Newton step, and indeed this is so. Let me discuss a little bit the computational results that we have gotten here.
We tested two variants of the algorithm, a standard version and another version which we call 45. Remember 45, it all now relates to that. And this one's a little bit more conservative. It plays a little better against very strong opponents, a little worse against weaker opponents.
And our computational experiments involve the Stockfish engine. It's a computer engine that's very, very strong. It's considered the top engine by many. It has been the champion engine that is a benchmark for people in computer chess.
And we have run matches between MPCMC that uses Stockfish as both the position evaluator and the nominal opponent and plays against Stockfish. So now here on this column are Stockfish engines of different strength. At 0.5 seconds per move it plays pretty good chess.
At two seconds it plays very very good chess, very hard to beat and at five seconds it's almost unbeatable. It plays almost perfect chess. Now here are the results of this match of MPCMC against this Stockfish engines in using them also within the architecture.
And for a case where the engine replicates exactly the moves of the opponent, we get an advantage. But this advantage diminishes in both the standard and the fortified version as the engine becomes stronger. and the 45 version as we can see helps against the very strong engines here.
Now we also have done experiments with an approximately known opponent where we cannot predict the move of the opponent exactly and the results are a little worse but still very good for our architecture. Of course at the five second level Stockfish plays almost perfect chess and it's hard to measure the improvement but still we're getting an improvement and let me also mention that in all these games NPCMC has not lost a single game either the whole games have been either wins or draw and this type of result was also replicated with other agents such as the Komodo Dragon and others and this is just one step look ahead We expect to get better results without this step of the head, but we have not done these experiments yet. Please, I've done some partial experiments, but these experiments are very time-consuming, and we have no time to conduct them. The other thing to note is that our architecture is very well suited for the use of parallel computation, because there are a lot of ancient evaluations here, and we need to reduce and increase the move generation time, but everything can be done in parallel.
MPCAMC with one step look ahead requires about twice as much time per move than the engine that it plays against, assuming sufficient parallel computational resources are available. So our paper will be posted soon and you can find it there and see some of the details. But now let me get to my final slide. The first point I want to make is the MPCI-T lecture is the right one.
It's the right framework for sequential decision-making and approximate dynamic programming. Provided, however, it makes use of the synergy between online play and offline training. If you use just offline training without online play, it may not work very well because you will miss out on the Newton step. Online play uses an exact Newton step. not subject to training errors, and can also, also, parapheletically, deal with changing system parameters as in adaptive control.
That's the reason why MPC has worked very well with systems with changing parameters in the context of adaptive control. It's because there's an exact Newton step involved here that's not subject to training errors. At the same time, using just online play without offline training misses out on performance, because offline training can produce good starting points for the Newton step and therefore enhance the stability and the performance of the NPC points. Now the role of Newton's method in order to get these insights is central.
It provides new insights that can guide both analysis and algorithmic design. The framework is very general in the sense that it applies to arbitrage state and control spaces and discrete and continuous optimization applications, multi-agent problems, and so on. MPC can be very computationally intensive, particularly for longer look ahead and also stochastic problems, but massive parallel computational power is possible and can allow more sophisticated online-play strategies.
And in this talk I have tried to bring out the point that the cultural divide between reinforcement learning and AI in control can be bridged because MPCUs are very similar architecture to AlphaZero and other similar reinforcement learning applications and can benefit from reinforcement learning ideas. And finally let me note that the decision control in MPC community can provide intellectual leadership in the journey ahead thanks to its mathematical orientation. MPC people use mathematics quite a bit and in ways that people in reinforcement learning, many people in reinforcement learning do not. And the intuition that we can get through mathematics is very useful in delineating the right methods and getting the right insights.
And I think that's where the decision control and MPC community have an advantage. So this is my last slide and I thank you very much for your attendance. I hope we have some time for questions.
I'm perfectly keeping in 45 minutes time so that we have ample time for interaction and discussion. I'm into some microphone problems. Can you hear me?
Can you hear me? Your sound comes out a little choppy, I'm afraid. Okay.
Is it better if I only speak into your microphone? Can you hear better now? I can hear you better now, yes. Okay.
So we have an algorithm. I speak sometimes to both of you, then it's a bit bad, but I suggest if people have questions, they come to me, they speak into this microphone, and then I maybe repeat the question. Or Dimitri, you can repeat the question.
That's the easiest thing. So I would invite people to come to this micro that I hold here in my hands, this one, to ask your questions. And if maybe I don't see a hand now, so I would like to start with one question. I would like to look again at your... beautiful visualization of this linear quadratic regulator problem at the beginning.
I wanted to ask Dimitri if you could somehow discuss, there is the question in MPC, we also have another function of interest in which we call the value function of the nominal MPC problem. And that's basically not the L minus one step look ahead value function, but the L step value function. might be useful to discuss where it would be located in your plot.
And also the other things we have these terminal costs very often already stable. So I wanted to discuss these two cost functions. So if I understand correctly, are you considering a cost function approximation that is itself the cost function of another policy?
No, in our stability analysis we very often have an object which I think doesn't appear in your slides at the moment and that is the value function from basically the optimal value of the optimal control problem starting at x. I mean the MPC policy is of course what we apply and I think your approximation is the closed loop cost, right? That's the closed loop cost to go.
And the value function that we have in MPC is not yet in the plot, but I think it's close to, it's basically, I think it's one more value iteration, isn't it? It may be one more Newton step. Can you go to the picture where you show MPC?
If you look at this picture, you have some points that you obtain from MPC. and you apply a Newton step to that, that would be a second Newton step, it would be a double Newton step, which of course would be extremely powerful. I hope that I'm describing, I understand correctly what you mean.
Yeah, I mean it's a substance... Yeah, I would like to invite everyone to guess with me, so to say. I mean, if we would have a terminal constraint, for example, which is something we sometimes do, like 10 steps ahead and a nearly terminal constraint, that would probably correspond to an infinite k value at the beginning, right? That's right, yeah. Yeah.
And then we would… We can move very much to the right, yeah. And then the… the value iterations would drive us to the left, right? So they are upper approximations of the true cost to go. I just wanted to invite everyone to think about this, and I think it's a very fascinating framework to think of. We have this notion of a terminal cost which satisfies the stability conditions where you basically have If you do one step of dynamic programming, you get an improvement.
That's probably in your case everything in this example which is on the right of the k-star, isn't it? These would be all the value functions that by optimizing we can only improve, so to say. Yeah, if you start from the right you can only improve, but that's true only for one-dimensional linear quadratic problems.
and for multidimensional linear quadratic problems the original stability is more complicated and and so one has to be careful about making general statements but i'm pretty sure that this is a good way to go about it uh intuitively uh and it's generally true that if you're clever about picking the starting point of our newton step on the right side of k star then you can get a stable you can be within the region of stability and then MPC is going to work very well. Yeah, thank you. I think that was exactly the direction I wanted.
Now Alberto Bempalat has a question. Please come to the front and speak into the microphone for Dimitri and then Dimitri, please repeat the question. Yes, so hi Dimitri, Alberto Bempalat speaking.
Thanks a lot for your very, very instructive and interesting lecture. I have a question regarding the modeling of the opponent. As far as understood in what you have shown, you have modeled the opponent as a deterministic function of the input.
Very often, in control problems, we model the opponent as a completely stochastic disturbance, with no dependence on the input, which simplifies on one-hand computations. Can you comment on the different ways of modeling the opponent? opponents and if one should go for say an in-between modeling, partially stochastic, partially deterministic function of your moves and the consequences both computational and in terms of optimality.
Right, you're talking about a two-player minimax problem where the second player is approximated by a stochastic prediction. Yes. And okay, this is certainly something that we have thought about and in fact something that we have implemented in the context of chess.
Instead of having a deterministic opponent, have an engine that produces a probability distribution and then to guide the further selections. It didn't work very well in the context of chess, but that doesn't mean that it's not going to work well in other contexts. It's an interesting subject for investigation that we haven't thought about very much.
Isn't it the reason why with bad players it doesn't perform well? Because you have a deviation from the model you have done of the opponent, which is a very good player. Yeah, that's a good point. What should be the quality of a nominal opponent B?
The nominal opponent should not be bad. Even if the actual player is bad, the nominal opponent should not be a bad player, a poor player. It should be a good player because otherwise it can make bad mistakes that the opponent can exploit perhaps by accident or even deliberately.
So it's very important to have a good nominal opponent that approximates a minimax type of view of the problem. Okay, thanks. So that's why in our experiments we have used a very strong engine to model the nominal opponent, namely the Stockfish engine.
Well, on the other hand, this is our experience from chess and it may not be generalized to other problems. So, we are looking forward to experiment with other situations and I also look forward to see what other people come up with. The next question will be by Gabriele Panocchia.
Hi Dimitri, thank you very much for your very nice presentation. One clarification, can you go to the next slide? And because I'm not familiar with the numbers that you put in the table, if you can explain what like 7.5 dash 2.5 means, et cetera. I kind of figured it out, but I want to make sure that I got the right message. Thank you.
Okay. So in chess, it's standard to score one point for a win, half a point for a draw, and zero for a loss. So the result of a 10 game match here was 7.5 against 2.5 and translated to five wins and five draws in the 10 game match. And so okay so this would correspond to four draws and the remaining and six wins and so on.
Incidentally this number here is a 20 game match. The reason is that at five seconds Stockfish plays just perfect chess, almost perfect chess. And we had to play more than 10 games to get a win. We never got a loss, but in order to get a win against this very strong opponent, we needed to go to more than 10 games. Does that answer your question about the meaning of the numbers in this table?
Yes, I think it was answered. Perfect. Maybe I have another question on this chess.
Do you know what's happening inside this engine? Doesn't it do maybe similar algorithms as you? I mean, it's like somehow, do you prolong, so to say, its horizon length in some sense? Could you comment on what's inside the Stockfish engine? Do they also, I mean, they must use some related algorithms, I guess.
Yeah, it's clear to me that you're a chess player, and you understand what's going on in these engines. Indeed, what we do is extend the length of the look ahead, but it's a key point here. Many chess engines, most chess engines, use pruning of the look ahead tree.
And what's important about the Newton step is that the first step is exact. In other words, we don't prune anything right here. And I think that's an important part of our success, of our observed success.
So, I don't know whether existing chess engines can be modified to incorporate NPC ideas. Perhaps they can, but that's what we have done. And we think that the key part is that this first step is exact, with no legal moves being pruned and considered as inferior and discarded. Thank you.
That's a very generic argument to improve any policy that someone gives you, right? That's right. Thank you for this comment.
I should say that the ideas here apply generally to MPC. In the context of chess, they have a special meaning in terms of the certain mechanism of search engines involved. Okay, so I suggest I switch on the big micro because now it's mostly to the audience. I suggest that we thank Dimitri for a very inspiring talk and also answering our questions and then we go on into the coffee break now.
So let's give a big applause again to Dimitri. Thank you very much. Thank you very much. It's my pleasure to give this talk and I'm sorry I can't be with you. Thank you very much, Dimitri.
We just gave an applause. Oh, I see. I didn't hear you.
Bye-bye. Bye-bye.