This article appears in the Painless Machine Learning Season 2 catalog.
Linear programming solution process
In the last section, we introduced the formula for reverse reinforcement learning, and now we are going to solve it. This problem is essentially a linear programming problem, but the current form of the formula is not easy to solve the form, so we need to change the formula, using the way of variable substitution, the problem into a standard linear programming form. There are two variable substitutions involved:
Then we can get:
s.t.
,
,
And when we get the formula into this form, we’re ready to solve it. We can be found in the project https://github.com/MatthewJA/Inverse-Reinforcement-Learning about the problem solving, specific implementation in irl/linear_irl py the irl function, So let’s see how it works.
Cvxopt library is used in the code, which realizes the common convex optimization algorithm, including the solution method of linear programming. To solve the problem using this library, we need to arrange the problem into the following form:
s.t.
Where A is the matrix, b and C are vectors. Finally, we input three parts A, B and C into the function, and then we can use the linear programming algorithm in the algorithm library to solve it.
So, how does this formula look like this? And we can see, for the sake of calculation, that we have three sets of variables R, T, and U, all of which are of length, so our objective function becomes:
Following this form, we can obtain the following constraints, where each expression has a dimension of (), the N of each expression is different, butIt’s the same:
The above formula is still not detailed enough, more detailed content is shown in the figure.
By organizing the data according to Figure 14-1, we can solve and obtain the return vector.
The actual case
Having looked at the algorithm above, let’s look at a concrete example, this example is Grid World. Grid World is a very simple problem. Given an N by N checkerboard, each Grid can be used as a landing place. As shown, we can use a 5 by 5 grid as an example.
Standing on a grid of the board has a different effect. At the bottom right of the grid is a bonus grid. Standing on it earns one point and the game is over. Standing on another grid does not earn a reward, and the game continues.
In the initial state, the player is randomly placed on a grid, and at each subsequent moment, the player can make a decision to move in four directions: up, down, left or right. But things change, and sometimes we don’t get where we want to go, and there’s a “demon wind” in the game that triggers at a certain probability. When triggered, the player’s progress will be out of control, and they will move in any direction they choose. If they bump into the edge of the board, the player will be blocked and stopped.
The rules of the whole game are not complicated, from our point of view, because the bottom right corner is the point of score, so we just keep moving in that direction; If you’re already there, just stay there.
Let’s think about it another way: we don’t know the scoring rules of the game, but we can only see the actions of an old player (such a simple game, I believe that the old player is not too old). Can we infer the scoring rules from the actions of older players? This is Grid World based reverse reinforcement learning.
Since the problem is not complicated, we can express the strategy, return and value vector in the problem in the form of tables. Using the algorithm above, we can get the final result, as shown in the figure.
As can be seen from the results of the figure, the difference between the Reward result we restored and the real result was very small, or this difference could not prevent us from selecting the right strategy. That completes our introduction to the problem.