Machine game (2) Regret minimization algorithm

My wechat account is MultiAgent1024. I mainly study deep learning, machine game, reinforcement learning and other related topics. Look forward to your attention, welcome to learn and exchange progress together!

Modern game theory is rapidly combined with artificial intelligence, forming a new framework of data-driven game theory. Game theory intersects with computer science in many ways:

Theoretical Computer Science: Algorithmic Game Theory
Artificial intelligence: multi-agent systems, AI games, human-computer interaction, machine learning, advertising recommendations, etc.
Internet: Internet economy, sharing economy.
Distributed systems: blockchain.

The combination of artificial intelligence and game theory has formed two main research directions: 1. 2. Design of game rules.

Game theory provides mathematical models for many problems. Nash theorem confirms the existence of solutions to game process problems. Artificial intelligence methods can be used to solve equilibrium situations or optimal strategies.

The main research problem is: how to efficiently solve the strategy of game players and game equilibrium situation.

Its application areas mainly include:

Problem solving for large-scale search space: Go.
Imperfect information Game problem solving: Texas Hold ’em.
Online battle game intelligence: Dota, Star Wars.
Equilibrium solution of dynamic game: manufacturer competition and information security.

Regret Minimization algorithm:

We make several definitions of symbols in regret optimization algorithm (RM) :

Let’s say that there’s a commonA player. The playerThe policy adopted is expressed as.
For each information set.It’s in the action setThe probability distribution function of. The playerThe policy space is usedSaid.
A strategy group contains all player strategies, with....
saidIn addition toOther strategies (i.e., eliminate the playerThe strategy adopted).
In a game confrontation, different players take different strategies and actions at different times. strategyBelow the corresponding action sequenceThe probability of occurrence is expressed as. As a result,Here,Said the playerUse policiesPrompt action sequenceProbability of occurrence. In addition to the playerOther players drive the sequence of actions through their respective strategiesThe probability of occurrence can be expressed as:.
For each player.Said the playerThe payoff function, i.e., the set of terminating sequences at arrivalIn a termination sequence, the playerThe amount of revenue you get.
The playerGiven a strategyThe expected revenue can be calculated as follows:.

Optimal response strategy and Nash equilibrium

Let’s look at the optimal response strategy and Nash equilibrium under regret minimization.

The playerStrategy group for all playersThe best response strategyThe following conditions are met:

The playerThe benefits obtained by adopting other strategies are less than those obtained by adopting the best strategy. (Other player strategies remain the same here.)

In the strategy group, if each player’s strategy is the best response strategy relative to the strategies of other players, then the strategy groupIs aNash equilibriumNash equilibrium.

Nash equilibrium: policy group...Nash equilibrium if and only if for each player, the following conditions are met:

– Nash equilibrium and average regret value

–Nash equilibrium:
- For a given positive real numberThe policy groupis– Nash equilibrium if and only if for each player, the following conditions are met:

Average regret value(Average Overall regret) : Assuming that the game can be repeated (such as go, etc.), let noThe strategy group in the subgame isIf the game has already been playedThis time,Subgame for the playerThe average regret value of is defined as:

Strategy choice

Regret minimization algorithm is a method to determine future action selection according to the regret degree in the past game
In the game, the playerIn the firstRounds (each round represents the completion of a game) take the strategyThe regret value of is defined as follows (cumulative regret) :

Generally, a strategy with a negative regret value is considered as failing to improve the next moment return, so all regret values considered here are positive or 0.
Count the playersIn the firstTake turns to adopt strategyRegret value after the noRounds the playerSelection strategyThe probability is as follows (the greater the regret, the more choice, that is, it is too late to mend) :

Rock-paper-scissors, RPS example

Let’s say two playersandIn a rock-paper-scissors (RPS) game, the winning player would earn 1 point, the losing player would earn -1 point, and the tie would earn zero points for both players.
In the first round, if the playerOut of the stone (), the playerOut of the cloth,), then the playerThe benefits of, the playerThe earnings of.
For the playerSpeaking in the playerOut of the cloth,) this strategy is the case if the playerSelect the cloth () or scissors (), then the playerThe corresponding yield valueor.
So after the first game, the playerThe regret value of no cloth is:

The regret value of no scissors is:

So in the second game, the playerRock, paper, Scissors, and Scissors are 0, 2/3, and 1/3, respectively.Therefore, the playerTends to play scissors in the second game.
In the first round, the playerChoose stones and playersChoose cloth and play in the second gameSelect scissors and playerSelect the stone case, then the playerThe value of regret in each round and the cumulative regret after the second round are as follows:

From the table above, in the third game, the playerThe odds of rock, paper scissors and scissors are 1/6, 2/6 and 3/6, respectively
In practice, it is possible to find the optimal strategy for each player in each turn by accumulating regret values over multiple simulation iterations.
However, when the game state space grows exponentially, the minimum regret algorithm cannot be used for a large game tree.

Machine game (2) Regret minimization algorithm

Regret Minimization algorithm:

Optimal response strategy and Nash equilibrium

– Nash equilibrium and average regret value

Strategy choice

Rock-paper-scissors, RPS example

Related Posts

Google search as the biggest entry, a brief talk about personal blog SEO

SQL Introduction

A code love poem 💌 From digging gold Tanabata activities