Markov process-Markov reward process-Markov decision process
An overview of the
Markov Decision Processes (MDPs) is a mathematical description of reinforcement learning problems.
- The environment is required to be fully observed.
Markov sex
“Just know that the present, future and past are conditionally independent, and you can throw away all information from the past.”
Definition: if the state at time TIf the following equation is satisfied, the state is calledMarkov state, that is, the state satisfies Markov property
Note:
- stateContains all historical information, i.eAll previous information can be displayed in this state(It can replace all the previous states)
- Therefore, the environment is required to be fully observed (if it is partially observed, the state information is missing).
- Whether or not markov properties are satisfied is closely related to the definition of states
Example:
- Playing chess
- Tetris
With the Markov state:
- State transition matrices can be defined
- Ignore the time and focus on the present moment
Note: Whether a state satisfies Markov property is closely related to the definition of state.
State transition matrix
State transition probability
The probability of state transition refers to the probability of skipping from a Markov state S to its successor state S ‘. It’s the conditional probability distribution of the current state.
State transition matrix
If the states are discrete (finite) :
- All the states form rows
- All subsequent states form a column,
Get the state transition matrix
- Is the number of states
- Each row adds up to 1
State transition function
If the number of ** states is excessive, or infinite (continuous state) **, it is appropriate to use the function form of the uppermost expression in this section.
- At this point,
Markov process
define
A Markov process (MP) is a random process without memory, that is, a sequence of Markov states.
A Markov process can be defined by a binary
- : indicates the state set
- : indicates the state transition matrix
Usually assume thatIs the existence and stability of whenWhen unstable, online learning, fast learning and other methods are adopted
An example of a Markov process
- There are two termination states in markov process:
- Time to stop
- State of termination
(Episode)
Definition: reinforcement learning, from the initial stateTo terminate stateSequence process.
Markov reward process
define
On the basis of Markov process, markov reward process is obtained by assigning different reward values in the transfer relation.
Markov Reward Process (MRP) consists of a quad
- S: set of states
- : State transition matrix
- : reward function,Describes the reward in state S,
- : Attenuation factor
Return value
- Bonus value: Evaluation of a status
- Return value: The evaluation of a segment
Return value) is the cumulative decay reward starting at time T
Value functions in MRPs
Why value functions? The return value is the result of a fragment, there is a large sample bias and the return value is labeled T, and the value function is concerned with the state S
An MRP value function is defined as follows
The value function here is for the state S, so it’s called the state value function, also known as the V function
Behrman equation in MRPs (Emphasis)
The value function of the current state consists of two parts:
- Number one: instant rewards
- Second term: value function of subsequent state times attenuation coefficient
Since there may be multiple subsequent states, if the transition matrix is known, then
The matrix-vector form is:
Is essentially a linear equation that can be solved directly:
Direct solution only applies to small MRPs:
- Computational complexity
- For the known
Markov decision process
In both MP and MRP, we act as observers to observe state transitions and calculate returns. For an RL problem, we would prefer to change the state transition process to maximize the return value. Therefore, Markov Decision Processes (MDPs) are obtained by introducing decisions into MRP.
define
A Markov decision process (MDPs) consists of a quintuple
- : A collection of actions
- : State transition matrix
- : reward function, representing the reward for doing action A in state S.
strategy
In MDPs, a Policy π is the probability distribution of actions in a given state
- The strategy is time stable, it only depends on s, it doesn’t depend on time T
- Is the ultimate goal of RL problems
- If the distribution is one-hot, it is a deterministic strategy; otherwise, it is a random strategy
The relationship between MDPs and MRPs
If the MDP problem is given a policy, will degenerate into an MRP problem.
Value functions in MDPs
-
State value function (V function)
- Definition: Starting from state S, policies are usedThe expected return you get
-
State action value function (Q function)
-
Definition: The state action value function in MDPs is the expected return value obtained by executing action A starting from state S and then using strategy π
Action A doesn’t have to come from strategyYou actually follow the strategy after you’ve done action APerform action selection
-
Behrman expectation equation
Similar to MRP, the value function in MDPs can be divided into two parts: instant reward and subsequent state value function