Markov Decision Process (MDP)

Markov decision is a mathematical model of sequential decision, which is used to simulate the strategies and rewards of agents in systems with Markov properties. If you want to solve a decision problem, first resume markov decision model, and then use reinforcement learning algorithm to solve it. The final result is in the form of a sequential strategy: PI (theta) = {1, PI PI 2,…, PI tau} \ begin {array} {l} \ PI (\ theta) = \ left \ {\ pi_ {1}, \ pi_ {2}, \ \ cdots, \ pi_ {\ tau} \} \ \ \ \ right end {array} PI (theta) = {1, PI PI 2,…, PI tau}, the strategy parameters theta \ theta theta that under type, you can get the optimal strategy.


Theta. = arg max Theta. J ( Theta. ) \theta=\arg \max _{\theta} \mathcal{J}(\theta)

Markov sex

What is Markov sex? When the relation of states satisfies the following formula, the environment is said to have Markov property.


P s s = P ( S t + 1 S t ) P_{ss’}=P(S_{t+1}|S_{t})

That is, the current state of the environment is related only to the previous state, not to the previous historical state.

Markov award


P =  from  [ P 11  to  P 1 n P n 1 P n n ] R s = E [ R t + 1 S t = s ] \begin{array}{l} \mathcal{P}=\text { from }\left[\begin{array}{ccc} \mathcal{P}_{11} & \text { to } & \mathcal{P}_{1 n} \\ \vdots & & \\ \mathcal{P}_{n 1} & \ldots & \mathcal{P}_{n n} \end{array}\right] \\ R_{s}=E\left[R_{t+1} \mid S_{t}=s\right] \end{array}

Markov chain

Markov chain is a random process represented by states and state transitions, which satisfies Markov property.

The value function

State value

The goal of Markov decisions is to maximize the overall return value.

  • State value

v ( s ) = E [ G t S t = s ] = E [ R t + 1 + gamma R t + 2 + gamma 2 R t + 3 + S t = s ] = E [ R t + 1 + gamma ( R t + 2 + gamma R t + 3 + ) S t = s ] = E [ R t + 1 + gamma G t + 1 S t = s ] = E [ R t + 1 + gamma v ( S t + 1 ) S t = s ] \begin{aligned} v(s) &=\mathbb{E}\left[G_{t} \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\ldots \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right) \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma v\left(S_{t+1}\right) \mid S_{t}=s\right] \end{aligned}

So are:


v ( s ) = R s + gamma s S P s s v ( s ) v(s)=\mathcal{R}_{s}+\gamma \sum_{s^{ \prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}} v\left(s^{\prime}\right)

To solve the

  • To solve the value of each state:

v = R + gamma P v v=\mathcal{R}+\gamma \mathcal{P}v

[ v ( 1 ) v ( n ) ] = [ R 1 R n ] + gamma [ P 11 P 1 n P 11 P n n ] [ v ( 1 ) v ( n ) ] v = R + gamma P v ( I gamma P ) v = R v = ( I gamma P ) 1 R \begin{array}{l} {\left[\begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array}\right]=\left[\begin{array}{c} \mathcal{R}_{1} \\ \vdots \\ \mathcal{R}_{n} \end{array}\right]+\gamma\left[\begin{array}{ccc} \mathcal{P}_{11} & \ldots & \mathcal{P}_{1 n} \\ \vdots & & \\ \mathcal{P}_{11} & \ldots & \mathcal{P}_{n n} \end{array}\right]\left[\begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array}\right]} \\ v=\mathcal{R}+\gamma \mathcal{P} v \\ (I-\gamma \mathcal{P}) v=\mathcal{R} \\ v=(I-\gamma \mathcal{P})^{-1} \mathcal{R} \end{array}

Partial Observable Markov Decision Process (POMDP)

In real applications, the state of the environment is usually not completely observable for the agent, so the MDP model cannot accurately describe the problem, thus POMDP model is introduced. In POMDP model, a state observer is introduced to enable an agent to infer the probability distribution of states.

Confidence iteration formula:


b ( s ) = eta O ( o s . a ) s S T ( s s . a ) b ( s ) {{b^\prime }\left( {{s^\prime }} \right) = \eta O\left( {o\mid {s^\prime },a} \right)\sum\limits_{s \in S} T \left( {{s^\prime }\mid s,a} \right)b(s)}

Which eta = 1 / Pr ⁡ (o ∣ b, a) eta = 1 / \ \ operatorname {Pr} eta (o \ mid b, a) = 1 / Pr (o ∣ b, a), Pr ⁡ \ operatorname {Pr} can be said for Pr Pr ⁡ (o ∣ b, a) = ∑ ‘s ∈ SO ∑ (o ∣ s’, a) s ∈ ST (s’ ∣ s, a) b (s) {\ Pr (o \ mid b, a) = \ sum \ limits_ {{s ^ \ prime} \} in s O \left( {o\mid {s^\prime },a} \right)\sum\limits_{s \in S} T \left( {{s^\prime }\mid s,a} \ right) b (s)} (Pr o ∣ b, a) = ‘s ∈ s ∑ o (o ∣’ s, a) s ∈ s ∑ T (s’ ∣ s, a) b (s) status value is:


V PI. ( b 0 ) = t = 0 up gamma t r ( b t . a t ) = t = 0 up gamma t E [ R ( s t . a t ) b 0 . PI. ] {{V^\pi }\left( {{b_0}} \right) = \sum\limits_{t = 0}^\infty {{\gamma ^t}} r\left( {{b_t},{a_t}} \right) = \sum\limits_{t = 0}^\infty {{\gamma ^t}} E\left[ {R\left( {{s_t},{a_t}} \right)\mid {b_0},\pi } \right]}

The optimization equation is:


PI. = a r g m a x PI. V PI. ( b 0 ) {{\pi ^*} = \mathop {{\mathop{\rm argmax}\nolimits} }\limits_\pi {V^\pi }\left( {{b_0}} \right)}

V ( b ) = max a A [ r ( b . a ) + gamma o Ω Pr ( o b . a ) V ( tau ( b . a . o ) ) ] {{V^*}(b) = {{\max }_{a \in A}}\left[ {r(b,a) + \gamma \sum\limits_{o \in \Omega } {\Pr } (o\mid b,a){V^*}(\tau (b,a,o))} \right]}

application

To supplement ~