Markov Decision Process (MDP)

Markov decision is a mathematical model of sequential decision, which is used to simulate the strategies and rewards of agents in systems with Markov properties. If you want to solve a decision problem, first resume markov decision model, and then use reinforcement learning algorithm to solve it. The final result is in the form of a sequential strategy: PI (theta) = {1, PI PI 2,…, PI tau} \ begin {array} {l} \ PI (\ theta) = \ left \ {\ pi_ {1}, \ pi_ {2}, \ \ cdots, \ pi_ {\ tau} \} \ \ \ \ right end {array} PI (theta) = {1, PI PI 2,…, PI tau}, the strategy parameters theta \ theta theta that under type, you can get the optimal strategy.

\theta=\arg \max _{\theta} \mathcal{J}(\theta)

Markov sex

What is Markov sex? When the relation of states satisfies the following formula, the environment is said to have Markov property.

P_{ss’}=P(S_{t+1}|S_{t})

That is, the current state of the environment is related only to the previous state, not to the previous historical state.

Markov award

\begin{array}{l} \mathcal{P}=\text { from }\left[\begin{array}{ccc} \mathcal{P}_{11} & \text { to } & \mathcal{P}_{1 n} \\ \vdots & & \\ \mathcal{P}_{n 1} & \ldots & \mathcal{P}_{n n} \end{array}\right] \\ R_{s}=E\left[R_{t+1} \mid S_{t}=s\right] \end{array}

Markov chain

Markov chain is a random process represented by states and state transitions, which satisfies Markov property.

The value function

State value

The goal of Markov decisions is to maximize the overall return value.

State value

\begin{aligned} v(s) &=\mathbb{E}\left[G_{t} \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\ldots \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right) \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_{t}=s\right] \\ &=\mathbb{E}\left[R_{t+1}+\gamma v\left(S_{t+1}\right) \mid S_{t}=s\right] \end{aligned}

So are:

v(s)=\mathcal{R}_{s}+\gamma \sum_{s^{ \prime} \in \mathcal{S}} \mathcal{P}_{s s^{\prime}} v\left(s^{\prime}\right)

To solve the

To solve the value of each state:

v=\mathcal{R}+\gamma \mathcal{P}v

\begin{array}{l} {\left[\begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array}\right]=\left[\begin{array}{c} \mathcal{R}_{1} \\ \vdots \\ \mathcal{R}_{n} \end{array}\right]+\gamma\left[\begin{array}{ccc} \mathcal{P}_{11} & \ldots & \mathcal{P}_{1 n} \\ \vdots & & \\ \mathcal{P}_{11} & \ldots & \mathcal{P}_{n n} \end{array}\right]\left[\begin{array}{c} v(1) \\ \vdots \\ v(n) \end{array}\right]} \\ v=\mathcal{R}+\gamma \mathcal{P} v \\ (I-\gamma \mathcal{P}) v=\mathcal{R} \\ v=(I-\gamma \mathcal{P})^{-1} \mathcal{R} \end{array}

Partial Observable Markov Decision Process (POMDP)

In real applications, the state of the environment is usually not completely observable for the agent, so the MDP model cannot accurately describe the problem, thus POMDP model is introduced. In POMDP model, a state observer is introduced to enable an agent to infer the probability distribution of states.

Confidence iteration formula:

{{b^\prime }\left( {{s^\prime }} \right) = \eta O\left( {o\mid {s^\prime },a} \right)\sum\limits_{s \in S} T \left( {{s^\prime }\mid s,a} \right)b(s)}

Which eta = 1 / Pr ⁡ (o ∣ b, a) eta = 1 / \ \ operatorname {Pr} eta (o \ mid b, a) = 1 / Pr (o ∣ b, a), Pr ⁡ \ operatorname {Pr} can be said for Pr Pr ⁡ (o ∣ b, a) = ∑ ‘s ∈ SO ∑ (o ∣ s’, a) s ∈ ST (s’ ∣ s, a) b (s) {\ Pr (o \ mid b, a) = \ sum \ limits_ {{s ^ \ prime} \} in s O \left( {o\mid {s^\prime },a} \right)\sum\limits_{s \in S} T \left( {{s^\prime }\mid s,a} \ right) b (s)} (Pr o ∣ b, a) = ‘s ∈ s ∑ o (o ∣’ s, a) s ∈ s ∑ T (s’ ∣ s, a) b (s) status value is:

{{V^\pi }\left( {{b_0}} \right) = \sum\limits_{t = 0}^\infty {{\gamma ^t}} r\left( {{b_t},{a_t}} \right) = \sum\limits_{t = 0}^\infty {{\gamma ^t}} E\left[ {R\left( {{s_t},{a_t}} \right)\mid {b_0},\pi } \right]}

The optimization equation is:

{{\pi ^*} = \mathop {{\mathop{\rm argmax}\nolimits} }\limits_\pi {V^\pi }\left( {{b_0}} \right)}

{{V^*}(b) = {{\max }_{a \in A}}\left[ {r(b,a) + \gamma \sum\limits_{o \in \Omega } {\Pr } (o\mid b,a){V^*}(\tau (b,a,o))} \right]}

application

To supplement ~

mo4tech.com (Moment For Technology) is a global community with thousands techies from across the global hang out!Passionate technologists, be it gadget freaks, tech enthusiasts, coders, technopreneurs, or CIOs, you would find them all here.

Planning and Decision making — MDP, POMDP and their applications

Markov Decision Process (MDP)

Markov sex

Markov award

Markov chain

The value function

State value

To solve the

Partial Observable Markov Decision Process (POMDP)

application

Planning and Decision making — MDP, POMDP and their applications

Markov Decision Process (MDP)

Markov sex

Markov award

Markov chain

The value function

State value

To solve the

Partial Observable Markov Decision Process (POMDP)

application

Related Posts

Adaptation to the antagonistic domain

Introduction to the Unrestricted combat Sample Challenge

Meituan proposed Transformer based on implicit conditional location coding, which has better performance than ViT and DeiT