markov decision process (MDP)

2022-03-15 · 7 min read

\[ \renewcommand{\Pr}[1]{\text{Pr}\left[ #1 \right]} \renewcommand{\E}[1]{\mathbb{E}\left[ #1 \right]} \renewcommand{\argmax}{\text{argmax}} \]

Overview #

A Markov Decision Process (MDP) is a 4-tuple, $(S, A, P_a, R_a)$, where:

$S$ is the State space
$A$ is the Action space (or $A = \set{ A_s \;|\; s \in S}$, where $A_s$ are the actions available at state $s$)
$P(s', s, a) = \Pr{s' \;|\; s, a}$ is the transition probability from states $s$ to $s'$ given the agent took action $a$.
- The transition probabilities must be Markovian (i.e., "memoryless"); they must not depend on prior states (other than the most recent).
- Recall that we can always engineer "memorylessness" into a process by just redefining each new process state as the complete history of states to the given state, though this is often prohibitively expensive.
$R(s, s', a)$ is the immediate (or expected immediate) reward when going from states $s$ to $s'$ using action $a$.

We also have policy functions $\pi : S \mapsto A$ which take a state and output which action to take at that state.

If we fix a policy function $\pi$ for a given MDP, we actually get a standard markov chain (since there's no more pesky non-linear max in there). This markov chain has state transition probabilities

\[ \Pr{s_{t+1} = s' \;|\; s_t = s} = P(s', s, \pi(s)) \]

Our objective for an MDP is to find an optimal policy $\pi^*$ that maximizes the expected (discounted) sum of rewards

\[ \E{\sum_{t=0}^\infty \gamma^t \cdot R(s_t,\, s_{t+1},\, \pi^*(s_t))} \]

where $0 \le \gamma \le 1$ is a discounting factor.

The hyperparameter $\gamma$ is a quantity we control. It reflects how much we value short-term rewards over long-term rewards. $\gamma$ values closer to zero are miopic (prioritize short-term rewards). $\gamma$ values closer to one produce more optimal policies, but may prevent our learning algorithm from converging within a reasonable time.

Simulator models #

Learning the optimal policy $\pi^*$ means first choosing an algorithm appropriate for our process. Different algorithms require different assumptions about the process and require varying levels of simulator model fidelity.

Here are three popular algorithms in decreasing order of required simulator model fidelity:

Dynamic Programming (e.g. Value Iteration): requires an explicit model of the MDP. This explicit model contains the complete transition probability function and reward function. For certain games, the transition and reward functions could be discrete and infinitely large, ruling out this approach.
MCTS (Monte Carlo Tree Search): requires a generative model, which means you can generate samples of the next state and reward given a prior state and action. If you can efficiently simulate a single turn of your game without knowing the complete dynamics, MCTS may be an option.
Reinforcement Learning (RL): only requires an episodic simulator. In this case, all we need are samples of full episodes (e.g. full game rollouts like $[(s_0, a_0, r_0),\; (s_1, a_1, r_1,\; \ldots,\; (s_k, a_k, r_k))]$). For instance, AlphaStar (which uses RL) trains on full Starcraft replays as opposed to extracting the core game logic to simulate single game ticks in isolation.

Algorithms #

Preliminaries #

Before we dive into the algorithms, let's first take a look at some different ways we can express the value of states and q-states under the optimal policy $\pi^*$:

\[ \begin{align*} Q^*(s, a) &= \sum_{s'} P(s', s, a) \cdot (R(s', s, a) + \gamma \cdot V^*(s')) & \text{(1.)} \\ &= \sum_{s'} P(s', s, a) \cdot (R(s', s, a) + \gamma \cdot \max_{a' \in A_{s'}} Q^*(s', a')) & \text{(2.)} \\ \\ V^*(s) &= \max_{a \in A_s} Q^*(s, a) \\ &= \max_a \sum_{s'} P(s', s, a) \cdot (R(s', s, a) + \gamma \cdot V^*(s')) & \text{(3.)} \\ \\ V^\pi(s) &= Q^\pi(s, \pi(s)) \\ &= \sum_{s'} P(s', s, \pi(s)) \cdot (R(s', s, \pi(s)) + \gamma \cdot V^*(s')) & \text{(4.)} \\ \\ \pi(s) &= \underset{a \in A_s}{\argmax} \; Q^\pi(s, a) & \text{(5.)} \\ \end{align*} \]

This is the standard Q-state value formulation under the optimal policy (also called a Q-value).
The Q-values expressed only in terms of other Q-values. Used in Q-value iteration.
The state value under the optimal policy expressed in terms of other state values. Used in value iteration.
The state value given an explicit policy. Used in policy iteration.
The policy update equation. For each state, look at all the actions, evaluate them based on the previous policy, then set the new policy action to the best action. Used in policy iteration.

Dynamic Programming #

Value Iteration #

Value iteration (as the name implies) iteratively improves our estimate of the optimal state values $V^*$ until convergence.

We start with an initial estimate $V_0$ of all the state values. We then iterate from $k = 0, 1, 2, \ldots$, updating our next estimate of the state values $V_{k+1}$ from the current estimate $V_k$. For each iteration, we loop through all the possible states $s \in S$ and update $V_{k+1}(s)$ like so

\[ V_{k+1}(s) \gets \max_a \sum_{s'} P(s', s, a) \cdot (R(s', s, a) + \gamma \cdot V_k(s')) \]

Notice that we're not learning an explicit policy function $\pi$ here. Instead, the policy is implicit in the $\max$. If we want to recover an explicit policy function for our current estimate, $\pi_{k+1}(s)$, we simply look at all actions $a$ from state $s$ and choose the one with the greatest expected reward

\[ \pi_{k+1}(s) = \underset{a}{\argmax} \sum_{s'} P(s', s, a) \cdot (R(s', s, a) + \gamma \cdot V_k(s')) \]

There are some proofs that this converges, so long as certain properties are true. For our purposes, we just care about the required properties:

If the game is finite, then value iteration always converges regardless of $\gamma$.
If the game is infinite but has absorbing states that are reached w.p. 1 as $k \rightarrow \infty$ then value iteration converges regardless of $\gamma$.
Otherwise, for an infinite game, without absorbing states, we need a discounting value $0 \le \gamma < 1$ and a bounded maximum state transition reward $R_\text{max}$. You can see that with $\gamma < 1$ and a bounded reward, the maximum possible reward sum is bounded by

\[ R(s_1, s_0, a) + \gamma \cdot R(s_2, s_1, a) + \gamma^2 \cdot R(s_3, s_2, a) + \cdots \le \sum_{t = 0}^\infty \gamma^t \cdot R_\text{max} = \frac{1}{1 - \gamma} \cdot R_\text{max} \]

Q-value Iteration #

We can also do value iteration, but store Q-values instead of state-values. For some games, the space of Q-states $(s, a)$ (action $a$ applied to state $s$) might be significantly smaller than the original state space.

In these cases, we use Q-value Iteration over Value Iteration. The Q-value Iteration update equation looks nearly identical, except we just swap the order of the mean and max

\[ Q_{k+1}(s, a) = \sum_{s'} P(s', s, a) \cdot (R(s', s, a) + \gamma \cdot \max_{a'} Q_{k}(s', a') \]

Finally, our recovered policy

\[ \pi_{k+1}(s) = \underset{a}{\argmax} \; Q_{k}(s, a) \]