expectimax search

2022-03-27 ยท 4 min read

    • Adversarial search algorithm (two competitive agents)
    • Non-deterministic Environment (actions can have randomness)
    • Expectimax is like Minimax but for non-deterministic games.
    • Have Player1, Player2, and Chance node levels in our search tree.
    • Unlike minimax which only requires evaluation functions to preserve ordering, expectimax requires that our evaluation function $f(\text{leaf-state})$ must be a Positive Linear Operator on the win probability (so it must also preserves positive-ness) to be optimal.
      • Recall: the evaluation function scores terminal states (game end) and depth-pruned states (states exceeding a depth limit).

    Pruning + State & Q-State Cache #

    See also: 2007 - Effective Use of Transposition Tables in Stochastic Game Tree Search

    For a chance node, we're taking the mean of each possible outcome.

    As we explore branches from the chance node, our upper and lower bound estimates of the mean slowly tighten, until we explore all branches and the mean becomes an exact value.

    We can think of the lower bounds as taking the current exact value plus the sum of the rest of the probabilities times the worst possible outcome. Likewise the upper bound is the current exact value plus the sum of the rest of the probabilities times the best possible value.

    With a naive alpha-beta pruning and q-state caching strategy, we might prune a branch but still cache the current estimate. However, this would only be a valid estimate for another path if the alpha-beta bounds are tighter.

    Instead of just recording a single value, we record an enum with two cases

    Case 1: the exact value, meaning we've fully explored all branches of this chance node.

    Case 2: an upper and lower bound and a thunk(?) to resume exploration of this node, meaning there are still more branches to explore.

    In this way, we don't lose any information when we place a q-state into the cache. If another path has looser alpha-beta bounds (maybe we hit the q-state again on a higher level in the next branch) then it can explore the q-state some more before it decides to possibly prune or not.

    What about min- and max-state caching? #

    We presumably do the same thing with only upper xor lower bounds resp. vs exact value plus resumable exploration thunk.

    How do we prune? #

    At a given chance node we have the bound $[\alpha, \beta]$ from our parent node, showing the best case known for the max (alpha) and best case known for the min (beta).

    As we explore this chance node, we continually refine down our current best estimate bounds $[a, b]$ as we explore more branches. As soon as we see the bound intervals $[a, b]$ and $[\alpha, \beta]$ no longer intersect, we can prune.

    Appendix #

    Positive Linear Operator #

    source: https://www.wikiwand.com/en/Positive_linear_operator

    For scalar spaces, positive linear operators are functions of the form:

    \[ f(x) = a_1 \cdot x + a_2 \qquad \text{where } a_1 > 0 \]

    More generally, let $(X, \le_X)$ and $(Y, \le_Y)$ be vector spaces with preorder operators $\le_X$ and $\le_Y$ resp.

    Let $f \;:\; X \mapsto Y$ be a function from $X$ to $Y$. The function $f$ is a positve linear operator if:

    1. It preserves positive-ness, i.e., positive elements in $X$ are all mapped to positive elements in $Y$.
    2. It preserves the preordering (homomorphic?), i.e., if $x_1 \le_X x_2$, then $f(x_1) \le_Y f(x_2)$.

    Formally, $f$ is a positive linear operator if

    \[ \begin{align*} \forall \; x_1,\, x_2 \in X, &\text{ where } 0 \le_X x_1, x_2 \; : & \\ \\ &0 \le_Y f(x_1) \text{ and } 0 \le_Y f(x_2) &\text{(1.) preserves positive-ness} \\ &f(x_1) \le_Y f(x_2) &\text{(2.) preserves the preordering} \end{align*} \]

    For example, consider a simple zero-sum game where agents $A$ and $B$ have utility functions

    \[ \begin{align*} U_A &= p_w \cdot r + (1 - p_w) \cdot -r \\ &= r \cdot (2 \cdot p_w - 1) \\ &= 2 r \cdot p_w - r \\ U_B &= (1 - p_w) \cdot r + p_w \cdot -r \\ &= p_l \cdot r + (1 - p_l) \cdot -r \\ &= 2r \cdot p_l - r \end{align*} \]

    where $p_w$ is the probability agent $A$ wins and $p_l$ is just the probability agent $A$ loses and agent $B$ wins.

    Notice that these simple utility functions are both positive linear transformations of each agent's win probability $p_w$ and $p_l$ resp. This property makes them both equivalent to using the normalized score (i.e., win probability) in an expectimax search.