multinomial distribution

2022-03-08 · 2 min read

$$ \renewcommand{\Pr}[1]{\text{Pr}\left[ #1 \right]} \def\eps{\varepsilon} \def\E{\mathbb{E}} $$

Example #

You have a k-sided, weighted die. The weight each face $i = 1, \ldots, k$ is $p_i$. You roll the die $n$ times. The distribution of faces rolled will follow a Multinomial Distribution.

Overview #

Property	Value
Parameters	$n > 0$ number of trials, $p_1, \ldots, p_k$ event probabilities, where $\sum_{i} p_i = n$
Support	$i \in \set{1, \ldots, k}, ; x_i \in \set{0, \ldots, k}, \text{ with } \sum_i x_i = n$
PMF	$\Pr{X = \set{x_1, \ldots, x_k}} = \frac{n!}{x_1! \cdots x_n!} p_1^{x_1} \cdots p_k^{x_k}$
Mean	$\E[X_i] = n \cdot p_i$
Variance	$Var(X_i) = n p_i (1 - p_i), ; Cov(X_i, X_j) = -n p_i p_j ; (i \ne j)$

Statistical Inference #

Equivalence Tests #

Ostrovski 2017 - Testing equivalence of multinomial distributions

Wikiwand - Multinomial Test

Resin 2021 - A Simple Algorithm for Exact Multinomial Tests

Goodness-of-fit Tests #

Ideally, we could use an exact test, where we check if the observed counts $X$ lie in some failure region (according to some criterion) with some confidence probability $p_{significance}$, for example. The criterion will claim the $X$ is not the same as our hypothesized distribution if $X$ lies in the failure region. We choose the size of our failure region based on our initial choice of confidence (greater required confidence implies greater sized failure region).

$$ p_{significance} = \sum_{X ;\in; \text{region}} \text{multinomial-pmf}(X) $$

Unfortunately, the multinomial CDF and PMF have an absurdly huge support for any non-trivial $n$ and $k$, and we need to calculate the CDF by just exhaustively computing the joint probabilities, which makes this approach computationally infeasible, since any non-trivial region is just too big. :'(

Instead, we need derived random variables, called test statistics, which are more mathematically convenient or more computationally tractable, and compare their observed values against the failure regions projected into the test-statistic's distribution.

G-test - multinomial goodness-of-fit

Wikiwand - Pearson's chi-squared test

Wikiwand - G-test / log-likelihood ratio test

Example Implementation #

Using g_test defined in G-test - multinomial goodness-of-fit > Example Implementation.

use claim::{debug_assert_ge, debug_assert_le};
use ndarray::ArrayView1;
use statrs::distribution::{ChiSquared, ContinuousCDF};

/// A goodness-of-fit test between a hypothesized multinomial distribution, `p`,
/// and an experimentally observed distribution, `p_hat`, both represented as
/// dense PMFs. `n` is the number of samples taken to construct `p_hat`.
///
/// Returns a p-value, `Pr[G(x) >= p-value | H_0: x ~ p]`.
fn multinomial_test(n: usize, p: ArrayView1<f64>, p_hat: ArrayView1<f64>) -> f64 {
    // want to compute the DOF (nnz of p)
    let nnz = p.fold(0.0, |nnz, &x| nnz + if x > 0.0 { 1.0 } else { 0.0 });
    let dof = nnz - 1.0;

    debug_assert_le!(nnz, p.dim() as f64);
    debug_assert_ge!(dof, 1.0);

    // impossible to draw p_hat from p
    if !is_pmf_subset(p_hat, p) {
        return 0.0;
    }

    let g = g_test(n, p, p_hat);
    let pvalue = 1.0 - chisq_cdf(dof, g);

    println!(
        "multinomial_test: n: {n}, |p|: {}, dof: {dof}, g: {g}, p-value: {pvalue}",
        p.dim()
    );

    pvalue
}