multinomial distribution

$\renewcommand{\Pr}[1]{\text{Pr}\left[ #1 \right]} \def\eps{\varepsilon} \def\E{\mathbb{E}}$

Example #

You have a k-sided, weighted die. The weight each face $i = 1, \ldots, k$ is $p_i$. You roll the die $n$ times. The distribution of faces rolled will follow a Multinomial Distribution.

Overview #

PropertyValue
Parameters$n > 0$ number of trials, $p_1, \ldots, p_k$ event probabilities, where $\sum_{i} p_i = n$
Support$i \in \set{1, \ldots, k}, ; x_i \in \set{0, \ldots, k}, \text{ with } \sum_i x_i = n$
PMF$\Pr{X = \set{x_1, \ldots, x_k}} = \frac{n!}{x_1! \cdots x_n!} p_1^{x_1} \cdots p_k^{x_k}$
Mean$\E[X_i] = n \cdot p_i$
Variance$Var(X_i) = n p_i (1 - p_i), ; Cov(X_i, X_j) = -n p_i p_j ; (i \ne j)$

Statistical Inference #

Equivalence Tests #

Ostrovski 2017 - Testing equivalence of multinomial distributions

Wikiwand - Multinomial Test

Resin 2021 - A Simple Algorithm for Exact Multinomial Tests

Goodness-of-fit Tests #

Ideally, we could use an exact test, where we check if the observed counts $X$ lie in some failure region (according to some criterion) with some confidence probability $p_{significance}$, for example. The criterion will claim the $X$ is not the same as our hypothesized distribution if $X$ lies in the failure region. We choose the size of our failure region based on our initial choice of confidence (greater required confidence implies greater sized failure region).

$p_{significance} = \sum_{X \;\in\; \text{region}} \text{multinomial-pmf}(X)$

Unfortunately, the multinomial CDF and PMF have an absurdly huge support for any non-trivial $n$ and $k$, and we need to calculate the CDF by just exhaustively computing the joint probabilities, which makes this approach computationally infeasible, since any non-trivial region is just too big. :'(

Instead, we need derived random variables, called test statistics, which are more mathematically convenient or more computationally tractable, and compare their observed values against the failure regions projected into the test-statistic's distribution.

G-test - multinomial goodness-of-fit

Wikiwand - Pearson's chi-squared test

Wikiwand - G-test / log-likelihood ratio test

Example Implementation #

Using g_test defined in G-test - multinomial goodness-of-fit > Example Implementation.

use claim::{debug_assert_ge, debug_assert_le};
use ndarray::ArrayView1;
use statrs::distribution::{ChiSquared, ContinuousCDF};

/// A goodness-of-fit test between a hypothesized multinomial distribution, p,
/// and an experimentally observed distribution, p_hat, both represented as
/// dense PMFs. n is the number of samples taken to construct p_hat.
///
/// Returns a p-value, Pr[G(x) >= p-value | H_0: x ~ p].
fn multinomial_test(n: usize, p: ArrayView1<f64>, p_hat: ArrayView1<f64>) -> f64 {
// want to compute the DOF (nnz of p)
let nnz = p.fold(0.0, |nnz, &x| nnz + if x > 0.0 { 1.0 } else { 0.0 });
let dof = nnz - 1.0;

debug_assert_le!(nnz, p.dim() as f64);
debug_assert_ge!(dof, 1.0);

// impossible to draw p_hat from p
if !is_pmf_subset(p_hat, p) {
return 0.0;
}

let g = g_test(n, p, p_hat);
let pvalue = 1.0 - chisq_cdf(dof, g);

println!(
"multinomial_test: n: {n}, |p|: {}, dof: {dof}, g: {g}, p-value: {pvalue}",
p.dim()
);

pvalue
}