multinomial distribution
2022-03-08 ยท 3 min read
\[ \renewcommand{\Pr}[1]{\text{Pr}\left[ #1 \right]} \def\eps{\varepsilon} \def\E{\mathbb{E}} \]Example #
You have a k-sided, weighted die. The weight each face $i = 1, \ldots, k$ is $p_i$. You roll the die $n$ times. The distribution of faces rolled will follow a Multinomial Distribution.
Overview #
Property | Value |
---|---|
Parameters | $n > 0$ number of trials, $p_1, \ldots, p_k$ event probabilities, where $\sum_{i} p_i = n$ |
Support | $i \in \set{1, \ldots, k}, ; x_i \in \set{0, \ldots, k}, \text{ with } \sum_i x_i = n$ |
PMF | $\Pr{X = \set{x_1, \ldots, x_k}} = \frac{n!}{x_1! \cdots x_n!} p_1^{x_1} \cdots p_k^{x_k}$ |
Mean | $\E[X_i] = n \cdot p_i$ |
Variance | $Var(X_i) = n p_i (1 - p_i), ; Cov(X_i, X_j) = -n p_i p_j ; (i \ne j)$ |
Statistical Inference #
Equivalence Tests #
Ostrovski 2017 - Testing equivalence of multinomial distributions
Resin 2021 - A Simple Algorithm for Exact Multinomial Tests
Goodness-of-fit Tests #
Ideally, we could use an exact test, where we check if the observed counts $X$ lie in some failure region (according to some criterion) with some confidence probability $p_{significance}$, for example. The criterion will claim the $X$ is not the same as our hypothesized distribution if $X$ lies in the failure region. We choose the size of our failure region based on our initial choice of confidence (greater required confidence implies greater sized failure region).
\[ p_{significance} = \sum_{X \;\in\; \text{region}} \text{multinomial-pmf}(X) \]Unfortunately, the multinomial CDF and PMF have an absurdly huge support for any non-trivial $n$ and $k$, and we need to calculate the CDF by just exhaustively computing the joint probabilities, which makes this approach computationally infeasible, since any non-trivial region is just too big. :'(
Instead, we need derived random variables, called test statistics, which are more mathematically convenient or more computationally tractable, and compare their observed values against the failure regions projected into the test-statistic's distribution.
G-test - multinomial goodness-of-fit
Wikiwand - Pearson's chi-squared test
Wikiwand - G-test / log-likelihood ratio test
Example Implementation #
Using g_test
defined in G-test - multinomial goodness-of-fit > Example Implementation.
use claim::{debug_assert_ge, debug_assert_le};
use ndarray::ArrayView1;
use statrs::distribution::{ChiSquared, ContinuousCDF};
/// A goodness-of-fit test between a hypothesized multinomial distribution, `p`,
/// and an experimentally observed distribution, `p_hat`, both represented as
/// dense PMFs. `n` is the number of samples taken to construct `p_hat`.
///
/// Returns a p-value, `Pr[G(x) >= p-value | H_0: x ~ p]`.
fn multinomial_test(n: usize, p: ArrayView1<f64>, p_hat: ArrayView1<f64>) -> f64 {
// want to compute the DOF (nnz of p)
let nnz = p.fold(0.0, |nnz, &x| nnz + if x > 0.0 { 1.0 } else { 0.0 });
let dof = nnz - 1.0;
debug_assert_le!(nnz, p.dim() as f64);
debug_assert_ge!(dof, 1.0);
// impossible to draw p_hat from p
if !is_pmf_subset(p_hat, p) {
return 0.0;
}
let g = g_test(n, p, p_hat);
let pvalue = 1.0 - chisq_cdf(dof, g);
println!(
"multinomial_test: n: {n}, |p|: {}, dof: {dof}, g: {g}, p-value: {pvalue}",
p.dim()
);
pvalue
}