# G-test - multinomial goodness-of-fit

2022-03-08 · 3 min read

$\def\Chi{\mathcal{X}}$

## Overview #

• $G$ is a test-statistic. Asymptotically the same as the $\Chi^2$ test-statistic.
• Used for goodness-of-fit test b/w observed and hypothesized multinomial distributions.
• We use the $\Chi^2$ distribution to compute our desired $G$ threshold from a given $p$-value.
• It's a log-likelihood ratio. Closely related to KL-divergence and $\Chi^2$ statistic.

## Test Statistic #

In terms of observed counts, $O_i \ge 0$, and expected counts, $E_i > 0$

$G = 2 \sum_i O_i \cdot \ln\Bigl( \frac{O_i}{E_i} \Bigr)$

In terms of probabilities $\hat{p}_i$ and $p_i$ (estimated and expected resp.) and number of samples $n$:

\begin{align*} G &= 2 \sum_i \hat{p}_i \cdot n \cdot \ln\Bigl( \frac{\hat{p}_i \cdot n}{p_i \cdot n} \Bigr) \\ &= 2 \sum_i \hat{p}_i \cdot n \cdot \ln\Bigl( \frac{\hat{p}_i}{p_i} \Bigr) \\ &= 2 n \sum_i \hat{p}_i \cdot \ln\Bigl( \frac{\hat{p}_i}{p_i} \Bigr) \\ &= 2 \, n \; D_{\text{KL}}(\hat{p} \;||\; p) \end{align*}

where $D_{\text{KL}}(\hat{p} \;||\; p)$ is the Kullback-Leibler divergence.

## Example Implementation #

use claim::{debug_assert_ge, debug_assert_le};
use ndarray::{ArrayView1, Zip};
use statrs::distribution::{ChiSquared, ContinuousCDF};

const EPS: f64 = 1e-10;

/// Return true iff supp(p) ⊆ supp(q) for dense PMFs p and q.
fn is_pmf_subset(p: ArrayView1<f64>, q: ArrayView1<f64>) -> bool {
Zip::from(p).and(q).all(|&p_i, &q_i| {
// A = (q_i == 0.0)
// B = (p_i == 0.0)
// (A ==> B) <==> (¬A ∨ B)
(q_i > 0.0) || (p_i <= 0.0)
})
}

/// Compute the [KL-divergence](https://www.wikiwand.com/en/Kullback%E2%80%93Leibler_divergence).
/// between dense PMFs p and q.
///
/// D_{KL}(p || q) = \sum_i p_i * \ln(p_i / q_i)
///
/// Note: p's support must be a strict subset of q's support!
///       this is also referred to as the "absolute continuity" assumption,
///       where q_i = 0 implies p_i = 0.
fn kl_divergence(p: ArrayView1<f64>, q: ArrayView1<f64>) -> f64 {
// caller should check this before
debug_assert!(is_pmf_subset(p, q));

Zip::from(p)
.and(q)
.fold(0.0, |sum, &p_i, &q_i| sum + kl_div_term(p_i, q_i))
}

#[inline]
fn kl_div_term(p_i: f64, q_i: f64) -> f64 {
if q_i > EPS {
if p_i > EPS {
p_i * (p_i / q_i).ln()
} else {
0.0
}
} else {
if p_i > EPS {
debug_assert!(false);
f64::INFINITY
} else {
0.0
}
}
}

/// The G-test statistic.
///
/// * Used for comparing observed multinomial distribution with expected
///   hypothesis multinomial distribution.
/// * Asmyptotically approximates chi^2-test statistic.
///
/// n: the number of samples
/// p: the expected PMF
/// p_hat: the observed PMF
///
/// G-test: https://www.wikiwand.com/en/G-test
fn g_test(n: usize, p: ArrayView1<f64>, p_hat: ArrayView1<f64>) -> f64 {
(n as f64) * (2.0 * kl_divergence(p_hat, p))
}