winsorize

2022-03-15 · 3 min read

    Source: https://www.wikiwand.com/en/Winsorizing

    Cheap way to remove outliers from samples.

    • TL;DR: winsorize(X, q in [0,0.5]) replaces all elements in X below quantile q and above quantile 1-q with the quantile q and 1-q respectively.
    • Just clamps outliers above and below a quantile.
    • Effectively removes outliers without changing the number of samples (length of X).
    • Used in cargo's default bench harness when taking timing samples.

    Example Implementation #

    Source: https://github.com/phlip9/idris2-aoc21/blob/master/src/Bench.idr

    import Data.List
    import Data.List1
    import Data.String
    import Data.Vec
    import Scratch
    import System.Clock
    import System.Random
    
    %default total
    
    ||| Clamp `x` to the range `[min, max]`. Assumes `min <= max`.
    clamp : Ord a => (min : a) -> (max : a) -> (x : a) -> a
    clamp min max x = if x > max then max
                      else if x < min then min
                      else x
    
    ||| Linearly interpolate between `x0` and `x1` according to the scale parameter `t`.
    ||| For example, `lerp` returns `x0` when `t = 0` and `x1` when `t = 1`.
    lerp : Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
    lerp x0 x1 t = (1 - t) * x0 + (t * x1)
    
    ||| Linearly interpolate between `x0` and `x1` then clamp in the range `[x0, x1]`.
    ||| Useful for handling scale parameters `t ∉ [0, 1]`
    lerpClamp : Ord a => Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
    lerpClamp x0 x1 t = clamp x0 x1 (lerp x0 x1 t)
    
    ||| Truncate a float number returning the whole part. Truncate works
    ||| correctly on negative numbers, unlike `floor`.
    ||| For example, `trunc 3.58 == 3.0` and `trunc (-4.9) == (-4.0)`
    trunc : F64 -> F64
    trunc x = cast {to=F64} $ cast {to=Integer} x
    
    ||| Return the fractional part of a float number.
    ||| For example, `frac 3.58 == 0.58` and `fract (-4.9) == (-0.9)`
    fract : F64 -> F64
    fract x = x - trunc x
    
    data Interval a
      = OpenAbove a -- [a ..)
      | OpenBelow a -- (.. a]
      | Closed a a -- [a .. b]
    
    ||| Return the fractional index of a list as an `Interval` of the adjacent integer
    ||| number entries. A negative index will result in an `OpenBelow` interval while
    ||| an index greater than the largest integer index will result in an `OpenAbove`
    ||| interval.
    fractIndex : (xs : List1 a) -> (idx : F64) -> Interval a
    fractIndex (x ::: xs) idx = if idx < 0.0 then OpenBelow x
                                else go x xs (cast idx)
    where
      go : a -> List a -> Nat -> Interval a
      go x [] idx = OpenAbove x
      go x (y :: _) 0 = Closed x y
      go _ (y :: ys) (S idx) = go y ys idx
    
    ||| Return the empirical quantile, interpolating between the two adjacent integer
    ||| number entries.
    ||| Assumes `xs` is sorted. Handles `q ∉ [0, 1]` by returning the min or max
    ||| respectively.
    empiricalQuantile : (xs : List1 F64) -> (q : F64) -> F64
    empiricalQuantile xs q
      = let n = cast {to=F64} $ length xs
            idx = (n * q) - 1
            t = fract idx in
            case fractIndex xs idx of
                 OpenAbove x => x
                 OpenBelow x => x
                 Closed x y => lerpClamp x y t
    
    ||| Replace elements below/above `q`/`1-q` quantile with the `q`/`1-q` quantile
    ||| respectively. Reduces impact of outliers without changing the number of samples.
    |||
    ||| Note: assumes xs is sorted
    ||| See: https://en.wikipedia.org/wiki/Winsorising
    |||
    ||| ### Example
    |||
    ||| ```idris2
    ||| > winsorize [1.0,2,3,4,5,6,7,8,9,10] 0.25
    ||| [2.5 2.5, 3.0, 4.0, 5.0, 6.0, 7.0, 7.5, 7.5, 7.5]
    ||| ```
    winsorize : (xs : List1 F64) -> (q : F64) -> List1 F64
    winsorize xs q
      = let q = clamp 0.0 1.0 q
            q = if q > 0.5 then 1.0 - q else q
            lo = empiricalQuantile xs q
            hi = empiricalQuantile xs (1.0 - q) in
            map (clamp lo hi) xs
    
    winsorizeUnsorted : (xs : List1 F64) -> (q : F64) -> List1 F64
    winsorizeUnsorted xs q = winsorize (sort xs) q