# winsorize

2022-03-15 · 3 min read

Cheap way to remove outliers from samples.

• TL;DR: winsorize(X, q in [0,0.5]) replaces all elements in X below quantile q and above quantile 1-q with the quantile q and 1-q respectively.
• Just clamps outliers above and below a quantile.
• Effectively removes outliers without changing the number of samples (length of X).
• Used in cargo's default bench harness when taking timing samples.

## Example Implementation #

import Data.List
import Data.List1
import Data.String
import Data.Vec
import Scratch
import System.Clock
import System.Random

%default total

||| Clamp x to the range [min, max]. Assumes min <= max.
clamp : Ord a => (min : a) -> (max : a) -> (x : a) -> a
clamp min max x = if x > max then max
else if x < min then min
else x

||| Linearly interpolate between x0 and x1 according to the scale parameter t.
||| For example, lerp returns x0 when t = 0 and x1 when t = 1.
lerp : Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
lerp x0 x1 t = (1 - t) * x0 + (t * x1)

||| Linearly interpolate between x0 and x1 then clamp in the range [x0, x1].
||| Useful for handling scale parameters t ∉ [0, 1]
lerpClamp : Ord a => Num a => Neg a => (x0 : a) -> (x1 : a) -> (t : a) -> a
lerpClamp x0 x1 t = clamp x0 x1 (lerp x0 x1 t)

||| Truncate a float number returning the whole part. Truncate works
||| correctly on negative numbers, unlike floor.
||| For example, trunc 3.58 == 3.0 and trunc (-4.9) == (-4.0)
trunc : F64 -> F64
trunc x = cast {to=F64} $cast {to=Integer} x ||| Return the fractional part of a float number. ||| For example, frac 3.58 == 0.58 and fract (-4.9) == (-0.9) fract : F64 -> F64 fract x = x - trunc x data Interval a = OpenAbove a -- [a ..) | OpenBelow a -- (.. a] | Closed a a -- [a .. b] ||| Return the fractional index of a list as an Interval of the adjacent integer ||| number entries. A negative index will result in an OpenBelow interval while ||| an index greater than the largest integer index will result in an OpenAbove ||| interval. fractIndex : (xs : List1 a) -> (idx : F64) -> Interval a fractIndex (x ::: xs) idx = if idx < 0.0 then OpenBelow x else go x xs (cast idx) where go : a -> List a -> Nat -> Interval a go x [] idx = OpenAbove x go x (y :: _) 0 = Closed x y go _ (y :: ys) (S idx) = go y ys idx ||| Return the empirical quantile, interpolating between the two adjacent integer ||| number entries. ||| Assumes xs is sorted. Handles q ∉ [0, 1] by returning the min or max ||| respectively. empiricalQuantile : (xs : List1 F64) -> (q : F64) -> F64 empiricalQuantile xs q = let n = cast {to=F64}$ length xs
idx = (n * q) - 1
t = fract idx in
case fractIndex xs idx of
OpenAbove x => x
OpenBelow x => x
Closed x y => lerpClamp x y t

||| Replace elements below/above q/1-q quantile with the q/1-q quantile
||| respectively. Reduces impact of outliers without changing the number of samples.
|||
||| Note: assumes xs is sorted
||| See: https://en.wikipedia.org/wiki/Winsorising
|||
||| ### Example
|||
||| idris2
||| > winsorize [1.0,2,3,4,5,6,7,8,9,10] 0.25
||| [2.5 2.5, 3.0, 4.0, 5.0, 6.0, 7.0, 7.5, 7.5, 7.5]
||| 
winsorize : (xs : List1 F64) -> (q : F64) -> List1 F64
winsorize xs q
= let q = clamp 0.0 1.0 q
q = if q > 0.5 then 1.0 - q else q
lo = empiricalQuantile xs q
hi = empiricalQuantile xs (1.0 - q) in
map (clamp lo hi) xs

winsorizeUnsorted : (xs : List1 F64) -> (q : F64) -> List1 F64
winsorizeUnsorted xs q = winsorize (sort xs) q