Bagplots, boxplots and outlier detection for functional data Bagplots, boxplots and outlier detection for functional data Han Lin Shang & Rob J Hyndman Business & Economic Forecasting Unit 1 Bagplots, boxplots and outlier detection for functional data Outline 1 Introduction 2 Functional bagplot and HDR boxplot 3 Outlier detection 4 Conclusions 2 Bagplots, boxplots and outlier detection for functional data Outline 1 Introduction 2 Functional bagplot and HDR boxplot 3 Outlier detection 4 Conclusions Introduction 3 Bagplots, boxplots and outlier detection for functional data Introduction French male mortality rates −4 −6 −8 Log death rate −2 0 France: male death rates (1899−2003) 0 20 40 60 Age 80 100 4 Bagplots, boxplots and outlier detection for functional data Introduction French male mortality rates 0 France: male death rates (1899−2003) −4 −6 −8 Log death rate −2 War years 0 20 40 60 Age 80 100 4 Bagplots, boxplots and outlier detection for functional data Introduction French male mortality rates 0 France: male death rates (1899−2003) −4 −6 −8 Log death rate −2 War years 0 20 Aims 1 “Boxplots” for functional data 2 40 Tools for 60 detecting 80 outliers 100in functional data Age 4 Bagplots, boxplots and outlier detection for functional data Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. Introduction 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X k=1 zi,k φk (x) 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X zi,k φk (x) k=1 µ(x) is mean curve 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X zi,k φk (x) k=1 µ(x) is mean curve {φk (x)} are principal components 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X zi,k φk (x) k=1 µ(x) is mean curve {φk (x)} are principal components {zi,k } are PC scores 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X zi,k φk (x) k=1 µ(x) is mean curve {φk (x)} are principal components {zi,k } are PC scores 2 Plot zi,2 vs zi,1 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X zi,k φk (x) k=1 µ(x) is mean curve {φk (x)} are principal components {zi,k } are PC scores 2 Plot zi,2 vs zi,1 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X zi,k φk (x) k=1 µ(x) is mean curve {φk (x)} are principal components {zi,k } are PC scores 2 Plot zi,2 vs zi,1 å Each point in scatterplot represents one curve. 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Let {yi (x)}, i = 1, . . . , n, be a set of curves. 1 Apply a robust principal component algorithm yi (xi ) = µ(x) + n−1 X zi,k φk (x) k=1 µ(x) is mean curve {φk (x)} are principal components {zi,k } are PC scores 2 Plot zi,2 vs zi,1 å Each point in scatterplot represents one curve. å Outliers show up in bivariate score space. 5 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Scatterplot of first two PC scores 4 ● ● 3 ● ● ● 2 ● −10 ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● 0 1 ● ● ●● ●● ● ● ●●●● ● ● ●●● ● ●●● ●● ●● ● ● −1 PC score 2 ● −5 0 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● ● ● ● ●●● ● ●● ●●●●●●● ● ● 5 PC score 1 10 15 6 Bagplots, boxplots and outlier detection for functional data Introduction Robust principal components Scatterplot of first two PC scores 4 1914 ● 1915 ● 2 1 ● ●● ●● ● ● ●●●● ● ● ●●● ● ●●● ●● ●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● 0 −10 1918 ● 1943 ● 1940 ● −1 PC score 2 3 1916 ● 1944 ● 1917 ● −5 0 1919 ● 1942 ● 1945 ● ● ● ●● ● 1941 ● ●●● ●● ● ●●● ● ● ● ● ● ● ● ● ● ●● ●●●●●●● ● 5 PC score 1 10 15 6 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot Outline 1 Introduction 2 Functional bagplot and HDR boxplot 3 Outlier detection 4 Conclusions 7 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot Functional bagplot 5 Bivariate bagplot due to Rousseeuw et al. (1999). Rank points by halfspace location depth. Display median, 50% convex hull and outer convex hull (with 99% coverage if bivariate normal). 1914 ● 4 ● 1915 ● ● 1916 ● 3 ● 1944 ● 1917 ● ● 1918 ● ● 2 1943 1940 ● ● ● 1 ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● 1919 ● ● ● ● ● ● 1942 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● −1 PC score 2 ● ● ● −10 ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● −5 ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ● 0 5 PC score 1 10 15 8 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot Functional bagplot 0 5 Bivariate bagplot due to Rousseeuw et al. (1999). Rank points by halfspace location depth. Display median, 50% convex hull and outer convex hull (with 99% coverage if bivariate normal). Boundaries contain all curves inside bags. 95% CI for median curve also shown. 1914 ● 4 ● 1915 ● −2 ● 1916 ● 1918 ● ● 2 ● 1 ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● 1919 ● ● ● ● ● ● 1942 ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● −10 ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● −5 ● −8 ● ● ● ● −1 PC score 2 ● ● −4 ● ● Log death rate 1944 ● 1917 ● 1943 ● 1940 ● −6 3 ● ● ● ●● ● ● 0 5 PC score 1 10 15 0 20 40 60 Age 80 100 8 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot −4 −6 −8 Log death rate −2 0 Functional bagplot 0 20 40 60 Age 80 100 8 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot Functional HDR boxplot 5 Bivariate HDR boxplot due to Hyndman (1996). Rank points by value of kernel density estimate. Display mode, 50% and (usually) 99% highest density regions (HDRs) and mode. 1914 ● 4 ● ● 3 ● ● ● 2 ● ● ● 1 ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● 1919 ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● −1 PC score 2 ● −10 ● ● ● ●● −5 ● ● ● ● ● ● ●●● ● ● o ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● 0 5 PC score 1 10 15 9 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot Functional HDR boxplot 5 Bivariate HDR boxplot due to Hyndman (1996). Rank points by value of kernel density estimate. Display mode, 50% and (usually) 99% highest density regions (HDRs) and mode. 91% outer region 1914 ● 4 ● 1915 ● ● 1916 ● 3 ● 1944 ● 1917 ● ● 1918 ● ● 1943 ● 1940 ● 2 ● ● ● 1 ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● 1919 ● ● ● ● ● ● 1942 ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ● −1 PC score 2 ● −10 ● ● ● ●● −5 ● ● ● ● ● ● ●●● ● ● o ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● 0 5 PC score 1 10 15 9 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot Functional HDR boxplot 0 5 Bivariate HDR boxplot due to Hyndman (1996). Rank points by value of kernel density estimate. Display mode, 50% and (usually) 99% highest density regions (HDRs) and mode. Boundaries contain all curves inside HDRs. 91% outer region 1914 ● 4 ● 1915 ● −2 ● 1916 ● 1944 ● 1918 ● 1917 ● 1 ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● 1919 ● ● ● ● ● ● 1942 ● ● ● ● ● ● −4 2 ● ● ● ● ● 0 ● ● ● ●● ● ●● ● ● ● ●● ● ● −10 o ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ●● −5 ● ● ● ● ● ● ●●● ● ● −8 ● ● ● ● −1 PC score 2 ● Log death rate ● ● 1943 ● 1940 ● −6 3 ● ● ● ● ●● ● ● 0 5 PC score 1 10 15 0 20 40 60 Age 80 100 9 Bagplots, boxplots and outlier detection for functional data Functional bagplot and HDR boxplot −4 −6 −8 Log death rate −2 0 Functional HDR boxplot 0 20 40 60 Age 80 100 9 Bagplots, boxplots and outlier detection for functional data Outline 1 Introduction 2 Functional bagplot and HDR boxplot 3 Outlier detection 4 Conclusions Outlier detection 10 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: existing methods Likelihood ratio method Febrero et al. (2007) find curve that maximizes LRT statistic. If LRT > C , then curve is considered outlier. C is computed via smoothed bootstrap. Process continues until no more outliers. 11 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: existing methods Likelihood ratio method Febrero et al. (2007) find curve that maximizes LRT statistic. If LRT > C , then curve is considered outlier. C is computed via smoothed bootstrap. Process continues until no more outliers. Disadvantages Computationally intensive. Ignores shape outliers. If trimmed mean is used and there is no outlier, C will be downward biased. 11 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: existing methods Integrated squared error method Hyndman & Ullah (2007) proposed the use of 2 Z K X vi = ŷi (x) − µ(x) − zi,k φk (x) dx x k=1 where zi,k and (robust) PC scores and φk (x) are PCs. 12 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: existing methods Integrated squared error method Hyndman & Ullah (2007) proposed the use of 2 Z K X vi = ŷi (x) − µ(x) − zi,k φk (x) dx x k=1 where zi,k and (robust) PC scores and φk (x) are PCs. √ Curve is outlier if vi > s + λ s, where s = median(v1 , · · · , vt ) and λ is tuning parameter. 12 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: existing methods Integrated squared error method Hyndman & Ullah (2007) proposed the use of 2 Z K X vi = ŷi (x) − µ(x) − zi,k φk (x) dx x k=1 where zi,k and (robust) PC scores and φk (x) are PCs. √ Curve is outlier if vi > s + λ s, where s = median(v1 , · · · , vt ) and λ is tuning parameter. 12 Bagplots, boxplots and outlier detection for functional data Outlier detection 12 Outlier detection: existing methods Integrated squared error method Hyndman & Ullah (2007) proposed the use of 2 Z K X vi = ŷi (x) − µ(x) − zi,k φk (x) dx x k=1 where zi,k and (robust) PC scores and φk (x) are PCs. √ Curve is outlier if vi > s + λ s, where s = median(v1 , · · · , vt ) and λ is tuning parameter. Disadvantages Depends on K and λ. If K large, outliers modelled by higher components. Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: comparison French male mortality data set Based on historical information, the outliers are expected to be 1914–1919 & 1940–1945. Method Outliers detected Likelihood ratio — Integrated squared error 1914–1918, 1940, 1943–1944 Bagplot 1914–1919, 1940, 1942–1944 91% HDR boxplot 1914–1919, 1940, 1942–1944 13 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: comparison French male mortality data set Based on historical information, the outliers are expected to be 1914–1919 & 1940–1945. Method Outliers detected Likelihood ratio — Integrated squared error 1914–1918, 1940, 1943–1944 Bagplot 1914–1919, 1940, 1942–1944 91% HDR boxplot 1914–1919, 1940, 1942–1944 Method Sensitivity Specificity Time (s) Likelihood ratio 0% 100% 18.8 Integrated squared error 50% 94% 3.4 Bagplot 83% 98% 0.6 91% HDR boxplot 83% 98% 0.3 13 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: comparison −0.10 −0.05 0.00 Outliers shown in black −0.15 y 0.05 0.10 0.15 Simulation yi (x) = ai sin(x) + bi cos(x), 0 < x < 2π ai , bi ∼ Unif(0, 0.1) with probability 99% ai , bi ∼ Unif(0.1, 0.108) with probability 1% 0 1 2 3 4 x 5 6 14 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: comparison 0.0 −0.5 PC score 2 0.5 1.0 Scatterplot of first two PC scores ●●● ●● ● ● ●●● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ●●● ●●● ●● ● ●● ●● ●●●● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ●●● ● ●● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ●● ● ● ●● ● ●● ● ●● ● ● ● ● ● ●●● ●● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ●● ● ● ●●● ●● ● ●● ● ●● ● ●● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ●● ● ●● ● ● ●● ●● ● ● ●● ●●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ●●●● ● ●● ● ●●● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ●● ●● ●● ● ●●● ●● ●● ● ●● ●● ● ●● ● ●● ●●● ● ● ● ●●● ● ●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ●● ●●● ●● ●● ●●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ●●● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ●● ● ●● ● ●● ●●●●●●● ● ● ●●●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ●● ● ●● ● ● ●● ● ● ●● ●● ●● ●● ● ●● ● ● ● ●●● ● ●●●● ● ● ●● ● ● ● ● ●● ●●● ● ● ●● ● ● ● ● ●● ● ●● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●●● ● ●● ●● ● ● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ●● ●●●●●● ● ●● ● ●● ● ● ● ● ● ● ● ●●●● ● ● −0.5 0.0 PC score 1 0.5 15 Bagplots, boxplots and outlier detection for functional data Outlier detection: comparison Simulation Method Outliers detected Likelihood ratio — Integrated squared error — Bagplot — 99% HDR boxplot All Outlier detection 16 Bagplots, boxplots and outlier detection for functional data Outlier detection Outlier detection: comparison Simulation Method Outliers detected Likelihood ratio — Integrated squared error — Bagplot — 99% HDR boxplot All Method Sensitivity Specificity Time (s) Likelihood ratio 0% 100% 28.5 Integrated squared error 0% 100% 18.8 Bagplot 0% 100% 7.3 99% HDR boxplot 100% 100% 6.9 16 Bagplots, boxplots and outlier detection for functional data Outline 1 Introduction 2 Functional bagplot and HDR boxplot 3 Outlier detection 4 Conclusions Conclusions 17 Bagplots, boxplots and outlier detection for functional data Conclusions Conclusions Functional bagplot highly robust but sometimes misses outliers. 18 Bagplots, boxplots and outlier detection for functional data Conclusions Conclusions Functional bagplot highly robust but sometimes misses outliers. Functional HDR boxplot more flexible but coverage probability needs tuning. 18 Bagplots, boxplots and outlier detection for functional data Conclusions Conclusions Functional bagplot highly robust but sometimes misses outliers. Functional HDR boxplot more flexible but coverage probability needs tuning. Functional HDR boxplot can detect bimodality and inliers. 18 Bagplots, boxplots and outlier detection for functional data Conclusions Conclusions Functional bagplot highly robust but sometimes misses outliers. Functional HDR boxplot more flexible but coverage probability needs tuning. Functional HDR boxplot can detect bimodality and inliers. Existing depth method performs poorly and ignores shape outliers. 18 Bagplots, boxplots and outlier detection for functional data Conclusions Conclusions Functional bagplot highly robust but sometimes misses outliers. Functional HDR boxplot more flexible but coverage probability needs tuning. Functional HDR boxplot can detect bimodality and inliers. Existing depth method performs poorly and ignores shape outliers. Existing ISE method often misses outliers. 18 Bagplots, boxplots and outlier detection for functional data Conclusions Conclusions Functional bagplot highly robust but sometimes misses outliers. Functional HDR boxplot more flexible but coverage probability needs tuning. Functional HDR boxplot can detect bimodality and inliers. Existing depth method performs poorly and ignores shape outliers. Existing ISE method often misses outliers. 18 Bagplots, boxplots and outlier detection for functional data Conclusions Conclusions Functional bagplot highly robust but sometimes misses outliers. Functional HDR boxplot more flexible but coverage probability needs tuning. Functional HDR boxplot can detect bimodality and inliers. Existing depth method performs poorly and ignores shape outliers. Existing ISE method often misses outliers. å Paper and R code: www.robhyndman.info å Comments to: Han.Shang@buseco.monash.edu 18