Bagplots, boxplots and outlier detection for functional data

advertisement
Bagplots, boxplots and outlier detection for functional data
Bagplots, boxplots and outlier
detection for functional data
Han Lin Shang & Rob J Hyndman
Business & Economic Forecasting Unit
1
Bagplots, boxplots and outlier detection for functional data
Outline
1
Introduction
2
Functional bagplot and HDR boxplot
3
Outlier detection
4
Conclusions
2
Bagplots, boxplots and outlier detection for functional data
Outline
1
Introduction
2
Functional bagplot and HDR boxplot
3
Outlier detection
4
Conclusions
Introduction
3
Bagplots, boxplots and outlier detection for functional data
Introduction
French male mortality rates
−4
−6
−8
Log death rate
−2
0
France: male death rates (1899−2003)
0
20
40
60
Age
80
100
4
Bagplots, boxplots and outlier detection for functional data
Introduction
French male mortality rates
0
France: male death rates (1899−2003)
−4
−6
−8
Log death rate
−2
War years
0
20
40
60
Age
80
100
4
Bagplots, boxplots and outlier detection for functional data
Introduction
French male mortality rates
0
France: male death rates (1899−2003)
−4
−6
−8
Log death rate
−2
War years
0
20
Aims
1
“Boxplots” for functional data
2
40 Tools for
60 detecting
80 outliers
100in
functional
data
Age
4
Bagplots, boxplots and outlier detection for functional data
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
Introduction
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
k=1
zi,k φk (x)
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
zi,k φk (x)
k=1
µ(x) is mean curve
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
zi,k φk (x)
k=1
µ(x) is mean curve
{φk (x)} are principal components
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
zi,k φk (x)
k=1
µ(x) is mean curve
{φk (x)} are principal components
{zi,k } are PC scores
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
zi,k φk (x)
k=1
µ(x) is mean curve
{φk (x)} are principal components
{zi,k } are PC scores
2
Plot zi,2 vs zi,1
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
zi,k φk (x)
k=1
µ(x) is mean curve
{φk (x)} are principal components
{zi,k } are PC scores
2
Plot zi,2 vs zi,1
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
zi,k φk (x)
k=1
µ(x) is mean curve
{φk (x)} are principal components
{zi,k } are PC scores
2
Plot zi,2 vs zi,1
å Each point in scatterplot represents one curve.
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Let {yi (x)}, i = 1, . . . , n, be a set of curves.
1
Apply a robust principal component algorithm
yi (xi ) = µ(x) +
n−1
X
zi,k φk (x)
k=1
µ(x) is mean curve
{φk (x)} are principal components
{zi,k } are PC scores
2
Plot zi,2 vs zi,1
å Each point in scatterplot represents one curve.
å Outliers show up in bivariate score space.
5
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Scatterplot of first two PC scores
4
●
●
3
●
●
●
2
●
−10
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
● ● ●
0
1
●
●
●●
●● ●
●
●●●●
● ●
●●●
● ●●● ●●
●●
●
●
−1
PC score 2
●
−5
0
●
●
●
●
● ●
●●
●
●
●●●
●● ●
●
●
●
● ●
●●●
●
●●
●●●●●●●
●
●
5
PC score 1
10
15
6
Bagplots, boxplots and outlier detection for functional data
Introduction
Robust principal components
Scatterplot of first two PC scores
4
1914 ●
1915 ●
2
1
●
●●
●● ●
●
●●●●
● ●
●●●
● ●●● ●●
●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●● ●
● ● ●
0
−10
1918 ●
1943 ●
1940 ●
−1
PC score 2
3
1916 ●
1944 ●
1917 ●
−5
0
1919 ●
1942 ●
1945 ●
● ●
●●
●
1941
●
●●●
●● ●
●●●
●
●
●
●
●
●
●
● ●
●●
●●●●●●●
●
5
PC score 1
10
15
6
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
Outline
1
Introduction
2
Functional bagplot and HDR boxplot
3
Outlier detection
4
Conclusions
7
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
Functional bagplot
5
Bivariate bagplot due to Rousseeuw et al. (1999).
Rank points by halfspace location depth.
Display median, 50% convex hull and outer convex
hull (with 99% coverage if bivariate normal).
1914 ●
4
●
1915 ●
●
1916 ●
3
●
1944 ●
1917 ●
●
1918 ●
●
2
1943
1940 ●
●
●
1
●
●●
● ●
●
● ●
●
●
● ●
●●
●
●
●
●●
●
●●
●
1919 ●
●
● ●
●
●
1942 ●
●
●
● ●
●
●
●
●
0
●
●
●
● ●
●
●● ●
●●
●
●
●
●●
●
●
−1
PC score 2
●
●
●
−10
●
●
●
●
●
● ●
● ●● ●
● ●
● ●●
●
●
●
●
●●
−5
●
●
●
●
● ●
●●● ●
●
●
●
●●
●
●
0
5
PC score 1
10
15
8
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
Functional bagplot
0
5
Bivariate bagplot due to Rousseeuw et al. (1999).
Rank points by halfspace location depth.
Display median, 50% convex hull and outer convex
hull (with 99% coverage if bivariate normal).
Boundaries contain all curves inside bags.
95% CI for median curve also shown.
1914 ●
4
●
1915 ●
−2
●
1916 ●
1918 ●
●
2
●
1
●
●●
● ●
●
● ●
●
●
● ●
●●
●
●
●
●●
●
●●
●
1919 ●
●
● ●
●
●
1942 ●
●
●
● ●
●
●
●
0
●
●
●
●
● ●
●●● ●
● ●
●
●
●● ●
●●
●
●
●
●●
●
●
−10
●
●
●
●
● ●
● ●● ●
● ●
● ●●
●
●
●
●
●●
−5
●
−8
●
●
●
●
−1
PC score 2
●
●
−4
●
●
Log death rate
1944 ●
1917 ●
1943 ●
1940 ●
−6
3
●
●
●
●●
●
●
0
5
PC score 1
10
15
0
20
40
60
Age
80
100
8
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
−4
−6
−8
Log death rate
−2
0
Functional bagplot
0
20
40
60
Age
80
100
8
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
Functional HDR boxplot
5
Bivariate HDR boxplot due to Hyndman (1996).
Rank points by value of kernel density estimate.
Display mode, 50% and (usually) 99% highest
density regions (HDRs) and mode.
1914 ●
4
●
●
3
●
●
●
2
●
●
●
1
●
●●
● ●
●
● ●
●
●
● ●
●●
●
●
●
●●
●
●●
●
1919 ●
●
● ●
●
●
●
●
● ●
●
●
●
●
0
●
●
●
● ●
●
●● ●
●●
●
●
●
●●
●
●
−1
PC score 2
●
−10
●
●
●
●●
−5
●
●
●
●
● ●
●●● ●
●
o
●
●
●
●
●
●
● ●● ●
●● ● ● ●
●
●
●
●
●●
●
●
0
5
PC score 1
10
15
9
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
Functional HDR boxplot
5
Bivariate HDR boxplot due to Hyndman (1996).
Rank points by value of kernel density estimate.
Display mode, 50% and (usually) 99% highest
density regions (HDRs) and mode.
91% outer region
1914 ●
4
●
1915 ●
●
1916 ●
3
●
1944 ●
1917 ●
●
1918 ●
●
1943 ●
1940 ●
2
●
●
●
1
●
●●
● ●
●
● ●
●
●
● ●
●●
●
●
●
●●
●
●●
●
1919 ●
●
● ●
●
●
1942 ●
●
●
● ●
●
●
●
●
0
●
●
●
● ●
●
●● ●
●●
●
●
●
●●
●
●
−1
PC score 2
●
−10
●
●
●
●●
−5
●
●
●
●
● ●
●●● ●
●
o
●
●
●
●
●
●
● ●● ●
●● ● ● ●
●
●
●
●
●●
●
●
0
5
PC score 1
10
15
9
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
Functional HDR boxplot
0
5
Bivariate HDR boxplot due to Hyndman (1996).
Rank points by value of kernel density estimate.
Display mode, 50% and (usually) 99% highest
density regions (HDRs) and mode.
Boundaries contain all curves inside HDRs.
91% outer region
1914 ●
4
●
1915 ●
−2
●
1916 ●
1944 ●
1918 ●
1917 ●
1
●●
● ●
●
● ●
●
●
● ●
●●
●
●
●
●●
●
●●
●
1919 ●
●
● ●
●
●
1942
●
● ●
●
●
●
−4
2
●
●
●
●
●
0
● ●
●
●● ●
●●
●
●
●
●●
●
●
−10
o
●
●
●
●
●
●
● ●● ●
●● ● ● ●
●
●
●
●
●
●●
−5
●
●
●
●
● ●
●●● ●
●
−8
●
●
●
●
−1
PC score 2
●
Log death rate
●
●
1943 ●
1940 ●
−6
3
●
●
●
●
●●
●
●
0
5
PC score 1
10
15
0
20
40
60
Age
80
100
9
Bagplots, boxplots and outlier detection for functional data
Functional bagplot and HDR boxplot
−4
−6
−8
Log death rate
−2
0
Functional HDR boxplot
0
20
40
60
Age
80
100
9
Bagplots, boxplots and outlier detection for functional data
Outline
1
Introduction
2
Functional bagplot and HDR boxplot
3
Outlier detection
4
Conclusions
Outlier detection
10
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: existing methods
Likelihood ratio method
Febrero et al. (2007) find curve that maximizes
LRT statistic.
If LRT > C , then curve is considered outlier.
C is computed via smoothed bootstrap.
Process continues until no more outliers.
11
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: existing methods
Likelihood ratio method
Febrero et al. (2007) find curve that maximizes
LRT statistic.
If LRT > C , then curve is considered outlier.
C is computed via smoothed bootstrap.
Process continues until no more outliers.
Disadvantages
Computationally intensive.
Ignores shape outliers.
If trimmed mean is used and there is no outlier,
C will be downward biased.
11
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: existing methods
Integrated squared error method
Hyndman & Ullah (2007) proposed the use of
2
Z K
X
vi =
ŷi (x) − µ(x) −
zi,k φk (x) dx
x
k=1
where zi,k and (robust) PC scores and φk (x)
are PCs.
12
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: existing methods
Integrated squared error method
Hyndman & Ullah (2007) proposed the use of
2
Z K
X
vi =
ŷi (x) − µ(x) −
zi,k φk (x) dx
x
k=1
where zi,k and (robust) PC scores and φk (x)
are PCs.
√
Curve is outlier if vi > s + λ s, where
s = median(v1 , · · · , vt ) and λ is tuning parameter.
12
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: existing methods
Integrated squared error method
Hyndman & Ullah (2007) proposed the use of
2
Z K
X
vi =
ŷi (x) − µ(x) −
zi,k φk (x) dx
x
k=1
where zi,k and (robust) PC scores and φk (x)
are PCs.
√
Curve is outlier if vi > s + λ s, where
s = median(v1 , · · · , vt ) and λ is tuning parameter.
12
Bagplots, boxplots and outlier detection for functional data
Outlier detection
12
Outlier detection: existing methods
Integrated squared error method
Hyndman & Ullah (2007) proposed the use of
2
Z K
X
vi =
ŷi (x) − µ(x) −
zi,k φk (x) dx
x
k=1
where zi,k and (robust) PC scores and φk (x)
are PCs.
√
Curve is outlier if vi > s + λ s, where
s = median(v1 , · · · , vt ) and λ is tuning parameter.
Disadvantages
Depends on K and λ.
If K large, outliers modelled by higher components.
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: comparison
French male mortality data set
Based on historical information, the outliers are
expected to be 1914–1919 & 1940–1945.
Method
Outliers detected
Likelihood ratio
—
Integrated squared error 1914–1918, 1940, 1943–1944
Bagplot
1914–1919, 1940, 1942–1944
91% HDR boxplot
1914–1919, 1940, 1942–1944
13
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: comparison
French male mortality data set
Based on historical information, the outliers are
expected to be 1914–1919 & 1940–1945.
Method
Outliers detected
Likelihood ratio
—
Integrated squared error 1914–1918, 1940, 1943–1944
Bagplot
1914–1919, 1940, 1942–1944
91% HDR boxplot
1914–1919, 1940, 1942–1944
Method
Sensitivity Specificity Time (s)
Likelihood ratio
0%
100%
18.8
Integrated squared error
50%
94%
3.4
Bagplot
83%
98%
0.6
91% HDR boxplot
83%
98%
0.3
13
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: comparison
−0.10
−0.05
0.00
Outliers shown
in black
−0.15
y
0.05
0.10
0.15
Simulation
yi (x) = ai sin(x) + bi cos(x),
0 < x < 2π
ai , bi ∼ Unif(0, 0.1) with probability 99%
ai , bi ∼ Unif(0.1, 0.108) with probability 1%
0
1
2
3
4
x
5
6
14
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: comparison
0.0
−0.5
PC score 2
0.5
1.0
Scatterplot of first two PC scores
●●● ●●
● ● ●●●
●
● ●
●
●
● ●●
●● ● ●●
●
●
●
● ●●● ●●●
●● ●
●● ●●
●●●●
●
● ●● ● ●
● ● ●
●
●●●
●
●
●
●
●
●
●
●●● ● ● ●
●●
●
● ● ●● ●
● ●
●
●● ● ●● ●
● ●● ●
●
●
●
●
● ● ●●●
● ●●
●
●
●●
●
●● ●
● ● ●● ●
●● ●
● ●
●● ● ● ● ● ●
● ●● ●●
●
●
●●
●
●●
●
●● ●
●
● ●
● ●●●
●●
● ● ● ●●●
●
● ●
●
●
●
●
●
●
●
●
●
●
● ●
● ●
●
●
●
● ●
●
● ●
● ● ●
● ● ●● ●
●
● ●
●
● ●●
● ●● ● ●
●● ●
● ●●●
●● ● ●● ● ●● ● ●● ● ●● ●●
●
●
●
● ●
● ●
●● ●
● ●
●
●
●●●●
●
●
●
● ●
●● ● ●●
●
●●
●
●
●
●● ● ●●● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●● ●
●
●
● ● ●
● ● ● ● ●
●●● ●●●
● ●●
●
●
●
●
●
●
● ●●
● ● ●
●
●
●
●
●
● ● ●
●
● ●
●
●●
●● ●
●●
●
●●
● ●
● ●
● ●● ●
●
●●
●●
●●
●
●
● ●
●
● ●●
● ● ●
● ●●●
●
●
● ●●
●
●
●● ●● ●
●● ● ●
●●
●● ● ●
●●
●●●● ●
●● ●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●● ●
● ● ●● ● ●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●● ●
● ●●
●●●● ● ●● ●
●●●
●
●●
● ●●
● ● ●
●
● ●
●
●● ●
●
●● ● ● ●●
● ● ● ●
● ● ●
●
●●● ●
●
●
●●
● ●
●
●
● ●
●
●●
●●
●●
● ●●●
●● ●● ●
●●
●● ● ●●
●
●●
●●● ● ● ●
●●● ●
●●● ●
●
●
●
●
●
●
●
● ● ●●
●
●
● ●
● ● ● ●●●
●● ●
●
●
● ● ●●
● ●● ● ●
●
●● ●●● ●●
●● ●●●
●● ●
●
● ● ●
●
●
●●
● ●
●
●
●
● ●
●
●● ● ●●●
●
●
● ●
●
● ●●
●●
●
● ●
●● ●
●
● ● ● ● ●● ●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ● ●
●
●●
●
●●
● ●● ●●●●●●●
●
● ●●●●
● ●● ●
●
●
●●
●
●
● ●
● ●
●
●
●●
●●
●
●
●
●●
●
●● ● ●● ● ●
●● ● ●
●●
●●
●●
●● ●
●●
●
●
● ●●● ● ●●●● ●
●
●●
●
●
●
● ●● ●●● ●
● ●●
●
● ●
● ●●
●
●●
●
●● ●
● ●
●● ● ●
●
●
●
●
●
●
●
●
●
●
● ●●
●● ● ● ●
● ● ● ●● ●
● ● ●●
●● ●● ●
●
●
● ●●
●
●
●
●
●
● ●
●
●● ●
●
●●●
●
● ●
●
● ●
●●
●
● ●
●
●● ● ●
●● ●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●● ● ●
●●●
●
● ●
●
● ●●●●
● ●●
●●
●
●
●● ● ●
● ●
●
●● ● ●●
●
●
●
●
● ●
●● ●●●●●●
●
●●
●
●●
●
●
●
●
●
●
●
●●●● ●
●
−0.5
0.0
PC score 1
0.5
15
Bagplots, boxplots and outlier detection for functional data
Outlier detection: comparison
Simulation
Method
Outliers detected
Likelihood ratio
—
Integrated squared error —
Bagplot
—
99% HDR boxplot
All
Outlier detection
16
Bagplots, boxplots and outlier detection for functional data
Outlier detection
Outlier detection: comparison
Simulation
Method
Outliers detected
Likelihood ratio
—
Integrated squared error —
Bagplot
—
99% HDR boxplot
All
Method
Sensitivity Specificity Time (s)
Likelihood ratio
0%
100%
28.5
Integrated squared error
0%
100%
18.8
Bagplot
0%
100%
7.3
99% HDR boxplot
100%
100%
6.9
16
Bagplots, boxplots and outlier detection for functional data
Outline
1
Introduction
2
Functional bagplot and HDR boxplot
3
Outlier detection
4
Conclusions
Conclusions
17
Bagplots, boxplots and outlier detection for functional data
Conclusions
Conclusions
Functional bagplot highly robust but sometimes
misses outliers.
18
Bagplots, boxplots and outlier detection for functional data
Conclusions
Conclusions
Functional bagplot highly robust but sometimes
misses outliers.
Functional HDR boxplot more flexible but
coverage probability needs tuning.
18
Bagplots, boxplots and outlier detection for functional data
Conclusions
Conclusions
Functional bagplot highly robust but sometimes
misses outliers.
Functional HDR boxplot more flexible but
coverage probability needs tuning.
Functional HDR boxplot can detect bimodality
and inliers.
18
Bagplots, boxplots and outlier detection for functional data
Conclusions
Conclusions
Functional bagplot highly robust but sometimes
misses outliers.
Functional HDR boxplot more flexible but
coverage probability needs tuning.
Functional HDR boxplot can detect bimodality
and inliers.
Existing depth method performs poorly and
ignores shape outliers.
18
Bagplots, boxplots and outlier detection for functional data
Conclusions
Conclusions
Functional bagplot highly robust but sometimes
misses outliers.
Functional HDR boxplot more flexible but
coverage probability needs tuning.
Functional HDR boxplot can detect bimodality
and inliers.
Existing depth method performs poorly and
ignores shape outliers.
Existing ISE method often misses outliers.
18
Bagplots, boxplots and outlier detection for functional data
Conclusions
Conclusions
Functional bagplot highly robust but sometimes
misses outliers.
Functional HDR boxplot more flexible but
coverage probability needs tuning.
Functional HDR boxplot can detect bimodality
and inliers.
Existing depth method performs poorly and
ignores shape outliers.
Existing ISE method often misses outliers.
18
Bagplots, boxplots and outlier detection for functional data
Conclusions
Conclusions
Functional bagplot highly robust but sometimes
misses outliers.
Functional HDR boxplot more flexible but
coverage probability needs tuning.
Functional HDR boxplot can detect bimodality
and inliers.
Existing depth method performs poorly and
ignores shape outliers.
Existing ISE method often misses outliers.
å Paper and R code: www.robhyndman.info
å Comments to: Han.Shang@buseco.monash.edu
18
Download