Microsoft PowerPoint

advertisement
Magee Campus
Dataset Shift Detection in Non-Stationary
Environments using EWMA Charts
Prof. Girijesh Prasad
Co-authors: Haider Raza, Yuhua Li
School of Computing & Intelligent Systems @ Magee,
Faculty of Computing & Engineering, Derry~Londonderry.
g.prasad@ulster.ac.uk
http://isrc.ulster.ac.uk
1
Magee Campus
Outline
 Motivation
 Background
 Proposed contribution
 Future work and Conclusion
http://isrc.ulster.ac.uk
2
Magee Campus
Motivation
 Classical learning systems are built upon the assumption that the input
data distribution for the training and testing are same.
 Real-world environments are often non-stationary (e.g., EEG-based BCI)
 So, learning in real-time environments is difficult due to the nonstationarity effects and the performance of system degrades with time.
 So, predictors need to adapt online. However, online adaptation
particularly for classifiers is difficult to perform and should be avoided as
far as possible and this requires performing in real-time:
Non-stationary shift-detection test.
Target
t
x1
Inputs
+
y
-
x2
Error
e
http://isrc.ulster.ac.uk
3
Magee Campus
Background
•
Supervised learning
•
Non-stationary environments
•
Dataset shift
Dataset shift-detection
•
•
Proposed Work
Supervised learning
(Mitchell, 1997)
(Sugiyama et al. 2009)
http://isrc.ulster.ac.uk
Proposed
Work
Shift-Detection
Dataset
shift-detection
(Shewhart 1939),
(Page 1954),
(Roberts 1959),
(Alippi et al. 2011b),
(Alippi & Roveri 2008a;
Alippi & Roveri 2008b)
Dataset
shift
(Torres et al.
2012),
Non-stationary
environments
(M Krauledat 2008),
(Sugiyama 2012).
4
Magee Campus
Supervised Learning
 Training samples: Input (𝑥) and output (𝑦)
 Learn input-output rule: 𝑦 = 𝑓(𝑥)
 Assumption: “Training and test samples are drawn from same
probability distribution” i.e., (𝑃𝑡𝑟𝑎𝑖𝑛 𝑦, 𝑥 = 𝑃𝑡𝑒𝑠𝑡 𝑦, 𝑥 )
Is this assumption really true?
No….!!!
Not always true 
http://isrc.ulster.ac.uk
Reason :Non-Stationary
Environments !
5
Magee Campus
Non-Stationarity
Definition: Let (𝑋𝑡 )𝑡 ∈𝐼 be a multivariate time-series, where 𝐼 ⊂ ℝ is an index set. Then (𝑋𝑡 ) is called
stationary time-series, if the probability distribution does not change over time, i.e.,
𝑃 𝑋𝑡(𝑖) = 𝑃 𝑋𝑡(𝑗)
for all 𝑡(𝑖) , 𝑡(𝑗) ∈ 𝐼. A time-series is called non-stationary, if it is not stationary.
For examples:
𝟐)
𝓝(𝝁, 𝝈
Learning from past
only is of limited
use 




Brain-computer interface
Robot control
Remote sensing application
Network intrusion detection
What is the challenge?
http://isrc.ulster.ac.uk
6
Magee Campus
Dataset Shift
Dataset Shift appears when training and test joint distributions are different.
That is, when (𝑃𝑡𝑟𝑎𝑖𝑛 𝑦, 𝑥 ≠ 𝑃𝑡𝑒𝑠𝑡 𝑦, 𝑥 ) (Torres, 2012)
*Note : Relationship between covariates (x) and class label (y)
XY: Predictive model (e.g., spam filtering)
YX: Generative model (e.g., Fault detection )
Types of Dataset Shift
 Covariate Shift
 Prior Probability Shift
 Concept Shift
http://isrc.ulster.ac.uk
Concept
Covariate
Prior
probability
shifts
shift appears
appears
shift appears
only inonly
XY
in
problems
YX
𝑝𝑡𝑟𝑎𝑖𝑛problems
𝑦 𝑥 ≠ 𝑝𝑡𝑒𝑠𝑡 𝑦 𝑥 , &
𝑝𝑡𝑟𝑎𝑖𝑛
𝑝𝑡𝑟𝑎𝑖𝑛
𝑦𝑦
𝑥
𝑥𝑥 ==𝑝𝑝𝑡𝑒𝑠𝑡
𝑦𝑥 𝑦
𝑥𝑖𝑛, &
𝑡𝑒𝑠𝑡 𝑥
𝑝𝑡𝑟𝑎𝑖𝑛
𝑋→
(𝑥)
(𝑦)
𝑌 ≠ 𝑝𝑡𝑒𝑠𝑡 (𝑥)
(𝑦)
7
Magee Campus
Dataset Shift-Detection
Detecting abrupt and gradual shifts in time-series data is called the data
shift-detection.
Types of Shift-Detection
 Retrospective/offline-detection: (i.e., Shift-point analysis)
 Real-time/online-detection: (i.e., Control charts)
Types of Control Charts
 Shewart Chart (Shewart, 1939)
 Cumulative Sum (CUSUM) (E S Page, 1954)
 Exponentially Weighted Moving Average (EWMA) (S W Roberts, 1959)
 Computational Intelligence CUSUM (CI-CUSUM) (Alippi et al., 2008)
 Intersection of Confidence Interval (ICI) (Alippi et al., 2011 )
http://isrc.ulster.ac.uk
8
Magee Campus
Proposed Contribution
 We have proposed dataset shift-detection test.
• Shift-Detection based on Exponentially Weight Moving Average (SDEWMA) model
http://isrc.ulster.ac.uk
9
Magee Campus
Shift-Detection based on Exponentially Weight Moving Average
(SD-EWMA)
𝒛 𝒊 = 𝝀. 𝒙
𝒊
+ 𝟏 − 𝝀 .𝒛
(1)
𝒊−𝟏
where λ is the smoothing constant (0<λ≤1).
It is a first-order integrated moving average (ARIMA) model.
𝒙𝒊=𝒙
𝒊−𝟏
− 𝜺𝒊 − 𝜽𝜺𝒊−𝟏
(2)
Where 𝜀𝑖 is a sequence of i.i.d random signal with zero mean and constant variance.
Equation (1) with 𝝀 = 1 − 𝜃 , is the optimal 1-step-ahead prediction for this process 𝒙
𝒊+𝟏
(𝒊) = 𝒛
𝒊
The 1-step-ahead error are calculated as
𝒆𝒓𝒓 (𝒊) = 𝒙 (𝒊) − 𝒙
𝒊
(𝒊 − 𝟏)
(3)
IF the 1-step-ahead error is normally distributed, then
UCL
LCL
𝑷𝒙
𝒊
𝑷 −𝑳. 𝝈𝒆𝒓𝒓 ≤ 𝒆𝒓𝒓(𝒊) ≤ 𝑳. 𝝈𝒆𝒓𝒓 = 𝟏 − 𝜶
𝒊 − 𝟏 − 𝑳. 𝝈𝒆𝒓𝒓 ≤ 𝒙 𝒊 ≤ 𝒙 𝒊 𝒊 − 𝟏 + 𝑳. 𝝈𝒆𝒓𝒓 = 𝟏 − 𝜶
http://isrc.ulster.ac.uk
10
Magee Campus
Proposed Algorithm: SD-EWMA
Training Input: For each observation in the training dataset, go to training phase and compute the parameters for testing.
Testing input: Receive new data sample-by-sample and perform the check as follows.
IF (Shift detected)
THEN (Report the point of shift and initiate an appropriate corrective action)
ELSE
(Continue and integrate the upcoming information).
Output: Shift-detection points
Training Phase
1.
Initialize training data: 𝑥 𝑖 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑛, 𝑛 is the size of training data.
2.
Calculate the mean of input data (𝑥) and set it to the initial value of 𝑧 0 .
3.
Compute the 𝑧-statistics for each observation 𝑥 𝑖 in training data
𝑧 𝑖 = 𝜆. 𝑥 𝑖 + 1 − 𝜆 . 𝑧 𝑖−1
1.
Compute 1-step-ahead prediction errors
𝑒𝑟𝑟 𝑖 = 𝑥 𝑖 − 𝑥𝑖 (𝑖 − 1)
1.
Compute the sum of 1-step-ahead prediction error divided by number of observations and use it as the initial value of variance 𝜎𝑒𝑟𝑟 2(0) for the testing phase.
2.
Set 𝜆 by minimizing the sum of squared prediction error on training data.
Testing Phase
1.
For each data point 𝑥 𝑖+1 in the operation/testing phase,
2.
Compute 𝑧 𝑖 = 𝜆. 𝑥 𝑖 + 1 − 𝜆 . 𝑧 𝑖−1
3.
Compute 𝑒𝑟𝑟 𝑖 = 𝑥 𝑖 − 𝑥(𝑖) (𝑖 − 1)
4.
Compute the estimated variance 𝜎𝑒𝑟𝑟 2(𝑖) = 𝜗. 𝑒𝑟𝑟(𝑖) 2 + 1 − 𝜗 . 𝜎𝑒𝑟𝑟 2(𝑖−1)
5.
Compute 𝑈𝐶𝐿(𝑖+1) and 𝐿𝐶𝐿(𝑖+1)
6.
𝑈𝐶𝐿(𝑖+1) = 𝑧 𝑖 + 𝐿. 𝜎𝑒𝑟𝑟
7.
𝐿𝐶𝐿(𝑖+1) = 𝑧 𝑖 − 𝐿. 𝜎𝑒𝑟𝑟
8.
IF (𝐿𝐶𝐿(𝑖+1) < 𝑥 𝑖+1 < 𝑈𝐶𝐿(𝑖+1) )
9.
THEN
10.
Continue processing with new sample
11.
ELSE (Covariate Shift detected)
12.
Initiate an appropriate corrective action.
http://isrc.ulster.ac.uk
11
Magee Campus
Datasets
Synthetic Data
Dataset 1-Jumping Mean (D1): 𝑦 𝑡 = 0.6𝑦 𝑡 − 1 − 0.5𝑦 𝑡 − 2 + 𝜀𝑡
where 𝜀𝑡 is a noise with mean 𝜇 and standard deviation 1.5. The initial values are set as 𝑦(1) = 𝑦(2) = 0.
A change point is inserted at every 100 time steps by setting the noise mean 𝜇 at time 𝑡 as
0
𝑁=1
𝑁
𝜇𝑁 =
𝜇𝑁−1 +
𝑁 = 2 … .49
16
where 𝑁 is a natural number such that 100 𝑁 − 1 + 1 ≤ 𝑡 ≤ 100𝑁.
Dataset 2-Scaling Variance (D2): The change point is inserted at every 100 time steps by setting the noise
standard deviation 𝜎 at time 𝑡 as
1
𝑁 = 1,3, … … , 49
𝑁
𝑁 = 2,4, … 48
ln(e + )
4
where 𝑁 is a natural number such that 100 𝑁 − 1 + 1 ≤ 𝑡 ≤ 100𝑁.
𝜎=
Dataset 3-Positive-Auto-correlated (D3): The dataset is consisting of 2000 data-points, the non stationarity
occurs in the middle of the data stream, shifting from 𝒩(𝑥; 1,1) to 𝒩(𝑥; 3,1), where 𝒩(𝑥; 𝜇, 𝜎) denotes the
normal distribution with mean and standard deviation respectively.
http://isrc.ulster.ac.uk
12
Magee Campus
Dataset 4-Auto-correlated (D4): The dataset is a time-series consisting of 2000 data-points using 1-D
digital filter from matlab. The filter function creates a direct form II transposed implementation of a
standard difference equation. In the filter, the denominator coefficient is changed from 2 to 0.5 after
producing 1000 number of points.
Real-world Dataset: EEG Based Brain Signals
The real-world data used here are from BCI competition-III dataset (IV-b). This dataset, contains 2 classes,
118 EEG channels (0.05-200Hz), 1000Hz sampling rate which is down-sampled to 100Hz, 210 training trials,
and 420 test trials.
1.5
Session1
Session 2
Session 3
1
0.5
0
2.5
3
3.5
4
Log(bandpower)
4.5
5
Figure : pdf plot of 3 different sessions’ data taken from the training dataset. It is clear from the plot that, in each session the
distribution is changed by shifting the mean from session-to-session transfer.
http://isrc.ulster.ac.uk
13
Magee Campus
75
80
Shift Points
75
Shift Point
70
70
65
65
60
55
60
50
UCL
45
0
100
200
LCL
SD-EWMA
300
400
Sample Number
500
UCL
55
380
600
385
390
LCL
395
400
405
Sample Number
SD-EWMA
410
415
420
(a)
Figure: Shift detection based
on SD-EWMA: Dataset 1 (jumping mean): (a) the shift point is detected at every 100th
point. (b) Zoomed view of figure a: shift is detected at 401st sample by crossing the upper control limit.
20
8
UCL
LCL
20
SD-EWMA
6
10
Shift Point
10
4
(b)
0
0
2
-10
0
-10
-20
-2
Shifts Points
-20
250
300
350
400
450
500
Sample Number
Shift Point
UCL
550
600
650
-4
850
900
950
LCL
1000
1050
Sample Number
UCL
SD-EWMA
1100
1150
-30
900
950
LCL
1000
Sample Number
SD-EWMA
1050
1100
Figure : Shift detection based on SD-EWMA: (a) Dataset 2 (scaling variance): the shift is detected at 3 points.
(b) Dataset 3 (positive auto-correlated): detects the shift after producing 1000 observations.
(c) Dataset 4 (Auto-correlated): detects the shift after producing 1000 observations.
http://isrc.ulster.ac.uk
14
Magee Campus
TABLE : SD-EWMA shift detection in time-series data
Total
shifts
D1
9
D2
5
D3
1
D4
1
Lambda (𝝀)
# TP
# FP
#FN
% Shift
detected
0.20
0.30
0.40
0.60
0.70
0.80
0.40
0.50
0.60
0.40
0.50
0.60
8
8
All
All
All
4
All
All
All
All
All
All
1
3
4
11
6
8
13
9
15
7
6
7
1
1
0
1
0
1
0
0
0
0
0
0
88.8
88.8
100
100
100
80
100
100
100
100
100
100
Table : Simulation results on different tests
ICI-CDT
SD-EWMA
# TP
# FP
#FN
AD
# TP
# FP
#FN
AD
D1
6
0
3
35
All
1
0
0
D2
1
0
4
60
All
6
0
0
D3
1
0
0
80
All
9
0
0
D4
1
0
0
60
All
6
0
0
http://isrc.ulster.ac.uk
15
Magee Campus
5
log(bandpower)
UCL
LCL
SD-EWMA
4.5
4
3.5
3
2.5
5500
6000
6500
Sample Number
Figure 4: A window of 2000 samples obtained from real-world dataset.
TABLE 4: SD-EWMA shift detection in BCI data
Lambda (𝝀)
0.01
0.05
0.10
http://isrc.ulster.ac.uk
Number of
Trials
1-20
21-45
46-70
1-20
21-45
46-70
1-20
21-45
46-70
# Shifts points SD-EWMA
Session 2
Session 3
0
1
3
1
4
2
10
4
9
6
7
5
16
9
22
21
25
17
16
Magee Campus
Conclusion and Future Work
 The drawback of classical supervised learning techniques in nonstationary environments and the motivation behind the dataset shiftdetection were discussed.
 The background of non-stationary environments and dataset shiftdetection were presented.
 A proposed SD-EWMA method is presented and the results are discussed.
 In future, the SD-EWMA will be combined into an adaptive learning
framework for non-stationary learning.
http://isrc.ulster.ac.uk
17
Magee Campus
Questions
Thank You !
http://isrc.ulster.ac.uk
Download