Magee Campus Dataset Shift Detection in Non-Stationary Environments using EWMA Charts Prof. Girijesh Prasad Co-authors: Haider Raza, Yuhua Li School of Computing & Intelligent Systems @ Magee, Faculty of Computing & Engineering, Derry~Londonderry. g.prasad@ulster.ac.uk http://isrc.ulster.ac.uk 1 Magee Campus Outline Motivation Background Proposed contribution Future work and Conclusion http://isrc.ulster.ac.uk 2 Magee Campus Motivation Classical learning systems are built upon the assumption that the input data distribution for the training and testing are same. Real-world environments are often non-stationary (e.g., EEG-based BCI) So, learning in real-time environments is difficult due to the nonstationarity effects and the performance of system degrades with time. So, predictors need to adapt online. However, online adaptation particularly for classifiers is difficult to perform and should be avoided as far as possible and this requires performing in real-time: Non-stationary shift-detection test. Target t x1 Inputs + y - x2 Error e http://isrc.ulster.ac.uk 3 Magee Campus Background • Supervised learning • Non-stationary environments • Dataset shift Dataset shift-detection • • Proposed Work Supervised learning (Mitchell, 1997) (Sugiyama et al. 2009) http://isrc.ulster.ac.uk Proposed Work Shift-Detection Dataset shift-detection (Shewhart 1939), (Page 1954), (Roberts 1959), (Alippi et al. 2011b), (Alippi & Roveri 2008a; Alippi & Roveri 2008b) Dataset shift (Torres et al. 2012), Non-stationary environments (M Krauledat 2008), (Sugiyama 2012). 4 Magee Campus Supervised Learning Training samples: Input (𝑥) and output (𝑦) Learn input-output rule: 𝑦 = 𝑓(𝑥) Assumption: “Training and test samples are drawn from same probability distribution” i.e., (𝑃𝑡𝑟𝑎𝑖𝑛 𝑦, 𝑥 = 𝑃𝑡𝑒𝑠𝑡 𝑦, 𝑥 ) Is this assumption really true? No….!!! Not always true http://isrc.ulster.ac.uk Reason :Non-Stationary Environments ! 5 Magee Campus Non-Stationarity Definition: Let (𝑋𝑡 )𝑡 ∈𝐼 be a multivariate time-series, where 𝐼 ⊂ ℝ is an index set. Then (𝑋𝑡 ) is called stationary time-series, if the probability distribution does not change over time, i.e., 𝑃 𝑋𝑡(𝑖) = 𝑃 𝑋𝑡(𝑗) for all 𝑡(𝑖) , 𝑡(𝑗) ∈ 𝐼. A time-series is called non-stationary, if it is not stationary. For examples: 𝟐) 𝓝(𝝁, 𝝈 Learning from past only is of limited use Brain-computer interface Robot control Remote sensing application Network intrusion detection What is the challenge? http://isrc.ulster.ac.uk 6 Magee Campus Dataset Shift Dataset Shift appears when training and test joint distributions are different. That is, when (𝑃𝑡𝑟𝑎𝑖𝑛 𝑦, 𝑥 ≠ 𝑃𝑡𝑒𝑠𝑡 𝑦, 𝑥 ) (Torres, 2012) *Note : Relationship between covariates (x) and class label (y) XY: Predictive model (e.g., spam filtering) YX: Generative model (e.g., Fault detection ) Types of Dataset Shift Covariate Shift Prior Probability Shift Concept Shift http://isrc.ulster.ac.uk Concept Covariate Prior probability shifts shift appears appears shift appears only inonly XY in problems YX 𝑝𝑡𝑟𝑎𝑖𝑛problems 𝑦 𝑥 ≠ 𝑝𝑡𝑒𝑠𝑡 𝑦 𝑥 , & 𝑝𝑡𝑟𝑎𝑖𝑛 𝑝𝑡𝑟𝑎𝑖𝑛 𝑦𝑦 𝑥 𝑥𝑥 ==𝑝𝑝𝑡𝑒𝑠𝑡 𝑦𝑥 𝑦 𝑥𝑖𝑛, & 𝑡𝑒𝑠𝑡 𝑥 𝑝𝑡𝑟𝑎𝑖𝑛 𝑋→ (𝑥) (𝑦) 𝑌 ≠ 𝑝𝑡𝑒𝑠𝑡 (𝑥) (𝑦) 7 Magee Campus Dataset Shift-Detection Detecting abrupt and gradual shifts in time-series data is called the data shift-detection. Types of Shift-Detection Retrospective/offline-detection: (i.e., Shift-point analysis) Real-time/online-detection: (i.e., Control charts) Types of Control Charts Shewart Chart (Shewart, 1939) Cumulative Sum (CUSUM) (E S Page, 1954) Exponentially Weighted Moving Average (EWMA) (S W Roberts, 1959) Computational Intelligence CUSUM (CI-CUSUM) (Alippi et al., 2008) Intersection of Confidence Interval (ICI) (Alippi et al., 2011 ) http://isrc.ulster.ac.uk 8 Magee Campus Proposed Contribution We have proposed dataset shift-detection test. • Shift-Detection based on Exponentially Weight Moving Average (SDEWMA) model http://isrc.ulster.ac.uk 9 Magee Campus Shift-Detection based on Exponentially Weight Moving Average (SD-EWMA) 𝒛 𝒊 = 𝝀. 𝒙 𝒊 + 𝟏 − 𝝀 .𝒛 (1) 𝒊−𝟏 where λ is the smoothing constant (0<λ≤1). It is a first-order integrated moving average (ARIMA) model. 𝒙𝒊=𝒙 𝒊−𝟏 − 𝜺𝒊 − 𝜽𝜺𝒊−𝟏 (2) Where 𝜀𝑖 is a sequence of i.i.d random signal with zero mean and constant variance. Equation (1) with 𝝀 = 1 − 𝜃 , is the optimal 1-step-ahead prediction for this process 𝒙 𝒊+𝟏 (𝒊) = 𝒛 𝒊 The 1-step-ahead error are calculated as 𝒆𝒓𝒓 (𝒊) = 𝒙 (𝒊) − 𝒙 𝒊 (𝒊 − 𝟏) (3) IF the 1-step-ahead error is normally distributed, then UCL LCL 𝑷𝒙 𝒊 𝑷 −𝑳. 𝝈𝒆𝒓𝒓 ≤ 𝒆𝒓𝒓(𝒊) ≤ 𝑳. 𝝈𝒆𝒓𝒓 = 𝟏 − 𝜶 𝒊 − 𝟏 − 𝑳. 𝝈𝒆𝒓𝒓 ≤ 𝒙 𝒊 ≤ 𝒙 𝒊 𝒊 − 𝟏 + 𝑳. 𝝈𝒆𝒓𝒓 = 𝟏 − 𝜶 http://isrc.ulster.ac.uk 10 Magee Campus Proposed Algorithm: SD-EWMA Training Input: For each observation in the training dataset, go to training phase and compute the parameters for testing. Testing input: Receive new data sample-by-sample and perform the check as follows. IF (Shift detected) THEN (Report the point of shift and initiate an appropriate corrective action) ELSE (Continue and integrate the upcoming information). Output: Shift-detection points Training Phase 1. Initialize training data: 𝑥 𝑖 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑛, 𝑛 is the size of training data. 2. Calculate the mean of input data (𝑥) and set it to the initial value of 𝑧 0 . 3. Compute the 𝑧-statistics for each observation 𝑥 𝑖 in training data 𝑧 𝑖 = 𝜆. 𝑥 𝑖 + 1 − 𝜆 . 𝑧 𝑖−1 1. Compute 1-step-ahead prediction errors 𝑒𝑟𝑟 𝑖 = 𝑥 𝑖 − 𝑥𝑖 (𝑖 − 1) 1. Compute the sum of 1-step-ahead prediction error divided by number of observations and use it as the initial value of variance 𝜎𝑒𝑟𝑟 2(0) for the testing phase. 2. Set 𝜆 by minimizing the sum of squared prediction error on training data. Testing Phase 1. For each data point 𝑥 𝑖+1 in the operation/testing phase, 2. Compute 𝑧 𝑖 = 𝜆. 𝑥 𝑖 + 1 − 𝜆 . 𝑧 𝑖−1 3. Compute 𝑒𝑟𝑟 𝑖 = 𝑥 𝑖 − 𝑥(𝑖) (𝑖 − 1) 4. Compute the estimated variance 𝜎𝑒𝑟𝑟 2(𝑖) = 𝜗. 𝑒𝑟𝑟(𝑖) 2 + 1 − 𝜗 . 𝜎𝑒𝑟𝑟 2(𝑖−1) 5. Compute 𝑈𝐶𝐿(𝑖+1) and 𝐿𝐶𝐿(𝑖+1) 6. 𝑈𝐶𝐿(𝑖+1) = 𝑧 𝑖 + 𝐿. 𝜎𝑒𝑟𝑟 7. 𝐿𝐶𝐿(𝑖+1) = 𝑧 𝑖 − 𝐿. 𝜎𝑒𝑟𝑟 8. IF (𝐿𝐶𝐿(𝑖+1) < 𝑥 𝑖+1 < 𝑈𝐶𝐿(𝑖+1) ) 9. THEN 10. Continue processing with new sample 11. ELSE (Covariate Shift detected) 12. Initiate an appropriate corrective action. http://isrc.ulster.ac.uk 11 Magee Campus Datasets Synthetic Data Dataset 1-Jumping Mean (D1): 𝑦 𝑡 = 0.6𝑦 𝑡 − 1 − 0.5𝑦 𝑡 − 2 + 𝜀𝑡 where 𝜀𝑡 is a noise with mean 𝜇 and standard deviation 1.5. The initial values are set as 𝑦(1) = 𝑦(2) = 0. A change point is inserted at every 100 time steps by setting the noise mean 𝜇 at time 𝑡 as 0 𝑁=1 𝑁 𝜇𝑁 = 𝜇𝑁−1 + 𝑁 = 2 … .49 16 where 𝑁 is a natural number such that 100 𝑁 − 1 + 1 ≤ 𝑡 ≤ 100𝑁. Dataset 2-Scaling Variance (D2): The change point is inserted at every 100 time steps by setting the noise standard deviation 𝜎 at time 𝑡 as 1 𝑁 = 1,3, … … , 49 𝑁 𝑁 = 2,4, … 48 ln(e + ) 4 where 𝑁 is a natural number such that 100 𝑁 − 1 + 1 ≤ 𝑡 ≤ 100𝑁. 𝜎= Dataset 3-Positive-Auto-correlated (D3): The dataset is consisting of 2000 data-points, the non stationarity occurs in the middle of the data stream, shifting from 𝒩(𝑥; 1,1) to 𝒩(𝑥; 3,1), where 𝒩(𝑥; 𝜇, 𝜎) denotes the normal distribution with mean and standard deviation respectively. http://isrc.ulster.ac.uk 12 Magee Campus Dataset 4-Auto-correlated (D4): The dataset is a time-series consisting of 2000 data-points using 1-D digital filter from matlab. The filter function creates a direct form II transposed implementation of a standard difference equation. In the filter, the denominator coefficient is changed from 2 to 0.5 after producing 1000 number of points. Real-world Dataset: EEG Based Brain Signals The real-world data used here are from BCI competition-III dataset (IV-b). This dataset, contains 2 classes, 118 EEG channels (0.05-200Hz), 1000Hz sampling rate which is down-sampled to 100Hz, 210 training trials, and 420 test trials. 1.5 Session1 Session 2 Session 3 1 0.5 0 2.5 3 3.5 4 Log(bandpower) 4.5 5 Figure : pdf plot of 3 different sessions’ data taken from the training dataset. It is clear from the plot that, in each session the distribution is changed by shifting the mean from session-to-session transfer. http://isrc.ulster.ac.uk 13 Magee Campus 75 80 Shift Points 75 Shift Point 70 70 65 65 60 55 60 50 UCL 45 0 100 200 LCL SD-EWMA 300 400 Sample Number 500 UCL 55 380 600 385 390 LCL 395 400 405 Sample Number SD-EWMA 410 415 420 (a) Figure: Shift detection based on SD-EWMA: Dataset 1 (jumping mean): (a) the shift point is detected at every 100th point. (b) Zoomed view of figure a: shift is detected at 401st sample by crossing the upper control limit. 20 8 UCL LCL 20 SD-EWMA 6 10 Shift Point 10 4 (b) 0 0 2 -10 0 -10 -20 -2 Shifts Points -20 250 300 350 400 450 500 Sample Number Shift Point UCL 550 600 650 -4 850 900 950 LCL 1000 1050 Sample Number UCL SD-EWMA 1100 1150 -30 900 950 LCL 1000 Sample Number SD-EWMA 1050 1100 Figure : Shift detection based on SD-EWMA: (a) Dataset 2 (scaling variance): the shift is detected at 3 points. (b) Dataset 3 (positive auto-correlated): detects the shift after producing 1000 observations. (c) Dataset 4 (Auto-correlated): detects the shift after producing 1000 observations. http://isrc.ulster.ac.uk 14 Magee Campus TABLE : SD-EWMA shift detection in time-series data Total shifts D1 9 D2 5 D3 1 D4 1 Lambda (𝝀) # TP # FP #FN % Shift detected 0.20 0.30 0.40 0.60 0.70 0.80 0.40 0.50 0.60 0.40 0.50 0.60 8 8 All All All 4 All All All All All All 1 3 4 11 6 8 13 9 15 7 6 7 1 1 0 1 0 1 0 0 0 0 0 0 88.8 88.8 100 100 100 80 100 100 100 100 100 100 Table : Simulation results on different tests ICI-CDT SD-EWMA # TP # FP #FN AD # TP # FP #FN AD D1 6 0 3 35 All 1 0 0 D2 1 0 4 60 All 6 0 0 D3 1 0 0 80 All 9 0 0 D4 1 0 0 60 All 6 0 0 http://isrc.ulster.ac.uk 15 Magee Campus 5 log(bandpower) UCL LCL SD-EWMA 4.5 4 3.5 3 2.5 5500 6000 6500 Sample Number Figure 4: A window of 2000 samples obtained from real-world dataset. TABLE 4: SD-EWMA shift detection in BCI data Lambda (𝝀) 0.01 0.05 0.10 http://isrc.ulster.ac.uk Number of Trials 1-20 21-45 46-70 1-20 21-45 46-70 1-20 21-45 46-70 # Shifts points SD-EWMA Session 2 Session 3 0 1 3 1 4 2 10 4 9 6 7 5 16 9 22 21 25 17 16 Magee Campus Conclusion and Future Work The drawback of classical supervised learning techniques in nonstationary environments and the motivation behind the dataset shiftdetection were discussed. The background of non-stationary environments and dataset shiftdetection were presented. A proposed SD-EWMA method is presented and the results are discussed. In future, the SD-EWMA will be combined into an adaptive learning framework for non-stationary learning. http://isrc.ulster.ac.uk 17 Magee Campus Questions Thank You ! http://isrc.ulster.ac.uk