Machine learning methods with applications to precipitation and streamflow William W. Hsieh Dept. of Earth, Ocean & Atmospheric Sciences, The University of British Columbia http://www.ocgy.ubc.ca/~william Collaborators: Alex Cannon, Carlos Gaitan & Aranildo Lima Nonlinear regression Linear regression (LR): Neural networks (NN/ANN): Adaptive basis fns hj 2 Cost function J minimized to solve for the weights Here J is the mean squared error. Underfit and overfit: 3 Why is climate more linear than weather? [Yuval & Hsieh, 2002. Quart. J. Roy. Met. Soc.] 4 Curse of the central limit theorem Central limit theorem says averaging weather data makes climate data more Gaussian and linear => no advantage using NN! Nowadays, climate is not just the mean of the weather data, but can be other types of statistics from weather data, E.g. climate of extreme weather Use NN on daily data, then compile climate statistics of extreme weather => Can we escape from the curse of the central limit theorem? 5 Statistical downscaling Global climate models (GCM) have poor spatial resolution (a) Dynamical downscaling: imbed regional climate model (RCM) in the GCM (b) Statistical downscaling (SD): Use statistical/machine learning methods to downscale GCM output. Statistical downscaling at 10 stations in Southern Ontario & Quebec. [Gaitan, Hsieh & Cannon, Clim.Dynam. 2014] Predictors (1961-2000) from the NCEP/NCAR Reanalysis interpolated to the grid (approx. 3.75° lat. by 3.75° lon.) used by the Canadian CGCM3.1. 6 How to validate statistical downscaling in future climate? Following Vrac et al. (2007), use regional climate model (RCM) output as pseudo-observations. CRCM 4.2 provides 10 “pseudo-observational” sites For each site, downscale from 9 surrounding CGCM 3.1 grid cells: 6 predictors/cell : Tmax,Tmin, surface u, v, SLP, precipitation. 6 predictors/cell x 9 cells = 54 predictors 2 Periods: 1971 - 2000 (20th century climate: 20C3M run) 2041 – 2070 (future climate: SRES A2 scenario) 7 10 meteorological stations 8 Precipitation Occurrence Models Using the 54 available predictors and a binary predictand (precip./ no precip), we implemented the following models: Linear Discriminant classifier Naive Bayes classifier kNN (k nearest neighbours) classifier (45 nearest neighbours) Classification Tree TreeEnsemble: Ensemble of classification trees. ANN-C: Artificial Neural Network Classifier. 9 Peirce skill score (PSS) for downscaled precipitation occurrence: 20th Century (20C3M) and future (A2) periods. Persistence Discriminant naïve-Bayes kNN ClassTree TreeEnsem. ANN-C 10 Climdex Climate indices Compute indices from downscaled daily data 11 Index of agreement (IOA) of climate indices ANN-F IOA ARES-F ARES-F SWLR-F SWLR-F 20C3M A2 12 Differences between the IOA of future (A2) and 20th Century (20C3M) climates ANN-F ARES-F SWLR-F 13 Conclusion Use NN on daily data, then compile climate statistics of extreme weather => beat linear method => escaped from the curse of the central limit theorem 14 Extreme learning machine (ELM): [G.-B. Huang] ANN: ELM: Choose the weights (wij and w0j) of the hidden neurons randomly. Only need to solve for aj and a0 by linear least squares. ELM turns nonlinear regression by NN into a linear regression problem! 15 Tested ELM on 9 environmental datasets [Lima, Cannon and Hsieh, Environmental Modelling & Software, under revision] Goal is to develop ELM into nonlinear updateable model output statistics (UMOS). 16 Deep learning 17 Spare slides 18 Compare 5 models over 9 environmental datasets [Lima, Cannon & Hsieh. Environmental Modelling & Software (under revision)] MLR = multiple linear regression ANN = Artificial neural network SVR-ES = Support vector regression (with Evolutionary Strategy) RF = random forest ELM-S = Extreme Learning Machine (with scaled weights) Optimal number of hidden neurons in ELM chosen over validation data by a simple hill climbing algorithm. Compare models in terms of: RMSE skill score = 1 – RMSE/RMSEMLR t = cpu time 19 RMSE skill score (relative to MLR) 20 Cpu time 21 Conclusions ELM turns nonlinear ANN into a multiple linear regression problem, but with same skills as ANN. ELM-S is faster than ANN and SVR-ES in 8 of the 9 datasets and faster than RF in 5 of the 9 datasets. When dataset has both large no. of predictors and large sample size, ELM loses its advantage over ANN. RF is fast but could not outperform MLR in 3 of 9 datasets (ELM-S outperformed MLR in all 9 datasets). 22 Online sequential learning Previously, we used ELM for batch learning. When new data arrive, need to retrain model using the entire data record => very expensive. Now use ELM for “online sequential learning” (i.e. as new data arrive, update the model with only the new data) For multiple linear regression (MLR), online sequential MLR (OS-MLR) is straightforward. [Envir.Cda’s updateable MOS (model output statistics), used to postprocess NWP model output, is based on OS-MLR] Online sequential ELM (OS-ELM) (Liang et al. 2006, IEEE Trans. Neural Networks) is easily derived from OSMLR. 23 Predict streamflow at Stave, BC at 1-day lead. 23 potential predictors (local observations, GFS reforecast data, climate indices) [same as Rasouli et al. 2012]. Data during 1995-1997 used to find the optimal number of hidden neurons (3 for ANN and 27 for ELM), and to train the first model. New data arrive (a) weekly, (b) monthly or (c) seasonally. Validate forecasts for 1998-2001. 5 models: MLR, OS-MLR, ANN, ELM, OS-ELM Compare correlation (CC), mean absolute error (MAE), root mean squared error (RMSE) and cpu time 24 25 26 Conclusion With new data arriving frequently, OS-ELM provides a much cheaper way to update & maintain a nonlinear regression model. Future research OS-ELM retains the information from all the data it has seen. If data are non-stationary, need to forget the older data. Adaptive OS-ELM has an adaptive forgetting factor. 27