Forecasting Thailand Summer Monsoon Rainfall Nkrintra Singhrattna1, 2, Balaji Rajagopalan+1, 3, Martyn Clark3 and K. Krishna Kumar3, 4 1 Department of Civil, Environmental and Architectural Engineering, University of Colorado at Boulder, Boulder, USA 2 Thailand Public Works Department, Bangkok, Thailand 3 Co-operative Institute for Research in Environmental Sciences, University of Colorado, Boulder, USA 4 Indian Institute of Tropical Meteorology, Pune, India Abstract Predictors of Thailand summer (August-October) monsoon rainfall are identified from the large-scale ocean-atmospheric circulation variables (i.e., Sea Surface Temperatures, Sea Level Pressures) in the Indo-Pacific Ocean. The identified predictors are part of the broader El Niño Southern Oscillation (ENSO) phenomenon. The predictors exhibit significant relationship to the summer rainfall only during the post-1980 period when the Thailand summer rainfall also shows a relationship with ENSO. This suggests that for the predictors to be useful the rainfall itself should be related to ENSO. Two methods for generating ensemble forecasts are proposed. The first is the traditional linear regression, and the second is a nonparametric approach where local polynomials are fitted. The associated predictive standard errors are used for generating ensembles. All the methods exhibit significant comparable skill in a cross-validated mode. However, the nonparametric methods show improved skill during extreme years (i.e. wet and dry years). Furthermore, the models provide useful skill at 1~3 month lead time that can have significant impact on resources planning and management. + Corresponding Author International Journal of Climatology February, 2004 1. Introduction Seasonal forecast of Thailand summer monsoon can have significant impacts on resources planning and management. In particular, the forecast can help public sector agencies to plan for floods or droughts – e.g., reservoir operations, agricultural practices, flood emergency responses and etc. It is not clear if there is a seasonal forecast mechanism in place in Thailand. Unlike, in the case of the Indian summer monsoon, the Indian Meteorological Department is required to issue a seasonal forecast of the upcoming monsoon season by the end of May. Given the great potential utility of the forecast, this chapter develops two approaches to forecast Thailand summer monsoon rainfall. The first is a regression approach based on local regressions using the predictors identified in section 3. The paper is organized as follows. Data description and predictor identification are first presented. The two forecasting methods are next described followed by crossvalidated model skills in forecasting the Thailand summer rainfall. Implications to water resource management conclude the presentation. 2. Data The data used in this study are: 1. Thailand summer monsoon (August – October) rainfall averaged over three stations, Nakhon Sawan (Nk), Suphan Buri (Sp) and Don Muang (Dm). All of these stations are in the West Central region and in the Chao Phraya River basin – one of the key regions for Thailand’s socio-economic well-being. Please see Singhrattna (2003) and Singhrattna et al. (2003) for details on the rainfall data set. 2. Large-scale ocean and atmospheric circulation variables such as sea surface temperatures (SST), sea level pressures (SLP), winds, velocity potential were obtained from NCEP/NCAR Re-analysis (Kalnay 1996). These data sets span the period of 1948 – current, covering the globe on a 2.5x2.5 grid and available at http://www.cdc.noaa.gov. 3. Standard ENSO indices: NINO3, NINO1+2, Southern Oscillation Index (SOI) available at http://www.cpc.noaa.gov. 3. Predictor Identification The aim here is to identify predictors of Thailand summer rainfall, which can then be used in statistical forecast models for improved predictions. Clearly, better forecasts can help significant in resources planning and management. The two main requirements for good predictors are (i) good relation with the hydroclimate variables (i.e. rainfall) and (ii) reasonable lead-time (i.e. months to season). Our earlier work (Singhrattna 2003, Singhrattna et al. 2003) indicated that Thailand summer rainfall is strongly correlated with ENSO in the post-1980 period. So, the first step is to look for relationship with standard ENSO indices during the premonsoon seasons and follow up with correlations between the rainfall and large-scale ocean-atmospheric variables (SSTs, SLPs etc.). This approach of correlation with largescale ocean-atmospheric circulation variables has been used to identify predictors for streamflows in Northern Brazil (DeSouza and Lall 2003) and in the Truckee-Carson river basins in NV, USA (Grantz 2002). 3.1 Correlation with ENSO indices Thailand summer rainfall was correlated with the standard ENSO indices and Indian Ocean Dipole (IOD) index1 (Saji et al. 1999) from pre-monsoon seasons and also with the spring (March-May, MAM) Thailand air temperatures. The latter is believed to be an indicator of the land-ocean thermal gradient that is important for the strength of the monsoon (Singhrattna et al. 2003). The correlations are computed for the post-1980 period and shown in Table 1. Correlation values that are statistically significant at the 95% confidence level using a t-test (Helsel and Hirsch 1995) are shown in bold in the table. It can be seen that the SLP-based ENSO index; SOI, shows a strong correlation with monsoon rainfall during the concurrent season and also 1-2 seasons prior. The spring land temperatures also exhibit a significant correlation as expected. Interestingly, the IOD shows a strong correlation with the monsoon rainfall at 2-season lead-time. All these brighten the prospects for a long-lead forecast. To check the temporal variation of these correlations, selected predictors from pre-monsoon seasons (JFM Nino 3, MJJ SOI, MAM IOD and MAM SATs) were correlated with monsoon rainfall in a 21-year moving window (Figure 1). It can be confirmed that the predictors show correlations with summer rainfall only in recent decades, much as the correlations between the rainfall and ENSO (shown in solid line between summer rainfall and ASO SOI) and seen by Singhrattna et al. (2003). This suggests that the ENSO-based predictors are related to the monsoon rainfall only when the monsoon rainfall itself is related to ENSO. Interestingly, this is similar to the finding 1 Japan Marine Science and Technology Center. Indian Ocean Dipole (IDO) Home Page. [Online] Available http://w3.jamstec.go,jp/frsgc/research/d1/saji/test1.html, July 2002. by Krishna Kumar et al. (1999) where they show that the predictors of Indian monsoon rainfall are related to the rainfall only when the Indian monsoon is related to ENSO. 3.2 Correlation with Large-Scale Variables While the indices show significant correlations as seen above, we would like to check their large-scale aspects and also to see other stronger could be identified. To this end, the summer monsoon rainfall is correlated with surface air temperatures (SATs), SSTs and SLPs during the pre-monsoon seasons, and the correlation maps are shown in Figure 2. The red and dark purple colors indicate correlations that are significant at 95% confidence level. Strong positive correlations during spring (MAM) season can be seen over Thailand with SATs (Fig. 2(a)). This is essentially capturing the land-ocean gradient that gets set up by the land temperatures during spring season before the monsoon (Singhrattna 2003). With SLPs (Fig. 2(b)), the correlations are strong in the Pacific subtropical region indicating that a higher than normal subtropical pressure tends to enhance the easterlies and consequently, increasing the moisture transport to the Thailand and consequently, the rainfall. With SSTs (Fig. 2(c)), strong positive correlations are seen in the Eastern Indian Ocean and Western Pacific Ocean regions around the equator. This region is also one of the poles of the Indian Ocean Dipole index (Saji et al. 1999) and hence, the strong correlation with IOD seen in Table 1. Correlations with all three variables indicate persistence from spring leading up to the monsoon season, thus providing the potential for long-lead forecast. The solid box in the figures shows the regions of high correlation from where the predictors will be developed in the following sections. 3.3 Predictor Selection Based on the correlations with indices and the correlation maps with large-scale variables, predictors can be identified. Predictors with large correlations to the summer rainfall are selected. With this criterion, the selected predictors are (i) Thailand spring temperatures, i.e. MAM SATs, (ii) SSTs averaged over 10.5o-14.3o S latitudes and 108.8o-120o E longitudes and (iii) SLPs averaged over 20o-30o N latitudes and 165o-180o E longitudes. To check the temporal variability of the strength of the predictors to monsoon rainfall, moving window correlations are shown in Figure 3. As expected, the predictors are correlated mainly in the post-1980 period. Furthermore, the predictors show significant correlations with the summer rainfall at 1~2-season lead-time. To investigate the nature of relationship between the predictors and the rainfall, Figure 4 shows the 3D plot of summer rainfall and a couple of selected predictors (MAM SST and MJJ SLP). Notice the interesting nonlinearity at the extremes (i.e. during high rainfall). 4. Forecasting Models Typically, a regression (often linear) is fit between the identified predictors and the dependent variable (i.e. the summer rainfall). The fitted regression is then used to forecast the mean value of the variable. There is a rich literature for fitting and testing linear regression models, and software is extensively available (Helsel and Hirsch 1995). Such models have been widely used for hydroclimate forecasting in the US (e.g., Lui et al. 1998, Piechota et al. 2001, Cordery and McCall 2000) for Indian monsoon forecasting (Krishna Kumar et al. 1995). Below, the linear regression model is briefly described. 4.1 Linear Regression Traditional models fit a regression, often linear, between the response variable (i.e. summer rainfall) and the independent variables (i.e. predictors). They are of the form: Yt = a1 * x1t + a2 * x2t + a3 * x3t + … + ap * xpt + e [1] Where the coefficients a1, a2,…, ap are estimated from the data e is the error which is assumed to be normally distributed with mean 0 and variance e2, also estimated from the data. Implicitly, the variables are also assumed to be normally (or Gaussian) distributed. If not, they are generally transformed to a normal distribution (e.g., log or power transform) before the model is fit. Once the model is fit (i.e. the coefficients estimated) then for any new value of the predictors, the model with the fitted coefficients (Equation [1]) is used to estimate the mean value of the dependent variable; say Y. Standard error; e, (or the standard deviation of the estimate) is automatically obtained from the theory (Helsel and Hirsch 1995). Normal random deviates with a mean of 0 and standard deviation of e provide the ensembles of errors when added to the mean estimate; Y, results are ensemble forecasts. In the above model, if the independent variables happen to be past values of the response variable itself, then it forms a time series model of Auto Regressive Moving Average framework. These are linear regression based frameworks as well and hydrologists have developed and used such models for streamflow simulation and forecast (Salas 1985, Yevjevich 1972, Bras and Iturbe 1985). The main drawbacks of traditional linear regression based models are (i) assumption of Gaussian distribution of data and errors, (ii) assumption of linear relationship between the variables, (iii) higher order fits (e.g., quadratic or cubic) require large amounts of data for fitting and (iv) not portable across data sets (i.e. sites). 4.2 Nonparametric regression – Locally Weighted Polynomials Nonparametric methods provide an attractive alternative with their flexible and powerful framework. In this approach, the model is: Y = (x1, x2, x3, … xp) + e [2] which is same as the traditional model (Equation [1]) but the function is estimated locally. In that, the value of the function at any point ‘x’ is obtained by fitting a polynomial to a small number of neighbors (K) to ‘x’. This is the gist of nonparametric approach – “local” estimation. Once the neighbors to ‘x’ are identified then there are couples of options: (i) The neighbors can be resampled with a weight function that gives more weights to the nearest neighbor and less to the farthest; thus, generating an ensemble forecast or simulation (Lall and Shama 1996, Rajagopalan and Lall 1999, Yates et al. 2003, DeSouza et al. 2003) (ii) A polynomial can be fit to the neighbors that can be used to issue a mean value (or mean forecast) – also used for spatial interpolation (Loader 1999, Rajagopalan and Lall 1998). This can be repeated at every point of interest; thus, resulting in the smoothing of the data. The parameters to be estimated are the size of the neighborhood and the order of the polynomial (P). These parameters are obtained using objective criteria such as Generalized Cross Validation (GCV) function, Likelihood function, and etc. Note that if a first order (i.e. linear) polynomial is selected, and if the neighborhood includes all the observations this then results in the traditional linear regression. The local polynomial estimation can be viewed as a superset. Loader (1999) provides the theoretical background for this approach. Being a local estimation scheme, it has the ability to capture any arbitrary local features. Furthermore, unlike the parametric alternative, no prior assumption is made as the functional form of the relationship (e.g., in the parametric model, Equation [1], a linear relationship is assumed to begin with). These are the two major differences that offer flexibility to nonparametric methods. There are several nonparametric approaches, such as kernel-based (Bowman and Azzalini 1997); Splines, K-nearest neighbor (K-nn) local polynomials (Rajagopalan and Lall 1999, Owosina 1992); Locally weighted polynomials (LOCFIT) (Loader 1999). The K-nn local polynomials and the LOCFIT approaches are very similar. Owosina (1992) performed an extensive comparison of a number of regression methods both parametric and nonparametric on a variety of synthetic data sets. He found that the nonparametric methods handily outperform parametric alternatives. The LOCFIT method is simple and easy to implement; hence, we chose this in this research. We then adapt a modified version of LOCFIT, also known as the modified K-nn. The modification was developed by Prairie (2002), Prairie et al. (2003) and applied for streamflow and salinity modeling. Later, Grantz (2003) demonstrated the use of this approach for streamflow forecasting on the Truckee-Carson basin in Nevada, USA. The methodology is described in the following section. 4.3 Modified K-nn Model The modification is described in the following steps: (i) First, the optimal polynomial order and the size of neighborhood are obtained (see section 4.2 above). We use this to obtain an estimate at all the observation points and consequently, the residuals. This is done using LOCFIT. The modification is as follows (refer to Figure 5). (ii) For a new data point xnew at which a forecast is required, the conditional mean value Ynew is obtained using step (i). (iii) Now select one of the (K) neighbors of xnew, say xi and select the corresponding residual ei, this is now added to the mean forecast Ynew + ei; thus, obtaining one of the ensemble members. The selection of one of the neighbors is done using a weight function 1/j W (j) = k 1/j [3] j=1 As can be seen, this weight function gives more weight to the nearest neighbor and less to the farthest neighbors. The number of neighbors to be used to resample the residuals need not be same as the number of neighbors (K) used to perform the local polynomial in step (i). In practice, square root of (n-1) is used to resample the residuals. Repeat step (iii), as many times as required resulting in an ensemble forecast. Also, repeat steps (i)-(iii) for each forecast point. 4.4 LOCFIT With Normal Errors In the modified K-nn method described above, if the sample size is small then the resampled residuals provide limited variety in the ensembles. This can be addressed by modifying step (iii) in the algorithm described in the preceding section as follows: The LOCFIT (Loader 1999) also provides estimates of standard error, le, (much like e that in the case of linear regression) for the conditional mean value (Loader 1999, page xx). Assuming normal distribution of the errors “locally” around Ynew, Normal random deviates with a mean of 0 and standard deviation of le provide the ensembles of errors, when added to the mean estimate; Ynew, results are ensemble forecasts. This step is similar to generating the ensembles from the linear regression method described at the end of section 4.1. However, the main difference is that the mean estimate is obtained via local regressions (i.e. LOCFIT) and the standard error estimate le is also a “local” estimate with an assumption of Normal distribution for the errors “locally”. Unlike, in the case of the linear regression, the predictive standard e error remains constant, regardless of the independent variables. This is a key difference. Since the errors are generated from a Normal distribution with the appropriate standard deviation, the ensembles provide a good variety. Furthermore, given that they are “locally” Normally distributed, provides the ability to capture local features in the error structures. The main advantage of nonparametric approaches is the ability to capture local error structure that can be non-Gaussian or nonlinear. Also, it improves upon straight bootstrap (Lall and Shama 1996, DeSouza and Lall 2003), in that; values not seen in the historical record can be produced. Unlike in traditional bootstrap, only historical values are resampled. 5. Model Evaluation The models are verified in a cross-validated mode. In that, the data (rainfall and the predictors) for a given year is dropped out and the model(s) based on the rest of the data is applied to generate ensemble forecast for the dropped year. This is repeated for all the years. Apart from visual inspection, the ensembles are evaluated on three criteria: (i) Correlation between the observed value and the median of the ensemble forecast. This is much like evaluating the mean forecast that would come from a standard linear regression model. (ii) Likelihood function (LLH) (Rajagopalan et al. 2002). This evaluates the skill of the model in capturing the Probability Density Function (PDF). (iii) Rank Probability Skill Score (RPSS) (Wilks 1995). This evaluates the skill of the model in capturing the categorical probabilities (i.e. the probability distribution function) The likelihood function (LLH) is applied to measure the skills of forecast models. Its process is to categorize forecasted values to three divisions: below, normal and above normal. The ensemble forecasts falling into these three categories are compared to historical data and then develop a skill score. The likelihood skill score for any given year of forecast is defined as: N [ Pf ]1/N t=1 LLH = N [ Pc ] 1/N t=1 [4] where Pf is probability of ensemble forecasts in observed category for any given year Pc is the climatological probability of in the observed category in the given year – since we divided the rainfall into three categories at the 33rd and 66th percentile, the probabilities of each of the categories is 1/3. n is length of time series The LLH values vary from 0 to numbers of categories (i.e. 3 in this study). The score of zero indicates lack of skill, a score of greater than 1 indicates that the forecasts are able to capture the PDF of climatological data. The ranked probability skill score (RPSS) is also applied to quantify the skills of forecast models against the climatology. This method evaluates the probability of ensemble forecasts falling into many categories (i.e. in this study: below average, average and above average) and compared to historical data. The RPSS score for any given year is defined as: k RPS = 1 i j [ ( Pn - dn)2] i=1 n=1 [5] n=1 k -1 RPSS = 1 – RPS (forecast) RPS (standard) [6] for k mutually exclusive and collectively exhaustive categories (in this case we have three categories, so k = 3). The vector d (d1, d2,…, dk) represents the observation vector such that dn equals 1 if the observation fell in category n or 0 otherwise. The RPSS scores vary from +1 to - (i.e. perfect skill to bad skill). Scores above 0 indicates improvement to climatological forecast. 5. Results Ensemble forecasts are made at the beginning of each month starting from April. The skill of the forecasts is evaluated using the three skill scores described in the previous section and are presented in this section. Skill scores of the forecasts from the three models described above are presented. Skills are also compared during high (wet) and low (dry) years. Threshold exceedence probabilities during the extreme years are estimated and the PDFs of ensembles of a few representative years are also presented. From the set of predictors (SST, SAT and SLP based) identified in the previous section, the optimal subset was found by the combination that gave the best forecast skill. Several formal methods are available for subset selection – such as subset regression or cross-validation metrics etc. Since there are only three predictors in this case almost all combinations were tried out to find the optimal predictor set. For summer monsoon rainfall, the best set of predictors were found to be SLP and SST based. The land temperatures (SAT) did not seem to improve the skill much. MJJ SLPs and the MAM SSTs, these predictors were used in the modified K-nn model, and a forecast was issued at the beginning of each month starting April 1st for each year. The predictors are the average values from the preceeding season (i.e. preceeding three months). The skill scores are shown in Figure 6 and Table 2. It can be seen that the skill increases significantly as the forecast lead-time decreases. This is intuitive and consistent with expectations. The linear regression and LOCFIT+normal errors show similar skill on all the three measures. This indicates that the most part of the relationship between the predictors and the rainfall is linear, and that linear regression seems appropriate. The modified K-nn is comparable in performance as the lead-time decreases, but early on it, its performance is weak. This is due to small sample size and even smaller sample size (K) of residuals used in resampling. Recall from section 4.2 that K is typically square root of number of data points (in this case; 21), which is about 4~5. At long leads, the relationship between predictors and rainfall is not as strong and consequently, with a small K there is less variety in the ensembles and a bias if the predictors are not very useful. This leads to poor skill scores. Notice the significant skill from May 1st onwards, providing a 2-month lead-time that can be very useful for resources planning and management. This useful long-lead skill, regardless of the method, is quite impressive. In order to compare the performance of these models in extreme years, Figure 7 (a) and (b) (and Table 3) show the skill scores for the High (wet) and Low (dry) rainfall years consistent with Singhrattna (2003). Interestingly, the nonparametric models seem to show slight improvement over linear regression for forecasts starting May 1st in both the wet and dry years and generally for all the skill measures. This could be explained by the fact that subtle nonlinear relationship exists between the predictors and the rainfall at the extremes (Fig. 4) that there is some advantage to use nonparametric methods. Furthermore, notice that the skill in wet years is much more than that in the dry years, consistent with earlier findings. PDFs of the ensemble forecasts made on August 1st during selected wet and dry years from the three models are presented along with the climatological PDF and the observed values in Figure 8 and 9, respectively. The observed value is shown as the dotted line. For the high years, the modified K-nn shows the ensembles to be shift to the right of the climatological PDF. Even though, the observed value is not in the middle of the ensemble PDF (as we would like it to be), it still can provide useful information and skill in terms of threshold exceedence probabilities (as will be seen below). For the low years, it can be seen that the LOCFIT+normal method does a good job exhibiting the shift in PDF (to the left of climatological PDF) relative to the linear regression. Also in years 1984 and 1994, the observed is much closer to the mean of the PDF than the ensembles from linear regression. One of the key variables for decision makers is the probability of threshold exceedences. We chose 700 mm (the 90th percentile of the data) as a surrogate for wet (or flood) conditions and 400 mm (the 10th percentile of the data) for dry conditions. From the PDFs of the ensembles, the exceedence probabilities are computed for the selected wet and dry years and shown in Table 4. It can be seen that for the wet years the climatological exceedence probability is 0.10, while the ensembles indicate a very high probability of exceedence of this threshold, thus indicating a wet condition. This information can be very helpful in flood emergency response planning and management. For the dry exceedence in 1984 and 1994, the models improve significantly upon the climatological non-exceedence. It can be seen that the nonparametric models in general show a slight improvement upon the linear regression model. Similar estimates were obtained from forecasts issued in June and July. 6. Implication to Water Management Increased stress on the Chao Phraya River basin due to increased population growth is resulting in water quantity and quality problems. To mitigate this, effective planning and management of water resources is called for. In the short term, this requires a good idea of the upcoming monsoon season – i.e. good seasonal forecast and in the long term it needs realistic projections of scenarios of future variability and change. There is no known long-lead forecast of Thailand summer monsoon or streamflows as a result, much of the water resources planning in Chao Phraya basin is near term – i.e. responding to near term weather forecast. Given the need, the ensemble forecasts of monsoon rainfall from the models presented above can provide critical information to water managers for effective decision making. Some of the potential applications include: (i) The threshold exceedence probabilities can provide the water managers with the probabilistic knowledge of potential drought or flood conditions in the upcoming season. This can be used to effectively plan annual and seasonal reservoir, emergency response preparedness, flood plain management, cropping strategies, conservation measures etc. (ii) The threshold exceedence probabilities can also be used as a surrogate for wetness or dryness and provide probabilistic information on flooding potential, land slides, etc. and develop optimal response strategies. (iii) The ensembles of rainfall can be used to drive a water balance model and generate ensembles of streamflows. These can then be used in the water resources decision making to identify optimal responses. The long-lead forecasts can provide significant information to government and public sector agencies to mobilize resources to meet contingencies. The forecasts will provide a useful and powerful tool to water managers in long-term planning that is currently lacking. 7. Summary Predictors from large-scale ocean, atmosphere and land variables have been identified that have strong correlation with Thailand summer monsoon. The predictors are physically consistent. The predictors indicate a 1-2 seasons worth of lead-time. Interestingly, the predictors are related to the monsoon rainfall, only during post-1980 period when the monsoon rainfall is correlated with ENSO, as seen in Singhrattna (2003). This suggests the tantalizing possibility that ENSO relationship could be modulating the predictability – similar to what is seen in the case of the Indian monsoon (Krishna Kumar et al. 1999). Two modeling approaches for ensemble forecasts of Thailand summer monsoon are offered. In this, a traditional linear regression (parametric) and a nonparametric method based on local polynomials are adapted. In the linear regression model, the regression is used to obtain the mean forecast at the point of interest, and the standard error of prediction that is obtained as a byproduct is used to generate the ensembles assuming Normal distribution. In the nonparametric approach, local polynomials are fitted to obtain the mean forecast then the residuals in the neighborhood are resampled for the ensembles, or local estimates of predictive standard error is used to obtain the ensembles assuming local Normal distribution of residuals. The nonparametric method has the ability to capture nonlinear features and non-Gaussian error features that might be present in the data. Small sample size is a big constraint. However, there seems to be useful significant skill at lead times of 3~4 months. This has tremendous implications to water management, early warning and preparedness and also for resources planning in general. Acknowledgements The authors would like to thank Somkiat Apipattanavis and Tom Chase for useful discussions. The first author is grateful to Thailand Government for the graduate student fellowship and to Thailand Public Works Department for their continued support. Travel support from IUGG to the first author to present preliminary results from this work at the IUGG General Assembly 2003 in Sapporo, Japan is very much appreciated. Funding for the third author via CIRES Visiting Fellowship is also thankfully acknowledged. References