1 COST733CAT - a database of weather and circulation type classifications 2 Authors 3 Philipp A., J. Bartholy, C. Beck, M. Erpicum, P. Esteban, R. Huth, P. James, S. Jourdain, T. 4 Krennert, S. Lykoudis, S. Michalides, K. Pianko, P. Post, D. Rassilla Álvarez, A. Spekat, F. S. 5 Tymvios 6 Abstract 7 A new database of classification catalogs has been compiled during the COST Action 733 8 "Harmonisation and Applications of Weather Type Classifications for European regions" in order to 9 evaluate different methods for weather and circulation type classification. This paper gives a 10 technical description of the included methods and provides a basis for further studies dealing with 11 the dataset. Even though the list of included methods is by far not complete, it demonstrates the 12 huge variety of methods and their variations. In order to allow a systematic overview a new 13 conceptional systemization for classification methods is presented reflecting the way types are 14 defined. Methods using predefined types include manual and threshold based classifications while 15 methods producing derived types include those based on eigenvector techniques, leader algorithms 16 and optimization algorithms like k-means cluster analysis. The different methods are discussed with 17 respect to their intention and implementation. 18 19 Introduction 20 Classification of weather and atmospheric circulation states into distinct types is a widely used tool 21 for describing and analyzing weather and climate conditions. The principal idea is to transfer 22 multivariate information given on the metrical scale in a input dataset, e.g. a time series of daily 23 pressure fields, to a univariate time series of type membership on the nominal scale, i.e. a so called 24 classification catalog. The advantage of such a substantial information compression is the 25 straightforward use of the catalogs. On the other hand the loss of information caused by the 26 classification process makes it sometimes difficult to relate the remaining information to other 27 climate elements like temperature or precipitation, which in most cases is the main objective for 28 applications of classifications. It may be a consequence of this contrariness, that the number of 29 different classification methods is huge and still increasing, in the hope to find a classification 30 method producing a simple catalog but reflecting the most relevant variability of the climate 31 system. However, the number of classification methods and their results is a drawback as it is hard 32 to decide which one to use for certain applications. This was the reason to initiate COST action 733 33 entitled "Harmonisation and Applications of Weather Type Classifications for European regions". 34 The main goal of this network of European meteorologist and climate scientists is to systematically 35 compare different classification methods and to evaluate whether there might be one or a few 36 universally superior methods, which can be recommended. Further it should be evaluated whether it 37 is possible to combine favorable properties of existing methods in order to develop a reference 38 classification. The basis for this work has been established by producing a database of classification 39 catalogs (called cost733cat) under unified conditions concerning the input dataset and the 40 configuration of the classification procedure in order to make the resulting catalogs as comparable 41 as possible concerning the method alone. As a consequence in some aspects the included catalogs 42 might be suboptimal for some applications, e.g. all methods have been applied to means sea level 43 pressure while the 500 hPa geopotential heigth might be better for some applications. However, the 44 current version of the collection, which will be made available for free at the end of COST Action 45 733 in 2010, can be applied for more than intercomparison questions and future versions might 46 include more suitable configurations. 47 [COMMENT: 48 a more detailed justification of the selection of fixed 49 numbers of categories, in the first place, and in particuler 9,18 and 27, in 50 the second stage, should be included unless this is covered in another paper 51 e.g by WG3. The reson is that even among us there is still 52 a discussion about the physical meaning and the "quality" - in terms of 53 being appropriate for use in certain topics- of the classifications in the 54 catalogue. So we need to stress the intended use, that is intercomparison of 55 the methods in terms of discriminating power/efficiency both of the patterns 56 of key meteorological parameters (WG3), and in other applications (WG4). 57 COMMENT] I tried to point this out above (48ff). 58 In order to describe the included methods a new systemization has been developed within the COST 59 Action and is used here in the following. 60 Classification methods have been discriminated into two main groups e.g. by Yarnal (1993) for a 61 long while. The first group has been called "manual" while the second group is called "automated". 62 An alternative discrimination refers to "subjective" versus "objective" methods, which is not quite 63 the same since automated methods which are often seen as objective always include subjective 64 decisions. A third group has been established (e.g. Sheridan 2002) called "hybrid" methods 65 referring to methods defining the types subjectively but assigning all observation patterns 66 automatically. 67 Another distinction could be made between circulation type classifications (CTC) including only 68 information of the atmospheric circulation like air pressure, and so called weather type 69 classifications (WTC) including also information about other weather elements like temperature, 70 precipitation, etc. However, beside the subjective methods, WTCs are rare and actually only one 71 method using other parameters than pressure fields is included in cost733cat. The reason is 72 probably the demand from the majority of applications to use CTCs for relating circulation to target 73 variables (circulation-environment approach after Yarnal REF), e.g. for downscaling where only 74 reliable circulation data are available. 75 With the growing availability of computing capacities during the last decades, the number of 76 automated CTC methods increased considerably, since it is easy now to modify existing algorithms 77 and produce new classifications. This increased variety of automated methods makes it necessary to 78 find a new systematization of methods especially accounting for the increased diversity of 79 automated methods and their algorithms (see Huth et al. 2009) for further developments. 80 This paper is organized as follows: after a the description of the classification input dataset the new 81 methodological systemization is presented and used as structure for the description of the 82 individual classification methods. Concluding remarks point out the differences and commons from 83 the technical point of view. 84 85 Input data and configuration 86 In order to be able to evaluate differences between the resulting classification catalogs all methods 87 have been applied to daily 12 o'clock ERA40 reanalysis data (REF) within the period 09/1957 to 88 08/2002 covering all months, excluding discrepancies caused by different input datasets. However, 89 while most of the authors contributing to the catalog dataset originally used sea level pressure 90 (SLP), some methods have been applied to other parameters like the geopotential height of the 500 91 hPa level or wind components, humidity and temperature (the three latter parameters only in one 92 single method called WLK). Therefore, in order to further reduce sources of differences for 93 comparisons, at least one variant of each method has been producing using SLP only. 94 Another important feature is the clipping of the ERA40 1° by 1° grid. Different spatial scales and 95 regions have been covered by a set of 12 unified domains throughout Europe presented in Figure 1 96 and Table 1. The largest covering whole Europe by a reduced grid resolution (domain 0) while the 97 smallest is confined to the greater Alpine area comprising 12x18 grid points (domain 6). All 98 methods have been applied to all 12 domains. 99 [Figure 1.] 100 [Table 1.] 101 While some methods allow to chose an arbitrary number of types (like cluster analysis) others are 102 limited to one or a few numbers, either due to their concept (e.g. by division by wind sectors) or to 103 technical reasons (e.g. leading to empty classes). The original numbers of types are varying between 104 4 and 43 types which makes it rather difficult to compare the classifications. Therefore three 105 reference numbers have been specified to be 9, 18 and 27 types in order to reduce the departures to 106 the maximum of 2 for each method. Again all methods have been run three times for each of the 107 reference number of types. 108 An overview of the different variants of classifications is presented in Table 2. 109 [Table 2.] 110 Methods and systemization 111 112 From the methodological point of view two main groups can be discerned concerning the way types 113 are defined. The first strategy is to establish a set of types in prior to the process of assignment 114 (called "predefined types" hereafter), while the second way is to arrange the entities to be classified 115 (daily patterns in this case) following a certain algorithm such that the types are, together with the 116 assignment, the result of the process (called "derived types"). 117 ... 118 I. Methods using predefined types 119 Methods using predefined types include those with subjectively chosen circulation patterns and /or 120 weather situations and those where the allocation of days to one type depends on thresholds. Thus 121 the latter define the types indirectly by declaration of a boundary, e.g. a distinction is made between 122 days with a westerly main flow direction over the domain and days with northerly, easterly or 123 southerly direction where the angle between the sectors serves as a threshold and boundary between 124 the types. 125 ... 126 I.1. Subjective definition of types 127 A common feature of the subjective classifications is their relatively high number ranging between 128 29 and 43 except for the Peczely classification with 13 types. 129 ... 130 HBGWL/HBGWT - Hess Brezowsky Grosswetterlagen/-typen 131 One of the most famous catalogs is surely the one founded by Baur and revised and developed by 132 Hess and Brezowsky for central Europe. It is now maintained in Potsdam by Gerstengarbe and 133 Werner and freely available at ... 134 The concept of type definition strongly follows the onflow direction of air masses onto central 135 Europe, discerning zonal, mixed and meridional types which are further discriminated on a second 136 hierarchical level into a total of 10 "Großwettertypes" and on the last hierarchical level into 29 137 "Großwetterlagen" and one undefined type for transition patterns. The subjective definition of types 138 and assignment of daily patterns includes the knowledge of the authors about importance of the 139 specific circulation patterns for temperature and precipitation conditions in central Europe. 140 Therefore they are called weather types rather than circulation types because other weather 141 elements are included in the classification process not only circulation patterns. 142 OGWL - Objective Grosswetterlagen 143 An objectivized version of the "Hess and Brezowsky Großwetterlagen" has been produced by 144 James () using only circulation composites of the original classification and newly assigning the 145 daily circulation patterns to the types by finding the minimum Euclidean distance. 146 PECZELY 147 Gyorgy Peczely, a Hungarian climatologist (1924-1984), originally published his macrocirculation 148 system in 1957. The system was defined on the base of the geographical location of cyclones, 149 anticyclones over the Carpathian basin. All together 13 types were composed. The abbreviation, the 150 corresponding number (used in the code file), and short description of the 13 circulation types: 151 Meridional, northern types 152 mCc (1) Cold front with meridional flow 153 AB (2) Anticyclone over the British Isles 154 CMc (3) Cold front arising from a Mediterranean cyclone Meridional, southern types 155 mCw (4) Warm front arising from a meridional cyclone 156 Ae (5) Anticyclone located east of the Carpathian Basin 157 CMw (6) Warm front arising from a Mediterranean cyclone 158 Zonal, western types 159 zC (7) Zonal cyclone 160 Aw (8) Anticyclone located west of the Carpathian Basin 161 As (9) Anticyclone located south of the Carpathian Basin 162 Zonal, eastern types 163 An (10) Anticyclone located north of the Carpathian Basin 164 AF (11) Anticyclone located over the Scandinavian Peninsula Central types 165 A (12) Anticyclone located over the Carpathian Basin 166 C (13) Cyclone located above the Carpathian Basin 167 After the death of Gyorgy Peczely (professor of the Szeged University, Hungary), one of his 168 followers, Csaba Karossy (professor of the Berzsenyi Daniel College, Szombathely, Hungary) 169 continued the coding process. 170 171 PERRET 172 173 ZAMG 174 175 I.2. Threshold based methods 176 GWT - Grosswettertypen or Prototype classification 177 This classification approach is based on predefined circulation patterns determined according to the 178 subjective classification of the so-called Central European Großwettertypes (Hess and Brezowski, 179 1952; Gerstengarbe and Werner, 1993). It is assumed that these 10 Großwettertypes resulting from 180 a generalization of 29 large-scale weather patterns defined by the geographical position of major 181 centres of action and the location and extension of frontal zones (Gerstengarbe and Werner, 1993) 182 can be sufficiently characterized in terms of varying degrees of zonality, meridionality, and 183 vorticity of the large-scale SLP field over Europe. On the basis of this assumption, the classification 184 scheme consists of the following steps: 185 At first, three prototypical SLP patterns are defined for the region 19 °W-38 °E and 36°N-64°N 186 representing idealized W-E, S-N, and central low-pressure isobars over the European region. 187 Spatial correlations with these prototypical patterns are calculated for all monthly mean SLP grids 188 for; they will be addressed as coefficients of zonality (Z), meridionality (M), and vorticity (V), 189 respectively. 190 The ten Central European Großwettertypes are defined by means of particular combinations of 191 these three correlation coefficients: Großwettertypes high and low pressure over Central Europe 192 result from a maximum V coefficient (negative and positive, respectively). The remaining eight 193 main circulation types are defined in terms of the Z and M coefficients corresponding to the main 194 isobar directions over central Europe (e.g. Z = 1 and M= 0 for the W-E pattern, Z = 0.7 and M= 0.7 195 for the SW-NE pattern, and so on). Each monthly SLP grid outside the high- and low-pressure 196 samples is assigned to one of these direction types according to the minimum Euclidean distance of 197 its Z and M coefficients from those of the predefined prototypes. 198 A further subdivision of these eight circulation type samples into cyclonic and anticyclonic 199 subsamples is achieved according to the sign of the corresponding V coefficient. 200 201 LITADVE/LITTC - Litynski advection and circulation types 202 1.) Calculate two / three indices: meridional Wp, zonal Ws . / sea level pressure in central point of 203 domain Cp 204 Wp and Ws are defined by the averaged components of the geostrophical wind vector and describe 205 the advection of the air masses. 206 2.) Calculate the lower boundary and upper boundary values for Wp, Ws/ and Cp for each day 207 3.) We have component N ( Wp ), E ( Ws ) / N ( Wp ), E ( Ws ), C ( Cp ) when the indice is less 208 then the lower boundary value 209 4.) We have component 0 ( Wp ), 0 ( Ws ) / 0 ( Wp ), 0 ( Ws ), 0 ( Cp) when the indice is between 210 lower and upper boundary values 211 5.) We have component S ( Wp ), W ( Ws ) / S ( Wp), W ( Ws ), A ( Cp ) when the indice isn't less 212 then the upper boundary value 213 6.) Finally, the type = superposition of these two / three components. 214 215 LWT2 - Lamb weather types version 2 216 The LWT2 Method is a modified and improved version of the objective Jenkinson-Collison (JC) 217 system for classifying daily MSLP fields into 26 flow categories, indicating flow direction and 218 vorticity. 219 The LWT2 grid can be placed anywhere. The vorticity and flow strength thresholds (of JC) are 220 modified dynamically so that exactly 33% of days (in ERA40) fall into each of the three vorticity 221 classes Ax, Ux and Cx. 222 The size of the LWT2 grid is usually optimized so that the mean within-type pattern correlations are 223 maximized. However, for the COST733 sub-domains, a fixed grid size is forced as a function of the 224 sub-domain size. Thus, while free LWT2 grids normally have a typical north-south extent of about 225 24-30 degrees of latitude, the forced LWT2 grids for the WG2 sub-domains have extents of 12-18 226 degrees. The large domain's LWT2 grid is also forced and has an extent of 46 degrees - much larger 227 than normal. 228 229 WLK - Wetterlagenklassifikation 230 1.) Calculate main wind sector: wind direction at 700 hPa indicates flow, derived from u- and v- 231 components of true wind. 2/3 of all weighted wind directions have to be directed to a sector of 10°. 232 The 10°-sector is then assigned to a quadrant 233 2.) for each gridpoint nabla[square](geopotential) and its weighted area mean [gpdm2/km2*10.000] 234 are calculated. 235 3.) Weighted area mean value of precipitable water (whole atmosphere), compared to area daily 236 mean value (annual variation), indicating values above or 237 below of daily mean. 238 239 II. Methods producing derived types 240 II.1. PCA based methods 241 242 TPCA - t-mode principal component analysis 243 Short description of the TPCA method 244 TPCA = principal component analysis in T-mode 245 a) Introduction and historical context The potential of principal component analysis (PCA) to be 246 used as a classification tool was suggested by Richman (1981) and the idea was developed and 247 more deeply discussed by Gong and Richman (1995). The basic idea of using PCA as a 248 classification tool consists in assigning each case to that principal component, for which it has the 249 highest loading. To classify circulation patterns (as well as patterns of other atmospheric variables), 250 PCA should be used in T-mode (i.e., rows of the data matrix correspond to gridpoints and columns 251 to days), not in S-mode (where rows correspond to days and columns to gridpoints) - this was to a 252 different extent discussed and proved by Richman (1986), Drosdowsky (1993), Huth (1993, 1996a), 253 and Compagnucci and Richman (2007). The first application of T-mode PCA to classification of 254 circulation patterns dates back to Compagnucci and Vargas (1986) who analyzed artificial data; the 255 first analysis of this kind based on real data was conducted by Huth (1993). In order to obtain a real 256 classification, rotated PCA must be used; usually, an oblique rotation yields better results than an 257 orthogonal one (Huth 1993). Nevertheless, results of an unrotated analysis can also be interpreted, 258 though not directly as circulation types (e.g., Compagnucci and Salles 1997). The classification 259 based on raw data yields better results than that of anomalies because the latter creates artificial 260 types and have difficulties with classifying patterns close to the time-mean flow. The use of 261 correlation rather than covariance as a similarity measure is recommended since the latter would 262 give more influence to the patterns with larger spatial variability, for which there is no reason; in 263 fact, the difference of results between the correlation and covariance matrix is usually small (Huth 264 1996a). An important choice is the number of principal components to retain and rotate; Huth 265 (1996a) showed that those numbers of PCs selected by the rule of O'Lenic and Livezey (1988), i.e., 266 whose eigenvalue is well separated from the following one, tend to yield better solutions. When raw 267 data are used, the number of types (classes) is identical to the number of principal components. A 268 comparison of various circulation clasisfication methods (Huth 1996b) demonstrated that TPCA is 269 excellent in uncovering the real structure of data (i.e., it succeeds in reproducing the classification 270 known in advance), however at the expense of a lower between-group separation. It is important to 271 stress that the fact that the correlation / covariance matrix in TPCA in a typical climatological 272 setting (more temporal than spatial points) is singular, does not pose any limitations on the 273 calculation procedure and does not lead to the instability of results, alleged e.g. by Ehrendorfer 274 (1987). Other applications of T-mode PCA as a pattern classification tool (not only of circulation 275 fields) include Bartzokas et al. (1994), Bartzokas and Metaxas (1996), Huth (1997), De and 276 Mazumdar (1999), Huth (2000, 2001), Compagnucci et al. (2001), Salles et al. (2001), Jacobeit et 277 al. (2003), Müller et al. (2003), Brunetti et al. (2006), and Huth et al. (2007). 278 b) Settings of the method Here we apply TPCA in a setting similar to Huth (2000). Since the 279 calculation of the correlation matrix and principal components (the size of the correlation matrix in 280 our case would be 16436 x 16436) is extremely demanding on computer resources and is practically 281 intractable on a personal computer, we avoid it by calculating PCA on a subset of data and a 282 subsequent projection of obtained principal components onto th e rest of data. Such an approach 283 was shown to be reliable. The procedure is therefore as follows: 284 1. Data preprocessing: Spatial mean is calculated for each pattern; it is subtracted from the 285 data. The reason is that it is reasonable to treat patterns with the same spatial structure, 286 differing only by a constant, as identical. 287 2. The dataset is divided into ten subsets by selecting 1st, 11th, 21st, etc. day as the first 288 subset; 2nd, 12th, 22nd, etc. as the second subset, ..., and 10th, 20th, 30th, etc. as the tenth 289 subset. Steps 3 to 9 are repeated for each subset separately. 290 3. Correlation matrix is calculated. 291 4. Principal components (loadings and scores) are calculated. 292 5. The plausible numbers of principal components are selected: the criterion is that there is a 293 pronounced drop between the selected and next eigenvalue in the eigenvalue vs. PC number 294 diagram. Usually two to three numbers are selected. (The plausible numbers of PCs are 295 usually the same in all the analyses of the ten subsamples.) 296 6. The oblique rotation (direct oblimin method) is conducted for the above selected numbers 297 of components. 298 7. The principal components are projected onto the rest of data by solving the matrix 299 equation FAT = FTZ (here F and F are matrices of PC scores and PC correlations, 300 respectively, Z is the full data matrix, and A are "PC loadings" (pseudo-loadings) to be 301 determined. 302 8. The sign of principal components with prevalent negative sign (the sign is assigned to the 303 components to some extent randomly and has no real meaning) is changed. 304 9. Each day is classified with that PC (type) for which it has the highest loading. 305 10. Contingency tables are used to compare the resultant classifications based on ten 306 subsamples. The classifications are usually comparable (i.e., yield very similar types); if one 307 (or several) of them differs considerably from the rest, is not considered further. 308 11. The final classification (one for each selected number of PCs) is taken by a random 309 choice from those passing point 10. That is, if e.g. three numbers of PCs were chosen as 310 plausible for a given domain, three different final classifications are produced. 311 The following classifications are available: domain no. of types 00 7, 12 01 7, 9 02 7, 11 03 7, 9 04 312 7, 9 05 7, 9 06 7, 9 07 7, 9 08 7 09 7, 9 10 7, 9 11 7, 6 313 The classifications are denoted as follows: Huth_TPCAnn_Dxx.txt where xx is the number of the 314 domain (two digits) and nn is the number of types (two digits). 315 Note: Due to splitting up into a classification TPCA07 (7 PCs for all domains) and TPCA 316 (varying PCs for the domains) the names have been changed now to Huth_TPCA_Dxx.txt 317 without the numbers. 318 319 P27 - Kruzinga emprical orthogonal function types 320 P27 is a simple eigenvector based classification scheme (Kruizinga 1978, 1979). The scheme uses 321 daily 500 hPa heights on a regular grid (originally one 6 x 6 points with the step of 5° in latitude 322 and 10° in longitude). The actual 500 hPa height, htq, for the day t at gridpoint q is reduced first by 323 subtracting the daily average height, ht,over the grid. This operation removes a substantial part of 324 the annual cycle. The reduced 500 hPa heights ptq, are given by: 325 ptq= htq - ht , q=1,... n; n the number of gridpoints, t=1,...N, N the number of days 326 The vector of reduced 500 hPa heights is approximated as: 327 pt s1t*a1+ s2t*a2 + s3t*a3, t=1,...N. 328 a-s are the first three principal component vectors (eigenvectors of the second-moment matrix of pt) 329 and s-s are their amplitudes or scores. 330 The flow pattern of a particular day is described by three amplitudes: s1t, s2t, s3t 331 s1ta1 characterizes the east-west component of the flow 332 s2ta2 the north-south component and 333 s3ta3 the cyclonicity (or anticyclonicity) 334 The range of each amplitude is divided into three equiprobable intervals, then each pattern is on the 335 basis of its amplitudes uniquely assigned to one of the 3 x 3 x 3 = 27 possible interval 336 combinations. The type numbers show the interval to what the type belongs. The first number 337 shows the intensity of the westery flow (1 is eastery or weak westerly, 2 medium and 3 strong 338 westery flow), the second number shows the intensity of the southerly flow (1 is northerly and 3 is 339 southerly). The third number shows direction of the vorticity (1- anticyclone, 3 -cyclone, 2 340 something between). 341 PCAXTR - principal component analysis extreme scores 342 The methodology is based on the well known Rotated Principal Component Analysis, introducing 343 some modifications: 344 S-mode with previous standardization of the rows (spatial standardization) 345 Correlation Matrix 346 Scree test and North-Rule-of-Thumb selection methods 347 Varimax (orthogonal) rotation 348 Decision of the number of clusters and their centroids with the "extreme scores method" 349 (Esteban et al; 2005, 2006, 2009) 350 Assignment to the groups of the remaining cases with Euclidian distance to the nearest 351 centroid. 352 In this way, the main novelty of this procedure is the called “extreme scores method”, oriented to 353 facilitate the decision on the number of groups and the centroids needed for the classifications. For 354 this purpose are considered the spatial variation patterns (modes) established by the PCA, i.e., the 355 principal components retained and rotated in their positive and negative phases as potential groups 356 for circulation patterns classification. The centroids are calculated averaging the days that fulfil the 357 principle of the “extreme scores” for a certain pattern and phase: observations with high score 358 values for a certain component (normally values higher than +2 for the positive phase, or lower than 359 -2 for the negative phase), but with low score values for the remainder (normally between +1 and -1 360 ) were selected. Using this technique, if there is not any real case that could be assigned to the PCA 361 that we are checking (i.e., there is not any day with a close spatial structure), we will consider this 362 component and phase as an artificial result on the PCA analysis, thus eliminating this potential 363 group (circulation pattern). Briefly, the extreme scores procedure establishes the number of groups 364 and their centroids for the clustering method, but also tries to be a filter to avoid artefacts in the 365 final classification if the Varimax rotation previously used in the PCA process has not done it as 366 well as expected. 367 II.2. Leader algorithm 368 Methods based on the so called leader algorithm have been established when computing capacities 369 have been available but were still low. They try to find key (or leader) patterns in the sample of 370 maps, which are located in the center of high density clouds of entities within the multidimensional 371 phase space, spawned by the variables, i.e. grid point values. Thus the aim is close to that of non- 372 hierarchical cluster analysis but the expensive iterations of the latter is avoided. 373 LUND 374 1.) Calculate Pearson correlation coefficients between all days, 375 2.) for each day count all correlation coefficients r >0.7, 376 3.) define the day with the largest number of correlation coefficients r >0.7, as key pattern day for 377 type 1, 378 4.) remove from the data set the key pattern day for type 1as well as all days with a correlation 379 coefficients r >0.7 to that key day, 380 5.) on the remainder data set apply steps 2.) to 4.) for the next types, until all types (of the 381 predefined number of types) have a key day, 382 6.) assign days to a key day according to the highest correlation coefficient. 383 384 385 E - Erpicum et al. 386 Our classification algorithm works like this: 387 1.We compute the geometrical 3D direction of the vertical vector of the 500hPa 388 geopotential(Z500)/sea level pressure(SLP) surface for all grid points for each domain. 389 2.We compute the cosine value of the angle between all the couples (between two different days) of 390 vertical vectors for each grid point. 391 3.We sum these cosines to create a similarity index for all couples of the weather maps (of the 392 ERA-40 period i.e. 09/1957=>08/2002). This index is normalized by the number of grid points. It 393 equals 1 for the weather map with itself. It is near zero if the two weather maps (i.e. days) are very 394 very different. 395 4.For a given similarity index threshold (from 1 to 0), for each weather map, we compute the 396 number of similar weather maps (i.e. their similarity index are higher than the similarity index 397 threshold). The weather map reference of the first class is the weather map which contains the 398 maximum of similar weather maps), the second one is the second largest class and so on .... The 399 number of classes is fixed here to 10 or 30. 400 5.We decrement the similarity index threshold from 1 to 0 until 99% of weather maps are classified. 401 Therefore, 1% of the weather maps are not classified. 402 403 404 KH - Kirchhofer types 405 1.) Calculate spatial correlation coefficients between all fields, as well as 406 between single rows and single columns, 407 2.) assign each pair of days with the minimum of the whole grid, single column and single row 408 correlation coefficients, 409 3.) for each day count all correlation coefficients r>0.4 410 4.) define the day with most correlations >0.4 as key pattern for type 1 411 5.) take out the key day and all days with r>0.4 to the key day 412 6.) on the remaining days proceed as in step 2.) to 4.) for the following type numbers until the all 413 types (of the predefined number of types) have a key day. 414 7.) finally choose type number of each day (using whole sample again) by highest correlation 415 coefficient to any of the key days. 416 II.3. Optimisation algorithms 417 418 CKMEANS 419 CKMEANS is a k-means clustering algorithm with a few modifications, e.g., with respect to 420 obtaining the starting partition: 421 1. Random selection of one day, so, yes, we are using daily data 422 2. Computation of the distance measure to all remaining day using information from 500 and 423 1000hPa gridded geopotential reanalysis fields -> nine most DISSIMILAR days are 424 identified. Together with the randomly selected day in step 1 we have a starting partition of 425 10. The number 10 is de-mystified in Remark 1 below. 426 427 PCACA - k-means by seeds from hierarchical cluster analysis of principal components 428 This method is a combination of two well known statistical multivariate techniques, such as 429 Principal Component Analysis and clustering. Most of it follows the recommendations proposed by 430 Yarnal (1993) 431 Thus, for each one of the large domain and smaller subdomains the analysis was initiated running a 432 Principal Component Analysis (S-mode) to reduce the collinearity, simplifying the numerical 433 calculations and improving the performance of subsequent clustering procedure. 434 In order to avoid that the PCA extract the annual cycle of sea level pressure (strong/weak pressure 435 gradients in winter/summer) as the greatest source of common variance as the first component, but 436 keeping the seasonality of the circulation patterns (eg. more cyclonic in winter...), the classification 437 procedure was not performed season by season; instead the raw (original) sea level pressure data 438 were filtered applying a high-pass filter (Hewitson and Crane, 1992) to emphasize the synoptic 439 variability upon the rest of the sources of variability. Day-to-day cycles in sea level pressure were 440 obtained submitting the daily time series of sea level pressure map means (the mean value of all the 441 grid points on the corresponding days map) to a spectral and autocorrelation analysis. The results 442 show that, on average, cycles longer than 13 days can be considered non-synoptic, and therefore, 443 they can be removed by subtracting a l3-day running mean of the grid (from day t - 6 to day t + 6 , 444 centered on day t = 0) from each data point in each grid. This high-pass filter provides de-seasoned 445 daily maps of sea level pressure, showing spatial differences in pressure from the surrounding 11- 446 day mean map. The retained map patterns are identical to the unfiltered maps, except that the values 447 vary about zero. Although seasonal and annual cycles in mean sea level pressure are removed, the 448 seasonal frequency of various patterns is not changed (e.g., more cyclonic patterns in winter). 449 The matrix was disposed on a S mode, being the principal components extracted from a correlation 450 matrix. The significant components (mostly 3 components in all subdomains, 9 in the larger 451 domain) were identified through several test (e.g. scree test, Joliffe´s test, North´s test), and later on 452 rotated by a VARIMAX algorithm to avoid the classical spatial problems associated to this mode of 453 decomposition (domain shape…). 454 The daily time series of component scores were the input of the clustering procedure. A two-stage 455 modus operandi has been used to identify an appropriate number of clusters and to obtain initial 456 “seed points” using a hierarchical algorithm (ward´s method), and then refined by an agglomerative 457 algorithm (k-means) to obtain a more robust classification. The primary partition was selected with 458 the help of some statistical test (R2, Pseudo-F, Pseudo-T2...), but, when the cutting points were not 459 clear, several solutions were run, plotting the resulting types, and finally choosing manually the best 460 related with composites plots from monthly climatic remarkable episodes (temperature and 461 precipitation) from each sub-domain. 462 463 PETISCO - k-means 464 We take each day in the sample and using this day as a seed, we make the group of the similar days 465 to the seed, after that, we use the average of the group as new seed and make again the group of the 466 similar days to the seed, repeating the process in the same way until to get a stable group. Finally 467 we select as a subtype the average of the group that contains more days, and take away its days. We 468 repeat the process with the remaining days in the same way, and continue selecting as a new 469 subtype, the greater group average no similar to previously selected subtypes. The process 470 continues until a new selection is not possible or until the size of the group is lesser than a certain 471 number. 472 The similarity measure utilized is the correlation coefficient between the grid data fields of the two 473 elements whose similarity we are studying, and it is calculated for mslp and for 500 hPa 474 geopotencial fields. To improve the similarity analysis, the domain is divided into two subdomains 475 and the correlation in them is also calculated, so two elements are considered similar if both 476 correlation value for mslp and for 500 hPa are greater than a threshold in the whole domain and in 477 each subdomain. We have utilized a correlation value of 0.90 as a threshold to select the subtypes 478 for each domain, in order to represent a great number of synoptic patterns. 479 These subtypes are then grouped into types; first, we select a number of dissimilar subtypes and 480 then the remaining subtypes are grouped with the most similar of the previously selected subtypes. 481 Now we use 0.80 as threshold value. 482 In our previous work we utilized lesser resolution (5 o x 5o ), but in this work the higher resolution 483 can introduce a lot of redundant information inside the domain related to the border, this might 484 produce a worse representation of the similarity on the domain border. Nevertheless the method has 485 been applied as originally was applied without doing a Principal Component Analysis. 486 487 PCAXTRKMS 488 These methods follows the same procedure than PCAXTR, only introducing change at the last step. 489 In this way, the assignment to the groups of the remaining cases is made with the K-means non- 490 hierarchical clustering method. In relation to these two different ways of clustering the cases 491 (without iterations in PCAXTR and with iterations in PCAXTRKM), some authors have pointed 492 out that the iterations of the Kmeans method commonly tend to equal the size of the groups (Huth, 493 1996), probably related with a strong underlying structure, as a continuum, of the data. Recently, 494 Philipp et al (2007) highlighted some limitations of K-means regarding to the dependence on the 495 ordering of checks and reassignments. In this way, different local optimums could be obtained with 496 different starting partitions or different K-means algorithms. 497 SANDRA/SANDRAS - simulated annealing and diversified randomization clustering 498 The numbers of clusters for COST733 is subjectively choosen in order to reach a compromise 499 between a reasonable low number to allow an overview and MSLP variation explained by the 500 clustering. For all domains it is around 20 as suggested also by an external criterion for t-mode 501 PCA. Fine adjustment is done by selection of elbows of the silhouette index. 502 Method: Simulated anneiling clustering of 3-day sequences of daily Z925 and Z500 fields together, 503 i.e. one day is decribed by Z925 and Z500 of the day itself and the two preceding days. 504 505 NNW - neural network self organizing maps (Kohonen network) 506 The methodology used is the Artificial Neural Networks and the number of clusters chosen is 507 considred to be proportional to the size of the domain. The results are in tabular form for the 500 508 hPa contour analyses. 509 510 Discussion and concluding remarks 511 Manual or subjective methods are still in use and have their value due to the integration of the 512 experience of experts, which is often hard to formulate in precise rules for automated classification. 513 Therefore 4 subjective catalogs have been included in the dataset for comparison (HBGWL, 514 PECZELY, PERRET and ZAMG). Even though it is unlikely that new manual classifications will 515 be established in the future. Disadvantages are the sometimes imprecise definition of the types, 516 which may lead to inhomogeneities and that they are made for distinct geographical regions and are 517 not scalable. The former problems are overcome by the automatic assignment of days due to their 518 pressure patterns as it is done by the OGWL method, however the latter problems remain. 519 They are mostly replaced now by threshold based methods which have the advantage of automatic 520 processing and a lower degree of subjectivity. However, subjectivity is also apparent by definition 521 of the thresholds. Even though some thresholds seem natural (like the symmetric partitioning of the 522 wind rose), it is always an artificial or arbitrary decision to use certain thresholds and not others. 523 However they captivate with their clear and precise definition of types. 524 525 526 The list of included methods of the cost733 catalog database is by far not complete. However, it is 527 believed that the most common basic methods are represented with some of their variations. 528 Methods not included are mostly those which are specialized for a certain target variable or don't 529 fulfill the criterion of producing distinct and unambiguous classifications of the entities. An 530 example for a method including both aspects is the fuzzy classification method developed by 531 Bardossy et al. (), which tries to find circulation types explaining the variations of a specified target 532 variable, e.g. station precipitation time series, while the daily maps are members of more than one 533 class. The consequence of optimization of a classification for one or a few target variables is that 534 the resulting catalog is applicable for a small target region but not for climate variations on a larger, 535 e.g. continental scale. On the other hand one of the attracting properties of classification methods is 536 the transfer from the metric scale to the nominal scale, which is violated by the principle of fuzzy 537 classification. These aspects should not debase the value of the mentioned methods since they are 538 mostly superior to generalized and discrete methods, but their results are valid for distinct purposes 539 only. 540 However, further studies of the presented classification dataset might show that the idea of 541 generalized classifications including the main links between circulation and climate variability in 542 different or large regions is unrealistic. And the attempts to find generally "better" classification 543 methods is like shooting on a moving target. If a method is optimal for circulation it may be bad for 544 explaining temperature variations, if it is optimal for temperature it may be bad for precipitation, 545 etc. Thus a universally best method might be unreachable, explaining the large number of attempts 546 to develop more and more different variations of existing methods or to develop completely new 547 ones. It is the hope of the authors that this database reflecting the large variety of classification 548 methods will help to come to an conclusion whether it is worth to spend such large efforts in the 549 search for the ultimate best method. However, if the results of studies on intercomparison of the 550 classifications only show that some of the methods are not useful in many aspects, the aim of the 551 work on this catalog collection has been reached. 552 553 554 References 555 Ekstrom M, Jonsson P and Barring L. 2002. Synoptic pressure patterns associated with major wind 556 erosion events in southern Sweden. Climate Research 23: 51–66. 557 ESTEBAN P, JONES PD, MARTÍN-VIDE J, MASES M. (2005) Atmospheric circulation patterns 558 related to heavy snowfall days in Andorra, Pyrenees. International Journal of Climatology. 25: 319- 559 329. 560 ESTEBAN P, MARTIN-VIDE J, MASES M (2006) Daily atmospheric circulation catalogue for 561 western europe using multivariate techniques.International Journal of Climatology 26: 1501-1515 562 ESTEBAN, P.; NINYEROLA, M. and PROHOM, M. (2009): Spatial modelling of air temperature 563 and precipitation for Andorra (Pyrenees) from daily circulation patterns. Theoretical and Applied 564 Climatology. In press. Online at: 565 http://www.springerlink.com/content/r2728365g085q007/?p=0713eaeac536434ebc4b237b6677218 566 a&pi=1 567 Hewitson B, Crane RG (1992) Regional climates in the GISS Global Circulation Model and 568 Synoptic-scale Circulation. J Climate 5:1002-1011. 569 570 Yarnal B (1993) Synoptic climatology in environmental analysis Belhaven Press, London. 571 572 Figure 1. (will be finished soon and made available on the wiki)Table 1.: Identification numbers, 573 names and coordinates of spatial domains defined for classification input data. name 00 01 02 03 04 05 Europe Iceland West Scandinavia Northeastern Europe British Isles Baltic Sea longitudes 37°W to 56°E by 3° (32) 34°W to 3°W by 1° (32) 06°W to 25°E by 1° (32) 24°E to 55°E by 1° (32) 18°W to 08°E by 1° (27) 08°E to 34°E by 1° (27) latitudes 30°N to 76°N by 2° (24) 57°N to 72°N by 1° (16) 57°N to 72°N by 1° (16) 55°N to 70°N by 1° (16) 47°N to 62°N by 1° (16) 53°N to 68°N by 1° (16) 06 07 08 09 10 Alps Central Europe Eastern Europe Western Mediterranean Central Mediterranean 03°E to 20°E by 1° (18) 03°E to 26°E by 1° (24) 22°E to 45°E by 1° (24) 17°W to 09°E by 1° (27) 07°E to 30°E by 1° (24) 41°N to 52°N by 1° (12) 43°N to 58°N by 1° (16) 41°N to 56°N by 1° (16) 31°N to 48°N by 1° (18) 34°N to 49°N by 1° (16) 11 Eastern Mediterranean 20°E to 43°E by 1° (24) 27°N to 42°N by 1° (16) 574 575 576 Table 2.: Methods and variants overview. Alphabetical list of abbreviations used to identify 577 individual variants of classification configurations. In column 2 the number of types is given, 578 column 3 holds the parameters used for classification, the method groups are given in column 4: sub 579 (subjective), thr (threshold based), pca (based on principal component analysis), ldr (leader 580 algorithm), opt (optimization algorithm). abbreviation CKMEANSC09 CKMEANSC18 CKMEANSC27 EZ850C10 EZ850C20 EZ850C30 ESLPC09 ESLPC18 ESLPC27 GWT GWTC10 GWTC18 GWTC26 KHC09 KHC18 KHC27 LITADVE LITTC18 LITTC LUND LUNDC09 LUNDC18 LUNDC27 LWT2C10 LWT2C18 LWT2 NNW NNWC09 NNWC18 types 9 19 27 10 20 30 9 19 27 18 10 18 26 9 18 27 9 18 27 10 9 18 27 10 18 26 9-30 9 18 parameters SLP SLP SLP Z850 Z850 Z850 SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP SLP Z500 SLP SLP method group opt opt opt ldr ldr ldr ldr ldr ldr thr thr thr thr thr thr thr thr thr thr ldr ldr ldr ldr thr thr thr opt opt opt 581 582 NNWC27 P27 P27C08 P27C16 P27C27 PCACA PCACAC09 PCACAC18 PCACAC27 PCAXTR PCAXTRC09 PCAXTRC18 PCAXTRKM PCAXTRKMC09 PCAXTRKMC18 PETISCO PETISCOC09 PETISCOC18 PETISCOC27 SANDRA SANDRAC09 SANDRAC18 SANDRAC27 SANDRAS SANDRASC09 SANDRASC18 SANDRASC27 TPCAV TPCA07 TPCAC09 TPCAC18 TPCAC27 WLKC733 27 27 8 16 27 4-5 9 18 27 11-17 9-10 15-18 11-17 9-10 15-18 25-38 9 18 27 18-23 9 18 27 30 9 18 27 6-12 7 9 18 27 40 WLKC09 WLKC18 9 18 WLKC28 28 HBGWL HBGWT OGWL OGWLSLP PECZELY PERRET SCHUEPP ZAMG 29 10 29 29 13 40 40 43 SLP opt Z500 pca SLP pca SLP pca SLP pca SLP opt SLP opt SLP opt SLP opt SLP pca SLP pca SLP pca SLP opt SLP opt SLP opt SLP, Z500 opt SLP opt SLP opt SLP opt SLP opt SLP opt SLP opt SLP opt Z925, Z500 opt SLP opt SLP opt SLP opt SLP pca SLP pca SLP pca SLP pca SLP pca U/V700, thr Z925/500, PW U700, V700 thr U700,V700,Z thr 925 U700,V700,Z thr 925/500 SLP, Z500 SLP sub sub sub sub sub sub sub sub