Quality control and homogenization of the COST benchmark dataset Petr Štěpánek Pavel Zahradníček Czech Hydrometeorological Institute, regional office Brno e-mail: petr.stepanek@chmi.cz zahradnicek@chmi.cz Processing before any data analysis Software AnClim, ProClimDB Data Quality Control Finding Outliers Two main approaches: Using limits derived from interquartile ranges (time series) 1 0 .0 8 .0 6 .0 4 .0 2 .0 0 .0 - 2 .0 - 4 .0 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 comparing values to values of neighbouring stations (spatial analysis) 2000 Creating Reference Series for monthly data weighted/unweighted mean from neighbouring stations Power of weight is 1 for temperature (1/d) and 3 for precipitation (1/d3) - IDW criterions used for stations selection (or combination of it): best correlated / nearest neighbours (correlations – from the first differenced series) limit correlation, limit distance limit difference in altitudes neighbouring stations series should be standardized to test series AVG and / or STD/ Atlitude Comparison with „expected“ value – (calculated as weighted mean from standardized neighbours values) Example: Proposed list of stations used for creating reference series „Outliers“ temperature sur1, network 1 • detected 12 „outliers“ • 10 errors for station 150 (5 in year 1909) • Mean difference between measured outliers and expect value is about 6°C „Outliers“ precipitation sur1, network 1 • detected 8 „outliers“ •Mean difference between measured outliers and expect value is about 180 mm • Max difference is 313 mm (station 4307012, 8/1971) Monthly, Seasonal and Annual Averages Data Processing Quality Control - Outliers Interquartile Range Comparing to Neighbours Combining Near Stations Homogeneity Testing Alexandersson test Bivariate Test t-test Mann-Whitney-Pettit Reference Series from Correlations Several Iterations from Distances Hom. Assessment Probability Adjusting Data Filling Miss. Values Months, seasons, year Creating Reference Series for monthly, weighted/unweighted mean from neighbouring stations criterions used for stations selection (or combination of it): best correlated / nearest neighbours (correlations – from the first differenced series) limit correlation, limit distance limit difference in altitudes neighbouring stations series should be standardized to test series AVG and / or STD (temperature - elevation, precipitation - variance) - missing data are not so big problem then Relative homogeneity testing Test series – 40 years Longer series – divide to the more section with overlay 10 years Tests: SNHT, Bivarite, t-test Example of the detected breaks – temperature, sur1, network 1 - Detected 63 breaks Station no. 50, break 1928 Station no. 50, break 1975 Test and reference series Difference between test and reference series Test statistics Station no. 100, break 1983 Example of the detected breaks – precipitation, sur1, network 1 - Detected 10 breaks Station no. 4309900, break 1909 Station no. 4311803, break 1991 Adjusting monthly data using reference series based on distance Power of weight is 0.5 for temperature and 1 for precipitation adjustment: from differences/ratios 20 years before and after a change, monhtly smoothing monthly adjustments (low-pass filter for adjacent values) Station no. 100, break 1983 Station no. 50, break 1928 2,000 0,000 1,800 -0,100 1,600 I II III IV V VI VII VIII IX -0,200 1,400 -0,300 1,200 1,000 -0,400 0,800 -0,500 0,600 -0,600 0,400 -0,700 0,200 -0,800 0,000 I II III IV V VI ADJ VII VIII IX ADJ SMOOTH X XI XII -0,900 ADJ ADJ SMOOTH X XI XII Adjusting values – evaluation •After adjust must correlation increase – if not, the series is not adjust Temperature Precipitation Absolute values of adjustment for temperature, surg1, network 1 Iterative homogeneity testing several iteration of testing and results evaluation several iterations of homogeneity testing and series adjusting (3 iterations should be sufficient) question of homogeneity of reference series is thus solved: possible inhomogeneities should be eliminated by using averages of several neighbouring stations if this is not true: in next iteration neighbours should be already homogenized Example – homogenized temperature series Station no. 50 14 raw data homogenized T[°C] 13 12 11 10 1900 1906 1912 1918 1924 1930 1936 1942 1948 1954 1960 1966 1972 1978 1984 1990 1996 Station no. 100 12 raw data homogenized 11 T[°C] 10 9 8 7 1900 1906 1912 1918 1924 1930 1936 1942 1948 1954 1960 1966 1972 1978 1984 1990 1996 Example – homogenized precipitation series Station no. 4309900, break 1909 1800 raw data homogenized 1600 T[°C] 1400 1200 1000 800 600 1900 1906 1912 1918 1924 1930 1936 1942 1948 1954 1960 1966 1972 1978 1984 1990 1996 Station no. 4311803, break 1991 1400 raw data homogenized 1200 T[°C] 1000 800 600 400 1900 1906 1912 1918 1924 1930 1936 1942 1948 1954 1960 1966 1972 1978 1984 1990 1996 http://www.climahom.eu