Statistics in WR: Lecture 1 • Key Themes – Knowledge discovery in hydrology – Introduction to probability and statistics – Definition of random variables • Reading: Helsel and Hirsch, Chapter 1 How is new knowledge discovered? After completing the Handbook of Hydrology in 1993, I asked myself the question: how is new knowledge discovered in hydrology? I concluded: • By deduction from existing knowledge • By experiment in a laboratory • By observation of the natural environment Deduction – Isaac Newton • Deduction is the classical path of mathematical physics – Given a set of axioms – Then by a logical process – Derive a new principle or equation • In hydrology, the St Venant equations for open channel flow and Richard’s equation for unsaturated flow in soils were derived in this way. Three laws of motion and law of gravitation http://en.wikipedia.org/wiki/Isaac_Newton (1687) Experiment – Louis Pasteur • Experiment is the classical path of laboratory science – a simplified view of the natural world is replicated under controlled conditions • In hydrology, Darcy’s law for flow in a porous medium was found this way. Pasteur showed that microorganisms cause disease & discovered vaccination Foundations of scientific medicine http://en.wikipedia.org/wiki/Louis_Pasteur Observation – Charles Darwin • Observation – direct viewing and characterization of patterns and phenomena in the natural environment • In hydrology, Horton discovered stream scaling laws by interpretation of stream maps Published Nov 24, 1859 Most accessible book of great scientific imagination ever written Mean Annual Flow Mean Annual Flow, Colorado River at Austin (1929-2008) 8000 7000 Discharge (cfs) 6000 5000 4000 3000 2000 1000 0 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 2020 Is there a relation between flow and water quality? Mean Annual Flow, Colorado River at Austin (1929-2008) 6000 4000 2000 0 1920 Colorado River at Austin 1940 1960 1980 2000 3.52020 3 Total Nitrogen (mg/l) Discharge (cfs) 8000 2.5 Total Nitrogen in water 2 1.5 1 0.5 0 Jun-68 Dec-73 May-79 Nov-84 May-90 Oct-95 Are Annual Flows Correlated? Correlation of Annual Flows (Colorado River at Austin) 8000 Last Year's Discharge (cfs) 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 4000 5000 This Year's Discharge (cfs) 6000 7000 8000 CE 397 Statistics in Water Resources, Lecture 2, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 9 Key Themes • Statistics – Parametric and non-parametric approach • Data Visualization • Distribution of data and the distribution of statistics of those data • Reading: Helsel and Hirsch p. 17-51 (Sections 2.1 to 2.3 • Slides from Helsel and Hirsch (2002) “Techniques of water resources investigations of the USGS, Book 4, Chapter A3. 10 Characteristics of Water Resources Data • • • • Lower bound of zero • Autocorrelation – consecutive Presence of “outliers” measurements are not Positive skewness independent Non-normal distribution • Dependence on other of data uncontrolled variables • Data measured with e.g. chemical thresholds (e.g. concentration is related detection limits) to discharge • Seasonal and diurnal patterns 11 Normal Distribution From Helsel and Hirsch (2002) 12 Lognormal Distribution From Helsel and Hirsch (2002) 13 Method of Moments From Helsel and Hirsch (2002) 14 Statistical measures • Location (Central Tendency) – Mean – Median – Geometric mean • Skewness (Symmetry) – Coefficient of skewness • Kurtosis (Flatness) – Coefficient of kurtosis • Spread (Dispersion) – Variance – Standard deviation – Interquartile range 15 Histogram From Helsel and Hirsch (2002) Annual Streamflow for the Licking River at Catawba, Kentucky 03253500 16 Quantile Plot From Helsel and Hirsch (2002) 17 Plotting positions i = rank of the data with i = 1 is the lowest n = number of data p = cumulative probability or “quantile” of the data value (its percentile value) 18 Normal Distribution Quantile Plot From Helsel and Hirsch (2002) 19 Probability Plot with Normal Quantiles (Z values) q q zsq q q z 20 From Helsel and Hirsch (2002) Annual Flows From HydroExcel Annual Flows produced using Pivot Tables in Excel 21 Estimating the mean annual discharge 6000 Licking River at Catawba, Kentucky, 1934-2008 (75 years of data) 5000 Mean Discharge (cfs) 4000 Mean 3000 Mean +Se Mean - Se 2000 1000 0 0 10 20 30 40 50 60 70 80 90 Number of years of Data 22 CE 397 Statistics in Water Resources, Lecture 3, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 23 Key Themes • Using HydroExcel for accessing water resources data using web services • Descriptive statistics and histograms using Excel Analysis Toolpak • Reading: Chapter 11 of Applied Hydrology by Chow, Maidment and Mays 24 CE 397 Statistics in Water Resources, Lecture 4, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 25 Key Themes • • • • Frequency and probability functions Fitting methods Typical distributions Reading: Chapter 4 of Helsel and Hirsh pp. 97116 on Hypothesis tests 26 27 Method of Moments 28 Maximum Likelihood 29 CE 397 Statistics in Water Resources, Lecture 5, 2009 David R. Maidment Dept of Civil Engineering University of Texas at Austin 30 Key Themes • Using Excel to fit frequency and probability distributions • Chi Square test and probability plotting • Beginning hypothesis testing • Reading: Chapter 3 of Helsel and Hirsh pp. 6597 on Describing Uncertainty • Slides from Helsel and Hirsch Chap. 4 31 32 Statistics in Water Resources, Lecture 6 • Key theme – T-distribution for distributions where standard deviation is unknown – Hypothesis testing – Comparing two sets of data to see if they are different • Reading: Helsel and Hirsch, Chapter 6 Matched Pair Tests Chi-Square Distribution http://en.wikipedia.org/wiki/Chi-square_distribution t-, z and ChiSquare Source: http://en.wikipedia.org/wiki/Student's_t-distribution Normal and t-distributions Normal t-dist for ν = 1 t-dist for ν = 5 t-dist for ν = 2 t-dist for ν = 10 t-dist for ν = 3 t-dist for ν = 30 Standard Normal and Student - t • Standard Normal z – X1, … , Xn are independently distributed (μ,σ), and – then is normally distributed with mean 0 and std dev 1 • Student’s t-distribution – Applies to the case where the true standard deviation σ is unknown and is replaced by its sample estimate Sn p-value is the probability of obtaining the value of the test-statistic if the null hypothesis (Ho) is true If p-value is very small (<0.05 or 0.025) then reject Ho If p-value is larger than α then do not reject Ho 38 One-sided test Two-sided test Statistics in WR: Lecture 7 • Key Themes – Statistics for populations and samples – Suspended sediment sampling – Testing for differences in means and variances • Reading: Helsel and Hirsch Chapter 8 Correlation Estimators of the Variance Maximum Likelihood Estimate for Population variance Unbiased estimate from a sample http://en.wikipedia.org/wiki/Variance Bias in the Variance Common sense would suggest to apply the population formula to the sample as well. The reason that it is biased is that the sample mean is generally somewhat closer to the observations in the sample than the population mean is to these observations. This is so because the sample mean is by definition in the middle of the sample, while the population mean may even lie outside the sample. So the deviations from the sample mean will often be smaller than the deviations from the population mean, and so, if the same formula is applied to both, then this variance estimate will on average be somewhat smaller in the sample than in the population. Suspended Sediment Sampling http://pubs.usgs.gov/sir/2005/5077/ T-test with same variances T-test with different variances Statistics in WR: Lecture 8 • Key Themes – Replication in Monte Carlo experiments – Testing paired differences and analysis of variance – Correlation • Reading: Helsel and Hirsch Chapter 9 Simple Regression Statistics of Mean of Replicated Series Variance of Replicates of Cumulative mean of 1000 uniform(0,1) random variables 1.20E-04 1.00E-04 8.00E-05 6.00E-05 Variance Theoretical Value 4.00E-05 2.00E-05 0.00E+00 0 20 40 60 Number of Replicates 80 100 Patterns of data that all have correlation between x and y of 0.7 Monotonic nonlinear correlation Linear correlation Non-monotonic correlation Statistics in WR: Lecture 9 • Key Themes – Using SAS to compute cross-correlation between two data series – Using Excel to compute autocorrelation of a single data series – Correlation length and influence of data interval on that – Lagged Cross-correlation between rainfall and flow • Reading: Helsel and Hirsch Chapter 12 Trend Analysis Correlation • Correlation (or cross-correlation) measures the association between two sets of data (x, y) • Autocorrelation measures the correlation of a dataset with lagged or displace values of itself (either in time or space), e.g x(t) with x(t – L) where L is the lag time • Lagged cross-correlation measures the association between one series y(t), and lagged values of another series x(t – L) Statistics in WR: Lecture 10 • Key Themes – Trend analysis using Simple Linear Regression – Characterization of outliers – Multiple Linear Regression • Reading: Helsel and Hirsch Chapter 11 Multiple Linear Regression • Slides are from Helsel and Hirsch, Chapter 9 H&H p.222 Regression Formulas H&H p.226 Regression Formulas H&H p.227 Statistics in WR: Lecture 11 • Key Themes – Simple Linear Regression – Derivation of the normal equations – Multiple Linear Regression • Reading: Helsel and Hirsch Chapter 7 Comparing several independent groups • Reading: Barnett, Environmental Statistics Chapter 10 Time series methods • Slides are from Helsel and Hirsch, Chapter 9 Regression Assumptions Formulas used in the derivation of the normal equations (1a) Plot the Data: TDS vs LogQ (2) Interpret Regression Statistics A good set of Residuals Multiple Linear Regression Simple vs Complex regression models F-distribution http://en.wikipedia.org/wiki/F-test “If U is a Chisquare random variable with m degrees of freedom, V is a Chisquare random variable with n degrees of freedom, and if U and V are independent, then the ratio [(U/m)/V/n) has an F-distribution with (m, n) degrees of freedom.” Haan, Statistical Methods in Hydrology, p.122 The values of the F-statistic are tabulated at: http://www.itl.nist.gov/div898/handbook/eda/section3/eda3673.htm Statistics in WR: Lecture 12 • Key Themes – Regression y|x and x|y – Adjusted R2 – Time series and seasonal variations R2 and Adjusted R2 SUMMARY OUTPUT SSE R 1 SSy 2 Regression Statistics Multiple R 0.950344 R Square 0.903154 0.903154347 Adjusted R Square 0.898543 0.89854265 Standard Error 159033.1 Observations SSE /( n p) AdjR 1 SSy /( n 1) 2 23 ANOVA df Regression SS MS 1 4.95309E+12 4.95309E+12 Residual (error) 21 5.31122E+11 25291521454 Total (y) 22 5.48421E+12 F 195.8399 Significance F 4.07E-12 Time Series Trend: Tide Levels at San Diego 4 y = 2E-05x + 2.2869 3.5 3 2.5 2 1.5 1 0.5 0 Jan-00 Mar-19 Apr-38 Jun-57 Aug-76 Oct-95 Dec-14 -0.5 -1 http://tidesandcurrents.noaa.gov/sltrends/sltrends_station.shtml?stnid=9410170%20San%20Diego,%20CA One harmonic Five harmonics http://en.wikipedia.org/wiki/Fourier_series Statistics in WR: Lecture 13 • Key Themes – ANOVA for sediment data – Fourier series for diurnal cycles – Fourier series for seasonal cycles Analysis of Variance (ANOVA) Assumptions There are several variants (one factor, two factor, two factor with replication). We will deal just with One Factor ANOVA Single Factor ANOVA Single Factor ANOVA ANOVA Formulas Single Factor ANOVA Groups of Sediment Load Data (Ex3) 3.5 x 106 5.5 x 106 480,000 USGS1 Mean 218,000 Ton/yr TWDB Mean 189,000 Ton/yr Overall Mean 183,000 Ton/yr USGS2 Mean 97,000 Ton/yr Zero