Second Semester 2022-2023 (16.01-29.04.2023) CIVL 6005: Data Analysis in Hydrology Lecture 2 (February 7, 2023) Review of Probability and Statistics (2) Lecturer: Ji Chen Department of Civil Engineering The University of Hong Kong, Pokfulam Stormwater Drainage Manual – Planning, Design and Management (DSD, p150) Intensity – Duration – Frequency Curves (for durations not exceeding 4 hours) in Hong Kong Bayes Theorem • The Bayes theorem provides a method for incorporating new information with previous or so-called prior probability assessments to yield new values for the relative likelihood of events of interest p( Bi | A) = p( Bi ) p( A | Bi ) n i =1 p( A | Bi ) p( Bi ) • These new (conditional) probabilities are called posterior probabilities • Bayes theorem provides a means of estimating probabilities of one event by observing a second event Example The manager of a recreational facility has determined that the probability he will have 1000 or more visitors on any Sunday in July depends upon the maximum temperature for that Sunday as shown in a table in the next slide. The table also gives the probabilities that the maximum temperature will fall in the indicated ranges. On a certain Sunday in July, the facility has more than 1000 visitors. What is the probability that the maximum temperature was in the various temperature classes? Example Prob of Prob of being Temperature 1000 or in temperature o Tj ( C) more class <15 0.05 0.05 15-20 0.20 0.15 20-25 0.50 0.20 25-30 0.75 0.35 30-35 0.50 0.20 >35 0.25 0.05 Prob of Tj|1000 or more visitors Solution • Let Tj for j=1, 2, … 6 represent the 6 intervals of temperature. Then from Bayes theorem p(T j | 1000 or more ) = p(1000 or more | T j )p(T j ) (p(1000 or more | T )p(T ) ) 6 i =1 i i Example Prob of Prob of being Temperature 1000 or in temperature o Tj ( C) more class Prob of Tj|1000 or more visitors <15 0.05 0.05 0.005 15-20 0.20 0.15 0.059 20-25 0.50 0.20 0.197 25-30 0.75 0.35 0.517 30-35 0.50 0.20 0.197 >35 0.25 0.05 0.025 Sampling Principle: Counting Sampling Principle: Counting (1) • Multiplication principle: – If an experiment can be divided into tasks in such a way that the number of available outcomes for the remaining tasks does not at any step depend on the outcomes from the previous steps, then – the total number of outcomes is the product of the available outcomes at each step Sampling Principle: Counting (2) • Addition principle: If an experiment can be split into several mutually exclusive methods of accomplishing it – the only possible ways of accomplishing the experiment are these “several” methods, and – if we use one of these methods, then the other methods are impossible – then the experiment can be accomplished in the sum of the available outcomes at each method Sampling Principle: Counting (3) • Ordering and replacing principle: – Sampling can be done with replacement so that the item selected is immediately returned to the population or without replacement so that the item is not returned – The order of sampling may be important in some situations and not in others • Four types of samples – – – – ordered with replacement ordered without replacement unordered with replacement, and unordered without replacement Sampling Principle: Counting (4) We simulate the choice of a sample of size r from the element of population n, and then we will have the following table for the number of ways of selecting samples under the four types. With Replacement Ordered Unordered Without Replacement r n! P = (n − r )! n + r − 1 r n n! = r r!(n − r )! n n r Sample Data: Grouping Sample Data Grouping (1) • Since it is difficult to grasp the total data picture from tabulation, a useful first step in data analysis is to plot the data as a frequency histogram • Histogram is a graphic representation of the frequency distribution of a continuous variable • Rectangles are drawn in such a way that their bases lie on a linear scale representing different class intervals, and their heights are proportional to the frequencies of the values within each of the intervals Sample Data Grouping (2) • The midpoint of a class is called the class mark. The class interval is the difference between the upper and lower class boundaries • The selection of the class interval and the location of the first class mark can appreciably affect the appearance of a frequency histogram Sample Data Grouping (3) • The appropriate width for a class interval depends on the range of the data, the number of observations, and the behavior of the data – One suggestion for the class interval is not exceeding one-fourth to one-half of the standard deviation of the data, and – a suggestion for the number of classes is m = 1 + 3.3 log 10 (n) where m is the number of classes and n the number of data values. Sample Data Grouping (4) • Be caution of using too few or too many classes in a histogram. – Too few classes will eliminate detail and obscure the basic pattern of the data – Too many classes result in erratic patterns of alternating high and low frequencies – The aim of making histograms is to grasp the full meaning of the data distribution Year Rainfall (mm) Year Rainfall (mm) Year Rainfall (mm) Year Rainfall (mm) 1960 2,250 1970 2,360 1980 1,700 1990 2,050 1961 2,240 1971 1,910 1981 1,650 1991 1,600 1962 1,750 1972 2,850 1982 3,250 1992 2,700 1963 900 1973 3,100 1983 2,900 1993 2,350 1964 2,450 1974 2,350 1984 2,000 1994 2,750 1965 2,430 1975 3,010 1985 2,200 1995 2,790 1966 2,440 1976 2,200 1986 2,400 1996 2,250 1967 1,510 1977 1,650 1987 2,350 1997 3,350 1968 2,350 1978 2,600 1988 1,600 1998 2,550 1969 1,900 1979 2,610 1989 1,950 1999 2,100 14 12 10 Annual Rainfall in Hong Kong (19601999) 8 6 4 2 0 900 1250 1600 1950 2300 2650 3000 3350 Variables Random Variables (1) • A random variable is a function defined on a sample space • Random variables may be discrete or continuous – If the set of values of a random variable is finite (or countably infinite), the random variable is said to be a discrete random variable – If the set of values of a random variable is infinite, the random variable is said to be continuous random variable Random Variables (2) • In the course, capital letters will be used to denote random variables and the corresponding lower case letter will represent values of the random variable • Any function of a random variable is also a random variable Univariate probability distributions • pX(x) denotes probability density function (pdf) • PX(x) denotes cumulative distribution function (cdf) Bivariate Distributions (1) • The situation frequently arises where one is interested in the simultaneous behavior of two or more random variables • If X and Y are continuous random variables, their joint probability density function is pX,Y(x, y) and the corresponding cumulative probability distribution is PX,Y(x, y) • These two are related Bivariate Distributions (2) 2 p X ,Y (x, y ) = PX ,Y (x, y ) xy PX ,Y ( x, y ) = p ( X x and Y y ) = x y − − p X ,Y (t , s )dsdt f X, Y (xi , y j ) = p(X = x i and Y = y j ) FX, Y (xi , y j ) = p(X x i and Y y j ) = f X, Y ( xi , y j ) xi x y j y Marginal distributions (1) • The definition of the marginal distribution is the distribution of X across all values of Y • In other words, the distribution of X ignoring Y • For instance the marginal density of X, pX(x), is obtained by integrating pX,Y(x, y) over all possible values of Y p X ( x) = p X ,Y ( x, s )ds − Marginal distributions (2) • The cumulative marginal distribution is given by PX ( x) = p( X x and Y ) = x − − p X ,Y (t , s )dsdt = PX (t )dt x − Conditional distributions • The distribution of one variable with restrictions of conditions placed on the second variable is called a conditional distribution. p X |Y (x | Y is in R ) = R p X ,Y ( x, s )ds R pY (s )ds Independence • In general the conditional density of X given Y is a function of y • If the random variables X and Y are independent, this functional relationship disappears • In fact in this case, the conditional density equals the marginal density p X ,Y ( x, y ) = p X ( x) pY ( y ) f X ,Y ( x, y ) = f X ( x) f Y ( y ) Derived Distributions (1) • For a univariate continuous distribution of the random variable X, the distribution of U where U = u(X) is a monotonic function can be found from pU (u ) = p X (x ) dx / du Derived Distributions (2) • A continuous Bivariate density, the transformation from pX,Y(x, y) to pU,V(u, v) where U = u(X, Y) and V = v(X, Y) are one-to- one continuously differentiable transformations can be made by the relationship x, y pU ,V (u, v ) = p X ,Y (x, y ) J u, v PROPERTIES OF RANDOM VARIABLES PROPERTIES OF RANDOM VARIABLES (1) • Every hydrologic variable is a random variable – for example: rainfall, streamflow, infiltration rates, evaporation, reservoir storage, etc • It is defined that any process whose outcome is a random variable as an experiment • A single outcome from this experiment is called a realization of the experiment or an observation from the experiment PROPERTIES OF RANDOM VARIABLES (2) • The terms of realization and observation can be used interchangeably • However, an observation is generally taken to be a single value of a random variable, and • Realization is generally taken as a time series of random variables generated by a random experiment PROPERTIES OF RANDOM VARIABLES (3) • The complete assemblage of the entire values representative of a particular random process is called a population • Any subset of these values would be a sample from the population Parameters for describing population (1) • Quantities that are descriptive of a population are called parameters • Sample statistics are estimates for population parameters Parameters for describing population (2) • Sample statistics are estimated from samples of data and as such are functions of random variables (the sample values) and thus are themselves random variables • For a decision based on a sample to be valid in terms of the population, the sample statistics must be representative of the population parameters Parameters for describing population (3) • It is required that the sample itself be representative of the population and that “good” parameter estimation procedures are used • Similarly, one cannot get “good” estimates for the parameters of a streamflow synthesis model if the estimates are based on a short period of record during which an extreme drought (or wet) occurred Parameters for describing population (4) • Usually the true probability density function that generated the available sample of data is not known • Thus, it is necessary to not only estimate population parameters, but, it is also necessary to estimate the form of the random process (experiment) that generated the data Moments of Distribution ➢ the ith moment about the origin ✓ the continuous random variable X = x i p X ( x )dx ' i − ✓ the case of discrete distribution i' = j x ij f X (x j ) ➢ the ith central moment is defined as the ith moment about the mean of a distribution i ( x − ) p X ( x )dx − i = i = j (x j − )i f X (x j ) Mean and Variance ➢ The expected value (mean) of the random variable X E ( X ) = = xp X (x )dx ' 1 − E ( X ) = 1' = j x j f X (x j ) ➢ Variance of the random variable X = E( X − ) = = E ( X − ) = 2 = 2 2 ( x − ) − 2 2 2 p X ( x )dx = j (x j − ) f X (x j ) 2 2 Measures of Central Tendency (1) ➢ Arithmetic mean: the mean of random variable X ✓ E ( X ) = 1' ✓ A sample estimate of the population mean X = i =1 xi / n n ➢ Geometric mean: ( ) 1/ n n i =1 i XG = x Measures of Central Tendency (2) ➢ Median: ✓ A sample median (Xmd) is the observation such that half of the values lie on either side of Xmd ✓ A population median md = x p md − p X ( x )dx = 0.5 p i =1 f X (xi ) = 0.5 ➢ Mode: A possible value (Xmo) of X that occurs with a probability at least as large as the probability of any other value of X Measures of Dispersion (1) • range: the difference between the largest and smallest sample values – relative range is the range divided by the mean Measures of Dispersion (2) • variance or standard deviation (positive square root of the variance): the most common measure of dispersion – The sample estimate of σ2 is denoted by s2 s 2 = i (xi − x ) / (n − 1) 2 – A dimensionless measure of dispersion is the coefficient of variance defined as the standard deviation divided by the mean cv = s / x Measures of Symmetry (1) ➢ absolute skewness: the difference in the mean and the mode ➢ relative skewness: defined as the difference in the mean and the mode divided by the standard deviation ✓ population measure of skewness: ( − mo ) / ✓ sample measure of skewness: (x − xmo ) / s Measures of Symmetry (2) ➢ The most commonly used measure of skewness is the coefficient of skew 3/ 2 = / 3 2 ❖ ❖ unbiased estimate for the coefficient of skew based on a sample of size n is . n2M 3 cs = (n − 1)(n − 2)s 3X where M3 is the sample estimate for 3 Measures of Peakedness (1) • Kurtosis: the extent of peakedness or flatness of a probability distribution in comparison with the normal probability distribution = 4 22 the sample estimate for the kurtosis n3 M 4 k= (n − 1)(n − 2)(n − 3)s X4 where M4 is the sample estimate for 4 • Mesokurtic: Kurtosis for a normal distribution is 3 Measures of Peakedness (2) • Leptokurtic (kurtosis > 3): a relatively greater concentration of probability near the mean than does the normal • Platykurtic (kurtosis < 3): a relatively smaller concentration of probability near the mean than doses the normal Plotting Positions Annual Rainfall in Hong Kong (1960-2009) Year Rainfall (mm) Year Rainfall (mm) Year Rainfall (mm) Year Rainfall (mm) Year Rainfall (mm) 1960 2,237 1970 2,316 1980 1,711 1990 2,047 2000 2,752 1961 2,232 1971 1,904 1981 1,660 1991 1,639 2001 3,092 1962 1,741 1972 2,807 1982 3,248 1992 2,679 2002 2,490 1963 901 1973 3,100 1983 2,894 1993 2,344 2003 1,942 1964 2,432 1974 2,323 1984 2,017 1994 2,726 2004 1,739 1965 2,353 1975 3,029 1985 2,191 1995 2,754 2005 3,215 1966 2,398 1976 2,197 1986 2,338 1996 2,249 2006 2,628 1967 1,571 1977 1,680 1987 2,319 1997 3,343 2007 1,707 1968 2,288 1978 2,593 1988 1,685 1998 2,565 2008 3,066 1969 1,896 1979 2,615 1989 1,945 1999 2,129 2009 2,182 Mean = 2318.2 mm S = 518.8 mm Plotting Position: Definition ➢ plotting position refers to the probability value assigned to each piece of data to be plotted ➢ the probability of an event being exceeded in any year can be estimated by using the plotting position, which is defined by a number of formulae as given below: n : the number of samples m : the rank of the sample when arranged in descending order Pm:exceedence probability for an event with rank m Empirical Equations for Computing Pm ➢ California Pm = m/n ➢ Weibull Pm = m/(n+1) ➢ Hazen Pm = (2m-1)/(2n) ➢ Tukey Pm = (3m-1)/(3n+1) The inverse of Pm gives the return period T Probability Density Function Probability Density Function For a random variable X with a probability density function f(x), the following are true: f(x) 0 for all x ; f(x)dx = 1 − b P(a X b) = f ( x)dx a Cumulative Density Function a F(a) = P( X a ) = f ( x)dx ➢ F(x) is the cumulative distribution function (CDF) ➢ F(a) is the area under the PDF from -∞ to a ➢ The CDF F(x) and PDF f(x) are related to each other by F(x) = f ( x)dx − dF(x) f(x) = dx Normal distribution (Gaussian distribution) (1) f ( x) = 1 e 2 − ( x− )2 2 2 Properties of the Normal distribution ➢ E(X) = ➢ Var(X) = 2 ➢ Skewness coefficient = 0 Normal distribution (Gaussian distribution) (2) Standard Normal distribution ➢ when = 0 and 2 = 1 in a Normal distribution, the resulting distribution is called a Standard Normal Distribution 1 ( z) = e 2 − z2 2 ➢ any other Normal random variable can be transformed into a Standard Normal random variable Z, referred to as a Normal deviate and defined as Z= X − Frequency Analysis: Normal distribution (1) using probability paper ➢ assume Normal distribution N(,) ➢ compute X and s ➢ assume = X and = s ➢ plot a straight line using any two points. ➢ usually, the following two points are used: ( X - s, 0.1587) and ( X + s, 0.8413) ➢ the values 0.1587 and 0.8413 are the probabilities of exceedence for the magnitudes ( X - s) and ( X + s). Frequency Analysis: Normal distribution (2) • for visual comparison of the sample data with the fitted straight line, the data can be arranged in descending order and the plotting positions calculated using any of the plotting position formulas • if the fit is not good, it may be due to either ➢ the Normal distribution is not suitable, or ➢ the sample statistics are not good estimators of population parameters Normal distribution : Example Pm (X xm )= m/n (1) Pm (X xm )= m/(n+1) (2) Year 1997 1982 2005 1973 2001 2008 1975 1983 1972 1995 2000 1994 1992 2006 1979 : Rainfall (mm) 3343 3248 3215 3100 3092 3066 3029 2894 2807 2754 2752 2726 2679 2628 2615 : No. P (1) P (2) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 : 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 : 0.020 0.039 0.059 0.078 0.098 0.118 0.137 0.157 0.176 0.196 0.216 0.235 0.255 0.275 0.294 Normal distribution : Example P(1) (1799.4,0.1587) (2867,0.8413) Normal distribution : Example P(2) (1799.4,0.1587) (2867,0.8413) Normal distribution : Example 0.25 Frequency (%) 0.20 0.15 0.10 0.05 0.00 500 1000 1500 2000 2500 Rainfall (mm) 3000 3500 4000 Normal distribution and Frequency Factor (1) Frequency analysis using frequency factor method • • the cumulative probability refers to P(X xT) which is equal to (1 – 1/T) For example, if T = 100, P(X x100) =1 – 1/100 = 0.99 if T = 50, P(X x50) =1 – 1/50 = 0.98 if T = 20, P(X x20) =1 – 1/20 = 0.95 if T = 5, P(X x5) =1 – 1/5 = 0.8 frequency factor K = f (Type of distribution, Return period) Normal distribution and Frequency Factor (2) • In the case of the Normal distribution, K = value of the standard variate corresponding to the cumulative probability = z value corresponding to P(X xT) e.g. T = 100, T = 50, T = 20, T = 5, T = 2, P = 0.99, and P = 0.98, and P = 0.95, and P = 0.80, and P = 0.50, and z z z z z =2.326 =2.054 =1.645 =0.8416 =0 Extreme Value Distribution (1) ➢ the generalized extreme value distribution is given by F(x) = exp [-{1-k(x-) / }1/k] for k 0 = exp [-exp {- (x-) / }] for k = 0 ➢ k = 0 corresponds to Type I, or Gumbel distribution, k 0 corresponds to Type II distribution and k 0 corresponds to Type III distribution ➢ type I is used for maximum values and type III for minimum values Extreme Value Distribution (2) • The Gumbel distribution is expressed as follows : P ( X x ) =e − e− ( x − ) dF ( x) − e − ( x − ) − ( x − ) f ( x) = = {e e } dx − ( x − ) − ( x − ) = {e −e } − ( x − ) = {e − ( x − ) −e } Properties • The parameters are given by E(X) = = + 0.5772 / ( ) = / 6 = 1.2825/ Median [F(X x) = 0.5] = + 0.3665 / Skewness Coefficient 1.3 HK Annual Rainfall for the Period 1960-2009 Extreme Value Distribution: Example The following table lists the record of annual peak discharges at a stream gauging station: Year Discharge (m3/s) 2001 2002 2003 2004 2005 2006 2007 2008 2009 25 17 26 31 42 19 37 20 35 Using the Extreme Value Type I distribution, determine (a) the probability that an annual flood peak of 30 m3/s will not be exceeded, (b) the return period of an annual flood peak of 42 m3/s, and (c) the annual peak discharge for 20-year return period. Jointly Distributed Random Variables Moments for Jointly Distributed Random Variables (1) • If X and Y are jointly distributed continuous random variables and U is a function of X and Y, U=g(X, Y) – then E(U) can be calculated using E (U ) = Eg ( X , Y ) = upU (u )du – or E[g(X, Y)]: Eg ( X , Y ) = g ( x, y ) p X ,Y (x, y )dxdy Moments for Jointly Distributed Random Variables (2) • A general expression for the r, s moment of the jointly distributed random variables X and Y – original moments: ' r ,s = x y p X ,Y (x, y )dxdy r s – central moments: r ,s = (x − X ) ( y − Y ) p X ,Y (x, y )dxdy r s Covariance ➢covariance of X and Y is the (1, 1) central moment Cov (X, Y ) = X ,Y = 1,1 = E( X − X )(Y − Y ) = E(XY ) − E ( X )E (Y ) = (x − X )( y − Y ) p X ,Y (x, y )dxdy ➢the sample estimate of the covariance s X ,Y = (xi − x )( yi − y ) / (n − 1) for X and Y are independent X ,Y = (x − X ) p X (x )dx ( y − Y ) pY ( y )dy = 0 Correlation Coefficient (1) ➢ the covariance has units equal to the units of X times the units of Y ➢ correlation coefficient: a normalized covariance ✓ Computed using X ,Y X ,Y = XY Correlation Coefficient (2) ➢ The population correlation coefficient can be estimated using X ,Y = s X ,Y s X sY where sX and sY are the sample estimates for σX and σY and sX,Y is the sample covariance Example: The tabulation below shows the occurrence of average daily temperature (T) and average daily relative humidity (RH) on each day of a selected 8-day period for 43 consecutive years. Temperature (oC) Relative Humidity (%) -5~0 0~5 5~10 10~15 15~20 20~25 0~20 1 4 6 1 2 1 20~40 4 9 11 30 8 8 40~60 5 15 31 62 28 20 60~80 3 8 9 25 18 11 80~100 1 0 1 12 8 2 1. f X ,Y (x i , y j ) 2. f ( x ) and F ( x ) 3. f ( y ) and F ( y ) 4. the probability that a. 10 ≤ T ≤ 15 and 60 ≤ RH ≤ 80 b. 10 ≤ T ≤ 15 given that 60 ≤ RH ≤ 80 c. T ≤ 20 d. RH ≤ 60 e. T ≤ 10 and RH ≤ 40 5. whether T and RH are independent? 6. the covariance and the correlation of T and RH? X Y X i i Y i i There are 6 intervals of temperature and 5 intervals of relative humidity. Let xi be the ith temperature interval for i=1 to 6 and let yj be the jth relative humidity range for j=1 to 5. Letting nij=the entry in the tabulation corresponding to the ith temperature interval and the jth relative humidity interval we have the n f X ,Y (xi , y j ) = i , j N = nij = 8 43 = 344 N i, j i j 1 2 3 4 5 6 fY ( y j ) FY ( y j ) 1 0.0029 0.0116 0.0174 0.0029 0.0058 0.0029 0.0435 0.0435 2 0.0116 0.0262 0.0320 0.0872 0.0233 0.0233 0.2036 0.2471 3 0.0145 0.0436 0.0901 0.1802 0.0814 0.0581 0.4679 0.7150 4 0.0087 0.0233 0.0262 0.0727 0.0523 0.0320 0.2152 0.9302 5 0.0029 0 0.0029 0.0349 0.0233 0.0058 0.0698 1.0000 f X (xi ) 0.0406 0.1047 0.1686 0.3779 0.1861 0.1221 FX ( xi ) 0.0406 0.1453 0.3139 0.6918 0.8779 1.0000