Uploaded by waynechutest

CIVL6005 lecture2 Feb7 2023(1)

advertisement
Second Semester 2022-2023 (16.01-29.04.2023)
CIVL 6005: Data Analysis in Hydrology
Lecture 2 (February 7, 2023)
Review of Probability and Statistics (2)
Lecturer: Ji Chen
Department of Civil Engineering
The University of Hong Kong, Pokfulam
Stormwater Drainage Manual – Planning, Design and Management
(DSD, p150)
Intensity – Duration – Frequency Curves
(for durations not exceeding 4 hours) in Hong Kong
Bayes Theorem
• The Bayes theorem provides a method for incorporating
new information with previous or so-called prior
probability assessments to yield new values for the
relative likelihood of events of interest
p( Bi | A) =

p( Bi ) p( A | Bi )
n
i =1
p( A | Bi ) p( Bi )
• These new (conditional) probabilities are called
posterior probabilities
• Bayes theorem provides a means of estimating
probabilities of one event by observing a second event
Example
The manager of a recreational facility has determined
that the probability he will have 1000 or more visitors
on any Sunday in July depends upon the maximum
temperature for that Sunday as shown in a table in the
next slide. The table also gives the probabilities that the
maximum temperature will fall in the indicated ranges.
On a certain Sunday in July, the facility has more than
1000 visitors.
What is the probability that the maximum temperature
was in the various temperature classes?
Example
Prob of Prob of being
Temperature
1000 or in temperature
o
Tj ( C)
more
class
<15
0.05
0.05
15-20
0.20
0.15
20-25
0.50
0.20
25-30
0.75
0.35
30-35
0.50
0.20
>35
0.25
0.05
Prob of
Tj|1000 or
more visitors
Solution
• Let Tj for j=1, 2, … 6 represent the 6 intervals
of temperature. Then from Bayes theorem
p(T j | 1000 or more ) =
p(1000 or more | T j )p(T j )
 (p(1000 or more | T )p(T ) )
6
i =1
i
i
Example
Prob of Prob of being
Temperature
1000 or in temperature
o
Tj ( C)
more
class
Prob of
Tj|1000 or
more visitors
<15
0.05
0.05
0.005
15-20
0.20
0.15
0.059
20-25
0.50
0.20
0.197
25-30
0.75
0.35
0.517
30-35
0.50
0.20
0.197
>35
0.25
0.05
0.025
Sampling Principle: Counting
Sampling Principle: Counting (1)
• Multiplication principle:
– If an experiment can be divided into
tasks in such a way that the number of
available outcomes for the remaining
tasks does not at any step depend on
the outcomes from the previous steps,
then
– the total number of outcomes is the
product of the available outcomes at
each step
Sampling Principle: Counting (2)
• Addition principle: If an experiment can
be split into several mutually exclusive
methods of accomplishing it
– the only possible ways of accomplishing the
experiment are these “several” methods, and
– if we use one of these methods, then the
other methods are impossible
– then the experiment can be accomplished in
the sum of the available outcomes at each
method
Sampling Principle: Counting (3)
• Ordering and replacing principle:
– Sampling can be done with replacement so
that the item selected is immediately
returned to the population or without
replacement so that the item is not returned
– The order of sampling may be important in
some situations and not in others
• Four types of samples
–
–
–
–
ordered with replacement
ordered without replacement
unordered with replacement, and
unordered without replacement
Sampling Principle: Counting (4)
We simulate the choice of a sample of size r from the
element of population n, and then we will have the
following table for the number of ways of selecting
samples under the four types.
With
Replacement
Ordered
Unordered
Without
Replacement
r
n!
P =
(n − r )!
 n + r − 1


 r 
n
n!
  =
 r  r!(n − r )!
n
n
r
Sample Data: Grouping
Sample Data Grouping (1)
• Since it is difficult to grasp the total data picture
from tabulation, a useful first step in data
analysis is to plot the data as a frequency
histogram
• Histogram is a graphic representation of the
frequency distribution of a continuous variable
• Rectangles are drawn in such a way that their
bases lie on a linear scale representing different
class intervals, and their heights are proportional
to the frequencies of the values within each of
the intervals
Sample Data Grouping (2)
• The midpoint of a class is called the class
mark. The class interval is the difference
between the upper and lower class boundaries
• The selection of the class interval and the
location of the first class mark can
appreciably affect the appearance of a
frequency histogram
Sample Data Grouping (3)
• The appropriate width for a class interval
depends on the range of the data, the
number of observations, and the behavior of
the data
– One suggestion for the class interval is not
exceeding one-fourth to one-half of the standard
deviation of the data, and
– a suggestion for the number of classes is
m = 1 + 3.3 log 10 (n)
where m is the number of classes and n the
number of data values.
Sample Data Grouping (4)
• Be caution of using too few or too many classes
in a histogram.
– Too few classes will eliminate detail and
obscure the basic pattern of the data
– Too many classes result in erratic patterns of
alternating high and low frequencies
– The aim of making histograms is to grasp the
full meaning of the data distribution
Year
Rainfall
(mm)
Year
Rainfall
(mm)
Year
Rainfall
(mm)
Year
Rainfall
(mm)
1960
2,250
1970
2,360
1980
1,700
1990
2,050
1961
2,240
1971
1,910
1981
1,650
1991
1,600
1962
1,750
1972
2,850
1982
3,250
1992
2,700
1963
900
1973
3,100
1983
2,900
1993
2,350
1964
2,450
1974
2,350
1984
2,000
1994
2,750
1965
2,430
1975
3,010
1985
2,200
1995
2,790
1966
2,440
1976
2,200
1986
2,400
1996
2,250
1967
1,510
1977
1,650
1987
2,350
1997
3,350
1968
2,350
1978
2,600
1988
1,600
1998
2,550
1969
1,900
1979
2,610
1989
1,950
1999
2,100
14
12
10
Annual Rainfall in
Hong Kong (19601999)
8
6
4
2
0
900
1250
1600
1950
2300
2650
3000
3350
Variables
Random Variables (1)
• A random variable is a function defined on a
sample space
• Random variables may be discrete or continuous
– If the set of values of a random variable is
finite (or countably infinite), the random
variable is said to be a discrete random
variable
– If the set of values of a random variable is
infinite, the random variable is said to be
continuous random variable
Random Variables (2)
• In the course, capital letters will be
used to denote random variables and
the corresponding lower case letter
will represent values of the random
variable
• Any function of a random variable is
also a random variable
Univariate probability distributions
• pX(x) denotes probability density
function (pdf)
• PX(x) denotes cumulative distribution
function (cdf)
Bivariate Distributions (1)
• The situation frequently arises where one is
interested in the simultaneous behavior of
two or more random variables
• If X and Y are continuous random
variables, their joint probability density
function is pX,Y(x, y) and the corresponding
cumulative probability distribution is
PX,Y(x, y)
• These two are related
Bivariate Distributions (2)
2
p X ,Y (x, y ) =
PX ,Y (x, y )
xy
PX ,Y ( x, y ) = p ( X  x and Y  y ) = 
x

y
− −
p X ,Y (t , s )dsdt
f X, Y (xi , y j ) = p(X = x i and Y = y j )
FX, Y (xi , y j ) = p(X  x i and Y  y j ) =   f X, Y ( xi , y j )
xi  x y j  y
Marginal distributions (1)
• The definition of the marginal distribution is the
distribution of X across all values of Y
• In other words, the distribution of X ignoring Y
• For instance the marginal density of X, pX(x), is
obtained by integrating pX,Y(x, y) over all possible
values of Y

p X ( x) =  p X ,Y ( x, s )ds
−
Marginal distributions (2)
• The cumulative marginal distribution is given by
PX ( x) = p( X  x and Y  ) = 
x


− −
p X ,Y (t , s )dsdt =  PX (t )dt
x
−
Conditional distributions
• The distribution of one variable with
restrictions of conditions placed on the second
variable is called a conditional distribution.
p X |Y

(x | Y is in R ) =
R
p X ,Y ( x, s )ds

R
pY (s )ds
Independence
• In general the conditional density of X given Y is a
function of y
• If the random variables X and Y are independent, this
functional relationship disappears
• In fact in this case, the conditional density equals the
marginal density
p X ,Y ( x, y ) = p X ( x) pY ( y )
f X ,Y ( x, y ) = f X ( x) f Y ( y )
Derived Distributions (1)
• For a univariate continuous distribution of
the random variable X, the distribution of
U where U = u(X) is a monotonic function
can be found from
pU (u ) = p X (x ) dx / du
Derived Distributions (2)
• A continuous Bivariate density, the
transformation from pX,Y(x, y) to pU,V(u, v)
where U = u(X, Y) and V = v(X, Y) are one-to-
one continuously differentiable transformations
can be made by the relationship
 x, y 
pU ,V (u, v ) = p X ,Y (x, y ) J 

 u, v 
PROPERTIES OF RANDOM
VARIABLES
PROPERTIES OF RANDOM VARIABLES (1)
• Every hydrologic variable is a random variable
– for example: rainfall, streamflow, infiltration
rates, evaporation, reservoir storage, etc
• It is defined that any process whose outcome is
a random variable as an experiment
• A single outcome from this experiment is called
a realization of the experiment or an observation
from the experiment
PROPERTIES OF RANDOM VARIABLES (2)
• The terms of realization and observation
can be used interchangeably
• However, an observation is generally
taken to be a single value of a random
variable, and
• Realization is generally taken as a time
series of random variables generated by
a random experiment
PROPERTIES OF RANDOM VARIABLES (3)
• The complete assemblage of the
entire values representative of a
particular random process is called a
population
• Any subset of these values would be
a sample from the population
Parameters for describing population (1)
• Quantities that are descriptive of
a population are called
parameters
• Sample statistics are estimates
for population parameters
Parameters for describing population (2)
• Sample statistics are estimated from samples
of data and as such are functions of random
variables (the sample values) and thus are
themselves random variables
• For a decision based on a sample to be valid in
terms of the population, the sample statistics
must be representative of the population
parameters
Parameters for describing population (3)
• It is required that the sample itself be
representative of the population and that “good”
parameter estimation procedures are used
• Similarly, one cannot get “good” estimates for
the parameters of a streamflow synthesis model
if the estimates are based on a short period of
record during which an extreme drought (or
wet) occurred
Parameters for describing population (4)
• Usually the true probability density
function that generated the available
sample of data is not known
• Thus, it is necessary to not only estimate
population parameters, but, it is also
necessary to estimate the form of the
random process (experiment) that
generated the data
Moments of Distribution
➢ the ith moment about the origin
✓ the continuous random variable X

 =  x i p X ( x )dx
'
i
−
✓ the case of discrete distribution
i' =  j x ij f X (x j )
➢ the ith central moment is defined as the ith
moment about the mean of a distribution

i
(
x −  ) p X ( x )dx
−
i = 
i =  j (x j −  )i f X (x j )
Mean and Variance
➢ The expected value (mean) of the random
variable X

E ( X ) =  =  xp X (x )dx
'
1
−
E ( X ) = 1' =  j x j f X (x j )
➢ Variance of the random variable X


= E( X −  )  = 
 = E ( X −  ) = 2 = 
2
2

(
x − )
−
2
2
2
p X ( x )dx
=  j (x j −  ) f X (x j )
2
2
Measures of Central Tendency (1)
➢ Arithmetic mean: the mean of random
variable X
✓ E ( X ) = 1'
✓ A sample estimate of the population mean
X = i =1 xi / n
n
➢ Geometric mean:
(
)
1/ n
n
i =1 i
XG =  x
Measures of Central Tendency (2)
➢ Median:
✓ A sample median (Xmd) is the observation such that
half of the values lie on either side of Xmd
✓ A population median
 md = x p

 md
−
p X ( x )dx = 0.5

p
i =1
f X (xi ) = 0.5
➢ Mode: A possible value (Xmo) of X that occurs with a
probability at least as large as the probability of any
other value of X
Measures of Dispersion (1)
• range: the difference between
the largest and smallest
sample values
– relative range is the range divided
by the mean
Measures of Dispersion (2)
• variance or standard deviation (positive
square root of the variance): the most
common measure of dispersion
– The sample estimate of σ2 is denoted by s2
s 2 = i (xi − x ) / (n − 1)
2
– A dimensionless measure of dispersion is the
coefficient of variance defined as the standard
deviation divided by the mean
cv = s / x
Measures of Symmetry (1)
➢ absolute skewness: the difference in
the mean and the mode
➢ relative skewness: defined as the
difference in the mean and the mode
divided by the standard deviation
✓ population measure of skewness:
( −  mo ) / 
✓ sample measure of skewness:
(x − xmo ) / s
Measures of Symmetry (2)
➢ The most commonly used measure of
skewness is the coefficient of skew
3/ 2

=

/

3
2
❖
❖ unbiased estimate for the coefficient of skew
based on a sample of size n is
.
n2M 3
cs =
(n − 1)(n − 2)s 3X
where M3 is the sample estimate for 3
Measures of Peakedness (1)
• Kurtosis: the extent of peakedness or flatness
of a probability distribution in comparison
with the normal probability distribution
=
4
 22
the sample estimate for the kurtosis
n3 M 4
k=
(n − 1)(n − 2)(n − 3)s X4
where M4 is the sample estimate for 4
• Mesokurtic: Kurtosis for a normal distribution is 3
Measures of Peakedness (2)
• Leptokurtic (kurtosis > 3): a relatively
greater concentration of probability near
the mean than does the normal
• Platykurtic (kurtosis < 3): a relatively
smaller concentration of probability near
the mean than doses the normal
Plotting Positions
Annual Rainfall in Hong Kong (1960-2009)
Year
Rainfall
(mm)
Year
Rainfall
(mm)
Year
Rainfall
(mm)
Year
Rainfall
(mm)
Year
Rainfall
(mm)
1960
2,237
1970
2,316
1980
1,711
1990
2,047
2000
2,752
1961
2,232
1971
1,904
1981
1,660
1991
1,639
2001
3,092
1962
1,741
1972
2,807
1982
3,248
1992
2,679
2002
2,490
1963
901
1973
3,100
1983
2,894
1993
2,344
2003
1,942
1964
2,432
1974
2,323
1984
2,017
1994
2,726
2004
1,739
1965
2,353
1975
3,029
1985
2,191
1995
2,754
2005
3,215
1966
2,398
1976
2,197
1986
2,338
1996
2,249
2006
2,628
1967
1,571
1977
1,680
1987
2,319
1997
3,343
2007
1,707
1968
2,288
1978
2,593
1988
1,685
1998
2,565
2008
3,066
1969
1,896
1979
2,615
1989
1,945
1999
2,129
2009
2,182
Mean = 2318.2 mm S = 518.8 mm
Plotting Position: Definition
➢ plotting position refers to the probability value
assigned to each piece of data to be plotted
➢ the probability of an event being exceeded in
any year can be estimated by using the plotting
position, which is defined by a number of
formulae as given below:
n : the number of samples
m : the rank of the sample when arranged in
descending order
Pm:exceedence probability for an event with rank m
Empirical Equations for Computing Pm
➢ California
Pm = m/n
➢ Weibull
Pm = m/(n+1)
➢ Hazen
Pm = (2m-1)/(2n)
➢ Tukey
Pm = (3m-1)/(3n+1)
The inverse of Pm gives the return period T
Probability Density Function
Probability Density Function
For a random variable X with a probability
density function f(x), the following are true:
f(x)  0 for all x ;

 f(x)dx = 1
−
b
P(a  X  b) =
 f ( x)dx
a
Cumulative Density Function
a
F(a) = P( X  a ) =  f ( x)dx
➢ F(x) is the cumulative distribution function
(CDF)
➢ F(a) is the area under the PDF from -∞ to a
➢ The CDF F(x) and PDF f(x) are related to each
other by
F(x) =  f ( x)dx
−
dF(x)
f(x) =
dx
Normal distribution (Gaussian distribution) (1)
f ( x) =
1
e
2 
−
( x− )2
2 2
Properties of the Normal distribution
➢ E(X) = 
➢ Var(X) = 2
➢ Skewness coefficient = 0
Normal distribution (Gaussian distribution) (2)
Standard Normal distribution
➢ when  = 0 and 2 = 1 in a Normal distribution,
the resulting distribution is called a Standard
Normal Distribution
1
 ( z) =
e
2
−
z2
2
➢ any other Normal random variable can be
transformed into a Standard Normal random variable
Z, referred to as a Normal deviate and defined as
Z=
X −

Frequency Analysis: Normal distribution (1)
using probability paper
➢ assume Normal distribution N(,)
➢ compute
X and s
➢ assume  = X and  = s
➢ plot a straight line using any two points.
➢ usually, the following two points are used: ( X - s,
0.1587) and ( X + s, 0.8413)
➢ the values 0.1587 and 0.8413 are the probabilities of
exceedence for the magnitudes ( X - s) and ( X + s).
Frequency Analysis: Normal distribution (2)
• for visual comparison of the sample data with the
fitted straight line, the data can be arranged in
descending order and the plotting positions
calculated using any of the plotting position
formulas
• if the fit is not good, it may be due to either
➢ the Normal distribution is not suitable, or
➢ the sample statistics are not good estimators of
population parameters
Normal distribution : Example
Pm (X  xm )= m/n
(1)
Pm (X  xm )= m/(n+1) (2)
Year
1997
1982
2005
1973
2001
2008
1975
1983
1972
1995
2000
1994
1992
2006
1979
:
Rainfall
(mm)
3343
3248
3215
3100
3092
3066
3029
2894
2807
2754
2752
2726
2679
2628
2615
:
No.
P (1)
P (2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
:
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
0.22
0.24
0.26
0.28
0.3
:
0.020
0.039
0.059
0.078
0.098
0.118
0.137
0.157
0.176
0.196
0.216
0.235
0.255
0.275
0.294
Normal distribution : Example P(1)
(1799.4,0.1587)
(2867,0.8413)
Normal distribution : Example P(2)
(1799.4,0.1587)
(2867,0.8413)
Normal distribution : Example
0.25
Frequency (%)
0.20
0.15
0.10
0.05
0.00
500
1000
1500
2000
2500
Rainfall (mm)
3000
3500
4000
Normal distribution and Frequency Factor (1)
Frequency analysis using frequency factor method
•
•
the cumulative probability refers to P(X  xT) which is equal
to (1 – 1/T) For example,
if T = 100,
P(X  x100) =1 – 1/100 = 0.99
if T = 50,
P(X  x50) =1 – 1/50 = 0.98
if T = 20,
P(X  x20) =1 – 1/20 = 0.95
if T = 5,
P(X  x5) =1 – 1/5
= 0.8
frequency factor K = f (Type of distribution, Return period)
Normal distribution and Frequency Factor (2)
•
In the case of the Normal distribution,
K = value of the standard variate corresponding to the cumulative probability
= z value corresponding to P(X  xT)
e.g.
T = 100,
T = 50,
T = 20,
T = 5,
T = 2,
P = 0.99, and
P = 0.98, and
P = 0.95, and
P = 0.80, and
P = 0.50, and
z
z
z
z
z
=2.326
=2.054
=1.645
=0.8416
=0
Extreme Value Distribution (1)
➢ the generalized extreme value distribution is given by
F(x) = exp [-{1-k(x-) / }1/k]
for k  0
= exp [-exp {- (x-) / }]
for k = 0
➢ k = 0 corresponds to Type I, or Gumbel distribution, k
 0 corresponds to Type II distribution and k  0
corresponds to Type III distribution
➢ type I is used for maximum values and type III for
minimum values
Extreme Value Distribution (2)
• The Gumbel distribution is expressed as follows :
P ( X  x ) =e
− e−  ( x −  )
dF ( x)
− e − ( x −  ) − ( x −  )
f ( x) =
= {e
e
}
dx − ( x −  )
− ( x −  )
= {e −e
}
− ( x −  )
= {e − ( x −  ) −e
}
Properties
•
The parameters are given by
E(X) =  =  + 0.5772 / 
(
)
 =  / 6 = 1.2825/
Median [F(X  x) = 0.5] =  + 0.3665 / 
Skewness Coefficient  1.3
HK Annual Rainfall for the Period 1960-2009
Extreme Value Distribution: Example
The following table lists the record of annual peak
discharges at a stream gauging station:
Year
Discharge (m3/s)
2001 2002 2003 2004 2005 2006 2007 2008 2009
25
17
26
31
42
19
37
20
35
Using the Extreme Value Type I distribution, determine
(a) the probability that an annual flood peak of 30 m3/s
will not be exceeded, (b) the return period of an annual
flood peak of 42 m3/s, and (c) the annual peak
discharge for 20-year return period.
Jointly Distributed Random Variables
Moments for Jointly Distributed
Random Variables (1)
• If X and Y are jointly distributed continuous
random variables and U is a function of X and Y,
U=g(X, Y)
– then E(U) can be calculated using
E (U ) = Eg ( X , Y ) =  upU (u )du
– or E[g(X, Y)]:
Eg ( X , Y ) =  g ( x, y ) p X ,Y (x, y )dxdy
Moments for Jointly Distributed
Random Variables (2)
• A general expression for the r, s moment
of the jointly distributed random
variables X and Y
– original moments:

'
r ,s
=  x y p X ,Y (x, y )dxdy
r
s
– central moments:
 r ,s =  (x −  X ) ( y − Y ) p X ,Y (x, y )dxdy
r
s
Covariance
➢covariance of X and Y is the (1, 1) central
moment
Cov (X, Y ) =  X ,Y = 1,1 = E( X −  X )(Y − Y ) = E(XY ) − E ( X )E (Y )
=  (x −  X )( y − Y ) p X ,Y (x, y )dxdy
➢the sample estimate of the covariance
s X ,Y =  (xi − x )( yi − y ) / (n − 1)
for X and Y are independent
 X ,Y =  (x −  X ) p X (x )dx  ( y − Y ) pY ( y )dy = 0
Correlation Coefficient (1)
➢ the covariance has units equal to the
units of X times the units of Y
➢ correlation coefficient: a normalized
covariance
✓ Computed using
 X ,Y
 X ,Y
=
 XY
Correlation Coefficient (2)
➢ The population correlation
coefficient can be estimated using
 X ,Y =
s X ,Y
s X sY
where sX and sY are the sample estimates
for σX and σY and sX,Y is the sample
covariance
Example: The tabulation below shows the occurrence of average daily temperature
(T) and average daily relative humidity (RH) on each day of a selected 8-day period
for 43 consecutive years.
Temperature (oC)
Relative
Humidity
(%)
-5~0
0~5
5~10
10~15
15~20
20~25
0~20
1
4
6
1
2
1
20~40
4
9
11
30
8
8
40~60
5
15
31
62
28
20
60~80
3
8
9
25
18
11
80~100
1
0
1
12
8
2
1. f X ,Y (x i , y j )
2. f ( x ) and F ( x )
3. f ( y ) and F ( y )
4. the probability that
a. 10 ≤ T ≤ 15 and 60 ≤ RH ≤ 80
b. 10 ≤ T ≤ 15 given that 60 ≤ RH ≤ 80
c. T ≤ 20
d. RH ≤ 60
e. T ≤ 10 and RH ≤ 40
5. whether T and RH are independent?
6. the covariance and the correlation of T and RH?
X
Y
X
i
i
Y
i
i
There are 6 intervals of temperature and 5 intervals of relative humidity. Let xi be the
ith temperature interval for i=1 to 6 and let yj be the jth relative humidity range for j=1
to 5. Letting nij=the entry in the tabulation corresponding to the ith temperature interval
and the jth relative humidity interval we have the
n
f X ,Y (xi , y j ) = i , j
N =  nij = 8  43 = 344
N
i, j
i
j
1
2
3
4
5
6
fY ( y j )
FY ( y j )
1
0.0029
0.0116
0.0174
0.0029
0.0058
0.0029
0.0435
0.0435
2
0.0116
0.0262
0.0320
0.0872
0.0233
0.0233
0.2036
0.2471
3
0.0145
0.0436
0.0901
0.1802
0.0814
0.0581
0.4679
0.7150
4
0.0087
0.0233
0.0262
0.0727
0.0523
0.0320
0.2152
0.9302
5
0.0029
0
0.0029
0.0349
0.0233
0.0058
0.0698
1.0000
f X (xi ) 0.0406
0.1047
0.1686
0.3779
0.1861
0.1221
FX ( xi ) 0.0406
0.1453
0.3139
0.6918
0.8779
1.0000
Download