Uploaded by hibou1956

Lecture 2 Probalility%20Distributions

advertisement
THE PRINCIPLES OF MONTE CARLO SIMULATION
Lecture Two:
Probability Distributions
•Basic definitions
•CDF and PDF
•Quantiles
•Moments
•Parametric distributions
Basic definitions
• Statistics:
Mathematical methods for collecting, organizing, and interpreting data, as
well as drawing conclusions and making reasonable decisions based on
such analysis.
• Population:
Collection of a finite number of measurements or virtually infinitely large
collection of data about something of interest.
• Sample:
Representative subset selected from the population. A good sample must
reflect the essential features of the population from which it is drawn.
• Random sample:
A sample in which each member of a population has an equal chance of
being included in the sample.
• Sample space:
A set of all possible outcomes of a chance experiment.
Basic definitions
• Inductive statistics (statistical inference):
If the sample is representative, conclusions about the population
can often be inferred.
Such inference cannot be absolutely certain, the language of
probability is used for stating conclusions.
• Descriptive statistics:
A phase of statistics that describes or analyses a given sample
without inference about the population.
• Event of a sample space:
A group of outcomes of the sample space whose members have
some common characteristic.
Basic definitions
• Events can be defined as:
– Null or empty: event that contains no outcomes of the sample space
– Union: consists of all possible outcomes that are either in E1 or in E2 or in
both E1 and E2 (a)
– Intersection: consists of all outcomes that are in both E1 and E2 (b)
– Mutually exclusive: null event in which E1 and E2 cannot both occur (c)
– Containment of an event by another: intersection of E2 and E1 consists of all
possible outcomes of E2 (d)
– Complement: consists of all outcomes not contained in E (e)
Basic definitions
• Statistically independent events:
The occurrence of one event does not depend on the occurrence of other
events.
• Random variable:
A form of presenting any unsampled (or unknown) value z, the probability
distribution of which models the uncertainty about z. Variable can be
continuous or discrete
• Continuous variable:
A variable that can assume any value between two given values: porosity,
permeability, price
• Discrete or categorical variable:
Not continuous, e.g., lithofacies
• Probability function of a random variable Z:
Mathematical function that assigns a probability to each realization z of
the random variable Z: P(Z = z)
Cumulative Distribution Function:
CDF
• Cumulative distribution function cdf defined as:
F ( z ) = Prob{Z ≤ z} ∈ [0,1]
This formula gives the area under the pdf of the RV Z, and is the
probability that the RV Z is less than or equal to a threshold value of z.
• The probability of exceeding any of the threshold values of
z can be written:
Prob{Z > z} = 1 − F ( z )
• Properties of the cdf:
– F(z) non decreasing
– F(z) ∈ [0,1]
– F(-∞) = 0 and F(∞) = 1
CDF and PDF
• The probability of Z occurring in an interval from a to b (where b>a)
is the difference in the cdf values evaluated at points b and a:
Prob{Z ∈ [a, b]} = F (b) − F (a)
• Probability density function (pdf) is the derivative of the cdf, if it is
differentiable:
F ( z + dz ) − F ( z )
dz → 0
dz
f ( z ) = F ' ( z ) = lim
•
The cdf can be obtained from integrating the pdf:
z
F (z) =
∫
−∞
f ( z ) dz
CDF and PDF
∞
∫ f ( z )dz = 1
1.0
F(x)
• Properties of pdf:
f(z) ≥ 0
−∞
0.25
0.0
x
Quantiles
• The p-quantile of the distribution F(zp) is the value zp for
which:
F ( z p ) = Prob{Z ≤ z p } = p ∈ [0,1]
Thus, the quantile can be expressed in an inverse form of
the cdf:
q(p) = F-1(p)
Used in Monte
Carlo simulation
Quantiles
• Lower quartile q(0.25), the median (M) q(0.5), and the
upper quartile values q(0.75) are commonly used
• Interquartile Range (IR) is the difference between the
upper and the lower quartiles: IR = q(0.75) – q(0.25)
• Skewness sign: sign of the difference between the mean
and the median (m-M): positively skewed (a), negatively
skewed (b), symmetric (c)
Quantiles
• Q-Q plot: to compare two
distributions F1 and F2
•
•
•
•
•
•
Choose a series of probability values
pk, k = 1, 2, …, K
Plot q1(pk) versus q2(pk), k = 1, 2, …, K
If all the points fall into 45o line, the
two distribution are exactly the same
If the line is shifted from 45o, the two
distribution have the same shape but
different means
If the slope of the line is not 45o, the
two distributions have different
variances
If there is a non-linear character on Q-Q
plot, they have different shapes on
histogram
Moments
• The expected value is the probability-weighted sum of all the possible
occurrences of the RV Z
n
E{Z } = m =
∑w z
i =1
n
i
where:
i
∑w
i =1
i
E{Z}= expected value of Z
wi = weight of the ith data
n = number of data points
m = mean
In the continuous case:
E{Z } = m =
+∞
+∞
−∞
−∞
∫ zdF ( z ) = ∫ zf ( z )dz
• The variance of the RV Z is defined as the expected squared deviation
of Z about its mean:
Var{Z } = σ 2 = E{[ Z − m]2 } = E{Z 2 } − m 2 ≥ 0
Variance
• In a discrete form, variance can be defined as
n
Var{Z } =
∑ w (z
i
i =1
i
− m) 2
n
∑w
i =1
i
• In continuous form, it can be written as
Var{Z } =
+∞
∫ ( z − m)
−∞
+∞
2
dF ( z ) = ∫ ( z − m) 2 f ( z )dz
−∞
• The variance is a measure of the spread of the data from the mean.
• The standard deviation (SD), which is the square root of the variance,
is also one of the measures of data variability from the mean.
• The dimensionless coefficient of variance (CV) is the ratio of the SD
over the mean (SD/m).
Bivariate Statistics
Extension to bivariate and higher order distributions:
Let X and Y be RVs. The cdf of X and Y, FXY(x,y), is defined as:
FXY ( x, y ) = Prob{ X ≤ x, and Y ≤ y}
The pdf of a bivariate distribution, fXY is:
∂ 2 FXY ( x, y )
f XY ( x, y ) =
∂x∂y
The second order moment of a bivariate distribution is the covariance. The
covariance between the two variables is defined as:
Cov{ X , Y } = E{[ X − m X ][Y − mY ]} = E{ XY } − m X mY
+∞
+∞
−∞
−∞
= E ∫ dx ∫ ( x − m X )( y − mY ) f XY ( x, y )dy
Correlation
• The covariance between the same variable is its variance:
Cov{X,X} = Var{X}; Cov{Y,Y} = Var{Y}
• The correlation coefficient is a measure of the linear dependence
between the two variables
ρ XY =
Cov{ X , Y }
Var{ X }Var{Y }
∈ [−1,+1]
• A correlation of ρXY = 1 implies that X and Y are perfectly
correlated.
• Independence between the two variables means the correlation
coefficient is zero, ie. ρXY = 0 . However, the reverse statement is not
always the case. Zero correlation does not imply independence
between the two variables.
High Order Moments
• The first moment is the mean, the second moment about the mean is
the variance, the third moment is kurtosis, …
• The nth moment of a RV Z about the mean m, also called the nth central
moment, is defined as
m n = E{( Z − m)}n
where n = 0, 1, 2, ….
• The nth moment of Z about the origin is defined as
m'n = E{Z n }
These are non-centered moments.
Entropy
• A measure of local uncertainty, not specific to a particular
interval [a,b]
• Entropy of the local probability density function is defined
as
H (u ) =
∞
∫ − [ln f (u; z (n))]. f (u; z (n))dz
−∞
where f (u; z (n)) is a conditional pdf, and all zero pdf values are excluded from
the integral
• If entropy decreases, uncertainty decreases since the
probability distribution tends toward a single value (or few
values).
Parametric distribution
• A parametric distribution model is an analytical expression for the
probability given the variable value, e.g., the normal or Gaussian
distribution model:
p=
1
e
σ 2π
−( Z −m )2
2σ 2
with parameters m and σ that control the center and spread of the bellshaped normal distribution.
• Parametric models sometimes relate to an underlying theory
– e.g., the normal distribution is the limit distribution for the central limit
theorem
• In general, there is no need to assume a parametric distribution model;
there are often enough data to infer the shape of the distribution nonparametrically
• It is possible to transform any univariate distribution to any other
univariate distribution
• We can smooth any distribution that is not well resolved by the
available data
Uniform distribution
• pdf:
• cdf:
 1

f ( z) = b − a
 0 ,
∀z ∈ [a,b]
otherwise
 0, ∀z ≤ a
 z-a
∀z ∈ [a,b]
F ( z ) = ∫ f ( z )dz = 
b-z
−∞

 1, ∀z ≥ b
z
• Moments:
b
E{Z } =
a+b
= m = median
2
1
1 2
2
E{Z } =
z
dz
(a + ab + b 2 )
=
∫
b−a a
3
2
(b − a ) 2
σ = Var{Z } = E{Z } − m =
12
2
2
2
• Uniform distribution within [0,1] is the distribution of random
numbers that have a mean of 0.5, variance of 1/12, and the cdf
F(z) = z
Dirac distribution
• z = a (constant, no uncertainty)
• cdf
1, if x ≥ a
F (z ) = 
 0, if not
• pdf
0, ∀ x ≠ a

f (z ) = 
undefined at x = a
• Moments:
E(Z} = m = a
σ2 = 0
Exponential distribution
• cdf
F ( z ) = 1 − e − z / a , ∀z ≥ a
a = 1/A
• pdf
f ( z) =
1 −z / a
e
, ∀z ≥ 0
a
• Moments:
Mean = a
Variance = a2
Median = 0.69a
Normal (Gaussian) distribution
• Gaussian distribution is fully characterized by its two
parameters, the mean and the variance, m and σ :
 1  z − m 2 
1
g ( z) =
exp − 
 
σ
2
σ 2π

 

• Standard normal pdf has a mean of zero and a standard
deviation of one:
 z 2
g o ( z) =
exp − 
2π
 2 
1
Normal (Gaussian) distribution
• The cdf of the Gaussian distribution G(z) has no closed-form
analytical expression, but the standard normal cdf Go(z) is well
tabulated in literature:
z
Go ( z ) =
∫g
o
( z ) dz
−∞
 z−m
G ( z ) = ∫ g ( z )dz = G o 

 σ 
• The Gaussian distribution has characteristic symmetry:
– It is symmetric about its mean m; therefore, the mean and the
median are the same, and
– The pdf g(m+z) = g(m-z)
Lognormal distribution
• A positive RV, Y > 0, is said to be lognormally distributed
if X = ln(Y) is normally distributed
Y > 0 → log N (m, σ 2 ), if X = ln Y → N (α , β 2 )
RVs that are lognormally distributed have characteristically skewed distributions.
Lognormal distribution
• Lognormal distributions are also characterized by two parameters: a
mean and a variance. However, they can be characterized by either the
arithmetic parameters (m and σ2) or the logarithmic parameters (α or
β2).
• The lognormal cdf and pdf are more easily expressed as a function of
its logarithmic parameters:
 ln y − α 
 for all y > 0
FY ( y ) = Prob{Y ≤ y )} = G o 
β


f Y ( y ) = F 'Y ( y ) =
 ln y − α 
1

g o 
βy  β 
• The relations between arithmetic parameters and logarithmic
parameters are:
m = eα + β
2
σ 2 = m 2 [e β − 1]
2
/2
α = ln m − β / 2
2
 σ2
β = ln1 + 2
m

2



Central Limit Theorem
• Theorem:
The sum of a great number of independent equally distributed (not
necessarily Gaussian) standardized random variables tends to be
normally distributed, i.e. if n RV’s Zi have the same cdf and zero
means, the RV tends toward a normal cdf, as n increases towards
infinity.
E{mˆ } = E{X} = m

n
mˆ =
1
xi
∑
n i =1
• Corollary:

→ Normal 
σ2
1
Var{mˆ } = n Var{ X } = n
The product of a great number of independent, identically
distributed RVs tends to be lognormally distributed
1 n
αˆ = ∑ log x i
n i =1
E{αˆ} = E{ log X} = α


→ Normal 
β2
1
Var{αˆ } = n Var{log X } = n
Combination of distributions
• Combination of distributions results in a new distribution.
• F(x) =∑kλkF'k(x) is a distribution model if F'k(x)s are distribution
functions, and ∑kλk = 1, λk≥0, ∀k.
• For example F(x) = p∆0(x) + (1-p)F1(x) is a mixture of a spike at zero
and a positive distribution
• Disjunctive coding of an experimental histogram: given data x1, x2, …,
xn and rank ordered data x(1), x(2), …, x(n).
• Experimental cumulative distribution:
1
Fˆ ( x) = ∑ ∆ x ( i ) ( x)
n
A sum of n Dirac
distributions of
parameters x(i) and equal
amplitude 1/n
Non-parametric distribution
• Non-parametric model is misleading in that it could be interpreted as
characteristic of a RF model that has no parameters
• All RF models imply a full multivariate distribution as characterized
by the set of distributions, but some RF models have more free
parameters than others
• The more free parameters that can be fitted confidently to data, the
more flexible the model
• Unfortunately, the fewer free parameters a model has, the easier it is to
handle (less inference and less computation) ! inflexible models
Dirac distribution has one-free-parameter models (Z constant)
Poisson-exponential model is fully determined by its mean
Gaussian distribution is two-free-parameter model
• Parameter-rich model versus parameter-poor model
Experimental and sample statistics
•
Histogram:
•
Mean is sensitive to outliners
Median is sensitive to gaps in the middle of a distribution
Locate distribution by selected quantiles (e.g., quartiles)
Spread measured by standard deviation (very sensitive to
extreme values)
Cumulative histogram:
•
•
•
Useful to see all of the data values on one plot
Useful for isolating statistical populations
May be used to check distribution models:
-
Straight line on arithmetic scale a normal distribution
Straight line on logarithmic scale a lognormal distribution
Possible to transform data to perfectly reproduce any
univariate distribution
Cumulative Frequency
-
Frequency
" A counting of samples in classes
" Sometimes two scales are needed to show the
details (use trimming limits)
" Logarithmic scale can be useful
" Sample statistics
1
0
Distribution smoothing
•
•
•
•
Sample histograms and scattergrams tend to be erratic with few data.
Sawtooth-like fluctuations are usually not representative of the underlying
population and they disappear as the sample size increases
Smoothing distribution not only removes such fluctuations; it also allows
increasing the class resolution and extending the distribution(s) beyond the
sample minimum and maximum values
More flexible smoothing technique (quadratic programming or annealing
approach) has been applied for smoothing histograms and scattergrams: honor
sample statistics.
Extreme Values-Outliers
• Extreme values: a few very small or very large values may strongly
affect summary statistics like the mean or variance of the data, the
linear correlation coefficient, or measures of spatial continuity
• Such extreme values can be handled as follows:
1. Declare the extreme values erroneous and remove them
2. Classify the extreme values into separate statistical population
3. Use robust statistics, which are less sensitive to extreme values: median, rank
correlation coefficient…
4. Transform the data to reduce the influence of extreme values
• Outliers: observations that have values out of line with the rest of the
data
• Outliers can create a difficult situation in a regression equation because
they have a disproportionate effect on the estimated values of the
regression coefficients
• Outlier observation can only be removed with extreme care because it
may actually supply unique information about the response
Identification of statistical
population
• By histogram: histogram has different modes
• By Q-Q plot two compare two distributions
• By hypothesis tests to see whether or not two populations
have the same means, variance…(t-test, F-test…)
Review of Main Points
•
•
•
•
•
•
•
•
Probability as a language of uncertainty
CDF always defined and used in Monte Carlo simulation
PDF is derivative of CDF
Quantile is a z-value at particular F(z)
Mean and variance are “moments”
Parametric distribution
F(z)/f(z) defined by equation
Non-parametric defined by data values
Experimental statistics are used to clean and understand
the data, as well as to build statistical models.
Download