THE PRINCIPLES OF MONTE CARLO SIMULATION Lecture Two: Probability Distributions •Basic definitions •CDF and PDF •Quantiles •Moments •Parametric distributions Basic definitions • Statistics: Mathematical methods for collecting, organizing, and interpreting data, as well as drawing conclusions and making reasonable decisions based on such analysis. • Population: Collection of a finite number of measurements or virtually infinitely large collection of data about something of interest. • Sample: Representative subset selected from the population. A good sample must reflect the essential features of the population from which it is drawn. • Random sample: A sample in which each member of a population has an equal chance of being included in the sample. • Sample space: A set of all possible outcomes of a chance experiment. Basic definitions • Inductive statistics (statistical inference): If the sample is representative, conclusions about the population can often be inferred. Such inference cannot be absolutely certain, the language of probability is used for stating conclusions. • Descriptive statistics: A phase of statistics that describes or analyses a given sample without inference about the population. • Event of a sample space: A group of outcomes of the sample space whose members have some common characteristic. Basic definitions • Events can be defined as: – Null or empty: event that contains no outcomes of the sample space – Union: consists of all possible outcomes that are either in E1 or in E2 or in both E1 and E2 (a) – Intersection: consists of all outcomes that are in both E1 and E2 (b) – Mutually exclusive: null event in which E1 and E2 cannot both occur (c) – Containment of an event by another: intersection of E2 and E1 consists of all possible outcomes of E2 (d) – Complement: consists of all outcomes not contained in E (e) Basic definitions • Statistically independent events: The occurrence of one event does not depend on the occurrence of other events. • Random variable: A form of presenting any unsampled (or unknown) value z, the probability distribution of which models the uncertainty about z. Variable can be continuous or discrete • Continuous variable: A variable that can assume any value between two given values: porosity, permeability, price • Discrete or categorical variable: Not continuous, e.g., lithofacies • Probability function of a random variable Z: Mathematical function that assigns a probability to each realization z of the random variable Z: P(Z = z) Cumulative Distribution Function: CDF • Cumulative distribution function cdf defined as: F ( z ) = Prob{Z ≤ z} ∈ [0,1] This formula gives the area under the pdf of the RV Z, and is the probability that the RV Z is less than or equal to a threshold value of z. • The probability of exceeding any of the threshold values of z can be written: Prob{Z > z} = 1 − F ( z ) • Properties of the cdf: – F(z) non decreasing – F(z) ∈ [0,1] – F(-∞) = 0 and F(∞) = 1 CDF and PDF • The probability of Z occurring in an interval from a to b (where b>a) is the difference in the cdf values evaluated at points b and a: Prob{Z ∈ [a, b]} = F (b) − F (a) • Probability density function (pdf) is the derivative of the cdf, if it is differentiable: F ( z + dz ) − F ( z ) dz → 0 dz f ( z ) = F ' ( z ) = lim • The cdf can be obtained from integrating the pdf: z F (z) = ∫ −∞ f ( z ) dz CDF and PDF ∞ ∫ f ( z )dz = 1 1.0 F(x) • Properties of pdf: f(z) ≥ 0 −∞ 0.25 0.0 x Quantiles • The p-quantile of the distribution F(zp) is the value zp for which: F ( z p ) = Prob{Z ≤ z p } = p ∈ [0,1] Thus, the quantile can be expressed in an inverse form of the cdf: q(p) = F-1(p) Used in Monte Carlo simulation Quantiles • Lower quartile q(0.25), the median (M) q(0.5), and the upper quartile values q(0.75) are commonly used • Interquartile Range (IR) is the difference between the upper and the lower quartiles: IR = q(0.75) – q(0.25) • Skewness sign: sign of the difference between the mean and the median (m-M): positively skewed (a), negatively skewed (b), symmetric (c) Quantiles • Q-Q plot: to compare two distributions F1 and F2 • • • • • • Choose a series of probability values pk, k = 1, 2, …, K Plot q1(pk) versus q2(pk), k = 1, 2, …, K If all the points fall into 45o line, the two distribution are exactly the same If the line is shifted from 45o, the two distribution have the same shape but different means If the slope of the line is not 45o, the two distributions have different variances If there is a non-linear character on Q-Q plot, they have different shapes on histogram Moments • The expected value is the probability-weighted sum of all the possible occurrences of the RV Z n E{Z } = m = ∑w z i =1 n i where: i ∑w i =1 i E{Z}= expected value of Z wi = weight of the ith data n = number of data points m = mean In the continuous case: E{Z } = m = +∞ +∞ −∞ −∞ ∫ zdF ( z ) = ∫ zf ( z )dz • The variance of the RV Z is defined as the expected squared deviation of Z about its mean: Var{Z } = σ 2 = E{[ Z − m]2 } = E{Z 2 } − m 2 ≥ 0 Variance • In a discrete form, variance can be defined as n Var{Z } = ∑ w (z i i =1 i − m) 2 n ∑w i =1 i • In continuous form, it can be written as Var{Z } = +∞ ∫ ( z − m) −∞ +∞ 2 dF ( z ) = ∫ ( z − m) 2 f ( z )dz −∞ • The variance is a measure of the spread of the data from the mean. • The standard deviation (SD), which is the square root of the variance, is also one of the measures of data variability from the mean. • The dimensionless coefficient of variance (CV) is the ratio of the SD over the mean (SD/m). Bivariate Statistics Extension to bivariate and higher order distributions: Let X and Y be RVs. The cdf of X and Y, FXY(x,y), is defined as: FXY ( x, y ) = Prob{ X ≤ x, and Y ≤ y} The pdf of a bivariate distribution, fXY is: ∂ 2 FXY ( x, y ) f XY ( x, y ) = ∂x∂y The second order moment of a bivariate distribution is the covariance. The covariance between the two variables is defined as: Cov{ X , Y } = E{[ X − m X ][Y − mY ]} = E{ XY } − m X mY +∞ +∞ −∞ −∞ = E ∫ dx ∫ ( x − m X )( y − mY ) f XY ( x, y )dy Correlation • The covariance between the same variable is its variance: Cov{X,X} = Var{X}; Cov{Y,Y} = Var{Y} • The correlation coefficient is a measure of the linear dependence between the two variables ρ XY = Cov{ X , Y } Var{ X }Var{Y } ∈ [−1,+1] • A correlation of ρXY = 1 implies that X and Y are perfectly correlated. • Independence between the two variables means the correlation coefficient is zero, ie. ρXY = 0 . However, the reverse statement is not always the case. Zero correlation does not imply independence between the two variables. High Order Moments • The first moment is the mean, the second moment about the mean is the variance, the third moment is kurtosis, … • The nth moment of a RV Z about the mean m, also called the nth central moment, is defined as m n = E{( Z − m)}n where n = 0, 1, 2, …. • The nth moment of Z about the origin is defined as m'n = E{Z n } These are non-centered moments. Entropy • A measure of local uncertainty, not specific to a particular interval [a,b] • Entropy of the local probability density function is defined as H (u ) = ∞ ∫ − [ln f (u; z (n))]. f (u; z (n))dz −∞ where f (u; z (n)) is a conditional pdf, and all zero pdf values are excluded from the integral • If entropy decreases, uncertainty decreases since the probability distribution tends toward a single value (or few values). Parametric distribution • A parametric distribution model is an analytical expression for the probability given the variable value, e.g., the normal or Gaussian distribution model: p= 1 e σ 2π −( Z −m )2 2σ 2 with parameters m and σ that control the center and spread of the bellshaped normal distribution. • Parametric models sometimes relate to an underlying theory – e.g., the normal distribution is the limit distribution for the central limit theorem • In general, there is no need to assume a parametric distribution model; there are often enough data to infer the shape of the distribution nonparametrically • It is possible to transform any univariate distribution to any other univariate distribution • We can smooth any distribution that is not well resolved by the available data Uniform distribution • pdf: • cdf: 1 f ( z) = b − a 0 , ∀z ∈ [a,b] otherwise 0, ∀z ≤ a z-a ∀z ∈ [a,b] F ( z ) = ∫ f ( z )dz = b-z −∞ 1, ∀z ≥ b z • Moments: b E{Z } = a+b = m = median 2 1 1 2 2 E{Z } = z dz (a + ab + b 2 ) = ∫ b−a a 3 2 (b − a ) 2 σ = Var{Z } = E{Z } − m = 12 2 2 2 • Uniform distribution within [0,1] is the distribution of random numbers that have a mean of 0.5, variance of 1/12, and the cdf F(z) = z Dirac distribution • z = a (constant, no uncertainty) • cdf 1, if x ≥ a F (z ) = 0, if not • pdf 0, ∀ x ≠ a f (z ) = undefined at x = a • Moments: E(Z} = m = a σ2 = 0 Exponential distribution • cdf F ( z ) = 1 − e − z / a , ∀z ≥ a a = 1/A • pdf f ( z) = 1 −z / a e , ∀z ≥ 0 a • Moments: Mean = a Variance = a2 Median = 0.69a Normal (Gaussian) distribution • Gaussian distribution is fully characterized by its two parameters, the mean and the variance, m and σ : 1 z − m 2 1 g ( z) = exp − σ 2 σ 2π • Standard normal pdf has a mean of zero and a standard deviation of one: z 2 g o ( z) = exp − 2π 2 1 Normal (Gaussian) distribution • The cdf of the Gaussian distribution G(z) has no closed-form analytical expression, but the standard normal cdf Go(z) is well tabulated in literature: z Go ( z ) = ∫g o ( z ) dz −∞ z−m G ( z ) = ∫ g ( z )dz = G o σ • The Gaussian distribution has characteristic symmetry: – It is symmetric about its mean m; therefore, the mean and the median are the same, and – The pdf g(m+z) = g(m-z) Lognormal distribution • A positive RV, Y > 0, is said to be lognormally distributed if X = ln(Y) is normally distributed Y > 0 → log N (m, σ 2 ), if X = ln Y → N (α , β 2 ) RVs that are lognormally distributed have characteristically skewed distributions. Lognormal distribution • Lognormal distributions are also characterized by two parameters: a mean and a variance. However, they can be characterized by either the arithmetic parameters (m and σ2) or the logarithmic parameters (α or β2). • The lognormal cdf and pdf are more easily expressed as a function of its logarithmic parameters: ln y − α for all y > 0 FY ( y ) = Prob{Y ≤ y )} = G o β f Y ( y ) = F 'Y ( y ) = ln y − α 1 g o βy β • The relations between arithmetic parameters and logarithmic parameters are: m = eα + β 2 σ 2 = m 2 [e β − 1] 2 /2 α = ln m − β / 2 2 σ2 β = ln1 + 2 m 2 Central Limit Theorem • Theorem: The sum of a great number of independent equally distributed (not necessarily Gaussian) standardized random variables tends to be normally distributed, i.e. if n RV’s Zi have the same cdf and zero means, the RV tends toward a normal cdf, as n increases towards infinity. E{mˆ } = E{X} = m n mˆ = 1 xi ∑ n i =1 • Corollary: → Normal σ2 1 Var{mˆ } = n Var{ X } = n The product of a great number of independent, identically distributed RVs tends to be lognormally distributed 1 n αˆ = ∑ log x i n i =1 E{αˆ} = E{ log X} = α → Normal β2 1 Var{αˆ } = n Var{log X } = n Combination of distributions • Combination of distributions results in a new distribution. • F(x) =∑kλkF'k(x) is a distribution model if F'k(x)s are distribution functions, and ∑kλk = 1, λk≥0, ∀k. • For example F(x) = p∆0(x) + (1-p)F1(x) is a mixture of a spike at zero and a positive distribution • Disjunctive coding of an experimental histogram: given data x1, x2, …, xn and rank ordered data x(1), x(2), …, x(n). • Experimental cumulative distribution: 1 Fˆ ( x) = ∑ ∆ x ( i ) ( x) n A sum of n Dirac distributions of parameters x(i) and equal amplitude 1/n Non-parametric distribution • Non-parametric model is misleading in that it could be interpreted as characteristic of a RF model that has no parameters • All RF models imply a full multivariate distribution as characterized by the set of distributions, but some RF models have more free parameters than others • The more free parameters that can be fitted confidently to data, the more flexible the model • Unfortunately, the fewer free parameters a model has, the easier it is to handle (less inference and less computation) ! inflexible models Dirac distribution has one-free-parameter models (Z constant) Poisson-exponential model is fully determined by its mean Gaussian distribution is two-free-parameter model • Parameter-rich model versus parameter-poor model Experimental and sample statistics • Histogram: • Mean is sensitive to outliners Median is sensitive to gaps in the middle of a distribution Locate distribution by selected quantiles (e.g., quartiles) Spread measured by standard deviation (very sensitive to extreme values) Cumulative histogram: • • • Useful to see all of the data values on one plot Useful for isolating statistical populations May be used to check distribution models: - Straight line on arithmetic scale a normal distribution Straight line on logarithmic scale a lognormal distribution Possible to transform data to perfectly reproduce any univariate distribution Cumulative Frequency - Frequency " A counting of samples in classes " Sometimes two scales are needed to show the details (use trimming limits) " Logarithmic scale can be useful " Sample statistics 1 0 Distribution smoothing • • • • Sample histograms and scattergrams tend to be erratic with few data. Sawtooth-like fluctuations are usually not representative of the underlying population and they disappear as the sample size increases Smoothing distribution not only removes such fluctuations; it also allows increasing the class resolution and extending the distribution(s) beyond the sample minimum and maximum values More flexible smoothing technique (quadratic programming or annealing approach) has been applied for smoothing histograms and scattergrams: honor sample statistics. Extreme Values-Outliers • Extreme values: a few very small or very large values may strongly affect summary statistics like the mean or variance of the data, the linear correlation coefficient, or measures of spatial continuity • Such extreme values can be handled as follows: 1. Declare the extreme values erroneous and remove them 2. Classify the extreme values into separate statistical population 3. Use robust statistics, which are less sensitive to extreme values: median, rank correlation coefficient… 4. Transform the data to reduce the influence of extreme values • Outliers: observations that have values out of line with the rest of the data • Outliers can create a difficult situation in a regression equation because they have a disproportionate effect on the estimated values of the regression coefficients • Outlier observation can only be removed with extreme care because it may actually supply unique information about the response Identification of statistical population • By histogram: histogram has different modes • By Q-Q plot two compare two distributions • By hypothesis tests to see whether or not two populations have the same means, variance…(t-test, F-test…) Review of Main Points • • • • • • • • Probability as a language of uncertainty CDF always defined and used in Monte Carlo simulation PDF is derivative of CDF Quantile is a z-value at particular F(z) Mean and variance are “moments” Parametric distribution F(z)/f(z) defined by equation Non-parametric defined by data values Experimental statistics are used to clean and understand the data, as well as to build statistical models.