EAS31116/B9036: Statistics in Earth & Atmospheric Sciences Lecture 3: Probability Distributions (cont’d) Instructor: Prof. Johnny Luo www.sci.ccny.cuny.edu/~luo Project Dates Oct 2: 1-page abstract due Nov 6: Progress report due Dec 4: presentation of the project Dec 11: Project Report due Outlines 1. Definition of Terms 1. Some Empirical & Exploratory Data Analysis 2. Parametric Distribution I: Discrete Distributions 3. Parametric Distribution II: Continuous Distributions 4. Assessments of the Goodness of Fit Random Variable: when the value of a variable is the outcome of a statistical experiment (i.e., uncertain and dependent on chances), it is called a random variable. Probability Distribution: Probability (remember: it means a ratio) assigned to values of a random variable. Empirical Probability Distribution: just describe what’s been observed – an exploratory approach Parametric Probability Distribution: summarize the observed probability distribution using particular mathematical forms. Jan 1987 Ithaca Precip/T Describe data in n-quantiles 2-quantiles: Median 3-quantiles: Terciles 4-quantiles: Quartiles … 100-quantiles: Percentiles Step 1: rank the data in ascending (or descending) order Step 2: find cutoff values for equal size subgroups Box-and-whisker plots Histograms Histograms of the Jan Max temperature in Ithaca. In Matlab, the function hist will plot histogram of a data array. Discrete Distribution I: Binomial Distribution Definition: A sequence of n independent yes/no (or head/tail) experiments. Usually we use 1 to represent yes and 0 for no. Random Variable (X): number of yes (or head) in a sequence of n trials. If you flip coin 3 times, what are possible values for X? X = 0, 1, 2, 3. Think-Pair-Share: What are the probabilities (remember: probability is a ratio) of all these four possible outcomes? For N=3, possible X = 0, 1, 2, 3. What are the probabilities of all these four possible outcomes? 0 (all 3 tails): 1 (1 head/2 tails): 2 (2 heads/1tail): 3 (all 3 heads): 1/8 3/8 3/8 1/8 Binomial distribution is a discrete parametric distribution with two parameters: 1) N (total # of experiments) 2) p (probability for yes at each trial) More Generally, “N choose x”: different ways of distributing x successes in a sequence of N trials Application in Earth Sciences 220-yr record Q1: Compute the probability of the lake freezing next winter (or in any single winter in the future) Step 1: Find the two parameters in the binomial distribution x = 1 (one freeze) N = 1 (only one future year) p = 10/220 = 0.045 Step 2: This is trivial! Pr{X=1} = (1!)/(1! 0!) (0.045)(1-0.045)1-1 = 0.045 Application in Earth Sciences 220-yr record Q2: Compute the probability of the lake freezing once in 10 years. Step 1: Find the two parameters in the binomial distribution x = 1 (one freeze) N = 10 p = 10/220 = 0.045 Step 2: Pr{X=1} = (10!)/(1! 9!) (0.045)(1-0.045)10-1 = 0.30 Application in Earth Sciences 220-yr record Step 1: Look for the complement event x = 0 (no freezing at all in 10 years) x = 0 (no freeze); N = 10; p = 10/220 = 0.045 Q3: Compute the probability of the lake freezing at least once in 10 years. Step 2: Pr{X=0} = (10!)/(0! 10!) (0.045)0(1-0.045)10 = 0.63 Pr{X=1} = 1 – 0.63 = 0.37 A sequence of N yes/no experiments Flipping a coin 20 times Flipping a cheating coin 20 times Flipping a coin 40 times Source: wikipedia Discrete Distribution II: Poisson Distribution The Poisson Distribution describes the probability of a given number of events occurring in a fixed interval of time. For example, number of email you receive each day, or number of tornadoes reported in New York State each year. (This is no longer a yes/no experiment.) μ μ μ Discrete Distribution II: Poisson Distribution The Poisson Distribution describes the probability of a given number of events occurring in a fixed interval of time. For example, number of email you receive each day, or number of tornadoes reported in New York State each year. (This is no longer a yes/no experiment.) The Poisson Distribution only has one parameter: μ (called intensity; it happens to be the mean value). X is the random variable. μ μ μ Consider the annual tornado counts in NYS for 1959–1988, in Table 4.3. During the 30 years covered by these data, 138 tornados were reported in New York state. The average, or mean, rate of tornado occurrence is 138/30 = 4.6 /year Consider the annual tornado counts in NYS for 1959–1988, in Table 4.3. During the 30 years covered by these data, 138 tornados were reported in New York state. The average, or mean, rate of tornado occurrence is 138/30 = 4.6 /year The Poisson distribution fits data fairly well (we will learn how to do the fitting later in class). Expected Value of a Random Variable The expected value of a random variable or function of a random variable is simply the probability-weighted average of that variable or function. For example, flip coin 3 times, N = 3, p=0.5, E[X] = 1.5 (in between one head and two heads) Expected value: Variance: Outlines 1. Definition of Terms 1. Some Empirical & Exploratory Data Analysis 2. Parametric Distribution I: Discrete Distributions 3. Parametric Distribution II: Continuous Distributions 4. Assessments of the Goodness of Fit Probability Density Function (PDF): f(x) Analogous to histogram. Probability is represented by the area under the curve Probability Density Function (PDF): f(x) Analogous to histogram. Probability is represented by the area under the curve Cumulative Distribution Function (CDF): F(x) Continuous Distribution I: Gaussian Distribution (aka, Normal distribution) Two parameters: μ and σ Why is Gaussian distribution so popular? Central Limit Theorem: as the sample size gets large, the sum (or average) of a set of independent observations will follow a Gaussian distribution. A lot of quantities in natural science are the result of many factors superimposed (resembling the sum or average of these factors) Histograms of the Jan Max Temp in Ithaca. They already look somewhat Gaussianlike, although not exactly. If you plot the distribution of mean max temp. in Jan (i.e., use multiple years of data), it will become more Gaussian. Mean: 0, standard deviation: 1 Standard Normal Distribution Z-score (random variable) Quantiles PDF and CDF of a Normal Distribution CDF PDF Q1: The mean Jan temperature in Ithaca is 22.20F and σ is 4.40F. In Jan 1987, the mean Jan temp. is 21.40F. Assume it follows Gaussian distribution. What is the probability that mean Jan temp. is as cold or colder than Jan 1987? Q1: The mean Jan temperature in Ithaca is 22.20F and σ is 4.40F. In Jan 1987, the mean Jan temp. is 21.40F. Assume it follows Gaussian distribution. What is the probability that mean Jan temp. is as cold or colder than Jan 1987? z = (21.4 – 22.2)/4.4 = -0.18 What about z in the positive range? Q2: The mean Jan temperature in Ithaca is 22.20F and σ is 4.40F. Assume it follows Gaussian distribution. What is the probability that 200F ≤ mean temp. ≤ 250F? Q2: The mean Jan temperature in Ithaca is 22.20F and σ is 4.40F. Assume it follows Gaussian distribution. What is the probability that 200F ≤ mean temp. ≤ 250F? z20 = (20 – 22.2)/4.4 = -0.50 z25 = (25 – 22.2)/4.4 = 0.64 Continuous Distribution II: Gamma Distribution Sometimes a variable is constrained by a physical limit on the left. For example, precipitation: it can’t be lower than zero and it can go to infinity (in theory). So, the distribution is not Gaussian, but skewed to the right. Continuous Distribution II: Gamma Distribution - Random variable: x -Two parameters: 1) α: the shape parameter, 2) β: the scale parameter. Γ(α) is the gamma function. Standard gamma distribution: Standard gamma distribution: Q1: suppose Ithaca Jan precip follows the Gamma distribution with α ≈ 4 and β = 0.52 inches. For Jan 1987, the mean precip in Ithaca is 3.15 inches, use the Table below to find the percentile value for Jan 1987 precip. Standard gamma distribution: Q1: suppose Ithaca Jan precip follows the Gamma distribution with α ≈ 4 and β = 0.52 inches. For Jan 1987, the mean precip in Ithaca is 3.15 inches, use the Table below to find the percentile value for Jan 1987 precip. Step 1: standardize ξ = 3.15/0.52 = 6.06 Step 2: For α ≈ 4, standard variable of 6.06 falls in between the cumulative prob. of 0.80 and 0.90. So, it’s about 0.85. Outlines 1. Definition of Terms 1. Some Empirical & Exploratory Data Analysis 2. Parametric Distribution I: Discrete Distributions 3. Parametric Distribution II: Continuous Distributions 4. Assessments of the Goodness of Fit Superimpose the fitted Gaussian and Gamma distribution curved on the raw histogram (Jan 1987 Ithaca precip) More will be covered later in class Binomial Diff. b/w Binomial & Poisson distributions Binomial predicts number of successes within a set number of trials. Poisson Poisson predicts number of occurrences per unit time, space, …