High lights on probability distribution Binomial Distribution • It is one of the most widely used discrete probability distributions. • Consider dichotomous (binary) random variable • It is based on Bernoulli trial – When a single trial of an experiment can result in only one of two mutually exclusive outcomes (success or failure; dead or alive; sick or well, male or female) Example: • We are interested in determining whether a newborn infant will survive until his/her 70th birthday • Let Y represent the survival status of the child at age 70 years • Y = 1 if the child survives and Y = 0 if he/she does not • The outcomes are mutually exclusive • Suppose that 72% of infants born survive to age 70 years P(Y = 1) = p = 0.72 P(Y = 0) = 1 − p = 0.28 Characteristics of a Binomial Distribution The experiment consist of n identical trials. Only two possible outcomes on each trial. The probability of A (success), denoted by p, remains the same from trial to trial. The probability of B (failure), denoted by q, q = 1- p. The trials are independent. n and are the parameters of the binomial distribution. The mean is n and the variance is n(1- ) The Poisson Distribution • We are observing a count or number of events, rather than a binary outcome for each subject or trial, as in the binomial distribution . • Applicable for counts of events over a given interval of time, for example: – number of patients arriving at an emergency department in a day – number of new cases of HIV diagnosed at a clinic in a month In theory, a random variable X is a count that can assume any integer value greater than or equal to 0 B. Continuous Probability Distributions • A continuous random variable X can take on any value in a specified interval or range • With a large number of class intervals, the frequency polygon begins to resemble a smooth curve. • The probability distribution of X is represented by a smooth curve called a probability density function f(x) Distribution of serum triglyceride • The area under the smooth curve is equal to 1 • The area under the curve between any two points x1 and x2 is the probability that X takes a value between x1 and x2 • Instead of assigning probabilities to specific outcomes of the random variable X, probabilities are assigned to ranges of values 3. The Normal distribution The ND is the most important probability distribution in statistics Frequently called the “Gaussian distribution”/bell-shape curve. Variables such as blood pressure, weight, height, serum cholesterol level, and IQ score are approximately normally distributed • A random variable is said to have a normal distribution if it has a probability distribution that is symmetric & bell-shaped • In ND the “average "represents the true or normal value of the measurement and deviations from this are errors. • Small errors would occur more frequently than large errors. • The ND is vital to statistical work. • Because most estimation procedures & hypothesis tests underlie on ND • The concept of “probability of “X=x” in the discrete probability distribution is replaced by the “probability density function f(x) • The ND is also an approximating distribution to other distributions (e.g., binomial) 1. The mean µ tells you about location – Increase µ - Location shifts right – Decrease µ – Location shifts left – Shape is unchanged 2. The variance σ2 tells you about narrowness or flatness of the bell – Increase σ2 - Bell flattens. Extreme values are more likely – Decrease σ2 - Bell narrows. Extreme values are less likely – Location is unchanged Properties of Normal Distribution 1. It is symmetrical about its mean, . 2. The mean, the median and mode are almost equal. It is unimodal. 3. The total area under the curve about the x-axis is 1 square unit. 4. The curve never touches the x-axis. 5. As the value of increases, the curve becomes more and more flat and vice versa. 6.Perpendiculars of: ± 1SD contain about 68%; ±2 SD contain about 95%; ±3 SD contain about 99.7% of the area under the curve. 7.The distribution is completely determined by the parameters &. We have different normal distributions depending on the values of μ and σ2. We cannot tabulate every possible distribution Tabulated normal probability calculations are available only for the ND with µ = 0 and σ2=1. Standard Normal Distribution It is a normal distribution that has a mean equal to 0 and a SD equal to 1, and is denoted by N(0, 1). If a random variable X~N(,) then we can transform it to a SND with the help of Z-transformation These Z-scores can then be used to find the area (the probability) under the normal curve. We compute a standard score to transform a score from its original units into standard deviation units. The formula for standard scores is: Z = x - • Z represents the Z-score for a given x value • A Z-score is the # of SD that a given x value is above or below the mean Standard normal distribution cont… The first standard score is a z-score for a population. A z-score specifies the precise location of each X value within a distribution. The sign of the z-score (+ or -) signifies whether the score is above the mean (positive) or below the mean (negative). The numerical value of the z-score specifies the distance from the mean by counting the number of standard deviation units between X and µ. The standard normal distribution has mean 0 and variance 1 • Approximately 68% of the area under the standard normal curve lies between ±1, • about 95% between ±2, and • about 99.7 % between ± 3 a) What is the probability that z < -1.96? (1) Sketch a normal curve (2) Draw a perpendicular line for z = -1.96 (3) Find the area in the table (4) The answer is the area to the left of the line P(z < -1.96) = 0.0250 b) What is the probability that -1.96 < z < 1.96? The area between the values P(-1.96 < z < 1.96) = 0.9750 - 0.0250 =0.9500 c) What is the probability that z > 1.96? • The answer is the area to the right of the line; found by subtracting table value from 1.0000; P(z > 1.96) =1.0000 0.9750 = .0250 Applications of the Normal Distribution • The ND is used as a model to study many different variables. • The ND can be used to answer probability questions about continuous random variables. • Following the model of the ND, a given value of x must be converted to a z score before it can be looked up in the z table. Example: • The diastolic blood pressures of males 35–44 years of age are normally distributed with µ = 80 mm Hg and σ2 = 144 mm Hg2, σ = 12 mm Hg • Therefore, a DBP of 80+12 = 92 mm Hg lies 1 SD above the mean • Let individuals with BP above 95 mm Hg are considered to be hypertensive a. What is the probability that a randomly selected male has a BP above 95 mm Hg? • Approximately 10.6% of this population would be classified as hypertensive b. What is the probability that a randomly selected male has a DBP above 110 mm Hg? Z = 110 – 80 = 2.50 12 P (Z > 2.50) = 0.0062 • Approximately 0.6% of the population has a DBP above 110 mm Hg c. What is the probability that a randomly selected male has a DBP below 60 mm Hg? Z = 60 – 80 = -1.67 12 P (Z < -1.67) = 0.0475 • Approximately 4.8% of the population has a DBP below 60 mm Hg Sampling Distributions Sampling distributions are important in the understanding of statistical inference. Definition A Parameter :is number that can be used to describes a population as a whole. Statistic: is a number derived from a sample drawn from a specific population. In statistical practice, the value of a population parameter is not known. A statistic is used to estimate a parameter. • The sampling distribution of a statistic is the distribution of all possible values of the statistic, computed from samples of the same size randomly drawn from the same population. • When sampling a discrete, finite population, a sampling distribution can be constructed. • Note that this construction is difficult with a large population and impossible with an infinite population Developing a Sampling Distribution Assume there is a population … • Population size N=4 • Random variable, X, is age of individuals • Values of X: 18, 20, 22, 24 (years) A B C D Developing a Sampling Distribution (continued) Summary Measures for the Population Distribution: X μ P(x) i N .3 18 20 22 24 21 4 σ (X μ) i N .2 .1 0 2 2.236 18 20 22 24 A B C D Uniform Distribution x Developing a Sampling Distribution (continued) Now consider all possible samples of size n=2 1st Obs 16 Sample Means 2nd Observation 18 20 22 24 18 18,18 18,20 18,22 18,24 20 20,18 20,20 20,22 20,24 1st 2nd Observation Obs 18 20 22 24 18 18 19 20 21 20 19 20 21 22 22 24 22,18 22,20 22,22 22,24 16 possible samples with 24,18 (sampling 24,20 24,22 24,24 replacement) 22 20 21 22 23 24 21 22 23 24 Developing a Sampling Distribution (continued) Sampling Distribution of All Sample Means Sample Means Distribution 16 Sample Means 1st 2nd Observation Obs 18 20 22 24 18 18 19 20 21 20 19 20 21 22 22 20 21 22 23 24 21 22 23 24 _ P(X) .3 .2 .1 0 18 19 20 21 22 23 (no longer uniform) 24 _ X Developing a Sampling Distribution (continued) Summary Measures of this Sampling Distribution: μX X N σX i 18 19 21 24 21 16 2 ( X μ ) i X N (18 - 21)2 (19 - 21)2 (24 - 21)2 1.58 16 Comparing the Population with its Sampling Distribution Population N=4 μ 21 σ 2.236 Sample Means Distribution n=2 μX 21 σ X 1.58 _ P(X) .3 P(X) .3 .2 .2 .1 .1 0 18 20 22 24 A B C D X 0 18 19 20 21 22 23 24 _ X Standard Error of the Mean • Different samples of the same size from the same population will yield different sample means • A measure of the variability in the mean from sample to sample is given by the Standard Error of the Mean: • The variance of the sampling distribution is not equal to the population variance. however, that the variance of the sampling distribution is = to the population variance divided by the size of the sample used to obtain the sampling distribution. • Note that the standard error of the mean decreases as the sample size increases σX σ n If the Population is Normal If a population is normal with mean μ and standard deviation σ, the sampling distribution of X is also normally distributed with μX μ and σ σX n Z-value for Sampling Distribution of the Mean Z-value for the sampling distribution of X : Z where: ( X μX ) σX ( X μ) σ n X = sample mean μ = population mean σ = population standard deviation n = sample size Sampling Distribution Properties μx μ • (i.e. x is unbiased ) Normal Population Distribution μ x μx x Normal Sampling Distribution (has the same mean) Sampling Distribution Properties (continued) As n increases, σx decreases Larger sample size Smaller sample size μ x If the Population is not Normal • We can apply the Central Limit Theorem: – Even if the population is not normal, – …sample means from the population will be approximately normal as long as the sample size is large enough. properties of the sampling distribution: μx μ σ σx n Central Limit Theorem As the sample size gets large enough… n↑ the sampling distribution becomes almost normal regardless of shape of population x If the Population is not Normal (continued) Population Distribution Sampling distribution properties: Central Tendency μx μ σ σx n Variation μ x Sampling Distribution (becomes normal as n increases) Larger sample size Smaller sample size μx x How Large is Large Enough? • For most distributions, n > 30 will give a sampling distribution that is nearly normal • For fairly symmetric distributions, n > 15 • For normal population distributions, the sampling distribution of the mean is always normally distributed Example • Suppose a population has mean μ = 8 and standard deviation σ = 3. Suppose a random sample of size n = 36 is selected. • What is the probability that the sample mean is between 7.8 and 8.2? Example (continued) Solution: • Even if the population is not normally distributed, the central limit theorem can be used (n > 30) • … so the sampling distribution of approximately normal • … with mean μx x is = 8 • …and standard deviation σ 3 σx 0.5 n 36 Example (continued) Solution (continued): 7.8 - 8 X -μ 8.2 - 8 P(7.8 X 8.2) P 3 σ 3 36 n 36 P(-0.4 Z 0.4) 0.3108 Population Distribution ??? ? ?? ? ? ? ? ? μ8 Sampling Distribution Standard Normal Distribution Sample .1554 +.1554 Standardize ? X 7.8 μX 8 8.2 x -0.4 μz 0 0.4 Z • Up until this point, we have assumed that the values of the parameters of a probability distribution are known. • In the real world, the values of these population parameters are usually not known • Instead, we must try to say something about the way in which a random variable is distributed using the information contained in a sample of observations