Chapter 5 Sampling Distributions Introduction • Distribution of a Sample Statistic: The probability distribution of a sample statistic obtained from a random sample or a randomized experiment – What values can a sample mean (or proportion) take on and how likely are ranges of values? • Population Distribution: Set of values for a variable for a population of individuals. Conceptually equivalent to probability distribution in sense of selecting an individual at random and observing their value of the variable of interest Sampling Distributions for Counts and Proportions • Binary outcomes: Each individual or realization can be classified as a “Success” or “Failure” (Presence/Absence of Characteristic of interest) • Random Variable X is the count of the number of successes in n “trials” • Sample proportion: Proportion of succeses in the sample • Population proportion: Proportion of successes in the population X Sample Proportion : p n ^ Population Proportion : p Binomial Distribution for Sample Counts • Binomial “Experiment” – Consists of n trials or observations – Trials/observations are independent of one another – Each trial/observation can end in one of two possible outcomes often labelled “Success” and “Failure” – The probability of success, p, is constant across trials/observations – Random variable, X, is the number of successes observed in the n trials/observations. • Binomial Distributions: Family of distributions for X, indexed by Success probability (p) and number of trials/observations (n). Notation: X~B(n,p) Binomial Distributions and Sampling • Problem when sampling from a finite sample: the sequence of probabilities of Success is altered after observing earlier individuals. • When the population is much larger than the sample (say at least 20 times as large), the effect is minimal and we say X is approximately binomial • Obtaining probabilities: n k P( X k ) p (1 p) n k k n n! k 0,1,, n k k!(n k )! Table C gives probabilities for various n and p. Note that for p > 0.5, use 1-p and you are obtaining P(X=n-k) Example - Diagnostic Test • Test claims to have a sensitivity of 90% (Among people with condition, probability of testing positive is .90) • 10 people who are known to have condition are identified, X is the number that correctly test positive 10 k 10k P( X k ) (.9) (.1) k k P(k) 0 1E-10 1 9E-09 10 10! k 0,1,,10 k k!(10 k )! 2 3 4 5 6 7 8 9 10 3.64E-07 8.75E-06 0.000138 0.001488 0.01116 0.057396 0.19371 0.38742 0.348678 • Compare with Table C, n=10, p=.10 • Table obtained in EXCEL with function: BINOMDIST(k,n,p,FALSE) (TRUE option gives cumulative distribution function: P(Xk) Binomial Mean & Standard Deviation • • • • • • • Let Si=1 if the ith individual was a success, 0 otherwise Then P(Si=1) = p and P(Si=0) = 1-p Then E(Si)=mS = 1(p) + 0(1-p) = p Note that X = S1+…+Sn and that trials are independent Then E(X)=mX = nmS = np V(Si) = E(Si2)-mS2 = p-p2 = p(1-p) Then V(X)=sX2 = np(1-p) X ~ B(n, p) E( X ) m X np s X np(1 p) For the diagnostic test: m 10(0.9) 9.0 s 10(0.9)(0.1) 0.95 Sample Proportions • Counts of Successes (X) rarely reported due to dependency on sample size (n) • More common is to report the sample proportion of successes: # of successes in sample X p sample size n ^ ^ E p m ^ p p p (1 p) ^ 2 V p s ^ p n s ^ p p(1 p ) n Sampling Distributions for Counts & Proportions • For samples of size n, counts (and thus proportions) can take on only n distinct possible outcomes • As the sample size n gets large, so do the number of possible values, and sampling distribution begins to approximate a normal distribution. Common Rule of thumb: np 10 and n(1-p) 10 to use normal approximation X ~ N np, np(1 p) p ~ N p, ^ p(1 p) n (approxima tely) (approxima tely) Sampling Distribution for X~B(n=1000,p=0.2) Sampling Distribution of X (n=1000,p=0.2) 0.035 0.03 Probability 0.025 0.02 0.015 0.01 0.005 981 953 925 897 869 841 813 785 757 729 701 673 645 617 589 561 533 505 477 449 421 393 365 337 309 281 253 225 197 169 141 113 85 57 29 1 0 # Successes m X np 1000(.20) 200 s X np(1 p) 1000(.2)(.8) 12.65 Using Z-Table for Approximate Probabilities • To find probabilities of certain ranges of counts or proportions, can make use of fact that the sample counts and proportions are approximately normally distributed for large sample sizes. – – – – – Define range of interest Obtain mean of the sampling distribution Obtain standard deviation of sampling distribution Transform range of interest to range of Z-values Obtain (approximate) Probabilities from Z-table ^ Coin Tossing(He ads) : P p 0.51 | n 1000 tosses ^ Range : p 0.51 Mean : p 0.50 SD : (0.5)(0.5) .0158 1000 ^ z p m ^ p s ^ p 0.51 0.50 0.63 .0158 P ( Z 0.63) 1 P ( Z 0.63) 1 .7357 .2643 Sampling Distribution of a Sample Mean • Obtain a sample of n independent measurements of a quantitative variable: X1,…,Xn from a population with mean m and standard deviation s – Averages will be less variable than the individual measurements – Sampling distributions of averages will become more like a normal distribution as n increases (regardless of the shape of the population of individual measurements) 1 1 E X E X i nm m X m n n 2 s 1 1 V X V X i ns 2 s X2 n n n 2 sX s n Central Limit Theorem • When random samples of size n are selected from aamy population with mean m and finite standard deviation s, the sampling distribution of the sample mean will be approximately distributed for large n: s X ~ N m, n approximat ely, for large n Z-table can be used to approximate probabilities of ranges of values for sample means, as well as percentiles of their sampling distribution Exponential Distribution • Often used to model times: survival of components, to complete tasks, between customer arrivals at a checkout line, etc. Density is highly skewed: Sample means of size 10 (m=1, s=1/100.5=0.32) y 0 0 .2 .5 .4 y .6 1 .8 1 1.5 Individual Measurements (m=1,s=1) 0 1 2 3 x 4 5 0 1 2 3 x 4 5 Miscellaneous Topics • Normal Approximation for sample counts and proportions is example of CLT (X=S1+…+Sn) • Any linear function of independent normal random variables is normal (use rules on means and variances to get parameters of distribution) • Generalizations of CLT apply to cases where random variables are correlated (to an extent) and have different distributions (within reason) – Variables made up of many small random influence will tend to be approximately normal