Announcements Announcements Lecture 7: Geometric & Binomial distributions Due: HW 2 at the beginning of class on Thursday: Clarification on Exercise 3.4 parts (c) and (d): Mary and Leo’s percentile mean the proportion of people whose finishing times are lower than theirs. Note that this doesn’t mean ”% of people they performed better than” because in a triathlon ”performing better” means finishing faster. Statistics 101 Mine Çetinkaya-Rundel February 7, 2012 Statistics 101 (Mine Çetinkaya-Rundel) Recap Review question Q3: More than three-quarters of the nation’s colleges and universities now offer online classes, and about 23% of college graduates have taken a course online. 39% of those who have taken a course online believe that online courses provide the same educational value as one taken in person, a view shared by only 27% of those who have not taken an online course. At a coffee shop you overhear a recent college graduate discussing that she doesn’t believe that online courses provide the same educational value as one taken in person. What’s the probability that she has taken an online course before? valuable not valuable total Statistics 101 (Mine Çetinkaya-Rundel) 1 / 31 Qnline quiz 2- Q3 Question 3: didn’t take online course February 7, 2012 Recap Qnline quiz 2- commonly missed questions took online course L7: Geo & Binom distributions Which is the correct notation for the following probability? “At a coffee shop you overhear a recent college graduate discussing that she doesn’t believe that online courses provide the same educational value as one taken in person. What’s the probability that she has taken an online course before?” (a) P(took online course | not valuable) (b) P(not valuable | took online course) (c) P(valuable and took online course) total (d) P(took online course and not valuable) 0.23 (e) P(valuable | didn’t take online course) 1 L7: Geo & Binom distributions February 7, 2012 2 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 3 / 31 Geometric distribution Bernoulli distribution Geometric distribution Milgram experiment Milgram experiment (cont.) Stanley Milgram, a Yale University psychologist, conducted a series of experiments on obedience to authority starting in 1963. These experiments measured the willingness of study participants to obey an authority figure who instructed them to perform acts that conflicted with their personal conscience. Experimenter (E) orders the teacher (T), the subject of the experiment, to give severe electric shocks to a learner (L) each time the learner answers a question incorrectly. Milgram found that about 65% of people would obey authority and give such shocks. Over the years, additional research suggested this number is approximately consistent across communities and time. The learner is actually an actor, and the electric shocks are not real, but a prerecorded sound is played each time the teacher administers an electric shock. Statistics 101 (Mine Çetinkaya-Rundel) Bernoulli distribution L7: Geo & Binom distributions Geometric distribution February 7, 2012 4 / 31 Statistics 101 (Mine Çetinkaya-Rundel) Bernoulli distribution L7: Geo & Binom distributions Geometric distribution Bernouilli random variables February 7, 2012 5 / 31 Geometric distribution Geometric distribution Dr. Smith wants to repeat Milgram’s experiments but she only wants to sample people until she finds someone who will not inflict a severe shock. What is the probability that she stops after the first person? Each person in Milgram’s experiment can be thought of as a trial. A person is labeled a success if she refuses to administer a severe shock, and failure if she administers such shock. P (1st person refuses ) = 0.35 Since only 35% of people refused to administer a shock, probability of success is p = 0.35. ... the third person? When an individual trial has only two possible outcomes, it is called a Bernoulli random variable. P (1st and 2nd shock , 3rd refuses ) = S S R × × = 0.652 ×0.35 ≈ 0.15 0.65 0.65 0.35 ... the tenth person? Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 6 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 7 / 31 Geometric distribution Geometric distribution Geometric distribution Geometric distribution Geometric distribution (cont.) Geometric distribution describes the waiting time until a success for independent and identically distributed (iid) Bernouilli random variables. Clicker question Can we calculate the probability of rolling a 6 for the first time on the 6th roll of a die using the geometric distribution? independence: outcomes of trials don’t affect each other identical: the probability of success is the same for each trial (a) no, on the roll of a die there are more than 2 possible outcomes (b) yes, why not Geometric probabilities If p represents probability of success, (1 − p ) represents probability of failure, and n represents number of independent trials P (success on the nth trial ) = (1 − p )n−1 p Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions Geometric distribution February 7, 2012 8 / 31 Statistics 101 (Mine Çetinkaya-Rundel) Geometric distribution L7: Geo & Binom distributions Geometric distribution Expected value February 7, 2012 9 / 31 Geometric distribution Expected value and its variability Mean and standard deviation of geometric distribution How many people is Dr. Smith expected to test before finding the first one that refuses to administer the shock? The expected value, or the mean, of a geometric distribution is defined as p1 . 1 1 µ= = = 2.86 p 0.35 s 1 µ= p σ= 1−p p2 Going back to Dr. Smith’s experiment: s σ= She is expected to test 2.86 people before finding the first one that refuses to administer the shock. But how can she test a non-whole number of people? 1−p = p2 r 1 − 0.35 = 2.3 0.352 Dr. Smith is expected to test 2.86 people before finding the first one that refuses to administer the shock, give or take 2.3 people. These values only makes sense in the context of repeating the experiment many many times. Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 10 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 11 / 31 Binomial distribution Binomial distribution Suppose we randomly select four individuals to participate in this experiment. What is the probability that exactly 1 of them will refuse to administer the shock? Scenario 2: Scenario 3: Scenario 4: 0.35 (A) refuse 0.65 (A) shock 0.65 (A) shock 0.65 (A) shock 0.65 (B) shock 0.35 × (B) refuse 0.65 × (B) shock 0.65 × (B) shock × 0.65 (C) shock 0.65 × (C) shock 0.35 × (C) refuse 0.65 × (C) shock × 0.65 (D) shock 0.65 × (D) shock 0.65 × (D) shock 0.35 × (D) refuse × # of scenarios × P (single scenario ) = 0.0961 = 0.0961 # of scenarios: there is a less tedious way to figure this out, we’ll get to that shortly... = 0.0961 P (single scenario ) = p k (1 − p )(n−k ) = 0.0961 probability of success to the power of number of successes, probability of failure to the power of number of failures The Binomial distribution describes the probability of having exactly k successes in n independent Bernouilli trials with probability of success p. The probability of exactly one 1 of 4 people refusing to administer the shock is the sum of all of these probabilities. 0.0961 + 0.0961 + 0.0961 + 0.0961 = 4 × 0.0961 = 0.3844 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions Binomial distribution Binomial distribution The question from the prior slide asked for the probability of given number of successes, k , in a given number of trials, n, (k = 1 success in n = 4 trials), and we calculated this probability as Let’s call these people Allen (A), Brittany (B), Caroline (C), and Damian (D). Each one of the four scenarios below will satisfy the condition of “exactly 1 of them refuses to administer the shock”: Scenario 1: February 7, 2012 The binomial distribution 12 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions The binomial distribution Binomial distribution Counting the # of scenarios February 7, 2012 13 / 31 The binomial distribution Calculating the # of scenarios Choose function The choose function is useful for calculating the number of ways to choose k successes in n trials. Earlier we wrote out all possible scenarios that fit the condition of exactly one person refusing to administer the shock. If n was larger and/or k was different than 1, writing out the scenarios would get even more tedious. For example, what if n = 9 and k = 2: ! n n! = k k !(n − k )! RRSSSSSSS SRRSSSSSS SSRRSSSSS ··· SSRSSRSSS k = 1, n = 4: 4 k = 2, n = 9: 9 1 ··· 2 = 4! 1!(4−1)! = 4×3×2×1 1×(3×2×1) =4 = 9! 2!(9−1)! = 9×8×7! 2×1×7! 72 2 = = 36 SSSSSSSRR Note: You can also use R for these calculations: Writing out all possible scenarios is incredibly tedious and prone to errors. Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 > choose(9,2) [1] 36 14 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 15 / 31 Binomial distribution The binomial distribution Binomial distribution Properties of the choose function Binomial distribution (cont.) If k = 1, only 1 of the n trials result in a success, it could be the first, the second, · · · , or the nth trial, so there are n ways this can happen: Binomial probabilities If p represents probability of success, (1 − p ) represents probability of failure, n represents number of independent trials, and k represents number of successes ! n =n 1 If k = n, all n trials result in a success, and there’s only one way this can happen: ! n k P (k successes in n trials ) = p (1 − p )(n−k ) k We can use the binomial distribution to calculate the probability of k successes in n trials, as long as ! n =1 n If k = 0, all n trials result in a failure, and there’s only one way this can happen as well: ! n =1 0 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions Binomial distribution The binomial distribution February 7, 2012 16 / 31 1 the trials are independent 2 the number of trials, n, is fixed 3 each trial outcome can be classified as a success or a failure 4 the probability of success, p, is the same for each trial Statistics 101 (Mine Çetinkaya-Rundel) The binomial distribution L7: Geo & Binom distributions Binomial distribution February 7, 2012 17 / 31 The binomial distribution Clicker question Clicker question A January 27, 2012 Gallup survey suggests that 48% of Americans would vote for Obama over Romney if the presidential election was held that day. Among a random sample of 10 Americans what is the probability that exactly 8 would vote for Obama over Romney? A January 27, 2012 Gallup survey suggests that 48% of Americans would vote for Obama over Romney if the presidential election was held that day. Among a random sample of 10 Americans what is the probability that exactly 8 would vote for Obama over Romney? (a) pretty low (a) 0.488 × 0.522 (b) pretty high (b) 8 × 0.488 × 0.522 10 8 2 (c) 10 8 × 0.48 × 0.52 10 (d) 8 × 0.482 × 0.528 http:// www.gallup.com/ poll/ election.aspx , February 6, 2012. Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 18 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 19 / 31 Binomial distribution The binomial distribution Binomial distribution Expected value The binomial distribution Expected value and its variability Mean and standard deviation of binomial distribution A January 27, 2012 Gallup survey suggests that 48% of Americans would vote for Obama over Romney if the presidential election was held that day. Among a random sample of 100 people, how many would you expect to vote for Obama? µ = np σ= q np (1 − p ) Going back to the voters: Easy enough, 100 × 0.48 = 48. σ= Or more formally, µ = np = 100 × 0.48 = 48. But this doesn’t mean in every random sample of 100 people exactly 48 will vote for Obama. In some samples this value will be less, and in others more. How much would we expect this value to vary? √ q np (1 − p ) = 100 × 0.48 × 0.52 ≈ 5 We would expect 48 out of 100 randomly sampled voters, give or take 5. Note: Mean and standard deviation of a Binomial might not always be whole numbers, and that is alright, these values represent what we would expect to see on average. Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions Binomial distribution February 7, 2012 20 / 31 Statistics 101 (Mine Çetinkaya-Rundel) The binomial distribution L7: Geo & Binom distributions Binomial distribution February 7, 2012 21 / 31 The binomial distribution Unusual observations Clicker question Using the notion that observations that are more than 2 standard deviations away from the mean are considered unusual and the mean and the standard deviation we just computed, we can calculate a range for how many people we should expect to find in random samples of 100 that are planning to vote for Obama in the 2012 presidential election. A January 2012 Gallup report suggests that 7.5% of 18-29 year old Americans have been diagnosed with high blood pressure. Would a random sample of 1,000 adults in this age range where only 60 of them have high blood pressure be considered unusual? (a) Yes (b) No 48 ± 2 × 5 = (38, 58) http:// www.gallup.com/ poll/ 152108/ Key-Chronic-Diseases-Decline.aspx Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 22 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 23 / 31 Binomial distribution Normal approximation to the binomial Binomial distribution A recent study found that “Facebook users get more than they give”. For example: 40% of Facebook users in our sample made a friend request, but 63% received at least one request Users in our sample pressed the like button next to friends’ content an average of 14 times, but had their content “liked” an average of 20 times Normal approximation to the binomial This study also found that approximately %25 of Facebook users are considered power users. The same study found that the average Facebook user has 245 friends. What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? We are given that n = 245, p = 0.25, and we are asked for the probability P (K ≥ 70). Users sent 9 personal messages, but received 12 12% of users tagged a friend in a photo, but 35% were themselves tagged in a photo P (X ≥ 70) = P (K = 70 or K = 71 or K = 72 or · · · or K = 245) Any guesses for how this pattern can be explained? = P (K = 70) + P (K = 71) + P (K = 72) + · · · + P (K = 245) This seems like an awful lot of work... http:// www.pewinternet.org/ Reports/ 2012/ Facebook-users/ Summary.aspx Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions Binomial distribution February 7, 2012 24 / 31 Statistics 101 (Mine Çetinkaya-Rundel) Normal approximation to the binomial Binomial distribution Normal approximation to the binomial 25 / 31 Normal approximation to the binomial P (K ≥ 70) = P (K = 70) + P (K = 71) + P (K = 72) + · · · + P (K = 245) = P (X > 70 − 0.5) In the case of the Facebook power users, n = 245 and p = 0.25. σ= February 7, 2012 Correction for the normal approximation When the sample size is large enough, the binomial distribution with parameters n and p can be approximated by the normal model with p parameters µ = np and σ = np (1 − p ). µ = 245 × 0.25 = 61.25 L7: Geo & Binom distributions √ 245 × 0.25 × 0.75 = 6.78 0.06 Bin(n = 245, p = 0.25) ≈ N (µ = 61.25, σ = 6.78). Bin(245,0.25) N(61.5,6.78) 0.05 0.06 Bin(245,0.25) N(61.5,6.78) 0.05 0.04 0.04 We apply a 0.5 correction in order to account for the probability of exactly 70 “successes”. 0.03 0.03 0.02 0.02 0.01 0.01 0.00 0.00 20 40 60 80 100 20 k 40 60 80 100 k Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 26 / 31 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 27 / 31 Binomial distribution Normal approximation to the binomial Binomial distribution Normal approximation to the binomial Histograms of number of successes Hollow histograms of samples from the binomial model where p = 0.10 and n = 10, 30, 100, and 300. Clicker question What is the probability that the average Facebook user with 245 friends has 70 or more friends who would be considered power users? 0 (a) 0.0984 (c) 0.8888 (b) 0.1112 (d) 0.9016 2 4 6 0 2 n = 10 0 5 10 4 6 8 10 n = 30 15 20 10 20 n = 100 30 40 50 n = 300 Try it yourself at http:// www.socr.ucla.edu/ htmls/ SOCR Distributions.html . Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions Binomial distribution February 7, 2012 28 / 31 Statistics 101 (Mine Çetinkaya-Rundel) Normal approximation to the binomial L7: Geo & Binom distributions Binomial distribution February 7, 2012 29 / 31 Normal approximation to the binomial Low large is large enough? Clicker question Below are four pairs of Binomial distribution parameters. Which distribution can be approximated by the normal distribution? There are two correct answers. The sample size is considered large enough if the expected number of successes and failures are both at least 10. np ≥ 10 Statistics 101 (Mine Çetinkaya-Rundel) and n(1 − p ) ≥ 10 L7: Geo & Binom distributions February 7, 2012 30 / 31 (a) n = 100, p = 0.8 (c) n = 150, p = 0.05 (b) n = 25, p = 0.6 (d) n = 500, p = 0.015 Statistics 101 (Mine Çetinkaya-Rundel) L7: Geo & Binom distributions February 7, 2012 31 / 31