161.120 Introductory Statistics Week 6 Lecture slides • Probability – CAST chapter 8 – Text sections 7.1 to 7.3, 7.6 and 7.7 • Random Variables – Text sections 8.1, 8.2 and 8.5 – CAST section 9.1 7.1 Random Circumstances Random circumstance is one in which the outcome is unpredictable. Case Study 1.1 Alicia Has a Bad Day Doctor Visit: Diagnostic test comes back positive for a disease (D). Test is 95% accurate. About 1 out of 1000 women actually have D. Statistics Class: Professor randomly selects 3 separate students at the beginning of each class to answer questions. Alicia is picked to answer the third question. Random Circumstances in Alicia’s Day Random Circumstance 1: Disease status Alicia has D. Alicia does not have D. Random Circumstance 2: Test result Test is positive. Test is negative. Random Circumstances in Alicia’s Day Random Circumstance 3: 1st student’s name is drawn Alicia is selected. Alicia is not selected. Random Circumstance 4: 2nd student’s name is drawn Alicia is selected. Alicia is not selected. Random Circumstance 5: 3rd student’s name is drawn Alicia is selected. Alicia is not selected. Assigning Probabilities • A probability is a value between 0 and 1 and is written either as a fraction or as a decimal fraction. • A probability simply is a number between 0 and 1 that is assigned to a possible outcome of a random circumstance. • For the complete set of distinct possible outcomes of a random circumstance, the total of the assigned probabilities must equal 1. 7.2 Interpretations of Probability The Relative Frequency Interpretation of Probability In situations that we can imagine repeating many times, we define the probability of a specific outcome as the proportion of times it would occur over the long run -- called the relative frequency of that particular outcome. Example 7.1 Probability of Male versus Female Births Long-run relative frequency of males born in the United States is about .512. Information Please Almanac (1991, p. 815). Table provides results of simulation: the proportion is far from .512 over the first few weeks but in the long run settles down around .512. Determining the Relative Frequency Probability of an Outcome Method 1: Make an Assumption about the Physical World Example 7.2 A Simple Lottery Choose a three-digit number between 000 and 999. Player wins if his or her three-digit number is chosen. Suppose the 1000 possible 3-digit numbers (000, 001, 002, . . . , 999) are equally likely. In long run, a player should win about 1 out of 1000 times. This does not mean a player will win exactly once in every thousand plays. Determining the Relative Frequency Probability of an Outcome Method 1: Make an Assumption about the Physical World Example 7.3 Probability Alicia has to Answer a Question There are 50 student names in a bag. If names mixed well, can assume each student is equally likely to be selected. Probability Alicia will be selected to answer the first question is 1/50. Determining the Relative Frequency Probability of an Outcome Method 2: Observe the Relative Frequency Example 7.4 The Probability of Lost Luggage “1 in 176 passengers on U.S. airline carriers will temporarily lose their luggage.” This number is based on data collected over the long run. So the probability that a randomly selected passenger on a U.S. carrier will temporarily lose luggage is 1/176 or about 0.006. Proportions and Percentages as Probabilities Ways to express the relative frequency of lost luggage: • The proportion of passengers who lose their luggage is 1/176 or about 0.006. • About 0.6% of passengers lose their luggage. • The probability that a randomly selected passenger will lose his/her luggage is about 0.006. • The probability that you will lose your luggage is about 0.006. Last statement is not exactly correct – your probability depends on other factors (how late you arrive at the airport, etc.). Estimating Probabilities from Observed Categorical Data Assuming data are representative, the probability of a particular outcome is estimated to be the relative frequency (proportion) with which that outcome was observed. Approximate margin of error for the estimated probability is 1 n Example 7.5 Nightlights and Myopia Revisited Assuming these data are representative of a larger population, what is the approximate probability that someone from that population who sleeps with a nightlight in early childhood will develop some degree of myopia? Note: 72 + 7 = 79 of the 232 nightlight users developed some degree of myopia. So we estimate the probability to be 79/232 = 0.34. This estimate is based on a sample of 232 people with a margin of error of about 0.066 The Personal Probability Interpretation Personal probability of an event = the degree to which a given individual believes the event will happen. Sometimes subjective probability used because the degree of belief may be different for each individual. • Must fall between 0 and 1 (or between 0 and 100%). 7.3 Probability Definitions and Relationships • Sample space: the collection of unique, nonoverlapping possible outcomes of a random circumstance. • Simple event: one outcome in the sample space; a possible outcome of a random circumstance. • Event: a collection of one or more simple events in the sample space; often written as A, B, C, and so on. Example 7.6 Days per Week of Drinking Random sample of college students. Q: How many days do you drink alcohol in a typical week? Simple Events in the Sample Space are: 0 days, 1 day, 2 days, …, 7 days Event “4 or more” is comprised of the simple events {4 days, 5 days, 6 days, 7 days} Assigning Probabilities to Simple Events P(A) = probability of the event A Conditions for Valid Probabilities 1. Each probability is between 0 and 1. 2. The sum of the probabilities over all possible simple events is 1. Equally Likely Simple Events If there are k simple events in the sample space and they are all equally likely, then the probability of the occurrence of each one is 1/k. Example 7.2 A Simple Lottery (cont) Random Circumstance: A three-digit winning lottery number is selected. Sample Space: {000,001,002,003, . . . ,997,998,999}. There are 1000 simple events. Probabilities for Simple Event: Probability any specific three-digit number is a winner is 1/1000. Assume all three-digit numbers are equally likely. Event A = last digit is a 9 = {009,019, . . . ,999}. Since one out of ten numbers in set, P(A) = 1/10. Event B = three digits are all the same = {000, 111, 222, 333, 444, 555, 666, 777, 888, 999}. Since event B contains 10 events, P(B) = 10/1000 = 1/100. Complementary Events One event is the complement of another event if the two events do not contain any of the same simple events and together they cover the entire sample space. Notation: AC represents the complement of A. Note: P(A) + P(AC) = 1 Example 7.2 A Simple Lottery (cont) A = player buying single ticket wins AC = player does not win P(A) = 1/1000 so P(AC) = 999/1000 Mutually Exclusive Events Two events are mutually exclusive, or equivalently disjoint, if they do not contain any of the same simple events (outcomes). Example 7.2 A Simple Lottery (cont) A = all three digits are the same. B = the first and last digits are different The events A and B are mutually exclusive (disjoint), but they are not complementary. Independent and Dependent Events • Two events are independent of each other if knowing that one will occur (or has occurred) does not change the probability that the other occurs. • Two events are dependent if knowing that one will occur (or has occurred) changes the probability that the other occurs. The definitions can apply either … to events within the same random circumstance or to events from two separate random circumstances. Example 7.7 Winning a Free Lunch Customers put business card in restaurant glass bowl. Drawing held once a week for free lunch. You and Vanessa put a card in two consecutive weeks. Event A = You win in week 1. Event B = Vanessa wins in week 1. Event C = Vanessa wins in week 2. • Events A and B refer to the same random circumstance and are not independent. • Events A and C refer to to different random circumstances and are independent. Example 7.3 Alicia Answering (cont) Event A = Alicia is selected to answer Question 1. Event B = Alicia is selected to answer Question 2. Events A and B refer to different random circumstances, but are A and B independent events? • P(A) = 1/50. • If event A occurs, her name is no longer in the bag, so P(B) = 0. • If event A does not occur, there are 49 names in the bag (including Alicia’s name), so P(B) = 1/49. Knowing whether A occurred changes P(B). Thus, the events A and B are not independent. Conditional Probabilities Conditional probability of the event B, given that the event A occurs, is the long-run relative frequency with which event B occurs when circumstances are such that A also occurs; written as P(B|A). P(B) = unconditional probability event B occurs. P(B|A) = “probability of B given A” = conditional probability event B occurs given that we know A has occurred or will occur. Example 7.8 Probability That a Teenager Gambles Depends upon Gender Survey: 78,564 students (9th and 12th graders) The proportions of males and females admitting they gambled at least once a week during the previous year were reported. Results for 9th grade: P(student is weekly gambler | teen is boy) = 0.20 P(student is weekly gambler | teen is girl) = 0.05 Notice dependence between “weekly gambling habit” and “gender.” Knowledge of a 9th grader’s gender changes probability that s/he is a weekly gambler. Two-Way Table: “Hypothetical Hundred Thousand” Example 7.8 Teens and Gambling (cont) Sample of 9th grade teens: 49.1% boys, 50.9% girls. Results: 22.9% of boys and 4.5% of girls admitted they gambled at least once a week during previous year. Start with hypothetical 100,000 teens … (.491)(100,000) = 49,100 boys and thus 50,900 girls Of the 49,100 boys, (.229)(49,100) = 11,244 would be weekly gamblers. Of the 50,900 girls, (.045)(50,900) = 2,291 would be weekly gamblers. Example 7.8 Teens and Gambling (cont) Weekly Gambler Not Weekly Gambler Total Boy 11,244 37,856 49,100 Girl 2,291 48,609 50,900 Total 13,535 86,465 100,000 P(boy and gambler) = 11,244/100,000 = 0.1124 P(boy | gambler) = 11,244/13,535 = 0.8307 P(gambler) = 13,535/100,000 = 0.13535 Simulations Probability can be used to model complex situations. A simulation of the model involves using the model's probabilities to generate an instance of the situation. Repeating the simulation can give insight into the behaviour of the system. CAST examples: Tennis game, shows how randomly generating points can simulate a tennis match. Soccer league, shows how simulations can explore some properties of the league In a sample, any individual's value is highly variable • Unexplained variability in data is usually modelled as random sampling from some underlying population. – A consequence of this model is that standard graphical and numerical summaries of data cannot be regarded as fixed – different samples from the same population would result in different summaries. • The variability of random samples is most noticable when the sample values belong to named individuals. – The value for a single individual will vary greatly from sample to sample and the rankings of the individuals are similarly variable. Distribution of the sample as a whole • Although the value associated with any individual is highly variable, there is more stability in the overall distribution of random samples. • Features such as centre, spread or skewness in the distribution are more stable from sample to sample. Parameters and statistics • We usually model data sets as random samples from some population. – The sampling process is random, so if sampling is repeated, a different sample is usually obtained. • Summary statistics will vary from sample to sample. – The underlying population remains unchanged. • Summary statistics will remain constant. • To distinguish, we call numerical summaries of a population population parameters whereas the corresponding summaries of a sample are called sample statistics. Population parameters are constants Sample statistics vary from sample to sample 8.1 What is a Random Variable? Random Variable: assigns a number to each outcome of a random circumstance, or, equivalently, to each unit in a population. Two different broad classes of random variables: 1. A continuous random variable can take any value in an interval or collection of intervals. 2. A discrete random variable can take one of a countable list of distinct values. Example 8.1 Random Variables at an Outdoor Graduation or Wedding Some Random factors that will determine how enjoyable the event is: Temperature: continuous random variable Number of airplanes that fly overhead: discrete random variable Example 8.2 Probability an Event Occurs Three Times in Three Tries • What is the probability that three tosses of a fair coin will result in three heads? • Assuming boys and girls are equally likely, what is the probability that 3 births will result in 3 girls? • Assuming probability is 1/2 that a randomly selected individual will be taller than median height of a population, what is the probability that 3 randomly selected individuals will all be taller than the median? • Answer to all three questions = 1/8. • Discrete Random Variable X = number of times the “outcome of interest” occurs in three independent tries. 8.2 Discrete Random Variables X the random variable. k = a number the discrete r.v. could assume. P(X = k) is the probability that X equals k. Discrete random variable: can only result in a countable set of possibilities – often a finite number of outcomes, but can be infinite. Example 8.3 It’s Possible to Toss Forever Repeatedly toss a fair coin, and define: X = number of tosses until the first head occurs Any number of flips is a possible outcome. P(X = k) = (1/2)k Probability Distribution of a Discrete R.V. Using the sample space to find probabilities: Step 1: List all simple events in sample space. Step 2: Find probability for each simple event. Step 3: List possible values for random variable X and identify the value for each simple event. Step 4: Find all simple events for which X = k, for each possible value k. Step 5: P(X = k) is the sum of the probabilities for all simple events for which X = k. Probability function (pf) X is a table or rule that assigns probabilities to possible values of X. Example 8.4 How Many Girls are Likely? Family has 3 children. Probability of a girl is ½. What are the probabilities of having 0, 1, 2, or 3 girls? Sample Space: For each birth, write either B or G. There are eight possible arrangements of B and G for three births. These are the simple events. Sample Space and Probabilities: The eight simple events are equally likely. Random Variable X: number of girls in three births. For each simple event, the value of X is the number of G’s listed. Example 8.4 How Many Girls? (cont) Value of X for each simple event: Probability function for Number of Girls X: Graph of the pf of X: Conditions for Probabilities for Discrete Random Variables Condition 1 The sum of the probabilities over all possible values of a discrete random variable must equal 1. Condition 2 The probability of any specific outcome for a discrete random variable must be between 0 and 1. Cumulative Distribution Function of a Discrete Random Variable Cumulative distribution function (cdf) for a random variable X is a rule or table that provides the probabilities P(X ≤ k) for any real number k. Cumulative probability = probability that X is less than or equal to a particular value. Example 8.4 Distribution Function for the Number of Girls (cont) Finding Probabilities for Complex Events Example 8.4 A Mixture of Children What is the probability that a family with 3 children will have at least one child of each sex? If X = Number of Girls then either family has one girl and two boys (X = 1) or two girls and one boy (X = 2). P(X = 1 or X = 2) = P(X = 1) + P(X = 2) = 3/8 + 3/8 = 6/8 = 3/4 pf for Number of Girls X: 8.5 Continuous Random Variables Continuous random variable: the outcome can be any value in an interval or collection of intervals. Probability density function for a continuous random variable X is a curve such that the area under the curve over an interval equals the probability that X is in that interval. P(a X b) = area under density curve over the interval between the values a and b. Example 8.13 Time Spent Waiting for Bus Bus arrives at stop every 10 minutes. Person arrives at stop at a random time, how long will s/he have to wait? X = waiting time until next bus arrives. X is a continuous random variable over 0 to 10 minutes. Note: Height is 0.10 so total area under the curve is (0.10)(10) = 1 This is an example of a Uniform random variable Example 8.13 Waiting for Bus (cont) What is the probability the waiting time X was in the interval from 5 to 7 minutes? Probability = area under curve between 5 and 7 = (base)(height) = (2)(.1) = .2