Name ________________________________ AP Statistics Class Notes & Reading Guide TPOSe5 Chapter P: Preliminary Key Vocabulary: statistics probability available data census observational study sample variable quantitative variable distribution individuals survey population experiment statistical inference categorical variable 1. What is meant by a “variable”? 2. Describe the differences between categorical and quantitative data. 3. What is meant by a “distribution”? 4. Explain the key differences between observational studies and experiments. (Consider the design, implementation, and how the results of each can be interpreted.) 5. Explain the key questions of data analysis from your book. 6. Explain the concept of a “lurking variable” 7. (a) Describe the explanation of probability as stated in your book. (b) Then give an example that further demonstrates your understanding of the concept of probability as long-run behavior. Displaying Distributions with Graphs (1.1) Learning Targets: Describe what is meant by exploratory data analysis Explain what is meant by the distribution of a variable Differentiate between categorical and quantitative variables Construct bar graphs and pie charts for a set of categorical data Construct a stemplot for a set of quantitative data Construct a back-to-back stemplot to compare two related distributions Construct a stemplot using split stems Construct a histogram for a set of quantitative data, and discuss how changing the class width can change the impression of the data given by the histogram Describe the overall pattern of a distribution by its center, shape, and spread Explain what is meant by the mode of a distribution Recognize and identify symmetric and skewed distributions Explain what is meant by an outlier in a stemplot or histogram Construct and interpret an Ogive from a relative frequency table Construct a time plot for a set of data collected over time CASE STUDY: What does it mean to say that a TV show was ranked number 1? The Neilsen Media Research Company randomly samples about 5100 households and 13,000 individuals each week. The TV viewing habits of this sample are captured by metering equipment, and data are sent automatically in the middle of the night to Neilsen. Broadcasters and companies that want to air commercials on TV use the data on who is watching TV and what they are watching. The results of this data gathering appear as ratings on a weekly basis. Here is an alphabetical list of the top 20 prime-time shows for viewers aged 18 to 49 during the week of November 22-28, 2004: Show Network According to Jim Amazing Race Apprentice Biggest Loser Boston Legal CSI CSI: Miami CSI: NY Desperate Housewives Extreme Makeover: Home Edition Fear Factor Law & Order: SVU Monday Night Football NFL Monday Showcase Raymond Seinfeld Survivor: Vanuata Two and a Half Men 60 Minutes ABC CBS NBC NBC ABC CBS CBS CBS ABC ABC NBC NBC ABC ABC CBS NBC CBS CBS CBS Which network is winning the ratings battle? Viewers (millions) 5.5 6.1 7.1 5.4 7.4 10.9 10.5 6.1 16.2 9.7 6.5 7.8 7.8 5.7 8.0 7.6 7.8 8.8 5.4 Individuals : Variables : Categorical variables: Quantitative variables: Univariate data: Graphs are used to display data. (There will be a graph question on the AP Exam) Categorical variables are displayed with one of the following graphs: Bar graph Pie chart Time plot Quantitative variables are displayed with one of the following graphs: Dotplot Stemplot (Stem-and-leaf) Frequency Distribution Relative frequency/Cumulative frequency Histogram When we describe the graph, we are describing the distribution of the quantitative variable. Look for an overall pattern (use the terms “center”, “shape”, and “spread”). Center and spread are discussed in the next section. Shape refers to: Symmetric Skewed (right-skewed vs. left-skewed) Example: The population of the United States is aging, though less rapidly than in other developed countries. Here is a stemplot of the percents of residents aged 65 and over in the 50 states, according to the 2000 census. The stems are whole percentages and the leaves are tenths of a percent. 5 7 Florida) 6 7 8 5 9 679 10 6 the 11 02233677 12 0011113445789 13 00012233345568 14 034579 15 36 16 17 6 1. There are 2 outliers (Alaska and What are the percents for these two states? 2. Ignoring Alaska and Florida, describe shape, center, and spread of this distribution 3. Re-graph this data using a histogram Enter the data into a list in your calculator (STAT EDIT L1) Sort the list in descending order (STAT SortD L1) Frequency Distribution: CLASSES FREQUENCY RELATIVE FREQ. (%) RELATIVE CUMMULATIVE FREQ Histogram: Describing Distributions with Numbers (1.2) Learning Targets Given a data set, compute the mean and median as measures of center Explain what is meant by a resistant measure Given the data set, find the quartiles Given a data set, find the five-number summary Use the five-number summary of a data set to construct a boxplot for the data Compute the interquartile range (IQR) of a data set Given a data set, use the 1.5(IQR) rule to identify outliers Given a data set, compute the standard deviation and variance as measures of spread Identify situations in which the mean is the most appropriate measure of the center and situations in which the median is the most appropriate measure MEASURES OF CENTER: Mean, Median, Mode A small company consists of the owner, a manager, a salesperson, and 2 technicians. Their annual salaries are listed below: Staff owner manager salesperson technician technician Salary $200,000 $50,000 $25,000 $15,000 $15,000 What is the Median? ________________ What is the Mode? ______________ Mean = sum of values x1 x 2 x 3 ... xn # of values n xi x n 1 x n i Why is the mean so much higher than the median? NONRESISTANT measure of center: RESISTANT measure of center: (unimodal vs. bimodal) If data is symmetric, mean and median are the same. If data is skewed, the mean is farther out in the tail than the median. MEASURES OF SPREAD: Range, Quartiles, Standard Deviation 1.) Range: 2.) Quartiles: 3.) Standard Deviation: Your calculator will compute a 5-number summary that includes: min value Q1 median Q3 which you can then graph as a box plot. max value Example: Weights of Cowley County Community College Volleyball Players: 131 134 114 188 167 175 180 133 a) Find the 5-number summary for the above data b) Convert this information to a boxplot 126 130 265 110 c) Check your boxplot with the calculator d) Perform the outlier test: Q1 – 1.5(IQR) = low-end boundary Q3 + 1.5(IQR) = high-end boundary Example: Two different brands of paint were tested to see how long each would last before fading. The following data lists the number of months the paint lasted: Brand A 10 60 50 30 40 20 x = 35 Brand B 35 45 30 35 40 25 x = 35 The means are the same, but Brand B was more consistent…the data was clustered around the mean more than Brand A. Brand A distance from x (deviations) squared deviations sum = ________ sum = ________ 10 60 50 30 40 20 Use your calculator to calculate the squared deviations for Brand B Since the sum of the deviations is always zero, we can find the last deviation just by figuring out what number needs to make the group sum to zero. This means all but the last number are free to vary, but the last one must make the group add to zero. If there are n data values, then there are n-1 values that are free to vary. We call this “n-1 degrees of freedom.” To continue calculating the standard deviation, we divide by n-1 to find the “average”, which gives us the variance. Variance for Brand A = Variance for Brand B = what we know from these numbers is that Brand A is more spread out. That’s it, nothing else. s = standard deviation for the data = var iance Standard deviation for Brand A = Standard deviation for Brand B = Standard Deviation can be thought of as the “typical distance the data is from the mean”. Chapter 1: Exploring Data Key Vocabulary: center dot plot Range shape histogram time plot mean median resistant IQR minimum standard deviation spread skewed left stemplot nonresistant maximum variance frequency skewed right split stems outlier symmetric back-to-back stemplot x 5-number summary quartiles Q 1 , Q3 boxplot modified boxplot degrees of freedom 1.1 Displaying Distributions With Graphs 1. Explain “roundoff error” 2. When is it useful to use a bar chart? 3. When is it useful to use a pie chart? 4. Define “range” 5. When is it better to use a histogram rather than a dotplot? 6. What is meant by “frequency” in a histogram? 7. Draw examples of a symmetric histogram, skewed right histogram, and skewed left histogram. 8. Explain a split-stem stemplot. 9. Explain a back-to-back stemplot. 10. Explain a modified boxplot. 11. Explain a parallel boxplot. 1.2 Describing Distributions with Numbers 12. In statistics, what is the most common measurement of center? 13. Explain how to calculate the mean, x 14. Explain why the median is resistant to extreme observations, but the mean is nonresistant. 15. What does standard deviation measure? 16. What is the five-number-summary? 17. What is the relationship between variance and standard deviation? 18. When does standard deviation equal zero? 19. Is standard deviation resistant or nonresistant to extreme observations? Explain. Write a 20-word summary of the reading assignment that captures the main ideas of the 1st chapter. Measures of Relative Standing and Density Curves (2.1) Learning Targets Compute the z-score of an observation given the mean and standard deviation Explain what is meant by a standardized value Compute the “pth” percentile of an observation Explain what is meant by a mathematical model Define a density curve Explain where the mean and median of a density curve are to be found Describe the relative position of the mean and median in a symmetric density curve and in a skewed density curve Density Curves What would you say about the distribution of data in this histogram? Center: Shape: Spread: Since we’ve been estimating values when using a histogram, and we’re really only interested in the shape (most of the time), we can smooth out the tops of the bars. Mathematical model : * * * A density curve could be any shape: Since the density curve is an approximation, the mean and standard deviation of the curve might be different from the actual observed values. x = = s= = Measures of Relative Standing: z-scores Here are the scores of all 25 students in Mr. Pryor’s statistics class on their first test: 79 77 81 83 80 86 77 90 73 79 83 85 74 83 93 89 78 84 80 82 75 77 67 72 73 Jenny scored an 86. How did she perform relative to her classmates? We could look at the stemplot. Notice it is roughly symmetric with no apparent outliers. What can we conclude from these displays? We could look at the Minitab output of summary statistics for the test scores. STANDARDIZING - Z-SCORE: EXAMPLE: Three landmarks of baseball achievement are Ty Cobb’s batting average of 0.420 in 1911, Ted Williams’s 0.406 in 1941, and George Brett’s 0.390 in 1980. These batting averages cannot be compared directly because the distribution of major league batting averages has changed over the years. The distributions are quite symmetric, except for the outliers such as Cobb, Williams, and Brett. While the mean batting average has been held roughly constant by rule changes and the balance between hitting and pitching, the standard deviation has dropped over time. Here are the facts: Decade 1910s 1940s 1970s Mean 0.266 0.267 0.261 Std. Dev. 0.0371 0.0326 0.0317 Which player stood out most from his peers? zCobb = zWilliams = zBrett = Measures of Relative Standing: percentiles In chapter 1, we defined the pth percentile of a distribution as the value with p percent of the observations less than or equal to it. If the distribution is normal, we can use z-scores to calculate percentiles. A z-score of 0 has 50% of the observations below it and 50% of the observations above it. You can use a table of values to calculate percentiles. TABLE “A” IS IN YOUR TEXTBOOK, AND A COPY OF IT IS IN YOUR BINDER A z-score of -2.24 has 0.0125 of the area/observations below it. The z-score corresponding to 1.58% of the area/observations is -2.15 EXAMPLE: The distribution of heights of young women aged 18 to 24 is approximately normal with mean = 64.5 inches and standard deviation = 2.5 inches. (a) What percent of girls are shorter than 69.5 inches? (b) What percent of girls are taller than 67 inches? (c) What percent of girls are between 59.5 in. and 67 in.? (d) Taylor Swift is approximately 5’11”. How does her height compare to the other girls? Normal Distributions (2.2) Learning Targets:Identify the main properties of the Normal curve as a particular density curve. List three reasons why Normal distributions are important in statistics Explain the 68-95-99.7 rule (the empirical rule) Explain the notation N Density Curve: , Define the standard Normal distribution Use a table of values for the standard Normal curve (Table A) to compute the proportion of observations that are (a) less than a given z-score, (b) greater than a given z-score, or (c) between two given z-scores. Use a table of values for the standard Normal curve to find the proportion of observations in any region given any Normal distribution (i.e. given raw data rather than z-scores) Use a table of values for the standard Normal curve to find a value with a given proportion of observations above of below it (inverse Normal) Identify at least two graphical techniques for assessing Normality Explain what is meant by a Normal probability plot; use it to help assess the Normality of a given data set Use technology to perform Normal distribution calculations and to make Normal probability plots Normal Curve: The exact density curve for a particular distribution is described by giving its mean and its standard deviation. The usual notation is: ________________ To sketch a Normal curve, draw a number line and mark the mean. Put the peak of the curve above the mean. Label the number line in intervals of your standard deviation. Always include 3 intervals to the left and 3 intervals to the right. “concave down” “concave up” Practice: Sketch the distribution N(15.8, 1.2) All Normal distributions obey the following rule: _____ _ _____ _ ________ _ This rule is sometimes known as the “Empirical Rule” _________________ _ _____________ _ _____________ _ Example: The distribution of heights of American men aged 20 – 29 is approximately Normal with mean 70 inches and standard deviation 3 inches. (a) Draw a Normal curve on which the mean and standard deviation are correctly located. (b) What percent of men are taller than 69.5 inches? (c) Between what heights do the middle 95% of men fall? (d) What percent of men are shorter than 59.5 inches? (e) A height of 67 inches corresponds to what percentile of adult American men’s heights? Example: The level of cholesterol in the blood is important because high cholesterol levels may increase the risk of heart disease. The distribution of blood cholesterol levels in a large population of people of the same age and gender is roughly Normal. For 14-year-old boys, the mean is 170 milligrams of cholesterol per deciliter of blood (mg/dl) and the standard deviation is 30 mg/dl. About 1% of 14-year-old boys have cholesterol high enough to require medical attention. What cholesterol level requires medical attention in 14-year-old boys? (a) Sketch the distribution and mark the important points on the horizontal axis. (b 1) Use table A to find the standardized value (b 2) Use InvNorm to find the standardized value (c) Un-standardize the variable (d) State your conclusion Assessing Normality There are some methods we can use to determine if a distribution might be Normally distributed. I. Construct a histogram or stemplot. Look for symmetry and an approximate bell shape. II. Construct a Normal Probability Plot a. Arrange data values from smallest to largest (Use L 1) b. Find corresponding z-scores for each x-value (Enter into L2) c. Plot each data point (x-value) against its corresponding z-score. (Scatterplot) If the result is a fairly straight line, then the data values came from an approximately Normal distribution. Example: According to USA Today, the 2011 salaries for the Kansas City Royals are as follows. Assess the Normality of this data. Kansas City Royals RK PLAYER 1. Billy Butler 2. Jeff Francoeur 3. Alex Gordon 4. Bruce Chen 5. Jonathan Broxton 6. Luke Hochevar 7. Felipe Paulino 8. Aaron Crow 9. Humberto Quintero Alcides Escobar Salary (US$) 8,500,000 6,750,000 6,000,000 4,500,000 4,000,000 3,510,000 1,900,000 1,600,000 1,000,000 1,000,000 RK 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. PLAYER Chris Getz Jose Mijares Brayan Pena Mitch Maier Eric Hosmer Greg Holland Tim Collins Luis Mendoza Mike Moustakas Kelvin Herrera Salary (US$) 967,500 925,000 875,000 865,000 502,500 497,150 495,725 488,925 487,250 480,650 Chapter 2: The Normal Distributions Key Vocabulary: mu Density curve Inflection point Normal distribution Standard Normal z-scores sigma Empirical rule distribution Outcomes Percentile Normal Probability plot Simulation N Normal curve Standardized value normalcdf invNorm 2.1 , Density Curves and Measures of Relative Standing: 1. Explain the two ways to describe the position of Jenny’s score on the first statistics test within the distribution of test scores. 2. How do we calculate the percent of observations falling within k standard deviations of the mean? 3. List the 4 steps used to explore data from a single quantitative variable. 4. Describe the relationship between the mean and the median of a skewed distribution. 2.2 Normal Distributions: 5. List the three reasons that normal distributions are important in statistics. 6. Give 3 examples of distributions that are often close to normal. 7. How is a standard normal distribution different from a normal distribution? 8. List the steps used in solving problems involving normal distributions. 9. In what situation do we use Table A backwards? 10. Is there a difference between the 80th percentile and the top 80%? Explain 11. Explain the basic idea of a Normal probability plot. 12. How does a Normal probability plot indicate that the data are normal? 13. Write 5 – 6 sentences about your progress in AP Statistics so far. Scatterplots and Correlation (3.1) Learning Targets: Explain the difference between an explanatory variable and a response variable Given a set of bivariate data, construct a scatterplot Explain how to recognize an outlier in a scatterplot Explain what it means for two variables to be positively or negatively associated Explain how to add categorical variables to a scatterplot Use a TI83 / TI84 to construct a scatterplot Define the correlation “r” and describe what it measures Explain what is meant by the direction, form, and strength of the overall pattern of a scatterplot. Given a set of bivariate data, use technology to compute the correlation “r” List the four basic properties of the correlation “r” that you need to know in order to interpret any correlation List four other facts about correlation that must be kept in mind when using “r” Explanatory variables ____________________________________________ Response variables ______________________________________________ EXAMPLE: One effect alcohol has on the body is a drop in body temperature Explanatory variable: _____________________Response variable: __________________ EXAMPLE: Can we predict a state’s average SAT math score if we know the average SAT verbal score? Explanatory variable: ___________________Response variable: _____________________ Note: _________________________________________________________________ _________________________________________________________________________ Scatterplots show the relationship between two quantitative variables measured on the same individuals. We plot the explanatory variable on the x-axis, and plot the response variable on the y-axis. To interpret scatter plots, look for the overall pattern and for striking deviations from that pattern. To describe a pattern look at association, form, and strength: Association: ______________________________________________________ Form: ___________________________________________________________ Strength: ______________________________________________________ What is the direction of these scatterplots? Which scatterplot is stronger? Note:___________________________________________________________ ________________________________________________________________________. The correlation, r, measures the ___________________________________ of linear associations between quantitative variables. To calculate r: Notice that xi x y y represents the standardized values of x and i represents the sx sy standardized values of y. This allows us to ___________________________________. ** r is always a value between _________. Strong correlations produce r values close to __________. The closer r is to 0, the ____________________ the linear relationship is. ** Positive r -values indicate _______________; negative r -values indicate a _________________________. ** r has no ___________________________. ** When calculating r. it doesn’t matter which variable is _______________ and which one is ________________. EXAMPLE: A food industry group asked 3368 people to guess the number of calories in each of several common foods. Here is a table of the average of their guesses and the correct number of calories: Food Guessed calories Correct calories 8 oz whole milk 196 159 5 oz spaghetti with tomato sauce 394 163 5 oz macaroni with cheese 350 269 one slice wheat bread 117 61 one slice white bread 136 76 2 oz candy bar 364 260 saltine cracker 74 12 medium-sized apple 107 80 medium-sized potato 160 88 cream-filled snack cake 419 160 We think that how many calories a food actually has helps explain people’s guesses of how many calories it has. With this in mind, make a scatterplot of these data and describe the relationship. Calculate the correlation r by hand using the lists on your calculator. Here are the keystrokes: Let: L1 = x-values (type them in) L2 = y-values L3 = xi x sx (find x and sx using 1-var stats on L1 , L2) L4 = yi y sy (find y and sy using 1-var stats on L2) xi x sx L5 = yi y s y (multiply L3 and L4) Find the sum of L5 using: List (2nd Stat) Math 5: Sum (type them in) L1 L2 159 196 153 394 269 350 61 117 76 136 260 364 12 74 80 107 88 160 160 419 L3 Sum of L5 = ______________________ 1 sum L5 = ___________________ = r n 1 L4 L5 Least Squares Regression (3.2) Learning Targets: Explain what is meant by a regression line Given a regression equation, interpret the slope and y-intercept in context Explain what is meant by extrapolation Explain why the regression line is called the “least squares regression line” (LSRL) Explain how the coefficients of the regression equation y = a + bx can be found given r, sx, sy, and (x , y) Given a bivariate data set, use technology to construct a least-squares regression line Define a residual Given a bivariate data set, use technology to construct a residual plot for a linear regression List two things to consider about a residual plot when checking to see if a straight line is a good model for a bivariate data set. Explain what is meant by the standard deviation of the residuals Define the coefficient of determination, r, and explain how it is used in determining how well a linear model fits its bivariate set of data. List and explain four important facts about least-squares regression Regression Since linear relationships between quantitative variables are quite common, it is useful to summarize overall patterns by drawing a line on a scatterplot. This line is called a regression line. A REGRESSION LINE (or line of best fit) ______________________________________ ______________________________________________________________________ Regression lines allow us to make predictions Regression lines model ____________ (like density curves in ch.2) to give us a concise mathematical description of the relationship between the variables. There are many methods for drawing a line of fit. We need a method of drawing a line of fit that does not depend on guessing where the line should be. In the field of statistics, the model we use is called the Least Squares Regression Line (LSRL). Since we are using the line to predict y from x, a good line of fit will make the vertical distances of the points from the line as small as possible. The error is calculated by subtracting the predicted y-value from the actual y-value: error = observed (actual) response - predicted response Residual = Actual – Predicted (R A P) The vertical error is called the residual The least-squares regression line of y on x is the line that makes the sum of the squared vertical distances of the data points from the line as small as possible If our regression line is “perfect”, the points will have the same distance above the line as distance below. So it’s possible to get a total of zero error when we add all of the residuals. For this reason, when discussing residuals, we square the errors to eliminate the negatives (this strategy should sound familiar to you…) and then “undo” the squaring later. ŷ is read “y hat” to emphasize that this is a PREDICTED response for any x. b is the PREDICTED slope. Note: When explaining anything about regression lines or regression slopes, be sure to state that these are PREDICTED values Example: When purchasing a car, there is a relationship between the price of the car and the age of the car. Find the LSRL, the value of r, and the residuals using the calculator . Price of a Saturn SL1 from 10 classified car ads in a newspaper Age (years) Asking Price ($) 1.0 1.0 2.0 2.0 3.0 4.0 5.0 5.0 6.0 6.0 11875 10995 8500 9995 8995 6995 4450 5500 4400 4800 Solution: 1st Determine which variable is explanatory and which is response. x = _____________________ y = _____________________ 2nd To find r: Under “CATALOG” find “DIAGNOSTIC ON”, then press ENTER 3rd then use L2, Y1 Enter data into lists L1 and L2 STAT CALC 4: a+bx, L1, (this tells the calculator to use list 1 and list 2 in finding the regression equation, and then place the equation into Y= screen) y = ________________________________ 4th Calculate the residuals for this data: r = ________________________ Price of a Saturn SL1 from 10 classified car ads in a newspaper L1 L2 1.0 11875 1.0 10995 2.0 8500 2.0 9995 3.0 8995 4.0 6995 5.0 4450 5.0 5500 ŷ L3 = predicted values Residual L4 = actual - predicted 6.0 4400 6.0 4800 Residual plots should have no clear pattern, with points uniformly scattered above and below the line. Make a residual plot in the STAT PLOT using L1 for the X List and L4 for the Y List. When sketching it on your paper, always be sure to label the axes! RESIDUALS X – VALUES Correlation and Regression Wisdom (3.3) Learning Targets: Recall the three limitations on the use of correlation and regression Explain what is meant by an outlier in bivariate data Explain what is meant by an influential observation and how it relates to regression Given a scatterplot in a regression setting, identify outliers and influential observations Define a lurking variable Give an example of what it means to say “association does not imply causation” Explain how correlations based on averages differ from correlations based on individuals Outliers and Influential Points: Three limitations on the use of correlation and regression: 1. ______________________________________________________________ 2. ______________________________________________________________ 3. ______________________________________________________________. Example: Does the age at which a child begins to talk predict a later score of mental ability? A study of the development of young children recorded the age in months at which each of 21 children spoke their first word and their Gesell Adaptive Score, the result of an aptitude test taken much later. The results appear in the scatter and residual plots below. Children 18 and 19 are outside the cluster of the rest of the data. Child 19 is an outlier in the y-direction. Child 18 is an outlier in the x-direction. It has a strong influence on the position of the regression line. The graph below adds a 2nd regression line calculated after leaving out child 18. What happens to the LSRL? This applet will show you how the LSRL changes with the addition of outliers and influential points. http://statweb.calpoly.edu/chance/applets/LRApplet.html In the regression setting, not all outliers are influential: The LSRL is most likely to be influenced by outliers in the x-direction. Influential points often have small residuals, because they pull the regression line toward themselves, causing other points’ residuals to increase. If you suspect an influential point, find the equation of the regression both with and without the point in question. If the line moves more than a small amount when the point is deleted, the point is influential. Lurking Variables: Another caution about regression and correlation is the effect of a third variable. This third variable may influence the interpretation of relationships between the explanatory and response variables. EXAMPLE: A study was once completed on the relationship between ice cream sales and snake bite treatments in a local area hospital. The study found a strong positive correlation between these two variables. Why do you think that is? Do left-handers die early? A study of 1000 deaths in California found that left-handed people died at an average age of 66 years old, while right-handed people died at an average age of 75 years old. Should lefthanders fear an early death? Coefficient of Determination: When we find the equation for the LSRL we find a value for “r” (correlation coefficient). We are also given a vale for “r2”, which is called the coefficient of determination. r2 tells us the fraction of the variance of one variable that is explained by the LSRL on the other variable. Basically, it tells us how well the LSRL does at predicting values of the response variable, y. EXAMPLE: A study of class attendance and grades among first-year students at a state university showed that, in general, students who attended a higher percent of their classes earned higher grades. The coefficient of determination, r 2 = 0.16. Class attendance explained 16% of the variation in grade index among the students. What else could explain the variation? What is the numerical value of the correlation between percent of classes attended and grade index? Chapter 3: Examining Relationships Key Vocabulary: response variable explanatory variable independent variable dependent variable “y – hat” negative association r-squared r-value regression line mathematical model least-squares regression line positive association SSE correlation residuals residual plot influential observation scatterplot SSM linear coefficient of determination 3.1 Scatterplots and Correlation: 1. What is the difference between a response variable and an explanatory variable? 2. How are response and explanatory variables related to dependent and independent variables? 3. When is it appropriate to use a scatterplot to display data? 4. Explain the difference between a positive association and a negative association. 5. Explain what it means to “describe the overall pattern of a scatterplot using direction, form, and strength”. 6. How do you add categorical variables in a scatterplot? 7. List the 4 cautions about correlation. 3.2 Least-Squares Regression: 8. In what ways is a regression line a mathematical model? 9. What is extrapolation? 10. What is a least-squares regression line? 11. Define residual: 12. If a least-squares regression line (LSRL) fits the data well, what 2 characteristics should the residual plot exhibit? 13. What numerical quantity tells us how well the LSRL will do at predicting? 14. Explain the idea of r-squared. 3.3 Correlation and Regression Wisdom: 15. What three cautions are reviewed in this section? 16. Explain the difference between an influential point and an outlier. 17. Explain how a lurking variable can make correlation or regression misleading. 18. What are the three new cautions of this section? Relationships between Categorical Variables (4.2) Learning Targets: Explain what is meant by a “two-way table” Explain what is meant by “marginal distributions” in a two-way table Describe how changing “counts” to “percents” is helpful Explain what is meant by a “conditional distribution” Define “Simpson’s paradox” and give an example of it Two-way Tables: We use a two-way table to display relationships between two or more categorical variables. Some variables are categorical by nature: gender, race, occupation, etc. Other variables become categorical by grouping quantitative variables into classes. A two-way table has a row variable and a column variable. The entries in the table can be counts or percents. College students by gender and age group, 2003 (thousands of persons) Age group Female Male Total 15-17 years 18-24 years 25-34 years 35+ years 89 5668 1904 1660 61 4697 1589 970 150 10,365 3,494 2,630 Total 9321 7317 16,639 Identify the row variable and the column variable: Marginal Distributions: The distribution of the categorical variables in the table above says how often each outcome occurred. The “total” column contains the total for each of the rows. How many college students were ages 18 to 24? __________________ How many college students were ages 15 to 17? ___________________ The “total” row contains the total for each column. How many female college students were there in 2003? ____________________ Since the totals occur in the margins of the table, they are called marginal distributions. College students by sex and age group, 2003 (thousands of persons) Age group Female Male Total 15-17 years 18-24 years 25-34 years 35+ years 89 5668 1904 1660 61 4697 1589 970 150 10,365 3,494 2,630 total 9321 7317 16,639 The marginal distributions from the above table do not tell us how the two variables are related. The relationship is in the body of the table. To describe relationships among categorical variables, calculate appropriate percents from the given counts. Counts are often hard to compare, but percents can tell a lot about the relationship. What percent of the 15-17 age group is female? __________________________ What percent of the male students were of age 18-24? ______________________ What percent of college students were females between the ages of 25-34? _______ Each of the above answers is called a conditional distribution, because to find the value, we used a table entry from a given condition. Do the female and male totals agree with the overall total? __________ We sometimes encounter ROUNDOFF ERROR in two-way tables. The entries above were rounded to the nearest thousand, so when adding the row totals and the column totals, we sometimes find a very small discrepancy. As long as you understand why the error is there, it shouldn’t affect your conclusions. Simpson’s Paradox: Lurking variables can also affect categorical variables. They can change or even reverse relationships. EXAMPLE: Accident victims are sometimes taken by helicopter from the accident scene to a hospital. Helicopters save time, but do they save lives? Below is a comparison between the counts of accident victims who die with helicopter evacuation and with the usual transport to a hospital by road. Helicopter 64 136 200 Victim died Victim survived Total Road 260 840 1100 What percent of helicopter patients died? ___________________ What percent of road transported patients died? _____________________ What do these percents suggest? _______________________________________ There is a lurking variable in this example. What is it? ________________________ Here is the same data broken down differently: Serious Accidents Victim died Victim survived Total Helicopter 48 52 100 Less Serious Accidents Road 60 40 100 Victim died Victim survived Total Helicopter 16 84 100 Check the conditional distributions for death of patients: Helicopter(serious)/died Helicopter(less serious)/died Road(serious)/died road(less serious)/died Do these percents suggest the same conclusion as above? Road 200 800 1000 Establishing Causation (4.3) Learning Targets: Identify the three ways in which the association between two variables can be explained Explain what process provides the best evidence for causation Define what is meant by a common response Defined what it means to say that two variables are confounded Discuss why establishing a cause-and-effect relationship through experimentation is not always possible Explain what it means to say that a lack of evidence for a cause-and-effect relationship does not necessarily mean that there is no cause-and-effect relationship List 5 criteria for establishing causation when you cannot conduct a controlled experiment Explaining Association: When we look at relationships between variables, often we hope to show that changes in the explanatory variable CAUSE changes in the response variable. Remember, a strong association between two variables is not enough to determine a cause and effect relationship. There are three ways we can explain the association between two variables: 1. CAUSATION: There is a direct cause - effect relationship between two variables. 2. COMMON RESPONSE: The observed association between two variables is explained by a lurking variable. BOTH variables change due to the lurking variable. 3. CONFOUNDING: Either explanatory variables or lurking variables have mixing influences and cannot be distinguished from each other. Reminder: A lurking variable is a variable that is not an explanatory or response variable that may still influence the relationships among those variables. It is important to note that even when direct causation is present, very seldom does this completely explain an association between variables. EXAMPLE: SAT scores Why is a student’s SAT Math score positively associated with SAT Verbal score? How do we establish causation? Carefully designed experiments are the best way, but are not always practical, ethical, or even possible. What are the criteria for establishing causation when we can’t do an experiment? Look for the following attributes: 1. The association is strong. 2. a. The association is consistent. Consistency reduces the chances that a lurking variable is present. 3. Larger values of the response variable are associated with stronger responses. The alleged cause precedes the effect in time. The alleged cause is plausible. 4. 5. EXAMPLE A: Doctors had long observed that most lung cancer patients were smokers. Comparison of smokers and similar non-smokers showed a very strong association between smoking and death from lung cancer. Could the association be due to common response? Might there be, for example, a genetic factor that predisposes people both to nicotine addiction and to lung cancer? If so, smoking and lung cancer would be positively associated even if smoking had no direct effect on the lungs. (Genetics would be a lurking variable affecting the smoking and the lung cancer) Or perhaps confounding was to blame. It might be that smokers live unhealthy lives in other ways such as an unhealthy diet, too much alcohol, or lack of exercise, which could cause the cancer. In this case, another habit might be influencing the lung cancer that has nothing to do with smoking. We can’t design an experiment where we force some patients to smoke and some not to smoke this would be unethical. Yet, medical authorities do not hesitate to say that smoking causes lung cancer. So, how are they so sure? Let’s look at the five criteria: 1. Strong association: The association between smoking and lung cancer is very strong. 2. Consistent association: Many countries studied the association and found similar results. 3. Larger values of response are related to stronger responses: People who smoke more cigarettes per day, or who smoke over a longer period develop lung cancer at a higher rate than people who stop smoking. 4. Cause precedes effect: Lung cancer develops after years of smoking. When smoking became more common, lung cancer cases increased about 30 years after. 5. Plausible: Experiments on animals show that tar from cigarette smoke causes cancer in the animal. Each of the following examples shows that causation is not a simple idea: CAUTION Even well-established causal relations may not generalize to other settings. EXAMPLE B: Experiments have conclusively shown that large amounts of saccharin in the diet cause bladder cancer in rats. Should we avoid saccharin as a sugar substitute? * * CAUTION Even when direct causation is present, it is rarely a complete explanation of an association between the two variables. EXAMPLE C: A study of Mexican-American girls aged 9 to 12 years old recorded body mass index (BMI), a measure of weight relative to height, for both the girls and their mothers. The study also measured hours of television viewing, minutes of physical activity, and intake of several kinds of food. The strongest correlation (r = 0.506) was between the BMI of daughters and the BMI of mothers. Yet, the mother’s BMIs explain only 25% (r2) of the variance among the daughters’ BMIs. What else could explain the variance among the daughters’ BMI? * * Chapter 4: More about Relationships between Two Variables Key Vocabulary: Exponential function Power function Linear growth Exponential growth Extrapolation Power law model Simpson’s paradox lurking variables causation confounding common response marginal distributions conditional distributions 4.2 Relationships between Categorical Variables 1. What is used to analyze categorical variables? 2. What is a two-way table? 3. Describe roundoff error. 4. Why are percents used to describe the relationship between categorical variables? 5. Explain the difference between marginal distributions and conditional distributions. 6. Explain Simpson’s Paradox. 4.3 Establishing Causation 1. Define causation and give an example. 2. Explain confounding and give an example. 3. Explain common response and give an example. 4. Give an example of a potential cause-and-effect situation that cannot be verified by the use of an experiment. 5. Comment on your performance in the class so far. Is there anything you can do to be more successful? Is there anything your instructor can do to help you be more successful? Designing Samples (5.1) Learning Targets: Define: Population, Sample, Biased, Simple Random Sample (SRS), Systematic Random Sampling, Probability Sample, Cluster Sample, Undercoverage, Nonresponse Explain: Voluntary Response Sample, Sampling vs. Census, Convenience Sampling, Observational study vs. Experiment Give examples of: Voluntary Response Sample, Response Bias Determine: 4 steps involved in choosing an SRS, Strata of interest, Major advantage of large random samples Introduction: The chapters we have studied taught us how to analyze data by looking at patterns and departures from patterns. Now, we will see how to produce the data. Data can be gathered from observational studies and experiments. Observational Study: _____________________________________________ _________________________________________________________________ Experiment: _____________________________________________________ _________________________________________________________________ EXAMPLES: Which technique (observational study or experiment) for gathering data do you think was used in the following studies? 1. The Colorado Division of Wildlife netted and released 774 fish at Quincy Reservoir. There were about 219 perch, 315 blue gill, 83 pike, and 157 rainbow trout. 2. The Colorado Division of Wildlife caught 41 bighorn sheep on Mt. Evans and gave each one an injection to prevent heartworm. A year later, 38 of these sheep did not have heartworm, while the other 3 did. 3. The Colorado Division of Wildlife imposed special fishing regulations on the Deckers section of the South Platte River. All trout fewer than 15 inches had to be released. A study of trout before regulation length and after regulation lengths showed that the average length of a trout increased by 4.2 inches after the regulation went into effect. 4. An ecology class used binoculars to watch 23 turtles at Lowell Ponds. It was found that 18 were box turtles, and 5 were snapping turtles We usually want to gather information about a large group of individuals. We use certain symbols when discussing the data. Population: ________________________________________________________ Population mean: Population standard deviation:: Sample: __________________________________________________________ Sample mean: Sample standard deviation: EXAMPLES: For each of the following sampling situations, identify the population and the sample as exactly as possible. 1. Each week, the Gallup Poll questions a sample of about 1500 adult U.S. residents to determine national opinion on a wide variety of issues. 2. The 2000 Census tried to gather basic information from every household in the United States. A “long form” requesting much additional information was sent to a sample of about 17% of households. 3. A machinery manufacturer purchases voltage regulators from a supplier. There are reports that variation in the output voltage of the regulators is affecting the performance of the finished products. To assess the quality of the supplier’s production, the manufacturer sends a sample of 5 regulators from the last shipment to a laboratory for study. There are two distinct ways of gathering data: Sampling: _______________________________________________________ Census: _________________________________________________________ Methods of Sampling: Voluntary Response Sampling: _________________________________________ _________________________________________________________________ Convenience Sampling: _______________________________________________ _________________________________________________________________ Simple Random Samples: (SRS) _______________________________________ _________________________________________________________________ Systematic Random Sampling: _________________________________________ _________________________________________________________________ Stratified Random Sample: ___________________________________________ _________________________________________________________________ _________________________________________________________________ Cluster Sampling: __________________________________________________ _________________________________________________________________ _________________________________________________________________ The best sampling method is to uses an SRS, since it reduces bias the best. The easiest SRS is to put all names in a hat and draw names. Another method of selecting an SRS is the use of a Random Digit Table (Table B in the back of the book). We will also use the random digit table in chapter 6 to conduct a simulation. EXAMPLE: Suppose we are interested in the number of students in all of the Blue Valley high schools who have ever spent more than a week in another country. In the interest of time, we choose to select a sample. 1. How could we choose our sample using the Voluntary Response Sampling method? 2. How could we choose our sample using the Convenience Sampling method? 3. How could we choose our sample using the Systematic Random Sampling method? 4. How could we choose our sample using the Cluster Sampling method? 5. How could we choose our sample using the Simple Random Sampling (SRS) method? Cautions about Sample Surveys Random selection eliminates bias in the choice of a sample. When the population consists of human beings, however, it takes more than a good sampling method to get accurate information. Undercoverage: ____________________________________________________ There is rarely a list of every possible participant available in a population. household sample survey – telephone opinion polls – Nonresponse: ______________________________________________________ Response Bias: ____________________________________________________ Inference about the Population Using chance to select a sample eliminates bias in the actual selection of the sample, but it is unlikely that results from a sample are exactly the same as for the entire population. Later in the course, we will discuss how close we need to be in order to generalize our sample results to the entire population. If we select two sample at random from the same population, we will draw different individuals, so the sample results will almost certainly differ somewhat. We can improve our results by knowing that larger random samples give more accurate results than smaller samples. Designing Experiments (5.2) Learning Targets: Define experimental units, subjects, treatment, factor, level, placebo effect, replication, randomization, completely randomized design, block, matched pairs design, and double blind Given a number of factors and the number of levels for each factor, determine the number of treatments Explain the major advantage of an experiment over an observational study Explain the difference between control and a control group, and explain the purpose of each Use the random-digit table to assign individuals to a treatment group or a control group List the 3 main principles of experimental design Explain the phrase “statistically significant” Generate an outline, followed by an explanation, of a completely randomized design for an experiment Experimental Units, Subjects, and Treatments A study is an experiment when we actually do something to individuals in order to observe the response. What we do is called the treatment. The explanatory variables in an experiment are often called ___________________. Many experiments study the joint effects of several ________________. In such an experiment, each treatment is formed by combining a specific value of each of the _____________, called a ____________________. EXAMPLE: An experiment investigating the effects of repeated exposure to an advertising message used undergraduate students as subjects. All subjects viewed a 40minute television program that included ads for a digital camera. Some subjects saw a 30second commercial, others saw a 90-second version. The same commercial was shown 1, 3, or 5 times during the program. This experiment has 2 factors: _________________________________________ The first factor has 2 levels: ___________________________________ and the second factor has 3 levels:__________________________________. One level of each factor will create ___________treatments: Basic Principles of Statistical Design: The FIRST basic principle of statistical design of experiments is Control: Some laboratory experiments can get away with having a design as simple as: Treatment Observe Results For example, we may place a heavy object on a support (treatment) and measure how much it bends (observation). A controlled environment is protection from lurking variables. When experiments are done in the field or with living subjects, we cannot determine if the response is from the treatment or a _______________________. EXAMPLE: “Gastric freezing” is a clever treatment for ulcers in the upper intestine. The patient swallows a deflated balloon with tubes attached, then a refrigerated liquid is pumped through the balloon for an hour. The idea is that cooling the stomach will reduce its production of acid and so relieve ulcers. An experiment reported in the Journal of the American Medical Association showed that gastric freezing did reduce acid production and relieve ulcer pain. The treatment was safe and easy, and was widely used for several years. The design of the experiment was: Gastric freezing Observe pain relief The experiment was poorly designed. The patients’ response may have been due to the placebo effect. A placebo is a dummy treatment. Many patients respond favorably to any treatment, even a placebo. This may be due to trust in the doctor and expectations of a cure, or simply to the fact that medical conditions often improve without treatment. A later experiment divided ulcer patients into two groups. One group was treated by gastric freezing as before. The other group received placebo treatment in which the liquid in the balloon was at body temperature rather than freezing. The results: 34% of the patients in the treatment group improved, but so did 38% of the patients in the placebo group. This and other properly designed experiments showed that gastric freezing was no better than a placebo, and its use was abandoned. Explain the situation of “confounding” in the previous example: We can defeat confounding by __________________________________________. The placebo effect and other lurking variables now operate on both groups. The only difference between the groups is ________________________________________. A control group is ___________________________________________________. Caution: Don’t confuse the terms “control” and “control group” (see page 357) The SECOND basic principle of statistical design of experiments is Replication. If the 2nd gastric-freezing experiment was performed on one person in each group, would we be able to conclude the ineffectiveness of gastric-freezing? Replication (several subjects in each treatment group) promotes similar responses within each treatment group, but different from other treatment groups. The THIRD basic principle of statistical design of experiments is Randomization. Comparison of the effects of several treatments is valid only when all treatments are applied to similar groups of experimental units. What would have happened if the control group in the second gastric-freezing experiment consisted of patients whose ulcers were 10 times worse than the ulcers of the patients in the treatment group? We must rely on chance to assign subjects to the groups. This way, subjects with similar characteristics are less likely to be grouped together and confounding is minimized. EXAMPLE: Does talking on a hands-free cell phone distract drivers? Undergraduate students “drove” in a high-fidelity driving simulator equipped with a hands-free cell phone. The car ahead brakes: how quickly does the subject respond? Twenty students (control group) simply drove. Another 20 (the experimental group) talked on the cell phone while driving. How many factors does this experiment have? _________ How many levels? ________ The researchers needed to divide the 40 student subjects into two groups of 20. One completely unbiased method is to put the names of the 40 students in a hat, mix them up, and draw 20. These students would make up the control group and the remaining 20 make up the experimental group (or vice-versa). The logic behind the randomized comparative design is as follows: Randomization produces two groups of subjects that we expect to be similar in all respects before the treatments are applied. Comparative design helps ensure that influences other than the cell phone operate equally on both groups. Therefore, differences in average brake reaction time must be due either to talking on the cell phone or to the play of chance in the random assignment of subjects to the two groups When describing your experimental design, it is usually helpful to begin with an outline or diagram of your experiment. Here is the design of the cell phone experiment: We would then follow this design with an explanation, in paragraph form, of how we choose to randomly allocate, implement treatments, and do the comparisons. The number of treatments and method of allocation can make our diagram more complicated. Be sure to include everything you are doing to minimize confounding in your experiment. EXAMPLE: Does regularly taking aspirin help protect people against heart attacks? The Physicians’ Health Study was a medical experiment that helped answer this question. In fact, the Physicians’ Health Study looked at the effects of two drugs: aspirin and beta-carotene. The body converts beta-carotene into vitamin A, which may help prevent some forms of cancer. The subjects were 21,996 male physicians. There were two factors, each having two levels: aspirin (yes or no) and beta-carotene (yes or no). These factors and levels combined to form four treatments. One-fourth of the subjects were assigned to each of these treatments. On odd-numbered days, the subjects took either a white tablet that contained aspirin or a dummy pill that looked and tasted like the aspirin but had no active ingredient. This tablet is called a ______________________. On even-numbered days they took either a blue capsule containing beta-carotene or a dummy pill mimicking the beta-carotene. Draw a diagram of this experiment: EXAMPLE (continued) There were several response variables – the study looked for heart attacks, several kinds of cancer, and other medical outcomes. After several years, it did not appear that beta-carotene had any effect, but approximately 24% of the placebo group had suffered a heart attack, while almost 14% of the aspirin group suffered heart attacks. This difference is large enough to support the claim that taking aspirin does reduce heart attacks. We can conclude causation since the three basic principles of experimental design were followed. We hope to see a large enough difference in the responses that it is unlikely to have happened by chance. We will discuss the requirements of difference in responses during 2nd semester and will be making decisions on whether or not the results are statistically significant. Results are statistically significant if ___________________________________ _________________________________________________________________ Other Randomized Comparative Experiments: There are different designs that we can base our experiments on, depending on the circumstances. BLOCK DESIGN A block is a group of experimental units or subjects that are known to be _______________ before the experiment begins. The experimenter believes that this particular characteristic plays an important role in the outcome. In a block design, the random assignment to treatments is carried out separately within each block. Blocks are a form of control, since the characteristic to be blocked may act as a lurking variable. EXAMPLE: The progress of a certain type of cancer differs in men and women. A clinical experiment will compare three therapies for this type of cancer. How can we control the difference in progression between men and women? A diagram for this design would look like this: MATCHED PAIRS DESIGN A matched pairs design compares just two treatments, but the subjects are matched in pairs to avoid lurking variables, or the same subjects are used for each treatment. EXAMPLE: In the hands-free cell phone experiment, it is possible that everyone in the no phone group were good drivers, while everyone in the cell phone group was a bad driver. We can address this issue by creating a matched pairs experiment. Each subject would drive with the cell phone and without the cell phone. The randomization comes in when deciding which 20 will drive non-phone first and with-phone second. The remaining 20 will drive with-phone first and non-phone second. A diagram for this design would look like this: BLIND / DOUBLE-BLIND Blindness refers to the knowledge of which treatment group a subject is in. Single blind studies do not inform the subject of which treatment he/she is receiving. Double blind studies keep the treatment group a secret from an evaluator as well. Why would this be necessary? Lack of Realism One very serious weakness of experiments is lack of realism. If the situation does not realistically duplicate the conditions we want, we will not be able to come to a conclusion. Using a placebo for something with a familiar taste to the subjects can tip them off that it’s a placebo if it lacks the familiar taste. Just knowing they are participating in an experiment may affect the responses of your subjects. And the environments in different laboratories can alter the results when trying to achieve replication. Caution – Statistical analysis of an experiment cannot tell us how far the results will generalize to other settings. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 C. Method Number Three: Stratified Sample Consider the field as grouped in vertical columns (called strata). Using your calculator or a random number table, randomly choose one plot from each vertical column and mark these plots on the grid. A B C D E F G H I J 1 2 3 4 5 6 7 8 9 10 D. Method Number Four: Stratified Sample Consider the field as grouped in horizontal rows (also called strata). Using your calculator or a random number table, randomly choose one plot from each horizontal row and mark these plots on the grid. A 1 2 3 4 B C D E F G H I J 5 6 7 8 9 10 Observations: 1) Compare the class boxplots of the sample means obtained from the SRS and the two methods of stratified sampling. 2) Based on the results of both activities, under what conditions is it more useful to use stratified sampling? 3) Based on the results of both activities, under what conditions is it more useful to use a simple random sample? Chapter 5: Producing Data Key Vocabulary: Voluntary response Sample Confounded Population Design Convenience sampling Statistically significant biased response bias simple random sample systematic random sample stratified random sample observational study strata experimental units undercoverage subjects nonresponse treatment double-blind block design factor level placebo effect control group randomization completely randomized experiment matched-pairs design 5.1 – Designing Samples 1. Describe the relationship between sample, population, sampling, and a census. 2. Discuss two sampling methods that may not provide reliable results. Give an example for each. 3. Discuss the sampling methods mentioned in your book that do produce reliable results (assuming they are designed and conducted appropriately). Give an example for each. 4. What is bias? 5. Why is a simple random sample usually the best sampling method? 5.2 – Designing Experiments 6. Describe the difference between “control” and “control group”. 7. Summarize the basic principles of experimental design: control, replication, and randomization. What is the purpose or goal of designing experiments using these principles? 8. What does statistically significant mean? 9. Describe the experimental concept of “blinding” and “double-blinding.” Why is this necessary? Simulation (6.1) Learning Targets: List 3 methods that can be used to calculate or estimate the chances of an event Introduction: occurring Define “simulation” List the 5 steps involved in a simulation Explain what is meant by “independent trials” Use a random-digit table to carry out a simulation Given a probability problem, conduct a simulation in order to estimate the probability desired Use a calculator to conduct a simulation of a probability problem Toss a coin 10 times. How could we determine the likelihood of a run of 3 or more consecutive heads or consecutive tails? An airline knows from past experience that a certain percent of customers who have purchased tickets will not show up to board the airplane. If the airline overbooks a particular flight, what are the chances that the airline will encounter more ticketed passengers than they have seats for? There are 3 methods we can use to answer question like these involving chance. 1. Observe the random phenomenon many times and calculate the frequency of the results. 2. Develop a probability model to calculate a theoretical answer. 3. Simulate the random phenomenon by repeating a procedure and calculating the results Option 1 is not the best choice because ___________________________________ Option 2 may not be feasible because ____________________________________ Option 3 is our best choice because ______________________________________ Simulation: A simulation is when you imitate an event for learning purposes. When we simulate an experiment, we will be looking for a probability to use for predicting future outcomes. ACTIVITY A: Simulating a coin toss APPS ProbSim 1 (toss coin) We would expect 50% of the tosses to be heads and 50% to be tails. Does this happen? Simulate 80 tosses and record the number of heads tossed. The more simulations we do, the more our data will resemble the true probability. CLASS RESULTS Number of “heads” tossed F R E Q U E N C Y l l l l l l l l l l l l l l l l l NUMBER OF “HEADS” TOSSED Simulation Steps: Step 1: State the problem or describe the random phenomenon “We are investigating …” Step 2: State the assumptions “We are assuming …” Step 3: Assign digits to represent the outcomes “Let digits 0, 1, 2 represent … while digits 3 – 9 represent …” Step 4: Simulate many repetitions Show the result of each trial Step 5: State your conclusions: It appears the probability of ____________ is approximately ________” l l ACTIVITY B: Suppose we were interested in the likeliness of a “run” of 3 heads or 3 tails out of 10 tosses of a coin. We would have to keep track of this manually if we used the probability simulator. But since the likeliness of heads = likeliness of tails and on the random-digit table, the likeliness of a 1 = likeliness of a 2 = likeliness of a 3 (etc.) we can use the random-digit table to perform the simulation. STEP 1: We are investigating __________________________________________ STEP 2: We are assuming _____________________________________________ We could assign ODD digits to heads and EVEN digits to tails, or vice-versa. Or we could assign digits 0 – 4 to heads and digits 5 – 9 to tails. There are many ways we can make the assignment of the digits AS LONG AS THE PROBABILITIES MATCH UP. STEP 3: Let digits _______ represent _______________ while digits _________ represent _____________________ STEP 4: Begin with line 101 and find the likeliness of a run of 3 heads or 3 tails in a total of 10 tosses: LINE 101 19223 95034 05756 TRIAL 1 28713 96409 12531 TRIAL 3 TRIAL 2 From these 3 trials, what is the probability? ___________ Is this reasonable? _____ Continue with line 101 (wrapping around to lines 102, 103, 104, and so on) and perform 22 more trials, for a total of 25 trials: 42544 82853 73676 47150 99400 01927 27754 42648 82425 36290 45467 71709 77558 00095 32863 29485 82226 90056 52711 38889 93074 60227 40011 85848 48767 52573 95592 94007 69971 91481 60779 53791 17297 59335 68417 35013 15529 72765 85089 57067 50211 47487 82739 57890 How many of the 25 trials had a run of 3? ________ What is the likeliness of getting a run of 3 heads or 3 tails in 10 tosses? __________ STEP 5: It appears _____________________________________________ ACTIVITY C: Suppose that 80% of a university’s students favor abolishing evening exams. You ask 10 students chosen at random. What is the likelihood that all 10 favor abolishing exams? STEP 1: STEP 2: STEP 3: Assign digits _________ to represent “yes” and digits ___________ to represent “no” STEP 4: Simulate 25 repetitions (of 10 responses) beginning at line 129. Record just the results of each trial as “Y” or “N” TRIAL 1 TRIAL 2 TRIAL 3 TRIAL 4 TRIAL 5 TRIAL 6 TRIAL 7 TRIAL 8 TRIAL 9 TRIAL 10 TRIAL 11 TRIAL 12 TRIAL 13 TRIAL 14 TRIAL 15 TRIAL 16 TRIAL 17 TRIAL 18 TRIAL 19 TRIAL 20 TRIAL 21 TRIAL 22 TRIAL 23 TRIAL 24 TRIAL 25 STEP 5: ACTIVITY D: Explain why each of the following simulations fails to model the real situation properly: 1. Use a random integer from 0 through 9 to represent the number of heads that appear when 9 coins are tossed 2. A basketball player takes a foul shot. Look at a random digit, using an odd digit to represent a good shot and an even number to represent a miss. 3. Simulate a baseball player’s performance at bat by letting 0 = out, 1 = single, 2 = double, 3 = triple, 4 = home run. 4. The preferences for Dairy Queen Treats at a local franchise is as follows: Peanut Buster Parfait = 38% Banana Split = 42% Dilly Bar = 14% Other = 6% Simulate the daily orders by assigning digits 0 – 3 to Peanut Buster Parfait, 4 – 7 to Banana Split, 8 = Dilly Bar, and 9 = Other. 5. McDonald’s claims to have 60% of their business generated by drive-through orders. Simulate a particular McDonald’s weekly orders by assigning even numbers to represent drive through, and odd numbers to represent eat-in orders. ACTIVITY E: The owner of a bakery knows the daily demand for a highly perishable cheesecake is as follows: Number of cheesecakes sold per day 0 Rel. freq. 0.05 1 0.15 2 0.25 3 0.25 4 0.20 5 0.10 1. Use simulation to find the demand for the cheesecake on 30 consecutive business days. a) Assign digits to each cheesecake number: no cheesecakes = ____________________ 1 cheesecake = ___________________ 2 cheesecakes = _____________________ 3 cheesecakes = __________________ 4 cheesecakes = _____________________ day: 5 cheesecakes = __________________ b) Perform the simulation for 30 days: (Use line 135 on the random-digit table.) day: # cakes: 1 2 3 4 5 6 7 8 9 10 day: # cakes: 11 12 13 14 15 16 17 18 19 20 day: # cakes: 21 22 23 24 25 26 27 28 29 30 c) Summarize your findings: # days of no cheesecakes = # days of 1 cheesecake = # days of 2 cheesecakes = # days of 3 cheesecakes = # days of 4 cheesecakes = # days of 5 cheesecakes = 2. Suppose that it cost the baker $5 to produce a cheesecake, and that the unused cheesecakes must be discarded at the end of the business day. Suppose also that the selling price of a cheesecake is $13. Use the simulation from above to estimate the number of cheesecakes that he should produce each day in order to maximize his profit. If he bakes 0 cheesecakes each of the 30 days: amount in sales: ________ cost for production: ________ what is his net profit? _________ If he bakes 1 cheesecake each of the 30 days: amount in sales: ________ cost for production: ________ what is his net profit? _________ If he bakes 2 cheesecakes each of the 30 days: amount in sales: ________ If he bakes 3 cheesecakes each of the 30 days: amount in sales: ________ cost for production: ________ what is his net profit? _________ cost for production: ________ what is his net profit? _________ If he bakes 4 cheesecakes each of the 30 days: amount in sales: ________ cost for production: ________ what is his net profit? _________ If he bakes 5 cheesecakes each of the 30 days: amount in sales: ________ cost for production: ________ what is his net profit? _________ MAXIMUM PROFIT OCCURS WHEN _________________ Probability Models (6.2) Learning Targets: Explain how the behavior of a chance event differs in the short and long run Explain what is meant by a “random phenomenon” Explain the idea of probability being “empirical” Define “probability” in terms of relative frequency Define: sample space, event, equally likely outcomes, and independent events Explain what is meant by a “probability model” Construct a tree diagram Use the multiplication principle to determine the number of outcomes in a sample space Explain sampling with / without replacement List the 4 rules that must be true for any assignment of probabilities Explain what is meant by A B and A B Explain what is meant by each of the regions in a Venn diagram Give an example of 2 events, A and B, where A B Use a Venn diagram to illustrate the intersection of two events A and B Compute the probability of an event given the probabilities of the outcomes that make up the event Compute the probability of an event in the special case of equally likely outcomes Language of Probability: A phenomenon is random if: a) _____________________________________________ and b) ____________________________________________ The probability of any random phenomenon is ____________________________________ ________________________________________________________________________ The idea of probability is empirical, or ___________________________________________ Probability Models: Think about the event of tossing a coin. When we toss a coin, we cannot know the outcome in advance. The description of a coin toss has two parts: 1) A list of all possible outcomes, which is called the ___________________________. (each individual outcome is called an _______________) 2) A probability for each outcome. If the above two parts are included in a mathematical description of a random phenomenon, then it is called a ____________________________________. EXAMPLE: list the sample space, S, for the phenomenon and find the probability of each outcome in the sample space: a) a single coin toss S = _____________ b) a roll of a die S = __________________ P = ___________ c) toss a coin and roll a die P = ___________ S = _________________________________________________ P = _____________ Being able to properly list all outcomes in a sample space will be critical to determining probabilities. There are three helpful techniques to make sure you don’t accidentally overlook any outcomes: TREE DIAGRAM MULTIPLICATION PRINCIPLE MAKE AN ORGANIZED LIST TREE DIAGRAM: Make a tree diagram for tossing a coin and rolling a die MULTIPLICATION PRINCIPLE: If you can do one task in n1 number of ways and a second task in n2 number of ways, then both tasks can be done in n1 x n2 number of ways. tossing a coin: n1 = _________ _______ rolling a die: n2 = __________ total outcomes = MAKE AN ORGANIZED LIST: Record the results of each of 4 tosses of a coin in order EXAMPLE: How many 3-digit numbers can you make? (a) if no digits are repeated (called SAMPLING WITHOUT REPLACEMENT): ________________ (b) if digits may be repeated (called SAMPLING WITH REPLACEMENT): ___________________ Probability Rules: 1. Any probability is a number between 0 and Symbolically:___________________________ 2. The sum of the probabilities of all possible outcomes must be 1 Symbolically: ____________ 3. Probability of an event not occurring (called the _______________________ of the event) is 1 minus the probability that it does occur. Symbolically: _________________________ 4. If 2 events have no outcomes in common (called ___________________ events or ________________________ events), then the probability that one or the other occurs is the sum of their individual probabilities. Symbolically:______________________________ Probability Notation: We use set notation to describe events: The event A B is read _________________________________ and is the set of ___________________________________________________________. It is another way to say ______________. The event A B is read ___________________________ and is the set of ________________________________________________________. It is another way to say ______________. The symbol is used for __________________, which is the event that _____________________________________. Assigning Probabilities: We can use a VENN DIAGRAM to help answer questions about probability. For example, suppose that a certain high school has 90 students enrolled in AP Statistics, of which 16 are juniors. If their junior class has 390 students, the high school has 1400 students enrolled, what is the probability that a randomly selected student is a junior not taking AP Statistics? In our Venn diagram, we should have overlapping circles (why?) and a border around the circles to enclose space not in either circle: AP Stats Juniors High School We fill in the numbers given in the information and answer the question. - FINITE NUMBER OF OUTCOMES – If we have a finite number of outcomes we can use a table to help calculate probabilities. EXAMPLE: Faked numbers in tax returns, payment records, invoices, expense account claims, and many other settings often display patterns that aren’t present in legitimate records. Some patterns, like too many round numbers, are obvious and easily avoided by a clever crook. Others are more subtle. It is a striking fact that the first digits of numbers in legitimate records often follow a distribution known as Benford’s law. Here it is: First digit: Probability: 1 0.301 Consider the events 2 0.176 3 0.125 4 0.097 5 0.079 6 0.067 7 0.058 8 0.051 9 0.046 A = {first digit is 1} B = {first digit is 6 or greater} C = {first digit is odd} a) Find P(A) and P(B) b) Find P A B c) Find the probability that a first digit is anything other than 1 d) Find P(C) e) Find P(B or C) - EQUALLY LIKELY OUTCOMES – Assigning correct probabilities to individual outcomes often requires long observation of the random phenomenon. But sometimes we are willing to assume equal likeliness because of a balance in the phenomenon. EXAMPLE: You roll a die. Find the probability that you roll a 2 or higher. Event (roll): Probability: 1 2 3 4 5 6 Independence and the Multiplication Rule: We can find the probability that two events occur at the same time if the events are independent. Independent events are events in which the outcome of one event does not affect the outcome of the other. Disjoint events cannot be independent. If A and B are disjoint, and if A occurs, then B cannot occur. In a toss of two coins, what is the probability that both coins land on heads? Sample Space = There are _________ possible outcomes. The combination H H is one of the possible outcomes. The probability of H H occurring is __________. We can obtain this without writing out the sample space. The Multiplication Rule tells us: If A and B are independent events, then P(A and B) = P(A)P(B) EXAMPLES: 1. You roll two dice. Find the probability that you roll doubles. 2. All human blood can be typed as one of O, A, B, or AB but the distribution of the types varies a bit with race. Here is the distribution of the blood type of a randomly chosen black American: Blood type: Probability: O .59 A .17 B .18 AB ? a) What is the probability of type AB blood? Why? b) Maria has type B blood. She can safely receive blood transfusions from people with blood types O and B. What is the probability that a randomly chosen black American can donate blood to Maria? 3. A general can plan a campaign to fight one major battle or three small battles. He believes that he has a probability of 0.6 of winning the large battle and probability 0.8 of winning each of the small battles. Victories or defeats in the small battles are independent. The general must win either the large battle or all 3 small battles to win the campaign. Which strategy should he choose? General Probability Rules (6.3) Learning Targets: State the following rules: (A) the addition rule for disjoint events, (B) the general addition rule for union of two events, (C) the general multiplication rule for any two events. Given any two events A and B, compute P A B Define joint event, joint probability Define independent events in terms of conditional probability Given two events, compute their joint probability Explain what is meant by the conditional probability Use the general multiplication rule to define P A | B P B | A Review: There are additional probability rules that govern any assignment of probabilities. We need more rules so we can give probability models for more complex random phenomenon. Let’s review the 5 rules we already have: a. The value of any probability is ________________________________ b. The sum of the probabilities of the sample space is ________ c. If A and B are independent events, then P(A and B) = ____________ d. For any event A, P(AC) = ________________ e. If A and B are disjoint events, then P(A or B) = _______________ General Probability Rules: Rule e applies to more than 2 disjoint events. In fact, it can extend to any number of disjoint events. The union of a collection of events is the event that any of them occur. If events A and B are not disjoint, then ___________________________________________. The probability of their union is _______ the sum of their probabilities. Why are we using subtraction in this formula? EXAMPLE: Call a household “prosperous” if its income exceeds $100,000. Call a household “educated” if the householder completed college. Select an American household at random, and let A be the event that the selected household is prosperous, and B be the event that the household is educated. According to the Current Population Survey, P(A) = 0.138, P(B) = 0.261, and the probability that a household is both educated and prosperous is P A B = 0.082. What is the probability that the household selected is either prosperous or educated. Conditional Probability: The probability we assign to an event can change if we know that some other event has occurred. The other event that has occurred reduces the size of the sample space. So, when calculating the probability, the denominator in the fraction has changed. Consider rolling two dice and observing the sum. There are _________ possible outcomes for the rolls, which is the denominator for any probability concerning the roll of two dice without a condition. We wish to calculate the probability of rolling a sum of 8 at the same time one of the dice shows a 3. If we are given that one die already shows a 3, how many possible outcomes (of any sum) are there now? _______ So the conditional probability of the sum being 8 given that one die is a 3 is: The new notation P(A|B) or P(B|A) is a conditional probability. That is, it gives the probability of one event (sum = 8) under the condition that we know another event (3 already rolled). EXAMPLE: Students at the University of New Harmony received 10,000 course grades last semester. The grades in the table are broken down by which school of the university taught the course. The schools are Liberal Arts, Engineering and Physical Sciences, and Health and Human Services. Liberal Arts Engineering & Physical Sciences Health & Human Services Total A 2,142 368 882 3,392 B 1,890 432 630 2,952 Below B 2,268 800 588 3,656 Total 6,300 1,600 2,100 10,000 It is common knowledge that college grades are lower in engineering and the physical sciences (EPS) than in the liberal arts and social sciences. Consider the two events: A = the grade comes from an EPS course B = the grade is below a B Are these two events mutually exclusive? ____________ If we choose a grade at random, what is the probability that the grade will be below a B? P(B) = Find the probability that the randomly chosen grade is below a B given that the grade comes from the EPS School. Find the probability that a grade is an A given that it comes from a Liberal Arts course. Extended Multiplication Rules: The conditional probability rule can be re-written as a multiplication rule: Notice how this is just a restatement of our original conditional probability rule. This rule can be extended to more than two events and is used to calculate the probability of the intersection. The intersection of a collection of events is the event that all of the events occur at the same time. Probability problems often require us to combine several of the basic rules into a more elaborate calculation. If we use our organizational tools, the problems become easier to solve. EXAMPLE: Online chat rooms are dominated by the young. Teens are the biggest users. If we look only at adult Internet users (aged 18 and over), 47% of the 18 – 29 age group chat online, as do 21% of those aged 30 – 49, while just 7% of those 50 and over. To learn what percent of all Internet users participate in chat rooms, we also need the age breakdown of users. Ages 18 – 29 make up 29% of adult Internet users, 47% are age 30 – 49, and the remaining 24% are over 50 years old. a) What is the probability that a randomly chosen user of the Internet participates in chat rooms? b) What percent of adult chat room participants are aged 18 – 29? Tree diagrams make the probability problem much simpler. In fact it takes longer to explain it than it does to make the diagram and calculate the answer. Tree diagrams combine the addition and multiplication rules. Independent Events: The conditional probability P(B|A) is generally not equal to the unconditional probability P(B). This is because the occurrence of event A gives us some additional information about whether or not event B occurs. If knowing that event A occurs gives us no addition help toward whether or not B occurs, then A and B are ______________________________________. EXAMPLE: Going back to the educated and prosperous households, where event A is “prosperous” P(A) = 0.138, and event B is “educated” P(B) = 0.261, and the probability that a household is both educated and prosperous is P A B = 0.082. What is the conditional probability that a household is prosperous given that it is educated? Are the events prosperous and educated independent? Chapter 6: Probability and Simulation: The Study of Randomness Key Vocabulary: Simulation Trial Random Probability Independence random phenomenon sample space tree diagram event P(A) replacement AC P(Ac) disjoint Venn diagram union intersection joint event joint probability conditional probability 6.1 Simulation: 1. In statistics, what is meant by the term random? 2. In statistics, what is meant by probability? 3. In statistics, what is meant by an independent trial? 4. What is simulation? 5. Why do statisticians use simulation? 6. List the 5 steps for conducting a simulation. 6.2 Probability Models 1. What is probability theory? 2. In statistics, what is sample space? 3. In statistics, what is an event? 4. What is a probability model? 5. Explain why the probability of any event is a number between 0 and 1. 6. What is the sum of the probabilities of all possible outcomes? 7. Give your own example of an event that does not occur. What is it’s probability? Explain. 8. What is meant by the complement of an event? 9. When are two events considered disjoint? 10. What is the probability of disjoint events both occurring? 11. Explain why the probability of getting heads when flipping a coin is 50%. 12. What is the multiplication rule for independent events? 13. Can disjoint events be independent? Explain. 14. If two events, A and B are independent, what must be true about AC and BC? 6.3 General Probability Rules 1. What is meant by the union of two or more events? 2. State the addition rule for disjoint events. 3. State the general addition rule for unions of two events. 4. Explain the difference between the rules in #2 and #3. 5. What is meant by joint probability? 6. What is meant by conditional probability? 7. State the general multiplication rule. 8. How is the general multiplication rule different from the multiplication rule for independent events? 9. State the formula for finding conditional probability. 10. What is meant by the intersection of two or more events? Give an example by drawing a Venn diagram. 11. Explain the difference between the union and the intersection of two or more events. 12. Give an example of two disjoint events, and explain why they are disjoint. Discrete and Continuous Random Variables (7.1) Learning Targets: Define: random variable, discrete random variable, continuous random variable, probability distribution, density curve Explain what is meant by a probability distribution and a uniform distribution Construct the probability distribution for a discrete random variable Construct a probability histogram for a discrete random variable Introduction: Find the following probabilities on the roll of one die P(1) _______ P(2) _______ P(3) _______ P(4) ________ P(5) _______ P(6) _______ Since we have several probabilities involved for the same event (the roll of a die), we could use the notation P(X) where X represents the value that each roll could be. X is called a random variable. Definition: Random Variable - This chapter moves us from general probability to statistical inference. Definition: Statistical Inference – Now our sample space, S, becomes a list of the possible values of the random variable instead of a list of possible outcomes of an event. This section shows us 2 ways of assigning probabilities to the values of the random variable. Definition: (a) (b) (c) (d) Discrete Random Variable - EXAMPLE: The instructor of a large class gives 15% each of A’s and D’s, 30% each of B’s and C’s, and 10% F’s. Assign probabilities by making a probability distribution table. Use a grade-point scale (A = 4, B = 3, etc.). Find the probability that a student chosen at random (a) receives a B or better (b) doesn’t receive an F (c) gets better than a D Definition: Continuous Random Variable - (a) (b) (c) (d) (e) EXAMPLE: Use the density curve to find the following probabilities: a) P(0.2 < X < 0.4) = ______________ 1.0 0.75 b) P(0.2 < X < 0.4) = ______________ 0.5 c) P(X = 0.2) = ______________ d) P(0 < X < 0.8) = ______________ 0.25 0.2 0.4 0.6 0.8 1.0 1.2 1.4 We are most familiar with a density curve being the normal curve. Normal distributions are probability distributions. Remember N , ? This means ________________________________________________ z x is the formula for _____________________________________ Now, we use a capital letter X: z X represents a standard normal variable with N (0, 1) We will use z-scores to calculate probabilities using table A. Your calculator has table A programmed into it. It’s under the Distributions key (2nd VARS) “normalcdf”. You enter the minimum and maximum z-scores for the area under the Normal curve. EXAMPLE: (a) find the following probabilities and (b) draw and shade the area under the normal curve. Write down the calculator keystrokes used Find P(z > 2.23) _______________________________________ Find P(z < -1.27) _______________________________________ Find P(-0.7 < z < 2.11) ____________________________________ EXAMPLE: An opinion poll asks an SRS of 1500 American adults what they considered to be the most serious problem facing our schools. Suppose that if we could ask all adults this question, 30% would say “drugs”. This is the proportion for the entire population, p. Our sample proportion is an estimate of p, so we call it particular p p. We will see in a later chapter that this has a distribution of N(0.3, 0.0118). What is the probability that the poll result differs from the truth about the population by more than 2 percentage points? Means & Variances of Random Variables (7.2) Learning Targets: Define “mean of a random variable” Calculate the mean, variance, and standard deviation of a discrete random variable, , , 2 , 2X Y a bX XY a bX Explain: the Law of Large Numbers, the Law of Small Numbers, standard deviations of combined random variables, ACTIVITY:To investigate means of a random variable, consider a random variable that takes values {0, 1, 2, 3, 5, 8}. Complete the following: 1. Calculate the mean of the population, 2. Make a list of all of the samples of size 2 from this population. (There should be 15) 3. Find the mean of each of the 15 samples of size 2. 4. Find the mean of the 15 x values. Compare this mean to the mean you calculated in step 1 Population mean Sample# = _____________ Sample 0, 1 x 0.5 2 0, 2 1 3 0, 3 1 Sample# 6 Sample x 1, 2 Sample# 11 7 1, 3 12 8 1, 5 13 4 9 14 5 10 15 Sample mean of all x s = ________ x represents ___________________________________________________ x represents _____________________________________________________ Since the value of a random variable is a numerical outcome of a random phenomenon, the probabilities can all be different. To find the mean value, we apply a formula. Formula: X is a discrete random variable with the distribution: X P(X) x = x1 p1 x2 p2 x3 p3 … … xk pk x EXAMPLE: According to the following Census Bureau data, what is the mean size of an American household? Inhabitants Proportion of households 1 .25 2 .32 3 .17 4 .15 5 .07 6 .03 7 .01 x = For continuous random variable distributions, x is the point at which the area under the curve would ________________________ if it were made out of a solid material. If the density curve is symmetric, the mean is the center. If the density curve is skewed, we need advanced mathematics to find the mean. Law of Large Numbers: As the number of observations increases (sample size), the mean x eventually approaches the mean of the population. How many observations are necessary? That depends on the ______________________ of the random outcomes. The more variable the outcomes, the more trials are needed. Rules for Means: 1. 2. EXAMPLE: Gain Communications sells aircraft communications units to both the military and civilian markets. Next year’s sales depend on market conditions that cannot be predicted exactly. Gain follows the modern practice of using probability estimates of sales. Gain makes a profit of $2000 on each military unit sold and $3500 on each civilian unit sold. Let X be the random variable: # units sold in military division. Let Y be the random variable: # units sold in civilian division. Then: X P(X) 1000 .1 3000 5000 10,000 .3 .4 .2 Y P(Y) 300 .4 500 .5 750 .1 (a) Find the mean number of military units sold and civilian units sold MILITARY: CIVILIAN: (b) Find the mean number of units sold overall: (c) Find the mean profit from military sales and the mean profit from civilian sales; then find the overall mean profit. (Military profit = $2000 each; Civilian profit = $3500 each) MILITARY: CIVILIAN: OVERALL: Mean is a measure of ______________. Variance and standard deviation are measures of ________________________. Formula: X is a discrete random variable with the distribution: X P(X) x2 = x = x1 p1 x2 p2 x3 p3 … … xk pk with mean = x EXAMPLE: Find the standard deviation for Gains Communications sales in the military division. X P(X) 1000 .1 3000 .3 5000 .4 10,000 .2 Y P(Y) 300 .4 500 .5 750 .1 Let the calculator do the calculations for you. Rules for Variances: 1. 2. EXAMPLE: The payoff, X, of a $1 ticket in the Tri-State Pick 3 game is $500 with probability 0.001 and $0 the rest of the time. (a) Find the distribution of X (b) Find X and X (c) Find your average winnings (d) Suppose you buy a $1 ticket on each of 2 different days. Calculate your mean total payoff and the standard deviation of the total payoff. Chapter 7: Random Variables Key Vocabulary: Random variable Law of Large Numbers Discrete random variable Probability distribution standard deviation Probability histogram Density curve continuous random variable variance uniform distribution normal distribution Probability density curve X Y expected value 7.1 Discrete and Continuous Random Variables: 1. What is a discrete random variable? 2. If X is a discrete random variable, what information does the probability distribution of X give? 3. In a probability histogram, what does the height of each bar represent? 4. In a probability histogram, what is the sum of the heights of each bar? 5. What is a continuous random variable? 6. If X is a continuous random variable, how is the probability distribution of X described? 7. If X is a discrete random variable, do P(X > 2) and P(X > 2) have the same value? Explain. 8. If X is a continuous random variable, do P(X>2) and P(X>2) have the same value? Explain. 9. How is a normal distribution related to a probability distribution? 10. Is a probability distribution always a normal distribution? Explain. 7.2 Means and Variances of Random Variables: 1. Explain the difference between the notations x and X . 2. What is meant by the expected value of X? 3. How do you calculate the mean of a discrete random variable X? 4. Explain the Law of Large Numbers. 5. Given the mean X and Y explain how to calculate the mean X Y . 6. Given the mean X and Y explain how to calculate the mean X 4Y . 7. Explain how to calculate the variance of a discrete random variable. 8. Given the variance of a discrete random variable, explain how to calculate the standard deviation. 9. Suppose X and Y are independent random variables, given 2 X and 2 Y explain how to calculate 2 X Y and X Y . 10. Suppose X and Y are independent random variables, given 2 X explain how to calculate 23 2X and 3 2X . Binomial Distributions (8.1) Learning Targets: Describe the conditions that need to be present to have a binomial setting Define binomial distribution Explain what is meant by the sampling distribution of a count Explain the difference between binompdf (n, p, x) and binomcdf(n, p, x) State the mathematical expression that gives the value of a binomial coefficient. State the mathematical expression used to calculate the value of binomial probability Evaluate a binomial probability by using the mathematical formula for P(X=k). Use your calculator to help evaluate a binomial probability Calculate the mean and variance of a binomial distribution Review: Discrete Probability Distribution - ___________________________________________ Continuous Probability Distribution - __________________________________________ Binomial Random Variable: A random variable can also be what is called a binomial random variable if the data are produced in a binomial setting. There are 4 conditions that determine a binomial setting: a) b) c) d) X = number of successes, and is called a binomial random variable because the number of successes varies according to the random phenomenon. We use the notation __________ to tell you the distribution is binomial with “n” observations each having probability “p”. EXAMPLES: Determine whether the following situations are binomial or not: 1) Deal 10 cards from a shuffled deck and count the number of red cards dealt. CHECK THE REQUIREMENTS: a) c) b) d) If binomial, X represents the number of successes, which is _________________________ 2) A couple decides to continue having children until their first girl is born. X is the number of children they have. CHECK THE REQUIREMENTS: a) c) b) d) If binomial, X represents the number of successes, which is __________________________ 3) A quality engineer selects an SRS of 10 switches from a large shipment for detailed inspection. Unknown to the engineer, 10% of the switches in the shipment fail to meet the specifications. X is the number of detective switches in the sample. CHECK THE REQUIREMENTS: a) c) b) d) If binomial, X represents the number of successes, which is __________________________ Binomial Probabilities: Once we determine the type of distribution we have, we will be asked to calculate probabilities. Use example (3) to find the probability that out of the 10 switches inspected, exactly 2 switches in the sample will fail the inspection. Determine what event is a “success” ________________________________ Determine what the random variable X represents: X = _______________________________________ P(X) = ______________________________________ Make a probability distribution: X P(X) P(X = 0) is the probability that no switches fail. P(X = 1) is the probability that exactly one switch fails. (We don’t know which switch it will be) In order to calculate P(X = 2): Number of ways we can have 2 failing switches Probability of a switch failing Probability of a switch not failing n nk P(X k) p k 1 p k Probability Formula: We can also use the Binomial pdf (probability distribution function) to calculate the probabilities. Binompdf (n, p, X): Binompdf (total # trials, probability of success, X-value) If we wanted to calculate the probability that no more than 2 switches failed, we would use the cumulative distribution function (cdf) Binomcdf (n, p, X): Binomcdf (total # trials, probability of success, X-value) EXAMPLE: Corinne is a basketball player who makes 75% of her free throws over the season. In a key game, Corinne shoots 12 free throws and makes only 7 of them. The fans think she failed because she was nervous. Is it unusual for Corinne to perform this poorly? What question are we really answering? _________________________________________ Identify: Success: _______________________________________________ X represents ___________________________________________ n = ________ p = _________ X = _________ pdf or cdf? ______________ Calculate: Binomial Mean and Standard Deviation: What would we guess to be the average (mean) number of shots made by Corinne in games where she shoots 12 free throws? Mean of a Binomial Random Variable: Standard Deviation of a Binomial Random Variable: n p n p 1 p EXAMPLE: Among employed women, 25% have never been married. Select 10 employed women at random. (a) What is the distribution of X? (b) What is the probability that exactly 2 of the 10 women in your sample have never been married? (c) What is the probability that 2 or fewer have never been married? (d) What is the probability that 6 or more have never been married? (e) What is the probability that more than 4 have never been married? (f) What is the mean number of women in such a sample who have never been married? (g) What is the standard deviation? Geometric Distributions (8.2) Learning Targets: Review: Describe what is meant by a geometric setting Given the probability of success, p, calculate the probability of getting the first success on the nth trial. Calculate the mean (expected value) and the variance of a geometric random variable Calculate the probability that it takes more than n trials to see the first success for a geometric random variable Conditions for a Binomial Distribution: a) b) c) d) Geometric Random Variable: A random variable can also be what is called a geometric random variable if the data are produced in a geometric setting. There are 4 conditions that determine a geometric setting: a) b) c) d) X = number of trials to obtain the first success. EXAMPLES: Determine if the following situations are geometric or not: 1) Deal 10 cards from a shuffled deck and count the number X of red cards dealt. CHECK THE REQUIREMENTS: a) c) b) d) If geometric, X represents the trial where the first success occurs. 2) A couple decides to continue having children until their first girl is born. CHECK THE REQUIREMENTS: a) c) b) d) If geometric, X represents the trial where the first success occurs. ________________ 3) Blood type is inherited, but children inherit independently of each other. Count the number of type A blood among 5 children CHECK THE REQUIREMENTS: a) c) b) d) If geometric, X represents the trial where the first success occurs. ________________ Geometric Probabilities: A game consists of rolling a single die. The event of interest is rolling a 3. We can label each roll as one of two possible outcomes: success = ___________________________; failure = _____________________________. To calculate the probabilities, we consider each roll separately: X = 1: P(X = 1) finds the probability of a success on the first roll, which is _______________ X = 2: P(X = 2) finds the probability of a success on the second roll, which means: roll #1 was a failure (Probability = ______________) roll #2 was a success (Probability = _____________) so P(X = 2) = _________________ X = 3: P(X = 3) finds the probability of ________________________________________ roll # 1 was a ______________________ (Probability = ______________) roll # 2 was a ______________________ (Probability = ______________) roll # 3 was a ______________________ (Probability = ______________) so P(X = 3) = ____________ Probability Formula: P(X n) 1 p n 1 p We can also use the Geometric pdf (probability distribution function) to calculate the probabilities. Geometpdf (n, p) Geometpdf (probability of success, trial # of 1st success) EXAMPLE: Find the probability that the first 3 occurs on the 5th roll. If we wanted to find the probability that the first roll of a 3 occurred sometime before the nth roll, we would use the cumulative distribution function Geometcdf (n, p) Geometcdf (probability of success, maximum trial # of 1st success) EXAMPLE: Bob is a basketball player who makes 65% of his free throws over the season. We put him on the free-throw line and ask him to shoot free throws until he misses one. Let X = # free throws Bob makes until he misses. Construct a probability distribution table. X P(X) 1 2 3 4 5 (a) Find the probability that Bob’s first miss occurs on his 5th shot (b) Find the probability that Bob’s first miss occurs before his 5th shot (c) Find the probability that Bob’s first miss occurs after the 5th shot Geometric Mean and Standard Deviation: Mean of a Geometric Random Variable: 1 p Variance of a Geometric Random Variable: 2 1 2p p Standard Deviation of a Geometric Random Variable: 6 etc EXAMPLE: Suppose that Albert, a well-known major league baseball player, finished last season with a 0.325 batting average. He wants to calculate the probability that he will get his first hit of this new season at his first at-bat. He also wants to know his expected number of at-bats until he gets a hit. EXAMPLE: In 1986-1987, Cheerios cereal boxes displayed a dollar bill on the front of the box and a cartoon character who said, “Free $1 bill in every 20th box.” Conduct a simulation to determine the number of boxes of Cheerios you would expect to buy in order to get one of the “free” dollar bills. Success = _____________________ Digits: ____________________ Failure = ______________________ Digits: ____________________ Use the following random digits. The trial is over as soon as you find a $1 bill. 43909 99477 25330 64359 40085 16925 85117 36071 15689 14227 06565 14374 13352 49367 81982 87209 36759 58984 68288 22913 18638 54303 00795 08727 According to Cheerios, how many boxes would we expect to have to open? _______________ How many did our simulation say we have to open? ________________ Calculate the variance and the standard deviation using the probability from Cheerios. How does the standard deviation help explain the results of our simulation? Chapter 8: The Binomial and Geometric Distributions Key Vocabulary: Binomial setting Binomial distribution Probability distribution function (pdf) Binomial coefficient Binomial random variable B (n, p) Cumulative distribution function (cdf) n! n k ! n 1 ! k 8.1 The Binomial Distributions: 1. What are the four conditions for the binomial setting? 2. In the binomial distribution, what do the parameters n and p represent? 3. What is meant by B(n, p)? 4. What is the difference between a probability distribution function (pdf) and a cumulative distribution function (cdf)? 5. How do we find the mean and standard deviation for a binomial random variable? 6. There are 50 poker chips in a container, 25 of which are red, 15 white, and 10 blue. You draw a chip without looking 25 times, each time returning the chip to the container. a) What is the probability that you will draw 9 or fewer blue chips? b) What is the probability that you will draw 6 or more red chips. 8.2 The Geometric Distributions: 1. What are the four conditions for the geometric setting? 2. Explain the difference between the binomial setting and the geometric setting. 3. How do you calculate the mean of a geometric random variable X? 4. How do you calculate the variance of a geometric random variable? 5. Explain what has been the most difficult part of this class so far. Sampling Distributions (9.1) Learning Targets: Compare and contrast parameter and statistic Explain what is meant by sampling variability Define the sampling distribution of a statistic Explain how to describe a sampling distribution Define an unbiased statistic and an unbiased estimator Describe what is meant by the variability of a statistic. Explain how bias and variability are related to estimating with a sample Introduction: The remainder of this course focuses on Statistical Inference. We will be asking “how often will this method give a correct answer if I used it many, many times?” This chapter prepares us for the study of statistical inference by looking at the probability distributions of sample proportions and sample means. Parameters and Statistics: A recent poll asked random individuals “Are you afraid to go outside at night?” The results showed that 45% of the sample said yes, they were afraid to go outside at night. Does that mean 45% of all people are afraid to go outside at night? 45% describes the sample; it is used to estimate the _______________________________. EXAMPLE A: State whether each boldface number is a parameter or a statistic: 1. The Tennessee STAR experiment randomly assigned children to regular or small classes during their first four years of school. When these children reached high school, 40.2% of blacks from small classes took the ACT or SAT college entrance exams. Only 31.7% of blacks from regular classes took one of these exams. 2. A random sample of female college students has a mean height of 64.5 inches, which is greater than the 63-inch mean height of all adult American women. Sampling Distributions: Suppose I wanted to know what percent of the class has a dog for a pet. Population: _____________________________________________________________ Population Proportion: ________________________ Now, I choose an SRS of 10 students and survey them. I can calculate the statistic: p1 # yes % 10 Then I choose a different SRS of 10 students and calculate a new statistic: p2 # yes % 10 And I choose yet another SRS of 10 students and calculate a 3rd statistic, p 3 . Are these 3 statistics always equal? p 4 = ___________ p 5 = ______________ p 6 = ______________ Every possible combination of 10 combine to create the sampling distribution. Since the value of the statistic varies with repeated sampling, a graph of all combinations would be the graph of the sampling distribution. EXAMPLE B: Let us illustrate the idea of sampling variability and a sampling distribution in the case of a very small sample from a very small population. The population is the scores of 10 students on an exam: Student # Score 0 82 1 62 2 80 3 58 4 72 5 73 6 65 7 66 8 74 9 62 The parameter of interest is the mean score in this population, which is 69.4. The sample is an SRS of size n = 4 drawn from the population. Because the students are labeled 0 to 9, a single random digit from Table B chooses one student for the sample. a) Use table B (line 120) to draw an SRS of size 4 from this population. Write the four scores in your sample and calculate the mean x of the sample scores. This statistic is an estimate of the population parameter. LINE 120 (STUDENT #): STUDENT TEST SCORE: 3 5 4 7 x 1 = _____________ b) Continue in line 120, wrapping around to line 121, and repeat this process 9 more times. SCORES: SCORES: x 2 = _____________ x 3 = _____________ SCORES: SCORES: x 4 = _____________ x 5 = _____________ SCORES: SCORES: x 6 = _____________ x 7 = _____________ SCORES: SCORES: x 8 = _____________ x 9 = _____________ SCORES: x 10 = _____________ Notice the sampling variability in all the x values. Find the mean of all 10 x values _________ The sampling distribution is the ideal pattern that would emerge if we looked at all possible samples of size n from our population. The Bias of a Statistic: How trustworthy is a statistic as an estimate of a parameter? We use the concept of bias and describe the bias of a statistic as the center of the sampling distribution rather than the bias of a sampling method. The Variability of a Statistic: The sample proportion from a random sample of any size is an unbiased estimator of the population parameter, but larger samples have a clear advantage. Since there is less variability among large samples than among small samples, the large samples are more likely to produce an estimate close to the true value of the parameter. The variability of a statistic does not depend on the size of the population. Bias and Variability: We can think of the true value of the population parameter as the bull’s eye on a target and think of the sample statistic as an arrow fired at the target. Both bias and variability describe what happens when we take many shots at the target. Bias means that __________________________________________________________ _______________________________________________________________________ High variability means that __________________________________________________ _______________________________________________________________________ Sample Proportions (9.2) Learning Targets: Describe the sampling distribution of a sample proportion Compute the mean and standard deviation for the sampling distribution Identify the “rule of thumb” that justifies the use of the formula for the standard deviation Identify the conditions necessary to use a Normal approximation for the sampling distribution Use a Normal approximation for the sampling distribution of p to solve probability problems The Sampling Distribution of a Sample Proportion p : To describe the sampling proportion of p , we need to discuss the center, shape, and spread. The center uses the mean value. In a sampling distribution of sample proportions (all possible samples of the same size), the mean of the distribution of p is the parameter p. When discussing the spread, if the population is much larger than the sample, the standard deviation of the distribution of p is p 1 p n . This formula for standard deviation only applies if the population is __________________________________________________. YOU MUST CHECK THIS CONDITION (AND SHOW THE CHECK) BEFORE USING THIS FORMULA. Using the Normal Approximation for p : To discuss the shape, remember that we’re talking about the distribution of all possible samples. We said that the larger the sample size, the less variability we will have. If the sample size is large enough, we will have a shape close to Normal. How large is “large enough”? if: np > 10 and if: n(1-p) > 10 we use the Normal approximation YOU MUST ALWAYS CHECK THIS CONDITION (AND SHOW THE CHECK) BEFORE ASSUMING YOU HAVE A NORMAL DISTRIBUTION EXAMPLE A: You ask an SRS of 1500 first-year college students whether they applied for admission to any other college. Actually, 35% of all first-year students applied to colleges besides the one they’re attending. What is the probability that your sample will give a result within 2 percentage points of the true value? n = ______ p = ______ * Can we assume a Normal distribution? * Can we use our formula for standard deviation? EXAMPLE B: The Gallop Poll once asked a random sample of 1540 adults, “Do you happen to jog?” Suppose that in fact 15% of all adults jog. a) Find the mean and, if you can, the standard deviation of the sample proportion p of the proportion who jog. b) Check that you can assume a Normal approximation for the distribution of p . c) If you can use a Normal distribution, find the probability that between 13% and 17% of the sample jog. d) What sample size would be required to reduce the standard deviation of the sample proportion to one-half the value you found in (a)? Sample Means (9.3) Learning Targets: Given the mean and standard deviation of a population, calculate the mean and standard deviation for the sampling distribution of a sample mean. State the Central Limit Theorem Identify the shape of the sampling distribution of a sample mean drawn from a population that has a Normal distribution Use the Central Limit Theorem to solve probability problems for the sampling distribution of a sample mean Introduction: We use sample proportions most often when we are interested in categorical variables. We look at “what proportion of ...” or “what percent of adults ...” When we record quantitative variables we are interested in other statistics, like mean and standard deviation. Sample means are among the most common statistics. A sample distribution for the sample mean would describe the means of __________________________________. Since the value of the sample mean depends on the sample we use, the sample mean is a random variable. The Mean and the Standard Deviation of x : We will be using statistics calculated from samples to make decisions about the population. To estimate the population mean, we can use the sample mean because: The mean of the sample means is the population mean. x If the sample size is increased, the variability of the sample mean decreases. x n (only use this formula when the population is at least 10 times as large as the sample.) The distribution of the sample mean becomes closer to a normal distribution as the sample size becomes larger, regardless of the shape of the population from which the sample is drawn. (The Central Limit Theorem) EXAMPLE: Suppose a population has a mean of 30 and a variance of 25. If a sample size of 100 is drawn from the population, what is the probability that the sample mean will be larger than 31? EXAMPLE: A company that owns and services a fleet of cars for its sales force has found that the service lifetime of the disc brake pads varies from car to car according to a Normal distribution with mean 55,000miles and standard deviation 4500miles . The company installs a new brand of brake pads in 8 cars. a) If the new brand has the same lifetime distribution as the previous type, what is the distribution of the sample mean lifetime for the 8 cars? (center, shape, spread) b) The average life of the pads on these 8 cars turns out to be x 51,800 miles. What is the probability that a sample mean lifetime would be 51,800 or less if the true lifetime distribution is unchanged? (This probability is evidence of whether or not the new brand of pads has a lifetime less than 55,000 miles) Chapter 9: Sampling Distributions Key Vocabulary: Parameter Statistic Sampling Variability Law of Large Numbers Sampling Distribution Unbiased Central Limit Theorem Population mean Sample mean x vs. ; p vs. p 9.1 Sampling Distributions 1. Explain the difference between a parameter and a statistic. 2. Explain the difference between p and p . 3. Explain the difference between p and x . 4. What is meant by the sampling distribution of a statistic? 5. When is a statistic considered unbiased? 6. How is the size of the sample related to the spread of the sampling distribution? 9.2 Sample Proportions 1. In an SRS of size n, what is true about the sampling distribution of p when the sample size, n, increases? 2. In an SRS of size n, what is the mean of the sampling distribution of p ? 3. In an SRS of size n, what is the standard deviation of the sampling distribution of p? 4, What happens to the standard deviation of p as the sample size, n, increases? 5. When does the formula p 1 p n apply to the standard deviation of p ? 6. When the sample size, n, is large, the sampling distribution of p is approximately normal. What test can you use to determine if the sample is large enough to assume that the sampling distribution is approximately normal? 7. Explain the difference between “rule of thumb 1” and “ rule of thumb 2”. 9.3 Sample Means 1. What symbols are used to represent the parameters in this section? 2. What symbols are used to represent the statistics in this section? 3. Since averages are less variable than individual outcomes, what is true about the standard deviation of the sampling distribution of x ? 4. If we draw an SRS of size n from a population that has a Normal distribution with mean and standard deviation . Give three characteristics of the sample mean, x . 5. What does the Central Limit Theorem say about the shape of the sampling distribution of x ? 6. What is your plan in preparing for the final exam? Estimating with Confidence (10.1) Learning Targets: List the 6 basic steps in the reasoning of statistical estimation Distinguish between a point estimate and an interval estimate Identify the basic form of all confidence intervals Explain what is meant by margin of error State in nontechnical language what is meant by a level C confidence interval State the 3 conditions that need to be present in order to construct a valid confidence interval Explain what it means by the “upper p critical value” of the standard Normal distribution For a known population standard deviation, construct a level C confidence interval for a population mean List the 4 necessary steps in the creation of a confidence interval Identify 3 ways to make the margin of error smaller when constructing a confidence interval Once a confidence interval has been constructed for a population value, interpret the interval in the context of the problem Determine the sample size necessary to construct a level C confidence interval for a population mean with a specified margin of error Identify as many of the 6 warnings about constructing confidence intervals as you can Introduction: The goal of statistical inference is to use the sample data to _________________________ __________________. We won’t be sure that our conclusions are correct – a different sample might lead to _____________________________. Using probability to express the strength of our conclusions will add support to our decisions. In this chapter, we will ___________________ the value of a population parameter with a confidence interval. Confidence Intervals: Let’s say we have an SRS of size 840 with a mean test score of 272 points. What would we guess the population mean, , to be? _________ If we had every possible sample of size 840 (the “sampling distribution”), what would we know? *_____________________________________________________________________ *_____________________________________________________________________ * _____________________________________________________________________ Realistically we wouldn’t know if we don’t know . But let’s suppose = 60 for this population. Then where are 95% of all samples? When calculating a confidence interval, there are only 2 possible results: * _____________________________________________________________________ * _____________________________________________________________________ The probability that our confidence interval contains the true mean, , is either _______ or _______. All confidence intervals are in the form: ESTIMATE MARGIN OF ERROR (Obtained from the sample) (how far from the sample we are willing to go) The 68-95-99.7 rule will work if we want to be 68% confident: Margin of error = _________________________ If we repeat this process over and over, 68% of the intervals will contain the true mean. 95% confident: Margin of error = _________________________ If we repeat this process over and over, 95% of the intervals will contain the true mean. 99.7% confident: Margin of error = _________________________ If we repeat this process over and over, 99.7% of the intervals will contain the true mean. What do we need to find if we want a 90% confidence interval? When 90% of the area is in the middle of the curve that means 10% is left out. So each tail has 5% of the area. We need to find the point where that happens. Each * is a certain number of standard deviations away from . We call it z* and the value of z* depends on the confidence level we wish to use. Confidence level 90% 95% 99% EXAMPLE 1: tail area 0.05 0.025 0.005 z* value 1.645 1.960 2.576 Find z* for 80% confidence level ___________________ Find z* for 98% confidence level ___________________ A level “C” confidence interval for is: n xz* EXAMPLE 2: A study of the career paths of hotel general managers sent questionnaires to an SRS of 160 hotels belonging to major US hotel chains. There were 114 responses. The average time those general managers had spent with their current company was 11.78 years. Give a 99% confidence interval for the mean number of years general managers of major chain hotels have spent with their current company. (use = 3.2 years) For all confidence intervals: * The user chooses the confidence level when choosing the sampling method and setting up the experiment; the margin of error depends on the confidence level. * Margin of Error = n z* Numerator = z* and Denominator = n * We would like a high confidence level and a small margin of error * Margin of error gets smaller when: (how does a fraction get smaller?) * _____________________________________________________ * _____________________________________________________ * _____________________________________________________ EXAMPLE 3: How large of a sample of the hotel managers in the previous example would we need to estimate the mean within 1 year with 99% confidence? Estimating a Population Mean (10.2 A) Learning Targets: Identify the 3 conditions that must be present before estimating a population mean Explain what is meant by the standard error of a statistic in general and by the standard error of the sample mean in particular List 3 important facts about the t-distributions. Include comparisons to the standard Normal curve. Use Table C to determine critical t-values for a given level C confidence interval for a mean and a specified number of degrees of freedom Construct a one-sample t confidence interval for a population mean (remembering to use the 4-step procedure) Conditions for Inference about a Population Mean: As mentioned in section 10.1, we usually will not know the population standard deviation when finding confidence intervals. If is not known, we cannot calculate the standard deviation for the sampling distribution, x . We must estimate from the data, even though we are primarily interested in . Before we can estimate, we need to verify 3 important conditions: * * * The t Distribution: Once we have verified the conditions, we can estimate with “s”, the sample standard deviation. Then the standard deviation of the sampling distribution, x , becomes the standard _____________________ of the sample mean. Since was replaced by s, the statistic has more ________________ and no longer has a Normal distribution. We cannot find the standardized value, z. There is another standardized value we can use, called ______. The formula is very similar to the z-score formula: Like z, this standardized value tells us the number of standard deviation units x is from . Unlike z, there is a different t-distribution for each _______________________. We specify a particular t-distribution by giving its ___________________________________. The notation is ______________. Table C in the back of the book gives critical values t* for the t-distributions. (t* tells us ____________________________________________) Each row in the table contains critical values for one of the t-distributions; the degrees of freedom appear at the left of the row. By looking down any column, you can check that the tcritical values approach the normal values as the degrees of freedom increase. The t-table uses area to the right of t*. EXAMPLE 1: Find the critical value t* from table C to satisfy each of the following conditions: a) The t-distribution has 5 degrees of freedom and probability 0.05 to the right of t*. ____ b) The t-distribution has 21 degrees of freedom and probability 0.99 to the left of t*. ____ c) Used for a 95% confidence interval based on 10 observations (n = 10). __________ d) Used for a 99% confidence interval from an SRS of 20 observations. __________ e) Used for an 80% confidence interval from a sample size of 7. __________ Facts About the t-Distribution: * The density curves of the t-distributions are similar in shape to the _________________ ______________________________. They are symmetric about _____________, singlepeaked, and ________________________. * The spread of the t-distributions is a bit _____________________ than that of the standard normal distribution. The t-distributions have more area/probability in ___________ and less in the _________________ than the standard normal distribution does. This is true because _______________________________________________________________. * As the degrees of freedom (k) increase, the t(k) density curve approaches the _________ _________________________more closely. This happens because _________________ ___________________________________________________________________. The One-Sample t Confidence Intervals: Now we use this knowledge to construct confidence intervals for an unknown mean, with an unknown standard deviation: s n x t* Steps to displaying a confidence interval: P: “Let ________ represent the mean _________________ of all _______________.” A: N: I: C: “We are ______% confident that the true mean __________________ is between _____ & _____.” (Being _______% confident means that since ______% of all intervals created by this method will contain the true mean, we are pretty sure our interval is one of those intervals) EXAMPLE 2: Poisoning by the pesticide DDT causes tremors and convulsions. In a study of DDT poisoning, researchers fed several rats a measured amount of DDT. They then made measurements on the rats’ nervous systems that might explain how DDT poisoning causes tremors. One important variable was the “absolutely refractory period”, the time required for a nerve to recover a stimulus. This period varies normally. Measurements on four rats gave the data below (in milliseconds). 1.6 1.7 1.8 1.9 a) Find the mean refractory period ( x ) and the standard error of the mean(Sx). b) Give a 90% confidence interval for the mean absolutely refractory period for all rats of this strain when subjected to the same treatment. P A N I C Robustness of t - procedures: One-sample t-procedures are exactly correct when the population is ___________________. However, no real data are exactly Normal. Procedures that are robust are not strongly influenced by the lack of normality. If there are no outliers in the data, the t-procedures can be used if the population isn’t Normal. The t-procedures are strongly influenced by __________________. We assume that the population is ___________________________ in order to justify the use of t-procedures. If there are _____________________ and the sample size is small, the results will not be reliable. For small sample sizes, sizes less than _______, only use t-procedures if the data are close to Normal. If the sample size is larger, at least _______, we can use t-procedures if there are no outliers or ____________________. If the sample is size is large enough, at least ________, we can safely use t-procedures, even if the data is skewed. Estimating a Population Mean (10.2 B) Learning Targets: Describe what is meant by “paired-t procedures” Calculate a level C confidence interval for a set of paired data Explain what is meant by a robust inference procedure and comment on the robustness of t-procedures Discuss how sample size affects the usefulness of t-procedures Paired t Procedures: Comparative studies are more convincing than single-sample investigations, so ____________ inference is not as common as comparative _______________________. In a comparative study, we may want to compare two _____________, or we may want to compare two ___________________. In either case, the samples must be chosen __________________ and _____________________________ in order to perform statistical inference. Because matched pairs are not chosen independently, we will not use two-sample inference for a matched pairs design. Instead, we apply the one-sample _____________________ to the observed ____________________. If the same subjects are creating both data sets, it is a matched pairs design. EXAMPLES The following situations require inference about a mean or means. Identify each as a singlesample, a matched-pairs sample, or a two-sample: A) An education researcher wants to learn whether it is more effective to put questions before or after introducing a new concept in an elementary school mathematics text. He prepares two textbook segments that teach the same concept, one with motivating questions before and the other with review questions after. He uses each text segment to teach a separate group of children. The researcher compares the scores of the groups on a test over the material. B) Another researcher approaches the same issue differently. She prepares textbook segments on two unrelated topics. Each segment comes in two versions, one with questions before and the other with questions after. The subjects are a single group of children. Each child studies both topics, one (chosen at random) with questions before and the other with questions after. The researcher compares test scores for each child on the two topics to see which topic he or she learned better. The parameter in a matched-pair t-procedure is either: * OR * EXAMPLE: Is caffeine dependence real? Our subjects are 11 people diagnosed as being dependent of caffeine. Each subject was barred from coffee, colas, and other substances containing caffeine. Instead they took capsules containing their normal caffeine intake. During a different time period, they took placebo capsules. The order in which subjects took caffeine and placebo was randomized. The table below contains data on two of several tests given to the subjects. “Score” is the score on the Beck Depression Inventory. Higher scores show more symptoms of depression. Construct and interpret a 90% confidence interval for the mean change in depression score. Subject # 1 2 3 4 5 Score (caffeine) 5 5 4 3 8 Score (placebo) 16 23 5 7 14 Subject # 6 7 8 9 10 11 SHOW ALL STEPS Score (caffeine) 5 0 0 2 11 1 Score (placebo) 24 6 3 15 12 0 Estimating a Population Proportion (10.3) Learning Targets: p , determine the standard error of p . Given a sample proportion, List the 3 conditions that must be present before constructing a confidence interval for an unknown proportion. Construct a confidence interval for a population proportion, remembering to use the 4-step procedure. Determine the sample size necessary to construct a level C confidence interval for a population proportion with a specified margin of error. Conditions for Inference about a Proportion: As always, inference is based on the sampling distribution of a statistic. We described the sampling distribution of a sample proportion p in section 9.2. Here is a brief review of its important properties: Center: The mean is ________. The sample proportion is an ___________________ estimator of the population proportion, p. Shape: The standard deviation of p is ________________________, provided that the population is at least ______ times as large as the sample. Spread: The distribution of p is approximately Normal if the sample size is large enough that ______________________ and _______________________. In practice, we don’t know the value of p. (If we did, we wouldn’t need a confidence interval for it). So we won’t be able to check the conditions for a Normal distribution given above. In large samples, p will be ________________ to p. So we replace p by p in determining the conditions for Normality. We also replace p with p in the formula for standard deviation. Since we are using an estimated value, standard deviation is now called the ______________________________of p . SE p 1p n We use this in the confidence interval formula: estimate margin of error REVIEW: Steps to displaying a confidence interval: P: A: N: I: C: EXAMPLE 1: As part of a quality improvement program, your mail-order company is studying the process of filling customer orders. Company standards say an order is shipped on time if it is sent out within 3 working days after it is received. You audit an SRS of 100 of the 500 orders received in the past month. The audit reveals that 86 of these orders were shipped on time. Find a 95% confidence interval for the true proportion of the month’s orders that were shipped on time. P: A: N: I: C: EXAMPLE 2: A national opinion poll found that 44% of all American adults agree that parents should be given vouchers good for education at any public or private school of their choice. The result was based on a small sample. How large of an SRS is required to obtain a margin of error of 0.03 (that is, 3%) in a 95% confidence interval? EXAMPLE 3: A company has received complaints about its customer service. They intend to hire a consultant to carry out a survey of customers. Before contacting the consultant, the company president wants some idea of the sample size that she will be required to pay for. One critical question is the degree of satisfaction with the company’s customer service a person has, measured on a five-point scale. The president wants to estimate the proportion, p, of customers who are satisfied (that is, who choose either “satisfied” or “very satisfied,” the two highest levels on the 5-point scale.) She decides she wants the estimate to be within 2% (0.02) at a 95% confidence level. Find the margin of error used to estimate the proportion of customers who are satisfied. Chapter 10: Estimating with Confidence Key Vocabulary: Statistical inference Confidence levels Critical values t-distribution z-distribution Paired t procedures Confidence intervals Margin of Error Standard error Degrees of freedom one-sample t interval Robust 10.1 Confidence Intervals: The Basics 1. What does a confidence interval estimate? 2. Explain the difference between a confidence interval and a confidence level. 3. Describe the conditions that must be met in order to determine a confidence interval. 4. What is meant by a critical value? 5. Explain margin of error. What error does it cover? 6. List three situations in which the margin of error gets smaller. 10.2 Estimating a Population Mean 1. Why do we use standard error? 2. Under what assumptions is Sx a reasonable estimate of ? 3. What is a t-distribution? 4. List 3 facts about the t-distribution. 5. Why do we use degrees of freedom? 6. Describe the difference between a one-sample t-interval and a two-sample t-interval 7. If a procedure is robust, it is not strongly affected by ______________________. 10.3 Estimating a Population Proportion 1. Explain how “center”, “shape”, and “spread” help determine the conditions/assumptions for inference about a proportion. 2. In statistics, what is meant by a sample proportion? 3. What is the confidence interval estimating? 4. When is p approximately Normal? 5. How do you calculate the standard error of p ? 6. How do you determine a confidence interval for p? Significance Tests – The Basics (11.1) Learning Targets: Explain why significance testing looks for evidence against a claim rather than in favor of the claim. Define Null Hypothesis and Alternative Hypothesis. Explain the difference between a one-sided hypothesis and a two-sided hypothesis Identify the three conditions that need to be present before doing a significance test for a mean Explain what is meant by a test statistic. Give the general form of a test statistic Define P-value Define significance level Define statistical significance at level . Explain the difference between the P-value approach to significance testing and the statistical significance approach. Introduction: In the last chapter, we found confidence intervals to _________________ ___________________________________. The other common type of statistical inference, called significance tests, has a different goal: to assess the evidence provided by data about some claim concerning a population. EXAMPLE: I claim that I make 80% of my basketball free throws. To test my claim, you ask me to shoot 20 free throws. I make only 8 of the 20. “Aha!” you say. “Someone who makes 80% of their free throws would almost never make only 8 of the 20. So I don’t believe your claim.” Your reasoning is based on asking what would happen if my claim were true and we repeated the sample of 20 free throws many times – I would almost never make as few as 8.This outcome is so unlikely that it gives strong evidence that my claim is not true. The Basics: A significance test is a formal procedure for comparing __________________ with a _________________. The hypothesis is a statement about a population Parameter, like the population mean, , or population proportion, p. The reasoning of statistical tests is based on asking what would happen if we repeated the sampling or experiment many times. As in the previous chapter, we begin with the unrealistic assumption that we know the population standard deviation, . The Hypotheses: We will be writing a null hypothesis, which always states that ____________________________, and an alternative hypothesis, which suggests that ____________________________________________________. We abbreviate the null hypothesis as H 0 and the alternative hypothesis as Ha . Hypotheses always refer to some population, not a particular outcome. Be sure to state H 0 and Ha in terms of a population parameter. EXAMPLE: Diet colas use artificial sweeteners to avoid sugar. Colas with artificial sweeteners gradually lose their sweetness over time. Manufacturers therefore test new colas for loss of sweetness before marketing them. Trained tasters sip the cola along with drinks of standard sweetness and score the cola on a “sweetness score” of 1 to 10. The cola is then stored for a month at high temperature to imitate the effect of 4 months’ storage at room temperature. After a month, each taster scores the stored cola again. This is a matched-pairs experiment. Our data are the differences in the tasters’ scores (score before storage minus score after storage). The bigger these differences, the bigger the loss of sweetness. Here are the sweetness scores for a new cola, measured by 10 trained tasters: 2.0 0.4 0.7 2.0 -0.4 2.2 -1.3 1.2 1.1 2.3 Most tasters found a loss of sweetness, but 2 found a gain in sweetness. Assume the standard deviation is 1. We need to know if this data is good evidence that the cola actually lost sweetness in storage. The sample mean x is calculated to be __________. The population mean would represent ___________________________________________________ * What are we saying if “there is no change” ? ___________________________________ * What does the evidence point to? ____________________________________________ H 0 : ________________ Ha : ________________ (One-sided or two-sided?) EXAMPLES: Practice stating hypotheses: 1. Suppose a dog food manufacturer wants to know if the proper amount of dog food is being placed in the 25-lb bags. (One-sided alternative or two-sided alternative?) H0 : Ha : 2. The Standard Tire Company has introduced a new tire in Europe that will be guaranteed to last at least 30,000 km. An independent agency conducted several tests and suspects the tires do not last as long as claimed. (One-sided alternative or two-sided alternative?) H0 : Ha : The Assumptions: In chapter 10, we had three conditions that should be satisfied before we construct a confidence interval about an unknown population mean or proportion. These same 3 conditions must be verified before performing significance tests about a population mean or proportion: * _________________ * _____________________ * __________________ As in the previous chapter the details for checking the Normality condition are different for means and proportions. For means: ________________________________________________________ For proportions: ____________________________________________________ The Test Statistics: The hypothesis test is based on a statistic comparing the value of the parameter as stated in the null hypothesis with an estimate of the parameter from the sample data. If the estimate is far from the parameter, we have evidence against the null hypothesis. To assess how far the estimate is from the parameter, we standardize the estimate: EXAMPLE: Find the test statistic for the Diet Cola problem. The P-Values: The null hypothesis states the claim we are seeking evidence against. The test statistic measures how much the sample data diverge from the null hypothesis. The amount of divergence tells us we have data that would be unlikely if H 0 were true. “Unlikely” is determined by a probability, called a P-value. We compute the probability assuming _____________________________________________. The P-value tells us how likely it is that the sample data would occur when we assume the null hypothesis is true. (Maybe we just drew a bad sample, but another sample would be closer to the parameter). Small P-values are evidence against H 0 because they say that the observed result is unlikely to occur when H0 is true (Probably not just a bad sample). Once we find the test statistic, we find the probability of this value occurring under the assumed H 0 . EXAMPLE: Find the P-value for the Diet Cola problem. If the P-value (probability) is small enough we reject the claim that the null hypothesis is making. How small is “small enough”? The Significance Levels: We must have some value to compare the P-value to. It is very important to understand that this value, called the significance level ( - value), must be chosen during the design stage of the experiment, not after the data is collected. The “significance” refers to how likely it is for something to happen. For example, does it have a 5% chance of occurring? Does it have a 1% chance of occurring? If the P-value is less than the -value, we say ___________________________________________, meaning they are significant enough to reject the claim of the null hypothesis. When deciding on an value, determine how important the decision is. Life-threatening situations usually require a 0.01 significance level, while the height of soap suds would require a 10% significance level. EXAMPLE: Assume in the Diet Cola problem, the significance level ( ) is 5%. That is = 0.05 Compare the P-value found in the previous example, and decide if the results are significant. There is a difference between “statistical significance” and “practical significance”. We will get to that later in the chapter. The Conclusion: The final step in performing a significance test is to draw a conclusion about the null hypothesis: “reject” or “do not reject”, we NEVER “ACCEPT”. Your conclusion should have a clear connection to your calculations and should be stated in the context of the problem. We reject H 0 if our sample result is too unlikely to have occurred by chance assuming H 0 is true. EXAMPLE: Write a conclusion for the Diet Cola problem. To help remember the steps in a significance test, you might use this phrase: Parameter (What parameter are we using? What does it represent?) Hypotheses (null and alternative) Assumptions (SRS, Normality, Independence) Name the test (one-sample z-test… for now …) Test statistic (show the formula used and give the value) Obtain p-value (finding probability) Make decision (“reject” or “do not reject”) State conclusion (decision made in the step above, what made you decide that, what it means in context of the problem) Carrying Out Significance Tests (11.2) Learning Targets: Identify and explain the steps involved in formal hypothesis testing Conduct a z-test for a population mean Explain the relationship between a level two-sided significance test for and a level 1 confidence interval for . Conduct a 2-sided significance test for using a confidence interval Results of a significance test hold up in court because it points to a result that is unlikely to occur simply by chance. Two-sided tests: Two-sided, or two-tailed, tests occur when our alternative hypothesis is twosided. We need to check both tails of the Normal curve when finding the P-value of a two-tailed test. EXAMPLE: The medical director of a large company is concerned about the effects of stress on the company’s younger executives. According to the National Center for Health Statistics, the mean systolic blood pressure for males 35 to 44 years old is 128, and the standard deviation in this population is 15. The medical director examines the medical records of 72 male executives in this age group and finds that their mean systolic blood pressure is x 129.93 . Is this evidence that the mean blood pressure for all the company’s younger male executives is different from the national average? Tests from Confidence Intervals: A 95% confidence interval captures the _____________________________________________ in 95% of all samples. If we are 95% confident of this idea, we are also confident that values outside of our confidence interval are incompatible with our data. There is a __________ chance that we won’t capture the true value of the mean. So A 5% significance level and a 95% confidence interval can be used to draw the same conclusions. We can say the same about a _________ confidence interval and a _________ significance level. EXAMPLE: The Deely Laboratory analyzes specimens of a drug to determine the concentration of the active ingredient. Such chemical analyses are not perfectly precise. Repeated measurements on the same specimen will give slightly different results. The results of repeated measurements follow a Normal distribution quite closely. The analysis procedure has no bias, so the mean of the population of all measurements is the true concentration of the specimen. The standard deviation of this distribution is a property of the analysis method and is known to be = 0.0068 grams per liter. The laboratory analyzes each specimen three times and reports the mean result. A client sends a specimen for which the concentration of active ingredient is supposed to be 0.86%. Deely’s three analyses give concentrations 0.8403, 0.8363, 0.8447. Give a 99% confidence interval for the concentration of the active ingredient. Is this significant evidence (at the 1%) level that the true concentration is not 0.86%? Use and Abuse of Tests (11.3) Learning Targets: Distinguish between statistical significance and practical importance Identify the advantages and disadvantages of using P-values rather than a fixed level of significance Introduction: Significance tests are frequently used when reporting results of research in many fields. New drugs require significant evidence of effectiveness and safety. Courts ask about statistical significance in hearing discrimination cases. Marketers want to know whether a new ad campaign significantly outperforms the old one, and medical researchers want to know whether a new treatment performs significantly better. Choosing a level of significance: When designing a significance test, you should choose before performing the test. Consider how much evidence is required to reject H0. * How plausible is H0? - If H0 represents an assumption that the people you must convince have believed for years, it’s going to take strong evidence (small ) to convince them. * What are the consequences of rejecting H0? - If rejecting H0 means an expensive change, you will need strong evidence. * is not a hard-core comparison value. - There is not a practical distinction between P = 0.049 and P = 0.051. Statistical Significance and Practical Importance: When large samples are available, even tiny deviations from the null hypothesis will be significant. Always remember to check the practical significance. EXAMPLE: Suppose a dog food manufacturer wants to know if the proper amount of dog food is being placed in the 25-lb bags. Suppose an SRS of 2000 bags was selected, and x 25.01 pounds with 0.1 pound. Is this enough evidence that the bag-filling equipment needs to be adjusted? Show all steps Statistical inference is not valid for all sets of data: Badly designed surveys or experiments often provide invalid results. Our tests of significance cannot correct flaws in the design. Don’t be too impressed by P-values on a printout until you are confident that the data was produced correctly. EXAMPLE: You wonder whether background music would improve the productivity of the staff who process mail orders in your business. After discussing the idea with the workers, you add music and find a statistically significant increase. Should you conclude improvement due to background music? Change in environment Study under way What needs to be added? Don’t ignore lack of significance: There is a tendency to infer “no change” or “no effect” whenever the P-value fails to attain the usual 5% standard. Remember, we simply “fail to reject H0” when the P-value is not less than . That does not mean that H0 is true. Maybe we need to increase our sample size and perform the significance test again. Using Inference to Make Decisions (11.4) Learning Targets: Define what is meant by a Type I error and a Type II error Introduction: Describe, given a real situation, what constitutes a Type I error and what the consequences of such an error would be Describe, given a real situation, what constitutes a Type II error and what the consequences of such an error would be Describe the relationship between significance level and a Type I error Define what is meant by the power of a test Identify the relationship between the power of a test and a Type II error List 4 ways to increase the power of a test Explain why a large value for the power of a test is desirable. Tests of significance assess the _____________________________________________. We measure evidence by the P-value, which is “probability, computed under the assumption that ____________________________________________”. The alternative hypothesis enters the test only to ____________________________________________________. When we choose before performing the test, we are using the outcome of the test to ______________________________. EXAMPLE: A producer of bearings and the consumer of the bearings agree that each carload lot must meet certain quality standards. When a carload arrives, the consumer inspects a sample of the bearings. On the basis of the sample outcome, the consumer either accepts or rejects the carload. This is called ____________________________________. For this situation, we must decide between H0: the lot of bearings meets standards Ha: the lot of bearings does not meet standards We have 4 possible situations, here: A) B) C) D) What is the difference between the two types of errors? Whether or not we make one of these errors depends on the performance of the significance test. So the probability of making one of these errors will describe the performance of the test. Probability of Making a Type I Error: EXAMPLE: The mean diameter of a type of bearing is supposed to be 2.000cm. The bearing diameters vary normally with standard deviation =0.010cm. When a carload lot of the bearings arrive, the consumer takes an SRS of 5 bearings from the lot and measures their diameters. The consumer rejects the bearings if the sample mean diameter is significantly different from 2cm at the 5% significance level. Probability of Making a Type II Error: Step 1: ________________________________________________________________ Step 2:_______________________________ Step 3: ________________________________________________________________ (to standardize, you will be given a value to use as a ) EXAMPLE: Find the probability of making a Type II error in the bearing problem. Use a significance level of 5%, and a = 2.015: A test makes a type II error when it fails to reject a null hypothesis that really is false. A high probability of a type II error means that __________________________________. We usually report the sensitivity of the test as the strength, or the ________________, of the test. This is the probability that the test will reject H0 when it’s supposed to reject it (when Ha is actually true). To calculate, use POWER = 1 – P(type II error) ** Calculations of P-values and calculations of power both say ___________________ ____________________________________________________________. ** P-value describes _________________________________________________. ** Power tells us ____________________________________________________. EXAMPLE: The cola maker in the “loss of sweetness” problem determines that a sweetness loss is too large to accept if the mean response for all tasters is a = 1.1. Will a 5% significance test of the hypotheses: H0 : 0 Ha : 0 based on a sample of 10 tasters usually detect a change this great? (Previously, = 1) Chapter 11: Testing a Claim Key Vocabulary: Null hypothesis P-value Test statistic Significance level Test of significance Type I error Alternative hypothesis Statistically significant Practical importance Upper p critical value Two-sided test Type II error 11.1 Significance Tests: The Basics 1. What is a null hypothesis? 2. What is an alternative hypothesis? 3. What is a test statistic? How do you find it? 4. What is meant by a P-value? 5. How does a P-value give information about the significance test? 6. What does the alpha-value do in a significance test? 11.2 Carrying Out Significance Tests 11.3 Use and Abuse of Tests 1. Explain the 3 conditions necessary to perform inference 2. Explain the difference between accepting H0 and failing to reject H0. 3. Explain how we can use a confidence interval to make a decision about H0. 4. What should be considered when deciding on a level of significance? 5. Explain the purpose of example 11.15 11.4 Using Inference to Make a Decision 1. Explain the difference between a Type I error and a Type II error. 2. What is the relationship between the significance level and the probability of a Type I error? 3. Describe briefly how to find the power of a significance test. 4. What does the power of a significance test tell us? 5. List 4 ways you can increase the power of a significance test. Inference for the Mean of a Population (12.1) Learning Targets: Define the one-sample t-statistic Determine critical values of t (t*), from a “t-table” given the probability of being to the right or left of t* Determine the P-value of a t-statistic for both a one-sided and two-sided significance test Conduct a one-sample t significance test for a population mean using the required steps Conduct a paired t-test for the difference between two population means Introduction: Now that we have studied the principles and process of significance tests, we move to the practical application. We begin by dropping the unrealistic assumption that we know the population standard deviation, . We use the t-distribution, as we did in chapter 10 with confidence intervals. Let’s review some of the characteristics of the t-distribution: * Since was replaced by s, the statistic has more ________________ and no longer has a normal distribution. * We specify a particular t-distribution by giving its ________________________. * The density curves of the t-distributions are similar in shape to the _____________ ______________________________. They are symmetric about ________, singlepeaked, and ________________________. * The spread of the t-distributions is a bit _________________ than that of the standard normal distribution. The t-distributions have more area/probability in ___________________ and less in the __________________ than the standard normal distribution does. * As the degrees of freedom, k, increase, the t(k) density curve approaches the _____________________________more closely. The t-distribution: Since the population standard deviation is unknown, and must be estimated with the ____________________________________, we cannot find the standardized value, z to use as the test statistic. There is another standardized value we can use, called ______. The formula is very similar to the z-score formula: Like z, this standardized value t tells us how many standard-deviation units Unlike z, there is a different t-distribution for each sample size. A significance test using the test statistic t is called a one-sample t-test. x is from . Assumptions: One-sample t-procedures are exactly correct when the population is _____________, which never happens in reality. We assume that the population is _____________________________ in order to justify the use of t-procedures. The tprocedures are strongly influenced by ____________. The results will not be reliable if there are _______________ and the sample size is ________________. For small sample sizes, sizes of less than ______, only use t-procedures if the data are close to normal. If the sample size is at least _______, only use t-procedures if there is no ____________________. We can safely use t-procedures if the sample size is at least ________, even if the data is skewed. Outliers in any sample size make the t-procedures invalid. YOU MUST MAKE A BOX-POLT ON YOUR CALCULATOR AND CHECK ITS SYMMETRY BEFORE ASSUMING THE DATA IS NEARLY NORMAL. COMMENTS ABOUT THE SHAPE OF THE BOX PLOT ARE REQUIRED IN THE ASSUMPTIONS STEP. EXAMPLE: Many homeowners buy detectors to check for the invisible gas, radon, in their homes. How accurate are these detectors? To answer this question, University researchers placed 12 radon detectors in a chamber that exposed them to 105 picocuries per liter of radon. The detector readings were as follows: 91.9 97.8 111.4 122.3 105.4 95.0 103.8 99.6 96.6 119.3 104.8 101.7 a) Make a boxplot of the data. Describe the shape of the distribution. b) Is there convincing evidence that the mean reading of all detectors of this type differs from the true value of 105? Carry out a test in detail, and then write a brief conclusion. (Use the tcdf function on the calculator during the “O” step.) P: H: A: N: T: O: M: S: Paired t-tests: Remember the diet cola “loss of sweetness” problem? The same 10 tasters rated before and after sweetness. We subtracted the before sweetness score and the after sweetness score and performed a test on the differences. This is the method of a paired t-test. EXAMPLE: The acculturation Rating Scale for Mexican Americans (ARSMA) and the Bicultural Inventory (BI) both measure the extent to which Mexican Americans have adopted Anglo/English culture. These tests were compared by administering both tests to 22 Mexican Americans. Both tests have the same range of scores (1.00 to 5.00) and are scaled to have similar means for the groups used to develop them. There was a high correlation between the two scores, giving evidence that both are measuring the same characteristics. The researchers wanted to know whether the population mean difference in scores for the two tests is 0. The differences in scores (ARSMA – BI) for the 22 participants had x 0.2519 and s = 0.2767. Find the test statistic and the P-value to answer the researchers’ question. P: H: A: N: T: O: M: S: Tests about a Population Proportion (12.2) Learning Targets: Explain why p 0 instead of p is used when computing the standard error of p in a significance test for a population proportion. Explain why the correspondence between a two-tailed significance test and a confidence interval for a population proportion is not as exact as when testing for a population mean. Explain why the test for a population proportion is sometimes called a large sample test. Conduct a significance test for a population proportion using the PHANTOMS steps. Discuss how significance tests and intervals can be used together to help draw conclusions about a population proportion. Introduction: The proportion of a population having a given characteristic is a parameter called ______. The proportion of a sample having a given characteristic is a statistic called __________. The problems in this section are the same type as in chapter 11 and 12.1, where we are hypothesizing about a _________________, but instead of , we are making inferences about _______. Instead of x , we use _____ to standardize and find the P-value. We still do not know the population standard deviation, ______. Instead of using s, we calculate an estimate (standard error) using a formula similar to the one used in confidence intervals: Assumptions: For sufficiently large samples, we know that the sampling distribution of p is approximately _______________ (provided p is not too close to _______ or _______), with a mean equal to _____ and standard deviation equal to ______________. To standardize p , we use the z-score formula: Assumptions for inference about a proportion * * * (one-proportion z-test): Hypothesis Tests: H0 Ha Find z: Confidence Intervals: Find p-value: The formula for finding a confidence interval is: The interval is describing ________________________________________________ How does the confidence interval support a decision made in a significance test? EXAMPLE 1: As part of a quality improvement program, your mail-order company is studying the process of filling customer orders. Company standards say an order is shipped on time if it is sent out within 3 working days after it is received. You audit an SRS of 100 of the 500 orders received in the past month. The audit reveals that 86 of these orders were shipped on time. Find a 95% confidence interval for the true proportion of the month’s orders that were shipped on time. P A N I C Is there good evidence that the proportion of orders shipped on time differs from last month’s average of 88%? EXAMPLE 2: According to the National Institute for Occupational Safety and Health, job stress poses a major threat to the health of workers. A national survey of restaurant employees found that 75% said that work stress had a negative impact on their personal lives. A random sample of 100 employees from a large restaurant chain finds that 68 answer “yes” when asked, “Does work stress have a negative impact on your personal life?” Is this good reason to think that the proportion of all employees in this chain who would say “yes” differs from the national proportion p0 = 0.75? P H A N T O M S Chapter 12: Significance Tests in Practice Key Vocabulary: Standard Error Degrees of Freedom Paired t-tests t Distribution One-sample t-statistic One-prop z-test 12.1 Tests about a Population Mean 1. Who does the author say developed the t-distributions? 2. What was he trying to do when he noticed the need for t-distributions? 3. What is the interpretation for the t-statistic? 4. Describe the differences between a standard normal distribution and a t distribution. 5. Explain how normality is checked for a population mean. 6. How do you calculate the degrees of freedom for the t-distribution? 7. What happens to the t-distribution as the degrees of freedom increase? 8. In a matched pairs t-procedure, what is the parameter of interest, ? 9. What measures a tests ability to detect deviations from the null hypothesis? 10. How would you construct a level C confidence interval for 11. Under what assumptions is s a reasonable estimate of if is unknown? ? 12.2 Tests about a Population Proportion: 1. Explain how normality is checked for a population proportion. 2. Why do some people call this a “large-sample test”? 3. Why are confidence intervals sometimes included as part of the analysis? 4. Give the formula for standard error used in finding a confidence interval. 5. Give the formula for standard error used in a one-proportion z-test. 6. Explain why we use different formulas in the confidence interval and the test statistic? Comparing Two Means (13.1) Learning Targets: Identify situations in which two-sample problems might arise Describe the three conditions necessary for doing inference involving two population means Clarify the difference between the two-sample z-statistic and the two-sample tstatistic Identify the two practical options for using two-sample t-procedures and how they differ in terms of computing the number of degrees of freedom Conduct a 2-sample significance test for the difference between two-independent means using PHANTOMS Compare the robustness of two-sample procedures with that of one-sample procedures. Include in your comparison the role of equal sample sizes Explain what is meant by “pooled two-sample t – procedures,” when pooling can be justified, and why it is advisable not to pool Introduction: Comparative studies are more convincing than single-sample investigations, so _________ inference is not as common as comparative __________________________________. In a comparative study, we may want to compare two ____________, or we want to compare two _______________. In either case, the samples must be chosen ______________ and ____________________ in order to perform statistical inference. Because matched pairs are not chosen independently, we will not use two-sample inference for a matched pairs design. Instead, we apply the one-sample __________________to the observed __________________. If the same subjects are creating both data sets, it is a matched pairs design. EXAMPLE 1 The following situations require inference about a mean or means. Identify each as a singlesample, a matched-pairs sample, or a two-sample: A) An education researcher wants to learn whether it is more effective to put questions before or after introducing a new concept in an elementary school mathematics text. He prepares two textbook segments that teach the same concept, one with motivating questions before and the other with review questions after. He uses each text segment to teach a separate group of children. The researcher compares the scores of the groups on a test over the material. B) Another researcher approaches the same issue differently. She prepares textbook segments on two unrelated topics. Each segment comes in two versions, one with questions before and the other with questions after. The subjects are a single group of children. Each child studies both topics, one (chosen at random) with questions before and the other with questions after. The researcher compares test scores for each child on the two topics to see which topic he or she learned better. Comparing Two Means The null hypothesis for a two-sample hypothesis test still says there is no difference, but now there are two ‘s. You can use: ______________ or ___________________ The alternative hypothesis could be: _______________________ or _______________________(two-sided) _______________________ or _______________________(one-sided) _______________________ or _______________________(one-sided) Before you begin, check your assumptions: If these assumptions hold, then the difference in sample means is an unbiased estimator of the difference in population means, which tells us that ______________________. Furthermore, if both populations are normally distributed, then __________ is also normally distributed. Also, the variance of x1 x2 is the sum of the variances of _____ and ______. So the standard error for the two-sample means is: The two-sample t-statistic formula is similar to the one-sample t-statistic formula: The degrees of freedom is taken to be the smaller of n1 - 1 and n2 - 1 EXAMPLE 2 In a study of cereal leaf beetle damage on oats, researchers measured the number of beetle larvae per stem in small plots of oats after randomly applying one of two treatments: no pesticide or Malathion at the rate of 0.25 pound per acre. The data appear roughly normal. Here are the summary statistics: Group Treatment n x s 1 2 control Malathion 13 14 3.47 1.36 1.21 0.52 Is there significant evidence at the 1% level that Malathion reduces the mean number of larvae per stem? P H A N T O M S Estimating Two Means with Confidence We can also find confidence intervals with two-sample t-procedures. The formula is similar to the one-sample t-procedure: The confidence interval says: ________________________________________________ or _____________________________________________________________________ EXAMPLE 3 Ordinary corn doesn’t have as much of the amino acid “lysine” as animals need in their feed. Plant scientists have developed varieties of corn that have increased amounts of lysine. In a test of the quality of high-lysine corn as animal feed, an experimental group of 20 one-day-old male chicks received a ration containing the new corn. A control group of 20 one-day-old male chicks received a ration that was identical except it contained normal corn. Here are the weight gains (in grams) after 21 days: CONTROL 380 283 356 350 345 321 349 410 384 455 366 402 329 316 360 EXPERIMENTAL 356 462 399 272 431 361 434 406 427 430 447 403 318 420 339 401 393 467 477 410 375 426 407 392 326 a) Check your assumptions for comparing two means b) Is there good evidence that chicks fed high-lysine corn gain weight faster? c) Give a 95% confidence interval for the mean extra weight gain in chicks fed high-lysine corn. Robustness Recall from chapter 10, an inference procedure is called robust if the probability calculations involved in the procedure remain fairly accurate when a condition for use is violated. The _____________________ procedures are more robust than the ____________________ procedures, particularly when the distributions are not symmetric. When the sizes of the two samples are _________________ and the two populations being compared have distributions with similar shapes, probability values from the t-table are quite accurate. A broad range of distributions is included when the sample sizes are as small as n1 n2 5 . When the two distributions have different _______________________, larger samples are needed. As a guide, we will adapt the sample size conditions from the one-sample t-procedures by replacing “sample size” with “the sum of the sample sizes” as long as n1 and n2 are both at least 5. * Except in the case of small samples, the assumption that the data are an SRS from the population of interest is more important than the assumption that the population distribution is Normal * Sum of the sample sizes is less than 15 : Use t-procedures only if the data are close to Normal. * Sum of the sample sizes is at least 15 : The t-procedures can be used except in the presence of _________________________ or strong _________________________ * Large samples : The t-procedures can be used even for clearly skewed distributions when the sum of the samples is large (n > 30). WE CAN NEVER USE T-PROCEDURES WHEN OUTLIERS ARE PRESENT In planning a two-sample study, choose equal sample sizes if you can. The two-sample tprocedures are most robust against non-Normality in this case, and the conservative P-values are most accurate. We can, however, complete a study even if the sample sizes are not equal. The Pooled Two-Sample t-Procedures (Don’t Use Them!) Pooled two-sample t-procedures average the two sample variances to estimate the population standard deviation. It is only equal to our t-statistic if the two sample sizes are the same, but not otherwise. Comparing Two Proportions (13.2) Learning Targets: Identify the mean and standard deviation of the sampling distribution of p1 p2 List the conditions under which the sampling distribution of p1 p2 is approximately Normal Identify the standard error of p1 p2 when constructing a confidence interval for the difference between two population proportions Construct a confidence interval for the difference between two population proportions using the PANIC method for confidence intervals Explain why, in a significance test for the difference between two proportions, it is reasonable to combine (pool) your sample estimate to make a single estimate of the difference between the proportions. Explain how the standard error of p1 p2 differs between constructing a confidence interval for p1 p 2 and performing a hypothesis test for H0 : p1 p2 0 List three conditions that need to be satisfied in order to a significance test for the difference between two proportions Conduct a significance test for the difference between two proportions using PHANTOMS Introduction: In section 13.1, we discussed a need to compare two ____________________ or two __________________________. We used a ____________________________. We can still make these comparisons when using proportions. We call it a _____________________ _________________________________. Two-Sample Problems: Proportions We will use notation similar to that used in our study of two-sample t statistics. The groups we want to compare are Population 1 (or Treatment 1) and Population 2 (or Treatment 2). We have a separate SRS from each population (or treatment). Here is our notation: Population population proportion sample size sample proportion 1 p1 n1 p1 2 p2 n2 p2 We compare the populations by doing inference about the difference p1 p2 . The statistic that estimates this difference is the difference between the two sample proportions p1 p2 . Significance Tests for Comparing Two Proportions: An observed difference between two sample proportions can reflect a difference in the populations, or it may just be due to chance variation in random sampling. Significance tests help us decide if the effect we see in the samples is really there in the populations. The null hypothesis says there is _______________________ between the two parameters. The alternative hypothesis says what kind of difference we expect: _______________________ or _______________________(two-sided) _______________________ or _______________________(one-sided) _______________________ or _______________________(one-sided) Before you begin, check your assumptions: * * * To do a test, standardize p1 p2 to get a z-statistic. If H0 is true, all the observations in both samples really come from a single population with a single unknown proportion. So instead of estimating p1 and p2 separately, we combine the two samples and use the overall sample proportion to estimate the single population parameter, pc. Call this the combined sample proportion (or the “pooled” sample proportion). It is: Use p c in place of both p1 and p 2 in the expression for the standard error: Using this standard error will give us a z-statistic that has the standard Normal distribution when H0 is true. The formula for the z-score is: EXAMPLE: The 1958 Detroit Area Study was an important investigation of the influence of religion on everyday life. The sample “was basically a simple random sample of the population of the metropolitan area” of Detroit, Michigan. Of the 656 respondents, 267 were white Protestants and 230 were white Catholics. The study took place at the height of the cold war. One question asked whether the government was doing enough in areas such as housing, unemployment, and education. 161 of the Protestants and 136 of the Catholics said no. Is there enough evidence that the 267 white Protestants and the 230 white Catholics differed on this issue? P: H: A: N: T: O: M: S: Confidence Interval for Comparing Two Proportions: Confidence Interval = Estimate Margin of Error = Estimate z* (Standard Error) = Conditions: * * * EXAMPLE: Another question from the 1958 Detroit Area study asked if the right of free speech included the right to make speeches in favor of communism. Of the 267 white Protestants, 104 said yes, while 75 out of the 230 white Catholics said yes. Check that it is safe to use the z-confidence interval, then give a 95% confidence interval for the difference between the proportion of Protestants who agreed and the proportion of Catholics who agreed. Chapter 13: Comparing Two Population Parameters Vocabulary: Two-sample problems Two-sample z statistic Robust Combined sample proportion Two-sample t statistic Pooled variances 13.1 Comparing Two Means 1. How are two-sample problems different than one-sample problems? 2. How are the conditions for two-sample procedures different from the conditions for onesample procedures? 3. Here are the notations used in two-sample procedures. Complete the table: __________POPULATION Population Variable 1 x1 2 x2 _____ Mean STATISTIC__________ St. Dev. Sample size Mean n1 St. Dev. s1 2 4. Explain when to use two-sample z inference and when to use two-sample t-inference. 5. Write the formula for the test statistic when using two-sample z-procedures. 6. Write the formula for the test statistic when using two-sample t-procedures. 7. How do we adjust the one-sample t guideline about sample sizes to work in the two-sample t test? 13.2 Comparing Two Proportions 1. Explain the difference between a matched-pairs study and a two-sample study. 2. Explain why we can use the combined sample proportion, p C , in hypothesis tests but not in confidence intervals. 3. Explain how to calculate the standard error of p1 p2 . 4. What assumptions must be met in order to use z-procedures for inference about two proportions? 5. Describe how to construct a level C confidence interval for the difference between two proportions p1 – p2. 6. For a two-sample hypothesis test where H0: p1 = p2, what is the formula for calculating the z-test statistic? Test for Goodness of Fit (14.1) Learning Targets: Describe the situation for which the chi-square test for goodness of fit is appropriate Define the X2 statistic, and identify the number of degrees of freedom it is based on, for the 2 goodness of fit test List the conditions that need to be satisfied in order to conduct a 2 test for goodness of fit Conduct a 2 test for goodness of fit Identify three main properties of the chi-square density curve Use technology to conduct a 2 test for goodness of fit If a X2 statistic turns out to be significant, discuss how to determine which observations contribute the most to the total value Introduction: On average, the new mix of colors of M&M’s Milk Chocolate Candies will contain 13% of each of browns and reds, 14% yellows, 16% greens, 20% oranges and 24% blues. Assume you have a 1.69-ounce bag of M&M’s. After opening, you discover the following candies: Browns: 7 Reds: 8 Yellows: 9 Greens: 10 Oranges: 13 Blues: 13 Does your bag (sample) reflect the distribution advertised by the M&M/Mars Company? If so, there should be very little difference between the ______________________ counts and the ____________________ counts. But how much difference is too much? The chi-square (“kie-square”) test for goodness of fit allows us to determine whether a specified population distribution seems valid. To analyze categorical data, we construct one-way tables and examine the counts in the __________________ for the explanatory and _________________ variables. We then compare the observed counts to the expected counts. One-Way Table: COLOR OBSERVED COUNT EXPECTED COUNT observed exp ected 2 exp ected E Blue Brown Green Orange Red Yellow TOTAL Sum: 2 O E E O E 2 = 2 The sum of the last column is called the ___________________________________. It measures how well the __________________ counts fit the __________________ counts, assuming that the null hypothesis is true. The larger this value is, the more evidence there will be against H0. P : H0 : Ha : Properties of the Chi-Square Distribution: * like the t-statistic, there are many chi-square distributions in the family. To choose appropriate one, use the ________________________________________. * the degrees of freedom (d.f.) is determined by the number of cells (rows) in the one-way table. (d.f. = # rows – 1) * all individual expected counts must be at least 1 * no more than 20% of the expected counts can be less than 5 (to ensure a sufficient sample size) * the distribution is a density curve, beginning at ________ on the horizontal axis and is skewed _______________. * as the degrees of freedom increase, the shape of the curve becomes more _________. * 2 pdf (x, d.f.) will graph the chi-square distribution * 2 cdf (lower, upper, d.f.) will calculate the area under the chi-square curve. Use this to find the _____________ for the hypothesis test. EXAMPLES: 1. Find the area to the right of X2 = 1.41 under the chi-square curve with 2 degrees of freedom. 2. Find the area to the right of X2 = 19.62 under the chi-square curve with 9 degrees of freedom. 3. A “wheel of fortune” at a carnival is divided into four equal parts: part I: Win a doll part II: Win a candy bar part III: Win a free ride part IV: Win nothing You suspect that the wheel is unbalanced (i.e., not all parts of the wheel are equally likely to be landed upon when the wheel is spun). The results of 500 spins of the wheel are as follows: Part Frequency I 95 II 105 III 135 IV 165 Perform a goodness of fit test. Is there evidence that the wheel is not in balance? P H A N T O M S I II III IV Inference for Two-Way Tables (14.2) Learning Targets: Explain what is meant by a two-way table. Given a two-way table, compute the row or column conditional distributions Define the chi-square (X2) statistic Using the words “populations” and “categorical variables”, describe the major difference between homogeneity of populations and independence Identify the form of the null hypothesis in a 2 test for homogeneity of populations Identify the form of the null hypothesis in a 2 test of association/independence Given a two-way table of observed counts, calculate the expected counts for each cell List the conditions necessary to conduct a Use technology to conduct a 2 test of significance for a two-way table 2 test of significance for a two-way table Introduction: In both sections of chapter 14, we are interested in comparing a set of ____________________ to a set of _______________________. In a goodness-of-fit test, there is a single categorical variable that takes on values over a single population. But what if we want to compare more than two groups? We may want to compare gender and opinion on abortion, or background music and wine selection. We need a new way to present the data and a new test. We will use a two-way, or contingency, table to present the data. We studied these tables in chapter 4 as an organizational method for data analysis. Now, we want to use the data to make decisions. When comparing multiple variables, we usually have one of the following questions: 1. Do the data come from the same population, or do the populations differ? 2. Are the variables associated, or are they independent? These questions determine which new Chi-square test to use: EXAMPLE 1: Suppose that data are collected on gender (M, F) and political party preference (Republican, Democrat, Other). Suppose the data look like this: Male Female Republican 24 2 Democrat 2 24 Other 1 1 Does there appear to be any relationship? _______________________________________ Are the variables independent? _____________ Suppose the data look like this: Republican 24 24 Male Female Democrat 2 2 Other 1 1 Does there appear to be any relationship? _________________________________________ Are the variables independent? _____________ Finally, suppose the data look like this: Republican 14 12 Male Female Democrat 11 10 Other 2 5 Does there appear to be any relationship? _________________________________________ Are the variables independent? _____________ Part 1 - Homogeneity of Populations: Homogeneity (homogeneous): Homogeneity of populations: Statistical methods for dealing with multiple comparisons usually have two parts: 1. An overall test to see if there is good evidence of any differences among the parameters that we want to compare. 2. A detailed follow-up analysis to decide which of the parameters differ and to estimate how large the differences are. EXAMPLE 2: In a study of the television viewing habits of children, a developmental psychologist selects a random sample of 300 first graders - 100 boys and 200 girls. Each child is asked which of the following TV programs they like best: The Lone Ranger, Sesame Street, or The Simpsons. Results are shown in the contingency table below. Gender Boys Girls Total Lone Ranger 50 50 100 TV Show Sesame Street 30 80 110 The Simpsons 20 70 90 Total 100 200 300 Do the boys' preferences for these TV programs differ significantly from the girls' preferences? The Problem of Multiple Comparisons: The researchers expected that gender would influence choice of TV show, so gender is the ________________ variable, and the favorite TV show is the __________________ variable. If we used a chi-square goodness of fit, we would have to do it three times: H0: the distribution for gender and Lone Ranger is the same as the distribution for gender and Sesame Street H0: the distribution for gender and Lone Ranger is the same as the distribution for gender and The Simpsons H0: the distribution for gender and The Simpsons is the same as the distribution for gender and Sesame Street The problem is _______________________________________________. We need to be able to do many comparisons at once with some overall measure of confidence. For this type of problem, we are looking at a situation in which there are separate categorical variables taking values on separate _____________________. We will use a 2-way table to organize our data. It is similar to a matrix with “r” rows and “c” columns, so we call it an r x c table. The test we perform looks at whether each treatment (TV show) affected the populations (gender) differently or not is called the Chi-Square Test for Homogeneity of Populations. Stating Hypotheses: We will still use a null hypothesis that says ______________________________________, and an alternative hypothesis that says _________________________________________. Identify the populations involved: In this setting, the null hypothesis becomes: The alternative hypothesis would be: Computing the Expected Cell Counts: We will still use the ________________ test to measure how far the observed values are from the expected values. In this type of problem, however, we will also need to calculate the ______________. expected count = row total column total table total Observed counts for gender and TV show: Gender Lone Ranger Boys Girls Total 50 50 100 TV Show Sesame Street The Simpsons 30 80 110 Total 20 70 90 100 200 300 Find the expected counts for gender and TV show: Gender Lone Ranger TV Show Sesame Street The Simpsons Boys Girls The X2 Statistic and its P-value: Since the expected counts are all large enough to satisfy the required conditions: How many expected counts are greater than 1? _________ What percent of expected counts are less than 5? _____________ we proceed with the test, comparing the table of observed counts with the table of expected counts. We will calculate O E E 2 for each cell in the table and find the sum as our X2 test statistic. We will still use 2 cdf, but our degrees of freedom are calculated differently. Degrees of freedom = (# rows - 1)(# columns – 1) = (r - 1)(c - 1) Enter the observed counts into L1, and the expected counts into L2. Use L3 to calculate the X2 statistic just as we did in the previous section. Think of the X2 statistic as a measure of the distance that the observed counts are from the expected counts. Find 2 cdf to determine the probability of gathering data as extreme or more extreme than this data. L1 L2 L3 X 2 = ____________ 2 cdf ( _____________________) = _____________ Follow up Analysis: 1. Look at the conditional distribution table (Unusually high? Unusually low?) 2. Look at the chi-squared components (Unusually high?) 3. WARNING: The test confirms only that there is some relationship. The statistical analysis does not tell us what population our conclusion describes. It cannot generalize to other TV shows, other ages, etc. Part 2 - Association/Independence: The previous example compared 3 TV shows using separate and independent samples. Each group is a sample from a separate population corresponding to a separate treatment. The null hypothesis from the Goodness of Fit test of “______________________________________” took the form of “equal proportions among the 3 populations.” The chi square test for association/independence does not compare several populations. Instead, it classifies observations from a SINGLE population into two or more categories. EXAMPLE 3: Many popular businesses are franchises – think of McDonald’s. The owner of a local franchise benefits from the brand recognition, national advertising, and detailed guidelines provided by the franchise chain. In return, he or she pays fees to the franchise firm and agrees to follow its policies. The relationship between the local entrepreneur and the franchise firm is spelled out in a detailed contract. One clause that the contract may or may not contain is the entrepreneur’s right to an exclusive territory. This means that the new outlet will be the only representative of the franchise in a specified territory and will not have to compete with other outlets of the same chain. How does the presence of an exclusive-territory clause in the contract relate to the survival of the business? A study designed to address this question collected data from a sample of 170 new franchise firms. Two categorical variables were measured for each firm. First, the firm was classified as successful or not based on whether or not it was still franchising as of a certain date. Second, the contract each firm offered to franchises was classified according to whether or not there was an exclusive-territory clause. Here are the data, arranged in a two-way table: We are comparing franchises that have exclusive territories with those that do not. “Exclusive Territory” is the explanatory variable so it’s the column variable. The row variable is the response variable “success”. P H A N T O M S EXAMPLE 4: A study of the relationship between men’s marital status and the level of their jobs used data on all 8235 male managers and professionals employed by a large manufacturing firm. Each man’s job has a grade set by the company that reflects the value of that particular job to the company. The authors of the study grouped the many job grades into quarters. Grade 1 contains jobs in the lowest quarter of job grades, and grade 4 contains those in the highest quarter. Here is the Minitab output for the 4 x 4 table: Expected counts are printed below observed counts SINGLE 58 39.08 MARRIED 874 896.44 2 222 173.47 3927 3979.05 70 64.86 20 21.62 4239 3 50 101.90 2396 2337.30 34 38.10 10 12.70 2490 4 7 22.55 533 517.21 7 8.43 4 2.81 551 337 7330 126 42 8235 9.158 13.575 26.432 10.722 + 0.562 + 0.681 + 1.474 + 0.482 1 Total ChiSq = df = 9 Chisquare DIVORCED WIDOWED 15 8 14.61 4.87 + + + + 0.010 0.407 0.441 0.243 + + + + 2.011 0.121 0.574 0.504 2 cells with expected counts less than 5.0 9. 67.3970 1.0000 Is there a relationship between marital status and job grade? Total 955 + + + = 67.397 Chapter 14: Inference for Distributions of Categorical Variables Vocabulary: Chi-square test for Goodness of Fit Degrees of freedom Components of Chi-square Cell counts Observed count Cell Chi-square test for Association/Independence Chi-square test for Homogeneity of Populations Chi-square statistic Expected count r x c table 14.1 Test for Goodness of Fit 1. What is the chi-square statistic? 2. What is the difference between the notation X2 and 2 ? 3. How many degrees of freedom does a chi-square distribution have? 4. As the chi-square statistic increases, what happens to the P-value? 5. What is the domain of the chi-square distribution? 6. What is the shape of the chi-square distribution? 7. State the null and alternative hypotheses for a Goodness of Fit test. 8. What happens to the shape as the degrees of freedom increase? 14.2 Inference for Two-Way tables 1. What information is contained in a two-way table for a chi-square test? 2. State the null and alternative hypotheses for comparing more than two populations. 3. Explain how to calculate the expected count in any cell of a two-way table. 4. How many degrees of freedom does a chi-square test for a two-way table with “r” rows and “c” columns have? 5. When is the chi-square test of association/independence used? 6. When is the chi-square test for homogeneity of populations used? Write a short summary of your performance/effort in the class this year. Include whether or not you feel prepared for a college-level class. Inference For Regression (Chp 15) Learning Targets: Identify the conditions necessary to do inference for regression Given a set of data, check that the conditions for doing inference for regression are present Explain what is meant by the standard error about the least-squares line Compute a confidence interval for the slope of the regression line Conduct a test of the hypothesis that the slope of the regression line is 0 (or that the correlation is 0) in the population Introduction: When a scatterplot shows a linear relationship between a quantitative explanatory variable, ______, and a quantitative response variable, ______, we can use the least-squares line fitted to the data to predict y for a given value of x. Now we want to perform tests and find confidence intervals in this setting. Before attempting the inference procedures, we need to examine the data: 1. Make a scatterplot (look for a roughly linear pattern) 2. Find the LSRL (use calculator to find Linear Regression) 3. Look for outliers and influential observations *outliers are __________________________________________________ *influential observations are ______________________________________ 4. Calculate the correlation coefficient and the coefficient of determination (___and ____) EXAMPLE 1: One of nature’s patterns connects the percent of adult birds in a colony that return from the previous year and the number of new adults that join the colony. Here are data for 13 colonies of sparrow hawks. Examine the data for linear regression appropriateness. Percent return (x) 74 66 81 52 73 62 52 45 62 46 60 46 38 New adults (y): 5 6 8 11 12 15 16 17 18 18 19 20 20 Explanatory variable: ___________________ Response variable: ___________________ Is the pattern roughly linear?_______ outliers? ______ influential observations? _______ Equation of LSRL: STAT CALC 8: ________________________ r = _____ r2 = _____ The Regression Model: The basic idea here is that there is an “on the average” straight line relationship between y and x. The true regression line y x says that the mean response y moves along a straight line as x changes. We can’t observe the true regression line. ( and are population ______________) The values of y that we can observe vary about their means according to a __________________. The standard deviation determines whether the points fall close to the true regression line (______ ) or are widely scattered (___________ ). Calculation of the least squares line Y a bx gives us unbiased estimates for and in the form of ____ and ______. Conditions for Regression Inference: We have n observations of an explanatory variable x and a response variable y. Our goal is to study or predict the behavior of y for given values of x. Normality 1. For any fixed value of x, the response value y varies according to a _____________. To verify this, check the residuals’ distribution with a histogram. Is it approximately normal? *remember: residual = actual – predicted [RAP] *need to put predicted values from the LSRL equation into L3. Then L4 = L2 – L3 (L4 are the residuals) Independence 2. Repeated observations on the same individual are not allowed. Linearity 3. The mean response y , has a straight-line relationship with x in the form of y x where the slope, , and y-intercept, , are unknown parameters. Variability 4. The variability of the responses cannot change with x while the mean response is changing with x. In other words, the standard deviation of y (we’ll use ) is the same for all values of x. Estimating the Parameters: We use the LSRL to estimate the parameters and . The third parameter involved in the regression model is . Since the LSRL estimates the true regression line, the residuals from the LSRL estimate how much y varies about the true regression line. If is the standard deviation of responses about the true regression line, an estimate of is calculated from the residuals in the sample. A standard deviation calculated from a sample is called the ___________________ s 1 residual2 n2 1 y y 2 n 2 Why are we using n-2?______________________________________________________ EXAMPLE: Calculate the standard error about the least squares line for the sparrow hawk example. 1. LSRL: __________________________________ 2. L1 L2 L3 L4 ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ ______________________________________________________________ Sum of the squared residuals: ______________ 3. s = ____________________ EXAMPLE: Coffee is a leading export from several developing countries. When coffee prices are high, farmers often clear forest to plant more coffee trees. Here are five years’ data on prices paid to coffee growers in Indonesia and the percent of forest area lost in a national park that lies in a coffee-producing region: Price (cents per pound) Forest lost (percent) 29 0.49 40 1.59 54 1.69 55 1.82 71 3.10 a) Find the regression equation (LSRL) ________________________________________ b) Explain in words what the slope of the population regression would tell us if we knew it. c) Use the residuals to estimate the standard deviation of percents of forest lost about the means given by the population regression line. (Show the formula and the numbers you use) Confidence Intervals for the Regression Slope: The slope of the true regression line (population regression) is usually the most important parameter in a regression problem. The slope is the __________ of increase or decrease of the mean response as the explanatory variable increases. We often want to estimate the true slope, so we use ______ of the LSRL as an unbiased estimator. The confidence interval can show how accurate the estimate is likely to be. To find the confidence interval for the regression slope, we need the standard error of the slope. STANDARD ERROR OF THE SLOPE: SEb = Estimation of change in y change in x = 1 y y 2 n 2 (x x) 2 EXAMPLE: Does the length of time young children remain at the lunch table help predict how much they eat? Here are data on 20 toddlers observed over several months at a nursery school. “Time” is the average number of minutes a child spent at the table when lunch was served. “Calories” is the average number of calories the child consumed during lunch, calculated from careful observation of what the child ate each day. Time: Calories: 21.4 472 30.8 498 37.7 465 33.5 456 32.8 423 39.5 437 22.8 508 34.1 431 33.9 479 43.8 454 Time: Calories: 42.4 450 43.1 410 29.2 504 31.3 437 28.6 489 32.9 436 30.6 480 35.1 439 33.0 444 43.7 408 a) Look at the scatterplot of the data and describe briefly what the data show about the behavior of children b) Find the equation of the least-squares line for predicting calories consumed from time at the table. c) Determine the standard error of the slope. IN GENERAL: Confidence interval = estimate t*(standard error of estimate) FOR THIS CHAPTER: FORMULA: Confidence interval = slope t*(standard error of slope) Confidence interval = slope t* 1 y y 2 n 2 (x x) 2 where t* is the critical value for the (n-2) density curve EXAMPLE: Find the confidence interval for the toddler lunch table example above. Hypothesis Testing: The most common hypothesis about the slope of y x is: H0 : 0 This hypothesis says that the slope of the regression line is ___________. If that were true, we would be saying that the mean of y does not change at all when x changes. In other words, there is no true linear relationship between x and y. (This does not say “there is no slope”) We will follow the PHANTOMS procedure, using a Linear Regression t-test. The test statistic is just the standardized version of the least-squares slope. It is another t-statistic of the t(n-2) distribution. t slope b SE of slope SEb EXAMPLE: Infants who cry easily may be more easily stimulated than others. This may be a sign of higher IQ. Child development researchers explored the relationship between the crying intensity of infants four to ten days old and their later IQ test scores. Infants’ crying and IQ scores L1 crying L2 IQ L1 crying L2 IQ L1 crying L2 IQ 10 12 9 16 18 15 12 20 16 33 19 18 22 87 97 103 106 109 114 119 132 136 159 103 112 135 20 16 23 27 15 21 12 15 17 13 13 16 30 90 100 103 108 112 114 120 133 141 162 104 118 155 12 12 14 10 23 9 16 31 22 17 18 19 94 103 106 109 113 119 124 135 157 94 109 120 Perform a significance test to determine if a linear relationship exists between the crying intensity and IQ scores P H A (a) Independent observations? (b) Linear relationship? (Look at the scatterplot) (c) Normality? (d) Variation consistant? N T O M S When reading a computer output of regression information, be aware that the software usually gives a 2-sided p-value. If we have a one-tailed test, we will need to _____________. Chapter 15: Inference for Regression Vocabulary: Explanatory variable True regression line Standard error about the line Response variable Parameters , , Residuals 1. Explain the 4 conditions for regression inference. 2. Explain the difference between the equations y x and y a bx 3. Explain the concept of standard error about the least-squares line 4. What does the slope represent? 5. How does the null hypothesis H0 : 0 fit in with the other null hypotheses we have used? 6. What does the confidence interval estimate in this chapter?