AP Statistics - Chapter 4: Relations Between Two Variables Chapter Objectives: A) B) C) D) use logs to allow the use of linear techniques with exponential relationships find marginal and conditional probabilities from a two-way table identify Simpson’s Paradox and explain how it exists from the data define the possible explanations of an observed association and identify them within examples 4.1 Transforming to Achieve Linearity Linear Transformation + , - , × , ÷ stay linear non-linear: can be transformed to become linear A) Exponential Relationships they can be made linear by converting with logs EXAMPLE #1a: Bacteria Growth By Year: Year Bact. Growth 1 3 2 12 3 23 4 35 5 70 6 120 Analyze. Is it a good fit? Now add: 7 300 8 700 9 1200 10 2700 11 4800 12 12000 Re-analyze. Is it a good fit? Method to use regression techniques with exponential relationships: 1. Log the y’s [L3 = Log(L2) or Ln(L2)] 2. Regress (Linreg) on x (L1) and log y (L3) 3. To predict, substitute x into #2 line, then “unlog” it EXAMPLE #1b: Yr BG 1 3 2 12 3 23 4 35 5 70 6 120 7 300 8 700 9 1200 10 2700 11 4800 12 12000 13 _____ ŷ= log BG OR ln BG ŷlog = ŷln = r2 = r2log = r2ln = ŷ(7) = ŷlog(7) = ŷln(7) = 10ŷlog(7) = eŷln(7) = Homework: p 276 5, 6, 8, 9 4.2 Relationships Between Categorical Variables B) Two-Way Tables Marginal Probabilities - %’s in the margins (row and column percents) Conditional Probabilities - %’s within the rows and columns EXAMPLE: College Students By Gender and Age Group Age 15-17 18-24 25-34 35+ TOTAL Female 89 5668 1904 1660 Marginal Percentages: Male 61 4697 1589 970 Total % 15-17 = % 18-24 = % Female = % Male = Conditional Distributions: % of F who are 18-24 = % of 15-17 who are M = Homework: p 298 23 – 25 % 25-34 = %35+ = % F who are 25-34 = % of M who are 35+ = % of 35+ who are M = % of 25-34 who are F = p 301 29 C) Simpson’s Paradox EXAMPLE #3: Survival Outcome vs. Evacuation Type After an Accident ALL ACCIDENTS Outcome Helicopter Died 64 Survived 136 Ambulance 260 840 Which is the better way to be evacuated? Why? Now, add severity of accident to the data: SERIOUS ACCIDENTS Outcome Helicopter Ambulance Died 48 60 Survived 52 40 LESS SERIOUS ACCIDENTS Outcome Helicopter Died 16 Survived 84 Ambulance 200 800 Which way is better for serious accidents? Why? Which way is better for less serious accidents? Simpson’s Paradox when comparisons between groups reverse direction when data are presented with a differing level of detail usually caused by some sort of extreme value or values What led to the paradox in Example #3? 4.3 Establishing Causation D) Association - Causation Of the 16 factors below, 8 show a strong correlation (+ or -) with test scores. The other 8 don’t matter. Discuss and guess which are which? 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. Child has highly educated parents Child’s family is intact Parents have high socioeconomic status Parents recently signed into a better neighborhood mother was 30+ at the time of 1st child’s birth Mother didn’t work between birth and kindergarten Child had low birth weight Child attended Head Start Parents speak English at home Parents regularly take child to the museum Child is adopted Child is regularly spanked Parent’s are involved in the PTA Child Frequently watches TV Has many books at home Parents read to the child nearly everyday Explaining Association (x and y are associated) Strong association measured by correlation between 2 variables does not imply causation. Example of causation: 1) increased drinking of alcohol causes a decrease in coordination Example of association: 1) High SAT scores are associated with high Freshmen GPA 1.Causation: X causes Y example: smoking and lung cancer (Association = dashed line Causation = solid line) 2.Common response: The observed association between the variables x and y is explained by a lurking variable z. Both x and y change to changes in z. Example: Smoking and lung cancer (genetic factor the predisposes people to nicotine addiction and lung cancer) 3. Confounding: x and y are related for unknown reasons (too much happening to tell what causes what). Even a very strong association between two variables is not by itself good evidence that there is a cause-and-effect link between the variables. A confounding variable is one whose effects on the response variable cannot be distinguished from one or more of the explanatory variables in the study. Example: smoking and lung cancer. People who drink too much, don’t exercise, eat unhealthy foods, etc., are more likely to get lung cancer as a result of their lifestyle. Such people may be more likely to be smokers as well. Does smoking cause lung cancer? Much evidence, not experimental. The cigarette companies abused this for decades. EXAMPLE: The following are examples of observed observations. Are they explained by causation, common response, or is there confounding? 1) X - Mother’s BMI Y - Daughter’s BMI 2) X - Amount of saccharine in a rat’s diet Y - number of tumors in a rat's body 3) X - High School Seniors SAT’s scores Y - students first year GPA 4) X - monthly flow of money into stock mutual funds Y - monthly rate of return for the stock market 5) X - whether a person attends religious services Y - how long the person lives 6) X - number of years of education a worker has Y - worker’s income The following are studies you will analyze: Suppose studies were to be done on the following. Part a) Determine if you believe the association would be positive, negative, or none. Part b) Then decide if the relationship would most likely be causation, common response, or confounding Part c) If it is common response, identify the confounding variable affecting both. If it is confounding, identify the confounding variable affecting the response variable. 1. When you are on a diet, the amount of calories you eat daily vs. the amount of weight you lose. 2. The number of pets you own vs. the amount you spend on pet food. 3. How much you pay for a house vs. how much you pay for a car. 4. How much you study vs. your GPA. 5. The number of policeman that are visible on a stretch of road vs. the speed you travel. 6. How a student does in algebra vs. the student does in geometry. 7. A person’s height vs. the amount of money that person has. 8. The number of wins the Indians have and the total amount of money spent on concessions at Indians games. 9. The number of people who smoke cigarettes vs the number of people who get lung cancer. 10. The number of people in a family vs. the number of cars the family owns. 11. The number of problems on a math test vs. the amount of time it takes students to complete the exam. 12. The amount of gasoline purchased on the Ohio Turnpike daily vs the total length of time it takes vehicles to travel the Ohio Turnpike. 13. Amount of fertilizer and yield of corn 14. Dosage of a drug and the survival rate of mice 15. High temperatures in the summer lead to higher electricity use 16. It has been observed that children with more cavities tend to have larger vocabularies. 17. For countries, pick any measure of technological modernity (# of TVs per capita) and life expectancy. 18. The number of firefighters who respond to a fire and the amount of damage done. 19. Religious people live longer 20. You might want to test a fertilizer on your lawn. Suppose you spread it on half the lawn to see if the grass will look better there. You found the fertilized half grew better. Establishing Causation - The best method is to establish a carefully controlled experiment. There is evidence, but it isn't experimental. It's just observational. Homework: p 312 41 – 48 Does smoking cause lung cancer? Chapter 4 Practice Problems (Linear Fit to an Exponential Relationship) Note: Round to 2 decimal places. 1. The number of animals by year is shown as follows: Year: 1 2 3 # of Animals 200 500 1500 4 5000 5 17000 6 50000 a. Draw a scatterplot (in scale, show labels) b. Calculate the best-fit line and sketch it in the scatterplot. Find and interpret r 2 and r. c. Calculate the residuals and sketch the residual plot (labels). d. Predict the number of animals for years 7 and 8. e. Explain why the linear fit is not a good one. f. Transform the data (show how) to make it more linear. Sketch the new scatterplot (labels). g. Calculate the new best-fit line and sketch it in the scatterplot. Find and interpret the new r2 and r. h. Calculate the new residuals and sketch the new residual plot (labels). i. Predict the number of caterpillars for years 7 and 8. j. Show the generic prediction equation for any particular year.