Ch 4 Notes - Franklin Board of Education

advertisement
AP Statistics - Chapter 4: Relations Between Two Variables
Chapter Objectives:
A)
B)
C)
D)
use logs to allow the use of linear techniques with exponential relationships
find marginal and conditional probabilities from a two-way table
identify Simpson’s Paradox and explain how it exists from the data
define the possible explanations of an observed association and identify them within examples
4.1 Transforming to Achieve Linearity
Linear Transformation  + , - , × , ÷ stay linear
non-linear: can be transformed to become linear
A) Exponential Relationships
they can be made linear by converting with logs
EXAMPLE #1a: Bacteria Growth By Year:
Year
Bact. Growth
1
3
2
12
3
23
4
35
5
70
6
120
Analyze. Is it a good fit?
Now add:
7
300
8
700
9
1200
10
2700
11
4800
12
12000
Re-analyze. Is it a good fit?
Method to use regression techniques with exponential relationships:
1. Log the y’s [L3 = Log(L2) or Ln(L2)]
2. Regress (Linreg) on x (L1) and log y (L3)
3. To predict, substitute x into #2 line, then “unlog” it
EXAMPLE #1b:
Yr
BG
1
3
2
12
3
23
4
35
5
70
6
120
7
300
8
700
9
1200
10
2700
11
4800
12
12000
13
_____
ŷ=
log BG
OR
ln BG
ŷlog =
ŷln =
r2 =
r2log =
r2ln =
ŷ(7) =
ŷlog(7) =
ŷln(7) =
10ŷlog(7) =
eŷln(7) =
Homework: p 276 5, 6, 8, 9
4.2 Relationships Between Categorical Variables
B) Two-Way Tables
Marginal Probabilities - %’s in the margins (row and column percents)
Conditional Probabilities - %’s within the rows and columns
EXAMPLE: College Students By Gender and Age Group
Age
15-17
18-24
25-34
35+
TOTAL
Female
89
5668
1904
1660
Marginal Percentages:
Male
61
4697
1589
970
Total
% 15-17 =
% 18-24 =
% Female =
% Male =
Conditional Distributions: % of F who are 18-24 =
% of 15-17 who are M =
Homework: p 298 23 – 25
% 25-34 =
%35+ =
% F who are 25-34 =
% of M who are 35+ =
% of 35+ who are M =
% of 25-34 who are F =
p 301 29
C) Simpson’s Paradox
EXAMPLE #3: Survival Outcome vs. Evacuation Type After an Accident
ALL ACCIDENTS
Outcome
Helicopter
Died
64
Survived
136
Ambulance
260
840
Which is the better way to be evacuated? Why?
Now, add severity of accident to the data:
SERIOUS ACCIDENTS
Outcome
Helicopter
Ambulance
Died
48
60
Survived
52
40
LESS SERIOUS ACCIDENTS
Outcome
Helicopter
Died
16
Survived
84
Ambulance
200
800
Which way is better for serious accidents? Why?
Which way is better for less serious accidents?
Simpson’s Paradox
when comparisons between groups reverse direction when data are presented with a differing level of detail
usually caused by some sort of extreme value or values
What led to the paradox in Example #3?
4.3 Establishing Causation
D) Association - Causation
Of the 16 factors below, 8 show a strong correlation (+ or -) with test scores. The other 8 don’t matter. Discuss and guess which are
which?
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
Child has highly educated parents
Child’s family is intact
Parents have high socioeconomic status
Parents recently signed into a better neighborhood
mother was 30+ at the time of 1st child’s birth
Mother didn’t work between birth and kindergarten
Child had low birth weight
Child attended Head Start
Parents speak English at home
Parents regularly take child to the museum
Child is adopted
Child is regularly spanked
Parent’s are involved in the PTA
Child Frequently watches TV
Has many books at home
Parents read to the child nearly everyday
Explaining Association (x and y are associated)
Strong association measured by correlation between 2 variables does not imply causation.
Example of causation:
1) increased drinking of alcohol causes a decrease in coordination
Example of association:
1) High SAT scores are associated with high Freshmen GPA
1.Causation: X causes Y
example: smoking and lung cancer
(Association = dashed line
Causation = solid line)
2.Common response: The observed association between the variables x and y is explained by a lurking variable z. Both x and y
change to changes in z.
Example: Smoking and lung cancer (genetic factor the predisposes people to nicotine addiction and lung cancer)
3. Confounding: x and y are related for unknown reasons (too much happening to tell what causes what). Even a very strong
association between two variables is not by itself good evidence that there is a cause-and-effect link between the variables. A
confounding variable is one whose effects on the response variable cannot be distinguished from one or more of the explanatory
variables in the study.
Example: smoking and lung cancer. People who drink too much, don’t exercise, eat unhealthy foods, etc., are more likely to get lung
cancer as a result of their lifestyle. Such people may be more likely to be smokers as well.
Does smoking cause lung cancer? Much evidence, not experimental. The cigarette companies abused this for decades.
EXAMPLE:
The following are examples of observed observations. Are they explained by causation, common response, or is
there confounding?
1) X - Mother’s BMI
Y - Daughter’s BMI
2) X - Amount of saccharine in a rat’s diet
Y - number of tumors in a rat's body
3) X - High School Seniors SAT’s scores
Y - students first year GPA
4) X - monthly flow of money into stock mutual funds
Y - monthly rate of return for the stock market
5) X - whether a person attends religious services
Y - how long the person lives
6) X - number of years of education a worker has
Y - worker’s income
The following are studies you will analyze:
Suppose studies were to be done on the following.
Part a) Determine if you believe the association would be positive, negative, or none.
Part b) Then decide if the relationship would most likely be causation, common response, or confounding
Part c) If it is common response, identify the confounding variable affecting both. If it is confounding, identify
the confounding variable affecting the response variable.
1. When you are on a diet, the amount of calories you eat daily vs. the amount of weight you lose.
2. The number of pets you own vs. the amount you spend on pet food.
3. How much you pay for a house vs. how much you pay for a car.
4. How much you study vs. your GPA.
5. The number of policeman that are visible on a stretch of road vs. the speed you travel.
6. How a student does in algebra vs. the student does in geometry.
7. A person’s height vs. the amount of money that person has.
8. The number of wins the Indians have and the total amount of money spent on concessions at Indians games.
9. The number of people who smoke cigarettes vs the number of people who get lung cancer.
10. The number of people in a family vs. the number of cars the family owns.
11. The number of problems on a math test vs. the amount of time it takes students to complete the exam.
12. The amount of gasoline purchased on the Ohio Turnpike daily vs the total length of time it takes vehicles to travel the
Ohio Turnpike.
13. Amount of fertilizer and yield of corn
14. Dosage of a drug and the survival rate of mice
15. High temperatures in the summer lead to higher electricity use
16. It has been observed that children with more cavities tend to have larger vocabularies.
17. For countries, pick any measure of technological modernity (# of TVs per capita) and life expectancy.
18. The number of firefighters who respond to a fire and the amount of damage done.
19. Religious people live longer
20. You might want to test a fertilizer on your lawn. Suppose you spread it on half the lawn to see if the grass will look
better there. You found the fertilized half grew better.
Establishing Causation - The best method is to establish a carefully controlled experiment.
There is evidence, but it isn't experimental. It's just observational.
Homework: p 312 41 – 48
Does smoking cause lung cancer?
Chapter 4 Practice Problems (Linear Fit to an Exponential Relationship)
Note: Round to 2 decimal places.
1. The number of animals by year is shown as follows:
Year:
1
2
3
# of Animals
200
500
1500
4
5000
5
17000
6
50000
a.
Draw a scatterplot (in scale, show labels)
b.
Calculate the best-fit line and sketch it in the scatterplot. Find and interpret r 2 and r.
c.
Calculate the residuals and sketch the residual plot (labels).
d.
Predict the number of animals for years 7 and 8.
e.
Explain why the linear fit is not a good one.
f.
Transform the data (show how) to make it more linear. Sketch the new scatterplot (labels).
g.
Calculate the new best-fit line and sketch it in the scatterplot. Find and interpret the new r2 and r.
h.
Calculate the new residuals and sketch the new residual plot (labels).
i.
Predict the number of caterpillars for years 7 and 8.
j.
Show the generic prediction equation for any particular year.
Download