1 AP Statistics Name: Unit 3- Examining Relationships Date:

advertisement
1
AP Statistics
Unit 3- Examining Relationships
Name:____________________________________
Date:_____________________________________
Most statistical studies involve more than one variable. Often in the AP Statistics exam, you will be asked to
compare two data sets by using side by side boxplots or histograms etc. However, there are times where we
want to examine relationships among several variables for the same group of data. For example, is the
temperature outside related to the amount of ice cream is sold. Or does the amount of TV watched affect IQ?
When you examine the relationship between two variables you need to start with a scatterplot.
Scatterplots and Correlation
A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The
values of one variable appear on the horizontal axis, and the values of the other variable appear on the vertical
axis. Each individual in the data appears as a point in the plot fixed by the values of both variables for that
individual.
A response variable measures the outcome of a study. An explanatory variable helps explain or influences
change in a response variable.
You will often find explanatory variables called independent variables, and response variable called dependent
variables. The idea behind this language is that the response variable depends on the explanatory variable.
Because the words independent and dependent have other, unrelated meanings in statistics, we won’t use them
here.
Types of LINEAR plots
Perfect +
Strong +
Moderate +
Perfect -
Strong -
Moderate -
2
Types of NON-LINEAR plots
Quadratic
Cubic
Exponential
Natural Logarithmic
Quartic
None
There are two ways of determining whether two variables are related:
a) By looking at a scatter plot (graphical approach)
b) By calculating a “correlation coefficient” (mathematical approach)
Graphical Approach:
Interpreting a Scatterplot
•
•
•
In any graph of data, look for overall pattern and for striking deviations from that pattern.
You can describe the overall pattern of a scatterplot by the direction, form, and strength of the relationship.
An important kind of deviation is an outlier, an individual value that falls outside the overall pattern of the
relationship.
3
Things to look for in a scatterplot:
1) Overall pattern (linear, exponential, etc.) or deviations from the pattern (outliers)
2) Direction: Positive or negative slope
3) Strength: How close do the points lie to a simple form (such as a line)
4
Ex 1: Suppose we hypothesize that the number of doctor visits a person has can be explained by the amount of
cigarettes they smoke. So we want to see if there is a relationship between the number of cigarettes one smokes
a week (the explanatory or independent variable) and the number of times per year one visits a doctor (response
or dependant variable). We ask 10 random people and get the following information:
# of Cigarettes Per Week
0 3 21 15 30 5 40 60 0 0
Number of doctor visits per year 1 2 4 3 5 1 5 6 2 0
If we were to plot the following data point on an X-Y coordinate plane, we would get a scatter plot that looks
like this:
Smoking Scatter Plot
Doctor Visits / year
7
6
5
4
3
2
1
0
0
20
40
60
80
Number of Cigarettes / week
Note that the graph shows us that as you smoke more cigarettes / week, you also tend to go to the doctor more
often. This result demonstrates that there is a positive relationship between the two variables. The following
scatter plots show other types of relationships that might occur between two variables:
5
Mathematical Approach:
In order to strengthen the analysis when comparing two variables, we can attach a number, called the
correlation coefficient (r), to describe the linear relationship between two variables. This number helps remove
any subjectivity in reading a linear scatter plot and helps us avoid being fooled by axis manipulation (see
example below):
The correlation measures the strength and direction of the linear relationship between two quantitative
variables:
Linear Correlation Coefficient (r)
measures the strength of the linear relationship between the given variables (AKA Pearson’s Product Moment
Correlation Coefficient)
r=
(
n∑ xy − (∑ x )(∑ y )
)
n ∑ x 2 − (∑ x )
2
(
)
n ∑ y 2 − (∑ y )
2
; where n is the number of ordered pairs
Or…
r=
 xi − x   y i − y 
1

⋅
∑
n − 1  s x   s y 
r 2 (in percent form) is the percent of variation in the y variable that can be explained by variation in the x
variable.
r
1
.75 ≤ r < 1
.50 ≤ r < .75
-.50 < r < .50
-.75 < r ≤ -.50
-1 < r ≤ -.75
-1
Type of correlation
Perfect +
Strong +
Moderate +
None
Moderate Strong Perfect -
6
Interpreting Correlation Coefficients:
1) The value of r is always between –1 and 1.
2) A correlation of –1 implies two variables are perfectly negatively correlated: As one goes up the other goes
down and as one goes down the other goes up.
3) A correlation of 1 implies that there is perfect positive correlation between the two variables. As one goes up,
so does the other.
4) A correlation of 0 implies that there is no correlation between the two variables. In other words, there is no
relationship between them. As one goes up, we know nothing about whether the other will go up or down.
5) Positive correlations between 0 and 1 have varying strengths, with the strongest positive correlations being
closer to 1.
6) Negative correlation between 0 and –1 are also of varying strength with the strongest negative correlation
being closer to –1.
7) r does not have units. Changing the units on your data will not affect the correlation.
8) Correlation describes only linear relationships between two variables.
9) r is very strongly affected by outliers. Use r with caution when outliers appear in your scatter plot. Don’t rely
on r alone to determine the linear strength between two variables – graph a scatter plot first.
10) Correlation makes no distinction between explanatory and response variables. It makes no difference
which variable you call x and which you call y when calculating the correlation.
Ex 2: Construct a scatterplot and compute the linear correlation coefficient between x and y.
x 1 2 3 5 9 11
y 3 5 7 10 16 20
7
Ex 3: Lets calculate the correlation between the amount of cigarettes smoked per week and the number of
doctor visits per year:
# of Cigarettes Per Week
0 3 21 15 30 5 40 60 0 0
Number of doctor visits per year 1 2 4 3 5 1 5 6 2 0
x=
sx =
y=
sy =
x
r=
x−x
sx
y−y
sy
y
0
1
3
2
21
4
15
3
30
5
5
1
40
5
60
6
0
2
0
0
 xi − x   y i − y 
1

⋅
∑
n − 1  s x   s y 
So sum up the last column and multiply it by
1
9
r = _____________
Now check your answer by doing it on the calculator.
x−x y− y
.
sx
sy
8
Linear Regression
AKA- Least Squares Regression Line (LSRL), Line of Best Fit, Linear Regression Equation
Least Squares Regression A.K.A. linear regression allows you to fit a line to a scatter diagram in order to be
able to predict what the value of one variable will be based on the value of another variable.
The fitted line is called the line of best fit, linear regression line, or least squares regression line, (LSRL) and
has the form yˆ = a + bx where:
a: y intercept
b: slope of the line
The way the line is fitted to the data is through a process called the method of least squares. The main idea
behind this method is that the square of the vertical distance between each data point and the line is minimized.
The least squares regression line is a mathematical model for the data that helps us predict values of the
response (dependant) variable from the explanatory (independent) variable. Therefore, with regression, unlike
with correlation, we must specify which is the response and which is the explanatory variable.
There are two sets of formulas that could be used to find the line of best fit:
Formula for finding the slope and y-intercept in a linear regression line:
Slope:
b=r
sy
Intercept:
a = y − bx
sx
The slope of the regression line is important in the sense that it gives us the rate of change of ŷ with respect to
x. In other words, it gives us the amount of change in ŷ when x increases by 1.
So for the previous example the process for finding the LSRL is:
1) We found the correlation coefficient is r =
2) We can find s y and s x using the calculator
3) Therefore:
b=
4) We can find y and x from the calculator as well.
5) Therefore:
a=
The equation of the regression line is:
9
The alternate set of formulas that can be used for determining the LSRL is:
slope = a = b1 =
n(∑ xy ) − (∑ x )(∑ y )
(
)
n ∑ x 2 − (∑ x )
2
(∑ y )(∑ x ) − (∑ x )(∑ xy )
=
n(∑ x ) − (∑ x )
2
y − int ercept = b = bo
2
2
Ex 4: Using the data from Ex 3, write the equation of the line of best fit using the formula above.
Using the Graphing Calculator: Note: You should only need to do steps 2-8, 10 once.
1. Enter x values in L1 and y values in L2
2. Press 2 nd y =
3. Make sure all plots are off, if not, Press 4 (Plotsoff) Enter
4. Press 2 nd y =
5. Select 1
6. Turn this plot on by highlighting On and hitting Enter
7. Press the Down Arrow once and Press Enter
8. Verify that L1 and L2 are listed.
9. Press Zoom 9 (On your graph you should see a scatterplot)
10. 2 nd 0 Press DiagnosticsOn Press Enter twice
11. Press Stat, Right Arrow, 4 (LinReg(ax + b))
12. L1 , L2 , Y1 Enter (Correlation and Regression data is now presented)
13. Zoom 9 (Line should appear on the graph verifying that equation is correct)
Ex 5: Verify your answers to Ex 2-4 using the graphing calculator.
10
Ex 6: The costs of mountain bikes are given in the following table, along with overall rating scores given to
each bike by Consumer Reports magazine. Graph the data using a scatterplot. Identify the explanatory and
response variable. Interpret the scatterplot that you just constructed on your calculator. Find the strength of the
linear correlation. Find the equation of the least squares line.
11
How good is our prediction?
The strength of a prediction which uses the LSRL depends on how close the data points are to the regression
line. The mathematical approach to describing this strength is via the coefficient of determination. The
coefficient of determination gives us the proportion of variation in the values of y that is explained by leastsquares regression of y on x. The coefficient of determination turns out to be r 2 (correlation coefficient
squared).
Whenever you use the regression line for prediction, also include r 2 as a measure of how successful the
regression is in explaining the response.
In our example r 2 = 0.8569. This means that over 85% of the variation in doctor visits per year can be
explained by the linear relationship it has with cigarettes smoked per week.
Residuals:
Since the LSRL minimized the vertical distance between the data values and a trend line we have a special
name for these vertical distances. They are called residuals. A residual is simply the difference between the
observed y and the predicted y.
Residual = y - ŷ
In our example, if we want to find the residual for the person in our sample who smoked 3 cigarettes per week
and had 2 doctor visits, we would need to use the regression equation to find ŷ .
Residual =
12
Residual Plots:
Residuals help us determine how well our data can be modeled by a straight line, by enabling us to construct a
residual plot. A residual plot is a scatter diagram that plots the residuals on the y-axis and their corresponding x
values on the x-axis. So for our example, the residual plot would be plotted as follows:
x
0
3
21
15
30
5
40
60
0
0
y
1
2
4
3
5
1
5
6
2
0
y hat
1.3069
1.5817
3.2305
2.6809
4.0549
1.7649
4.9709
6.8029
1.3069
1.3069
Residuals
-0.3069
0.4183
0.7695
0.3191
0.9451
-0.7649
0.0291
-0.8029
0.6931
-1.3069
-0.0074
Notice that the sum
of the residuals is
zero (any error is
due to rounding)
1.5
1
0.5
0
-0.5
0
20
40
60
80
-1
-1.5
How does a residual plot help us? The pattern of the plot is the indicator of whether or not the data can be
modeled linearly. The ideal plot is the one below:
The plot shows a uniform scatter of the points above and below the fitted line with no unusual individual
observations.
13
Here are a few things to watch out for when examining the pattern of a residual plot:
1) A curved pattern: Shows that the overall pattern of the data is not linear
2) Increasing or decreasing spread about the line as x increases: Predictions for y will be less accurate for
larger values of x
3) Individual points with large residuals: indicate outliers from the overall pattern
Outlier: An observation that lies outside the overall pattern in the scatter plot (either in the x or y direction)
Influential point: A point is influential if removing it would markedly change the position of the regression line.
Points that are outliers in the x direction are often influential.
14
In the scatter plot above child 19 is an outlier (its score lies far above the rest of the data) and child 18 is
influential because if we were to fit the LSRL without child 18 (dashed line) it would look considerably
different than with child 18 (solid line). Child 18, because it is an extreme in the x direction, has a
disproportionate pull on the regression line and is therefore an influential point.
Important note: Influential points often have small residuals because they tend to pull the line towards
themselves. Therefore, you might miss influential points if you only look at residuals.
Facts about least-squares regression
Fact 1. The distinction between explanatory and response variables is essential in regression. Least-squares
regression looks at the distances of the data points from the line only in the y direction. If we reverse the roles
of the two variables, we get different least-squares regression line.
Fact 2. There is a close connection between correlation and the slope of the LSRL. The slope is :
sy
b=r
sx
This equation says that along the regression line, a change in one standard deviation in x corresponds to a
change of r standard deviations in y.
Fact 3. The LSRL always passes through the point ( x, y )
Fact 4. The correlation r describes the strength of a straight-line relationship. In the regression setting, this
description takes a specific form: the square of the correlation, r², is the fraction of the variation in the values of
y that is explained by the least-squares regression of y on x.
15
Ex 7: An employer gives an aptitude test to his employees and compares it with their productivity at work. The
results are given below:
X (Score on
Aptitude Test)
6
9
3
8
7
5
8
20
4
11
10
Y (Productivity
at Work)
30
49
18
42
39
25
70
40
15
45
52
a) Draw a scatter plot of the data. Label and scale the axes. Is there any pattern? Does the data appear to be
positively correlated, negatively correlated, or neither? Which is your response and which is your explanatory
variable?
b) Find the correlation coefficient, r? Does it support your answer to a?
c) Find the regression equation for this data and sketch it onto your scatter plot.
d) Discuss any outliers or influential points by visually looking at the scatter plot.
16
e) Remove any potential influential points from your data and recalculate your regression equation. Did they, in
fact, make a significant difference to your regression line?
f) Construct a residual plot and discuss its implications
g) What proportion of the variation in productivity can be explained by the aptitude test?
i. Including the influential point
ii. Not including the influential point
h) A new job applicant scores a 9 on the aptitude test. The employer wants to predict his productivity level.
Help the employer (use the regression equation without the influential point)
17
Correlation and Regression Wisdom
Extrapolation – is the use of a regression line for prediction outside the domain of values of the explanatory
variable x that you used to obtain the line. Such predictions cannot be trusted.
Ex 1: The number of people living on American farms has steadily declined in the last 70 years
Year
Population
1945
32.1
1950
30.5
1955
24.4
1960
23.0
1965
19.1
1970
15.6
1975
12.4
1980
9.7
1985
8.9
1990
7.2
a.) Make a scatterplot of the data and find the LSRL.
b.) According to the regression equation, how much did the farm population decline each year on average
during this period?
c.) What percent of the observed variation in farm population is accounted for by the linear change over time?
d.) Use the regression equation to predict the number of people living on farms in 2010. Is this a reasonable
result? Why?
18
Other Regressions
There are other regressions that can be determined using the graphing calculator.
Quadratic
Cubic
Quartic
Exponential
Natural Logarithmic
Ex 1:
1.
A study was conducted to determine if there is a relationship between the number of brownie bites eaten
on a daily basis and an individual male’s weight. The data are given below.
# of brownies daily
Weight in lb
12
120
13
140
22
170
31
195
41
205
53
225
65
250
73
295
83
310
99
325
a. Construct a scatterplot (complete, with appropriate labels)
b. Visually determine the best fit model.
c. Determine, using your calculator, the best regression model. Why is the one you chose the best?
Justify your answer by showing all work.
19
d. Determine the equation of best fit.
e. What weight would you expect a person who eats 100 brownies per day to be?
f. How many brownie bites per day would you expect a person who is 150 lb to eat?
20
Transforming Relationships
Question: What do you do if the data shows a relationship that is not linear?
Answer: Try a transformation.
Applying a function such as the logarithm or square root to a quantitative transforming variable is called
transforming or “re-expressing” the data. Transforming data amounts to changing the scale of measurement
that was used when the data was collected. We can choose to measure temperature in degrees Fahrenheit or in
degrees Celsius, distance in miles or kilometers. These changes of units are linear transformations. Linear
transformations cannot straighten a curved relationship between two variables. Linear transformations
also have NO effect on the correlation coefficient (r). To straighten a curved relationship we need to use
nonlinear functions like the logarithm or the square root or the cubing function to name a few. Non linear
transformations can and do affect r and that is what our goal is; to get r as strong as possible.
21
Ex 1: No tortilla chip lover likes soggy chips, so it is important to find characteristics of the production process
that produce chips with an appealing texture. The following data on x = frying time (in seconds) and y =
moisture content (%) appeared in the article “Thermal and Physical Properties of Tortilla Chips as a Function of
Frying Time” (Journal of Food Processing and Preservation [1995]: 175-189)
Frying Time x
5
10 15 20 25 30 45 60
Moisture Content y 16.3 9.7 8.1 4.2 3.4 2.9 1.9 1.3
Sketch a scatterplot of the data, try a few transformations and come up with the best transformed equation.
22
Frequently, an appropriate transformation is suggested by the data. One type of transformation that
statisticians have found useful for straightening a plot is a power transformation. A power (exponent) is first
selected, and each original value is raised to that power to obtain the corresponding transformed value. Table
5.6 displays a "ladder" of the most frequently used power transformations. The power 1 corresponds to no
transformation at all. Using the power 0 would transform every value to 1, which is certainly not informative, so
statisticians use the logarithmic transformation in its place in the ladder of transformations. Other powers
intermediate to or more extreme than those listed can be used, of course, but they are less frequently needed
than those on the ladder. Notice that all the transformations previously presented are included in this ladder.
Figure 5.31 is designed to suggest where on the ladder we should go to find an appropriate transformation. The
four curved segments, labeled l, 2, 3, and 4, represent shapes of curved scatterplots that are commonly
encountered. Suppose that a scatterplot looks like the curve labeled 1. Then, to straighten the plot, we should
use a power of x that is up the ladder from the no-transformation row ( x 2 or x 3 ) and/or a power on y that is
also up the ladder from the power 1. Thus, we might be led to squaring each x value, cubing each y, and
plotting the transformed pairs. If the curvature looks like curved segment 2, a power up the ladder from no
transformation for x and/or a power down the ladder for y (e.g., y or log(y)) should be used.
The scatterplot for the tortilla chip data has the pattern of segment 3 in Figure 5.31. This suggests going down
the ladder of transformations for x and/or for y. We found that transforming the y values in this data set by
taking logarithms worked well, and this is consistent with the suggestion of going down the ladder of
transformations for y.
23
Ex 2: In many parts of the world, a typical diet consists mainly of cereals and grains, and many individuals
suffer from a substantial iron deficiency. The article "The Effects of Organic Acids, Phytates, and Polyphenols
on the Absorption of Iron from Vegetables" (British Journal of Nutrition [1983]: 331-342) reported the data in
Table 5.7 on x = proportion of iron absorbed and y = polyphenol content (in milligrams per gram) when a
particular food is consumed
Food
Wheat germ
Aubergine
Butter beans
Spinach
Brown lentils
Beetroot greens
Green lentils
Carrot
Potato
Beetroot
Pumpkin
Tomato
Broccoli
Cauliflower
Cabbage
Turnip
Sauerkraut
x
0.007
0.007
0.012
0.014
0.024
0.024
0.032
0.096
0.115
0.185
0.206
0.224
0.260
0.263
0.320
0.327
0.327
y
6.4
3.0
2.9
5.8
5.0
4.3
3.4
0.7
0.2
1.5
0.1
0.3
0.4
0.7
0.1
0.3
0.2
Sketch a scatterplot of the data, try a few transformations and come up with the best transformed equation.
24
Use your transformed equation to answer the following questions.
1.
What is the residual value for Turnips.
2.
Suppose Yams had a proportion of iron absorbed was .181, what would be the predicted polyphenol
content of Yams.
3.
Suppose the polyphenol content was 5.5 for zucchini. Using the transformed LSRL, find the estimated
proportion of iron absorbed.
25
Cautions about Correlation and Regression
A Lurking Variable is a variable that is not among the explanatory or response variables in a study and yet
may influence the interpretation of relationships among those variables. A lurking variable can falsely suggest
a strong relationship between x and y, or it can hide a relationship that is really there. Lurking variables can
dramatically change the conclusions of a regression study. Because lurking variables are unrecognized and
unmeasured, detecting their effects is a challenge. Many lurking variables change systematically over time.
Ex 1: As the number of cell phones has increased, so have the number of accidents on I-95. So there is a strong
positive correlation between the amount of cell phones and the number of car accidents. What is the lurking
variable in this example?
Ex 2: A study of 1000 deaths in California showed that left-handed people died at an average age of 66 while
right-handers died at age 75. Should left-handers fear an early death?
“Beware the lurking variable” is good advice when thinking about an association between two
variables. The example below illustrates Common Response. The observed association between the
variables shoe size and math ability is explained by a lurking variable, age. This common response
creates an association even though there may be no direct causal link between x and y.
Ex 3: If we were to collect data on the shoe size of a child (age 6 to 18) and their Mathematical ability we
would probably find a strong positive correlation. This does not mean that having a larger foot causes you to
be better in math. It simply means that the third variable, age, causes both shoe size and mathematical ability
to increase.
Shoe size
Math ability
Age
Two variables are Confounded when their effects on a response variable cannot be distinguished from each
other. The confounded variables may be either explanatory variables or lurking variables. When many
variables interact with each other, confounding of several variables often prevents us from drawing conclusions
about causation.
26
Ex 4: Researchers found that the more schooling a man’s wife had, the less likely he was to be at risk for heart
disease. Confounding would occur if the men who married college graduates differ from those who don’t on
some variable that is related to heart disease. We know that stress affects the heart. Maybe men who marry
college grads have less stress because their wives make good livings and lessen the burden of bill paying.
Ex 5: Many studies have found that people who are active in their religion live longer than nonreligious people.
But people who attend church or mosque or synagogue also take better care of themselves than non-attenders.
They are less likely to smoke, more likely to exercise, and less likely to be overweight. The effects of good
habits are confounded with the direct effects of attending religious services.
Note: many observed associations are at least partly explained by lurking variables. Both common response
and confounding involve the influence of a lurking variable (or variables) z on the response variable y. The
distinction between these two types of relationships is less important than the common element, the influence of
lurking variables. The most important lesson is: even a strong association between two variables is not by
itself good evidence that there is a cause and effect link between the variables.
Ex 6: Studies show that there is a correlation between societies that have high milk consumption and
an increase in the incidence of cancer in those societies. Can we therefore conclude that high milk
consumption causes cancer? Not really; keep in mind that societies that have high milk consumption are
generally wealthier and its citizens also have a greater longevity – the longer one lives the more likely
one is to contract cancer. Thus we are not sure if the higher incidence of cancer is due to the third factor
of longevity or the milk consumption, or some combination of both.
Contract
cancer
Drink
Longevity
The third factor of longevity is a lurking variable.
Here is the moral of story:
Correlation/Association Doesn’t Imply Causation!
Download