Ch3

advertisement
Finding Areas with Calc
1. Shade Norm
•
Consider WISC data from before: N (100, 15). Suppose we
want to find % of children whose scores are above 125
•
Specify window: X[55,145]12 and Y[-.008, .028].01
•
Press 2nd VARS (DISTR), then choose Draw and 1:
ShadeNorm(.
•
Compete command ShadeNorm(125, 1E99, 100, 15) or
(0,85, 100,15).
•
If using Z scores instead of raw scores, the mean 0 and SD
1 will be understood so ShadeNorm (1,2) will give you area
from z-score of 1.0 to 2.0. How does this compare to the
68-95-99.7 rule?
2. NormalCDF
•
Advantage- quicker, disadvantage- no picture.
•
2nd Vars (DISTR) choose 2: normalcdf(.
•
Complete command (125, 1E99, 100, 15) and
press enter. You get .0477 or approx. 5%.
•
If you have Z scores: normalcdf (-1, 1) = .6827
aka 68%...like our rule!
InvNorm
•
InvNorm calculates the raw or Z score value
corresponding to a known area under the curve.
•
2nd Vars (DISTR), choose 3: invNorm(.
•
Complete the command invNorm (.9, 100, 15) and
press Enter. You get 119.223, so this is the score
corresponding to 90th percentile.
•
Compare this with command invNorm (.9) you
get 1.28. This is the Z score.
Bivariate Relationships
What is Bivariate data?
When exploring/describing a bivariate (x,y) relationship:
Determine the Explanatory and Response variables
Plot the data in a scatterplot
Note the Strength, Direction, and Form
Note the mean and standard deviation of x and the
mean and standard deviation of y
Calculate and Interpret the Correlation, r
Calculate and Interpret the Least Squares
Regression Line in context.
Assess the appropriateness of the LSRL by
constructing a Residual Plot.
3.1 Response Vs.
Explanatory Variables
•
Response variable measures an outcome of a study,
explanatory variable helps explain or influences
changes in a response variable (like independent vs.
dependent).
•
Calling one variable explanatory and the other
response doesn’t necessarily mean that changes in
one CAUSE changes in the other.
•
Ex: Alcohol and Body temp: One effect of Alcohol is a
drop in body temp. To test this, researches give
several amounts of alcohol to mice and measure each
mouse’s body temp change. What are the
explanatory and response variables?
Scatterplots
•
Scatterplot shows the relationship between two quantitative variables measured
on the same individuals.
•
Explanatory variables along X axis, Response variables along Y.
•
Each individual in data appears as the point in the plot fixed by the values of
both variables for that individual.
•
Example:
Interpreting Scatterplots
•
Direction: in previous example, the overall pattern moves
from upper left to lower right. We call this a negative
association.
•
Form: The form is slightly curved and there are two
distinct clusters. What explains the clusters? (ACT
States)
•
Strength: The strength is determined by how closely the
points follow a clear form. The example is only
moderately strong.
•
Outliers: Do we see any deviations from the pattern?
(Yes, West Virginia, where 20% of HS seniors take the
SAT but the mean math score is only 511).
Association
Introducing Categorical
Variables
Calculator Scatterplot
Student
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Beers
5
2
9
8
3
7
3
5
3
5
4
6
5
7
1
4
BAC
0.1 0.03 0.19 0.12 0.04 0.09 0.07 0.06 0.02 0.05 0.07 0.1 0.08 0.09 0.01 0.05
50
5
•
Enter the Beer consumption in L1 and the BAC values in L2
•
Next specify scatterplot in Statplot menu (first graph). X list
L1 Y List L2 (explanatory and response)
•
Use ZoomStat.
•
Notice that their are no scales on the axes and they aren’t
labeled. If you are copying your graph to your paper,
make sure you scale and label the Axis (use Trace)
•
Correlation
Caution- our eyes can be fooled! Our eyes are not
good judges of how strong a linear relationship is.
The 2 scatterplots depict the same data but drawn
with a different scale. Because of this we need a
numerical measure to supplement the graph.
r
•
The Correlation measures the direction and strength of the linear relationship
between 2 variables.
Z Z
x
y
•
Formula- (don’t need to memorize or use): r =
•
In Calc: Go to Catalog (2nd, zero button), go to DiagnosticOn, enter, enter. You
only have to do this ONCE! Once this is done:
•
Enter data in L1 and L2 (you can do calc-2 var stats if you want the mean and sd
of each)
•
Calc, LinReg (A + Bx) enter
n 1
Interpreting r
•
The absolute value of r tells you the strength of the
association (0 means no association, 1 is a strong
association)
•
The sign tells you whether it’s a positive or a negative
association. So r ranges from -1 to +1
•
Note- it makes no difference which variable you call x and which you
call y when calculating correlation, but stay consistent!
•
Because r uses standardized values of the observations, r does not
change when we change the units of measurement of x, y, or both.
(Ex: Measuring height in inches vs. ft. won’t change correlation with
weight)
•
values of -1 and +1 occur ONLY in the case of a perfect linear
relationship , when the variables lie exactly along a straight line.
Examples
1. Correlation requires that both variables be
quantitative
2. Correlation measures the strength of only
LINEAR relationships, not curved...no matter
how strong they are!
3. Like the mean and standard deviation, the
correlation is not resistant: r is strongly affected
by a few outlying observations. Use r with
caution when outliers appear in the scatterplot
4. Correlation is not a complete summary of
two-variable data, even when the relationship is
linear- always give the means and standard
deviations of both x and y along with the
correlation.
3.2- least squares
regression
Text
The slope here B = .00344 tells us that fat
gained goes down by .00344 kg for each
added calorie of NEA according to this
linear model. Our regression equation is
the predicted RATE OF CHANGE in the
response y as the explanatory variable x
changes.
The Y intercept a = 3.505kg is the fat gain
estimated by this model if NEA does not
change when a person overeats.
•
We can use a regression line to predict the response y
for a specific value of the explanatory variable x.
Prediction
LSRL
•
In most cases, no line will pass exactly through all
the points in a scatter plot and different people will
draw different regression lines by eye.
•
Because we use the line to predict y from x, the
prediction errors we make are errors in y, the
vertical direction in the scatter plot
•
A good regression line makes the vertical distances of the
points from the line as small as possible
•
Error: Observed response - predicted response
LSRL Cont.
Equation of LSRL
•
Example 3.36: The Sanchez household is about to install
solar panels to reduce the cost of heating their house. In
order to know how much the panels help, they record their
consumption of natural gas before the panels are installed.
Gas consumption is higher in cold weather, so the
relationship between outside temp and gas consumption is
important.
•
•
About how much gas does the regression line predict that the family will
use in a month that averages 20 degree-days per day?
•
•
•
Positive, linear, and very strong
500 cubic feet per day
How well does the least-squares line fit the data?
Residuals
•
The error of our predictions, or vertical distance
from predicted Y to observed Y, are called
residuals because they are “left-over” variation in
the response.
One subject’s NEA rose by 135 calories.
That subject gained 2.7 KG of fat. The
predicted gain for 135 calories is
Y hat = 3.505- .00344(135) = 3.04 kg
The residual for this subject is
y - yhat
= 2.7 - 3.04 = -.34 kg
Residual Plot
•
The sum of the least-squares residuals is always zero.
•
The mean of the residuals is always zero, the horizontal line
at zero in the figure helps orient us. This “residual = 0” line
corresponds to the regression line
•
Residuals
Listresiduals
on listed
Calc
If you want to get all your
in L3
highlight L3 (the name of the list, on the top)
and go to 2nd- stat- RESID then hit enter and
enter and the list that pops out is your resid for
each individual in the corresponding L1 and
L2. (if you were to create a normal scatter plot
using this list as your y list, so x list: L1 and Y list
L3 you would get the exact same thing as if you
did a residual plot defining x list as L1 and Y list
as RESID as we had been doing).
This is a helpful list to have to check your work when
Examining Residual Plot
•
Residual plot should show no obvious pattern. A curved
pattern shows that the relationship is not linear and a straight
line may not be the best model.
•
Residuals should be relatively small in size. A regression line
in a model that fits the data well should come close” to most
of the points.
•
A commonly used measure of this is the standard deviation
of the residuals, given by:
s
For the NEA and fat gain data, S =
7.663
 .740
14
 residuals
n2
2
Residual Plot on Calc
•
Produce Scatterplot and Regression line from data (lets use
BAC if still in there)
•
Turn all plots off
•
Create new scatterplot with X list as your explanatory
variable and Y list as residuals (2nd stat, resid)
•
Zoom Sta
R squared- Coefficient
of determination
If all the points fall directly on the least-squares
line, r squared = 1. Then all the variation in y
is explained by the linear relationship with x.
So, if r squared = .606, that means that 61% of
the variation in y among individual subjects is
due to the influence of the other variable. The
other 39% is “not explained”.
r squared is a measure of how successful the
regression was in explaining the response
Facts about LeastSquares regression
•
The distinction between explanatory and response variables is essential
in regression. If we reverse the roles, we get a different least-squares
regression line.
•
There is a close connection between corelation and the slope of the
LSRL. Slope is r times Sy/Sx. This says that a change of one standard
deviation in x corresponds to a change of 4 standard deviations in y.
When the variables are perfectly correlated (4 = +/- 1), the change in the
predicted response y hat is the same (in standard deviation units) as the
change in x.
•
The LSRL will always pass through the point (X bar, Y Bar)
•
r squared is the fraction of variation in values of y explained by the x
variable
3.3 Influences
•
Correlation r is not resistant. Extrapolation is not
very reliable. One unusual point in the scatterplot
greatly affects the value of r. LSRL also not
resistant.
•
A point extreme in the x direction with no other
points near it pulls the line toward itself. This point
is influential.
Lurking VariablesBeware!
•
Example: A college board study of HS grads found a strong correlation
between math minority students took in high school and their later
success in college. News articles quoted the College Board saying that
“math is the gatekeeper for success in college”.
•
But, Minority students from middle-class homes with educated
parents no doubt take more high school math courses. They are also
more likely to have a stable family, parents who emphasize
education, and can pay for college etc. These students would likely
succeed in college even if they took fewer math courses. The family
background of students is a lurking variable that probably explains
much of the relationship between math courses and college success.
Beware correlations
based on averages
•
Correlations based on averages are usually too
high when applied to individuals.
•
Example: if we plot the average height of young
children against their age in months, we will see
a very strong positive association with correlation
near 1. But individual children of the same age
vary a great deal in height. A plot of height
against age for individual children will show much
more scatter and lower correlation than the plot
of average height against age.
Chapter Example:
Corrosion and Strength
Consider the following data from the article, “The Carbonation
of Concrete Structures in the Tropical Environment of
Singapore” (Magazine of Concrete Research (1996):293-300
which discusses how the corrosion of steel(caused by
carbonation) is the biggest problem affecting concrete strength:
x= carbonation depth in concrete (mm)
y= strength of concrete (Mpa)
x
y
8
20
20
30
35
40
50
22.8 17.1 21.5 16.1 13.4 12.4 11.4
55
65
9.7
6.8
Define the Explanatory and Response Variables.
Plot the data and describe the relationship.
Corrosion and Strength
Strength (Mpa)
There is a strong, negative,
linear relationship between
depth of corrosion and
concrete strength. As the
depth increases, the
strength decreases at a
constant rate.
Depth (mm)
Strength (Mpa)
Corrosion and Strength
Depth (mm)
The mean depth of corrosion
is 35.89mm with a standard
deviation of 18.53mm.
The mean strength is 14.58
Mpa with a standard deviation
of 5.29 Mpa.
Find the equation of the
Least Squares Regression
Line (LSRL) that models the
relationship between
corrosion and strength.
Strength (Mpa)
Corrosion and Strength
Depth (mm)
y=24.52+(-0.28)x
strength=24.52+(-0.28)depth
r=-0.96
Corrosion and Strength
Strength (Mpa)
y=24.52+(-0.28)x
strength=24.52+(-0.28)depth
r=-0.96
Depth (mm)
What does “r” tell us?
There is a Strong, Negative, LINEAR relationship
between depth of corrosion and strength of concrete.
What does “b=-0.28” tell us?
For every increase of 1mm in depth of corrosion, we
predict a 0.28 Mpa decrease in strength of the concrete.
Corrosion and Strength
Use the prediction model (LSRL) to determine the following:
What is the predicted strength of concrete with a corrosion
depth of 25mm?
strength=24.52+(-0.28)depth
strength=24.52+(-0.28)(25)
strength=17.59 Mpa
What is the predicted strength of concrete with a corrosion
depth of 40mm?
strength=24.52+(-0.28)(40)
strength=13.44 Mpa
How does this prediction compare with the observed strength
at a corrosion depth of 40mm?
Residuals
Note, the predicted strength when corrosion=40mm is:
predicted strength=13.44 Mpa
The observed strength when corrosion=40mm is:
observed strength=12.4mm
• The prediction did not match the observation.
• That is, there was an “error” or “residual” between our
prediction and the actual observation.
• RESIDUAL = Observed y - Predicted y
• The residual when corrosion=40mm is:
• residual = 12.4 - 13.44
• residual = -1.04
Assessing the Model
Is the LSRL the most appropriate prediction model for
strength? r suggests it will provide strong
predictions...can we do better?
To determine this, we need to study the residuals
generated by the LSRL.
Make a residual plot.
Look for a pattern.
If no pattern exists, the LSRL may be our best bet for
predictions.
If a pattern exists, a better prediction model may
exist...
Residual Plot
residuals
Construct a Residual Plot for the (depth,strength) LSRL.
depth(mm)
There appears to be no
pattern to the residual
plot...therefore, the
LSRL may be our best
prediction model.
Coefficient of Determination
93.75% of the variability in
predicted strength can be
explained by the LSRL on
depth.
Strength (Mpa)
We know what “r” tells us
about the relationship
between depth and
strength....what about r2?
Depth (mm)
Summary
When exploring a bivariate relationship:
Make and interpret a scatterplot:
Strength, Direction, Form
Describe x and y:
Mean and Standard Deviation in Context
Find the Least Squares Regression Line.
Write in context.
Construct and Interpret a Residual Plot.
Interpret r and r2 in context.
Use the LSRL to make predictions...
Examining
Relationships
Regression Review
Regression Basics
When describing a Bivariate
Relationship:
Make a Scatterplot
Strength, Direction, Form
Model: y-hat=a+bx
Interpret slope in context
Make Predictions
Residual = Observed-Predicted
Assess the Model
Interpret “r”
Residual Plot
Reading Minitab
Output
Regression Analysis: Fat gain versus NEA
The regression equation is
FatGain = ****** + ******(NEA)
Predictor
Constant
NEA
Coef
3.5051
-0.0034415
S=0.739853
R-Sq = 60.6%
SE Coef
0.3036
0.00074141
T
11.54
-4.04
P
0.000
0.000
RSq(adj)=57.8%
Regression equations aren’t always as easy to spot as they
are on your TI-84. Can you find the slope and intercept
above?
Outliers/Influential Points
Age at Fi rst Word and Gesel l Score
Child
Age
Score
1
1 15 m onths
95
2
2 26 m onths
71
3
3 10 m onths
83
4
4
9 m onths
91
5
5 15 m onths
102
6
6 20 m onths
87
7
7 18 m onths
93
8
8 11 m onths
100
9
9
8 m onths
104
10
10 20 m onths
94
11
11
7 m onths
113
12
12
9 m onths
96
13
13 10 m onths
83
14
14 11 m onths
84
15
15 11 m onths
102
16
16 10 m onths
100
17
17 12 m onths
105
18
18 42 m onths
57
19
19 17 m onths
121
20
20 11 m onths
86
21
21 10 m onths
100
Does the age of a child’s first word predict
<new >
his/her mental ability? Consider the
following data on (age of first word, Gesell
Adaptive Score) for 21 children.
Influential?
Does the highlighted point markedly affect the
equation of the LSRL? If so, it is “influential”.
Test by removing the point and finding the new
LSRL.
Explanatory vs.
Response
The Distinction Between Explanatory and Response variables
is essential in regression.
Switching the distinction results in a different least-squares
regression line.
Note: The correlation value, r, does NOT depend on the
distinction between Explanatory and Response.
Correlation
The correlation, r, describes the
strength of the straight-line
relationship between x and y.
Ex: There is a strong, positive,
LINEAR relationship between
# of beers and BAC.
There is a weak, positive, linear
relationship between x and y.
However, there is a strong
nonlinear relationship.
r measures the strength of
linearity...
Coefficient of Determination
The coefficient of determination, r2, describes the percent of
variability in y that is explained by the linear regression on x.
71% of the variability in death
rates due to heart disease can
be explained by the LSRL on
alcohol consumption.
That is, alcohol consumption
provides us with a fairly good
prediction of death rate due to
heart disease, but other factors
contribute to this rate, so our
prediction will be off somewhat.
Cautions
Correlation and Regression are NOT
RESISTANT to outliers and Influential Points!
Correlations based on “averaged data” tend to be
higher than correlations based on all raw data.
Extrapolating beyond the observed data can
result in predictions that are unreliable.
Correlation vs. Causation
Consider the following historical data:
Coll ecti on 1
Year
Ministers
Rum
1
1860
63
8376
2
1865
48
6406
3
1870
53
7005
4
1875
64
8486
5
1880
72
9595
6
1885
80
10643
7
1890
85
11265
8
1895
76
10071
9
1900
80
10547
10
1905
83
11008
<new >
There is an almost perfect linear
12
1915
140
18559
relationship between x and y.
(r=0.999997)
x = # Methodist Ministers in New England
y = # of Barrels of Rum Imported to Boston
CORRELATION DOES NOT IMPLY CAUSATION!
11
1910
105
13885
Summary
Download