y - TeacherWeb

advertisement
– Working with relationships between two variables
• “Donation “ made to teacher & Stats Test Score
100
90
80
Stats
Test
Score
70
60
50
40
30
20
10
0
$0
$20
$40
$60
$80
Correlation & Regression
• Univariate & Bivariate Statistics
– U: frequency distribution, mean, mode, range, standard deviation
– B: correlation – two variables
• Correlation
– linear pattern of relationship between one variable (x) and
another variable (y) – an association between two variables
– relative position of one variable correlates with relative
distribution of another variable
• X - An explanatory variable attempts to explain the observed
outcomes in Y –A response variable measures an outcome of a
study.
• Warning:
– No proof of causality
– Cannot assume x causes y
Scatterplot or
Scatter Diagram
a plot of paired data to determine or show a
relationship between two variables
Graduating Seniors by State in 2005
The state of
Louisiana
The state of
Rhode Island
Figure 3.1
(Percent taking SAT vs. Score)
• Attributes of a good scatterplot
–
–
–
–
–
Consistent and uniform scale
Label on both axis
Accurate placement of data
Data throughout the axis
Axis break lines if not starting at zero.
• To achieve this goal you should try to do your
scatterplots on graph paper.
AP Statistics, Section
3.1, Part 1
5
Graduating Seniors by State in 2005
States from NE,
Mid-Atlantic
and West
States from
Midwest, Mtn
Central, and
Southwest
Paired Data
Miles traveled
2
5
12
7
7
15
10
Minutes
6
9
23
18
15
28
19
Scatter Diagram
Minutes
6
9
23
18
15
28
19
Relationship between miles
traveled and minutes
30
minutes
Miles
2
5
12
7
7
15
10
20
10
0
0
5
10
miles
15
20
Linear Correlation
The general trend of the points seems to
follow a straight line segment.
Linear Correlation
Non-Linear Correlation
No Linear Correlation
High Linear Correlation
Points lie close to a straight line.
High Linear Correlation
Moderate Linear Correlation
Low Linear Correlation
Perfect Linear Correlation
Questions Arising
• Can we find a
relationship
between x and y?
minutes
Relationship between miles
traveled and minutes
30
25
20
15
10
5
0
0
5
10
miles
15
20
• How strong is the
relationship?
When there appears to be a linear relationship
between x and y:
attempt to “fit” a line to the scatter
diagram.
When using x values
to predict y values:
• Call x the explanatory variable
• Call y the response variable
Scatterplot!
• No Correlation
– Random or circular assortment of dots
• Positive Correlation
– ellipse leaning to right
– GPA and SAT
– Smoking and Lung Damage
– Number of Whoppers eaten and Mr. Flynn’s weight
• Negative Correlation
–
–
–
–
ellipse learning to left
Depression & Self-esteem
Studying & test errors
Vampire friends & Werewolf boyfriends
Interpreting Scatterplots
• Pattern/Shape: linear, parabola, bell shaped
– Deviations from pattern: Are there areas where the data conform
less to the pattern?
– Form: Are there clusters of data?
– Special data: Are there any influential points?
– Is a transformation of data necessary?
• Trend/Direction: positive, negative, or WTF?
– As x increases what happens to y?
• Strength/Association: weak, moderate, strong
– IF a line were drawn through the data, how close would the points
be to the line?
– Is the a small or large amount of variability within the y values?
AP Statistics, Section
3.1, Part 1
22
Pearson’s Correlation Coefficient
• “r” indicates…
– strength of relationship (strong, weak, or none)
– the variation of the points around the model (linear)
– direction of relationship
• positive (direct) – variables move in same direction
• negative (inverse) – variables move in opposite directions
• r ranges in value from –1.0 to +1.0
-1.0
Strong Negative
0.0
No Rel.
+1.0
Strong Positive
•Try quick estimates
–Next slide and strange quiz
Practice with Scatterplots
r = .__ __
r = .__ __
r = .__ __
r = .__ __
A relationship between correlation coefficient,
r, and the slope, b, of the least squares line:
 sx 
r b 
s 
 y
where s y  standard deviation of the y values
and s x  standard deviation of the x values
Linear correlation coefficient
 1  r  +1
Calculating the Correlation Coefficient, r
r
SS xy
SS x SS y
where SS xy   xy 
 x  y 
n
2


x
SS x   x 2  
n
2


y
SS y   y 2  
n
n  number of data pairs
Paired Data
Miles traveled
2
5
12
7
7
15
10
Minutes
6
9
23
18
15
28
19
Scatter Diagram
Minutes
6
9
23
18
15
28
19
Relationship between miles
traveled and minutes
30
minutes
Miles
2
5
12
7
7
15
10
20
10
0
0
5
10
miles
15
20
Find the Least Squares Line
x (Miles
Traveled)
2
y
(Minutes)
6
x2
xy
4
12
5
9
25
45
12
23
144
276
7
18
49
126
7
15
49
105
15
28
225
420
10
19
100
190
x = 58
y = 118
x2 = 596
xy = 1174
Finding the slope
SS xy   xy 
 x  y   1174  (58)(118)  196.28571
n
7

x

 x 
n
2
and
SS x
2
58 2
 596 
 115.42857
7
SS xy 196.28571
slope  b 

 1.700495
SS x 115.42857
Finding the y-intercept
118
y  mean of y values 
 16.857143
7
58
x  mean of x values 
 8.2857143
7
y  int ercept  a  y  bx 
16.857143  1.700495 ( 8.2857143 )  2.7673273
The equation of the least squares line is:
y = a + bx
y = 2.8 + 1.7x
To Compute r:
• Complete a table, with columns listing x, y, x2,
y2, xy
• Compute SSxy, SSx, and SSy
• Use the formula:
r
SS xy
SS x SS y
Find the Correlation Coefficient
x
(Miles)
2
y
(Min.)
6
x2
y2
xy
4
36
12
5
9
25
81
45
12
23
144
529
276
7
18
49
324
126
7
15
49
225
105
15
28
225
784
420
10
19
100
361
190
x = 58
y = 118 x2 = 596 y2=2340 xy = 1174
Calculations:
SS xy

x  y 
(58)(118)

  xy 
 1174 
 196.28571

x

 x 
n
n
7
2
2
58
2
SS x
 596 
 115.42857
7
2
2


y
118
SS y   y 2  
 2340 
 350.85714
n
7
SS xy
196.28571
r

 0.9753643
115.42857350.85714
SS x SS y
The Correlation Coefficient,
r = 0.9753643
r  0.98
Calculating Correlation
• The calculation of
correlation is
based on mean
 xi  x
1
and standard
r

deviation.
n  1  sx
• Remember that
both mean and
standard
deviation are not
resistant
measures. AP Statistics, Section 3.2, Part 1

  yi  y 

 
  sy 
38
Calculating Correlation
The formula for
calculating zvalues.
• What does the
Both z-values
contents of the
are negative.
parenthesis look
Their product is
positive.
like?
• What happens when
 xi  x   yi  y 
1
the values are both
r


 

from the lower half
n  1  sx   s y 
of the population?
Both z-values
From the upper
are positive.
half?
Their product is
positive.
AP Statistics, Section 3.2, Part 1
39
Calculating Correlation
• What happens when
one value is from the
lower half of the
population but other
 xi  x
1
r
value is from the


n  1  sx
upper half?
  yi  y 

 
  sy 
One z-value is positive and the
other is negative. Their product is
negative.
AP Statistics, Section 3.2, Part 1
40
Using the TI-83/84 to calculate r
• With Diagnostics ON:
• Run LinReg(a+bx)
[STAT>CALC>option
8] with the explanatory
variable as the first list,
and response variable
as the second list
The results are the slope
and vertical intercept of
the regression equation
(more on that later) and
values of r and r2. (More
on r2 check next
handout ;)
AP Statistics, Section 3.2, Part 1
41
Predictive Potential
• Coefficient of Determination
– r²
– Amount of variance accounted for in y by x
– Percentage increase in accuracy you gain by using the regression
line to make predictions
– Without correlation, you can only guess the mean of y
– [Used with regression]
0%
20%
40%
60%
80%
100%
Understanding r-squared actvity
Limitations of Correlation
• linearity:
– can’t describe (accurately) non-linear relationships
– e.g., flavor and % eaten, thickness and strength
• truncation of range:
– underestimate strength of relationship if you can’t see full range
of x value
• no proof of causation
–
third variable problem:
• could be 3rd variable causing change in both variables
• directionality: can’t be sure which way causality “flows”
• “We don’t get it” – what does it have to do with
that f#$%@! Line?
That is for another session…
Regression
• Regression: Correlation + Prediction
– predicting y based on x
– e.g., predicting….
• throwing points (y)
• based on distance from target (x)
• Regression equation
–
–
–
–
formula that specifies a line
y’ = a + bx
plug in a x value (distance from target) and predict y (points)
note
• y= actual value of a score
• y’= predict value
•Data Handout
–Test takers, planets, darts
The Least-Square Regression
• Finds the best fit
line by trying to
minimize the areas
formed by the
difference of the
real data from the
values predicted by
the model.
AP Statistics, Section 3.3, Part 1
45
The Least-Square Regression
• Statisticians use a
slightly different version
of “slope-intercept”
form.
Slope is the product of r
value and std dev ratio
Y-intercept is the value
found using the avg x and
avg y
y  a  bx
sy
br
sx
a  y  bx
AP Statistics, Section 3.3, Part 1
46
Regression Graphic – Regression Line
120
100
80
60
y’=47
40
y’=20
20
0
Rsq = 0.6031
8
10
12
14
16
Distance from target
18
if x=18
then…
20
22
24
26
if x=24
then…
Predicting Model
• To put the regression line
on the graph use the
Statistics:Eq:RegEQ from
the Vars menu to put the
Y1 equation.
• Then you can use Trace or
Table or Y1 to find
response values that
correspond to particular
experimental values.
AP Statistics, Section 3.3, Part 1
48
Regression Equation
• y’= a + bx
–
–
–
–
See STAT – CALC –
LinReg: a + bx
y’ = predicted value of y
b = slope of the line
x = value of x that you plug-in
a = y-intercept (where line crosses y axis)
• In the dart throwing case….
– y’ = 125.401 - 4.263(x)
• So if the distance is 20 feet
– y’ = 125.401 - 4.263(20)
– y’ = 125.401 -85.26
– y’ = 40.141
Drawing a Regression Line by Hand
Four steps
1. Use the y-intercept (if possible; does it have
meaning =interval vs. rational)
2. Plot the average point (mean x, mean y)
3. Plug in a large value for x (just so it falls on the
right end of the graph), plug it in for x, then
plot the resulting point
4. Connect the three points with a straight line!
Residuals
Predicted Value ( ŷ )
• It is important to note
that the observed value
almost never match the
predicted values exactly
• The difference between
the observed value and
predicted has a special
name: residual
Residual:
y  yˆ
AP Statistics, Section 3.3, Part 1
Observed Value:
(y)
51
Residual Plots
• You can plot the
residuals to see if
the there is any
trends with the
quality of the
predictive model
• Try looking in the
List menu for
“RESID:”
AP Statistics, Section 3.3, Part 1
52
Residual Plots
• This residual shows
no tendencies. It is
equally bad
throughout.
• This suggests that
the original
relationship is
linear.
AP Statistics, Section 3.3, Part 1
53
“Pattern” =Not Linear
“Well Distributed”=Linear
AP Statistics, Section
3.3, Part 1
54
Predictive Ability
• Mantra!!
– As variability decreases, prediction accuracy __________
– if we can account for variance, we can make better predictions
• As r increases:
– r² increases
• “variance accounted for” increases
• the prediction accuracy increases
– prediction error decreases (distance between y’ and y)
– Sy decreases
• the standard error of the residual/predictor
• measures overall amount of prediction error
– It can be thought of like this …
We like big r’s and we cannot lie!!!
You other brothers can’t deny!!!
Check out those residuals son
and plot em with your TI-84 on
Cause if they don’t look all scattered and patterned
then your least squared line is shattered
Then I only want that - if your scale and r squared is fat
So kick out those nasty outliers
When your correlation factor is on
BABY GOT STATS!
Thanks – Peace !
Download