Coefficient of Determination

advertisement
Two-Variable Statistics
Desired Outcomes
By the end of this unit, participants will . . .
• have a quick overview of Unit 8: Two-Variable Statistics.
• know the difference between categorical and quantitative
data.
• understand two-way tables and calculate marginal, joint,
and conditional probability.
• understand how to find best-fit lines, Median-Median lines,
and LSRL’s.
• understand correlation coefficient and coefficient of
determination.
• know the definition of a residual and how to construct
residual plots and use them to evaluate the appropriateness
of a model.
• have a quick overview on Non-Linear Data.
• have ideas for the unit project (performance task).
Materials You Have
• Outline located on the Wiki – where you can
take notes and find electronic copies of the
handouts and other activities
• Copy of PowerPoint on Wiki
• Handouts – for your use now (yes, you can
write on them!!)
Unit 4 vs. Unit 8
• Please quickly look over the TENTATIVE outline
for Unit 8
• Our Goal for today is to review/teach the
concepts and give you exposure to the
concepts activities
• Your job is to ask questions as needed
Types of Data
• Qualitative – also called categorical
– Group characteristics
– Ex. What school do you work in?
• Quantitative
– Numerical
– Ex. What is your height?
Activity – Venn Diagram
Categorical Data: Two Way Tables
People leaving a soccer match were asked if they
supported Manchester United or Newcastle
United. They were also asked if they were happy.
The table below gives the results.
Manchester
United
Newcastle
United
Happy
40
8
Not Happy
2
20
vs.
Categorical Data: Two-Way Tables
Marginal Distribution
• Counts vs. Percentages
• How many Manchester fans were surveyed?
• What is the probability that a randomly selected person is a
fan of Newcastle?
• What is the probability that a randomly selected person left
the game happy?
Manchester
United
Newcastle
United
Total
Happy
40
8
48
Not Happy
2
20
22
Total
42
28
70
Categorical Data: Two Way Tables
Joint Probability
• compound event: ______ AND ______, ______ OR ______
• percentages/probability based on the table total
• How many of those surveyed are happy Manchester United
fans?
• What percentage of those surveyed are Newcastle fans and
not happy?
• How likely is a person to be a Newcastle fan or Not Happy?
Manchester
United
Newcastle
United
Total
Happy
40
8
48
Not Happy
2
20
22
Total
42
28
70
Categorical Data: Two Way Tables
Conditional Probability
• How likely is one event to happen, given that another event
has happened?
• percentages/probability based on the row or column total of
the given event
• How likely is a person to be happy, given that they were a
Newcastle fan?
• If a person left the game happy, how likely is it that he/she is a
Manchester fan?
Manchester
United
Newcastle
United
Total
Happy
40
8
48
Not Happy
2
20
22
Total
42
28
70
Categorical Data: Two-Way Tables
• M&M’s sheets
• In your group, devise at least one count or
probability question for each type we discussed:
– Marginal
– Joint
– Conditional
Thirst Dilemma
• In your groups of four,
work through the Thirst
Dilemma activity.
• Be prepared to report
out on your answers!
• Group Roles: Survivor,
Measurer, Reader,
Recorder
Describing Bivariate
Relationships
Strength
•Visually – how closely do
the points fall to the line or
curve?
•Numerical measure – the
correlation coefficient
(applies only to linear
models)
0≤ r ≤1
0 - .5 weak
.5 - .75 moderate
.75 – 1 strong
Form
Direction
• Linear
• Positive
• Nonlinear
- exponential
- quadratic
• Negative
(for linear and
exponential models)
• Positive then negative,
or negative then positive
(for quadratic models)
Describe the Thirst Dilemma Data
• Strength
• Form
• Direction
Thirst Dilemma
16
15
14
13
12
11
Height of Water (cm)
There is a strong,
negative, linear
relationship between
the number of drinks
and the height of the
water. The more
drinks, the lower the
height of the water
left in the bottle.
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
Number of Drinks
9
10
11
12
13
14
Thirst Dilemma
a. x is the number of drinks of water
b. y is the height of the water in cm
Independent Dependent
c. data table
0
15
1
14.2
2
13.8
3
13.1
4
12.6
5
11
6
10.4
7
9.7
8
9.2
Thirst Dilemma
16
14
13
12
11
Height of Water (cm)
d. graph
e. The more drinks,
the lower the
height of the water
left in the bottle.
The height goes
down about ½ to ¾
of a cm for each
drink.
f. 2hours 8 drinks
approximately 9 cm
15
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
Number of Drinks
9
10
11
12
13
14
Thirst Dilemma
16
15
y = -0.76x + 15.151
14
13
12
11
Height of Water (cm)
g. The height of the
water in the bottle
decreases by .76 cm
for each drink. We
started with 15.151
cm of water in the
bottle.
h. 7 cm  .76 cm/drink
= 9.21  9
It would take about 9
drinks to bring the
level down by 7 cm.
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
Number of Drinks
9
10
11
12
13
14
Thirst Dilemma
j.
y = -.76x + 15.151
A linear function is the
best fit because the
data follows a strong,
negative, linear trend.
0 = -.76x + 15.151
-15.151 = -.76 x
19.94 = x
It would take about 20
drinks before the
water is all gone. If
you take a drink every
15 min, that would be
four drinks per hour.
So the water would
last 20 ÷ 4 = 5 hours.
16
15
y = -0.76x + 15.151
14
13
12
11
Height of Water (cm)
i.
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
Number of Drinks
9
10
11
12
13
14
16
15
14
13
12
11
Height of Water (cm)
10
y = -0.76x + 15.151
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
Number of Drinks
13
14
15
16
17
18
19
20
21
22
Thirst Dilemma
k. The height would
decrease more quickly.
16
l.
15
14
13
12
11
Height of Water (cm)
10
y = -0.76x + 15.151
9
8
7
6
y = -0.76x + 4
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
11
12
Number of Drinks
13
14
15
16
17
18
19
20
21
22
Thirst Dilemma
m. Group 1 started out with more water, but took bigger
drinks. Group 2 started out with less water but took
smaller drinks.
n. Steep slope at first (drinking fast), and then a less
steep slope.
o. Assuming they were using the same size bottle, Sam
took bigger drinks because his slope shows that for
each drink, the height went down 2.4 cm. For Julie,
the height only went down .9 cm for each drink.
Extension
• What would the graphs look like with the
following bottle shapes? Would they still be
linear? Sketch the graph for each bottle.
Least Squares Regression Line (LSRL)
How does the calculator work its magic?
http://www.nctm.org/standards/content.aspx?id=26787
Least Squares Regression Line (LSRL)
residual = actual value – predicted value
It is the job of
the LSRL to
minimize the
squared errors.
Why do we
square the
residuals???
Properties of the Least Squares
Regression Line (LSRL)
• Minimizes the squares of the distance a real
point is from the line (sum of the distances is
0 so we have to square them)
• Goes through (mean x, mean y)
• Slope is related to correlation, r
Correlation Coefficient (r)
r is the correlation coefficient
It describes the strength of the linear
relationship between two quantities
•for linear x vs. y
•for exponential x vs. log y
•for power log x vs. log y
•no correlation coefficient for
quadratic, just R2
Correlation
Correlation goes from -1 to 1, inclusive
•Closer to 1 is strong
•Closer to 0 is very weak
Demos for Correlation and LSRL
Build a plot and see the correlation coefficient:
http://strader.cehd.tamu.edu/Mathematics/Statistics/LeastSquares/least_squares.html
Exponential
Y = abx
log y = log abx
log y = log a + log bx
log y = log a + x log b
log y = (log b)x + log a
Y =
M x + B Data is linearized!
Power
y = axb
log y = log axb
log y = log a + log xb
log y = log a + b log x
log y = b log x + log a
Y = M x + B
Anscombe Data Sets
From Wikipedia
From Wikipedia
So the moral of the story is . . .
GRAPH THE DATA FIRST!!!
Coefficient of Determination
2
r
The coefficient of determination describes the
proportion of variation in y that can be explained
by the linear relationship with x. It tells us how
much error in prediction can be “explained” by
the relationship with x.
http://hadm.sph.sc.edu/courses/J716/demos/leastsquares/leastsquaresdemo.html
Coefficient of Determination
If there is no relationship between x and y, the best
predictor for y is the average of the y-values,
represented by the horizontal line, y  y.
yy
Coefficient of Determination
Residuals are
represented by the
vertical distance
between a data point
and the line.
Coefficient of Determination
SSM = sum of the squared
errors about the mean
Coefficient of Determination
If there is a relationship between x and y, the average
y-value is no longer the best predictor.
Coefficient of Determination
SSE = sum of the
squared errors
Coefficient of Determination
2051  162
r 
2051
2
r  0.92
2
SSM is the total error we
started with, SSE is the
error still left after we fit
the LSRL to the data;
SSM-SSE is the amount
of error that was taken
away
So the coefficient of determination measures
the proportion of variation in y that can be
explained by the linear relationship with x.
Coefficient of Determination
Note that r  r 2 , so the correlation coefficient
is equal to the square root of the coefficient of
determination.
This is mathematically true, but the meanings of
the two quantities are very different!
Quadratic Functions
• Fitting a quadratic function to a set of data
uses a different process. So, there is no rvalue. Only R2 (note the capital!) is reported.
• R2 has a similar meaning to r2 and can be
interpreted in the same way.
Interpreting Constants/Coefficients
Linear
y  mx  b
slope: the constant rate of
change of y in relation
to x. For every one unit
increase in x, there are
m units of increase (or
decrease) in y.
y-intercept: the initial value
or starting amount
Making Predictions
Predictions are reliable if you have a good model:
• Model fits the graph of the data
• Correlation coefficient is close to 1 (or -1)
• Residual plot shows no pattern
Predictions
Predictions are not reliable if:
• The correlation is weak (low r-value)
• The residual plot shows a pattern
• You are making a prediction outside the
domain of the data (e.g. using data from 19501990 to predict what is going to happen in
2015)
Linear Data Set Activity
This activity is a sheet
that will be used for
several concepts. There
are many sets of data
out there that you can
use for more practice.
PART ONE: Graph the
data and describe.
Best-Fit Lines
Best-Fit Lines
In Unit 4, we start with the “eye-ball” line of
best fit.
• What are the flaws with this method?
• What do you do with an outlier?
• How do you know whose line is the best
model?
PART THREE: Median-Median Line
When it comes to outliers, which measure is
best to use?
MEAN
MEDIAN
Just like the median, the median-median line is
not sensitive to outliers.
Finding the Median-Median Line
1.
2.
Arrange the data so that the x-values are in ascending order.
Divide the data into three groups. If the number of data values does not
divide evenly, then divide so that the 1st and 3rd groups contain the same
number of data values and the middle group contains only one more or one
less value.
3. On the plot of the data, use vertical lines to divide the groups visually.
4. Look at the first group. Determine the median x-value and the median yvalue and write them as an ordered pair. This is the summary point for the
first group. Call it S1.
5. Plot S1 using a plus sign or a square instead of a dot, in order to distinguish
it from the other points.
6. Repeat Steps 4 and 5 for the 2nd and 3rd groups of data points to find points
S2 and S3.
7. Draw a line (lightly) through S1 and S3. Find the equation of this line.
8. Calculate the distance between the line and point S2.
9. Now adjust the line connecting points S1 and S3 by sliding it one-third of
this distance towards point S2 while keeping the same slope (the resulting
line should be parallel to the first line).
10. Write the equation of this new line by adjusting the value of the y-intercept
by the one-third amount. If you are sliding up, add the 1/3 amount to the
y-intercept; if you are sliding down, subtract. This is the equation of the
Median-Median Line!
PART FOUR: Towards Finding the
Least Squares Regression Line
Find the point ( x , y ) and graph it.
Find a best-fit line that goes through this
point. Try to make the distances between the
line and the data points as small as possible.
PART FIVE: Compare
• Using your graphing calculator, find the LSRL.
• Of the three lines you found, which one best
matches the data? Explain
Residuals and Residual Plots
Residual: the difference between the actual yvalue and the predicted y-value for a given xvalue
residual  y  yˆ
Residual Plot: the plot of the x-values vs. their
residuals
Visually speaking, a
residual is a measure of the
vertical distance between a
data point and the model.
Residual Plot
• Why do you need to make a residual plot?
– To evaluate the goodness of fit of the model
• What does a good residual plot look like?
– Points are scattered, as if there is no correlation
– There is a balance between positive and negative
residuals
– The values of the residuals are small compared to
the size of the data
Examples
We want there to be no pattern and for the
residuals to be small.
Examples
If there is a pattern, it indicates that the model is
not good. This “curve” tells us that a nonlinear
function is a better choice.
Example
This “Cone” shape tells us that the errors in
prediction are getting larger as x gets larger.
Examples of Residual Plots
PART SEVEN: Make a Residual Plot
PART SEVEN: Make a Residual Plot
Finding the Best Model: Linear,
Exponential or Quadratic?
• Exponential
• Quadratic
Exponential
y  ab
Growth (base > 1)
y = a (1 + r)x
Initial
value or
amount
Growth
rate
x
Decay (base < 1)
y = a (1 – r)x
Initial
value or
amount
Decay rate
Quadratic
Vertex: min or max
 b

,
 2a
  b 
f
 
 2a  
y –intercept : (O, c)
starting value or amount
e.g. initial height of a projectile
x-intercept(s):
e.g. time when object hits the ground
Performance Task:
Two-Variable Statistics
1.
2.
A.
B.
C.
D.
E.
F.
On a piece of chart paper, make a table and a graph of your
data. Describe the strength, form, and direction of the
relationship.
Answer the following discussion questions
What type of function does your data model?
What is the algebraic equation that best models your data?
What is the meaning (in context) of each constant and
coefficient in your equation?
Find the correlation coefficient and make a residual plot.
How good is your model?
Answer any questions from your assigned activity.
What are some tip or suggestions for using this activity in
the classroom?
Data Collection
Investigations
Note: These are a mix of different types of
functions.
Overhead Projector
Sitting in class, you have noticed that the image
projected onto a screen from an overhead projector
gets larger as the overhead projector is moved farther
away from the screen.
Question: Is the relationship between the distance an overhead projector is
from a screen and the height of the image projected on the screen linear or
curved?
Equipment: Overhead projector, transparency with an image in focus, meter
stick or ruler to measure the height of the image, tape measures to measure
distance from the projector to the screen.
Data Collection: Place the overhead as close to the screen as possible with the
image in focus. Measure the distance from the screen to a fixed point on the
projector. Also measure the height of the image on screen. Move the
overhead projector slightly away from the screen, focus the image, and take
both measurements again. Repeat this process to collect at least 10 data
points.
Analysis: Make a scatter plot of (distance, image height). Describe the
relationship.
Pennies
If you take a jar containing a collection of 100 pennies and
empty it onto a table, how many pennies would you expect
to land heads? If you remove the pennies that show
heads, return the remaining pennies to the jar, shake it up
and empty the jar again, how many do you expect to land
heads? What happens in the long run?
Question: What is the relationship between the number of times you have
emptied the jar and the number of pennies that remain after you remove
those that show heads?
Equipment: One hundred pennies, jar.
Data Collection: Take a jar containing a collection of 100 pennies, shake the jar
to mix the pennies, and empty it onto a table. Remove the pennies that are
showing heads and record the number of pennies remaining. Return the
remaining pennies to the jar, shake it well, Continue this process until no
pennies remain.
Analysis: Make a scatter plot of (# times you empty the jar, # pennies remaining).
Describe the relationship.
Bouncing Ball
If you drop a ball from the ceiling of your math classroom, it
will bounce higher than if you drop it from desk level.
What is the relationship between the height of the drop
and the height of the bounce?
Question: How is the height from which a ball is dropped related to the height
of its first bounce?
Equipment: Bouncing ball, tape measure, tape.
Data Collection: Tape or hang the tape measure on a wall. Measure the height
from which you plan to drop the ball. Drop the ball and measure the height
of the first bounce. Error can be minimized by having two or three students
sight the rebound height and averaging their results. Repeat this process
until you collect at least 10 data points.
Analysis: Make a scatter plot of (drop height, rebound height). Describe the
relationship. How high will your ball bounce if it is dropped from a height of
3 meters?
Other Questions to Consider: Do all balls bounce in the same way? You can try
this investigation with different types of balls and make a comparison.
Circles
You have learned the relationship between the
diameter of a circle and its circumference.
Can you use data from circular objects to
confirm this result?
Question: How is the circumference of a circle related to its diameter?
Equipment: Empty cans or jar lids, tape measure.
Data Collection: Measure the diameter and circumference of empty
cans, jar lids, or other circular objects until you have collected at
least 10 data points.
Analysis: Make a scatter plot of (diameter, circumference). Describe the
relationship. Is this the relationship that you expected? Use your
model to find the circumference of a can with a diameter of 3
centimeters, and compare it to the known result.
Index Card (part 1)
If you are sitting in the second row of a movie theater and
someone sits directly in front of you, your view is probably
not obstructed. However, if you are sitting towards the
back and the same thing happens, it will be significantly
harder to see this movie screen especially if you are not
very tall. The following experiment investigates this issue
by using a tape measure in place of the movie screen and
an index card in place of the head of the person who is
blocking your view.
Question: How does the distance you are away from the wall affect the length of
the tape measure that is obscured by an index car?
Equipment: Index card, tape measure, tape.
Data Collection: Attach a tape measure horizontally to a wall. Have a student
close one eye, and hold an index card at arms length. Record the students
distance from the wall and the length of the section of the tape measure that
is obscured by the card. Have the student take one small step back (about
12 in or 30 cm), close one eye, and again record the distance from the wall
and the length of the tape measure that is obscured. Repeat this process
until you collect at least 10 data points.
Index Card (part 2)
If you are sitting in the second row of a movie theater and
someone sits directly in front of you, your view is probably
not obstructed. However, if you are sitting towards the
back and the same thing happens, it will be significantly
harder to see this movie screen especially if you are not
very tall. The following experiment investigates this issue
by using a tape measure in place of the movie screen and
an index card in place of the head of the person who is
blocking your view.
Analysis: Make a scatter plot of (distance from wall, length of tape measure
obscured). Describe the relationship.
Other Questions to Consider: How does the size of the card affect the data and
therefore the scatter plot? (Experiment by simply rotating the card 90o.
Compare results.) How does the length of the person’s arm affect the data
and therefore the scatter plot? (Experiment by having different person hold
the index card. Compare results.)
Pendulum
If you swing a long pendulum, it takes more time to
complete one swing than if you swing a short
pendulum. This experiment allows you to investigate
the relationship between the length of a pendulum
and its period.
Question: How does the period of a pendulum depend on its length?
Equipment: Pendulum (constructed by tying a small nut or several washers
onto a string at least two meters long) meter stick, stopwatch.
Data Collection: Vary the length of the string by about 20 cm from one trial to
the next, and measure how the period of the pendulum (time to complete
one swing across and back) changes. To measure the period, students
should hold one end of the string stable, pull the weight slightly (about 20o)
to one side, let the weight make 10 complete swings, record the time, and
then divide by 10. Collect at least 10 data points. Be sure to include some
long lengths as well as short lengths.
Analysis: Make a scatter plot of (diameter, circumference). Describe the
relationship.
Road Map
Road maps provide a legend for computing straight-line distances
as well as mileage between points along roads shown on the
map. How do these distances compare in your state?
Question: How does the straight-line distance between two cities relate to the shortest
travel distance between the cities?
Equipment: State road map, ruler.
Data Collection: Use the ruler to measure the straight-line distance between two cities
and convert this distance to miles using the legend on the map. Compute the travel
distance by adding the distances along the shortest route between the two cities.
Repeat this process to collect at least 10 data points. Be sure to include a variety of
distances.
Analysis: Make a scatter plot of (straight-line distance, travel distance). Describe the
relationship. How many miles would you expect to travel between two cities that are
exactly six inches apart on the map?
Other Questions to Consider: How do you think the scatter plot for Nevada would
compare to the scatter plot of West Virginia? Is the relationship observed in your
scatter plot the same for all states?
Download