The Weather Turbulence

advertisement
Algebra 1 Summer Institute 2014
The Weather Turbulence
Summary
Goals
Participants will work and
reason with bivariate data
that show a linear
relationship. They will
determine how strong is the
relationship calculating a
correlation coefficient.
They will also calculate the
coefficient of determination
and interpret its result based
on the context of the
problem. Finally they will
do the calculations to
determine the equation of
the regression line.
Participant Handouts
 Distinguish between
scatter plots that display
a relationship that can be
reasonably modeled by a
linear equation and those
that should be modeled
by a nonlinear equation.
 Use an equation given as
a model for a nonlinear
relationship to answer
questions based on an
understanding of the
specific equation and the
context of the data.
 Determine the leastsquares regression line
from a given set of data
using technology.
 Calculate and interpret
the correlation
coefficient and the
coefficient of
determination
1. The Weather Turbulence
2. Excel file: the Weather
Turbulence
Materials
Technology
Source
Estimated Time
Paper
Colored Pencils
LCD Projector
Facilitator Laptop
Excel
GeoGebra
Engageny.org
Stattrekcom
120 minutes
Mathematics Standards
Common Core State Standards for Mathematics
MAFS.7.SP.2: Draw informal comparative inferences about two populations.
2.3: Informally assess the degree of visual overlap of two numerical data distributions
with similar variabilities, measuring the difference between the centers by
expressing it as a multiple of a measure of variability. For example, the mean
height of players on a basketball team is 10 cm greater than the mean height of
players on the soccer team, about twice the variability (mean absolute deviation)
1
Algebra 1 Summer Institute 2014
on either team; on a dot plot, the separation between the two distributions of
height is noticeable.
2.4: Use measures of center and measures of variability for numerical data from
random samples to draw informal comparative inferences about two populations.
For example, decide whether the words in a chapter of a seventh-grade science
book are generally longer than the words in a fourth-grade science book.
MAFS.8.SP.1: Investigate patterns of association in bivariate data
1.1: Construct and interpret scatter plots for bivariate measurement data to investigate
patterns of association between two quantities. Describe patterns such as
clustering, outliers, positive or negative association, linear association, and
nonlinear association.
Standards for Mathematical Practice
1. Make sense of problems and persevere in solving them
2. Reason abstractly and quantitatively
3. Construct viable arguments and critique the reasoning of others
4. Model with mathematics
5. Use tools appropriately
Instructional Plan
Briefly introduce the data in the table below. Explain how plotting the ordered pairs of
data create a scatter plot.
Example:
The National Climate Data Center collects data on weather conditions at various
locations. They classify each day as clear, partly cloudy, or cloudy. Using data
taken over a number of years, they provide data on the following variables:
(Slide 2)
𝐱 = elevation above sea level (in feet)
𝐲 = mean number of clear days per year
𝐰= mean number of partly cloudy days per year
𝐳 = mean number of cloudy days per year
Could a city’s elevation above sea level be used to predict the number of clear,
partly cloudy, or cloudy days per year a city experiences? After observing a
scatter plot of the data, linear models (or the least-squares linear model obtained
from a calculator or computer software) can provide a reasonable description of
the relationship between these two variables. The linear model will be evaluated
by considering how close the data points are to the corresponding graph of the
line. The equation of the linear model will be used to answer the statistical
question. We will mostly concentrate on the associating between elevation and
the number of clear days.
2
Algebra 1 Summer Institute 2014
The table below shows data for 14 U.S. cities
City
Albany, NY
Albuquerque,
NM
Anchorage, AK
Boise, ID
Boston, MA
Helena, MT
Lander, WY
Milwaukee, WI
New Orleans,
LA
Raleigh, NC
Rapid City, SD
Salt Lake City,
UT
Spokane, WA
Tampa, FL
69
𝐰=
Mean
Number
of Partly
Cloudy
Days per
Year
111
𝐳=
Mean
Number
of
Cloudy
Days
per Year
185
5,311
167
111
87
114
2,838
15
3,828
5,557
672
40
120
98
82
114
90
60
90
103
104
122
100
265
155
164
179
129
175
4
101
118
146
434
3,162
111
111
106
115
149
139
4,221
125
101
139
2,356
19
86
101
88
143
191
121
𝐱=
Elevation
Above
Sea
Level
(ft.)
𝐲 = Mean
Number
of Clear
Days per
Year
275
Data Source:
http://www.ncdc.noaa.gov/oa/climate/online/ccd/cldy.html
1. Let participants work in groups of two. Then discuss and confirm as a class.
Create a scatter plot in Excel or GeoGebra of the data on elevation and mean
number of clear days. (Slide 3)
3
Algebra 1 Summer Institute 2014
2. Do you see a pattern in the scatter plot, or does it look like the data
points are scattered?
The scatter plot does not have a strong pattern. Participants may respond
that it looks like the data points are randomly scattered. If they look
carefully, however, there is a pattern that suggests as elevation increases,
the number of clear days also appears to increase. Motivate the
discussion by looking at various data points, with several at lower
elevations, and several others at higher elevations to indicate the possible
relationship.
3. How would you describe the relationship between elevation and mean
number of clear days for these 14 cities? That is, does the mean number
of clear days tend to increase as elevation increases, or does the mean
number of clear days tend to decrease as elevation increases?
As the elevation increases, the number of clear days generally increases.
4. Do you think that a straight line would be a good way to describe the
relationship between the mean number of clear days and elevation? Why
do you think this?
Although the pattern is not strong, a straight line would describe the
general pattern that was observed in the discussion of the first two
questions.
We have noticed that the pattern is not very strong. How strong or weak is it?
We will look at a number, correlation coefficient, used when we suspect a linear
association between patterns called the Pearson product-moment coefficient that
can measure the strength between two variables.
Generally, the correlation coefficient of a sample is denoted by r, and the correlation
coefficient of a population is denoted by ρ or R. (Slide 4)
The sign and the absolute value of a correlation coefficient describe the direction and the
magnitude of the relationship between two variables.





The value of a correlation coefficient ranges between -1 and 1.
The greater the absolute value of a correlation coefficient, the stronger the linear
relationship.
The strongest linear relationship is indicated by a correlation coefficient of -1 or
1.
The weakest linear relationship is indicated by a correlation coefficient equal to 0.
A positive correlation means that if one variable gets bigger, the other variable
tends to get bigger.
4
Algebra 1 Summer Institute 2014

A negative correlation means that if one variable gets bigger, the other variable
tends to get smaller.
Keep in mind that the Pearson product-moment correlation coefficient only measures
linear relationships. Therefore, a correlation of 0 does not mean zero relationship between
two variables; rather, it means zero linear relationship. (It is possible for two variables to
have zero linear relationship and a strong curvilinear relationship at the same time.)
The scatterplots below show how different patterns of data produce different degrees of
correlation. (Side 5)
Maximum positive
correlation
(r = 1.0)
Strong positive correlation
(r = 0.80)
Zero correlation
(r = 0)
Maximum negative
correlation
(r = -1.0)
Moderate negative
correlation
(r = -0.43)
Strong correlation &
outlier
(r = 0.71)
Several points are evident from the scatterplots.




When the slope of the line in the plot is negative, the correlation is negative; and
vice versa.
The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall
exactly on a straight line.
The correlation becomes weaker as the data points become more scattered.
If the data points fall in a random pattern, the correlation is equal to zero.
5
Algebra 1 Summer Institute 2014

Correlation is affected by outliers. Compare the first scatterplot with the last
scatterplot. The single outlier in the last plot greatly reduces the correlation (from
1.00 to 0.71).
How to Calculate a Correlation Coefficient
The formula is based on the Deviation scores, the difference between a raw score and the
mean scores. For example, the deviation score for x is:
xi = Xi -𝑋̅
Where:



xi is the deviation for observation “i”
Xi is the raw score for observation “i”
𝑋̅ is the mean of all raw scores
The most common formula for computing a product-moment correlation coefficient (r)
between two variables is: (Slide 6)
𝑟=
∑(𝑥𝑦)
√(∑ 𝑥 2 ) ∙ (∑ 𝑦 2 )
Where:



Σ is the summation symbol,
xi = Xi -𝑋̅, xi is the deviation score, Xi is the raw score for observation i, 𝑋̅ is
the mean x value,
yi = Yi -𝑌̅, yi is the deviation score, Yi is the raw score for observation i, and 𝑌̅
is the mean y value.
5. Using Excel, let’s calculate the correlation coefficient r. (The excel file “the
weather turbulence” shows all calculations. The second page shows the formulas
used).
In this case, we get the r = 0.605648914, which is not very strong.
The Coefficient of Determination
The coefficient of determination (denoted by R2) is a key output of regression analysis.
It is interpreted as the proportion of the variance in the dependent variable that is
predictable from the independent variable. (Slide 7)


The coefficient of determination ranges from 0 to 1.
An R2 of 0 means that the dependent variable cannot be predicted from the
independent variable.
6
Algebra 1 Summer Institute 2014


An R2 of 1 means the dependent variable can be predicted without error from the
independent variable.
An R2 between 0 and 1 indicates the extent to which the dependent variable is
predictable. An R2 of 0.10 means that 10 percent of the variance in Y is
predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.
If you know the linear correlation (r) between two variables, then the coefficient of
determination (R2) is easily computed using the following formula: R2 = r2.
6. Compute the R2 for our example. What does the coefficient of determination tell
us in the context of the problem?
Since we know r = 0.605648914, then r2 = .366810607. This means that R2 =
.366810607
This means that about 37% of the variability of the number of clear days in a year
can be explained by the elevation of the city.
In the last activity, we created some lines and experimented with residuals to determine
which line was a better fit. In this activity we will figure out mathematically how to come
up with the equation of the least square regression line.
The Least Squares Regression Line
Linear regression finds the straight line, called the least squares regression line or
LSRL, that best represents observations in a bivariate data set. Suppose Y is a dependent
variable, and X is an independent variable. The population regression line is:
Y = Β0 + Β1X
Where Β0 is a constant, Β1 is the regression coefficient, X is the value of the independent
variable, and Y is the value of the dependent variable.
Given a random sample of observations, the population regression line is estimated by:
(Slide 8)
ŷ = b0 + b1x
Where b0 is a constant, b1 is the regression coefficient, x is the value of the independent
variable, and ŷ is the predicted value of the dependent variable.
Normally, you will use a computational tool - a software package (e.g., Excel) or a
graphing calculator - to find b0 and b1. You enter the X and Y values into your program or
calculator, and the tool solves for each parameter.
7
Algebra 1 Summer Institute 2014
In the unlikely event that you find yourself on a desert island without a computer or a
graphing calculator, you can solve for b0 and b1 "by hand". Here are the equations. (Slide
9)
b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2]
b1 = r * (sy / sx)
b0 = y - b1 * x
Where:









b0 is the constant in the regression equation,
b1 is the regression coefficient,
r is the correlation between x and y,
xi is the X value of observation i,
yi is the Y value of observation i,
x is the mean of X,
y is the mean of Y,
sx is the standard deviation of X, and
sy is the standard deviation of Y
7. Optional: Using Excel do the computations to find the equation of the regression
line.
Properties of the Regression Line
When the regression parameters (b0 and b1) are defined as described above, the regression
line has the following properties.




The line minimizes the sum of squared differences between observed values (the y
values) and predicted values (the ŷ values computed from the regression
equation).
The regression line passes through the mean of the X values (x) and through the
mean of the Y values (y).
The regression constant (b0) is equal to the y intercept of the regression line.
The regression coefficient (b1) is the average change in the dependent variable (Y)
for a 1-unit change in the independent variable (X). It is the slope of the regression
line.
The least squares regression line is the only straight line that has all of these properties.
8. Construct a scatter plot that displays the data for 𝐱 = elevation above sea
level (in feet) and 𝐰 = mean number of partly cloudy days per year.
(Slide 10)
8
Algebra 1 Summer Institute 2014
Based on the scatter plot you constructed, is there a relationship between
elevation and the mean number of partly cloudy days per year? If so, how
would you describe the relationship? Explain your reasoning.
There appears to be a relationship. As the elevation increases, the
number of partly cloudy days tends to decrease from approximately
0 to 3000 feet above sea level. Then at approximately 3000 feet
above sea level, as the elevation increases, the number of partly
cloudy days also appears to increase. This pattern suggests a
quadratic model. Some cities, however, don’t follow this pattern.
(Students should discuss the overall pattern.)
9
Download