Relating two variables: ScatterPlots

advertisement
Chapter 4
Scatterplots
and
Correlation
Slide
1
Section 4.1 Scatter Diagrams and Correlation
•
•
Scatterplots
Linear Correlation Coefficient
Slide
2
Association between two variables:
Size of diamond and price of ring
The source of the data is a full page advertisement placed in the Straits
Times newspaper issue of February 29, 1992, by a Singapore-based
retailer of diamond jewelry.
The variables are the size of the diamond in carats (1 carat = .2 gram) and
the price of ladies’ rings (single diamond stone) in Singapore dollars.
Carats Singapore
dollars
.20
495
.16
328
.17
350
.19
385
.25
642
……. …..
How would you describe the association between the two variables?
Slide
3
SCATTERPLOT: Diamond rings data
N=48
X Carat
Y Price in US $
Average
s.d.
Min
Max
0.20
0.056
0.12
0.35
865.144
213.64
385
1879
Diamond carats vs Price in US$
Price in US dollars
Carat
Slide
4
Terminology
Response variable: measures the outcome of the study
(Dependent variable)
Explanatory variable: explains or causes changes in the response variable
(Independent variable)
Example:
Carat=Explanatory variable
Price=Response variable
Slide
5
6
Slide4-66
EXAMPLE
Interpreting a Scatter Diagram
The data shown to the right are based
on a study for drilling rock. The
researchers wanted to determine
whether the time it takes to dry drill a
distance of 5 feet in rock increases
with the depth at which the drilling
begins.
• Depth, x, is the explanatory
variable,
• Time, y, (in minutes) to drill five feet
is the response variable.
Draw a scatter diagram of the data.
Source: Penner, R., and Watts, D.G. “Mining Information.” The
American Statistician, Vol. 45, No. 1, Feb. 1991, p. 6.
Slide
7
4-8
Slide
8
Interpreting scatter plots
1.
Look for the overall pattern and for striking deviations
2.
Define form, direction and strength of the relationship:
a. Form: roughly linear if the points follow a straight line
or nonlinear…
b. Direction: positive or negative?
c. Strength: how closely the points follow a clear form
3.
Check for the presence of outliers, individual values that fall outside
the overall pattern
4.
Two variables are positively (negatively) associated if the increase of
one variable correspond to an increase (decrease) in the other
variable.
Demo
Slide
9
Various Types of Relations in a Scatter Diagram
4-10
Slide
10
Example: 2000 Presidential Elections
Did the butterfly ballots confuse
voters? Did voters for Al Gore
instead cast their votes for other
candidates?
Bush spokesman Ari Fleishcher
stated on Nov. 9 2000 that "Palm
Beach County is a Pat Buchanan
stronghold and that's why Pat
Buchanan received 3,407 votes
there."
What is the level of support that Pat Buchanan enjoys in Palm Beach
County?
The published election results show the association between the vote totals for
Pat Buchanan and the total population for Florida counties.
Slide
11
•Is the association positive or negative?
•Is the form of the relationship almost linear?
•Outlier present?
Slide
12
Another example: The statistics of poverty and inequality
Data from U.N.E.S.C.O. 1990 Demographic Year Book .
For 97 countries in the world, data are given for birth rates and for
an index of the Gross National Product.
Slide
13
Note:
More information can be added into a graph by putting the
categorical variable ON the scatter plot, either
• as a label of the points, or
• as a symbol instead of the points themselves, or
• by the use of color (different color for different category) as in
the previous graph.
Slide
14
Linearization using Mathematical Transformations:
The plot before shows a non-linear association!
Sometimes we can make it linear, by using some transformations on
the variables. Possible transformations are, for example, “ln”, “exp”,
“sqrt”. Here we consider the natural log of GNP.
Birth rate vs Log G.N.P.
Slide
15
Measure of Linear Association
If there is a strong linear association between the variables, then the
cloud of points on the scatter plot will be close to a line.
Birth rate
(1,000 pop)
Log G.N.P.
Slide
16
The Correlation Coefficient r
The correlation coefficient r measures the direction and the strength of
the linear relationship between two variables.
• It is a value between –1 and 1
• The closer r is to 1 or –1, the stronger the linear association is.
• Positive values of r imply a positive association, negative values imply a
negative association
• Values of r close to 0 imply weak linear association.
• Sample r is defined as:
 xi  x  yi  y 
1



r

n  1  s x  s y 
Where X data have average x and standard deviation sx, and Y
data have average y and standard deviation sy.
Slide
17
EXAMPLE
Determining the Linear Correlation Coefficient
Determine the
linear correlation
coefficient of the
drilling data.
4-18
Slide
18
(xi
-
126.25)/sx
(yi
-
6.9858)/sy
product
Slide
19
 xi  x   yi  y 
  s   s 
 x  y 
r
n 1
8.501037

12  1
 0.773
4-20
Slide 2020
Properties of r
 The correlation coefficient r varies between –1 and 1. If r=0 means
there is no linear association between X and Y. If r=1 or –1, then the
points in a scatter plot lie on a straight line.
 Positive r indicates positive association between X and Y.
Negative r indicates negative association between X and Y.
 Both variables X and Y must be quantitative. The correlation
coefficient between X and Y is the same as the correlation between Y and
X
 r does not change if we change the units of measurement for X and Y
 The correlation measures only the linear relationship between two
variables
 r can be strongly affected by the presence of outliers.
Slide
21
Example of correlation
Negative
association
Birth rate
(1,000 pop)
r = -0.74
Log G.N.P.
Slide
22
Diamond rings data
N=48
X Carat
Y Price in US
$
Average
s.d.
Min
Max
0.20
0.056
0.12
0.35
865.144
213.6
4
385
1879
Diamond carats vs Price in US$
Price in US dollars
Strong positive
association:
r = 0.989
Carat
Slide
23
Positive Correlation
In each plot there are 100 points. The correlation coefficient measures
the amount of clustering around a line.
If r is close to 1, then points lie close to a straight line!!
24
Slide
24
Negative Correlation
Negative correlation: as x increases, y tends to decrease.
If r is close to – 1, then points lie close to a straight line!!
25
Slide
25
More here
Match the correlation with the plot!
Match the diagrams with the following correlations:
– 0.93 – 0.75 –0.20 0.27
0.63
1.0
Slide
26
Change of scale
These are the low and high temperatures in Boulder (CO) for the month of
April 1996. The first scatter plot uses degrees in Fahrenheit and the second
plot uses degrees in centigrade. Notice that Co = 5/9*(Fo – 32)
r = 0.74
r=?
Are the correlations between low and high temperatures in the two
graphs different?
Slide
27
Different correlations?
In which diagram below is the correlation coefficient the largest? The smallest?
28
Slide
28
Outliers and nonlinear association
How are the data sets different?
29
Slide
29
Plot the data: the nature of the association between x and y is very
different. The correlation coefficient can be misleading in presence of
outliers or non-linear association. Check the scatter plot of the data
Perfect association!
Why is r not equal to 1?
For each
of these:
r = 0.82
Outliers change the
value of r.
What would the
value of r be without
the outliers?
30
Slide
30
Which of the following diagrams should be summarized
by r?
(1)
(2)
(3)
31
Slide
31
Correlation does not mean Causation!!
32
Slide
32
Example
Ice cream sales and crime rates have a very high
correlation.
Does this mean that local governments should shut
down all ice cream shops?
Ans: There is another variable: temperature! As air
temperatures rise, both ice cream sales and crime
rates rise. Here, temperature is a lurking variable.
Two variables can be related through a lurking
variable even though there is no causal relation.
4-33
Slide
33
SCATTERPLOT and
CORRELATION
using Excel
Slide
34
To graph a Scatterplot
– (Highlight the two data columns)
– Use the Chart Wizard
– Choose: XY(Scatter)
– Follow the dialog window steps
appropriately (label axes etc.)
Slide
35
Computing the Correlation coefficient
 The correlation coefficient is computed using the Correlation function
in the Data Analysis Toolpak.
Click on TOOLS > DATA ANALYSIS > Correlation
 Or you can use the function:
= CORREL(data range X, data range Y)
Example:
If the X values are in B2:B25 and the Y values are in C2:C25, the correlation
between the X data and Y data is obtained as follows:
= CORREL(B2:B25, C2:C25)
Slide
36
SCATTERPLOT and
CORRELATION
using Ti83
Slide
37
Create the two Lists
•
To input data into the STAT list editor:
• Enter STAT edit mode by pressing [STAT] [1].
• Enter the data in the L1 and L2 lists, pressing [ENTER] after each entry.
• Press [2nd] [MODE] to QUIT and return to the home screen.
Example:
L1: {7,2,4,2,5}
L2: {8,4,6,2,7}
Slide
38
Graph the ScatterPlot
• Press [2nd] [Y=] to access the STAT PLOT editor.
• Press [ENTER] to edit Plot1.
• Press [ENTER] to turn ON Plot1.
• Scroll down and highlight the scatter plot graph type (first option in the first
row). Press [ENTER] to select the scatter plot graph type.
• Scroll down and make sure Xlist: is set to L1 and Ylist: is set to L2. To
input L1, press [2nd] [1]. To input L2, press [2nd] [2].
• Press [GRAPH] to display the scatter plot.
You may have to change the “Windows” settings to view your graph.
Slide
39
Get the Correlation Coefficient r
• Turn on diagnostics with the [DiagnosticOn] command:
– [2nd] [0] gets [CATALOG]
– Scroll down to DiagnosticOn and press [ENTER] twice.
• [STAT] [►] [CALC]
• Scroll down to 4: LinReg(ax+b)
• press [ENTER] twice.
Slide
40
Download