Scatterplots and Correlation

advertisement
Exploring Relationships
Between Variables
Chapter 7 Scatterplots and
Correlation
Chapter 7 Objectives
Scatterplots
Correlation

Scatterplots


Explanatory and
response variables
The correlation
coefficient “r”

Interpreting
scatterplots
r does not distinguish
x and y

r has no units

Outliers


Categorical variables
in scatterplots
r ranges from -1 to
+1

Influential points

Basic Terminology

Univariate data: 1 variable is measured on
each sample unit or population unit
e.g. height of each student in a sample

Bivariate data: 2 variables are measured on
each sample unit or population unit
e.g. height and GPA of each student in a
sample; (caution: data from 2 separate
samples is not bivariate data)
Same goals with bivariate data that we
had with univariate data

Graphical displays and numerical summaries

Seek overall patterns and deviations from those
patterns

Descriptive measures of specific aspects of the
data
Here, we have two quantitative
variables for each of 16
students.
1) How many beers they
drank, and
2) Their blood alcohol level
(BAC)
We are interested in the
relationship between the two
variables: How is one affected
by changes in the other one?
Student
Beers
Blood Alcohol
1
5
0.1
2
2
0.03
3
9
0.19
4
7
0.095
5
3
0.07
6
3
0.02
7
4
0.07
8
5
0.085
9
8
0.12
10
3
0.04
11
5
0.06
12
5
0.05
13
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Scatterplots

Useful method to graphically describe the relationship between
2 quantitative variables
Scatterplot: Blood Alcohol Content vs Number of Beers
In a scatterplot, one axis is used to represent each of the variables,
and the data are plotted as points on the graph.
Student
Beers
BAC
1
5
0.1
2
2
0.03
3
9
0.19
4
7
0.095
5
3
0.07
6
3
0.02
7
4
0.07
8
5
0.085
9
8
0.12
10
3
0.04
11
5
0.06
12
5
0.05
13
6
0.1
14
7
0.09
15
1
0.01
16
4
0.05
Focus on Three Features of a
Scatterplot
Look for an overall pattern regarding …
1.
Shape - ? Approximately linear, curved, up-and-down?
2.
Direction - ? Positive, negative, none?
3.
Strength - ? Are the points tightly clustered in the particular
shape, or are they spread out?
Blood Alcohol as a function of Number of Beers
… and deviations from the overall
pattern:
Outliers
Blood Alcohol Level (mg/ml)
0.20
0.18

0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
0
1
2
3
4
5
6
Number of Beers
7
8
9
10
Scatterplot: Fuel Consumption vs Car
Weight. x=car weight, y=fuel cons.

(xi, yi): (3.4, 5.5) (3.8, 5.9) (4.1, 6.5) (2.2, 3.3)
(2.6, 3.6) (2.9, 4.6) (2, 2.9) (2.7, 3.6) (1.9, 3.1) (3.4, 4.9)
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
7
6
5
4
3
2
1.5
2.5
3.5
WEIGHT (1000 lbs)
4.5
Explanatory and response variables
A response variable measures or records an outcome of a study. An
explanatory variable explains changes in the response variable.
Typically, the explanatory or independent variable is plotted on the x
axis, and the response or dependent variable is plotted on the y axis.
Blood Alcohol as a function of Number of Beers
Blood Alcohol Level (mg/ml)
0.20
Response
(dependent)
variable:
blood alcohol
content
y
0.18
0.16
0.14
0.12
0.10
0.08
0.06
0.04
0.02
0.00
x
0
1
2
3
4
5
6
7
8
9
10
Number of Beers
Explanatory (independent) variable:
number of beers
SAT Score vs Proportion of Seniors
Taking SAT 2005
2005 Average SAT Score
2005 SAT Total
1250
IW
IL
1200
1150
NC 74% 1010
1100
1050
1000
DC
950
0%
20%
40%
60%
Percent of Seniors Taking SAT
80%
100%
Chapter 7 Objectives
Correlation

The correlation coefficient “r”

r does not distinguish x and y

r has no units

r ranges from -1 to +1

Influential points
The correlation coefficient "r"
The correlation coefficient is a measure of the direction and strength of
the linear relationship between 2 quantitative variables. It is calculated
using the mean and the standard deviation of both the x and y variables.
Correlation can only be used to
describe quantitative variables.
Categorical variables don’t have
means and standard deviations.
Correlation: Fuel Consumption vs Car
Weight
FUEL CONSUMP.
(gal/100 miles)
FUEL CONSUMPTION vs CAR WEIGHT
r = .9766
7
6
5
4
3
2
1.5
2.5
3.5
WEIGHT (1000 lbs)
4.5
Example: calculating correlation


(x1, y1), (x2, y2), (x3, y3)
(1, 3) (1.5, 6) (2.5, 8)
x  1.67, y  5.67, sx  .76, s y  2.52
r
11.67  35.67   1.51.67  6 5.67    2.51.67 8 5.67 
(31)(.76)(2.52)
 .9538
Automate calculation of the correlation!
(Excel, statcrunch, calculator, etc.)
Properties of Correlation



r is a measure of the strength of the linear relationship between
x and y.
No units [like demand elasticity in economics (-infinity, 0)]
-1 < r < 1
Values of r and scatterplots
r near +1
r near -1
y
r near 0
r near 0
y
x
x
Properties (cont.)
r ranges from
-1 to+1
"r" quantifies the strength
and direction of a linear
relationship between 2
quantitative variables.
Strength: how closely the points
follow a straight line.
Direction: is positive when
individuals with higher X values
tend to have higher values of Y.
Properties of Correlation (cont.)

r = -1 only if y = a + bx with slope b<0

r = +1 only if y = a + bx with slope b>0
10
20
y = 11 - x
8
y = 1 + 2x
r=1
r = -1
15
6
Y
10
y
4
5
2
0
0
0
2
4
6
x
8
10
0
2
4
6
X
8
10
Properties (cont.) High correlation
does not imply cause and effect
CARROTS: Hidden terror in the produce department
at your neighborhood grocery

Everyone who ate carrots in 1920, if they are still
alive, has severely wrinkled skin!!!

Everyone who ate carrots in 1865 is now dead!!!

45 of 50 17 yr olds arrested in Raleigh for juvenile
delinquency had eaten carrots in the 2 weeks
prior to their arrest !!!
Properties (cont.) Cause and Effect

There is a strong positive correlation between the monetary damage
caused by structural fires and the number of firemen present at the
fire. (More firemen-more damage)

Improper training? Will no firemen present result in the least amount
of damage?
Properties (cont.) Cause and Effect
(1,2) (24,75) (1,0) (18,59) (9,9) (3,7) (5,35) (20,46) (1,0)
(3,2) (22,57)
x = fouls committed by player;
y = points scored by same player
The correlation is due to a third “lurking”
variable – playing time

r measures the strength of
the linear relationship
between x and y; it does
not indicate cause and
effect
correlation
r = .935
(x, y) = (fouls, points)
Points

80
70
60
50
40
30
20
10
0
0
5
10
15
Fouls
20
25
30
Download