Document

advertisement
Alternatively, dependent variable and independent variable.
Alternatively, endogenous variable and exogenous variable.
Association versus causation
Scatterplots
Weeks since beginning of semester
free in computer labs
Percentage of computers used
Stata Exercise 1
Suppose we were considering the effect of hiring more people into the
firm. On average, what total billings can we expect from a staff of 50?
150?
Stata Exercise 2
Stata Exercise 3
Stata Exercise 4
Adding Categorical Values to a
Scatterplot

Often it is useful to have a way of
distinguishing groups of data in a scatterplot
Stata Exercise 5
Stata Exercise 6
Transforming Data

Data analysts often look for a transformation
of the data that simplifies the overall pattern.
Stata Exercise 7

The transformation typically involves turning
a non-Normally distributed variable into a
more-or-less Normally distributed variable.
Categorical Explanatory Variable


What if the explanation for the numbers is not
another number but the category?
For example, investing in a particular sector
of the economy might be great in some years
or terrible in others.
Stata Exercise 8
More scatterplots

Relations between competitors
Stata Exercise 9
Correlation
Which one
has the
stronger
correlation?
r = covariance(x,y) / [stdev(x)*stdev(y)]
r = (1/(n-1)) * sum of [(standardized values of x) (standardized values of] y)
week
w - mean of w
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
4.8
8.5
mean of w stdev of w
z-score of w
prop of
comps
73.1
89.7
71.3
65.3
54.6
57.9
51.6
41.2
59.1
48.5
24
43
29.1
19.7
12.1
10.1
p - mean of p z-score of p
23.1
46.9
mean of p stdev of p
sum
count
corr
z-score *
z-score
0.00
16
Correlation
r = (1/(n-1)) * sum of [(standardized values of x) (standardized values of] y)

The r coefficient between measures of height
and weight is positive because people who
are of above-average height tend to be of
above-average weight … so if the z-score for
height is large, the z-score for weight tends
to be large.
Correlation applet at
www.whfreeman.com/pbs
Stata Exercise 11
Correlation


Correlation coefficients, as well as
scatterplots can be used for comparisons.
For example, how well did Vanguard
International Growth Fund (an investment
vehicle) do compared to an average of the
stocks in Europe, Australasia and the Far
East?
Stata Exercise 12
Correlation







Doesn’t tell you anything about causality
Variables must be numerical
It is indifferent to units of measurement
r>0 means positive association; r<0, negative
-1 < r < 1. r = -1 means a perfectly straight
downward-sloping line. r=0 means no relation.
r only measures linear relations
r is not resistant to outliers
Stata Exercise 13
Regression
The Linear Regression Model
yi  a  bxi  errori

Errors have a mean 0 and a constant sd of s
and are independent of x.
0
1000
2000
3000
Square Footage of Homes
Linear prediction
Price of Homes
4000
0
1000
2000
3000
Square Footage of Homes
Price of Homes
Linear prediction
4000
0
50
0
50
100 150
1500<sqft<=2000
Frequency
100 150
1000<sqft<=1500
1000000
500000
Price of Homes
1000000
2500<sqft<=3000
0
0
50
Frequency
100 150
2000<sqft<=2500
0
100150200
500000
Price of Homes
50
0
1000000
500000
Price of Homes
1000000
3500<sqft<=4000
0
0
50
Frequency
100 150
3000<sqft<=3500
0
100 150
500000
Price of Homes
50
0
0
500000
Price of Homes
1000000
0
500000
Price of Homes
1000000
0
1000
2000
3000
Square Footage of Homes
Price of Homes
Linear prediction
4000
y – 20,000 = 1560 (x - 66.5)
Sketch a
scatterplot of
the data
consistent with
this line
50000
y = – 84,000 + 1560 x
$37,694
(76.5’’, $35,600)
(66.5’’, $20,000)
95% of
values
0
(61.5’’, $12,200)
55
60
65
earn
70
Height (inches)
Fitted values
75
80
0
50000
55
60
65
earn
70
Height (inches)
Fitted values
75
80
3
2
0
1
y
0
1
2
x
Draw the best-fitting line through the circles
3
4
3
0
1
y
2
0
1
2
3
x
Draw the best-fitting line through the circles
4
5
6
3
2
0
1
y
0
1
2
x
Mark with an “X” the average “y” value for each “x”
value. Then draw the best-fitting line through the Xs
3
4
3
0
1
y
2
0
1
2
3
x
Mark with an “X” the average “y” value for each “x”
value. Then draw the best-fitting line through the Xs
4
5
6
Fact 1
Regression (unlike correlation) is
sensitive to your determination of
which variable is explanatory and
which response.
Item = a + b(sales)
Sales = a + b(item)
Stata Exercise 14
Facts 2 and 3

If x changes by one standard deviation of
x, y changes by r standard deviations of y.
–

E.g., sx = 1, sy = 2, and r = 0.61.
If x changes by 1, y will change by 2*0.61 = 1.22
The regression line goes through the point ( x , y )
–
The point-slope form of the line requires only the
information on this slide to draw a line.
Fact 4


Correlation r is related to the slope of the
regression line and therefore to the relation
between x and y.
Actually, the square of r, that is, R2 is the
( x, y)
fraction of the variation in y that is
explained by the variation in x.
variation in yˆ as x pulls it along the line
R 
total variation in observed values of y
2
Because most of the variation
in gas consumption is
explained by temperature,
the R2 of this regression is
very high.
tbill98
tbill98_hat
11.5
10.84649
12.6
12.19961
13.8
14.81564
6.4
5.975251
5.3
6.336083
residuals
Excel Exercise 1
Stata Exercises 15 and 16
With influential
observations
Without influential
observation 21
Stata Exercise 17
Cautions about Correlation and
Regression




Don’t extrapolate too far
Correlations are stronger for averages than
for individuals
Beware of lurking (latent, hidden, excluded,
neglected) variables
Association is not causation
–
Establishing causation takes a lot of work (see p.
139).
Download