B632_06_lect1b

advertisement
Statistics Refresher: Topics
• Central tendency
– Expected value and
means
• Dispersion
• Characteristics of sampling
distributions
• Class Data
– 2005 National Security
Survey (phone and web)
– Population variance,
sample variance,
standard deviations
• Measures of relations
• Covariation
– covariance matrices
• Correlations
• Sampling
distributions
January 17 2006
– Stata application
• Means, Variance, Standard
Deviations
• The Normal Distribution
• Medians and IQRs
• Box Plots and Symmetry
Plots
Lecture 1b
Slide 1
Measures of Central Tendency
In general: E[Y] = µY
For discrete functions:
For continuous functions:
I
E[Y] =
 Y f (Y ) = µ
i
i
Y
i1

E[Y] =
 Yf (Y)dY = µ
Y

An unbiased estimator of the expected value:
 Yi
.
Y
n
January 17 2006
Lecture 1b
Slide 2
Rules for Expected Value
• E[a] = a -- the expected value of a constant
is always a constant
• E[bX] = bE[X]
• E[X+W] = E[X] + E[W]
• E[a + bX] = E[a] + E[bX] = a + bE[X]
January 17 2006
Lecture 1b
Slide 3
Measures of Dispersion
• Var[X] = Cov[X,X] = E[X-E[X]]2
• Sample variance:
sX2 
2
(X

X)
 i
n 1
• Standard deviation:
 X  Var(X)
• Sample Std. Dev:
s X  s 2X
January 17 2006
Lecture 1b
Slide 4
Rules for Variance Manipulation
• Var[a] = 0
• Var[bX] = b2 Var[X]
• From which we can deduce:
Var[a+bX] = Var[a] + Var[bX] = b2 Var[X]
• Var[X + W]
= Var[X] + Var[W] + 2Cov[X,W]
January 17 2006
Lecture 1b
Slide 5
Measures of Association
• Cov[X,Y] = E[(X - E[X])(Y - E[Y])]
= E[XY] - E[X]E[Y]
• Sample Covariance:
• Correlation:
{( X
XY 
i
 X)(Yi  Y )}
n 1
Cov[X,Y]
Var[X]Var[Y]
• Correlation restricts range to -1/+1
January 17 2006
Lecture 1b
Slide 6
Rules of Covariance
Manipulation
• Cov[a,Y] = 0 (why?)
• Cov[bX,Y] = bCov[X,Y] (why?)
• Cov[X + W,Y] = Cov[X,Y] + Cov[W,Y]
January 17 2006
Lecture 1b
Slide 7
Covariance Matrices
 Var[Y ] Cov[Y , X] Cov[Y ,Z ]
Cov[X,Y ] Var[X ] Cov[X, Z]

Cov[Z,Y ] Cov[Z, X] Var[Z ] 

Correlation Matrices (Example)
. correlate p2_age p1_edu p100d_in
(obs=2500)
|
p2_age
p1_edu p100d_in
-------------+--------------------------p2_age |
1.0000
p1_edu |
0.0322
1.0000
p100d_in | -0.0456
0.3234
1.0000
January 17 2006
Lecture 1b
Slide 8
In-Class Dataset: National Security Survey
• Review the Frequency Report
– Public perspectives on national security, domestic and
international
– Telephone and Internet survey
– Dates: April 2005-June 2005
– Knowledge, beliefs, policy preferences
• Class data: n=3006
– Variable types
• Nominal
• Ordinal scales, Likert-type scales
• Ratio scales
• Stata format
January 17 2006
Lecture 1b
Slide 9
Characterizing Data
• Rolling in the data -- before modeling
– A Cautionary Tale
• Sample versus population statistics
Concept
Sample Statistic
Population Parameter
n
Mean
Variance
Standard Deviation
January 17 2006
X
  E[Y ]
i
X
i 1
n
 (Y  Y )

2
s
2
Y
i
(n  1)
sY  s2Y
Lecture 1b
 Y2  Var[Y ]
 Y  Var[Y ]
Slide 10
Properties of Standard Normal
(Gaussian) Distributions
• Can be dramatically different than sample
frequencies (especially small ones) Stata
• Tails go to plus/minus infinity
• The density of the distribution is key:
+/- 1.96 std.s covers 95% of the distribution
+/- 2.58 std.s covers 99% of the distribution
• Student’s t tables converge on Gaussian
January 17 2006
Lecture 1b
Slide 11
Standard Normal (Gaussian)
Distributions
• So what?
– Only mean and standard deviation needed to
characterize data, test simple hypotheses
– Large sample characteristics: honing in on normal
ni=300
ni=100
ni=20
X
January 17 2006
Lecture 1b
Slide 12
Order Statistics
• Medians
– Order statistic for central tendency
– The value positioned at the middle or (n+1)/2 rank
– Robustness compared to mean
• Basis for “robust estimators”
• Quartiles
– Q1: 0-25%; Q2: 25-50%; Q3: 50-75% Q4: 75-100%
• Percentiles
– List of hundredths (say that fast 20 times)
January 17 2006
Lecture 1b
Slide 13
Distributional Shapes
• Positive Skew
Y  MdY
MdY Y
• Negative Skew
Y  MdY
Y MdY
• Approximate
Symmetry
Y  MdY
MdY
Y
January 17 2006
Lecture 1b
Slide 14
Using the Interquartile Range
(IQR)
•
•
•
•
•
IQR = Q3 - Q1
Spans the middle 50% of the data
A measure of dispersion (or spread)
Robustness of IQR (relative to variance)
If Y is normally distributed, then:
– SY≈IQR/1.35.
• So: if MdY ≈ Y and SY ≈IQR/1.35, then
– Y is approximately normally distributed
January 17 2006
Lecture 1b
Slide 15
Example: The Observed Distribution
of Age (p2_age)
(Distribution of Age)
1 = phone s urv ey
2 = w eb_a s urv ey
.0 3
.0 2
Dens ity
.0 1
0
20
40
60
80
100
20
40
60
80
100
p2_ age
Densi ty
norm al p 2_ag e
Graphs by phone=1_web=2
January 17 2006
Lecture 1b
Slide 16
Interpreting Box Plots
1 = phone s urv ey
2 = w eb_a s urv ey
10
0
p2
_a
ge
80
60
40
20
Graphs by phone=1_web=2
Median Age = ~49; IQR = ~25 years
January 17 2006
Lecture 1b
Slide 17
Quantile Normal Plots
• Allow comparison between an empirical
distribution and the Gaussian distribution
• Plots percentiles against expected normal
• Most intuitive:
80
• Evaluate
100
– Normal QQ plots
60
40
p2_age
20
0
0
January 17 2006
Lecture 1b
20
40
60
Invers e Norma l
80
100
Slide 18
Data Exploration in Stata
• Access National Security dataset (new)
• Using Age: univariate analysis Stata
• Using Age: split by survey mode Stata
• Exercises:
– Univariate analysis of age
• By mode, gender
– Graphing: Produce
• Histograms
• Box plots
• Q-Normal plots
January 17 2006
Lecture 1b
Slide 19
For Next Week
• Read Hamilton
– Appendix 1 (review carefully)
– Pages 1-23; 29-37
• Review Herron and Jenkins-Smith
– Homework #1
• Bivariate Regression Analysis
–
–
–
–
January 17 2006
Theoretical model
Model formulation
Model assumptions
Residual analysis
Lecture 1b
Slide 20
Download