Statistics Refresher: Topics • Central tendency – Expected value and means • Dispersion • Characteristics of sampling distributions • Class Data – 2005 National Security Survey (phone and web) – Population variance, sample variance, standard deviations • Measures of relations • Covariation – covariance matrices • Correlations • Sampling distributions January 17 2006 – Stata application • Means, Variance, Standard Deviations • The Normal Distribution • Medians and IQRs • Box Plots and Symmetry Plots Lecture 1b Slide 1 Measures of Central Tendency In general: E[Y] = µY For discrete functions: For continuous functions: I E[Y] = Y f (Y ) = µ i i Y i1 E[Y] = Yf (Y)dY = µ Y An unbiased estimator of the expected value: Yi . Y n January 17 2006 Lecture 1b Slide 2 Rules for Expected Value • E[a] = a -- the expected value of a constant is always a constant • E[bX] = bE[X] • E[X+W] = E[X] + E[W] • E[a + bX] = E[a] + E[bX] = a + bE[X] January 17 2006 Lecture 1b Slide 3 Measures of Dispersion • Var[X] = Cov[X,X] = E[X-E[X]]2 • Sample variance: sX2 2 (X X) i n 1 • Standard deviation: X Var(X) • Sample Std. Dev: s X s 2X January 17 2006 Lecture 1b Slide 4 Rules for Variance Manipulation • Var[a] = 0 • Var[bX] = b2 Var[X] • From which we can deduce: Var[a+bX] = Var[a] + Var[bX] = b2 Var[X] • Var[X + W] = Var[X] + Var[W] + 2Cov[X,W] January 17 2006 Lecture 1b Slide 5 Measures of Association • Cov[X,Y] = E[(X - E[X])(Y - E[Y])] = E[XY] - E[X]E[Y] • Sample Covariance: • Correlation: {( X XY i X)(Yi Y )} n 1 Cov[X,Y] Var[X]Var[Y] • Correlation restricts range to -1/+1 January 17 2006 Lecture 1b Slide 6 Rules of Covariance Manipulation • Cov[a,Y] = 0 (why?) • Cov[bX,Y] = bCov[X,Y] (why?) • Cov[X + W,Y] = Cov[X,Y] + Cov[W,Y] January 17 2006 Lecture 1b Slide 7 Covariance Matrices Var[Y ] Cov[Y , X] Cov[Y ,Z ] Cov[X,Y ] Var[X ] Cov[X, Z] Cov[Z,Y ] Cov[Z, X] Var[Z ] Correlation Matrices (Example) . correlate p2_age p1_edu p100d_in (obs=2500) | p2_age p1_edu p100d_in -------------+--------------------------p2_age | 1.0000 p1_edu | 0.0322 1.0000 p100d_in | -0.0456 0.3234 1.0000 January 17 2006 Lecture 1b Slide 8 In-Class Dataset: National Security Survey • Review the Frequency Report – Public perspectives on national security, domestic and international – Telephone and Internet survey – Dates: April 2005-June 2005 – Knowledge, beliefs, policy preferences • Class data: n=3006 – Variable types • Nominal • Ordinal scales, Likert-type scales • Ratio scales • Stata format January 17 2006 Lecture 1b Slide 9 Characterizing Data • Rolling in the data -- before modeling – A Cautionary Tale • Sample versus population statistics Concept Sample Statistic Population Parameter n Mean Variance Standard Deviation January 17 2006 X E[Y ] i X i 1 n (Y Y ) 2 s 2 Y i (n 1) sY s2Y Lecture 1b Y2 Var[Y ] Y Var[Y ] Slide 10 Properties of Standard Normal (Gaussian) Distributions • Can be dramatically different than sample frequencies (especially small ones) Stata • Tails go to plus/minus infinity • The density of the distribution is key: +/- 1.96 std.s covers 95% of the distribution +/- 2.58 std.s covers 99% of the distribution • Student’s t tables converge on Gaussian January 17 2006 Lecture 1b Slide 11 Standard Normal (Gaussian) Distributions • So what? – Only mean and standard deviation needed to characterize data, test simple hypotheses – Large sample characteristics: honing in on normal ni=300 ni=100 ni=20 X January 17 2006 Lecture 1b Slide 12 Order Statistics • Medians – Order statistic for central tendency – The value positioned at the middle or (n+1)/2 rank – Robustness compared to mean • Basis for “robust estimators” • Quartiles – Q1: 0-25%; Q2: 25-50%; Q3: 50-75% Q4: 75-100% • Percentiles – List of hundredths (say that fast 20 times) January 17 2006 Lecture 1b Slide 13 Distributional Shapes • Positive Skew Y MdY MdY Y • Negative Skew Y MdY Y MdY • Approximate Symmetry Y MdY MdY Y January 17 2006 Lecture 1b Slide 14 Using the Interquartile Range (IQR) • • • • • IQR = Q3 - Q1 Spans the middle 50% of the data A measure of dispersion (or spread) Robustness of IQR (relative to variance) If Y is normally distributed, then: – SY≈IQR/1.35. • So: if MdY ≈ Y and SY ≈IQR/1.35, then – Y is approximately normally distributed January 17 2006 Lecture 1b Slide 15 Example: The Observed Distribution of Age (p2_age) (Distribution of Age) 1 = phone s urv ey 2 = w eb_a s urv ey .0 3 .0 2 Dens ity .0 1 0 20 40 60 80 100 20 40 60 80 100 p2_ age Densi ty norm al p 2_ag e Graphs by phone=1_web=2 January 17 2006 Lecture 1b Slide 16 Interpreting Box Plots 1 = phone s urv ey 2 = w eb_a s urv ey 10 0 p2 _a ge 80 60 40 20 Graphs by phone=1_web=2 Median Age = ~49; IQR = ~25 years January 17 2006 Lecture 1b Slide 17 Quantile Normal Plots • Allow comparison between an empirical distribution and the Gaussian distribution • Plots percentiles against expected normal • Most intuitive: 80 • Evaluate 100 – Normal QQ plots 60 40 p2_age 20 0 0 January 17 2006 Lecture 1b 20 40 60 Invers e Norma l 80 100 Slide 18 Data Exploration in Stata • Access National Security dataset (new) • Using Age: univariate analysis Stata • Using Age: split by survey mode Stata • Exercises: – Univariate analysis of age • By mode, gender – Graphing: Produce • Histograms • Box plots • Q-Normal plots January 17 2006 Lecture 1b Slide 19 For Next Week • Read Hamilton – Appendix 1 (review carefully) – Pages 1-23; 29-37 • Review Herron and Jenkins-Smith – Homework #1 • Bivariate Regression Analysis – – – – January 17 2006 Theoretical model Model formulation Model assumptions Residual analysis Lecture 1b Slide 20