2011 Boot Camp Basic Statistics: Conceptual understanding and application Dae Joong Kim John Glenn School of Public Affairs Ohio State University kim.2769@osu.edu Introduction Quantitative Analysis Courses Boot Camp: Basic statistics Online: http://glennschool.osu.edu/bootcamp/index.html 820: Data Analysis for Public Policy and Management (Autumn quarter) 822: Multivariate Data Analysis for Public Policy and Management (Spring quarter) 2 6 Statistics = probabilistic, random, or stochastic analysis → errors in equations or models; a useful tool e.g., Y=b0+b1X1+b2X2+e Mathematics = deterministic analysis → no errors in equations or models; mathematical language e.g., Y=b0+b1X1+b2X2 Statistical significance: whether relevant estimates are included in statistical confidence interval (C.I) (99%, 95% or 91%) Substantial significance: whether relevant estimates have expected sign (+ or -)and magnitude 2011 Boot Camp weight height 145 170 170 190 155 172 122 180 167 187 160 174 143 174 142 166 139 164 165 182 student1 student2 student3 student4 student5 student6 student7 student8 studnet9 Student10 Analysis: Mean of weight: 145 + 170 + 155 + ⋯ + 165 = 150.8 10 (145+155) Median of weight: 2 = 150 122 139 142 143 145 155 160 165 167 170 Correlation between weight and height: 170 w/o outlier 160 Descriptive statistics analysis (Data distribution analysis) Mean (average; expected utility) Variance KEY CONDITIONS Standard Deviation Normal distribution Frequency and percentage, etc. Central limit theorem Inferential statistics analysis Correlation (r): just relationship between variables (no direction; (Relationship analysis between variables: symmetric relationship) Explorative analysis or hypothesis testing ) corr(weight, height), -1 ≤ r ≤ 1 Correlation (positive, negative, or nothing) Regression: independent relationship of more than one variables with a variable (direction: asymmetric relationship) Mean difference analysis Weight = height + error, 0 ≤ r2 ≤ 1 t-test for two groups Causation: a correlation or regression is not same as causation if it does Analysis of variance (ANOVA) for multi-groups not satisfy 1) time order between variables (cause-effect), 2) no other Regression (independent variable and dependent variable) variables between them, and 3) direction change at the same time. Single regression model Linear regression model (OLS) BASIC ASSUMPTIONS Multiple regression model Non-linear regression model Linearity; Normality: Homoscedasticity; Interpretation of Data Output Independency; Interpret data outputs based on your background knowledge and Statistical Hypothesis Testing experience Null hypothesis (H0): (Different researchers can interpret the same data in different ways) hypothesis that researchers try to disprove Support your argument, or your theoretical model based on your Alternative or research) hypothesis (Ha) : hypothesis that researchers expect to support their models interpretation variables 150 Mean: arithmetic average of a set of number Median: the middle obs in a group of data when the data are ranked in order of magnitude Mode: the most common value in any distribution Sampling: Population: JGS master students Sample: 10 students observations Analysis of Data Sample Estimate; infer Population Research question: Relation between weight and height weight Surveys - questionnaire - interview Experiments Sampling Example w/ outlier 140 Collection of Data Observations (individuals or cases) Discrete variables (Nominal (e.g., sex (male or female), Variables = observations’ Ordinal scale(e.g., economic status attributes (low, Middle, high)) Continuous variables (Interval scale (e.g., income), Ratio scale (e.g., height; weight)) KEY CONDITIONS Randomness Level of measurement Representativeness (scales of measure) 130 The study of the collection, analysis, and interpretation of data related to your research questions or models Data 120 BASIC STATISTICS 165 170 175 180 185 190 height Regression outlier Statistics? Definition The study of the collection, analysis, and interpretation of DATA related to your research (questions or models) Research question: e.g., Is there any relationship of weight with height? questions or models 4 6 Statistics? Definition Statistics = probabilistic, random, or stochastic analysis → errors in equations or models; a useful tool e.g., Y=b0+b1X1+b2X2 + e Mathematics = deterministic analysis → no errors in equations or models; mathematical language e.g., Y=b0+b1X1+b2X2 5 6 DATA? Definition Data refers to qualitative (e.g., female/male) or quantitative attributes of a variable or set of variables. the results of measurements and can be the basis of graphs, images, or observations of a set of variables. Raw Data(=unprocessed data) refers to a collection of numbers and characters. 6 6 DATA? Example: Raw Data 7 6 DATA? Purpose To get necessary information and knowledge Data interpretation Information Discussion; agreement Knowledge “Data” is not “information” unless it is interpreted 8 6 DATA? Structure Observations (=individuals or cases) Data Variables = observations’ attributes e.g., Raw Data Discrete variables 1. Nominal e.g., sex (male or female) 2. Ordinal scale e.g., economic status (low, Middle, high) Continuous variables 3. Interval scale e.g., income ($100,000) 4. Ratio scale e.g., height; weight Level of measurement (scales of measure) 9 6 Collection of Data Definition The selection of a SAMPLE (a subset of individuals) from within a POPULATION to yield some Information/knowledge about the whole population, especially for the purposes of making predictions based on statistical inference *Population: all people or items with the characteristic that one wishes to understand 10 6 Collection of Data Structure Randomness Representativeness Surveys (Observation) - Questionnaire Paper Web, etc - Interview Face-to-face Phone,etc etc Sampling Population Sample Experiments Control group vs. experimental group Inference (estimation; prediction) 11 6 Collection of Data Survey or Interview Questionnaires 1. Nominal (=categorical or dummy) question e.g., your gender? Male___ Female ______ 2. Ordinal-scale question e.g., How much are you satisfied with your annual salary? a. very high b. high c. neutral d. low e. very low 3. Interval-scale question e.g., How is your annual salary? a. below 20,000 b. 20,000 – 50,000 c. 50,000 – 70,000 d. above 70,000 4. Ratio-scale question e.g., What is your height? ________ 12 6 Collection of Data Randomness and Representativeness The most important conditions to secure reliable sampling, or to eliminate bias Randomness: equal chance of selection (e.g., National lottery) Representativeness: the selection of individuals which are representative of a larger population 13 6 Collection of Data Randomness and Representativeness Low bias and high precision Quality of Data 14 6 Collection of Data Normal Distribution Symmetric distribution of values around the mean of a variable (Bell-shape distribution) s.d (s or σ) = 40 s.d (s or σ) = 24 s.d (s or σ) = 19 Mean (𝑋 or μ)=30 Mean (𝑋 or μ)=70) Mean (𝑋 or μ)=10 15 Collection of Data Normal Distribution (why important?) 1. Distributions of most variables tend to be normal, or they are usually quite close to normal distribution 2. It is easy for mathematical statisticians to work with. This means that many kinds of statistical tests can be derived for normal distributions. 3. If the mean and standard deviation of a normal distribution are known, it is easy to convert back and forth from raw scores to percentiles. 16 Collection of Data Standard Normal Distribution N ~ (0, σ2) Standard normal distribution is called “Z distribution” <probabilistic distribution> Z-distribution (n≥30) cf. t-distribution (n<30) 17 Collection of Data Z distribution table s.d t distribution table t= s.e: how likely the mean yo estimating is true mean e.g., Z =1.13=(1.1 + 0.03) e.g., t =1.26 (df=9) 87% 87% take more than 111.3 min when mean time on a review is 100 mins, and s.d is 10 mins. 18 Collection of Data Central Limit Theorem A foundational concept in statistical inference which states that if a sampling distribution is made up of samples containing more than 30 cases (each), the sample means will be normally distributed 19 6 Collection of Data Normal distribution: Mean, Median, Mode Mean: arithmetic average of a set of number Median: the middle observation in a group of data when the data are ranked in order of magnitude Mode: the most common value in any distribution 20 6 Collection of Data Skewedness Left-tail is longer Right-tail is longer Means are distorted by extreme values, or outliers 1. Using median instead of mean 2. If necessary, transform to normality, especially in regression analysis 21 6 Analysis of Data Purpose A step to find “a pattern of data” to get necessary information and knowledge 22 6 Analysis of Data Type Descriptive (statistical) analysis Numerical information (such as mean, median, standard deviation) that summarize and interpret some of the properties of a set of data (sample) but do not infer the properties of the population from which the sample was drawn. Inferential (statistical) analysis Deducing (inferring) the properties of a population from the analysis of the properties of a data sample drawn from it 23 6 Analysis of Data Descriptive Analysis Data distribution analysis: It tells us what values the variable takes and how often each value occur Mean (𝑿 (sample); μ (population)) - Arithmetic average or expected value of a variable (χ) 𝑛 𝑖=1 𝑥𝑖 𝑛 (n = number of observation) Variance (s2 (sample); σ2 (population)) - The average of the squared differences from the mean 𝑛 2 𝑖=1(𝑥𝑖 −𝑥) 𝑛−1 𝑛 2 𝑖=1(𝑥𝑖 −𝑥) 𝑛−1 Standard Deviation (s (sample); σ (population)) - A measure of dispersion, or variation, the square root of variance 24 6 Analysis of Data Descriptive Analysis Range: difference between maximum value and minimum value Min: the lowest, or minimum value in variable Max: the highest, or maximum value in variable Q1: the first (or 25th) quartile Q2: the third (or 75th) quartile Min 1 2 3 4 50th Mean or Mode 25th 5 6 7 8 9 Max 10 11 12 13 25 6 Analysis of Data Descriptive Analysis Frequency distribution - A table that shows a body of your data grouped according to numerical values Example: 26 6 Analysis of Data Descriptive Analysis Mean: arithmetic average of a set of number Median: the middle observation in a group of data when the data are ranked in order of magnitude Mode: the most common value in any distribution Height Mean: 170+190+172+180+187+174+174+166+164+182 10 Median: 174+174 2 = 𝟏𝟕𝟓.9 =174 164 166 170 172 174 174 180182 187 190 Mode: 174 Variance: (170−175.9)2 +(190−175.9)2 + ∙ ∙ ∙ +(164−175.9)2 +(182−175.9)2 (10−1) =74.77 Standard deviation: 74.77 = 8.65 27 Analysis of Data Descriptive Analysis: Using “Stata” 28 Analysis of Data Inferential Analysis Relationship analysis between variables: (Explorative analysis or hypothesis testing ) Main analysis: Mean difference analysis (t-test; ANOVA) and Relationship analysis (correlation; regression), etc. Mean difference analysis t-test for two groups Analysis of variance (ANOVA) for multi-groups 29 6 Analysis of Data Inferential Analysis: Mean Diff Example. t-test height difference between male and female male = 0 female =1 30 6 Analysis of Data Inferential Analysis Relationship analysis Correlation - Correlation means linear association between two variables - Three types of correlation X2 X2 X2 X1 positive X1 zero X1 negative 31 6 Analysis of Data Inferential Analysis Regression (independent and dependent relationships among variables) 1. Number of independent variable Single regression model: Association of one independent variable with one dependent variable e.g., Y = β0+β1X1+e where Y is dependent var, X is independent var, e is error, β0 is intercept, and β1 is slope of X1. Multiple regression model: Association of more than two independent variables with one dependent variable e.g., Y = β0+β1X1+β2X2+e 2. Shape of regression line Linear regression model (OLS) Non-linear regression model (MLE or GLS) Y Y X 6 X 32 Analysis of Data Correlation (r): just relationship between variables (no direction; symmetric relationship) corr(weight, height), -1 ≤ r ≤ 1 Regression: independent relationship of more than one variables with a variable (direction: asymmetric relationship) Weight(Y) = β0+β1*height(X1) + error, 0 ≤ r2 ≤ 1 r2 is the fraction of the sample variance of weight (Y) explained by (or predicted by) height (X1). Causation: a correlation or regression is not same as causation if it does not satisfy 1) time order between variables (causeeffect), 2) no other variables between them, and 3) direction change at the same time. 33 6 Analysis of Data 150 140 without outlier 130 weight 160 170 Correlation ( r) btwn weight and height: with outlier 120 (outlier) Regression (r2) btwn height and weight: 165 170 175 180 185 190 height r2 tells us that 22.78% of variance in weight is explained by height D.V I.V Slope (β1) 6 Intercept (β0) 34 Analysis of Data Hypothesis Testing Null hypothesis (H0): hypothesis that researchers try to disprove Alternative or research hypothesis (Ha) : hypothesis that researchers expect to support their models 35 6 Analysis of Data Hypothesis Testing: Example H0: Male is not taller than female Ha: Male is taller than female Pr(T>t)=0.0076 < 0.05 We can accept the hypothesis that male’s height is less than female’s height because the difference of height between female and male is statistically significant at 5% signficnace level. 6 36 Interpretation of Data Outputs Interpretation of data outputs based on your background knowledge and experience is the last step of statistics in social science. - Different researchers can interpret the same data in different ways Support your argument, or your theoretical model based on your interpretation 37 6 Interpretation of Data Outputs Interpretation of data outputs based on your background knowledge and experience is the last step of statistics in social science. - Different researchers can interpret the same data in different ways Support your argument, or your theoretical model based on your interpretation 38 6 Interpretation of Data Outputs Statistical significance: whether relevant estimates are included in a statistical confidence interval (C.I) (99%, 95% or 90%), or at a significant level (α=0.01(t value=2.58), 0.05 (t value=1.96) or 0.1 (t value=1.64) Substantial significance: whether relevant estimates have expected sign (+ or -)and magnitude 39 6 Interpretation of Data Outputs • Not statistically significant at 5% significance level, but significant at 1% level; and its sign is positive • If we assume this one is statistically significant, we can interpret that for one center meter increase of height, weight increases by .98 pounds 40 6 Practice 41 6 Practice: Questions Q1. n (number of observations) Q2. 𝑛 𝑖=1 𝑋𝑖 (sum of Xs) Q3. 𝑋 (mean) Q4. Median Mode Q5. Five number summery: Min (lowest value) Q1 (25th quartile value) M (median) Q3 (75th quartile value) Max (highest value) Q6. s2 (variance) Q7. s (standard deviation) 42 6 Practice: Answers 43 6 Practice: Answers 44 6 Practice: Extra Qs 45 6