Learning the Language of the Statistician • The following slides contain many of the symbols we will be using in this class. These are the symbols we will be using in formulas. While I do not require you to memorize all of the formulas, it is important that you know what these symbols mean. You will be expected to memorize a few of the simpler formulas for the departmental final. • To do responsible research, you must assimilate, integrate and apply. This power point presentations concentrates on assimilating this basic information. Sample Sampling Population Distribution -------------------------------------------------------------------------------------------------------- Individual Score yi yi Sample Size n N Mean ำฏ µ Mu σ Sigma Standard Deviation ๐ 2 σ/n estimated by s/ ๐ σ2 Variance S2 Sum ∑ ∑ Proportion p π Hypothesized Mean ำฏo µo Hypothesized Proportion p0 πo Pi Stating Hypotheses with Symbols • One Sample Hypothesis Test for a Proportion o Null hypothesis • P = π The sample proportion is the same as the population proportion. o Research hypothesis • P ≠ π The sample proportion is NOT the same as the population proportion. If you have a theory, you can use a one-tailed test and indicate that it is greater or less than the population proportion. • One Sample Hypothesis Test for a Mean o Null hypothesis • ำฏ = µ The sample mean is the same as the population mean. o Research hypothesis • ำฏ ≠ µ The sample mean is not the same as the population mean. If you have a theory, you can use a one-tailed test and indicate that it is greater or less than the population mean. Stating Hypotheses with Symbols • Chi Square o Null hypothesis • H0 E=O, The expected value equal the observed value • The dependent variable is contingent on the independent variable in the population o Research hypothesis • H1 E≠O, The expected value does not equal the observed value • The dependent variable is NOT contingent on the independent variable in the population NOTE – For an Elaborated Chi Square you simply state that E=0 for all of the independent/dependent combinations for the null hypothesis. For the research hypothesis you state that E ≠ 0 for at least one of the combinations. You would actually test each dependent/independent combination separately. Stating Hypotheses with Symbols • One-Way Anova - with 2 groups o Null hypothesis • H0 µ1 = µ2, The Means are equal Or The Mean of Group 1 is the same as the Mean of Group 2 in the population o Research hypothesis • Two Tailed – one the computer uses • H0 µ1 ≠ µ2, The Means are not equal OR the Mean of Group 1 is not the same as the Mean of Group 2 in the population • One Tailed - state a direction • H0 µ1 < µ2, or µ1 > µ2 The Mean of Group 1 lower than the Mean of Group 2 in the population. The Mean of Group 1 is higher then the mean of Group 2 in the population. Stating Hypotheses with Symbols • One-Way Anova - with more than 2 groups* o Null hypothesis • H0 µ1 = µ2……..µk The Means of all the groups are equal. o Research hypothesis • Two Tailed – one the computer uses • H0 µ1 ≠ µ2,…….. µk The Means are not equal. The Mean of one group is not equal to the Mean of at least one other group. o * This is still bi-variate. You don’t have more variables – only more categories in the categorical variable. Stating Hypotheses with Symbols • Bi-Variate Regression o Null hypothesis • H0 Β1 = 0, The regression slope is not different from 0 in the population • There is no relationship between the independent and dependent variables in the population. o Research hypothesis • H0 Β1 ≠ 0, The Slope is different from 0 in the population • There is a relationship between the independent and dependent variable in the population. • Multi-Variate Regression o Null hypothesis • H0 Β1…..βk = 0, The regression slope is not different from 0 in the population • There is no relationship between the independent and dependent variable in the population. o Research hypothesis • H0 Β1……βk ≠ 0, At leas one of the Slopes is different from 0 in the population. • There is a relationship between the independent variable and at least one of the dependent variables in the population. Matching Variables with Types of Analysis ๏ Chi-square (2 categorical variables) type of car you drive by gender race by political preference race by eye color gender by YES/NO questions ๏ Anova (1 categorical and one continuous variable) gender by yearly income gender by score on self esteem index race by yearly income political preference by yearly income age by whether or not you have children ๏ Bi Varate Regression (Two Continuous Variables) yearly income by years of education years married by marital satisfaction (scale score) age by number of children ๏ Multiple Regression ( continuous/dummy independent and continuous dependent) number of dates per year by yearly income, age, height, gender (dummy variable). poverty rates by sex ratio, percent single headed household, percent employed. Statistics That Do Not Use Hypotheses • Confidence Intervals o We generally do not state a hypothesis for a Confidence Interval. Confidence Intervals are used to estimate a population mean or proportion based on a sample mean or proportion. Opinion polls use Confidence Intervals to predict election results etc. • Pearson Correlation (correlation co-efficient or r) o We generally do not associate Pearson Correlation Matrixes with hypotheses. We generally use Pearson Correlation Matrixes for diagnostic purposes and to test the strength of bi-variate relationships. Equations/Formulas Z Tests • Z scores o Z= ๐๐ − µ ๐ o Where yi = individual’s score o µ = population mean o Σ = population standard deviation o Information needed • Population mean and standard deviation o Example of when we would use this • If you knew an individual’s SAT/ACT score, you could determine what percentile they scored in (i.e., the 95%) • OR if you know what percentile they are in, you can determine their score. Equations for Inferential Statistics • Summary Statistics o Mean • ำฏ= ∑๐ฒ๐ฒ/n o Median • ๐+๐ ๐ Order values and count up this far o Variance • S2 = ∑( ๐ฆ๐ฆ − ำฏ)2 ๐−1 o Standard Deviation • S = ๐ 2 Inferring a Population Mean or Proportion Based on Sample Mean or Proportion • The following Slides Focus on How to Estimate a Population Mean or Proportion if we ONLY have a random sample. • In these cases we estimate one point in the population (i.e., the mean IQ of USU students) • BUT we build a confidence interval around this single point – generally a 95% confidence interval error A One or Large Sample Hypothesis Test • In the following slides we compare a sample mean or proportion with a population mean or proportion. • We want to know if our sample mean or proportion is different from the population mean or proportion • The population mean or proportion could actually be a mean/proportion that is specified by a theory or by past research (rather than a number computed from a population data set) Equations/Formulas for One Sample Hypotheses Tests • The equations are outlined in red • What do the symbols mean o o o o o One sample hypothesis test for Proportion P = proportion in the sample Π0 =proportion or hypothesized proportion in the population n = sample size Z = computed statistic o o o o o o One sample hypothesis test for Mean ำฎ = mean in the sample µ0 = mean or hypothesized mean in the population n = sample size sำฎ = standard error or an estimate of the standard deviation in the population s ๐ = computation for estimating the standard error using standard deviation of the sample size times the square root of the sample size. o Symbols for Statistics that Infer the Relationship in the Sample to the Population Chi Square Regression Symbol(s) X2 Interpretation Chi Square Statistic β b แบก beta – slope in population slope in sample alpha – intercept or constant in prediction formula value of the X variables y-hat or predicted Y Y bar or the mean of Y X1…X ลถ ำฏ Anova µ yi - ำฏ Mu or mean in population Chi-Square Equation Equations/Formulas for Inferential Statistics o Pearson Correlation Coefficient and R2 • Formula o r = ∑(๐ฟ๐ฟ − ) (yi – ำฎ) ∑ ๐๐ − ๐ ๐ ∑( ๐๐ − ำฏ) ๐ o R2 = r squared o Multiple Regression o Prediction Equation • ลถ = ฮฌ + b1x1 + b2x2 + b3x3 +….. • ลถ = predicted score for the dependent variable • a = intercept or constant • b = slope or parameter estimate for independent variables – unit increase in Y variable for ever 1 unit increase in X • X = value of the X values – taken from the codebook o Equations/Formulas for Inferential Statistics • Anova o Formula o TSS = ∑ ๐๐๐๐ - G2 ๐ ๐ป ๐๐ n o SSB = ∑( ) − ๐ฎ2 o TSS = Total Sum of Squares SSB = Sum of Squares Within SSW = Sum of Squares Between n SSW = TSS – SSB s2B = F statistic s2w s2B = SSB/k-1 S2w = SSW/n-k o F = S2B/S2W df between = k-1 df within = n-k Anova and Regression Sums of Squares • Anova o TSS = Total Sum of Squares o SSW = Sum or Squares within each group o SSB = Sum of Squares between the groups SSB/TSS = R square or the proportion of the total sum of squares that is explained by group membership • Regression o TSS – Total Sum of Squares o SSM – Sum of Squares Model o SSE – Sum of Squares Error Equations/Formulas for Inferential Statistics • Two Sample T-test o Formula • T = ำฏ1 – ำฏ2 __________ sำฏ1 – ำฏ2 this part is computed as follows sำฏ1 – ำฏ2 = SP ๐/๐๐ + ๐/๐๐ o Pooled standard deviation Sp = standard deviation of sample 1 o • ๐๐ −๐ ๐บ๐๐+ ๐๐ −๐ ๐บ๐๐ ๐๐+๐๐ −๐ What symbols mean • t = critical value • ำฎ1 = mean of sample one • ำฎ2 = mean of sample two • n1 = size of sample 1 and n2 = size of sample 2 • Degrees of freedom = df = n1 + n2 – 2 o Uses a T distribution Estimated standard error of the difference between the two means standard deviation of sample 2 Equations/Formulas for Inferential Statistics • Mann Whitney o Focuses on ranks rather than on means – medians o Two Groups o Formula • Z= T1 – E(T1) ๐๐๐ (๐ป1) • • • • E(T1) = n1 (n+1) 2 Rank values from smallest to largest Sum ranks in smaller group = T1 Compute E(T1) Compute Variance Var T1 = n1 n2 S2 n s2 = ∑(Yi - ำฎ )2 n-1 Uses a Z dsitribution. Equations/Formulas for Inferential Statistics • Kruskal Wallis o Focuses on ranks (medians) rather than on means o More than Two Groups o Formula ๐ ๐ = ๐๐ ๐ป๐๐ -3 (n+1) ∑ ๐ (๐+๐) ๐ ๐๐ T = total sum of ranks for each sample n = total number of cases nk = number of cases for the k sample Uses X2 Distribution Degrees of Freedom = k-1 (where K is number of groups) Use when you want to compare more than two groups, and the distribution is not normal. Equations/Formulas for Inferential Statistics • Formulas for Sample Size Sample size (n) = ๐๐ .9604 ๐ (๐+1) D = degrees of freedom or margin of error (usually .05) N= population size .9604 = a constant related to at least 95% sure This sample size is large enough that we can be at least 95% sure we can generalize to the population with a margin of error of .05 • Prepared by Dr. Carol Albrecht