Soci 4003: Statistics for the Social Sciences Chapter 2 Aron, Aron, and Coups: The Mean, Variance, SD, and Z scores I. Measures of Central Tendency - Means, Median, and Mode A) Means are typically the statistic used to describe data by social scientist. However, we what are the limitations ----- outliers 3 Characteristics of the MEAN 1) The mean is always the center of any distribution of scores in the sense that it is the point around which all of the scores cancel out. Symbolically: Σ (Xi – Mean) = 0 Or, we take each score in a distribution, subtract the mean from it, and add all of the differences; the resultant sum will always be zero. To illustrate 65 73 77 85 90 65-78 = 73-78 = 77-78 = 85-78 = 90-78 = - 13 -5 -1 7 12 0 2 ) Sum of Least Squares principle, whiuch means that the mean is the point in a distribution around which the variation of the scores (as indicated by the squared differences) is minimized. If the differences between the scores and the mean are squared and then added, the resultant will be less than the sum of the squared differences between the scores and any other point in the distribution. This signifies merely signifies that he mean is closer to all scores than any other measure of central tendancy Σ (X – mean)2 Remember our last Example 65 65-78 = (- 13) = 169 65 65-77 = (- 12) = 144 73 73-78 = (- 5) = 29 73 73-77 = (- 4) = 16 77 77-78 = (- 1) = 1 77 77-77 = (0) = 0 85 85-78 = (7) = 49 85 85-77 = (8) = 64 90 90-78 = (12) = 144 90 90-77 = (13) = 169 388 393 3) Unlike the Mode and Median, all scores affect the mean B. Median – the number that falls directly in the middle of a distribution C. Mode – the score that occurs the most in the frequency Why to use these statistics? 1) Mode – when the variable is nominal and you want to report the most common score 2) Median – When the variable is ordinal level; When the interval-ratio is highly skewed (statistics of lies – mixed community income can be made to look higher with mean); You want to report the mixed score 3) Mean – Variables are interval or ratio; You want to report the typical scores; You want to do further statistical analysis D) Rate – Provides another useful way of summarizing the distribution of a single variable. # of actual occurrences / the number of possible occurrences per some unite of time crude death rate – 100/7000 x 1000 = 14.29 14. 29 deaths per 1000 people Ex) see Healey page 43 II. Range, Variance and Standard Deviation – Measure of Dispersion Provides a full description of a distribution of scores, the measures of central tendencies should be combined with measures of dispersion 1) Range – the distance between the highest and lowest scores in a distribution Unfortunately, since it is based on only two scores (the highest and lowest), the range is often deceptive as a measure of dispersion. Often, a distribution will have outliers that will make ranges problematic 2) Variance and Standard Deviance The Range is problematic because they do not use all the scores in the distribution, and, in this sense, they do not capitalize on all the available data. A good measure of dispersion should: a) Use all scores in the distribution b) Describe the average or typical deviation of the scores. The statistics should give us an idea about how far the scores are form each other or from the center of the distribution c) Increase in value as the distribution of scores becomes more diverse To create such a statistics, here are the logical steps 1) To create a score that describes the distances between each score and the mean, we wand to assess the deviation (X – mean) 2) However, this gives us many scores. To create a useful statistic, we then think about sum the deviations, but, as you know, this statistic will always equal 0 3) Therefore a solution pushes us to square the deviations to get absolute scores. 4) One more problem reveals itself. Although this number will allow us to see variability (hi scores = higher distribution and vice versa), the size of the score would depend heavily on the size of the sample or population. Therefore, we must standardize the score by dividing by the sample or population size (discuss rates). 5) With this said, this formula is for VARIANCE s2 = Σ (X – mean) 2 N - Discuss n-1 = often underestimates the effect when assessing samples, therefore we minus –1 to account for that 6) What I want to talk about also is standard deviation, which is much simpler it is the square root of variance Interpreting the Standard Deviation (s) – You may be asking yourself why does this matter? What do I have? This measure is meaningful for many reasons. a) involves understanding deviation and the normal curve b) think of this as an index that increases as disperson increases c) can compare the distribution of 2 samples III. Normal or “Bell-Shape” Curve and Z scores The Normal curve is central to theory that underlies inferential statistics The normal curve is a THEORETICAL model, or line chart, that is: a) unimodal (has a single mode or peak) b) perfectly smooth and symmetrical (unskewed) so that its mean, median, and mode are all exactly the same value Of course, no empirical distribution has a shape that perfectly matches this ideal model, but many variables (standardized test scores, test results of large classes, height and weights) are close enough to permit the assumption of normality – as long as it’s random sampling it reaches a normal distribution In turn, this assumption makes possible one of the most important uses of the normal curve—the description of empirical distributions based on our knowledge of the theoretical normal curve NOTE: On any normal curve, distances along abscissa (horizontal axis), when measured in standard deviations, always encompases exactly the same proportion of the total area under the curve. 1 standard deviation – 68.26% (1/2 – 34.13 2 stadard deviations – 95.44 (1/2)- 47.72 The relationship between distance from the mean and area allows us to describe empirical distributions that are at least approximately normal. The position of individual scores can be describe with respect to the mean, the distribution as a whole, or any other score in the distribution Computing Z-Scores To find the percentage of the total area (or number of cases) above or below scores in an empirical distribution, the original scores must first be extressed in units of the standard deviation or converted into Z-SCORES. The original scores could be in any unit measurement (feet, IQ, dollars), but Z scores always have the same values for their mean (0) and standard deviations. - Think of converting the original scores into Z scores as a process of changing scales-similar to changing from meters to yards, or kilometers to miles. The original (or raw) scores and Z scores are two equally valid but different ways of measuring distances under the normal curve When computing Z scores, we convert the original units of measurements to Z scores and, thus “standardize” the normal curve to a distribution that has a mean of 0 and a sd of 1 Z = X - mean S The z score of positive 1 indicates that the original score lies 1 sd above the mean Normal Curve Table – allows you to find the proportion of scores above and below the z score to the mean (see index) IV. SPSS example * Univar.sps. * Sample SPSS descriptive statistics example. Replicates examples in handout. * This program is really quite short, but these painstakingly detailed * comment lines stretch it out. Comment lines are very handy though * if you are ever trying to figure out why you did something the way you did. * Also, while I am giving you this program, this could all easily be done * interactively using SPSS Menus. In effect, SPSS will generate most * of this syntax for you. * First, enter the data. Normally I would create a separate data file, but for * now I will enter the data directly into the program using the * data list, begin data and end data commands. data list free / X. begin data. 100 150 200 250 250 250 250 325 325 400 end data. * The formats command tells SPSS that X is measured in dollars. * Not essential, but it helps make the display easier to read. This could * also be done using the SPSS Data Editor. The Var Labels Command * will also make the output easier to read. Formats X (dollar8). Var Labels X "Weekly Income". * Next, run the frequencies command, indicating what stats I want. * I used SPSS menus to generate the syntax for this command, but it * could also be typed in directly. FREQUENCIES VARIABLES=x /STATISTICS=STDDEV VARIANCE MEAN MEDIAN MODE SUM /ORDER= ANALYSIS . * Now, here is how to run the problem when the data are already grouped * in a frequency distribution. * The variable WGT indicates how often the value occurs in the data. data list free / X WGT. begin data. 100 1 150 1 200 1 250 4 325 2 400 1 end data. Formats X (dollar8). Var Labels X "Weekly Income"/ Wgt "Weighting Var". * The Weight command causes cases to be weighted by the # of times * the value occurs. Weight by Wgt. * Now just run the frequencies again. FREQUENCIES VARIABLES=x /STATISTICS=STDDEV VARIANCE MEAN MEDIAN MODE SUM /ORDER= ANALYSIS .