Discrete Data Distributions and Summary Statistics Terms: histogram, mode, mean, range, standard deviation, outlier Discrete vs. Continuous Data dis·crete adj. 1. Constituting a separate thing. See Synonyms at distinct. 2. Consisting of unconnected distinct parts. 3. Mathematics: Defined for a finite or countable set of values; not continuous. con·tin·u·ous adj. 1. Uninterrupted in time, sequence, substance, or extent. See Synonyms at continual. 2. Attached together in repeated units: a continuous form fed into a printer. 3. Mathematics: Of or relating to a line or curve that extends without a break or irregularity. Discrete vs. Continuous Data discrete Usually related to counts. Variable values for different units often tie. Averaging two values does not necessary yield another possible value. continuous Any value in some interval. A tie among different units is in theory virtually impossible (and in practice very rare). Ties (due to rounding) are infrequent in practice. The average of any two values is another (and different) possible value. Distribution The distribution of a variable tells us what values it takes and how often it takes those values. MAKE A PICTURE! For discrete quantitative data, use a relative frequency chart / histogram* to display the distribution. * Fundamentally these are the same thing. Left Skewed Distribution Right Skewed Distribution Symmetric Distribution Outlier outlier noun 1: something that is situated away from or classed differently from a main or related body 2: a statistical observation that is markedly different in value from the others of the sample Measures of Center Median Half the data are above/below the median. Not too suitable to highly discrete data. More later about this. (Sample) Mean Sum all the data x, then divide by how many (n) Denoted (“x bar”) x Both have the same measurement units as the data. Less Important Measures of Center Midrange Average the minimum and maximum For highly skewed data, the midrange is often a value that is quite atypical. Mode Most common value - highest proportion of occurrence There can be 2 (or more) modes if there are ties in relative frequencies. Generally found by graphical inspection. Sometimes not anywhere near any “center.” Both have the same measurement units as the data. Measure of spread / variation SAME THING Range = Max – Min In statistics Range is a single number Interquartile Range Better suited to continuous data More later about this. Variance / Standard Deviation All but variance have the same measurement units as the data. Variance S2 Mean of the squared deviations from the mean 1. Obtain the Mean. 2. Determine, for each value, the deviation from the Mean. 3. Square each of these deviations 4. Sum these squares 5. Divide this sum by one fewer than the number of observations to get the Variance Measure of squared variation from the mean Standard Deviation S Square root of the Variance Measure of spread / variation (from the mean) Same measurement units as the data. Comparing Means & Standard Deviations Small Class Large Class 38 40 42 44 46 Age Guess 48 Small: Mean = 41.60 SD = 2.07 Large: Mean = 44.80 SD = 2.59 50 Comparing Means & Standard Deviations Mean 44.80 Add a 40 and a 50… SD 2.59 Comparing Means & Standard Deviations Mean 44.80 SD 2.59 SD 3.58 Add a 40 and a 50… Mean 44.86 Comparing Means & Standard Deviations Mean 44.80 Add a 42 and a 48… SD 2.59 Comparing Means & Standard Deviations Mean 44.80 SD 2.59 SD 2.73 Add a 42 and a 48… Mean 44.86 Comparing Means & Standard Deviations Mean 44.80 Add 45 and 45… SD 2.59 Comparing Means & Standard Deviations Mean 44.80 SD 2.59 Mean 44.86 SD 2.12 Add 45 and 45… Comparing Means & Standard Deviations 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Mean = 4.0 9 10 11 12 13 14 15 16 SD = 3.0 6 5 4 3 2 1 0 0 1 2 3 4 5 6 Mean = 8.0 7 8 9 10 11 12 13 14 SD = 3.0 15 16 Comparing Means & Standard Deviations 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 Mean = 8.0 9 10 11 12 13 14 15 16 SD = 3.0 6 5 4 3 2 1 0 0 1 2 3 4 5 Mean = 8.0 6 7 8 9 10 11 12 SD = 6.0 13 14 15 16 Computing Mean & Standard Deviation Data listed by unit 1. By hand with calculator support (UGH) 2. Using your calculator’s built in statistics functionality • 60 second quiz: Determine and write down the mean and standard deviation of at most 10 data values in under 1 minute 3. Using Excel 4. Using Minitab Z = # of St Devs from Mean “…within Z standard deviations of the mean…” Determine Z SD. Find the values Mean – ZSD & Mean + ZSD This means: “…between __________ and ______________.” Mean & Standard Deviation Where the data are In general you’ll find that about 68% of the data falls within 1 standard deviation of the mean 95% falls within 2 all falls within 3 There are exceptions. These guidelines hold fairly precisely for data that has a bell (Normal) shaped histogram. Range Rule of Thumb To guess the standard deviation, take the usual range of data and divide by four. Most homes for sale in the Oswego City School District are listed at prices between $50,000 and $200,000. What would you guess for the standard deviation of prices? $50,000 to $200,000 Range about $200000 – $50000 = $150000 Apply the RRoT… $150000 / 4 = $37,500 Students are asked to complete a survey online. This assignment is made on a Monday at about noon. The survey closes Wednesday at midnight. Since each student’s submission is accompanied by a time stamp, it is simple to figure how early, relative to the deadline, each student submitted the work. For the data set of amount of time early, guess the standard deviation. Give results in both days and hours. This assignment is made on a Monday at about noon. The survey closes Wednesday at midnight. That’s 2.5 days, or 60 hours. People will hand it in between immediately (2.5 days / 60 hours early) and at the last minute (0 early). The range is about 2.5 days or 60 hours. Apply the RRoT… 2.5 / 4 = 0.625 days these are the same 60 / 4 = 15 hours Consider GPAs of graduating seniors. Guess the standard deviation. GPAs. You can’t graduate under 2.0. All As gives 4.0. Min about 2.0 Max probably exactly 4.0 Range about 4.0 – 2.0 = 2.0 Apply the RRoT… 2.0 / 4 = 0.5 Example An instructor asked students in two sections of the same course to guess the instructor’s age. Students in the first class (in a large lecture hall) had no other knowledge of the instructor’s personal life. Students in the second class (in a small classroom) knew that the instructor was the father of a young girl. Variable Guess of instructor’s age Quantitative Units The students Guess of instructor’s age varies from student to student. Variable Class (or Which class?) Categorical Units The students Which class varies from student to student. 28 30 32 34 36 38 44 42 40 Age_Large 46 This is a fairly symmetric distribution. Mode = 42 Range = 54 – 32 = 22 48 50 52 54 28 30 32 34 36 38 44 42 40 Age_Large This is a symmetric distribution. Mean = 42.0 Symmetry: Typically Mean Mode “Nearly equal” 46 48 50 52 Mode = 42 54 Dotplot of Age_Small 28 30 32 34 36 38 44 42 40 Age_Large 46 48 50 52 54 34 36 38 40 42 44 Age_Small 46 48 50 52 54 Mean = 42.0 28 30 32 Mean = 39.0 28 30 32 34 36 30 32 Mean = 39.0 38 44 42 40 Age_Large 46 48 50 52 54 48 50 52 54 St Dev 22 / 4 = 5.5 Mean = 42.0 28 Dotplot of Age_Small 34 36 38 40 42 44 Age_Small 46 St Dev 22/ 4 = 5.5 30 33 36 39 42 45 Large Class Mean = 40.25 30 33 St Dev = 4.33 (guess 4.25) 36 39 42 45 Small Class Mean = 38.15 St Dev = 4.14 (guess 3.75) Properties: Mean & Standard Deviation They don’t really “depend” (in the usual sense) on how much data there is. They depend on the relative frequency (percent) of occurrence of each value. Adding a new unit… Sometimes the mean will go up; sometimes down. But on average it will stay the same. Same for standard deviation. Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 ALWAYS – for every data set SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = 6.7 = 2.59 Standard Deviation Calculation Standard Deviation Calculation for the Large Section Age Mean Deviation from Mean Deviation squared Sums 43 48 42 44 47 44.8 44.8 44.8 44.8 44.8 43 – 44.8 = -1.8 48 – 44.8 = +3.2 42 – 44.8 = -2.8 44 – 44.8 = -0.8 47 – 44.8 = +2.2 (-1.8)2 = 3.24 3.22 = 10.24 (-2.8)2 = 7.84 (-0.8)2 = 0.64 2.22 = 4.84 224 224.0 0 26.80 Mean = 224 / 5 = 44.8 Variance = 26.8 / 4 = 6.7 SD = Sample Mean: x 44.80 Sample Standard Deviation: S 2.59 6.7 = 2.59