Chapter 3 Numerical Descriptions of Data Slide set to accompany "Statistics Using Technology" by Kathryn Kozak (Slides by David H Straayer) is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at http://www.tacomacc.edu/home/dstraayer/published/Statistics/Book/StatisticsUsingTechnology112314b.pdf. 3.1 Measures of Center • Mode (nah, not really) • Mean • Median Mode • Most frequently appearing value, or most common frequency class • “Humps” in the distribution • Need not be all the exact same height or count • Mostly to recognize bi-modal and multi-modal • Multi-Modal is often a tip-off that different types of individuals in the data set. Mean or Average • Known as Mean, Arithmetic Mean, and Average. • Especially useful when the data is roughly symmetrical and without many outliers. • Can be misleading on very skewed data. • Consider the average income of people in this classroom. • What happens if Bill Gates walks in and we calculate a new average? Median • The “halfway point” – roughly half are smaller than this value, have are larger. • This measure of center is more “resistant” to skewness and outliers. • Frequently used for distributions like income and house cost. • Doesn’t change much when Bill G walks into the room Shapes and measures of center Weighted Average • Students are often familiar with this, as grades are usually computed as weighted averages. • For example: – Homework 20% – Worksheets 15% – Tests 65% • w is the weight of the data value, x • If the weights add up to 1, even simpler! Section 3.2: Measures of Spread • Range – Just Max-Min – Obviously sensitive to outliers • IQR – Range of middle half – Less sensitive to outliers – Will discuss more in next section • Standard Deviation – Appropriate for symmetric distributions where the mean is a good measure of center. Spread: how is the data spread out? • We are often interested in the source and amount of variation in a data set. • You will find it useful to pause for a moment and ask “What is the source of variation in this data set?” • Often knowing the source of variation can help you understand the shape of the distribution. Canonical Examples of variation • The weights of Ping-Pong balls – Because of standards and high-quality manufacturing, very little spread • The weights of apples in a grocery store – Obviously some, but not extreme, variability • “Your little brother’s rock collection” – That sapphire is almost microscopic, but the granite rock is pretty hefty. A lot of variation. Range • Sorting the data makes it easy to find minimum and maximum values. • Many computer systems have functions min() and max(). • Probably the most common measure in nontechnical situations, but its sensitivity to outliers can be a problem. Standard Deviation • I call it “Average un-averageness”. • It is a good description of spread in data that is reasonably free of extreme outliers. • It is sensitive to outliers. • Although not as useful in skewed data sets, it can be calculated for any data set. Let’s think about the design of S.D. • Each data point in a data set has a deviation from the mean (average). • A deviation is calculated with a subtraction: xi - 𝑥 is the deviation of “the ith x” • By the definition of mean, and the laws of math, if we add up all the deviations, the sum is always zero. We might try absolute values • We could add up (and then divide to make an average) the absolute values of the deviations. • Although this seems appealing, there are a few difficult-to-explain reasons why absolute values are not such a good idea. • But there is another easy way to get rid of all those nasty minus signs… Squaring gets rid of negatives • Squaring the deviations also gets rid of the negatives, so they can be added up on the way to calculating an average. • This also has some nice mathematical properties. (smooth graph, rewarding “close”) We can get an average by dividing • Yet another strange complication – in most cases it is better to divide by “one less than the count”. • This makes for a little bit larger (safer) estimates of variability. “Unbiased estimator” • We call it variance. • Of course when we’re done, we got a (sort of) average squared deviation. • Units like squared ounces or grams are just plain weird. Variance • This is the average squared deviation. • Not much use for us in this class, because the units are squared. • Think of it as “the last thing you calculate before you take the square root.” Getting back to sane units • Simple – just take the square root of the average squared deviation. • And that’s what we call the “sample standard deviation” – or just “standard deviation” for short. (S or Sx) • There is also a “population standard deviation” – where we don’t subtract one before dividing, but we won’t use it much. A formula for nerds Sx = 𝑥𝑖 − 𝑥 2 𝑛−1 Population Standard Deviation • We use a Greek letter for this – the lower case sigma: s s= 𝑥𝑖 −𝑥 𝑁 2 Technology • Excel, and many other computer systems, uses stdev(…) for standard deviation, and has average(…) (or mean(…) on other systems) for mean, and median(…) for median • On the TI, put the data in a list, and perform Stats/Calc/1-var stats On the TI On the TI On the TI That little “down arrow” • On the TI screen, that means “there is more on the next page – just scroll down to see it. • Well cover the second screen stuff next. Use of Standard Deviation • Chebyshev’s Theorem: For any set of data: – At least 75% of the data fall in the interval from - 2s to + 2s. – At least 88.9% of the data fall in the interval from - 3s to + 3s. – At least 93.8% of the data fall in the interval from - 4s to + 4s. Z-Score and unusualness • How far is a data point from the mean? • If the data value is outside two standard deviations of the mean, either above or below, then the number is uncommon. 𝑥−𝜇 𝑥−𝑥 𝑆𝑥 • 𝑧= or says how far x is away from s the mean, ( 𝑜𝑟 𝑥 − 𝑏𝑎𝑟), in units of standard deviation (s or Sx) “Dilbert’s way of expressing weirdness” 3.3 Ranking • • • • Percentile: ranks by 100ths Decile: ranks by 10ths Quartile: ranks by 4ths Quintiles: ranks by 5ths Percentile • Percentile: The kth percentile is the data value that has k% of the data at or below that value. • For example, “The 1%” refers to the highestearning 1% of Americans. The 99th percentile is about $400,000 Quartiles 1. Sort the data in increasing order. 2. Find the median, this divides the data list into 2 halves. 3. Find the median of the data below the median. This value is Q1. 4. Find the median of the data above the median. This value is Q3. There are slightly different algorithms for computing Q1 and Q3. Our author uses: “Ignore the median in both calculations for Q1 and Q3” Interquartile Range (IQR) • IQR = Q3-Q1 This is another measure of spread. It is resistant to outliers in much the same way as median is resistant to outliers to express the center. • Q1-1.5*IQR and Q3 + 1.5*IQR provide another definition of “outlier” 5-Number Summary • These 5 numbers are often used to provide a nice summary of a distribution: 1. 2. 3. 4. 5. Minimum Q1 Median Q3 Maximum Box-and-Whiskers Plot Uses of Box-and-Whiskers plots • They are frequently used to gage symmetry and spread of a distribution. • Multiple Box-and-Whiskers plots on the same field are convenient to compare two distributions.