STATC141 Spring 2006 Lecture 2, 01/24/2006 Introduction to Probability and Statistics (II) In this lecture, we focus on the statistics to summarize data. We use the midterm scores of 19 students as the illustrative dataset (see table 1). Table 1. Students’ midterm scores Histogram To summarize data, statisticians often use a graph called a “histogram.” Figure 1 below is the histogram for the midterm scores of 19 students. The height of a block represents crowding - percentage per horizontal unit. The areas of blocks represent percentages, or say the area under the histogram over an interval equals the percentage of cases in that interval. The total area under the histogram is 100%. For instance, from figure 1, about 0.042*10 = 42% of students had midterm scores from 70-80 (80 is counted but not 70). Figure 1. A histogram. This graph shows the distribution of students by midterm scores. The average and the standard deviation A histogram can be used to summarize large amounts of data. Often, an even more drastic summary is possible, given just the center of the histogram and the spread around the center (this is similar to the “expected value” and “standard error” mentioned in lecture 1). The “average” is often used to find the center, and so is the “median.” The “standard deviation” measures spread around the average; the “interquartile range” is another measure of spread. Average: the average of a list of numbers equals their sum, divided by how many there are. Median: the median of a histogram is the value with half the area to the left and half to the right. 2 Root-Mean-Square (r.m.s): r.m.s. size of a list = square root of (average of (entries) ) SQUARE all the entries, getting rid of the signs. Take the MEAN (average) of the squares. Take the square ROOT of the mean. Standard deviation (SD): the SD says how far away numbers on a list are from their average. Most entries on the list will be somewhere around one SD away from the average. Very few will be more than two or three SDs away. SD = r.m.s. deviation from average; deviation from average = entry – average So SD = square root of (average of (entry-average)2) = square root of (average of (entries2) – (average of entries)2). For the 19 students’ midterm scores: Average = 68.94737, SD= 11.30621 The Normal approximation for data The normal curve was discovered around 1720 by Abraham de Moivre, while he was developing the mathematics of chance. The equation for the standard normal curve is: y 100% x2 / 2 , where e = 2.71828 … e 2 This equation involves three of the most famous numbers in the history of mathematics: 2, , and e. A graph of the curve is shown in figure 2. Figure 2. The standard normal curve Important features of the standard normal curve: The graph is symmetric about 0; The total area under the curve equals 100%. The area under the normal curve between -1 and +1 is about 68%; The area under the normal curve between -2 and +2 is about 95%; The area under the normal curve between -3 and +3 is about 99.7%. Many histograms for data are similar in shape to the normal curve, provided they are drawn to the same scale. Making the horizontal scales match up involves “standard units” – A value is converted to standard units by seeing how many SDs it is above or below the average (see figure 3 ). Figure 3. A histogram for the midterm scores of students compared to the normal curve. The area between 57.64116 and 80.25358 (the percentage of students within one SD of average with respect to midterm scores) is about equal to the area between -1 and +1 under the curve – 68%. Percentiles Sometimes, the histogram may not follow the normal curve. To summarize such histograms, statisticians often use “percentiles” (table 2). 5 25 75 90 100 43 64 77 80 85 Table 2. Selected percentiles for student midterm scores The 25th percentile of the midterm score distribution was 64, meaning that about 25% of the students had midterm scores of 64 or less, and about 75% had scores above that level. By definition, the “interquartile range” equals 75th percentiles – 25th percentile. This is sometimes used as a measure of spread, when the SD would be too heavily influenced by a small percentage of cases. For table 2, the interquartile range is 77 – 64 = 13.