STATC141 Spring 2005

advertisement
STATC141 Spring 2006
Lecture 2, 01/24/2006
Introduction to Probability and Statistics (II)
In this lecture, we focus on the statistics to summarize data. We use the midterm scores of 19 students as the
illustrative dataset (see table 1).
Table 1. Students’ midterm scores
Histogram
To summarize data, statisticians often use a graph called a “histogram.” Figure 1 below is the histogram for
the midterm scores of 19 students. The height of a block represents crowding - percentage per horizontal unit.
The areas of blocks represent percentages, or say the area under the histogram over an interval equals the
percentage of cases in that interval. The total area under the histogram is 100%. For instance, from figure 1,
about 0.042*10 = 42% of students had midterm scores from 70-80 (80 is counted but not 70).
Figure 1. A histogram. This graph shows the distribution of students
by midterm scores.
The average and the standard deviation
A histogram can be used to summarize large amounts of data. Often, an even more drastic summary is
possible, given just the center of the histogram and the spread around the center (this is similar to the
“expected value” and “standard error” mentioned in lecture 1). The “average” is often used to find the center,
and so is the “median.” The “standard deviation” measures spread around the average; the “interquartile
range” is another measure of spread.
 Average: the average of a list of numbers equals their sum, divided by how many there are.
 Median: the median of a histogram is the value with half the area to the left and half to the right.
2
 Root-Mean-Square (r.m.s): r.m.s. size of a list = square root of (average of (entries) )

SQUARE all the entries, getting rid of the signs.
Take the MEAN (average) of the squares.

Take the square ROOT of the mean.
 Standard deviation (SD): the SD says how far away numbers on a list are from their average. Most
entries on the list will be somewhere around one SD away from the average. Very few will be more than
two or three SDs away.
SD = r.m.s. deviation from average; deviation from average = entry – average
So SD = square root of (average of (entry-average)2)
= square root of (average of (entries2) – (average of entries)2).

For the 19 students’ midterm scores:
Average = 68.94737, SD= 11.30621
The Normal approximation for data
The normal curve was discovered around 1720 by Abraham de Moivre, while he was developing the
mathematics of chance. The equation for the standard normal curve is:
y
100%  x2 / 2
, where e = 2.71828 …
e
2
This equation involves three of the most famous numbers in the history of mathematics: 2,  , and e. A
graph of the curve is shown in figure 2.
Figure 2. The standard normal curve
Important features of the standard normal curve:
 The graph is symmetric about 0;
 The total area under the curve equals 100%.
 The area under the normal curve between -1 and +1 is about 68%;
 The area under the normal curve between -2 and +2 is about 95%;
 The area under the normal curve between -3 and +3 is about 99.7%.
Many histograms for data are similar in shape to the normal curve, provided they are drawn to the same scale.
Making the horizontal scales match up involves “standard units” – A value is converted to standard units by
seeing how many SDs it is above or below the average (see figure 3 ).
Figure 3. A histogram for the midterm scores of students compared to the normal curve. The area between
57.64116 and 80.25358 (the percentage of students within one SD of average with respect to midterm scores)
is about equal to the area between -1 and +1 under the curve – 68%.
Percentiles
Sometimes, the histogram may not follow the normal curve. To summarize such histograms, statisticians
often use “percentiles” (table 2).
5
25
75
90
100
43
64
77
80
85
Table 2. Selected percentiles for student midterm scores
The 25th percentile of the midterm score distribution was 64, meaning that about 25% of the students had
midterm scores of 64 or less, and about 75% had scores above that level.
By definition, the “interquartile range” equals 75th percentiles – 25th percentile. This is sometimes used
as a measure of spread, when the SD would be too heavily influenced by a small percentage of cases. For
table 2, the interquartile range is 77 – 64 = 13.
Download