Chapter 6: Putting Statistics to Work

advertisement
Chapter 6: Putting Statistics to Work
In Chapter 5 we learned the 5 basic steps of a statistical study. We learned about different
methods of sampling and the strengths and weaknesses of each (step 2). We also learned
about experiments and observational studies and the benefits of both. These are both setups
to collect data (step 3). The only mention of step 4 is with the margin of error and confidence
interval in surveys and polls. The only thing we heard about conclusions (step 5) is that
correlation may hint that there could be a cause and effect relationship between the variables.
In other words, Chapter 5 does not cover steps 4 and 5 in much detail. This brings us to
Chapter 6.
Chapter 6 begins with a discussion of the characterization of data and how “average” can
have several meanings and be misleading. The shape of the distribution can lead to large
discrepancies between the mean, median and mode. Furthermore, it is not always clear how
the average is calculated.
Then the focus moves to the variation of a distribution. We learn about the 5 number
summary, the range of a distribution, and the standard deviation. These will be important
concepts for inferring population parameters from sample statistics.
In unit 6C, the “bell curve”, or more accurately, the normal distribution, is analyzed. Its
significance and characteristics are discussed and analyzed.
Finally, we get to the pot of gold at the end of the rainbow. This is where the power
of statistics lies: statistical inference. This sets guidelines for properly making conclusions
based on a study, and what the statistics actually tell you. We learn that all statistics tell us
is how likely that we got the data by chance. The less likely it is that it is just chance, the
1
more confident we can be that something is actually going on.
Unit 6A: Characterizing Data
The purpose of this section is to discuss the various ways of measuring the “average” and to
see the effects of the distribution on the different “averages.” In particular, if the distribution
is skewed left or right, it will pull the mean to the direction of the skewness more than the
median or the mode. Other ways of describing distributions are also discussed.
What is average?
The word “average” is thrown around in statistics frequently. However, there are three
types of averages: mean, median and mode. All three are distinct for most data sets. Typically, in statistical studies the “average” is the mean. If you were to make a histogram of
the data and hold it with one finger from below with your finger at the mean, the histogram
would balance on your finger. Another type of average is the median. This is just the “center”
of the distribution. Half of the data values are above it, and half are below it. This average
is less commonly used. However, the advantage of the median is that outliers do not affect it
as much as the mean. The mode is barely ever used in statistics, and it is just the value that
occurs most frequently. If you had a frequency table, it would just be the category with the
highest number in the frequency column. Unlike the mean and the median, the mode can be
used for qualitative data. However, most studies use numerical variables so there is some sort
of natural ordering. For example, there is no natural ordering on hair color, but there is for
salary.
2
Deciding which “average” to use can be a hard decision. A general rule of thumb is
that if you have (an) outlier(s), your mean can be way off, so the median should be used.
Alternatively, in reality some studies throw out outliers and then calculate the mean as the
average. This is skating on thin ice. There needs to be a very good reason to throw out a
data point. Unfortunately, this is used more often that you would think. The main reason
this happens is because the statistical community at large prefers the mean as the average.
Since outliers can make the mean unrepresentative of the data, outliers are often disregarded
in the calculation of the mean.
For example, let’s say you are calculating the average speed of processors in the new Apple
G6 computers. You get a random sample of 5 computers and test them with a simple program
to calculate the processor speed. You get the following speeds (in seconds): 1.32, 1.35, 1.34,
1.37, 0.15. Here is the problem: the measurement of 0.15 would require the processor to
do each computation faster than the speed of light. For this reason, it is impossible for the
program to have completed in under 0.15 seconds. Now, if we compute the mean with this
value in, we get 1.106 seconds. If we compute the mean without this value, we get 1.345
seconds. Finally, if we compute the median, we get 1.345 seconds. Therefore, the median and
the mean without the outlier are the same (they will typically be close). In this case, they
would be justified in throwing out the 0.15 second result because it is impossible due to the
laws of physics. It would be misleading to include the value of 0.15 seconds since the average
speed would be lower than all but one of the speeds, and make the processors look faster than
they really are.
The moral of the story is that outliers can have a dramatic effect on the mean, and one
must be very careful about how to compute the “average” of a data set so it is not misleading.
3
In cases where there is an outlier, the median is often a better measure of the “average.”
See the textbook for their examples on Confusion About “Average”.
Shapes of Distributions
There are several ways of describing a distribution. To get the best description, if possible,
all three of the following should be used: number of peaks, skewness (or symmetry), and
variation. Below are three example distributions.
Two Peaks
Right Skewed
Left Skewed
The number of peaks is pretty obvious. You look for humps in the histogram and count
them. The first graph at the beginning of this has two peaks, and the other two have only one
peak. The number of peaks usually tells you something about the data being measured. For
example, if you are measuring the height of adult Americans to the nearest inch, you would
probably notice two peaks. This would tell you that there may be a natural stratification of
the data. Clearly, this is men and women.
The skewness or symmetry of the data is also apparent from a histogram. If there is a tail
on one side of the data that is not on the other side, then the data is skewed in the direction
of the tail. There are many possible explanations of skewness. For example, if you look at
the distribution of grades in a large class where most of the students do well, there will be a
4
tail towards the lower grades. This would likely be due to the fact that you can’t get higher
than 100%, but you can get as low as 0%. Since most of the students do well, the grades will
be stuck around As and Bs, whereas the rest of the class will be spread out through Cs, Ds
and Fs. So, there would probably be a tail towards the Fs.
When data is skewed, the mean is pulled in the direction of the skewness. The median
may also be pulled that direction, but not as much as the mean.
Symmetry is also clear from the histogram. If the left side is pretty much a mirror image
of the right side, then the data is symmetric. For example, if you plotted the heights of adult
American men, one would expect this to be symmetric. When data is symmetric, the mean,
median and mode are all the same! Typically, when there is just random variation in the
data, the distribution will be symmetric.
However, one thing the book is not clear (and maybe even misleading) is that the tail
values may or may not be outliers. The book points to the tail as outliers, but there may be
a good reason for the data to be skewed! For example, if you look at a distribution of the
income of all American families it would be skewed towards the high incomes. There are good
reasons for this data to be skewed, and the “outliers” (according to the book) tell a story.
Finally, the variation of the data tells how spread out it is. If all of the data is close
together, then there is little variation. If the data take on a wide range of values, then there
is large variation. However, be careful about interpreting variation qualitatively, like “a lot
of variation” or “a little variation.” Changing the units of the x-axis can make a “lot of
variation” look like “a little variation” and vice-versa.
5
6B: Measures of Variation
Variation is a useful idea in a quantitative sense. In a way, it tells you how close the data
are to the “average” (more on which average later). There are three main ways of measuring
the variation. The most crude of these ways is the range. A little better is quartiles and the
five-number summary. The best way is the standard deviation.
The range is simply the difference between the maximum value and the minimum value.
This is a barbaric way of telling how spread out the data is. The plus side of it is that this is
VERY easy to calculate. The down side is that it really isn’t very useful since it says nothing
about any measure of average, and it tells nothing about the shape of the distribution.
Quartiles and the Five-Number Summary
Quartiles are a way of dividing the data up into four chunks with equal numbers of data
values in each chunk. The quartiles are calculated by computing medians. To get the 2nd
quartile (or middle quartile), you simply compute the median of the data. To get the 1st
quartile (or lower quartile), you take all the data points below the median (the lower half of
the values) and compute their median. Similarly, for the 3rd (or upper) quartile, you find the
median of the data points above the middle quartile. To complete the five-number summary,
you include the minimum and maximum value.
Let’s say you want to know the five-number summary for the final grades for introductory
physics. The grades were (in descending order): 98, 96, 94, 89, 84, 83, 83, 81, 79, 78, 76, 76,
75, 75, 73, 71, 70, 69, 64, 52. The median is 77 (since there are 20 grades, and 77 is half
way between the middle two grades of 76 and 78). So, the middle quartile is 77. Now, this
6
naturally breaks the data into two parts, those above the median and those below the median.
Looking at the grades above the median (98, 96, 94, 89, 84, 83, 83, 81, 79, and 78) we can
calculate their median: 83.5. This is the upper quartile since it is the median of the top half
of the data. Similarly, the lower half of the data (76, 76, 75, 75, 73, 71, 70, 69, 64 and 52)
has a median of 72. That is, 72 is the lower quartile. To finish off the five-number summary,
we add the minimum value of 52 and the maximum value of 98. Putting it all together, we
give the five-number summary: 52, 72, 77, 83.5, 98.
This summary gives you, at a glance, both the range and the median. Furthermore, it
gives you some idea of skewness by looking at the difference between the median and the
lower/upper quartile and the difference between the median and the max/min.
However, looking at numbers is not always as telling as a picture. Therefore, we have the
boxplot. This is a plot designed to represent the five-number summary clearly.
98
83.5
90
77
80
72
70
52
60
50
Figure 1: Box Plot
This plot tells us a lot more than the range. It gives us clues about symmetry or skewness
and about the spread (variation). Although it is not excessively clear from the diagram, the
right “whisker” is longer than the left “whisker.” This means that the data might be skewed
7
to the right.
Standard Deviation
There is a reason this is called “standard” deviation. This is the most common measure of
variance (and the most precise). However, unlike the five-number summary which is based on
medians, the standard deviation is a measure of how much the values deviate from the mean.
This is a mathematical measure that is used in a lot of the theory behind drawing conclusions
based on statistical analysis. Technically, the standard deviation is the square root of ((the
sum of (value - mean)2 )/number of values -1). To calculate this, follow these steps:
1. Compute the mean, and then for each value, calculate the difference between the data
value and the mean (data value - mean).
2. Calculate the square of each of these deviations.
3. Add up the squares of the deviations.
4. Divide the sum by one less than the number of values (num values - 1)
5. Take the square root of the answer of step 4.
As an example, we will use the data that we made the box plot with. The mean of the
data is 78.5. We will calculate everything in a table:
8
value
deviation
deviation squared
98
19.5
380.25
96
17.5
306.25
94
15.5
240.25
89
10.5
110.25
84
5.5
30.25
83
4.5
20.25
83
4.5
20.25
81
2.5
6.25
79
.5
.25
78
-.5
.25
76
-2.5
6.25
76
-2.5
6.25
75
-3.5
12.25
75
-3.5
12.25
73
-5.5
30.25
71
-7.5
56.25
70
-8.5
72.25
69
-9.5
90.25
64
-18.5
342.25
52
-26.5
702.25
The sum of the squares of the deviations is 2440. Now we have to divide this by 20−1 = 19.
Calculating 2440/19 ≈ 128.42. Now, we must take the square root of this. And, the square
9
root of 128.42 is about 11.3. Therefore, the standard deviation of this set of class grades is
11.3.
Finally, the book gives a rule of thumb for estimating the standard deviation. The standard
≈
deviation is approximately 1/4 of the range. In our example, the range was 46. So, range
4
11.5. This is pretty close to our calculated value of 11.3. It is not precise, but it is in the ball
park. So, this rule of thumb is alright for approximating the standard deviation (but should
not be used in place of the standard deviation!).
6C: The Normal Distribution
The famous “bell curve” is a normal distribution. The reason this distribution is so important
is that we know a lot about it, mathematically. The normal distribution is symmetric with
one distinct peak. The highest point of the peak is the mean, median and mode of the
distribution.
A distribution is likely to be normal if there is expected to be random variations, and only
one peak (corresponding to the average value) with the likelihood of values further from the
average being less likely. For example, the price of gasoline across the country would likely be
a normal distribution. The height of adult Americans would probably NOT be normal since
there is a different average height for men and women.
The Standard Deviation in Normal Distributions
The standard deviation is an intricate part of the normal distribution (through the equation for the curve, which we do not talk about). It also tells us a lot about how much of
10
the data is above or below a given point. In fact, there is a rule about normal distributions
called the 68-95-99.7 rule. This asserts that 68% of the data is between (mean + 1 standard deviation) and (mean - 1 standard deviation), 95% is between the same with 2 standard
deviations, and 99.7% of the data is between (mean + 3 standard deviation) and (mean - 3
standard deviation). If we look at this graphically, we see the following:
68%
95%
99.7%
−3σ
−2σ
−σ
mean
+σ
+2σ
+3σ
This rule allows us to calculate other things as well. For example, you can determine
what percentage of the data points are above (mean + 1 standard deviation). Since 68% of
the data is between (mean - σ) and (mean + σ), we know that 32% is outside of this range.
Since the curve is symmetric, half of this is below (mean - σ) and half is above (mean + σ).
Therefore, 16% of the data lies above (mean + σ).
The 68-95-99.7 rule also gives us some power in computing the percentile a given data
point lies in, just by knowing the standard deviation of the distribution and the value of the
data point. However, there is a more precise way of determining the percentile of a data
point. This is found by calculating the z-score (or standard score) and then looking up the
percentile in a table. The z-score measures how many standard deviations the value is away
11
from the mean. The equation for calculating the z-score is
z=
data value − mean
x−µ
=
standard deviation
σ
After computing the z-score, you look it up in a table and determine the percentile.
It is common for people to want to know their percentile on standardized tests (SAT, ACT
or IQ tests). The SAT verbal section (and math section) has a mean of 500, and a standard
deviation of 100. From our 68-95-99.7 rule, we see that 68% of the people who take the test
score between 400 and 600 on the verbal section. Also, only 16% score above 600. This means
that if you get a 600 on the verbal section, you are in the 84th percentile. If you go up one
more standard deviation to 700, only 2.5% of the population scored higher than you! Finally,
only .15% of the population gets an 800 on the verbal section. In other words, only 15 out of
10000 students taking the SAT get a perfect score on the verbal section! The same goes for
math.
The power of the normal distribution is apparent from the preceeding discussion. The
mean and standard deviation alone determine the entire distribution. Furthermore, given a
data value, it is possible to find out what percent of the data lies below it. This is a powerful
tool for things like standardized test, grading a class on a “curve,” etc.
6D: Statistical Inference
The power of statistics lies in statistical inference. This is the ability to say that something
is actually going on. However, the way we must go about it is to rule out the possibility that
the results were due to chance. When an event is not likely to have occurred by chance alone,
it is said to be statistically significant. For example, if the average adult male can throw a
12
baseball at 60 miles per hour, and a college student can throw a baseball at 93 miles per hour,
the college student’s throwing is statistically significant.
Similarly, if we look at the President’s approval rating, we see that on September 20, 2001
his approval rating was about 70%. Now, (May 2005) his approval rating is about 42%. This
is statistically significant since the margin of error in each survey is about 5%, and this change
is large compared to the margin of error. Therefore, we can actually say with confidence that
his approval has gone down. However, if you look at a year ago verses today, his rating has
gone from 45% to 42%. This is not statistically significant since the change is small compared
to the margin of error. Therefore, we can not say that his approval has gone down in the last
year without more information.
However, simply saying whether or not an event is statistically significant is not quite
correct. One has to specify what criterion they used to determine the event was “not likely
to have occurred by chance.” You determine how “not likely” it is by saying something is
statistically significant at some level. For example, statistical significance at the 0.05 level
means that there is less than a 5% chance that the event occurred by chance. Similarly,
if there is less than a 1% chance an event occurred by chance, it is said to be statistically
significant at the 0.01 level. With one of these specifications, it is clear what you mean by a
statistically significant event.
Margin of Error and Confidence Intervals
In Chapter 5 we mentioned the margin of error and the confidence interval. However, we
did not get into details about how it was calculated or what it means, technically.
We know that sample statistics only approximate the actual value for the population
13
(the population parameter). However, we do know that the population parameter is very
likely to fall within the confidence interval. The “very likely” technically means that there
is a 95% chance that the population parameter lies somewhere inside the confidence interval.
(Actually, it means that there is at most a 5% chance that these measurements were taken
and the population parameter is actually outside our confidence interval). In fact, if a survey
is carried out many times with the same number of people surveyed, the distribution of the
proportions measured will be a normal distribution. This gives us all of the tools described in
section 6C. This distribution has a special name: sampling distribution. The name comes
from the fact that it is a distribution of the values of a sample statistic over many trials. The
“Law of Large Numbers” is a mathematical theorem that describes these distributions. It
√
also states that the standard deviation is 1/2 n. From the 68-95-99.7 rule, 95% of the values
are between (mean + 2 standard deviation) and (mean - 2 standard deviation), or (mean +
√
√
1/ n) and (mean - 1/ n) where n is the number of participants in each survey. Therefore,
√
the margin of error for 95% confidence in a survey is about 1/ n. The 95% confidence interval
√
is found by adding and subtracting 1/ n to the sample statistic (or sample proportion). This
backs up something that seems to make sense: the more people you survey, the more accurate
the survey is (i.e. the survey has a smaller margin of error when you survey more people).
One thing to note is that a poll of only 500 people has a margin of error of about 4.5
percent. A 5% margin of error corresponds to a sample of only 400 people. A poll with 1000
people has a margin of error of about 3.2%. Most surveys reported on the news have a margin
of error of 3% or 5%. The 3% margin of error is for a sample of 1111 people. In other words,
when the news gives you the results of a “national” survey or poll with a 3% margin of error,
they really only asked about 1000 people.
14
Moreover, a 95% confidence means that there is a 5% chance that the population parameter
is actually outside the confidence interval. In other words, on average, 1 out of every 20 of
these surveys has a population parameter outside the confidence interval!
Hypothesis Testing
This is one of the most powerful techniques in statistics. This technique is used to REJECT
a claim. This is done by assuming the population parameter is a specific value. This is called
the null hypothesis. In experiments, this is the assumption that the treatment had NO
EFFECT. The alternative hypothesis is accepted (i.e. there was a difference between the
control and treatment groups) ONLY IF the null hypothesis is rejected. So, in a way, it is
possible to get an affirmative result from rejecting the null hypothesis.
This method can ONLY say that the null hypothesis is most likely false. If we can not
reject the null hypothesis, we do not know that the null hypothesis is true. All we can say in
this case is that there is not sufficient evidence to reject the null hypothesis and accept the
alternative.
Now that we know what can happen, we have to decide how to accept or reject a null
hypothesis. Rejecting the null hypothesis is saying that it is not likely that we obtained our
data from a sample that had the assumed population parameter. There is always a probability
that we did get this data set from a sample of the population whose population parameter is
in fact the value in the null hypothesis. However, this probability can be very small. When
this probability is small, it is not likely that the population parameter is that in the null
hypothesis.
If, for instance, the probability that values as extreme as ours came from a population with
15
the population parameter of the null hypothesis is less than 1 in 100 (i.e. 0.01 or 1%), the test
is significant at the 0.01 level. This means that there is only a 1% chance that the population
parameter is actually what we stated in the null hypothesis. In other words, there is a 99%
chance that the null hypothesis is wrong and the population parameter is different from that
in the null hypothesis. So, we are 99% confident that we can reject the null hypothesis.
Typically, significance at the 0.05 level (i.e. there is only a 1 in 20 chance, or 5% chance,
that the null hypothesis is correct) is enough to reject the null hypothesis. However, sometimes
one needs to be very careful, such as in medical studies, and significance at the 0.01 level is
usually required (such as by the FDA).
If, on the other hand, the probability that the null hypothesis is correct is greater than 1
in 20, there is not sufficient evidence to reject the null hypothesis. In this case, you can make
no conclusions other than there is not enough reason to doubt the null hypothesis.
Chapter 6 Definitions
distribution of a variable (or data set) describes the values taken on by the variable and
the frequency (or relative frequency) of these values.
sum of all values
mean: a measure of the average, defined as total
number of values
median: a measure of the average, defined as the middle value of the data set. If there are
an even number of values, it is half way between the two middle values.
mode: a measure of the average, defined as the most common value (or group of values) in
a distribution.
16
outlier: a data value that is much higher or much lower than almost all other values.
symmetric distribution: a distribution whose left half is a mirror image of its right half.
left-skewed distribution: a distribution whose values are more spread out on the left side
(the left side has a tail).
right-skewed distribution: a distribution whose values are more spread out on the right
side (the right side has a tail).
variation: a measure of how widely data values are spread about the center of a distribution.
range: (of a data set:) the difference between its highest and lowest data value.
lower quartile: (or first quartile:) the median of the data values in the lower half of the
data set.
middle quartile: (or second quartile:) the median of the entire data set.
upper quartile: (or third quartile:) the median of the data values in the upper half of the
data set.
five-number summary: summary of a data set consisting of the lowest value, lower quartile,
middle quartile, upper quartile and highest value.
boxplot: graphical representation of the five-number summary. The lower and upper quartiles are the ends of a box, the median is a line. and whiskers extend to the high and
low values.
standard deviation: a (good) measure of variation.
17
deviation: the difference between the mean and the data value.
normal distribution: a symmetric, bell shaped distribution with a single peak. Is peak is
the mean, median AND mode of the distribution.
68-95-99.7 rule: In a normal distribution, 68% of the values are between (mean - 1 standard
deviation) and (mean + 1 standard deviation), 95% of the values are between (mean
- 2 standard deviations) and (mean + 2 standard deviations), and 99.7% of the values
are between (mean - 3 standard deviations) and (mean + 3 standard deviations).
standard score: (or z-score:) the number of standard deviations a data value lies from the
mean.
nth percentile: the smallest value in the set with the property that n% of the data values
are less than or equal to it.
statistically significant: it is unlikely that the data set or observations occurred by chance.
sampling distribution: a distribution consisting of the averages of many samples.
null hypothesis: a claim that the population is a specific value (usually the claim that what
you are testing has NO EFFECT).
alternative hypothesis: the claim that is accepted if the null hypothesis is rejected.
Chapter 6 Equations
µ = mean =
sum of all values
total number of values
range = highest value (max) − lowest value (min)
18
σ = standard deviation =
z = standard score =
v
u
u sum
t
of (deviations from the mean)2
total number of data values − 1
value − mean
standard deviation
19
Download