here - BCIT Commons

advertisement
MATH 2441
Probability and Statistics for Biological Sciences
Rough Cuts
There are a number of simple "rules of thumb" that are useful for making rough estimates of various things
in statistics. Some of the more likely to be useful are described briefly here.
The Empirical Rule
The so-called empirical rule describes how values in sets of data or populations which obey the normal
distribution are clustered around the mean value. It is exact for situations which are exactly normally
distributed, but it can be a reasonable approximation in situations where the normal distribution is followed
approximately (i.e., the distribution is unimodal and symmetric). The rule is illustrated in the following figure:
Empirical Rule

µ-3
µ-2
µ-

µ
µ+
µ+2
µ+3
68%
95%
99.7%
For a normally distributed set of data or a normally distributed population,



approximately 68% of all elements will fall within one standard deviation of the mean; that is,
will have a value between  -  and  + 
approximately 95% of all elements will fall within two standard deviations of the mean; that is,
will have a value between  - 2 and  + 2
approximately 99.7% of all elements will fall within three standard deviations of the mean.
Since this percentage is so close to 100%, people often state this third rule as "effectively the
entire population will be located within three standard deviations of the mean".
These percentages are accurate when the distribution is exactly a normal distribution (well, ok, the exact
percentages in that case to four significant figures are 68.26%, 95.44%, and 99.74%, respectively).
When the distribution is symmetric, we know that the percentage of the population excluded from each of
these intervals will be evenly split between the lower tail and the upper tail of the distribution. Thus, we can
also make statements along the lines




approximately 16% of the population is more than one standard deviation below the mean
approximately 16% of the population is more than one standard deviation above the mean
approximately 2.5% of the population is more than two standard deviations below the mean,
and similarly, approximately 2.5% of the population are more than two standard deviations
above the mean.
only about 3 elements out of 1000 (0.3%) of a population will deviate from the mean by more
than three standard deviations.
David W. Sabo (1999)
Rough Cuts
Page 1 of 4
Tchebysheff's Theorem
When you don't feel justified in applying the empirical rule (either because you have evidence that the
population distribution is not even approximately normal, or you don't know and don't want to make an
erroneous assumption), there is another rule that may be tried.
This result, called Tchebysheff's Theorem, makes no assumptions at all about the shape of the data
distribution. It can be stated as follows:
"The fraction of a population occurring within k standard deviations of the mean is at least 1 
1
k2
."
This rule is true for any positive value of k (not just whole numbers), but to get an idea of how its results
compare with those from the empirical rule, we look at it for k = 1, 2, and 3:

when k = 1, 1 
1
k
2
 1
1
12
 1  1  0 , indicating that at least 0% of the data values fall within
one standard deviation of the mean. Since you can never have less than "at least 0%" of a
collection of things, this is a useless statement.

when k = 2, 1 
1
k
2
 1
1
2
2
 1
1 3
  0.75 , indicating that at least 75% of the data values
4 4
will be found within two standard deviations of the mean. (When we are confident enough that
the data is approximately normally distributed so that we can use the empirical rule, we're able
to make the statement that at least 95% of the data falls within this interval -- the lack of
information about the data distribution results in this rule hedging by 20 percentage points.)
Note that since we haven't assumed that the distribution is symmetric about the mean here, we
can't say anything about where the residual 25% of the data may be -- just that up to 25% of
the data may be as much as two standard deviations different from the mean.

when k = 3, 1 
1
k
2

8
 0.89 , allowing us to say that at least 89% of the data or the
9
population will be found within three standard deviations of the mean.
Tchebysheff's theorem has the advantage that what it says is guaranteed to be true. Its disadvantage is that
what it says is often so imprecise that it is of little practical use. On the other hand, it you assume the
empirical rule when it really isn't justified, you may get some very specific results, but they are untrue.
Example
To give you an idea of how you might use these rules of thumb, consider a situation in which a technologist
is trying to estimate the amount of time in total that must be allocated to carry out a frequent laboratory
procedure. Suppose she monitors 50 repetitions of the procedure, and finds that for those 50 repetitions,
the mean time per procedure was x = 25.0 minutes and s = 5 minutes (nice round numbers for the
example!). According to the empirical rule, she could then say that only 16% of the time will the procedure
take more than 30 minutes, only 2.5% of the time will the procedure take more than 35 minutes, and rarely
(around three times out of every 2000) will the procedure take more than 40 minutes to perform. On the
other hand, using Tchebysheff's theorem, all she could say is that at least 75% of the procedures will require
less than 35 minutes (so up to 25% could require more than 35 minutes); at least 89% of the procedures will
require less than 40 minutes (or up to 11% could require more than 40 minutes), and so on. Of course, if it
was really important to make precise and reliable predictions here, the technologist should employ statistical
techniques that go beyond simple "rules-of-thumb" -- which are primarily intended to allow people to make
some fast, rough, but insightful estimates, rather than profound, reliable, precise analyses.
Page 2 of 4
Rough Cuts
David W. Sabo (1999)
The Relationship Between the Mean and the Median
Recall that if the distribution is not symmetric, then the mean and median will have different values, with the
mean being in the direction of skewing from the median. It is possible to demonstrate that the mean and
median are never different by more than one standard deviation. When applied to a population, this
principle takes the form:
  ~  
~  0   , since  is never a
Of course, this is true for a symmetric distribution as well, since then   
negative number.
Example 1:
Recall the example of the very skewed set of data described on page 2 of the document on "Measures of
Central Tendency". The group under consideration consisted of 20 students, of which 19 had an income of
$2000 each and one had an income of $10,000,000 for a particular year. For these twenty students, the
median annual income was $2000, but the mean annual income was $501,900.
We can work out an estimate of  here by calculating the value of s for the 20 students. Letting xk denote
the annual income of student number k, we have
20
 x k  19  $2000  1 $10,000 ,000  $10,038 ,000
k 1
Similarly
20
2
2
2
 x k  19  $2000  1 $10,000 ,000  100 ,000 ,076 ,000 ,000
k 1
Thus,
n
2
1 n

1
  xk 
10,038 ,000 2
100 ,000 ,076 ,000 ,000 
n  k 1 
20

 4,998 ,000 ,200 ,000
n 1
19
2
 xk 
s2 
k 1
so that
s  s 2  4,998,000,200,000  $2,235,621
Now,
x~
x  $501,900  $2000  $499,900 ,
which is clearly less than or equal to the estimate s = $2,235,621 we have for .

Example 2:
Consider a less far-fetched set of data: the SalmonCa0 data. From information provided in class, we know
x  70.50, so that x  ~
that x  74.28 and ~
x  3.78 , which is less than or equal to s  22.02 by a good bit.

David W. Sabo (1999)
Rough Cuts
Page 3 of 4
A Rough Estimate of s
The standard deviation is not a very intuitive quantity, and so it is not always easy to tell if the value you
calculate is reasonable (as opposed to being the ridiculous result of an arithmetic blunder). One rough
estimate of the value of s arises out of the empirical rule.
For approximately normally distributed populations, about 95% of the members will fall within two standard
deviations of the mean, an interval with a width of 2 + 2 = 4. Thus, particularly when sample sizes are
not in the thousands (in which case there is a good likelihood of encountering members of the population
that are three or more standard deviations from the mean), it is reasonable to equate this interval roughly
with the sample range. This will give a ballpark estimate of  and hence also of s. Thus, very roughly, for
the data in a sample, we can write
s
largest  smallest
4
Example:
For the SalmonCa0 data, the largest observation was 129 ppm, and the smallest observation was 29 ppm.
Thus, according to this rough rule,
s
129 ppm  29 ppm
 25 ppm
4
An exact calculation gives s = 22.02 ppm, to two decimal places. The values 22.02 and 25 are close
enough that we can be confident no serious blunder has been committed in calculating s. Had we obtained
2.202 or 220.2 when we tried to calculate s, we would probably recheck our work, because either of these
two values would be difficult to accept as roughly equal to 25.

Page 4 of 4
Rough Cuts
David W. Sabo (1999)
Download