2.0 Descriptive statistics

advertisement
Descriptive statistics
One of my pleasures is growing roses. In my dreams, I have a garden full of 200 rose
plants. I'd like to maximize the number of flowers on each of my rose plants, so I'll plan
some experiments. Before I begin my experiments, I'll collect some information on my
200 roses: what is the average number of flowers on each rose plant, and how variable is
the number of flowers?
Being enthusiastic, I measure the number of flowers on each of my 200 roses. The graph
on the left shows the result. Each circle represents one rose plant. The graph on the right
shows a histogram for same data.
Population versus sample
The 200 roses in my garden is the entire population that I'm interested in working on.
Another population I might be interested in could be the set of all pills produced in a
particular factory during a particular year. For both these examples, it is relatively easy to
count and measure every member of the population.
It is unusual that the entire population we are interested in is small and easily defined.
More commonly, we could be interested in the population consisting of all people who
might be given a drug we are developing. It would be difficult to count and measure the
effect of our drug in people who may not even be born yet. Or we might be interested in
the population of all members of a species of beetle. They would be quite difficult to
count and measure.
Since it is often difficult or impossible to count and measure every member of a
population we are interested in, we usually take a sample from that population. We then
try to make inferences about the population based on our samples. When I do my rose
experiments, I usually won't want to do the work of measuring all 200 roses again.
Instead, I'll likely take a sample of, say, 10 or 20 roses. When I do a clinical trial, I can't
measure all the patients who might take the drug, so I might take a sample of, say, 300
patients.
A typical value: the population mean
Because I measured the number of flowers on all 200 roses in my garden, I can calculate
the mean of all 200. The mean of all 200 roses in my population is called the population
mean. As we'll see shortly, the mean of a sample of, say, 10 roses from my garden (the
sample mean) will be close to, but rarely exactly the same as, the mean of all 200 (the
population mean).
To calculate the mean (average) of N numbers, add up all the N numbers and divide by
N. The Greek letter  (mu) is the population mean.
Population mean =  = Xi/N
The symbol  in the formula is the Greek capital letter Sigma. Xi indicates adding up all
N of the Xi observations. If you are not familiar with using notation, see the end of this
chapter for an explanation
Suppose my population is 5 small apple trees growing in my garden. I count the number
of ripe apples on each tree.
Tree number
1
2
3
4
5
Number of ripe apples
8
8
10
12
12
The sum of all 5 numbers is 50, so the mean number of ripe apples is 50/5 = 10 ripe
apples.
Population mean =  = Xi/N = 50/5 = 10.
The mean has two interesting properties.
1. The sum of the distances of the observations from the mean is always zero.
2. The mean minimizes the sum of the squared distance of all the observations from
the mean. If you compare the sum of the squared distance of all the observations
to any value other than the mean, the sum of the squared distance from the mean
will always be greater than the sum for the mean.
In statistics, we often describe the distance of an observation from the mean as the
deviation from the mean.
For my 200 roses, the number of flowers on each rose is in the Excel file
"flowercounts.xlsx" on the book website. In that file, you see that the population mean is
 = 19.65 flowers per rose.
Population variance and standard deviation
I'm also interested in the variability of the number of ripe apples on my trees, and will be
interested in the variability of the number of flowers on my roses.
A common goal of designed experiments is to reduce variability. For example, we might
want to
 reduce the variability of the yield of a process
 reduce the variability in the amount of drug in a pill
 reduce the variability of replicate measures in an assay
We could describe the variability of the number of flowers by giving the highest and the
lowest values (the range of values). But the range is not a very good descriptor of
variability, because it can be greatly affected by a single unusual point. A single incorrect
measurement or an outlier could give very extreme values. The most widely used
descriptors of variability are the variance and the standard deviation.

population variance = 2 (sigma squared)

population standard deviation =  (sigma)
Population variance and standard deviation
We calculate a population variance to describe the variability of observations around the
population mean.
population variance =  =
2
2
∑𝑁
𝑖=1( 𝑥𝑖 − 𝜇)
𝑁
The population standard deviation is just the square root of the population variance.
𝑁
∑𝑖=1( 𝑥𝑖 − 𝜇)
population standard deviation = =√
2
𝑁
Tree number
1
2
3
4
5
Number of ripe apples
8
8
10
12
12
So for our ripe apples example, we have
population variance = 2
=
2
∑𝑁
𝑖=1( 𝑥𝑖 − 𝜇)
𝑁
= [(8 - 10) 2 + (8 - 10)2 + (10 – 10)2 + (12 - 10)2 + (12 - 10)2 /(5)
= [4 + 4 + 0 + 4 + 4]/5
= 16/5
= 3.2 ripe apples2
The population standard deviation is the square root of the variance.
=√
2
∑𝑁
𝑖=1( 𝑥𝑖 − 𝜇)
𝑁
= sqrt(3.2) = 1.79 ripe apples.
For my 200 roses, from the Excel file "flowercounts.xlsx". In that file, you see that the
population standard deviation is  = 5.02 flowers per rose.
Sample mean, variance and standard deviation
When we evaluate a new drug we might wish that we could collect observations on every
member of the population, and calculate the population variance and the population
standard deviation. But that is usually not possible. In clinical trials and experiments, we
take a random sample of N observations from a (much larger, possibly infinite)
population.
The sample mean, sample variance and sample standard deviation are estimates of the
true population mean, population variance and the population standard deviation.
We will use different symbols when we calculate these descriptive statistics using the N
observations in a sample:



sample mean = X
sample variance = s2
sample standard deviation = s
We use an X with a bar over the top, X , as the symbol for the sample mean.
Sample variance = s2

 X i X
N
i 1
=
2
N 1
The sample standard deviation, s, is the square root of the sample variance s2.
𝑁
∑𝑖=1( 𝑥𝑖 − 𝑥̅ )
s=√
𝑁−1
2
Notice that, for the population standard deviation, we divided by N. For the sample
standard deviation, we divide by N-1. Dividing my N-1 gives us a more accurate estimate
of the true population standard deviation.
Here is an example of calculating the sample mean and standard deviation. In this case,
we calculate the mean number of flowers on five plants. These N = 5 plants are a sample
of all the plants that we could measure by repeating the experiment.
Plant number
1
2
3
4
5
Number of flowers
8
8
10
12
12
To calculate the mean of N numbers, add up all the N numbers and divide by N. For the
five plants N is 5. The sum of all 5 numbers is 50, so the mean number of flowers is 50/5
= 10 flowers:
Mean number of flowers = X
= (8 + 8 + 10 + 12 + 12) / 5
= 50/5
= 10 flowers.
We calculate the sample variance to describe the variability of observations around the
sample mean.
Sample variance = s2

 X i X
N
i 1
=
2
N 1
= [(8 - 10) 2 + (8 - 10)2 + (10 – 10)2 + (12 - 10)2 + (12 - 10)2 /(5-1)
= [4 + 4 + 0 + 4 + 4]/4
= 16/4
= 4 flowers2
The variance has units of flowers2, flowers squared. We’d like to have a measure of
variability in flowers, the same units as the original measurements. The sample standard
deviation, s, has the same units as the original measurements
Sample standard deviation = s
= Square root (sample variance)
= Square root (s2)
= Square root (4 flowers2)
= 2 flowers.
22 = 4
In our sample of 5 plants, the sample standard deviation of the number of flowers on each
plant is s = 2 flowers. The sample variance is s2 = 4 flowers2.
Standard Error of the Mean (SEM): An intuitive explanation
Often, we want to test if there is a difference between the means of two groups. For
example, I'd like to test if the mean number of flowers per rose is different when I do or
do not use fertilizer. To test for differences in the means, I need an accurate estimate of
the true population mean for each treatment group.
We want to quantify how accurately the sample mean estimates the population mean. To
quantify the error in estimating the population mean we use the standard error of the
mean, which we'll define next. The symbol for standard error of the mean (SEM) is

X
.
To illustrate, the concept of the standard error, let's start by taking some random samples
from our population of 200 roses and seeing how close the mean of each sample is to the
population mean  = 19.65 flowers per rose.
The graph shows the 10 roses selected in our first random sample, Sample 1. The black
boxes in the graph show the 10 roses selected. The number of flowers on the 10 roses in
Sample 1 is 14, 15, 18, 19, 21, 21, 23, 27, 28, and 28. The mean number of flowers on the
10 roses is 21.4.
We'll take two more samples of 10 roses each to give us Sample 2 and Sample 3:
Here are the number of flowers on the roses in each sample, and the sample mean.
Sample
1
2
3
N
10
10
10
Number of flowers on each rose
14, 15, 18, 19, 21, 21, 23, 27, 28, 28
6, 13, 19, 20, 20, 20, 21, 23, 27, 27
7, 8, 15, 16, 19, 20, 21, 21, 21, 26
Sample mean
21.4
19.6
17.4
The means of these three samples are distributed around the population mean of 19.65.
If, instead of 3 samples, we take 100 samples, what would the distribution of sample
means look like?
Here are the means of 100 random samples, each of size N=10, from the 200 roses.
[1]
[16]
[31]
[46]
[61]
[76]
[91]
15.9
18.3
18.9
19.5
20.2
20.8
21.9
16.1
18.4
18.9
19.5
20.2
20.8
21.9
16.2
18.4
19.0
19.6
20.2
20.9
21.9
16.6
18.4
19.0
19.6
20.3
20.9
22.2
16.7
18.5
19.0
19.6
20.3
21.0
22.2
16.9
18.6
19.1
19.7
20.3
21.1
22.3
17.1
18.6
19.1
19.8
20.3
21.3
23.1
17.2
18.6
19.2
19.8
20.4
21.3
23.2
17.6
18.7
19.2
19.8
20.4
21.3
23.3
17.6
18.7
19.2
19.9
20.4
21.4
24.6
17.6
18.7
19.2
19.9
20.6
21.4
17.7
18.7
19.2
19.9
20.6
21.5
17.8
18.8
19.3
19.9
20.7
21.6
17.9
18.9
19.3
19.9
20.7
21.7
18.0
18.9
19.4
20.1
20.8
21.7
Here is a histogram showing the distribution of the sample means, and a histogram of the
population of 200 roses. Notice that the distribution of the sample means is centered near
19. The mean of the 100 sample means is 19.7, close to the true population mean of
19.65. The standard deviation of the 100 sample means 1.65.
What would the distribution of sample means look like if we had a population with a
small standard deviation? Let's look at a population with a smaller standard deviation
than the rose population: a population of 160 irises. Here are the number of flowers on
each of 160 irises, and a histogram showing the distribution.
[1]
[28]
[55]
[82]
[109]
[136]
14
18
18
20
21
22
15
18
19
20
21
22
15
18
19
20
21
22
16
18
19
20
21
22
16
18
19
20
21
22
16
18
19
20
21
22
16
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
22
17
18
19
20
21
23
17
18
19
20
21
23
17
18
19
20
21
23
17
18
19
20
21
23
17
18
19
20
21
23
17
18
19
20
22
23
18
18
19
20
22
24
18
18
19
20
22
24
18
18
20
20
22
25
18
18
20
20
22
25
18
18
20
20
22
18
18
20
20
22
The histogram on the left shows the distribution of the number of flowers on the irises.
The histogram on the right shows the distribution of the number of flowers on the roses.
Notice that the distribution of the number of flowers on the iris plants is narrower than
the distribution for the roses.
Here are the population mean and standard deviation for the roses and irises.
Plant Population mean Population standard deviation
Rose 19.65
5.02
Iris
19.60
2.03
Now, let's take 100 samples from the iris population, and calculate the mean for each
sample. Here are the means of 100 random samples, each of size N=10, from the 160
irises.
[1]
[16]
[31]
[46]
[61]
[76]
[91]
15.9
18.3
18.9
19.5
20.2
20.8
21.9
16.1
18.4
18.9
19.5
20.2
20.8
21.9
16.2
18.4
19.0
19.6
20.2
20.9
21.9
16.6
18.4
19.0
19.6
20.3
20.9
22.2
16.7
18.5
19.0
19.6
20.3
21.0
22.2
16.9
18.6
19.1
19.7
20.3
21.1
22.3
17.1
18.6
19.1
19.8
20.3
21.3
23.1
17.2
18.6
19.2
19.8
20.4
21.3
23.2
17.6
18.7
19.2
19.8
20.4
21.3
23.3
17.6
18.7
19.2
19.9
20.4
21.4
24.6
17.6
18.7
19.2
19.9
20.6
21.4
17.7
18.7
19.2
19.9
20.6
21.5
17.8
18.8
19.3
19.9
20.7
21.6
17.9
18.9
19.3
19.9
20.7
21.7
18.0
18.9
19.4
20.1
20.8
21.7
Here is a histogram showing the distribution of the iris sample means, and a histogram of
the population of 160 irises. From the histogram you can see that the distribution of the
sample means for the irises is centered near 19. The mean of the 100 sample means is
19.5, close to the true population mean of 19.6. The standard deviation of the 100 sample
means 0.65.
Here are graphs to summarize the rose and iris example.
Here are the population mean and standard deviation for the roses and irises.
Plant Sample
size
Rose 10
Iris
10
Population
mean
19.65
19.60
Population standard
deviation
5.02
2.03
Standard deviation of
sample means
1.65
0.65
The graphs show these results.

The roses have a larger population standard deviation, and a larger standard
deviation of the sample means.

The irises have a smaller population standard deviation, and a smaller standard
deviation of the sample means.
When we use a sample to estimate the true population mean, the error in our estimate is
affected by the population standard deviation. This example is an instance of a general
law: the standard error of the mean
deviation, 


X
is proportional to the population standard
If the population of number of flowers has very small standard deviation, then
 the samples from that population will have small sample standard deviation
 the sample means will be close to the population mean
 and we will have a small standard error of the mean.
If the population of number of flowers has a large standard deviation, then
 the samples from that population will have large sample standard deviation
 the sample means may be far from the population mean
 and we will have a large standard error of the mean.
Large population variability causes a large standard error of the mean.

Now let's look at how the number of observations in a sample (sample size N) affects
how accurately the sample mean estimates the population mean. If the number of
observations in our sample is only N = 2, we’re not very confident that the sample mean
will be close to the population mean. On the other hand, if we have N = 100 or N =
1000, we start to be a lot more confident that the mean of any given sample will be close
to the population mean.
The estimate of the population mean using 2 observations is less reliable than the
estimate using 20 observations and much less reliable than the estimate of the mean using
100 observations. As the sample size N gets bigger, we expect our error in estimating the
population mean to get smaller.
Here's an example using the roses, with sample size of N = 4, 9, or 25.
Sample size N
4
9
25
Standard deviation of 100 sample
2.44
1.61
0.88

This example is an instance of the general case: standard error of the mean X is
proportional to 1/N. Unfortunately, the standard error only goes down proportionally to

the square root of N, rather than linearly with N.These examples show that X depends
on both the population standard deviation,  and the number of observations in our

sample, N. The actual relationship is X =  /sqrt(N). That is, the standard error of the
mean is directly proportional to the population standard deviation, and inversely
proportional to the square root of N, the number of observations in each sample.
Usually we don't know the population standard deviation, , so instead we approximate
by the sample standard deviation, s. Using s in place of  gives us this formula to
estimate the standard error of the mean:
Standard Error of the Mean
= (Sample standard deviation)/(Square root of N)
= 𝑆𝑥̅
s
=
N
We’ll use standard error in statistical tests such as t-tests and analysis of variance to
compare groups. We also use standard error to determine if the slope of a regression line
is non-zero. We'll use the standard error of a statistic (such as the standard error of the
sample mean, or the standard error of coefficients in a regression model) to determine the
statistical significance (p-values).
Adding things up: Sigma ( notation
The Greek symbol Sigma ( in a formula means to take the sum.
Let’s look at calculating the mean number of flowers, using sigma notation. There were 5
plants, and we could assign each of them a label:
Plant
X1
X2
X3
X4
X5
Number of flowers
8
8
10
12
12
The letter X represents the variable, in this case plant number, and the subscripts 1
through 5 indicate which plant we are considering. We use the annotation Xi (X sub i) to
indicate any individual plant without specifying which one. So, if i=2, then we are
considering plant X2, with 8 flowers.
To indicate that we are adding up the number of flowers in 5 plants, we could write as
follows.
Sum of number of flowers in 5 plants = 8 + 8 + 10 + 12 + 12.
Or we could write:
Sum of number of flowers in 5 plants = X1+ X2+ X3+ X4+ X5.
It would get tedious to write out this formula for a lot of plants, so instead we use the
Sigma ( notation:
Sum of number of flowers in 5 plants
5
Xi
i 1
= sum of Xi for i from1 to 5
= X1+ X2+ X3+ X4+ X5
= 8 + 8 + 10 + 12 + 12
= 50
Sometimes we won’t write out the subscript “i=1” or the superscript “5” if the meaning is
clear. In that case, we might just write Xi .
Finally, to calculate the mean of the number of flowers in 5 plants using sigma notation,
we write the following.
Mean of number of flowers in 5 plants = X
5
Xi
i 1
5
= 50/5
= 10 flowers
Download