Chapter 6: RANDOM SAMPLING AND DATA DESCRIPTION Part 1

advertisement
Chapter 6: RANDOM SAMPLING AND
DATA DESCRIPTION
Part 1: Random Sampling
Numerical Summaries
Stem-n-Leaf plots
Histograms, and Box plots
Sections 6-1 to 6-4
Random Sampling
In statistics, we’re usually interested in a value
or parameter that describes a particular population.
Such as...
• The mean cholesterol level of all 50 year old
men
– value of interest is the mean
– population is all 50 year old men
1
• The mean height of all NBA basketball
players
– value of interest is the mean
– population is all NBA basketball players
• The mean number of worker-related failures
occurring on any given Friday
– value of interest is the mean
– population is all Fridays
• The mean hole diameter of manufactured
washers
– value of interest is the mean
– population is all manufactured washers
2
Gathering data on all individuals in a population is usually not realistic (though the census
attempts this every 10 years).
But we can get info on a population by looking
at a subset of the population.
To get at the population parameters (such as
the population mean µ), we collect data on a
subset of the full population.
Sample
Population
Often, this is done with a simple random sample
of the population, which means the observations
were taken totally at random, and each individual had the same chance of being chosen.
What do we do with the data once we collect it?
We can summarize it in a useful manner. One
option is to report a statistic from the data.
3
• Statistic
A statistic is a summary value calculated from
a sample of observations. Usually, a statistic
is an estimator of some population parameter.
Suppose we collect n observations in a sample
x1, x2, . . . , xn, from a particular population,
Estimates the
population parameter
Statistic
Sample mean:
P
x̄ =
Population mean:
n
i=1 xi
µ
n
Sample variance:
P
s2 =
Population Variance:
n
2
(x
−x̄)
i
i=1
σ2
n−1
Calculated
from the data
Unknown
4
We discussed this general concept earlier... that
we infer something about the population from
a sample. This is called statistical inference.
Sample
Population
Population parameters are shown with a greek
letter.
5
Statistic
Estimates this...
Sample mean:
x̄
Sample variance:
s2
Sample std. deviation:
s
Population mean:
µ
Population variance:
σ2
Population std. deviation:
σ
Sample intercept:
b0 or βˆ0
Sample slope:
b1 or βˆ1
Population intercept:
β0
Population slope:
β1
6
Numerical Summaries
Section 6-1
The sample mean and the sample variance are
numerical summaries of the sample data. The
sample standard deviation is the square root of
the sample variance.
The full (larger) population of interest maybe an
actual physical population, but it could also be
a conceptual population if the population doesn’t
physically exist, as with ‘all components that
will be manufactured and sold’.
As we saw earlier, the sample variance s2 essentially describes the ‘average’ squared distance of
an observation from the sample mean.
7
There are n = 8 observations in the sample
below. The deviations from the sample mean
|xi − x̄| are shown below:
Sample variance:
s2 =
Pn
2
(x
−
x̄)
i
i=1
n−1
8
Computation of s2
Original formula and alternatives:
s2 =
Pn
2
(x
−
x̄)
i
i=1
n−1
2) − (
(x
i=1 i
Pn
=
Pn
i=1 xi)
2
n
n−1
Pn
=
2) − nx̄2
(x
i=1 i
n−1
Note that the divisor for sample variance is
n − 1. We subtract 1 from the sample size because we had to estimate µ with x̄ in order to
compute the sample variance.
9
We’re interested in how the observations are dispersed around µ, but we only have information
on how the observations are dispersed around x̄.
If we didn’t make this adjustment, our estimate
for σ 2 (i.e. our s2 value), would consistently be
too small in estimating the true population variance.
We also say, s2 is based on n−1 degrees of freedom.
We’ll discuss this more later.
Another measure of sample spread is the sample range.
• Sample Range
If the n observations in a sample are denoted
by x1, x2, . . . , xn, the sample range is
r = max(xi) − min(xi)
This is as a single value, not 2 individual
values.
10
Stem-n-leaf diagrams
Section 6-2
The mean and variance are quantities that give
us information on the center and spread of the
data, respectively. These are important summaries of a distribution.
But many distributions can have the same mean
and variance, and yet be different distributions.
We can use graphical displays to consider the
whole distribution of the data.
11
Consider the following set of n = 80 data points
which are compressive strengths in pounds per
square inch of 80 specimens of a new aluminumlithium alloy undergoing evaluation.
105
221
183
186
121
181
180
143
97
154
153
174
120
168
167
141
245
228
174
199
181
158
176
110
163
131
154
115
160
208
158
133
207
180
190
193
194
133
156
123
134
178
76
167
184
135
229
146
218
157
101
171
165
172
158
169
199
151
142
163
145
171
148
158
160
175
149
87
160
237
150
135
196
201
200
176
150
170
118
149
For this data, x̄ = 162.66 and s2 = 1140.63.
These give a measure of center and spread.
12
We can look at a stem-n-leaf diagram to get a
feel for the full distribution of the data.
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6
7
7
15
058
013
133455
12356899
001344678888
0003357789
0112445668
0011346
034699
0178
8
189
7
5
The decimal point is 1 digit(s) to the right of the |
The minimum value is 76. ‘7’ is the stem, and
‘6’ is the leaf.
The maximum value is 245. ‘24’ is the stem,
and ‘5’ is the leaf.
13
The ‘legend’ tells us where the decimal is at.
This stem-n-leaf suggests this distribution can
be described as bell-shaped and unimodal (i.e. has
one peak).
14
Steps for making a Stem-n-Leaf Diagram
1. Separate each observation into a stem consisting of all but the final (rightmost) digit
and a leaf, the final digit. Stems may have
as many digits as needed, but each leaf contains only a single digit.
2. Write the stems in a vertical column with the
smallest at the top, and draw a vertical line
at the right of this column.
3. Write each leaf in the row to the right of its
stem, in increasing order out from the stem.
15
If there are too many values for each stem, you
can also do a split-stem-n-leaf diagram by splitting the values for each stem.
16
Mode, Quartiles, and Percentiles
Once we’ve ordered the data as in the stem-nleaf diagram, we can easily pull-out some other
useful data features.
Consider the following stem-n-leaf diagram:
The decimal point is 1 digit(s) to the
right of the |
6
6
7
7
|
|
|
|
134
5568
0113
57
We see that n = 13, the min is 61, the max is 77.
• Median
This is the value at which 50% fall below and
50% fall above.
– The median is 68 for this data set.
If n is odd, an actual data point is the median.
17
If n is even, the median falls between the 2
data points at the middle (use the average of
these two data points).
The median is a measure of central tendency,
and is denoted by x̃.
• Mode
This is the most frequently occurring data
point.
– There are two modes in this data set, 65
and 71. We would call this distribution
bimodal (i.e. has 2 peaks).
18
• Quartiles
The positions that break the data into 4 quadrants, each containing 25% of the data are
the quartiles. The first quartile (q1), the
second quartile (q2) also called the median,
and the third quartile (q3).
This data set has
q1 = 64.5
q2 = 68
q3 = 72
There are a number of ways to find positions
the break the data into the 25% proportions
since the data is discrete. But here’s one option:
q1 is the interpolated value between the
data points at ordered positions of b n+1
4 c
and d n+1
4 e
(These are symbols for rounded-down b c
and rounded-up d e, respectively)
19
q3 is the interpolated value between the
3(n+1)
data points at ordered positions of b 4 c
and d
3(n+1)
4 e
The interquartile range(IQR) is equal to
q3 − q1 and is a measure of variability. It is
the spread of the middle 50% of the data.
The IQR is less sensitive to extremes than
the ordinary sample range.
• Percentiles
The 100kth percentile is a data value such
that approximately 100k% of the observations are at or below this value and approximately 100(1 − k)% of them are above it (for
0 < k < 1).
20
• Example: Mean and Median
A manufacturer of electronic components is
interested in determining the lifetime of a
certain type of battery. A sample, in hours
of life, is as follows:
123, 116, 122, 110, 175, 126, 125, 111, 118, 117
a) Find the sample mean and median.
b) What feature in this data set is responsible for the substantial difference between the
mean and median?
21
Frequency Distributions and Histograms
Section 6-3
A frequency distribution is a table that divides
a set of data into a suitable number of classes
(categories), showing also the number of items
belonging to each class.
Consider the following stem-n-leaf diagram for
humidity readings rounded to the nearest percent.
Stem
1
2
3
4
5
Leaf
2 5 7
1 1 3 4 5 7 8 9
2 4 4 7 9
2 4 8
3
We might group these data into the following
frequency distribution:
22
Class
Class Frequency Relative
Interval midpoint
f
frequency
10-19
14.5
3
3/20 = 0.15
20-29
24.5
8
8/20 = 0.40
30-39
34.5
5
5/20 = 0.25
40-49
44.5
3
3/20 = 0.15
50-59
54.5
1
1/20 = 0.05
Cumulative
Relative
frequency
0.15
0.55
0.80
0.95
1.00
There were 5 bins, or cells, or intervals for this
frequency table.
23
The histogram is a visual display of a frequency distribution.
• Example: Recall the n = 80 compressive
strengths from earlier
105
221
183
186
121
181
180
143
97
154
153
174
120
168
167
141
245
228
174
199
181
158
176
110
163
131
154
115
160
208
158
133
207
180
190
193
194
133
156
123
134
178
76
167
184
135
229
146
218
157
101
171
165
172
158
169
199
151
142
163
145
171
148
158
160
175
149
87
160
237
150
135
196
201
200
176
150
170
118
149
Using 10 bins, we can create the frequency
distribution...
24
Class
Class Frequency
Relative
Interval midpoint
f
frequency
61-80
70.5
1
1/80 = 0.0125
81-100
90.5
2
2/80 = 0.0250
101-120 110.5
6
6/80 = 0.0750
121-140 130.5
8
8/80 = 0.1000
141-160 150.5
23
23/80 = 0.2875
161-180 170.5
19
19/80 = 0.2375
181-200 190.5
12
12/80 = 0.1500
201-220 210.5
4
4/80 = 0.0500
221-240 230.5
4
4/80 = 0.0500
241-260 250.5
1
1/80 = 0.0125
The histogram for this frequency table...
25
Cumulative
Relative
frequency
0.0125
0.0375
0.1125
0.2125
0.5000
0.7375
0.8875
0.9375
0.9875
1.0000
10
0
5
Frequency
15
20
Histogram of data
100
150
200
250
data
We can see this is a unimodal distribution with
a bell-shape.
NOTE: The bin widths can alter the shape of a
histogram. For instance, if I only chose 3 bins...
26
30
0
10
20
Frequency
40
50
60
70
Histogram of data
0
50
100
150
200
250
300
data
This is not as informative. In general, you don’t
want too many or too few observations in each
bin (relative to n), and you can play around
with bin size for the best scenario.
27
We summarize data in a histogram (by lumping a lot of individual observations together in a
cell), so we lose some information. But this loss
is usually small compared to the information
gained in the visual, and the ease of interpretation gained in the graph.
• Some possible descriptions of histograms
– Symmetric
– Skewed (asymmetric, long tail to one side)
Right-tail stretched out... positive skew
Left-tail stretched out... negative skew
– Unimodal (one peak)
– Bimodal (two peaks)
– Bell-shaped
– uniformly distributed (flat)
28
Symmetric
If the distribution is symmetric,
the mean = median.
Right-skewed
If the distribution is right-skewed,
mean > median.
Left-skewed
If the distribution is left-skewed,
mean < median.
Left-skewed
Symmetric
29
Right-skewed
The histogram of the sample data at the bottom of the slide gives us a feel for the population
from which the sample was drawn.
The top plot is of the conceptual population
from which the sample was drawn.
30
Box Plots
Section 6-4
Boxplots are another graphical tool for visualizing data. They utilize the quartiles to give us
a feel for the data distribution.
Values forming the box (shows middle 50% of data):

q1 
q2
left, middle, right

q3
1.5 × IQR
largest possible∗ whiskers
(as distance from q1 or q3)
outliers
values out past the whiskers
(past q1− 1.5 × IQR or
past q3 + 1.5 × IQR),
seen at either tail
∗ Whiskers will end on an actual data point.
31
—————————————————————
32
Download