Notes 2

advertisement
Statistics 13 Elementary Statistics
Summer Session I 2012
Lecture Notes 2: Methods for Describing Data1
Describing Qualitative Data
Definition 2.1
classified.
A class is one of the categories into which qualitative data can be
Definition 2.2 The class frequency is the number of observations in the data set
that fall into a particular class.
Definition 2.3 The class relative frequency is the class frequency divided by the
total number of observations in the data set; that is
class relative frequency =
Definition 2.4
that is,
class frequency
total number of observations
The class percentage is the class relative frequency multiplied by 100;
class percentage = (class relative frequency) × 100
Summary of Graphical Descriptive Methods for Qualitative Data
• Bar Graph: The categories (classes) of the qualitative variable are represented
by bars, where the height of each bar is either the class frequency, class relative
frequency, or class percentage.
• Pie Chart: The categories (classes) of the qualitative variable are represented by
slices of a pie (circle). The size of each slice is proportional to the class relative
frequency.
• Pareto Diagram: A bar graph with the categories (classes) of the qualitative
variable (i.e., the bars) arranged by height in descending order from left to right.
1
Last update: June 25, 2012
1
Control
Treatment
12.5%
6.7%
16.7%
28.9%
17.8%
20.8%
12.5%
17.8%
15.6%
18.8%
18.8%
13.3%
25
20
25
Under $25,000
$25,000−$50,000
$50,001−$75,000
$75,001−$100,000
Above $100,000
Prefer not to answer
20
15
10
15
9
8
9
10
6
7
6
5
0
0
2
13
10
5
1
Under $25,000
$25,000−$50,000
$50,001−$75,000
$75,001−$100,000
Above $100,000
Prefer not to answer
3
4
5
6
8
8
6
3
1
2
3
4
5
6
Income of the patients: Examples of pie Reasons for arriving late at work (from
charts (top) and bar graphs (down)
Wikipedia): Example of Pareto Diagram
Describing Quantitative Data
Summary of Graphical Descriptive Methods for Quantitative Data
• Dot Plot: The numerical value of each quantitative measurement in the data set
is represented by a dot on a horizontal scale. When data values repeat, the dots are
placed above one another vertically.
• Stem-and-Leaf Display: The numerical value of the quantitative variable is partitioned into a “stem” and a “leaf.” The possible stems are listed in order in a
column. The leaf for each quantitative measurement in the data set is placed in
the corresponding stem row. Leaves for observations with the same stem value are
listed in increasing order horizontally.
• Histogram: The possible numerical values of the quantitative variable are partitioned into class intervals, each of which has the same width. These intervals from
the scale of the horizontal axis. The frequency or relative frequency of observations
in each class interval is determined. A vertical bar is placed over each class interval, with the height of the bar equal to either the class frequency or class relative
frequency.
2
Dotplots
Example 1 The outbreak of food poisoning on a sportsday, Thailand 1990.
Age by sex
15
20
Distribution of birthdate
10
5
Frequency
F
0
M
0
10
20
30
40
50
60
1930
70
1935
1940
1945
1950
1955
1960
1965
1970
1975
Stem-and-Leaf Display
Example 2 The following data show the ages of the 27 residents of Alcan, Alaska. (Source:
U.S. Bureau of the Census)
45
46
43
1
19
37
52
35
8
42
3
41
10
11
48
40
31
42
The stem-and-plot leaf for the data:
0
1
2
3
4
5
13678
0129
0157
0011223568
0258
3
50
6
55
40
41
30
7
12
58
Histograms
Example 3 Using the age data from above.
Histogram of age
0
0.02
0.00
2
0.01
4
Frequency
6
Relative Frequency
0.03
8
10
0.04
Histogram of age
0
10
20
30
40
50
60
0
10
20
30
age
40
50
60
age
The Meaning of Summation Notation ni=1 xi
Sum the measurements of the variable that appears to the right of the summation symbol,
beginning with the first measurement and ending with the nth measurement.
P
Example 4 A data set contains the observations 5,1,3,2,1. Then we set x1 = 5, x2 = 1, x3 =
3, x4 = 2, x5 = 1. Then
5
xi = x1 + x2 + x3 + x4 + x5 = 5 + 1 + 3 + 2 + 1 = 12
a.
Pi=1
5
x2 = x21 + x22 + x23 + x24 + x25 = 52 + 12 + 32 + 22 + 12 = 12
b.
P5i=1 i
c.
2 − 1) + (x3 − 1) + (x4 − 1) + (x5 − 1) = (x1 + x2 + x3 + x4 +
i=1 (x − 1) = (x1 − 1) + (xP
x5 ) − (1 + 1 + 1 + 1 + 1) = 5i=1 xi − 5 = 12 − 5 = 7
P5
(x−1)2 = (x1 −1)2 +(x2 −1)2 +(x3 −1)2 +(x4 −1)2 +(x5 −1)2 = 42 +02 +22 +12 +02 = 21
d.
Pi=1
e. ( 5i=1 xi )2 = (x1 + x2 + x3 + x4 + x5 )2 = (5 + 1 + 3 + 2 + 1)2 = 122 = 144
P
Definition 2.5 The mean of a set of quantitative data is the sum of the measurements,
divided by the number of measurements contained in the data set.
Formula for a Sample Mean: x̄ =
Pn
i=1
xi
n
Symbols for the Sample Mean and the Population Mean
x̄ =Sample mean
µ =Population mean
4
Definition 2.6 The median of a quantitative data set is the middle number when the
measurements are arranged in ascending (or descending) order.
Calculating a Sample Median M
Arrange the n measurements from the smallest to the largest.
1. If n is odd, M is the middle number.
2. If n is even, M is the mean of the middle two numbers.
Definition 2.7 A data set is said to be skewed if one tail of the distribution has more
extreme observations than the other tail.
Rightward skewness
Definition 2.8
set.
mean
median
Relative frequency
mean
median
Relative frequency
Relative frequency
mean
median
Symmetry
Leftward skewness
The mode is the measurement that occurs most frequently in the data
Definition 2.9 The range of a quantitative data set is equal to the largest measurement
minus the smallest measurement.
Definition 2.10 The sample variance for a sample of n measurements is equal to the
sum of the squared distances from the mean, divided by (n − 1). The symbol s2 is used
to represent the sample variance.
Pn
(x −x̄)2
Formula for a Sample Variance: s2 = i=1n−1i
P
2
Pn 2 ( ni=1 xi )
x −
2
n
i=1 i
A shortcut formula: s =
n−1
5
Definition 2.11 The sample standard deviation, s, is defined as the positive square
root of the sample variance, s2 , or, mathematically,
√
s = s2
Symbols for Variance and Standard Deviation
s2
s
σ2
σ
=
=
=
=
Central Tendency
Variation
Sample variance
Sample standard deviation
Population variance
Population standard deviation
Numerical Descriptive Measures
Mean Median
Mode
Range Variance Standard Deviation
Two ways to interpret the standard deviation:
1. Chebyshev’s Rule and 2. Empirical Rule.
1. Chebyshev’s rule applies to any data set, regardless of the shape of the frequency
distribution of the data.
a. It is possible that very few of the measurements will fall within one standard deviation of the mean.
b. At least 3/4 of the measurements will fall within two standard deviations of the
mean.
c. At least 8/9 of the measurements will fall within three standard deviations of the
mean.
d. Generally, for any number k greater than 1, at least (1 − 1/k 2 ) of the measurements
will fall within k standard deviations of the mean.
Relative frequency
2. Empirical rule is a rule of thumb that applies to data sets with frequency distributions
that are mound shaped and symmetric, as follows:
Population measurements
6
a. Approximately 68% of the measurements will fall within one standard deviation of
the mean.
b. Approximately 95% of the measurements will fall within two standard deviations of
the mean.
c. Approximately 99.7% (essentially all) of the measurements will fall within three
standard deviation of the mean.
x̄ ± s
(x̄ ± σ)
Chebyshev’s rule less than 43
Empirical rule approx 68%
x̄ ± 2s
(x̄ ± 2σ)
At least 34
approx 95%
x̄ ± 3s
x̄ ± ks
(x̄ ± 3σ)
(x̄ ± kσ)
At least 89
At least (1 −
approx 99.7%
1
)
k2
Example 5 Use Chebyshev’s Theorem to give a lower bound on the percent of data in the
interval (x̄ − 2.5s, x̄ + 2.5s).
Answer: At least 1 − 1/2.52 = 0.84 = 84% of the measurements will fall within the
interval. i.e. The lower bound is 84%.
Definition 2.12 For any set of n measurements (arranged in ascending or descending
order), the pth percentile is a number such that p% of the measurements fall below that
number and (100 − p)% fall above it.
Definition 2.13
The sample z-score for a measurement x is
z=
x − x̄
s
The population z-score for a measurement x is
z=
x−µ
σ
Interpretation of z-scores for Mound-Shaped Distributions of Data
1. Approximately 68% of the measurements will have a z-score between -1 and 1.
2. Approximately 95% of the measurements will have z-score between -2 and 2.
3. Approximately 97% (almost all) of the measurements will have a z-score between
-3 and 3.
Definition 2.14 An observation (or measurement) that is unusually large or small
relative to the other values in a data set is called an outlier. Outliers typically are
attributable to one of the following causes:
1. The measurement is observed, recorded, or entered into the computer incorrectly.
2. The measurement comes from a different population.
7
3. The measurement is correct, but represents a rare (chance) event.
Definition 2.15 The lower quartile QL is the 25th percentile of a data set. The
middle quartile M is the median. The upper quartile QU is the 75th percentile.
Definition 2.16 The interquartile range (IQR) is the distance between the lower
and upper quartiles.
IQR= QU − QL
Elements of a Box Plot
1. A rectangle (the box) is drawn with the ends (the hinges) drawn at the lower and
upper quartiles(QL and QU ). The median of the data is shown in the box, usually
by a line.
2. The points at distances 1.5(IQR) from each hinge mark the inner fences of the
data set. Lines (the whiskers) are drawn from each hinge to the most extreme
measurement inside the inner fence. Thus,
Lower inner fence= QL − 1.5(IQR)
Upper inner fence= QU + 1.5(IQR)
A second pair of fences, the outer fences, appears at a distance of 3(IQR) from
the hinges. One symbol (e.g., “*”) is used to represent measurements falling
between the inner and outer fences, and another (e.g., “0”) is used to represent
measurements that lie beyond the outer fences. Thus outer fences are not shown
unless one or more measurements lie beyond them. We have
Lower outer fence= QL − 3(IQR)
Upper outer fence= QU + 3(IQR)
Different symbols can be used to represent the median and the extreme data
points.
Measurements beyond the outer fences are probably outliers.
Graphing Bivariate Relationships
One way to describe the relationship between two quantitative variables, called a bivariate relationship, is to plot the data in a scattergram (or scatterplot).
a. Positive relationship
b. Negative relationship
8
c. No relationship
Download