Lecture 1

advertisement
Sept. 27, 2007 LEC #1
I.
ECON 140A/240A-1
Exploratory Data Analysis
L. Phillips
I. Introduction
At the beginning of the course we will study three branches of statistics: (1) data
analysis, (2) probability, and (3) statistical inference.
Data analysis is the gathering, display and summary of data. We will use visual
devices and quantitative measures to accomplish these tasks.
Probability has its origins in gambling and the laws of chance. This topic is
interesting in its own right but we will also use probability as a means to better
understand the binomial distribution, the central limit theorem, and the relationship
between the binomial distribution and the normal distribution.
II. Data Description
One use of statistics is to describe data with summary measures. Two notions are
central tendency and dispersion.
There are several measures of central tendency. An intuitive and relative easy
measure to use is the mode, i.e. the data value that is observed most frequently. Of course
one issue is what if the data has two or three modes and has multiple peaks.
Another measure of central tendency is the median. The data can be sorted and
ordered from the highest value to the lowest, and the data point in the middle is the
median, with one half of the data values above and one half of the data values below.
Another measure of central tendency requiring some arithmetic is the sample
mean of the data. Add up all the data values and divide by the number of observations or
data points.
III. Exploratory Data Analysis
Sept. 27, 2007 LEC #1
ECON 140A/240A-2
Exploratory Data Analysis
L. Phillips
John Tukey developed exploratory data analysis to visually describe the
characteristics of data. Two visual tools useful for this purpose are the stem and leaf
diagram and the box and whiskers plot.
An example of the methodology of the stem and leaf plot is its application to
weight data from males and females at Penn State, taken from Larry Gonick & Woolcott
Smith, The Cartoon Guide to Statistics(1993).
Males: 140 145 160 190 155 165 150 190 195 138 160 155 153 145 170 175 175 170 180
135 170 157 130 185 190 155 170 155 215 150 145 155 155 150 155 150 180 160 135
160 130 155 150 148 155 150 140 180 190 145 150 164 140 142 136 123 155
Females: 140 120 130 138 121 125 116 145 150 112 125 130 120 130 131 120 118 125
135 125 118 122 115 102 115 150 110 116 108 95 125 133 110 150 108
For this illustration, the data is pooled without regard to gender. The first step is to
determine the range of the data, the minimum weight and the maximum weight, 95 and
215, respectively. The second step is to construct the stem, counting by tens from 9 for
90, 10 for 100, etc. out to 21 for 210.
----------------------------------------------------------------------------------------------------9
10
11
12
13
14
15
Sept. 27, 2007 LEC #1
ECON 140A/240A-3
Exploratory Data Analysis
L. Phillips
16
17
18
19
20
21
Figure 1 : Stem of the Stem and Leaf Diagram
----------------------------------------------------------------------------------------------------------The third step is to construct the leaves: use the second digit of 95, the lowest weight,
which is placed after 9 on the stem. There are three weights between 100 and 110: 102,
108, and 108 so the digits following 10 on the stem are 2, 8, 8. This is a leaf attached to
the stem at 10. Continuing in this fashion:
-----------------------------------------------------------------------------------------------------------9: 5
10: 2 8 8
11: 6 2 8 8 5 5 0 6 0
12: 3 0 1 5 5 0 0 5 5 2 5
13: 8 5 0 5 0 6 0 8 0 0 1 5 3
14: 0 5 5 5 8 0 5 0 2 0 5
15: 5 0 5 3 7 5 5 0 5 5 0 5 0 5 0 5 0 0 5 0 0 0
16: 0 5 0 0 0 4
17: 0 5 5 0 0 0
18: 0 5 0 0
Sept. 27, 2007 LEC #1
ECON 140A/240A-4
Exploratory Data Analysis
L. Phillips
19: 0 0 5 0 0
20:
21: 5
Figure 2: Preliminary Leaves in the Stem and Leaf Diagram
--------------------------------------------------------------------------------------------------------The last step is to order the digits composing the leaves. This provides a visual
description of the data including the minimum, the maximum, the modes and the median.
---------------------------------------------------------------------------------------------------------9: 5
10: 2 8 8
11: 0 0 2 5 5 6 6 8 8
12: 0 0 0 1 2 3 5 5 5 5 5
13: 0 0 0 0 0 1 3 5 5 5 6 8 8
14: 0 0 0 0 2 5 5 5 5 5 8
15: 0 0 0 0 0 0 0 0 0 0 3 5 5 5 5 5 5 5 5 5 5 7
16: 0 0 0 0 4 5
17: 0 0 0 0 5 5
18: 0 0 0 5
19: 0 0 0 0 5
20:
21: 5
Figure 3: Stem and Leaf Diagram
Sept. 27, 2007 LEC #1
ECON 140A/240A-5
Exploratory Data Analysis
L. Phillips
Of course this back of the envelope technology could be combined with using a computer
to sort or order the data.
In all there are 92 observations or data points. So the median would lie between
the 46th and 47th observation, i.e. between 145 and 145 so the median is 145. Note the
data is bimodal with ten 150’s and ten 155’s. The students have a reporting bias tending
to round off to zeros and fives.
IV. Dispersion
One measure of dispersion is the interquartile range, IQR. Sort the data and put the
points into four groups with equal numbers of observations. There will be two groups
above the median and two groups below the median. If the median is a data point, add it
to both the upper group and the lower group. In the case of the weight data, we had an
even number of observations, and the median fell between two observations, the 46th and
the 47th, which were both equal to 145. Next, find the median for the two high groups,
i.e. the third quartile with 25 percent of the observations above it. Also find the median
for the two lowest groups, i.e. the first quartile with 25 percent of the observations below
it. The difference between the median for the highs and the median for the lows is the
interquartile range.
Having already done the work for the weight data by constructing the stem and leaf
diagram, we can use it to determine the first quartile of 125 pounds, between the 23rd
observation of 125 pounds and the 24th observation of 125 pounds. The third quartile is
between the 23rd and 24th observation from the top, i.e. between 157 pounds and 155
pounds so the third quartile is 156 pounds, and the interquartile range is 156 minus 125 or
31 pounds.
Sept. 27, 2007 LEC #1
ECON 140A/240A-6
Exploratory Data Analysis
L. Phillips
John Tukey’s box and whiskers plot displays the interquartile range as well as other
features of the data such as outliers. The left edge of the box is the first quartile and the
right edge of the box is the third quartile. The median is drawn as a vertical line dividing
the box.
---------------------------------------------------------------------------------------------------
125
145
156
Figure 4: Box of the Box and Whiskers Plot
------------------------------------------------------------------------------------------------
To illustrate the whiskers, we need to redraw the box at half scale horizontally so
that we have sufficient room. The whiskers end with points that are not outliers, and the
data points that are outliers are illustrated individually. Outliers are any data points that
are beyond 1.5 times the interquartile range, i.e. 1.5 times 31 or 46.5, from either end of
the box. So the first quartile of 125 minus 46.5 is 78.5, but this lies far below the
minimum point of 95 so the left whisker will end at 95 with no low outlying points. The
third quartile of 156 plus 46.5 is 202.5 so the right whisker ends at 195, the next point
below and there is one outlier at 215 pounds.
Sept. 27, 2007 LEC #1
95
125
ECON 140A/240A-7
Exploratory Data Analysis
145
156
L. Phillips
195
215
Figure 5: Box and Whiskers Plot
-----------------------------------------------------------------------------------------------------------Another measure of dispersion or spread in the data is the standard deviation, s.
This is the square root of the sample variance, i.e. the average of the squared distance of
each observation value from the sample mean:
[xj – ( xj /n)]2 }/n-1 = s2 .
Download