Sept. 27, 2007 LEC #1 I. ECON 140A/240A-1 Exploratory Data Analysis L. Phillips I. Introduction At the beginning of the course we will study three branches of statistics: (1) data analysis, (2) probability, and (3) statistical inference. Data analysis is the gathering, display and summary of data. We will use visual devices and quantitative measures to accomplish these tasks. Probability has its origins in gambling and the laws of chance. This topic is interesting in its own right but we will also use probability as a means to better understand the binomial distribution, the central limit theorem, and the relationship between the binomial distribution and the normal distribution. II. Data Description One use of statistics is to describe data with summary measures. Two notions are central tendency and dispersion. There are several measures of central tendency. An intuitive and relative easy measure to use is the mode, i.e. the data value that is observed most frequently. Of course one issue is what if the data has two or three modes and has multiple peaks. Another measure of central tendency is the median. The data can be sorted and ordered from the highest value to the lowest, and the data point in the middle is the median, with one half of the data values above and one half of the data values below. Another measure of central tendency requiring some arithmetic is the sample mean of the data. Add up all the data values and divide by the number of observations or data points. III. Exploratory Data Analysis Sept. 27, 2007 LEC #1 ECON 140A/240A-2 Exploratory Data Analysis L. Phillips John Tukey developed exploratory data analysis to visually describe the characteristics of data. Two visual tools useful for this purpose are the stem and leaf diagram and the box and whiskers plot. An example of the methodology of the stem and leaf plot is its application to weight data from males and females at Penn State, taken from Larry Gonick & Woolcott Smith, The Cartoon Guide to Statistics(1993). Males: 140 145 160 190 155 165 150 190 195 138 160 155 153 145 170 175 175 170 180 135 170 157 130 185 190 155 170 155 215 150 145 155 155 150 155 150 180 160 135 160 130 155 150 148 155 150 140 180 190 145 150 164 140 142 136 123 155 Females: 140 120 130 138 121 125 116 145 150 112 125 130 120 130 131 120 118 125 135 125 118 122 115 102 115 150 110 116 108 95 125 133 110 150 108 For this illustration, the data is pooled without regard to gender. The first step is to determine the range of the data, the minimum weight and the maximum weight, 95 and 215, respectively. The second step is to construct the stem, counting by tens from 9 for 90, 10 for 100, etc. out to 21 for 210. ----------------------------------------------------------------------------------------------------9 10 11 12 13 14 15 Sept. 27, 2007 LEC #1 ECON 140A/240A-3 Exploratory Data Analysis L. Phillips 16 17 18 19 20 21 Figure 1 : Stem of the Stem and Leaf Diagram ----------------------------------------------------------------------------------------------------------The third step is to construct the leaves: use the second digit of 95, the lowest weight, which is placed after 9 on the stem. There are three weights between 100 and 110: 102, 108, and 108 so the digits following 10 on the stem are 2, 8, 8. This is a leaf attached to the stem at 10. Continuing in this fashion: -----------------------------------------------------------------------------------------------------------9: 5 10: 2 8 8 11: 6 2 8 8 5 5 0 6 0 12: 3 0 1 5 5 0 0 5 5 2 5 13: 8 5 0 5 0 6 0 8 0 0 1 5 3 14: 0 5 5 5 8 0 5 0 2 0 5 15: 5 0 5 3 7 5 5 0 5 5 0 5 0 5 0 5 0 0 5 0 0 0 16: 0 5 0 0 0 4 17: 0 5 5 0 0 0 18: 0 5 0 0 Sept. 27, 2007 LEC #1 ECON 140A/240A-4 Exploratory Data Analysis L. Phillips 19: 0 0 5 0 0 20: 21: 5 Figure 2: Preliminary Leaves in the Stem and Leaf Diagram --------------------------------------------------------------------------------------------------------The last step is to order the digits composing the leaves. This provides a visual description of the data including the minimum, the maximum, the modes and the median. ---------------------------------------------------------------------------------------------------------9: 5 10: 2 8 8 11: 0 0 2 5 5 6 6 8 8 12: 0 0 0 1 2 3 5 5 5 5 5 13: 0 0 0 0 0 1 3 5 5 5 6 8 8 14: 0 0 0 0 2 5 5 5 5 5 8 15: 0 0 0 0 0 0 0 0 0 0 3 5 5 5 5 5 5 5 5 5 5 7 16: 0 0 0 0 4 5 17: 0 0 0 0 5 5 18: 0 0 0 5 19: 0 0 0 0 5 20: 21: 5 Figure 3: Stem and Leaf Diagram Sept. 27, 2007 LEC #1 ECON 140A/240A-5 Exploratory Data Analysis L. Phillips Of course this back of the envelope technology could be combined with using a computer to sort or order the data. In all there are 92 observations or data points. So the median would lie between the 46th and 47th observation, i.e. between 145 and 145 so the median is 145. Note the data is bimodal with ten 150’s and ten 155’s. The students have a reporting bias tending to round off to zeros and fives. IV. Dispersion One measure of dispersion is the interquartile range, IQR. Sort the data and put the points into four groups with equal numbers of observations. There will be two groups above the median and two groups below the median. If the median is a data point, add it to both the upper group and the lower group. In the case of the weight data, we had an even number of observations, and the median fell between two observations, the 46th and the 47th, which were both equal to 145. Next, find the median for the two high groups, i.e. the third quartile with 25 percent of the observations above it. Also find the median for the two lowest groups, i.e. the first quartile with 25 percent of the observations below it. The difference between the median for the highs and the median for the lows is the interquartile range. Having already done the work for the weight data by constructing the stem and leaf diagram, we can use it to determine the first quartile of 125 pounds, between the 23rd observation of 125 pounds and the 24th observation of 125 pounds. The third quartile is between the 23rd and 24th observation from the top, i.e. between 157 pounds and 155 pounds so the third quartile is 156 pounds, and the interquartile range is 156 minus 125 or 31 pounds. Sept. 27, 2007 LEC #1 ECON 140A/240A-6 Exploratory Data Analysis L. Phillips John Tukey’s box and whiskers plot displays the interquartile range as well as other features of the data such as outliers. The left edge of the box is the first quartile and the right edge of the box is the third quartile. The median is drawn as a vertical line dividing the box. --------------------------------------------------------------------------------------------------- 125 145 156 Figure 4: Box of the Box and Whiskers Plot ------------------------------------------------------------------------------------------------ To illustrate the whiskers, we need to redraw the box at half scale horizontally so that we have sufficient room. The whiskers end with points that are not outliers, and the data points that are outliers are illustrated individually. Outliers are any data points that are beyond 1.5 times the interquartile range, i.e. 1.5 times 31 or 46.5, from either end of the box. So the first quartile of 125 minus 46.5 is 78.5, but this lies far below the minimum point of 95 so the left whisker will end at 95 with no low outlying points. The third quartile of 156 plus 46.5 is 202.5 so the right whisker ends at 195, the next point below and there is one outlier at 215 pounds. Sept. 27, 2007 LEC #1 95 125 ECON 140A/240A-7 Exploratory Data Analysis 145 156 L. Phillips 195 215 Figure 5: Box and Whiskers Plot -----------------------------------------------------------------------------------------------------------Another measure of dispersion or spread in the data is the standard deviation, s. This is the square root of the sample variance, i.e. the average of the squared distance of each observation value from the sample mean: [xj – ( xj /n)]2 }/n-1 = s2 .