Topic (4) Summarizing Data – Quantitative Variables 4-1 Topic (4) SUMMARIZING QUANTITATIVE DATA A) Frequency Distributions For Samples Frequency Tables and Histograms Can’t always list every possible value for quantitative variables or the datasets get too large. We wish to summarize the data in some way. So, we create groupings (intervals, bins, classes) and assign each observation to a grouping based on the value of its quantitative variable. 1) How many groupings or intervals (classes)? Want approximately 5-10 observations per group (on average) in equal width intervals (groups) C= number of observations n = 8 8 e.g. for n = 48, use anywhere from 6 to 10 intervals 2) How big is each interval (bin)? Should be equal-width, so choose a starting value slightly below the min value in the dataset and an ending value slightly above the max value in the dataset Size of each class = ending value − starting value c Topic (4) Summarizing Data – Quantitative Variables 4-2 e.g. Shannon-Weiner Index (SWI) ranges from 0.0 to 2.2685 and n = 48. We’ll use a range from 0 to 2.4 which divides nicely with c = 6. So interval width = 2.4/6 = 0.4. 3) Construct each class or grouping: FREQUENCY TABLE Grouping 0-0.4 >0.4-0.8 >0.8-1.2 >1.2-1.6 >1.6-2.0 >2.0-2.4 TOTAL Absolute Frequency 27 6 3 4 6 2 48 Relative Frequency 27/48=56.25% 6/48 = 12.5% 3/48 = 6.25% 4/48 =8.33% 6/48=12.5% 2/48=4.16% 100% Histogram (a graphical display of the frequency table) – display either the absolute or relative frequency 60 50 40 P e r c 30 e n t 20 10 0 0. 2 0. 6 1 1. 4 SW I 1. 8 2. 2 Topic (4) Summarizing Data – Quantitative Variables 4-3 Stem-and-Leaf Plots: every observation is explicitly displayed in the graphic e.g. SWI for n = 48 locations Stem 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Leaf 7 1 # 1 1 244 2 9 4 0 00 8 3 1 1 1 1 2 1 369 3 46 23 2 2 59 2 0455 4 69 2 0012666679 10 02235566889 11 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Topic (4) Summarizing Data – Quantitative Variables 4-4 To construct a stem-and-leaf plot: 1) find the minimum and maximum values in the dataset 2) decide which digits in a value are significant (“stem”) and which are less important (“leaves”) and which really do not provide much information (this part of the value is ignored or truncated out) e.g. SWI min = 0 and max = 2.2685 stem = X . X = 2 . 2 ____ leaf = _ . _ X = _ . _ 7 Observations with the same stem are plotted one next to the other in increasing leaf order within the stem. Note the order of the vertical axis – SAS plots stem-andleaf plots vertically (not horizontally) and in a mirror image of the X-axis if you were to rotate the plot to the horizontal position. Compare the shape to the histogram on the previous page. Box Plots: plots using the “5-number summary”: {minimum, first quartile, median, third quartile, maximum} Topic (4) Summarizing Data – Quantitative Variables 4-5 Order the data for a particular variable from low to high in value as is done in the stem-and-leaf plot. The first quartile (also called the lower quartile or 25th percentile) is that value where 25% of the observations fall below it and the remainder are its value or higher. E.g. in the SWI, the first quartile is the 12th smallest out of 48th ordered numbers: 0.10. The median is the middle value or 50th percentile where half of the data have values less than the median and 50% have values the same or higher. When the number of observations is even, the median is the average of the two middle values. E.g. in the SWI, the median is the average of the 24th and 25th ordered values 0.30 and 0.34 or median = 0.32. The third quartile (also called the upper quartile or 75th percentile) is that value where 75% of the observations fall below it and the remaining 25% are its value or higher. E.g. in the SWI, the third quartile is the 36th smallest out of 48th ordered numbers: 1.18. Topic (4) Summarizing Data – Quantitative Variables 4-6 On a vertical axis showing the range of possible values, plot a rectangle whose length extends from the first to the third quartiles and which has a waist (horizontal line) at the median. Extend lines vertically to the minimum or maximum of the data. Stem 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 Leaf 7 1 # 1 1 244 2 9 4 0 00 8 3 1 1 1 1 2 1 369 3 46 23 2 2 59 2 0455 4 69 2 0012666679 10 02235566889 11 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Boxplot | | | | | | | | | | | +-----+ | | | | | | | | | + | | | | | *-----* | | +-----+ | These can be made fancier by adding the ability to identify outliers and extreme observations as well. Topic (4) Summarizing Data – Quantitative Variables 4-7 Boxplots are especially useful for comparing several different datasets simultaneously EXAMPLE Summaries of daily high temperatures for each month at Oberlin, OH Such graphics allow us to see how the distribution of the data changes in time, specifically how the monthly medians and variability of the data change throughout the year. Topic (4) Summarizing Data – Quantitative Variables 4-8 The histogram, i.e. the frequency distribution, or stemand-leaf plot plays an important role in statistical analysis. As a consequence we spend a lot of time and effort describing these distributions. The descriptions include: Shape of the distribution (skew, modality, symmetry, gaps, and outlying or other unusual data points) Symmetric: each half of the histogram is a mirror image of the other half. The frequency distribution is said to have equal-length tails. 14 12 10 8 6 4 Std. Dev = 9.09 2 Mean = 52.1 N = 99.00 0 32.5 37.5 35.0 X 42.5 40.0 47.5 45.0 52.5 50.0 57.5 55.0 62.5 60.0 67.5 65.0 72.5 70.0 Topic (4) Summarizing Data – Quantitative Variables 4-9 Skew: the tails are not equal in length, the side of the longer tail determine the direction of skew. Positive skew is a long tail toward large values; negative skew, small values. 6 5 4 3 2 1 S M N 0 0.0 5.0 CONCENTR 10.0 15.0 20.0 25.0 -90 -70 -50 -30 -10 Topic (4) Summarizing Data – Quantitative Variables 4-10 Gaps, outlying or unusual observations: intervals with no observations or values that are not in the pattern with the rest of the data. E.g. number of fish in a tow Stem 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 Leaf 5 4 8 # 1 1 1 Boxplot 0 0 0 4 1 566 3 02 2 68 2 113 3 5566 4 2234 4 5555567788 10 0000112223333444 16 ----+----+----+----+ Multiply Stem.Leaf by 10**+3 | | | | +-----+ | + | | | *-----* +-----+ Topic (4) Summarizing Data – Quantitative Variables 4-11 Modality: the mode is the most frequent value in a dataset when only a few different values are listed. When the data have many different values (e.g. SWI), the mode is usually said to be the interval with the most observations. Data can be unimodal (one mode) or multimodal (one primary mode with secondary modes) 50 40 30 20 10 St M N 0 40.0 50.0 45.0 60.0 55.0 70.0 65.0 80.0 75.0 90.0 85.0 95.0 Topic (4) Summarizing Data – Quantitative Variables 4-12 Three examples of unimodal, symmetric distributions. What distinguishes the three distributions? Special name for distributions which follow a symmetric, unimodal shape with equal sized tails and with a specific curve between the mode and the tails: NORMAL DISTRIBUTION Topic (4) Summarizing Data – Quantitative Variables 4-13 B) Frequency Distributions for Populations N = population size >>> n = sample size. If we used the rule of thumb for number of bars needed we’d get an extremely large number: The tops of the bars approach a smooth line – this is called the density curve of the population N=40 N=40000 -2 -1.5 -1 -0.5 0 .5 1 1.5 N=400 -3 -2 -1 0 1 2 3 -4 -3 -2 -1 0 1 2 3 4 Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-14 C) Summary Measures for Samples 1) Measures of Center a) Median (50th percentile) Important Point #1: The median is said to be robust because it is resistant to outliers Important Point #2: The sample median divides the total area under the bars in a histogram in half. Important Point #3: Populations also have medians called the population median (M). This number divides the area under the curve describing the population frequency distribution in halves. Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY Stem 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 5-15 Leaf 7 1 # 1 1 244 2 9 4 0 00 8 3 1 1 1 1 2 1 369 3 46 23 2 2 59 2 0455 4 69 2 0012666679 10 02235566889 11 ----+----+----+----+ Multiply Stem.Leaf by 10**-1 Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-16 b) Arithmetic Mean Defn: The MEAN of a data set is the average value. That is, it is the value obtained by adding all of the numbers together and dividing the result by the number of values in the sum (see symbols later). The SAMPLE MEAN is denoted as x (pronounced “x-bar”). The POPULATION MEAN is denoted μ (pronounced “mu”). EXAMPLE The fish lengths for a study in the Tennessee River are: 48, 45, 49, 51, 44, 49, 46, 28.5, 26, 25.5, 25, 44 A dot plot of these data: • • • •• • ••• •• • ____|______|______|_____|______|_____|____ 25 30 35 40 45 50 Length (cm) If each point has the same weight, where should the pivot point be to balance the x-axis (i.e. keep it horizontal)? Ans: Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-17 To calculate the sample mean: Sum the data values and divide the result by n. 48+45+49+51+44+49+46+28.5+26+25.5+25+44 = 481 = 40.08 12 12 We say that the fish caught in the study averaged 40.08 cm in length. Important Point #1: If one were able to observe the value of every single element in a population (say, every single fish in the Tennessee River in 1978), then it would be possible to calculate the population mean μ. Since we can’t do that, we say that an estimate of the population mean μ is the sample mean x . Important Point #2: Is the mean robust? Ans: Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-18 NOTATION: X denotes the NAME of the variable e.g. LENGTH x denotes a value for the named variable e.g. 48 cm i a subscript which denotes the index number for the observation e.g. fish IDs run from 1 to 12 xi denotes the value for the ith observation (that is, the ith observed value) e.g. x1 = 48, x2 = 45, etc. Σ denotes the operation “SUM” So, we can write n x= ∑ xi i =1 n = x1 + x2 + ... + xn n Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-19 For frequency distributions, the relationship of the mean to the median depends on the shape of the distribution: Skewed to the right (positive): mean median Skewed to the left (negative): mean median Symmetric and unimodal mean median Uniform mean median Symmetric and bimodal: mean median Question: So, which measure of center do you use when? Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-20 2) Measures of Spread How do we capture variability in a set of values using a single summary statistic? X 0 Z 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 Y 0 1 2 3 4 5 6 7 Note how each of these datasets vary in their minimum and maximum values and how they vary within their distribution as well. a) Range of a Variable Defn: Range = Maximum value – Minimum Value e.g. Tenn. River fish lengths: range = 26 cm (51 - 25) Question: is the range a robust measure for variability?? Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-21 b) Standard Deviation The distance xi − x is called the deviation of the ith value from the sample mean. EXAMPLE: fish lengths ( x = 40 .08 ) • • • •• • • • • •• • ____|______|______|_____|______|_____|____ 25 30 35 40 45 50 xi − x = deviation 25 - 40.1= -15.1 25.5 - 40.1= -14.6 26 - 40.1= -14.1 28.5 - 40.1 = -11.6 44 - 40.1= 3.9 44 - 40.1= 3.9 45 - 40.1= 4.9 46 - 40.1= 5.9 48 - 40.1= 7.9 49 - 40.1= 8.9 49 - 40.1= 8.9 51 - 40.1= 10.9 Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-22 Question: Might these deviations be useful information to describe the variability in a set of data? The standard deviation is a measure of the average deviation of values in a set of data. FACT: for any set of data, the deviations always sum to 0! So to be useful, we do the following: 1) calculate the deviations, xi − x , i=1,…,n 2 2) square each deviation, ( xi − x ) , i=1,…,n n 3) sum up the squares, ∑ ( xi − x )2 i =1 n ∑ ( xi − x )2 4) divide by (n-1) {NOT n} n i =1 ∑ ( xi 5) take the square root i =1 n −1 − x )2 n −1 =s = s2 Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-23 s , the sample standard deviation, can be thought of as the typical or average deviation of an observation from the sample mean. EXAMPLE: fish lengths • • • •• • • • • •• • ____|______|______|_____|______|_____|____ 25 30 35 40 45 50 Deviations -15.1 -14.6 -14.1 -11.6 3.9 3.9 4.9 5.9 7.9 8.9 8.9 10.9 -----Σ= 0.0 (Deviations)2 228.01 213.16 198.81 134.56 15.21 15.21 24.01 34.81 62.41 79.21 79.21 118.81 ---------1203.42 Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-24 1203.42 = 109.40 cm2 = s 2 (12 − 1) Divide by (n-1): Take the square root: 109.402 cm2 = 10.46 cm = s Interpretation? Defn: the SAMPLE STANDARD DEVIATION is defined by n ∑ ( xi the equation i =1 − x )2 n −1 = s. The SAMPLE VARIANCE is s2. The POPULATION VARIANCE is denoted σ2. The POPULATION STANDARD DEVIATION is denoted by σ . Question: How is it used ? 1. s is the sample estimate of the population standard deviation σ . (note that σ is almost always unknown!) Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 2. 5-25 large values of s (or σ) imply large variability in a data set (but it depends on the scale as well) a) good for comparing two or more datasets when the data have the same units of measurement EXAMPLE Based on a sample of 50 acres on randomly selected farms in Maryland, the 1998 corn yield averaged 125 bushels per acre with a standard deviation (s.d.) of 40 bushels. The next year, a drought year, had an average yield of x = 83 bushels per acre and s = 25. Let’s assume that the frequency distributions of the number of bushels per acre for fields in each of these 2 years look unimodal and symmetric , i.e. “normal”). Important and Useful Point: the range and the s.d. of a set of data that are approximately normally distributed are related: range = max − min ≈ 6s . So knowing x and s and that the data are “normal” in shape, we can graph and compare the two years yields: |____________________________________________| 0 250 Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-26 c) Coefficient of Variation CV = s × 100% . x Note that CV is unitless and is often used to compare different variables measured on different scales. EXAMPLE: Tennessee River fish study of the effects of DDT 10.46 CV = × 100% = 26.09% Fish lengths: 40.08 407.76 CV = ×100% = 40.76% Fish weights: 1000.33 6.98 CV = ×100% = 96.87% DDT concentration: 7.21 Question: which random variable (Length, Weight, DDTconc.) is the most variable? Question: Suppose I had measured the fish lengths in inches. Would the CV be the same? Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY 5-27 d) The Interquartile Range Defn: Recall the LOWER QUARTILE (Q1) of a dataset is the 25th percentile of the observations and that the UPPER QUARTILE (Q3) is the 75th percentile of the observations. The INTERQUARTILE RANGE (IQR) is the range of the middle 50% of the dataset. IQR = Q3 – Q1 . EXAMPLE n=12 Fish weights 441, 532, 544, 778, 897, 917, 986, 1023, 1266, 1398, 1459, 1763 Median: 917 + 986 m= = 951.5 2 Q1: 544 + 778 = 661 2 Q3: 1266 + 1398 = 1332 2 IQR: 1332-661=671 Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY Question: Is the IQR resistant to outliers? 5-28