Statistics 13 Elementary Statistics Summer Session I 2012 Lecture Notes 2: Methods for Describing Data1 Describing Qualitative Data Definition 2.1 classified. A class is one of the categories into which qualitative data can be Definition 2.2 The class frequency is the number of observations in the data set that fall into a particular class. Definition 2.3 The class relative frequency is the class frequency divided by the total number of observations in the data set; that is class relative frequency = Definition 2.4 that is, class frequency total number of observations The class percentage is the class relative frequency multiplied by 100; class percentage = (class relative frequency) × 100 Summary of Graphical Descriptive Methods for Qualitative Data • Bar Graph: The categories (classes) of the qualitative variable are represented by bars, where the height of each bar is either the class frequency, class relative frequency, or class percentage. • Pie Chart: The categories (classes) of the qualitative variable are represented by slices of a pie (circle). The size of each slice is proportional to the class relative frequency. • Pareto Diagram: A bar graph with the categories (classes) of the qualitative variable (i.e., the bars) arranged by height in descending order from left to right. 1 Last update: June 25, 2012 1 Control Treatment 12.5% 6.7% 16.7% 28.9% 17.8% 20.8% 12.5% 17.8% 15.6% 18.8% 18.8% 13.3% 25 20 25 Under $25,000 $25,000−$50,000 $50,001−$75,000 $75,001−$100,000 Above $100,000 Prefer not to answer 20 15 10 15 9 8 9 10 6 7 6 5 0 0 2 13 10 5 1 Under $25,000 $25,000−$50,000 $50,001−$75,000 $75,001−$100,000 Above $100,000 Prefer not to answer 3 4 5 6 8 8 6 3 1 2 3 4 5 6 Income of the patients: Examples of pie Reasons for arriving late at work (from charts (top) and bar graphs (down) Wikipedia): Example of Pareto Diagram Describing Quantitative Data Summary of Graphical Descriptive Methods for Quantitative Data • Dot Plot: The numerical value of each quantitative measurement in the data set is represented by a dot on a horizontal scale. When data values repeat, the dots are placed above one another vertically. • Stem-and-Leaf Display: The numerical value of the quantitative variable is partitioned into a “stem” and a “leaf.” The possible stems are listed in order in a column. The leaf for each quantitative measurement in the data set is placed in the corresponding stem row. Leaves for observations with the same stem value are listed in increasing order horizontally. • Histogram: The possible numerical values of the quantitative variable are partitioned into class intervals, each of which has the same width. These intervals from the scale of the horizontal axis. The frequency or relative frequency of observations in each class interval is determined. A vertical bar is placed over each class interval, with the height of the bar equal to either the class frequency or class relative frequency. 2 Dotplots Example 1 The outbreak of food poisoning on a sportsday, Thailand 1990. Age by sex 15 20 Distribution of birthdate 10 5 Frequency F 0 M 0 10 20 30 40 50 60 1930 70 1935 1940 1945 1950 1955 1960 1965 1970 1975 Stem-and-Leaf Display Example 2 The following data show the ages of the 27 residents of Alcan, Alaska. (Source: U.S. Bureau of the Census) 45 46 43 1 19 37 52 35 8 42 3 41 10 11 48 40 31 42 The stem-and-plot leaf for the data: 0 1 2 3 4 5 13678 0129 0157 0011223568 0258 3 50 6 55 40 41 30 7 12 58 Histograms Example 3 Using the age data from above. Histogram of age 0 0.02 0.00 2 0.01 4 Frequency 6 Relative Frequency 0.03 8 10 0.04 Histogram of age 0 10 20 30 40 50 60 0 10 20 30 age 40 50 60 age The Meaning of Summation Notation ni=1 xi Sum the measurements of the variable that appears to the right of the summation symbol, beginning with the first measurement and ending with the nth measurement. P Example 4 A data set contains the observations 5,1,3,2,1. Then we set x1 = 5, x2 = 1, x3 = 3, x4 = 2, x5 = 1. Then 5 xi = x1 + x2 + x3 + x4 + x5 = 5 + 1 + 3 + 2 + 1 = 12 a. Pi=1 5 x2 = x21 + x22 + x23 + x24 + x25 = 52 + 12 + 32 + 22 + 12 = 12 b. P5i=1 i c. 2 − 1) + (x3 − 1) + (x4 − 1) + (x5 − 1) = (x1 + x2 + x3 + x4 + i=1 (x − 1) = (x1 − 1) + (xP x5 ) − (1 + 1 + 1 + 1 + 1) = 5i=1 xi − 5 = 12 − 5 = 7 P5 (x−1)2 = (x1 −1)2 +(x2 −1)2 +(x3 −1)2 +(x4 −1)2 +(x5 −1)2 = 42 +02 +22 +12 +02 = 21 d. Pi=1 e. ( 5i=1 xi )2 = (x1 + x2 + x3 + x4 + x5 )2 = (5 + 1 + 3 + 2 + 1)2 = 122 = 144 P Definition 2.5 The mean of a set of quantitative data is the sum of the measurements, divided by the number of measurements contained in the data set. Formula for a Sample Mean: x̄ = Pn i=1 xi n Symbols for the Sample Mean and the Population Mean x̄ =Sample mean µ =Population mean 4 Definition 2.6 The median of a quantitative data set is the middle number when the measurements are arranged in ascending (or descending) order. Calculating a Sample Median M Arrange the n measurements from the smallest to the largest. 1. If n is odd, M is the middle number. 2. If n is even, M is the mean of the middle two numbers. Definition 2.7 A data set is said to be skewed if one tail of the distribution has more extreme observations than the other tail. Rightward skewness Definition 2.8 set. mean median Relative frequency mean median Relative frequency Relative frequency mean median Symmetry Leftward skewness The mode is the measurement that occurs most frequently in the data Definition 2.9 The range of a quantitative data set is equal to the largest measurement minus the smallest measurement. Definition 2.10 The sample variance for a sample of n measurements is equal to the sum of the squared distances from the mean, divided by (n − 1). The symbol s2 is used to represent the sample variance. Pn (x −x̄)2 Formula for a Sample Variance: s2 = i=1n−1i P 2 Pn 2 ( ni=1 xi ) x − 2 n i=1 i A shortcut formula: s = n−1 5 Definition 2.11 The sample standard deviation, s, is defined as the positive square root of the sample variance, s2 , or, mathematically, √ s = s2 Symbols for Variance and Standard Deviation s2 s σ2 σ = = = = Central Tendency Variation Sample variance Sample standard deviation Population variance Population standard deviation Numerical Descriptive Measures Mean Median Mode Range Variance Standard Deviation Two ways to interpret the standard deviation: 1. Chebyshev’s Rule and 2. Empirical Rule. 1. Chebyshev’s rule applies to any data set, regardless of the shape of the frequency distribution of the data. a. It is possible that very few of the measurements will fall within one standard deviation of the mean. b. At least 3/4 of the measurements will fall within two standard deviations of the mean. c. At least 8/9 of the measurements will fall within three standard deviations of the mean. d. Generally, for any number k greater than 1, at least (1 − 1/k 2 ) of the measurements will fall within k standard deviations of the mean. Relative frequency 2. Empirical rule is a rule of thumb that applies to data sets with frequency distributions that are mound shaped and symmetric, as follows: Population measurements 6 a. Approximately 68% of the measurements will fall within one standard deviation of the mean. b. Approximately 95% of the measurements will fall within two standard deviations of the mean. c. Approximately 99.7% (essentially all) of the measurements will fall within three standard deviation of the mean. x̄ ± s (x̄ ± σ) Chebyshev’s rule less than 43 Empirical rule approx 68% x̄ ± 2s (x̄ ± 2σ) At least 34 approx 95% x̄ ± 3s x̄ ± ks (x̄ ± 3σ) (x̄ ± kσ) At least 89 At least (1 − approx 99.7% 1 ) k2 Example 5 Use Chebyshev’s Theorem to give a lower bound on the percent of data in the interval (x̄ − 2.5s, x̄ + 2.5s). Answer: At least 1 − 1/2.52 = 0.84 = 84% of the measurements will fall within the interval. i.e. The lower bound is 84%. Definition 2.12 For any set of n measurements (arranged in ascending or descending order), the pth percentile is a number such that p% of the measurements fall below that number and (100 − p)% fall above it. Definition 2.13 The sample z-score for a measurement x is z= x − x̄ s The population z-score for a measurement x is z= x−µ σ Interpretation of z-scores for Mound-Shaped Distributions of Data 1. Approximately 68% of the measurements will have a z-score between -1 and 1. 2. Approximately 95% of the measurements will have z-score between -2 and 2. 3. Approximately 97% (almost all) of the measurements will have a z-score between -3 and 3. Definition 2.14 An observation (or measurement) that is unusually large or small relative to the other values in a data set is called an outlier. Outliers typically are attributable to one of the following causes: 1. The measurement is observed, recorded, or entered into the computer incorrectly. 2. The measurement comes from a different population. 7 3. The measurement is correct, but represents a rare (chance) event. Definition 2.15 The lower quartile QL is the 25th percentile of a data set. The middle quartile M is the median. The upper quartile QU is the 75th percentile. Definition 2.16 The interquartile range (IQR) is the distance between the lower and upper quartiles. IQR= QU − QL Elements of a Box Plot 1. A rectangle (the box) is drawn with the ends (the hinges) drawn at the lower and upper quartiles(QL and QU ). The median of the data is shown in the box, usually by a line. 2. The points at distances 1.5(IQR) from each hinge mark the inner fences of the data set. Lines (the whiskers) are drawn from each hinge to the most extreme measurement inside the inner fence. Thus, Lower inner fence= QL − 1.5(IQR) Upper inner fence= QU + 1.5(IQR) A second pair of fences, the outer fences, appears at a distance of 3(IQR) from the hinges. One symbol (e.g., “*”) is used to represent measurements falling between the inner and outer fences, and another (e.g., “0”) is used to represent measurements that lie beyond the outer fences. Thus outer fences are not shown unless one or more measurements lie beyond them. We have Lower outer fence= QL − 3(IQR) Upper outer fence= QU + 3(IQR) Different symbols can be used to represent the median and the extreme data points. Measurements beyond the outer fences are probably outliers. Graphing Bivariate Relationships One way to describe the relationship between two quantitative variables, called a bivariate relationship, is to plot the data in a scattergram (or scatterplot). a. Positive relationship b. Negative relationship 8 c. No relationship