Chapter 6: RANDOM SAMPLING AND DATA DESCRIPTION Part 1: Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Sections 6-1 to 6-4 Random Sampling In statistics, we’re usually interested in a value or parameter that describes a particular population. Such as... • The mean cholesterol level of all 50 year old men – value of interest is the mean – population is all 50 year old men 1 • The mean height of all NBA basketball players – value of interest is the mean – population is all NBA basketball players • The mean number of worker-related failures occurring on any given Friday – value of interest is the mean – population is all Fridays • The mean hole diameter of manufactured washers – value of interest is the mean – population is all manufactured washers 2 Gathering data on all individuals in a population is usually not realistic (though the census attempts this every 10 years). But we can get info on a population by looking at a subset of the population. To get at the population parameters (such as the population mean µ), we collect data on a subset of the full population. Sample Population Often, this is done with a simple random sample of the population, which means the observations were taken totally at random, and each individual had the same chance of being chosen. What do we do with the data once we collect it? We can summarize it in a useful manner. One option is to report a statistic from the data. 3 • Statistic A statistic is a summary value calculated from a sample of observations. Usually, a statistic is an estimator of some population parameter. Suppose we collect n observations in a sample x1, x2, . . . , xn, from a particular population, Estimates the population parameter Statistic Sample mean: P x̄ = Population mean: n i=1 xi µ n Sample variance: P s2 = Population Variance: n 2 (x −x̄) i i=1 σ2 n−1 Calculated from the data Unknown 4 We discussed this general concept earlier... that we infer something about the population from a sample. This is called statistical inference. Sample Population Population parameters are shown with a greek letter. 5 Statistic Estimates this... Sample mean: x̄ Sample variance: s2 Sample std. deviation: s Population mean: µ Population variance: σ2 Population std. deviation: σ Sample intercept: b0 or βˆ0 Sample slope: b1 or βˆ1 Population intercept: β0 Population slope: β1 6 Numerical Summaries Section 6-1 The sample mean and the sample variance are numerical summaries of the sample data. The sample standard deviation is the square root of the sample variance. The full (larger) population of interest maybe an actual physical population, but it could also be a conceptual population if the population doesn’t physically exist, as with ‘all components that will be manufactured and sold’. As we saw earlier, the sample variance s2 essentially describes the ‘average’ squared distance of an observation from the sample mean. 7 There are n = 8 observations in the sample below. The deviations from the sample mean |xi − x̄| are shown below: Sample variance: s2 = Pn 2 (x − x̄) i i=1 n−1 8 Computation of s2 Original formula and alternatives: s2 = Pn 2 (x − x̄) i i=1 n−1 2) − ( (x i=1 i Pn = Pn i=1 xi) 2 n n−1 Pn = 2) − nx̄2 (x i=1 i n−1 Note that the divisor for sample variance is n − 1. We subtract 1 from the sample size because we had to estimate µ with x̄ in order to compute the sample variance. 9 We’re interested in how the observations are dispersed around µ, but we only have information on how the observations are dispersed around x̄. If we didn’t make this adjustment, our estimate for σ 2 (i.e. our s2 value), would consistently be too small in estimating the true population variance. We also say, s2 is based on n−1 degrees of freedom. We’ll discuss this more later. Another measure of sample spread is the sample range. • Sample Range If the n observations in a sample are denoted by x1, x2, . . . , xn, the sample range is r = max(xi) − min(xi) This is as a single value, not 2 individual values. 10 Stem-n-leaf diagrams Section 6-2 The mean and variance are quantities that give us information on the center and spread of the data, respectively. These are important summaries of a distribution. But many distributions can have the same mean and variance, and yet be different distributions. We can use graphical displays to consider the whole distribution of the data. 11 Consider the following set of n = 80 data points which are compressive strengths in pounds per square inch of 80 specimens of a new aluminumlithium alloy undergoing evaluation. 105 221 183 186 121 181 180 143 97 154 153 174 120 168 167 141 245 228 174 199 181 158 176 110 163 131 154 115 160 208 158 133 207 180 190 193 194 133 156 123 134 178 76 167 184 135 229 146 218 157 101 171 165 172 158 169 199 151 142 163 145 171 148 158 160 175 149 87 160 237 150 135 196 201 200 176 150 170 118 149 For this data, x̄ = 162.66 and s2 = 1140.63. These give a measure of center and spread. 12 We can look at a stem-n-leaf diagram to get a feel for the full distribution of the data. 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | | | | | | | | | | | | | | | | | | 6 7 7 15 058 013 133455 12356899 001344678888 0003357789 0112445668 0011346 034699 0178 8 189 7 5 The decimal point is 1 digit(s) to the right of the | The minimum value is 76. ‘7’ is the stem, and ‘6’ is the leaf. The maximum value is 245. ‘24’ is the stem, and ‘5’ is the leaf. 13 The ‘legend’ tells us where the decimal is at. This stem-n-leaf suggests this distribution can be described as bell-shaped and unimodal (i.e. has one peak). 14 Steps for making a Stem-n-Leaf Diagram 1. Separate each observation into a stem consisting of all but the final (rightmost) digit and a leaf, the final digit. Stems may have as many digits as needed, but each leaf contains only a single digit. 2. Write the stems in a vertical column with the smallest at the top, and draw a vertical line at the right of this column. 3. Write each leaf in the row to the right of its stem, in increasing order out from the stem. 15 If there are too many values for each stem, you can also do a split-stem-n-leaf diagram by splitting the values for each stem. 16 Mode, Quartiles, and Percentiles Once we’ve ordered the data as in the stem-nleaf diagram, we can easily pull-out some other useful data features. Consider the following stem-n-leaf diagram: The decimal point is 1 digit(s) to the right of the | 6 6 7 7 | | | | 134 5568 0113 57 We see that n = 13, the min is 61, the max is 77. • Median This is the value at which 50% fall below and 50% fall above. – The median is 68 for this data set. If n is odd, an actual data point is the median. 17 If n is even, the median falls between the 2 data points at the middle (use the average of these two data points). The median is a measure of central tendency, and is denoted by x̃. • Mode This is the most frequently occurring data point. – There are two modes in this data set, 65 and 71. We would call this distribution bimodal (i.e. has 2 peaks). 18 • Quartiles The positions that break the data into 4 quadrants, each containing 25% of the data are the quartiles. The first quartile (q1), the second quartile (q2) also called the median, and the third quartile (q3). This data set has q1 = 64.5 q2 = 68 q3 = 72 There are a number of ways to find positions the break the data into the 25% proportions since the data is discrete. But here’s one option: q1 is the interpolated value between the data points at ordered positions of b n+1 4 c and d n+1 4 e (These are symbols for rounded-down b c and rounded-up d e, respectively) 19 q3 is the interpolated value between the 3(n+1) data points at ordered positions of b 4 c and d 3(n+1) 4 e The interquartile range(IQR) is equal to q3 − q1 and is a measure of variability. It is the spread of the middle 50% of the data. The IQR is less sensitive to extremes than the ordinary sample range. • Percentiles The 100kth percentile is a data value such that approximately 100k% of the observations are at or below this value and approximately 100(1 − k)% of them are above it (for 0 < k < 1). 20 • Example: Mean and Median A manufacturer of electronic components is interested in determining the lifetime of a certain type of battery. A sample, in hours of life, is as follows: 123, 116, 122, 110, 175, 126, 125, 111, 118, 117 a) Find the sample mean and median. b) What feature in this data set is responsible for the substantial difference between the mean and median? 21 Frequency Distributions and Histograms Section 6-3 A frequency distribution is a table that divides a set of data into a suitable number of classes (categories), showing also the number of items belonging to each class. Consider the following stem-n-leaf diagram for humidity readings rounded to the nearest percent. Stem 1 2 3 4 5 Leaf 2 5 7 1 1 3 4 5 7 8 9 2 4 4 7 9 2 4 8 3 We might group these data into the following frequency distribution: 22 Class Class Frequency Relative Interval midpoint f frequency 10-19 14.5 3 3/20 = 0.15 20-29 24.5 8 8/20 = 0.40 30-39 34.5 5 5/20 = 0.25 40-49 44.5 3 3/20 = 0.15 50-59 54.5 1 1/20 = 0.05 Cumulative Relative frequency 0.15 0.55 0.80 0.95 1.00 There were 5 bins, or cells, or intervals for this frequency table. 23 The histogram is a visual display of a frequency distribution. • Example: Recall the n = 80 compressive strengths from earlier 105 221 183 186 121 181 180 143 97 154 153 174 120 168 167 141 245 228 174 199 181 158 176 110 163 131 154 115 160 208 158 133 207 180 190 193 194 133 156 123 134 178 76 167 184 135 229 146 218 157 101 171 165 172 158 169 199 151 142 163 145 171 148 158 160 175 149 87 160 237 150 135 196 201 200 176 150 170 118 149 Using 10 bins, we can create the frequency distribution... 24 Class Class Frequency Relative Interval midpoint f frequency 61-80 70.5 1 1/80 = 0.0125 81-100 90.5 2 2/80 = 0.0250 101-120 110.5 6 6/80 = 0.0750 121-140 130.5 8 8/80 = 0.1000 141-160 150.5 23 23/80 = 0.2875 161-180 170.5 19 19/80 = 0.2375 181-200 190.5 12 12/80 = 0.1500 201-220 210.5 4 4/80 = 0.0500 221-240 230.5 4 4/80 = 0.0500 241-260 250.5 1 1/80 = 0.0125 The histogram for this frequency table... 25 Cumulative Relative frequency 0.0125 0.0375 0.1125 0.2125 0.5000 0.7375 0.8875 0.9375 0.9875 1.0000 10 0 5 Frequency 15 20 Histogram of data 100 150 200 250 data We can see this is a unimodal distribution with a bell-shape. NOTE: The bin widths can alter the shape of a histogram. For instance, if I only chose 3 bins... 26 30 0 10 20 Frequency 40 50 60 70 Histogram of data 0 50 100 150 200 250 300 data This is not as informative. In general, you don’t want too many or too few observations in each bin (relative to n), and you can play around with bin size for the best scenario. 27 We summarize data in a histogram (by lumping a lot of individual observations together in a cell), so we lose some information. But this loss is usually small compared to the information gained in the visual, and the ease of interpretation gained in the graph. • Some possible descriptions of histograms – Symmetric – Skewed (asymmetric, long tail to one side) Right-tail stretched out... positive skew Left-tail stretched out... negative skew – Unimodal (one peak) – Bimodal (two peaks) – Bell-shaped – uniformly distributed (flat) 28 Symmetric If the distribution is symmetric, the mean = median. Right-skewed If the distribution is right-skewed, mean > median. Left-skewed If the distribution is left-skewed, mean < median. Left-skewed Symmetric 29 Right-skewed The histogram of the sample data at the bottom of the slide gives us a feel for the population from which the sample was drawn. The top plot is of the conceptual population from which the sample was drawn. 30 Box Plots Section 6-4 Boxplots are another graphical tool for visualizing data. They utilize the quartiles to give us a feel for the data distribution. Values forming the box (shows middle 50% of data): q1 q2 left, middle, right q3 1.5 × IQR largest possible∗ whiskers (as distance from q1 or q3) outliers values out past the whiskers (past q1− 1.5 × IQR or past q3 + 1.5 × IQR), seen at either tail ∗ Whiskers will end on an actual data point. 31 ————————————————————— 32