Chapter 6 - Random Sampling and Data Description More joy of dealing with large quantities of data You can never have too much data. Chapter 6B Today in Prob & Stat 6-2 Stem-and-Leaf Diagrams Steps for Constructing a Stem-and-Leaf Diagram 6-2 Stem-and-Leaf Diagrams Example 6-4 Figure 6-4 Stem-and-leaf diagram for the compressive strength data in Table 6-2. Figure 6-5 too few just right 25 observations on batch yields Stem-and-leaf displays for Example 6-5. Stem: Tens digits. Leaf: Ones digits. too many Figure 6-6 Stem-and-leaf diagram from Minitab. Number of observations In the middle stem 6-4 Box Plots • The box plot is a graphical display that simultaneously describes several important features of a data set, such as center, spread, departure from symmetry, and identification of observations that lie unusually far from the bulk of the data. • Whisker • Outlier • Extreme outlier Figure 6-13 Description of a box plot. Figure 6-14 Box plot for compressive strength data in Table 62. Figure 6-15 Comparative box plots of a quality index at three plants. 6-5 Time Sequence Plots • A time series or time sequence is a data set in which the observations are recorded in the order in which they occur. • A time series plot is a graph in which the vertical axis denotes the observed value of the variable (say x) and the horizontal axis denotes the time (which could be minutes, days, years, etc.). • When measurements are plotted as a time series, we often see •trends, •cycles, or •other broad features of the data Figure 6-16 Company sales by year (a) and by quarter (b). Figure 6-17 gosh! – a stem and leaf diagram combined with a time series plot A digidot plot of the compressive strength data in Table 6-2. Figure 6-18 A digidot plot of chemical process concentration readings, observed hourly. 6-6 Probability Plots • Probability plotting is a graphical method for determining whether sample data conform to a hypothesized distribution based on a subjective visual examination of the data. • Probability plotting typically uses special graph paper, known as probability paper, that has been designed for the hypothesized distribution. Probability paper is widely available for the normal, lognormal, Weibull, and various chi-square and gamma distributions. Probability (Q-Q)* Plots •Forget ‘normal probability paper’ •Plot the z score versus the ranked observations, x(j) •Subjective, visual technique usually applied to test normality. Can also be adapted to other distributions. •Method (for normal distribution): •Rank the observations x(1), x(2), …, x(n) from smallest to largest •Compute the (j-1/2)/n value for each x(j) •Plot zj=F-1((j-1/2)/n) versus x(j) Parentheses usually indicate ordering of data. Computing zj, where zj = F-1(j – ½)/n 29 values xj values 1 4.07 2 4.88 3 5.10 4 5.26 5 5.27 25 26 27 28 29 5.65 5.75 5.79 5.85 5.86 zj (j-1/2)/n -2.11 0.017 -1.63 0.052 -1.36 0.086 -1.17 0.121 -1.01 0.155 1.01 1.17 1.36 1.63 2.11 xj values are ordered least to greatest 0.845 0.879 0.914 0.948 0.983 Example in EXCEL – Table 6-6, pp. 214 Cavendish Earth Density Data 2.50 2.00 zj is the function NORMSINV 1.50 zj values 1.00 0.50 0.00 -0.50 4.0 4.5 5.0 -1.00 -1.50 -2.00 -2.50 xj values 5.5 6.0 Example in EXCEL – Table 6-6, cont’d Cavendish Earth Density Data (censored) 2.00 1.50 zj is the function NORMSINV zj values 1.00 0.50 0.00 -0.50 4.0 4.5 5.0 -1.00 -1.50 -2.00 -2.50 xj values 5.5 6.0 Example 6-7 Example 6-7 (continued) Figure 6-19 Normal probability plot for battery life. Figure 6-20 Normal probability plot obtained from standardized normal scores. Figure 6-21 Normal probability plots indicating a nonnormal distribution. (a) Light-tailed distribution. (b) Heavy-tailed distribution. (c ) A distribution with positive (or right) skew. The Beginning of a Comprehensive Example Descriptive Statistics in Action see real numbers, real data watch as they are manipulated in perverse ways be thrilled as they are sorted and be amazed as they are compressed into a single numbers The Raw Data As part of a life span study of a particular type of lithium polymer rechargable battery, 120 batteries were operated and their life span in operating hours determined. 1676.5 895.6 1682.0 1913.6 2881.9 2007.8 3313.4 2156.4 1954.7 2210.4 1630.3 1818.8 1779.5 984.2 1512.6 2046.1 1613.3 2066.1 2926.9 1995.7 2386.6 1663.8 2045.9 1985.2 1387.6 718.3 1088.7 1879.4 2056.6 1740.2 2791.8 2476.0 845.1 1581.8 2713.7 2238.5 1314.2 729.3 1898.7 1377.2 1347.6 2420.6 2450.0 2319.7 2560.1 884.1 596.2 1779.7 908.3 955.4 2383.4 1577.6 2365.4 1527.9 2749.2 2439.7 2016.2 1757.8 1022.7 2063.8 1840.2 943.5 2210.5 2856.3 745.0 2125.3 1759.9 1297.0 2210.1 543.4 891.5 1818.8 1803.7 1460.3 1753.3 2633.1 4300.8 1250.8 1005.2 667.1 916.0 1351.9 1823.0 1944.9 1641.3 1694.0 1378.0 849.4 1882.6 2323.8 807.0 2088.8 2940.7 2004.6 1714.3 2039.1 1760.5 577.8 1945.6 1299.9 Data generated from a Weibull distribution with = 2.8 and = 2000 1592.0 1395.4 2401.8 2968.7 1952.3 2430.5 999.1 1608.4 983.8 1831.1 1307.4 2139.0 1552.6 1808.1 2398.0 2398.8 2824.3 715.2 2277.3 1941.2 Descriptive Statistics Minitab trimmed mean Variable N Battery Life 120 Variable Battery Life Mean Median TrMean 1789.4 1813.4 1773.9 Minimum Maximum 543.4 4300.8 Q1 1348.7 StDev SE Mean 661.5 60.4 Q3 2210.3 More Minitab Histogram of Battery Life Frequency 20 10 0 0 500 1000 1500 2000 2500 Battery Life 3000 3500 4000 4500 More Minitab Histogram of Battery Life, with Normal Curve Frequency 20 10 0 0 500 1000 1500 2000 2500 Battery Life 3000 3500 4000 4500 Stem and Leaf Plot Leaf Unit = 100 21 36 (40) 44 13 2 1 1 0 555677778888889999999 1 000222333333334 1 5555556666666677777777888888888899999999 2 0000000000111222223333333444444 2 56777888999 33 3 43 Dotplot for Battery Life 1000 2000 3000 Battery Life 4000 More Minitab Boxplot of Battery Life 0 1000 2000 Battery Life 3000 4000 More Minitab Descriptive Statistics Variable: Battery Life Anderson-Darling Normality Test A-Squared: P-Value: 700 1300 1900 2500 3100 3700 4300 95% Confidence Interval for Mu 0.566 0.139 Mean StDev Variance Skew ness Kurtosis N 1789.45 661.53 437627 0.359617 0.736628 120 Minimum 1st Quartile Median 3rd Quartile Maximum 543.40 1348.68 1813.45 2210.32 4300.80 95% Confidence Interval for Mu 1669.87 1650 1750 1850 1950 1909.03 95% Confidence Interval for Sigma 587.10 757.74 95% Confidence Interval for Median 95% Confidence Interval for Median 1691.57 1945.04 Time Series Plot Based upon the order that the data was generated 4000 3000 2000 1000 0 Index 20 40 60 80 100 120 Time Series Plot Sorted by failure time 4000 sorted 3000 2000 1000 0 Index 20 40 60 80 100 120 Normal Probability Plot for Battery Life ML Estimates 99 95 Percent 90 80 70 60 50 40 30 20 10 5 1 0 1000 2000 Data 3000 4000 Mean: 1789.45 StDev: 658.771 Percent Weibull Probability Plot for Battery Life ML Estimates 99 95 90 80 70 60 50 40 30 20 10 5 3 2 1 100 1000 Data Shape: 2.92250 Scale: 2005.35 Exponential Probability Plot for Battery Life ML Estimates Mean: Percent 99 98 97 95 90 80 70 60 50 30 10 0 5000 10000 Data 1789.45 Computer Support This is easy if you use the computer. hang on, we are going to Excel… A Recap … •Population – the totality of observations with which we are concerned. Issue: conceptual vs. actual. •Sample – subset of observations selected from a population. •Statistic – any function of the observations in a sample. •Sample range – If the n observations in a sample are denoted by x1, x2, …,xn, then the sample range is r = max(xi) – min(xi). •Sample mean and variance. n x xi i1 n s 2 (x x ) i1 i n 1 n 2 x i1 2 i nx 2 n 1 Note that these are functions of the observations in a sample and are, therefore, statistics. More Recapping … Note terminology – ‘population parameter’ vs. ‘sample statistic’ N xi i1 N 2 (xi ) i1 N 2 N 2 2 x N i i1 N Note difference in denominators n s 2 (x x ) i1 i n 1 n 2 x i1 2 i nx 2 n 1 Sample variance uses an estimate of the mean (xbar) in its calculation. If divided by n, the sample variance would be a biased estimate – biased low. Sampling Process X a random variable that represents one selection from a population. Each observation in the sample is obtained under identical conditions. The population does not change during sampling. The probability distribution of values does not change during sampling. f(x1,x2,…,xn) = f(x1)f(x2)…f(xn) if the sample is independent. Notation X1, X2,…, Xn are the random variables. x1, x2,…, xn are the values of the random variables. A Final Recap… A probability distribution is often a model for a population. This is often the case when the population is conceptual or infinite. The histogram should resemble to distribution of population values. The bigger the sample the stronger the resemblance. Our Work Here Today is Done Next Week: The Glorious Midterm Prob/Stat students Discussing stem and leaf plots