Lecture 1: Descriptive Statistics MSU-STT-351-Sum 16 (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 1 / 50 Introduction Why Statistics? (i) It is the science that helps to understand many phenomena which occur in the field of engineering, science, economics, finance, and etc. (ii) It is the scientific way that helps to make intelligent judgments/decisions from the observed data which contains uncertainty and variation. We start with two examples. Example 1. The emission levels of HC (hydrocarbon) and CO (carbon monoxide) of a vehicle: HC (gm/mile): CO (gm/mile): 12.8 118 18.3 149 32.2 232 32.5 236 Question: What is the emission level of HC/CO? It is difficult to make a precise statement, as there is a high variation in the observed levels. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 2 / 50 Introduction Example 2. Marks of two students in 4 tests: S1: S2: 25 85 38 62 42 78 39 59 Question: Who is doing better? Any difficulty in answering? Clearly, S2 is doing better. There is no need for statistical analysis, in such situations. It is well-known that statistics has been often miused in several practical situations. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 3 / 50 Introduction What is statistics? (i) One word definition: (a) Economics: Money (b) Philosophy: Why (c) Statistics: Variation (ii) Layman definition: Information/summary of data. (iii) Formal Definition. Applied Perspective: Statistics deals with techniques to deal with or how to (a) obtain information/data (sample) (b) analyze scientifically the data (c) draw valid conclusions/inference (iv) Theoretical Perspective: As a branch of mathematics, it deals with analytical techniques and procedures to analyze the data and to make inference about the population characteristics. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 4 / 50 Introduction Population and Samples Population: The set of all well-defined objects/elements (of interest) which are under investigation. Example 1. The students studying engineering at MSU. Example 2. The population of East Lansing. If we can collect information on all the elements in the population, we call it “Census”. Most often, it is impossible, as it involves a lot of time, efforts and money. Sample: A subset of the population, which is selected for obtaining information, is called a sample. For example, We may select 10 students from each engineering discipline from MSU. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 5 / 50 Introduction Often, we are interested in certain characteristics of the population (number of flaws in a piece of cloth; thickness of a capsule wall, monthly income of an individual, etc). A characteristic may be (i) Categorical (belongs to one of the categories) (a) Gender of a student (male/female) (b) Quality of a product (excellent/good/bad) (ii) Numerical (measured in real value) (a) Heights of students (b) Values of a stock (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 6 / 50 Introduction Types of Variables A variable is any characteristic which changes over the objects in the population. It is denoted by x , y , z (or by X , Y , Z). A variable “X ” may be categorical (called categorical variable) or numerical/quantitative (called numerical variable). Types of Data (i) The data X1 , X2 , . . . , Xn (or x1 , x2 , . . . , xn ) on a categorical variable X is called categorical data. (ii) The data X1 , X2 , . . . , Xn (or x1 , x2 , . . . , xn ) on a numerical variable X is called quantitative data. Suppose we measure height = x, and weight = y on n-individuals, (x1 , y1 ), . . . , (xn , yn ). Then we have the bivariate data. Similarly, multivariate data is defined. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 7 / 50 Branches of Statistics The main branches of statistics are the following: (i) Descriptive Statistics: Deals with summarizing and describing important features (such as mean, median, standard deviation) of data (tabulating or graphical methods). (ii) Inferential Statistics: Deals with techniques for drawing inferences (generalizing to population) and predictions about the population, based on the information obtained from the sample. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 8 / 50 Descriptive Statistics Descriptive Statistics 1.2 Graphical (visual) Display of Univariate Data Pictures often reveal useful information about data. 1.2.1 Graphs for Quantitative Data (i) Stem-and-Leaf Display (Stem Plot) This is an useful plot for displaying quantitative data. Example 4. Consider the data on the pulse rates (per minute) of 10 patients: 45, 61, 60, 62, 65, 73, 75, 75, 78, 82 (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 9 / 50 Descriptive Statistics (i) Stem-and-Leaf Display (Stem Plot) Stem plot gives • • • • Actual values Extent of spread Number and location of peaks Presence of any outlier (P. Vellaisamy: MSU-STT-351-Sum 16) Stem: Tens Leaf: Ones digit 8 2 7 3558 6 0125 } 5 ga 4 5 ‘outlier’ Probability & Statistics for Engineers 10 / 50 Descriptive Statistics (ii) The Dot plot used when data is small or has few distinct values. Here, each observation is represented by a ‘dot’ on a horizontal scale. .... . 40 50 60 .... 70 . 80 This is similar to stem plot, except that dot is used instead of integers. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 11 / 50 Descriptive Statistics Definition 1 A (quantitative) variable X is discrete if it takes finite or countable values. It is continuous if it takes any value in an interval or of the whole real line. Example 5. Let X = number of trials to get the first success. Then X ∈ {1, 2, . . .} and hence X is discrete. Suppose, X = height of a student (in cm). Then X ∈ [150, 190] and is a continuous variable. Let X be a discrete variable taking values in {1, 2, . . . , k } = S . Let X1 , . . . , Xn be n data values on X . Then frequency of i ∈ S = Number of values in the data {X1 , X2 , . . . , Xn } equal to i. For 1 ≤ i ≤ k , the relative frequency of i = frequency of i /n. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 12 / 50 Descriptive Statistics Example 6. Let X = Number of children in a family. Then X ∈ {0, 1, 2, 3}. Also, suppose the data on 20 families in East Lansing are: 2, 0, 1, 2, 2, 3, 1, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 3, 1, 2. Then the frequency table is X Frequency Relative Frequency 0 1 1/20 = 0.05 1 6 6/20 = 0.3 2 9 9/20 = 0.45 3 4 4/20 = 0.20 Total 20 (P. Vellaisamy: MSU-STT-351-Sum 16) 1.0 Probability & Statistics for Engineers 13 / 50 Descriptive Statistics (iii) Histogram for Discrete Data Take x-values on horizontal scale and the frequency/relative frequency along the vertical scale. Draw the rectangle on each value whose height is equal to the frequency/relative frequency. The frequency histogram for Example 6 is: Histogram of C1 9 8 Frequency 7 6 5 4 3 2 1 0 0 1 2 3 C1 Similarly, relative frequency histogram may be drawn. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 14 / 50 Descriptive Statistics Histogram for Continuous Data (measurements) Case 1. (Equal Width Case) (i ) The data assumes real values, not necessarily integers. (ii ) Subdivide the range of the data into k subintervals or classes of equal length such that each observation lies exactly in one class. (iii ) Construct rectangles whose height is equal to frequency (for frequency histogram) or relative frequency (for relative frequency histogram). Note: (i ) No hard-and-fast rules concerning k ; usually, an integer between 5 and 20 will do. (ii ) For √ large data of size n, more classes be used. A rule of thumb is k = n. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 15 / 50 Descriptive Statistics Note: If all data belong to one or two classes or when most sub-intervals (of equal length) have low frequencies, better to use fewer but with different lengths. . (P. Vellaisamy: MSU-STT-351-Sum 16) …… … Probability & Statistics for Engineers . 16 / 50 Descriptive Statistics Histogram: For classes of different lengths: (i ) (ii ) Decide the class intervals. Construct the rectangle using the formula: Rectangle height=relative frequency/class width (area of rectangle=relative frequency) (iii ) The resulting “rectangle heights” are called “densities” (iv ) The formula in (ii) works for “equal width” case also. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 17 / 50 Descriptive Statistics Example 7. The following data represents the frequency distribution of the fracture strength (MPa) observations for ceramic bars fired in a particular kiln: (read 81 − 83 = 81− < 83 meaning that the data value 83 is not included) Class: 81 − 83 83 − 85 85 − 87 87 − 89 89 − 91 91 − 93 93 − 95 95 − 97 97 − 99 Freq: 6 7 17 30 43 28 22 13 3 (a ) Construct a histogram based on relative frequencies, and comment on any interesting features. (b ) What proportion of strength observations are at least 85? Less than 95? (c ) What proportion of the observations are less than 90? (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 18 / 50 Descriptive Statistics Solution: (a ) The histogram appears below. A representative value for this data would be X = 90. The histogram is reasonably symmetric, unimodal, and somewhat bell-shaped. The variation in the data is not small since the spread of the data (99 − 81) = 18 constitutes about 20% of the typical value of 90. Relative frequency .20 .10 0 81 83 85 87 89 91 93 95 97 99 Fracture strength (MPa) (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 19 / 50 Descriptive Statistics (b ) The proportion of the observations that are at least 85 is 1 − (6 + 7)/169 = 0.9231. Similarly, The proportion less than 95 is 1 − (22 + 13 + 3)/169 = 0.7751. (c ) Note x = 90 is the midpoint of the class 89− < 91, which contains 43 observations (a relative frequency of 43/169=0.2544). Therefore, about half of this frequency, 0.1272, should be added to the relative frequencies for the classes to the left of x = 90. That is, approximate proportion of the observations that are less than 90 is 0.0355 + 0.0414 + 0.1006 + 0.1775 + 0.1272 = 0.4822. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 20 / 50 Histogram Shapes The histogram shape is called (a) unimodal if it has single peak. Note: The histogram seen earlier is unimodal. frequency 25 20 15 10 5 0 0 (P. Vellaisamy: MSU-STT-351-Sum 16) 10 Flow rate Probability & Statistics for Engineers 20 21 / 50 Histogram Shapes (b) Bimodal if it has 2 different peaks. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 22 / 50 Histogram Shapes (c) Multimodal if it has > 2 peaks. (d) The histogram is ‘symmetric’ if it is unimodal and right half is the mirror image of the left half. F requenc y 15 10 5 0 10 20 30 40 50 60 70 80 I D T v a lu e (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 23 / 50 Histogram Shapes (e) Positively skewed if the right tail is stretched out compared with the left tail. (f) Negatively skewed if left tail is stretched out compared with right tail. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 24 / 50 Histogram for Qualitative/Categorical Data (i) Histogram for categorical data is called bar chart. There will be natural ordering of classes. (Titanic Data) (ii) A Pareto diagram is a bar chart that results from quality control study, where different categories correspond to different kinds of defects or non-conformities. Example 8. Histogram for Titanic Data: The following table classifies 2201 people as per the class they traveled: Class: Count: First (F) 325 (P. Vellaisamy: MSU-STT-351-Sum 16) Second (S) 285 Third (T) 706 Probability & Statistics for Engineers Crew (C) 885 25 / 50 Histogram for Qualitative/Categorical Data Histogram for Titanic Data 1000 900 800 700 600 500 400 300 200 100 0 F (P. Vellaisamy: MSU-STT-351-Sum 16) S T Probability & Statistics for Engineers C 26 / 50 Some Additional Examples Some Additional Examples: Example 1. Construct the stem-and-leaf display for the data on flexural strength of a certain concrete (in MPa units): 5.9, 7.2, 7.3, 6.3, 8.1, 6.8, 7.0, 7.6, 6.8, 6.5, 7.0, 6.3, 7.9, 9.0, 8.2, 8.7, 7.8, 9.7, 7.4, 7.7, 9.7, 7.8, 7.7, 11.6, 11.3, 11.8, 10.7 (a) Is it spread about a representative value? (b) Is it symmetric? (c) Any outliers? (d) What proportion of observations exceed 10 MPa? (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 27 / 50 Some Additional Examples Solution: (a) Minitab generated the following stem-and-leaf display of this data: Stem-and-leaf of C1 N = 27 Leaf Unit = 0.10 1 6 (11) 10 7 4 3 5 6 7 8 9 10 11 9 33588 00234677889 127 077 7 368 The left most column shows the cumulative numbers of observations from each stem to the nearest tail of the data. For example, the 6 in the second row indicates that there are a total of 6 data points contained in stems 6 and 5. Minitab uses parentheses around 11 in row three to indicate that the median of the data is contained in this stem. A value close to 8 is representative of this data. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 28 / 50 Some Additional Examples (b) The data display is not perfectly symmetric around some middle/representative value. There tends to be some positive skewness in this data. (c) The outliers are data points that appear to be very different from the pack. Looking at the no stem-and-leaf display in Part (a), there appear to be no outliers in this data. (a more precise definition of an outlier will be given later). (d) From the stem-and-leaf display in Part (a), there are 3 leaves associated with the stem of 11, which represent the 3 data values that greater than or equal to 11. 10.7, which is represented by the stem of 10 and the leaf of 7, also exceeds 10. Therefore, the proportion of data values that exceed 10 is 4/27 = 0.128, or, about 15%. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 29 / 50 Some Additional Examples Example 2. The following data represents the IDTs (inter-division time) of a number of cells both in exposed (treatment) and in unexposed (control) conditions: 28.1, 31.2, 13.7, 46.0, 25.816.8, 34.8, 62.3, 28.0, 17.9, 19.5, 21.1, 31.9, 28.9, 60.1, 23.7, 18.6, 21.4, 26.6, 26.2, 32.0, 43.5, 17.4, 38.8, 30.6, 55.6, 25.5, 52.1, 21.0, 22.3, 15.5, 36.3, 19.1, 38.4, 72.8, 48.9, 21.4, 20.7, 57.3, 40.9 Construct a histogram of this data based on classes with boundaries 10, 20, 30, ... Then calculate log(x ) to the (base 10) for each x and construct the histogram of the transformed data using the class boundaries 1.1, 1.2, 1.3, and etc. What is the effect of the transformation? (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 30 / 50 Some Additional Examples Solution. A histogram of the raw data appears below: The histogram of log-values (base 10) is shown above. The shape of this histogram is much less skewed than the histogram of the original data. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 31 / 50 Numerical Summary of Measures We now discuss some of the important characteristics of the data and for the population. Measures of Location First, we discuss them for the data and then for the population distribution. The Mean 1. The Sample Mean: x The sample of mean of n observation x1 , . . . , xn is x = (1/n) n X xi = (x1 + . . . + xn )/n, i =1 where n denotes the number of observations. Example 1a. Suppose scores of 8 students in a test are: 35, 20, 45, 50, 42, 38, 39, 11. Then the sample mean is = 280/8 = 35. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 32 / 50 Numerical Summary of Measures Example 1b. Suppose, the last score is recorded, by mistake, as 71. Then, x = (269 + 71)/8 = 340/8 = 42.5%. About 22% increase in the sample mean. Note this is a signifiant one. Rule: Increase one decimal place more than the one present in the data. In the above example, the data are in integers (no decimal places) and so we denoted x = 42.5 (one decimal place) (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 33 / 50 Numerical Summary of Measures 2. The Median: e x This measure is less affected by outliers or extreme values. This divides the sample distribution in to two equal parts. Definition 2 (Sample median) First order the observations as X(1) ≤ X(2) ≤ . . . ≤ X(n) , from the smallest to the largest one. Then the median is defined as if n is odd, X( n+2 1 ) , e x = X( n2 ) + X( n2 +1) /2, if n is even ( middle Value, if n is odd, = average of middle 2 values, if n is even. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 34 / 50 Numerical Summary of Measures Example 2: The median of the values in Example 1a is: 11, 20, 35, 38, 39, 42, 45, 50. |{z} Here, n = 8 even; n/2 = 4. Take the middle values: 4th and 5th values. The median e x is = average of middle two values = {(38 + 39)/2} = 38.5. Example 3: Find the median of Example 1b (one outlier case). Here, 20, 35, 38, 39, 42, 45, 50, 71. |{z} Again, e x = (39 + 42)/2 = 81/2 = 40.5 Remark. 1 (i) The median value is less affected than the mean. (ii) Also, this is an extreme case, as we replaced the smallest observation by one which is greater than the largest. (iii) Decreasing the first three smallest values or increasing the last three largest values in Example 3, does not affect the median. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 35 / 50 Numerical Summary of Measures 3. The Trimmed Mean (i) First order the observations from the smallest to the largest. (ii) Let r ∈ (0, 0.5). Then 100r % trimmed data is obtained by discarding the largest 100r % and the smallest 100r % of the data. Definition The 100r % trimmed simple mean is the sample mean of the 100r % trimmed data. Example 4. Obtain the 12% trimmed mean of the data in Example 1: 11, 20, 35, 38.39.42.45.50. Here, 12 = 100r % (100r = 12, r = 12/100 = 0.12) Also, n = 8; 12% of 8 = (12/100) × 8 = 24/25 ≈ 1. Discarding the smallest one and the largest one, we get 12.5% trimmed means (since (1/8) = 12.5) as (20 + 35 + 38 + 39 + 42 + 45)/6 = 219/6 = 36.5. It is less sensitive than the mean, but more sensitive than the median. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 36 / 50 Measure of Variability Let x1 , . . . , xn be a sample of size n on a variable x . Definition 3 (i) The Range: Arrange the data x1 , . . . , xn as x(1) ≤ x(2) ≤ . . . ≤ x(n) . Then the range R = x(n) − x(1) . This is the simplest measure of variability. Drawback: It depends only on x(1) and x(n) . (ii) The Sample Variance The sample variance of x1 , . . . , xn is defined by sx2 n X = 1/(n − 1) (xi − x )2 = Sxx /(n − 1) i =1 √ and the sample standard deviation is s = + s 2 , the positive square root. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 37 / 50 Measure of Variability Facts: (i) The unit of s is the same as that of xi ’s. (ii) n X (xi − x ) = 0, for any x1 , . . . , xn . i =1 That is, if the derivations (x1 − x ), . . . , (xn−1 − x ) are known, then (xn − x ) can be found. Thus, n deviations actually contain only (n − 1) independent pieces of information (called degrees of freedom) and this will suffice to find s 2 or s . Thus, s 2 or s are based on (n − 1) degrees of freedom. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 38 / 50 Measure of Variability A Useful Formula: Sxx n X = i =1 X = = X X xi2 − (xi − x )2 xi2 − ( X xi )2 /n 2 xi2 − nx . Hence, Sx2 = 1 n−1 i 1 X 2 1 xi = Sxx . n n−1 i A Proposition: Let be the variance of the data x1 , . . . , xn and c , 0. (i) If y1 = x1 + c , . . . , yn = xn + c , then Sy2 = Sx2 . Sx2 (ii) If y1 = cx1 , . . . , yn = cxn then Sy2 = c 2 Sx2 and Sy = |c |Sx . (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 39 / 50 Measure of Variability Example 5 The following data represents the value of Young’s modulus for certain cast plates: 116.4, 115.9, 114.6, 115.2, 115.8. (a) Find x and (xi − x ) (b) Using (xi − x )’s, compute S 2 (c) Calculate using computational for Sxx . (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 40 / 50 Measure of Variability Solution: (a) x = 1/n i xi = 577.9/5 = 115.58. Deviations from the mean: 116.4 − 115.58 = .82, 115.9 − 115.58 = .32, 114.6 − 115.58 = −.98, 115.2 − 115.58 = −.38, and 115.8 − 115.58 = .22. P (b) s 2 = [(.82)2 + (.32)2 + (−.98)2 + (−.38)2 + (.22)2 ]/(5 − 1) = 1.928/4 = .482. Hence, s = 0.482. (c) P i xi2 = 66, 795.61, so S 2 = 1 n −1 P i xi2 − 1 n P i 2 xi = [66795.61 − (577.9)2 /5]/4 = 1.928/4 = 0.482. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 41 / 50 Measure of Variability Box Plot The quartiles and percentiles yield more information about the location of a data set. Similarly, median and IQR (inter quartile range) are used to construct box plot, a visual summary of the data. Quartiles and IQR Let x1 , . . . , xn denote the data set of size n. First order the observations from the smallest to the largest. (i) Compute the median e x. (ii) If n is even, first n2 observations form the lower half; and the remaining n 2 observations form the upper half (median separates the data into two parts). (n+1) (iii) If n is odd, the median e x is the 2 th value of the ordered data and include it both the parts. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 42 / 50 Measure of Variability The Quartiles: (i) The lower quartile= Q1 = median of the lower-half of the data. (ii) The upper quartile= Q3 = median of the upper-half of the data. (iii) The interquartile range IQR = Q3 − Q1 Note: The IQR is also called fourth spread fs = Q3 − Q1 = upper fourth lower fourth, and is resistant to outliers. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 43 / 50 Measure of Variability Example 1 Consider the following data: 5.2, 3.9, 4.8, 5.1, 3.7, 4.5, 4.2. Here, n = 7. Ordered data: 3.7, 3.9, 4.2, 4.5, 4.8, 5.1, 5.2. The median = 4.5. Since n is odd, include the median in lower half and upper half of the data. 4. 2 = 82.1 = 4.05. Lower half: 3.7, 3.9, 4.2, 4.5; Q1 = 3.9+ 2 5. 1 Upper half: 4.5, 4.8, 5.1, 5.2; Q3 = 4.8+ = 92.9 = 4.95. 2 Hence, IQR = 4.95 − 4.05 = 0.9. IQR Criteria for an Outlier: An observation that lies above Q3 + (1.5)IQR or below Q1 − (1.5)IQR may be suspected to be an outlier. An outlier is called extreme if it lies outside (Q1 − 3IQR , Q3 + 3IQR ). Otherwise, it is called a mild outlier. Boxplot: A box plot is a visual display of 5 number summary: (x(1) , Q1 , e x , Q3 , x(n) ). (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 44 / 50 Measure of Variability Procedure: (i) The middle box denotes the Q1 , median and the Q3 . (ii) The whiskers extend above Q3 or below Q1 till Q3 + 3IQR or Q1 − 3IQ , respectively. (iii) The outliers are denoted by special symbols. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 45 / 50 Measure of Variability Remark. 2 The box-plot has the following properties: (i) More compact than stem plot or histogram. (ii) Central box contains roughly 50% of the data. (iii) Does not reveal the presence of ”clusters”. (iv) Very useful in comparing (similarity and differences) data sets on same scale. (v) Height of the box = IQR (vi) If the median is roughly in the middle of the box, then the distribution is symmetric; or else it is skewed. (vii) Whiskers show skewness if they are not of the same length. (viii) Useful to detect outliers. The main use of box plots is to compare the groups. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 46 / 50 Measure of Variability Example 3 The following data denotes the shear strength (MPa) of a joint bonded in a particular manner. 22.2, 40.4, 16.4, 73.7, 36.6, 109.9, 30.0, 4.4, 33.1, 66.7, 81.5 (a) What are the values of the quartiles, and the value of the IQR? (b) Construct a box plot based on the five-number summary, and comment on its features. (c) How large or small does an observation have to be to qualify as an outlier? As an extreme outlier? (d) By how much could the largest observation be decreased without affecting the IQR? (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 47 / 50 Measure of Variability Solution: (a) The lower half of the data set: 4.4, 16.4, 22.2, 30.0, 33.1, 36.6, and therefore the lower quartile is ((22.2 + 30.0)/2) = 26.1. The top half of the data set: 36.6, 40.4, 66.7, 73.7, 81.5, 109.9 and therefore the upper quartile, is ((66.7 + 73.7)/2) = 70.2. So, the IQR = (70.2 − 26.1) = 44.1. (b)A boxplot (created in Minitab) of this data appears below: There is a slight positive skew to the data. The variation seems quite large. There are no outliers. 0 50 100 sheer strength (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 48 / 50 Measure of Variability (c) An observation would need to be further than 1.5(44.1) = 66.15 units below the lower quartile or above the upper quartile to be classified as a mild outlier. Notice that, in this case, an outlier on the lower side would not be possible since the sheer strength variable cannot have a negative value. An extreme outlier would fall (3)(44.1) = 132.3 or more units below the lower, or above the upper quartile. Since the minimum and maximum observations in the data are 4.4 and 109.9 respectively and so there are no outliers, of either type, in this data set. (d) Not until the value x = 109.9 is lowered below 73.7 would there be any change in the value of the upper quartile. That is, the value x = 109.9 could not be decreased by more than (109.9 − 73.7) = 36.2 units. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 49 / 50 Homework Home work: Sect 1.2: 11, 16, 19, 26, 27, 29 Sect 1.3: 35, 36, 41, 43 Sect 1.4: 45, 51, 54, 57, 79. (P. Vellaisamy: MSU-STT-351-Sum 16) Probability & Statistics for Engineers 50 / 50