Lecture 1: Descriptive Statistics MSU-STT-351-Sum 16 (P. Vellaisamy: MSU-STT-351-Sum 16)

advertisement
Lecture 1: Descriptive Statistics
MSU-STT-351-Sum 16
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
1 / 50
Introduction
Why Statistics?
(i) It is the science that helps to understand many phenomena which
occur in the field of engineering, science, economics, finance, and etc.
(ii) It is the scientific way that helps to make intelligent
judgments/decisions from the observed data which contains uncertainty
and variation.
We start with two examples.
Example 1. The emission levels of HC (hydrocarbon) and CO (carbon
monoxide) of a vehicle:
HC (gm/mile):
CO (gm/mile):
12.8
118
18.3
149
32.2
232
32.5
236
Question: What is the emission level of HC/CO? It is difficult to make a
precise statement, as there is a high variation in the observed levels.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
2 / 50
Introduction
Example 2. Marks of two students in 4 tests:
S1:
S2:
25
85
38
62
42
78
39
59
Question: Who is doing better?
Any difficulty in answering? Clearly, S2 is doing better. There is no need
for statistical analysis, in such situations.
It is well-known that statistics has been often miused in several practical
situations.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
3 / 50
Introduction
What is statistics?
(i) One word definition:
(a) Economics: Money
(b) Philosophy: Why
(c) Statistics: Variation
(ii) Layman definition: Information/summary of data.
(iii) Formal Definition. Applied Perspective: Statistics deals with
techniques to deal with or how to
(a) obtain information/data (sample)
(b) analyze scientifically the data
(c) draw valid conclusions/inference
(iv) Theoretical Perspective: As a branch of mathematics, it deals with
analytical techniques and procedures to analyze the data and to make
inference about the population characteristics.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
4 / 50
Introduction
Population and Samples
Population: The set of all well-defined objects/elements (of interest)
which are under investigation.
Example 1. The students studying engineering at MSU.
Example 2. The population of East Lansing.
If we can collect information on all the elements in the population, we call it
“Census”.
Most often, it is impossible, as it involves a lot of time, efforts and money.
Sample: A subset of the population, which is selected for obtaining
information, is called a sample.
For example, We may select 10 students from each engineering discipline
from MSU.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
5 / 50
Introduction
Often, we are interested in certain characteristics of the population
(number of flaws in a piece of cloth; thickness of a capsule wall, monthly
income of an individual, etc).
A characteristic may be
(i) Categorical (belongs to one of the categories)
(a) Gender of a student (male/female)
(b) Quality of a product (excellent/good/bad)
(ii) Numerical (measured in real value)
(a) Heights of students
(b) Values of a stock
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
6 / 50
Introduction
Types of Variables
A variable is any characteristic which changes over the objects in the
population. It is denoted by x , y , z (or by X , Y , Z).
A variable “X ” may be categorical (called categorical variable) or
numerical/quantitative (called numerical variable).
Types of Data
(i) The data X1 , X2 , . . . , Xn (or x1 , x2 , . . . , xn ) on a categorical variable X is
called categorical data.
(ii) The data X1 , X2 , . . . , Xn (or x1 , x2 , . . . , xn ) on a numerical variable X is
called quantitative data. Suppose we measure height = x, and weight = y
on n-individuals, (x1 , y1 ), . . . , (xn , yn ). Then we have the bivariate data.
Similarly, multivariate data is defined.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
7 / 50
Branches of Statistics
The main branches of statistics are the following:
(i) Descriptive Statistics: Deals with summarizing and describing
important features (such as mean, median, standard deviation) of data
(tabulating or graphical methods).
(ii) Inferential Statistics: Deals with techniques for drawing inferences
(generalizing to population) and predictions about the population, based
on the information obtained from the sample.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
8 / 50
Descriptive Statistics
Descriptive Statistics
1.2 Graphical (visual) Display of Univariate Data
Pictures often reveal useful information about data.
1.2.1 Graphs for Quantitative Data
(i) Stem-and-Leaf Display (Stem Plot)
This is an useful plot for displaying quantitative data.
Example 4. Consider the data on the pulse rates (per minute) of 10
patients:
45, 61, 60, 62, 65, 73, 75, 75, 78, 82
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
9 / 50
Descriptive Statistics
(i) Stem-and-Leaf Display (Stem Plot)
Stem plot gives
•
•
•
•
Actual values
Extent of spread
Number and
location of peaks
Presence of any
outlier
(P. Vellaisamy: MSU-STT-351-Sum 16)
Stem: Tens
Leaf: Ones digit
8 2
7 3558
6 0125
}
5
ga
4 5
‘outlier’
Probability & Statistics for Engineers
10 / 50
Descriptive Statistics
(ii) The Dot plot used when data is small or has few distinct values.
Here, each observation is represented by a ‘dot’ on a horizontal scale.
....
.
40
50
60
....
70
.
80
This is similar to stem plot, except that dot is used instead of integers.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
11 / 50
Descriptive Statistics
Definition 1
A (quantitative) variable X is discrete if it takes finite or countable values. It
is continuous if it takes any value in an interval or of the whole real line.
Example 5. Let X = number of trials to get the first success. Then
X ∈ {1, 2, . . .} and hence X is discrete. Suppose, X = height of a student
(in cm). Then X ∈ [150, 190] and is a continuous variable.
Let X be a discrete variable taking values in {1, 2, . . . , k } = S . Let
X1 , . . . , Xn be n data values on X .
Then frequency of i ∈ S = Number of values in the data {X1 , X2 , . . . , Xn }
equal to i. For 1 ≤ i ≤ k , the relative frequency of i = frequency of i /n.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
12 / 50
Descriptive Statistics
Example 6. Let X = Number of children in a family. Then X ∈ {0, 1, 2, 3}.
Also, suppose the data on 20 families in East Lansing are:
2, 0, 1, 2, 2, 3, 1, 2, 3, 2, 3, 1, 2, 1, 2, 1, 2, 3, 1, 2.
Then the frequency table is
X
Frequency
Relative Frequency
0
1
1/20 = 0.05
1
6
6/20 = 0.3
2
9
9/20 = 0.45
3
4
4/20 = 0.20
Total
20
(P. Vellaisamy: MSU-STT-351-Sum 16)
1.0
Probability & Statistics for Engineers
13 / 50
Descriptive Statistics
(iii) Histogram for Discrete Data
Take x-values on horizontal scale and the frequency/relative frequency
along the vertical scale. Draw the rectangle on each value whose height is
equal to the frequency/relative frequency. The frequency histogram for
Example 6 is:
Histogram of C1
9
8
Frequency
7
6
5
4
3
2
1
0
0
1
2
3
C1
Similarly, relative frequency histogram may be drawn.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
14 / 50
Descriptive Statistics
Histogram for Continuous Data (measurements)
Case 1. (Equal Width Case)
(i ) The data assumes real values, not necessarily integers.
(ii ) Subdivide the range of the data into k subintervals or classes of
equal length such that each observation lies exactly in one class.
(iii ) Construct rectangles whose height is equal to frequency (for
frequency histogram) or relative frequency (for relative frequency
histogram).
Note:
(i ) No hard-and-fast rules concerning k ; usually, an integer between 5
and 20 will do.
(ii ) For
√ large data of size n, more classes be used. A rule of thumb is
k = n.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
15 / 50
Descriptive Statistics
Note: If all data belong to one or two classes or when most sub-intervals
(of equal length) have low frequencies, better to use fewer but with
different lengths.
.
(P. Vellaisamy: MSU-STT-351-Sum 16)
…… …
Probability & Statistics for Engineers
.
16 / 50
Descriptive Statistics
Histogram: For classes of different lengths:
(i )
(ii )
Decide the class intervals.
Construct the rectangle using the formula:
Rectangle height=relative frequency/class width
(area of rectangle=relative frequency)
(iii ) The resulting “rectangle heights” are called “densities”
(iv ) The formula in (ii) works for “equal width” case also.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
17 / 50
Descriptive Statistics
Example 7. The following data represents the frequency distribution of the
fracture strength (MPa) observations for ceramic bars fired in a particular
kiln: (read 81 − 83 = 81− < 83 meaning that the data value 83 is not
included)
Class: 81 − 83 83 − 85 85 − 87 87 − 89 89 − 91 91 − 93 93 − 95 95 − 97 97 − 99
Freq:
6
7
17
30
43
28
22
13
3
(a ) Construct a histogram based on relative frequencies, and comment on
any interesting features.
(b ) What proportion of strength observations are at least 85? Less than
95?
(c ) What proportion of the observations are less than 90?
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
18 / 50
Descriptive Statistics
Solution:
(a ) The histogram appears below. A representative value for this data
would be X = 90. The histogram is reasonably symmetric, unimodal, and
somewhat bell-shaped. The variation in the data is not small since the
spread of the data (99 − 81) = 18 constitutes about 20% of the typical
value of 90.
Relative frequency
.20
.10
0
81
83
85
87
89
91
93
95
97
99
Fracture strength (MPa)
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
19 / 50
Descriptive Statistics
(b ) The proportion of the observations that are at least 85 is
1 − (6 + 7)/169 = 0.9231.
Similarly, The proportion less than 95 is 1 − (22 + 13 + 3)/169 = 0.7751.
(c ) Note x = 90 is the midpoint of the class 89− < 91, which contains 43
observations (a relative frequency of 43/169=0.2544). Therefore, about
half of this frequency, 0.1272, should be added to the relative frequencies
for the classes to the left of x = 90.
That is, approximate proportion of the observations that are less than 90 is
0.0355 + 0.0414 + 0.1006 + 0.1775 + 0.1272 = 0.4822.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
20 / 50
Histogram Shapes
The histogram shape is called (a) unimodal if it has single peak.
Note: The histogram seen earlier is unimodal.
frequency
25
20
15
10
5
0
0
(P. Vellaisamy: MSU-STT-351-Sum 16)
10
Flow rate
Probability & Statistics for Engineers
20
21 / 50
Histogram Shapes
(b) Bimodal if it has 2 different peaks.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
22 / 50
Histogram Shapes
(c) Multimodal if it has > 2 peaks.
(d) The histogram is ‘symmetric’ if it is unimodal and right half is the mirror
image of the left half.
F requenc y
15
10
5
0
10
20
30
40
50
60
70
80
I D T v a lu e
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
23 / 50
Histogram Shapes
(e) Positively skewed if the right tail is stretched out compared with the left
tail.
(f) Negatively skewed if left tail is stretched out compared with right tail.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
24 / 50
Histogram for Qualitative/Categorical Data
(i) Histogram for categorical data is called bar chart. There will be natural
ordering of classes. (Titanic Data)
(ii) A Pareto diagram is a bar chart that results from quality control study,
where different categories correspond to different kinds of defects or
non-conformities.
Example 8. Histogram for Titanic Data: The following table classifies 2201
people as per the class they traveled:
Class:
Count:
First (F)
325
(P. Vellaisamy: MSU-STT-351-Sum 16)
Second (S)
285
Third (T)
706
Probability & Statistics for Engineers
Crew (C)
885
25 / 50
Histogram for Qualitative/Categorical Data
Histogram for Titanic Data
1000
900
800
700
600
500
400
300
200
100
0
F
(P. Vellaisamy: MSU-STT-351-Sum 16)
S
T
Probability & Statistics for Engineers
C
26 / 50
Some Additional Examples
Some Additional Examples:
Example 1. Construct the stem-and-leaf display for the data on flexural
strength of a certain concrete (in MPa units):
5.9, 7.2, 7.3, 6.3, 8.1, 6.8, 7.0, 7.6, 6.8, 6.5, 7.0, 6.3, 7.9, 9.0, 8.2, 8.7, 7.8,
9.7, 7.4, 7.7, 9.7, 7.8, 7.7, 11.6, 11.3, 11.8, 10.7
(a) Is it spread about a representative value?
(b) Is it symmetric?
(c) Any outliers?
(d) What proportion of observations exceed 10 MPa?
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
27 / 50
Some Additional Examples
Solution: (a) Minitab generated the following stem-and-leaf display of this
data:
Stem-and-leaf of C1
N
= 27
Leaf Unit = 0.10
1
6
(11)
10
7
4
3
5
6
7
8
9
10
11
9
33588
00234677889
127
077
7
368
The left most column shows the cumulative numbers of observations from
each stem to the nearest tail of the data. For example, the 6 in the second
row indicates that there are a total of 6 data points contained in stems 6
and 5. Minitab uses parentheses around 11 in row three to indicate that
the median of the data is contained in this stem. A value close to 8 is
representative of this data.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
28 / 50
Some Additional Examples
(b) The data display is not perfectly symmetric around some
middle/representative value. There tends to be some positive skewness in
this data.
(c) The outliers are data points that appear to be very different from the
pack. Looking at the no stem-and-leaf display in Part (a), there appear to
be no outliers in this data. (a more precise definition of an outlier will be
given later).
(d) From the stem-and-leaf display in Part (a), there are 3 leaves
associated with the stem of 11, which represent the 3 data values that
greater than or equal to 11. 10.7, which is represented by the stem of 10
and the leaf of 7, also exceeds 10. Therefore, the proportion of data
values that exceed 10 is 4/27 = 0.128, or, about 15%.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
29 / 50
Some Additional Examples
Example 2. The following data represents the IDTs (inter-division time) of
a number of cells both in exposed (treatment) and in unexposed (control)
conditions:
28.1, 31.2, 13.7, 46.0, 25.816.8, 34.8, 62.3, 28.0, 17.9, 19.5, 21.1, 31.9, 28.9,
60.1, 23.7, 18.6, 21.4, 26.6, 26.2, 32.0, 43.5, 17.4, 38.8, 30.6, 55.6, 25.5,
52.1, 21.0, 22.3, 15.5, 36.3, 19.1, 38.4, 72.8, 48.9, 21.4, 20.7, 57.3, 40.9
Construct a histogram of this data based on classes with boundaries
10, 20, 30, ...
Then calculate log(x ) to the (base 10) for each x and construct the
histogram of the transformed data using the class boundaries 1.1, 1.2, 1.3,
and etc.
What is the effect of the transformation?
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
30 / 50
Some Additional Examples
Solution. A histogram of the raw data appears below:
The histogram of log-values (base 10) is shown above. The shape of this
histogram is much less skewed than the histogram of the original data.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
31 / 50
Numerical Summary of Measures
We now discuss some of the important characteristics of the data and for
the population.
Measures of Location
First, we discuss them for the data and then for the population distribution.
The Mean
1. The Sample Mean: x
The sample of mean of n observation x1 , . . . , xn is
x = (1/n)
n
X
xi = (x1 + . . . + xn )/n,
i =1
where n denotes the number of observations.
Example 1a. Suppose scores of 8 students in a test are:
35, 20, 45, 50, 42, 38, 39, 11. Then the sample mean is = 280/8 = 35.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
32 / 50
Numerical Summary of Measures
Example 1b. Suppose, the last score is recorded, by mistake, as 71.
Then, x = (269 + 71)/8 = 340/8 = 42.5%. About 22% increase in the
sample mean. Note this is a signifiant one.
Rule: Increase one decimal place more than the one present in the data.
In the above example, the data are in integers (no decimal places) and so
we denoted x = 42.5 (one decimal place)
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
33 / 50
Numerical Summary of Measures
2. The Median: e
x
This measure is less affected by outliers or extreme values. This divides
the sample distribution in to two equal parts.
Definition 2 (Sample median)
First order the observations as X(1) ≤ X(2) ≤ . . . ≤ X(n) , from the smallest
to the largest one. Then the median is defined as


if n is odd,


 X( n+2 1 ) ,
e
x = 


 X( n2 ) + X( n2 +1) /2, if n is even
(
middle Value,
if n is odd,
=
average of middle 2 values, if n is even.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
34 / 50
Numerical Summary of Measures
Example 2: The median of the values in Example 1a is:
11, 20, 35, 38, 39, 42, 45, 50.
|{z}
Here, n = 8 even; n/2 = 4. Take the middle values: 4th and 5th values.
The median e
x is = average of middle two values = {(38 + 39)/2} = 38.5.
Example 3: Find the median of Example 1b (one outlier case). Here,
20, 35, 38, 39, 42, 45, 50, 71.
|{z}
Again, e
x = (39 + 42)/2 = 81/2 = 40.5
Remark. 1
(i) The median value is less affected than the mean.
(ii) Also, this is an extreme case, as we replaced the smallest observation
by one which is greater than the largest.
(iii) Decreasing the first three smallest values or increasing the last three
largest values in Example 3, does not affect the median.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
35 / 50
Numerical Summary of Measures
3. The Trimmed Mean (i) First order the observations from the smallest to
the largest. (ii) Let r ∈ (0, 0.5). Then 100r % trimmed data is obtained by
discarding the largest 100r % and the smallest 100r % of the data.
Definition The 100r % trimmed simple mean is the sample mean of the
100r % trimmed data.
Example 4. Obtain the 12% trimmed mean of the data in Example 1:
11, 20, 35, 38.39.42.45.50.
Here, 12 = 100r % (100r = 12, r = 12/100 = 0.12)
Also, n = 8; 12% of 8 = (12/100) × 8 = 24/25 ≈ 1.
Discarding the smallest one and the largest one, we get 12.5% trimmed
means (since (1/8) = 12.5) as
(20 + 35 + 38 + 39 + 42 + 45)/6 = 219/6 = 36.5.
It is less sensitive than the mean, but more sensitive than the median.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
36 / 50
Measure of Variability
Let x1 , . . . , xn be a sample of size n on a variable x .
Definition 3
(i) The Range: Arrange the data x1 , . . . , xn as x(1) ≤ x(2) ≤ . . . ≤ x(n) . Then
the range R = x(n) − x(1) .
This is the simplest measure of variability.
Drawback: It depends only on x(1) and x(n) .
(ii) The Sample Variance The sample variance of x1 , . . . , xn is defined by
sx2
n
X
= 1/(n − 1) (xi − x )2 = Sxx /(n − 1)
i =1
√
and the sample standard deviation is s = + s 2 , the positive square root.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
37 / 50
Measure of Variability
Facts:
(i) The unit of s is the same as that of xi ’s.
(ii)
n
X
(xi − x ) = 0, for any x1 , . . . , xn .
i =1
That is, if the derivations (x1 − x ), . . . , (xn−1 − x ) are known, then (xn − x )
can be found. Thus, n deviations actually contain only (n − 1) independent
pieces of information (called degrees of freedom) and this will suffice to
find s 2 or s . Thus, s 2 or s are based on (n − 1) degrees of freedom.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
38 / 50
Measure of Variability
A Useful Formula:
Sxx
n
X
=
i =1
X
=
=
X
X
xi2 −
(xi − x )2
xi2 − (
X
xi )2 /n
2
xi2 − nx .
Hence,
Sx2 =
1
n−1
i
1 X 2
1
xi
=
Sxx .
n
n−1
i
A Proposition: Let
be the variance of the data x1 , . . . , xn and c , 0.
(i) If y1 = x1 + c , . . . , yn = xn + c , then Sy2 = Sx2 .
Sx2
(ii) If y1 = cx1 , . . . , yn = cxn then
Sy2 = c 2 Sx2 and Sy = |c |Sx .
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
39 / 50
Measure of Variability
Example 5
The following data represents the value of Young’s modulus for certain
cast plates: 116.4, 115.9, 114.6, 115.2, 115.8.
(a) Find x and (xi − x )
(b) Using (xi − x )’s, compute S 2
(c) Calculate using computational for Sxx .
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
40 / 50
Measure of Variability
Solution:
(a) x = 1/n i xi = 577.9/5 = 115.58.
Deviations from the mean: 116.4 − 115.58 = .82, 115.9 − 115.58 =
.32, 114.6 − 115.58 = −.98, 115.2 − 115.58 = −.38, and
115.8 − 115.58 = .22.
P
(b) s 2 = [(.82)2 + (.32)2 + (−.98)2 + (−.38)2 + (.22)2 ]/(5 − 1)
= 1.928/4 = .482. Hence, s = 0.482.
(c)
P
i
xi2 = 66, 795.61,
so S 2 =
1
n −1
P
i
xi2 −
1
n
P
i
2 xi
= [66795.61 − (577.9)2 /5]/4 = 1.928/4 =
0.482.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
41 / 50
Measure of Variability
Box Plot
The quartiles and percentiles yield more information about the location
of a data set. Similarly, median and IQR (inter quartile range) are used to
construct box plot, a visual summary of the data.
Quartiles and IQR
Let x1 , . . . , xn denote the data set of size n.
First order the observations from the smallest to the largest.
(i) Compute the median e
x.
(ii) If n is even, first n2 observations form the lower half; and the remaining
n
2 observations form the upper half (median separates the data into two
parts).
(n+1)
(iii) If n is odd, the median e
x is the 2 th value of the ordered data and
include it both the parts.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
42 / 50
Measure of Variability
The Quartiles: (i) The lower quartile= Q1 = median of the lower-half of
the data.
(ii) The upper quartile= Q3 = median of the upper-half of the data.
(iii) The interquartile range IQR = Q3 − Q1
Note: The IQR is also called fourth spread fs = Q3 − Q1 = upper fourth lower fourth, and is resistant to outliers.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
43 / 50
Measure of Variability
Example 1 Consider the following data: 5.2, 3.9, 4.8, 5.1, 3.7, 4.5, 4.2.
Here, n = 7. Ordered data: 3.7, 3.9, 4.2, 4.5, 4.8, 5.1, 5.2.
The median = 4.5. Since n is odd, include the median in lower half and
upper half of the data.
4. 2
= 82.1 = 4.05.
Lower half: 3.7, 3.9, 4.2, 4.5;
Q1 = 3.9+
2
5. 1
Upper half: 4.5, 4.8, 5.1, 5.2;
Q3 = 4.8+
= 92.9 = 4.95.
2
Hence, IQR = 4.95 − 4.05 = 0.9.
IQR Criteria for an Outlier: An observation that lies above Q3 + (1.5)IQR
or below Q1 − (1.5)IQR may be suspected to be an outlier. An outlier is
called extreme if it lies outside (Q1 − 3IQR , Q3 + 3IQR ). Otherwise, it is
called a mild outlier.
Boxplot: A box plot is a visual display of 5 number summary:
(x(1) , Q1 , e
x , Q3 , x(n) ).
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
44 / 50
Measure of Variability
Procedure: (i) The middle box denotes the Q1 , median and the Q3 .
(ii) The whiskers extend above Q3 or below Q1 till Q3 + 3IQR or Q1 − 3IQ ,
respectively.
(iii) The outliers are denoted by special symbols.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
45 / 50
Measure of Variability
Remark. 2
The box-plot has the following properties:
(i) More compact than stem plot or histogram.
(ii) Central box contains roughly 50% of the data.
(iii) Does not reveal the presence of ”clusters”.
(iv) Very useful in comparing (similarity and differences) data sets on same
scale.
(v) Height of the box = IQR
(vi) If the median is roughly in the middle of the box, then the distribution is
symmetric; or else it is skewed.
(vii) Whiskers show skewness if they are not of the same length.
(viii) Useful to detect outliers.
The main use of box plots is to compare the groups.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
46 / 50
Measure of Variability
Example 3 The following data denotes the shear strength (MPa) of a joint
bonded in a particular manner.
22.2, 40.4, 16.4, 73.7, 36.6, 109.9, 30.0, 4.4, 33.1, 66.7, 81.5
(a) What are the values of the quartiles, and the value of the IQR?
(b) Construct a box plot based on the five-number summary, and comment
on its features.
(c) How large or small does an observation have to be to qualify as an
outlier? As an extreme outlier?
(d) By how much could the largest observation be decreased without
affecting the IQR?
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
47 / 50
Measure of Variability
Solution:
(a) The lower half of the data set: 4.4, 16.4, 22.2, 30.0, 33.1, 36.6, and
therefore the lower quartile is ((22.2 + 30.0)/2) = 26.1. The top half of the
data set: 36.6, 40.4, 66.7, 73.7, 81.5, 109.9 and therefore the upper
quartile, is ((66.7 + 73.7)/2) = 70.2. So, the IQR = (70.2 − 26.1) = 44.1.
(b)A boxplot (created in Minitab) of this data appears below: There is a
slight positive skew to the data. The variation seems quite large. There are
no outliers.
0
50
100
sheer strength
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
48 / 50
Measure of Variability
(c) An observation would need to be further than 1.5(44.1) = 66.15 units
below the lower quartile or above the upper quartile to be classified as a
mild outlier. Notice that, in this case, an outlier on the lower side would not
be possible since the sheer strength variable cannot have a negative
value. An extreme outlier would fall (3)(44.1) = 132.3 or more units below
the lower, or above the upper quartile. Since the minimum and maximum
observations in the data are 4.4 and 109.9 respectively and so there are
no outliers, of either type, in this data set.
(d) Not until the value x = 109.9 is lowered below 73.7 would there be any
change in the value of the upper quartile. That is, the value x = 109.9
could not be decreased by more than (109.9 − 73.7) = 36.2 units.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
49 / 50
Homework
Home work:
Sect 1.2: 11, 16, 19, 26, 27, 29
Sect 1.3: 35, 36, 41, 43
Sect 1.4: 45, 51, 54, 57, 79.
(P. Vellaisamy: MSU-STT-351-Sum 16)
Probability & Statistics for Engineers
50 / 50
Download