The Boxplot on Minitab

advertisement
The Boxplot
(33)
Example: Consider Psychological profile scores for 30 convicted felons.
For this sample we know that
m = 196.5, Q1 = 191 and Q3 = 204
Inner Fences: LIF =
HIF =
Outer Fences: LOF =
HOF =
Sketch :
(34)
The Boxplot on Minitab
MTB>gstd
MTB> boxp c1
[draws a boxplot for the sample in c1]
Note: 1. On the boxplot scale one space (sp) “-“ is given by
sp = ( the distance between successive tick marks [+] )/10
2. Values read on the boxplot scale are approximate.
Example: Consider the psychological profile score data
MTB > boxp c1
------------I +
I--------*
O
------+---------+---------+---------+---------+---------+------C1
140
160
180
200
220
240
O
*
*
Question: From the boxplot scale determine the ( approximate) values
of the sample median, first and third quartiles and possible and
probable outliers.
sp=
; m=
Q1 =
; Q3 =
Possible Outliers:
Probable Outliers:
(35)
Identifying Outliers in a Stem-and–leaf Plot
Minitab gives us an option when we use the stem command to identify
outliers as calculated in the boxplot.
Using the STEM command with the subcommand TRIM one can form
a stem and leaf plot with the outliers trimmed away and shown on
special lines labeled LO and HI. In this way one can examine the shape
of the sample with the outliers removed.
Example: Scientists use ph (power of Hydrogen) as a numerical
measure of acidity with a ph=7.00 being neutral. A scientist suspects a
ph meter is defective. He takes a random sample of 45 measurements
with this meter on a neutral substance. The data have been entered into
C1.
8.32 6.13 5.89 6.45 7.14 9.00 7.21 6.02 5.96
6.19 10.28 7.05 6.56 5.90 5.99 6.99 6.98 8.26
4.61
6.86 9.05 5.88 5.79
6.58 3.46 6.21
6.07
5.49
6.09 5.82 5.38 6.60
6.49 6.62 7.22
3.76
6.76
8.35 7.22 8.85 5.95
7.24 3.26 6.72 11.17
MTB > stem c1;
SUBC> trim.
Stem-and-leaf of PHlevel
N = 45
Leaf Unit = 0.10
LO
4
6
14
22
(9)
14
8
8
5
4
4
5
5
6
6
7
7
8
8
9
HI
32, 34, 37,
Low Outliers(values below the LIF)
6
34
78889999
00011244
555677899
012222
233
8
00
102, 111,
High Outliers(values above the HIF)
(36)
The outliers are the values beyond the inner fences. We can show this
using the approximate sample values read from the plot. See the
calculations below:
To show the “HI” values are outliers, we need calculate the high inner
fence.
AP(Q1) =(25/100)45 = 11.25, Pos(Q1) = 12
Q1 =
AP(Q3) = (75/100)45 = 33.75; Pos(Q3) = 34
Q3 =
IQR =
HIF=
The values 10.2 and 11.1 are both bigger than 9.15 and are plotted as
“HI” outliers in the stem and leaf plot.
Similarly,
LIF =
The values 3.2, 3.4 and 3.7 are all smaller than LIF and are plotted as
“LO” outliers in the previous stem and leaf plot.
(37)
Bivariate Data
All the data sets we have encountered up to now deal with
measurements of one variable on each of several “individuals” (i.e.
incomes of individuals, test scores etc.). Such type of data is called
univariate data.
In many situations it may be of interest to take measurements of two
variables on each of several individuals. For example we may be
interested in both the income of an individual and the number of hours
of television watched per week. In such a case each observation will
consist of two values ( a pair of measurements)
(income, number of hours watched).
Such type of data is called bivariate data and is often denoted in the
form; (x,y), the “x-observation” and “y-observation”. A bivariate
sample of ‘n’ observations would then be written as
(x1,y2), (x2,y2),. . ., (xn,yn)
The reason we take observations on two variables is that we are
interested in exploring the relationship between the variables. For
example “ Do poorer people tend to watch more TV”.
The simplest possible relationship between two variables x and y is that
of a straight line.
(38)
Measuring Association Between Variables
Here are some typical examples of association:
Positive linear relationship between x and y:
Curvilinear relationship between x and y:
Negative linear relationship between x and y:
There are many measures of association in statistics and the term
CORRELATION may be applied to such measures. We will study a
specific type of correlation called PEARSON’S SAMPLE
CORRELATION COEFFICIENT which measures the strength of the
linear relationship between the two numerical variables.
(39)
Pearson’s Sample Correlation Coefficient r
Consider the following bivariate data on two class quizzes:
x
y
1
1
2
4
5
5
2
4
5
4
4
2
1
2
2
4
sxy =
=
=
sxy is called the sample covariance and is a measure of the linear
relationship between x and y. To make this value independent of the
units of measurement we divide by sx = 1.690 and sy =1.414.
The scatter plot of this sample data is as follows:
(40)
Pearson’s sample correlation coefficient r is
 
r =
=
  =
Note: r is always between –1 and +1. The nearer r is to –1, the closer the
plotted points are to a straight line with negative slope. The nearer r is
to +1, the closer the plotted points are to a straight line with positive
slope. ( If r is equal to –1 or +1 then all the points are on a straight line).
Note: Pearson’s sample correlation coefficient r is a measure of the
linear relationship between two variables X and Y. Other relationship
may exist which are not detected by r.
Example: Consider the sample of pairs (-1,1), (0,0), (1,1). We use
Minitab to calculate Pearson’s sample correlation coefficient r.
C1: -1
0
1
C2: 1
0
1
MTB> corr c1 c2
Correlations (Pearson)
Correlation of X and Y =0.000
The correlation r is 0. Thus there is no linear component to the
relationship between X and Y; note however that for each pair (x,y),
y=x2, so there is another type of relationship between X and Y ( a
quadratic relationship).
(41)
Correlation and Causality
Just because the value of r is close to 1, does not of itself mean that x
“causes” y.
A strong ( negative or positive) correlation does not necessarily imply a
cause and effect relationship between X and Y. Often there is a third
hidden variable which creates an apparent relationship between X and
Y. Consider the example below.
Example: A sample of students from a grade school were given a
vocabulary test. A high positive correlation was found between
X= “student’s height” and Y= “student’s score on the test”
Should one infer that growing taller will increase one’s vocabulary?
Explain. Also indicate a third variable which could offer a plausible
explanation for this apparent relationship.
Note: A cause and effect relationship is best established by an
experiment in which other variables which influence X and Y are
controlled.
Question: In the example above, how might a more controlled
experiment be performed?
(42)
Fitting a Straight Line: The Least Squares (Regression) Line
A set of points (bivariate observations) may or may not have a linear
relationship. If we want to draw a straight line which “fits” these point,
which line fits best? Our choice depends on what we use to define a
“good” fit.
Given a sample of bivariate data (x1, y1), (x2,y2), . . .(xn,yn), the least
squares line L is the line fitted to the points in such a way that
d12+d22+…+dn2 is as small as possible.
The equation of the least squares line L is given by
y =b0 + b1x, where b1 = (syr/sx) and b0 =
Note:
1. The sign of the slope (b1) is the same as the sign of the sample
correlation coefficient.
2. The least squares line always passes through the point (
(43)
).
Example: Consider the previous example of scores on two class quizzes.
Sample data is (1,5), (1,4), (2,4), (4,2), (5,1), (5,2), (2,2), (4,4). In our
calculation of r=-.717, we also found that
=3,
=3,
=1.690 and
= 1.414
Thus,
b1 = (syr/sx) =
b0 =
=
=
=
y = b0 + b1x =
SKETCH:
Note: 1.The least squares line only fits the points used in its
calculation. Do not be tempted to extend it below the smallest x-value or
above the largest x-value. New observations with x-values outside this
range may not fall anywhere close to the fitted line. So if a situation
requires you to extend the line, do so with caution.
2. In the language of regression r2 is known as the COEFFICIENT OF
DETERMINATION. It is used in regression to measure how well the
least squares line fits the points. r2 satisfies the following inequality
0 r2 1
(44)
Pearson’s Sample Correlation Coefficient and the Least Squares Line on Minitab
Example: The data below gives the cost estimates and actual costs (in millions) of a
random sample of 10 construction projects at a large industrial facility.
Estimate(X)
44.277
2.737
7.004
22.444
18.843
46.514
3.165
21.327
42.337
7.737
Actual(Y)
51.174
9.683
14.827
22.159
26.537
50.281
15.550
23.896
50.144
13.567
To find Pearson’s sample correlation coefficient ‘r’ for this data and the equation of
the least squares line using Minitab, proceed as follows.
Name C1 as EST(X) and C2 as ACT(Y), then enter the x-data into C1 and the
corresponding y-data into C2. Then
Click: Stat Regression  Fitted Line Plot
Type in C2 in the Response (Y) box and type in C1 in the Predictor(X) box. Now
click OK.
Regression Plot
ACT(Y) = 7.49228 + 0.937658 EST(X)
S = 3.49312
R-Sq = 96.0 %
R-Sq(adj) = 95.5 %
50
ACT(Y)
40
30
20
10
0
10
20
30
EST(X)
(45)
40
50
From the output we find that the coefficient of determination r2 = .960
and the equation of the least squares line is
y=7.49228+.937658x
Questions: (1) What is the value of the sample correlation coefficient?
Interpret it.
(2) Predict the actual cost of a project whose estimated cost is 15
million dollars.
(3) Is it safe to use this data to predict the actual cost of a project
estimated to cost 80 million dollars? Explain.
(4) Estimate the average change in the actual cost of a project when
the estimated cost increases by 10 million.
(46)
Download