Lectures - cda college

advertisement
CDA COLLEGE (LIMASSOL) 2014 - 2015
MTH 221 (Y2S4) STATISTICS II
FOR BUSINESS ADMINISTRATION
Syllabus:
1.
Review: Location and Dispersion
2, 3. Sampling, Estimation and Confidence Intervals
4, 5. Hypothesis Testing –Distribution free tests
6, 7 Review – Midterm
8, 9. Correlation and Regression
10, 11.Time Series and Forecasting
12.
Introduction to the Analysis of Variance
13.
Practice
50% Homework and 50% Final Exam: Passing Mark 50
1. (MTH 121, Semester 2/MTH 221, Semester 4)
Sanders, Eng, Merph : “STATISTICS, A FRESH
APPROACH” , Mc Graw Hill
2. (MTH 121, Semester 2/MTH 221, Semester 4)
Francis, M.: ADVANCED LEVEL STATISTICS
Stanley Thornes Publishers
3. (MTH 121, Semester 2/MTH 221, Semester 4)
Hamburg, M.: BASIC STATISTICS
Harcourt Brace Jovanovich
■☺
2
WEEK 1 : REVIEW - STATISTICAL MEASURES
Measures of Location:
1 n
1 k
,
or

 f
n 1 xi
n 1 xj j
(n  1) / 2  F
 w, Where,
Median (the middle value) : m = L+
f
n is the sample size, k is the number of classes, f is the median class
frequency, x j
is the midpoint of the class, L is the lower boundary of the
Minimum, Maximum, Average: X =
class, F is the cumulative frequency of the previous class and w is the class
width. Similarly,
(n  1) / 4  F
3(n  1) / 4  F
 w ; Q =L+
w
Quartiles: Q =L+
1
3
f
f
Mode is the value with the highest frequency
 Methods of calculation; Single values, frequencies, discrete, continuous
Examples – Exercises
1. Keith records the amount of rainfall, in mm, at his school, each day for a
week. The results are given below.
2.8
5.6
2.3
9.4
0.0
0.5
1.8
(a) Calculate the mean, the median and the mode of the amount of rainfall
during the7 days.
Keith realizes that he has transposed two of his figures. The number 9.4
should have been 4.9 and the number 0.5 should have been 5.0.
Keith corrects these figures.
(b) State, giving your reason, the effect this will have on the mean.
2. The following table summarises the birth weights of a random sample of 100
babies born in clinic over a year.
(a)
(b)
(c)
(d)
Weight in Kg
frequency
1.5-1.9
2
2.0-2.4
9
2.5-2.9
12
3.0-3.4
18
3.5-3.9
22
4.0-4.4
17
4.5-4.9
13
5.0+
7
Write down the upper class boundary of the second class
Represent these data by a histogram
Calculate estimates of the median and the quartiles of these birth weights
Comment on the skewness of these data.
3
Measures of Dispersion:

Range: x max  x min
_
1
k
(

x
)2

x
j
1
n 1

Variance:

Standard deviation s=

Interquartile range IR=
s
2
=
s
j
or
2
k
_
1
2
( xj  n x )
n 1 1
2
Q Q
3
1
Properties; appropriateness
Computer; calculator
Applications
1. The data below represent the cost of electricity during July 2004 for a random
sample of 50 one-bedroom apartments in a city:
96
157
141
95
108
171
185
149
163
119
202
90
206
150
183
178
116
175
154
151
147
172
123
130
114
102
111
128
143
135
153
148
144
187
191
197
213
168
166
137
127
130
109
139
129
82
165
167
149
158
a. After constructing the Stem and leaf Plot, form a frequency distribution that
have class intervals with upper class limits $99, $119 and so on
b. Draw the ogive and estimate the median and the quartiles from your graph.
c. Estimate the mean, the range, the standard deviation and the interquartile
range algebraically.
4
WEEKS 2 and 3
SAMPLING, ESTIMATION AND CONFIDENCE INTERVALS
“ Knowledge is happiness…”
THE PROBLEM:
Here we have one of the fundamental problems in Statistics:
From a relatively small sample, we try to make inferences about the whole
population. Either with a specified value (Point Estimation), or with an interval,
which intends to cover the true parameter value with a prespecified confidence
(Interval Estimation).
In Probability Theory, we calculate the probability of an event, before the
experiment: given of course the values of the parameters.
Notation: f ( x; ) or f ( x |  )
Where, X is the vector of observations, and θ is the vector of parameters.
In Statistical Theory, we estimate the values of the parameters θ, after the
experiment, based on the observed data X.
SAMPLING
“You
don’t have to eat the whole cheesecake to realize,
that it turned sour”
The Problem: One of the fundamental objectives of statistical
science is to make inferences about the whole
population, examining a small portion of it!
Simple Random Sample? :
A collection of Independent Identically Distributed (i.i.d.) random variables
X1 , X 2 ,
, Xn
or, we can consider it as a small group of size n, randomly taken from the
population of size N. To draw the sample, we can use random numbers, drawn
from the computer, a calculator, or from statistical tables.
In order to understand some of the important aspects of sampling, we consider a
very simple special example:
Consider the “population” of 4 families in a small village with numbers of children:
{ 3, 1, 0, 8}
Population size: N= 4; Variable: X= the number of children in a family.
Note that for this population
5

1
1
X  (3  1  0  8)  3 ,

N
4
2 
1
(3  3)2  (1  3)2  (0  3)2  (8  3)2 0  4  9  25
2
(
X


)


 9.5

N
4
4
If we take a sample of size n=2, without replacement, the number of possible
 4
samples is   = 6, and the sampling distribution is the set of all values of X ,
 2
along with the corresponding probabilities, for all possible samples:
Sample
X
Pr
(3,1)
2
1/6
(3,0)
1.5
1/6
(3,8)
5.5
1/6
(1,0)
0.5
1/6
(1,8)
4.5
1/6
(0,8)
4
1/6
totals
18
1.0
Note that the median of the population is 2 and the mode does not exist!
Now:
1
( X )   XP( X )  (2  1.5  ...  4)  3(  )
6
allX
1
Var ( X )  [(2  3) 2  (1.5  3) 2  ...  (4  3) 2 ]  3.167  (verify !)
6
1 N n 2 1 42
 
 9.5  3.167
n n 1
2 2 1
Infinite Population
Starting from the population parameters E(X)=μ and Var(X) = σ 2 ,
and the proportion π, we have the Sample statistics:
Sample Size: n
1 n
X
X
1
X

X i ; Proportion: p  or , where, X 
Sample Average:
n i 1
n
n
N
(Binomial Distribution)

n
X i  nX

1
n
2
i

1
s 
(Xi  X ) =
, Total T   X i

n  1 i 1
n 1
i 1
2
n
2
Variance
with the properties:
E ( X )   , E (T )  n , E ( p)   , and
Var ( X ) 
2
n
,Var (T )  n 2 ,Var ( p) 
 (1   )
n
2
n
X
i 1
i
,
6
INTERVAL ESTIMATION
Point estimation (estimating a parameter with a single value), gives us some idea
about the value of the parameter, but important information about the precision of
this estimate is missing! An estimate of the population mean μ is given by

 = X , for the variance



Var(X) = σ 2 , the estimate is
2
 S 2 , for the proportion π, is

 = p and for the
correlation coefficient ρ, is  = r .
This procedure is useful, but almost always out of target. A more realistic
and “safe” procedure is often entertained through confidence Intervals.
Given a sample X 1 , X 2 ,
, X n , we define a P% Confidence Interval (CI)
about the parameter θ, as a random interval [T1 , T2 ] , such that:
Pr(T1    T2 ]  P% , (Central),
where T 1 , and T 2 ,
are the values of the statistic T, obtained from the sample. As a general principle
the C.I. for θ is
    z  S.E.( )
approximately!
Some commonly used intervals:
1.
An exact P% C.I. for μ when σ 2 is known, (or approximate , when σ 2 is
unknown, but n is large (n>100)), is obtained by solving for μ the
inequality:
Z 
(X  ) n

 Z to find
μ=X z

n
,
1 n
where of course X   X i and z is the appropriate value, from a
n i 1
standardized normal distribution, corresponding to the presecified degree of
“confidence” (probability) P.
The random interval covers the true but unknown value of μ with probability P%,
considering all possible samples of the same size from the population!
Care should be taken since, this is different from an interval about
individual values of N (μ, σ 2 )
7
X = μ ± zσ
which is much wider!
2. A P% C.I. for μ when σ 2 is unknown, and n is small (n≤100) is given by
X  t( n 1)
s
1 n
2
s

( X i  X ) 2 , the sample variance and
, where

n  1 i 1
n
t( n 1) is the appropriate value, from a Students' t-Distribution with (n-1)
degrees of freedom!
3.
Approximate Confidence Intervals:
For the intensity rate λ of a Poisson distribution or, for the proportion π of
Binomial:
For Poisson:   x  z x , where x is the observed value (number of events
in a rather wide time interval), and then, by possibly dividing accordingly for the
required interval!
For Binomial:   p  z
x
p(1  p)
, where p  , the sample proportion.
n
n
This interval is approximate, for two reasons!
(i) Binomial distribution is approximated by the corresponding Normal and
(ii) π, the theoretical proportion, is replaced by p, the sample proportion!
Aside: Note that an interval closer to the exact C.I. (by overcoming the second
approximation) for π, may be obtained by solving the relevant quadratic
inequality
z 
( p  ) n
z
 (1   )
And this interval turns out to be:
2np  z 2  z z 2  4np(1  p)

2(n  z 2 )
4. A P% C.I. for the population variance σ 2 , is given by
1 n
(n  1) s 2
(n  1) s 2
2
2
( X i  X ) 2 , the sample
 
, where s 

u2
u1
n  1 i 1
2
variance and (u1 , u2 ) are the corresponding point s from a ( n 1)
distribution
with (n-1) degrees of freedom, and covering P% central probability!
8
Examples - Exercises
1.
A large bag of coins contains 20c, 50c and 100c (1Є), in the ratio 3:2:1.
(a)Find the mean and the variance for the value of coins in this population
A random sample of two coins is taken and their values 1 , and , 2
are recorded
(b)List all possible samples.
(c) Find the sampling distribution for the average X 
1   2
2
2. The following grouped frequency distribution summarizes the time, to the
nearest minute, spent waiting by a sample of patients in a doctor’s surgery.
Waiting Time
(to nearest
Number of
minute)
Patients
3 or less
6
4-6
15
7-8
27
9
49
10
52
11-12
29
13-15
13
16 or more
9
The average of the times was 9.63 minutes and the standard deviation was
3.05 minutes (Taking the missing limits to be 0 and 20 )
a) Using interpolation, estimate the median and semi-interquartile range of these
data. For a normal distribution, the ratio of the semi-interquartile range to the
standard deviation would be approximately 0.67.
b) Calculate the corresponding value for the above data. Comment on your
result.
For a normal distribution, 90% of times would be expected to lie in the interval
(Mean ± 1.645 standard deviation)
c) Find the theoretical limits for these data.
d) Find the 90% C.I. for the mean μ and the variance σ2 of the population from
which this sample has been drawn
e) Find also a 95% CI for the proportion π of patients who wait longer than 15
minutes
9
WEEKS 4 and 5: HYPOTHESIS TESTING
“Doubt is the root of progress”
The Problem: Why testing? First, there should be a claim for the value of a
parameter (μ, σ, π, λ, and so on) or a model, and then the
evidence from the observed data should be left to decide about
the claim!
A test is a statistical procedure which decides with some confidence whether a
statement – hypothesis, about a parameter value, or the whole model, is
valid!
The Characteristics or the “ingredients” of a test!
The supposed claims are formulated as follows:
(a) Null Hypothesis: H 0 : This statement should specify completely the value of
the parameter of interest! e.g. μ=50 vs:
(b) Alternative: H 1 : This is a complementary statement to H 0 , in some direction!
e.g. μ>50, or μ<50, (one tailed) or μ  50 (two tailed tests);
(X  ) n
N (0,1) ; This is (used to be called) a statistic
Test Statistic: e.g. Z 

from the sample, which serves as the criterion for decision - Accept or reject H 0 ; (
In fact this is rather a Pivotal Quantity, since by definition; a Statistic involves
only functions of the sample!)
Significance level: α = Pr (we observe what we did, under H 0 , or even worse in
the direction of H 1 ) = (p-value);
Often this is predetermined at 5% or at 1%, but it can take any value!
Critical value; Critical region: The value, or rather the set of values of the test
statistic Z or X which lead to rejection of H 0 .
The Decision Rule: This is a rule which is based on a specified value
test statistic and is of the form:
T * ,of the
Reject H 0 , if Tobs  T * or Tobs  T *
Accept H 0 , otherwise
Relation with Confidence intervals:
If the assumed value of μ lies inside the C.I., then we accept H 0 ; if, however it is
outside the interval, then we reject the null hypothesis in favor of the alternative.
Example: Two friends Roger and Marcos play the best of five series (at most 5)
games of tennis to decide whether Roger is better (Roger’s claim) or
they are equally competent!
Let the parameter of interest π = Pr (Roger wins any game);
10
H 0 : π = 0.5 (Marco’s claim) vs H 1 : π > 0.5; (Roger’s claim)
The test statistic naturally is the final score (outcome of the game) with the
corresponding distribution as in the table:
Final score
3-0
3 -1
3-2
2-3
1-3
0-3
Probability
4/32
6/32
6/32
6/32
6/32
4/32
(under H 0 )
Suppose that the game ends (3 – 0) for Roger. Obviously there is evidence in
favor of Roger.
The observed significance level is:
S.L.= P(score 3 – 0, or better for R) = 4/32 = 1/8 = 0.125
PRINCIPLE: The rationale behind a statistical test is the following;
“If a small value of the significance level α is observed, or
equivalently an extreme value of the test statistic T is realized, then
we are faced with two options:
(a) Either H 0 is true and we just observed a quite rare event, or
(b) H 0 . Is false, and that is the reason, we have observed the realized
outcome”!
Therefore, following the above rationale, we reject the Null Hypothesis when we
observe a rather improbable outcome!
It should be emphasized that when we decide under uncertainty, we take a risk of
making an error as follows:
UKNOWN STATE OF NATURE
DECISION
H 0 is true
H 1 is true
Correct decision
Pr. = 1− α
Type I Error
Pr. = α = SL
Accept H 0
Reject H 0
Type II error
Pr.= β
Correct decision
Pr.= 1− β= power
We can explain all these, using an easy and realistic example;
At a backgammon tournament, there are (among others, of course) the two
players Adam and Daniel, in short A and D.
Adam claims that he is a better player than Daniel, whereas Daniel modestly
argues that there is nothing between them. A test may be formulated as follows:
Let π = Pr(Adam winning any game)
Null Hypothesis H 0 :

1
2
Alternative Hypothesis H 1 :
Daniel’s claim

1
2
Adam’s claim
11
If a decision is to be taken on the basis of a series of, say 10 games, then the
Test statistic is naturally
X= number of wins for Adam, X~ Bin(n, π), n=10
Now suppose that the series ended with the score 9-1 for Adam, so X obs = 9.
The observed significance level:   Pr( X  9) 
10
10 
  x 0.5 (1  0.5)
x 9

x

10  x
 0.0107
Taking into account the above quite unlikely result, we reject H 0 in favor of H 1 .
Note however that in real situations the decision rule is traditionally
Reject H 0 if X ≥ 6,
Accept H 0 otherwise
In this case the critical region is X= 6, 7... 10
Following along the same lines, we get
The significance level:
10  x
10  x
 0.3770
0.5 (1  0.5)
x 6  x 
10
  Pr( X  6)   
Basic Parametric Tests
(a) Tests for the mean μ: H 0 :   0
(i) Known variance or large sample size (n>50, replacing of course the unknown
σ with the sample standard deviation s, which is an estimate of σ )
Test statistic:
Z
(X  ) n
N (0,1)

(ii) Unknown variance and small sample size
(iii)
Z
Tests for two samples (sizes n, n):
(Y  X )  (  y   x )
 x2  y

n m
2
where s   
2
2
nm2
( X  ) n
s
t( n 1)
H 0 : x   y
N (0,1) , (known variances), or T 
(n  1) sx2  (m  1) s y2
T
(Y  X )  (  y   x )
1 1
s 
n m
;
The pooled estimate of the common population variance
2
t( n m2) ,
12
iv) Paired t test: When natural pairing exists between the observations in the
two samples, the most efficient test is the paired one and the set up is the
following:
H 0 :  x   y or   x   y  0
Paired Sample
Assumptions:
Di  X i  Yi , (difference)
D
N ( , 2 ) , independent
1 n
1 n
Di , and , S d2 
( Di  D ) 2


n i 1
n  1 i 1
Test statistic: T 
(d   ) n
sd
X1 , X 2 ,
, Xn
Y1 , Y2 ,
, Yn
t( n 1)
Examples – Exercises
1.
A student takes a multiple choice test. The test is made up of 10 questions each
with 5 possible answers. The student gets 4 questions correct. Her teacher
claims she was guessing the answers. Using a one tailed test, at the 5% level of
significance, test whether or not there is evidence to reject the teacher’s claim.
State your hypotheses clearly.
2. A random sample of 10 mustard plants had the following height, in mm, after 4
days growth.
5.0, 4.5, 4.8, 5.2, 4.3, 5.1, 5.2, 4.9, 5.1, 5.0
Those grown previously had a mean height of 5.1 mm after 4 days. Using a 5%
significance level, test whether or not the mean height of these plants is less than
those grown previously.
(You may assume that the heights above are normally distributed)
13
DISTRIBUTION FREE METHODS
Non - parametric methods;
(i)
One sample with size n:
Assumption of normality is waived, but symmetry is needed!
As a general principle these tests are obtained by applying the
corresponding parametric tests on the ranks!
(a) Sign test : The test is based only on the sign of the differences of the
observed values X 1 , , X n from the assumed median η 0
Null Hypothesis H 0 : η (population median) =η 0 vs. H 1 : η > η 0 , η < η 0 or η 
η0
First we obtain the differences:
d i = x i – η 0 (ignoring any 0’s and adjusting the sample size accordingly!).
The Test Statistic is U = # of +’s (or –‘s) (whichever is smaller!) in the differences
di ;
Now, under the null Hypothesis, H 0 : δ (median of the differences) = 0,
U ~ Bin(n, ½); S.L. =Pr(U ≤ u obs ) or Pr(U  u obs )
For a two tailed test, just double the observed S.L., you have obtained from the
one tailed test.
For large samples (say, n>20), U~ N (μ, σ 2 ), approximately, where μ = n/2, σ 2 =
n/4 and we proceed accordingly with the test statistic:
n
x  0 u  2 2u  n
Z


N (0,1)

n
n
4
(Normal approximation to Binomial!)
(b) Wilcoxon signed rank test (one sample with size n):
This test is based on the ranks of the signed differences of the observed values
X 1 , , X n from the assumed median η 0
Null hypothesis H 0 : η (pop’n median) =η 0 vs. H 1 : η > η 0 , η < η 0 or η  η 0
Obtain the differences d i = x i – η 0 , but have in mind to ignore 0’s, adjusting the
sample size accordingly, and then assign ranks to the resulted differences,
considered as one set, ignoring the sign.
When there are ties in the values, then take the average rank of the tied
observations.
Eventually, you get two totals:
T  : the sum of the ranks for the positive differences and
14
T  : the sum of the ranks for the negative differences
Check however: Since this is the sum of the first n natural numbers, we must
n(n  1)
allways have: T  +T  =
2
Test statistic: T= T  or T  (whichever is smaller), and compare the value with
the critical value on the appropriate table for the test!
This is a better test than the sign test, since it takes into account the ranks, and
hence the relative magnitude of the differences and not only the signs!
There is however the (not unusual ! ) case, where we observe a lot of (say)
positive differences, but very few, absolutely big negative differences. In this
situation the Sign Test, will turn out to be significant, but the Wilcoxon Test will
not!
For large samples (say, n>20), then T~ N (μ, σ 2 ), approximately, where
μ= n(n+1)/4,
σ 2 = n(n+1)(2n+1)/24 and based on these we can proceed as in Normal tests.
CONTINGENCY TABLE (m  n)
Test for Association or Independence
( Karl Pearson 1857 – 1936):
(The values in the cells are
frequencies.
If the values are percentages, then,
before proceeding to the test, we
have to transform them to
frequencies,)
H 0 : No association between the
two factors, or the two
classifications A and B, are
independent (m rows and n
columns)
Cl.B
Cl.A
A1
A2
B1
B2
… Bn
Total
o11
o21
o12
o22
… o1n
r1
… …
Am om1
Totals
c1
…
om 2
c2
… o2n r2
… … …
… omn rm
…
cn T
Expected frequencies in each cell (i th row, j th column) : (under H 0 )
e ij = (row total) × (column total)/ (grand total) 
rc
i j
N
For the test to be “good” and reliable (according to Pearson), each expected
frequency, should always be at least 5; otherwise we have to combine adjacent
or similar classes, in order to achieve all expected frequencies to be at least 5.
Test statistic
15
(oij eij )
D
e
i, j
2
 (2m1)( n 1)
ij
For a 2  2 contingency table, we often use Yates’ continuity correction, and the
appropriate test statistic becomes
(|oij eij |0.5)
D
e
i, j
2
2
 (1)
ij
The correction (- 0.5) drives the observed value D, away from the critical region.
So if the uncorrected value is not significant, the corrected would also be
insignificant!
Examples - Exercises
1. Manuel is planning to buy a new machine to squeeze oranges in his cafe and he
has two models, at the same price, on trial. The manufacturers of machine B
claim that their machine produces more juice from an orange than machine A. To
test this claim Manuel takes a random sample of 8 oranges, cuts them in half and
puts one half in machine A and the other half in machine B. The amount of juice,
in ml, produced by each machine is given in the table below.
Orange
1
2
3
4
5
6
7
8
Machine A
60
58
55
53
52
51
54
56
Machine B
61
60
58
52
55
50
52
58
Stating your hypotheses clearly, test, at the 10% level of significance, whether or
not the mean amount of juice produced by machine B is more than the mean
amount produced by machine A.
Use both parametric and non-parametric tests and compare!
. 2. A survey in a college was commissioned to investigate whether or not there was
any association between gender and passing a driving test. A group of 50 male
and 50 female students were asked whether they passed or failed their driving
test at the first attempt. All students asked had taken the test. The results were
as follows,
Pass Fail
Male
23
27
Female 32
18
Stating your hypotheses clearly test,at the 10% level of significance, whether
there is any evidence of an association between gender and passing a driving
test at the first attempt.
16
WEEKS 8, 9
CORRELATION - REGRESSION
“Nothing is isolated!”
The Problem: Given a set of bivariate observations, we attempt to
estimate the “best” algebraic relationship between X
and Y
Correlation from the Sample:
-We are interested in measuring linear association between:
The response (dependent) Variable: Y, vs
The explanatory (independent) Variable X
-A first indication of the existence of this linear association is revealed, if we
draw the scatter diagram between X and Y. Other possible shapes
Correlation: meaning linear association between X and Y; cause and effect.
A numerical measure of how strong is this linear association between X and Y is
the
sample product moment correlation coefficient (pmcc)
1  n

X iYi  nXY  , is the sample covariance and
r

, where sxy 

sx s y
n  1  i 1

sxy
s x ,s y , are the sample standard deviations of X and Y. Note that always
−1≤ r ≤1, with r =  1, for perfect correlation; positive or negative! (See the
shapes above)
Another way to calculate the pmcc: r 
SXY
;
SXX  SYY
(Can be obtained easily from an advanced calculator! )
Where
17
n
n
SXY   ( X i  X ) (Yi  Y )   X iYi  nXY ,
i 1
i 1
n
n
SXX   ( X i  X ) 2   X i2  nX 2  (n  1) sx2 ,
i 1
i 1
n
n
i 1
i 1
SYY   (Yi  Y ) 2   Yi 2  nY 2  (n  1) s y2
Note that:
(i)
The numerical existence of significant correlation does not necessarily
imply linear association between the two variables, unless of course
some natural explanation exists! Spurious (unexplained, nonsense)
correlation sometimes occurs!
(ii)
On the other hand, a value of the pmcc close to 0 implies no linear
association, but it might indicate non-linear association! (Examples are
shown above on the scatter diagrams)
(iii)
For tests, or inferences on the true but unknown value of ρ, we can
use the result:
T
r n2
1 r
2
t( n2)
The product moment correlation coefficient (pmcc) however, is invariant
under scale and location transformations, i.e. it remains exactly the same!
SPEARMAN’s RANK CORRELATION COEFFICIENT
The corresponding non- parametric, rank correlation
coefficient has been developed by Charles Spearman (1904)
and, basically this is the p.m.c.c. of the corresponding ranks.
It can also be evaluated through Spearman’s formula
n
rs  1 
6 di2
i 1
2
n( n  1)
Where d i is the difference between corresponding ranks of the two variables.
When there are ties in the ranks, we take the average rank of the tied
observations.
18
Note that the proof (see appendix A7) of the formula requires no ties, so the two
ways of calculation:
(i) from the pmcc of the ranks and
(ii) from the above formula of Spearman,
might be slightly different;
More accurate is the first one, obtained using a calculator!
Testing is performed through the use of appropriate tables of rank correlation
coefficient.
Comparisons:
(i) The rank correlation coefficient may be used when only ranking is available, or
the data are qualitative.
(ii) The big difference with the p.m.c.c, is that Spearman’s coefficient measures
agreement of ranks, but not necessarily linear association.
(iii) This coefficient (Spearman’s rank) does not rely on Bivariate Normal.
(iv) Numerically, the two coefficients are often close, but in special cases their
values might be quite apart!
LINEAR REGRESSION
“ It is our opinion of the situation at one stage, but this must change, if
we find , at a later stage, that the facts are against it”
Glancing the scatter diagram, we may suspect a rather strong linear association
of the response variable Y on the explanatory variable X.(a p.m.c.c. close to
r   1 reveals that!). In this case, we use regression methods to establish the
“suspected” linear relationship.
For example, for the set of points, given below, we draw the scatter diagram:
(X, Y) points: (1, 49), (3, 51), (4, 52), (6, 52), (6, 53), (7, 53), (8, 54), (11, 56),
(12, 56), (14, 57), (14, 58), (17, 59), (18, 59), (20, 60), (20, 61)
The line of “best” fit is obtained through the statistical regression model:
19
Y i =α+ βX i + ε i ,
where ε i is the unobserved error.
The vertical random errors satisfy the conditions;
(i) ε i ~ N (0, σ 2 ), and
(ii) they are independent ( i=1,2,…,n)
where α, is the Y intercept and β, is the slope (gradient) of the straight line
Variables: Explanatory, Independent; X
Response, Dependent; Y
Principle of regression:
Applying least squares methods, we obtain the estimates of the
parameters α and β, which turn out to be the Statistics a and b, by
minimizing the sum of squared vertical errors, with respect to α and β:
n
n
i 1
i 1
SSE    i2   (Yi     X i ) 2
Solving the resulting two regression equation (see appendix A4), the “best”
estimates of the unknown parameters α and β, turn out to be:
The Slope
 b
s
s
xy
2
x
i.e the sample covariance divided by the sample variance of X,
or equivalently, b = SXY/SXX
where SXY and SXX are defined as in the previous chapter (Correlation).
n
n
n
n
SXY   ( X i  X ) (Yi  Y )   X iYi  nXY , SXX   ( X i  X )   X i2  nX 2
2
i 1
i 1
The Y- intercept:
The variance of the
Another form of the fitted
i 1
i 1
  a  Y  bX

2
 s2 
SSE
n2
Model:
Y  Y  b( X  X )
line is:
From Analytic Geometry, this form indicates that the line passes through the
point G ( X , Y ) the center of gravity of the data, and has gradient m = b
20
Natural interpretation of the parameters of the regression Model:
a: Is the estimated value of the response variable Y, when the explanatory
variable X = 0
b: Represents the estimated change of the response variable Y, for a unit
increase of the explanatory variable X
Note that prediction may be obtained from the fitted line, but, this is quite
risky, outside the range of observations (extrapolation), as in any other
hard science!
To examine the goodness of fit of the entertained model, we can use the
residuals:
ei  ( yi  yi )
i=1, 2,…, n
SSE
, This is the percentage of total
SYY
variation explained by the regression line. For this simple model, we have
and the coefficient of determination , R 2 = 1–
R 2  r 2 . In other words the coefficient of determination is just the square of the
p.m.c.c.
For an adequate fit, the coefficient of determination R 2 should be high! (Close to
one!)
Also, the residuals, plotted against the explanatory variable X should,
theoretically, be scattered randomly, above and below the axis of X, within a
narrow horizontal band.
Note also that
e
i
 0, always!
i
Multiple Regression
The general linear model is explained here, through the powerful tools of Linear
Algebra and we have a look at it, from the elegant perspective of three
dimensional geometry!
21
For a multiple regression situation, we entertain the matrix model:
y  X  

 y1   x11
  
 y2    x21
; In matrix form,  ...   ...
  
 yn   xn1
x22
...
xn 2
MVN (0,  2 I n )
b  ( X T X )1 X T y
with the solution being:
and
... x1k  1   1 
   
... x2 k   2    2 

... ...  ...   ...  ,
   
... xnk   k    n 
x12
E (b)   ;
also Var (b)  ( X X ) 
The predicted value for a given explanatory X is:
T
1
2
yi  xi b  x i ( X T X ) 1 X T y, with,Var ( yi )  x i ( X T X ) 1 x i 2
t
t
t
and the estimate of
( y  y) ( y  y)
T
s2 
2,
is
(n  k )
Application
Matrix approach to the simple model
The simple model may be formulated in a multiple regression context as follows:
y  X   ,
where
 y1 
 1 x1 
 1 
 


 
y2 

1 x2 
 


y
,X 
,    ,   2 
 ... 
 ... ... 
 ... 
 
 


 
 yn 
 1 xn 
n 
In this context, it turns out that
n

XT X  
 nx

nx 
1

T
1
n
(
X
X
)

,
and
nSXX
xi2 

i 1

 n 2
  xi
 i 1
 nx


nx 
,
n 
so the variance- covariance matrix of the estimated coefficients (slope and
intercept) is
given by:
22
 n 2
a
   xi
Var   
 i 1
 b  nSXX 
 nx
2

nx 

n 
A P% confidence region for the parameter β, or a test may be constructed from
the distributional result of the pivotal quantity
P
(b   )T X T X (b   ) / 2
( y  X b)T ( y  X b) / (n  2)
F (2, n  2)
In fact the confidence region will become an ellipsoid which will cover the true,
but unknown value of the vector β, with probability P%.
How good is the model?
The goodness of the model is measured by
(i) The size of the sum of squared errors SSE, and, better
(ii) The value of the coefficient of determination:
R2 
SSR
SSE
 1
SST
SST
However, both these two measures can be reduced technically by introducing
more independent variables X (more columns in the X matrix). As we know from
geometry the more the number of explanatory variables, we introduce, the closer
the hyperplane comes to the observed vector and naturally the question arises
where to stop.
An adjusted coefficient of determination has been developed which takes into
account the extra variables. This is adjusted for the loss of the degrees of
freedom when we introduce new parameters.
2
R 1
SSE / (n  k )
n 1
MSE
1
(1  R 2 )  1 
SST / (n  1)
nk
MST
Other indicators of the goodness of the model are
(i) Plots of the residuals
(ii) Significance of the estimated coefficients
(iii) Tests for Normality of the residuals.
23
Examples – Exercises
1. The amount of blood expelled from the heart with each ventricular contraction
is known as the stroke volume. Medical researchers studying the
relationship between age (in years) and stroke volume (ml of blood)
obtained the following data from a random sample of patients.
Age (x)
25 30 35 40 45 50 55 60 65 70
Stroke of volume 76 77 74 71 72 70 68 67 64 62
(You may use
x
2
=25 025,
y
2
=54 835,
 xy
=34 120)
Draw a scatter diagram to represent these data.
a) Calculate the product moment correlation coefficient between x and y
b) Interpret your result
c) Find the equation of the regression line of y on x in the form y=a+bx.
d) From your line estimate the stroke volume of a patient at the age of 75
2. A teacher took a random sample of 8 children from a class. For each child
the teacher recorded the length of their left foot, f cm, and their height, h cm.
The results are given in the table below.
f
h
23
135
26
144
(You may use  f =186
23
134
22
136
27
140
h =1085
Sff = 39.5
25 291)
24
134
20
130
Shh =139.875
21
132
 fh =
(a) Calculate Sfh.
(b) Find the equation of the regression line of h on f in the form h = a + bf.
Give the value of a and the value of b correct to 3 significant figures.
(c) Use your equation to estimate the height of a child with a left foot length of 25
cm.
(d) Comment on the reliability of your estimate in part (c), giving a reason for
your answer.
The left foot length of the teacher is 25 cm.
(e) Give a reason why the equation in part (b) should not be used to estimate the
teacher’s height.
24
WEEKS 10, 11 TIME SERIES
“Standing on the past, we live the present, hoping for the future”
The Problem: A set of measurements taken at consecutive time
points constitutes a time series. A number of ways to
analyze the series, estimate the parameters, and
forecast future values are considered!
15.1.The Model: Y t = Trend + Seasonal variation +
Short term (non random) variation +
Random variation
E: Expansion; R: Recession
Very often, among the problems in analyzing time series is how to estimate, and
remove seasonal variation, to get a clear picture of the trend. Here are some
commonly used techniques:
(i)
Regression with dummies:
Yt  a  bt  cQ2  dQ3  fQ4   t , where Q2 , Q3 , Q4 are dummy variables,
taking the values 1 or 0, depending on whether we are on the second quarter, the
third , or the fourth, thus reflecting any existing seasonal component!
However, the residuals are serially correlated, so the model can estimate the first
order coefficient of correlation by:
n
r 
e
t 2
n
et
t 1
 et2
,
t 2
where e1 , e2
, en
,the residuals are from the above multiple regression.
25
The Hypothesis H 0 :   0, vsH1 :  
is tested using the Durbin-Watson test statistic:
n
D
 (e
t 2
t
 et 1 ) 2
n
e
t 2
0
 2  2r
2
t
(ii) The Moving Average Model
The second model, often used is the Moving Average (M.A.) of appropriate order
(usually the period of the process), to estimate the trend. If s (the period of
seasonality) is even, then we need further MA of order two to center the
estimates, so that, these estimates correspond to the observed values. In
mathematical terns, the first term of the MA with period s is
*
1 s
2
Y
1 s
  Yi ,
s i 1
or more explicitly, for most series with observations taken quarterly the period is
four and the moving average centered at Y t is
M (Yt ) 
.5Yt  2  Yt 1  Yt  Yt 1  .5Yt  2
,
4
t = 3, 4,…, n-2
The second step is to calculate the seasonal component by taking: either
(i)
the difference Ut  Yt  M (Yt ) (Additive Model) or
(ii)
the ratio U t  Yt / M (Yt ) (Multiplicative Model)
and then average over all observations at each of the four phases. i.e.
so 
ut 1  ut 5  ut 9  ...
ut  ut  4  ut 8  ...
; s1 
n1
no
s2 
ut 3  ut  7  ut 11  ...
ut  2  ut  6  ut 10  ...
s

; 3
n3
n2
where n i is the number of observations on the i-th phase.
Finally we deseasonalize the series by subtracting (dividing for the multiplicative
model) each observation by the corresponding seasonal effect, obtaining
Yt  so , Yt 1  s1 , Yt 2  s2 , Yt 3  s3 , Yt 4  so ,...
A linear regression is fitted to the deseasonalized series to obtain forecasts of the
trend, which at the last stage should be corrected by the seasonal effect!
For the most commonly used additive model, after estimating the trend and the
seasonal component, which, in practice, is estimated by subtracting the
estimated trend from the observed series, and then averaging for each quarter or
repeated point, we can make predictions based on algebraic or graphical
estimates.
26

Yt =Trend estimate + Seasonal component
Random variation cannot be predicted.
Non random variation? This is non - regular or cyclical variation about trend.
-Autoregressive models may also apply!
Very simple linear regression models:
Due to the fact that the main components of most time series are the trend and
the seasonal component, it is possible to fit simple models which take into
account these two main effects, for example:
Yt     Yt  s   t , where  t
N (0,  2 ), independent
and s is the period of seasonality. α and β are the parameters to be estimated
from the data. Often we take s = 4, for quarterly data or s = 12 for monthly data!
Examples – Exercises
1.
Trend, seasonal variation and random variation are three terms used in the
analysis of time series. Explain what they mean.
In order to compete with its larger rivals, a small cinema shows only first-rate
films with one performance each evening and change of program every five
weeks. The table below gives the weekly attendances at this cinema, in
hundreds, during a 20 week period beginning in September.
week
Attendance
(hundreds)
week
Attendance
(hundreds)
1
5.5
2
9.2
3
4
10.1 7.5
5
6.8
6
7
8
9
10
11.7 17.4 17.2 15.0 13.1
11
12
13
14
15
16
17
18
19
20
16.3 23.0 25.1 21.2 19.1 25.0 30.4 31.6 29.0 28.3
(i) Plot these data on a graph.
(ii) Calculate an appropriate moving average in order to smooth the series
and plot the values on the graph
(iii)
Discuss with reasons what you think might happen to attendances in
the next 20 weeks and predict the attendances on 21 st and 22nd
weeks.
27
2. The earnings of a corporation during the seventies are given below in $000:
Year
1971
1972
1973
1974
1975
1976
1977
1978
Quarter
1
300
330
495
550
590
610
700
820
2
460
545
680
870
990
1050
1230
1410
3
345
440
545
660
830
920
1060
1250
4
910
1040
1285
1580
1730
2040
2320
2730
(i) Plot these data on a graph.
(ii) Calculate an appropriate moving average in order to smooth the series and plot
the values on the graph
(iii) Estimate the trend by least squares on MA
(iv)Provide predictions for all four quarters of 1979
28
WEEK 12
EXPERIMENTAL DESIGN
“We only observe the effects! What about the cause?”
The Problem : To isolate the factors of interest from other experimental noise.
13.1 Completely Randomized Design: One factor:
X ij     j   ij ,
The model
With  ij
N (0,  2 ) , independent; where
X ij : is the ith observation on the jth column (treatment)
μ: is the overall effect
 j : is the effect of treatment
j
 ij : the error, whereas i  1, 2,..., n j
and j  1, 2,..., m .
The set-up
Replication/ Level
1
1
X11
2
X 12
3
X13
… m
… X 1m
2
X 21
X 22
X 23
…
X 2m
3
X 31
X 32
X 33
…
X 3m
…
…
X n11
…
X n2 2
…
X n3 3
… …
… X nm m
Totals
C1
C2
C3
… Cm
Where N 
k
n
j 1
j
; Total number of observations, T 
X
i, j
nj
C j   X ij , , so
i 1
The jth column average is X j 
Cj
nj
The sums of squares:
Total :
T
SST   ( X ij  X )2   X ij2 
i, j
i, j
T2
N
ij
, X 
T
, and
N
29
SSC   ( X ij  X )  
2
Between columns:
i, j
j
C 2j
nj

T2
, and
N
The within or error sum of squares: SSW or SSE   ( X ij  X j )   X  
2
i, j
2
ij
i, j
j
C 2j
nj
Note that, to avoid confusion of the denomimators, as a general principle,
each total squared is divided by the corresponding number of observations
contributing to that total!
Analysis of Variance Table
Source of
variation
Sum of
Squares
Degrees
of
Freedom
Columns
(Treatments)
Residual
(Error)
Total
(Corrected)
SSC
m-1
SSE (by
subtraction)
SST
N-m
Mean
Square=
MS/d.f
F-Ratio
SSC
m 1
SSE
MSE 
N m
MSC
MSE
MSC 
F (m  1, N  m)
N-1
The Null Hypothesis H 0 : No difference between treatments i.e.  j  0, j
Reasoning: If there is no difference between treatments, then the model
becomes X ij     ij . So the two mean squares MSC and MSE are
independent estimates of the same variance  of the assumed model. Their
ratio is distributed like an F random variable with the corresponding degrees of
freedom shown above.
2
Randomized Block Design - Two factors:
X ij       j   ij ,
The model
with assumptions
 ij
N (0,  2 ) , independent , no interaction between the two
factors;
where
X ij : is the observation on the ith row block and on the jth column (treatment)
μ: is the overall effect
i : is the effect of the ith block (row)
30
 j : is the effect of jth treatment (column)
 ij : error, and i  1, 2,..., n and j  1, 2,..., m ,
The set-up
1
2
3
… m
Totals
Block
1
X11
X 12
X13
…
X 1m
2
X 21
X 22
X 23
…
X 2m
R1
R2
3
X 31
X 32
X 33
…
X 3m
…
n
…
X n1
…
X n2
…
X n3
… …
… X nm
Totals
C1
C2
C3
… Cm
Treatment
R3
…
Rn
T
Where N=mn is the total number of observations, the grand total T   X ij and
i, j
T
X ..  , the grand average;
N
m
n
j 1
i 1
Ri   X ij ,and C j   X ij , and
Ri
C
, the i-th row average, X . j  j , the jth column average and
m
n
The sums of squares:
X i. 
Total :
SST   ( X ij  X .. ) 2   X ij2 
i, j
i, j
Between columns:
T2
N
SSC   ( X . j  X .. )  
2
i, j
Between rows:
j
C 2j
n

T2
, and
N
2
i
R T2
SSR   ( X .i  X .. )   
,
N
i, j
j m
2
The within or error: SSE   ( X ij  X . j )  ( X i.  X .. ) 
i, j
2
31
Analysis of Variance Table
Source of
variation
Sum of Degrees
Square of
s
Freedom
SSR
n-1
Between
Rows
Between
SSC
Columns
Residual
SSE
(Error)
Total
SST
(Corrected)
m-1
N-n-m+1
Mean
Square=MS/d.f
F-Ratio
SSR
n 1
SSC
MSC 
m 1
SSE
MSE 
N  n  m 1
MSR
MSE
MSC
MSE
MSR 
F (n  1, N  n  m  1)
F (m  1, N  n  m  1)
N-1
The Null Hypothesis H 0 : No difference between rows or between columns i.e.
 i  0,  j  0, i, j
Reasoning: If there is no difference between rows or columns, then the
model becomes X ij     ij . So the two mean squares MSR and MSC are
independent estimates of the same variance  . However, regardless of the
columns or the rows effects, an independent estimate of the variance is the
MSE, since any differences are subtracted from the sum of squares.
2
Exercises
A factory manufactures batches of an electronic component. Each component is
manufactured in one of three shifts. A component may have one of two types of
defect, D1 or D2 , at the end of the manufacturing process.
A Production manager believes that the type of defect is dependent upon the
shift that manufactured the component. He examines 200 randomly selected
defective components and classifies them by defect type and shift.
The results are shown in the table below.
Defect type
Shift
D1 D2
First shift
45
Second shift 55
Third shift
50
18
20
12
Stating your hypotheses, test, at the 10% level of significance, whether or not
there is evidence to support the manager’s belief. Show your working clearly.
WEEKS 13 and 14: Review
Download