Uploaded by Ezhilmathi GS

PrincipalComponentsAnalysis

advertisement
Subject: Statistics
Paper: Multivariate Analysis
Module: Introduction to Principal
Components Analysis
1 / 17
Development Team
Principal investigator: Dr. Bhaswati Ganguli, Professor,
Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of
Statistics, University of Calcutta
Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian
Institute of Public Health, Hyderabad
Content reviewer: Dr. Kalyan Das, Professor, Department of
Statistics, University of Calcutta
2 / 17
Development Team
Principal investigator: Dr. Bhaswati Ganguli, Professor,
Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of
Statistics, University of Calcutta
Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian
Institute of Public Health, Hyderabad
Content reviewer: Dr. Kalyan Das, Professor, Department of
Statistics, University of Calcutta
2 / 17
Development Team
Principal investigator: Dr. Bhaswati Ganguli, Professor,
Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of
Statistics, University of Calcutta
Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian
Institute of Public Health, Hyderabad
Content reviewer: Dr. Kalyan Das, Professor, Department of
Statistics, University of Calcutta
2 / 17
Development Team
Principal investigator: Dr. Bhaswati Ganguli, Professor,
Department of Statistics, University of Calcutta
Paper co-ordinator: Dr. Sugata SenRoy, Professor, Department of
Statistics, University of Calcutta
Content writer: Souvik Bandyopadhyay, Senior Lecturer, Indian
Institute of Public Health, Hyderabad
Content reviewer: Dr. Kalyan Das, Professor, Department of
Statistics, University of Calcutta
2 / 17
Dimension Rduction techniques
Question
Is it necessary to look at all the variables, or is it possible for a
smaller set of variables to capture the same information ?
Some of the techniques used are :
I
Principal Component Analysis
I
Factor Analysis
I
Canonical Correlations
I
Multidimensional Scaling
We will discuss Principal Component Analysis in this lecture.
3 / 17
Dimension Rduction techniques
Question
Is it necessary to look at all the variables, or is it possible for a
smaller set of variables to capture the same information ?
Some of the techniques used are :
I
Principal Component Analysis
I
Factor Analysis
I
Canonical Correlations
I
Multidimensional Scaling
We will discuss Principal Component Analysis in this lecture.
3 / 17
Dimension Rduction techniques
Question
Is it necessary to look at all the variables, or is it possible for a
smaller set of variables to capture the same information ?
Some of the techniques used are :
I
Principal Component Analysis
I
Factor Analysis
I
Canonical Correlations
I
Multidimensional Scaling
We will discuss Principal Component Analysis in this lecture.
3 / 17
Dimension Rduction techniques
Question
Is it necessary to look at all the variables, or is it possible for a
smaller set of variables to capture the same information ?
Some of the techniques used are :
I
Principal Component Analysis
I
Factor Analysis
I
Canonical Correlations
I
Multidimensional Scaling
We will discuss Principal Component Analysis in this lecture.
3 / 17
Dimension Rduction techniques
Question
Is it necessary to look at all the variables, or is it possible for a
smaller set of variables to capture the same information ?
Some of the techniques used are :
I
Principal Component Analysis
I
Factor Analysis
I
Canonical Correlations
I
Multidimensional Scaling
We will discuss Principal Component Analysis in this lecture.
3 / 17
Dimension Rduction techniques
Question
Is it necessary to look at all the variables, or is it possible for a
smaller set of variables to capture the same information ?
Some of the techniques used are :
I
Principal Component Analysis
I
Factor Analysis
I
Canonical Correlations
I
Multidimensional Scaling
We will discuss Principal Component Analysis in this lecture.
3 / 17
Dimension Rduction techniques
Question
Is it necessary to look at all the variables, or is it possible for a
smaller set of variables to capture the same information ?
Some of the techniques used are :
I
Principal Component Analysis
I
Factor Analysis
I
Canonical Correlations
I
Multidimensional Scaling
We will discuss Principal Component Analysis in this lecture.
3 / 17
Introduction
Principal Component Analysis
A principal component analysis is concerned with explaining the
variance-covariance structure of a set of variables through a fewer
linear combinations of these variables.
I
The technique helps to capture the variability of a set of
variables through a few linear combinations of the variables.
I
These linear combinations of the variables contain most of the
information contained in the original set.
4 / 17
Introduction
Principal Component Analysis
A principal component analysis is concerned with explaining the
variance-covariance structure of a set of variables through a fewer
linear combinations of these variables.
I
The technique helps to capture the variability of a set of
variables through a few linear combinations of the variables.
I
These linear combinations of the variables contain most of the
information contained in the original set.
4 / 17
Objectives of principal component analysis
I
Data reduction : To replace the original set of variables by a
smaller set of p-variables, which retains a large proportion of
the total variability of the original variables.
I
I
Interpretation : To reveal relationships that were not
previously suspected.
I
I
Need to look at fewer variables rather than a long list of
variables and hence easier to work with.
Easier to capture the underlying mechanism of the model and
hence easier to interpret from.
Can also serve as an intermediate step in much larger
investigations like cluster analysis or multiple regression.
5 / 17
Objectives of principal component analysis
I
Data reduction : To replace the original set of variables by a
smaller set of p-variables, which retains a large proportion of
the total variability of the original variables.
I
I
Interpretation : To reveal relationships that were not
previously suspected.
I
I
Need to look at fewer variables rather than a long list of
variables and hence easier to work with.
Easier to capture the underlying mechanism of the model and
hence easier to interpret from.
Can also serve as an intermediate step in much larger
investigations like cluster analysis or multiple regression.
5 / 17
Objectives of principal component analysis
I
Data reduction : To replace the original set of variables by a
smaller set of p-variables, which retains a large proportion of
the total variability of the original variables.
I
I
Interpretation : To reveal relationships that were not
previously suspected.
I
I
Need to look at fewer variables rather than a long list of
variables and hence easier to work with.
Easier to capture the underlying mechanism of the model and
hence easier to interpret from.
Can also serve as an intermediate step in much larger
investigations like cluster analysis or multiple regression.
5 / 17
Objectives of principal component analysis
I
Data reduction : To replace the original set of variables by a
smaller set of p-variables, which retains a large proportion of
the total variability of the original variables.
I
I
Interpretation : To reveal relationships that were not
previously suspected.
I
I
Need to look at fewer variables rather than a long list of
variables and hence easier to work with.
Easier to capture the underlying mechanism of the model and
hence easier to interpret from.
Can also serve as an intermediate step in much larger
investigations like cluster analysis or multiple regression.
5 / 17
Objectives of principal component analysis
I
Data reduction : To replace the original set of variables by a
smaller set of p-variables, which retains a large proportion of
the total variability of the original variables.
I
I
Interpretation : To reveal relationships that were not
previously suspected.
I
I
Need to look at fewer variables rather than a long list of
variables and hence easier to work with.
Easier to capture the underlying mechanism of the model and
hence easier to interpret from.
Can also serve as an intermediate step in much larger
investigations like cluster analysis or multiple regression.
5 / 17
A Simple Example
Let X1 = Height and X2 = Weight.
Q : Should we look at both these two variables ?
New variables
Consider the linear combinations
Y1 = a11 X1 + a12 X2
Y2 = a21 X1 + a22 X2
Instead of looking at both X1 and X2 , can we reasonably look at
either Y1 or Y2 only, depending on which has the larger variability.
Note
We can only have two linearly independent combinations of X1
and X2 .
6 / 17
A Simple Example
Let X1 = Height and X2 = Weight.
Q : Should we look at both these two variables ?
New variables
Consider the linear combinations
Y1 = a11 X1 + a12 X2
Y2 = a21 X1 + a22 X2
Instead of looking at both X1 and X2 , can we reasonably look at
either Y1 or Y2 only, depending on which has the larger variability.
Note
We can only have two linearly independent combinations of X1
and X2 .
6 / 17
Example (contd.)
I
Here we have the plot of X2 against X1 (with bases changed).
I
Shows fair amount of variability in both X1 and X2 .
7 / 17
Example (contd.)
I
Here we have the plot of X2 against X1 (with bases changed).
I
Shows fair amount of variability in both X1 and X2 .
7 / 17
Example (contd.)
I
We now rotate the axes to Y1 and Y2 (in red).
I
Shows that the major variability is captured by Y1 .
I
So can discard Y2 and consider the single variable Y1 instead
of X1 and X2 .
8 / 17
Example (contd.)
I
We now rotate the axes to Y1 and Y2 (in red).
I
Shows that the major variability is captured by Y1 .
I
So can discard Y2 and consider the single variable Y1 instead
of X1 and X2 .
8 / 17
Example (contd.)
I
We now rotate the axes to Y1 and Y2 (in red).
I
Shows that the major variability is captured by Y1 .
I
So can discard Y2 and consider the single variable Y1 instead
of X1 and X2 .
8 / 17
Population Principal Components
Geometrical Significance
Principal components represent the selection of a new coordinate
system with coordinates Y1 , Y2 , . . . , Ym obtained by rotating the
original system with X1 , X2 , . . . , Xm as the coordinate axes.
The new axes represent the directions with maximum variability
and provide a simpler and more parsimonious description of the
covariance structure.
9 / 17
Population Principal Components
Geometrical Significance
Principal components represent the selection of a new coordinate
system with coordinates Y1 , Y2 , . . . , Ym obtained by rotating the
original system with X1 , X2 , . . . , Xm as the coordinate axes.
The new axes represent the directions with maximum variability
and provide a simpler and more parsimonious description of the
covariance structure.
9 / 17
Population Principal Components
I
Let X = [X1 , X2 , . . . , Xm ].
I
Let Cov(X) = Σ
I
Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0.
I
Consider the linear combinations
Y1
..
.
=
a10 X
..
.
=
a11 X1 + a12 X2 + · · · + a1m Xm
..
.
Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm
I
Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with
decreasing variability, are the principal components.
10 / 17
Population Principal Components
I
Let X = [X1 , X2 , . . . , Xm ].
I
Let Cov(X) = Σ
I
Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0.
I
Consider the linear combinations
Y1
..
.
=
a10 X
..
.
=
a11 X1 + a12 X2 + · · · + a1m Xm
..
.
Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm
I
Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with
decreasing variability, are the principal components.
10 / 17
Population Principal Components
I
Let X = [X1 , X2 , . . . , Xm ].
I
Let Cov(X) = Σ
I
Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0.
I
Consider the linear combinations
Y1
..
.
=
a10 X
..
.
=
a11 X1 + a12 X2 + · · · + a1m Xm
..
.
Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm
I
Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with
decreasing variability, are the principal components.
10 / 17
Population Principal Components
I
Let X = [X1 , X2 , . . . , Xm ].
I
Let Cov(X) = Σ
I
Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0.
I
Consider the linear combinations
Y1
..
.
=
a10 X
..
.
=
a11 X1 + a12 X2 + · · · + a1m Xm
..
.
Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm
I
Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with
decreasing variability, are the principal components.
10 / 17
Population Principal Components
I
Let X = [X1 , X2 , . . . , Xm ].
I
Let Cov(X) = Σ
I
Let the eigenvalues of Σ be λ1 ≥ λ2 ≥ · · · ≥ λm ≥ 0.
I
Consider the linear combinations
Y1
..
.
=
a10 X
..
.
=
a11 X1 + a12 X2 + · · · + a1m Xm
..
.
Ym = am0 X = am1 X1 + am2 X2 + · · · + amm Xm
I
Y1 , Y2 , . . . , Ym , chosen orthogonal to each other and with
decreasing variability, are the principal components.
10 / 17
How to choose the aj ’s
I
Observe that
V ar(Yj ) = aj0 Σaj ,
Cov(Yj , Yk ) = aj0 Σak ,
j = 1, 2, . . . , m
j, k = 1, 2, . . . , m
I
The first principal component is the linear combination with
maximum variance i.e. it maximizes Var(Y1 ) = a10 Σa1 .
I
Since Var(Y1 ) can be increased by multiplying a1 by some
constant, it is convenient to restrict attention to coefficient
vectors of unit length only.
11 / 17
How to choose the aj ’s
I
Observe that
V ar(Yj ) = aj0 Σaj ,
Cov(Yj , Yk ) = aj0 Σak ,
j = 1, 2, . . . , m
j, k = 1, 2, . . . , m
I
The first principal component is the linear combination with
maximum variance i.e. it maximizes Var(Y1 ) = a10 Σa1 .
I
Since Var(Y1 ) can be increased by multiplying a1 by some
constant, it is convenient to restrict attention to coefficient
vectors of unit length only.
11 / 17
How to choose the aj ’s
I
Observe that
V ar(Yj ) = aj0 Σaj ,
Cov(Yj , Yk ) = aj0 Σak ,
j = 1, 2, . . . , m
j, k = 1, 2, . . . , m
I
The first principal component is the linear combination with
maximum variance i.e. it maximizes Var(Y1 ) = a10 Σa1 .
I
Since Var(Y1 ) can be increased by multiplying a1 by some
constant, it is convenient to restrict attention to coefficient
vectors of unit length only.
11 / 17
Choosing the Yj ’s
I
The coefficient a1 of the first principal component Y1 = a1 X
is so chosen as to maximize
V ar(a1 X)
I
subject to the condition a01 a1 = 1.
The coefficient a2 of the second principal component
Y2 = a2 X is so chosen as to maximize
V ar(a2 X) subject to a02 a2 = 1 and a01 a2 = 0.
I
In general, for j = 1, . . . , m, the coefficient aj of the j th
principal component Yj = aj X is so chosen as to maximize
V ar(aj X) subject to a0j aj = 1 and a0k aj = 0 for all k < j.
12 / 17
Choosing the Yj ’s
I
The coefficient a1 of the first principal component Y1 = a1 X
is so chosen as to maximize
V ar(a1 X)
I
subject to the condition a01 a1 = 1.
The coefficient a2 of the second principal component
Y2 = a2 X is so chosen as to maximize
V ar(a2 X) subject to a02 a2 = 1 and a01 a2 = 0.
I
In general, for j = 1, . . . , m, the coefficient aj of the j th
principal component Yj = aj X is so chosen as to maximize
V ar(aj X) subject to a0j aj = 1 and a0k aj = 0 for all k < j.
12 / 17
Choosing the Yj ’s
I
The coefficient a1 of the first principal component Y1 = a1 X
is so chosen as to maximize
V ar(a1 X)
I
subject to the condition a01 a1 = 1.
The coefficient a2 of the second principal component
Y2 = a2 X is so chosen as to maximize
V ar(a2 X) subject to a02 a2 = 1 and a01 a2 = 0.
I
In general, for j = 1, . . . , m, the coefficient aj of the j th
principal component Yj = aj X is so chosen as to maximize
V ar(aj X) subject to a0j aj = 1 and a0k aj = 0 for all k < j.
12 / 17
Advantages
Result
V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym )
The advantages are
I
The Y-variances have decreasing order i.e.
V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym )
I
Unlike Cov(Xj , Xk ),
Cov(Yj , Yk ) = 0 for all j 6= k.
Strategy
Work with only p << m principal components Y instead of all X
if these explain a large portion of X’s variability (say 80% or 90%).
13 / 17
Advantages
Result
V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym )
The advantages are
I
The Y-variances have decreasing order i.e.
V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym )
I
Unlike Cov(Xj , Xk ),
Cov(Yj , Yk ) = 0 for all j 6= k.
Strategy
Work with only p << m principal components Y instead of all X
if these explain a large portion of X’s variability (say 80% or 90%).
13 / 17
Advantages
Result
V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym )
The advantages are
I
The Y-variances have decreasing order i.e.
V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym )
I
Unlike Cov(Xj , Xk ),
Cov(Yj , Yk ) = 0 for all j 6= k.
Strategy
Work with only p << m principal components Y instead of all X
if these explain a large portion of X’s variability (say 80% or 90%).
13 / 17
Advantages
Result
V ar(X1 ) + . . . + V ar(Xm ) = V ar(Y1 ) + . . . + V ar(Ym )
The advantages are
I
The Y-variances have decreasing order i.e.
V ar(Y1 ) ≥ V ar(Y2 ) ≥ . . . ≥ V ar(Ym )
I
Unlike Cov(Xj , Xk ),
Cov(Yj , Yk ) = 0 for all j 6= k.
Strategy
Work with only p << m principal components Y instead of all X
if these explain a large portion of X’s variability (say 80% or 90%).
13 / 17
Example
I
Let us illustrate the idea of principal components through an
example.
I
Consider the amount of protein (X1 ), carbohydrate (X2 ), fat
(X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected
cereal brands (data source : Johnson & Wichern, 2002).
I
The purpose is to see if, instead of these 5 variables, we can
observe just one or two constructed variables (principal
components) which will synthesize the data without losing on
its variability.
I
The method by which we construct these principal
components will be discussed in later lectures.
14 / 17
Example
I
Let us illustrate the idea of principal components through an
example.
I
Consider the amount of protein (X1 ), carbohydrate (X2 ), fat
(X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected
cereal brands (data source : Johnson & Wichern, 2002).
I
The purpose is to see if, instead of these 5 variables, we can
observe just one or two constructed variables (principal
components) which will synthesize the data without losing on
its variability.
I
The method by which we construct these principal
components will be discussed in later lectures.
14 / 17
Example
I
Let us illustrate the idea of principal components through an
example.
I
Consider the amount of protein (X1 ), carbohydrate (X2 ), fat
(X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected
cereal brands (data source : Johnson & Wichern, 2002).
I
The purpose is to see if, instead of these 5 variables, we can
observe just one or two constructed variables (principal
components) which will synthesize the data without losing on
its variability.
I
The method by which we construct these principal
components will be discussed in later lectures.
14 / 17
Example
I
Let us illustrate the idea of principal components through an
example.
I
Consider the amount of protein (X1 ), carbohydrate (X2 ), fat
(X3 ), calorie (X4 ) and vitamin (X5 ) contents of 12 selected
cereal brands (data source : Johnson & Wichern, 2002).
I
The purpose is to see if, instead of these 5 variables, we can
observe just one or two constructed variables (principal
components) which will synthesize the data without losing on
its variability.
I
The method by which we construct these principal
components will be discussed in later lectures.
14 / 17
Example (contd.)
The 5 principal components come out as follows :
Y1 = 0X1 + 0X2 + 0X3 + 0.228X4 + 0.972X5
Y2 = 0X1 − 0.176X2 + 0X3 − 0.956X4 + 0.233X5
Y3 = 0.561X1 − 0.814X2 + 0.216X3 + 0X4 + 0X5
Y4 = −0.816X1 − 0.521X2 + 0.973X3 + 0.129X4 + 0X5
Y5 = 0.136X1 + 0.179X2 + 0X3 + 0.126X4 + 0X5
15 / 17
Example (contd.)
We next look at the variances, and proportion and cumulative
proportion of variability explained by the Yj ’s.
Y1
Y2
Y3
Y4
Y5
s.d.
31.82 16.17 2.43 0.51 0.40
prop. of Var
0.79 0.20 0.007 0.002 0.001
Cum prop. of Var 0.79 0.99 0.997 0.999 1.0
I
The first principal component explains 79% of the variability
in the X’s.
I
If we want to capture more variability of the original variables,
then the first and second principal components together
account for 99% of the total X-variability.
I
Thus instead of the 5 X variables, we can work with only 2
principal components Y1 and Y2 .
16 / 17
Example (contd.)
We next look at the variances, and proportion and cumulative
proportion of variability explained by the Yj ’s.
Y1
Y2
Y3
Y4
Y5
s.d.
31.82 16.17 2.43 0.51 0.40
prop. of Var
0.79 0.20 0.007 0.002 0.001
Cum prop. of Var 0.79 0.99 0.997 0.999 1.0
I
The first principal component explains 79% of the variability
in the X’s.
I
If we want to capture more variability of the original variables,
then the first and second principal components together
account for 99% of the total X-variability.
I
Thus instead of the 5 X variables, we can work with only 2
principal components Y1 and Y2 .
16 / 17
Example (contd.)
We next look at the variances, and proportion and cumulative
proportion of variability explained by the Yj ’s.
Y1
Y2
Y3
Y4
Y5
s.d.
31.82 16.17 2.43 0.51 0.40
prop. of Var
0.79 0.20 0.007 0.002 0.001
Cum prop. of Var 0.79 0.99 0.997 0.999 1.0
I
The first principal component explains 79% of the variability
in the X’s.
I
If we want to capture more variability of the original variables,
then the first and second principal components together
account for 99% of the total X-variability.
I
Thus instead of the 5 X variables, we can work with only 2
principal components Y1 and Y2 .
16 / 17
Summary
I
The different data reduction techniques are stated.
I
The idea of principal component analysis is explained.
I
The derivation of principal components is shown.
I
The technique is illustrated through an example.
17 / 17
Summary
I
The different data reduction techniques are stated.
I
The idea of principal component analysis is explained.
I
The derivation of principal components is shown.
I
The technique is illustrated through an example.
17 / 17
Summary
I
The different data reduction techniques are stated.
I
The idea of principal component analysis is explained.
I
The derivation of principal components is shown.
I
The technique is illustrated through an example.
17 / 17
Summary
I
The different data reduction techniques are stated.
I
The idea of principal component analysis is explained.
I
The derivation of principal components is shown.
I
The technique is illustrated through an example.
17 / 17
Download