Reference & Publisher: Introduction to Multivariate Statistics

advertisement
DESCRIPTION OF COURSE IN THE PROGRAM
COURSE NO. & TITLE: Multivariate Statistics
SEMSTER & YEAR: Selective course (3) for 4th year of mining.
WEEKLY HOURS: 4 hours
NO. OF WEEKS: 15
NO. OF STUDENTS: 9 students
AVERAGE ATTENDENCE PERCENTAGE: 90 %
NO. OF GROUPS: One
PREREQUISITE COURSES: Statistics - Mining Operations
TEXTBOOK & PUBLISHER: Multivariate Statistics, 2001
REFERENCE & PUBLISHER: Introduction to Multivariate Statistics, Barbara,
2001
TEACHER(S) NAME(S) AND POSITION(S):
Sameh Saad Eldin Ahmed
COURSE GOALS: Understanding Multivariate statistics methods and its
applications to the environmental studies
PREREQUISITE TOPICS: Descriptive statistics
COURSE TOPICS: Principal component Analysis - Factor Analysis
Interpretation of Factors - Applications.
COMPUTER USAGE IF ANY: SAS - Excel.
LABORATORY EXPERIMENTS OR APPLIED SESSIONS IF ANY: -RESEARCH PROJECTS IF ANY: Selected topics
OTHER INFORMATIONS: Seminars and Reports
Dr SaMeH
2
Syllabus
1. Introduction
1.1. General
1.2. Multivariate Statistics: Why?
1.3. The Domain of Multivariate Statistics: Number of IV’s
and DV’s
1.4. Computers and Multivariate Statistics
1.5. Number and Nature of Variables to Include
2. Review of Univariate and Bivariate Statistics
2.1. Hypothesis Testing
2.2. Analysis of Variance
2.3. Parameter Estimation
2.4. Bivariate Statistics: Correlation and Regression
3. Cleaning Up Your Act: Screening Data Prior to Analysis
3.1. Important Issues in Data Screening
3.1.1.
3.1.2.
3.1.3.
3.1.4.
Accuracy of data file
Missing data
Outliers
Normality, Linearity, and Homogeneity
3.2. Complete Example of Data Screening
4. Multiple Regression
5. Multivariate Statistical Methods
5.1. Principal Components Analysis
5.1.1.
5.1.2.
5.1.3.
5.1.4.
General Purpose and Description
Kinds of Research Questions
Limitations
Examples
5.2. Factor Analysis
5.2.1.
5.2.2.
5.2.3.
5.2.4.
Dr SaMeH
Fundamental Equations of Factor Analysis
Major Types of Factor Analysis
Some important Issues
Complete Example of FA
3
1.
Introduction
1.1. General
Statistics is that branch of mathematics, which deals with analysis of data, and is
divided into descriptive statistics and inferential statistics (statistical inference).
Multivariate statistics is an extension of univariate (one variable) or bivariate (two
variables) statistics. It allows a single test instead of many different univariate and
bivariate test when a large number of variables are being investigated (Brown, 1998).
Therefore, multivariate statistics represents the general case, and univariate and
bivariate analyses are spatial cases of general multivariate model.
It is important to point out initially that there is more than one analytical statistical
strategy that is appropriate for analysing most data. The choice of the techniques used
depends on the nature of the data, number of variables, the interrelationships of the
variables, and application of the principle of parsimony, where simplicity in
interpretation is of primary concern.
1.2. Multivariate Analysis: Why?
Multivariate statistics are increasingly popular techniques used for analysing
complicated data sets. They provide analysis when there are many independent
variables (IVs) and/or many dependent variables (Ids), all correlated with one another to
varying degrees. Because of the difficulty of addressing complicated research questions
with univariate analyses and because of the availability of canned software for
performing multivariate analyses, multivariate statistics have become widely used.
Indeed, the day is near when a standard univaraite statistics course ill prepares a student
to read research literature or research to produce it.
As a definition, multivariate data consist of observations on several different variables
for a number of individuals or objects (Chatfield et al., 1980). The analysis of
multivariate data is usually concerned with several aspects. First, what are the
relationships, if any, between the variables? Second, what differences, if any, are there
between classes?
Dr SaMeH
4
Multivariate analysis combines variables to do useful work. The combination of the
variables that is formed is based on the relations between the variables and goals of the
analysis, but in all cases, it is linear combination of variables.
1.3. The Domain of Multivariate Statistics: Number of IV’s
and DV’s
Multivariate statistical methods are an extension of univariate and bivariate statistics.
Multivariate statistics are the complete or general case, whereas univariate and bivariate
statistics are special cases of the multivariate model. If your design has many variables,
multivariate techniques often let you perform a single analysis instead of a series of
univariate or bivariate analyses.
Variables are generally classified into two major groups: indepandent and dependent.
Independent Variables (IVs) are the differing conditions to which you expose your
subjects, or characteristics (tall or short) that the subjects themselves bring into the
research situations. IVs are usually considered either predictor or causal variables
because they predict or cause the DVs.- the response or outcome variables. Note that IV
and DV are defined within a research context; a DV in one research setting may be an
IV in another.
The term univariate statistics refers to analysis in which there is a single DV. There
may be, however, more than one IV. For example, the amount of social behaviour of
graduate students (the DV) is studied as a function of course load (one IV) and type of
training in social skills to which students are exposed (another IV).
Bivariate statistics frequently refers to analysis of two variables where neither ia an
experimental IV and the desire is simply to study the relationship between the variables
( e.g., the relationship between income and amount of education).
With multivariate statistics, you simultaneously analyse multiple dependent and
multiple independent variables.
1.4. Computers and Multivariate Statistics
Dr SaMeH
5
One answer to the question “Why multivariate statistics?” is that the techniques are now
accessible by computer. Among several computer packages available in the market the
following are the most common:

SPSS (Statistical Package for the Social Sciences), SPSS Inc., 1999e

SAS (Statistical Analysis System), SAS Institute Inc., 1998

SYSTAT, SPSS Inc., 1999f
Garbage In, Roses Out?
The trick in multivariate statistics is not in computation; that is easy done by computer.
The trick is to select reliable and valid measurements, choose the appropriate program,
use it correctly and know how to interpret the output. Output from commercial
computer programs, with their beautifully formatted tables, graphs, and matrices, can
make garbage look like roses.
1.5. Number and Nature of Variables to Include
Attention to the number of variables included in analysis is important. A general rule is
to get the best solution with the fewest variables. As more and more variables are
included, the solution usually improves, but only slightly. Sometimes the improvement
does not compensate for the cost in degrees of freedom of including more variables, so
the power of the analyses diminishes. If there are too many variables relative to sample
size, the solution provides a wonderful fit to the sample that may not generalise to the
population, a condition known as overfitting. To avoid overfitting, include only a
limited number of variables in each analysis.
Considerations for variables in a multivariate analysis include cost, availability,
meaning, and theoretical relationships among the variables.
A few reliable variables give a more meaningful solution than a large number of less
reliable variables. Indeed, if variables are sufficiently unreliable, the entire solution may
reflect only measurement error.
An appropriate data set for multivariate statistical methods consists of values on a
number of variables for each of several subjects. or cases. For continuous variables, the
Dr SaMeH
6
values are scores on variables. For example, if the continuous variable id the GRE, the
values for the various subjects are scores such as 500, 650, 420, and so on.
2
2.1
Review of Univariate and Bivariate Statistics
Univariate Statistics
Univariate tools are used to describe the distribution of individual variables with one
variable. This is known as data screening and preparation and involves the summary
statistics of the data.
2.1.1
Histograms of data
Histograms are very useful data summarises which allow many characteristics of the
data to be presented in a single illustration. They are obtained simply by grouping data
together into classes.
2.1.2
Summary statistics
The important features of most histograms can be captured by few summary statistics
(Simple statistics methods). The three main statistics categories are (Isaaks et al., 1989).
2.1.2.1 measure of location
The mean, the median, and the mode can give some idea on where the centre of the
distribution lies. The locations of other parts of distribution are given by various
quintiles.

The Mean (): is the arithmetic average of the data values.

1 n
 xi
n i 1
(3.1)
The number of data is n and x1 ,...,xn are the data values.

The Median (M): is the midpoint of the observed values if they are arranged
in increasing order.

The Mode: is the value that occurs most frequently.

Minimum: the smallest value in the data set is the minimum.
Dr SaMeH
7

Maximum: the largest value in the data set is the maximum.

Lower and Upper Quartile: if the data values are arranged in increasing
order, then a quarter of the data falls below the lower or first quartile, Q1 and
a quarter of the data falls above the upper or third quartile, Q3.

Deciles, percentiles, and quartiles: deciles split the data into tenths while
percentile split the data into hundredths. Quartiles are a generalisation of this
idea to any fraction.
2.1.2.2 measures of spread
The variance, the standard deviation, and the interquartile range, these are used to
describe the variability of the data values.

Variance is the average squared difference of the observed values from its
mean. It is directly proportional to the amount of variation in the data. The
variance, 2, is given by:
1 n
SS
   ( xi   ) 2 
n i 1
df
2
(3.2)
where (n) is the number of observations, (SS) referred to the sum of squares and
(df) is the degree of freedom = (n-1).

Standard deviation () is simply the square root of variance. It is often used
instead of the variance since its units are the same the units of the variable
being described.

Interquartile range, (IQR) is the difference between the upper and lower
quartiles and is given by:
IQR  Q3  Q1
(3.3)
2.1.2.3 measures of shape
Histograms can be classified broadly by their shapes. The following are the most
commonly measures:

Coefficient of skewness is the most commonly used static for summarising
the symmetry of a quantity. It is defined as:
3
1 n
 xi   

1
Coefficient of skewness = n
3

Dr SaMeH
(3.4)
8
The numerator is the average cubed difference between the data values and their
mean, and the denominator is the cube of the standard deviation.

Coefficient of variation (CV) is a statistic that is often used as an alternative
to skewness to describe the shape of the distribution and it is defined as the
ratio of the standard deviation and the mean.
CV 


(3.5)
If estimation is the final aim of a study, the coefficient of variation can provide
some warning of upcoming problems. If its value is grater than one, it indicates
the presence of some erratic high values that may have significant impact on the
final estimates.
The simple statistics is usually enhanced by simple techniques of analysis, to name but a
few, box plots, scatter grams, histograms, steam-and-leaf plots and probability plots.
2.2
Bivariate Statistics
Bivariate statistics frequently refers to the analysis of two variables. The most common
display of bivariate data is the scatter-plot, which is an x-y graph of the data on which
the x co-ordinate corresponds to the value of one variable and y co-ordinate to the value
of the other variable. In addition to providing a good qualitative feel for how two
variables are related, a scatter-plot is also useful for drawing the attention to aberrant
data (Dowd, 1992).
2.2.1
Covariance and Correlation
If “x” represents one variable, and “y” represents another variable, then the general
formula for covariance is:
n
S xy 
_
(x
i 1
i
_
_
 x)( yi  y )
(3.6)
n
_
where: x , and y are the mean values of “x” and “y” variables respectively.
The covariance is a measure of the degree of linear association between the two
variables. Covariances are positive for positive or direct association, negative for
Dr SaMeH
9
negative or inverse association and zero for no association. As the degree of association
between any two variables increases the magnitude of the covariance will increase.
The covariance does not take account of different amounts of variability in individual
variables and makes no allowance for variables measured in different units.
To provide a valid comparison, the covariance must be scaled down so that it gives the
same numerical value for a given amount of association between two variables,
regardless of the magnitudes of the values of individual variables and independent of
units of measurement. The most common way of doing this is to divide the covariance
by the product of the standard deviations of the individual variables. The value so
obtained is called a correlation coefficient and is defined as:
r
S xy
(3.7)
SxSy
where:
n
S x2 
_
(x
i 1
n
 x)2
i
n
, and
S y2 
( y
i 1
_
 y )2
i
n
The correlation coefficient measures the strength of the linear relationship between two
variables and takes values from -1.0 (perfect negative or inverse correlation) to +1.0
(perfect positive or direct correlation). A value of r = 0.0 indicates no linear correlation.
2.2.2
Regression
If the correlation coefficient indicates a strong linear relationship, it may be of interest
to describe this relationship in terms of an equation.
y  a  bx
(3.8)
where: b is the slope of the line and a is the intercept of the y axis. This equation is
called a regression line or more specifically, the regression line of y on x. The
regression line could also be used to predict values of y corresponding to given values
of x.
Dr SaMeH
10
The method of calculating the values of a and b is called “method of least squares”
n
a
n
n
i 1
i 1
n
i
i
i 1
n
i 1
i
i
n  xi2  (  xi ) 2
i 1
i 1
n
b
n
x  y x x y
2
i
n
n
i 1
n
i 1
n  xi y i   xi  y i
i 1
n
n  xi2  (  xi ) 2
i 1
i 1
The correlation coefficient is related to the slope of the regression line by:
rb
Sx
Sy
From which it can be seen that a zero correlation coefficient entitles a zero slope for the
regression line.
2.3
Multivariate Analysis
Multivariate analysis can be defined as general methods applicable to any number of
variables analysed simultaneously, and usually applied to more (often many more) than
three variables. If these are m variables, the data may be imagined as points in mdimensional space. The prime objective is to reduce the dimensionality so that the shape
of the data scatter can be viewed. Relationships between variables can also be
investigated (Swan et al., 1995).
3.5.1
Basic topics in multivariate statistics
In the univariate case, it is often necessary to summarise a data set by calculating its
mean and variance. To summarise multivariate data sets, one needs to find the mean and
variance of each of the p variables, together with a measure of the way each pair of
variables is related. For the latter the covariance or correlation of each pair of variables
is used. These quantities are defined below (Everitt et al., 1991).
Mean. The mean vector   [ 1 ,....,  p ] is such that:
Dr SaMeH
11

1  E ( X i )   xf i ( x )dx
(3.9)
is the mean of the ith component of X. This definition is given for the case where Xi is
continuous. If Xi is discrete, then E ( X i ) is given by  xPi ( x ) , where Pi ( x ) is the
(marginal) probability distribution of Xi.
Variance. The variance of the ith component of X is given by:
Var( X i )  E[( X i  i ) 2 ]
 E( X i2 )  i2
(3.10)
This is usually denoted by  2i in the univariate case, but in order to tie in with the
covariance notation. It is usually denoted by  ii in the multivariate case.
Covariance. The covariance of two variables Xi and Xj is defined by:
Cov( X i , X j )  E[( X i  i )( X j   j )]
(3.11)
Thus, it is the product moment of the two variables about their respective means. The
covariance of X i and X j is usually denoted by ij . Then, if i = j, the variance of X i is
denoted by  ii , as noted above.
Equation (3.11) is often written in the equivalent alternative form
 ij  E[ X i X j ]  i  j
(3.12)
The covariance matrix. With p variables, there are p variances and 1 / 2 p( p  1)
covariances, and these quantities are all second moments. It is often useful to present
these quantities in a ( p  p ) matrix, denoted by  , whose ( i, j )th element is ij . Thus,
 11  12

 22
21

 ...

 p1  p 2
Dr SaMeH
...  1 p 
...  2 p 


...  pp 
12
This matrix is called the dispersion matrix, the variance-covariance matrix, or simply
the covariance matrix. The diagonal term of  are the variances, while the off-diagonal
terms, the covariance’s, are such that  ij   ji . Thus the matrix is symmetric.
Using equations (3.11) and (3.12), one can express  in two alternative useful forms,
namely
  [(    )(    )T ]
 E[ T ]   T
(3.13)
Correlation. If two variables are related in a linear way, then the covariance will be
positive depending on whether the relationship has a positive or negative slope. But the
size of the coefficient is difficult to interpret because it depends on the unit in which the
two variables are measured. Thus, the covariance is often standardised by dividing by
the product of the standard deviations of the two variables to give a quantity called
correlation coefficient. The correlation between variables Xi and Xj will be denoted by
ij, and is given by:
 ij 
 ij
(3.14)
 i j
where i denotes the standard deviation of Xi
The correlation coefficient provides a measure of a linear association between two
variables. It is positive if the relationship between the two variables has a positive slope
so that ‘high’ values of one variable tend to go with ‘high’ values of the other variable.
Conversely, the coefficient is negative if the relationship has a negative slope.
The
correlation
matrix.
With
p
variables,
there
are
p
variances
and
p( p  1) / 2 distinct correlations. It is often useful to present them in a ( p  p ) matrix,
whose ( i, j )th element is ij . This matrix, called the correlation matrix, will be denoted
by p . The diagonal terms of P are unity, and the off-diagonal terms are such that p is
symmetric.
Dr SaMeH
13
In order to relate the covariance and correlation matrices, let us define a
( p  p ) diagonal matrix D , whose diagonal terms are the standard deviations of the
components of  , so that,
 1 0
0 
2
D
 ...
0 0

0
... 0 


...  p 
...
Then the covariance and correlation matrices are related by:
  DPD
P  D 1  D 1
(3.15)
where the diagonal terms of the matrix D-1 are the reciprocals of the respective standard
deviations.
Multiple regression. Any observed variable could be considered a function of any
other variable measured on the same sample [Davis, 1986 Ref. 41 page 271 eq’s]. In
fact one could have measured several variables at the field such as depth, permeability,
tip resistance, pore pressure, conductivity, temperature, etc. and could have examined
differences in water content associated with changes in each or all of these variables
with the set of laboratory data. In a sense, variables may be considered as spatial coordinates, and one can envision changes accruing ‘along’ a dimension defined by a
variable such as mineral content.
Regression on any “m” independent variables upon a dependent variable can be
expressed as:
Y   0  1 X 1   2 X 2  ...   m X m  
The normal equations that will yield a least squares solution can be found by
appropriate labelling of the rows and columns of matrix equations and cross-multiplying
to find the entries in the body of the matrix.
Dr SaMeH
14
X 0 X1 X 2 X 3
Y
 b0  
 b  
. 1   
 b2  
  
 b3  
X0 
X1 

X2 

X3 






Where, X0 is a dummy variable equal to 1 for every observation. The matrix equation,
after cross multiplication, is
n
 X
 1
 X 2

 X 3
X
X
X
X
1
2
1
2
X1
3
X1
X
X
X
X
2
1
X2
2
2
3
X2
X
X
X
X
 b0   Y 
X 3   b1   X 1Y 
1
.  

b2   X 2Y 
X
2
3
  

2
3
 b3   X 3Y 
3
The ’s in the regression model are estimated by b’s, the sample partial regression
coefficient. They are called partial regression coefficients because each gives the rate of
change (or slope) in the dependent variable for a unit change in that particular
independent variable provided all other independent variables are held constant.
Dr SaMeH
15
5. Multivariate Statistical Methods
The most known multivariate analysis techniques are; the principal components analysis
(PCA), factor analysis (FA), cluster analysis and the canonical analysis. The first two
methods (PCA) and (FA) are statistical techniques applied to a single set of variables
where someone is interested in discovering which variables in the set form coherent
subsets that are relatively independent of one another. Variables are combined into
factors. Factors are thought to reflect underlying processes that have created the
correlation among variables.
The specific goals of PCA or FA are to summarise patterns of correlation among
observed variables, to reduce a large number of observed variables to a smaller number
of factors, to provide an operational definition (a regression equation) for an underlying
process by using observed variables, and/or test a theory about the nature of underlying
processes. Interpreting the results obtained from those methods requires a good
understanding of the physical meaning of the problem.
Steps in PCA or FA include selecting and measuring a set of variables, preparing the
correlation matrix (to perform either PCA or FA), extracting a set of factors from the
correlation matrix, determining the number of factors, (probably) rotating the factors to
increase interpretability, and, finally, interpreting the results. A good PCA or FA “make
sense” a bad one does not.
The third method (cluster analysis) is designed to solve a problem where a given sample
of n objects, each of which has a score on P variables, to devise a scheme for grouping
the objects into classes so that the similar ones are in the same class (Tabachnick et al.,
2000).
The following subsections explain the three multivariate statistics methods that would
be used in this research.
3.6.1
Principal components analysis
Dr SaMeH
16
Principal component analysis (PCA) is a multivariate technique for examining
relationships among several quantitative variables by forming new variables, which are
linear composites of the original variables. The maximum number of new variables that
can be formed is equal to the number of the original variables, and the new variables are
uncorrelated themselves. So, the procedure is used if one interested in summarising data
and detecting linear relationships. In other words, through evaluation of PCA, one seeks
to determine the minimum number of variables that contain the maximum amount of
information and determine which variables are strongly interrelated.
Principal component analysis (PCA) was originated by Pearson (1901) and later
developed by Hotelling (1933). Many authors, Rao (1964), Cooley and Lohnes (1971),
Gnabadesiken (1977) and Tabachnick et al (2000) discussed the application of PCA.
3.6.1.1 description of principal component analysis method
Given a data set with p numeric variables, p principal components can be computed.
Each principal component is a linear combination of the original variables, with
coefficients equal to the eigenvectors of the correlation or covariance matrix. The
eigenvectors are customarily taken with unit length. The principal components are
sorted by descending order of the eigenvalues, which are equal to the variance of the
components.
For any principal component analysis study, given a data set consist of n=x
observations on p=y variables. From this n x p matrix, one can calculate a p x p matrix
of correlations. In essence, principal component analysis extracts p roots or eigenvalues,
and p eigenvectors from the correlation matrix. The number of roots corresponds to the
rank of the matrix, which equals the number of linearly independent vectors. The
eigenvalues are numerically equal to the sums of the squared factor loadings and
represent the relative proportion of the total variance accounted for by each component
(Davis 1973; Brown, 1998).
3.6.1.2 objectives of principal component analysis
The objective of principal component analysis is to determine the relations existing
between measured properties that were originally considered to be independent sources
of information.
Dr SaMeH
17
Geometrically, the objective of principal component analysis is to identify a new set of
orthogonal axes such that: (Sharma, 1996)

Each coordinates of the observations with respect to each of the
axes give the values for the new variables. The new axes are
called principal component analysis and the values of the new
values are called principal component scores.

Each new variable is a linear combination of the original
variables.

The first new variable accounts for the maximum variance that
has not been accounted for the first variable.

The second new variable accounts for the maximum variance that
has not accounted for by the first variable.

The third new variable accounts for the maximum variance that
has not accounted for by the first two variables.

The pth new variable accounts for the maximum variance that has
not accounted for by p-1 variable.

The p new variables are uncorrelated.
The key parameters to be analysed in PCA are:
Sum of variation explained, o
Eigenvalues grater than one, and o
Cumulative variation explained. o
These parameters denote when factors or components are non-significant in the
statistical sense and should be discarded. The cumulative variation explained is
expected to be in the 95% range, and the eigenvalues should have values greater than
one, otherwise the factors are probably the result of statistical noise and more data is
needed.
3.6.2
Factor analysis
Factor analysis (FA) is a generic name given to a class of multivariate statistical
methods whose primary purpose is data reduction and summarisation. Broadly
speaking, it addresses itself to the problem of analysing the interrelationships among a
large number of variables and then explaining these variables in terms of their common
underlying dimensions [factors], (Hair et al., 1987).
Dr SaMeH
18
The general purpose of factor analytic techniques is to find a way of condensing
(summarising) the information contained in a number of original variables into a
smaller set of new composite dimensions (factors) with a minimum loss of information.
That is, to search for and define the fundamental constructs or dimensions assumed to
underlie the original variables.
3.6.2.1 description of the factor analysis method
The starting point in factor analysis, as with other statistical techniques, is the research
problem. As it has been mentioned early in Chapter 1, one of the research objectives is
to reduce and summarise the number of water quality variables that being measured
during the monitoring program of groundwater. It is believed that factor analysis is an
appropriate technique to achieve this objective. Factor analysis method is normally
answering questions like: what variables should be included, how many variables
should be included, how are the variables measured, and is the sample size large
enough?
Regarding the question of variables, any variables relevant to the research problem can
be included as long as they are appropriately measured.
Regarding the sample size question, the researcher generally would not factor analyse a
sample of fewer than 50 observations and preferably the sample size should be 100 or
larger. As a rule, there should be four or five times as many observations as there are
variables to be analysed, (Hair et al., 1987). However, this ratio is somewhat
conservative, and in many instances, the researcher is forced to analyse a set of
variables when only 2:1 ratio of observations to variables is available. When dealing
with smaller sample sizes and a lower ratio, one should interpret any findings
cautiously. Figure 3.1 shows the general steps followed in any application of factor
analysis techniques.
One of the first decisions in the application of factor analysis involves the calculation of
the correlation matrix. If the objective of the research were to summarise the
characteristics, the factor analysis would be applied to the correlation matrix of the
variables. This is the most common type of factor analysis and is referred as R factor
analysis. Factor analysis also may apply to a correlation matrix of individual
respondents. This type of analysis called Q factor analysis.
Dr SaMeH
19
Numerous variations of general model are available. The two most frequently employed
analytic approaches are principal component analysis and common factor analysis. The
component model is used when the objective is to summarise most of the original
information (variance) in a minimum number of factors for prediction purposes. In
contrast, common factor analysis is used primarily to identify underlying factors or
dimensions not easily recognised.
PROBLEM
- Which variables to be included?
- How many variables measured?
- How are variables measured?
- Sample size?
Component
Analysis
FACTOR
MODEL
Common FA
CORRELATION MATRIX
R versus Q
EXTRACTION METHOD
Orthogonal Or Oblique
UNROTEDTED FACTOR MATRIX
Number of factors
ROTEDTED FACTOR MATRIX
Factor interpretation
FACTOR SCORES
For subsequent analysis:
Regression Discriminant
Analysis Correlation
Dr SaMeH
20
Figure 3.1: Factor analysis decision diagram, after (Hair et al., 1987).
3.6.3
PCA versus FA
PCA and FA procedures are similar, except for preparation of the observed correlation
matrix for extraction. The difference is in the variance that is analysed. Although in
either PCA or FA, the variance that is analysed is the sum of the values in the positive
diagonal. In PCA, all the variance in the observed variables is analysed, whereas in FA
only shared variance of variables is analysed.
Mathematically, the difference between PCA and FA occurs in the contents of the
positive diagonal of the correlation matrix (the diagonal that contains the correlation
between a variable and itself). In PCA, one’s are in the diagonal and there is as much
variance to be analysed, as there are variables; each variable contributes a unit of
variance to the positive diagonal of the correlation matrix. All the variance is distributed
to the components, including error and unique variance for each observed variable. So if
all components are retained, PCA duplicates exactly the observed correlation matrix and
the standard scores of the observed variables. In FA, only the variance that each variable
shares with other observed variables are available for analysis (Brown, 1998).
Concerning the ability of the two techniques (PCA and FA) in examining the
interrelationships among a set of variables, the two techniques are different and should
not be confused. FA is more concerned with explaining the covariance structure of the
variables, whereas PCA is more concerned with explaining the variability in the
variables.
Both FA and PCA differ from the regression analysis in that there is no dependent
variable to be explained by a set of independent variables. However, PCA and FA also
differ from each other. In PCA, the major aim is to select a number of components that
explain as much of the total variance as possible. On the other hand the main objectives
of the FA is to explain the interrelationships among the original variables.
Dr SaMeH
21
The major emphasis is placed to obtain understandable factors that convey the essential
contained in the original variables.
Significant tests for factor analysis
One of the sophisticated tests of adequacy of factor analysis is given by Kaiser-MeyerOlkin (KMO) measure of sampling adequacy (Norusis 1985 from Brown 1998), which
is an index for comparing the magnitudes of the partial correlation coefficients. Small
values of the KMO measure indicate that a technique such as factor analysis may not be
a good idea. Kaiser (1974) has indicated that KMOs below 0.5 are not acceptable.
Mathematically, the two methods produce several linear combinations of observed
variables, each linear combination being a component or factor. The factors summarise
the patterns of the correlations in the observed correlation matrix and can in fact be used
to reproduce the observed correlation matrix. Nevertheless, usually the number of
factors is far fewer than the number of observed variables.
Steps in PCA or FA consist of selecting and measuring a set of variables, preparing the
correlation matrix, extracting the set of factors, rotating the factors to increase
interpretability and finally, interpreting the results. It should be mentioned that there are
relevant statistical considerations to most of those steps, but the final task of the analysis
is the interpretability. In this respect, the factor is more easily interpreted when several
observed variables correlate highly with it, and those variables that do not correlate with
other factors (Korre, 1997).
3.6.4
Major types of factor analysis
The following sections describe some of the most common used methods for factor
extraction and rotation. Those methods are described in many references, such as:
Rummel (1970), Mulaik (1972), Harman (1976), Brown (1998), and Tabachnick et al.
(2000).
3.6.4.1 extraction techniques for Factors
About seven extraction techniques can be considered as the principal factor extraction
techniques. Those techniques, found in the most popular statistical packages, are
summarised in Table 3.1. Of these, PCA and principal factors are the most commonly
used.
Dr SaMeH
22
All the extraction techniques calculate a set of orthogonal components of factors that, in
combination, produce R. Principles used to establish the solutions, such as maximising
variance or minimising residual correlations, differ from technique to another. But
differences in solutions are small for a data set with a large sample, numerous variables
and similar communality estimates.
Table 3.1: Summary of extraction procedures (modified after Tabachnick, 2000).
Extraction technique
Objectives of analysis
Special features
Principal components Maximise
variance Mathematically determines
extracted by orthogonal an empirical solution with
components. common, unique, and error
variance
mixed
into
components.
Principal factors Maximise
variance Estimates communalities in
extracted by orthogonal an attempt to eliminate
factors. unique and error variance
from factors.
Image factoring
Uses SMCs1 between each
variable and all others as
communalities to generate a
mathematically determined
solution with error variance
and
unique
variance
eliminated.
Maximum likelihood Estimate factor loadings Has significant tests for
factoring for
population
that factors; especially useful for
maximise the likelihood confirmatory factor analysis.
of sampling the observed
correlation matrix.
Alpha factoring Maximise
the Somewhat likely to produce
generalisability
of communalities greater than
orthogonal factors.
1.
Un-weighted least Minimises
squared
squares
residual correlations.
Generalised least squares Weights variables by
shared variance before
minimising
squared
residual correlations.
Principal components: the objective of PCA is to extract maximum variance from the
data set with each component. The first principal component is the linear combination
of observed variables that maximally separates subjects by minimising the variance of
1
SMCs, squared multiple correlation.
Dr SaMeH
23
their component scores. The second component is formed from residual; correlations; it
is the linear combination of observed variables that extracts maximum variability
uncorrelated with the first component. Subsequent components also extract maximum
variability from residual correlations and are orthogonal to all previously extracted
components. The principal components are ordered, with the first component extracting
the most variance and the last component the least variance. The solution is
mathematically unique, and if all components are retained, exactly reproduces the
observed correlation matrix.
Principal factors: the objective remains the same as in principal component extraction,
is to extract maximum orthogonal variance from the data set with each subsequent
factor. Principal factors extraction differs from PCA in that estimates of communality,
instead of ones, are in the positive diagonal of the observed correlation matrix. These
estimates are derived through an iterative procedure, with SMC’s used as the starting
values in the iteration. One advantage of principal factors extraction is that it conforms
to the factor analytic model in which common variance is analysed with unique and
error variance removed. However, principal factors are sometimes not as good as other
extraction techniques in reproducing the correlation matrix.
Image factor extraction: provides an interesting compromise between PCA and
principal factors. Like PCA, image extraction provides a mathematically unique
solution because there are fixed values in the positive diagonal of R. Like principal
factors, the values of the diagonal are communalities with unique and error variability
excluded. The compromise is struck by using SMC or R2 of each variable as DV with
all others serving IVs as the communality for that variable.
Maximum likelihood factor extraction: estimates population values for factor loadings
by calculating loadings that maximise the probability of sampling the observed
correlation matrix from a population. Within constrains imposed by the correlations
among variables, population estimates for factor loadings are calculated that have the
greatest probability of yielding a sample with the observed correlation matrix.
Alpha factoring: concerns with the reliability of the common factors rather than with the
reliability of group differences. Coefficient alpha is a measure of psychometrics for the
reliability (also called generalisability) of a score taken in a variety of situations. In this
Dr SaMeH
24
method, communalities are estimated, using iterative procedures that maximise
coefficient alpha for the factors. The procedure is used where the interest is in
discovering which common factors are found consistently when repeated samples of
variables are taken from a population of variables.
Un-weighted least squares factoring: the goal of un-weighted least squares (minimum
residual) factor extraction is to minimise squared differences between the observed and
reproduced correlation matrixes. The off-diagonal differences are considered;
communalities are derived from the solution rather than the estimated as part of the
solution. This procedure gives the same results as principal factors if communalities are
the same.
Generalised least squares factoring: aims at minimise the off-diagonal squared
differences between observed and reproduced correlation matrices but in this case
weights are applied to the variables. Differences for variables that are not as strongly
related to other variables in the set are not as important to the solution.
3.6.4.2 rotation
Regardless the extraction method used, none of the extraction techniques routinely
provides an interpretable solution without rotation. However all types of extraction may
be rotated by any of the procedures described in this section. Rotation is usually used to
improve the interpretability and scientific utility of the solution. It is always done if the
matrix of factor loadings is not unique or easy explained. However, it is important to
mention that rotation does not improve the quality of the mathematical fit between the
observed and reproduced correlation matrices because all orthogonally rotated solutions
are mathematically equivalent to one another, and to the solution before rotation. Also
the different rotation methods are tending to give similar results with a good data set
and if the pattern of correlation in the data set is fairly clear.
The methods of rotation are varimax, quartimax, parsimax, equamax, orthomax with
user-specified gamma, promax with user-specified exponent, Harris-Kaiser case II with
user-specified exponent, and oblique Procrustean with a user-specified target pattern.
Table 3.2 summarises the most common rotation techniques. Those techniques that
described in details in, Harman (1976), Mulaik (1972) and Gorsuch (1983) can be
classified into two main categories: a) orthogonal methods and b) oblique methods.
Dr SaMeH
25
Table 3. 2: Summary of rotational techniques (modified after Korre, 1997).
Rotational Technique
Varimax
Quartimax
Equamax
Parsimax
Orthoblique
Promax
Type
Purpose of the analysis
Orthogon Minimise complexity of
al factors by minimising
variance of lading on
each factor.
Orthogon Minimise complexity of
al variables by minimising
variance of loadings on
each variable
Orthogon Simplify both variables
al
and factors.
Characteristics
Most commonly used
method, Gamma () =1
First factor tends to be general
with others sub-clusters of
variables ( =0).
May behave erratically (
=1/2).
Orthogon Performs an orthomax Gamma () defined.
al rotation
for

=(nvar.(nfact1))/(nvar+nfact-2),
where nvar=number of
variables and nfact –
number of factors
Both Rescale factor loadings to
orthogona yield orthogonal solution;
non-rescaled loadings may be
l and correlated
oblique
Oblique Orthogonal factors rotated to Fast and inexpensive method.
oblique positions
Procrustes
Oblique Rotate to target matrix
Useful in confirmatory FA.
Orthogonal rotation techniques involve, varimax, quartimax and equamax, with
varimax being the most commonly used method of all the available rotation methods.
There is a slightly difference in the objectives of these methods. The aim of using
varimax is to maximise the variance of the loadings within factors, across variables. The
loadings that are high after extraction become higher and those that are low become
Dr SaMeH
26
even lower. This would lead to an easier interpretation of the factors, as the correlation
between the factors and the variables became obvious. Varimax also tends to
reapportion variance among factors so that they become relatively equal in importance.
Quartimax is more or less similar to varimax, while the later deals with the factors, the
first deals directly with the variables. As most of the researchers are more interested in
simple factors rather than simple variables, quartimax method remains not nearly
popular as varimax. Equimax is a hybrid between varimax and quartimax, which tries
simultaneously to simplify the factors and the variables. One should be careful in using
this method, as it tends to behave erratically unless the researcher can specify the
number of factors with confidence (Mulaik, 1972).
Oblique rotation is used if the researcher believes that the processes represented by the
factors are correlated. The most common oblique rotation methods are Promax and
Procrustes methods. Oblique rotation offers continuous range of correlations between
factors and often produce more useful patterns than do orthogonal rotations. In Promax
rotation, an orthogonally rotated solution (usually varimax) is rotated again to allow
correlations among factors. The orthogonal loadings are raised to powers (usually 2,4,
or 6) to drive small and moderate loadings to zero while larger loadings are reduced, but
not to zero. Notwithstanding, factors correlate simple structure is maximised by
clarifying, which factors do and do not correlate with each factor. This method has
another advantage of being fast. In Procrustes rotation, the researcher specifies a target
matrix of loadings (usually 0’s and 1’s) and a transformation matrix is sought to rotate
extracted factors to the target, if possible. If the solution can be rotated to the target,
then the hypnotised factor structure is said to be confirmed. Unfortunately, as Gorsuch
(1974), reports, with procrustean rotation factors are often extremely highly correlated
and sometimes a correlation matrix generated by random processes is rotated to the
target with apparent ease.
Orthoblique rotation uses the quartimax algorithm to produce an orthogonal solution
on rescaled factor loadings; therefore, the solution may be oblique with respect to the
original factor loading.
3.6.4.3 geometric interpretation
Factors extraction yields a solution in which observed variables are vectors that
terminate at the points indicates by the coordinate system. The factors serve as axes for
Dr SaMeH
27
the system. The coordinates of each point are the entries from the loadings matrix for
the variable, and the length of the vector represents the communality of the variable. If
the factors are orthogonal, the factor axes are all at right angles to one another, and the
coordinates of the variable points are correlations between the common factors and the
observed variables.
As mentioned before, one of the essential goals of PCA and FA is to discover the
minimum number of factor axes needed to reliably position of variables. A second
major intention, and the motivation behind rotation, is to realise the meaning of the
factors that underlie responses to observed variables. This goal is achieved by
interpreting the factor axes that are used to define the space. Factor rotation repositions
factor axes to make them interpretable. It should be noted that repositioning the axes
changes the coordinates of the variable points but not the positions of the points with
respect to each other.
Factors are usually interpretable when some observed variables load highly on them and
the rest do not. Ideally, each variable loads on one, and only one, factor. Graphically,
this means that the point representing each variable lies far out along the axis but near
the origin on the other axes, i.e., the coordinates of the point are large for one axis and
near zero for the other axis.
3.6.5
Canonical analyses
Variables are often found to belong to different groups that are generalised to relate to
different processes or factors. Canonical correlation analyses are used to identify and
quantify the associations between two sets of variables is a data set. Its main objectives
is to determine the correlation between a linear combination of the variables in one set
and a linear combination of the variables in another set. The first pair of linear
combinations has the largest correlation. The second pair of linear combinations is
determined and has the second largest correlation of the remaining variable sets. This
process continues until all pairs of remaining variables are analysed. The pairs of linear
combinations are called canonical variables, and their correlations are called canonical
correlations.
Dr SaMeH
28
If one is interested in the correlation between the two sets of water quality parameters,
Field and Laboratory data or Anions and Cations Anions, then canonical correlation
would be the solution to define such correlation.
A quick scanning into the nature of the data associated with groundwater indicators
showed that the data collected around mines or landfill site involves too many
parameters. In order to summarise these data and detecting its linear relationships, one
can use the principal component analyses (PCA) or (FA) method to reduce the number
of variables in regression, and hence to define the quality indicators. The aim is to
predict the parameters that are usually determined in the laboratory from pumped
samples from a minimum (optimised) set of parameters measured in the field. In other
words, multivariate statistics methods will provide a tool to detect the inter-correlation
structure of the variables of the raw data.
Presently, these are many software packages in the market, where different multivariate
methods have treated and can be used quite friendly such as EXCEL, MINITAB,
STATAGRAPH, SYSTAT, STATA, SPSS, and SAS. The latter was used throughout
this research as it is considered as one of the most powerful packages of such
application.
Applications of PCA and FA usually conducted using the following steps:
1. Extraction of the principal components;
2. Conduct varimax rotation if suitable (otherwise try other rotation
method);
3. From the results of step 2, estimate:
a. The factorability of the correlation matrix;
b. The rank of the observed correlation matrix;
c. The number of factors; and
d. The variables that might be excluded from subsequent
analyses.
2.4
Statistical Tests
Dr SaMeH
29
3.7.1
Data screening and processing
The collected water quality data should be examined carefully before conducting any
advanced statistics to understand the nature of the variables. A definite trend in making
data errors is exhibited by the following order in error pattern. The most common errors
associated with the data preparation are:

The extreme values in the data set.

Inversion or interchange of two numbers.

Repetitions of numbers.

Having number in the wrong column.
Computing the descriptive statistics for the original data such as range, median and
mean of variables can aid to identifying some of those common errors.
The data for multivariate analysis should always be examined and tested for normality,
homogeneity of variances, and multicollinearity. The main reason behind these tests are
to (1) determine the satiability of the data for analysis, (2) decide if transformations are
necessary, and (3) decide what form of the data should use. (Brown, 1998)
3.7.2
Testing for normality and homogeneity
Normality of variables is assessed by either statistical or graphical methods. Two
components of normality are skewness and kurtosis. Skewness has to do with the
symmetry of the distribution; a skewed variable is a variable whose mean is not in the
centre of the distribution. Kurtosis has to do with the peakedness of a distribution; a
distribution is either too peaked (with short, thick tails) or too flat (with long, thin tails).
Figure 3.2 shows a normal distribution, distributions with skewness, and distributions
with non-normal kurtosis. The test of normality may be a normal probability plot on
variables, tests of skewness and kurtosis, chi-square goodness of fit tests, and/or
histograms.
3.7.2.1 coefficient of skewness
The coefficient of skewness is often calculated to determine if the distribution is
symmetrical or whether it tails to the left (negative) or right (positive). Generally, one
Dr SaMeH
30
can look at departures from symmetry of a distribution using the skewness as a measure
of normality.
3.7.2.2 coefficient of kurtosis
The coefficient of kurtosis, CK, is a measure of flatness and may be tested. For a normal
distribution, the CK has a value of 0.263 (Spiegel,1961).
(a)
Normal
(b)
(c)
Positive skewness
Negative skewness
(d)
(e)
Positive kurtosis
Dr SaMeH
Negative kurtosis
31
Figure 3.2: Normal distribution, distribution with skewness, and distributions with kurtosis.
Dr SaMeH
3
3.7.2.3 significance tests of normality
Graphical displays of data such as histograms and probability plots can be useful in
recognising how data are distributed and if errors in the data exist. Other tested includes chisquare goodness of fit tests. The standard errors for both kurtosis and skewness can be
approximated and then used in a z-test against zero. The forms of equations are:
z  S k  0 / ss ,
and
z  K u  0 / sk
with S k = skewness and K u = kurtosis, ss and sk are standard errors and can be given as:
ss 
6
N
and
sk 
24
N
The homogeneity of variances is usually evaluated using Bartlett’s Test of Homogeneity of
Variances or another similar measure.
3.7.3 Data transformation
Transformations are used to make data more linear, more symmetric, or to achieve constant
variance. The most common transformation that is used is logarithm. If the data is chosen for
transformation, it is important to check that the variable is normal or nearly normal after
transformation. This involves finding the transformation that produces skewness and kurtosis
values nearest zero, or the transformation showing the fewest outliers.
Several methods are available for data transformations. The selection of the most appropriate
method depends on the nature of the data. Commonly, data transformation divided into two
types of transformation families: nonoptimal transformation and optimal transformation.
3.7.3.1 nonoptimal transformation
Nonoptimal transformations are computed before the algorithm begins. Nonoptimal
transformations create a single new transformed variable that replaces the original variables.
The subsequent iterative algorithms (except for the possible linear transformation and missing
value estimation) do not transform the new variable [Ref. SAS/STAT User’s Guide, Volume
2, 1994, page 1280]. Methods Included involves inverse trigonometric sine, exponential
variables, logarithm, logit, raise variables to specified power, and transform to ranks methods.
However, the most applicable method is the logarithm method.
Dr SaMeH
3
Logarithmic transformation: If the data values have high range then one thinks about
using logarithmic transformation. However not all the data can be transformed using
this method, for example, in this particular case, the following variables are not
suitable for logarithmic transformation; Li, NH4, NO2 and S. Results of this
transformation can be seen in Table A5.8-1.
3.7.3.2 optimal transformations
Replace the specified variables with new, iteratively derived optimal transformation variables
that fit the specified model better than the original variable). Transformation family of this
type includes linear, monotonic and optimal scoring transformation methods:
Linear transformation finds an optimal transformation of each variable. For variables with no
missing values, the transformed variable is the same as the original variable.
Monotonic transformation finds a monotonic transformation of each variable using least
square monotonic transformation, with the restriction that ties are preserved
Optimal scoring transformation finds an optimal scoring of each variable by assigning scores
to each class (level) of the variable.
Other optimal transformations include B-spline and monotonic, ties not preserved methods.
3.7.4 Standardisation
Standardisation is a transformation of a collection of data to standardise, or unit-less form, by
subtracting from each observation the mean of the data set and dividing by the respective
standard deviation. The new variable will then have a mean of zero and a variance of one.
Results of data standardisation are given in Table A5.9.
Dr SaMeH
4
Download