DESCRIPTION OF COURSE IN THE PROGRAM COURSE NO. & TITLE: Multivariate Statistics SEMSTER & YEAR: Selective course (3) for 4th year of mining. WEEKLY HOURS: 4 hours NO. OF WEEKS: 15 NO. OF STUDENTS: 9 students AVERAGE ATTENDENCE PERCENTAGE: 90 % NO. OF GROUPS: One PREREQUISITE COURSES: Statistics - Mining Operations TEXTBOOK & PUBLISHER: Multivariate Statistics, 2001 REFERENCE & PUBLISHER: Introduction to Multivariate Statistics, Barbara, 2001 TEACHER(S) NAME(S) AND POSITION(S): Sameh Saad Eldin Ahmed COURSE GOALS: Understanding Multivariate statistics methods and its applications to the environmental studies PREREQUISITE TOPICS: Descriptive statistics COURSE TOPICS: Principal component Analysis - Factor Analysis Interpretation of Factors - Applications. COMPUTER USAGE IF ANY: SAS - Excel. LABORATORY EXPERIMENTS OR APPLIED SESSIONS IF ANY: -RESEARCH PROJECTS IF ANY: Selected topics OTHER INFORMATIONS: Seminars and Reports Dr SaMeH 2 Syllabus 1. Introduction 1.1. General 1.2. Multivariate Statistics: Why? 1.3. The Domain of Multivariate Statistics: Number of IV’s and DV’s 1.4. Computers and Multivariate Statistics 1.5. Number and Nature of Variables to Include 2. Review of Univariate and Bivariate Statistics 2.1. Hypothesis Testing 2.2. Analysis of Variance 2.3. Parameter Estimation 2.4. Bivariate Statistics: Correlation and Regression 3. Cleaning Up Your Act: Screening Data Prior to Analysis 3.1. Important Issues in Data Screening 3.1.1. 3.1.2. 3.1.3. 3.1.4. Accuracy of data file Missing data Outliers Normality, Linearity, and Homogeneity 3.2. Complete Example of Data Screening 4. Multiple Regression 5. Multivariate Statistical Methods 5.1. Principal Components Analysis 5.1.1. 5.1.2. 5.1.3. 5.1.4. General Purpose and Description Kinds of Research Questions Limitations Examples 5.2. Factor Analysis 5.2.1. 5.2.2. 5.2.3. 5.2.4. Dr SaMeH Fundamental Equations of Factor Analysis Major Types of Factor Analysis Some important Issues Complete Example of FA 3 1. Introduction 1.1. General Statistics is that branch of mathematics, which deals with analysis of data, and is divided into descriptive statistics and inferential statistics (statistical inference). Multivariate statistics is an extension of univariate (one variable) or bivariate (two variables) statistics. It allows a single test instead of many different univariate and bivariate test when a large number of variables are being investigated (Brown, 1998). Therefore, multivariate statistics represents the general case, and univariate and bivariate analyses are spatial cases of general multivariate model. It is important to point out initially that there is more than one analytical statistical strategy that is appropriate for analysing most data. The choice of the techniques used depends on the nature of the data, number of variables, the interrelationships of the variables, and application of the principle of parsimony, where simplicity in interpretation is of primary concern. 1.2. Multivariate Analysis: Why? Multivariate statistics are increasingly popular techniques used for analysing complicated data sets. They provide analysis when there are many independent variables (IVs) and/or many dependent variables (Ids), all correlated with one another to varying degrees. Because of the difficulty of addressing complicated research questions with univariate analyses and because of the availability of canned software for performing multivariate analyses, multivariate statistics have become widely used. Indeed, the day is near when a standard univaraite statistics course ill prepares a student to read research literature or research to produce it. As a definition, multivariate data consist of observations on several different variables for a number of individuals or objects (Chatfield et al., 1980). The analysis of multivariate data is usually concerned with several aspects. First, what are the relationships, if any, between the variables? Second, what differences, if any, are there between classes? Dr SaMeH 4 Multivariate analysis combines variables to do useful work. The combination of the variables that is formed is based on the relations between the variables and goals of the analysis, but in all cases, it is linear combination of variables. 1.3. The Domain of Multivariate Statistics: Number of IV’s and DV’s Multivariate statistical methods are an extension of univariate and bivariate statistics. Multivariate statistics are the complete or general case, whereas univariate and bivariate statistics are special cases of the multivariate model. If your design has many variables, multivariate techniques often let you perform a single analysis instead of a series of univariate or bivariate analyses. Variables are generally classified into two major groups: indepandent and dependent. Independent Variables (IVs) are the differing conditions to which you expose your subjects, or characteristics (tall or short) that the subjects themselves bring into the research situations. IVs are usually considered either predictor or causal variables because they predict or cause the DVs.- the response or outcome variables. Note that IV and DV are defined within a research context; a DV in one research setting may be an IV in another. The term univariate statistics refers to analysis in which there is a single DV. There may be, however, more than one IV. For example, the amount of social behaviour of graduate students (the DV) is studied as a function of course load (one IV) and type of training in social skills to which students are exposed (another IV). Bivariate statistics frequently refers to analysis of two variables where neither ia an experimental IV and the desire is simply to study the relationship between the variables ( e.g., the relationship between income and amount of education). With multivariate statistics, you simultaneously analyse multiple dependent and multiple independent variables. 1.4. Computers and Multivariate Statistics Dr SaMeH 5 One answer to the question “Why multivariate statistics?” is that the techniques are now accessible by computer. Among several computer packages available in the market the following are the most common: SPSS (Statistical Package for the Social Sciences), SPSS Inc., 1999e SAS (Statistical Analysis System), SAS Institute Inc., 1998 SYSTAT, SPSS Inc., 1999f Garbage In, Roses Out? The trick in multivariate statistics is not in computation; that is easy done by computer. The trick is to select reliable and valid measurements, choose the appropriate program, use it correctly and know how to interpret the output. Output from commercial computer programs, with their beautifully formatted tables, graphs, and matrices, can make garbage look like roses. 1.5. Number and Nature of Variables to Include Attention to the number of variables included in analysis is important. A general rule is to get the best solution with the fewest variables. As more and more variables are included, the solution usually improves, but only slightly. Sometimes the improvement does not compensate for the cost in degrees of freedom of including more variables, so the power of the analyses diminishes. If there are too many variables relative to sample size, the solution provides a wonderful fit to the sample that may not generalise to the population, a condition known as overfitting. To avoid overfitting, include only a limited number of variables in each analysis. Considerations for variables in a multivariate analysis include cost, availability, meaning, and theoretical relationships among the variables. A few reliable variables give a more meaningful solution than a large number of less reliable variables. Indeed, if variables are sufficiently unreliable, the entire solution may reflect only measurement error. An appropriate data set for multivariate statistical methods consists of values on a number of variables for each of several subjects. or cases. For continuous variables, the Dr SaMeH 6 values are scores on variables. For example, if the continuous variable id the GRE, the values for the various subjects are scores such as 500, 650, 420, and so on. 2 2.1 Review of Univariate and Bivariate Statistics Univariate Statistics Univariate tools are used to describe the distribution of individual variables with one variable. This is known as data screening and preparation and involves the summary statistics of the data. 2.1.1 Histograms of data Histograms are very useful data summarises which allow many characteristics of the data to be presented in a single illustration. They are obtained simply by grouping data together into classes. 2.1.2 Summary statistics The important features of most histograms can be captured by few summary statistics (Simple statistics methods). The three main statistics categories are (Isaaks et al., 1989). 2.1.2.1 measure of location The mean, the median, and the mode can give some idea on where the centre of the distribution lies. The locations of other parts of distribution are given by various quintiles. The Mean (): is the arithmetic average of the data values. 1 n xi n i 1 (3.1) The number of data is n and x1 ,...,xn are the data values. The Median (M): is the midpoint of the observed values if they are arranged in increasing order. The Mode: is the value that occurs most frequently. Minimum: the smallest value in the data set is the minimum. Dr SaMeH 7 Maximum: the largest value in the data set is the maximum. Lower and Upper Quartile: if the data values are arranged in increasing order, then a quarter of the data falls below the lower or first quartile, Q1 and a quarter of the data falls above the upper or third quartile, Q3. Deciles, percentiles, and quartiles: deciles split the data into tenths while percentile split the data into hundredths. Quartiles are a generalisation of this idea to any fraction. 2.1.2.2 measures of spread The variance, the standard deviation, and the interquartile range, these are used to describe the variability of the data values. Variance is the average squared difference of the observed values from its mean. It is directly proportional to the amount of variation in the data. The variance, 2, is given by: 1 n SS ( xi ) 2 n i 1 df 2 (3.2) where (n) is the number of observations, (SS) referred to the sum of squares and (df) is the degree of freedom = (n-1). Standard deviation () is simply the square root of variance. It is often used instead of the variance since its units are the same the units of the variable being described. Interquartile range, (IQR) is the difference between the upper and lower quartiles and is given by: IQR Q3 Q1 (3.3) 2.1.2.3 measures of shape Histograms can be classified broadly by their shapes. The following are the most commonly measures: Coefficient of skewness is the most commonly used static for summarising the symmetry of a quantity. It is defined as: 3 1 n xi 1 Coefficient of skewness = n 3 Dr SaMeH (3.4) 8 The numerator is the average cubed difference between the data values and their mean, and the denominator is the cube of the standard deviation. Coefficient of variation (CV) is a statistic that is often used as an alternative to skewness to describe the shape of the distribution and it is defined as the ratio of the standard deviation and the mean. CV (3.5) If estimation is the final aim of a study, the coefficient of variation can provide some warning of upcoming problems. If its value is grater than one, it indicates the presence of some erratic high values that may have significant impact on the final estimates. The simple statistics is usually enhanced by simple techniques of analysis, to name but a few, box plots, scatter grams, histograms, steam-and-leaf plots and probability plots. 2.2 Bivariate Statistics Bivariate statistics frequently refers to the analysis of two variables. The most common display of bivariate data is the scatter-plot, which is an x-y graph of the data on which the x co-ordinate corresponds to the value of one variable and y co-ordinate to the value of the other variable. In addition to providing a good qualitative feel for how two variables are related, a scatter-plot is also useful for drawing the attention to aberrant data (Dowd, 1992). 2.2.1 Covariance and Correlation If “x” represents one variable, and “y” represents another variable, then the general formula for covariance is: n S xy _ (x i 1 i _ _ x)( yi y ) (3.6) n _ where: x , and y are the mean values of “x” and “y” variables respectively. The covariance is a measure of the degree of linear association between the two variables. Covariances are positive for positive or direct association, negative for Dr SaMeH 9 negative or inverse association and zero for no association. As the degree of association between any two variables increases the magnitude of the covariance will increase. The covariance does not take account of different amounts of variability in individual variables and makes no allowance for variables measured in different units. To provide a valid comparison, the covariance must be scaled down so that it gives the same numerical value for a given amount of association between two variables, regardless of the magnitudes of the values of individual variables and independent of units of measurement. The most common way of doing this is to divide the covariance by the product of the standard deviations of the individual variables. The value so obtained is called a correlation coefficient and is defined as: r S xy (3.7) SxSy where: n S x2 _ (x i 1 n x)2 i n , and S y2 ( y i 1 _ y )2 i n The correlation coefficient measures the strength of the linear relationship between two variables and takes values from -1.0 (perfect negative or inverse correlation) to +1.0 (perfect positive or direct correlation). A value of r = 0.0 indicates no linear correlation. 2.2.2 Regression If the correlation coefficient indicates a strong linear relationship, it may be of interest to describe this relationship in terms of an equation. y a bx (3.8) where: b is the slope of the line and a is the intercept of the y axis. This equation is called a regression line or more specifically, the regression line of y on x. The regression line could also be used to predict values of y corresponding to given values of x. Dr SaMeH 10 The method of calculating the values of a and b is called “method of least squares” n a n n i 1 i 1 n i i i 1 n i 1 i i n xi2 ( xi ) 2 i 1 i 1 n b n x y x x y 2 i n n i 1 n i 1 n xi y i xi y i i 1 n n xi2 ( xi ) 2 i 1 i 1 The correlation coefficient is related to the slope of the regression line by: rb Sx Sy From which it can be seen that a zero correlation coefficient entitles a zero slope for the regression line. 2.3 Multivariate Analysis Multivariate analysis can be defined as general methods applicable to any number of variables analysed simultaneously, and usually applied to more (often many more) than three variables. If these are m variables, the data may be imagined as points in mdimensional space. The prime objective is to reduce the dimensionality so that the shape of the data scatter can be viewed. Relationships between variables can also be investigated (Swan et al., 1995). 3.5.1 Basic topics in multivariate statistics In the univariate case, it is often necessary to summarise a data set by calculating its mean and variance. To summarise multivariate data sets, one needs to find the mean and variance of each of the p variables, together with a measure of the way each pair of variables is related. For the latter the covariance or correlation of each pair of variables is used. These quantities are defined below (Everitt et al., 1991). Mean. The mean vector [ 1 ,...., p ] is such that: Dr SaMeH 11 1 E ( X i ) xf i ( x )dx (3.9) is the mean of the ith component of X. This definition is given for the case where Xi is continuous. If Xi is discrete, then E ( X i ) is given by xPi ( x ) , where Pi ( x ) is the (marginal) probability distribution of Xi. Variance. The variance of the ith component of X is given by: Var( X i ) E[( X i i ) 2 ] E( X i2 ) i2 (3.10) This is usually denoted by 2i in the univariate case, but in order to tie in with the covariance notation. It is usually denoted by ii in the multivariate case. Covariance. The covariance of two variables Xi and Xj is defined by: Cov( X i , X j ) E[( X i i )( X j j )] (3.11) Thus, it is the product moment of the two variables about their respective means. The covariance of X i and X j is usually denoted by ij . Then, if i = j, the variance of X i is denoted by ii , as noted above. Equation (3.11) is often written in the equivalent alternative form ij E[ X i X j ] i j (3.12) The covariance matrix. With p variables, there are p variances and 1 / 2 p( p 1) covariances, and these quantities are all second moments. It is often useful to present these quantities in a ( p p ) matrix, denoted by , whose ( i, j )th element is ij . Thus, 11 12 22 21 ... p1 p 2 Dr SaMeH ... 1 p ... 2 p ... pp 12 This matrix is called the dispersion matrix, the variance-covariance matrix, or simply the covariance matrix. The diagonal term of are the variances, while the off-diagonal terms, the covariance’s, are such that ij ji . Thus the matrix is symmetric. Using equations (3.11) and (3.12), one can express in two alternative useful forms, namely [( )( )T ] E[ T ] T (3.13) Correlation. If two variables are related in a linear way, then the covariance will be positive depending on whether the relationship has a positive or negative slope. But the size of the coefficient is difficult to interpret because it depends on the unit in which the two variables are measured. Thus, the covariance is often standardised by dividing by the product of the standard deviations of the two variables to give a quantity called correlation coefficient. The correlation between variables Xi and Xj will be denoted by ij, and is given by: ij ij (3.14) i j where i denotes the standard deviation of Xi The correlation coefficient provides a measure of a linear association between two variables. It is positive if the relationship between the two variables has a positive slope so that ‘high’ values of one variable tend to go with ‘high’ values of the other variable. Conversely, the coefficient is negative if the relationship has a negative slope. The correlation matrix. With p variables, there are p variances and p( p 1) / 2 distinct correlations. It is often useful to present them in a ( p p ) matrix, whose ( i, j )th element is ij . This matrix, called the correlation matrix, will be denoted by p . The diagonal terms of P are unity, and the off-diagonal terms are such that p is symmetric. Dr SaMeH 13 In order to relate the covariance and correlation matrices, let us define a ( p p ) diagonal matrix D , whose diagonal terms are the standard deviations of the components of , so that, 1 0 0 2 D ... 0 0 0 ... 0 ... p ... Then the covariance and correlation matrices are related by: DPD P D 1 D 1 (3.15) where the diagonal terms of the matrix D-1 are the reciprocals of the respective standard deviations. Multiple regression. Any observed variable could be considered a function of any other variable measured on the same sample [Davis, 1986 Ref. 41 page 271 eq’s]. In fact one could have measured several variables at the field such as depth, permeability, tip resistance, pore pressure, conductivity, temperature, etc. and could have examined differences in water content associated with changes in each or all of these variables with the set of laboratory data. In a sense, variables may be considered as spatial coordinates, and one can envision changes accruing ‘along’ a dimension defined by a variable such as mineral content. Regression on any “m” independent variables upon a dependent variable can be expressed as: Y 0 1 X 1 2 X 2 ... m X m The normal equations that will yield a least squares solution can be found by appropriate labelling of the rows and columns of matrix equations and cross-multiplying to find the entries in the body of the matrix. Dr SaMeH 14 X 0 X1 X 2 X 3 Y b0 b . 1 b2 b3 X0 X1 X2 X3 Where, X0 is a dummy variable equal to 1 for every observation. The matrix equation, after cross multiplication, is n X 1 X 2 X 3 X X X X 1 2 1 2 X1 3 X1 X X X X 2 1 X2 2 2 3 X2 X X X X b0 Y X 3 b1 X 1Y 1 . b2 X 2Y X 2 3 2 3 b3 X 3Y 3 The ’s in the regression model are estimated by b’s, the sample partial regression coefficient. They are called partial regression coefficients because each gives the rate of change (or slope) in the dependent variable for a unit change in that particular independent variable provided all other independent variables are held constant. Dr SaMeH 15 5. Multivariate Statistical Methods The most known multivariate analysis techniques are; the principal components analysis (PCA), factor analysis (FA), cluster analysis and the canonical analysis. The first two methods (PCA) and (FA) are statistical techniques applied to a single set of variables where someone is interested in discovering which variables in the set form coherent subsets that are relatively independent of one another. Variables are combined into factors. Factors are thought to reflect underlying processes that have created the correlation among variables. The specific goals of PCA or FA are to summarise patterns of correlation among observed variables, to reduce a large number of observed variables to a smaller number of factors, to provide an operational definition (a regression equation) for an underlying process by using observed variables, and/or test a theory about the nature of underlying processes. Interpreting the results obtained from those methods requires a good understanding of the physical meaning of the problem. Steps in PCA or FA include selecting and measuring a set of variables, preparing the correlation matrix (to perform either PCA or FA), extracting a set of factors from the correlation matrix, determining the number of factors, (probably) rotating the factors to increase interpretability, and, finally, interpreting the results. A good PCA or FA “make sense” a bad one does not. The third method (cluster analysis) is designed to solve a problem where a given sample of n objects, each of which has a score on P variables, to devise a scheme for grouping the objects into classes so that the similar ones are in the same class (Tabachnick et al., 2000). The following subsections explain the three multivariate statistics methods that would be used in this research. 3.6.1 Principal components analysis Dr SaMeH 16 Principal component analysis (PCA) is a multivariate technique for examining relationships among several quantitative variables by forming new variables, which are linear composites of the original variables. The maximum number of new variables that can be formed is equal to the number of the original variables, and the new variables are uncorrelated themselves. So, the procedure is used if one interested in summarising data and detecting linear relationships. In other words, through evaluation of PCA, one seeks to determine the minimum number of variables that contain the maximum amount of information and determine which variables are strongly interrelated. Principal component analysis (PCA) was originated by Pearson (1901) and later developed by Hotelling (1933). Many authors, Rao (1964), Cooley and Lohnes (1971), Gnabadesiken (1977) and Tabachnick et al (2000) discussed the application of PCA. 3.6.1.1 description of principal component analysis method Given a data set with p numeric variables, p principal components can be computed. Each principal component is a linear combination of the original variables, with coefficients equal to the eigenvectors of the correlation or covariance matrix. The eigenvectors are customarily taken with unit length. The principal components are sorted by descending order of the eigenvalues, which are equal to the variance of the components. For any principal component analysis study, given a data set consist of n=x observations on p=y variables. From this n x p matrix, one can calculate a p x p matrix of correlations. In essence, principal component analysis extracts p roots or eigenvalues, and p eigenvectors from the correlation matrix. The number of roots corresponds to the rank of the matrix, which equals the number of linearly independent vectors. The eigenvalues are numerically equal to the sums of the squared factor loadings and represent the relative proportion of the total variance accounted for by each component (Davis 1973; Brown, 1998). 3.6.1.2 objectives of principal component analysis The objective of principal component analysis is to determine the relations existing between measured properties that were originally considered to be independent sources of information. Dr SaMeH 17 Geometrically, the objective of principal component analysis is to identify a new set of orthogonal axes such that: (Sharma, 1996) Each coordinates of the observations with respect to each of the axes give the values for the new variables. The new axes are called principal component analysis and the values of the new values are called principal component scores. Each new variable is a linear combination of the original variables. The first new variable accounts for the maximum variance that has not been accounted for the first variable. The second new variable accounts for the maximum variance that has not accounted for by the first variable. The third new variable accounts for the maximum variance that has not accounted for by the first two variables. The pth new variable accounts for the maximum variance that has not accounted for by p-1 variable. The p new variables are uncorrelated. The key parameters to be analysed in PCA are: Sum of variation explained, o Eigenvalues grater than one, and o Cumulative variation explained. o These parameters denote when factors or components are non-significant in the statistical sense and should be discarded. The cumulative variation explained is expected to be in the 95% range, and the eigenvalues should have values greater than one, otherwise the factors are probably the result of statistical noise and more data is needed. 3.6.2 Factor analysis Factor analysis (FA) is a generic name given to a class of multivariate statistical methods whose primary purpose is data reduction and summarisation. Broadly speaking, it addresses itself to the problem of analysing the interrelationships among a large number of variables and then explaining these variables in terms of their common underlying dimensions [factors], (Hair et al., 1987). Dr SaMeH 18 The general purpose of factor analytic techniques is to find a way of condensing (summarising) the information contained in a number of original variables into a smaller set of new composite dimensions (factors) with a minimum loss of information. That is, to search for and define the fundamental constructs or dimensions assumed to underlie the original variables. 3.6.2.1 description of the factor analysis method The starting point in factor analysis, as with other statistical techniques, is the research problem. As it has been mentioned early in Chapter 1, one of the research objectives is to reduce and summarise the number of water quality variables that being measured during the monitoring program of groundwater. It is believed that factor analysis is an appropriate technique to achieve this objective. Factor analysis method is normally answering questions like: what variables should be included, how many variables should be included, how are the variables measured, and is the sample size large enough? Regarding the question of variables, any variables relevant to the research problem can be included as long as they are appropriately measured. Regarding the sample size question, the researcher generally would not factor analyse a sample of fewer than 50 observations and preferably the sample size should be 100 or larger. As a rule, there should be four or five times as many observations as there are variables to be analysed, (Hair et al., 1987). However, this ratio is somewhat conservative, and in many instances, the researcher is forced to analyse a set of variables when only 2:1 ratio of observations to variables is available. When dealing with smaller sample sizes and a lower ratio, one should interpret any findings cautiously. Figure 3.1 shows the general steps followed in any application of factor analysis techniques. One of the first decisions in the application of factor analysis involves the calculation of the correlation matrix. If the objective of the research were to summarise the characteristics, the factor analysis would be applied to the correlation matrix of the variables. This is the most common type of factor analysis and is referred as R factor analysis. Factor analysis also may apply to a correlation matrix of individual respondents. This type of analysis called Q factor analysis. Dr SaMeH 19 Numerous variations of general model are available. The two most frequently employed analytic approaches are principal component analysis and common factor analysis. The component model is used when the objective is to summarise most of the original information (variance) in a minimum number of factors for prediction purposes. In contrast, common factor analysis is used primarily to identify underlying factors or dimensions not easily recognised. PROBLEM - Which variables to be included? - How many variables measured? - How are variables measured? - Sample size? Component Analysis FACTOR MODEL Common FA CORRELATION MATRIX R versus Q EXTRACTION METHOD Orthogonal Or Oblique UNROTEDTED FACTOR MATRIX Number of factors ROTEDTED FACTOR MATRIX Factor interpretation FACTOR SCORES For subsequent analysis: Regression Discriminant Analysis Correlation Dr SaMeH 20 Figure 3.1: Factor analysis decision diagram, after (Hair et al., 1987). 3.6.3 PCA versus FA PCA and FA procedures are similar, except for preparation of the observed correlation matrix for extraction. The difference is in the variance that is analysed. Although in either PCA or FA, the variance that is analysed is the sum of the values in the positive diagonal. In PCA, all the variance in the observed variables is analysed, whereas in FA only shared variance of variables is analysed. Mathematically, the difference between PCA and FA occurs in the contents of the positive diagonal of the correlation matrix (the diagonal that contains the correlation between a variable and itself). In PCA, one’s are in the diagonal and there is as much variance to be analysed, as there are variables; each variable contributes a unit of variance to the positive diagonal of the correlation matrix. All the variance is distributed to the components, including error and unique variance for each observed variable. So if all components are retained, PCA duplicates exactly the observed correlation matrix and the standard scores of the observed variables. In FA, only the variance that each variable shares with other observed variables are available for analysis (Brown, 1998). Concerning the ability of the two techniques (PCA and FA) in examining the interrelationships among a set of variables, the two techniques are different and should not be confused. FA is more concerned with explaining the covariance structure of the variables, whereas PCA is more concerned with explaining the variability in the variables. Both FA and PCA differ from the regression analysis in that there is no dependent variable to be explained by a set of independent variables. However, PCA and FA also differ from each other. In PCA, the major aim is to select a number of components that explain as much of the total variance as possible. On the other hand the main objectives of the FA is to explain the interrelationships among the original variables. Dr SaMeH 21 The major emphasis is placed to obtain understandable factors that convey the essential contained in the original variables. Significant tests for factor analysis One of the sophisticated tests of adequacy of factor analysis is given by Kaiser-MeyerOlkin (KMO) measure of sampling adequacy (Norusis 1985 from Brown 1998), which is an index for comparing the magnitudes of the partial correlation coefficients. Small values of the KMO measure indicate that a technique such as factor analysis may not be a good idea. Kaiser (1974) has indicated that KMOs below 0.5 are not acceptable. Mathematically, the two methods produce several linear combinations of observed variables, each linear combination being a component or factor. The factors summarise the patterns of the correlations in the observed correlation matrix and can in fact be used to reproduce the observed correlation matrix. Nevertheless, usually the number of factors is far fewer than the number of observed variables. Steps in PCA or FA consist of selecting and measuring a set of variables, preparing the correlation matrix, extracting the set of factors, rotating the factors to increase interpretability and finally, interpreting the results. It should be mentioned that there are relevant statistical considerations to most of those steps, but the final task of the analysis is the interpretability. In this respect, the factor is more easily interpreted when several observed variables correlate highly with it, and those variables that do not correlate with other factors (Korre, 1997). 3.6.4 Major types of factor analysis The following sections describe some of the most common used methods for factor extraction and rotation. Those methods are described in many references, such as: Rummel (1970), Mulaik (1972), Harman (1976), Brown (1998), and Tabachnick et al. (2000). 3.6.4.1 extraction techniques for Factors About seven extraction techniques can be considered as the principal factor extraction techniques. Those techniques, found in the most popular statistical packages, are summarised in Table 3.1. Of these, PCA and principal factors are the most commonly used. Dr SaMeH 22 All the extraction techniques calculate a set of orthogonal components of factors that, in combination, produce R. Principles used to establish the solutions, such as maximising variance or minimising residual correlations, differ from technique to another. But differences in solutions are small for a data set with a large sample, numerous variables and similar communality estimates. Table 3.1: Summary of extraction procedures (modified after Tabachnick, 2000). Extraction technique Objectives of analysis Special features Principal components Maximise variance Mathematically determines extracted by orthogonal an empirical solution with components. common, unique, and error variance mixed into components. Principal factors Maximise variance Estimates communalities in extracted by orthogonal an attempt to eliminate factors. unique and error variance from factors. Image factoring Uses SMCs1 between each variable and all others as communalities to generate a mathematically determined solution with error variance and unique variance eliminated. Maximum likelihood Estimate factor loadings Has significant tests for factoring for population that factors; especially useful for maximise the likelihood confirmatory factor analysis. of sampling the observed correlation matrix. Alpha factoring Maximise the Somewhat likely to produce generalisability of communalities greater than orthogonal factors. 1. Un-weighted least Minimises squared squares residual correlations. Generalised least squares Weights variables by shared variance before minimising squared residual correlations. Principal components: the objective of PCA is to extract maximum variance from the data set with each component. The first principal component is the linear combination of observed variables that maximally separates subjects by minimising the variance of 1 SMCs, squared multiple correlation. Dr SaMeH 23 their component scores. The second component is formed from residual; correlations; it is the linear combination of observed variables that extracts maximum variability uncorrelated with the first component. Subsequent components also extract maximum variability from residual correlations and are orthogonal to all previously extracted components. The principal components are ordered, with the first component extracting the most variance and the last component the least variance. The solution is mathematically unique, and if all components are retained, exactly reproduces the observed correlation matrix. Principal factors: the objective remains the same as in principal component extraction, is to extract maximum orthogonal variance from the data set with each subsequent factor. Principal factors extraction differs from PCA in that estimates of communality, instead of ones, are in the positive diagonal of the observed correlation matrix. These estimates are derived through an iterative procedure, with SMC’s used as the starting values in the iteration. One advantage of principal factors extraction is that it conforms to the factor analytic model in which common variance is analysed with unique and error variance removed. However, principal factors are sometimes not as good as other extraction techniques in reproducing the correlation matrix. Image factor extraction: provides an interesting compromise between PCA and principal factors. Like PCA, image extraction provides a mathematically unique solution because there are fixed values in the positive diagonal of R. Like principal factors, the values of the diagonal are communalities with unique and error variability excluded. The compromise is struck by using SMC or R2 of each variable as DV with all others serving IVs as the communality for that variable. Maximum likelihood factor extraction: estimates population values for factor loadings by calculating loadings that maximise the probability of sampling the observed correlation matrix from a population. Within constrains imposed by the correlations among variables, population estimates for factor loadings are calculated that have the greatest probability of yielding a sample with the observed correlation matrix. Alpha factoring: concerns with the reliability of the common factors rather than with the reliability of group differences. Coefficient alpha is a measure of psychometrics for the reliability (also called generalisability) of a score taken in a variety of situations. In this Dr SaMeH 24 method, communalities are estimated, using iterative procedures that maximise coefficient alpha for the factors. The procedure is used where the interest is in discovering which common factors are found consistently when repeated samples of variables are taken from a population of variables. Un-weighted least squares factoring: the goal of un-weighted least squares (minimum residual) factor extraction is to minimise squared differences between the observed and reproduced correlation matrixes. The off-diagonal differences are considered; communalities are derived from the solution rather than the estimated as part of the solution. This procedure gives the same results as principal factors if communalities are the same. Generalised least squares factoring: aims at minimise the off-diagonal squared differences between observed and reproduced correlation matrices but in this case weights are applied to the variables. Differences for variables that are not as strongly related to other variables in the set are not as important to the solution. 3.6.4.2 rotation Regardless the extraction method used, none of the extraction techniques routinely provides an interpretable solution without rotation. However all types of extraction may be rotated by any of the procedures described in this section. Rotation is usually used to improve the interpretability and scientific utility of the solution. It is always done if the matrix of factor loadings is not unique or easy explained. However, it is important to mention that rotation does not improve the quality of the mathematical fit between the observed and reproduced correlation matrices because all orthogonally rotated solutions are mathematically equivalent to one another, and to the solution before rotation. Also the different rotation methods are tending to give similar results with a good data set and if the pattern of correlation in the data set is fairly clear. The methods of rotation are varimax, quartimax, parsimax, equamax, orthomax with user-specified gamma, promax with user-specified exponent, Harris-Kaiser case II with user-specified exponent, and oblique Procrustean with a user-specified target pattern. Table 3.2 summarises the most common rotation techniques. Those techniques that described in details in, Harman (1976), Mulaik (1972) and Gorsuch (1983) can be classified into two main categories: a) orthogonal methods and b) oblique methods. Dr SaMeH 25 Table 3. 2: Summary of rotational techniques (modified after Korre, 1997). Rotational Technique Varimax Quartimax Equamax Parsimax Orthoblique Promax Type Purpose of the analysis Orthogon Minimise complexity of al factors by minimising variance of lading on each factor. Orthogon Minimise complexity of al variables by minimising variance of loadings on each variable Orthogon Simplify both variables al and factors. Characteristics Most commonly used method, Gamma () =1 First factor tends to be general with others sub-clusters of variables ( =0). May behave erratically ( =1/2). Orthogon Performs an orthomax Gamma () defined. al rotation for =(nvar.(nfact1))/(nvar+nfact-2), where nvar=number of variables and nfact – number of factors Both Rescale factor loadings to orthogona yield orthogonal solution; non-rescaled loadings may be l and correlated oblique Oblique Orthogonal factors rotated to Fast and inexpensive method. oblique positions Procrustes Oblique Rotate to target matrix Useful in confirmatory FA. Orthogonal rotation techniques involve, varimax, quartimax and equamax, with varimax being the most commonly used method of all the available rotation methods. There is a slightly difference in the objectives of these methods. The aim of using varimax is to maximise the variance of the loadings within factors, across variables. The loadings that are high after extraction become higher and those that are low become Dr SaMeH 26 even lower. This would lead to an easier interpretation of the factors, as the correlation between the factors and the variables became obvious. Varimax also tends to reapportion variance among factors so that they become relatively equal in importance. Quartimax is more or less similar to varimax, while the later deals with the factors, the first deals directly with the variables. As most of the researchers are more interested in simple factors rather than simple variables, quartimax method remains not nearly popular as varimax. Equimax is a hybrid between varimax and quartimax, which tries simultaneously to simplify the factors and the variables. One should be careful in using this method, as it tends to behave erratically unless the researcher can specify the number of factors with confidence (Mulaik, 1972). Oblique rotation is used if the researcher believes that the processes represented by the factors are correlated. The most common oblique rotation methods are Promax and Procrustes methods. Oblique rotation offers continuous range of correlations between factors and often produce more useful patterns than do orthogonal rotations. In Promax rotation, an orthogonally rotated solution (usually varimax) is rotated again to allow correlations among factors. The orthogonal loadings are raised to powers (usually 2,4, or 6) to drive small and moderate loadings to zero while larger loadings are reduced, but not to zero. Notwithstanding, factors correlate simple structure is maximised by clarifying, which factors do and do not correlate with each factor. This method has another advantage of being fast. In Procrustes rotation, the researcher specifies a target matrix of loadings (usually 0’s and 1’s) and a transformation matrix is sought to rotate extracted factors to the target, if possible. If the solution can be rotated to the target, then the hypnotised factor structure is said to be confirmed. Unfortunately, as Gorsuch (1974), reports, with procrustean rotation factors are often extremely highly correlated and sometimes a correlation matrix generated by random processes is rotated to the target with apparent ease. Orthoblique rotation uses the quartimax algorithm to produce an orthogonal solution on rescaled factor loadings; therefore, the solution may be oblique with respect to the original factor loading. 3.6.4.3 geometric interpretation Factors extraction yields a solution in which observed variables are vectors that terminate at the points indicates by the coordinate system. The factors serve as axes for Dr SaMeH 27 the system. The coordinates of each point are the entries from the loadings matrix for the variable, and the length of the vector represents the communality of the variable. If the factors are orthogonal, the factor axes are all at right angles to one another, and the coordinates of the variable points are correlations between the common factors and the observed variables. As mentioned before, one of the essential goals of PCA and FA is to discover the minimum number of factor axes needed to reliably position of variables. A second major intention, and the motivation behind rotation, is to realise the meaning of the factors that underlie responses to observed variables. This goal is achieved by interpreting the factor axes that are used to define the space. Factor rotation repositions factor axes to make them interpretable. It should be noted that repositioning the axes changes the coordinates of the variable points but not the positions of the points with respect to each other. Factors are usually interpretable when some observed variables load highly on them and the rest do not. Ideally, each variable loads on one, and only one, factor. Graphically, this means that the point representing each variable lies far out along the axis but near the origin on the other axes, i.e., the coordinates of the point are large for one axis and near zero for the other axis. 3.6.5 Canonical analyses Variables are often found to belong to different groups that are generalised to relate to different processes or factors. Canonical correlation analyses are used to identify and quantify the associations between two sets of variables is a data set. Its main objectives is to determine the correlation between a linear combination of the variables in one set and a linear combination of the variables in another set. The first pair of linear combinations has the largest correlation. The second pair of linear combinations is determined and has the second largest correlation of the remaining variable sets. This process continues until all pairs of remaining variables are analysed. The pairs of linear combinations are called canonical variables, and their correlations are called canonical correlations. Dr SaMeH 28 If one is interested in the correlation between the two sets of water quality parameters, Field and Laboratory data or Anions and Cations Anions, then canonical correlation would be the solution to define such correlation. A quick scanning into the nature of the data associated with groundwater indicators showed that the data collected around mines or landfill site involves too many parameters. In order to summarise these data and detecting its linear relationships, one can use the principal component analyses (PCA) or (FA) method to reduce the number of variables in regression, and hence to define the quality indicators. The aim is to predict the parameters that are usually determined in the laboratory from pumped samples from a minimum (optimised) set of parameters measured in the field. In other words, multivariate statistics methods will provide a tool to detect the inter-correlation structure of the variables of the raw data. Presently, these are many software packages in the market, where different multivariate methods have treated and can be used quite friendly such as EXCEL, MINITAB, STATAGRAPH, SYSTAT, STATA, SPSS, and SAS. The latter was used throughout this research as it is considered as one of the most powerful packages of such application. Applications of PCA and FA usually conducted using the following steps: 1. Extraction of the principal components; 2. Conduct varimax rotation if suitable (otherwise try other rotation method); 3. From the results of step 2, estimate: a. The factorability of the correlation matrix; b. The rank of the observed correlation matrix; c. The number of factors; and d. The variables that might be excluded from subsequent analyses. 2.4 Statistical Tests Dr SaMeH 29 3.7.1 Data screening and processing The collected water quality data should be examined carefully before conducting any advanced statistics to understand the nature of the variables. A definite trend in making data errors is exhibited by the following order in error pattern. The most common errors associated with the data preparation are: The extreme values in the data set. Inversion or interchange of two numbers. Repetitions of numbers. Having number in the wrong column. Computing the descriptive statistics for the original data such as range, median and mean of variables can aid to identifying some of those common errors. The data for multivariate analysis should always be examined and tested for normality, homogeneity of variances, and multicollinearity. The main reason behind these tests are to (1) determine the satiability of the data for analysis, (2) decide if transformations are necessary, and (3) decide what form of the data should use. (Brown, 1998) 3.7.2 Testing for normality and homogeneity Normality of variables is assessed by either statistical or graphical methods. Two components of normality are skewness and kurtosis. Skewness has to do with the symmetry of the distribution; a skewed variable is a variable whose mean is not in the centre of the distribution. Kurtosis has to do with the peakedness of a distribution; a distribution is either too peaked (with short, thick tails) or too flat (with long, thin tails). Figure 3.2 shows a normal distribution, distributions with skewness, and distributions with non-normal kurtosis. The test of normality may be a normal probability plot on variables, tests of skewness and kurtosis, chi-square goodness of fit tests, and/or histograms. 3.7.2.1 coefficient of skewness The coefficient of skewness is often calculated to determine if the distribution is symmetrical or whether it tails to the left (negative) or right (positive). Generally, one Dr SaMeH 30 can look at departures from symmetry of a distribution using the skewness as a measure of normality. 3.7.2.2 coefficient of kurtosis The coefficient of kurtosis, CK, is a measure of flatness and may be tested. For a normal distribution, the CK has a value of 0.263 (Spiegel,1961). (a) Normal (b) (c) Positive skewness Negative skewness (d) (e) Positive kurtosis Dr SaMeH Negative kurtosis 31 Figure 3.2: Normal distribution, distribution with skewness, and distributions with kurtosis. Dr SaMeH 3 3.7.2.3 significance tests of normality Graphical displays of data such as histograms and probability plots can be useful in recognising how data are distributed and if errors in the data exist. Other tested includes chisquare goodness of fit tests. The standard errors for both kurtosis and skewness can be approximated and then used in a z-test against zero. The forms of equations are: z S k 0 / ss , and z K u 0 / sk with S k = skewness and K u = kurtosis, ss and sk are standard errors and can be given as: ss 6 N and sk 24 N The homogeneity of variances is usually evaluated using Bartlett’s Test of Homogeneity of Variances or another similar measure. 3.7.3 Data transformation Transformations are used to make data more linear, more symmetric, or to achieve constant variance. The most common transformation that is used is logarithm. If the data is chosen for transformation, it is important to check that the variable is normal or nearly normal after transformation. This involves finding the transformation that produces skewness and kurtosis values nearest zero, or the transformation showing the fewest outliers. Several methods are available for data transformations. The selection of the most appropriate method depends on the nature of the data. Commonly, data transformation divided into two types of transformation families: nonoptimal transformation and optimal transformation. 3.7.3.1 nonoptimal transformation Nonoptimal transformations are computed before the algorithm begins. Nonoptimal transformations create a single new transformed variable that replaces the original variables. The subsequent iterative algorithms (except for the possible linear transformation and missing value estimation) do not transform the new variable [Ref. SAS/STAT User’s Guide, Volume 2, 1994, page 1280]. Methods Included involves inverse trigonometric sine, exponential variables, logarithm, logit, raise variables to specified power, and transform to ranks methods. However, the most applicable method is the logarithm method. Dr SaMeH 3 Logarithmic transformation: If the data values have high range then one thinks about using logarithmic transformation. However not all the data can be transformed using this method, for example, in this particular case, the following variables are not suitable for logarithmic transformation; Li, NH4, NO2 and S. Results of this transformation can be seen in Table A5.8-1. 3.7.3.2 optimal transformations Replace the specified variables with new, iteratively derived optimal transformation variables that fit the specified model better than the original variable). Transformation family of this type includes linear, monotonic and optimal scoring transformation methods: Linear transformation finds an optimal transformation of each variable. For variables with no missing values, the transformed variable is the same as the original variable. Monotonic transformation finds a monotonic transformation of each variable using least square monotonic transformation, with the restriction that ties are preserved Optimal scoring transformation finds an optimal scoring of each variable by assigning scores to each class (level) of the variable. Other optimal transformations include B-spline and monotonic, ties not preserved methods. 3.7.4 Standardisation Standardisation is a transformation of a collection of data to standardise, or unit-less form, by subtracting from each observation the mean of the data set and dividing by the respective standard deviation. The new variable will then have a mean of zero and a variance of one. Results of data standardisation are given in Table A5.9. Dr SaMeH 4