Statistical Structures in Data Assignment 2 Multivariate Analysis with R submitted by Mayank Kale 22BM6JP26 Guide: Prof. Amita Pal Indian Statistical Institute, Kolkata PGDBA 2022-2024 (22 Jan’23) Multivariate Analysis with R Table of Contents 1 Principal Component Analysis ................................................................................................ 3 Correlation Matrix ...................................................................................................................... 3 Dispersion Matrix ....................................................................................................................... 5 2 Correspondence Analysis ........................................................................................................ 8 3 Factor Analysis ...................................................................................................................... 12 a. Best fit orthogonal Factor model .......................................................................................... 12 b. Interpretation of factors ..................................................................................................... 13 4 Multiple Correspondence Analysis ....................................................................................... 14 5 Metric MDS ........................................................................................................................... 16 6 Non-metric MDS ................................................................................................................... 18 7 Multiple Linear Regression ................................................................................................... 20 7.1 Without removing influential observations ......................................................................... 21 7.2 After removing influential observations .............................................................................. 24 Page | 2 Multivariate Analysis with R 1 Principal Component Analysis From the Concrete Compressive Strength Data Set in the UCI Machine Learning Repository, (https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength), use the observations on the 9 variables (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate, Age, Concrete compressive strength) to compute the dispersion matrix S and the correlation matrix R. Perform Principal Component Analysis (PCA) with S and R separately and provide the following in each case: i. The loadings of the variables on the PCs ii. The variances of the PCs iii. The scree plot iv. The number of PCs which explain at least 90% of the variation. Correlation Matrix Correlation Matrix Cem ent Blast furna nce slag Cement 1 -0.28 -0.4 0.08 0.09 -0.11 -0.22 0.08 0.5 Blast furnance slag -0.28 1 -0.32 0.11 0.04 -0.28 -0.28 -0.04 0.13 Fly.Ash. -0.4 -0.32 1 0.38 -0.01 0.08 -0.15 -0.11 Water Superplasticiz er -0.08 0.11 -0.26 -0.66 -0.18 -0.45 0.28 -0.29 0.09 0.04 0.38 1 -0.27 0.22 -0.19 0.37 Coarse.Agg -0.11 -0.28 -0.01 -0.27 1 -0.18 0 -0.16 -0.22 -0.28 0.08 0.22 -0.18 1 -0.16 -0.17 0.08 -0.04 -0.15 -0.19 0 -0.16 1 0.33 0.5 0.13 -0.11 0.37 -0.16 -0.17 0.33 1 Fine.Aggregat e Age..day. Conc comp strength Page | 3 Fly.A sh. Wat er Superplast icizer Coarse. Agg Fine. Agg Age day. Conc comp strength 0.26 1 0.66 0.18 0.45 0.28 0.29 Multivariate Analysis with R a) Loadings of the PCs b) Variances of the PCs It is the square of the sd below c) Scree Plot As we can see here, the last 3 PCs form a straight line and there exists a bend at 6th PC. Hence, 6 PCs should be retained. d) Cumulative Proportion of Variance explained Page | 4 Multivariate Analysis with R So, the first 6 PCs explain at least 90% of the variation e) Plots Dispersion Matrix Dispersion Matrix Cement Blast furnance slag Fly.Ash. Water Superplasticiz er Page | 5 Cem ent 1092 2 2482 2658 -182 58 Blast furna nce slag -2482 7444 Fly.A sh. Wat er Superplast icizer Coarse. Agg Fine. Agg Age day. Conc comp strength 2658 1787 182 58 -889 -1866 541 869 198 22 -1905 -1948 -241 194 144 -50 406 -624 -113 -84 -303 -772 374 -103 36 -124 107 -73 37 -1787 4096 198 -351 351 456 22 144 -84 Multivariate Analysis with R Coarse.Agg -889 Fine.Aggregat e Age..day. Conc comp strength 1866 541 869 -1905 -50 -1948 406 -241 -624 194 -113 303 772 374 103 -124 6046 -1113 -15 -214 107 -1113 6428 -791 -224 -73 -15 -791 3990 347 37 -214 -224 347 279 a) Loadings of PCs: b) Variance of the PCs: the diagonal elements of covariance matrix or square of eigenvalues (standard deviations in summary) It is the square of the sd below c) Scree Plot Page | 6 Multivariate Analysis with R As we can see there exists a bend at 5th PC, so retain the first 5 PCs d) Cumulative Proportion of Variance: So, the first 5 PCs explain at least 90% of the variance. e) PCA plot Page | 7 Multivariate Analysis with R 2 Correspondence Analysis The dataset author provided in the first sheet of the attached MS-Excel file, Assignment_2_data.xlsx, contains the counts of the 26 letters of the alphabet (columns of matrix) for 12 different novels (rows of matrix). Each row contains letter counts in a sample of text from each work, excluding proper nouns. i. Use any appropriate function from any R package to perform correspondence analysis on the data. ii. Visualize the data in a two-dimensional space using the first two extracted coordinates from both rows and columns. iii. Comment, with justification, on how reliable this plot is in respect of portraying associations among row and column categories. iv. Comment on the information provided by the 2-D CA plot regarding the association between them. • • Function for CA Used ca package for correspondence analysis. Code is attached in the mail Inertia Page | 8 Multivariate Analysis with R Explains how much variation is accounted by each dimension. The first 2 dimensions capture 60% of the variance. Its not highly reliable as some info is missed. The problem is the quantity of the data. The more data, the greater the chance that any good summary will miss out important details. • Eigenvalues: • Scree Plot: Page | 9 Multivariate Analysis with R • Biplot: • Where red points are column points and blue is for columns. a. b. c. d. e. The letters x, w, y, z, q, and k are not used much in the 12 novels. The letter z, and sound of fury(7) are highly discriminating while a, t, o, and islands aren’t. The letters q, z are similar to each other. Similarly, x and y. Also, k and w. sound and fury 6 (Faulkner) and Pendorric 3 (holt) are similar to each other Profiles of future (clark) has no association with p, q, z (90-degree angle). Also, farewell to arms (Hemingway) has no association with v, l f. sound and fury 6 (Faulkner) has a positive association with v(small angle between them) g. Similarly, Profiles of future (Clark) has a positive association with d and a negative association with x( almost 180-degree angle) h. farewell to arms (Hemingway) has a negative association with e and r. Page | 10 Multivariate Analysis with R • Row Plots The row variables with the larger value, contribute the most to the definition of the dimensions Page | 11 Multivariate Analysis with R 3 Factor Analysis Consider the data related to red wines in the wine quality dataset available in the UCI ML Repository (https://archive.ics.uci.edu/ml/datasets/Wine+Quality). It has 1599 observations on the following variables for various varieties of red wine: 1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol 12 - quality (score between 0 and 10) Treating quality as the dependent variable, (a) fit the best possible orthogonal factor model to the data, giving appropriate justification regarding the choice of the optimal number of factors (b) for the best model, give reasonable interpretation to the factors. a. Best fit orthogonal Factor model • K=5 The “SS loadings” row is the sum of squared loadings. This is sometimes used to determine the value of a particular factor. We say a factor is worth keeping if the SS loading is greater than 1. Here, the 5th factor’s SS loading is less than 1. The non-metric independent variable(quality) and those having communality >1(density) were not considered for the analysis. Page | 12 Multivariate Analysis with R • K=4 Here, the SS loadings of all factors are <1 and it has the maximum p-value for all such factors i.e with k=1,2,3,4. Hence, we need to retain the k=4 factor model as they capture around 60% of the variance. b. Interpretation of factors Factor 1 loads heavily on fixed acidity, citric acid, pH. Hence, it can be interpreted as Acidity. Factor 2 loads heavily on free and total SO2. It can be interpreted as a sulfur factor. While factor 3 loads heavily on chlorides. Hence, it can be termed as chlorides factor. Factor 4 loads heavily on volatile acidity and alcohol. Hence, it can be categorized as Alcohol Page | 13 Multivariate Analysis with R This is the FA Biplot. 4 Multiple Correspondence Analysis Consider the dataset tea, that is provided in the second sheet of the attached MS-Excel file Assignment_2_data.xlsx. It is a data frame (of factors) containing the answers to a questionnaire on tea consumption for 300 individuals. Although the data contains 36 columns (i.e., variables), consider only the following six columns: • What kind of tea do you drink (black, green, flavored) • How do you drink it (alone, w/milk, w/lemon, other) • What kind of presentation do you buy (tea bags, loose tea, both) • Do you add sugar (yes, no) • Where do you buy it (supermarket, shops, both) • Do you always drink tea (always, not always) i. Use any appropriate function from any R package to perform multiple correspondence analysis (CA) on the data. ii. Visualize the data in a two-dimensional space using the first two extracted coordinates from both rows and columns. iii. How reliable is this plot in respect of portraying associations among row and column categories? Justify your answer. Page | 14 Multivariate Analysis with R iv. Consider the data in the last five columns, which correspond to binary attributes. Treat these as observations as ordinal variables by assigning the value 0 to “not-A” and the value 1 to A, A being the attribute corresponding to the respective columns. Compute the tetrachoric correlations for these 5 variables and perform PCA with the tetrachoric correlation matrix. Identify the attributes that explain 90% of the variation. a. b C The first 2 dimensions retain only 30% of the inertia(variation) contained in the data. Not all points are equally well displayed in the two dimensions. D • Tetrachoric Matrix Page | 15 Multivariate Analysis with R • PCA on tetrachoric matrix Relaxing, exciting, and effect on health are most important variables. 5 Metric MDS The third sheet of the attached MS-Excel file Assignment_2_data.xlsx, labeled pottery, contains the results of chemical analysis on 45 pots of Romano-British origin, made in five different kilns located in three different regions, in the form of observations on nine different chemical constituents. i. Compute the distance matrix for the 45 pots. ii. Perform metric multidimensional scaling to ascertain to what extent the chemical profiles of the pots suggest similarity among them, examining the 2-dimensional MDS plot corresponding to the data. iii. If you are given additional information that • the first 21 pots are from kiln no. 1, the next 12 are from kiln no. 2, followed by 2, 5 and 5 pots from kiln nos. 3, 4 and 5 respectively, • region 1 contains kiln 1, region 2 contains kilns 2 and 3, and region 3 contains kilns 4 and 5, do your conclusions in (ii) appear to reflect similarity in respect of kiln and/or region? Explain with the help of a modified version of the MDS plot in which pots from different kilns are shown in different colours. Page | 16 Multivariate Analysis with R a> Distance matrix for 1st 23 observations b> Metric MDS plot The MDS plot suggests that there is a demarcation among regions. However, 22 and 24 are closer to 16,13 than to 34,35 which is visible in the next plot with Kilns coloured. C> Page | 17 Multivariate Analysis with R Hence, from the above explanation, we can state that the clustering is by Kilns not region. 6 Non-metric MDS The fourth sheet of the attached MS-Excel file Assignment_2_data.xlsx, labeled gardenflowers, contains the dissimilarity matrix of 18 species of garden flowers. i. Use some form of non-metric multidimensional scaling to investigate which species share common properties. ii. Compute Kruskal’s stress measure for dimensions and generate a scree plot with the values. iii. According to Kruskal’s guidelines what is the assessment of fit in 2 dimensions? • Non-MDS plot Page | 18 Multivariate Analysis with R • Stress Value It is 18.87% • This suggests that the fit is poor for non-metric MDS Page | 19 Multivariate Analysis with R 7 Multiple Linear Regression The last sheet in the attached MS-Excel file Assignment_2_data.xlsx, labeled USairpollution, contains observations on seven variables, collected in a study of air pollution in 41 cities in the USA. The variables are: i. SO2: SO2 content of air in micrograms per cubic metre; ii. temp: average annual temperature in degrees Fahrenheit; iii. manu: number of manufacturing enterprises employing 20 or more workers; iv. popul: population size (1970 census) in thousands; v. wind: average annual wind speed in miles per hour; vi. precip: average annual precipitation in inches; vii. predays: average number of days with precipitation per year. (a) Using sulphur dioxide content (SO2) as the response variable and the remaining six variables as explanatory variables, fit a linear regression model by least squares. (b) Generate the residual plot and comment. (c) Test whether the regression is significant. (d) Perform appropriate tests of hypotheses to infer the significance of each explanatory variable in the regression model. (e) Obtain 95% confidence intervals for the regression coefficients that were found to be significantly different from 0 in part (c). (f) Obtain the 95% confidence interval for the mean sulphur dioxide content when the vector of observations on the predictors is 𝐱𝐱0=(20,55,440,500,10.0,11.75,80)′. (g) Obtain the 95% prediction interval for the mean sulphur dioxide content when the vector of observations on the predictors is 𝐱𝐱0 as given in part (f). (h) Use appropriate regression diagnostic tools to identify influential observations. (i) Repeat the regression analysis of parts (a)-(d) above after removing whatever cities you think should be regarded as outliers. Page | 20 Multivariate Analysis with R 7.1 Without removing influential observations a,c,d) The F-statistic is 11.48 and the p-value is <0.05, there is sufficient evidence to conclude that the regression model fits the data better than the model with no predictor variables. This finding is good because it means that the predictor variables in the model actually improve the fit of the model. Hence, the regression is significant, the regression equation helps us to understand the relationship between Xi’s and Y. In general, if none of the predictor variables in the model are statistically significant, the overall F statistic is also not statistically significant. Here, temp, manu, and popul are significant variables from the t-test. Their respective p-values are less than 0.05. While wind, precip, and predays are insignificant variables, they don’t impact the regression model as compared to other features. Page | 21 Multivariate Analysis with R b>Residual Plots Scale location plot: The red line representing the average of the standardized residuals must be approximately horizontal. Here, it is not suggesting there exists some heteroscedasticity in the data. Residual Plots: A strong pattern among residuals indicates non-linearity in the data. e> 95% CI for significant variables (temp, manu, and popul) f> CI for mean SO2 content g> PI for mean SO2 content Page | 22 Multivariate Analysis with R h> Influential Observations • Outliers Outliers are observations that aren’t predicted well by the regression model. They either have extremely large positive or negative residuals. If the model is underestimating the response value, then it will be indicated by a positive residual. On the other hand, if the model is overestimating the response value, then it will be indicated by a negative residual. From our regression model example, we can start investigating outlier observations by using a Q-Q plot. Pittsburgh (32nd observation) and Providence (33th observation) are the cites that are detected as potential outliers. Applying the outlierTest function is helping us to confirm if potential outliers are indeed outliers. The statistical test is showing that Providence is undeniably detected as an outlier. • High Leverage Points Observations will be considered as high-leverage points if they resemble outliers when we compared it to other predictors. Strictly speaking, they have an uncommon combination of predictor values while the response value has a minor impact on determining leverage. You can compute the high leverage observation by looking at the ratio of the number of parameters estimated in the model and sample size. If an observation has a ratio greater than 2 -3 times the average ratio, then the observation is considered a high-leverage point. • DFFITS, Cooks Distance The DFFITS statistic is a measure of how the predicted value at the ith observation changes when the ith observation is deleted. While Providence(33) has maximum DFFITS score and Cooks distance value. Page | 23 Multivariate Analysis with R • The code below was used to create thresholds for identification So, Buffalo(5th obs), Phoenix, and Providence were identified as influential observations and were removed 7.2 After removing influential observations a,c,d) The F-statistic is 17.48 and the p-value is <0.05, there is sufficient evidence to conclude that the regression model fits the data better than the model with no predictor variables. Hence, the regression is significant. Here, manu, and popul are the only significant variables from the t-test. Their respective p-values are less than 0.05. While temp, wind, precip, and predays are insignificant variables, they don’t impact the regression model as compared to other features. Page | 24 Multivariate Analysis with R We can find the 95% CI for manu and popul. b> The residual plots look okay. Page | 25