Uploaded by mayankkba2024

22BM6JP26 SSDA2

advertisement
Statistical Structures in Data Assignment 2
Multivariate Analysis with R
submitted by
Mayank Kale
22BM6JP26
Guide: Prof. Amita Pal
Indian Statistical Institute, Kolkata
PGDBA 2022-2024
(22 Jan’23)
Multivariate Analysis with R
Table of Contents
1
Principal Component Analysis ................................................................................................ 3
Correlation Matrix ...................................................................................................................... 3
Dispersion Matrix ....................................................................................................................... 5
2
Correspondence Analysis ........................................................................................................ 8
3
Factor Analysis ...................................................................................................................... 12
a. Best fit orthogonal Factor model .......................................................................................... 12
b.
Interpretation of factors ..................................................................................................... 13
4
Multiple Correspondence Analysis ....................................................................................... 14
5
Metric MDS ........................................................................................................................... 16
6
Non-metric MDS ................................................................................................................... 18
7
Multiple Linear Regression ................................................................................................... 20
7.1 Without removing influential observations ......................................................................... 21
7.2 After removing influential observations .............................................................................. 24
Page | 2
Multivariate Analysis with R
1 Principal Component Analysis
From the Concrete Compressive Strength Data Set in the UCI Machine Learning Repository,
(https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength), use the observations on
the 9 variables (Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate,
Fine Aggregate, Age, Concrete compressive strength) to compute the dispersion matrix S and the
correlation matrix R. Perform Principal Component Analysis (PCA) with S and R separately and
provide the following in each case:
i. The loadings of the variables on the PCs
ii. The variances of the PCs
iii. The scree plot
iv. The number of PCs which explain at least 90% of the variation.
Correlation Matrix
Correlation
Matrix
Cem
ent
Blast
furna
nce
slag
Cement
1
-0.28
-0.4
0.08
0.09
-0.11
-0.22
0.08
0.5
Blast
furnance slag
-0.28
1
-0.32
0.11
0.04
-0.28
-0.28
-0.04
0.13
Fly.Ash.
-0.4
-0.32
1
0.38
-0.01
0.08
-0.15
-0.11
Water
Superplasticiz
er
-0.08
0.11
-0.26
-0.66
-0.18
-0.45
0.28
-0.29
0.09
0.04
0.38
1
-0.27
0.22
-0.19
0.37
Coarse.Agg
-0.11
-0.28
-0.01
-0.27
1
-0.18
0
-0.16
-0.22
-0.28
0.08
0.22
-0.18
1
-0.16
-0.17
0.08
-0.04
-0.15
-0.19
0
-0.16
1
0.33
0.5
0.13
-0.11
0.37
-0.16
-0.17
0.33
1
Fine.Aggregat
e
Age..day.
Conc comp
strength
Page | 3
Fly.A
sh.
Wat
er
Superplast
icizer
Coarse.
Agg
Fine.
Agg
Age
day.
Conc comp
strength
0.26
1
0.66
0.18
0.45
0.28
0.29
Multivariate Analysis with R
a) Loadings of the PCs
b) Variances of the PCs
It is the square of the sd below
c) Scree Plot
As we can see here, the last 3 PCs form a straight line and there exists a bend at 6th PC.
Hence, 6 PCs should be retained.
d) Cumulative Proportion of Variance explained
Page | 4
Multivariate Analysis with R
So, the first 6 PCs explain at least 90% of the variation
e) Plots
Dispersion Matrix
Dispersion
Matrix
Cement
Blast
furnance slag
Fly.Ash.
Water
Superplasticiz
er
Page | 5
Cem
ent
1092
2
2482
2658
-182
58
Blast
furna
nce
slag
-2482
7444
Fly.A
sh.
Wat
er
Superplast
icizer
Coarse.
Agg
Fine.
Agg
Age
day.
Conc comp
strength
2658
1787
182
58
-889
-1866
541
869
198
22
-1905
-1948
-241
194
144
-50
406
-624
-113
-84
-303
-772
374
-103
36
-124
107
-73
37
-1787
4096
198
-351
351
456
22
144
-84
Multivariate Analysis with R
Coarse.Agg
-889
Fine.Aggregat
e
Age..day.
Conc comp
strength
1866
541
869
-1905
-50
-1948
406
-241
-624
194
-113
303
772
374
103
-124
6046
-1113
-15
-214
107
-1113
6428
-791
-224
-73
-15
-791
3990
347
37
-214
-224
347
279
a) Loadings of PCs:
b) Variance of the PCs: the diagonal elements of covariance matrix or square of eigenvalues
(standard deviations in summary)
It is the square of the sd below
c) Scree Plot
Page | 6
Multivariate Analysis with R
As we can see there exists a bend at 5th PC, so retain the first 5 PCs
d) Cumulative Proportion of Variance:
So, the first 5 PCs explain at least 90% of the variance.
e) PCA plot
Page | 7
Multivariate Analysis with R
2 Correspondence Analysis
The dataset author provided in the first sheet of the attached MS-Excel file, Assignment_2_data.xlsx,
contains the counts of the 26 letters of the alphabet (columns of matrix) for 12 different novels (rows of
matrix). Each row contains letter counts in a sample of text from each work, excluding proper nouns.
i. Use any appropriate function from any R package to perform correspondence analysis on the data.
ii. Visualize the data in a two-dimensional space using the first two extracted coordinates from both rows
and columns.
iii. Comment, with justification, on how reliable this plot is in respect of portraying associations among
row and column categories.
iv. Comment on the information provided by the 2-D CA plot regarding the association between them.
•
•
Function for CA
Used ca package for correspondence analysis. Code is attached in the mail
Inertia
Page | 8
Multivariate Analysis with R
Explains how much variation is accounted by each dimension. The first 2 dimensions capture 60%
of the variance. Its not highly reliable as some info is missed. The problem is the quantity of the
data. The more data, the greater the chance that any good summary will miss out important details.
•
Eigenvalues:
•
Scree Plot:
Page | 9
Multivariate Analysis with R
•
Biplot:
•
Where red points are column points and blue is for columns.
a.
b.
c.
d.
e.
The letters x, w, y, z, q, and k are not used much in the 12 novels.
The letter z, and sound of fury(7) are highly discriminating while a, t, o, and islands aren’t.
The letters q, z are similar to each other. Similarly, x and y. Also, k and w.
sound and fury 6 (Faulkner) and Pendorric 3 (holt) are similar to each other
Profiles of future (clark) has no association with p, q, z (90-degree angle). Also, farewell
to arms (Hemingway) has no association with v, l
f. sound and fury 6 (Faulkner) has a positive association with v(small angle between them)
g. Similarly, Profiles of future (Clark) has a positive association with d and a negative
association with x( almost 180-degree angle)
h. farewell to arms (Hemingway) has a negative association with e and r.
Page | 10
Multivariate Analysis with R
•
Row Plots
The row variables with the larger value, contribute the most to the definition of the
dimensions
Page | 11
Multivariate Analysis with R
3 Factor Analysis
Consider the data related to red wines in the wine quality dataset available in the UCI ML Repository
(https://archive.ics.uci.edu/ml/datasets/Wine+Quality). It has 1599 observations on the following variables
for various varieties of red wine:
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
12 - quality (score between 0 and 10)
Treating quality as the dependent variable,
(a) fit the best possible orthogonal factor model to the data, giving appropriate justification regarding the
choice of the optimal number of factors
(b) for the best model, give reasonable interpretation to the factors.
a. Best fit orthogonal Factor model
•
K=5
The “SS loadings” row is the sum of squared loadings. This is sometimes used to determine the value of
a particular factor. We say a factor is worth keeping if the SS loading is greater than 1.
Here, the 5th factor’s SS loading is less than 1. The non-metric independent variable(quality) and those
having communality >1(density) were not considered for the analysis.
Page | 12
Multivariate Analysis with R
•
K=4
Here, the SS loadings of all factors are <1 and it has the maximum p-value for all such
factors i.e with k=1,2,3,4. Hence, we need to retain the k=4 factor model as they capture
around 60% of the variance.
b. Interpretation of factors
Factor 1 loads heavily on fixed acidity, citric acid, pH. Hence, it can be interpreted as
Acidity.
Factor 2 loads heavily on free and total SO2. It can be interpreted as a sulfur factor.
While factor 3 loads heavily on chlorides. Hence, it can be termed as chlorides factor.
Factor 4 loads heavily on volatile acidity and alcohol. Hence, it can be categorized as
Alcohol
Page | 13
Multivariate Analysis with R
This is the FA Biplot.
4 Multiple Correspondence Analysis
Consider the dataset tea, that is provided in the second sheet of the attached MS-Excel file
Assignment_2_data.xlsx. It is a data frame (of factors) containing the answers to a questionnaire on tea
consumption for 300 individuals. Although the data contains 36 columns (i.e., variables), consider only
the following six columns: • What kind of tea do you drink (black, green, flavored)
• How do you drink it (alone, w/milk, w/lemon, other)
• What kind of presentation do you buy (tea bags, loose tea, both)
• Do you add sugar (yes, no)
• Where do you buy it (supermarket, shops, both)
• Do you always drink tea (always, not always)
i. Use any appropriate function from any R package to perform multiple correspondence analysis (CA) on
the data.
ii. Visualize the data in a two-dimensional space using the first two extracted coordinates from both rows
and columns.
iii. How reliable is this plot in respect of portraying associations among row and column categories?
Justify your answer.
Page | 14
Multivariate Analysis with R
iv. Consider the data in the last five columns, which correspond to binary attributes. Treat these as
observations as ordinal variables by assigning the value 0 to “not-A” and the value 1 to A, A being the
attribute corresponding to the respective columns. Compute the tetrachoric correlations for these 5
variables and perform
PCA with the tetrachoric correlation matrix. Identify the attributes that explain 90% of the variation.
a.
b
C The first 2 dimensions retain only 30% of the inertia(variation) contained in the data. Not all points are
equally well displayed in the two dimensions.
D
•
Tetrachoric Matrix
Page | 15
Multivariate Analysis with R
•
PCA on tetrachoric matrix
Relaxing, exciting, and effect on health are most important variables.
5 Metric MDS
The third sheet of the attached MS-Excel file Assignment_2_data.xlsx, labeled pottery, contains the
results of chemical analysis on 45 pots of Romano-British origin, made in five different kilns located in
three different regions, in the form of observations on nine different chemical constituents. i. Compute the
distance matrix for the 45 pots.
ii. Perform metric multidimensional scaling to ascertain to what extent the chemical profiles of the pots
suggest similarity among them, examining the 2-dimensional MDS plot corresponding to the data.
iii. If you are given additional information that • the first 21 pots are from kiln no. 1, the next 12 are from
kiln no. 2, followed by 2, 5 and 5 pots from kiln nos. 3, 4 and 5 respectively,
• region 1 contains kiln 1, region 2 contains kilns 2 and 3, and region 3 contains kilns 4 and 5,
do your conclusions in (ii) appear to reflect similarity in respect of kiln and/or region? Explain with the
help of a modified version of the MDS plot in which pots from different kilns are shown in different colours.
Page | 16
Multivariate Analysis with R
a> Distance matrix for 1st 23 observations
b> Metric MDS plot
The MDS plot suggests that there is a demarcation among regions. However, 22 and 24 are closer
to 16,13 than to 34,35 which is visible in the next plot with Kilns coloured.
C>
Page | 17
Multivariate Analysis with R
Hence, from the above explanation, we can state that the clustering is by Kilns not region.
6 Non-metric MDS
The fourth sheet of the attached MS-Excel file Assignment_2_data.xlsx, labeled gardenflowers, contains
the dissimilarity matrix of 18 species of garden flowers. i. Use some form of non-metric multidimensional
scaling to investigate which species share common properties.
ii. Compute Kruskal’s stress measure for dimensions and generate a scree plot with the values.
iii. According to Kruskal’s guidelines what is the assessment of fit in 2 dimensions?
•
Non-MDS plot
Page | 18
Multivariate Analysis with R
•
Stress Value
It is 18.87%
•
This suggests that the fit is poor for non-metric MDS
Page | 19
Multivariate Analysis with R
7 Multiple Linear Regression
The last sheet in the attached MS-Excel file Assignment_2_data.xlsx, labeled USairpollution,
contains observations on seven variables, collected in a study of air pollution in 41 cities in the
USA. The variables are: i. SO2: SO2 content of air in micrograms per cubic metre;
ii. temp: average annual temperature in degrees Fahrenheit;
iii. manu: number of manufacturing enterprises employing 20 or more workers;
iv. popul: population size (1970 census) in thousands;
v. wind: average annual wind speed in miles per hour;
vi. precip: average annual precipitation in inches;
vii. predays: average number of days with precipitation per year.
(a) Using sulphur dioxide content (SO2) as the response variable and the remaining six variables
as explanatory variables, fit a linear regression model by least squares.
(b) Generate the residual plot and comment.
(c) Test whether the regression is significant.
(d) Perform appropriate tests of hypotheses to infer the significance of each explanatory variable
in the regression model.
(e) Obtain 95% confidence intervals for the regression coefficients that were found to be
significantly different from 0 in part (c).
(f) Obtain the 95% confidence interval for the mean sulphur dioxide content when the vector of
observations on the predictors is
𝐱𝐱0=(20,55,440,500,10.0,11.75,80)′.
(g) Obtain the 95% prediction interval for the mean sulphur dioxide content when the vector of
observations on the predictors is 𝐱𝐱0 as given in part (f).
(h) Use appropriate regression diagnostic tools to identify influential observations.
(i) Repeat the regression analysis of parts (a)-(d) above after removing whatever cities you think
should be regarded as outliers.
Page | 20
Multivariate Analysis with R
7.1 Without removing influential observations
a,c,d)
The F-statistic is 11.48 and the p-value is <0.05, there is sufficient evidence to conclude that the
regression model fits the data better than the model with no predictor variables. This finding is
good because it means that the predictor variables in the model actually improve the fit of the
model. Hence, the regression is significant, the regression equation helps us to understand the
relationship between Xi’s and Y.
In general, if none of the predictor variables in the model are statistically significant, the overall F
statistic is also not statistically significant. Here, temp, manu, and popul are significant variables
from the t-test. Their respective p-values are less than 0.05. While wind, precip, and predays are
insignificant variables, they don’t impact the regression model as compared to other features.
Page | 21
Multivariate Analysis with R
b>Residual Plots
Scale location plot: The red line representing the average of the standardized residuals must be
approximately horizontal. Here, it is not suggesting there exists some heteroscedasticity in the
data.
Residual Plots: A strong pattern among residuals indicates non-linearity in the data.
e> 95% CI for significant variables (temp, manu, and popul)
f> CI for mean SO2 content
g> PI for mean SO2 content
Page | 22
Multivariate Analysis with R
h> Influential Observations
•
Outliers
Outliers are observations that aren’t predicted well by the regression model. They either
have extremely large positive or negative residuals. If the model is underestimating the
response value, then it will be indicated by a positive residual. On the other hand, if the
model is overestimating the response value, then it will be indicated by a negative residual.
From our regression model example, we can start investigating outlier observations by
using a Q-Q plot. Pittsburgh (32nd observation) and Providence (33th observation) are the
cites that are detected as potential outliers.
Applying the outlierTest function is helping us to confirm if potential outliers are indeed
outliers. The statistical test is showing that Providence is undeniably detected as an outlier.
•
High Leverage Points
Observations will be considered as high-leverage points if they resemble outliers when we
compared it to other predictors. Strictly speaking, they have an uncommon combination of
predictor values while the response value has a minor impact on determining leverage. You
can compute the high leverage observation by looking at the ratio of the number of
parameters estimated in the model and sample size. If an observation has a ratio greater
than 2 -3 times the average ratio, then the observation is considered a high-leverage point.
•
DFFITS, Cooks Distance
The DFFITS statistic is a measure of how the predicted value at the ith observation changes
when the ith observation is deleted. While Providence(33) has maximum DFFITS score and
Cooks distance value.
Page | 23
Multivariate Analysis with R
•
The code below was used to create thresholds for identification
So, Buffalo(5th obs), Phoenix, and Providence were identified as influential observations
and were removed
7.2 After removing influential observations
a,c,d) The F-statistic is 17.48 and the p-value is <0.05, there is sufficient evidence to conclude that
the regression model fits the data better than the model with no predictor variables. Hence, the
regression is significant.
Here, manu, and popul are the only significant variables from the t-test. Their respective p-values
are less than 0.05. While temp, wind, precip, and predays are insignificant variables, they don’t
impact the regression model as compared to other features.
Page | 24
Multivariate Analysis with R
We can find the 95% CI for manu and popul.
b>
The residual plots look okay.
Page | 25
Download