Uploaded by Jorge Landois

ISEN 614 final project

advertisement
ISEN 614
FINAL PROJECT REPORT
Jorge Landois
May 5th, 2023
Executive Summary
The objective of this project was to identify the in-control and out-of-control samples, utilizing the data
reduction and analysis tools that were thought throughout the semester in ISEN 614. We were provided
with a large dataset, which proved difficult to analyze because the aggregated noise can overwhelm the
signal effects thus making it harder to reject the null hypothesis. This phenomenon is known as curse of
dimensionality. During the project, we used principal component analysis (PCA) as the main data
reduction tool to reduce the data used in the analysis. We then used the Hotelling T2 chart on the reduced
data to identify the in-control samples and the out-of-control samples. The entirety of this process was
achieved with the creation of a MATLAB script.
First, we calculated the mean vector and the covariance matrix ‘S’ of the given data. After we obtained the
covariance matrix, we obtained the eigenvalues and eigenvectors of ‘S’ to find the reduced dimension.
MATLAB sorts the eigenvalues in ascending order by default. These eigenvectors were used to select and
create principal components from the original data. The eigenvalues were plotted in a scree plot, and it ws
determined that the 4 largest eigenvalues held the most influence on the data, Therefore, the data would
be reduced to only 4 principal components (PC). Additionally, the 4 largest eigenvalues account for over
80% of the sum of all the eigenvalues. For this reason, it is adequate to continue with 4 PCs.
For Principal Component Analysis (PCA), we calculated the vector ‘y’ of the principal components by
multiplying the datapoints with the eigenvectors corresponding to the chosen PCs. Then, we performed
Phase I analysis on ‘y’, and approximated the upper control limit to 9.49 using a chi-squared distribution.
We then plotted the Hotelling T2 statistic for each sample. To isolate in-control data, we removed out-ofcontrol samples (samples greater than 9.49) and recalculated the T2 statistic till we were left with only the
in-control samples. When the process was complete there were 461 in-control samples.
PCA proved to be a successful tool in this case. However, it might not be the case in all datasets. Some
datasets might require Minimum Description Length (MDL), to determine the number of PCs that need to
be selected. In other cases, a correlation matrix might be beneficial to identify the correct PCs, instead of
a covariance matrix like in this case. The approach chosen in this project is a robust procedure that can be
utilized in various cases. It uses an efficient data reduction tool and a simple calculation to determine outof-control samples.
2
Introduction
PCA (Principal Component Analysis) is a technique used to reduce the dimensionality of a dataset by
identifying a smaller number of uncorrelated variables (principal components) that capture most of the
variation in the original data. These principal components are linear combinations of the original variables
and are ordered in terms of the amount of variation they capture.
Hotelling's T-squared statistic, also known as the T2 statistic, is a multivariate statistical measure used to
test whether a set of observations fall within a certain range of values. It is used to identify outliers and
determine whether a set of observations is statistically different from the expected mean value. The T2
statistic is calculated based on the distance of each observation from the mean, as well as the covariance
matrix of the data.
Understanding the Problem
Dr. Lee provided us with a dataset containing 552 samples, each with 209 data points. The team was
tasked to determine which of these samples would be considered outliers, and to establish a process and
explain the steps taken in determining the control limits. The in-control data is not given, and therefore
this requires a Phase I analysis.
Data Reduction
Sample statistics:
The first step is to calculate the sample mean (xbar) and the sample covariance matrix (S). In some cases,
a correlation matrix can be used instead of a covariance matrix. However, those cases usually involve
different variables within the data that can’t be combined effectively with the rest of the data. In this case,
the units are not specified and therefore the covariance matrix is the best alternative.
Eigenvalues and Eigenvectors:
Before we start to select the variables that we will consider when establishing the control limit for the
sample, we need to determine how many principal components (PC) will include in our calculations. To
do that, we need to solve for the eigenvectors and eigenvalues of the covariance matrix S. Using
3
MATLAB’s function eig(S), we obtained the eigenvalues and plotted them in a Scree plot as shown in
Figure 1. From this picture we determined that the last 4 eigenvalues/eigenvectors were the most
influential in the sample. They accounted for over 81% of the covariance of the sample. Taking this into
consideration, we determined that the rest of the analysis we would consider only 4 PCs.
Figure 1: Scree plot, eigenvalues vs principal components
Principal Component Analysis (PCA) and Phase I Analysis:
The first thing after determining the PCs is to create a vector ‘y’ by multiplying the original data by the
selected PCs’ eigenvector.
The next step is determining the Upper Control Limit (UCL), which in Phase I analysis is done by
estimating using a chi-squared distribution.
The third step is to calculate the Hotelling T2 statistic for each sample, which will be used to identify
which samples are out-of-control. We chose the T2 chart instead of the others because we are working on
a multivariate case, some charts (i.e. y chart) are more efficient in univariate cases. To calculate the T2
4
statistic, it is essential to determine the mean of the newly created vector ‘y’ and its covariance matrix
‘Sy’. Once the T2 statistic is calculated for each sample, we compare it to the UCL and if the T2 statistic
is greater, then that sample is considered out-of-control. Figure 2 displays the plot of the T2 statistics and
the UCL across all the samples of the ‘y’ vector. It is important to observe that there are some samples
that are above the UCL line and are considered out-of-control. Additionally, Figure 3 displays the same
graph, but only includes the in-control samples, a total of 427 samples.
Figure 2: Hotelling T2 Statistic vs sample #
5
Figure 3: In-control Samples
Conclusion
PCA proved to be an efficient data reduction method for this specific data set. After consulting
with other teams and students that enrolled in this class, we decided to follow the chosen
approach of using PCA and the Hotelling T2 statistic. At the end of the analysis, we determined
that 427 samples of the 552 were within the control limits, about 77.35% of all the samples.
While other methods can be useful in determining control limits and identifying out-of-control
samples in certain situations, PCA and Hotelling T2 statistic was the best choice for this specific
scenario.
6
Download