PCA & Hotelling T2 for Statistical Process Control

ISEN 614 FINAL PROJECT REPORT Jorge Landois May 5th, 2023 Executive Summary The objective of this project was to identify the in-control and out-of-control samples, utilizing the data reduction and analysis tools that were thought throughout the semester in ISEN 614. We were provided with a large dataset, which proved difficult to analyze because the aggregated noise can overwhelm the signal effects thus making it harder to reject the null hypothesis. This phenomenon is known as curse of dimensionality. During the project, we used principal component analysis (PCA) as the main data reduction tool to reduce the data used in the analysis. We then used the Hotelling T2 chart on the reduced data to identify the in-control samples and the out-of-control samples. The entirety of this process was achieved with the creation of a MATLAB script. First, we calculated the mean vector and the covariance matrix ‘S’ of the given data. After we obtained the covariance matrix, we obtained the eigenvalues and eigenvectors of ‘S’ to find the reduced dimension. MATLAB sorts the eigenvalues in ascending order by default. These eigenvectors were used to select and create principal components from the original data. The eigenvalues were plotted in a scree plot, and it ws determined that the 4 largest eigenvalues held the most influence on the data, Therefore, the data would be reduced to only 4 principal components (PC). Additionally, the 4 largest eigenvalues account for over 80% of the sum of all the eigenvalues. For this reason, it is adequate to continue with 4 PCs. For Principal Component Analysis (PCA), we calculated the vector ‘y’ of the principal components by multiplying the datapoints with the eigenvectors corresponding to the chosen PCs. Then, we performed Phase I analysis on ‘y’, and approximated the upper control limit to 9.49 using a chi-squared distribution. We then plotted the Hotelling T2 statistic for each sample. To isolate in-control data, we removed out-ofcontrol samples (samples greater than 9.49) and recalculated the T2 statistic till we were left with only the in-control samples. When the process was complete there were 461 in-control samples. PCA proved to be a successful tool in this case. However, it might not be the case in all datasets. Some datasets might require Minimum Description Length (MDL), to determine the number of PCs that need to be selected. In other cases, a correlation matrix might be beneficial to identify the correct PCs, instead of a covariance matrix like in this case. The approach chosen in this project is a robust procedure that can be utilized in various cases. It uses an efficient data reduction tool and a simple calculation to determine outof-control samples. 2 Introduction PCA (Principal Component Analysis) is a technique used to reduce the dimensionality of a dataset by identifying a smaller number of uncorrelated variables (principal components) that capture most of the variation in the original data. These principal components are linear combinations of the original variables and are ordered in terms of the amount of variation they capture. Hotelling's T-squared statistic, also known as the T2 statistic, is a multivariate statistical measure used to test whether a set of observations fall within a certain range of values. It is used to identify outliers and determine whether a set of observations is statistically different from the expected mean value. The T2 statistic is calculated based on the distance of each observation from the mean, as well as the covariance matrix of the data. Understanding the Problem Dr. Lee provided us with a dataset containing 552 samples, each with 209 data points. The team was tasked to determine which of these samples would be considered outliers, and to establish a process and explain the steps taken in determining the control limits. The in-control data is not given, and therefore this requires a Phase I analysis. Data Reduction Sample statistics: The first step is to calculate the sample mean (xbar) and the sample covariance matrix (S). In some cases, a correlation matrix can be used instead of a covariance matrix. However, those cases usually involve different variables within the data that can’t be combined effectively with the rest of the data. In this case, the units are not specified and therefore the covariance matrix is the best alternative. Eigenvalues and Eigenvectors: Before we start to select the variables that we will consider when establishing the control limit for the sample, we need to determine how many principal components (PC) will include in our calculations. To do that, we need to solve for the eigenvectors and eigenvalues of the covariance matrix S. Using 3 MATLAB’s function eig(S), we obtained the eigenvalues and plotted them in a Scree plot as shown in Figure 1. From this picture we determined that the last 4 eigenvalues/eigenvectors were the most influential in the sample. They accounted for over 81% of the covariance of the sample. Taking this into consideration, we determined that the rest of the analysis we would consider only 4 PCs. Figure 1: Scree plot, eigenvalues vs principal components Principal Component Analysis (PCA) and Phase I Analysis: The first thing after determining the PCs is to create a vector ‘y’ by multiplying the original data by the selected PCs’ eigenvector. The next step is determining the Upper Control Limit (UCL), which in Phase I analysis is done by estimating using a chi-squared distribution. The third step is to calculate the Hotelling T2 statistic for each sample, which will be used to identify which samples are out-of-control. We chose the T2 chart instead of the others because we are working on a multivariate case, some charts (i.e. y chart) are more efficient in univariate cases. To calculate the T2 4 statistic, it is essential to determine the mean of the newly created vector ‘y’ and its covariance matrix ‘Sy’. Once the T2 statistic is calculated for each sample, we compare it to the UCL and if the T2 statistic is greater, then that sample is considered out-of-control. Figure 2 displays the plot of the T2 statistics and the UCL across all the samples of the ‘y’ vector. It is important to observe that there are some samples that are above the UCL line and are considered out-of-control. Additionally, Figure 3 displays the same graph, but only includes the in-control samples, a total of 427 samples. Figure 2: Hotelling T2 Statistic vs sample # 5 Figure 3: In-control Samples Conclusion PCA proved to be an efficient data reduction method for this specific data set. After consulting with other teams and students that enrolled in this class, we decided to follow the chosen approach of using PCA and the Hotelling T2 statistic. At the end of the analysis, we determined that 427 samples of the 552 were within the control limits, about 77.35% of all the samples. While other methods can be useful in determining control limits and identifying out-of-control samples in certain situations, PCA and Hotelling T2 statistic was the best choice for this specific scenario. 6

PCA & Hotelling T2 for Statistical Process Control

Related documents

Products

Support

PCA & Hotelling T2 for Statistical Process Control

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib