Data Science tools and Techniques - Assignment #2 Submitted to: Dr. Kashif Zafar Submitted by: Maria Ishtiaq Ahmed (18l-1813) Question 1 (a): Implementing PCA in python: In this question principle component analysis has been implemented in Python. IDE Spyder (Python 3.6) has been used. Given below figure shows the code and output generated on the console: Figure 1: Python code for PCA along with the output Question 1 (b): Performing PCA on WEKA Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It's often used to make data easy to explore and visualize. PCA is predominantly used as a dimensionality reduction technique in domains like facial recognition, computer vision and image compression. It is also used for finding patterns in data of high dimension in the field of finance, data mining, bioinformatics, psychology, etc. [1]. To understand the concept of PCA I’ve created data related to weather i.e. Rainfall, Temperature, Humidity and Heat (This is dummy data). Given below is the table containing the data along with the mean of marks for each attribute: No. of Obs. 1 2 3 4 5 6 7 8 9 10 Mean: Rainfall 2.5 0.5 2.2 1.9 3.1 2.3 2 1 1.5 1.9 1.822 Temperature 2.4 0.7 2.9 2.2 3.0 2.7 1.6 1.1 1.6 0.9 1.856 Humidity 3 1 8 7 1.9 5 5 2 3 5 4.211 Heat 4.2 6.1 4 4.8 3.8 6 8 9 4 5.1 5.656 I have applied principle component analysis using WEKA software on this dataset and analyzed the results. The dataset will be represented using matrix A as shown below: 2.5 0.5 2.2 1.9 3.1 π΄= 2.3 2.0 1.3 1.5 [1.9 2.4 0.7 2.9 2.2 3.0 2.7 1.6 1.1 1.6 0.9 3.0 1.0 8.0 7.0 1.9 5.0 5.0 2.0 3.0 5.0 4.2 6.1 4.0 4.8 3.8 6.0 8.0 9.0 4.0 5.1] The mean of matrix A would be: ππππ(π΄) = [1.822 1.856 4.211 5.656] Now the covariance matrix is computed of the whole dataset. Covariance of two variables x and y is computed using the formula: πΆππ£(π₯, π¦) = ∑ππ=1(ππ − πππππ )(ππ − πππππ ) π−1 The covariance matrix is computed in WEKA. After that eigenvector is calculated. An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it. The eigenvectors are sorted by decreasing eigenvalues and k eigenvectors are chosen with the largest eigenvalues to form a d × k dimensional matrix W. Given below are the results computed in WEKA for the dataset given above: Figure 2: Correlation Matrix and Eigenvectors of weather dataset From the covariance matrix it can be observed the attributes are positively correlated and which are negatively correlated. For example see row 1 column 2 (Rainfall and temperature), as the value is positive i.ie 0.83 the attributes are positively correlated (Note: this is dummy data). The eigenvectors only define the directions of the new axis, since they have all the same unit length 1. The eigenvectors with the lowest eigenvalues bear the least information about the distribution of the data so they are dropped. In the end the matrix W is used to transform the samples onto the new subspace via the equation y = W′ × x where W′ is the transpose of the matrix W. Breast Cancer Coimbra: After implementing PCA on the dataset given above, in this part Breast Cancer dataset is used and Principle component analysis is done on this dataset. This dataset consists of 10 attributes. Figure 3: Correlation Matrix for Breast Cancer Dataset It can be observed that the covariance matrix in this case is huge. The diagonals are 1. This is because this is a correlation matrix and on the diagonal it is the correlation the attribute itself i.e. first row first column is correlation of the element with itself. So that’s the reason there are 1s in the diagonal are the element is maximum correlated with itself. The off diagonal elements show how to elements are related. If the value is positive then they are positively correlated i.e. increasing the value will increase the value of other element and decreasing would decrease. Figure 4: Eigenvectors for Breast Cancer Dataset If the value is negative it means that the elements are negatively correlated i.e. increasing value of one decreases value of other and vice versa. Correlation matrix is a symmetric matrix i.e. the transpose is equal to the original matrix. Figure 5: Ranked attributes for Breast Cancer Dataset Chronic Kidney Disease: It has 24 attributes. 11 attributes are numeric and 13 are nominal. Given below are the results observed: Figure 6: Correlation Matrix for Chronic Kidney Dataset The ranked attributes tell about the combination of features in the data. This tells how much is that feature being used. This helps us find out the variance, something that varies and that lets us predict the class. By keeping the top few attributes you can get the attributes giving more useful information. This is making things simpler as a few features/attributes are being removed. The remaining information in ranked attributes tells relative importance of different features. Figure 7: Eigenvectors for Chronic Kidney disease Figure 8: Ranked attributes for Chronic Kidney Dataset Conclusion: The goal here is reduce to the dimensionality of our feature space, i.e., projecting the feature space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this new feature subspace. So in you have small dataset as I’ve used in the start, there is no as such need of PCA. But in case of Chronic Kidney disease dataset and Breast Cancer Coimbra dataset, there are many features and the computation gets slower and complicated so there is need to get to know the features that are more important. Question 2: Critique of Research paper: Predicting Student’s Academic Performances A Learning Analytics Approach using Multiple Linear Regression The research paper “Predicting Student’s Academic Performances - A Learning Analytics Approach using Multiple Linear Regression” written by Oyerinde O. D and Chia P. A., has been published in International Journal of Computer Applications (0975 – 8887) Volume 157 – No 4, January 2017. This research is meant to predict student’s academic performances (SAP) using Regression analysis. Multiple linear regression model is implemented for the prediction of academic performance. Multiple Linear Regression can be defined as the relationship between a dependent variable and two or more independent variables. Patterns and relationships are found by comparing or correlating the independent variables and dependent variable. The abstract of this research paper is well written. It captures the interest of a potential reader of the paper. It makes a clear statement of the topic and the research area. It gives a concise summary of the paper. Introduction has a brief overview of the current state of research in machine learning and data mining. The research paper was able to achieve the objective of building a model to predict student’s academic performance. The analysis showed that students who perform well in mathematics courses have better chances of acquiring excellence in other Computer Science courses. There were few problems faced while implementing the model i.e. the availability of proper and authentic data for the analysis, the data was not readily accessible and there were inconsistencies in determining which student’s attributes contributes to academic performances. The conclusion is concise and well written. It summarizes the key findings and further implications for the field. The benefits and future areas for research have been written in a concise way. There is a detailed analysis of the results and is properly summarized in the form of tables, figures and flowcharts. There are no grammatical or spelling errors in the paper. The flowcharts, tables and figures are giving good understanding of the methodology. The mathematical equations are giving clear understanding of statistical model for regression. Referencing is done properly, in text referencing is also done. The paper length is fine i.e. 6 pages and 5-6 pages length is considered good. This research paper is well written covering all main components required for a research paper. It makes a clear statement of the topic and the research area. The methodology is explained in detail through figures, flowcharts, mathematical equations and tables. There are some recommendations that have been stated based on the limitations of the research work for future research in this endeavor. Firstly as the research is dependent on huge amounts of reliable and authentic data, there should be proper data warehouse for this purpose. Secondly, there should be proper intervention programs that brings student and educators in close mediation. A face to face discussion can spark up the student’s desires to do more. I think both these points are important for this research. In machine learning tasks the accuracy of a model depends mainly on the data that has been provided. So accurate and authentic data is required for good prediction. The data should not be limited or in less amount because this would give less accuracy and cause over fitting of the model. Prediction of student’s academic performance benefits both the course instructors as well as the students whose performance is lagging in class. The future direction of this research is to provide a generalized method for data collection. This should be based on identified metrics, parameters and indicators. Proper data warehouse should be built for this purpose. By using machine learning, statistical and data mining tools and methods, academic performance of students can be predicated. This will help both students and education providers. This is will enable proper guidance of students in achieving their academic goals and objectives will assist in adequate placement in tertiary education courses. References: [1] Principal Component Analysis, Available from: <http://setosa.io/ev/principalcomponent-analysis/>. Accessed on: [September 27, 2019] [2] The Mathematics behind Principal Component Analysis, Available from: <https://towardsdatascience.com/the-mathematics-behind-principal-componentanalysis-fff2d7f4b643>. Accessed on: [September 28, 2019]