Uploaded by Maria Ishtiaq

DataScience

advertisement
Data Science tools and Techniques - Assignment #2
Submitted to: Dr. Kashif Zafar
Submitted by: Maria Ishtiaq Ahmed (18l-1813)
Question 1 (a): Implementing PCA in python:
In this question principle component analysis has been implemented in Python. IDE Spyder
(Python 3.6) has been used. Given below figure shows the code and output generated on the
console:
Figure 1: Python code for PCA along with the output
Question 1 (b): Performing PCA on WEKA
Principal component analysis (PCA) is a technique used to emphasize variation and bring out
strong patterns in a dataset. It's often used to make data easy to explore and visualize. PCA is
predominantly used as a dimensionality reduction technique in domains like facial recognition,
computer vision and image compression. It is also used for finding patterns in data of high
dimension in the field of finance, data mining, bioinformatics, psychology, etc. [1].
To understand the concept of PCA I’ve created data related to weather i.e. Rainfall, Temperature,
Humidity and Heat (This is dummy data). Given below is the table containing the data along with
the mean of marks for each attribute:
No. of Obs.
1
2
3
4
5
6
7
8
9
10
Mean:
Rainfall
2.5
0.5
2.2
1.9
3.1
2.3
2
1
1.5
1.9
1.822
Temperature
2.4
0.7
2.9
2.2
3.0
2.7
1.6
1.1
1.6
0.9
1.856
Humidity
3
1
8
7
1.9
5
5
2
3
5
4.211
Heat
4.2
6.1
4
4.8
3.8
6
8
9
4
5.1
5.656
I have applied principle component analysis using WEKA software on this dataset and analyzed
the results. The dataset will be represented using matrix A as shown below:
2.5
0.5
2.2
1.9
3.1
𝐴=
2.3
2.0
1.3
1.5
[1.9
2.4
0.7
2.9
2.2
3.0
2.7
1.6
1.1
1.6
0.9
3.0
1.0
8.0
7.0
1.9
5.0
5.0
2.0
3.0
5.0
4.2
6.1
4.0
4.8
3.8
6.0
8.0
9.0
4.0
5.1]
The mean of matrix A would be:
π‘€π‘’π‘Žπ‘›(𝐴) = [1.822
1.856
4.211
5.656]
Now the covariance matrix is computed of the whole dataset. Covariance of two variables x and
y is computed using the formula:
πΆπ‘œπ‘£(π‘₯, 𝑦) =
∑𝑛𝑖=1(𝑋𝑖 − π‘‹π‘šπ‘’π‘Žπ‘› )(π‘Œπ‘– − π‘Œπ‘šπ‘’π‘Žπ‘› )
𝑛−1
The covariance matrix is computed in WEKA. After that eigenvector is calculated. An eigenvector is a
vector whose direction remains unchanged when a linear transformation is applied to it. The eigenvectors
are sorted by decreasing eigenvalues and k eigenvectors are chosen with the largest eigenvalues to form
a d × k dimensional matrix W. Given below are the results computed in WEKA for the dataset given above:
Figure 2: Correlation Matrix and Eigenvectors of weather dataset
From the covariance matrix it can be observed the attributes are positively correlated and which are
negatively correlated. For example see row 1 column 2 (Rainfall and temperature), as the value is positive
i.ie 0.83 the attributes are positively correlated (Note: this is dummy data). The eigenvectors only define
the directions of the new axis, since they have all the same unit length 1. The eigenvectors with the lowest
eigenvalues bear the least information about the distribution of the data so they are dropped. In the end
the matrix W is used to transform the samples onto the new subspace via the equation y = W′ × x where
W′ is the transpose of the matrix W.
Breast Cancer Coimbra:
After implementing PCA on the dataset given above, in this part Breast Cancer dataset is used
and Principle component analysis is done on this dataset. This dataset consists of 10 attributes.
Figure 3: Correlation Matrix for Breast Cancer Dataset
It can be observed that the covariance matrix in this case is huge. The diagonals are 1. This is
because this is a correlation matrix and on the diagonal it is the correlation the attribute itself i.e.
first row first column is correlation of the element with itself. So that’s the reason there are 1s in
the diagonal are the element is maximum correlated with itself. The off diagonal elements show
how to elements are related. If the value is positive then they are positively correlated i.e.
increasing the value will increase the value of other element and decreasing would decrease.
Figure 4: Eigenvectors for Breast Cancer Dataset
If the value is negative it means that the elements are negatively correlated i.e. increasing value
of one decreases value of other and vice versa. Correlation matrix is a symmetric matrix i.e. the
transpose is equal to the original matrix.
Figure 5: Ranked attributes for Breast Cancer Dataset
Chronic Kidney Disease:
It has 24 attributes. 11 attributes are numeric and 13 are nominal. Given below are the results
observed:
Figure 6: Correlation Matrix for Chronic Kidney Dataset
The ranked attributes tell about the combination of features in the data. This tells how much is
that feature being used. This helps us find out the variance, something that varies and that lets
us predict the class. By keeping the top few attributes you can get the attributes giving more
useful information. This is making things simpler as a few features/attributes are being removed.
The remaining information in ranked attributes tells relative importance of different features.
Figure 7: Eigenvectors for Chronic Kidney disease
Figure 8: Ranked attributes for Chronic Kidney Dataset
Conclusion:
The goal here is reduce to the dimensionality of our feature space, i.e., projecting the feature
space via PCA onto a smaller subspace, where the eigenvectors will form the axes of this new
feature subspace. So in you have small dataset as I’ve used in the start, there is no as such need
of PCA. But in case of Chronic Kidney disease dataset and Breast Cancer Coimbra dataset, there
are many features and the computation gets slower and complicated so there is need to get to
know the features that are more important.
Question 2:
Critique of Research paper: Predicting Student’s Academic Performances A Learning Analytics Approach using Multiple Linear Regression
The research paper “Predicting Student’s Academic Performances - A Learning Analytics
Approach using Multiple Linear Regression” written by Oyerinde O. D and Chia P. A., has been
published in International Journal of Computer Applications (0975 – 8887) Volume 157 – No 4,
January 2017.
This research is meant to predict student’s academic performances (SAP) using Regression
analysis. Multiple linear regression model is implemented for the prediction of academic
performance. Multiple Linear Regression can be defined as the relationship between a
dependent variable and two or more independent variables. Patterns and relationships are found
by comparing or correlating the independent variables and dependent variable.
The abstract of this research paper is well written. It captures the interest of a potential reader
of the paper. It makes a clear statement of the topic and the research area. It gives a concise
summary of the paper. Introduction has a brief overview of the current state of research in
machine learning and data mining. The research paper was able to achieve the objective of
building a model to predict student’s academic performance. The analysis showed that students
who perform well in mathematics courses have better chances of acquiring excellence in other
Computer Science courses. There were few problems faced while implementing the model i.e.
the availability of proper and authentic data for the analysis, the data was not readily accessible
and there were inconsistencies in determining which student’s attributes contributes to
academic performances. The conclusion is concise and well written. It summarizes the key
findings and further implications for the field. The benefits and future areas for research have
been written in a concise way.
There is a detailed analysis of the results and is properly summarized in the form of tables, figures
and flowcharts. There are no grammatical or spelling errors in the paper. The flowcharts, tables
and figures are giving good understanding of the methodology. The mathematical equations are
giving clear understanding of statistical model for regression. Referencing is done properly, in
text referencing is also done. The paper length is fine i.e. 6 pages and 5-6 pages length is
considered good. This research paper is well written covering all main components required for
a research paper. It makes a clear statement of the topic and the research area. The methodology
is explained in detail through figures, flowcharts, mathematical equations and tables.
There are some recommendations that have been stated based on the limitations of the research
work for future research in this endeavor. Firstly as the research is dependent on huge amounts
of reliable and authentic data, there should be proper data warehouse for this purpose. Secondly,
there should be proper intervention programs that brings student and educators in close
mediation. A face to face discussion can spark up the student’s desires to do more. I think both
these points are important for this research. In machine learning tasks the accuracy of a model
depends mainly on the data that has been provided. So accurate and authentic data is required
for good prediction. The data should not be limited or in less amount because this would give
less accuracy and cause over fitting of the model.
Prediction of student’s academic performance benefits both the course instructors as well as the
students whose performance is lagging in class. The future direction of this research is to provide
a generalized method for data collection. This should be based on identified metrics, parameters
and indicators. Proper data warehouse should be built for this purpose. By using machine
learning, statistical and data mining tools and methods, academic performance of students can
be predicated. This will help both students and education providers. This is will enable proper
guidance of students in achieving their academic goals and objectives will assist in adequate
placement in tertiary education courses.
References:
[1] Principal Component Analysis, Available from: <http://setosa.io/ev/principalcomponent-analysis/>. Accessed on: [September 27, 2019]
[2] The Mathematics behind Principal Component Analysis, Available from:
<https://towardsdatascience.com/the-mathematics-behind-principal-componentanalysis-fff2d7f4b643>. Accessed on: [September 28, 2019]
Download