Uploaded by ossama92

Assignment 2 ML

advertisement
Assignment #2
Osama Al-Shalali
MUN ID: 202295546
COMP 6915: Intro to Machine Learning
Memorial University of Newfoundland
February 22, 2024
1
Choice of Methods
1.1
Simple Regression
I chose Leave-One-Out Cross-Validation (LOOCV) as the baseline method due to the small
dataset size, allowing for reliable model performance estimation.
1.2
Other Methods
I have tested three alternative methods as follows:
• Feature Scaling
I have chosen the Standard Scaling which independently centers and scales each feature
by computing the mean and standard deviation from the training set. It standardizes
the data to have zero mean and unit variance. WHY? As noticed from the given data,
the feature vales range in the interval (0-44491), so it is a good idea to scale the feature
values. This will ensure features are on a similar scale and prevents dominance of
certain features in the learning process [1].
• Polynomial Regression
I have chosen Polynomial Regression - a variation of linear regression that models
the relationship between an independent variable (x) and a dependent variable (y)
as a polynomial with a degree of n. It captures nonlinear relationships by fitting a
curve to the data, representing the conditional mean of y for different values of x as an alternative method which provides greater flexibility in modeling by fitting a
polynomial equation to the data. WHY? Assuming there is some-linearity in the data
set, polynomial regression is useful allowing to capture and represent the non-linear
nature of the data [2].
1
• Feature Selection
As another alternative, I have implemented Univariate Feature Selection, a technique
used to identify the most important features in a dataset by evaluating their individual
relationships with the target variable. It helps in reducing dimensionality and simplifying the modeling process when dealing with a large number of features. WHY?
We assumed that the relationship between the target variable (y) and some individual
features is straightforward and can be assessed through basic statistical analysis [3].
2
Pycharm Output
Figure 1: Pycharm screenshot
2
3
The Data Visualization
Figure 2: Boxplot for different regression models.
3
4
Average RMSE Values
Model
Table 1: RMSE
Average RMSE ± standard deviation
Simple Regression
Regression + Scaling
Polynomial Regression
Feature Selction
5
2.17
1.76
1.91
0.45
±
±
±
±
1.73
1.65
1.98
0.32
Conclusion
• Simple Regression: The average RMSE is 2.17, this baseline method provides a
moderate level of accuracy in predicting outcomes (unseen testing examples).
Regression + Scaling: The average RMSE is 1.76, this approach improves prediction accuracy compared to simple regression.
Polynomial Regression: The average RMSE is 1.91, Polynomial regression captures
the nonlinear relationship, however. It is not that different from the performance of
simple regression due to the high linearity of the data.
Feature Selection: The average RMSE is impressively the lowest (0.45), which insures that feature selection helps identify the most relevant predictors (features), resulting in highly accurate predictions on testing examples.
In summary, feature selection appears to be the most effective method, yielding the
lowest RMSE. However, we should consider the trade-off between model complexity
and result accuracy when choosing a model for a specific problem.
4
References
[1] ”sklearn.preprocessing.StandardScaler,” Scikit-learn. Available:
https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.
[Accessed: 21 Feb. 2024].
[2] ”Implementation
of
Polynomial
Regression,”
Geeksforgeeks.
Available:
https://www.geeksforgeeks.org/python-implementation-of-polynomial-regression/.
[Accessed: 21 Feb. 2024].
[3] ”Eight ways to perform feature selection with scikit-learn,” Shedloadofcode. Available: https://www.shedloadofcode.com/blog/eight-ways-to-perform-feature-selectionwith-scikit-learn. [Accessed: 18 Feb. 2024].
5
Download