Assignment #2 Osama Al-Shalali MUN ID: 202295546 COMP 6915: Intro to Machine Learning Memorial University of Newfoundland February 22, 2024 1 Choice of Methods 1.1 Simple Regression I chose Leave-One-Out Cross-Validation (LOOCV) as the baseline method due to the small dataset size, allowing for reliable model performance estimation. 1.2 Other Methods I have tested three alternative methods as follows: • Feature Scaling I have chosen the Standard Scaling which independently centers and scales each feature by computing the mean and standard deviation from the training set. It standardizes the data to have zero mean and unit variance. WHY? As noticed from the given data, the feature vales range in the interval (0-44491), so it is a good idea to scale the feature values. This will ensure features are on a similar scale and prevents dominance of certain features in the learning process [1]. • Polynomial Regression I have chosen Polynomial Regression - a variation of linear regression that models the relationship between an independent variable (x) and a dependent variable (y) as a polynomial with a degree of n. It captures nonlinear relationships by fitting a curve to the data, representing the conditional mean of y for different values of x as an alternative method which provides greater flexibility in modeling by fitting a polynomial equation to the data. WHY? Assuming there is some-linearity in the data set, polynomial regression is useful allowing to capture and represent the non-linear nature of the data [2]. 1 • Feature Selection As another alternative, I have implemented Univariate Feature Selection, a technique used to identify the most important features in a dataset by evaluating their individual relationships with the target variable. It helps in reducing dimensionality and simplifying the modeling process when dealing with a large number of features. WHY? We assumed that the relationship between the target variable (y) and some individual features is straightforward and can be assessed through basic statistical analysis [3]. 2 Pycharm Output Figure 1: Pycharm screenshot 2 3 The Data Visualization Figure 2: Boxplot for different regression models. 3 4 Average RMSE Values Model Table 1: RMSE Average RMSE ± standard deviation Simple Regression Regression + Scaling Polynomial Regression Feature Selction 5 2.17 1.76 1.91 0.45 ± ± ± ± 1.73 1.65 1.98 0.32 Conclusion • Simple Regression: The average RMSE is 2.17, this baseline method provides a moderate level of accuracy in predicting outcomes (unseen testing examples). Regression + Scaling: The average RMSE is 1.76, this approach improves prediction accuracy compared to simple regression. Polynomial Regression: The average RMSE is 1.91, Polynomial regression captures the nonlinear relationship, however. It is not that different from the performance of simple regression due to the high linearity of the data. Feature Selection: The average RMSE is impressively the lowest (0.45), which insures that feature selection helps identify the most relevant predictors (features), resulting in highly accurate predictions on testing examples. In summary, feature selection appears to be the most effective method, yielding the lowest RMSE. However, we should consider the trade-off between model complexity and result accuracy when choosing a model for a specific problem. 4 References [1] ”sklearn.preprocessing.StandardScaler,” Scikit-learn. Available: https://scikitlearn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html. [Accessed: 21 Feb. 2024]. [2] ”Implementation of Polynomial Regression,” Geeksforgeeks. Available: https://www.geeksforgeeks.org/python-implementation-of-polynomial-regression/. [Accessed: 21 Feb. 2024]. [3] ”Eight ways to perform feature selection with scikit-learn,” Shedloadofcode. Available: https://www.shedloadofcode.com/blog/eight-ways-to-perform-feature-selectionwith-scikit-learn. [Accessed: 18 Feb. 2024]. 5