MEMORANDUM DATE: April 9, 2023 TO: Mitch Cochran, MA, MHS, CISM, CGCIO FROM: - Trieu Hoang Hiep: 11191914 - Le Chi Thanh: 11194705 SUBJECT: Midterm assignment Our project will be building a machine learning algorithms on the Deloitte ML Challenge dataset. The goal of a prediction machine learning project is to develop a model that can accurately predict outcomes based on input data. This is achieved by training a machine learning model on a dataset that includes input features and corresponding output labels, with the aim of learning the underlying patterns and relationships in the data so that it can make accurate predictions on new, unseen data. The goal of the project is not just to develop a model that can accurately predict outcomes based on the provided data, but also to develop a model that can provide actionable insights that can be used to solve real-world problems. For example, in a business context, a machine learning model that can accurately predict customer churn can provide actionable insights for a company to take specific actions to retain customers and improve customer satisfaction. The data for the Deloitte Machine Learning Challenge is typically provided by Deloitte. The specific dataset and problem statement for each year's challenge are announced when the competition is launched. The data is usually made available to all registered participants of the challenge through a secure online portal. Participants can download the data and use it to develop and train their machine learning models. We will use some technique like: - - Data preprocessing: This involves cleaning, transforming, and manipulating the raw data to prepare it for analysis. Techniques such as data cleaning, feature engineering, and normalization can be used to preprocess the data. Exploratory data analysis (EDA): EDA involves visualizing and summarizing the main characteristics of the dataset to gain insights and identify patterns. Techniques such as scatter plots, histograms, and box plots can be used for EDA. - - - Feature selection: This involves selecting the most relevant features from the dataset to use in the machine learning model. Techniques such as correlation analysis, principal component analysis (PCA), and recursive feature elimination (RFE) can be used for feature selection. Machine learning algorithms: This involves applying various machine learning algorithms such as linear regression, decision trees, random forests, and neural networks to the dataset to develop a predictive model. Model evaluation: This involves assessing the performance of the machine learning model using metrics such as accuracy, precision, recall, and F1 score. We will also using some technique to validate our result such as: - - - Cross-validation: Cross-validation is a technique used to evaluate the performance of a machine learning model by partitioning the dataset into multiple subsets and training the model on different subsets while testing it on the remaining subset. This helps to ensure that the model is not overfitting to the training data and can generalize well to new data. Hold-out validation: Hold-out validation involves splitting the dataset into a training set and a validation set. The model is trained on the training set and evaluated on the validation set. This helps to assess the model's performance on data that it has not seen before. Feature importance: Feature importance analysis can be used to identify the most important features in the dataset that are contributing the most to the model's predictions.