Job Recommendation Code Guide: ML Project

CODE GUIDE Original file is located at https://colab.research.google.com/drive/1VauCqz60lgTltWIsZItEFi6WyOKDd-QA Explanation:   The first line `# -*- coding: utf-8 -*-` specifies the encoding used in the file, ensuring that any non-ASCII characters are properly handled. The docstring below it indicates that this script was automatically generated by Google Colab and provides a link to the original file location. 2. **Importing Libraries**           import pandas as pd import numpy as np import random from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report import matplotlib.pyplot as plt import seaborn as sns import gdown Explanation:         pandas (`pd`)**: For data manipulation and analysis, especially for handling tabular data. numpy (`np`)**: Provides support for large arrays and matrices, along with mathematical functions. random**: A module for generating random numbers. scikit-learn (`sklearn`)**: Used for machine learning tasks. train_test_split**: Splits the dataset into training and testing sets. GridSearchCV**: Performs hyperparameter tuning. TfidfVectorizer**: Converts text data into numerical features using the TF-IDF method. RandomForestClassifier**: A machine learning model based on the Random Forest algorithm.      accuracy_score, precision_score, recall_score, f1_score**: Performance metrics for classification. confusion_matrix, classification_report**: Tools for evaluating model performance. matplotlib.pyplot (`plt`)**: A plotting library for creating static, animated, and interactive visualizations in Python. seaborn (`sns`)**: A library for making statistical graphics that is built on top of matplotlib. gdown**: A utility to download files from Google Drive. 3. **Loading the Dataset**    url = 'https://drive.google.com/uc?id=13iK_cpP-GHGqcsqcCdwKh-pEbuvRITJD' gdown.download(url, 'job_recommendation_dataset.csv', quiet=False) data = pd.read_csv('job_recommendation_dataset.csv') Explanation:    A Google Drive URL is used to download the dataset (`job_recommendation_dataset.csv`). gdown.download(url, 'job_recommendation_dataset.csv', quiet=False)` downloads the file and saves it locally. The dataset is loaded into a pandas DataFrame `data` using `pd.read_csv()`. 4. **Exploratory Data Analysis (EDA)**      # Distribution of Recommended vs Not Recommended plt.figure(figsize=(8, 6)) sns.countplot(x='Recommended', data=data) plt.title('Distribution of Recommended vs Not Recommended') plt.savefig('recommendation_distribution.png') Explanation: Count Plot: sns.countplot(x='Recommended', data=data)` creates a bar plot showing the distribution of the `Recommended` column (how many jobs were recommended vs not recommended). Plot Settings:  plt.figure(figsize=(8, 6))` sets the figure size.   plt.title('Distribution of Recommended vs Not Recommended')` adds a title. plt.savefig('recommendation_distribution.png')` saves the plot as an image file. # Distribution of Match Scores     plt.figure(figsize=(8, 6)) sns.histplot(data['Match_Score'], bins=20, kde=True) plt.title('Distribution of Match Scores') plt.savefig('match_score_distribution.png') Explanation: Histogram: sns.histplot(data['Match_Score'], bins=20, kde=True)` creates a histogram showing the distribution of `Match_Score` values, with a kernel density estimate (KDE) overlay. Plot Settings:          The figure size, title, and saving functionality are similar to the previous plot. # Bar Chart for Top 10 Job IDs plt.figure(figsize=(10, 8)) top_jobs = data['Job_ID'].value_counts().head(10) sns.barplot(x=top_jobs.index, y=top_jobs.values, palette='viridis') plt.title('Top 10 Job IDs') plt.xlabel('Job ID') plt.ylabel('Frequency') plt.savefig('top_10_jobs.png') Explanation: Bar Plot:   data['Job_ID'].value_counts().head(10)` gets the top 10 most frequent job IDs. sns.barplot(x=top_jobs.index, y=top_jobs.values, palette='viridis')` creates a bar chart showing these top 10 job IDs. Plot Settings: The plot is customized with labels, title, and saved to a file.   # Pie Chart for Job Recommendations plt.figure(figsize=(8, 8))    data['Recommended'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90, colors=['#ff9999','#66b3ff']) plt.title('Job Recommendations') plt.savefig('job_recommendations_pie.png') Explanation: Pie Chart: data['Recommended'].value_counts().plot.pie(...)` creates a pie chart showing the proportion of recommended vs not recommended jobs. Plot Settings:    autopct='%1.1f%%'` adds percentage labels to the slices. startangle=90` starts the pie chart at 90 degrees. The plot is customized and saved as an image file. 5. **Preprocessing**    vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(data['User_Skills'] + ' ' + data['Job_Requirements']) y = data['Recommended'] Explanation: TF-IDF Vectorization:   TfidfVectorizer()` initializes the TF-IDF vectorizer, which converts text data into a matrix of TF-IDF features. vectorizer.fit_transform(...)` applies the vectorizer to combine the `User_Skills` and Job_Requirements` columns, turning them into numerical features (`X`). Target Variable: y = data['Recommended']` stores the `Recommended` column as the target variable. 6. **Train-Test Split** X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) Explanation:     The dataset is split into training and testing sets. train_test_split(X, y, test_size=0.2, random_state=42)`: test_size=0.2` reserves 20% of the data for testing. random_state=42` ensures reproducibility. 7. **Model Training with Hyperparameter Tuning**           rf_model = RandomForestClassifier(random_state=42) param_grid = { 'n_estimators': [100, 200], 'max_depth': [10, 20, None], 'min_samples_split': [2, 5, 10], 'min_samples_leaf': [1, 2, 4] } grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2) grid_search.fit(X_train, y_train) best_rf_model = grid_search.best_estimator_ Explanation: Random Forest Model: rf_model = RandomForestClassifier(random_state=42)` initializes a Random Forest classifier. Hyperparameter Tuning:     param_grid = {...}` defines the grid of hyperparameters to search over. GridSearchCV(...)` performs cross-validated grid search to find the best hyperparameters. grid_search.fit(X_train, y_train)` trains the model with different hyperparameter combinations on the training data. best_rf_model = grid_search.best_estimator_` stores the best model found. 8. **Predictions** y_pred = best_rf_model.predict(X_test) Explanation: y_pred = best_rf_model.predict(X_test)` generates predictions on the test set using the best model. 9. **Performance Metrics**       accuracy = accuracy_score(y_test, y_pred) precision = precision_score(y_test, y_pred) recall = recall_score(y_test, y_pred) f1 = f1_score(y_test, y_pred) conf_matrix = confusion_matrix(y_test, y_pred) class_report = classification_report(y_test, y_pred) Explanation:     Various metrics are calculated to evaluate model performance: accuracy_score`, `precision_score`, `recall_score`, `f1_score` measure the model's accuracy, precision, recall, and F1 score, respectively. confusion_matrix` calculates the confusion matrix. classification_report` generates a detailed classification report. 10. **Print Metrics**       print(f'Accuracy: {accuracy:.2f}') print(f'Precision: {precision:.2f}') print(f'Recall: {recall:.2f}') print(f'F1 Score: {f1:.2f}') print('\nClassification Report:') print(class_report) Explanation: The calculated metrics are printed in a readable format. 11. **Confusion Matrix Plot**      plt.figure(figsize=(8, 6)) sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Recommended', 'Recommended'], yticklabels=['Not Recommended', 'Recommended']) plt.xlabel('Predicted') plt.ylabel('True') plt.title('Confusion Matrix')  plt.savefig('confusion_matrix.png') Explanation: Heatmap:   sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', ...)` creates a heatmap of the confusion matrix. annot=True` adds annotations to the cells. Plot Settings: Labels, title, and saving functionality are set similarly to the previous plots. 12. **Save Metrics Table**      metrics = pd.DataFrame({ 'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'], 'Score': [accuracy, precision, recall, f1] }) metrics.to_csv('performance_metrics.csv', index=False) Explanation: Save Metrics:   A DataFrame `metrics` is created to store the calculated metrics. metrics.to_csv('performance_metrics.csv', index=False)` saves the DataFrame to a CSV file without including the index column.

Job Recommendation Code Guide: ML Project

Related documents

Products

Support

Job Recommendation Code Guide: ML Project

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib