Uploaded by M Z Hussain

CODE GUIDE

advertisement
CODE GUIDE
Original file is located at
https://colab.research.google.com/drive/1VauCqz60lgTltWIsZItEFi6WyOKDd-QA
Explanation:


The first line `# -*- coding: utf-8 -*-` specifies the encoding used in the file, ensuring
that any non-ASCII characters are properly handled.
The docstring below it indicates that this script was automatically generated by
Google Colab and provides a link to the original file location.
2. **Importing Libraries**










import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score,
f1_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import gdown
Explanation:








pandas (`pd`)**: For data manipulation and analysis, especially for handling tabular
data.
numpy (`np`)**: Provides support for large arrays and matrices, along with
mathematical functions.
random**: A module for generating random numbers.
scikit-learn (`sklearn`)**: Used for machine learning tasks.
train_test_split**: Splits the dataset into training and testing sets.
GridSearchCV**: Performs hyperparameter tuning.
TfidfVectorizer**: Converts text data into numerical features using the TF-IDF
method.
RandomForestClassifier**: A machine learning model based on the Random
Forest algorithm.





accuracy_score, precision_score, recall_score, f1_score**: Performance metrics
for classification.
confusion_matrix, classification_report**: Tools for evaluating model performance.
matplotlib.pyplot (`plt`)**: A plotting library for creating static, animated, and
interactive visualizations in Python.
seaborn (`sns`)**: A library for making statistical graphics that is built on top of
matplotlib.
gdown**: A utility to download files from Google Drive.
3. **Loading the Dataset**



url = 'https://drive.google.com/uc?id=13iK_cpP-GHGqcsqcCdwKh-pEbuvRITJD'
gdown.download(url, 'job_recommendation_dataset.csv', quiet=False)
data = pd.read_csv('job_recommendation_dataset.csv')
Explanation:



A
Google
Drive
URL
is
used
to
download
the
dataset
(`job_recommendation_dataset.csv`).
gdown.download(url,
'job_recommendation_dataset.csv',
quiet=False)`
downloads the file and saves it locally.
The dataset is loaded into a pandas DataFrame `data` using `pd.read_csv()`.
4. **Exploratory Data Analysis (EDA)**





# Distribution of Recommended vs Not Recommended
plt.figure(figsize=(8, 6))
sns.countplot(x='Recommended', data=data)
plt.title('Distribution of Recommended vs Not Recommended')
plt.savefig('recommendation_distribution.png')
Explanation:
Count Plot:
sns.countplot(x='Recommended', data=data)` creates a bar plot showing the distribution
of the `Recommended` column (how many jobs were recommended vs not
recommended).
Plot Settings:

plt.figure(figsize=(8, 6))` sets the figure size.


plt.title('Distribution of Recommended vs Not Recommended')` adds a title.
plt.savefig('recommendation_distribution.png')` saves the plot as an image file.
# Distribution of Match Scores




plt.figure(figsize=(8, 6))
sns.histplot(data['Match_Score'], bins=20, kde=True)
plt.title('Distribution of Match Scores')
plt.savefig('match_score_distribution.png')
Explanation:
Histogram:
sns.histplot(data['Match_Score'], bins=20, kde=True)` creates a histogram showing the
distribution of `Match_Score` values, with a kernel density estimate (KDE) overlay.
Plot Settings:









The figure size, title, and saving functionality are similar to the previous plot.
# Bar Chart for Top 10 Job IDs
plt.figure(figsize=(10, 8))
top_jobs = data['Job_ID'].value_counts().head(10)
sns.barplot(x=top_jobs.index, y=top_jobs.values, palette='viridis')
plt.title('Top 10 Job IDs')
plt.xlabel('Job ID')
plt.ylabel('Frequency')
plt.savefig('top_10_jobs.png')
Explanation:
Bar Plot:


data['Job_ID'].value_counts().head(10)` gets the top 10 most frequent job IDs.
sns.barplot(x=top_jobs.index, y=top_jobs.values, palette='viridis')` creates a bar
chart showing these top 10 job IDs.
Plot Settings:
The plot is customized with labels, title, and saved to a file.


# Pie Chart for Job Recommendations
plt.figure(figsize=(8, 8))



data['Recommended'].value_counts().plot.pie(autopct='%1.1f%%', startangle=90,
colors=['#ff9999','#66b3ff'])
plt.title('Job Recommendations')
plt.savefig('job_recommendations_pie.png')
Explanation:
Pie Chart:
data['Recommended'].value_counts().plot.pie(...)` creates a pie chart showing the
proportion of recommended vs not recommended jobs.
Plot Settings:



autopct='%1.1f%%'` adds percentage labels to the slices.
startangle=90` starts the pie chart at 90 degrees.
The plot is customized and saved as an image file.
5. **Preprocessing**



vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(data['User_Skills'] + ' ' + data['Job_Requirements'])
y = data['Recommended']
Explanation:
TF-IDF Vectorization:


TfidfVectorizer()` initializes the TF-IDF vectorizer, which converts text data into a
matrix of TF-IDF features.
vectorizer.fit_transform(...)` applies the vectorizer to combine the `User_Skills` and
Job_Requirements` columns, turning them into numerical features (`X`).
Target Variable:
y = data['Recommended']` stores the `Recommended` column as the target variable.
6. **Train-Test Split**
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Explanation:




The dataset is split into training and testing sets.
train_test_split(X, y, test_size=0.2, random_state=42)`:
test_size=0.2` reserves 20% of the data for testing.
random_state=42` ensures reproducibility.
7. **Model Training with Hyperparameter Tuning**










rf_model = RandomForestClassifier(random_state=42)
param_grid = {
'n_estimators': [100, 200],
'max_depth': [10, 20, None],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
grid_search = GridSearchCV(estimator=rf_model, param_grid=param_grid, cv=3,
n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)
best_rf_model = grid_search.best_estimator_
Explanation:
Random Forest Model:
rf_model = RandomForestClassifier(random_state=42)` initializes a Random Forest
classifier.
Hyperparameter Tuning:




param_grid = {...}` defines the grid of hyperparameters to search over.
GridSearchCV(...)` performs cross-validated grid search to find the best
hyperparameters.
grid_search.fit(X_train, y_train)` trains the model with different hyperparameter
combinations on the training data.
best_rf_model = grid_search.best_estimator_` stores the best model found.
8. **Predictions**
y_pred = best_rf_model.predict(X_test)
Explanation:
y_pred = best_rf_model.predict(X_test)` generates predictions on the test set using the
best model.
9. **Performance Metrics**






accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
Explanation:




Various metrics are calculated to evaluate model performance:
accuracy_score`, `precision_score`, `recall_score`, `f1_score` measure the
model's accuracy, precision, recall, and F1 score, respectively.
confusion_matrix` calculates the confusion matrix.
classification_report` generates a detailed classification report.
10. **Print Metrics**






print(f'Accuracy: {accuracy:.2f}')
print(f'Precision: {precision:.2f}')
print(f'Recall: {recall:.2f}')
print(f'F1 Score: {f1:.2f}')
print('\nClassification Report:')
print(class_report)
Explanation:
The calculated metrics are printed in a readable format.
11. **Confusion Matrix Plot**





plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not
Recommended',
'Recommended'],
yticklabels=['Not
Recommended',
'Recommended'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')

plt.savefig('confusion_matrix.png')
Explanation:
Heatmap:


sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', ...)` creates a
heatmap of the confusion matrix.
annot=True` adds annotations to the cells.
Plot Settings:
Labels, title, and saving functionality are set similarly to the previous plots.
12. **Save Metrics Table**





metrics = pd.DataFrame({
'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
'Score': [accuracy, precision, recall, f1]
})
metrics.to_csv('performance_metrics.csv', index=False)
Explanation:
Save Metrics:


A DataFrame `metrics` is created to store the calculated metrics.
metrics.to_csv('performance_metrics.csv', index=False)` saves the DataFrame to
a CSV file without including the index column.
Download