Uploaded by Akimu Mosebolatan Odunola

MEDIUM TITANIC

advertisement
Get unlimited access to the best of Medium for less than $1/week. Become a member
How To Score ~80% Accuracy in
Kaggle’s Spaceship Titanic
Competition
This is a step-by-step guide to walk you through submitting a “.csv”
file of predictions to Kaggle for the new titanic competition.
Zaynab Awofeso · Follow
Published in CodeX · 13 min read · Jun 13, 2022
114
image by unsplash
Introduction
Kaggle recently launched a fun competition called Spaceship Titanic. It is
designed to be an update of the popular Titanic competition which helps
people new to data science learn the basics of machine learning, get
acquainted with Kaggle’s platform, and meet others in the community. This
article is a beginner-friendly analysis of the Spaceship Titanic Kaggle
Competition. It covers steps to obtain any meaningful insights from the data
and to predict the “ground truth” for the test set with an accuracy of ~80%
using RandomForestClassifier.
Index
1. Problem definition and metrics
2. About the data
3. Exploratory Data Analysis
4. Data Cleaning and preprocessing
5. Feature Extraction and Feature Selection
6. Baseline Model Performance and Model Building
7. Submission and Feature Importance
1. Problem definition and metrics
As the first thing, we have to understand the problem. It’s the year 2912 and
the interstellar passenger liner Spaceship Titanic has collided with a
spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as
its namesake from 1000 years before. Though the ship stayed intact, almost
half of the passengers were transported to an alternate dimension! To help
rescue crews retrieve the lost passengers, we are challenged to use records
recovered from the spaceship’s damaged computer system to predict which
passengers were transported to another dimension.
This problem is a binary class classification problem where we have to predict
which passengers were transported to an alternate dimension or not, and we
will be using accuracy as a metric to evaluate our results.
2. About the data
We will be using 3 CSV files:
train file (spaceship_titanic_train.csv) — contains personal records of the
passengers that would be used to build the machine learning model.
test file (spaceship_titanic_test.csv) — contains personal records for the
remaining one-third (~4300) of the passengers, but not the target variable
(i.e. the value of Transported for the passengers). It will be used to see
how well our model performs on unseen data.
sample submission file (sample_submission.csv) — contains the format
in which we have to submit our predictions.
We will be using python for this problem. You can download the dataset
from Kaggle here.
Import required libraries
1
# For loading Packages
2
import pandas as pd
3
pd.set_option('max_columns', None)
4
pd.set_option('max_rows', 100)
5
6
# For mathematical calculations
7
import numpy as np
8
9
# For data visualization
10
import matplotlib.pyplot as plt
11
import seaborn as sns
12
sns.set_style("darkgrid")
13
14
# To build and evaluate model
15
from sklearn.ensemble import RandomForestClassifier
16
from sklearn.metrics import accuracy_score
17
from sklearn.model_selection import cross_val_score
18
from sklearn.model_selection import GridSearchCV
19
from sklearn.feature_selection import SelectKBest, chi2
20
21
# To ignore any warnings
22
import warnings
23
warnings.filterwarnings("ignore")
spaceship_tiitanic.py hosted with ❤ by GitHub
view raw
Reading Data
1
# Read train data
2
train_df = pd.read_csv("spaceship_titanic_train.csv")
3
# Read test data
4
test_df = pd.read_csv("spaceship_titanic_test.csv")
spaceship_tiitanic_1.py hosted with ❤ by GitHub
view raw
Let’s make a copy of the train and test data so that even if we make any
changes to these datasets it would not affect the original datasets.
1
# copy of train and test data to prevent making changes to the original datasets
2
train_df_1 = train_df.copy()
3
test_df_1 = test_df.copy()
spaceship_tiitanic_2.py hosted with ❤ by GitHub
view raw
We will look at the structure of the train and test dataset next. We will first
check the features present, then we will look at their data types.
1
# view columns of the train data
2
train_df_1.columns
spaceship_tiitanic_3.py hosted with ❤ by GitHub
Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin',
'Destination', 'Age',
'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa',
'VRDeck',
'Name', 'Transported'],
dtype='object')
view raw
We have 13 independent variables and 1 target variable (Transported) in the
training dataset. Let’s also look at the columns of the test dataset.
1
# view columns of the test data
2
test_df_1.columns
spaceship_tiitanic_4.py hosted with ❤ by GitHub
view raw
Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin',
'Destination', 'Age',
'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa',
'VRDeck',
'Name'],
dtype='object')
We have similar features in the test dataset as the training dataset except
Transported that we will predict using the model built by the train data.
Given below is the description for each variable.
PassengerId — A unique Id for each passenger. Each Id takes the form
gggg_pp where gggg indicates a group the passenger is travelling with
and pp is their number within the group. People in a group are often
family members, but not always.
HomePlanet — The planet the passenger departed from, typically their
planet of permanent residence.
CryoSleep — Indicates whether the passenger elected to be put into
suspended animation for the duration of the voyage. Passengers in
cryosleep are confined to their cabins.
Cabin — The cabin number where the passenger is staying. Takes the
form deck/num/side, where side can be either P for Port or S for
Starboard.
Destination — The planet the passenger will be debarking to.
Age — The age of the passenger.
VIP — Whether the passenger has paid for special VIP service during the
voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck — Amount the
passenger has billed at each of the Spaceship Titanic’s many luxury
amenities.
Name — The first and last names of the passenger.
Transported — Whether the passenger was transported to another
dimension. This is the target, the column we are trying to predict.
Let’s print data types for each variable of the training dataset.
1
# print datatypes of the train data
2
train_df_1.dtypes
spaceship_tiitanic_5.py hosted with ❤ by GitHub
PassengerId
HomePlanet
CryoSleep
Cabin
Destination
Age
VIP
RoomService
FoodCourt
ShoppingMall
Spa
object
object
object
object
object
float64
object
float64
float64
float64
float64
view raw
VRDeck
Name
Transported
dtype: object
float64
object
bool
We can see there are three formats of data types in the training dataset:
object (Categorical variables) — The categorical variables in the training
dataset are: PassengerId, HomePlanet, CryoSleep, Cabin, Destination,
VIP and Name
float64 (Float variables i.e Numerical variables which have some decimal
values involved) — The Numerical variables in our train dataset: Age,
RoomService, FoodCourt, ShoppingMall, Spa and VRDeck
bool (Boolean variables i.e. a variable that has one of two possible values
e.g. True or False) — The Boolean Variable in our dataset is Transported
Let’s look at the shape of our train and test dataset.
1
# Print shape of train data
2
print("The shape of the train dataset is: ", train_df_1.shape)
3
4
# Print shape of test data
5
print("The shape of the test dataset is: ", test_df_1.shape)
spaceship_tiitanic_6.py hosted with ❤ by GitHub
view raw
The shape of the train dataset is: (8693, 14)
The shape of the test dataset is: (4277, 13)
We have 8693 rows and 14 columns in the training dataset and 4277 rows and
13 columns in the test dataset.
1. Exploratory Data Analysis
Univariate Analysis
Univariate analysis is the simplest form of analyzing data where we examine
each data individually to understand the distribution of its values.
Target Variable
We will first look at the target variable i.e. Transported. Since it is a
categorical variable, let us look at its percentage distribution and bar plot.
1
# Normalize is set to true to print proportions instead of number
2
train_df_1['Transported'].value_counts(normalize = True)
spaceship_tiitanic_7.py hosted with ❤ by GitHub
view raw
True
0.503624
False
0.496376
Name: Transported, dtype: float64
1
# Visualize target variable
2
ax = sns.catplot(x = "Transported", data = train_df_1, kind = "count", color = "b")
3
ax.set_axis_labels("Transported", "Transported Count")
spaceship_tiitanic_8.py hosted with ❤ by GitHub
view raw
Out of 8693 passengers in the train dataset, 4378 (about 50%) were
Transported to another dimension.
Let’s visualize the Independent categorical features next.
Independent Variable (Categorical)
1
# Visualize independent categorical features
2
plt.figure(figsize = (14, 15))
3
plt.subplot(221)
4
train_df_1['HomePlanet'].value_counts(normalize = True).plot.bar(title = 'HomePlanet')
5
plt.subplot(222)
6
train_df_1['CryoSleep'].value_counts(normalize = True).plot.bar(title = 'CryoSleep')
7
plt.subplot(223)
8
train_df_1['Destination'].value_counts(normalize = True).plot.bar(title = 'Destination')
9
plt.subplot(224)
10
train_df_1['VIP'].value_counts(normalize = True).plot.bar(title = 'VIP')
spaceship_tiitanic_9.py hosted with ❤ by GitHub
view raw
It can be inferred from the bar plots above that:
About 50% of passengers in the trainset departed from Earth
About 30% of the passengers in the training dataset were on CryoSleep
(i.e confined to their cabins.)
About 69% of the passengers in the training dataset were going to
TRAPPIST-1e
Not up to 1% of the passengers on the training dataset paid for VIP
services
The cabin column takes the form deck/num/side. So, let’s extract and
visualize the CabinDeck and CabinSide features.
# Extract CabinDeck, CabinNo. and CabinSide feature from Cabin
train_df_1[["CabinDeck", "CabinNo.", "CabinSide"]] = train_df_1["Cabin"].str.split('/', expand = True)
# Visualize cabin feature
plt.figure(figsize = (14, 15))
plt.subplot(221)
train_df_1['CabinDeck'].value_counts(normalize = True).plot.bar(title = 'CabinDeck')
plt.subplot(222)
train_df_1['CabinSide'].value_counts(normalize = True).plot.bar(title = 'CabinSide')

spaceship_tiitanic_10.py hosted with ❤ by GitHub

view raw
We can infer from the plot above that:
About 60% of the passengers in the train set were on deck F and G
There’s not much difference between the % of passengers that were on
Cabin side S compared to P
We have seen the categorical variables. Now let’s visualize the numerical
variables.
Age
1
# Visualize Age variable
2
plt.figure(1)
3
plt.subplot(121)
4
sns.distplot(train_df_1['Age']);
5
plt.subplot(122)
6
train_df_1['Age'].plot.box(figsize = (16, 5));
7
plt.show()
spaceship_tiitanic_11.py hosted with ❤ by GitHub
view raw
There are outliers in the Age variable and the distribution is fairly normal.
Room Service
1
# Visualize RoomService variable
2
plt.figure(1)
3
plt.subplot(121)
4
sns.distplot(train_df_1['RoomService']);
5
plt.subplot(122)
6
train_df_1['RoomService'].plot.box(figsize = (16, 5));
7
plt.ylim([-500, 1000])
8
plt.show()
spaceship_tiitanic_12.py hosted with ❤ by GitHub
view raw
We can see that most of the data in the distribution of RoomService are
towards the left, which means it is not normally distributed, and there are a
lot of outliers. We will try to make it normal later.
Spa
1
# Visualize Spa variable
2
plt.figure(1)
3
plt.subplot(121)
4
sns.distplot(train_df_1['Spa']);
5
plt.subplot(122)
6
train_df_1['Spa'].plot.box(figsize = (16, 5));
7
plt.ylim([-500, 1000])
8
plt.show()
spaceship_tiitanic_13.py hosted with ❤ by GitHub
view raw
There is a similar distribution as that of RoomService. It contains a lot of
outliers and it is not normally distributed.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck are the amount the
passenger has billed at each of the Spaceship Titanic’s many luxury
amenities, so let’s see if VRDeck, FoodCourt and ShoppingMall have a similar
distribution.
VRDeck
1
# Visualize VRDeck variable
2
plt.figure(1)
3
plt.subplot(121)
4
sns.distplot(train_df_1['VRDeck']);
5
plt.subplot(122)
6
train_df_1['VRDeck'].plot.box(figsize = (16, 5));
7
plt.ylim([-500, 1000])
8
plt.show()
spaceship_tiitanic_14.py hosted with ❤ by GitHub
view raw
FoodCourt
1
# Visualize FoodCourt variable
2
plt.figure(1)
3
plt.subplot(121)
4
sns.distplot(train_df_1['FoodCourt']);
5
plt.subplot(122)
6
train_df_1['FoodCourt'].plot.box(figsize = (16, 5));
7
plt.ylim([-500, 1000])
8
plt.show()
spaceship_tiitanic_15.py hosted with ❤ by GitHub
view raw
ShoppingMall
1
# Visualize ShoppingMall variable
2
plt.figure(1)
3
plt.subplot(121)
4
sns.distplot(train_df_1['ShoppingMall']);
5
plt.subplot(122)
6
train_df_1['ShoppingMall'].plot.box(figsize = (16, 5));
7
plt.ylim([-500, 1000])
8
plt.show()
spaceship_tiitanic_16.py hosted with ❤ by GitHub
view raw
We can see that VRDeck, FoodCourt and ShoppingMall have a similar
distribution. They are all not normally distributed, and they all have outliers.
Bivariate Analysis
After looking at every variable individually, we will explore them again to see
their relationship with the target variable. First, we will find the relationship
between the categorical variables and the target variable.
To do this, we will first create a dataframe to store the no of passengers
transported, and the percentage of passengers transported for each
categorical variable.
1
HomePlanet_Transported = train_df_1.groupby('HomePlanet').aggregate({'Transported': 'sum',
2
'PassengerId': 'size'
3
}).reset_index()
4
5
HomePlanet_Transported['TransportedPercentage'] = HomePlanet_Transported['Transported'] / HomePla
6
7
CryoSleep_Transported = train_df_1.groupby('CryoSleep').aggregate({'Transported': 'sum',
8
'PassengerId': 'size'
9
}).reset_index()
10
11
CryoSleep_Transported['TransportedPercentage'] = CryoSleep_Transported['Transported'] / CryoSleep
12
13
Destination_Transported = train_df_1.groupby('Destination').aggregate({'Transported': 'sum',
14
'PassengerId': 'size'
15
}).reset_index()
16
17
Destination_Transported['TransportedPercentage'] = Destination_Transported['Transported'] / Desti
18
19
VIP_Transported = train_df_1.groupby('VIP').aggregate({'Transported': 'sum',
20
'PassengerId': 'size'
21
}).reset_index()
22
23
VIP_Transported['TransportedPercentage'] = VIP_Transported['Transported'] / VIP_Transported['Pass

spaceship_tiitanic_17.py hosted with ❤ by GitHub
Now, let’s see how the categorical variables relate to transported.

view raw
1
# Visualize categorical features vs target variable
2
plt.figure(figsize = (14, 15))
3
plt.subplot(221)
4
sns.barplot(x = "HomePlanet", y = "TransportedPercentage", data = HomePlanet_Transported, order =
5
plt.subplot(222)
6
sns.barplot(x = "CryoSleep", y = "TransportedPercentage", data = CryoSleep_Transported, order = C
7
plt.subplot(223)
8
sns.barplot(x = "Destination", y = "TransportedPercentage", data = Destination_Transported, order
9
plt.subplot(224)
10
sns.barplot(x = "VIP", y = "TransportedPercentage", data = VIP_Transported, order = VIP_Transport

spaceship_tiitanic_18.py hosted with ❤ by GitHub

view raw
We can infer that:
About 64% of the Passengers from Europa were Transported
About 78% of the Passengers in CryoSleep were transported
The proportion of Passengers debarking to 55 Cancri e transported to
another dimension is greater compared to those debarking to PSO
J318.5–22 and TRAPPIST-1e
About 38% of the Passengers that paid for special VIP services were
transported
Next, let’s at how the CabinDeck and CabinSide columns relate to
transported. We will follow the same steps as above.
1
CabinDeck_Transported = train_df_1.groupby('CabinDeck').aggregate({'Transported': 'sum',
2
'PassengerId': 'size'
3
}).reset_index()
4
5
CabinDeck_Transported['TransportedPercentage'] = CabinDeck_Transported['Transported'] / CabinDeck
6
7
CabinSide_Transported = train_df_1.groupby('CabinSide').aggregate({'Transported': 'sum',
8
'PassengerId': 'size'
9
}).reset_index()
10
11
CabinSide_Transported['TransportedPercentage'] = CabinSide_Transported['Transported'] / CabinSide
12
13
# Visualize Cabin features vs target variable
14
plt.figure(figsize = (14, 15))
15
plt.subplot(221)
16
sns.barplot(x = "CabinDeck", y = "TransportedPercentage", data = CabinDeck_Transported, order = C
17
plt.subplot(222)
18
sns.barplot(x = "CabinSide", y = "TransportedPercentage", data = CabinSide_Transported, order = C

spaceship_tiitanic_19.py hosted with ❤ by GitHub

view raw
Cabin Deck B and C have the highest percentage of passengers
transported
The proportion of Passengers in Cabin Side S transported to another
dimension is greater compared to those in Cabin Side P
The PassengerId column takes the form gggg_pp where gggg indicates a
group the passenger is travelling with and pp, their number within the
group. We want to know how the number of people in a group relates to if
they are transported or not, So, we will extract the PassengerGroup feature
from the PassengerId column, get the number of people in a Group, and
then visualize how it relates to the transported feature.
1
# Extract PassengerGroup column from PassengerId column
2
train_df_1["PassengerGroup"] = train_df_1["PassengerId"].str.split('_', expand = True)[0]
3
4
# Create dataframe -No_People_In_PassengerGroup that contains the PassengerGroup and the no passe
5
No_People_In_PassengerGroup = train_df_1.groupby('PassengerGroup').aggregate({'PassengerId': 'siz
6
No_People_In_PassengerGroup = No_People_In_PassengerGroup.rename(columns = {"PassengerId": "NoInP
7
8
train_df_1 = train_df_1.merge(No_People_In_PassengerGroup[["PassengerGroup", "NoInPassengerGroup
9
# create dataframe NoInPassengerGroup_Transported that has No of passengers in a group transporte
10
NoInPassengerGroup_Transported = train_df_1.groupby('NoInPassengerGroup').aggregate({'Transported
11
'PassengerId': 'size'
12
}).reset_index()
13
14
NoInPassengerGroup_Transported['TransportedPercentage'] = NoInPassengerGroup_Transported['Transpo
15
16
# Visualize NoInPassengerGroup vs transported
17
sns.barplot(x = "NoInPassengerGroup", y = "TransportedPercentage", data = NoInPassengerGroup_Tran

spaceship_tiitanic_20.py hosted with ❤ by GitHub

view raw
There is no clear pattern of how the number of people in a group affects if
they are transported or not. So, we will look at how “if the passenger is alone
or not” affects if they are transported.
1
No_People_In_PassengerGroup["IsAlone"] = No_People_In_PassengerGroup["NoInPassengerGroup"].apply
2
train_df_1 = train_df_1.merge(No_People_In_PassengerGroup[["PassengerGroup", "IsAlone"]], how =
3
4
IsAlone_Transported = train_df_1.groupby('IsAlone').aggregate({'Transported': 'sum',
5
'PassengerId': 'size'
6
}).reset_index()
7
8
# create dataframe IsAlone_Transported that contains percentage of passengers transported Alone o
9
IsAlone_Transported['TransportedPercentage'] = IsAlone_Transported['Transported'] / IsAlone_Trans
10
11
# Visualize IsAlone vs transported
12
sns.barplot(x = "IsAlone", y = "TransportedPercentage", data = IsAlone_Transported, order = IsAlo

spaceship_tiitanic_21.py hosted with ❤ by GitHub

view raw
It seems more Passengers that were not alone were transported to another
dimension compared to Passengers that were alone.
The Name column also contains the first and last names of the passenger.
So, let’s extract the Family Name (last name) of each passenger to see if
family size may affect if passengers are transported or not.
1
# Extract FamilyName column from Name column
2
train_df_1["FamilyName"] = train_df_1["Name"].str.split(' ', expand = True)[1]
3
4
# Create dataframe -NoRelatives that contains the FamilyName and the no of relatives in a Family
5
NoRelatives = train_df_1.groupby('FamilyName')['PassengerId'].count().reset_index()
6
NoRelatives = NoRelatives.rename(columns = {"PassengerId": "NoRelatives"})
7
8
train_df_1 = train_df_1.merge(NoRelatives[["FamilyName", "NoRelatives"]], how = 'left', on = ['Fa
9
10
train_df_1["FamilySizeCat"] = pd.cut(train_df_1.NoRelatives, bins = [0, 2, 5, 10, 18], labels =
11
12
# create dataframe FamilySizeCat_Transported that has the Family Size Category and the percentage
13
FamilySizeCat_Transported = train_df_1.groupby('FamilySizeCat').aggregate({'Transported': 'sum',
14
'PassengerId': 'size'
15
}).reset_index()
16
17
FamilySizeCat_Transported['TransportedPercentage'] = FamilySizeCat_Transported['Transported'] / F
18
19
# Visualize FamilySizeCat vs transported
20
sns.barplot(x = "FamilySizeCat", y = "TransportedPercentage", data = FamilySizeCat_Transported, o

spaceship_tiitanic_22.py hosted with ❤ by GitHub

view raw
The percentage of smaller families transported is more than that of larger
families. This could be that smaller families are rich families, and were
transported. Let’s see how family size affects income.
To do this we will add all the amounts each passenger billed at each of the
Spaceship Titanic’s many luxury amenities. Then, we will plot it against
FamilySizeCat
1
# Create total spending feature
2
train_df_1["TotalSpendings"] = train_df_1["FoodCourt"] + \
3
train_df_1["ShoppingMall"] + \
4
train_df_1["RoomService"] + \
5
train_df_1["Spa"] + \
6
train_df_1["VRDeck"]
7
8
# FamilySizeCat vs TotalSpendings
9
plt.figure(figsize = (7, 5))
10
sns.boxplot(data = train_df_1, x = "FamilySizeCat", y = "TotalSpendings")
11
plt.ylim([-800, 12000])
spaceship_tiitanic_23.py hosted with ❤ by GitHub
view raw
Our hypothesis seems to be correct. It seems passengers with a smaller
family size are wealthier.
Now let’s visualize numerical independent variables with respect to the
target variable.
1
# Transported vs Age
2
plt.figure(figsize = (7, 5))
3
sns.violinplot(train_df_1["Transported"], train_df_1["Age"])
4
plt.ylim([-50, 200])
spaceship_tiitanic_24.py hosted with ❤ by GitHub
view raw
It looks like the percentage of passengers between the Age of 0 to about 4
transported is more than the percentage of older passengers transported. We
will create a new column AgeCat to confirm if more younger passengers
were transported compared to older passengers.
1
# Extract Age Category column from Age column
2
train_df_1["AgeCat"] = pd.cut(train_df_1.Age, bins = [0.0, 4.0, 12.0, 19.0, 40.0, 60.0, 80.0], labe
3
4
AgeCat_Transported = train_df_1.groupby('AgeCat').aggregate({'Transported': 'sum',
5
'PassengerId': 'size'
6
}).reset_index()
7
8
# create dataframe AgeCat_Transported that has the Age Category and the percentage transported
9
AgeCat_Transported['TransportedPercentage'] = AgeCat_Transported['Transported'] / AgeCat_Transporte
0
1
# Visualize AgeCat vs transported
2
sns.barplot(x = "AgeCat", y = "TransportedPercentage", data = AgeCat_Transported, order = AgeCat_Tr


spaceship_tiitanic_25.py hosted with ❤ by GitHub
view raw
We can infer from the plot above that:
about 74% of passengers within the Age range of 0–4 were transported
about 60% of passengers within the Age range of 5–12 were transported
Now, do the same for the remaining numerical independent variables.
1
#
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck
2
plt.figure(figsize = (7, 5))
3
sns.violinplot(train_df_1["Transported"], train_df_1["RoomService"])
4
plt.ylim([-500, 2000])
5
6
plt.figure(figsize = (7, 5))
7
sns.violinplot(train_df_1["Transported"], train_df_1["FoodCourt"])
8
plt.ylim([-900, 8000])
9
10
plt.figure(figsize = (7, 5))
11
sns.violinplot(train_df_1["Transported"], train_df_1["ShoppingMall"])
12
plt.ylim([-900, 6000])
13
14
plt.figure(figsize = (7, 5))
15
sns.violinplot(train_df_1["Transported"], train_df_1["Spa"])
16
plt.ylim([-800, 2000])
17
18
plt.figure(figsize = (7, 5))
19
sns.violinplot(train_df_1["Transported"], train_df_1["VRDeck"])
20
plt.ylim([-800, 2000])
spaceship_tiitanic_26.py hosted with ❤ by GitHub
view raw
Observations:
The bills spent by transported passengers appear to be concentrated and
approaching zero.
VRDeck, Spa and RoomService appear to have a similar distribution,
while ShoppingMall and RoomServices appear to have a similar
distribution.
We have seen how Family size affects expenditure. Now let’s see how
passengers elected in Cryosleep relates to expenditure.
1
# CryoSleep vs TotalSpendings
2
plt.figure(figsize = (7, 5))
3
4
sns.boxplot(data = train_df_1, x = "CryoSleep", y = "TotalSpendings")
5
plt.ylim([-900, 14000])
spaceship_tiitanic_27.py hosted with ❤ by GitHub
view raw
It can be seen from the plot above that passengers in CryoSleep have 0
expenditure. Now let’s see how VIP status affects expenditure.
1
# VIP vs TotalSpendings
2
plt.figure(figsize = (7, 5))
3
4
sns.boxplot(data = train_df_1, x = "VIP", y = "TotalSpendings")
5
plt.ylim([-800, 12000])
spaceship_tiitanic_28.py hosted with ❤ by GitHub
view raw
It can be seen that passengers with VIP status have a higher expenditure
compared to passengers who don’t.
1
# AgeCat vs TotalSpendings
2
plt.figure(figsize = (7, 5))
3
sns.boxplot(data = train_df_1, x = "AgeCat", y = "TotalSpendings")
4
plt.ylim([-800, 12000])
spaceship_tiitanic_29.py hosted with ❤ by GitHub
view raw
Let’s also see how the age category relates to total spending of a passenger.
From the plot above it can be inferred that:
Passengers within the age range of 0–12 had 0 expenditure
Expenditure increases with the Age
4. Cleaning and Preprocessing
After exploring the variables in our data, we can now impute the missing
values and treat the outliers.
First, let’s drop the columns we created for the exploratory data analysis.
1
train_df_2 = train_df_1.copy()
2
3
# drop features created during EDA
4
train_df_2 = train_df_2.drop(["PassengerGroup",
5
"CabinDeck",
6
"CabinNo.",
7
"CabinSide",
8
"FamilyName",
9
"NoRelatives",
10
"NoInPassengerGroup",
11
"AgeCat",
12
"FamilySizeCat",
13
"TotalSpendings"], axis = 1)
spaceship_tiitanic_28.py hosted with ❤ by GitHub
view raw
We will combine the train and test data to make cleaning and preprocessing
easier
1
# save target variable in train dataset and save it in target
2
target = train_df_2["Transported"]
3
4
# save test PassengerId in test_id
5
test_id = test_df_1["PassengerId"]
6
7
# drop Transported variable from the train set
8
train_df_3 = train_df_2.drop(["Transported"], axis = 1)
9
10
# Join the train and test set
11
data = pd.concat([train_df_3, test_df], axis = 0).reset_index(drop = True)
spaceship_tiitanic_29.py hosted with ❤ by GitHub
view raw
Let’s look at the shape of our new dataset.
1
# Print shape of data
2
print(data.shape)
spaceship_tiitanic_30.py hosted with ❤ by GitHub
view raw
(12970, 13)
The dataset has 12970 rows and 13 columns. Let’s look at the percentage of
each variable missing in our dataset.
1
# view percentage of values missing in each column
2
round(data.isna().sum() * 100/data.shape[0], 3)
spaceship_tiitanic_31.py hosted with ❤ by GitHub
PassengerId
HomePlanet
CryoSleep
Cabin
Destination
Age
VIP
RoomService
FoodCourt
ShoppingMall
Spa
VRDeck
Name
dtype: float64
view raw
0.000
2.221
2.390
2.305
2.113
2.082
2.282
2.028
2.228
2.359
2.190
2.066
2.267
There are missing values in every column except the PassengerId column
but the missing values are not up to 50% of the variables. We will treat the
missing values in the categorical columns first by imputation using mode.
1
# get categorical columns in train dataset with missing values and store in missing_cat_cols
2
data_1 = data.copy()
3
4
list_missing_cat_columns = list((data_1.select_dtypes(['object', 'category']).isna().sum() > 0).in
5
list_missing_cat_columns


spaceship_tiitanic_32.py hosted with ❤ by GitHub
view raw
['PassengerId',
'HomePlanet',
'CryoSleep',
'Cabin',
'Destination',
'VIP',
'Name']
1
# Fill Categorical columns in data with mode
2
for col in list_missing_cat_columns:
3
data_1[col] = data_1[col].fillna(data_1[col].mode()[0])
spaceship_tiitanic_33.py hosted with ❤ by GitHub
view raw
Now let’s find a way to fill the missing values in the Numerical features.
1
# Fill missing values for numeric columns
2
3
# get numeric columns with missing values and store in lst_missing_numeric_col
4
list_missing_numeric_col = list((data_1.select_dtypes(np.number).isna().sum() > 0).index)
5
list_missing_numeric_col
spaceship_tiitanic_34.py hosted with ❤ by GitHub
view raw
['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
We will start with Age. While performing EDA we saw that RoomService,
FoodCourt, ShoppingMall, Spa and VRDeck totals 0 if Passenger’s Age is less
than 13 or on CryoSleep so let’s create a function to handle that.
1
# Filling NaNs based on Age
2
def fill_nans_by_age_and_cryosleep(df):
3
df["RoomService"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["RoomService
4
df["FoodCourt"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["FoodCourt"])
5
df["ShoppingMall"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["ShoppingMa
6
df["Spa"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["Spa"])
7
df["VRDeck"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["VRDeck"])
8
9
return df
10
11
data_1 = fill_nans_by_age_and_cryosleep(data_1)


spaceship_tiitanic_35.py hosted with ❤ by GitHub
view raw
Now, lets fill the remaining missing values with using mean.
1
# Fill numeric columns in train dataset with mean
2
for col in list_missing_numeric_col:
3
4
data_1[col] = data_1[col].fillna(data_1[col].mean())
data_1.isna().sum()
spaceship_tiitanic_36.py hosted with ❤ by GitHub
PassengerId
HomePlanet
CryoSleep
Cabin
Destination
Age
VIP
RoomService
FoodCourt
ShoppingMall
Spa
0
0
0
0
0
0
0
0
0
0
0
view raw
VRDeck
Name
dtype: int64
0
0
As we can see, all the missing values have been filled in the dataset.
Outlier Treatment
As we saw earlier in our univariate analysis, RoomService, FoodCourt,
ShoppingMall, Spa and VRDeck contain outliers so we have to treat them as
the presence of outliers affects the distribution of our data. To do this we will
clip outliers on 99% quantile.
1
# clip outliers on 99% quantile
2
def clipping_quantile(dataframe, quantile_values = None, quantile = 0.99):
3
df = dataframe.copy()
4
if quantile_values is None:
5
6
quantile_values = df[["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].quant
for num_column in ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]:
7
num_values = df[num_column].values
8
threshold = quantile_values[num_column]
9
num_values = np.where(num_values > threshold, threshold, num_values)
10
11
df[num_column] = num_values
return df
12
13
data_1 = clipping_quantile(data_1, None, 0.99)

spaceship_tiitanic_37.py hosted with ❤ by GitHub

view raw
Our dataset is now clean!
5. Feature Extraction and Feature Selection
Based on our EDA let’s create a function to create new features that might
affect the target variable.
1
2
def extract_features(df):
df["PassengerGroup"] = (df["PassengerId"].str.split('_', expand = True))[0]
3
4
No_People_In_PassengerGroup = df.groupby('PassengerGroup').aggregate({'PassengerId': 'size'}
5
No_People_In_PassengerGroup = No_People_In_PassengerGroup.rename(columns = {"PassengerId": "N
6
# Create IsAlone feature
7
Search
Write
No_People_In_PassengerGroup["IsAlone"]
= No_People_In_PassengerGroup["NoInPassengerGroup"].ap
8
df = df.merge(No_People_In_PassengerGroup[["PassengerGroup", "IsAlone"]], how = 'left', on =
Open in app
9
10
# Create CabinDeck feature
11
df["CabinDeck"] = df["Cabin"].str.split('/', expand = True)[0]
12
# Create DeckPosition feature
13
df["DeckPosition"] = df["CabinDeck"].apply(lambda deck: "Lower" if deck in ('A', 'B', 'C', 'D
14
# Create CabinSide feature
15
df["CabinSide"] = df["Cabin"].str.split('/', expand = True)[2]
16
17
# Create Regular feature
18
df["Regular"] = df["FoodCourt"] + df["ShoppingMall"]
19
# Create Luxury feature
20
df["Luxury"] = df["RoomService"] + df["Spa"] + df["VRDeck"]
21
# Create TotalSpendings feature
22
df["TotalSpendings"] = df["RoomService"] + df["FoodCourt"] + df["ShoppingMall"] + df["Spa"] +
23
24
Wealthiest_Deck = df.groupby('CabinDeck').aggregate({'TotalSpendings': 'sum', 'PassengerId':
25
# Create DeckAverageSpent feature
26
Wealthiest_Deck['DeckAverageSpent'] = Wealthiest_Deck['TotalSpendings'] / Wealthiest_Deck['Pa
27
28
df = df.merge(Wealthiest_Deck[["CabinDeck", "DeckAverageSpent"]], how = 'left', on = ['CabinD
29
30
df["FamilyName"] = df["Name"].str.split(' ', expand = True)[1]
31
# Create NoRelatives feature
32
NoRelatives = df.groupby('FamilyName')['PassengerId'].count().reset_index()
33
NoRelatives = NoRelatives.rename(columns = {"PassengerId": "NoRelatives"})
34
35
df = df.merge(NoRelatives[["FamilyName", "NoRelatives"]], how = 'left', on = ['FamilyName'])
36
# Create FamilySizeCat feature
37
df["FamilySizeCat"] = pd.cut(df.NoRelatives, bins = [0, 2, 5, 10, 300], labels = ['0 - 2', '3
38
39
return df
40
41
data_2 = data_1.copy()
42
data_2 = extract_features(data_2)

spaceship_tiitanic_38.py hosted with ❤ by GitHub

view raw
Let us now drop the variables we used to create these features that are not so
relevant to remove the noise from the dataset.
1
data_3 = data_2.copy()
2
irrelevant_columns = ["Cabin", "PassengerId", "RoomService", "FoodCourt", "ShoppingMall", "Spa",
3
data_3 = data_3.drop(irrelevant_columns, axis = 1)
4
5
data_3.shape


spaceship_tiitanic_39.py hosted with ❤ by GitHub
view raw
(12970, 15)
Now, we will convert our categorical data into model-understandable
numerical data.
1
# Categorical Encoding
2
data_3 = pd.get_dummies(data_3, columns = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Cabin
3
4
# Ordinal Encoding
5
for col in ['CabinDeck', 'DeckPosition', 'FamilySizeCat']:
6
data_3[col], _ = data_3[col].factorize()


spaceship_tiitanic_40.py hosted with ❤ by GitHub
view raw
Next we will split the data back to get the train and test data.
1
# split the data back to get the train and test data
2
data_4 = data_3.copy()
3
train_data_final = data_4.loc[:train_df.index.max(),
4
test_data_final = data_4.loc[train_df.index.max() + 1:, :].reset_index(drop = True).copy()
spaceship_tiitanic_41.py hosted with ❤ by GitHub
:].copy()
view raw
Let’s print the shape of the train and test data to be sure we split the data
right.
1
# print shape of final train data
2
print(train_data_final.shape)
3
4
# print shape of final train data
5
print(test_data_final.shape)
spaceship_tiitanic_42.py hosted with ❤ by GitHub
view raw
(8693, 23)
(4277, 23)
6. Baseline Model Performance and Model Building
It is time to prepare the data for feeding into the models.
1
X = train_data_final.copy()
2
3
# save target variable in in y
4
y = target.astype(int)
spaceship_tiitanic_43.py hosted with ❤ by GitHub
view raw
Feature selection always plays a key role in model building. We will perform
a χ² to retrieve the 22 best features as follows.
1
# Univariate feature selection
2
chi_selector = SelectKBest(chi2, k = 22).fit(X, y)
3
4
chi_support = chi_selector.get_support()
5
chi_feature = X.loc[:, chi_support].columns
6
chi_feature
spaceship_tiitanic_44.py hosted with ❤ by GitHub
view raw
Index(['Age', 'CabinDeck', 'DeckPosition', 'Regular', 'Luxury',
'TotalSpendings', 'DeckAverageSpent', 'NoRelatives',
'FamilySizeCat',
'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars',
'CryoSleep_False', 'CryoSleep_True', 'Destination_55 Cancri
e',
'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True',
'CabinSide_P',
'CabinSide_S', 'IsAlone_Alone', 'IsAlone_Not Alone'],
dtype='object')
Next, for our model building we will use Random Forest, a tree ensemble
algorithm and try to improve the accuracy.
1
X = X[chi_feature]
2
3
# baseline model
4
baseline_model = RandomForestClassifier(random_state = 1)
5
baseline_model.fit(X, y)
spaceship_tiitanic_45.py hosted with ❤ by GitHub
view raw
We will use cross validation score to estimate the accuracy of our baseline
model.
1
# store accuracy of baseline model prediction in results
2
result = cross_val_score(baseline_model, X, y, cv = 20, scoring = "accuracy")
3
4
# print mean and standard deviation of baseline model
5
print(np.mean(result))
6
print(np.std(result))
spaceship_tiitanic_46.py hosted with ❤ by GitHub
view raw
0.7889098998887654
0.01911345656998776
We got a mean accuracy of 78.9%, now we will try to improve this accuracy
by tuning the hyperparameters for the model. We will use grid search to get
the optimized values of hyper parameters. Grid-search is a way to select the
best of a family of hyper parameters, parameterized by a grid of parameters.
We will tune the max_depth and n_estimators parameters. max_depth
decides the maximum depth of the tree and n_estimators decides the
number of trees that will be used in the random forest model.
1
# provide range for max_depth from 1 to 20 with an interval of 2
2
# provide range for n_estimators from 1 to 200 with an interval of 20
3
paramgrid = {'max_depth': list(range(1, 20, 2)),
4
5
'n_estimators': list(range(1, 200, 20))}
grid_search = GridSearchCV(RandomForestClassifier(random_state = 1), paramgrid)
6
7
# fit the grid search model
8
grid_search.fit(X, y)
9
10
# Estimating the optimized value
11
grid_search.best_estimator_
spaceship_tiitanic_47.py hosted with ❤ by GitHub
view raw
RandomForestClassifier(max_depth=11, n_estimators=101,
random_state=1)
So, the optimized value for the max_depth variable is 11 and for n_estimators
is 101. Now let’s build the final model using these optimized values.
RandomForestClassifier(max_depth=11, n_estimators=101,
random_state=1)
Now, let’s view the new accuracy score of our model with optimized
parameters to confirm it improved.
0.8047907728163567
0.018872624931449773
The model now has a mean accuracy of 80.5% which is an improvement. It’s
time to make predictions for the test dataset using our selected features.
7. Submission and Feature Importance
Before we make our submission, let’s import the sample submission file to
see the format our submission should have.
As we can see, we will only need PassengerId and Transported for the final
submission. To do this we will use the test set’s PassengerId and our
Prediction. Remember we need to convert 0 to False and 1 to True.
Feature importance allows you to understand the relationship between the
features and the target variable. Let us plot the feature importances to
understand what features are most important and what features are
irrelevant for the model.
We can see from the plot above that Luxury is the most important feature,
followed by TotalSpendings, and Regular. So, feature engineering helped us
in predicting the target variable.
I hope you enjoyed reading. You can find my code on GitHub.
Data Science
Machine Learning
Kaggle Competition
Kaggle Titanic
Tutorial
Written by Zaynab Awofeso
Follow
75 Followers · Writer for CodeX
Join me on a journey to discover the latest trends and breakthroughs in data science while
unlocking your full potential to become the best version of yourself.
More from Zaynab Awofeso and CodeX
Zaynab Awofeso in Learning SQL
Anmol Tomar in CodeX
Exploring the AdventureWorks
Database!
Say Goodbye to Loops in Python,
and Welcome Vectorization!
Let’s go on an adventure into the amazing
world of AdventureWorks Database!
Use Vectorization — a super fast alternative to
loops in Python
8 min read · Mar 6
278
5
· 5 min read · Nov 30, 2022
4.8K
65
Enigma of the Stack in CodeX
Zaynab Awofeso
Why Are JavaScript Pros Saying
Goodbye to TypeScript?
Reflecting on My Experience at the
Indaba 2023
TypeScript’s Triumph and the Unexpected
Turn
The Indaba 2023 through my lens;
· 7 min read · Sep 12
1.2K
7 min read · Sep 15
52
See all from Zaynab Awofeso
864
See all from CodeX
Recommended from Medium
2
Muhammad Dawood
Praoiticica
Exploratory Data Analysis (EDA) on
Titanic Dataset
Titanic — Data Cleaning and
Feature Engineering
The Titanic dataset is popular for data
analysis and machine learning. It contains…
The Titanic dataset is one of the best datasets
to practice data cleaning and feature…
3 min read · Jun 15
18 min read · Jul 17
139
Lists
Predictive Modeling w/
Python
Practical Guides to Machine
Learning
20 stories · 506 saves
10 stories · 580 saves
Natural Language Processing
New_Reading_List
728 stories · 324 saves
174 stories · 152 saves
Aderounmu Abiodun Emmanuel
AMOLE OLUWAFERANMI
E CORMERCE SALES (Amazon)
Analysis in SQL
Customer Retention Analysis with
Power BI
INTRODUCTION
Introduction
4 min read · Jul 27
7 min read · Aug 30
1
25
Muhammad Arief Rachman
Virat Patel
Wine Quality Prediction with
Machine Learning Model
I applied to 230 Data science jobs
during last 2 months and this is…
Business Problem/Research Background
A little bit about myself: I have been working
as a Data Analyst for a little over 2 years.…
8 min read · Jul 14
· 3 min read · Aug 11
1.7K
See more recommendations
36
Download