Get unlimited access to the best of Medium for less than $1/week. Become a member How To Score ~80% Accuracy in Kaggle’s Spaceship Titanic Competition This is a step-by-step guide to walk you through submitting a “.csv” file of predictions to Kaggle for the new titanic competition. Zaynab Awofeso · Follow Published in CodeX · 13 min read · Jun 13, 2022 114 image by unsplash Introduction Kaggle recently launched a fun competition called Spaceship Titanic. It is designed to be an update of the popular Titanic competition which helps people new to data science learn the basics of machine learning, get acquainted with Kaggle’s platform, and meet others in the community. This article is a beginner-friendly analysis of the Spaceship Titanic Kaggle Competition. It covers steps to obtain any meaningful insights from the data and to predict the “ground truth” for the test set with an accuracy of ~80% using RandomForestClassifier. Index 1. Problem definition and metrics 2. About the data 3. Exploratory Data Analysis 4. Data Cleaning and preprocessing 5. Feature Extraction and Feature Selection 6. Baseline Model Performance and Model Building 7. Submission and Feature Importance 1. Problem definition and metrics As the first thing, we have to understand the problem. It’s the year 2912 and the interstellar passenger liner Spaceship Titanic has collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension! To help rescue crews retrieve the lost passengers, we are challenged to use records recovered from the spaceship’s damaged computer system to predict which passengers were transported to another dimension. This problem is a binary class classification problem where we have to predict which passengers were transported to an alternate dimension or not, and we will be using accuracy as a metric to evaluate our results. 2. About the data We will be using 3 CSV files: train file (spaceship_titanic_train.csv) — contains personal records of the passengers that would be used to build the machine learning model. test file (spaceship_titanic_test.csv) — contains personal records for the remaining one-third (~4300) of the passengers, but not the target variable (i.e. the value of Transported for the passengers). It will be used to see how well our model performs on unseen data. sample submission file (sample_submission.csv) — contains the format in which we have to submit our predictions. We will be using python for this problem. You can download the dataset from Kaggle here. Import required libraries 1 # For loading Packages 2 import pandas as pd 3 pd.set_option('max_columns', None) 4 pd.set_option('max_rows', 100) 5 6 # For mathematical calculations 7 import numpy as np 8 9 # For data visualization 10 import matplotlib.pyplot as plt 11 import seaborn as sns 12 sns.set_style("darkgrid") 13 14 # To build and evaluate model 15 from sklearn.ensemble import RandomForestClassifier 16 from sklearn.metrics import accuracy_score 17 from sklearn.model_selection import cross_val_score 18 from sklearn.model_selection import GridSearchCV 19 from sklearn.feature_selection import SelectKBest, chi2 20 21 # To ignore any warnings 22 import warnings 23 warnings.filterwarnings("ignore") spaceship_tiitanic.py hosted with ❤ by GitHub view raw Reading Data 1 # Read train data 2 train_df = pd.read_csv("spaceship_titanic_train.csv") 3 # Read test data 4 test_df = pd.read_csv("spaceship_titanic_test.csv") spaceship_tiitanic_1.py hosted with ❤ by GitHub view raw Let’s make a copy of the train and test data so that even if we make any changes to these datasets it would not affect the original datasets. 1 # copy of train and test data to prevent making changes to the original datasets 2 train_df_1 = train_df.copy() 3 test_df_1 = test_df.copy() spaceship_tiitanic_2.py hosted with ❤ by GitHub view raw We will look at the structure of the train and test dataset next. We will first check the features present, then we will look at their data types. 1 # view columns of the train data 2 train_df_1.columns spaceship_tiitanic_3.py hosted with ❤ by GitHub Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Name', 'Transported'], dtype='object') view raw We have 13 independent variables and 1 target variable (Transported) in the training dataset. Let’s also look at the columns of the test dataset. 1 # view columns of the test data 2 test_df_1.columns spaceship_tiitanic_4.py hosted with ❤ by GitHub view raw Index(['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Name'], dtype='object') We have similar features in the test dataset as the training dataset except Transported that we will predict using the model built by the train data. Given below is the description for each variable. PassengerId — A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always. HomePlanet — The planet the passenger departed from, typically their planet of permanent residence. CryoSleep — Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins. Cabin — The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard. Destination — The planet the passenger will be debarking to. Age — The age of the passenger. VIP — Whether the passenger has paid for special VIP service during the voyage. RoomService, FoodCourt, ShoppingMall, Spa, VRDeck — Amount the passenger has billed at each of the Spaceship Titanic’s many luxury amenities. Name — The first and last names of the passenger. Transported — Whether the passenger was transported to another dimension. This is the target, the column we are trying to predict. Let’s print data types for each variable of the training dataset. 1 # print datatypes of the train data 2 train_df_1.dtypes spaceship_tiitanic_5.py hosted with ❤ by GitHub PassengerId HomePlanet CryoSleep Cabin Destination Age VIP RoomService FoodCourt ShoppingMall Spa object object object object object float64 object float64 float64 float64 float64 view raw VRDeck Name Transported dtype: object float64 object bool We can see there are three formats of data types in the training dataset: object (Categorical variables) — The categorical variables in the training dataset are: PassengerId, HomePlanet, CryoSleep, Cabin, Destination, VIP and Name float64 (Float variables i.e Numerical variables which have some decimal values involved) — The Numerical variables in our train dataset: Age, RoomService, FoodCourt, ShoppingMall, Spa and VRDeck bool (Boolean variables i.e. a variable that has one of two possible values e.g. True or False) — The Boolean Variable in our dataset is Transported Let’s look at the shape of our train and test dataset. 1 # Print shape of train data 2 print("The shape of the train dataset is: ", train_df_1.shape) 3 4 # Print shape of test data 5 print("The shape of the test dataset is: ", test_df_1.shape) spaceship_tiitanic_6.py hosted with ❤ by GitHub view raw The shape of the train dataset is: (8693, 14) The shape of the test dataset is: (4277, 13) We have 8693 rows and 14 columns in the training dataset and 4277 rows and 13 columns in the test dataset. 1. Exploratory Data Analysis Univariate Analysis Univariate analysis is the simplest form of analyzing data where we examine each data individually to understand the distribution of its values. Target Variable We will first look at the target variable i.e. Transported. Since it is a categorical variable, let us look at its percentage distribution and bar plot. 1 # Normalize is set to true to print proportions instead of number 2 train_df_1['Transported'].value_counts(normalize = True) spaceship_tiitanic_7.py hosted with ❤ by GitHub view raw True 0.503624 False 0.496376 Name: Transported, dtype: float64 1 # Visualize target variable 2 ax = sns.catplot(x = "Transported", data = train_df_1, kind = "count", color = "b") 3 ax.set_axis_labels("Transported", "Transported Count") spaceship_tiitanic_8.py hosted with ❤ by GitHub view raw Out of 8693 passengers in the train dataset, 4378 (about 50%) were Transported to another dimension. Let’s visualize the Independent categorical features next. Independent Variable (Categorical) 1 # Visualize independent categorical features 2 plt.figure(figsize = (14, 15)) 3 plt.subplot(221) 4 train_df_1['HomePlanet'].value_counts(normalize = True).plot.bar(title = 'HomePlanet') 5 plt.subplot(222) 6 train_df_1['CryoSleep'].value_counts(normalize = True).plot.bar(title = 'CryoSleep') 7 plt.subplot(223) 8 train_df_1['Destination'].value_counts(normalize = True).plot.bar(title = 'Destination') 9 plt.subplot(224) 10 train_df_1['VIP'].value_counts(normalize = True).plot.bar(title = 'VIP') spaceship_tiitanic_9.py hosted with ❤ by GitHub view raw It can be inferred from the bar plots above that: About 50% of passengers in the trainset departed from Earth About 30% of the passengers in the training dataset were on CryoSleep (i.e confined to their cabins.) About 69% of the passengers in the training dataset were going to TRAPPIST-1e Not up to 1% of the passengers on the training dataset paid for VIP services The cabin column takes the form deck/num/side. So, let’s extract and visualize the CabinDeck and CabinSide features. # Extract CabinDeck, CabinNo. and CabinSide feature from Cabin train_df_1[["CabinDeck", "CabinNo.", "CabinSide"]] = train_df_1["Cabin"].str.split('/', expand = True) # Visualize cabin feature plt.figure(figsize = (14, 15)) plt.subplot(221) train_df_1['CabinDeck'].value_counts(normalize = True).plot.bar(title = 'CabinDeck') plt.subplot(222) train_df_1['CabinSide'].value_counts(normalize = True).plot.bar(title = 'CabinSide') spaceship_tiitanic_10.py hosted with ❤ by GitHub view raw We can infer from the plot above that: About 60% of the passengers in the train set were on deck F and G There’s not much difference between the % of passengers that were on Cabin side S compared to P We have seen the categorical variables. Now let’s visualize the numerical variables. Age 1 # Visualize Age variable 2 plt.figure(1) 3 plt.subplot(121) 4 sns.distplot(train_df_1['Age']); 5 plt.subplot(122) 6 train_df_1['Age'].plot.box(figsize = (16, 5)); 7 plt.show() spaceship_tiitanic_11.py hosted with ❤ by GitHub view raw There are outliers in the Age variable and the distribution is fairly normal. Room Service 1 # Visualize RoomService variable 2 plt.figure(1) 3 plt.subplot(121) 4 sns.distplot(train_df_1['RoomService']); 5 plt.subplot(122) 6 train_df_1['RoomService'].plot.box(figsize = (16, 5)); 7 plt.ylim([-500, 1000]) 8 plt.show() spaceship_tiitanic_12.py hosted with ❤ by GitHub view raw We can see that most of the data in the distribution of RoomService are towards the left, which means it is not normally distributed, and there are a lot of outliers. We will try to make it normal later. Spa 1 # Visualize Spa variable 2 plt.figure(1) 3 plt.subplot(121) 4 sns.distplot(train_df_1['Spa']); 5 plt.subplot(122) 6 train_df_1['Spa'].plot.box(figsize = (16, 5)); 7 plt.ylim([-500, 1000]) 8 plt.show() spaceship_tiitanic_13.py hosted with ❤ by GitHub view raw There is a similar distribution as that of RoomService. It contains a lot of outliers and it is not normally distributed. RoomService, FoodCourt, ShoppingMall, Spa, VRDeck are the amount the passenger has billed at each of the Spaceship Titanic’s many luxury amenities, so let’s see if VRDeck, FoodCourt and ShoppingMall have a similar distribution. VRDeck 1 # Visualize VRDeck variable 2 plt.figure(1) 3 plt.subplot(121) 4 sns.distplot(train_df_1['VRDeck']); 5 plt.subplot(122) 6 train_df_1['VRDeck'].plot.box(figsize = (16, 5)); 7 plt.ylim([-500, 1000]) 8 plt.show() spaceship_tiitanic_14.py hosted with ❤ by GitHub view raw FoodCourt 1 # Visualize FoodCourt variable 2 plt.figure(1) 3 plt.subplot(121) 4 sns.distplot(train_df_1['FoodCourt']); 5 plt.subplot(122) 6 train_df_1['FoodCourt'].plot.box(figsize = (16, 5)); 7 plt.ylim([-500, 1000]) 8 plt.show() spaceship_tiitanic_15.py hosted with ❤ by GitHub view raw ShoppingMall 1 # Visualize ShoppingMall variable 2 plt.figure(1) 3 plt.subplot(121) 4 sns.distplot(train_df_1['ShoppingMall']); 5 plt.subplot(122) 6 train_df_1['ShoppingMall'].plot.box(figsize = (16, 5)); 7 plt.ylim([-500, 1000]) 8 plt.show() spaceship_tiitanic_16.py hosted with ❤ by GitHub view raw We can see that VRDeck, FoodCourt and ShoppingMall have a similar distribution. They are all not normally distributed, and they all have outliers. Bivariate Analysis After looking at every variable individually, we will explore them again to see their relationship with the target variable. First, we will find the relationship between the categorical variables and the target variable. To do this, we will first create a dataframe to store the no of passengers transported, and the percentage of passengers transported for each categorical variable. 1 HomePlanet_Transported = train_df_1.groupby('HomePlanet').aggregate({'Transported': 'sum', 2 'PassengerId': 'size' 3 }).reset_index() 4 5 HomePlanet_Transported['TransportedPercentage'] = HomePlanet_Transported['Transported'] / HomePla 6 7 CryoSleep_Transported = train_df_1.groupby('CryoSleep').aggregate({'Transported': 'sum', 8 'PassengerId': 'size' 9 }).reset_index() 10 11 CryoSleep_Transported['TransportedPercentage'] = CryoSleep_Transported['Transported'] / CryoSleep 12 13 Destination_Transported = train_df_1.groupby('Destination').aggregate({'Transported': 'sum', 14 'PassengerId': 'size' 15 }).reset_index() 16 17 Destination_Transported['TransportedPercentage'] = Destination_Transported['Transported'] / Desti 18 19 VIP_Transported = train_df_1.groupby('VIP').aggregate({'Transported': 'sum', 20 'PassengerId': 'size' 21 }).reset_index() 22 23 VIP_Transported['TransportedPercentage'] = VIP_Transported['Transported'] / VIP_Transported['Pass spaceship_tiitanic_17.py hosted with ❤ by GitHub Now, let’s see how the categorical variables relate to transported. view raw 1 # Visualize categorical features vs target variable 2 plt.figure(figsize = (14, 15)) 3 plt.subplot(221) 4 sns.barplot(x = "HomePlanet", y = "TransportedPercentage", data = HomePlanet_Transported, order = 5 plt.subplot(222) 6 sns.barplot(x = "CryoSleep", y = "TransportedPercentage", data = CryoSleep_Transported, order = C 7 plt.subplot(223) 8 sns.barplot(x = "Destination", y = "TransportedPercentage", data = Destination_Transported, order 9 plt.subplot(224) 10 sns.barplot(x = "VIP", y = "TransportedPercentage", data = VIP_Transported, order = VIP_Transport spaceship_tiitanic_18.py hosted with ❤ by GitHub view raw We can infer that: About 64% of the Passengers from Europa were Transported About 78% of the Passengers in CryoSleep were transported The proportion of Passengers debarking to 55 Cancri e transported to another dimension is greater compared to those debarking to PSO J318.5–22 and TRAPPIST-1e About 38% of the Passengers that paid for special VIP services were transported Next, let’s at how the CabinDeck and CabinSide columns relate to transported. We will follow the same steps as above. 1 CabinDeck_Transported = train_df_1.groupby('CabinDeck').aggregate({'Transported': 'sum', 2 'PassengerId': 'size' 3 }).reset_index() 4 5 CabinDeck_Transported['TransportedPercentage'] = CabinDeck_Transported['Transported'] / CabinDeck 6 7 CabinSide_Transported = train_df_1.groupby('CabinSide').aggregate({'Transported': 'sum', 8 'PassengerId': 'size' 9 }).reset_index() 10 11 CabinSide_Transported['TransportedPercentage'] = CabinSide_Transported['Transported'] / CabinSide 12 13 # Visualize Cabin features vs target variable 14 plt.figure(figsize = (14, 15)) 15 plt.subplot(221) 16 sns.barplot(x = "CabinDeck", y = "TransportedPercentage", data = CabinDeck_Transported, order = C 17 plt.subplot(222) 18 sns.barplot(x = "CabinSide", y = "TransportedPercentage", data = CabinSide_Transported, order = C spaceship_tiitanic_19.py hosted with ❤ by GitHub view raw Cabin Deck B and C have the highest percentage of passengers transported The proportion of Passengers in Cabin Side S transported to another dimension is greater compared to those in Cabin Side P The PassengerId column takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp, their number within the group. We want to know how the number of people in a group relates to if they are transported or not, So, we will extract the PassengerGroup feature from the PassengerId column, get the number of people in a Group, and then visualize how it relates to the transported feature. 1 # Extract PassengerGroup column from PassengerId column 2 train_df_1["PassengerGroup"] = train_df_1["PassengerId"].str.split('_', expand = True)[0] 3 4 # Create dataframe -No_People_In_PassengerGroup that contains the PassengerGroup and the no passe 5 No_People_In_PassengerGroup = train_df_1.groupby('PassengerGroup').aggregate({'PassengerId': 'siz 6 No_People_In_PassengerGroup = No_People_In_PassengerGroup.rename(columns = {"PassengerId": "NoInP 7 8 train_df_1 = train_df_1.merge(No_People_In_PassengerGroup[["PassengerGroup", "NoInPassengerGroup 9 # create dataframe NoInPassengerGroup_Transported that has No of passengers in a group transporte 10 NoInPassengerGroup_Transported = train_df_1.groupby('NoInPassengerGroup').aggregate({'Transported 11 'PassengerId': 'size' 12 }).reset_index() 13 14 NoInPassengerGroup_Transported['TransportedPercentage'] = NoInPassengerGroup_Transported['Transpo 15 16 # Visualize NoInPassengerGroup vs transported 17 sns.barplot(x = "NoInPassengerGroup", y = "TransportedPercentage", data = NoInPassengerGroup_Tran spaceship_tiitanic_20.py hosted with ❤ by GitHub view raw There is no clear pattern of how the number of people in a group affects if they are transported or not. So, we will look at how “if the passenger is alone or not” affects if they are transported. 1 No_People_In_PassengerGroup["IsAlone"] = No_People_In_PassengerGroup["NoInPassengerGroup"].apply 2 train_df_1 = train_df_1.merge(No_People_In_PassengerGroup[["PassengerGroup", "IsAlone"]], how = 3 4 IsAlone_Transported = train_df_1.groupby('IsAlone').aggregate({'Transported': 'sum', 5 'PassengerId': 'size' 6 }).reset_index() 7 8 # create dataframe IsAlone_Transported that contains percentage of passengers transported Alone o 9 IsAlone_Transported['TransportedPercentage'] = IsAlone_Transported['Transported'] / IsAlone_Trans 10 11 # Visualize IsAlone vs transported 12 sns.barplot(x = "IsAlone", y = "TransportedPercentage", data = IsAlone_Transported, order = IsAlo spaceship_tiitanic_21.py hosted with ❤ by GitHub view raw It seems more Passengers that were not alone were transported to another dimension compared to Passengers that were alone. The Name column also contains the first and last names of the passenger. So, let’s extract the Family Name (last name) of each passenger to see if family size may affect if passengers are transported or not. 1 # Extract FamilyName column from Name column 2 train_df_1["FamilyName"] = train_df_1["Name"].str.split(' ', expand = True)[1] 3 4 # Create dataframe -NoRelatives that contains the FamilyName and the no of relatives in a Family 5 NoRelatives = train_df_1.groupby('FamilyName')['PassengerId'].count().reset_index() 6 NoRelatives = NoRelatives.rename(columns = {"PassengerId": "NoRelatives"}) 7 8 train_df_1 = train_df_1.merge(NoRelatives[["FamilyName", "NoRelatives"]], how = 'left', on = ['Fa 9 10 train_df_1["FamilySizeCat"] = pd.cut(train_df_1.NoRelatives, bins = [0, 2, 5, 10, 18], labels = 11 12 # create dataframe FamilySizeCat_Transported that has the Family Size Category and the percentage 13 FamilySizeCat_Transported = train_df_1.groupby('FamilySizeCat').aggregate({'Transported': 'sum', 14 'PassengerId': 'size' 15 }).reset_index() 16 17 FamilySizeCat_Transported['TransportedPercentage'] = FamilySizeCat_Transported['Transported'] / F 18 19 # Visualize FamilySizeCat vs transported 20 sns.barplot(x = "FamilySizeCat", y = "TransportedPercentage", data = FamilySizeCat_Transported, o spaceship_tiitanic_22.py hosted with ❤ by GitHub view raw The percentage of smaller families transported is more than that of larger families. This could be that smaller families are rich families, and were transported. Let’s see how family size affects income. To do this we will add all the amounts each passenger billed at each of the Spaceship Titanic’s many luxury amenities. Then, we will plot it against FamilySizeCat 1 # Create total spending feature 2 train_df_1["TotalSpendings"] = train_df_1["FoodCourt"] + \ 3 train_df_1["ShoppingMall"] + \ 4 train_df_1["RoomService"] + \ 5 train_df_1["Spa"] + \ 6 train_df_1["VRDeck"] 7 8 # FamilySizeCat vs TotalSpendings 9 plt.figure(figsize = (7, 5)) 10 sns.boxplot(data = train_df_1, x = "FamilySizeCat", y = "TotalSpendings") 11 plt.ylim([-800, 12000]) spaceship_tiitanic_23.py hosted with ❤ by GitHub view raw Our hypothesis seems to be correct. It seems passengers with a smaller family size are wealthier. Now let’s visualize numerical independent variables with respect to the target variable. 1 # Transported vs Age 2 plt.figure(figsize = (7, 5)) 3 sns.violinplot(train_df_1["Transported"], train_df_1["Age"]) 4 plt.ylim([-50, 200]) spaceship_tiitanic_24.py hosted with ❤ by GitHub view raw It looks like the percentage of passengers between the Age of 0 to about 4 transported is more than the percentage of older passengers transported. We will create a new column AgeCat to confirm if more younger passengers were transported compared to older passengers. 1 # Extract Age Category column from Age column 2 train_df_1["AgeCat"] = pd.cut(train_df_1.Age, bins = [0.0, 4.0, 12.0, 19.0, 40.0, 60.0, 80.0], labe 3 4 AgeCat_Transported = train_df_1.groupby('AgeCat').aggregate({'Transported': 'sum', 5 'PassengerId': 'size' 6 }).reset_index() 7 8 # create dataframe AgeCat_Transported that has the Age Category and the percentage transported 9 AgeCat_Transported['TransportedPercentage'] = AgeCat_Transported['Transported'] / AgeCat_Transporte 0 1 # Visualize AgeCat vs transported 2 sns.barplot(x = "AgeCat", y = "TransportedPercentage", data = AgeCat_Transported, order = AgeCat_Tr spaceship_tiitanic_25.py hosted with ❤ by GitHub view raw We can infer from the plot above that: about 74% of passengers within the Age range of 0–4 were transported about 60% of passengers within the Age range of 5–12 were transported Now, do the same for the remaining numerical independent variables. 1 # RoomService, FoodCourt, ShoppingMall, Spa, VRDeck 2 plt.figure(figsize = (7, 5)) 3 sns.violinplot(train_df_1["Transported"], train_df_1["RoomService"]) 4 plt.ylim([-500, 2000]) 5 6 plt.figure(figsize = (7, 5)) 7 sns.violinplot(train_df_1["Transported"], train_df_1["FoodCourt"]) 8 plt.ylim([-900, 8000]) 9 10 plt.figure(figsize = (7, 5)) 11 sns.violinplot(train_df_1["Transported"], train_df_1["ShoppingMall"]) 12 plt.ylim([-900, 6000]) 13 14 plt.figure(figsize = (7, 5)) 15 sns.violinplot(train_df_1["Transported"], train_df_1["Spa"]) 16 plt.ylim([-800, 2000]) 17 18 plt.figure(figsize = (7, 5)) 19 sns.violinplot(train_df_1["Transported"], train_df_1["VRDeck"]) 20 plt.ylim([-800, 2000]) spaceship_tiitanic_26.py hosted with ❤ by GitHub view raw Observations: The bills spent by transported passengers appear to be concentrated and approaching zero. VRDeck, Spa and RoomService appear to have a similar distribution, while ShoppingMall and RoomServices appear to have a similar distribution. We have seen how Family size affects expenditure. Now let’s see how passengers elected in Cryosleep relates to expenditure. 1 # CryoSleep vs TotalSpendings 2 plt.figure(figsize = (7, 5)) 3 4 sns.boxplot(data = train_df_1, x = "CryoSleep", y = "TotalSpendings") 5 plt.ylim([-900, 14000]) spaceship_tiitanic_27.py hosted with ❤ by GitHub view raw It can be seen from the plot above that passengers in CryoSleep have 0 expenditure. Now let’s see how VIP status affects expenditure. 1 # VIP vs TotalSpendings 2 plt.figure(figsize = (7, 5)) 3 4 sns.boxplot(data = train_df_1, x = "VIP", y = "TotalSpendings") 5 plt.ylim([-800, 12000]) spaceship_tiitanic_28.py hosted with ❤ by GitHub view raw It can be seen that passengers with VIP status have a higher expenditure compared to passengers who don’t. 1 # AgeCat vs TotalSpendings 2 plt.figure(figsize = (7, 5)) 3 sns.boxplot(data = train_df_1, x = "AgeCat", y = "TotalSpendings") 4 plt.ylim([-800, 12000]) spaceship_tiitanic_29.py hosted with ❤ by GitHub view raw Let’s also see how the age category relates to total spending of a passenger. From the plot above it can be inferred that: Passengers within the age range of 0–12 had 0 expenditure Expenditure increases with the Age 4. Cleaning and Preprocessing After exploring the variables in our data, we can now impute the missing values and treat the outliers. First, let’s drop the columns we created for the exploratory data analysis. 1 train_df_2 = train_df_1.copy() 2 3 # drop features created during EDA 4 train_df_2 = train_df_2.drop(["PassengerGroup", 5 "CabinDeck", 6 "CabinNo.", 7 "CabinSide", 8 "FamilyName", 9 "NoRelatives", 10 "NoInPassengerGroup", 11 "AgeCat", 12 "FamilySizeCat", 13 "TotalSpendings"], axis = 1) spaceship_tiitanic_28.py hosted with ❤ by GitHub view raw We will combine the train and test data to make cleaning and preprocessing easier 1 # save target variable in train dataset and save it in target 2 target = train_df_2["Transported"] 3 4 # save test PassengerId in test_id 5 test_id = test_df_1["PassengerId"] 6 7 # drop Transported variable from the train set 8 train_df_3 = train_df_2.drop(["Transported"], axis = 1) 9 10 # Join the train and test set 11 data = pd.concat([train_df_3, test_df], axis = 0).reset_index(drop = True) spaceship_tiitanic_29.py hosted with ❤ by GitHub view raw Let’s look at the shape of our new dataset. 1 # Print shape of data 2 print(data.shape) spaceship_tiitanic_30.py hosted with ❤ by GitHub view raw (12970, 13) The dataset has 12970 rows and 13 columns. Let’s look at the percentage of each variable missing in our dataset. 1 # view percentage of values missing in each column 2 round(data.isna().sum() * 100/data.shape[0], 3) spaceship_tiitanic_31.py hosted with ❤ by GitHub PassengerId HomePlanet CryoSleep Cabin Destination Age VIP RoomService FoodCourt ShoppingMall Spa VRDeck Name dtype: float64 view raw 0.000 2.221 2.390 2.305 2.113 2.082 2.282 2.028 2.228 2.359 2.190 2.066 2.267 There are missing values in every column except the PassengerId column but the missing values are not up to 50% of the variables. We will treat the missing values in the categorical columns first by imputation using mode. 1 # get categorical columns in train dataset with missing values and store in missing_cat_cols 2 data_1 = data.copy() 3 4 list_missing_cat_columns = list((data_1.select_dtypes(['object', 'category']).isna().sum() > 0).in 5 list_missing_cat_columns spaceship_tiitanic_32.py hosted with ❤ by GitHub view raw ['PassengerId', 'HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'VIP', 'Name'] 1 # Fill Categorical columns in data with mode 2 for col in list_missing_cat_columns: 3 data_1[col] = data_1[col].fillna(data_1[col].mode()[0]) spaceship_tiitanic_33.py hosted with ❤ by GitHub view raw Now let’s find a way to fill the missing values in the Numerical features. 1 # Fill missing values for numeric columns 2 3 # get numeric columns with missing values and store in lst_missing_numeric_col 4 list_missing_numeric_col = list((data_1.select_dtypes(np.number).isna().sum() > 0).index) 5 list_missing_numeric_col spaceship_tiitanic_34.py hosted with ❤ by GitHub view raw ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'] We will start with Age. While performing EDA we saw that RoomService, FoodCourt, ShoppingMall, Spa and VRDeck totals 0 if Passenger’s Age is less than 13 or on CryoSleep so let’s create a function to handle that. 1 # Filling NaNs based on Age 2 def fill_nans_by_age_and_cryosleep(df): 3 df["RoomService"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["RoomService 4 df["FoodCourt"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["FoodCourt"]) 5 df["ShoppingMall"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["ShoppingMa 6 df["Spa"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["Spa"]) 7 df["VRDeck"] = np.where((df["Age"] < 13) | (df["CryoSleep"] == True), 0, df["VRDeck"]) 8 9 return df 10 11 data_1 = fill_nans_by_age_and_cryosleep(data_1) spaceship_tiitanic_35.py hosted with ❤ by GitHub view raw Now, lets fill the remaining missing values with using mean. 1 # Fill numeric columns in train dataset with mean 2 for col in list_missing_numeric_col: 3 4 data_1[col] = data_1[col].fillna(data_1[col].mean()) data_1.isna().sum() spaceship_tiitanic_36.py hosted with ❤ by GitHub PassengerId HomePlanet CryoSleep Cabin Destination Age VIP RoomService FoodCourt ShoppingMall Spa 0 0 0 0 0 0 0 0 0 0 0 view raw VRDeck Name dtype: int64 0 0 As we can see, all the missing values have been filled in the dataset. Outlier Treatment As we saw earlier in our univariate analysis, RoomService, FoodCourt, ShoppingMall, Spa and VRDeck contain outliers so we have to treat them as the presence of outliers affects the distribution of our data. To do this we will clip outliers on 99% quantile. 1 # clip outliers on 99% quantile 2 def clipping_quantile(dataframe, quantile_values = None, quantile = 0.99): 3 df = dataframe.copy() 4 if quantile_values is None: 5 6 quantile_values = df[["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].quant for num_column in ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]: 7 num_values = df[num_column].values 8 threshold = quantile_values[num_column] 9 num_values = np.where(num_values > threshold, threshold, num_values) 10 11 df[num_column] = num_values return df 12 13 data_1 = clipping_quantile(data_1, None, 0.99) spaceship_tiitanic_37.py hosted with ❤ by GitHub view raw Our dataset is now clean! 5. Feature Extraction and Feature Selection Based on our EDA let’s create a function to create new features that might affect the target variable. 1 2 def extract_features(df): df["PassengerGroup"] = (df["PassengerId"].str.split('_', expand = True))[0] 3 4 No_People_In_PassengerGroup = df.groupby('PassengerGroup').aggregate({'PassengerId': 'size'} 5 No_People_In_PassengerGroup = No_People_In_PassengerGroup.rename(columns = {"PassengerId": "N 6 # Create IsAlone feature 7 Search Write No_People_In_PassengerGroup["IsAlone"] = No_People_In_PassengerGroup["NoInPassengerGroup"].ap 8 df = df.merge(No_People_In_PassengerGroup[["PassengerGroup", "IsAlone"]], how = 'left', on = Open in app 9 10 # Create CabinDeck feature 11 df["CabinDeck"] = df["Cabin"].str.split('/', expand = True)[0] 12 # Create DeckPosition feature 13 df["DeckPosition"] = df["CabinDeck"].apply(lambda deck: "Lower" if deck in ('A', 'B', 'C', 'D 14 # Create CabinSide feature 15 df["CabinSide"] = df["Cabin"].str.split('/', expand = True)[2] 16 17 # Create Regular feature 18 df["Regular"] = df["FoodCourt"] + df["ShoppingMall"] 19 # Create Luxury feature 20 df["Luxury"] = df["RoomService"] + df["Spa"] + df["VRDeck"] 21 # Create TotalSpendings feature 22 df["TotalSpendings"] = df["RoomService"] + df["FoodCourt"] + df["ShoppingMall"] + df["Spa"] + 23 24 Wealthiest_Deck = df.groupby('CabinDeck').aggregate({'TotalSpendings': 'sum', 'PassengerId': 25 # Create DeckAverageSpent feature 26 Wealthiest_Deck['DeckAverageSpent'] = Wealthiest_Deck['TotalSpendings'] / Wealthiest_Deck['Pa 27 28 df = df.merge(Wealthiest_Deck[["CabinDeck", "DeckAverageSpent"]], how = 'left', on = ['CabinD 29 30 df["FamilyName"] = df["Name"].str.split(' ', expand = True)[1] 31 # Create NoRelatives feature 32 NoRelatives = df.groupby('FamilyName')['PassengerId'].count().reset_index() 33 NoRelatives = NoRelatives.rename(columns = {"PassengerId": "NoRelatives"}) 34 35 df = df.merge(NoRelatives[["FamilyName", "NoRelatives"]], how = 'left', on = ['FamilyName']) 36 # Create FamilySizeCat feature 37 df["FamilySizeCat"] = pd.cut(df.NoRelatives, bins = [0, 2, 5, 10, 300], labels = ['0 - 2', '3 38 39 return df 40 41 data_2 = data_1.copy() 42 data_2 = extract_features(data_2) spaceship_tiitanic_38.py hosted with ❤ by GitHub view raw Let us now drop the variables we used to create these features that are not so relevant to remove the noise from the dataset. 1 data_3 = data_2.copy() 2 irrelevant_columns = ["Cabin", "PassengerId", "RoomService", "FoodCourt", "ShoppingMall", "Spa", 3 data_3 = data_3.drop(irrelevant_columns, axis = 1) 4 5 data_3.shape spaceship_tiitanic_39.py hosted with ❤ by GitHub view raw (12970, 15) Now, we will convert our categorical data into model-understandable numerical data. 1 # Categorical Encoding 2 data_3 = pd.get_dummies(data_3, columns = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Cabin 3 4 # Ordinal Encoding 5 for col in ['CabinDeck', 'DeckPosition', 'FamilySizeCat']: 6 data_3[col], _ = data_3[col].factorize() spaceship_tiitanic_40.py hosted with ❤ by GitHub view raw Next we will split the data back to get the train and test data. 1 # split the data back to get the train and test data 2 data_4 = data_3.copy() 3 train_data_final = data_4.loc[:train_df.index.max(), 4 test_data_final = data_4.loc[train_df.index.max() + 1:, :].reset_index(drop = True).copy() spaceship_tiitanic_41.py hosted with ❤ by GitHub :].copy() view raw Let’s print the shape of the train and test data to be sure we split the data right. 1 # print shape of final train data 2 print(train_data_final.shape) 3 4 # print shape of final train data 5 print(test_data_final.shape) spaceship_tiitanic_42.py hosted with ❤ by GitHub view raw (8693, 23) (4277, 23) 6. Baseline Model Performance and Model Building It is time to prepare the data for feeding into the models. 1 X = train_data_final.copy() 2 3 # save target variable in in y 4 y = target.astype(int) spaceship_tiitanic_43.py hosted with ❤ by GitHub view raw Feature selection always plays a key role in model building. We will perform a χ² to retrieve the 22 best features as follows. 1 # Univariate feature selection 2 chi_selector = SelectKBest(chi2, k = 22).fit(X, y) 3 4 chi_support = chi_selector.get_support() 5 chi_feature = X.loc[:, chi_support].columns 6 chi_feature spaceship_tiitanic_44.py hosted with ❤ by GitHub view raw Index(['Age', 'CabinDeck', 'DeckPosition', 'Regular', 'Luxury', 'TotalSpendings', 'DeckAverageSpent', 'NoRelatives', 'FamilySizeCat', 'HomePlanet_Earth', 'HomePlanet_Europa', 'HomePlanet_Mars', 'CryoSleep_False', 'CryoSleep_True', 'Destination_55 Cancri e', 'Destination_TRAPPIST-1e', 'VIP_False', 'VIP_True', 'CabinSide_P', 'CabinSide_S', 'IsAlone_Alone', 'IsAlone_Not Alone'], dtype='object') Next, for our model building we will use Random Forest, a tree ensemble algorithm and try to improve the accuracy. 1 X = X[chi_feature] 2 3 # baseline model 4 baseline_model = RandomForestClassifier(random_state = 1) 5 baseline_model.fit(X, y) spaceship_tiitanic_45.py hosted with ❤ by GitHub view raw We will use cross validation score to estimate the accuracy of our baseline model. 1 # store accuracy of baseline model prediction in results 2 result = cross_val_score(baseline_model, X, y, cv = 20, scoring = "accuracy") 3 4 # print mean and standard deviation of baseline model 5 print(np.mean(result)) 6 print(np.std(result)) spaceship_tiitanic_46.py hosted with ❤ by GitHub view raw 0.7889098998887654 0.01911345656998776 We got a mean accuracy of 78.9%, now we will try to improve this accuracy by tuning the hyperparameters for the model. We will use grid search to get the optimized values of hyper parameters. Grid-search is a way to select the best of a family of hyper parameters, parameterized by a grid of parameters. We will tune the max_depth and n_estimators parameters. max_depth decides the maximum depth of the tree and n_estimators decides the number of trees that will be used in the random forest model. 1 # provide range for max_depth from 1 to 20 with an interval of 2 2 # provide range for n_estimators from 1 to 200 with an interval of 20 3 paramgrid = {'max_depth': list(range(1, 20, 2)), 4 5 'n_estimators': list(range(1, 200, 20))} grid_search = GridSearchCV(RandomForestClassifier(random_state = 1), paramgrid) 6 7 # fit the grid search model 8 grid_search.fit(X, y) 9 10 # Estimating the optimized value 11 grid_search.best_estimator_ spaceship_tiitanic_47.py hosted with ❤ by GitHub view raw RandomForestClassifier(max_depth=11, n_estimators=101, random_state=1) So, the optimized value for the max_depth variable is 11 and for n_estimators is 101. Now let’s build the final model using these optimized values. RandomForestClassifier(max_depth=11, n_estimators=101, random_state=1) Now, let’s view the new accuracy score of our model with optimized parameters to confirm it improved. 0.8047907728163567 0.018872624931449773 The model now has a mean accuracy of 80.5% which is an improvement. It’s time to make predictions for the test dataset using our selected features. 7. Submission and Feature Importance Before we make our submission, let’s import the sample submission file to see the format our submission should have. As we can see, we will only need PassengerId and Transported for the final submission. To do this we will use the test set’s PassengerId and our Prediction. Remember we need to convert 0 to False and 1 to True. Feature importance allows you to understand the relationship between the features and the target variable. Let us plot the feature importances to understand what features are most important and what features are irrelevant for the model. We can see from the plot above that Luxury is the most important feature, followed by TotalSpendings, and Regular. So, feature engineering helped us in predicting the target variable. I hope you enjoyed reading. You can find my code on GitHub. Data Science Machine Learning Kaggle Competition Kaggle Titanic Tutorial Written by Zaynab Awofeso Follow 75 Followers · Writer for CodeX Join me on a journey to discover the latest trends and breakthroughs in data science while unlocking your full potential to become the best version of yourself. More from Zaynab Awofeso and CodeX Zaynab Awofeso in Learning SQL Anmol Tomar in CodeX Exploring the AdventureWorks Database! Say Goodbye to Loops in Python, and Welcome Vectorization! Let’s go on an adventure into the amazing world of AdventureWorks Database! Use Vectorization — a super fast alternative to loops in Python 8 min read · Mar 6 278 5 · 5 min read · Nov 30, 2022 4.8K 65 Enigma of the Stack in CodeX Zaynab Awofeso Why Are JavaScript Pros Saying Goodbye to TypeScript? Reflecting on My Experience at the Indaba 2023 TypeScript’s Triumph and the Unexpected Turn The Indaba 2023 through my lens; · 7 min read · Sep 12 1.2K 7 min read · Sep 15 52 See all from Zaynab Awofeso 864 See all from CodeX Recommended from Medium 2 Muhammad Dawood Praoiticica Exploratory Data Analysis (EDA) on Titanic Dataset Titanic — Data Cleaning and Feature Engineering The Titanic dataset is popular for data analysis and machine learning. It contains… The Titanic dataset is one of the best datasets to practice data cleaning and feature… 3 min read · Jun 15 18 min read · Jul 17 139 Lists Predictive Modeling w/ Python Practical Guides to Machine Learning 20 stories · 506 saves 10 stories · 580 saves Natural Language Processing New_Reading_List 728 stories · 324 saves 174 stories · 152 saves Aderounmu Abiodun Emmanuel AMOLE OLUWAFERANMI E CORMERCE SALES (Amazon) Analysis in SQL Customer Retention Analysis with Power BI INTRODUCTION Introduction 4 min read · Jul 27 7 min read · Aug 30 1 25 Muhammad Arief Rachman Virat Patel Wine Quality Prediction with Machine Learning Model I applied to 230 Data science jobs during last 2 months and this is… Business Problem/Research Background A little bit about myself: I have been working as a Data Analyst for a little over 2 years.… 8 min read · Jul 14 · 3 min read · Aug 11 1.7K See more recommendations 36