Group Project Final Report for Survival Rate on the Titanic British Columbia Institute of Technology COMP 7481 – Data Mining April 7, 2009 Jonathan Dotto Nai-Chun Hsu Kevin Wong (A00619956) (A00585230) (A00619711) COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 Table of Contents Executive Summary................................................................................................................................. 3 Introduction/Background Information ................................................................................................... 3 Problem Statement ................................................................................................................................. 3 General Approach ................................................................................................................................... 3 Limitations/Constraints........................................................................................................................... 9 Conclusions ............................................................................................................................................. 9 Jonathan Dotto Nai-Chun Hsu 2 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 Executive Summary The goal of this data mining application is to determine whether the women and children first policy was followed back then when the Titanic sank. Using the Classification method in Data Mining and the Decision Tree algorithm, we created a training model based on the women and children first policy. This model shows that only the adult women and children (both male and female) survived. We then validate the model using the dataset provided by Radford Neal from the University of Toronto. The results shows that the policy was followed fairly closely, but there were still other determining factors that could affect a person’s fate on the Titanic. Introduction/Background Information The Titanic was an Olympic-class passenger liner which hit an iceberg and sunk on its maiden voyage from Brittan to the United States on April 14, 1912. The dataset consists of the gender, age (adult or child), class (1st, 2nd, 3rd class passenger or a crew member), and survival outcome of each of the 2201 passengers who were onboard. There is not complete agreement between major sources on the total number of passengers who were aboard the ship when it sank; our dataset was collected from the article "Report on the Loss of the ‘Titanic’ (S.S.)". Problem Statement The dataset that was chosen for this project consists of 2201 records which provide the details of each passenger who was on board the Titanic ship when it sank and if they survived. A maritime museum would like to analyze the data and develop a data mining project based on the correlation between the dataset attributes. The goal of the project is to show people the success rate of the women and children first policy which was in effect during that time period. Another goal of this data mining application will be to analyze the percentages of the crew, as well as the 1st, 2nd, and 3rd class passengers who survived which can be used to determine the effect that an individual’s social class had on survival rate. General Approach The 10 Steps of Data Mining Methodology in Berry & Linoff will be followed throughout the project development. Jonathan Dotto Nai-Chun Hsu 3 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 1. Translate into DM problem The purpose of this data mining project is to analyze the data and develop a model set to determine whether or not a passenger’s social class, sex or age has an effect on his or her survival rate on the Titanic. 2. Select Appropriate Data The original source files (titanic.doc and titanic.dat) were obtained from the data archive of the on-line Journal of Statistics Education. The newer version of this dataset was compiled by Robert J. MacG Dawson. He discussed this in the on-line article ‘The “Unusual Episode” Data Revisited’, Journal of Statistics Education, vol. 3, no.3 (1995). In the newer version, carriage returns at the end of the lines were deleted, as was a line containing a period at the end of each file. The dataset was then converted for use in DELVE by Radford Neal in 1996. This is the version we use for this data mining project. 3. Get to Know the Data The dataset was reconstructed from sources that were not completely clear, therefore, there are potential errors such as data inconsistency and false information. Dataset Profile (Training Set): Number of attributes: 4 Number of cases: 16 Dataset Profile (Test Set: Women and Children): Number of attributes: 4 Number of cases: 534 Dataset Profile (Test Set: Original dataset): Number of attributes: 4 Number of cases: 2201 Contributed by: Radford Neal, Professor, Dept of Statistics and Dept of Computer Science, University of Toronto radfordneal@gmail.com VARIABLE DESCRIPTIONS: Column 1 Class (0 = crew, 1 = first, 2 = second, 3 = third) 10 Age (1 = adult, 0 = child) 19 Sex (1 = male, 0 = female) 28 Survived (1 = yes, 0 = no) Jonathan Dotto Nai-Chun Hsu 4 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 Refer to Appendix A, for more information. 4. Create a Model Set The Classification method in Data Mining and the Decision Tree algorithm will be used to analyze the relationship between the attributes and the classes. The reason why the Classification method is chosen is because the original problem statement can be easily transformed into two classes. These two classes represent people who survived and people who didn’t survive. By applying them in the Data Mining application, we can build a model set based on the two classes using the Decision Tree algorithm. The reason why this algorithm is used is because we expect the tree to be small. The Decision Tree algorithm can be understood and interpreted easily when the tree is small. The other reason we use the Decision Tree is because all attributes are categorical which makes it easy to work with. 5. Fix Problems with Data Data inconsistency and false information are the potential problems to deal with when developing a Data Mining application with this dataset. According to the dataset contributor, there were no missing values and duplications. However, since this dataset has been compiled and reconstructed by different people, there are potentially false information. This may affect the results when comparing the models but we decided not to fix it because we have no way of knowing the validity of the given information in the first place. 6. Transform Data Test Set: Women and Children This test set is generated by deleting the “Survived” column in the original dataset. All the adult male cases in the original dataset are also taken out to validate the training model and measure the accuracy. 7. Build Models Training Model: The training model (see Figure 1) was created through Rapid Miner based on the woman and children first policy. We created our own training set data (see Appendix B) based on this policy. This included all the sixteen combinations (class, gender, age) of a person that was on the titanic. In the data, everyone survived except those who were classified as an adult male. Note that the training model created through Rapid Miner Jonathan Dotto Nai-Chun Hsu 5 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 did not include the class attribute (crew, 1st, 2nd, 3rd), since it had no affect on the outcome for survival. Criterion: GINI Index Max 2 Leaf Notes: Maximal 10 Depth: Confidence: Figure 1 – Decision tree created by Rapid Miner based on GINI index. 0.25 This model represents the woman and children first policy. Refer to Appendix B for the training set used to create this model. Criterion: Accuracy Max 2 Leaf Notes: Maximal 10 Depth: Confidence: Figure 2 - Decision tree created by Rapid Miner based on accuracy. 0.25 This model represents the likelihood of survival based on all 2201 cases. Criterion: GINI Index Max 2 Leaf Notes: Maximal 10 Depth: Confidence: Figure 3 - Decision tree created by Rapid Miner based on GINI index. 0.25 This model represents the likelihood of survival based on all 2201 cases. Jonathan Dotto Nai-Chun Hsu 6 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 8. Access Models In order to validate our models, we plotted a confusion matrix for each model against the woman and children data set we created. Figure 4 shows the confusion matrix for the model (Figure 1) we created (theorized on the woman and children first policy) against the woman and children test data. The accuracy of this model is 69.85% and the error rate is 30.15%. Woman/Children Tree Against Woman/Children Test Set Data Actually Survived: Actually Died: Class Precision: Predict Survived: 373 161 69.85% Predict Died: 0 0 0.00% Class Recall: 100.00% 0.00% Figure 4 – Woman and children model (Figure 1) against woman and children test set data. The accuracy of this model is 69.86% +/- 1.11%. The error rate of this model is 30.15%. Figure 5 shows the confusion matrix for the model (Figure 1) we created (theorized on the woman and children first policy) against the entire data set. The accuracy of this model is 69.86% and the error rate is 30.15%. Woman/Children Tree Against Entire Test Set Data Actually Survived: Actually Died: Class Precision: Predict Survived: 373 161 69.85% Predict Died: 338 1329 79.72% Class Recall: 52.46% 89.19% Figure 5 – Woman and children model (Figure 1) against the entire data set. The accuracy of this model is 77.33% +/- 3.11%. The error rate of this model is 22.67%. Figure 6 shows the confusion matrix for the generated models (Figure 2/3) against the woman and children test data. Note that both models produced the same results. The accuracy of this model is 76.98% and the error rate is 23.03%. Entire Tree Against Woman/Children Test Set Data Actually Survived: Actually Died: Class Precision: Predict Survived: 270 20 93.10% Predict Died: 103 141 57.79% Class Recall: 72.39% 87.58% Figure 6 – Entire model (Figure 2/3) against woman and children test set data. The accuracy of this model is 76.98% +/- 3.45%. Jonathan Dotto The error rate of this model is 23.03% Nai-Chun Hsu 7 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 Figure 7 shows the confusion matrix for the generated models (Figure 2/3) against the entire data set. Note that both models produced the same results. this model is 79.06% and the error rate is 20.95%. The accuracy of Entire Tree Against Entire Test Set Data Actually Survived: Actually Died: Class Precision: Predict Survived: 270 20 93.10% Predict Died: 441 1470 76.92% Class Recall: 37.97% 98.66% Figure 7 – Entire model (Figure 2/3) against the entire data set. The accuracy of this model is 79.06% +/- 1.45%. The error rate of this model is 20.95%. 9. Deploy Models Based on the confusion matrices in the previous section, Figures 8 and 9 summarizes the results. When comparing the two trees for both test sets, the accuracy and precision for classifying the survival of a person increased when using the entire decision tree model. This shows that there were other factors that led to the survival of a person, which in this case, the person’s class (crew, 1st, 2nd, 3rd) played an important role. Woman and Children (Test Set) Accuracy: Error Rate: Class Precision: Survived: Died: Woman/Children Decision Tree (Figure 1) 69.85% 30.15% 69.85% 0.00% Entire Decision Tree (Figure 2/3) 76.97% 23.03% 93.10% 57.79% Figure 8 - Summary of the accuracy, error rate and class precision of the models based on the woman and children only data. Entire Data (Test Set) Accuracy: Error Rate: Class Precision: Survived: Died: Woman/Children Decision Tree (Figure 1) 77.33% 22.67% 69.85% 79.72% Entire Decision Tree (Figure 2/3) 79.05% 20.95% 93.10% 76.92% Figure 9 - Summary of the accuracy, error rate and class precision of the models based on the entire data. Jonathan Dotto Nai-Chun Hsu 8 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 10. Access Results Based the results from the previous sections: - The woman and children first policy was followed fairly closely. - The class of a person played an important role. - The majority of the woman and children who died were third class. Limitations/Constraints 1 False Information Our dataset has been compiled and reconstructed by different people according to the original contributor, therefore, information contained in the dataset may be inconsistent between different versions. Our limitation is that we have no way of knowing the validity of the information contained in the latest version of the dataset. Therefore, the project results may not fully reflect what really happened back then when the Titanic sank. 2 Limited Number of Attributes Our dataset contains 3 attributes (sex, age, and class) to classify the survival of the people on the Titanic. Other determining factors could have affected their survival, such as their ethnicity and fitness level. Some reports have noted that a greater percentage of British people died (regardless of their class) because they politely waited in line for the life boats. Conclusions After analyzing the dataset of the survival outcome for the passengers on the Titanic it becomes clear that the policy of saving the women and children first did had a noticeable effect on the survival rate of that group of people. An accuracy of 69.85% when analyzing the predictions which were made by the optimal decision tree (Figure 1) vs. the actual outcome proves that an above average number of women and children did survive and were prioritized over the adult men. However, the decision trees generated from the entire dataset (Figures 2 and 3) show that all of the 3rd class passengers were more likely to not survive then to survive regardless of their gender or age. Overall the policy of saving the women and children first did have the effect of giving them a much greater chance of being saved but as was shown with the 3rd class passengers other factors including their social class influenced the survival rates of the passengers. Appendix A – Data Analysis Jonathan Dotto Nai-Chun Hsu 9 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 Table 1 - Number of People Breakdown Total Class Age Sex Survived Crew (0) First (1) 885 Second (2) Third (3) 325 285 706 2201 Adult (1) Child (0) 2092 109 Male (1) Female (0) 1731 470 2201 1490 2201 Yes (1) 2201 No (0) 711 Table 2 - Survival Breakdown By Class, Age, Sex Class Survived Crew (0) First (1) Age Second (2) Third (3) Sex Adult Child Male Female (1) (0) (1) (0) Yes 212 203 118 178 654 57 367 344 No 673 122 167 528 1438 52 1364 126 Table 3 - Crew Survival Breakdown by Age, Sex Crew Age Survived Sex Adult (1) Child (0) Male (1) Female (0) Yes 212 0 192 20 No 673 0 670 3 Table 4 - First Class Survival Breakdown by Age, Sex First Class Age Survived Sex Adult (1) Child (0) Male (1) Female (0) Yes 197 6 62 141 No 122 0 118 4 Table 5 - Second Class Survival Breakdown by Age, Sex Second Class Jonathan Dotto Nai-Chun Hsu 10 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic Age Survived April 7, 2009 Sex Adult (1) Child (0) Male (1) Female (0) Yes 94 24 25 93 No 167 0 154 13 Table 6 - Third Class Survival Breakdown by Age, Sex Third Class Age Survived Sex Adult (1) Child (0) Male (1) Yes 151 27 88 90 No 476 52 422 106 Jonathan Dotto Female (0) Nai-Chun Hsu 11 / 12 Kevin Wong COMP 7481 Data Mining Group Project Final Report for Survival Rate on the Titanic April 7, 2009 Appendix B – Training Set Class Gender Age Survived 1st male adult no 1st male child yes 1st female adult yes 1st female child yes 2nd male adult no 2nd male child yes 2nd female adult yes 2nd female child yes 3rd male adult no 3rd male child yes 3rd female adult yes 3rd female child yes crew male adult no crew male child yes crew female adult yes crew female child yes Jonathan Dotto Nai-Chun Hsu 12 / 12 Kevin Wong