Table of Contents

advertisement
Group Project Final Report for
Survival Rate on the Titanic
British Columbia Institute of Technology
COMP 7481 – Data Mining
April 7, 2009
Jonathan Dotto
Nai-Chun Hsu
Kevin Wong
(A00619956)
(A00585230)
(A00619711)
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
Table of Contents
Executive Summary................................................................................................................................. 3
Introduction/Background Information ................................................................................................... 3
Problem Statement ................................................................................................................................. 3
General Approach ................................................................................................................................... 3
Limitations/Constraints........................................................................................................................... 9
Conclusions ............................................................................................................................................. 9
Jonathan Dotto
Nai-Chun Hsu
2 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
Executive Summary
The goal of this data mining application is to determine whether the women and children
first policy was followed back then when the Titanic sank. Using the Classification method in
Data Mining and the Decision Tree algorithm, we created a training model based on the
women and children first policy. This model shows that only the adult women and children
(both male and female) survived. We then validate the model using the dataset provided by
Radford Neal from the University of Toronto. The results shows that the policy was followed
fairly closely, but there were still other determining factors that could affect a person’s fate
on the Titanic.
Introduction/Background Information
The Titanic was an Olympic-class passenger liner which hit an iceberg and sunk on its
maiden voyage from Brittan to the United States on April 14, 1912. The dataset consists of
the gender, age (adult or child), class (1st, 2nd, 3rd class passenger or a crew member), and
survival outcome of each of the 2201 passengers who were onboard. There is not complete
agreement between major sources on the total number of passengers who were aboard the
ship when it sank; our dataset was collected from the article "Report on the Loss of the
‘Titanic’ (S.S.)".
Problem Statement
The dataset that was chosen for this project consists of 2201 records which provide the
details of each passenger who was on board the Titanic ship when it sank and if they
survived. A maritime museum would like to analyze the data and develop a data mining
project based on the correlation between the dataset attributes. The goal of the project is to
show people the success rate of the women and children first policy which was in effect
during that time period. Another goal of this data mining application will be to analyze the
percentages of the crew, as well as the 1st, 2nd, and 3rd class passengers who survived which
can be used to determine the effect that an individual’s social class had on survival rate.
General Approach
The 10 Steps of Data Mining Methodology in Berry & Linoff will be followed throughout the
project development.
Jonathan Dotto
Nai-Chun Hsu
3 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
1. Translate into DM problem
The purpose of this data mining project is to analyze the data and develop a model set to
determine whether or not a passenger’s social class, sex or age has an effect on his or
her survival rate on the Titanic.
2. Select Appropriate Data
The original source files (titanic.doc and titanic.dat) were obtained from the data archive
of the on-line Journal of Statistics Education. The newer version of this dataset was
compiled by Robert J. MacG Dawson. He discussed this in the on-line article ‘The
“Unusual Episode” Data Revisited’, Journal of Statistics Education, vol. 3, no.3 (1995). In
the newer version, carriage returns at the end of the lines were deleted, as was a line
containing a period at the end of each file. The dataset was then converted for use in
DELVE by Radford Neal in 1996. This is the version we use for this data mining project.
3. Get to Know the Data
The dataset was reconstructed from sources that were not completely clear, therefore,
there are potential errors such as data inconsistency and false information.
Dataset Profile (Training Set):
Number of attributes: 4
Number of cases: 16
Dataset Profile (Test Set: Women and Children):
Number of attributes: 4
Number of cases: 534
Dataset Profile (Test Set: Original dataset):
Number of attributes: 4
Number of cases: 2201
Contributed by: Radford Neal, Professor, Dept of Statistics and Dept of Computer Science,
University of Toronto radfordneal@gmail.com
VARIABLE DESCRIPTIONS:
Column
1
Class (0 = crew, 1 = first, 2 = second, 3 = third)
10
Age
(1 = adult, 0 = child)
19
Sex
(1 = male, 0 = female)
28
Survived (1 = yes, 0 = no)
Jonathan Dotto
Nai-Chun Hsu
4 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
Refer to Appendix A, for more information.
4. Create a Model Set
The Classification method in Data Mining and the Decision Tree algorithm will be used to
analyze the relationship between the attributes and the classes. The reason why the
Classification method is chosen is because the original problem statement can be easily
transformed into two classes. These two classes represent people who survived and
people who didn’t survive. By applying them in the Data Mining application, we can build
a model set based on the two classes using the Decision Tree algorithm. The reason why
this algorithm is used is because we expect the tree to be small. The Decision Tree
algorithm can be understood and interpreted easily when the tree is small. The other
reason we use the Decision Tree is because all attributes are categorical which makes it
easy to work with.
5. Fix Problems with Data
Data inconsistency and false information are the potential problems to deal with when
developing a Data Mining application with this dataset. According to the dataset
contributor, there were no missing values and duplications. However, since this dataset
has been compiled and reconstructed by different people, there are potentially false
information. This may affect the results when comparing the models but we decided not
to fix it because we have no way of knowing the validity of the given information in the
first place.
6. Transform Data
Test Set: Women and Children
This test set is generated by deleting the “Survived” column in the original dataset. All
the adult male cases in the original dataset are also taken out to validate the training
model and measure the accuracy.
7. Build Models
Training Model:
The training model (see Figure 1) was created through Rapid Miner based on the woman
and children first policy. We created our own training set data (see Appendix B) based
on this policy. This included all the sixteen combinations (class, gender, age) of a
person that was on the titanic. In the data, everyone survived except those who were
classified as an adult male. Note that the training model created through Rapid Miner
Jonathan Dotto
Nai-Chun Hsu
5 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
did not include the class attribute (crew, 1st, 2nd, 3rd), since it had no affect on the
outcome for survival.
Criterion:
GINI Index
Max
2
Leaf Notes:
Maximal
10
Depth:
Confidence:
Figure 1 – Decision tree created by Rapid Miner based on GINI index.
0.25
This model represents the woman and children
first policy. Refer to Appendix B for the training set used to create this model.
Criterion:
Accuracy
Max
2
Leaf Notes:
Maximal
10
Depth:
Confidence:
Figure 2 - Decision tree created by Rapid Miner based on accuracy.
0.25
This model represents the likelihood of survival
based on all 2201 cases.
Criterion:
GINI Index
Max
2
Leaf Notes:
Maximal
10
Depth:
Confidence:
Figure 3 - Decision tree created by Rapid Miner based on GINI index.
0.25
This model represents the likelihood of survival
based on all 2201 cases.
Jonathan Dotto
Nai-Chun Hsu
6 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
8. Access Models
In order to validate our models, we plotted a confusion matrix for each model against
the woman and children data set we created.
Figure 4 shows the confusion matrix for the model (Figure 1) we created (theorized on
the woman and children first policy) against the woman and children test data. The
accuracy of this model is 69.85% and the error rate is 30.15%.
Woman/Children Tree Against Woman/Children Test Set Data
Actually Survived:
Actually Died:
Class Precision:
Predict Survived:
373
161
69.85%
Predict Died:
0
0
0.00%
Class Recall:
100.00%
0.00%
Figure 4 – Woman and children model (Figure 1) against woman and children test set data.
The accuracy of this model
is 69.86% +/- 1.11%. The error rate of this model is 30.15%.
Figure 5 shows the confusion matrix for the model (Figure 1) we created (theorized on
the woman and children first policy) against the entire data set. The accuracy of this
model is 69.86% and the error rate is 30.15%.
Woman/Children Tree Against Entire Test Set Data
Actually Survived:
Actually Died:
Class Precision:
Predict Survived:
373
161
69.85%
Predict Died:
338
1329
79.72%
Class Recall:
52.46%
89.19%
Figure 5 – Woman and children model (Figure 1) against the entire data set.
The accuracy of this model is 77.33% +/-
3.11%. The error rate of this model is 22.67%.
Figure 6 shows the confusion matrix for the generated models (Figure 2/3) against the
woman and children test data. Note that both models produced the same results.
The accuracy of this model is 76.98% and the error rate is 23.03%.
Entire Tree Against Woman/Children Test Set Data
Actually Survived:
Actually Died:
Class Precision:
Predict Survived:
270
20
93.10%
Predict Died:
103
141
57.79%
Class Recall:
72.39%
87.58%
Figure 6 – Entire model (Figure 2/3) against woman and children test set data. The accuracy of this model is 76.98%
+/- 3.45%.
Jonathan Dotto
The error rate of this model is 23.03%
Nai-Chun Hsu
7 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
Figure 7 shows the confusion matrix for the generated models (Figure 2/3) against the
entire data set. Note that both models produced the same results.
this model is 79.06% and the error rate is 20.95%.
The accuracy of
Entire Tree Against Entire Test Set Data
Actually Survived:
Actually Died:
Class Precision:
Predict Survived:
270
20
93.10%
Predict Died:
441
1470
76.92%
Class Recall:
37.97%
98.66%
Figure 7 – Entire model (Figure 2/3) against the entire data set.
The accuracy of this model is 79.06% +/- 1.45%. The
error rate of this model is 20.95%.
9. Deploy Models
Based on the confusion matrices in the previous section, Figures 8 and 9 summarizes the
results. When comparing the two trees for both test sets, the accuracy and precision
for classifying the survival of a person increased when using the entire decision tree
model. This shows that there were other factors that led to the survival of a person,
which in this case, the person’s class (crew, 1st, 2nd, 3rd) played an important role.
Woman and Children (Test Set)
Accuracy:
Error Rate:
Class Precision:
Survived:
Died:
Woman/Children
Decision Tree (Figure 1)
69.85%
30.15%
69.85%
0.00%
Entire
Decision Tree (Figure 2/3)
76.97%
23.03%
93.10%
57.79%
Figure 8 - Summary of the accuracy, error rate and class precision of the models based on the woman and children only
data.
Entire Data (Test Set)
Accuracy:
Error Rate:
Class Precision:
Survived:
Died:
Woman/Children
Decision Tree (Figure 1)
77.33%
22.67%
69.85%
79.72%
Entire
Decision Tree (Figure 2/3)
79.05%
20.95%
93.10%
76.92%
Figure 9 - Summary of the accuracy, error rate and class precision of the models based on the entire data.
Jonathan Dotto
Nai-Chun Hsu
8 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
10. Access Results
Based the results from the previous sections:
- The woman and children first policy was followed fairly closely.
- The class of a person played an important role.
- The majority of the woman and children who died were third class.
Limitations/Constraints
1 False Information
Our dataset has been compiled and reconstructed by different people according to the
original contributor, therefore, information contained in the dataset may be inconsistent
between different versions. Our limitation is that we have no way of knowing the validity
of the information contained in the latest version of the dataset. Therefore, the project
results may not fully reflect what really happened back then when the Titanic sank.
2 Limited Number of Attributes
Our dataset contains 3 attributes (sex, age, and class) to classify the survival of the
people on the Titanic. Other determining factors could have affected their survival,
such as their ethnicity and fitness level. Some reports have noted that a greater
percentage of British people died (regardless of their class) because they politely waited
in line for the life boats.
Conclusions
After analyzing the dataset of the survival outcome for the passengers on the Titanic it
becomes clear that the policy of saving the women and children first did had a noticeable
effect on the survival rate of that group of people. An accuracy of 69.85% when analyzing
the predictions which were made by the optimal decision tree (Figure 1) vs. the actual
outcome proves that an above average number of women and children did survive and were
prioritized over the adult men. However, the decision trees generated from the entire
dataset (Figures 2 and 3) show that all of the 3rd class passengers were more likely to not
survive then to survive regardless of their gender or age. Overall the policy of saving the
women and children first did have the effect of giving them a much greater chance of being
saved but as was shown with the 3rd class passengers other factors including their social
class influenced the survival rates of the passengers.
Appendix A – Data Analysis
Jonathan Dotto
Nai-Chun Hsu
9 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
Table 1 - Number of People Breakdown
Total
Class
Age
Sex
Survived
Crew (0)
First (1)
885
Second (2) Third (3)
325
285
706 2201
Adult (1) Child (0)
2092
109
Male (1)
Female (0)
1731
470
2201
1490
2201
Yes (1)
2201
No (0)
711
Table 2 - Survival Breakdown By Class, Age, Sex
Class
Survived Crew (0)
First (1)
Age
Second (2)
Third (3)
Sex
Adult
Child
Male
Female
(1)
(0)
(1)
(0)
Yes
212
203
118
178
654
57
367
344
No
673
122
167
528
1438
52
1364
126
Table 3 - Crew Survival Breakdown by Age, Sex
Crew
Age
Survived
Sex
Adult (1)
Child (0)
Male (1)
Female (0)
Yes
212
0
192
20
No
673
0
670
3
Table 4 - First Class Survival Breakdown by Age, Sex
First Class
Age
Survived
Sex
Adult (1)
Child (0)
Male (1)
Female (0)
Yes
197
6
62
141
No
122
0
118
4
Table 5 - Second Class Survival Breakdown by Age, Sex
Second Class
Jonathan Dotto
Nai-Chun Hsu
10 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
Age
Survived
April 7, 2009
Sex
Adult (1)
Child (0)
Male (1)
Female (0)
Yes
94
24
25
93
No
167
0
154
13
Table 6 - Third Class Survival Breakdown by Age, Sex
Third Class
Age
Survived
Sex
Adult (1)
Child (0)
Male (1)
Yes
151
27
88
90
No
476
52
422
106
Jonathan Dotto
Female (0)
Nai-Chun Hsu
11 / 12
Kevin Wong
COMP 7481
Data Mining
Group Project Final Report for
Survival Rate on the Titanic
April 7, 2009
Appendix B – Training Set
Class
Gender
Age
Survived
1st
male
adult
no
1st
male
child
yes
1st
female
adult
yes
1st
female
child
yes
2nd
male
adult
no
2nd
male
child
yes
2nd
female
adult
yes
2nd
female
child
yes
3rd
male
adult
no
3rd
male
child
yes
3rd
female
adult
yes
3rd
female
child
yes
crew
male
adult
no
crew
male
child
yes
crew
female
adult
yes
crew
female
child
yes
Jonathan Dotto
Nai-Chun Hsu
12 / 12
Kevin Wong
Download