sample_3-mbus673_Titanic Sinking (Decision Tree)

advertisement
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
Introduction
On April 15, 1912, the British passenger liner RMS Titanic sank in the North Atlantic1.
Striking an iceberg shortly before midnight, the ship sank in less than 3 hours. At the time the
largest ship afloat, the disaster was also one of the largest seaborne disasters in peacetime history.
Of over 2,200 passengers and crew (approximately 1,317 passengers and 885 crew)
approximately 710 survived, resulting in loss of nearly 1,500 lives.
Much historical data is available regarding the passengers of the ill-fated voyage. Of
course there were no electronic systems available. Nevertheless, British efficiency resulted in
fairly complete written records of the passengers and the ticket information. Use of this data can
provide a valuable, if slightly macabre, learning tool for the RapidMiner program.
In this paper I will provide background information on business intelligence systems.
This will be followed by a “practical” example: use of data mining decision tree technique of
RapidMiner to investigate the ‘rules’ that were followed and predict the survival of passengers
aboard the Titanic. Conclusions follow.
1
Wikipedia article
Fall 2013
Business Intelligence
1/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
Background
Some background material on the concepts of business intelligence and data mining are
in order. I will briefly outline business intelligence, data warehousing, data mining, and some of
the key issues with implementation of business intelligence systems.
A. Business Intelligence
In today’s business environment, the pace of change is increasing as is the level of
complexity. Organizations need to be agile and to make often complex decisions quickly. To
make good decisions requires, among other things, data, information, and knowledge. “Business
Intelligence (BI) is an umbrella term that combines architectures, tools, databases, analytical
tools, applications, and methodologies2.” The term was first used in the mid-1990s. There are
four main components in a BI system: a data source (data warehouse,) a set of tools for analyzing
the data, rules and tools to monitor business performance (BPM,) and a user interface. A primary
benefit of BI is the ability to provide the right information needed to make a decision, at the right
time. BI can provide decision support and in some cases even automate decision making.
Predictive analysis, for example the ability of credit card companies to detect and avert potential
fraud, is an example of BI at work. Because data is often sensitive to particular customers, BI
system require safeguards of both technical and procedural nature to govern access. Business
Intelligence and online analytical processing (OLAP) systems should be contrasted with
transaction processing systems, which are used to record the daily operational activities of the
2
Turban, page 8.
Fall 2013
Business Intelligence
2/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
organization. Because of their extreme complexity, BI systems are often purchased from
companies like IBM or Oracle. In some cases though, they can be developed in house. Either
way, they are typically quite expensive.
B. Data Warehousing
“In simple terms, a data warehouse (DW) is a pool of data produced to support decision
making3.” Data warehouses are:
- Subject oriented. Transaction databases are typically product oriented.
- Integrated. Data are gathered from many sources into a consistent format for presentation to
users.
- Time variant. Transactional databases often contain only have the current business cycle of data.
Data warehouses always contain historical data.
- Nonvolatile. The data, once entered into the data warehouse, is not changed.
A data mart is a type of data warehouse. A data warehouse is typically large or very
large, with data for an entire enterprise. A data mart is a virtual or physical subset of the data
warehouse, with data for just one department or business unit. Data marts can be developed as
dependent on the data warehouse, or developed independently.
An important concept in data warehouse systems is that of metadata. Metadata is “data
about the data” that is useful in the reporting and management of the data itself. Consider a
3
Turban, page 32.
Fall 2013
Business Intelligence
3/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
simple file system. An excel file will contain columns and rows of data. When it is stored on a
hard drive, it will also have information about the data stored along with the data in the rows and
columns. Examples would include filename, creation data, modification date, document author,
and so forth.
Data warehouse architectures can be envisioned as two or three-tier models. Furthermore,
alternative architectures have been proposed to govern the relationships between an enterprise
data warehouse and potentially multiple data marts. An important part of the data warehouse is
the ETL processes that are used to populate data into it. This is the extraction of data (from
original sources) the transformation of that data (into a common format) and the loading of that
data into the data warehouse. This three processes together are referred to as either ETL or data
integration. A data warehousing project is a complex endeavor for any organization. Enterprises
undertake such a complex project because the primary and secondary benefits outweigh the cost
and complexity to build and maintain the system. A number of vendors make data warehouse
systems available as commercial packages. Though considered “off-the-shelf” in reality much
integration work goes into making the data warehouse produce the desired benefits, even when
commercial software is used as a base platform. There are two primary approaches to the effort.
The Inmon model is a top-down, enterprise wide model also referred to as EDW (Enterprise Data
Warehouse.) The Kimball model, while starting with an enterprise-wide vision, looks to build
from a bottom-up perspective, delivering data mart functionality to individual groups. As the
process of building the data warehouse and beginning to load data and meta data commences, the
distinction between online transaction processing (OLTP) and online analysis processing
(OLAP) becomes sharper. OLTP databases are typically relational and the main data structure is
Fall 2013
Business Intelligence
4/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
an indexed table. With OLAP, the primary conceptual object of storage is a cube of three (or
conceivably more) dimensions. Operations on OLTP tables have names such as join, select, and
delete. Operations on OLAP cubes have names like slice, dice, roll up, and drill down. The
process of building data warehouse has undergone considerable study and now common pitfalls
and traps have been identified for avoidance. This includes, for example, the importance of
establishing service level agreements, of gaining end user support, and the importance of ETL
tool selection. There are many others. Data warehouses, over time, become massive repositories.
Others load data with increasing frequency, giving rise to the real-time data warehouse, in which
up to the minute data is available to combine with historical information in order to produce realtime decisions of an operational nature. In the future, it is expected that open source software
will play a larger role. Cloud computing and storage is also likely to increase. Data security and
protection become important issues to consider as data is migrated outside the confines of the
private data center.
For our case study example, in place of a extremely large data warehouse, we will be
using a data file based on the passengers of the Titanic to represent our data warehouse. More
details on the data set are given in the case study below.
C. Data Mining
Data mining is a way to develop business intelligence4. A variety of tools, methods, and
techniques are used to extract information useful to a business from data that may seem at first
glance to be of no value. Information about customers is valuable. In today’s technological
4
Turban, pg. 131.
Fall 2013
Business Intelligence
5/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
environment, information about customers can accumulate in large databases. The term of
mining incorporates the idea that digging valuable information from that data collection is not
always easy or evident. The process from a technical standpoint involves the use of statistical
tools and techniques, as well as artificial intelligence, to extract the information of value. The
miners in this case are often business analysts or users, non-technical, who need information to
make better decisions.There are four basic pattern types that these tools can identify:
- Association finds commonly occurring groups of items. The market basket technique is useful.
An example from our class discussion involved the identification of beer and diapers.
- Predictions identify future outcomes. The decision tree technique is an example.
- Clusters identify groups of objects based on characteristics. For example, segmenting
customers for marketing campaign in novel ways.
- Sequential relationships are identifications of time-ordered events. For example understanding
what events trigger a prospect to become a customer or a customer to develop repeat business
can be very valuable.
The learning algorithms used in data mining can be classified as supervised or
unsupervised. Supervised learning involves training data that allows the algorithms to understand
past behavior in order to determine suitable patterns or algorithms for predicting new behavior.
For example, our case study discussed further below includes the survived attribute to allow the
decision tree algorithm to understand the relationship between that outcome and attributes of
each passenger such as gender and age. One other important distinction involves that of the
mining being hypothesis driven or discovery driven. In the first case, the analyst has an idea and
Fall 2013
Business Intelligence
6/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
is using data to test that idea. In the second, the analyst (or BI system) is searching for
previously unknown relationships and their potential business value.
RapidMiner is an example of a tool used to perform data mining on a data set. It is
available in free or commercial packages. It offers a wide variety of tools and a “easy to use”
interface for building data mining processes. More details are demonstrated in the case, below.
D. Business Intelligence - Implementation Issues
The data base for our case study is relatively small. The tool for our case study is free. In
a real business environment though, the data are usually much larger, and less well-defined. And,
the business intelligence systems, of which data mining is only one part, are very expensive to
acquire and implement. These systems can be built in house or acquired from commercial
providers. Either way, the costs for a large enterprise can be large. The complexity to
implement these systems can make success less than guaranteed. In this section I will outline
briefly some of the major issues surrounding implementation and integration of business
intelligence systems into an enterprise.
There are four major implementation issues: integration, connections to data sources, the
use of on-demand BI, and the ethical and legal issues associated with the accumulation and
analysis of potentially sensitive information. There are both technical and managerial issues
related to the implementation of a BI system within an organization. Examples of technical
issues include scalability and performance, security, and the ability to read from disparate
sources of information into a common data warehouse. Examples of managerial issues include
business justification and cost-benefit analysis, legal issues, and ethical issues.
Fall 2013
Business Intelligence
7/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
There are several types of integration that can occur. The need for integration is really
part of the benefit of the system as a whole. Systems with a tight integration with other systems,
for example, can offer real-time BI and decision support.
Traditional data warehouses and BI systems are built with customer-provided
infrastructure such as servers and storage arrays. New trends are allowing less capital-intensive
alternatives. BI can be delivered as a service (SaaS or software as a service) by firms that
operate the infrastructure and application on behalf of another firm.
There are legal and privacy issues associated with business intelligence and some high
profile cases have demonstrated the consumer backlash that can come from perceived (or real)
misappropriate use of sensitive information. Many companies now collect information about
consumers that they themselves consider private. Data on the location and travel patterns of
consumers equipped with mobile phones is not unusual. Questions of how much data can be
collected, who can have access to that data, and what purposes it can be used for, are taking
place at both legal and ethical levels. Current generation business intelligence systems are
integrating with new sources of information, such as social media sites, and use Web 2.0 tools to
gather and analyze data about consumer patterns. The US government uses “business
intelligence” in counter-terrorism efforts, but as part of that effort collects and retains data on the
usage patterns of millions who are not terrorists. Right or wrong, “big data” is probably here to
stay; it will be incumbent on us collectively to make sure it does not turn out to be a malicious
“big brother.”
Fall 2013
Business Intelligence
8/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
An Example
a) Case description
The file titanic3.xls can be located from multiple sources using a Google search5. The
data set is over 1,300 rows of data, one per passenger, along with certain basic information. This
file will be used along with RapidMiner to explore the capabilities of the RapidMiner product for
data mining operations. On that terrible night, the Captain gave the order to abandon ship, with
the order ‘women and children first.’ Our case will attempt to use RapidMiner to provide a
decision tree based on age and gender.
b) Objective from the case
The primary objective of the case is to gain experience with RapidMiner. It is a powerful
and flexible tool though with some level of complexity. We will attempt to produce a decision
tree on age, gender, and potentially other factors that impacted survivability. When the captain
gave the order to abandon ship, he reportedly added, “women and children first.” Did these
groups have a better chance to survive?
c) Findings/Analysis and Business Implication
The full RapidMiner flows and steps with screen shots are shown in a separate document
attached to this one and called “Titanic for Decision Tree: Screenshots and procedures.” It is
5
For example, it is available from Carnegie-Mellon University at
http://lib.stat.cmu.edu/S/Harrell/data/xls/titanic3.xls
Fall 2013
Business Intelligence
9/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
determined that certainly, females stood a much higher chance of being saved that night than did
males. Running a simple decision tree on gender, we obtain the following baseline:
gender = female: survived {survived=339, perished=127}
gender = male: perished {survived=161, perished=682}
Nearly 73% of the women survived; less than 20% of the men did.
Considering the age of the passengers, I broke the age into 4 somewhat arbitrary groups:
Age Bracket
Upper Limit
Percentage Survival
infant
2.99 years
32%
child
15.99 years
56%
adult
69.99 years
39%
elder
99.99 years
25%
age
age
age
age
=
=
=
=
adult: perished {survived=359, perished=564}
child: survived {survived=45, perished=36}
elder: perished {survived=2, perished=6}
infant: perished {survived=94, perished=203}
It is not clear but possible that many infants (< 3 years of age for my test) perished
because of the cold or after the initial rescue. Clearly, children under 16 did stand a much better
chance of survival; this is the only age group where more than 50% of the passengers survived.
It is very interesting to consider other classifications, or to consider classifications in
combination. Consider passenger class. Without additional tuning of the Decision Tree step,
age makes no difference, and passenger class or age plays a limited role in survival of men. In
fact, the survival rates of first, second, and third class female passengers are 97%, 89%, and 49%,
Fall 2013
Business Intelligence
10/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
respectively. Being a woman of means helped on that night. Perhaps this is because of the
proximity of the cabins to the lifeboats.
gender = female
|
pclass = first: survived {survived=139, perished=5}
|
pclass = second: survived {survived=94, perished=12}
|
pclass = third: perished {survived=106, perished=110}
gender = male: perished {survived=161, perished=682}
Many more combinations are possible some of which are explored in the supplementary
document attached to this one and called “Titanic for Decision Tree: Screenshots and procedures.”
d) Data set information
The data set contains 1,309 rows of data with 14 columns. Each row represents data on
one passenger. The column definitions or data fields are itemized in Table 1.
Field Name
Field Description
pclass
Passenger class: 1, 2, or 3.
survived
0 (did not survive) or 1 (did survive.)
name
Last name, First name format
sex
gender: male or female
age
in years
sibsp
sibling/spouse traveling together: numeric
parch
parents/children traveling together: numeric
ticket
ticket number
fare
fare price
cabin
text string indicating cabin assignment.
Fall 2013
Business Intelligence
11/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
embarked
port of embarkation, S or C
boat
lifeboat number
body
a number for recovered corpses.
home.dest
text, home city of passenger listed on
manifest.
Clearly, not all columns are relevant to decision tree. For example, the ticket number is not
likely to forecast survivability.
Conclusion
RapidMiner is a powerful free tool for providing data mining and business intelligence
capabilities to firms that otherwise would have no opportunity to make use of the decision
making support of modern systems. With respect to the Titanic, it is reasonable to conclude that,
in fact, women and children did in fact have a better chance of survival.
Determining the percentage of women who survived, or of any single factor, does not
require RapidMiner or a decision tree. Much more interesting than simple statistical calculations
of a single variable, RapidMiner offers decision tree support to build rules predicting an outcome
based on multiple characteristics. In my example, I used age, gender, passenger-class, and in
some cases whether the person was along or with family to produce various decision trees. It is
not hard to imagine business applications, such as predicting customer repeat business behavior
or potential response to a marketing campaign, where the power or RapidMiner becomes
relevant to the workplace today.
References
Fall 2013
Business Intelligence
12/13
Use of RapidMiner for Decision Tree Analysis
Using Data from the Titanic Sinking
Wes Morris
MBUS 673
1. http://en.wikipedia.org/wiki/Titanic. Background information on the Titanic came from the
wikipedia page.
2. Turban, Efraim; et al. “Business Intelligence: A Managerial Approach” 2nd Edition
Fall 2013
Business Intelligence
13/13
Download