Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 Introduction On April 15, 1912, the British passenger liner RMS Titanic sank in the North Atlantic1. Striking an iceberg shortly before midnight, the ship sank in less than 3 hours. At the time the largest ship afloat, the disaster was also one of the largest seaborne disasters in peacetime history. Of over 2,200 passengers and crew (approximately 1,317 passengers and 885 crew) approximately 710 survived, resulting in loss of nearly 1,500 lives. Much historical data is available regarding the passengers of the ill-fated voyage. Of course there were no electronic systems available. Nevertheless, British efficiency resulted in fairly complete written records of the passengers and the ticket information. Use of this data can provide a valuable, if slightly macabre, learning tool for the RapidMiner program. In this paper I will provide background information on business intelligence systems. This will be followed by a “practical” example: use of data mining decision tree technique of RapidMiner to investigate the ‘rules’ that were followed and predict the survival of passengers aboard the Titanic. Conclusions follow. 1 Wikipedia article Fall 2013 Business Intelligence 1/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 Background Some background material on the concepts of business intelligence and data mining are in order. I will briefly outline business intelligence, data warehousing, data mining, and some of the key issues with implementation of business intelligence systems. A. Business Intelligence In today’s business environment, the pace of change is increasing as is the level of complexity. Organizations need to be agile and to make often complex decisions quickly. To make good decisions requires, among other things, data, information, and knowledge. “Business Intelligence (BI) is an umbrella term that combines architectures, tools, databases, analytical tools, applications, and methodologies2.” The term was first used in the mid-1990s. There are four main components in a BI system: a data source (data warehouse,) a set of tools for analyzing the data, rules and tools to monitor business performance (BPM,) and a user interface. A primary benefit of BI is the ability to provide the right information needed to make a decision, at the right time. BI can provide decision support and in some cases even automate decision making. Predictive analysis, for example the ability of credit card companies to detect and avert potential fraud, is an example of BI at work. Because data is often sensitive to particular customers, BI system require safeguards of both technical and procedural nature to govern access. Business Intelligence and online analytical processing (OLAP) systems should be contrasted with transaction processing systems, which are used to record the daily operational activities of the 2 Turban, page 8. Fall 2013 Business Intelligence 2/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 organization. Because of their extreme complexity, BI systems are often purchased from companies like IBM or Oracle. In some cases though, they can be developed in house. Either way, they are typically quite expensive. B. Data Warehousing “In simple terms, a data warehouse (DW) is a pool of data produced to support decision making3.” Data warehouses are: - Subject oriented. Transaction databases are typically product oriented. - Integrated. Data are gathered from many sources into a consistent format for presentation to users. - Time variant. Transactional databases often contain only have the current business cycle of data. Data warehouses always contain historical data. - Nonvolatile. The data, once entered into the data warehouse, is not changed. A data mart is a type of data warehouse. A data warehouse is typically large or very large, with data for an entire enterprise. A data mart is a virtual or physical subset of the data warehouse, with data for just one department or business unit. Data marts can be developed as dependent on the data warehouse, or developed independently. An important concept in data warehouse systems is that of metadata. Metadata is “data about the data” that is useful in the reporting and management of the data itself. Consider a 3 Turban, page 32. Fall 2013 Business Intelligence 3/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 simple file system. An excel file will contain columns and rows of data. When it is stored on a hard drive, it will also have information about the data stored along with the data in the rows and columns. Examples would include filename, creation data, modification date, document author, and so forth. Data warehouse architectures can be envisioned as two or three-tier models. Furthermore, alternative architectures have been proposed to govern the relationships between an enterprise data warehouse and potentially multiple data marts. An important part of the data warehouse is the ETL processes that are used to populate data into it. This is the extraction of data (from original sources) the transformation of that data (into a common format) and the loading of that data into the data warehouse. This three processes together are referred to as either ETL or data integration. A data warehousing project is a complex endeavor for any organization. Enterprises undertake such a complex project because the primary and secondary benefits outweigh the cost and complexity to build and maintain the system. A number of vendors make data warehouse systems available as commercial packages. Though considered “off-the-shelf” in reality much integration work goes into making the data warehouse produce the desired benefits, even when commercial software is used as a base platform. There are two primary approaches to the effort. The Inmon model is a top-down, enterprise wide model also referred to as EDW (Enterprise Data Warehouse.) The Kimball model, while starting with an enterprise-wide vision, looks to build from a bottom-up perspective, delivering data mart functionality to individual groups. As the process of building the data warehouse and beginning to load data and meta data commences, the distinction between online transaction processing (OLTP) and online analysis processing (OLAP) becomes sharper. OLTP databases are typically relational and the main data structure is Fall 2013 Business Intelligence 4/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 an indexed table. With OLAP, the primary conceptual object of storage is a cube of three (or conceivably more) dimensions. Operations on OLTP tables have names such as join, select, and delete. Operations on OLAP cubes have names like slice, dice, roll up, and drill down. The process of building data warehouse has undergone considerable study and now common pitfalls and traps have been identified for avoidance. This includes, for example, the importance of establishing service level agreements, of gaining end user support, and the importance of ETL tool selection. There are many others. Data warehouses, over time, become massive repositories. Others load data with increasing frequency, giving rise to the real-time data warehouse, in which up to the minute data is available to combine with historical information in order to produce realtime decisions of an operational nature. In the future, it is expected that open source software will play a larger role. Cloud computing and storage is also likely to increase. Data security and protection become important issues to consider as data is migrated outside the confines of the private data center. For our case study example, in place of a extremely large data warehouse, we will be using a data file based on the passengers of the Titanic to represent our data warehouse. More details on the data set are given in the case study below. C. Data Mining Data mining is a way to develop business intelligence4. A variety of tools, methods, and techniques are used to extract information useful to a business from data that may seem at first glance to be of no value. Information about customers is valuable. In today’s technological 4 Turban, pg. 131. Fall 2013 Business Intelligence 5/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 environment, information about customers can accumulate in large databases. The term of mining incorporates the idea that digging valuable information from that data collection is not always easy or evident. The process from a technical standpoint involves the use of statistical tools and techniques, as well as artificial intelligence, to extract the information of value. The miners in this case are often business analysts or users, non-technical, who need information to make better decisions.There are four basic pattern types that these tools can identify: - Association finds commonly occurring groups of items. The market basket technique is useful. An example from our class discussion involved the identification of beer and diapers. - Predictions identify future outcomes. The decision tree technique is an example. - Clusters identify groups of objects based on characteristics. For example, segmenting customers for marketing campaign in novel ways. - Sequential relationships are identifications of time-ordered events. For example understanding what events trigger a prospect to become a customer or a customer to develop repeat business can be very valuable. The learning algorithms used in data mining can be classified as supervised or unsupervised. Supervised learning involves training data that allows the algorithms to understand past behavior in order to determine suitable patterns or algorithms for predicting new behavior. For example, our case study discussed further below includes the survived attribute to allow the decision tree algorithm to understand the relationship between that outcome and attributes of each passenger such as gender and age. One other important distinction involves that of the mining being hypothesis driven or discovery driven. In the first case, the analyst has an idea and Fall 2013 Business Intelligence 6/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 is using data to test that idea. In the second, the analyst (or BI system) is searching for previously unknown relationships and their potential business value. RapidMiner is an example of a tool used to perform data mining on a data set. It is available in free or commercial packages. It offers a wide variety of tools and a “easy to use” interface for building data mining processes. More details are demonstrated in the case, below. D. Business Intelligence - Implementation Issues The data base for our case study is relatively small. The tool for our case study is free. In a real business environment though, the data are usually much larger, and less well-defined. And, the business intelligence systems, of which data mining is only one part, are very expensive to acquire and implement. These systems can be built in house or acquired from commercial providers. Either way, the costs for a large enterprise can be large. The complexity to implement these systems can make success less than guaranteed. In this section I will outline briefly some of the major issues surrounding implementation and integration of business intelligence systems into an enterprise. There are four major implementation issues: integration, connections to data sources, the use of on-demand BI, and the ethical and legal issues associated with the accumulation and analysis of potentially sensitive information. There are both technical and managerial issues related to the implementation of a BI system within an organization. Examples of technical issues include scalability and performance, security, and the ability to read from disparate sources of information into a common data warehouse. Examples of managerial issues include business justification and cost-benefit analysis, legal issues, and ethical issues. Fall 2013 Business Intelligence 7/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 There are several types of integration that can occur. The need for integration is really part of the benefit of the system as a whole. Systems with a tight integration with other systems, for example, can offer real-time BI and decision support. Traditional data warehouses and BI systems are built with customer-provided infrastructure such as servers and storage arrays. New trends are allowing less capital-intensive alternatives. BI can be delivered as a service (SaaS or software as a service) by firms that operate the infrastructure and application on behalf of another firm. There are legal and privacy issues associated with business intelligence and some high profile cases have demonstrated the consumer backlash that can come from perceived (or real) misappropriate use of sensitive information. Many companies now collect information about consumers that they themselves consider private. Data on the location and travel patterns of consumers equipped with mobile phones is not unusual. Questions of how much data can be collected, who can have access to that data, and what purposes it can be used for, are taking place at both legal and ethical levels. Current generation business intelligence systems are integrating with new sources of information, such as social media sites, and use Web 2.0 tools to gather and analyze data about consumer patterns. The US government uses “business intelligence” in counter-terrorism efforts, but as part of that effort collects and retains data on the usage patterns of millions who are not terrorists. Right or wrong, “big data” is probably here to stay; it will be incumbent on us collectively to make sure it does not turn out to be a malicious “big brother.” Fall 2013 Business Intelligence 8/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 An Example a) Case description The file titanic3.xls can be located from multiple sources using a Google search5. The data set is over 1,300 rows of data, one per passenger, along with certain basic information. This file will be used along with RapidMiner to explore the capabilities of the RapidMiner product for data mining operations. On that terrible night, the Captain gave the order to abandon ship, with the order ‘women and children first.’ Our case will attempt to use RapidMiner to provide a decision tree based on age and gender. b) Objective from the case The primary objective of the case is to gain experience with RapidMiner. It is a powerful and flexible tool though with some level of complexity. We will attempt to produce a decision tree on age, gender, and potentially other factors that impacted survivability. When the captain gave the order to abandon ship, he reportedly added, “women and children first.” Did these groups have a better chance to survive? c) Findings/Analysis and Business Implication The full RapidMiner flows and steps with screen shots are shown in a separate document attached to this one and called “Titanic for Decision Tree: Screenshots and procedures.” It is 5 For example, it is available from Carnegie-Mellon University at http://lib.stat.cmu.edu/S/Harrell/data/xls/titanic3.xls Fall 2013 Business Intelligence 9/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 determined that certainly, females stood a much higher chance of being saved that night than did males. Running a simple decision tree on gender, we obtain the following baseline: gender = female: survived {survived=339, perished=127} gender = male: perished {survived=161, perished=682} Nearly 73% of the women survived; less than 20% of the men did. Considering the age of the passengers, I broke the age into 4 somewhat arbitrary groups: Age Bracket Upper Limit Percentage Survival infant 2.99 years 32% child 15.99 years 56% adult 69.99 years 39% elder 99.99 years 25% age age age age = = = = adult: perished {survived=359, perished=564} child: survived {survived=45, perished=36} elder: perished {survived=2, perished=6} infant: perished {survived=94, perished=203} It is not clear but possible that many infants (< 3 years of age for my test) perished because of the cold or after the initial rescue. Clearly, children under 16 did stand a much better chance of survival; this is the only age group where more than 50% of the passengers survived. It is very interesting to consider other classifications, or to consider classifications in combination. Consider passenger class. Without additional tuning of the Decision Tree step, age makes no difference, and passenger class or age plays a limited role in survival of men. In fact, the survival rates of first, second, and third class female passengers are 97%, 89%, and 49%, Fall 2013 Business Intelligence 10/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 respectively. Being a woman of means helped on that night. Perhaps this is because of the proximity of the cabins to the lifeboats. gender = female | pclass = first: survived {survived=139, perished=5} | pclass = second: survived {survived=94, perished=12} | pclass = third: perished {survived=106, perished=110} gender = male: perished {survived=161, perished=682} Many more combinations are possible some of which are explored in the supplementary document attached to this one and called “Titanic for Decision Tree: Screenshots and procedures.” d) Data set information The data set contains 1,309 rows of data with 14 columns. Each row represents data on one passenger. The column definitions or data fields are itemized in Table 1. Field Name Field Description pclass Passenger class: 1, 2, or 3. survived 0 (did not survive) or 1 (did survive.) name Last name, First name format sex gender: male or female age in years sibsp sibling/spouse traveling together: numeric parch parents/children traveling together: numeric ticket ticket number fare fare price cabin text string indicating cabin assignment. Fall 2013 Business Intelligence 11/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 embarked port of embarkation, S or C boat lifeboat number body a number for recovered corpses. home.dest text, home city of passenger listed on manifest. Clearly, not all columns are relevant to decision tree. For example, the ticket number is not likely to forecast survivability. Conclusion RapidMiner is a powerful free tool for providing data mining and business intelligence capabilities to firms that otherwise would have no opportunity to make use of the decision making support of modern systems. With respect to the Titanic, it is reasonable to conclude that, in fact, women and children did in fact have a better chance of survival. Determining the percentage of women who survived, or of any single factor, does not require RapidMiner or a decision tree. Much more interesting than simple statistical calculations of a single variable, RapidMiner offers decision tree support to build rules predicting an outcome based on multiple characteristics. In my example, I used age, gender, passenger-class, and in some cases whether the person was along or with family to produce various decision trees. It is not hard to imagine business applications, such as predicting customer repeat business behavior or potential response to a marketing campaign, where the power or RapidMiner becomes relevant to the workplace today. References Fall 2013 Business Intelligence 12/13 Use of RapidMiner for Decision Tree Analysis Using Data from the Titanic Sinking Wes Morris MBUS 673 1. http://en.wikipedia.org/wiki/Titanic. Background information on the Titanic came from the wikipedia page. 2. Turban, Efraim; et al. “Business Intelligence: A Managerial Approach” 2nd Edition Fall 2013 Business Intelligence 13/13