Improving Prediction Accuracy through Data Mining Grant Cahill CS 700 Thesis Proposal Table of Contents Introduction ......................................................................................................................... 3 What is Data Mining? ......................................................................................................... 3 MySQL ............................................................................................................................... 4 Java Data Mining ................................................................................................................ 4 Unofficial Guide ................................................................................................................. 5 Hotel Quote Database Design ............................................................................................. 6 Plan of Attack ..................................................................................................................... 7 Optimize Database .......................................................................................................... 7 Incorporate Java Data Mining ......................................................................................... 8 Current Work ...................................................................................................................... 8 Conclusion .......................................................................................................................... 8 References ........................................................................................................................... 8 Introduction Imagine this; a mother of three is rushing to plan her trip of a lifetime to Walt Disney World in Orlando, Florida. She is both ecstatic and anxious at the idea of taking her family to this world mecca, there’s only one catch, she hasn’t got very much money and also not much time to plan this excursion. If only there was a way she could get projected costs for her trip without having to book a thing? Now, through the use of Data Mining, these types of projections are possible. One could predict the cost of, say, the hotels they are interested in for any particular time of year in order to find the most cost effective time of stay. For my thesis I plan to use real world data, courtesy of Len Testa and The Unofficial Guide to Walt Disney World, and analyze millions of records in a relational database in order to accurately predict hotel quotes for any given hotel listed in the database. This can all be done through the use of several technologies including MySQL, Java Data Mining libraries, and the database connectivity API’s. What is Data Mining? Data Mining is a relatively recent form of research and analysis in the world of database technology. Data Mining provides the user the ability to discern trends and common attributes from amongst large sets of data. The official definition states, “Data Mining can be thought of as data processing using sophisticated data search capabilities and statistical algorithms to discover patterns and correlations in large preexisting databases; a way to discover new meaning in data.” Data Mining is often compared to such topics as artificial intelligence and machine learning simply because it is not only knowledge discovery, but it provides the ability to act upon that discovery. The typical steps involved in successful Data Mining are: 1. 2. 3. 4. 5. 6. Define your objective Prepare your data Configure the mining tasks/algorithms Build Data Mining model Test model Report findings Building on the previous, a simple example of data mining is one that is commonly seen in the world of retail. Many retail stores desire the ability to be able to know the wants and needs of their customers. Data Mining provides the resources to handle this seemingly unattainable task. Through the use of Data Mining, a retail outlet is able to record the transactions of it’s customers in a large repository (a database for example) and then analyze that data through Data Mining and consequently predict the trends and desires of their customer base. For example, an online store could use Data Mining to offer suggestions when a previous customer browses their pages. As you can see, Data Mining can be a fantastic tool for data analysis. However, it can also be extremely sophisticated and complicated as well as requiring the needs of additional professionals such as statisticians, database analysts, and software architects. It is intentionally iterative in design and thus must be revisited else the data and/or model become obsolete. Data Mining thus provides the tools necessary in order to solve the problem at hand and complete the objective of determining future pricing through analysis of historical data. MySQL MySQL is an open source relational database developed by MySQL AB, a European based corporation. Released in 1995, MySQL is quickly becoming a well known and widely accepted database. Not only does it provide remarkable speed and capacity, but it is also extremely inexpensive in terms of corporate licensing and support. Additionally, for those not releasing MySQL as part of a commercial product, MySQL is completely free under the GNU General Public License which makes it extremely attractive to students and researchers. MySQL provides users the capability to create multiple user accounts, incorporate slavemaster relationships, back web hosting services, as well as run on multiple processor systems. These are just a few of the capabilities available to MySQL operators. Some of the additional highlights of MySQL include: Fully ACID (Atomicity-Consistency-Isolation-Durability) compliant Stored procedures Triggers Cursors Database connectivity with multiple programming languages including: C/C++, C#, Java, Perl, PHP, and Ruby Incorporation with the LAMP (Linux-Apache-MySQL-PHP) stack SSL support Query caching Multiple storage engines Thus, MySQL is an extremely good choice when it comes to database selection and design which is why I will be incorporating it into my project. Java Data Mining As mentioned above, Data Mining can be an expensive, complicated, and time consuming process. It involves many steps and elaborate algorithms in order for successful operation. Java, however, have come up with their own set of libraries to handle Data Mining. Through Java Specification Request (JSR) 73, the Java Data Mining libraries have been made public and available for use. In order to access the Java Data Mining libraries, the ‘javax.datamining’ API must be imported into the specific class being used. The expert group involved in the creation of this API included such industry giants as Oracle, IBM, Sun Microsystems, and BEA Systems. A simple way to view how the Java Data Mining libraries fit in to the big picture is as follows: Java API Data Mining Engine Mining Object Repository In the above example, the Java API is the ‘javax.datamining’ library visible to the software engineer implementing the data mining algorithm. The Java Data Mining API can then be thought of as an abstraction layer allowing the programmer an interface in which to interact with the Data Mining Engine (DME). This brings us to the next box in the diagram, the DME. The DME is where the core processing or data mining algorithm execution occurs. Finally, we end with the Mining Object Repository. It is in the Mining Object Repository where the data to be analyzed is stored. As you can see, the Java Data Mining API provides a great solution in the processing and completion of the Hotel Quotes problem. Unofficial Guide The Unofficial Guide is a great reference and wonderful piece of work for anyone considering a trip to the central Florida area. Len Testa is one of the authors of The Unofficial Guide to Walt Disney World and he has been gracious enough to assist me in providing the data necessary to complete my thesis. The Guide, as it is commonly known, has been in existence since 1986 and has helped more than two million families plan their ultimate vacation to Walt Disney World located in Orlando, Florida. Along with their website, www.touringplans.com, the Unofficial Guide to Walt Disney World provides readers with the ability to optimize their time in the parks and also their monetary expenditures. Through the use of a talented team including but not limited to researchers, statisticians, and computer scientists; The Guide is able to provide it’s consumers with the most accurate and up to date information possible. Hotel Quote Database Design As part of its research, The Unofficial Guide incorporates a database consisting of hotel quotes in the Orlando area. This database is used to record hotel rates of numerous hotels throughout the central Florida region. The table holding all of the quotes has a structure similar to the following: Field ResortID VendorID RunDate NightOfStay Cost Type int(11) int(11) date date float Null NO NO NO NO NO Key PRI PRI PRI PRI MUL Default Extra 0 0 0000-00-00 0000-00-00 0 Here the ‘ResortID’ and ‘VendorID’ are both tied to actual names of resorts throughout the Orlando area, but for processing purposes we see these items identified by an integer. Through the use of ‘RunDate’, ‘NightOfStay’, and ‘Cost’; I will be able to extract the valuable data necessary in order for successful prediction. Additionally, the ‘CREATE TABLE’ statement used to create the above named table is as follows: mysql> show create table rateshistorical \G *************************** 1. row *************************** Table: rateshistorical Create Table: CREATE TABLE `rateshistorical` ( `ResortID` int(11) NOT NULL default '0', `VendorID` int(11) NOT NULL default '0', `RunDate` date NOT NULL default '0000-00-00', `NightOfStay` date NOT NULL default '0000-00-00', `Cost` float NOT NULL default '0', PRIMARY KEY (`ResortID`,`VendorID`,`RunDate`,`NightOfStay`), KEY `IDX_VendorID` (`VendorID`), KEY `IDX_RunDate` (`RunDate`), KEY `IDX_NightOfStay` (`NightOfStay`), KEY `idx_cost` (`Cost`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1 Plan of Attack In order to complete my thesis, my plan is two-fold. First I would like to optimize the database holding the hotel quote data. There are many ways to do this and I have listed my primary options in the following subsection. Second, I would like use data mining to analyze the hotel data and consequently predict trends in the data. Through this “trend analysis” I wish to be able to, with a high degree of precision, predict potential hotel quotes for any given time of year with any given hotel based upon historical data. Optimize Database As you can imagine, operating a database consisting of millions of records can offer serious lag time in performance. Not only does The Guide have to operate a large database, but this database services a multitude of users as transactions are carried out throughout the day. Therefore, the first step I wish to pursue is to improve the throughput and latency/response time of the hotel database. Throughput is typically measured in transactions per time (second/min/hour). There are several characteristics that show throughput is lagging, one of which is known as “starvation”. Starvation is when users can be waiting too long for a response from the system under use. One of the problems with throughput/latency measurements is that oft times only wall clock time is taken into account while other extraneous factors may be at play. Queuing theory also plays a major part in multi user applications where the user response time is equal to the queuing delay added to the service time. Consequently, as the system approaches saturation the queuing delay grows rapidly. In order to improve queuing, either the queuing delay or the service time needs to be improved. Thus, because we are more concerned about the server side of the application, we will be focused on the service time needed to process the query. Additional considerations in terms of database tuning include: 1. 2. 3. 4. 5. 6. 7. Storage engine Query cache Slow query log Connection and thread settings Hardware characteristics (memory, disk space, CPUs, etc) Query tuning Table locks Based on the above, in order to complete this task I plan to take benchmarks of the current processing time required to process transactions on www.touringplans.com. Using these results, I will be able to measure the optimization techniques against them. Incorporate Java Data Mining In the second step of my thesis, I plan to use the Java Data Mining API, as described above, in order to accurately predict Hotel Quotes for any given hotel at any given time of year through the historical data provided by The Unofficial Guide to Walt Disney World. Upon selection and preparation of data, models need to be applied to the attributes and results from the execution of the algorithm. Such models include: predicted category, probability, and cost. Once a given prediction model/solution is selected, it must be iterated over and reviewed for accuracy. Additionally, over time, the given data mining solution to a specified problem will slowly become obsolete and need to be revisited. To finish, the ‘javax.datamining’ library will be used in generating an algorithm for the prediction of hotel quotes across a number of resorts in the central Florida area. Current Work Talk about current work already out there. Conclusion In conclusion, I plan to revolve my thesis around the use of Data Mining in hopes of generating accurate predictions of hotel quotes in the Orlando, Florida area. Through the use of various technologies to include: MySQL, Connector/J, Java Data Mining, and the Eclipse IDE; I plan to solve the problem of hotel quote prediction/analysis. The Java Data Mining (specifically javax.datamining) API will be the main driver of my analysis as I use this API to generate models and predictions based upon the data derived from the tables stored in MySQL. References http://en.wikipedia.org/wiki/Data_mining http://dictionary.reference.com/browse/data%20mining http://www.touringplans.com/ http://www.mysql.com/ http://en.wikipedia.org/wiki/Mysql http://www.artima.com/lejava/articles/data_mining.html MySQL Performance Tuning & Optimization Workshop http://jcp.org/en/jsr/detail?id=73 Java Data Mining: Strategy, Standard, and Practice