What is Data Mining?

advertisement
Improving Prediction Accuracy through Data Mining
Grant Cahill
CS 700
Thesis Proposal
Table of Contents
Introduction ......................................................................................................................... 3
What is Data Mining? ......................................................................................................... 3
MySQL ............................................................................................................................... 4
Java Data Mining ................................................................................................................ 4
Unofficial Guide ................................................................................................................. 5
Hotel Quote Database Design ............................................................................................. 6
Plan of Attack ..................................................................................................................... 7
Optimize Database .......................................................................................................... 7
Incorporate Java Data Mining ......................................................................................... 8
Current Work ...................................................................................................................... 8
Conclusion .......................................................................................................................... 8
References ........................................................................................................................... 8
Introduction
Imagine this; a mother of three is rushing to plan her trip of a lifetime to Walt Disney
World in Orlando, Florida. She is both ecstatic and anxious at the idea of taking her
family to this world mecca, there’s only one catch, she hasn’t got very much money and
also not much time to plan this excursion. If only there was a way she could get projected
costs for her trip without having to book a thing? Now, through the use of Data Mining,
these types of projections are possible. One could predict the cost of, say, the hotels they
are interested in for any particular time of year in order to find the most cost effective
time of stay. For my thesis I plan to use real world data, courtesy of Len Testa and The
Unofficial Guide to Walt Disney World, and analyze millions of records in a relational
database in order to accurately predict hotel quotes for any given hotel listed in the
database. This can all be done through the use of several technologies including MySQL,
Java Data Mining libraries, and the database connectivity API’s.
What is Data Mining?
Data Mining is a relatively recent form of research and analysis in the world of database
technology. Data Mining provides the user the ability to discern trends and common
attributes from amongst large sets of data. The official definition states, “Data Mining
can be thought of as data processing using sophisticated data search capabilities and
statistical algorithms to discover patterns and correlations in large preexisting databases;
a way to discover new meaning in data.”
Data Mining is often compared to such topics as artificial intelligence and machine
learning simply because it is not only knowledge discovery, but it provides the ability to
act upon that discovery.
The typical steps involved in successful Data Mining are:
1.
2.
3.
4.
5.
6.
Define your objective
Prepare your data
Configure the mining tasks/algorithms
Build Data Mining model
Test model
Report findings
Building on the previous, a simple example of data mining is one that is commonly seen
in the world of retail. Many retail stores desire the ability to be able to know the wants
and needs of their customers. Data Mining provides the resources to handle this
seemingly unattainable task. Through the use of Data Mining, a retail outlet is able to
record the transactions of it’s customers in a large repository (a database for example)
and then analyze that data through Data Mining and consequently predict the trends and
desires of their customer base. For example, an online store could use Data Mining to
offer suggestions when a previous customer browses their pages.
As you can see, Data Mining can be a fantastic tool for data analysis. However, it can
also be extremely sophisticated and complicated as well as requiring the needs of
additional professionals such as statisticians, database analysts, and software architects. It
is intentionally iterative in design and thus must be revisited else the data and/or model
become obsolete.
Data Mining thus provides the tools necessary in order to solve the problem at hand and
complete the objective of determining future pricing through analysis of historical data.
MySQL
MySQL is an open source relational database developed by MySQL AB, a European
based corporation. Released in 1995, MySQL is quickly becoming a well known and
widely accepted database. Not only does it provide remarkable speed and capacity, but it
is also extremely inexpensive in terms of corporate licensing and support. Additionally,
for those not releasing MySQL as part of a commercial product, MySQL is completely
free under the GNU General Public License which makes it extremely attractive to
students and researchers.
MySQL provides users the capability to create multiple user accounts, incorporate slavemaster relationships, back web hosting services, as well as run on multiple processor
systems. These are just a few of the capabilities available to MySQL operators. Some of
the additional highlights of MySQL include:









Fully ACID (Atomicity-Consistency-Isolation-Durability) compliant
Stored procedures
Triggers
Cursors
Database connectivity with multiple programming languages including: C/C++,
C#, Java, Perl, PHP, and Ruby
Incorporation with the LAMP (Linux-Apache-MySQL-PHP) stack
SSL support
Query caching
Multiple storage engines
Thus, MySQL is an extremely good choice when it comes to database selection and
design which is why I will be incorporating it into my project.
Java Data Mining
As mentioned above, Data Mining can be an expensive, complicated, and time
consuming process. It involves many steps and elaborate algorithms in order for
successful operation. Java, however, have come up with their own set of libraries to
handle Data Mining.
Through Java Specification Request (JSR) 73, the Java Data Mining libraries have been
made public and available for use. In order to access the Java Data Mining libraries, the
‘javax.datamining’ API must be imported into the specific class being used. The expert
group involved in the creation of this API included such industry giants as Oracle, IBM,
Sun Microsystems, and BEA Systems.
A simple way to view how the Java Data Mining libraries fit in to the big picture is as
follows:
Java API
Data Mining Engine
Mining Object
Repository
In the above example, the Java API is the ‘javax.datamining’ library visible to the
software engineer implementing the data mining algorithm. The Java Data Mining API
can then be thought of as an abstraction layer allowing the programmer an interface in
which to interact with the Data Mining Engine (DME). This brings us to the next box in
the diagram, the DME. The DME is where the core processing or data mining algorithm
execution occurs. Finally, we end with the Mining Object Repository. It is in the Mining
Object Repository where the data to be analyzed is stored.
As you can see, the Java Data Mining API provides a great solution in the processing and
completion of the Hotel Quotes problem.
Unofficial Guide
The Unofficial Guide is a great reference and wonderful piece of work for anyone
considering a trip to the central Florida area. Len Testa is one of the authors of The
Unofficial Guide to Walt Disney World and he has been gracious enough to assist me in
providing the data necessary to complete my thesis.
The Guide, as it is commonly known, has been in existence since 1986 and has helped
more than two million families plan their ultimate vacation to Walt Disney World located
in Orlando, Florida. Along with their website, www.touringplans.com, the Unofficial
Guide to Walt Disney World provides readers with the ability to optimize their time in
the parks and also their monetary expenditures. Through the use of a talented team
including but not limited to researchers, statisticians, and computer scientists; The Guide
is able to provide it’s consumers with the most accurate and up to date information
possible.
Hotel Quote Database Design
As part of its research, The Unofficial Guide incorporates a database consisting of hotel
quotes in the Orlando area. This database is used to record hotel rates of numerous hotels
throughout the central Florida region. The table holding all of the quotes has a structure
similar to the following:
Field
ResortID
VendorID
RunDate
NightOfStay
Cost
Type
int(11)
int(11)
date
date
float
Null
NO
NO
NO
NO
NO
Key
PRI
PRI
PRI
PRI
MUL
Default
Extra
0
0
0000-00-00
0000-00-00
0
Here the ‘ResortID’ and ‘VendorID’ are both tied to actual names of resorts throughout
the Orlando area, but for processing purposes we see these items identified by an integer.
Through the use of ‘RunDate’, ‘NightOfStay’, and ‘Cost’; I will be able to extract the
valuable data necessary in order for successful prediction.
Additionally, the ‘CREATE TABLE’ statement used to create the above named table is
as follows:
mysql> show create table rateshistorical \G
*************************** 1. row ***************************
Table: rateshistorical
Create Table: CREATE TABLE `rateshistorical` (
`ResortID` int(11) NOT NULL default '0',
`VendorID` int(11) NOT NULL default '0',
`RunDate` date NOT NULL default '0000-00-00',
`NightOfStay` date NOT NULL default '0000-00-00',
`Cost` float NOT NULL default '0',
PRIMARY KEY (`ResortID`,`VendorID`,`RunDate`,`NightOfStay`),
KEY `IDX_VendorID` (`VendorID`),
KEY `IDX_RunDate` (`RunDate`),
KEY `IDX_NightOfStay` (`NightOfStay`),
KEY `idx_cost` (`Cost`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
Plan of Attack
In order to complete my thesis, my plan is two-fold. First I would like to optimize the
database holding the hotel quote data. There are many ways to do this and I have listed
my primary options in the following subsection. Second, I would like use data mining to
analyze the hotel data and consequently predict trends in the data. Through this “trend
analysis” I wish to be able to, with a high degree of precision, predict potential hotel
quotes for any given time of year with any given hotel based upon historical data.
Optimize Database
As you can imagine, operating a database consisting of millions of records can offer
serious lag time in performance. Not only does The Guide have to operate a large
database, but this database services a multitude of users as transactions are carried out
throughout the day. Therefore, the first step I wish to pursue is to improve the throughput
and latency/response time of the hotel database.
Throughput is typically measured in transactions per time (second/min/hour). There are
several characteristics that show throughput is lagging, one of which is known as
“starvation”. Starvation is when users can be waiting too long for a response from the
system under use. One of the problems with throughput/latency measurements is that oft
times only wall clock time is taken into account while other extraneous factors may be at
play.
Queuing theory also plays a major part in multi user applications where the user response
time is equal to the queuing delay added to the service time. Consequently, as the system
approaches saturation the queuing delay grows rapidly. In order to improve queuing,
either the queuing delay or the service time needs to be improved. Thus, because we are
more concerned about the server side of the application, we will be focused on the service
time needed to process the query.
Additional considerations in terms of database tuning include:
1.
2.
3.
4.
5.
6.
7.
Storage engine
Query cache
Slow query log
Connection and thread settings
Hardware characteristics (memory, disk space, CPUs, etc)
Query tuning
Table locks
Based on the above, in order to complete this task I plan to take benchmarks of the
current processing time required to process transactions on www.touringplans.com.
Using these results, I will be able to measure the optimization techniques against them.
Incorporate Java Data Mining
In the second step of my thesis, I plan to use the Java Data Mining API, as described
above, in order to accurately predict Hotel Quotes for any given hotel at any given time
of year through the historical data provided by The Unofficial Guide to Walt Disney
World.
Upon selection and preparation of data, models need to be applied to the attributes and
results from the execution of the algorithm. Such models include: predicted category,
probability, and cost. Once a given prediction model/solution is selected, it must be
iterated over and reviewed for accuracy. Additionally, over time, the given data mining
solution to a specified problem will slowly become obsolete and need to be revisited.
To finish, the ‘javax.datamining’ library will be used in generating an algorithm for the
prediction of hotel quotes across a number of resorts in the central Florida area.
Current Work
Talk about current work already out there.
Conclusion
In conclusion, I plan to revolve my thesis around the use of Data Mining in hopes of
generating accurate predictions of hotel quotes in the Orlando, Florida area. Through the
use of various technologies to include: MySQL, Connector/J, Java Data Mining, and the
Eclipse IDE; I plan to solve the problem of hotel quote prediction/analysis. The Java Data
Mining (specifically javax.datamining) API will be the main driver of my analysis as I
use this API to generate models and predictions based upon the data derived from the
tables stored in MySQL.
References
http://en.wikipedia.org/wiki/Data_mining
http://dictionary.reference.com/browse/data%20mining
http://www.touringplans.com/
http://www.mysql.com/
http://en.wikipedia.org/wiki/Mysql
http://www.artima.com/lejava/articles/data_mining.html
MySQL Performance Tuning & Optimization Workshop
http://jcp.org/en/jsr/detail?id=73
Java Data Mining: Strategy, Standard, and Practice
Download