Technical Review Paper: Data Mining

advertisement
Technical Review Paper: Data Mining
Due: 21 January 2009
Name: Mark Peterson
Advisor: Dr. David Keezer
Section: L04
Group: Automated Beverage Dispenser (DK-05)
1
Introduction
For nearly three decades, increasing amounts of information have been
transitioned into the form of digital media. As a result of this conversion, large
repositories of data have been built. One particular challenge persists as information still
accumulates: How does one distinguish meaningful relationships between data among
such vast amounts of information? Data mining tackles this question. This paper
spotlights how data mining in the commercial sector, the current technology of data
mining, and the essentials needed to mine data.
Data Mining in the Commercial Sector
Data mining is applied to a variety of industries such as real estate, retail,
banking, telecommunications, and statistical analysis [1]. Essentially, data mining in the
commercial realm has become a tool that promotes efficiency for businesses. Indeed,
data mining software leader Agnoss says, “companies are realizing the potential in their
data and looking for ways to leverage the information to gain that elusive competitive
advantage” [2]. Data mining software packages can range widely in price. For example,
a single user license for Rapid-I’s RapidMiner costs between 1,499 Euros and 10,000
Euros [3]. Alternatively, researchers at the University of Illinois Urbana-Champaign
have created a free open-source software package called Illimine [4].
Technology of Data Mining
Research into data mining occurs at large corporations such as IBM [5] and
universities as seen in [4]. The algorithms used in current research can be categorized
among six different types: Classification, Regression, Clustering, Change and Deviation
Detection, Summarization, and Dependency Modeling [6]. Illimine employs two
2
Dependency Modeling methods such as the FPGrowth (frequent pattern growth)
algorithm CLOSET+ [4]. CLOSET+ guides its depth-first searching (DFS) by using
divide and conquer (D&C) methods [4]. DFS and D&C methods should be familiar to
laypersons who have exposure to introductory computer science materials.
Essentials for Implementation
The essentials needed to implement data mining techniques are: a body of data,
data mining software, sufficient computing power, and a meaningful way to report
results. Finding relationships in data cannot be useful if those relationships are not
reported in meaningful ways. One convenient method to convey the fundamentals of
data mining and their results is by case study. Consider FiveThirtyEight.com’s success in
political analysis during the 2008 campaign season [7]. In presenting political statistics
“in new and exciting ways,” Nate Silver and his colleagues at FiveThirtyEight.com
employ spatial data mining techniques —more specifically geo-spatial data mining
techniques—by creating detailed maps of election data [8]. FiveThirtyEight.com became
well known for their Electoral Projection Map. Win percentages were calculated for each
portion of the electoral map by using aggregate polling data, trending of numbers, and
regression analysis of geographic areas [8]. Based on the win percentages conducted
through 10,000 daily simulations, the Electoral Projection Map displayed gradients of
shades between dark blue and dark red indicating the certainty of a political party’s
chances of winning each state (or district) [8]. In the realm of political news reporting,
FiveThirtyEight.com’s cutting edge process of relating vast quantities of raw statistical
data and conveying their underlying relationships—in other words data mining—won
3
them national recognition such as a “Notable Narrative” by Harvard’s Neiman
Foundation for Journalism [7].
4
[1]
P. Duke. (2008, July). Geospatial Data Mining for Market Intelligence. Powell
Media, LLC. [Online]. Available: http://www.b-eye-network.com/view/7837
[2]
Agnoss Corp. KnowledgeSEEKER Datasheet. [Online]. Available:
http://www.angoss.com/files/docs/KnowledgeSEEKER.pdf
[3]
Rapid-I GmbH. RapidMiner Enterprise. [Online]. Available:
http://rapid-i.com/content/blogcategory/36/135/lang,en/
[4]
J. Han et al. Illimine Project. University of Illinois Urbana-Champaign Database
and Information Systems Laboratory. [Online]. Available:
http://illimine.cs.uiuc.edu
[5]
IBM Research. Knowledge Discovery and Data Mining. IBM Corp. [Online].
Available: http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html
[6]
F. Usama et al. (1996, Fall). From Data Mining to Knowledge Discover in
Databases. AI Magazine. [Online]. pp. 37-54. Available:
http://borg.cs.bilgi.edu.tr/aimag-kdd-overview-1996-Fayyad.pdf
[7]
FiveThirtyEight. (2008, Aug.). FAQ and Statement of Methodology.
FiveThirtyEight.com [Online]. Available:
http://www.fivethirtyeight.com/2008/03/frequently-asked-questions-lastrevised.html
[8]
Neiman Narrative Digest. (2008). Narrative By Numbers. Nieman Foundation for
Journalism at Harvard University. [Online]. Available:
http://www.nieman.harvard.edu/digest/notable/538-blog.html
5
Download