Technical Review Paper: Data Mining Due: 21 January 2009 Name: Mark Peterson Advisor: Dr. David Keezer Section: L04 Group: Automated Beverage Dispenser (DK-05) 1 Introduction For nearly three decades, increasing amounts of information have been transitioned into the form of digital media. As a result of this conversion, large repositories of data have been built. One particular challenge persists as information still accumulates: How does one distinguish meaningful relationships between data among such vast amounts of information? Data mining tackles this question. This paper spotlights how data mining in the commercial sector, the current technology of data mining, and the essentials needed to mine data. Data Mining in the Commercial Sector Data mining is applied to a variety of industries such as real estate, retail, banking, telecommunications, and statistical analysis [1]. Essentially, data mining in the commercial realm has become a tool that promotes efficiency for businesses. Indeed, data mining software leader Agnoss says, “companies are realizing the potential in their data and looking for ways to leverage the information to gain that elusive competitive advantage” [2]. Data mining software packages can range widely in price. For example, a single user license for Rapid-I’s RapidMiner costs between 1,499 Euros and 10,000 Euros [3]. Alternatively, researchers at the University of Illinois Urbana-Champaign have created a free open-source software package called Illimine [4]. Technology of Data Mining Research into data mining occurs at large corporations such as IBM [5] and universities as seen in [4]. The algorithms used in current research can be categorized among six different types: Classification, Regression, Clustering, Change and Deviation Detection, Summarization, and Dependency Modeling [6]. Illimine employs two 2 Dependency Modeling methods such as the FPGrowth (frequent pattern growth) algorithm CLOSET+ [4]. CLOSET+ guides its depth-first searching (DFS) by using divide and conquer (D&C) methods [4]. DFS and D&C methods should be familiar to laypersons who have exposure to introductory computer science materials. Essentials for Implementation The essentials needed to implement data mining techniques are: a body of data, data mining software, sufficient computing power, and a meaningful way to report results. Finding relationships in data cannot be useful if those relationships are not reported in meaningful ways. One convenient method to convey the fundamentals of data mining and their results is by case study. Consider FiveThirtyEight.com’s success in political analysis during the 2008 campaign season [7]. In presenting political statistics “in new and exciting ways,” Nate Silver and his colleagues at FiveThirtyEight.com employ spatial data mining techniques —more specifically geo-spatial data mining techniques—by creating detailed maps of election data [8]. FiveThirtyEight.com became well known for their Electoral Projection Map. Win percentages were calculated for each portion of the electoral map by using aggregate polling data, trending of numbers, and regression analysis of geographic areas [8]. Based on the win percentages conducted through 10,000 daily simulations, the Electoral Projection Map displayed gradients of shades between dark blue and dark red indicating the certainty of a political party’s chances of winning each state (or district) [8]. In the realm of political news reporting, FiveThirtyEight.com’s cutting edge process of relating vast quantities of raw statistical data and conveying their underlying relationships—in other words data mining—won 3 them national recognition such as a “Notable Narrative” by Harvard’s Neiman Foundation for Journalism [7]. 4 [1] P. Duke. (2008, July). Geospatial Data Mining for Market Intelligence. Powell Media, LLC. [Online]. Available: http://www.b-eye-network.com/view/7837 [2] Agnoss Corp. KnowledgeSEEKER Datasheet. [Online]. Available: http://www.angoss.com/files/docs/KnowledgeSEEKER.pdf [3] Rapid-I GmbH. RapidMiner Enterprise. [Online]. Available: http://rapid-i.com/content/blogcategory/36/135/lang,en/ [4] J. Han et al. Illimine Project. University of Illinois Urbana-Champaign Database and Information Systems Laboratory. [Online]. Available: http://illimine.cs.uiuc.edu [5] IBM Research. Knowledge Discovery and Data Mining. IBM Corp. [Online]. Available: http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html [6] F. Usama et al. (1996, Fall). From Data Mining to Knowledge Discover in Databases. AI Magazine. [Online]. pp. 37-54. Available: http://borg.cs.bilgi.edu.tr/aimag-kdd-overview-1996-Fayyad.pdf [7] FiveThirtyEight. (2008, Aug.). FAQ and Statement of Methodology. FiveThirtyEight.com [Online]. Available: http://www.fivethirtyeight.com/2008/03/frequently-asked-questions-lastrevised.html [8] Neiman Narrative Digest. (2008). Narrative By Numbers. Nieman Foundation for Journalism at Harvard University. [Online]. Available: http://www.nieman.harvard.edu/digest/notable/538-blog.html 5