A Data Mining Course for Computer Science Primary Sources and Implementations Dave Musicant Saturday, March 4, 2006 Overview What is data mining? Why offer a course in data mining? Why focus on research papers in an undergraduate class? What topics do I cover? What research papers do I use in class? What assignments do I use? Does it work? What is data mining? “The non-trivial discovery of novel, valid, comprehensible and potentially useful patterns from data” (Fayyad et al) Data Mining and Machine Learning are two sides of the same coin Data mining focuses more on larger datasets Machine learning focuses more on connections with artificial intelligence ... but there is much overlap in the two areas. My course is titled “Machine Learning and Data Mining” boosts student enthusiasm Why offer a course in data mining? Interesting applied area of CS that uses theoretical techniques Reinforces and introduces data structures and algorithms Privacy and ethics Personal ownership in assignments heaps, R-trees, graphs Students choose datasets in areas that interest them New field, yet accessible Can be done with only Data Structures as a prereq It’s my research area Why research papers? Can it be done? One approach to course is to use data mining software I wanted students to implement data mining algorithms Textbook support w/ computer science focus is limited Lopez & Ludwig, University of Minnesota-Morris (I use Margaret Dunham’s text as a side reference) Primary sources provide a rich experience With proper selection, papers are accessible to undergraduates Papers must be supplemented in classroom e.g. specific topics in linear algebra, statistics directs classroom activity toward filling gaps and interpreting papers instead of parroting reading Topics, Papers, Assignments Each topic consists of one or more papers that are assigned to the students to read before class discussion. Students post to Caucus (electronic message board): something they didn’t understand, or something they found interesting potential exam question Assignment follows class discussion Detailed references for all papers and datasets can be found in paper Topic 0: What is Data Mining? Paper: J. Friedman. “Data Mining and Statistics: What’s the Connection?” Entertaining and controversial Pokes fun at flaws on all sides Helps to ensure buy-in from computer science students (they haven’t been tricked into taking a stats course) Assignment: For the “census-income” dataset, determine: Number of records and features How many features are continuous, how many are nominal For continuous features: average, median, minimum, maximum, standard deviation 2-dimensional scatter plots of two features at a time Interesting patterns Topic 1: Classification and Regression Example: First Trimester Screening Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis 1 5 20 118 Positive 2 3 15 130 Negative 3 7 10 52 Negative 4 2 30 100 Positive Use this training set to learn how to classify patients where diagnosis is not known: Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis 101 4 16 95 ? 102 9 22 125 ? 103 1 14 80 ? Input Data Training Set Testing Set Classification The input data is often easily obtained, whereas the classification is not. Technique: Nearest Neighbor tissue (cm) Chemical 1 Diagnosis 5 20 Positive 3 15 Negative 7 10 Negative 2 30 Positive 35 30 Envision each example as a point in n-dimensional space Classify test point same as nearest training point 25 20 15 10 5 0 0 1 2 3 4 5 6 7 8 What am I? Topic 1: Classification and Regression Focus on scalable nearest neighbor algorithms Paper: Roussopoulos et. al. “Nearest Neighbor Queries” How to do NN efficiently when data doesn’t fit in core Requires R-trees (I cover in class) Assignment: Code up the traditional k-nearest neighbor algorithm, apply to census-income data Experiment with different distance metrics (1-norm, 2-norm, cosine) Experiment with different values of k Produce plots showing training and test set accuracies Interpret results Topic 2: Clustering Sometimes referred to as unsupervised learning Goal: find clusters of similar data Less accurate than supervised learning, but quite useful when no training set is available Where are the clusters below? How many are there? tissue (cm) tissue (cm) chemical 1 chemical 2 Topic 2: Clustering Assignment: Find dataset of interest from UCI Repository iris plant, letter recognition, liver disorders, Pima Indians diabetes, Congressional voting records, wine recognition, zoo this dataset is used for most remaining assignments if dataset has a class label, discard it for this assignment Implement basic clustering algorithm (k-means) Try varying number of clusters Try two different techniques for initializing clusters Report and interpret results found Topic 2: Clustering Paper: Bradley et al, “Scaling Clustering Algorithms to Large Databases” Describes “Scalable K-means” algorithm Class discussion around “data mining desiderata” Paper: Guha et al, “CURE: An Efficient Clustering Algorithm for Large Databases” Agglomerative clustering algorithm completely different approach Requires use of a heap (as I pose the assignment) Assignment: Implement stripped-down version of CURE Run on dataset, interpret results Topic 3: Association Rules “Supermarket basket analysis” What items do people tend do buy together at the same time? Paper: Agrawal et al, “Fast Algorithms for Mining Association Rules” presents classic Apriori algorithm (skim other portions of paper) Assignment: Implement Apriori algorithm and implement on own dataset Topic 4: Web Mining How does Google rank importance of web pages? Every page has a PageRank PageRank of a page is determined by the PageRank of the pages that link to it manifests itself as an eigenvalue problem Paper: Page et al, “The PageRank Citation Ranking: Bringing Order to the Web” describes basic version of Google PageRank algorithm cover eigenvalues in class exposure to linear algebra, numerical analysis Topic 4: Web Mining Paper: Chakrabarti et al, “Mining the Link Structure of the World Wide Web” describes HITS algorithm for ranking web pages Google isn’t the only way to do it uses Latent Semantic Analysis, which requires singular value decomposition (cover in class) Assignment: Implement PageRank algorithm try it on archive of department website crawling for an assignment is dangerous sparse data representation hashing or other form of map for efficiency interpret results hubs authorities Topic 5: Collaborative Filtering a.k.a. Recommender Systems Paper: Breese et al, “Empirical Analysis of Predictive Algorithms for Collaborative Filtering” “I like Pink Floyd, Dream Theater, and Evanescence. Who should I be listening to?” Amazon.com, Yahoo! Launchcast Algorithms are nearest neighbor-like in flavor Involve averaging numerical scores Need to normalize for individual biases Students already working on final project, so no assignment Topic 6: Ethical Issues in Data Mining Privacy concerns Good vs. evil uses of data mining Video: Ramakrishnan et al, “Data Mining: Good, Bad, or Just a Tool?” Before watching video, students post to Caucus: Panel discussion from KDD 2004 how data mining could be exploited how this could be prevented (if possible) After watching video followup commentary Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/ Topic 6: Ethical Issues in Data Mining Students response to video was more engaged than I expected More problems than solutions are raised in video Many students interested in issue of accountability Frustrated students that solutions weren’t clear If someone’s privacy is violated, who is responsible? “Who do I sue?” Lively class discussion Final Project “Do almost anything you want regarding data mining, so long as I approve it” Find a paper and implement the algorithm within Find a dataset of interest and study it completely, using Weka and/or their own code from throughout the term Quantitative association rules Poker association rules Collaborative filtering (music, art) Attack KDD Cup problems KDD Cup 2005: identify categories for web search queries tried this once: tended to be too big for them in the time that I had could perhaps be done with right level of support Conclusions Papers are most memorable part of course Caucus motivates reading papers Students find this a pain, but are thankful afterwards in evals Important to set deadline for posting a few hours before class so I have time to read Programming assignments work (mostly) well Students speak very positively about this in evaluations Significant prep time for me to fill in gaps Allow students to work in pairs if they wish Grading is difficult: unspecified details in algorithms, differing datasets All materials available on my website at http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05