A Data Mining Course for Computer Science Primary Sources and

advertisement
A Data Mining Course for
Computer Science
Primary Sources and Implementations
Dave Musicant
Saturday, March 4, 2006
Overview







What is data mining?
Why offer a course in data mining?
Why focus on research papers in an
undergraduate class?
What topics do I cover?
What research papers do I use in class?
What assignments do I use?
Does it work?
What is data mining?

“The non-trivial discovery of novel, valid, comprehensible
and potentially useful patterns from data” (Fayyad et al)

Data Mining and Machine Learning are two sides of the
same coin




Data mining focuses more on larger datasets
Machine learning focuses more on connections with artificial
intelligence
... but there is much overlap in the two areas.
My course is titled “Machine Learning and Data Mining”

boosts student enthusiasm
Why offer a course in data mining?


Interesting applied area of CS that uses theoretical
techniques
Reinforces and introduces data structures and
algorithms



Privacy and ethics
Personal ownership in assignments




heaps, R-trees, graphs
Students choose datasets in areas that interest them
New field, yet accessible
Can be done with only Data Structures as a prereq
It’s my research area
Why research papers? Can it be done?

One approach to course is to use data mining software



I wanted students to implement data mining algorithms
Textbook support w/ computer science focus is limited




Lopez & Ludwig, University of Minnesota-Morris
(I use Margaret Dunham’s text as a side reference)
Primary sources provide a rich experience
With proper selection, papers are accessible to
undergraduates
Papers must be supplemented in classroom


e.g. specific topics in linear algebra, statistics
directs classroom activity toward filling gaps and interpreting
papers instead of parroting reading
Topics, Papers, Assignments


Each topic consists of one or more papers that are
assigned to the students to read before class discussion.
Students post to Caucus (electronic message board):


something they didn’t understand, or something they found
interesting
potential exam question

Assignment follows class discussion

Detailed references for all papers and datasets can be
found in paper
Topic 0: What is Data Mining?

Paper: J. Friedman. “Data Mining and Statistics: What’s
the Connection?”




Entertaining and controversial
Pokes fun at flaws on all sides
Helps to ensure buy-in from computer science students (they
haven’t been tricked into taking a stats course)
Assignment: For the “census-income” dataset,
determine:





Number of records and features
How many features are continuous, how many are nominal
For continuous features: average, median, minimum, maximum,
standard deviation
2-dimensional scatter plots of two features at a time
Interesting patterns
Topic 1: Classification and Regression

Example: First Trimester Screening
Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis
1
5
20
118
Positive
2
3
15
130
Negative
3
7
10
52
Negative
4
2
30
100
Positive

Use this training set to learn how to classify patients
where diagnosis is not known:
Patient ID tissue (cm) Chemical 1 Chemical 2 Diagnosis
101
4
16
95
?
102
9
22
125
?
103
1
14
80
?
Input Data

Training Set
Testing Set
Classification
The input data is often easily obtained, whereas the
classification is not.
Technique: Nearest Neighbor
tissue (cm) Chemical 1 Diagnosis
5
20
Positive
3
15
Negative
7
10
Negative
2
30
Positive


35
30
Envision each
example as a point
in n-dimensional
space
Classify test point
same as nearest
training point
25
20
15
10
5
0
0
1
2
3
4
5
6
7
8
What am I?
Topic 1: Classification and Regression


Focus on scalable nearest neighbor algorithms
Paper: Roussopoulos et. al. “Nearest Neighbor Queries”



How to do NN efficiently when data doesn’t fit in core
Requires R-trees (I cover in class)
Assignment: Code up the traditional k-nearest neighbor
algorithm, apply to census-income data




Experiment with different distance metrics (1-norm, 2-norm,
cosine)
Experiment with different values of k
Produce plots showing training and test set accuracies
Interpret results
Topic 2: Clustering




Sometimes referred to as unsupervised learning
Goal: find clusters of similar data
Less accurate than supervised learning, but quite useful
when no training set is available
Where are the clusters below? How many are there?
tissue
(cm)
tissue
(cm)
chemical 1
chemical 2
Topic 2: Clustering

Assignment: Find dataset of interest from UCI
Repository







iris plant, letter recognition, liver disorders, Pima Indians
diabetes, Congressional voting records, wine recognition, zoo
this dataset is used for most remaining assignments
if dataset has a class label, discard it for this assignment
Implement basic clustering algorithm (k-means)
Try varying number of clusters
Try two different techniques for initializing clusters
Report and interpret results found
Topic 2: Clustering

Paper: Bradley et al, “Scaling Clustering Algorithms to
Large Databases”



Describes “Scalable K-means” algorithm
Class discussion around “data mining desiderata”
Paper: Guha et al, “CURE: An Efficient Clustering
Algorithm for Large Databases”

Agglomerative clustering algorithm



completely different approach
Requires use of a heap (as I pose the assignment)
Assignment: Implement stripped-down version of CURE

Run on dataset, interpret results
Topic 3: Association Rules



“Supermarket basket analysis”
What items do people tend do buy together at the same
time?
Paper: Agrawal et al, “Fast Algorithms for Mining
Association Rules”


presents classic Apriori algorithm (skim other portions of paper)
Assignment: Implement Apriori algorithm and implement
on own dataset
Topic 4: Web Mining


How does Google rank importance of web pages?
Every page has a PageRank



PageRank of a page is determined by the PageRank of the
pages that link to it
manifests itself as an eigenvalue problem
Paper: Page et al, “The PageRank Citation Ranking:
Bringing Order to the Web”



describes basic version of Google PageRank algorithm
cover eigenvalues in class
exposure to linear algebra, numerical analysis
Topic 4: Web Mining

Paper: Chakrabarti et al, “Mining the Link Structure of
the World Wide Web”




describes HITS algorithm for ranking web pages
Google isn’t the only way to do it
uses Latent Semantic Analysis, which requires singular value
decomposition (cover in class)
Assignment: Implement PageRank algorithm





try it on archive of department website
crawling for an assignment is dangerous
sparse data representation
hashing or other form of map for efficiency
interpret results
hubs
authorities
Topic 5: Collaborative Filtering

a.k.a. Recommender Systems



Paper: Breese et al, “Empirical Analysis of Predictive
Algorithms for Collaborative Filtering”




“I like Pink Floyd, Dream Theater, and Evanescence. Who should
I be listening to?”
Amazon.com, Yahoo! Launchcast
Algorithms are nearest neighbor-like in flavor
Involve averaging numerical scores
Need to normalize for individual biases
Students already working on final project, so no
assignment
Topic 6: Ethical Issues in Data Mining



Privacy concerns
Good vs. evil uses of data mining
Video: Ramakrishnan et al, “Data Mining: Good, Bad, or
Just a Tool?”


Before watching video, students post to Caucus:



Panel discussion from KDD 2004
how data mining could be exploited
how this could be prevented (if possible)
After watching video

followup commentary
Pictures from conference website at http://www.acm.org/sigs/sigkdd/kdd2004/
Topic 6: Ethical Issues in Data Mining


Students response to video was more engaged than I
expected
More problems than solutions are raised in video


Many students interested in issue of accountability



Frustrated students that solutions weren’t clear
If someone’s privacy is violated, who is responsible?
“Who do I sue?”
Lively class discussion
Final Project



“Do almost anything you want regarding data mining, so
long as I approve it”
Find a paper and implement the algorithm within
Find a dataset of interest and study it completely, using
Weka and/or their own code from throughout the term




Quantitative association rules
Poker association rules
Collaborative filtering (music, art)
Attack KDD Cup problems



KDD Cup 2005: identify categories for web search queries
tried this once: tended to be too big for them in the time that I had
could perhaps be done with right level of support
Conclusions

Papers are most memorable part of course



Caucus motivates reading papers



Students find this a pain, but are thankful afterwards in evals
Important to set deadline for posting a few hours before class so I
have time to read
Programming assignments work (mostly) well



Students speak very positively about this in evaluations
Significant prep time for me to fill in gaps
Allow students to work in pairs if they wish
Grading is difficult: unspecified details in algorithms, differing
datasets
All materials available on my website at
http://www.mathcs.carleton.edu/faculty/dmusican/cs377s05
Download