ACM KDD Cup A Survey: 1997-2011 Qiang Yang

advertisement
ACM KDD Cup
A Survey: 1997-2011
Qiang Yang
(partly based on Xinyue Liu’s slides @SFU, and
Nathan Liu’s slides @hkust)
Hong Kong University of Science and Technology
1
About ACM KDDCUP


ACM KDD: Premiere Conference in knowledge discovery
and data mining
ACM KDDCUP:


Worldwide competition in conjunction with ACM KDD
conferences.
It aims at:



showcase the best methods for discovering higher-level
knowledge from data.
Helping to close the gap between research and industry
Stimulating further KDD research and development
2
Statistics
 Participation in KDD Cup grew steadily
 Average person-hours per submission: 204
Max person-hours per submission: 910
Year
Submissions
97 98
16
21
99
24
2000 2005 2011
30
32 1000+
3
KDD Cup 97


A classification task – to
predict financial services
industry (direct mail
response)
Winners



Charles Elkan, a Prof from
UC-San Diego with his
Boosted Naive Bayesian
(BNB)
Silicon Graphics, Inc with
their software MineSet
Urban Science Applications,
Inc. with their software gain,
Direct Marketing Selection
System
4
MineSet (Silicon Graphics Inc.)

A KDD tool that combines data access, transformation,
classification, and visualization.
5
KDD Cup 98: CRM Benchmark



URL:
www.kdnuggets.com/meetings/kd
d98/kdd-cup-98.html
A classification task – to analyze
fund raising mail responses to a
non-profit organization
Winners



Urban Science Applications,
Inc. with their software
GainSmarts.
SAS Institute, Inc. with their
software SAS Enterprise Miner
™
Quadstone Limited with their
software Decisionhouse ™
6
KDDCUP 1998 Results
$70,000
$65,000
$60,000
$55,000
$50,000
$45,000
$40,000
$35,000
$30,000
$25,000
$20,000
$15,000
$10,000
$5,000
$-
Maximum Possible Profit Line
($72,776 in profits with 4,873 mailed)
100%
90%
80%
Mail to Everyone Solution
($10,560 in profits with 96,367 mailed)
70%
60%
50%
GainSmarts
SAS/Enterprise Miner
Quadstone/Decisionhouse
40%
30%
20%
10%
0%
ACM KDD Cup 1999




URL:
www.cse.ucsd.edu/users/elkan/
kdresults.html
Problem
To detect network intrusion
and protect a computer network
from unauthorized users,
including perhaps insiders
Data: from DoD
Winners
 SAS Institute Inc. with their
software Enterprise Miner.
 Amdocs with their
Information Analysis
Environment
8
KDDCUP 2000: Data Set and Goal:
Data collected from

Gazelle.com, a legwear
and legcare Web retailer
 Pre-processed
Training set: 2 months
 Test sets: one month
 Data collected includes: 


Click streams
Order information
The goal – to design
models to support website personalization and
to improve the
profitability of the site by
increasing customer
response.
Questions - When given
a set of page views,



characterize heavy
spenders
characterize killer pages
characterize which product
brand a visitor will view in
the remainder of the
session?
9
KDD Cup 2001

3 Bioinformatics Tasks

Dataset 1: Prediction of
Molecular Bioactivity for
Drug Design


half a gigabyte when
uncompressed
Dataset 2: Prediction of
Gene/Protein Function (task
2) and Localization (task 3)


Dataset 2 is smaller and
easier to understand
7 megabytes uncompressed

A total of 136 groups
participated to produce
a total of 200
submitted predictions
over the 3 tasks: 114
for Thrombin, 41 for
Function, and 45 for
Localization.
10
2001 Winners

Task 1, Thrombin:



Jie Cheng (Canadian Imperial
Bank of Commerce).
Bayesian network learner and
classifier
Task 2, Function: Mark-A.
Krogel (University of
Magdeburg).



Task 2:


the genes of one particular
type of organism
A gene/protein can have
more than one function, but
only one localization.
Inductive Logic programming
Task 3, Localization: Hisashi
Hayashi, Jun Sese, and
Shinichi Morishita (University
of Tokyo).

K nearest neighbor
11

molecular
biology : Two tasks


Task 1: Document
extraction from
biological articles
Task 2: Classification of
proteins based on gene
deletion experiments

Winners:

Task 1: ClearForest and
Celera, USA


Yizhar Regev and Michal
Finkelstein
Task 2: Telstra
Research Laboratories
, Australia

Adam Kowalczyk and
Bhavani Raskutti
12
2003 KDDCUP

Information
Retrieval/Citation Mining of
Scientific research papers





based on a very large
archive of research papers
First Task: predict how many
citations each paper will receive
during the three months
leading up to the KDD 2003
conference
Second Task: a citation graph
of a large subset of the archive
from only the LaTex sources
Third Task: each paper's
popularity will be estimated
based on partial download logs
Last Task: devise their own
questions
13
2004 Tasks and Results

粒子物理学和同调蛋白
质预测(Particle
physics; plus protein
homology prediction)

两个子任务的冠军分别
为:David S. Vogel, Eric
Gottschalk, and
Morgan C. Wang以及
Bernhard Pfahringer,
Yan Fu (付岩),
RuiXiang Sun, Qiang
Yang (杨强), Simin He,
Chunli Wang, Haipeng
Wang, Shiguang Shan,
Junfa Liu, Wen Gao.
14
Past KDDCUP Overview: 2005-2010
Year
Host
Task
Technique
Winner
2005
Microsoft
Web query
categorization
Feature Engineering,
Ensemble
HKUST (沈抖,
杨强,等)
2006
Siemens
Pulmonary emboli
detection
Multi-instance, Non-IID
sample, Cost sensitive,
Class Imbalance, Noisy
data
AT&T, Budapest
University of
Technology &
Economics
2007
Netflix
Consumer
recommendation
Collaborative Filtering,
Time series, Ensemble
IBM Research,
Hungarian
Academy of
Sciences
2008
Siemens
Breast cancer
detection from
medical images
Ensemble, Class
imbalance, Score
calibration
IBM Research,
National Taiwan
University
2009
Orange
Customer
relationship
prediction in telecom
Feature selection,
Ensemble
IBM Research,
University of
Melbourne
2010
PSLC Data
Shop
Student performance
prediction in ELearning
Feature engineering,
Ensemble,
Collaborative filtering
National Taiwan
University (CJ
Lin, S. Lin, etc.)
KDDCUP’11 Dataset




11 years of data
Rated items are
 Tracks
 Albums
 Artists
 Genres
Items arranges in a taxonomy
Two tasks
Track 1
Track 2
#ratings
263M
63M
#items
625K
296K
#users
1M
249K
Items in a Taxonomy
Track 1 Details
Track 1 Highlights





Largest publicly available dataset
Large number of items (50 times more than
Netflix)
Extreme rating sparsity (20 times more
sparse than Netflix)
Taxonomy can help in combating sparsely
rated items.
Fine time stamps with both date and time
allow sophisticated temporal modeling.
Track 2 Details
Track 2 Highlights



Performance metric focus on ranking/
classification, which differs from traditional
collaborative filtering.
No validation data provided, need to selfconstruct binary labeled data from rating
data.
Unlike track 1, track 2 removed time stamps
to focus more than long term preference
rather than short term behaviors.
Submission Stats
Winners
Track 1
Track 2
1st place
National Taiwan University
National Taiwan University
2nd place
Commendo (Netflix Prize
Winnder)
Chinese Academy of Science,
Hulu Labs
3rd place
Hong Kong University of
Science and Technology,
Shanghai Jiaotong University
Commendo (Netflix Prize
Winnder)
Chinese Teams at KDDCUP (NTU,
CAS, HKUST)
Nathan Liu:
HKUST CSE
PhD student
Key Techniques

Track 1:






Blending of multiple techniques
Matrix factorization models
Nearest neighbor models
Restricted Bolzmann machines
Temporal modelings
Track 2:



Importance sampling of negative instances
Taxonomical modelings
Use of pairwise ranking objective functions
Summary

To place on top of KDDCUP requires




Team work
Expertise in domain knowledge as well as mathematical
tools
Often done by world famous institutes and companies
Recent trends:



Dataset increasingly more realistic
Participants increasingly more professional
Tasks are increasingly more difficult
26
Summary




KDD Cup is an excellent source to learn the
state-of-art KDD techniques
KDDCUP dataset often becomes the
standard benchmark for future research,
development and teaching
Top winners are highly regarded and
respected
References:
http://www.sigkdd.org/kddcup/index.php
27
Download