KDD-2010 Review - IEEE Entity Web Hosting

advertisement
Data Mining in Practice:
Techniques and Practical Applications
Junling Hu
May 14, 2013
What is data mining?

Mining patterns from data

Is it statistics?




Functional form?
Computation speed concern?
Data size
Variable size
Is it machine learning?



2
Big data issue
New methods: network mining
Examples of data mining
Frequently bought together

3

Movie recommendation
More examples of data mining
Keyword suggestions


4

Genome & disease mining
Heart monitoring
Overview of data mining
Frequent pattern mining
Machine Learning




Supervised
Unsupervised
Stream mining
Recommender system
Graph mining
Unstructured data







Text,
Audio
Image and Video
Big data technology

5
Frequent Pattern Mining
Diaper and Beer

?
Product assortment
Click behavior
Machine breakdown



6
The case of Amazon
User
1
2
3
4
5


Items
{Princess dress, crown, gloves, t-shirt}
{Princess dress, crown, gloves, pink dress, t-shirt }
{Princess dress, crown, gloves, pink dress, jeans}
{ Princess dress, crown, gloves, pink dress}
{crown, gloves }
Count frequency of co-occurrence
Efficient algorithm
7
Machine Learning Process
8
Machine Learning

Supervised

Unsupervised (clustering)
9
Binary classification
Input features
Checking
Data point
10
Yes
Yes
No
Yes
Yes
Yes
Yes
Duration Savings Current
(years)
Loans
($k)
1
10
Yes
2
4
No
5
75
No
10
66
No
5
83
Yes
1
11
No
4
99
Yes
Output class
Loan
Purpose
Risky?
TV
TV
Car
Repair
Car
TV
Car
0
1
0
1
0
0
0
Classification (1)

Decision tree
11
Classification (2): Neural network

Perceptron

Multi-layer neural netowrk
12
Head pose detection
13
Support Vector Machine (SVM)

Search for a separating hyperplane
 Maximize margin
14
Perceived advantage of SVM

Transform data into higher dimension
15
Applications of SVM: Spam Filter
Input Features:

Transmission



Email header




From --“admin@one-spam.cpm”
To
--“undisclosed”
cc
Email Body



IP address --167.12.24.555
Sender URL -- one-spam.com
# of paragraphs
# words
Email structure


16
# of attachments
# of links
Logistic regression



Advantage: Simple functional form
Can be parallelized
Large scale
17
Applications of logistic regression

Click prediction




Search ranking (web pages, products)
Online advertising
Recommendation
The model


Output: Click/no click
Input features:
page content,
search keyword,
User information
18
Regression


Linear regression
Non-linear regression
19
Application:
• Stock price prediction
• Credit scoring
• employment forecast
History of Supervised learning
20
Semi-supervised learning

Application:

21
Speech dialog system
Unsupervised learning: Clustering

No labeled data

Methods

22
K-means
Categories of machine learning
23
Applications of Clustering


Malware detection
Document clustering: Topic detection
24
Graphs in our life

Social network
Friend recommendation
25

Molecular compound
Drug discovery
Graph and its matrix representation
Adjacency matrix
1
2
1
4
6
3
2
3
4
5
5
26
6
1
2
3
4
5
6
0
1
0
0
0
1
1
0
1
1
0
0
0
1
0
1
1
0
0
1
1
0
1
0
0
0
1
1
0
1
1
0
0
0
1
0
The web graph
Page 1
Anchor text
Page 2
Hyperlink
Anchor text
Anchor text
Page 3
Anchor text
27
PageRank as a steady state

Transition matrix
P=

1
2
3
4
5
6
1
0
0.5
0.25
0
0
0.5
2
0.33
0
0.25
1
0
0
3
0.33
0.5
0
0
0.33
0
4
0
0
0.25
0
0.33
0
PageRank is a probability vector
  P
28
5
0
0
0.25
0
0
0.5

6
0.33
0
0
0
0.33
0
such that
Discover influencers on Twitter

The Twitter graph



Node
Link
A PageRank approach: TwitterRank
2
Following
1
4
5
29
3
Facebook graph search

Entity graph

Natural language search

30
“Restaurants liked by my
friends”
Recommending a game
31
Recommendation in Travel site
32
Prediction Problems

Rating Prediction


Given how an user rated other items, predict the user’s rating for a given item
****
Top-N Recommendation

33
?
Given the list of items liked by an user, recommend new items that the user
might like
Explicit vs. Implicit Feedback Data

Explicit feedback


Ratings and reviews
Implicit feedback (user behavior)

Purchase behavior: Recency, frequency, …

Browsing behavior: # of visits, time of visit, time of staying,
clicks
34
Collaborative Filtering

Hypotheses

User/Item Similarities



Matching characteristics

35
Similar users purchase similar items
Similar items are purchased by similar users
Match exists between user’s and item’s characteristics
User-User similarity

User’s movie rating
36
John
Out of
Africa
4
Star
Wars
4
Air Force
One
5
Liar,
Liar
1
Adam
1
1
2
5
Laura
?
4
5
2
Item-item similarity
John
Adam
Out of
Africa
4
1
Star
Wars
4
1
Air Force
One
5
2
Liar,
Liar
1
5
Laura
?
4
5
2
37
Application of item-item similarity

Amazon
38
SVD (Singular Value Decomposition)
39
Latent factors
40
Application of Latent Factor Model

GetJar
41
Ranking-based recommendation
42
Application in LinkedIn

Ranking-based model
43
Thanks and Contact

Co-author: Patricia Hoffman
Contact:
 junlinghu@gmail.com

Twitter: @junling_tech
44
Download