Deep Web Mining and Learning for Advanced Local Search CS8803 Advisor Prof Liu

advertisement
Deep Web Mining and
Learning for Advanced
Local Search
CS8803 Advisor Prof Liu
Yu Liu, Dan Hou
Zhigang Hua, Xin Sun
Yanbing Yu
Competitors





Yahoo! Local
Yelp
CitySearch
Google Local
Yellow Page
How to beat them?
Research Background





Deep Web Crawling
Sentimental Learning
Sentimental Ranking Model
Geo-credit Ranking Model
Social Network for Businesses
Show Time!
Local Biz Space
Architecture
Query-based
Crawler
Sentimental
Learner
Super
Local-Search
HTML Parser
Apache
Server
JDBC
Database
Tools
Open source social network platform
Elgg, OpenSocial

LAMP Server
Linux+Apache+Mysql+PHP

Google Map API,
eg, Geocode,

Crawling Dynamic Pages
Crawling Dynamic Pages
Yahoo!Local Keyword Trend
16000
Numer of Keywords
14000
12000
10000
8000
6000
4000
2000
0
1
2
3
4
5
6
7
8
9
10
11
Time (min)
12
13
14
15
16
17
18
19
20
Parsing Dynamic Pages
Database Design
Reviewer
PK
Category
Reviewer_ID
Name
URL
Business
PK
PK,FK1
PK
Business_ID
Category
Business_ID
Name
Phone
Address
Description
Rating
Source
Latitude
Longtitude
Review
FK1
FK2
Business_ID
Reviewer_ID
Date
Rating
Content
Term
PK
PK,FK1
PK
Term
Business_ID
Source
Frequency
Norm_Frequency
Sentimental Learning
Sentimental Learning
Sentimental Learning
Can we use ONE score
to show how good/ bad the store is?
Sentimental Learning

Objective
• To identify positive and negative opinions of a store

Dataset
• Reviews represented by bag-of-terms
• Normalized TF-IDF feature (normalized)

Two ways of sentiment representation
• Simply average the scores

but “what you think good might be bad for me”
• Manual labeling



1 to 5 (“least satisfied” to “most satisfied”)
consensus based
time-accuracy tradeoff
Dimension Reduction

High dimensionality
• 6857 tokens



Memory limitation
Possibly under-fitting
Dimension Reduction
• PCA (Principle Component Analysis)



an orthogonal linear transformation
transforms the data to a new coordinate system
retains the characteristics of the data set that
contribute most to its variance
• Get the most important features without losing
generality
Principle Component Analysis



Original Dimension: 6857
Covariance Reserved: 95%
Different Granularity
• Manual Labeling:
Downsampling Rate
Result Dimension
40%
231
20%
4%
135
41
20%
211
4%
37
• Score Averaging:
Downsampling Rate
Result Dimension
40%
344
Sentimental learning

Features used for sentimental learning:
• Vector Space Model (reviews/comments)

Some keywords related to sentiments:
• Positive: good, happy, wonderful,
excellent, awesome, great, ok, nice, etc
• Negative: bad, sad, ugly, outdated,
shabby, stupid, wrong, awful, etc

Most words unrelated to sentiments:
• e.g. buy, take, go, iPod, apple, comment,
etc…
• Causing noise for sentimental learning!!
What we do?

How to learn sentiments from a large set
of features with lots of noise?




Vector Space Model: MXN (Entity-Term, e.g.
6,000X20,000)
Dimensionality reduction (PCA)
Using supervised learning for sentimental learning
Human labeling vs. Average rating
• An online entity always includes many reviews
with each review containing a rating

Average Rating is an alternative labeling for the
entity
• Manual labeling:


1 (least satisfactory) – 5 (most satisfactory)
Three persons do labeling, most-vote-adopted
Manual labeling vs. Average rating

Machine learning
Sentimental learning - manual labeling
• Around 300 entities from
local search, 6800 features
after stop words removing
and stemming
• Using different SVM kernels
• Avoiding overfit

Leave-one-out estimation
• Nonlinearity of features
• Polynomial kernel achieves
best performance
Sentimental learning - average rating
80%
90%
Linear-kernel
RBF-kernel
Poly-kernel
80%
70%
60%
50%
40%
30%
20%
10%
0%
High-dimension
1

60%
50%

40%
20%
10%
0%
1
PCA-37
2
PCA-135
3
PCA-231
4
Manual labeling
Rate averaging
•
•
•
30%
High-dimension
2
• Training more precise
• Labeling more consistent
Linear-kernel
RBF-kernel
Poly-kernel
70%
PCA-41
PCA-211
3
PCA-344
4
Training less precise
Rating more random
E.g. average(5, 5, 1) = 3
What we learned?

Dimensionality reduction is necessary
• Term Vector Space Model (VSM) is huge in
nature

Human labeling is necessary
• Sentimental learning involved subjective judge
instead of objective judge.
• Human rating is very random because it is not
consistent across different people
• More labeling data is needed

Other methods to be used:
• Unsupervised learning (clustering)

Gaussian Mixture Model (an alternative to learn
sentiments, while it is difficult to know the # of
hidden sentiments)
How to use learned sentiments?

Sentimental learning can be used to
improve ranking of local search
• Because sentimental value represents an
important metrics to evaluate the rank of an
entity
• Local search is influenced by the sentiment

Sentimental ranking model (SRM):
• SentiRank = a*ContentSim + (1-a)*SentiValue

Empirically setting the parameter as “0.5”.
• Similar to PageRank

PageRank = b*ContentSim + (1-b)*PageImportance
Geocoding


Geocoding of Addresses
For example , the geo-center of store AA National Auto Parts
Is located at 3410 Washington St, Phoenix, AZ,85009
Using Geocode, we can get the exact latitude and longtitude
(33.447708, -112.13246)
Haversine Formula of Great-circle distance:
Distance between two pairs of coordinates on sphere
= (3959 * acos( cos( radians(33.448) ) * cos( radians( lat ) ) * cos( radians( lng )
radians(-122) ) + sin( radians(-112.132) ) * sin(radians( lat ) ) ) )
-
Geo-Sentimental Ranking Model
(GSRM)

Three Measurements
1.
Content Similarity
-- term-frequency
3.
Sentimental Value
Geo-distance
-- sentimental learning
-- Google Map API

GSRM Ranking model
2.
rank 
a*ContentSim + (1-a)*SentiValue
GeoDis tan ce
Example


Thank You !
QA time
Download