Web-page Classification through Summarization

advertisement
Overview of Information Retrieval
and our Solutions
Qiang Yang
Department of Computer Science and Engineering
The Hong Kong University of Science and Technology
Hong Kong
1
Why Need Information Retrieval (IR)?




More and more online
information in general
(Information Overload)
Many tasks rely on effective
management and exploitation
of information
Textual information plays an
important role in our lives
Effective text management
directly improves productivity
2
What is IR?


Narrow-sense:
 IR= Search engine technologies
(Google/Yahoo!/Live Search)
 IR= Text matching/classification
Broad-sense: IR = Text information management:

How to find useful information? (info. retrieval) (e.g., Yahoo!)
How to organize information? (text classification) (e.g.,

How to discover knowledge from text? (text mining) (e.g.,

automatically assign email to different folders)
discover correlation of events)
3
Difficulties

Huge Amount of Online Data


Different types of data


Yahoo! has nearly 20 billion pages in its index (as
collected at the beginning of 2005)
Web-pages, emails, blogs, chatting-room
messages;
Ambiguous Queries


Short: 2-4 words
Ambiguous: apple; bank…
4
Our Solutions

Query Classification


Query Expansion/Suggestion


SIGIR’04; CIKM’04; ICDM’04; ICDE’06; WWW’06; IPM
(2007), DMKD (Vol. 12)
Document Summarization


Submission to SIGIR’07
Web page Classification/Clustering


Submissions to: SIGIR’07; AAAI’07; KDD’07
Entity Resolution


Champion of KDDCUP’05; TOIS (Vol. 24); SIGIR’06; KDD
Exploration (Vol. 7)
SIGIR’05; IJCAI’07
Analysis of Blogs, Emails, Chatting-room messages

SIGIR’06; ICDM’06 (2); IJCAI’07
5
Outline

Query Classification (QC)





Introduction
Solution 1: Query/category enrichment;
Solution 2: Bridging classifiers;
Entity Resolution
Summary of Other works
6
Query Classification
7
Introduction

Web-Query is difficult to manage:


Query Classification (QC) can help to understand query
better




Short; Ambiguous; Evolving
Vertical Search
Re-rank search results
Online Advertisements
Difficulties of QC (Different from text classification)



How to represent queries
Target taxonomy is dynamic, e.g. online ads taxonomy
Training data is difficult to collect
8
Problem Definition
Inspired by the KDDCUP’05 competition



Classify a query into a ranked list of categories
Queries are collected from real search engines
Target categories are organized in a tree with
each node being a category
9
Related Work

Document Classification



Feature selection [Yang et al. 1997]
Feature generation [Cai et al. 2003]
Classification algorithms




Naïve Bayes [Andrew and Nigam 1998]
KNN [Yang 1999]
SVM [Joachims 1999]
……
An overall survey in [Sebastiani 2002]
10
Related work

Query Classification/Clustering




Classify the Web queries by geographical locality
[Gravano 2003];
Classify queries according to their functional types
[Kang 2003];
Beitzel et al. studied the topical classification as
we do. However they have manually classified
data [Beitzel 2005];
Beeferman and Wen worked on query clustering
using clickthrough data respectively [Beeferman
2000; Wen 2001];
11
Related Work

Document/Query Expansion

Borrow text from extra data source




Using hyperlink [Glover 2002];
Using implicit links from query log [Shen 2006];
Using existing taxonomies [Gabrilovich 2005];
Query expansion [Manning 2007]


Global methods: independent of the queries
Local methods using relevance feedback or
pseudo-relevance feedback
12
Solutions
Queries
Target
Categories
Solution
Solution 1:
1: Query/Category
Query/Category Enrichment
Enrichment
Queries
Target
Categories
Solution 2: Bridging classifier
13
Solution 1:
Query/Category Enrichment



Assumptions & Architecture
Query Enrichment
Classifiers



Synonym-based classifiers
Statistical classifiers
Experiments
14
Assumptions & Architecture


The intended meanings of Web queries should be
reflected by the Web;
A set of objects exist that cover the target categories.
Construction of
Synonym- based
Classifiers
Labels of
Returned
Pages
Search
Engine
Text of
Returned
Pages
Construction of
Statistical Classifier
Phase I: the training phase
Query
Classified
results
Classified
results
Finial Results
Phase II: the testing phase
The Architecture of Our Approach
15
Query enrichment

Textual information
Title

Category information
Snippet
Category
Full text
16
Synonym-based classifiers
Page 1
C1I
C1T
Page 2
Query
Page 3
C2I
Category
Mapping
C*
C2T
C3I
C3T
Page 4
C4I
17
Synonym-based classifiers

Map by Word Matching


Direct Matching
Extended Matching



Device
E
D
Wordnet
“Hardware" → “Hardware;
Device ; Equipment“
High precision, low recall
18
Statistical classifiers: SVM



Apply synonym-based classifiers to map
Web pages from intermediate taxonomy
to target taxonomy
Obtain <pages, target category> as the
training data
Train SVM classifiers for the target
categories;
19
Statistical Classifier: SVM

Advantages





Disadvantages


Circles (triangles) denote crawled
pages
Black ones are mapped to the two
categories successfully
Fail to map the white ones;
For a query, if it happens to be
represented by the white ones, it
can not be classified correctly by
synonym-based method, but SVM
can
Recall can be higher, but precision may hurt
Once the target taxonomy changes, we need to train
classifiers again
20
Putting them together: Ensemble
of classifiers

Why ensemble?




Two kinds of classifiers based on different mechanisms
They can be complementary to each other
Proper combination can improve the performance
Combination strategies


EV (Use validation data)
EN (No validation data)
21
Experiment
--Data Sets & Eval. Criteria

Queries: from KDDCUP 2005



A:

800,000 queries,
800 labeled; three labelers
Evaluation
# of queries are correctly tagged as ci
i
B:
 # of queries are tagged as c
i
i
C:
 # of queries whose category is c
i
Precision 
A
B
A
C
2  Precision  Recall
F1 
Presion  Recall
Recall 
i
1
Overall F1 
3
3
 (F1 against human labeler i)
i 1
22
Experiment:
Quality of the Data Sets

Consistency between labelers
Performance of each labeler against another labelers
The distribution of the labels assigned by the three labelers.
23
Experiment Results
--Direct vs. Extended Matching


Number of pages collected for training using
different mapping methods
F1 of the synonym based classifier and SVM
24
Experiment Results
--The number of assigned labels
S1
S1
S2
S2
S3
S3
SVM
SVM
EN
EN
EDP
EDP
0.70
0.45
0.60
Rec
F1
Pre
0.60
0.40
0.50
0.50
0.35
0.40
0.40
0.30
0.30
0.25
0.20
0.20
0.10
1
2
3
4
5
6
Number of guessed labels
25
Experiment Results
-- Effect of Base Classifiers
26
Solutions
Queries
Target
Categories
Solution 1: Query/Category Enrichment
Queries
Target
Categories
Solution
Solution 2:2:Bridging
Bridging
classifier
classifier
27
Solution2:
Bridging Classifiers

Our Algorithm



Bridging Classifier
Category Selection
Experiments


Data Set and Evaluation Criteria
Results and Analysis
28
Algorithm
--Bridging Classifier

Problem with Solution 1:


target if fixed, and training needs to repeat
Goal:

Connect the target taxonomy and queries by
taking an intermediate taxonomy as a bridge
29
Algorithm
--Bridging Classifier (Cont.)

How to connect?
T
The relation between Ci
and C Ij
The relation between
and C Ij
q
Prior prob. of C Ij
The relation between
and CiT
q
30
Algorithm
--Bridging Classifier (Cont.)

Understand the Bridging Classifier

Given


q

V
q

and
:
and
are fixed
and
which reflects the size of
acts as a weighting factor
tends to be larger when
and
tend to belong to the same
smaller intermediate categories
31
Algorithm
--Category Selection

Category Selection for Reducing Complexity

Total Probability (TP)

Mutual Information
32
Experiment
--Data Sets and Eval. Criteria

Intermediate taxonomy

ODP: 1.5M Web pages, in 172,565 categories
Number of Categories on Different Levels
Statistics of the Numbers of Documents in the
Categories on Different Levels
33
Experiment
--Result of Bridging Classifiers




All intermediate categories are used
Snippet only
Best result when n = 60
Improvement by 10.4% and 7.1% in terms of
precision and F1 respectively compared to two
previous approaches
34
Experiment
--Result of Bridging Classifiers
Performances of the Bridging Classifier with
Different Granularity of Intermediate Taxonomy


Best results when using all intermediate categories
Reason:


A category with larger granularity may be a mixture of
several target categories
It can not be used to distinguish different target categories
35
Experiment
--Effect of category selection
When the category
number is around 18,000,
the bridging classifier is
comparable to, if not
better than, the previous
approaches


MI works better than TP

It favors the categories which are more
powerful to distinguish the target categories
36
Entity Resolution
37
Definition: Reference & Entity

Tsz-Chiu Au, Dana S. Nau: The Incompleteness of
Planning with Volatile External Information. ECAI
2006
Author
Entity

Name
Reference
Venue
Reference
Journal
/Conf.
Entity
Tsz-Chiu Au, Dana S. Nau: Maintaining Cooperation
in Noisy Environments. AAAI 2006
Current Author Search



DBLP
CiteSeer
Google
Graphical Model



We convert the Entity Resolution into a Graph
Partition Problem
Each node denotes
a reference
Each edge denotes
the relation of two
references
How to measure the Reference Relation

Tsz-Chiu Au, Dana S. Nau: The Incompleteness of Planning with
Volatile External Information. ECAI 2006
Coauthors
Research Community
Coauthors
Authors
Plaintext
Similarity
Research Area
Authors

Ugur Kuter, Dana S. Nau: Using Domain-Configurable Search
Control for Probabilistic Planning. AAAI 2005:
Features





F1:
F2:
F3:
F4:
F5:
Title Similarity
Coauthor Similarity
Venue Similarity
Research Community Overlap
Research Area Overlap
Research Community Overlap





A1, A2 stands for two author name references
F4.1:Similarity(A1, A2)
=Coauthors(Coauthors(A1))∩Coauthors(Coauthors(A2))
F4.2:Similarity(A1, A2)
=Venues(Coauthors(A1))∩Venues(Coauthors(A2))
Coauthors(X) returns the
coauthor name set of each
author in set X
Venues(Y) returns the
venue name set of each
author in set Y
Research Area Overlap





V1, V2 stands for two venue references
F4.1:Similarity(V1, V2)
=Authors(Articles(V1))∩Authors(Articles(V2))
F4.2:Similarity(V1, V2)
=Articles(Authors(Articles(V1)))∩Articles(Authors(Articles(V2)))
Authors(X) returns the
author name set of each
article in set X
Articles(Y) returns the
article set holding a reference
of each element in set Y
System Framework
Similarity
Probability
Experiment Results


Our Dataset:
1000 references to 20 author entities from DBLP
Getoor’s Datasets
CiteSeer:
2,892 author references to 1,165 author entities
arXiv:
58,515 references to 9,200 author entities
F1 = 97.0%
Summary of Other Work
47
Summary of Other Work






Summarization using Conditional Random Fields (IJCAI ’07)
Thread Detection in Dynamic Text Message Streams
(SIGIR ’06)
Implicit Links for Web Page Classification (WWW ’06)
Text Classification Improved by Multigram Models (CIKM ’06)
Latent Friend Mining from Blog Data (ICDM ’06)
Web-page Classification through Summarization (SIGIR ’04)
48
Summarization using Conditional
Random Fields (IJCAI ’07)

Motivation



Observation
Step 1: 1
2
3
4
5
6
Step 2: 1
2
3
4
5
6
Step 3: 1
2
3
4
5
6
Summarization  Sequence labeling
Solution: CRF


Sentence
(Observed)
xt-1
xt
xt+1
Label
(Unobserved)
yt-1
yt
yt+1
Feature functions:
Parameters:
,
,
49
Thread Detection in Dynamic
Text Message Streams (SIGIR ’06)

Representation


Content-based
Structure-based


Sentence Type; Personal Pronouns
Clustering
50
Implicit Links for Web Page
Classification (WWW ’06)

Implicit link 1 ( LI1)



Assumption: a user tends to click the pages
related to the issued query;
Definition: there is an LI1 between d1 and d2 if
they are clicked by the same person through the
same query;
Implicit link 2 (LI2)


Assumption: users tend to click related pages
according to the same query
Definition: there is an LI2 between d1 and d2 if
they are clicked according to the same query
51
Text Classification Improved by
Multigram Models (CIKM ’06)

Training Stage: For each category



Train an n-multigram model
Train an n-gram model on the sequences
Test Stage: For a test document



For each category, segment the document
Calculate its probability under the
corresponding n-gram model
Assign the test document the category
under which it has the largest probability
52
Latent Friend Mining from Blog
Data (ICDM ’06)

Objective


One way to build Web communities
Find the people sharing similar interest
with the target person




“Interest” is reflected by their “writings”
“Writings” are from their “blogs”
These people may not know each other
They are not linked as in previous study
53
Latent Friend Mining from
Blog Data (Cont.)

Solutions

Cosine Similarity-based method


Topic Model-based method


Calculating the cosine similarity between the contents of
the blogs.
Find latent topics in the blogs using latent topic models
and calculate the similarity at topic level
Two-level similarity-based method


First stage: use an existing topic hierarchy to get the
topic distribution of a blogger’s blogs;
Second stage: use a detailed similarity comparison
54
Web-page Classification through
Summarization (SIGIR ’04)
Description
LUHN
LSA
Page-layout analysis
Supervised
Combined
Summarizer
Train set
Testing set
Train Summaries
Classifier
Testing Summaries
Result
55
Thanks
56
Download