Automatic Classification of Text Databases Through Query Probing

advertisement
Automatic Classification of Text
Databases Through Query Probing
Panagiotis G. Ipeirotis
Luis Gravano
Columbia University
Mehran Sahami
E.piphany Inc.
Search-only Text Databases
Sources of valuable information
Hidden behind search interfaces
Non-crawlable
Example: Microsoft Support KB
Interacting With Searchable
Text Databases
1. Searching: Metasearchers
2. Browsing: Use Yahoo-like directories
3. Browse & search: “Category-enabled”
metasearchers
Searching Text Databases:
Metasearchers
Select the good databases for a query
Evaluate the query at these databases
Combine the query results from the
databases
Examples: MetaCrawler, SavvySearch, Profusion
Browsing Through Text Databases
Yahoo-like web directories:
InvisibleWeb.com
SearchEngineGuide.com
TheBigHub.com
Example from InvisibleWeb.com
Computers > Publications > ACM DL
Category-enabled metasearchers
User-defined category (e.g. Recipes)
Problem With Current
Classification Approach
Classification of databases is done manually
This requires a lot of human effort!
How to Classify Text Databases
Automatically: Outline
Definition of classification
Strategies for classifying searchable
databases through query probing
Initial experiments
Database Classification:
Two Definitions
Coverage-based classification:
The database contains many documents
about the category (e.g. Basketball)
Coverage: #docs about this category
Specificity-based classification:
The database contains mainly documents
about this category
Specificity: #docs/|DB|
Database Classification:
An Example
Category: Basketball
Coverage-based classification
ESPN.com, NBA.com
Specificity-based classification
NBA.com, but not ESPN.com
Categorizing a Text Database:
Two Problems
Find the category of a given document
Find the category of all the documents
inside the database
Categorizing Documents
Several text classifiers available
RIPPER (AT&T Research, William Cohen 1995)
Input: A set of pre-classified, labeled documents
Output: A set of classification rules
Categorizing Documents: RIPPER
Training set: Preclassified documents
“Linux as a web server”: Computers
“Linux vs. Windows: …”: Computers
“Jordan was the leader of Chicago Bulls”: Sports
“Smoking causes lung cancer”: Health
Output: Rule-based classifier
IF linux THEN Computers
IF jordan AND bulls THEN Sports
IF lung AND cancer THEN Health
Precision and Recall of
Document Classifier
During the training phase:
100 documents about computers
“Computer” rules matched 50 docs
From these 50 docs 40 were about computers
Precision = 40/50 = 0.8
Recall = 40/100 = 0.4
From Document
to Database Classification
If we know the categories of all the
documents, we are done!
But databases do not export such data!
How can we extract this information?
Our Approach: Query Probing
Design a small set of queries to probe the
databases
Categorize the database based on the
probing results
Designing and Implementing
Query Probes
The probes should extract information about the
categories of the documents in the database
Start with a document classifier (RIPPER)
Transform each rule into a query
IF lung AND cancer THEN health  +lung +cancer
IF linux THEN computers  +linux
Get number of matches for each query
Three Categories and Three Databases
linux  computers
ACM DL
jordan AND bulls  sports
lung AND cancer  health
ACM
NBA
PubM
comp
336
0
16
sports
0
6674
0
health
18
103 81164
NBA.com
PubMED
Specificity
Using the Results for Classification
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
comp
sports
health
ACM
COV
NBA
ACM
We use the
results to
estimate
coverage and
specificity
values
PubMed
NBA
PubM
SPEC ACM
NBA
PubM
comp
336
0
16
comp
0.95
0
0
sports
0
6674
0
sports
0
0.985
0
health
18
103
81164
health
0.05
0.015
1.0
Adjusting Query Results
Classifiers are not perfect!
Queries do not “retrieve” all the documents that
belong to a category
Queries for one category “match” documents that
do not belong to this category
From the training phase of classifier we use
precision and recall
Precision & Recall Adjustment
Computer-category:
Rule: “linux”, Precision = 0.7
Rule: “cpu”, Precision = 0.9
Recall (for all the rules) = 0.4
Probing with queries for “Computers”:
Query: +linux  X1 matches  0.7X1 correct matches
Query: +cpu  X2 matches  0.9X2 correct matches
From X1+X2 documents found:
Expect 0.7 X1+0.9 X2 to be correct
Expect (0.7 X1+0.9 X2)/0.4 total computer docs
Initial Experiments
Used a collection of 20,000 newsgroup articles
Formed 5 categories:
Computers (comp.*)
Science (sci.*)
Hobbies (rec.*)
Society (soc.* + alt.atheism)
Misc (misc.sale)
RIPPER trained with 10,000 newsgroup articles
Classifier: 29 rules, 32 words used
IF windows AND pc THEN Computers (precision~0.75)
IF satellite AND space THEN Science (precision~0.9)
Web-databases Probed
Using the newsgroup classifier we probed
four web databases:
Cora (www.cora.jprc.com)
CS Papers archive (Computers)
American Scientist (www.amsci.org)
Science and technology magazine (Science)
All Outdoors (www.alloutdoors.com)
Articles about outdoor activities (Hobbies)
Religion Today (www.religiontoday.com)
News and discussion about religions (Society)
Specificity
Results
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Computers
Science
Hobbies
Society
Misc
7498
1450
733
202
151 95
53
215
Cora
45
31
15 7
American Scientist
113
170
28
67
AllOutdoors
Only 29 queries per web site
No need for document retrieval!
103
502432
152
ReligionToday
Conclusions
Easy classification using only a small number of
queries
No need for document retrieval
Only need a result like: “X matches found”
Not limited to search-only databases
Every searchable database can be classified this way
Not limited to topical classification
Current Issues
Comprehensive classification scheme
Representative training data
Future Work
Use a hierarchical classification scheme
Test different search interfaces
Boolean model
Vector-space model
Different capabilities
Compare with document sampling (Callan et al.’s
work – SIGMOD99, adapted for the classification
task)
Study classification efficiency when documents
are accessible
Related Work
Gauch (JUCS 1996)
Etzioni et al. (JIIS 1997)
Hawking & Thistlewaite (TOIS 1999)
Callan et al. (SIGMOD 1999)
Meng et al. (CoopIS 1999)
Download