SLIDES: Data Mining Meets the Internet: Techniques for Web Information Retrieval

advertisement
Data Mining meets the Internet: Techniques for
Web Information Retrieval
Rajeev Rastogi
Internet Management Research
Bell laboratories, Murray Hill
6/30/2016
Data Mining Meets the Internet
1
The Web

Over 1 billion HTML pages, 15 terabytes

Wealth of information
– Bookstores, restaraunts, travel, malls, dictionaries, news, stock
quotes, yellow & white pages, maps, markets, .........
– Diverse media types: text, images, audio, video
– Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG,
MP3

Highly Dynamic
– 1 million new pages each day
– Average page changes in a few weeks

Graph structure with links between pages
– Average page has 7-10 links

Hundreds of millions of queries per day
6/30/2016
Data Mining Meets the Internet
2
Why is Web Information Retrieval
Important?

According to most predictions, the majority of human
information will be available on the Web in ten years

Effective information retrieval can aid in
– Research: Find all papers that use the primal dual method to solve
the facility location problem
– Health/Medicene: What could be reason for symptoms of “yellow
eyes”, high fever and frequent vomitting
– Travel: Find information on the tropical island of St. Lucia
– Business: Find companies that manufacture digital signal
processors (DSPs)
– Entertainment: Find all movies starring Marilyn Monroe during the
years 1960 and 1970
– Arts: Find all short stories written by Jhumpa Lahiri
6/30/2016
Data Mining Meets the Internet
3
Web Information Retrieval Model
Repository
Web Server
Clustering
Classification
Indexer
Inverted
Index
engine
jaguar
cat
Topic Hierarchy
Root
News
Plants
6/30/2016
Storage
Server
Crawler
Science
The jaguar has a 4 liter engine
The jaguar, a cat, can run at
speeds reaching 50 mph
Documents in repository
Business
Animals
Computers
jaguar
Automobiles
Data Mining Meets the Internet
Search
Query
4
Why is Web Information Retrieval
Difficult?

The Abundance Problem (99% of information of no interest
to 99% of people)
– Hundreds of irrelevant documents returned in response to a search
query

Limited Coverage of the Web (Internet sources hidden
behind search interfaces)
– Largest crawlers cover less than 18% of Web pages

The Web is extremely dynamic
– 1 million pages added each day



Very high dimensionality (thousands of dimensions)
Limited query interface based on keyword-oriented search
Limited customization to individual users
6/30/2016
Data Mining Meets the Internet
5
How can Data Mining Improve
Web Information Retrieval?

Latent Semantic Indexing (LSI)
– SVD-based method to improve precision and recall

Document clustering to generate topic hierarchies
– Hypergraph partitioning, STIRR, ROCK

Document classification to assign topics to new documents
– Naive Bayes, TAPER

Exploiting hyperlink structure to locate authoritative Web
pages
– HITS, Google, Web Trawling

Collaborative searching
– SearchLight

Image Retrieval
– QBIC, Virage, Photobook, WBIIS, WALRUS
6/30/2016
Data Mining Meets the Internet
6
Latent Semantic Indexing
6/30/2016
Data Mining Meets the Internet
7
Problems with Inverted Index
Approach

Synonymy
– Many ways to refer to the same object

Polysemy
– Most words have more than one distinct meaning
animal automobile
Doc 1
engine
jaguar
X
Doc 2
Doc 3
car
X
X
X
speed
X
X
X
X
X
Synonymy
6/30/2016
porsche
X
Polysemy
Data Mining Meets the Internet
8
LSI - Key Idea [DDF 90]

Apply SVD to terms by documents (t x d) matrix X
X
=
txd

S0
txm
mxm
D0’
mxd
T0 , D0 have orthonormal columns and S0 is diagonal
Ignoring very small singular values in S (keeping only the
first k largest values)
X

X
txd

T0
=
T
txk
S
D
kxk
kxd
New matrix X of rank k is closest to X in the least squares
sense
6/30/2016
Data Mining Meets the Internet
9
Comparing Documents and Queries

Comparing two documents
– Essentially dot product of two column vectors of X
2
X’ X = D S D’
– So one can consider rows of DS matrix as coordinates for
documents and take dot products in this space

Finding documents similar to query q with term vector Xq
– Derive a representation Dq for query
Dq = Xq’ T S
-1
– Dot product of DqS and appropriate row of DS matrix yields
similarity between query and specific document
6/30/2016
Data Mining Meets the Internet
10
LSI - Benefits
Reduces Dimensionality of Documents

– From tens of thousands (one dimension per keyword) to a few 100

Decreases storage overhead of index structures
Speeds up retrieval of documents similar to a query

Makes search less “brittle”

–
–
–
–
Captures semantics of documents
Addresses problems of synonymy and polysemy
Transforms document space from discrete to continuous
Improves both search precision and recall
6/30/2016
Data Mining Meets the Internet
11
Document Clustering
6/30/2016
Data Mining Meets the Internet
12
Improve Search Using Topic
Hierarchies

Web directories (or topic hierarchies) provide a hierarchical
classification of documents (e.g., Yahoo!)
Yahoo home page
Recreation
Travel


Sports
Business
Companies
Science
Finance
News
Jobs
Searches performed in the context of a topic restricts the search to
only a subset of web pages related to the topic
Clustering can be used to generate topic hierarchies
6/30/2016
Data Mining Meets the Internet
13
Clustering

Given:
– Data points (documents) and number of desired clusters k
Group the data points (documents) into k clusters
– Data points (documents) within clusters are more similar than
across clusters

Document similarity measure

– Each document can be represented by vector with 0/1 value along
each word dimension
– Cosine of angle between document vectors is a measure of their
similarity, or (euclidean) distance between the vectors

Other applications:
– Customer segmentation
– Market basket analysis
6/30/2016
Data Mining Meets the Internet
14
k-means Algorithm




Choose k initial means
Assign each point to the cluster with the closest mean
Compute new mean for each cluster
Iterate until the k means stabilize
6/30/2016
Data Mining Meets the Internet
15
Agglomerative Hierarchical
Clustering Algorithms

Initially each point is a distinct cluster

Repeatedly merge closest clusters until the number of clusters
becomes k
– Closest: dmean (Ci, Cj) = mi m j
dmin (Ci, Cj) =
min
pq
p
Ci,qC j
Likewise dave (Ci, Cj) and dmax (Ci, Cj)
6/30/2016
Data Mining Meets the Internet
16
Agglomerative Hierarchical
Clustering Algorithms (Continued)
dmean: Centroid approach
dmin: Minimum Spanning Tree (MST) approach
(a) Centroid
6/30/2016
(b) MST
(c) Correct Clusters
Data Mining Meets the Internet
17
Drawbacks of Traditional
Clustering Methods

Traditional clustering methods are ineffective for clustering
documents
– Cannot handle thousands of dimensions
– Cannot scale to millions of documents

Centroid-based method splits large and non-hyperspherical
clusters
– Centers of subclusters can be far apart

MST-based algorithm is sensitive to outliers and slight change in
position
– Exhibits chaining effect on string of outliers

Using other similarity measures such as Jaccard coefficient
instead of euclidean distance does not help
6/30/2016
Data Mining Meets the Internet
18
Example - Centroid Method for
Clustering Documents

As
–
–
–
cluster size grows
The number of dimensions appearing in mean go up
Their value in the mean decreases
Thus, very difficult to distinguish two points that differ on few
dimensions
ripple effect
Database: {1, 2, 3, 5} {2, 3, 4, 5} {1, 4} {6}
(0.5,1,1,0.5,1,0)
(0.5,0,0,0.5,0,0.5)
(1,1,1,0,1,0) (0,1,1,1,1,0) (1,0,0,1,0,0) (0,0,0,0,0,1)

{1,4} and {6} are merged even though they have no
elements in common!
6/30/2016
Data Mining Meets the Internet
19
Itemset Clustering using
Hypergraph Partitioning [HKK 97]




Build a weighted hypergraph with frequent itemsets
– Hyperedge: each frequent item
– Weight of hyperedge: average of confidences of all association
rules generated from itemset
Hypergraph partitioning algorithm is used to cluster items
– Minimize sum of weights of cut hyperedges
Label customers with Item clusters by scoring
Assume that items defining clusters are disjoint!
6/30/2016
Data Mining Meets the Internet
20
STIRR - A System for Clustering
Categorical Attribute Values [GKR 98]



Motivated by spectral graph partitioning, a method for clustering
undirected graphs
Each distinct attribute value becomes a separate node v with
weight w(v)
Node weights w(v) updated in each iteration as follows:
For each tuple t
 {v, u1 ,....., uk }
xt  ( w(u1 )  ........  w(uk ) )
p
p
do
1
p
w(v)  t xt
Update set of weights so that it is orthonormal

Positive and negative weights in non-principal basins tend to
represent good partitions of the data
6/30/2016
Data Mining Meets the Internet
21
ROCK [GRS 99]





Hierarchical clustering algorithm for categorical attributes
– Example: market basket customers
Use novel concept of links for merging clusters
– sim(pi, pj): similarity function that captures the closeness between pi
and pj
– pi and pj are said to be neighbors if sim(pi, pj)  
– link(pi, pj): the number of common neighbors
At each step, merge clusters/points with the most number of links
– Points belonging to a single cluster will in general have a large
number of common neighbors
Random sampling used for scale up
In final labeling phase, each point on disk is assigned to cluster
with maximum neighbors
6/30/2016
Data Mining Meets the Internet
22
ROCK
<1, 2, 3, 4, 5>
{1, 2, 3} {1, 4, 5}
{1, 2, 4} {2, 3, 4}
{1, 2, 5} {2, 3, 5}
{1, 3, 4} {2, 4, 5}
{1, 3, 5} {3, 4, 5}
sim (T1 , T2 ) 


<1, 2, 6, 7>
{1, 2, 6}
{1, 2, 7}
{1, 6, 7}
{2, 6, 7}
T1  T2
T1  T2
 0.5
{1, 2, 6} and {1, 2, 7} have 5 links.
{1, 2, 3} and {1, 2, 6} have 3 links.
6/30/2016
Data Mining Meets the Internet
23
Clustering Algorithms for Numeric
Attributes
Scalable Clustering Algorithms
(From Database Community)
CLARANS
DBSCAN
BIRCH
CLIQUE
CURE
…….

Above algorithms can be used to cluster documents after
reducing their dimensionality using SVD
6/30/2016
Data Mining Meets the Internet
24
BIRCH [ZRL 96]



Pre-cluster data points using CF-tree
– CF-tree is similar to R-tree
– For each point
» CF-tree is traversed to find the closest cluster
» If the cluster is within epsilon distance, the point is absorbed into
the cluster
» Otherwise, the point starts a new cluster
Requires only single scan of data
Cluster summaries stored in CF-tree are given to main memory
hierarchical clustering algorithm
6/30/2016
Data Mining Meets the Internet
25
CURE [GRS 98]




Hierarchical algorithm for dicovering arbitrary shaped clusters
– Uses a small number of representatives per cluster
– Note:
» Centroid-based: uses 1 point to represent a cluster => Too little
information..Hyper-spherical clusters
» MST-based: uses every point to represent a cluster =>Too much
information..Easily mislead
Uses random sampling
Uses Partitioning
Labeling using representatives
6/30/2016
Data Mining Meets the Internet
26
Cluster Representatives
A Representative set of points:




Small in number : c
Distributed over the cluster
Each point in cluster is close to one representative
Distance between clusters:
smallest distance between representatives
6/30/2016
Data Mining Meets the Internet
27
Computing Cluster Representatives
Finding Scattered Representatives



We want to
– Distribute around the center of the cluster
– Spread well out over the cluster
– Capture the physical shape and geometry of the cluster
Use farthest point heuristic to scatter the points over the cluster
Shrink uniformly around the mean of the cluster
6/30/2016
Data Mining Meets the Internet
28
Computing Cluster Representatives
(Continued)
Shrinking the Representatives


Why do we need to alter the Representative Set?
– Too close to the boundary of cluster
Shrink uniformly around the mean (center) of the cluster

6/30/2016
Data Mining Meets the Internet
29
Document Classification
6/30/2016
Data Mining Meets the Internet
30
Classification



Given:
– Database of tuples (documents), each assigned a class label
Develop a model/profile for each class
– Example profile (good credit):
(25 <= age <= 40 and income > 40k) or (married = YES)
– Example profile (automobile):
Document contains a word from {car, truck, van, SUV, vehicle, scooter}
Other applications:
– Credit card approval (good, bad)
– Bank locations (good, fair, poor)
– Treatment effectiveness (good, fair, poor)
6/30/2016
Data Mining Meets the Internet
31
Naive Bayesian Classifier


Class c for new document d is the one for which Pr[c/d] is
maximum
Assume independent term occurrences in document
Pr[ d / c]  td  (c, t )
 (c, t ) - fraction of documents in class c that contain term t

Then, by Bayes rule
Pr[ d / c]  Pr[c]
Pr[c / d ] 
c' Pr[d / c' ]  Pr[c' ]
6/30/2016
Data Mining Meets the Internet
32
Hierarchical Classifier (TAPER)
[CDA 97]

Class of new document d is leaf node c such that Pr[c/d] is
maximum
c
1
 log(Pr[ ci / ci 1 , d ])
ci 1
Topic Hierarchy
ci
c



Pr[c1 / d ]  1,
Pr[ci / d ]  Pr[ci 1 / d ]  Pr[ci / ci 1 , d ]
Pr[ci / ci 1 , d ] can be computed using Bayes rule
Problem of computing c reduces to finding leaf node c with
the least cost path from the root c1 to c
6/30/2016
Data Mining Meets the Internet
33
k-Nearest Neighbor Classifier



Assign to a point the label for majority of the k-nearest neighbors
For k=1, error rate never worse than twice the Bayes rate (unlimited
number of samples)
Scalability issues
– Use index to find k-nearest neighbors
– R-tree family works well up to 20 dimensions
– Pyramid tree for high-dimensional data
– Use SVD to reduce dimensionality of data set
– Use clusters to reduce the dataset size
6/30/2016
Data Mining Meets the Internet
34
Decision Trees
Credit Analysis
s a la ry
e d u c a t io n
la b e l
10000
h ig h s c h o o l
re je c t
40000
u n d e r g ra d u a t e
ac c ept
15000
u n d e r g ra d u a t e
re je c t
75000
g ra d u a t e
ac c ept
18000
g ra d u a t e
ac c ept
salary < 20000
yes
no
education in graduate
yes
accept
6/30/2016
accept
no
reject
Data Mining Meets the Internet
35
Decision Tree Algorithms


Building phase
– Recursively split nodes using best splitting attribute for node
Pruning phase
– Smaller imperfect decision tree generally achieves better accuracy
– Prune leaf nodes recursively to prevent over-fitting
6/30/2016
Data Mining Meets the Internet
36
Decision Tree Algorithms

Classifiers from machine learning community:
– ID3
– C4.5
– CART

Classifiers for large databases:
– SLIQ, SPRINT
– PUBLIC
– SONAR
– Rainforest, BOAT
6/30/2016
Data Mining Meets the Internet
37
Decision Trees


Pros
– Fast execution time
– Generated rules are easy to interpret by humans
– Scale well for large data sets
– Can handle high dimensional data
Cons
– Cannot capture correlations among attributes
– Consider only axis-parallel cuts
6/30/2016
Data Mining Meets the Internet
38
Feature Selection

Choose a collection of keywords that help discriminate
between two or more sets of documents
– Fewer keywords help to speed up classification
– Improves classification accuracy by eliminating noise from
documents

Fischer’s discriminant (ratio of between-class to withinclass scatter)
Fisher (t ) 
where  (c, t )
6/30/2016
2
(

(
c
1
,
t
)


(
c
2
,
t
))
c1,c 2
1
2
(
x
(
d
,
t
)


(
c
,
t
))
c | c | dc
1
x(d , t ) and x(d , t )  1 if d contains t

d c
|c|
Data Mining Meets the Internet
39
Exploiting Hyperlink Structure
6/30/2016
Data Mining Meets the Internet
40
HITS (Hyperlink-Induced Topic
Search) [Kle 98]


HITS uses hyperlink structure to identify authoritative Web
sources for broad-topic information discovery
Premise: Sufficiently broad topics contain communities
consisting of two types of hyperlinked pages:
– Authorities: highly-referenced pages on a topic
– Hubs: pages that “point” to authorities
– A good authority is pointed to by many good hubs; a good hub
points to many good authorities
Hubs
Authorities
6/30/2016
Data Mining Meets the Internet
41
HITS - Discovering Web
Communities

Discovering the community for a specific topic/query
involves the following steps
– Collect seed set of pages S (returned by search engine)
– Expand seed set to contain pages that point to or are pointed to by
pages in seed set
– Iteratively update hub weight h(p) and authority weight a(p) for each
page:
a( p) 
 h( q )
h( p ) 
q p
 a(q)
pq
– After a fixed number of iterations, pages with highest hub/authority
weights form core of community

Extensions proposed in Clever
– Assign links different weights based on relevance of link anchor text
6/30/2016
Data Mining Meets the Internet
42
Google [BP 98]


Search engine that uses link structure to calculate a quality
ranking (PageRank) for each page
PageRank
PageRank (u )
PageRank (v)  p  (1  p) 
u v OutDegree (u )
– Can be calculated using a simple iterative algorithm, and
corresponds to principal eigenvector of the normalized link matrix
– Intuition: PageRank is the probability that a “random surfer” visits a
page
» Parameter p is probability that the surfer gets bored and starts on a new
random page
» (1-p) is the probability that the random surfer follows a link on current
page
6/30/2016
Data Mining Meets the Internet
43
Google - Features

In addition to PageRank, in order to improve search Google
also weighs keyword matches
– Anchor text
» Provide more accurate descriptions of Web pages
» Anchors exist for un-indexable documents (e.g., images)
– Font sizes of words in text
» Words in larger or bolder font are assigned higher weights

Google v/s HITS
– Google: PageRanks computed initially for Web Pages independent
of search query
– HITS: Hub and authority weights computed for different root sets in
the context of a particular search query
6/30/2016
Data Mining Meets the Internet
44
Trawling the Web for Emerging
Communities [KRR 98]

Co-citation: pages that are related are frequently referenced
together

Web communities are characterized by dense directed
bipartite subgraphs
Bipartite Core

Computing (i,j) Bipartite cores
– Sort edge list by source id and detect all source pages s with outdegree j (let D be the set of destination pages that s points to)
– Compute intersection S of sets of source pages pointing to
destination pages in D (using an index on dest id to generate each
source set)
– Output Bipartite (S,D)
6/30/2016
Data Mining Meets the Internet
45
Using Hyperlinks to Improve
Classification [CDI 98]

Use text from neighbors when classifying Web page
– Ineffective because referenced pages may belong to different class

Use class information from pre-classified neighbors
– Choose class ci for which Pr(ci/Ni) is maximum (Ni is class labels of
all the neighboring documents)
– By Bayes rule, we choose ci to maximize Pr(Ni/ci) Pr(ci)
– Assuming independence of neighbor classes,
Pr( N i / ci )   j i Pr(c j / ci , j  i )i k Pr(ck / ci , i  k )
6/30/2016
Data Mining Meets the Internet
46
Collaborative Search
6/30/2016
Data Mining Meets the Internet
47
SearchLight

Key Idea: Improve search by sharing information on URLs
visited by members of a community during search

Based on the concept of “search sessions”
– A search session is the search engine query (collection of
keywords) and the URLs visited in response to the query
– Possible to extract search sessions from the proxy logs

SearchLight maintains a database of (query, target URL)
pairs
– Target URL is heuristically chosen to be last URL in search session
for the query

In response to a search query, SearchLight displays URLs
from its database for the specified query
6/30/2016
Data Mining Meets the Internet
48
Image Retrieval
6/30/2016
Data Mining Meets the Internet
49
Similar Images



Given:
– A set of images
Find:
– All images similar to a given image
– All pairs of similar images
Sample applications:
– Medical diagnosis
– Weather predication
– Web search engine for images
– E-commerce
6/30/2016
Data Mining Meets the Internet
50
Similar Image Retrieval Systems


QBIC, Virage, Photobook
Compute feature signature for each image
– QBIC uses color histograms
– WBIIS, WALRUS use wavelets

Use spatial index to retrieve database image whose signature is
closest to the query’s signature

QBIC drawbacks
– Computes single signature for entire image
– Thus, fails when images contain similar objects, but at different
locations or in varying sizes
– Color histograms cannot capture shape, texture and location
information (wavelets can!)
6/30/2016
Data Mining Meets the Internet
51
WALRUS Similarity Model [NRS 99]



WALRUS decomposes an image into regions
A single signature is stored for each region
Two images are considered to be similar if they have enough
similar region pairs
6/30/2016
Data Mining Meets the Internet
52
WALRUS (Step 1)

Generation of Signatures for Sliding Windows
– Each image is broken into sliding windows
– For the signature of each sliding window, use
s 2 coefficients from lowest frequency band of the Haar wavelet
2
– Naive Algorithm: O ( N max )
– Dynamic Programming Algorithm:
2
max
N - number of pixels in the image
2
S- s
 - max window size
O( NS log 
)
max
s   max
6/30/2016
Data Mining Meets the Internet
53
WALRUS (Step 2)

Clustering Sliding Windows
– Cluster the windows in the image using pre-clustering phase of
BIRCH
– Each cluster defines a region in the image.
– For each cluster, the centroid is used as a signature. (c.f. bounding
box)
6/30/2016
Data Mining Meets the Internet
54
WALRUS - Retrieval Results
Query image
6/30/2016
Data Mining Meets the Internet
55
Automated Schema Extraction for XML Data:
The XTRACT System
6/30/2016
Data Mining Meets the Internet
56
XML Primer I



Standard for data representation and data exchange
– Unified, self-describing format for publishing/exchanging management
data across heterogeneous network/NM platforms
Looks like HTML but it isn’t
Collection of elements
– Atomic (raw character data)
– Composite (sequence of nested sub-elements)

Example
<book>
<title>A relational Model for Large Shared Data Banks </title>
<author>
<name> E.F. Codd </name>
<affiliation> IBM Research </affiliation>
</author>
</ book>
6/30/2016
Data Mining Meets the Internet
57
XML Primer II

XML documents can be accompanied by Document Type
Descriptors (DTDs)

DTDs serve the role of the schema of the document

Specify a regular expression for every element

Example
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
6/30/2016
book (title, author*)>
title (#PCDATA)>
author (name, affiliation)>
name (#PCDATA)>
affiliation (#PCDATA)>
Data Mining Meets the Internet
58
The XTRACT System [GGR 00]

DTDs are of great practical importance
– Efficient storage of XML data collections
– Formulation and optimization of XML queries

However, DTDs are not mandatory => XML data may not be
accompanied by a DTD
– Automatically-generated XML documents (e.g., from relational databases or
flat files)
– DTD standards for many communities are still evolving

Goal of the XTRACT system:
Automated inference of DTDs from XML-document collections
6/30/2016
Data Mining Meets the Internet
59
Problem Formulation

Element types alphabet 

Infer DTD for each element type separately

Example sequences: instances of nested sub-elements
Only one level down in the hierarchy

Problem statement:
– Given a set example sequences for element e
– Infer a “good” regular expression for e

Hard problem!!
– DTDs can comprise general, complex regular expressions
– quantify notion of “goodness” for regular expressions
6/30/2016
Data Mining Meets the Internet
60
Example XML Documents
<book>
<title></title>
<author>
<name></name>
<affiliation></affiliation>
</author>
</book>
<paper>
<title></title>
<author>
<name></name>
<affiliation></affiliation>
</author>
<conference></conference>
<year></year>
</paper>
6/30/2016
<book>
<title></title>
<author>
<name></name>
<address></address>
</author>
<author>
<name></name>
<address></address>
</author>
<editor></editor>
<book>
Data Mining Meets the Internet
61
Example (Continued)
Simplified example sequences
<book>
: - <title><author>,
<title><author><author><editor>
<paper> :- <title><author><conference><year>
<author> :- <name><affiliation>,
<name><affiliation>,
<name><address>,
<name><address>
Desirable solution:
<!ELEMENT book (title, author*, editor?)>
<!ELEMENT paper (title, author, conference, year)>
<!ELEMENT author (name, affiliation?, address?)>
6/30/2016
Data Mining Meets the Internet
62
DTD Inference Requirements

Requirements for a good DTD:
– Generalizes to intuitively correct but previously unseen examples
– It should be concise (i.e., small in size)
– It should be precise (i.e., not cover too many sequences not
contained in the set of examples)

Example: Consider the case
p :- ta, taa, taaa, ta, taaaa
Candidate
DTD DTD
Candidate
ta | taa | taaa| taaaa
6/30/2016
Concise
Precise
no
yes
(t|a)*
yes
no
ta*
yes
somewhat
Data Mining Meets the Internet
63
The XTRACT Approach: MDL Principle

Minimum Description Length (MDL) quantifies and resolves the
tradeoff between DTD conciseness and preciseness

MDL principle: The best theory to infer from a set of data is the
one which minimizes the sum of
(A) the length of the theory, in bits, plus
(B) the length of the data, in bits, when encoded with the help of
the theory.

Part (A) captures conciseness, and
Part (B) captures preciseness
6/30/2016
Data Mining Meets the Internet
64
Overview of the XTRACT System
XTRACT consists of 3 subsystems
Input Sequences
I = { ab, abab, ac, ad, bc, bd, bbd, bbbe }
Generalization
subsystem
SG= I U { (ab)*, (a|b)*, b*d, b*e }
Factoring
subsystem
SF = SG U { (a|b)(c|d), b*(d|e) }
MDL subsystem
Inferred DTD: (ab)*| (a|b)(c|d) | b*(d|e)
6/30/2016
Data Mining Meets the Internet
65
MDL Subsystem

MDL principle: Minimize the sum of
– Theory description length, plus
– Data description length given the theory

In order to use MDL, need to:
– Define theory description length (candidate DTD)
– Define data description length (input sequences) given the theory
(candidate DTD)
– Solve the resulting minimization problem
6/30/2016
Data Mining Meets the Internet
66
MDL Subsystem - Encoding Scheme

Description Length of a DTD:
– Number of bits required to encode the DTD
– |Size of DTD| * log |  U {(,),|,*}|

Description length of a sequence given a candidate DTD:
– Number of bits required to specify the sequence given DTD
– Use a sequence of encoding indices
– Encoding of a given a is the empty string 
– Encoding of a given (a|b|c) is the index 0
– Encoding of aaa given a* is the index 3
– Example: Encoding of ababcabc given ((ab)*c)* is the sequence 2,2,1
6/30/2016
Data Mining Meets the Internet
67
MDL Encoding Example

Consider again the case
p :- ta, taa, taaa, taaaa
Theory Description
Data Description
(given the theory)
ta | taa | taaa| taaaa
0, 1,0 2, 3
(t|a)*
ta*
6/30/2016
201, 3011, 40111, 501111
Total
Length
17 + 7 = 24
6 + 21 = 27
3 + 7 = 10
1, 2, 3, 4
Data Mining Meets the Internet
68
MDL Subsystem - Minimization
Input Sequences
ta
Candidate DTDs
w11
w12
c1
ta
taaa
taa
taaaa
c2
ta*
c3
taa
ta


Maps to the Facility Location Problem (NP-hard)
XTRACT employs fast heuristic algorithms proposed by the
Operations Research community
6/30/2016
Data Mining Meets the Internet
69
References






[BP 98] S. Brin, and L. Page. The anatomy of a large-scale hypertextual Web
search engine. WWW7, 1998.
[CDA 97] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext
categorization using hyperlinks. ACM SIGMOD, 1998.
[CDI 98] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable
feature selection, classification and signature generation for organizing large
text databases into hierarchical topic taxonomies. VLDB Journal, 1998.
[CGR 00] K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim.
Approximate Query Processing Using Wavelets. VLDB, 2000.
[DDF 90] S. Deerwater, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R.
Harshman. Indexing by latent semantic analysis. Journal of the Society for
Information Science, 41(6), 1990.
[GGR 00] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim.
XTRACT: A System for Extracting Document Type Descriptors from XML
Documents. ACM SIGMOD, 2000.
6/30/2016
Data Mining Meets the Internet
70
References (Continued)







[GKR 98] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical
data: An approach based on dynamical systems. VLDB, 1998.
[GRS 99] S. Guha, K. Shim, and R. Rastogi. CURE: An efficient clustering
algorithm for large databases. ACM SIGMOD, 1998.
[GRS 98] S. Guha, K. Shim, and R. Rastogi. ROCK: A robust clustering
algorithm for categorical attributes. Data Engineering, 1999.
[HKK 97] E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on
association rule hypergraphs. DMKD Workshop, 1997.
[Kle 98] J. Kleinberg. Authoritative sources in a hyperlinked environment.
SODA, 1998.
[KRR 98] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling
the Web for emerging cyber-communities. WWW8, 1999.
[ZRL 96] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data
clustering method for very large databases. ACM SIGMOD, 1996.
6/30/2016
Data Mining Meets the Internet
71
Download