Data Mining meets the Internet: Techniques for Web Information Retrieval Rajeev Rastogi Internet Management Research Bell laboratories, Murray Hill 6/30/2016 Data Mining Meets the Internet 1 The Web Over 1 billion HTML pages, 15 terabytes Wealth of information – Bookstores, restaraunts, travel, malls, dictionaries, news, stock quotes, yellow & white pages, maps, markets, ......... – Diverse media types: text, images, audio, video – Heterogeneous formats: HTML, XML, postscript, pdf, JPEG, MPEG, MP3 Highly Dynamic – 1 million new pages each day – Average page changes in a few weeks Graph structure with links between pages – Average page has 7-10 links Hundreds of millions of queries per day 6/30/2016 Data Mining Meets the Internet 2 Why is Web Information Retrieval Important? According to most predictions, the majority of human information will be available on the Web in ten years Effective information retrieval can aid in – Research: Find all papers that use the primal dual method to solve the facility location problem – Health/Medicene: What could be reason for symptoms of “yellow eyes”, high fever and frequent vomitting – Travel: Find information on the tropical island of St. Lucia – Business: Find companies that manufacture digital signal processors (DSPs) – Entertainment: Find all movies starring Marilyn Monroe during the years 1960 and 1970 – Arts: Find all short stories written by Jhumpa Lahiri 6/30/2016 Data Mining Meets the Internet 3 Web Information Retrieval Model Repository Web Server Clustering Classification Indexer Inverted Index engine jaguar cat Topic Hierarchy Root News Plants 6/30/2016 Storage Server Crawler Science The jaguar has a 4 liter engine The jaguar, a cat, can run at speeds reaching 50 mph Documents in repository Business Animals Computers jaguar Automobiles Data Mining Meets the Internet Search Query 4 Why is Web Information Retrieval Difficult? The Abundance Problem (99% of information of no interest to 99% of people) – Hundreds of irrelevant documents returned in response to a search query Limited Coverage of the Web (Internet sources hidden behind search interfaces) – Largest crawlers cover less than 18% of Web pages The Web is extremely dynamic – 1 million pages added each day Very high dimensionality (thousands of dimensions) Limited query interface based on keyword-oriented search Limited customization to individual users 6/30/2016 Data Mining Meets the Internet 5 How can Data Mining Improve Web Information Retrieval? Latent Semantic Indexing (LSI) – SVD-based method to improve precision and recall Document clustering to generate topic hierarchies – Hypergraph partitioning, STIRR, ROCK Document classification to assign topics to new documents – Naive Bayes, TAPER Exploiting hyperlink structure to locate authoritative Web pages – HITS, Google, Web Trawling Collaborative searching – SearchLight Image Retrieval – QBIC, Virage, Photobook, WBIIS, WALRUS 6/30/2016 Data Mining Meets the Internet 6 Latent Semantic Indexing 6/30/2016 Data Mining Meets the Internet 7 Problems with Inverted Index Approach Synonymy – Many ways to refer to the same object Polysemy – Most words have more than one distinct meaning animal automobile Doc 1 engine jaguar X Doc 2 Doc 3 car X X X speed X X X X X Synonymy 6/30/2016 porsche X Polysemy Data Mining Meets the Internet 8 LSI - Key Idea [DDF 90] Apply SVD to terms by documents (t x d) matrix X X = txd S0 txm mxm D0’ mxd T0 , D0 have orthonormal columns and S0 is diagonal Ignoring very small singular values in S (keeping only the first k largest values) X X txd T0 = T txk S D kxk kxd New matrix X of rank k is closest to X in the least squares sense 6/30/2016 Data Mining Meets the Internet 9 Comparing Documents and Queries Comparing two documents – Essentially dot product of two column vectors of X 2 X’ X = D S D’ – So one can consider rows of DS matrix as coordinates for documents and take dot products in this space Finding documents similar to query q with term vector Xq – Derive a representation Dq for query Dq = Xq’ T S -1 – Dot product of DqS and appropriate row of DS matrix yields similarity between query and specific document 6/30/2016 Data Mining Meets the Internet 10 LSI - Benefits Reduces Dimensionality of Documents – From tens of thousands (one dimension per keyword) to a few 100 Decreases storage overhead of index structures Speeds up retrieval of documents similar to a query Makes search less “brittle” – – – – Captures semantics of documents Addresses problems of synonymy and polysemy Transforms document space from discrete to continuous Improves both search precision and recall 6/30/2016 Data Mining Meets the Internet 11 Document Clustering 6/30/2016 Data Mining Meets the Internet 12 Improve Search Using Topic Hierarchies Web directories (or topic hierarchies) provide a hierarchical classification of documents (e.g., Yahoo!) Yahoo home page Recreation Travel Sports Business Companies Science Finance News Jobs Searches performed in the context of a topic restricts the search to only a subset of web pages related to the topic Clustering can be used to generate topic hierarchies 6/30/2016 Data Mining Meets the Internet 13 Clustering Given: – Data points (documents) and number of desired clusters k Group the data points (documents) into k clusters – Data points (documents) within clusters are more similar than across clusters Document similarity measure – Each document can be represented by vector with 0/1 value along each word dimension – Cosine of angle between document vectors is a measure of their similarity, or (euclidean) distance between the vectors Other applications: – Customer segmentation – Market basket analysis 6/30/2016 Data Mining Meets the Internet 14 k-means Algorithm Choose k initial means Assign each point to the cluster with the closest mean Compute new mean for each cluster Iterate until the k means stabilize 6/30/2016 Data Mining Meets the Internet 15 Agglomerative Hierarchical Clustering Algorithms Initially each point is a distinct cluster Repeatedly merge closest clusters until the number of clusters becomes k – Closest: dmean (Ci, Cj) = mi m j dmin (Ci, Cj) = min pq p Ci,qC j Likewise dave (Ci, Cj) and dmax (Ci, Cj) 6/30/2016 Data Mining Meets the Internet 16 Agglomerative Hierarchical Clustering Algorithms (Continued) dmean: Centroid approach dmin: Minimum Spanning Tree (MST) approach (a) Centroid 6/30/2016 (b) MST (c) Correct Clusters Data Mining Meets the Internet 17 Drawbacks of Traditional Clustering Methods Traditional clustering methods are ineffective for clustering documents – Cannot handle thousands of dimensions – Cannot scale to millions of documents Centroid-based method splits large and non-hyperspherical clusters – Centers of subclusters can be far apart MST-based algorithm is sensitive to outliers and slight change in position – Exhibits chaining effect on string of outliers Using other similarity measures such as Jaccard coefficient instead of euclidean distance does not help 6/30/2016 Data Mining Meets the Internet 18 Example - Centroid Method for Clustering Documents As – – – cluster size grows The number of dimensions appearing in mean go up Their value in the mean decreases Thus, very difficult to distinguish two points that differ on few dimensions ripple effect Database: {1, 2, 3, 5} {2, 3, 4, 5} {1, 4} {6} (0.5,1,1,0.5,1,0) (0.5,0,0,0.5,0,0.5) (1,1,1,0,1,0) (0,1,1,1,1,0) (1,0,0,1,0,0) (0,0,0,0,0,1) {1,4} and {6} are merged even though they have no elements in common! 6/30/2016 Data Mining Meets the Internet 19 Itemset Clustering using Hypergraph Partitioning [HKK 97] Build a weighted hypergraph with frequent itemsets – Hyperedge: each frequent item – Weight of hyperedge: average of confidences of all association rules generated from itemset Hypergraph partitioning algorithm is used to cluster items – Minimize sum of weights of cut hyperedges Label customers with Item clusters by scoring Assume that items defining clusters are disjoint! 6/30/2016 Data Mining Meets the Internet 20 STIRR - A System for Clustering Categorical Attribute Values [GKR 98] Motivated by spectral graph partitioning, a method for clustering undirected graphs Each distinct attribute value becomes a separate node v with weight w(v) Node weights w(v) updated in each iteration as follows: For each tuple t {v, u1 ,....., uk } xt ( w(u1 ) ........ w(uk ) ) p p do 1 p w(v) t xt Update set of weights so that it is orthonormal Positive and negative weights in non-principal basins tend to represent good partitions of the data 6/30/2016 Data Mining Meets the Internet 21 ROCK [GRS 99] Hierarchical clustering algorithm for categorical attributes – Example: market basket customers Use novel concept of links for merging clusters – sim(pi, pj): similarity function that captures the closeness between pi and pj – pi and pj are said to be neighbors if sim(pi, pj) – link(pi, pj): the number of common neighbors At each step, merge clusters/points with the most number of links – Points belonging to a single cluster will in general have a large number of common neighbors Random sampling used for scale up In final labeling phase, each point on disk is assigned to cluster with maximum neighbors 6/30/2016 Data Mining Meets the Internet 22 ROCK <1, 2, 3, 4, 5> {1, 2, 3} {1, 4, 5} {1, 2, 4} {2, 3, 4} {1, 2, 5} {2, 3, 5} {1, 3, 4} {2, 4, 5} {1, 3, 5} {3, 4, 5} sim (T1 , T2 ) <1, 2, 6, 7> {1, 2, 6} {1, 2, 7} {1, 6, 7} {2, 6, 7} T1 T2 T1 T2 0.5 {1, 2, 6} and {1, 2, 7} have 5 links. {1, 2, 3} and {1, 2, 6} have 3 links. 6/30/2016 Data Mining Meets the Internet 23 Clustering Algorithms for Numeric Attributes Scalable Clustering Algorithms (From Database Community) CLARANS DBSCAN BIRCH CLIQUE CURE ……. Above algorithms can be used to cluster documents after reducing their dimensionality using SVD 6/30/2016 Data Mining Meets the Internet 24 BIRCH [ZRL 96] Pre-cluster data points using CF-tree – CF-tree is similar to R-tree – For each point » CF-tree is traversed to find the closest cluster » If the cluster is within epsilon distance, the point is absorbed into the cluster » Otherwise, the point starts a new cluster Requires only single scan of data Cluster summaries stored in CF-tree are given to main memory hierarchical clustering algorithm 6/30/2016 Data Mining Meets the Internet 25 CURE [GRS 98] Hierarchical algorithm for dicovering arbitrary shaped clusters – Uses a small number of representatives per cluster – Note: » Centroid-based: uses 1 point to represent a cluster => Too little information..Hyper-spherical clusters » MST-based: uses every point to represent a cluster =>Too much information..Easily mislead Uses random sampling Uses Partitioning Labeling using representatives 6/30/2016 Data Mining Meets the Internet 26 Cluster Representatives A Representative set of points: Small in number : c Distributed over the cluster Each point in cluster is close to one representative Distance between clusters: smallest distance between representatives 6/30/2016 Data Mining Meets the Internet 27 Computing Cluster Representatives Finding Scattered Representatives We want to – Distribute around the center of the cluster – Spread well out over the cluster – Capture the physical shape and geometry of the cluster Use farthest point heuristic to scatter the points over the cluster Shrink uniformly around the mean of the cluster 6/30/2016 Data Mining Meets the Internet 28 Computing Cluster Representatives (Continued) Shrinking the Representatives Why do we need to alter the Representative Set? – Too close to the boundary of cluster Shrink uniformly around the mean (center) of the cluster 6/30/2016 Data Mining Meets the Internet 29 Document Classification 6/30/2016 Data Mining Meets the Internet 30 Classification Given: – Database of tuples (documents), each assigned a class label Develop a model/profile for each class – Example profile (good credit): (25 <= age <= 40 and income > 40k) or (married = YES) – Example profile (automobile): Document contains a word from {car, truck, van, SUV, vehicle, scooter} Other applications: – Credit card approval (good, bad) – Bank locations (good, fair, poor) – Treatment effectiveness (good, fair, poor) 6/30/2016 Data Mining Meets the Internet 31 Naive Bayesian Classifier Class c for new document d is the one for which Pr[c/d] is maximum Assume independent term occurrences in document Pr[ d / c] td (c, t ) (c, t ) - fraction of documents in class c that contain term t Then, by Bayes rule Pr[ d / c] Pr[c] Pr[c / d ] c' Pr[d / c' ] Pr[c' ] 6/30/2016 Data Mining Meets the Internet 32 Hierarchical Classifier (TAPER) [CDA 97] Class of new document d is leaf node c such that Pr[c/d] is maximum c 1 log(Pr[ ci / ci 1 , d ]) ci 1 Topic Hierarchy ci c Pr[c1 / d ] 1, Pr[ci / d ] Pr[ci 1 / d ] Pr[ci / ci 1 , d ] Pr[ci / ci 1 , d ] can be computed using Bayes rule Problem of computing c reduces to finding leaf node c with the least cost path from the root c1 to c 6/30/2016 Data Mining Meets the Internet 33 k-Nearest Neighbor Classifier Assign to a point the label for majority of the k-nearest neighbors For k=1, error rate never worse than twice the Bayes rate (unlimited number of samples) Scalability issues – Use index to find k-nearest neighbors – R-tree family works well up to 20 dimensions – Pyramid tree for high-dimensional data – Use SVD to reduce dimensionality of data set – Use clusters to reduce the dataset size 6/30/2016 Data Mining Meets the Internet 34 Decision Trees Credit Analysis s a la ry e d u c a t io n la b e l 10000 h ig h s c h o o l re je c t 40000 u n d e r g ra d u a t e ac c ept 15000 u n d e r g ra d u a t e re je c t 75000 g ra d u a t e ac c ept 18000 g ra d u a t e ac c ept salary < 20000 yes no education in graduate yes accept 6/30/2016 accept no reject Data Mining Meets the Internet 35 Decision Tree Algorithms Building phase – Recursively split nodes using best splitting attribute for node Pruning phase – Smaller imperfect decision tree generally achieves better accuracy – Prune leaf nodes recursively to prevent over-fitting 6/30/2016 Data Mining Meets the Internet 36 Decision Tree Algorithms Classifiers from machine learning community: – ID3 – C4.5 – CART Classifiers for large databases: – SLIQ, SPRINT – PUBLIC – SONAR – Rainforest, BOAT 6/30/2016 Data Mining Meets the Internet 37 Decision Trees Pros – Fast execution time – Generated rules are easy to interpret by humans – Scale well for large data sets – Can handle high dimensional data Cons – Cannot capture correlations among attributes – Consider only axis-parallel cuts 6/30/2016 Data Mining Meets the Internet 38 Feature Selection Choose a collection of keywords that help discriminate between two or more sets of documents – Fewer keywords help to speed up classification – Improves classification accuracy by eliminating noise from documents Fischer’s discriminant (ratio of between-class to withinclass scatter) Fisher (t ) where (c, t ) 6/30/2016 2 ( ( c 1 , t ) ( c 2 , t )) c1,c 2 1 2 ( x ( d , t ) ( c , t )) c | c | dc 1 x(d , t ) and x(d , t ) 1 if d contains t d c |c| Data Mining Meets the Internet 39 Exploiting Hyperlink Structure 6/30/2016 Data Mining Meets the Internet 40 HITS (Hyperlink-Induced Topic Search) [Kle 98] HITS uses hyperlink structure to identify authoritative Web sources for broad-topic information discovery Premise: Sufficiently broad topics contain communities consisting of two types of hyperlinked pages: – Authorities: highly-referenced pages on a topic – Hubs: pages that “point” to authorities – A good authority is pointed to by many good hubs; a good hub points to many good authorities Hubs Authorities 6/30/2016 Data Mining Meets the Internet 41 HITS - Discovering Web Communities Discovering the community for a specific topic/query involves the following steps – Collect seed set of pages S (returned by search engine) – Expand seed set to contain pages that point to or are pointed to by pages in seed set – Iteratively update hub weight h(p) and authority weight a(p) for each page: a( p) h( q ) h( p ) q p a(q) pq – After a fixed number of iterations, pages with highest hub/authority weights form core of community Extensions proposed in Clever – Assign links different weights based on relevance of link anchor text 6/30/2016 Data Mining Meets the Internet 42 Google [BP 98] Search engine that uses link structure to calculate a quality ranking (PageRank) for each page PageRank PageRank (u ) PageRank (v) p (1 p) u v OutDegree (u ) – Can be calculated using a simple iterative algorithm, and corresponds to principal eigenvector of the normalized link matrix – Intuition: PageRank is the probability that a “random surfer” visits a page » Parameter p is probability that the surfer gets bored and starts on a new random page » (1-p) is the probability that the random surfer follows a link on current page 6/30/2016 Data Mining Meets the Internet 43 Google - Features In addition to PageRank, in order to improve search Google also weighs keyword matches – Anchor text » Provide more accurate descriptions of Web pages » Anchors exist for un-indexable documents (e.g., images) – Font sizes of words in text » Words in larger or bolder font are assigned higher weights Google v/s HITS – Google: PageRanks computed initially for Web Pages independent of search query – HITS: Hub and authority weights computed for different root sets in the context of a particular search query 6/30/2016 Data Mining Meets the Internet 44 Trawling the Web for Emerging Communities [KRR 98] Co-citation: pages that are related are frequently referenced together Web communities are characterized by dense directed bipartite subgraphs Bipartite Core Computing (i,j) Bipartite cores – Sort edge list by source id and detect all source pages s with outdegree j (let D be the set of destination pages that s points to) – Compute intersection S of sets of source pages pointing to destination pages in D (using an index on dest id to generate each source set) – Output Bipartite (S,D) 6/30/2016 Data Mining Meets the Internet 45 Using Hyperlinks to Improve Classification [CDI 98] Use text from neighbors when classifying Web page – Ineffective because referenced pages may belong to different class Use class information from pre-classified neighbors – Choose class ci for which Pr(ci/Ni) is maximum (Ni is class labels of all the neighboring documents) – By Bayes rule, we choose ci to maximize Pr(Ni/ci) Pr(ci) – Assuming independence of neighbor classes, Pr( N i / ci ) j i Pr(c j / ci , j i )i k Pr(ck / ci , i k ) 6/30/2016 Data Mining Meets the Internet 46 Collaborative Search 6/30/2016 Data Mining Meets the Internet 47 SearchLight Key Idea: Improve search by sharing information on URLs visited by members of a community during search Based on the concept of “search sessions” – A search session is the search engine query (collection of keywords) and the URLs visited in response to the query – Possible to extract search sessions from the proxy logs SearchLight maintains a database of (query, target URL) pairs – Target URL is heuristically chosen to be last URL in search session for the query In response to a search query, SearchLight displays URLs from its database for the specified query 6/30/2016 Data Mining Meets the Internet 48 Image Retrieval 6/30/2016 Data Mining Meets the Internet 49 Similar Images Given: – A set of images Find: – All images similar to a given image – All pairs of similar images Sample applications: – Medical diagnosis – Weather predication – Web search engine for images – E-commerce 6/30/2016 Data Mining Meets the Internet 50 Similar Image Retrieval Systems QBIC, Virage, Photobook Compute feature signature for each image – QBIC uses color histograms – WBIIS, WALRUS use wavelets Use spatial index to retrieve database image whose signature is closest to the query’s signature QBIC drawbacks – Computes single signature for entire image – Thus, fails when images contain similar objects, but at different locations or in varying sizes – Color histograms cannot capture shape, texture and location information (wavelets can!) 6/30/2016 Data Mining Meets the Internet 51 WALRUS Similarity Model [NRS 99] WALRUS decomposes an image into regions A single signature is stored for each region Two images are considered to be similar if they have enough similar region pairs 6/30/2016 Data Mining Meets the Internet 52 WALRUS (Step 1) Generation of Signatures for Sliding Windows – Each image is broken into sliding windows – For the signature of each sliding window, use s 2 coefficients from lowest frequency band of the Haar wavelet 2 – Naive Algorithm: O ( N max ) – Dynamic Programming Algorithm: 2 max N - number of pixels in the image 2 S- s - max window size O( NS log ) max s max 6/30/2016 Data Mining Meets the Internet 53 WALRUS (Step 2) Clustering Sliding Windows – Cluster the windows in the image using pre-clustering phase of BIRCH – Each cluster defines a region in the image. – For each cluster, the centroid is used as a signature. (c.f. bounding box) 6/30/2016 Data Mining Meets the Internet 54 WALRUS - Retrieval Results Query image 6/30/2016 Data Mining Meets the Internet 55 Automated Schema Extraction for XML Data: The XTRACT System 6/30/2016 Data Mining Meets the Internet 56 XML Primer I Standard for data representation and data exchange – Unified, self-describing format for publishing/exchanging management data across heterogeneous network/NM platforms Looks like HTML but it isn’t Collection of elements – Atomic (raw character data) – Composite (sequence of nested sub-elements) Example <book> <title>A relational Model for Large Shared Data Banks </title> <author> <name> E.F. Codd </name> <affiliation> IBM Research </affiliation> </author> </ book> 6/30/2016 Data Mining Meets the Internet 57 XML Primer II XML documents can be accompanied by Document Type Descriptors (DTDs) DTDs serve the role of the schema of the document Specify a regular expression for every element Example <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT <!ELEMENT 6/30/2016 book (title, author*)> title (#PCDATA)> author (name, affiliation)> name (#PCDATA)> affiliation (#PCDATA)> Data Mining Meets the Internet 58 The XTRACT System [GGR 00] DTDs are of great practical importance – Efficient storage of XML data collections – Formulation and optimization of XML queries However, DTDs are not mandatory => XML data may not be accompanied by a DTD – Automatically-generated XML documents (e.g., from relational databases or flat files) – DTD standards for many communities are still evolving Goal of the XTRACT system: Automated inference of DTDs from XML-document collections 6/30/2016 Data Mining Meets the Internet 59 Problem Formulation Element types alphabet Infer DTD for each element type separately Example sequences: instances of nested sub-elements Only one level down in the hierarchy Problem statement: – Given a set example sequences for element e – Infer a “good” regular expression for e Hard problem!! – DTDs can comprise general, complex regular expressions – quantify notion of “goodness” for regular expressions 6/30/2016 Data Mining Meets the Internet 60 Example XML Documents <book> <title></title> <author> <name></name> <affiliation></affiliation> </author> </book> <paper> <title></title> <author> <name></name> <affiliation></affiliation> </author> <conference></conference> <year></year> </paper> 6/30/2016 <book> <title></title> <author> <name></name> <address></address> </author> <author> <name></name> <address></address> </author> <editor></editor> <book> Data Mining Meets the Internet 61 Example (Continued) Simplified example sequences <book> : - <title><author>, <title><author><author><editor> <paper> :- <title><author><conference><year> <author> :- <name><affiliation>, <name><affiliation>, <name><address>, <name><address> Desirable solution: <!ELEMENT book (title, author*, editor?)> <!ELEMENT paper (title, author, conference, year)> <!ELEMENT author (name, affiliation?, address?)> 6/30/2016 Data Mining Meets the Internet 62 DTD Inference Requirements Requirements for a good DTD: – Generalizes to intuitively correct but previously unseen examples – It should be concise (i.e., small in size) – It should be precise (i.e., not cover too many sequences not contained in the set of examples) Example: Consider the case p :- ta, taa, taaa, ta, taaaa Candidate DTD DTD Candidate ta | taa | taaa| taaaa 6/30/2016 Concise Precise no yes (t|a)* yes no ta* yes somewhat Data Mining Meets the Internet 63 The XTRACT Approach: MDL Principle Minimum Description Length (MDL) quantifies and resolves the tradeoff between DTD conciseness and preciseness MDL principle: The best theory to infer from a set of data is the one which minimizes the sum of (A) the length of the theory, in bits, plus (B) the length of the data, in bits, when encoded with the help of the theory. Part (A) captures conciseness, and Part (B) captures preciseness 6/30/2016 Data Mining Meets the Internet 64 Overview of the XTRACT System XTRACT consists of 3 subsystems Input Sequences I = { ab, abab, ac, ad, bc, bd, bbd, bbbe } Generalization subsystem SG= I U { (ab)*, (a|b)*, b*d, b*e } Factoring subsystem SF = SG U { (a|b)(c|d), b*(d|e) } MDL subsystem Inferred DTD: (ab)*| (a|b)(c|d) | b*(d|e) 6/30/2016 Data Mining Meets the Internet 65 MDL Subsystem MDL principle: Minimize the sum of – Theory description length, plus – Data description length given the theory In order to use MDL, need to: – Define theory description length (candidate DTD) – Define data description length (input sequences) given the theory (candidate DTD) – Solve the resulting minimization problem 6/30/2016 Data Mining Meets the Internet 66 MDL Subsystem - Encoding Scheme Description Length of a DTD: – Number of bits required to encode the DTD – |Size of DTD| * log | U {(,),|,*}| Description length of a sequence given a candidate DTD: – Number of bits required to specify the sequence given DTD – Use a sequence of encoding indices – Encoding of a given a is the empty string – Encoding of a given (a|b|c) is the index 0 – Encoding of aaa given a* is the index 3 – Example: Encoding of ababcabc given ((ab)*c)* is the sequence 2,2,1 6/30/2016 Data Mining Meets the Internet 67 MDL Encoding Example Consider again the case p :- ta, taa, taaa, taaaa Theory Description Data Description (given the theory) ta | taa | taaa| taaaa 0, 1,0 2, 3 (t|a)* ta* 6/30/2016 201, 3011, 40111, 501111 Total Length 17 + 7 = 24 6 + 21 = 27 3 + 7 = 10 1, 2, 3, 4 Data Mining Meets the Internet 68 MDL Subsystem - Minimization Input Sequences ta Candidate DTDs w11 w12 c1 ta taaa taa taaaa c2 ta* c3 taa ta Maps to the Facility Location Problem (NP-hard) XTRACT employs fast heuristic algorithms proposed by the Operations Research community 6/30/2016 Data Mining Meets the Internet 69 References [BP 98] S. Brin, and L. Page. The anatomy of a large-scale hypertextual Web search engine. WWW7, 1998. [CDA 97] S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. ACM SIGMOD, 1998. [CDI 98] S. Chakrabarti, B. Dom, R. Agrawal, and P. Raghavan. Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB Journal, 1998. [CGR 00] K. Chakrabarti, M. Garofalakis, R. Rastogi, and K. Shim. Approximate Query Processing Using Wavelets. VLDB, 2000. [DDF 90] S. Deerwater, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 1990. [GGR 00] M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: A System for Extracting Document Type Descriptors from XML Documents. ACM SIGMOD, 2000. 6/30/2016 Data Mining Meets the Internet 70 References (Continued) [GKR 98] D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. VLDB, 1998. [GRS 99] S. Guha, K. Shim, and R. Rastogi. CURE: An efficient clustering algorithm for large databases. ACM SIGMOD, 1998. [GRS 98] S. Guha, K. Shim, and R. Rastogi. ROCK: A robust clustering algorithm for categorical attributes. Data Engineering, 1999. [HKK 97] E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. DMKD Workshop, 1997. [Kle 98] J. Kleinberg. Authoritative sources in a hyperlinked environment. SODA, 1998. [KRR 98] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. WWW8, 1999. [ZRL 96] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. ACM SIGMOD, 1996. 6/30/2016 Data Mining Meets the Internet 71