Text & Web Mining 5/28/2016 1 Structured Data So far we have focused on mining from structured data: Attribute Value Attribute Value Attribute Value Attribute Value Outlook Sunny Temperature Hot Windy Yes Humidity High Play Yes Most data mining involves such data 5/28/2016 2 Complex Data Types Increased importance of complex data: Spatial data: includes geographic data and medical & satellite images Multimedia data: images, audio, & video Time-series data: for example banking data and stock exchange data Text data: word descriptions for objects World-Wide-Web: highly unstructured text and multimedia data 5/28/2016 3 Text Databases Many text databases exist in practice News articles Research papers Books Digital libraries E-mail messages Web pages Growing rapidly in size and importance 5/28/2016 4 Semi-Structured Data Text databases are often semi-structured Example: 5/28/2016 Title Author Publication_Date Length Category Abstract Content Structured attribute/value pairs Unstructured 5 Handling Text Data Modeling semi-structured data Information Retrieval (IR) from unstructured documents Text mining 5/28/2016 Compare documents Rank importance & relevance Find patterns or trends across documents 6 Information Retrieval IR locates relevant documents Key words Similar documents IR Systems 5/28/2016 On-line library catalogs On-line document management systems 7 Performance Measure Two basic measures {Relevant} {Retrieved } precision recall Relevant documents {Retrieved } {Relevant} {Retrieved } Relevant & retrieved {Relevant} Retrieved documents All documents 5/28/2016 8 Retrieval Methods Keyword-based IR E.g., “data and mining” Synonymy problem: a document may talk about “knowledge discovery” instead Polysemy problem: mining can mean different things Similarity-based IR 5/28/2016 Set of common keywords Return the degree of relevance Problem: what is the similarity of “data mining” and “data analysis” 9 Modeling a Document Set of n documents and m terms Each document is a vector v in Rm The j-th coordinate of v measures the association of the j-th term 1 if the j - th term occurs vj otherwise 0 r if the j - th term occurs vj otherwise 0 r if the j - th term occurs vj R 0 otherwise Here r is the number of occurrences of the j-th term and R is the number of occurrences of any term. 5/28/2016 10 Frequency Matrix Term/docum ent t1 tm 5/28/2016 d1 d n (1) (n) v1 v1 v (1) m v (n) m 11 Similarity Measures Cosine measure (1) sim v , v ( 2) Dot product v v (1) ( 2) v v (1) ( 2) Norm of the vectors 5/28/2016 12 Example Google search for “association mining” Two of the documents retrieved: Idaho Mining Association: mining in Idaho (doc 1) Scalable Algorithms for Association mining (doc 2) Using only the two terms v (1) 5 17 v ( 2) 3 3 5 17 3 3 15 51 (1) ( 2) sim v , v 0.88 52 17 2 32 32 17.7 5.2 5/28/2016 13 New Model Add the term “data” to the document model v (1) 5 17 0 v ( 2) 3 3 3 5 17 0 3 3 3 66 (1) ( 2) sim v , v 0.72 2 2 2 2 2 92.1 5 17 0 3 3 3 5/28/2016 14 Frequency Matrix 5 3 A 17 3 0 3 Will quickly become large Singular value decomposition A USV T can be used to reduce it 0.3049 0.5254 0.7944 U 0.9517 - 0.1991 0.2336 0.0354 0.8272 0.5607 18.1232 0 0.9769 - 0.2139 S 0 3.5426 V 0.2139 0.9769 0 0 5/28/2016 15 Association Analysis Collect set of keywords frequently used together and find association among them Apply any association rule algorithm to a database in the format {document_id, a_set_of_keywords} 5/28/2016 16 Document Classification Need already classified documents as training set Induce a classification model Any difference from before? A set of keywords associated with a document has no fixed set of attributes or dimensions 5/28/2016 17 Association-Based Classification Classify documents based on associated, frequently occurring text patterns Extract keywords and terms with IR and simple association analysis Create a concept hierarchy of terms Classify training documents into class hierarchies 5/28/2016 Use association mining to discover associated terms to distinguish one class from another 18 Remember Generalized Association Rules Taxonomy: Clothes Outerwear Jackets Footwear Shirts Shoes Ancestor of shoes and hiking boots Hiking Boots Ski Pants Generalized association rule X Y where no item in Y is an ancestor of an item in X 5/28/2016 19 Classifiers Let X be a set of terms Let Anc (X) be those terms and their ancestor terms Consider a rule X C and document d If X Anc (d) then X C covers d A rule that covers d may be used to classify d (but only one can be used) 5/28/2016 20 Procedure Step 1: Generate all generalized association rules , where X is a set of terms and C is a class, that satisfy minimum support. Step 2: Rank the rules according to some rule ranking criterion Step 3: Select rules from the list 5/28/2016 21 Web Mining The World Wide Web may have more opportunities for data mining than any other area However, there are serious challenges: 5/28/2016 It is too huge Complexity of Web pages is greater than any traditional text document collection It is highly dynamic It has a broad diversity of users Only a tiny portion of the information is truly useful 22 Search Engines Web Mining Current technology: search engines Keyword-based indices Too many relevant pages Synonymy and polysemy problems More challenging: web mining 5/28/2016 Web content mining Web structure mining Web usage mining 23 Web Content Mining View of Data Main Data Representation Methods Applications 5/28/2016 IR View Unstructured Semi-structured Text documents Hypertext documents Bag of words, n-grams Terms, phrases Concepts or ontology Relational Machine Learning Statistics Categorization Clustering Finding extraction rules Finding patterns in text User modeling DB View Semi-structured Web site as DB Hypertext documents Edge-labeled graph Relational ILP Association rules Finding frequent substructures Web site schema discovery 24 Example: Classification of Web Documents Assign a class to each document based on predefined topic categories E.g., use Yahoo!’s taxonomy and associated documents for training Keyword-based document classification Keyword-based association analysis 5/28/2016 25 Web Structure Mining View of Data Main Data Representation Methods Applications 5/28/2016 Links structure Links structure Graph Proprietary algorithms Categorization Clustering 26 Authoritative Web Pages High quality relevant Web pages are termed authoritative Explore linkages (hyperlinks) 5/28/2016 Linking a Web page can be considered an endorsement of that page Those pages that are linked frequently are considered authoritative (This has its roots back to IR methods based on journal citations) 27 Structure via Hubs A hub is a set of Web pages containing collections of links to authorities There is a wide variety of hubs: 5/28/2016 Simple list of recommended links on a person’s home page Professional resource lists on commercial sites 28 HITS Hyperlink-Induced Topic Search (HITS) 5/28/2016 Form a root set of pages using the query terms in an index-based search (200 pages) Expand into a base set by including all pages the root set links to (1000-5000 pages) Go into an iterative process to determine hubs and authorities 29 Calculating Weights Authority weight ap h q:q p q Page p is pointed to by page q Hub weight hp 5/28/2016 a q:q p q 30 Adjacency Matrix Lets number the pages {1,2,…,n} The adjacency matrix is defined by 1 if page i links to page j A(i, j ) 0 otherwise By writing the authority and hub weights as vectors we have h Aa a A Th 5/28/2016 31 Recursive Calculations We now have h Aa AA h ... AA h Aa ... A A a T aA hA T T T k T k By linear algebra theory this converges to the principle eigenvectors of the the two matrices 5/28/2016 32 Output The HITS algorithm finally outputs Short list of pages with high hub weights Short list of pages with high authority weights Have not accounted for context 5/28/2016 33 Applications The Clever Project at IBM’s Almaden Labs Developed the HITS algorithm Google 5/28/2016 Developed at Stanford Uses algorithms similar to HITS (PageRank) On-line version 34 Web Usage Mining View of Data Main Data Representation Methods Applications 5/28/2016 Interactivity Server logs Browser logs Relational table Graph Machine learning Statistics Association rules Site construction, adaptation & management Marketing User modeling 35 Complex Data Types Summary Emerging areas of mining complex data types: Text mining can be done quite effectively, especially if the documents are semi-structured Web mining is more difficult due to lack of such structure 5/28/2016 Data includes text documents, hypertext documents, link structure, and logs Need to rely on unsupervised learning, sometimes followed up with supervised learning such as classification 36