Business Intelligence Technologies – Data Mining Lecture 7 Link Analysis & Web Mining 1 Agenda Content (text) Mining Link Structure Mining Web Usage Mining Case Discussion 2 Three Forms of Web Mining Data Available on the web Three forms of web mining Content Data Text Mining Link Structure Link Analysis Web usage data Web usage mining 3 Why Text Mining? Significant proportion of information of great potential value is stored in documents: News stories pertaining to competition, customers & the business environment at large Technical reports on new technology Email communications with customer, partners, and within the organization Corporate documents embodying corporate knowledge and expertise Legal documents --- automatic reasoning 4 Opportunities Finding patterns in text: Identify and track trends in industry What are my competitors doing? What relevant products are being developed? What are the potential usage of my products? Identify emerging themes in collections of documents Customer communications: cluster messages, each segment identifies a common theme such as complaints about a certain problem, or queries about product features. Automated categorization of e-mails (Spam Filter!), web pages, and news stories 5 Text Mining as the Solution Information retrieval Locating and ranking of documents of interest Interest is expressed via a set of keywords Deeper mining Document categorization 6 Structuring Textual Information Many methods designed to analyze structured data If documents can be represented by a set of attributes – can use existing data mining methods How to represent a document ? Structured representation Apply DM methods to find patterns among documents 7 Document Representation A document representation aims to capture what the document is about One possible approach: Each entry in the table represents a document Attribute describes whether or not a term appears in the document Example Terms Camera Digital Memory Pixel Document 1 1 1 0 1 Document 2 1 1 0 0 … … … … … … 8 Document Representation Another approach: Attributes represent the frequency in which a term appears in the document Example: Term frequency table Terms Camera Digital Memory Print Document 1 3 2 0 1 Document 2 0 4 0 3 … … … … … … 9 Document Representation But a term is mentioned more times in longer documents Therefore, use relative frequency (% of document): No. of occurrences/No. of words in document Terms Camera Digital Memory Print Document 1 0.03 0.02 0 0.01 Document 2 0 0.004 0 0.003 … … … … … … 10 The TF/IDF Document Representation TF/IDF: Term Frequency and Inverse Document Frequency An approach for weighting terms in a document based on the term’s frequency in the document and the document corpus. (used to filter out common words, e.g. “important”) A term would have a higher weight if it is found to be a good descriptor for a particular document, i.e., if it appears frequently in the document but is infrequent in the entire corpus. Weight are determined by: W = tf * log (N/df) tf: a term’s frequency in the document df: is the frequency of documents in the corpus that contain the term, N is the number of documents in the corpus. Terms Camera Digital Memory Print … … … … … Document 1 Document 2 … 11 Text Mining Application 1: Association Rules After proper representation, data mining techniques can be applied to text, e.g. association rules, clustering, classification. Keyword-based Association Rules: treat keywords as items. Microsoft Antitrust Document No. Item 1 Item 2 Item 3 100 France Iraq 101 NASDAQ 102 … Doc No. Microsoft antitrust France US 100 0 0 1 NYSE job 101 0 0 0 Iraq US UK 102 0 0 0 103 Microsoft antitrust OS 103 1 1 0 104 Microsoft Antitrust windows 104 1 1 0 … OR … … 12 Text Mining Application 2: Finding Clusters of Similar Documents Request for product information Complaints about recent upgrade Inquiries about complementary products 13 How to determine if two documents are similar ? In order to retrieve documents similar to a given document we need a measure of similarity Euclidean distance: The Euclidean distance between X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) is defined as D( X , Y ) n 2 ( x y ) i i i 1 Document A: (PDA=0.3, wireless=0.02, commerce=0 ) Document B: (0.001, 0.004, 0) D(A,B)=sqrt[(0.3-0.001)2+ (0.02-0.004)2 +(0-0)2] This can be used for document clustering (kmeans) and classification (kNN) 14 FYI: Basic Measures for Text Retrieval Most commonly used is the cosine measure of similarity between two documents X=(x1, x2, x3,…xn) and Y =(y1,y2, y3,…yn) : X Y sim ( X , Y ) Where X Y n (x y ) (x y ) (x i 1 i i 1 1 2 X Y y2 ) ... xn yn x x x x12 x22 x32 ... xn2 And Example: The similarity between X(3, 2, 0,1), and Y(1, 4, 0, 0) is Sim ( X , Y ) 3 3 1 2 4 0 0 1 0 2 2 2 0 2 12 12 4 2 0 0 15 Personalized Web Ad Delivery Objective: Web content is dynamic need automated ad placement Improve effectiveness of Web ads Customize ad delivery so that ad corresponds to the context user is exploring Example: google gmail Solution: Represent each ad as a document with a set of keywords. For example: ad for hybrid car is represented by the following set of keyword: car, electric, environment, etc. Then deliver ads to viewers of pages (i.e., documents) that resemble this description. 16 Text Mining Application 3: Text Classification … Doc No. earnings jump miss Class (positive vs. negative) 100 0.03 0 0.7 Negative 101 0.2 0.003 0.5 Negative 102 0.04 0.01 0.02 Positive 103 0.2 0.4 0.01 Positive 104 0.4 0.3 0.002 Positive … 17 Applications of Text Classification Business intelligence Classifying news stories: competitors, new technologies, etc. Email messages: Email from friends vs. spam Classification of Web pages E.g., customized delivery of news stories based on what is considered interesting by the user (viewed by the user): build a classifier to automatically classify stories from news stories into interesting and not-interesting classes. Personalized Web ads 18 Text Mining Application 4: Information Retrieval/ Search Engine Location and ranking of documents of interest Interest is expressed via a set of keywords Which documents satisfy a query? Query: Iraq US Most relevant documents: the terms Iraq and US are central to their content (have a high weight in the TF/IDF representation) 19 Basic Measures for Information Retrieval Relevant documents Retrieved & relevant Retrieved documents All documents How to evaluate the quality of a search engine Of the retrieved documents some are relevant while others are not Not every relevant document is retrieved Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) No. of documents Retrieved and Relevant Precision = _________________________________ No. of retrieved documents Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved No. of documents retrieved and relevant Recall= _________________________________ No. of relevant documents 20 What we just described was used by the First Generation Search TF/IDF based technique, evaluated by precision and recall The Yahoo/Altavista generation Problems of the initial approaches? 21 Link Structure Analysis - Using link structure to rank relevancy of Web pages Traditional IR methods only examine the appearance of relevant terms, and often fail to account for The quality of the information in the retrieved documents. The reliability of the source From the retrieved documents, want to rank authoritative documents higher Approach: Mining the Web’s link structure to identify authoritative web pages 22 Identify Authoritative Web Pages The Web includes pages and hyperlinks A lot of information is in the structure of web page linkages. Hyperlinks contain rich latent human information An author creates hyperlink pointing to another page -- can be viewed as endorsement The collective endorsement of a given page by different authors can help discover authoritative pages Google uses link structure of the Web to rank documents (PageRank) 23 Using Hubs to identify Authoritative Web Pages A hub is a page pointing to many good authorities. A hub may not be an authority, and have very few links pointing to it. E.g., a web page pointing to many good sources of information on business intelligence Yet a link from a hub to a page is valued more than a link from a regular page An authority is a page pointed to by many good hubs Hub Hub Authority Authority Page Page Page Page Page Page 24 Overview of Search Engine Web Usage Mining - Data Usage Site-level usage data log files User-level Web usage data panel data 30 Site-level Usage Data: Web Logfiles • A Web server is a program that processes incoming http requests • Web servers send Web pages to clients that request these pages • Each time the server sends something out to a client, the server stores “some details of what it just did” in files called log files. • What *can* these log files possibly contain? An Example of a Web Log File Host/IP Time stamp sniksnak.foobar.org - - [30/Feb/1996:06:03:24 -0800] "GET /film/logos/the.movies.main.gif HTTP/1.0" 200 278 Retrieval Method Path and File Retrieved Protocol HTTP Completion code Byte s What can we learn from Web log data? 32 User Level Usage Data Web browsing data collected at a user level (i.e. all web sites visited by a specific user) Panels of users (market research companies) Nielsen NetRatings, comScore Millions of users on the panel Tracking software installed on the users’ computer Market reports generated based on their users’ data E.g. search engines market shares; social networking sites grow 47% year over year. Sell reports and data sets What can we learn from user level usage data? 33 Case Discussion Google 1. 2. What are the differences between AdWords and AdSense in terms of techniques, revenue sources and issues? How can Google leverage its strength in other channels of adverting, e.g. print, radio, TV? MedNet 1. 2. 3. What are the pros and cons for Windham to advertise on MedNet.com and Marvel? What are the pros and cons of click-per-thousandimpression and click-through-rate? Is there a win-win solution for all the three players? 34