Text & Web Mining 5/28/2016 1

advertisement
Text & Web Mining
5/28/2016
1
Structured Data

So far we have focused on mining from
structured data:
Attribute  Value
Attribute  Value
Attribute  Value

Attribute  Value
Outlook  Sunny
Temperature  Hot
Windy  Yes
Humidity  High
Play  Yes
Most data mining involves such data
5/28/2016
2
Complex Data Types

Increased importance of complex data:





Spatial data: includes geographic data and
medical & satellite images
Multimedia data: images, audio, & video
Time-series data: for example banking data
and stock exchange data
Text data: word descriptions for objects
World-Wide-Web: highly unstructured text
and multimedia data
5/28/2016
3
Text Databases

Many text databases exist in practice







News articles
Research papers
Books
Digital libraries
E-mail messages
Web pages
Growing rapidly in size and importance
5/28/2016
4
Semi-Structured Data


Text databases are often semi-structured
Example:







5/28/2016
Title
Author
Publication_Date
Length
Category
Abstract
Content
Structured attribute/value pairs
Unstructured
5
Handling Text Data



Modeling semi-structured data
Information Retrieval (IR) from
unstructured documents
Text mining



5/28/2016
Compare documents
Rank importance & relevance
Find patterns or trends across documents
6
Information Retrieval

IR locates relevant documents



Key words
Similar documents
IR Systems


5/28/2016
On-line library catalogs
On-line document management systems
7
Performance Measure

Two basic measures {Relevant}  {Retrieved }
precision 
recall 
Relevant
documents
{Retrieved }
{Relevant}  {Retrieved }
Relevant &
retrieved
{Relevant}
Retrieved
documents
All documents
5/28/2016
8
Retrieval Methods

Keyword-based IR




E.g., “data and mining”
Synonymy problem: a document may talk about
“knowledge discovery” instead
Polysemy problem: mining can mean different
things
Similarity-based IR



5/28/2016
Set of common keywords
Return the degree of relevance
Problem: what is the similarity of “data mining”
and “data analysis”
9
Modeling a Document


Set of n documents and m terms
Each document is a vector v in Rm

The j-th coordinate of v measures the association of the j-th term
1 if the j - th term occurs
vj  
otherwise
0
r if the j - th term occurs
vj  
otherwise
0

 r
if the j - th term occurs
vj   R
 0
otherwise
Here r is the number of occurrences of the j-th term and R is the
number of occurrences of any term.
5/28/2016
10
Frequency Matrix
Term/docum ent
t1

tm
5/28/2016
d1  d n
(1)
(n)
v1  v1
  
v
(1)
m
 v
(n)
m
11
Similarity Measures

Cosine measure

(1)
sim v , v
( 2)

Dot product
v v
 (1) ( 2)
v v
(1)
( 2)
Norm of the
vectors
5/28/2016
12
Example


Google search for “association mining”
Two of the documents retrieved:



Idaho Mining Association: mining in Idaho (doc 1)
Scalable Algorithms for Association mining (doc 2)
Using only the two terms
v (1)  5 17 
v ( 2)  3 3

5 17   3 3
15  51
(1)
( 2)
sim v , v 

 0.88
52  17 2  32  32 17.7  5.2

5/28/2016

13
New Model

Add the term “data” to the document model
v (1)  5 17 0
v ( 2)  3 3 3

5 17 0  3 3 3
66
(1)
( 2)
sim v , v 

 0.72
2
2
2
2
2
92.1
5  17  0  3  3  3

5/28/2016

14
Frequency Matrix
 5 3


A  17 3
 0 3


Will quickly become large
Singular value decomposition A  USV T can be used to reduce it
 0.3049 0.5254  0.7944 


U   0.9517 - 0.1991
0.2336 
 0.0354 0.8272
0.5607 

18.1232 0



 0.9769 - 0.2139 

S 0
3.5426  V  
0.2139 0.9769 

 0

0


5/28/2016
15
Association Analysis


Collect set of keywords frequently used
together and find association among
them
Apply any association rule algorithm to
a database in the format
{document_id, a_set_of_keywords}
5/28/2016
16
Document Classification



Need already classified documents as
training set
Induce a classification model
Any difference from before?
A set of keywords associated with a document
has no fixed set of attributes or dimensions
5/28/2016
17
Association-Based
Classification

Classify documents based on associated,
frequently occurring text patterns

Extract keywords and terms with IR and simple
association analysis

Create a concept hierarchy of terms

Classify training documents into class hierarchies

5/28/2016
Use association mining to discover associated
terms to distinguish one class from another
18
Remember Generalized
Association Rules
Taxonomy:
Clothes
Outerwear
Jackets
Footwear
Shirts
Shoes
Ancestor of
shoes and
hiking boots
Hiking Boots
Ski Pants
Generalized association rule
X Y where no item in Y is
an ancestor of an item in X
5/28/2016
19
Classifiers





Let X be a set of terms
Let Anc (X) be those terms and their
ancestor terms
Consider a rule X  C and document d
If X  Anc (d) then X  C covers d
A rule that covers d may be used to
classify d (but only one can be used)
5/28/2016
20
Procedure



Step 1: Generate all generalized
association rules , where X is a set of
terms and C is a class, that satisfy
minimum support.
Step 2: Rank the rules according to
some rule ranking criterion
Step 3: Select rules from the list
5/28/2016
21
Web Mining


The World Wide Web may have more
opportunities for data mining than any other
area
However, there are serious challenges:





5/28/2016
It is too huge
Complexity of Web pages is greater than any
traditional text document collection
It is highly dynamic
It has a broad diversity of users
Only a tiny portion of the information is truly
useful
22
Search Engines  Web Mining

Current technology: search engines




Keyword-based indices
Too many relevant pages
Synonymy and polysemy problems
More challenging: web mining



5/28/2016
Web content mining
Web structure mining
Web usage mining
23
Web Content Mining
View of Data
Main Data
Representation
Methods
Applications
5/28/2016
IR View
Unstructured
Semi-structured
Text documents
Hypertext documents
Bag of words, n-grams
Terms, phrases
Concepts or ontology
Relational
Machine Learning
Statistics
Categorization
Clustering
Finding extraction rules
Finding patterns in text
User modeling
DB View
Semi-structured
Web site as DB
Hypertext documents
Edge-labeled graph
Relational
ILP
Association rules
Finding frequent substructures
Web site schema discovery
24
Example: Classification of Web
Documents




Assign a class to each document based
on predefined topic categories
E.g., use Yahoo!’s taxonomy and
associated documents for training
Keyword-based document classification
Keyword-based association analysis
5/28/2016
25
Web Structure Mining
View of Data
Main Data
Representation
Methods
Applications
5/28/2016
Links structure
Links structure
Graph
Proprietary algorithms
Categorization
Clustering
26
Authoritative Web Pages


High quality relevant Web pages are termed
authoritative
Explore linkages (hyperlinks)



5/28/2016
Linking a Web page can be considered an
endorsement of that page
Those pages that are linked frequently are
considered authoritative
(This has its roots back to IR methods based on
journal citations)
27
Structure via Hubs


A hub is a set of Web pages containing
collections of links to authorities
There is a wide variety of hubs:


5/28/2016
Simple list of recommended links on a
person’s home page
Professional resource lists on commercial
sites
28
HITS

Hyperlink-Induced Topic Search (HITS)



5/28/2016
Form a root set of pages using the query
terms in an index-based search (200
pages)
Expand into a base set by including all
pages the root set links to (1000-5000
pages)
Go into an iterative process to determine
hubs and authorities
29
Calculating Weights

Authority weight
ap 

h
q:q  p
q
Page p is pointed
to by page q
Hub weight
hp 
5/28/2016
a
q:q  p
q
30
Adjacency Matrix



Lets number the pages {1,2,…,n}
The adjacency matrix is defined by
1 if page i links to page j
A(i, j )  
0 otherwise
By writing the authority and hub weights as
vectors we have
h  Aa
a  A Th
5/28/2016
31
Recursive Calculations

We now have
h  Aa  AA h  ...  AA
h
Aa  ...  A A  a
T
aA hA
T

T
T k
T
k
By linear algebra theory this converges
to the principle eigenvectors of the the
two matrices
5/28/2016
32
Output

The HITS algorithm finally outputs



Short list of pages with high hub weights
Short list of pages with high authority
weights
Have not accounted for context
5/28/2016
33
Applications

The Clever Project at IBM’s Almaden Labs


Developed the HITS algorithm
Google



5/28/2016
Developed at Stanford
Uses algorithms similar to HITS (PageRank)
On-line version
34
Web Usage Mining
View of Data
Main Data
Representation
Methods
Applications
5/28/2016
Interactivity
Server logs
Browser logs
Relational table
Graph
Machine learning
Statistics
Association rules
Site construction, adaptation & management
Marketing
User modeling
35
Complex Data Types Summary

Emerging areas of mining complex data
types:


Text mining can be done quite effectively,
especially if the documents are semi-structured
Web mining is more difficult due to lack of such
structure


5/28/2016
Data includes text documents, hypertext
documents, link structure, and logs
Need to rely on unsupervised learning, sometimes
followed up with supervised learning such as
classification
36
Download