UIC - CS 594 Bing Liu 1
It refers to data mining using text documents as data.
There are many special techniques for pre-processing text documents to make them suitable for mining.
Most of these techniques are from the field of “Information Retrieval”.
UIC - CS 594 Bing Liu 2
Conceptually, information retrieval (IR) is the study of finding needed information. I.e., IR helps users find information that matches their information needs.
Historically, information retrieval is about document retrieval, emphasizing document as the basic unit.
Technically, IR studies the acquisition, organization, storage, retrieval, and distribution of information.
IR has become a center of focus in the Web era.
UIC - CS 594 Bing Liu 3
User
Info. Needs
Search/select
Queries
Information
Stored Information
Translating info.
needs to queries
Matching queries
To stored information
Query result evaluation
Does information found match user’s information needs?
UIC - CS 594 Bing Liu 4
Word (token) extraction
Stop words
Stemming
Frequency counts
UIC - CS 594 Bing Liu 5
Many of the most frequently used words in English are worthless in IR and text mining – these words are called stop words .
the, of, and, to, ….
Typically about 400 to 500 such words
For an application, an additional domain specific stop words list may be constructed
Why do we need to remove stop words
?
Reduce indexing (or data) file size
stopwords accounts 20-30% of total word counts.
Improve efficiency
stop words are not useful for searching or text mining stop words always have a large number of hits
UIC - CS 594 Bing Liu 6
Techniques used to find out the root/stem of a word:
E.g.,
user users used using engineering engineered engineer stem: use engineer
Usefulness
improving effectiveness of IR and text mining
matching similar words reducing indexing size
combing words with same roots may reduce indexing size as much as 40-50%.
UIC - CS 594 Bing Liu 7
remove ending
if a word ends with a consonant other than s, followed by an s, then delete s.
if a word ends in es, drop the s.
if a word ends in ing, delete the ing unless the remaining word consists only of one letter or of th.
If a word ends with ed, preceded by a consonant, delete the ed unless this leaves only a single letter.
…...
transform words
if a word ends with “ies” but not “eies” or “aies” then “ies --> y.”
UIC - CS 594 Bing Liu 8
Counts the number of times a word occurred in a document.
Counts the number of documents in a collection that contains a word.
Using occurrence frequencies to indicate relative importance of a word in a document.
if a word appears often in a document, the document likely “deals with” subjects related to the word.
UIC - CS 594 Bing Liu 9
A document is represented as a vector:
(W
1
, W
2
, … … , W n
)
Binary:
W i
= 1 if the corresponding term i (often a word) is in the document
W i
= 0 if the term i is not in the document
TF: (Term Frequency)
W i
= tf i where tf i is the number of times the term occurred in the document
TF*IDF: (Inverse Document Frequency)
W i
=tf i
*idf i
=tf i
*log(N/df i
)) where df i is the number of documents contains term i, and N the total number of documents in the collection.
UIC - CS 594 Bing Liu 10
Each indexing term is a dimension. A indexing term is normally a word.
Each document is a vector
D i
D j
= (t i1
, t i2
, t i3
, t i4
, ... t in
)
= (t j1
, t j2
, t j3
, t j4
, ..., t jn
)
Document similarity is defined as
Similarity (D i
, D j
)
k n
1 t ik
* t jk k n
1 t ik
2
k n
1 t jk
2
UIC - CS 594 Bing Liu 11
Query is a representation of the user’s information needs
Normally a list of words.
Query as a simple question in natural language
The system translates the question into executable queries
Query as a document
“Find similar documents like this one”
The system defines what the similarity is
UIC - CS 594 Bing Liu 12
A document Space is defined by three terms:
hardware, software, users
A set of documents are defined as:
A1=(1, 0, 0), A2=(0, 1, 0), A3=(0, 0, 1)
A4=(1, 1, 0), A5=(1, 0, 1), A6=(0, 1, 1)
A7=(1, 1, 1) A8=(1, 0, 1).
A9=(0, 1, 1)
If the Query is “hardware and software” what documents should be retrieved?
UIC - CS 594 Bing Liu 13
In Boolean query matching:
document A4, A7 will be retrieved (“AND”) retrieved:A1, A2, A4, A5, A6, A7, A8, A9 (“OR”)
In similarity matching (cosine):
q=(1, 1, 0)
S(q, A1)=0.71, S(q, A2)=0.71, S(q, A3)=0
S(q, A4)=1, S(q, A5)=0.5, S(q, A6)=0.5
S(q, A7)=0.82, S(q, A8)=0.5, S(q, A9)=0.5
Document retrieved set (with ranking)=
{A4, A7, A1, A2, A5, A6, A8, A9}
UIC - CS 594 Bing Liu 14
A measurement of the outcome of a search or retrieval
The judgment on what should or should not be retrieved.
There is no simple answer to what is relevant and what is not relevant: need human users.
difficult to define subjective depending on knowledge, needs, time,, etc.
The central concept of information retrieval
UIC - CS 594 Bing Liu 15
Given a query:
Are all retrieved documents relevant?
Have all the relevant documents been retrieved ?
Measures for system performance:
The first question is about the precision of the search
The second is about the completeness
( recall ) of the search.
UIC - CS 594 Bing Liu 16
Retrieved
Not retrieved
Relevant Not Relevant a b c d a
P = -------------a+b
UIC - CS 594 Bing Liu a
R = -------------a+c
17
Precision measures how precise a search is.
the higher the precision,
the less unwanted documents.
Number of relevant documents retrieved
Precision = --------------------------------------------
Total number of documents retrieved
Recall measures how complete a search is.
the higher the recall,
the less missing documents.
Number of relevant documents retrieved
Recall = -----------------------------------------------------
Number of all the relevant documents in the database
UIC - CS 594 Bing Liu 18
Theoretically,
R and P not depend on each other.
Practically,
High Recall is achieved at the expense of precision.
High Precision is achieved at the expense of recall.
When will p = 0?
Only when none of the retrieved documents is relevant.
When will p=1?
Only when every retrieved documents are relevant.
Depending on application, you may want a higher precision or a higher recall.
UIC - CS 594 Bing Liu 19
P
1.0
0.5
System A
System B
System C
0.1
0.1
UIC - CS 594
0.5
Bing Liu
1.0
R
20
Combining recall and precision, F score
2PR
F = ------------------
R + P
Breakeven point: when p = r
These two measures are commonly used in text mining: classification and clustering.
Accuracy is not normally used in text domain because the set of relevant documents is almost always very small compared to the set of irrelevant documents.
UIC - CS 594 Bing Liu 21
A Web crawler (robot) crawls the Web to collect all the pages.
Servers establish a huge inverted indexing database and other indexing databases
At query (search) time, search engines conduct different types of vector query matching
UIC - CS 594 Bing Liu 22
The real differences among different search engines are
their indexing weight schemes their query process methods their ranking algorithms
None of these are published by any of the search engines firms.
UIC - CS 594 Bing Liu 23
UIC - CS 594 Bing Liu 24
Each doc j is a vector, one component for each term (= word).
Have a vector space
terms are attributes n docs live in this space even with stop word removal and stemming, we may have 10000+ dimensions, or even
1,000,000+
UIC - CS 594 Bing Liu 25
Each training doc is a point (vector) labeled by its topic (= class)
Hypothesis: docs of the same topic form a contiguous region of space
Define surfaces to delineate topics in space
Government
Science
Arts
26 UIC - CS 594 Bing Liu
UIC - CS 594 Bing Liu
Government
Science
Arts
27
Given training documents compute a prototype vector for each class.
Given test doc, assign to topic whose prototype (centroid) is nearest using cosine similarity.
UIC - CS 594 Bing Liu 28
Constructing document vectors into a c
j c j
. c
j
C
1 j
d
C j
|| d d
||
D
1
C j d
D
C j
|| d d
||
and are parameters that adjust the relative impact of relevant and irrelevant training examples. Normally,
= 16 and = 4.
UIC - CS 594 Bing Liu 29
Given a set of training documents
D , each document is considered an ordered list of words. w w
Let denotes the word w
2
, … ,
C classes. w
= {
|v| c
1
>.
, c
2
, … , c
|C| t in position k of document
V = < w
} be the set of pre-defined
1
, di ,
There are two naïve Bayesian models,
One based on multi-variate
Bernoulli model (a word occurs or does not occurs in a document).
One based on the multinomial model (the number of word occurrences is considered)
UIC - CS 594 Bing Liu 30
( c j
)
( w t
| c j
)
| i
D |
1
| V
( c j
| d i
)
|
|
1
D
|
|
s i
|
1
D |
1
N i
| D |
(
1 w t
N
,
( d w s i
)
,
( d i c
) j
|
( c d i j
)
|
N(w t
, d i d i
. P(c j
) is the number of times the word w t
|d i
) is in {0, 1} d i
)
(1)
(2) occurs in document
( c j
| d i
)
r
( c
| C |
1
j
)
( c r
)
| k d
i
|
1 k
(
| d i
|
1 w d
( i
, w d k
| i
, k c j
|
) c r
)
(3)
UIC - CS 594 Bing Liu 31
To classify document d into class c
Define k-neighborhood N as k nearest neighbors of d
Count number of documents n in N that belong to c
Estimate P(c|d) as n/k
No training is needed (?). Classification time is linear in training set size.
UIC - CS 594 Bing Liu 32
UIC - CS 594 Bing Liu
Government
Science
Arts
33
P(science| )?
Government
Science
Arts
34 UIC - CS 594 Bing Liu
Binary Classification
Consider 2 class problems
Assume linear separability for now:
in 2 dimensions, can separate by a line
in higher dimensions, need hyperplanes
Can find separating hyperplane by programming (e.g. perceptron): linear
separator can be expressed as ax + by = c
UIC - CS 594 Bing Liu 35
UIC - CS 594
Find a,b,c, such that a x + b y a x + b y
c for red points
c for green
36
Many common text classifiers are linear classifiers
Despite this similarity, large performance differences
For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose?
What to do for non-separable problems?
UIC - CS 594 Bing Liu 37
UIC - CS 594
In general, lots of possible solutions for a,b,c.
Support Vector Machine (SVM) finds an optimal solution
Bing Liu 38
SVMs maximize the hyperplane.
margin around the separating
The decision function is fully specified by a subset of training samples, the support vectors .
Quadratic programming problem.
SVM: very good for text classification
UIC - CS 594 Bing Liu
Support vectors
Maximize margin
39
Let the training examples be (x
i, x i is n-dimensional vector. y i
,y i
) i = 1, 2,…, n, where is its class, -1 or 1.
The class represented by the subset with y class represented by the subset with y i i
= -1 and the
= +1 are linearly separable if there exists (w, b) such that
• w T x i
+ b
0 for y i
= +1
• w T x i
+ b < 0 for y i
= -1
The margin of separation m is the separation between the hyperplane w T x + b = 0 and the closest data points
(support vectors).
The goal of a SVM is to find the optimal hyperplane with the maximum margin of separation .
UIC - CS 594 Bing Liu 40
The decision boundary should be as far away from the data of both classes as possible
We maximize the margin, m
Class 1
UIC - CS 594 m
Class 2
Bing Liu 41
Thus, support vector machines (SVM) are linear functions of the form the input vector. f (x) = w T x + b , where w is the weight vector and x is
To find the linear function:
Minimize: 1
2 w
T w
Subject to: y i
( w
T x i
Quadratic programming.
b )
1 , i
1, 2, ..., n
UIC - CS 594 Bing Liu 42
To deal with cases where there may be no separating hyperplane due to noisy labels of both positive and negative training examples, the soft margin SVM is proposed:
Minimize:
Subject to:
1
2 w
T w
y i
( w
T x i
C i n
1
i
b )
1
i
, i
1, 2, ..., n where C
i
0, i = 1, …, n
0 is a parameter that controls the amount of training errors allowed.
UIC - CS 594 Bing Liu 43
Support Vectors:
1 margin s.v.
2 non-margin s.v.
3 non-margin s.v.
i
i
I
= 0 Correct
< 1 Correct (in margin)
> 1 Error
UIC - CS 594
1
3
1
3
1
2 3
1
Bing Liu 44
In general, complex real world applications may not be expressed with linear functions.
Key idea: transform x i
“make life easier” into a higher dimensional space to
Input space: the space x i are in
Feature space: the space of f
(.) f (x f f i
) after transformation
( )
( ) f f
( )
( ) f
( ) f
( ) f
( ) f
( ) f
( ) f
( ) f
( ) f f
( ) f
( ) f
( )
( ) f
( ) f
( ) f
( )
Input space
UIC - CS 594
Feature space
Bing Liu 45
The mapping function into a higher dimensional feature space.
x =(x
1
, .., x n
) f f
(x) = ( f
(.) is used to project data
(x) f
1
(x), …, f (z)> f
N
(x))
With a higher dimensional space, the data are more likely to be linearly separable.
In SVM, the projection can be done implicitly, rather than explicitly because the optimization does not actually need the explicit projection.
It only needs a way to compute inner products between pairs of training examples (e.g., x, z)
Kernel: K(x, z) = <
If you know how to compute K, you do not need to know f .
UIC - CS 594 Bing Liu 46
SVM are seen as best-performing method by many.
Statistical significance of most results not clear.
Kernels are an elegant and efficient way to map data into a better representation.
SVM can be expensive to train (quadratic programming).
For text classification, linear kernel is common and often sufficient.
UIC - CS 594 Bing Liu 47
We can still use the normal clustering techniques, e.g., partition and hierarchical methods.
Documents can be represented using vector space model.
For distance function, cosine similarity measure is commonly used.
UIC - CS 594 Bing Liu 48
Text mining applies and adapts data mining techniques to text domain.
A significant amount of pre-processing is needed before mining, using information retrieval techniques.
UIC - CS 594 Bing Liu 49