Lecture Note for Week 11

advertisement
Chapter 7: Document Preprocessing (textbook)
• Document preprocessing is a procedure which can
be divided mainly into five text operations (or
transformations):
(1) Lexical analysis of the text with the objective of
treating digits, hyphens, punctuation marks, and the
case of letters.
(2) Elimination of stop-words with the objective of
filtering out words with very low discrimination
values for retrieval purposes.
Document Preprocessing
(3) Stemming of the remaining words with the
objective of removing affixes (i.e., prefixes and
suffixes) and allowing the retrieval of documents
containing syntactic variations of query terms (e.g.,
connect, connecting, connected, etc).
(4) Selection of index terms to determine which
words/stems (or groups of words) will be used as an
indexing elements. Usually, the decision on whether
a particular word will be used as an index term is
related to the syntactic nature of the word. In fact ,
noun words frequently carry more semantics than
adjectives, adverbs, and verbs.
Document Preprocessing
(5) Construction of term categorization structures
such as a thesaurus, or extraction of structure
directly represented in the text, for allowing the
expansion of the original query with related terms (a
usually useful procedure).
Lexical Analysis of the text
Task: convert strings of characters into sequence of
words.
• Main task is to deal with spaces, e.g, multiple
spaces are treated as one space.
• Digits—ignoring numbers is a common way. Special cases, 1999,
2000 standing for specific years are important. Mixed digits are
important, e.g., 510B.C. 16 digits numbers might be credit card #.
• Hyphens: state-of-the art and
“state of the art” should be treated
as the same.
• Punctuation marks: remove them. Exception: 510B.C
• Lower or upper case of letters: treated as the same.
• Many exceptions: semi-automatic.
Elimination of Stopwords
• words appear too often are not useful
for IR.
• Stopwords: words appear more than
80% of the documents in the
collection are stopwords and are
filtered out as potential index words.
Stemming
• A Stem: the portion of a word which is
left after the removal of its affixes (i.e.,
prefixes or suffixes).
• Example: connect is the stem for
{connected, connecting connection,
connections}
• Porter algorithm: using suffix list for
suffix stripping. S, sses ss, etc.
Index terms selection
• Identification of noun groups
• Treat nouns appear closely as a single
component, e.g., computer science
Thesaurus
• Thesaurus: a precompiled list of important
words in a given domain of knowledge and
for each word in this list, there is a set of
related words.
• Vocabulary control in an information
retrieval system
• Thesaurus construction
– Manual construction
– Automatic construction
Vocabulary control
• Standard vocabulary for both indexing and
searching (for the constructors of the
system and the users of the system)
Objectives of vocabulary control
• To promote the consistent representation of
subject matter by indexers and
searchers ,thereby avoiding the dispersion of
related materials.
• To facilitate the conduct of a comprehensive
search on some topic by linking together terms
whose meanings are related paradigmatically.
Thesaurus
• Not like common dictionary
– Words with their explanations
• May contain words in a language
• Or only contains words in a specific domain.
• With a lot of other information especially the
relationship between words
– Classification of words in the language
– Words relationship like synonyms, antonyms
On-Line Thesaurus
• http://www.thesaurus.com
• http://www.dictionary.com/
• http://www.cogsci.princeton.edu/~wn/
Dictionary vs. Thesaurus
Check Information use http://www.thesaurus.com
Dictionary
• in·for·ma·tion ( n f r-m
sh n)
n.
– Knowledge derived
from study, experience,
or instruction.
– Knowledge of specific
events or situations that
has been gathered or
received by
communication;
intelligence or news.
See Synonyms at
knowledge.
– ......
Thesaurus
[Nouns] information, enlightenment,
acquaintance ……
[Verbs] tell; inform, inform of; acquaint,
acquaint with; impart, ……
[Adjectives] informed; reported;
published
Use of Thesaurus
• To control the term used in indexing ,for a
specific domain only use the terms in the
thesaurus as indexing terms
• Assist the users to form proper queries by
the help information contained in the
thesaurus
Construction of Thesaurus
• Stemming can be used for reduce the size
of thesaurus
• Can be constructed either manually or
automatically
WordNet: manually constructed
• WordNet® is an online lexical reference
system whose design is inspired by
current psycholinguistic theories of human
lexical memory. English nouns, verbs,
adjectives and adverbs are organized into
synonym sets, each representing one
underlying lexical concept. Different
relations link the synonym sets.
Relations in WordNet
Automatic Thesaurus Construction
• A variety of methods can be used in
construction the thesaurus
• Term similarity can be used for
constructing the thesaurus
Complete Term Relation Method
Term1 Term2 Term3 Term4 Term5 Term6 Term7 Term8
Doc1
0
4
0
0
0
2
1
3
Doc2
3
1
4
3
1
2
0
1
Doc3
3
0
0
0
3
0
3
0
Doc4
0
1
0
3
0
0
2
0
Doc5
2
2
2
3
1
4
0
2
Term – Document Relationship can be calculated using a variety of methods
Like tf-idf
Term similarity can be calculated base on the term – document relationship
 for example:
Sim(Termi , Term j ) 
 (DocTerm
All Document K
k ,i
)( DocTermk , j )
Complete Term Relation Method
Term1
Term1
Term2
Term3
Term4
Term5
Term6
Term7
Term8
7
16
15
14
14
9
7
8
12
3
18
6
17
18
6
16
0
8
6
18
6
9
6
9
3
2
16
Term2
7
Term3
16
8
Term4
15
12
18
Term5
14
3
6
6
Term6
14
18
16
18
6
Term7
9
6
0
6
9
2
Term8
7
17
8
9
3
16
Set threshold to 10
3
3
Complete Term Relation Method
T3
T1
Group
T1,T3,T4,T6
T2
T1,T5
T4
T2,T4,T6
T5
T6
T2,T6,T8
T7
T8
T7
Indexing
• Arrangement of data (data structure) to
permit fast searching
• Which list is easier to search?
sow fox pig eel yak hen ant cat dog hog
ant cat dog eel fox hen hog pig sow yak
Creating inverted files
Word Extraction
Word IDs
Original Documents
W1:d1,d2,d3
W2:d2,d4,d7,d9
Document IDs
Wn :di,…dn
Inverted Files
Creating Inverted file
• Map the file names to file IDs
• Consider the following Original Documents
D1
The Department of Computer Science was established in 1984.
D2
The Department launched its first BSc(Hons) in Computer Studies in
1987.
D3
followed by the MSc in Computer Science which was started in 1991.
D4
The Department also produced its first PhD graduate in 1994.
D5
Our staff have contributed intellectually and professionally to the
advancements in these fields.
Creating Inverted file
Red: stop word
D1
The Department of Computer Science was established in 1984.
D2
The Department launched its first BSc(Hons) in Computer Studies
in 1987.
D3
followed by the MSc in Computer Science which was started in
1991.
D4
The Department also produced its first PhD graduate in 1994.
D5
Our staff have contributed intellectually and professionally to the
advancements in these fields.
Creating Inverted file
After stemming, make lowercase (option), delete numbers (option)
D1
depart comput scienc establish
D2
depart launch bsc hons comput studi
D3
follow msc comput scienc start
D4
depart produc phd graduat
D5
staff contribut intellectu profession advanc field
Creating Inverted file (unsorted)
Words
Documents
Words
Documents
depart
d1,d2,d4
produc
d4
comput
d1,d2,d3
phd
d4
scienc
d1,d3
graduat
d4
establish
d1
staff
d5
launch
d2
contribut
d5
bsc
d2
intellectu
d5
hons
d2
profession
d5
studi
d2
advanc
d5
follow
d3
field
d5
msc
d3
start
d3
Creating Inverted file (sorted)
Words
Documents
Words
Documents
advanc
d5
msc
d3
bsc
d2
phd
d4
comput
d1,d2,d3
produc
d4
contribut
d5
profession
d5
depart
d1,d2,d4
scienc
d1,d3
establish
d1
staff
d5
field
d5
start
d3
follow
d3
studi
d2
graduat
d4
intellectu
d5
launch
d2
Searching on Inverted File
• Binary Search
– Using in the small scale
• Create thesaurus and combining
techniques such as:
– Hashing
– B+tree
– Pointer to the address in the indexed file
Huffman codes
• Binary character code: each character is
represented by a unique binary string.
• A data file can be coded in two ways:
a
frequency(%) 45
b
c
d
e
f
13
12
16
9
5
fixed-length code
000 001 010 011 100 101
variable-length
code
0
101 100 111 110 110
1 way
0
The first way needs 1003=300 bits. The second
needs
45 1+13 3+12 3+16 3+9 4+5 4=232 bits.
Variable-length code
• Need some care to read the code.
– 001011101 (codeword: a=0, b=00, c=01,
d=11.)
– Where to cut? 00 can be explained as either aa or
b.
• Prefix of 0011: 0, 00, 001, and 0011.
• Prefix codes: no codeword is a prefix of
some other codeword. (prefix free)
• Prefix codes are simple to encode and
decode.
Using codeword in Table to encode
and decode
• Encode: abc = 0.101.100 = 0101100
– (just concatenate the codewords.)
• Decode: 001011101 = 0.0.101.1101 = aabe
a
frequency(%) 45
b
13
c
12
d
16
e
9
f
5
fixed-length code
000 001 010 011 100 101
variable-length
code
0
101 100 111 1101 1100
• Encode: abc = 0.101.100 = 0101100
– (just concatenate the codewords.)
• Decode: 001011101 = 0.0.101.1101 = aabe
– (use the (right)binary tree below:)
10
0
0
0
1
86
14
0
1
58
0
0
28
1
0
a:4 b:13 c:1
5 Tree for the
2
fixed length
codeword
14
1
0
d:16 e:
9
1
f:5
10
0
a:4
5
1
55
0
1
25
30
0
1
c:1
2
b:13
0
14
d:16
0
1
f:5
e:
9
Tree for
variable-length
codeword
1
Binary tree
• Every nonleaf node has two children.
• The fixed-length code in our example is not
optimal.
• The total number of bits required to encode a file is
B(T )   f (c)dT (c)
cC
– f ( c ) : the frequency (number of occurrences) of c
in the file
– dT(c): denote the depth of c’s leaf in the tree
Constructing an optimal code
• Formal definition of the problem:
• Input: a set of characters C={c1, c2, …, cn},
each cC has frequency f[c].
• Output: a binary tree representing
codewords so that the total number of bits
required for the file is minimized.
• Huffman proposed a greedy algorithm to
solve the problem.
(a)
(b)
f:5
c:1
2
e:
9
c:1
2
b:13
b:13
14
0
1
f:5
e:
9
d:16
a:4
5
d:16
a:4
5
14
(c)
d:16
25
0
1
0
1
f:5
e:
9
c:1
2
b:13
25
(d)
30
0
1
c:1
2
b:13
0
14
1
d:16
0
1
f:5
e:
9
a:4
5
a:4
5
0
a:4
5
55
0
a:4
5
1
25
1
c:1
2
b:13
0
14
0
1
f:5
e:
9
(e)
1
25
1
d:16
0
1
55
30
0
10
0
30
0
1
c:1
2
b:13
0
14
d:16
0
1
f:5
e:
9
(f)
1
HUFFMAN(C)
1
n:=|C|
2
Q:=C
3
for i:=1 to n-1 do
4
z:=ALLOCATE_NODE()
5
x:=left[z]:=EXTRACT_MIN(Q)
6
y:=right[z]:=EXTRACT_MIN(Q)
7
f[z]:=f[x]+f[y]
8
INSERT(Q,z)
9
return EXTRACT_MIN(Q)
The Huffman Algorithm
• This algorithm builds the tree T corresponding to
the optimal code in a bottom-up manner.
• C is a set of n characters, and each character c
in C is a character with a defined frequency f[c].
• Q is a priority queue, keyed on f, used to identify
the two least-frequent characters to merge
together.
• The result of the merger is a new object (internal
node) whose frequency is the sum of the two
objects.
Time complexity
• Lines 4-8 are executed n-1 times.
• Each heap operation in Lines 4-8 takes
O(lg n) time.
• Total time required is O(n lg n).
Note: The details of heap operation will not
be tested. Time complexity O(n lg n)
should be remembered.
Another example:
e:4
c:6
a:6
c:6
b:9
b:9
d:11
10
0
e:4
d:11
1
a:6
10
0
d:11
15
1
e:4
0
c:6
a:6
15
0
c:6
1
b:9
21
0
1
1
10
b:9
0
e:4
d:11
1
a:6
36
0
1
15
0
c:6
21
0
1
1
10
b:9
0
e:4
d:11
1
a:6
Correctness of Huffman’s Greedy Algorithm
(Fun Part, not required)
•
•
•
Again, we use our general strategy.
Let x and y are the two characters in C
having the lowest frequencies. (the first two
characters selected in the greedy algorithm.)
We will show the two properties:
1. There exists an optimal solution Topt (binary tree
representing codewords) such that x and y are
siblings in Topt.
2. Let z be a new character with frequency
f[z]=f[x]+f[y] and
C’=C-{x, y}{z}. Let T’ be
an optimal tree for C’. Then we can get Topt
z from
T’ by replacing z with
x
y
Proof of Property 1
x
c
y
b
•
•
•
•
b
c
x
y
Topt
Tnew
Look at the lowest siblings in Topt, say, b and c.
Exchange x with b and y with c.
B(Topt)-B(Tnew)0 since f[x] and f[y] are the
smallest.
1 is proved.
2. Let z be a new character with frequency
f[z]=f[x]+f[y] and C’=C-{x, y}{z}. Let T’ be
an optimal tree for C’. Then we can get
Topt from T’ by
z
replacing z with
y
x
Proof: Let T be the tree obtained from T’ by
replacing z with the three nodes.
B(T)=B(T’)+f[x]+f[y].
… (1)
(the length of the codes for x and y are 1 bit more than that of z.)
Now prove T= Topt by contradiction.
If TTopt, then B(T)>B(Topt).
…(2)
From 1, x and y are siblings in Topt .
Thus, we can delete x and y from Topt and get another tree T’’ for C’.
B(T’’)=B(Topt) –f[x]-f[y]<B(T)-f[x]-f[y]=B(T’).
using (2)
using (1)
Thus, T(T’’)<B(T’). Contradiction! --- T’ is optimum.
Download