IT522-lec2

advertisement
IT-522: Web Databases And
Information Retrieval
Lecture - 2
By
Dr. Syed Noman Hasany
IR System Architecture
User Interface
Text
User
Need
Text Operations
Logical View
User
Feedback
Query
Query
Operations
Searching
Indexing
Index
Database
Manager
Inverted
file
Text
Database
Ranked
Docs
Ranking
Retrieved
Docs
IR System Components
• Searching retrieves documents that contain a
given query token from the inverted index.
• Ranking scores all retrieved documents
according to a relevance metric.
• User Interface manages interaction with the
user:
– Query input and document output.
– Relevance feedback.
– Visualization of results.
• Query Operations transform the query to
improve retrieval:
– Query expansion using a thesaurus.
– Query transformation using relevance feedback.
Assignment 1a
• Describe the IR system architecture in detail
with sequence of operations performed in
relation to information retrieval.
Document Preprocessing
Basic text processing techniques applied before an
index is created
• Tokenize document
• Remove stop words (also known as noise words)
• Perform stemming (saves indexing space and
improves recall)
 Create Index: data about remaining keywords is
then stored in a postings data structure
Tokenization
• Analyze text into a sequence of discrete
tokens (words).
• Sometimes punctuation (e-mail), numbers
(1999), and case (Republican vs. republican)
can be a meaningful part of a token.
– However, frequently they are not.
• Simplest approach is to ignore all numbers
and punctuation and use only caseinsensitive unbroken strings of alphabetic
characters as tokens.
Tokenizing HTML
• Should text in HTML commands not typically
seen by the user be included as tokens?
– Words appearing in URLs.
– Words appearing in “meta text” of images.
• Simplest approach is to exclude all HTML tag
information (between “<“ and “>”) from
tokenization.
• More difficult to work with some formats like
.ps / .pdf
Stopwords
• Excluding high-frequency words (e.g. function words: “a”,
“the”, “in”, “to”; pronouns: “I”, “he”, “she”, “it”).
• Stopwords are language dependent
• For efficiency, store strings for stopwords in a hashtable to
recognize them in constant time.
• How to determine a list of stopwords?
– For English? – may use existing lists of stopwords
• WordNet stopword list
– For Spanish? Bulgarian?
Lemmatization
• Reduce inflectional/variant forms to base form
• E.g.,
– am, are, is  be
– car, cars, car's, cars'  car
• the boy's cars are different colors  the boy car be different
color
• How to do this?
– Need a list of grammatical rules + a list of irregular words
– Children  child, spoken  speak …
– Practical implementation: use WordNet’s morphstr function
Stemming
• Reduce tokens to “root” form of words to
recognize morphological variation.
– “computer”, “computational”, “computation” all
reduced to same token “compute”
• Correct morphological analysis is language
specific and can be complex.
• Stemming “blindly” strips off known affixes
(prefixes and suffixes) in an iterative fashion.
for example compressed
and compression are both
accepted as equivalent to
compress.
for exampl compres and
compres are both accept
as equival to compres.
Porter Stemmer
• Simple procedure for removing known
affixes in English without using a dictionary.
• Can produce unusual stems that are not
English words:
– “computer”, “computational”, “computation” all
reduced to same token “comput”
• May conflate (reduce to the same token)
words that are actually distinct.
• Not recognize all morphological derivations.
Typical rules in Porter
•
•
•
•
sses  ss
ies  i
ational  ate
tional  tion
Term-document incidence
Which plays/docs of Shakespeare contain the words Brutus AND
Caesar but NOT Calpurnia?
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Brutus AND Caesar but NOT
Calpurnia
1 if doc contains word, 0
otherwise
Incidence vectors
• So we have a 0/1 vector for each term.
• To answer query:
take the vectors for Brutus, Caesar and
Calpurnia (complemented)  bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Bigger corpora
• Consider N = 1M documents, each with about 1K
terms.
• Avg 6 bytes/term including spaces/punctuation
– 6GB of data in the documents.
• Say there are m = 500K distinct terms among
these.
Assignment 1b
• 500K x 1M matrix has half-a-trillion 0’s and 1’s.
• But it has no more than one billion 1’s.
– matrix is extremely sparse.
• What’s a better representation?
– We only record the 1 positions.
• Million=106 , Billion=109 , Trillion=1012
Assignment 1b
Why?
Postings Data Structures
• Data about keywords, documents and especially the
occurrences of keywords in documents, needs to be
stored in an appropriate postings data structure:
this is the index in an IR system
• The details of what data needs to be stored depend
on the required functionality of the application
• Aim (as with all data structures) is to facilitate
efficient access to and processing of stored data
So, what’s wrong with this?
Doc1: car, traffic, lorry, fruit,
roads…
Doc2: boat, river, traffic,
vegetables…
Doc3: train, bread, railways…
…
…
Doc1,000,000: car, roads, traffic,
delays…
Inefficient retrieval! (search a term in each document)
Inverted index
• Indexing is a process by which a vocabulary of
keywords is assigned to all documents of a
corpus
• For each vocabulary item there is a list of
pointers to occurrences
• Speeds up searching for occurrences of words
Inverted index
• Maintain a list of docs for each term
• For each term T, we must store a list of all
documents that contain T.
• Do we use an array or a list for this?
Brutus
2
Calpurnia
1
Caesar
13
4
2
8
16
32
64 128
3
5
8
13
16
What happens if the word Caesar is added to
document 14?
21
34
Inverted index
• Linked lists generally preferred to arrays
+Dynamic space allocation
+Easy insertion of terms into posting lists
−Space overhead of pointers
Brutus
2
4
8
16
Calpurnia
1
2
3
5
Caesar
13
Dictionary
32
8
Posting
64
13
16
Postings lists
128
21
34
Inverted index construction
Documents to
be indexed.
Friends, Muslims, countrymen.
Tokenizer
Token stream.
Friends
Muslims
Countrymen
friend
muslim
countryman
Linguistic modules
Modified tokens.
Indexer
Inverted index.
friend
2
4
muslim
1
2
countryman
13
16
Indexer steps
• Sequence of (Modified token, Document ID) pairs.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Term
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
Doc #
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
caesar
2
was
ambitious
2
2
• Sort by terms.
Term
Doc #
I
did
enact
julius
caesar
I
was
killed
i'
the
capitol
brutus
killed
me
so
let
it
be
with
caesar
the
noble
brutus
hath
told
you
caesar
was
ambitious
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
• Multiple term entries in a
single document are
merged.
• Frequency information is
added.
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
caesar
did
enact
hath
I
I
i'
it
julius
killed
killed
let
me
noble
so
the
the
told
you
was
was
with
2
2
1
2
1
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
2
2
1
2
2
2
1
2
2
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
Term freq
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
• The result is split into a Dictionary file
and a Postings file. So there must be
one entry for each term in Dictionary.
Term
Doc #
ambitious
be
brutus
brutus
capitol
caesar
caesar
did
enact
hath
I
i'
it
julius
killed
let
me
noble
so
the
the
told
you
was
was
with
Freq
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
Doc #
Term
N docs Coll freq
ambitious
1
1
be
1
1
brutus
2
2
capitol
1
1
caesar
2
3
did
1
1
enact
1
1
hath
1
1
I
1
2
i'
1
1
it
1
1
julius
1
1
killed
1
2
let
1
1
me
1
1
noble
1
1
so
1
1
the
2
2
told
1
1
you
1
1
was
2
2
with
1
1
Freq
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
• The pointers in the structure
Doc #
Terms
Prasad
Freq
2
2
1
2
1
1
2
1
1
2
1
1
2
1
1
2
1
2
2
1
2
2
2
1
2
2
N docs Coll freq
Term
1
1
ambitious
1
1
be
2
2
brutus
1
1
capitol
3
2
caesar
1
1
did
1
1
enact
1
1
hath
2
1
I
1
1
i'
1
1
it
1
1
julius
2
1
killed
1
1
let
1
1
me
1
1
noble
1
1
so
2
2
the
1
1
told
1
1
you
2
2
was
1
1
with
Pointers
1
1
1
1
1
1
2
1
1
1
2
1
1
1
2
1
1
1
1
1
1
1
1
1
1
1
27
How Good is an IR System?
• We need ways to measure how good an IR systems is, i.e.
evaluation metrics
• Systems should return relevant information items (texts,
images, etc); systems may rank the items in order of relevance
Two ways to measure the performance of an IR system:
Precision = “how many of the retrieved items are relevant?”
Recall = “how many of the items that should have been retrieved
were retrieved?”
• These should be objective measures.
• Both require humans to make decisions about what documents
are relevant for a given query
Calculating Precision and Recall
R = number of documents in collection relevant to topic t
A(t) = number of documents returned by system in
response to query t
C = number of ‘correct’ (relevant) documents returned,
i.e. the intersection of R and A(t)
PRECISION = ((C+1)/(A(t)+1))*100
RECALL = ((C+1)/(R+1))*100
Assignment 1c
•
Amanda and Alex each need to choose an information retrieval system.
Amanda works for an intelligence agency, so getting all possible information
about a topic is important for the users of her system. Alex works for a
newspaper, so getting some relevant information quickly is more important
for the journalists using his system.
•
See below for statistics for two information retrieval systems (Search4Facts
and InfoULike) when they were used to retrieve documents from the same
document collection in response to the same query: there were 100,000
documents in the collection, of which 50 were relevant to the given query.
Which system would you advise Amanda to choose and which would you
advise Alex to choose? Your decisions should be based on the evaluation
metrics of precision and recall.
Search4Facts
•
•
Number of Relevant Documents Returned = 12
Total Number of Documents Returned = 15
•
•
Number of Relevant Documents Returned = 48
Total Number of Documents Returned = 295
InfoULike
Download