RUG - Template basic NL

advertisement
|1
Information Retrieval
› Gertjan van Noord – based on the sheets by Leonoor
van der Beek
2013
Lecture 1: introduction
Agenda for today
•
•
•
•
Who’s who
Intro to the course
Chapter 1 of Introduction to Information Retrieval
Homework/lab assignment
Intro to the course
•
•
•
•
•
What is IR?
What will we study and how?
Objectives of the course
Exercises and lab sessions
Final exam
3
What is IR?
Individuals, administrations, organizations have lots
of digital information
• how to organize and store it?
• how to retrieve documents?
• how to retrieve info inside them?
An IR system is a tool to facilitate retrieval of such
information
4
Book’s definition
Information retrieval (IR) is
finding material (usually documents)
of an unstructured nature (usually text)
that satisfies an information need
from within large collections
(usually stored on computers).
5
... finding material (usually documents) ...
What else can you think of?
•
•
•
•
parts of documents
facts, like the day of birth of Rembrandt
a book in the library
a work of art in a museum
6
... from within large collections
(usually stored on computers)…
WWW? What else?
• Specific collections, like legal information
or scientific medical papers (Medline)
• Information on your own computer
• Information within a company
• Subparts of the www, like one domain
7
… of an unstructured nature (usually text)
Can you explain this?
Unstructured:
differences between text and databases
is a text document really unstructured?
how about XML?
Beyond text: image, sound, video, ….
8
Database search
vs.
•structured semantic info:
•no semantic structure
no fixed format, but
•
•
•
•
fields
datatypes
validation
relations
•search of fields
•exact search for data
•order of found records
alfanumerical
IR
• text structure
• metadata
• XML
•full text search
•not-exact search for data or
information
•order of found documents
often by similarity with
query
9
Book’s definition
Information retrieval (IR) is
finding material (usually documents)
of an unstructured nature (usually text)
that satisfies an information need
from within large collections
(usually stored on computers).
10
that satisfies an information need ...
What information needs can we discern?
Try to formulate some different types of goals of a
search
-facts and question answering
-definitions
-information on a subject
-retrieving a known document
and in websearch?
11
User needs in web search
Navigational.
The immediate intent is to reach a particular
site.
Informational.
The intent is to acquire some information
assumed to be present on one or more web
pages.
Transactional.
The intent is to perform some web-mediated
activity.
12
Broder, A. 2002 A taxonomy of Web search. SIGIR Forum 36, no.23-10
Translation of info need
Each information need has to be translated into the
"language" of the IR system
reality
document
info need
query
13
Translation of info need
Query: Hilton, Paris
14
Translation of info need
Query: champagne
15
Translation of info need
Query: Rene Froger “Een eigen huis”
Translation of info need
Information need:
Query: ??
Translation of info need
Information need:
Query: ??
Are the results satisfying?
Search engines produce often a lot of results
When are you satisfied with the results?
How can we evaluate a system?
• the most relevant results are easy to find (on top
of the list, and/or sorted by subject, …)
• only few results are not relevant
• new information
• info corrobarated (more sources)
• relevant documents that I know are presented
Precision and recall
Key statistics for evaluation with a test set (fixed
questions, set of documents, evaluations of documents
for the queries available)
Precision: what fraction of the results are relevant to
the information need?
Recall: what fraction of the relevant documents in the
collection were returned by the system?
Precision and recall
But how relevant are Precision and Recall if you search
for e.g. The date of birth of Vincent van Gogh?
Overview of an IR system
(book: Baeza-Yates:Modern IR)
Web site
Overview and exercises:
* http://www.let.rug.nl/vannoord/College/Zoekmachines/
* Nestor
Course Book
Introduction to Information Retrieval
D. Manning, P. Raghaven and H. Schütze
Online version
NB: the book is also used for the Information Retrieval
course
The book is written for CS students, we will skip
sections and exercises that are a bit too technical
Schedule for this course
wk 1 ch 1 boolean retrieval, posting lists
wk 2 ch 2 decoding, tokenization and normalization,
sublinear posting list intersection
wk 3 ch 3 dictionaries, wild cards, spell correction
wk 4 ch 6 scoring and term weighting, term and
document frequency weighting, vector space models
wk 5 ch 8 evalutation
wk 6 ch 21 link analysis, page rank
wk 7 ch 9 relevance feedback and query analysis
HOW will we study the book?
Homework: read the chapter thoroughly
Lectures: overview of chapter
Labtime/homework: do exercises
Next lecture: remaining questions
Full slide presentation of the chapters by one of the
authors available as well
author's slides
Labtime
1.
2.
3.
4.
Exercises (from the book + more)
Try out simple techniques in Python
More...
More...
Course objectives
• knowledge of IR terminology
• insight in IR models and IR processes
• knowledge of methods of indexing, querying,
retrieving and ranking
• knowledge of methods of evaluation of IR systems
• practical experience with use, adaptation and testing
of some of the basic IR algorithms and techniques
Chapter 1: Boolean retrieval
1.
2.
3.
4.
5.
General introduction on IR
Boolean systems
Representation of information
Retrieving documents
Efficiency aspects
Boolean retrieval
The first IR systems were Boolean systems
Queries are formulated with the Boolean
operators AND, OR and NOT:
•
•
•
•
Brutus AND Caesar
(Brutus OR Caesar) AND NOT Cleopatra
Brutus OR (Caesar AND NOT Cleopatra)
NOT Brutus
How about Google queries?
Information from documents
• Each document in the system needs a unique
docID
• Tokenization is the process of splitting a text
into separate tokens (not trivial!)
• For a simple boolean system we just need to
know which terms are present in which doc
Term document incidence matrix
Doc 1 Doc 2 Doc 3 Doc 4
Antony
1
1
0
0
Brutus
1
1
1
0
Caesar
1
1
0
1
Antony AND Brutus AND NOT Cleopatra?
Cleopatra 1
0
0
0
in huge collections > 99% of entries are 0
not a good representation, no efficient processing
Building an inverted file
1.
2.
3.
4.
Give DocIDs and tokenize the texts
Gather terms with their docID
Sort on terms and docID
Now list the unique terms with their
document frequency and link to the
postings list with docIDs
term
docfreq
postings list
[Caesar, 3]
 [1,2,4]
Inverted file / index
(term)
(df)
(postings list)
Antony
2

1,2,6
Brutus
3

1,2,3
Caesar
3

1,2,4,5,6
Antony AND Brutus AND NOT Cleopatra?
Cleopatra 1
1

efficient processing if sorted on DocID
simple merging algorithms for AND / OR
Distributive laws
a AND (b OR c) = (a AND b) OR (a AND c)
(a OR b) AND (c OR d) = ??
NOT(a OR b) = NOT(a) AND NOT(b)
NOT(a AND b) = ??
Conjunctive and disjunctive queries
The outer level of processing can be either
conjunctive (AND) or disjunctive (OR):
• Conjunctive normal form:
• a conjunction of disjunctions
• (a OR NOT b) AND (c OR d) AND e
• Disjunctive normal form:
• a disjunction of conjunctions
• (a AND NOT b) OR (c AND d) OR e
The order of the size
Example
f(x) = 2x3 + 5x2 +x + 9
This is a function of O(x3):
if x grows to infinity the factor x3 is what
really determines the size of the outcome, the
rest can be neglected
The order of time complexity
Example
To find similar elements in two ordered lists, the
number of steps depends on the size of both lists:
O(x + y) (linear)
Need to check all combinations
O(x * y) (quadratic)
Big O notation
• Used to classify algorithms by how they
respond (e.g. in their processing time or
working space requirements) to changes in
input
• Best case, worst case, average case?
• Big O represents the upper bound (worst case)
• Other symbols used for lower bound, tight
bound, ….
Guidance/questions on the text
Write down and try to find explanations of terms you don’t
know
p4 you know the KB, MB, GB .. etc sizes?
p5 fig 1.3: look back to fig 1.1
p7 what types of linguistic preprocessing do you see in the
examples in step 3?
p11/12 do you understand the algorithms?
Are you able to explain now what an inverted file is and
how it is constructed?
Homework
…. is on the web site ….
Download