650-1 - University of Michigan

advertisement
January 6, 2003
Information Retrieval
Handout #1
(C) 2003, The University of Michigan
1
Course Information
•
•
•
•
•
•
Instructor: Dragomir R. Radev (radev@si.umich.edu)
Office: 3080, West Hall Connector
Phone: (734) 615-5225
Office hours: TBA
Course page: http://tangra.si.umich.edu/~radev/650/
Class meets on Mondays, 1-4 PM in 409 West Hall
(C) 2003, The University of Michigan
2
Introduction
(C) 2003, The University of Michigan
3
Demos
•
•
•
•
•
•
Google
Vivísimo
AskJeeves
NSIR
Lemur
MG
(C) 2003, The University of Michigan
4
Syllabus (Part I)
CLASSIC IR
Week 1 The Concept of Information Need, IR Models, Vector models,
Boolean models
Week 2 Retrieval Evaluation, Precision and Recall, F-measure, Reference
collection, The TREC conferences
Week 3 Queries and Documents, Query Languages, Natural language
querying, Relevance feedback
Week 4 Indexing and Searching, Inverted indexes
Week 5 XML retrieval
Week 6 Language modeling approaches
(C) 2003, The University of Michigan
5
Syllabus (Part II)
WEB-BASED IR
Week 7 Crawling the Web, hyperlink analysis, measuring the Web
Week 8 Similarity and clustering, bottom-up and top-down paradigms
Week 9 Social network analysis for IR, Hubs and authorities, PageRank
and HITS
Week 10 Focused crawling, Resource discovery, discovering communities
Week 11 Question answering
Week 12 Additional topics, e.g., relevance transfer
Week 13 Project presentations
(C) 2003, The University of Michigan
6
Readings
BOOKS
Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley/ACM Press, 1999
http://www.sims.berkeley.edu/~hearst/irbook/
Soumen Chakrabarti, Mining the Web, Morgan Kaufmann, 2002
http://www.cse.iitb.ac.in/~soumen/
PAPERS
Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998
Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999
Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 Davison "Topical locality on the Web" SIGIR 2000
Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999
Jeong and Barabási "Diameter of the world wide web" Nature (401) 130-131, 1999
Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000
Haveliwala "Topic-sensitive pagerank" WWW 2002
Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999
Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998
Menczer "Links tell us about lexical and semantic Web content" arXiv 2001
Menczer "Growing and Navigating the Small World Web by Local Content”. Proc. Natl. Acad. Sci. USA 99(22) 2002
Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998
Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" WWW 2002
Radev et al. “Content Diffusion on the Web Graph“
CASE STUDIES (IR SYSTEMS)
Lemur, MG, Google, AskJeeves, NSIR
(C) 2003, The University of Michigan
7
Assignments
Homeworks:
The course will have three homework assignments in the form of
problem sets. Each problem set will include essay-type questions,
questions designed to show understanding of specific concepts, and
hands-on exercises involving existing IR engines.
Project:
The final course project can be done in three different formats:
(1) a programming project implementing a challenging and novel
information retrieval application,
(2) an extensive survey-style research paper providing an exhaustive
look at an area of IR, or
(3) a SIGIR-style experimental IR paper.
(C) 2003, The University of Michigan
8
Grading
• Three HW assignments (30%)
• Project (30%)
• Final (40%)
(C) 2003, The University of Michigan
9
Topics
• IR systems
• Evaluation methods
• Indexing, search, and retrieval
(C) 2003, The University of Michigan
10
Need for IR
• Advent of WWW - more than 3 Billion
documents indexed on Google
• How much information?
http://www.sims.berkeley.edu/research/projects/how-much-info/
• Search, routing, filtering
• User’s information need
(C) 2003, The University of Michigan
11
Some definitions of Information
Retrieval (IR)
Salton (1989): “Information-retrieval systems process files of
records and requests for information, and identify and retrieve from
the files certain records in response to the information requests. The
retrieval of particular records depends on the similarity between the
records and the queries, which in turn is measured by comparing
the values of certain attributes to records and information requests.”
Kowalski (1997): “An Information Retrieval System is a system
that is capable of storage, retrieval, and maintenance of
information. Information in this context can be composed of text
(including numeric and date data), images, audio, video, and other
multi-media objects).”
(C) 2003, The University of Michigan
12
Examples of IR systems
• Conventional (library catalog).
Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, FAST).
Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)
Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, NSIR,
Answerbus)
Search in (restricted) natural language
(C) 2003, The University of Michigan
13
(C) 2003, The University of Michigan
14
(C) 2003, The University of Michigan
15
Types of queries (AltaVista)
Including or excluding words:
To make sure that a specific word is always included in your search topic, place the
plus (+) symbol before the key word in the search box. To make sure that a specific
word is always excluded from your search topic, place a minus (-) sign before the
keyword in the search box.
Example: To find recipes for cookies with oatmeal but without raisins, try
recipe cookie +oatmeal -raisin.
Expand your search using wildcards (*):
By typing an * at the end of a keyword, you can search for the word with multiple
endings.
Example: Try wish*, to find wish, wishes, wishful, wishbone, and wishy-washy.
(C) 2003, The University of Michigan
16
Types of queries
AND (&)
Finds only documents containing all of the specified words or phrases.
Mary AND lamb
finds documents with both the word Mary and the word lamb.
OR (|)
Finds documents containing at least one of the specified words or phrases.
Mary OR lamb
finds documents containing either Mary or lamb. The found documents could contain both, but do not have to.
NOT (!)
Excludes documents containing the specified word or phrase.
Mary AND NOT lamb
finds documents with Mary but not containing lamb. NOT cannot stand alone--use it with another operator, like AND.
NEAR (~)
Finds documents containing both specified words or phrases within 10 words of each other.
Mary NEAR lamb
would find the nursery rhyme, but likely not religious or Christmas-related documents.
(C) 2003, The University of Michigan
17
Mappings and abstractions
Reality
Data
Information need
Query
(C) 2003, The University of Michigan
From Korfhage’s book
18
Documents
•
•
•
•
•
Not just printed paper
collections vs. documents
data structures: representations
document surrogates: keywords, summaries
encoding: ASCII, Unicode, etc.
(C) 2003, The University of Michigan
19
Typical IR system
•
•
•
•
(Crawling)
Indexing
Retrieval
User interface
(C) 2003, The University of Michigan
20
Sample queries (from Excite)
In what year did baseball become an offical sport?
play station codes . com
birth control and depression
government
"WorkAbility I"+conference
kitchen appliances
where can I find a chines rosewood
tiger electronics
58 Plymouth Fury
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?
emeril Lagasse
Hubble
M.S Subalaksmi
running
(C) 2003, The University of Michigan
21
Size matters
• Typical document surrogate: 200 to 2000
bytes
• Book: up to 3 MB of data
• Stemming: computer, computational,
computing
(C) 2003, The University of Michigan
22
Key Terms Used in IR
• QUERY: a representation of what the user is looking for can be a list of words or a phrase.
• DOCUMENT: an information entity that the user wants to
retrieve
• COLLECTION: a set of documents
• INDEX: a representation of information that makes
querying easier
• TERM: word or concept that appears in a document or a
query
(C) 2003, The University of Michigan
23
Other important terms
•
•
•
•
•
•
Classification
Cluster
Similarity
Information Extraction
Term Frequency
Inverse Document
Frequency
• Precision
• Recall
(C) 2003, The University of Michigan
•
•
•
•
•
•
•
•
•
Inverted File
Query Expansion
Relevance
Relevance Feedback
Stemming
Stopword
Vector Space Model
Weighting
TREC/TIPSTER/MUC
24
Query structures
• Query viewed as a document?
– Length
– repetitions
– syntactic differences
• Types of matches:
– exact
– range
– approximate
(C) 2003, The University of Michigan
25
Additional references on IR
• Gerard Salton, Automatic Text Processing, AddisonWesley (1989)
• Gerald Kowalski, Information Retrieval Systems: Theory
and Implementation, Kluwer (1997)
• Gerard Salton and M. McGill, Introduction to Modern
Information Retrieval, McGraw-Hill (1983)
• C. J. an Rijsbergen, Information Retrieval, Buttersworths
(1979)
• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,
Managing Gigabytes, Van Nostrand Reinhold (1994)
• ACM SIGIR Proceedings, SIGIR Forum
• ACM conferences in Digital Libraries
(C) 2003, The University of Michigan
26
Related courses elsewhere
• Berkeley (Marti Hearst and Ray Larson)
http://www.sims.berkeley.edu/courses/is202/f00/
• Stanford (Chris Manning, Prabhakar
Raghavan, and Hinrich Schuetze)
http://www.stanford.edu/class/cs276a/
• Cornell (Jon Kleinberg)
http://www.cs.cornell.edu/Courses/cs685/2002fa/
• CMU (Yiming Yang and Jamie Callan)
http://la.lti.cs.cmu.edu/classes/11-741/
(C) 2003, The University of Michigan
27
Readings for weeks 1 – 3
• MIR (Modern Information Retrieval)
– Week 1
• Chapter 1 “Introduction”
• Chapter 2 “Modeling”
• Chapter 3 “Evaluation”
– Week 2
• Chapter 4 “Query languages”
• Chapter 5 “Query operations”
– Week 3
• Chapter 6 “Text and multimedia languages”
• Chapter 7 “Text operations”
• Chapter 8 “Indexing and searching”
(C) 2003, The University of Michigan
28
IR models
(C) 2003, The University of Michigan
29
Major IR models
•
•
•
•
•
•
Boolean
Vector
Probabilistic
Language modeling
Fuzzy
Latent semantic indexing
(C) 2003, The University of Michigan
30
Major IR tasks
•
•
•
•
•
Ad-hoc
Filtering and routing
Question answering
Spoken document retrieval
Multimedia retrieval
(C) 2003, The University of Michigan
31
Venn diagrams
x
w
z
y
D1
D2
(C) 2003, The University of Michigan
32
Boolean model
A
(C) 2003, The University of Michigan
B
33
Boolean queries
restaurants AND (Mideastern OR vegetarian) AND inexpensive
•
•
•
•
•
What types of documents are returned?
Stemming
thesaurus expansion
inclusive vs. exclusive OR
confusing uses of AND and OR
dinner AND sports AND symphony
4 OF (Pentium, printer, cache, PC, monitor, computer, personal)
(C) 2003, The University of Michigan
34
Boolean queries
• Weighting (Beethoven AND sonatas)
• precedence
coffee AND croissant OR muffin
raincoat AND umbrella OR sunglasses
• Use of negation: potential problems
• Conjunctive and Disjunctive normal forms
• Full CNF and DNF
(C) 2003, The University of Michigan
35
Transformations
• De Morgan’s Laws:
NOT (A AND B) = (NOT A) OR (NOT B)
NOT (A OR B) = (NOT A) AND (NOT B)
• CNF or DNF?
– Reference librarians prefer CNF - why?
(C) 2003, The University of Michigan
36
Boolean model
• Partition
• Partial relevance?
• Operators: AND, NOT, OR, parentheses
(C) 2003, The University of Michigan
37
Exercise
•
•
•
•
D1 = “computer information retrieval”
D2 = “computer retrieval”
D3 = “information”
D4 = “computer information”
• Q1 = “information  retrieval”
• Q2 = “information  ¬computer”
(C) 2003, The University of Michigan
38
Exercise
0
1
Swift
2
Shakespeare
3
Shakespeare
4
Milton
5
Milton
6
Milton
Shakespeare
7
Milton
Shakespeare
Swift
Swift
8
Chaucer
9
Chaucer
10
Chaucer
Shakespeare
11
Chaucer
Shakespeare
12
Chaucer
Milton
13
Chaucer
Milton
14
Chaucer
Milton
Shakespeare
15
Chaucer
Milton
Shakespeare
Swift
Swift
Swift
Swift
Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
(C) 2003, The University of Michigan
39
Stop lists
• 250-300 most common words in English
account for 50% or more of a given text.
• Example: “the” and “of” represent 10% of
tokens. “and”, “to”, “a”, and “in” - another
10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types),
2256 word occurrences (tokens). Top 65
types cover 1132 tokens (> 50%).
•(C) 2003,
Token/type
ratio: 2256/859 = 2.63
The University of Michigan
40
Vector-based representation
Term 1
Doc 1
Doc 2
Term 3
Term 2
(C) 2003, The University of Michigan
Doc 3
41
Vector queries
• Each document is represented as a vector
• non-efficient representations (bit vectors)
• dimensional compatibility
W1
W2
W3
W4
W5
W6
W7
W8
W9
W10
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
(C) 2003, The University of Michigan
42
The matching process
• Matching is done between a document and a
query - topicality
• document space
• characteristic function F(d) = {0,1}
• distance vs. similarity - mapping functions
• Euclidean distance, Manhattan distance,
Word overlap
(C) 2003, The University of Michigan
43
Vector-based matching
• The Cosine measure
 (D,Q) =
 (di x qi)
 (di)2 x
 (qi)2
• Intrinsic vs. extrinsic measures
(C) 2003, The University of Michigan
44
Exercise
• Compute the cosine measures  (D1,D2) and
 (D1,D3) for the documents: D1 = <1,3>,
D2 = <100,300> and D3 = <3,1>
• Compute the corresponding Euclidean
distances.
(C) 2003, The University of Michigan
45
Matrix representations
•
•
•
•
Term-document matrix (m x n)
term-term matrix (m x m x n)
document-document matrix (n x n)
Example: 3,000,000 documents (n) with
50,000 terms (m)
• sparse matrices
• Boolean vs. integer matrices
(C) 2003, The University of Michigan
46
Zipf’s law
Rank x Frequency  Constant
Rank Term Freq. Z
Rank Term Freq. Z
1
the
69,971 0.070
6
in
21,341 0.128
2
of
36,411 0.073
7
that
10,595 0.074
3
and
28,852 0.086
8
is
10,099 0.081
4
to
26.149 0.104
9
was
9,816
0.088
5
a
23,237 0.116
10
he
9,543
0.095
(C) 2003, The University of Michigan
47
Evaluation
(C) 2003, The University of Michigan
48
Contingency table
retrieved
not retrieved
relevant
w
x
not relevant
y
z
n2 = w + y
(C) 2003, The University of Michigan
n1 = w + x
N
49
Precision and Recall
Recall:
w
w+x
Precision:
w
w+y
(C) 2003, The University of Michigan
50
Exercise
Go to Google (www.google.com) and search for documents on
Tolkien’s “Lord of the Rings”. Try different ways of phrasing
the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien”
+Lord of the Rings”, etc. For each query, compute the precision
(P) based on the first 10 documents returned by AltaVista.
Note! Before starting the exercise, have a clear idea of what a
relevant document for your query should look like.
(C) 2003, The University of Michigan
51
n
Doc. no Relevant? Recall Precision
1
2
3
4
5
6
588
589
576
590
986
592
7
8
9
10
11
12
13
14
0.2
0.4
0.4
0.6
0.6
0.8
1.00
1.00
0.67
0.75
0.60
0.67
984
988
578
985
103
0.8
0.8
0.8
0.8
0.8
0.57
0.50
0.44
0.40
0.36
591
772
990
0.8
1.0
1.0
0.33
0.38
0.36
(C) 2003, The University of Michigan
x
x
x
x
x
[From Salton’s book]
52
P/R graph
1.2
1
Precision
0.8
0.6
0.4
0.2
0
0
0.2
0.4
(C) 2003, The University of Michigan
0.6
Recall
0.8
1
1.2
53
Issues
•
•
•
•
Standard levels for P&R (0-100%)
Interpolation
Average P&R
Average P at given “document cutoff
values”
• F-measure: F = 2/(1/R+1/P)
(C) 2003, The University of Michigan
54
Relevance collections
• TREC adhoc collections, 2-6 GB
• TREC Web collections, 2-100GB
(C) 2003, The University of Michigan
55
Sample TREC query
<top>
<num> Number: 305
<title> Most Dangerous Vehicles
<desc> Description:
Which are the most crashworthy, and least crashworthy,
passenger vehicles?
<narr> Narrative:
A relevant document will contain information on the
crashworthiness of a given vehicle or vehicles that can be
used to draw a comparison with other vehicles. The
document will have to describe/compare vehicles, not
drivers. For instance, it should be expected that vehicles
preferred by 16-25 year-olds would be involved in more
crashes, because that age group is involved in more crashes.
I would view number of fatalities per 100 crashes to be more
revealing of a vehicle's crashworthiness than the number of
crashes per 100,000 miles, for example.
</top>
(C) 2003, The University of Michigan
LA031689-0177
FT922-1008
LA090190-0126
LA101190-0218
LA082690-0158
LA112590-0109
FT944-136
LA020590-0119
FT944-5300
LA052190-0048
LA051689-0139
FT944-9371
LA032390-0172
LA042790-0172
LA021790-0136
LA092289-0167
LA111189-0013
LA120189-0179
LA020490-0021
LA122989-0063
LA091389-0119
LA072189-0048
FT944-15615
LA091589-0101
LA021289-0208
56
<DOCNO> LA031689-0177 </DOCNO>
<DOCID> 31701 </DOCID>
<DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE>
<SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION>
<LENGTH><P>586 words </P></LENGTH>
<HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE>
<BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE>
<TEXT>
<P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over
accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P>
<P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the
Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after
Consumer Reports magazine charged that the vehicle had basic design flaws. </P>
<P>Several Fatalities </P>
<P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,
particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation
conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P>
<P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle
roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving
the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the
accident report, NHTSA declined to investigate the Samurai. </P>
...
</TEXT>
<GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher
number of single-vehicle, first event roll-overs," a federal official
said. </P></GRAPHIC>
<SUBJECT>
<P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;
RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P>
</SUBJECT>
</DOC>
(C) 2003, The University of Michigan
57
TREC results
• http://trec.nist.gov/presentations/presentations.html
(C) 2003, The University of Michigan
58
Download