Search Engine Technology

advertisement
Information Retrieval
(4)
Prof. Dragomir R. Radev
radev@umich.edu
IR Winter 2010
…
7. Approximate string matching
…
Levenshtein edit distance
• Examples:
– Theatre-> theater
– Ghaddafi->Qadafi
– Computer->counter
• Edit distance (inserts, deletes,
substitutions)
– Edit transcript
• Done through dynamic programming
Recurrence relation
• Three dependencies
– D(i,0)=i
– D(0,j)=j
– D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)]
• Simple edit distance:
– t(i,j) = 0 iff S1(i)=S2(j)
Example
W
R
I
T
E
R
S
0
1
2
3
4
5
6
7
0
0
1
2
3
4
5
6
7
V
1
1
I
2
2
N
3
3
T
4
4
N
5
5
E
6
6
R
7
7
Gusfield 1997
Example (cont’d)
W
R
I
T
E
R
S
0
1
2
3
4
5
6
7
0
0
1
2
3
4
5
6
7
V
1
1
1
2
3
4
5
6
7
I
2
2
2
2
2
3
4
5
6
N
3
3
3
3
3
3
4
5
6
T
4
4
4
4
4
*
N
5
5
E
6
6
R
7
7
Gusfield 1997
Tracebacks
W
R
I
T
E
R
S
0
1
2
3
4
5
6
7
0
0
1
2
3
4
5
6
7
V
1
1
1
2
3
4
5
6
7
I
2
2
2
2
2
3
4
5
6
N
3
3
3
3
3
3
4
5
6
T
4
4
4
4
4
*
N
5
5
E
6
6
R
7
7
Gusfield 1997
Weighted edit distance
• Used to emphasize the relative cost of
different edit operations
• Useful in bioinformatics
– Homology information
– BLAST
– Blosum
– http://eta.emblheidelberg.de:8000/misc/mat/blosum50.html
Links
• Web sites:
– http://www.merriampark.com/ld.htm
– http://odur.let.rug.nl/~kleiweg/lev/
• Demo:
– /home/cs6998/tools/editDistance/dp/l.pl
theater theatre
– http://nayana.ece.ucsb.edu/imsearch/imsearc
h.html
Other methods
• Cosine
• Generation probabilities (language
modeling)
• (exp)KL-divergence
p( x)
D( p // q)   p( x) log
q( x)
0
p
0 log  0; p log  
q
0
IR Winter 2010
…
8. Query expansion
Relevance feedback
…
Query expansion
Query expansion
• Corpus-based: mine query logs
• NLP-based
• Vector-space relevance feedback
Relevance feedback
• Problem: initial query may not be the most
appropriate to satisfy a given information
need.
• Idea: modify the original query so that it
gets closer to the right documents in the
vector space
Relevance feedback
• Automatic
• Manual
• Method: identifying feedback terms
Q’ = a1Q + a2R - a3N
Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
Example
• Q = “safety minivans”
• D1 = “car safety minivans tests injury statistics” relevant
• D2 = “liability tests safety” - relevant
• D3 = “car passengers injury reviews” - nonrelevant
• R=?
• S=?
• Q’ = ?
Pseudo relevance feedback
• Automatic query expansion
– Thesaurus-based expansion (e.g., using
latent semantic indexing – later…)
– Distributional similarity
– Query log mining
Examples
Lexical semantics (Hypernymy):
Book: publication, product, fact, dramatic composition, record
Computer: machine, expert, calculator, reckoner, figurer
Fruit: reproductive structure, consequence, product, bear
Politician: leader, schemer
Newspaper: press, publisher, product, paper, newsprint
Distributional clustering:
Book: autobiography, essay, biography, memoirs, novels
Computer: adobe, computing, computers, developed, hardware
Fruit: leafy, canned, fruits, flowers, grapes
Politician: activist, campaigner, politicians, intellectuals, journalist
Newspaper: daily, globe, newspapers, newsday, paper
Examples (query logs)
•
•
•
•
•
•
•
•
•
Book: booksellers, bookmark, blue
Computer: sales, notebook, stores, shop
Fruit: recipes cake salad basket company
Games: online play gameboy free video
Politician: careers federal office history
Newspaper: online website college information
Schools: elementary high ranked yearbook
California: berkeley san francisco southern
French: embassy dictionary learn
[Otterbacher et al. HLT EMNLP 2005]
Final projects
• Two formats:
– A software system that performs a specific search-engine
related task. We will create a web page with all such code and
make it available to the IR community.
– A research experiment documented in the form of a paper. Look
at the proceedings of the SIGIR, WWW, or ACL conferences for
a sample format. I will encourage the authors of the most
successful papers to consider submitting them to one of the IRrelated conferences.
• Deliverables:
– System (code + documentation + examples) or Paper (+ code,
data)
– Poster (to be presented in class)
– Web page that describes the project.
Readings
• 4: MRS15, MRS16
• 5: MRS17
• 6: MRS18, MRS19
Download