Information Retrieval (4) Prof. Dragomir R. Radev radev@umich.edu IR Winter 2010 … 7. Approximate string matching … Levenshtein edit distance • Examples: – Theatre-> theater – Ghaddafi->Qadafi – Computer->counter • Edit distance (inserts, deletes, substitutions) – Edit transcript • Done through dynamic programming Recurrence relation • Three dependencies – D(i,0)=i – D(0,j)=j – D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] • Simple edit distance: – t(i,j) = 0 iff S1(i)=S2(j) Example W R I T E R S 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 I 2 2 N 3 3 T 4 4 N 5 5 E 6 6 R 7 7 Gusfield 1997 Example (cont’d) W R I T E R S 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 1 2 3 4 5 6 7 I 2 2 2 2 2 3 4 5 6 N 3 3 3 3 3 3 4 5 6 T 4 4 4 4 4 * N 5 5 E 6 6 R 7 7 Gusfield 1997 Tracebacks W R I T E R S 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 1 2 3 4 5 6 7 I 2 2 2 2 2 3 4 5 6 N 3 3 3 3 3 3 4 5 6 T 4 4 4 4 4 * N 5 5 E 6 6 R 7 7 Gusfield 1997 Weighted edit distance • Used to emphasize the relative cost of different edit operations • Useful in bioinformatics – Homology information – BLAST – Blosum – http://eta.emblheidelberg.de:8000/misc/mat/blosum50.html Links • Web sites: – http://www.merriampark.com/ld.htm – http://odur.let.rug.nl/~kleiweg/lev/ • Demo: – /home/cs6998/tools/editDistance/dp/l.pl theater theatre – http://nayana.ece.ucsb.edu/imsearch/imsearc h.html Other methods • Cosine • Generation probabilities (language modeling) • (exp)KL-divergence p( x) D( p // q) p( x) log q( x) 0 p 0 log 0; p log q 0 IR Winter 2010 … 8. Query expansion Relevance feedback … Query expansion Query expansion • Corpus-based: mine query logs • NLP-based • Vector-space relevance feedback Relevance feedback • Problem: initial query may not be the most appropriate to satisfy a given information need. • Idea: modify the original query so that it gets closer to the right documents in the vector space Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1 = 1, a2 = 1/|R| and a3 = 1/|N| Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - nonrelevant • R=? • S=? • Q’ = ? Pseudo relevance feedback • Automatic query expansion – Thesaurus-based expansion (e.g., using latent semantic indexing – later…) – Distributional similarity – Query log mining Examples Lexical semantics (Hypernymy): Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper Examples (query logs) • • • • • • • • • Book: booksellers, bookmark, blue Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn [Otterbacher et al. HLT EMNLP 2005] Final projects • Two formats: – A software system that performs a specific search-engine related task. We will create a web page with all such code and make it available to the IR community. – A research experiment documented in the form of a paper. Look at the proceedings of the SIGIR, WWW, or ACL conferences for a sample format. I will encourage the authors of the most successful papers to consider submitting them to one of the IRrelated conferences. • Deliverables: – System (code + documentation + examples) or Paper (+ code, data) – Poster (to be presented in class) – Web page that describes the project. Readings • 4: MRS15, MRS16 • 5: MRS17 • 6: MRS18, MRS19