ppt

advertisement
CIS 430 November 6, 2008
Emily Pitler
3

Named Entities

1 or 2 words


Ambiguous
meaning
Ambiguous intent
4
5
Mei and Church, WSDM 2008
6

Beitzel et. al. SIGIR 2004

America Online, week in December 2003

Popular queries:
◦ 1.7 words

Overall:
◦ 2.2 words
7

Lempel and Moran WWW2003
AltaVista, summer 2001
7,175,151 queries
2,657,410 distinct queries

1,792,104 queries occurred only once 63.7%

Most popular query: 31,546 times



8
Saraiva et. al. SIGIR 2001
9
Lempel and Moran WWW2003
10
American
Airlines?
or Alcoholics
Anonymous?
12



Clarity score ~ low ambiguity
Cronen-Townsend et. al. SIGIR 2002
Compare a language model
◦ over the relevant documents for a query
◦ over all possible documents


The more difference these are, the more clear
the query is
“programming perl” vs. “the”
13

Query Language Model
P ( w | Q)   P ( w | D) P( D | Q)
DR

Collection Language Model (unigram)
P ( w | collection ) 
 C (w)
Dcollection
 C (allwords )
Dcollection
14


Relative entropy between the two
distributions
Cost in bits of coding using Q when true
distribution is P
H ( P( x))    P(i) log( P(i))
i
DKL ( P Q)    P(i) log( Q(i))
i
 ( P(i) log( P(i)))
15
P(i)
DKL ( P Q)   P(i) log(
)
Q(i )
i
16
P( w | Q)
Clarity score   P( w | Q) log 2
Pcoll ( w)
wV
17
18

Navigational
◦ greyhound bus
◦ compaq

Informational
◦ San Francisco
◦ normocytic anemia

Transactional
◦ britney spears lyrics
◦ download adobe reader
Broder SIGIR 2002
19


The more webpages that point to you, the
more important you are
The more important webpages point to you,
the more important you are

These intuitions led to PageRank

PageRank led to…
Page et. al. 1998
22
washingtonpost.com
Mtv.com
cnn.com
vh1.com
Nytimes.com
23

Assume our surfer is on a page

In the next time step she can:
◦ Choose a link on the current page uniformly at
random
◦ Or
◦ Go somewhere else in the web uniformly at random

After a long time, what is the probability she
is on a given page?
24
Spread out their probability
over outgoing links
P(u )
P (v )  
uBv deg( u )
Pages that point to v
25
26

Could also “get bored” with probability d and
jump somewhere else completely
d
P(u )
P(v)   (1  d ) 
N
uBv deg( u )
27
28

Google, obviously
Given objects and links between them,
measures importance

Summarization (Erkan and Radev, 2004)

◦ Nodes = sentences, edges = thresholded cosine
similarity

Research (Mimno and McCallum, 2007)
◦ Nodes = people, edges = citations

Facebook?
29

Words on the page

Title

Domain

Anchor text—what other sites say when they
link to that page
31
Title: Ani Nenkova - Home
Domain: www.cis.upenn.edu
32






Ontology of webpages
Over 4 million webpages are categorized
Like WordNet for webpages
Search engines use this
Where is www.cis.upenn.edu?
Computers
◦ Computer Science
 Academic Departments
 North America
 United States
 Pennsylvania
33


What OTHER webpages say about your
webpage
Very good descriptions of what’s on a page
Link to:
www.cis.upenn.edu/~nenkova
“Ani Nenkova” is anchor text for
that page
34




10,000 documents
10 of them are relevant
What happens if you decide to return
absolutely nothing?
99.9% accuracy
36



Standard metrics in Information Retrieval
Precision: Of what you return, how many are
relevant?
| Relevant and Retrived |
Precision 
| Retrieved |
Recall: Of what is relevant, how many do you
return?
| Relevant and Retrived |
Recall 
| Relevant |
37



Not always clear-cut binary classification:
relevant vs. not relevant
How do you measure recall over the whole
web?
How many of the 2.7 billion results will get
looked at? Which ones actually need to be
good?
38


Very relevant > Somewhat relevant > Not
relevant
Want most relevant documents to be ranked
first
DCG p  rel1  i  2
p
reli
log 2 i

NDCG = DCG / ideal ordering DCG

Ranges from 0 to 1
39

Proposed ordering:
4

2
0
1
DCG = 4 + 2/log(2) + 0/log(3) + 1/log(4)
◦ = 6.5

IDCG = 4 + 2/log(2) + 1/log(3) + 0/log(4)
◦ = 6.63

NDCG = 6.5/6.63 = .98
40





Documents—hundreds of words
Queries—1 or 2, often ambiguous, words
It would be much easier to compare
documents and documents
How can we turn a query into a document?
Just find ONE relevant document, then use
that to find more
42





New Query = Original Query
+Terms from Relevant Docs
- Terms from Irrelevant Docs
Original query = “train”
Relevant
◦ www.dog-obedience-training-review.com

Irrelevant
◦ http://en.wikipedia.org/wiki/Caboose

New query = train + .3*dog -.2*railroad
43

Explicit feedback
◦ Ask the user to mark relevant versus irrelevant
◦ Or, grade on a scale (like we saw for NDCG)

Implicit feedback
◦ Users see list of top 10 results, click on a few
◦ Assume clicked on pages were relevant, rest
weren’t

Pseudo-relevance feedback
◦ Do search, assume top results are relevant, repeat
44



Have query logs for millions of users
“hybrid car””toyota prius” is more likely
than “hybrid car”-> “flights to LA”
Find statistically significant pairs of queries
(Jones et. al. WWW 2006) using:
H1 : P(q2 | q1 )  P(q2 | q1 )
H 2 :P(q2 | q1 )  P(q2 | q1 )
L( H1 )
LLR  2 log
L( H 2 )
45


Make a bipartite graph of queries and URLs
Cluster (Beeferman and Berger, KDD 2000)
46

Suggest queries in the same cluster
47


A lot of ambiguity is removed by knowing
who the searcher is
Lots of Fernando Pereira’s
◦ I (Emily Pitler) only know one of them

Location matters
◦ “Thai restaurants” from me means “Thai restaurants
Philadelphia, PA”
49



Mei and Church, WSDM 2008
H(URL|Q) = H(URL,Q)-H(Q) = 23.88-21.14=2.74
H(URL|Q,IP)= H(URL,Q,IP)-H(Q,IP)=27.17-26=1.17
50
51

Powerset trying to apply NLP to Wikipedia
52

Descriptive searches: “pictures of mountains”
◦ I don’t want a document with the words:
◦ {“picture”, “of”, “mountains”}



Link farms: trying to game PageRank
Spelling correction: a huge portion of queries
are misspelled
Ambiguity
53



Text normalization, documents as vectors,
document similarity, log likelihood ratio,
relative entropy, precision and recall, tf-idf,
machine learning…
Choosing relevant documents/content
Snippets = short summaries
54
Download