INDRI - Overview and Model - Center for Intelligent Information

advertisement
INDRI - Overview
Don Metzler
Center for Intelligent Information Retrieval
University of Massachusetts, Amherst
Zoology 101



Lemurs are
primates found only
in Madagascar
50 species (17 are
endangered)
Ring-tailed lemurs

lemur catta
Zoology 101



The indri is the
largest type of
lemur
When first spotted
the natives yelled
“Indri! Indri!”
Malagasy for
"Look! Over there!"
What is INDRI?


INDRI is a “larger” version of the Lemur Toolkit
Influences

INQUERY [Callan, et. al. ’92]
Inference network framework

Query language
Lemur [http://www-2.cs.cmu.edu/~lemur/]

Language modeling (LM) toolkit
Lucene [http://jakarta.apache.org/lucene/docs/index.html]

Popular off the shelf Java-based IR system

Based on heuristic retrieval models




No IR system currently combines all of these
features
Design Goals

Off the shelf (Windows, *NIX, Mac platforms)




Robust retrieval model


Inference net + language modeling [Metzler and Croft ’04]
Powerful query language



Separate download, compatible with Lemur
Simple to set up and use
Fully functional API w/ language wrappers for Java, etc…
Extensions to INQUERY query language driven by
requirements of QA, web search, and XML retrieval
Designed to be as simple to use as possible, yet robust
Scalable


Highly efficient code
Distributed retrieval
Document Representation
<html>
<head>
<title>Department Descriptions</title>
</head>
<body>
The following list describes …
<h1>Agriculture</h1> …
<h1>Chemistry</h1> …
<h1>Computer Science</h1> …
<h1>Electrical Engineering</h1> …
…
<h1>Zoology</h1>
</body>
</html>
<title>
context
<title>department
descriptions</title>
<title>
extents
<body>
context
<body>the following
list describes …
<h1>agriculture</h1>
… </body>
<body>
extents
<h1>
context
<h1>agriculture</h1>
<h1>chemistry</h1>
…
<h1>zoology</h1>
<h1>
extents
.
.
.
1. department
descriptions
1. the following
list describes
<h1>agriculture
</h1> …
1. agriculture
2. chemistry
…
36. zoology
Model



Based on original inference network
retrieval framework [Turtle and Croft ’91]
Casts retrieval as inference in simple
graphical model
Extensions made to original model


Incorporation of probabilities based on language
modeling rather than tf.idf
Multiple language models allowed in the network
(one per indexed context)
Model
Model hyperparameters (observed)
Document node (observed)
α,βh1
α,βtitle
Context language models
θtitle
r1
α,βbody
D
…
θbody
rN
Representation nodes
(terms, phrases, etc…)
r1
…
q1
Information need node
(belief node)
θh1
rN
r1
…
rN
q2
I
Belief nodes
(#combine, #not, #max)
Model
α,βbody
D
α,βh1
α,βtitle
θtitle
r1
…
θbody
rN
r1
…
q1
θh1
rN
r1
q2
I
…
rN
P( r | θ )

Probability of observing a term,
phrase, or “concept” given a context
language model


Assume r ~ Bernoulli( θ )


ri nodes are binary
“Model B” – [Metzler, Lavrenko, Croft ’04]
Nearly any model may be used here


tf.idf-based estimates (INQUERY)
Mixture models
Model
α,βbody
D
α,βh1
α,βtitle
θtitle
r1
…
θbody
rN
r1
…
q1
θh1
rN
r1
q2
I
…
rN
P( θ | α, β, D )


Prior over context language model
determined by α, β
Assume P( θ | α, β ) ~ Beta( α, β )




Bernoulli’s conjugate prior
αw = μP( w | C ) + 1
βw = μP( ¬ w | C ) + 1
μ is a free parameter
P(ri |  ,  , D)   P(ri |  ) P( |  ,  , D) 

tf w, D  P( w | C )
| D | 
Model
α,βbody
D
α,βh1
α,βtitle
θtitle
r1
…
θbody
rN
r1
…
q1
θh1
rN
r1
q2
I
…
rN
P( q | r ) and P( I | r )


Belief nodes are created dynamically based
on query
Belief node CPTs are derived from standard
link matrices




Combine evidence from parents in various ways
Allows fast inference by making marginalization
computationally tractable
Information need node is simply a belief
node that combines all network evidence
into a single value
Documents are ranked according to:
P( I | α, β, D)
Example: #AND
P(Q=true|a,b)
A
B
0
false false
0
0
1
false true
true false
true true
A
B
Q
P#and (Q  true)   P(Q  true | A  a, B  b) P( A  a) P( B  b)
a ,b
 P(t | f , f )(1  p A )(1  p B )  P(t | f , t )(1  p A ) pB  P(t | t , f ) p A (1  pB )  P(t | t , t ) p A pB
 0(1  p A )(1  p B )  0(1  p A ) pB  0 p A (1  pB )  1 p A pB
 p A pB
Query Language


Extension of INQUERY query language
Structured query language




Additional features




Term weighting
Ordered / unordered windows
Synonyms
Language modeling motivated constructs
Added flexibility to deal with fields via contexts
Generalization of passage retrieval (extent retrieval)
Robust query language that handles many current
language modeling tasks
Terms
Type
Example
Matches
Stemmed term
dog
All occurrences of dog (and
its stems)
Surface term
“dogs”
Exact occurrences of dogs
(without stemming)
Term group (synonym group) <”dogs” canine>
POS qualified term
All occurrences of dogs
(without stemming) or canine
(and its stems)
<”dogs” canine>.NNS Same as previous, except
matches must also be
tagged with the NNS POS
tag
Proximity
Type
Example
Matches
#odN(e1 … em) or
#N(e1 … em)
#od5(dog cat) or
#5(dog cat)
All occurrences of dog and cat
appearing ordered within a
window of 5 words
#uwN(e1 … em)
#uw5(dog cat)
All occurrences of dog and cat
that appear in any order within
a window of 5 words
#phrase(e1 … em)
#phrase(#1(willy wonka)
#uw3(chocolate factory))
System dependent
implementation (defaults to
#odm)
#syntax:xx(e1 … em) #syntax:np(fresh powder) System dependent
implementation
Context Restriction
Example
Matches
dog.title
All occurrences of dog appearing in the title context
dog.title,paragraph
All occurrences of dog appearing in both a title and
paragraph contexts (may not be possible)
<dog.title dog.paragraph> All occurrences of dog appearing in either a title context
or a paragraph context
#5(dog cat).head
All matching windows contained within a head context
Context Evaluation
Example
Evaluated
dog.(title)
The term dog evaluated using the title context as the
document
dog.(title, paragraph)
The term dog evaluated using the concatenation of the title
and paragraph contexts as the document
dog.figure(paragraph) The term dog restricted to figure tags within the paragraph
context.
Belief Operators
INQUERY
#sum / #and
#wsum*
#or
#not
#max
INDRI
#combine
#weight
#or
#not
#max
* #wsum is still available in INDRI, but should be used with discretion
Extent Retrieval
Example
Evaluated
#combine[section](dog canine)
Evaluates #combine(dog canine) for each
extent associated with the section context
#combine[title, section](dog canine) Same as previous, except is evaluated for
each extent associated with either the title
context or the section context
#sum(#sum[section](dog))
Returns a single score that is the #sum of the
scores returned from #sum(dog) evaluated for
each section extent
#max(#sum[section](dog))
Same as previous, except returns the
maximum score
Extent Retrieval Example
<document>
<section><head>Introduction</head>
Statistical language modeling allows formal
methods to be applied to information retrieval.
...
</section>
<section><head>Multinomial Model</head>
Here we provide a quick review of multinomial
language models.
...
</section>
<section><head>Multiple-Bernoulli Model</head>
We now examine two formal methods for
statistically modeling documents and queries
based on the multiple-Bernoulli distribution.
...
</section>
…
</document>
Query:
#combine[section]( dirichlet smoothing )
0.15
1. Treat each section
extent as a
“document”
0.50
2. Score each
“document”
according to
#combine( … )
0.05
SCORE
0.50
0.35
0.15
…
3. Return a ranked
list of extents.
DOCID
IR-352
IR-352
IR-352
…
BEGIN
51
405
0
…
END
205
548
50
…
Example Tasks

Ad hoc retrieval



Web search




Homepage finding
Known-item finding
Question answering
KL divergence based ranking



Flat documents
SGML/XML documents
Query models
Relevance modeling
Cluster-base language models
Ad Hoc Retrieval

Flat documents


Query likelihood retrieval:
q1 … qN ≡ #combine( q1 … qN)
SGML/XML documents


Can either retrieve documents or extents
Context restrictions and context
evaluations allow exploitation of
document structure
Web Search


Homepage / known-item finding
Use mixture model of several document
representations [Ogilvie and Callan ’03]
P( w | , )  body P( w |  body )  inlinkP( w |  inlink )  λtitleP( w |  title)

Example query: Yahoo!
#combine( #wsum(0.2 yahoo.(body)
0.5 yahoo.(inlink)
0.3 yahoo.(title) ) )
Question Answering


More expressive passage- and sentencelevel retrieval
Example:

Where was George Washington born?
#combine[sentence]( #1( george washington )
born #place( ? ) )

Returns a ranked list of sentences containing
the phrase George Washington, the term born,
and a snippet of text tagged as a PLACE named
entity
KL / Cross Entropy Ranking


INDRI handles ranking via KL / cross entropy

Query models

Relevance modeling
Relevance modeling [Lavrenko and Croft ’01]

Do initial query run and retrieve top N documents

Form relevance model P(w | θQ)

Formulate and rerun expanded query as:
#weight (P(w1 | θQ) w1 … P(w|V| | θQ) w|V|)

Ranked list equivalent to scoring by: KL(θQ || θD)
Cluster-Based LMs


Cluster-based language models [Liu and Croft ’04]
Smoothes each document with a documentspecific cluster model
P(w | , )  D P(w |  D )  (1  D )[Cluster P(w | Cluster )  (1  Cluster ) P( w | C )]


INDRI allows each document to specify (via
metadata) a background model to smooth
against
Many other possible uses of this feature
Conclusions

INDRI extends INQUERY and Lemur





Off the shelf
Scalable
Geared towards tagged (structured)
documents
Employs robust inference net approach to
retrieval
Extended query language can tackle many
current retrieval tasks
Download