Lemur Application toolkit

advertisement
Lemur Application toolkit
Kanishka P Pathak
Bioinformatics
CIS 595
Introduction

A language model (LM) is a probabilistic
mechanism for generating text

In the past several years, there has been
significant interest in the use of language
modeling for text and natural language
processing tasks

We now have text information retrieval (IR)
based on statistical language modeling
Previous work

The first statistical modeler was Claude
Shannon.

He thought of the human language as a
statistical source and …

He measured how well simple n-gram
models did at predicting and compressing
natural text.

For many years, language models were used in
speech recognition.

However, basic language modeling ideas have
been used in information retrieval for quite some
time.

Some of the previous models are:
naïve Bayes model
Robertson and Sparck Jones model
Their limitations….

Naïve Bayes
Suffers from the “Independence
Assumptions” it makes

RSJ
Distribution of query trems in “relevant” and
“non-relevant” documents
Turning the problem around

Ponte and Croft proposed the smoothed version of document
unigram model to assign a score to a query

Berger and J.Lafferty built on this model.
Their approach :
“predict the input (i.e. the query)”
This opened up new ways to think about information retrieval….
Lemur



‘Lemur’ is a nocturnal,
monkey-like African
animal largely confined to
the island of Madagascar
The name was chosen
partly because of
resemblance to LM/IR
Secondly because LM
community was an island
to the IR community
What is the Lemur project?

It is a research project being carried out by the computer
Science dept. at Univ. of Massachusetts and Carnegie
Mellon University

It is sponsored by the Advanced Research and
Development Activity in Information Technology (ARDA)

It is designed to facilitate research in language modeling
and Information retrieval

It is written in C/C++ and runs under Unix as well as
Windows
Components and their interaction
Components and their interaction
Components and their interaction
The toolkit

The lemur toolkit is available on the site
www-2.cs.cmu.edu/~lemur

To use the toolkit :
download  compile  execute
Example of applications




Pre-processing :
ParseQuery
ParseToFile
Building/Adding Index :
PushIndexer
BuildBasicIndex
Retrieval/Evaluation :
RetEval
StructQueryEval
Summarization :
BasicSummApp
MMRSummApp
What do we need to run an
application?

Text documents in the format which is
acceptable by LEMUR (TREC format)

Parameter file
Document format in Lemur
There are 5 documents formats supported
by Lemur :
TREC
WEB
CHINESE
CHINESECHAR
ARABIC
Example of a Document format
Say, we take the document “web”
<DOC>
<DOCNO> any_number_here </DOCNO>
Text here
</DOC>
<DOC>
<DOCNO> any_number_here </DOCNO>
Text here
</DOC>
Example of Document format
<DOC>
<DOCNO> 251 </DOCNO>
Ballistic Cam Design
This paper presents a digital computer program
for the rapid calculation of manufacturing data
essential to the design of preproduction cams which
are utilized in ballistic computers of tank fire
control systems. The cam profile generated introduces
the superelevation angle required by tank main
armament for a particular type ammunition.
CACM November, 1961
Archambault, M.
CA611117 JB March 15, 1978 10:37 PM
</DOC>
Example of what a parameter file
looks like
Say we are creating a parameter file for the application
‘BuildBasicIndex’
The parameter file needs to have the following contents:
1.inputFile
2.outputPrefix
3.maxDocuments
4.maxMemory
: the path to the source file
: a prefix name for your index
: maximum number of documents to
index (default 1000000)
: maximum amount of memory to be
used for indexing (default 128MB)
Eg:
inputFile=/usr/mydata/source;
outputPrefix= /usr/mydata/index;
maxDocuments=200000;
C:\lemur>BuildBasicIndex c:\lemur\buildpa
The indexed file generated is :
/usr/mydata/index.bsc
Contd….
Run the application with the parameter as
the only argument
OR
the first argument, if the application can take
other parameters from the command line
example
Example:
C:\lemur\lemur-2.0.3>BuildBasicIndex
c:\lemur\parambasic.txt
OR
C:\lemur\lemur-2.0.3>BuildBasicIndex
c:\lemur\parambasic.txt c:\lemur\source.txt
Where,
BuildBasicIndex is the application
parambasic.txt is a parameter file for BuildBasicIndex
source.txt is the file containing the source document
Lemur API
The Lemur API is intended to allow a programmer
to use the toolkit for special-purpose
applications that are not implemented in the
toolkit itself
The API interfaces are grouped at three different
levels:
1. Utility level
2. Indexer level
3. Retrieval level
API levels

Utility level : Includes common utilities such as
memory management, default exception
handler, program argument handler.

Indexer level : Converts the raw text into efficient
data structures so that the information (i.e. word
counts) may be accessed conveniently and
efficiently later.

Retrieval level: It is most useful for users who
want to build a prototype system or evaluation
system
Future Developments
Summarizing
 Filtering
 Question Answering
 Language generation

References

www-2.cs.cmu.edu/~lemur

A language modeling approach to Information retrieval
by Jay M Ponte and W. Bruce Croft (CS – UMass
Amherst)
THANK YOU
Any questions?
Download