Lemur Application toolkit Kanishka P Pathak Bioinformatics CIS 595 Introduction A language model (LM) is a probabilistic mechanism for generating text In the past several years, there has been significant interest in the use of language modeling for text and natural language processing tasks We now have text information retrieval (IR) based on statistical language modeling Previous work The first statistical modeler was Claude Shannon. He thought of the human language as a statistical source and … He measured how well simple n-gram models did at predicting and compressing natural text. For many years, language models were used in speech recognition. However, basic language modeling ideas have been used in information retrieval for quite some time. Some of the previous models are: naïve Bayes model Robertson and Sparck Jones model Their limitations…. Naïve Bayes Suffers from the “Independence Assumptions” it makes RSJ Distribution of query trems in “relevant” and “non-relevant” documents Turning the problem around Ponte and Croft proposed the smoothed version of document unigram model to assign a score to a query Berger and J.Lafferty built on this model. Their approach : “predict the input (i.e. the query)” This opened up new ways to think about information retrieval…. Lemur ‘Lemur’ is a nocturnal, monkey-like African animal largely confined to the island of Madagascar The name was chosen partly because of resemblance to LM/IR Secondly because LM community was an island to the IR community What is the Lemur project? It is a research project being carried out by the computer Science dept. at Univ. of Massachusetts and Carnegie Mellon University It is sponsored by the Advanced Research and Development Activity in Information Technology (ARDA) It is designed to facilitate research in language modeling and Information retrieval It is written in C/C++ and runs under Unix as well as Windows Components and their interaction Components and their interaction Components and their interaction The toolkit The lemur toolkit is available on the site www-2.cs.cmu.edu/~lemur To use the toolkit : download compile execute Example of applications Pre-processing : ParseQuery ParseToFile Building/Adding Index : PushIndexer BuildBasicIndex Retrieval/Evaluation : RetEval StructQueryEval Summarization : BasicSummApp MMRSummApp What do we need to run an application? Text documents in the format which is acceptable by LEMUR (TREC format) Parameter file Document format in Lemur There are 5 documents formats supported by Lemur : TREC WEB CHINESE CHINESECHAR ARABIC Example of a Document format Say, we take the document “web” <DOC> <DOCNO> any_number_here </DOCNO> Text here </DOC> <DOC> <DOCNO> any_number_here </DOCNO> Text here </DOC> Example of Document format <DOC> <DOCNO> 251 </DOCNO> Ballistic Cam Design This paper presents a digital computer program for the rapid calculation of manufacturing data essential to the design of preproduction cams which are utilized in ballistic computers of tank fire control systems. The cam profile generated introduces the superelevation angle required by tank main armament for a particular type ammunition. CACM November, 1961 Archambault, M. CA611117 JB March 15, 1978 10:37 PM </DOC> Example of what a parameter file looks like Say we are creating a parameter file for the application ‘BuildBasicIndex’ The parameter file needs to have the following contents: 1.inputFile 2.outputPrefix 3.maxDocuments 4.maxMemory : the path to the source file : a prefix name for your index : maximum number of documents to index (default 1000000) : maximum amount of memory to be used for indexing (default 128MB) Eg: inputFile=/usr/mydata/source; outputPrefix= /usr/mydata/index; maxDocuments=200000; C:\lemur>BuildBasicIndex c:\lemur\buildpa The indexed file generated is : /usr/mydata/index.bsc Contd…. Run the application with the parameter as the only argument OR the first argument, if the application can take other parameters from the command line example Example: C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\parambasic.txt OR C:\lemur\lemur-2.0.3>BuildBasicIndex c:\lemur\parambasic.txt c:\lemur\source.txt Where, BuildBasicIndex is the application parambasic.txt is a parameter file for BuildBasicIndex source.txt is the file containing the source document Lemur API The Lemur API is intended to allow a programmer to use the toolkit for special-purpose applications that are not implemented in the toolkit itself The API interfaces are grouped at three different levels: 1. Utility level 2. Indexer level 3. Retrieval level API levels Utility level : Includes common utilities such as memory management, default exception handler, program argument handler. Indexer level : Converts the raw text into efficient data structures so that the information (i.e. word counts) may be accessed conveniently and efficiently later. Retrieval level: It is most useful for users who want to build a prototype system or evaluation system Future Developments Summarizing Filtering Question Answering Language generation References www-2.cs.cmu.edu/~lemur A language modeling approach to Information retrieval by Jay M Ponte and W. Bruce Croft (CS – UMass Amherst) THANK YOU Any questions?