1. introduction - School of Computer Science

Language Model for IR Using Collection Information
Rong Jin
Luo Si
Alex G. Hauptmann
Jamie Callan
School of Computer Science
Carnegie Mellon University
School of Computer Science
Carnegie Mellon University
School of Computer Science
Carnegie Mellon University
School of Computer Science
Carnegie Mellon University
In this paper, we explored how to use meta-data information in
information retrieval task. We presented a new language model
that is able to take advantage of the category information for
documents to improve the retrieval accuracy. We compared the
new language model with the traditional language model over the
TREC4 dataset where the collection information for documents is
obtained using the k-means clustering method. The new language
model outperforms the traditional language model, which verifies
our statement.
Categories and Subject Descriptors
Language Models
As the confidence on the relevance of a document to a query
depends on both the words in that document and the document
label, in Equation (1), we write the probability P(Q|D,CD) as a
linear combination of the probability P(Q|D) and the probability
P(Q|CD) since linear interpolation is a common practice in
combining different evidences
P(Q | D, C D )  P(Q | D)  (1   ) P(Q | C D ))
Where  in the above equation is a constant that indicates the
importance of the words in determining whether a document is
relevant to a query or not.
The term P(Q|D) can be easily computed using the traditional
language model [2], which can be expressed as
P(Q | D) 
 (P(qw | D)  (1   )P(qw | GE))
Language model for IR
where  is the smoothing constant.
In traditional information retrieval, a document is usually assumed
to be a sequence of words, which is essentially the only evidence
to be used by the retrieval system to judge. However, in the web
environment, documents contain many useful meta-data such as
topic information in XML format. Therefore, it becomes an
interesting question as how to take advantage of the meta-data of
documents to improve the retrieval accuracy.
In this paper, we study one kind of meta-data for documents, i.e.
the category labels. The problem that we address is how to adapt
language models to make use of the category labels for
documents. The basic idea underlying the new language model
that we propose is to combine the evidence of the words together
with category labels to help predict whether a document is
relevant to the query. The documents within a category that is
relevant to a query will have a better chance to be relevant than
those in a category that is surely irrelevant to the query. The
crucial problem with the new language model is how to compute
the relevance of a category to a query and then use the computed
relevance information for categories to help determine the
relevance information for documents.
The traditional language model defines the similarity between a
document and a query as the probability P(Q|D), i.e. the
probability of generating query Q given the observation of the
document D [2]. When category labels of documents are
available, it would be appropriate to define the document-query
similarity as the probability P(Q|D,CD), where CD stands for the
unique category label for the document D.
The difficulty comes from the term P(Q|CD), which indicates the
relevance of the category CD to the query Q. To measure the
relevance of a category to a query, we can measure the average
similarity score for all the documents in that category. Intuitively
a category is believed to be relevant to a query if many of the
documents inside that category are relevant, while a category will
be treated as irrelevant to a query when most of documents within
that category have nothing to do with that query. Therefore, the
probability P(Q|C) can be expressed as the average of the
probability P(Q|D) for all the documents in the category C, or
P (Q | C ) 
 P(Q, D' | C )   P(Q | D' , C ) P( D' | C )
D 'C
 P(Q | D' , C )
| C | D 'C
D 'C
In the last step of Equation (3), we assume that probability
P(D’|C) is a uniform distribution and equal to 1/|C|, where |C| is
the number of documents in the category C.
As seen from Equation (1) and (3), there are two different pieces
of relevance information, namely the relevance information for
categories and the relevance information for documents. These
two sets of relevance information are intertwined with each other
through Equation (1) and (3). As indicated by Equation (1), to
compute the relevance information for documents P(Q|D,CD), we
need to know the relevance information for categories P(Q|CD).
Meanwhile, in Equation (3), to obtain the relevance information
for collections P(Q|C), the relevance information for documents
P(Q|D,C) is required. Fortunately, it is not difficult to prove that
the only one solution for P(Q|C) that satisfies both Equation (1)
and (3) is
P(Q | C ) 
 P(Q | D' )
| C | D'C
To prove that Equation (4) is the unique solution for P(Q|C), we
can simply substitute P(Q|D,CD) in Equation (1) with Equation
(3), which results in Equation (5).
P(Q | C ) 
 (P(Q | D' )  (1   ) P(Q | C ))
| C | D 'C
 P(Q | D' ) 
| C | D 'C
(1   )
|C |
 P(Q | C )
D 'C
 P(Q | D' )  (1   ) P(Q | C )
| C | D 'C
By moving the term (1-)P(Q|C) in the R.H.S of the above
equation to the L.H.S., we can easily obtain Equation (4).
Equation (1) together with Equation (4) is the foundation of the
new language model for retrieval.
To evaluate the effectiveness of the new language model, we
conduct a contrastive experiment by comparing the new language
model with the traditional language model. Since we don’t have a
collection with both manually assigned category labels and human
relevance judgments for documents, we use the TREC4 collection
as the test bed where the category labels are automatically
computed using k-means clustering method [3]. There are
altogether 100 different categories. Of course, the automatically
generated category labels for documents are not of high qualities
and therefore may have negative impact on the retrieval
performance of the new language model.
The exact formula for the traditional language model is shown in
Equation (2), which is also used to compute the P(Q|D) in the new
language model. The smoothing constant  is set to 0.5 for both
language models.
Since the weight  reflects the importance of document words in
determining the document-query similarity, and in most case,
document words provide much more information than the
document labels,  is usually set to be close to 1.
Table 1 lists the average precision and precisions at different
recall points for both the traditional language model and the new
language model with the weight  set to 0.90. We also conduct
experiments by varying  from 0.8 to 0.95 and obtain the similar
results as listed in Table 1. When  is lower than 0.8, the results
become worse. This is consistent with the intuition that, document
words provide far more information than document labels and
therefore, weight  should set to be close to 1.
As seen from Table 1, the new language model outperforms the
traditional language model in terms of both average precision and
precision at most recall points except the precision at 90% recall.
The quality of the categories labels used in the experiments is not
very high as we only use pseudo labels generated by k-means
algorithms. It can be imagined that the performance will be
further improved if we can incorporate high quality labels such as
those generated by domain experts.
Table 1: Average precision and precisions at different recall points for
both traditional language model and the new language model over TREC4
Tradition LM
New LM
Avg. Precision
This work is related to the work that has been done on combining
full-text and controlled vocabulary indexing [1] where category
labels of documents can be treated as manually assigned
keywords. The difference between them is that, in the work of
combining full-text and controlled vocabulary indexing, the
retrieval accuracy is improved by the extra matches of query
words with the manually assigned keywords, while in our work
the improvement is obtained by using the estimated matching
probabilities of collections with respect to the query.
In this paper, we proposed a new language model for information
retrieval that is able to make use of the document labels. In the
experiment with TREC4 data, where the category labels for
documents are simulated using k-means clustering method, the
new language model outperforms the traditional language model.
Therefore, we conclude that the new language model is a very
promising method for exploring category information to improve
the retrieve performance.
In principle, the proposed method can be extended to incorporate
other meta information. However, the problem of how to set the
parameters such as lambdas needs to be explored in the future
study. Furthermore, since we only explore the linear combination
as the way of combining evidences in this work, more studies
need to be carried out on different combination methods.
