Курсова работа по английски език 2004/2005

advertisement
Ива Стефанова Коевска, ф. № 43829
ФМИ, Информатика, гр. 6, 1 курс
Курсова работа по английски език 2004/2005
Information Retrieval Modelling: The 2-Poisson
Model
http://www.cs.utwente.nl/~hiemstra/papers/thesis.pdf (page 18, page 24)
Ива Стефанова Коевска
1.1
група 6
ф.№ 43829
Models of ranked retrieval
Models of ranked retrieval usually imply the use of some statistics on the terms, that
is, they somehow take into account the number of occurrences of terms in the documents or in
the index to compute rankings. Another key is- sue of models of ranked retrieval is automatic
query formulation. This addresses the difficulties non-expert users have with the Boolean
operators. Non-expert users should be able to enter a real natural language request, or possibly
just a couple of terms, without the use of operators. Both ranking and the fact that operators
are not mandatory is shared by the approaches presented in this section. For each model, some
pros and cons are identified.
1.2
The 2-Poisson model
Bookstein and Swanson (1974) studied the problem of developing a set of statistical
rules for the purpose of identifying the index terms of a document. They suggested that the
number of occurrences tf of terms in documents could be modelled by a mixture of two
Poisson distributions as follows, where X is a random variable for the number of occurrences.
The model assumes that the documents were created by a random stream of term
occurrences. For each term, the collection can be divided into two subsets. Documents in
subset one treat a subject referred to by a term to a greater extent than documents in subset
two. This is represented by  which is the proportion of the documents that belong to subset
one and by the Poisson means µ1 and µ2 (µ1  µ2) which can be estimated from the mean
number of occurrences of the term in the respective subsets. For each term, the model needs
these three parameters, but unfortunately, it is unknown to which subset each document
belongs. The estimation of the three parameters should therefore be done iteratively by
applying e.g. the expectation maximisation algorithm or alternatively by the method of
moments as done by Harter (1975). If a document is taken at random from subset one, then
the probability of relevance of this document is assumed to be equal to, or higher than, the
probability of relevance of a document from subset two; because the probability of relevance
is assumed to be correlated with the extent to which a subject referred to by a term is treated,
and because µ1  µ2 . Useful terms will make a good distinction between relevant and nonrelevant documents, that is, both subsets will have very different Poisson means µ1 and µ2 .
Therefore, Нarter (1975) suggests the following measure of effectiveness of an index term
that can be used to rank the documents given a query.
The 2-Poisson model's main advantage is that it does not need an addi tional term
weighting algorithm to be implemented. In this respect, the model contributed to the
understanding of information retrieval and inspired some researchers in developing new
models. The model's biggest problem, however, is the estimation of the parameters. For each
term there are three unknown parameters that cannot be estimated directly from the observed
data. Furthermore, despite the model's complexity, it still might not fit the actual data if the
term frequencies differ very much per document. Some studies therefore examine the use of
more than two Poisson functions, but this makes the estimation problem even more intractable
(Margulis 1993).
1
Ива Стефанова Коевска
група 6
ф.№ 43829
Glossary
Index – a pointer; a sign or indication.
Term – each of the members of which a mathematical expression, a series of
quantities, etc is composed.
Variable – (a symbol that represents) a quantity that may assume any given value or
set of values.
Subset – a set consisting of elements of a given set that can be the same as the given
set or smaller.
Extent – the space or degree to which a thing extends.
Proportion – comparative relation between things or magnitudes.
Mean number – an average number.
Respective subset – a subset that pertains individually to end of a number.
Parameter – a constant or variable term in a mathematical function that determines
the specific formof the function but not the natural value.
Iterate – to say or do somaething again.
Algorithm – a set of rules for solving a problem in a finite number of steps (as for
finding the greatest common divider).
Correlate – to bring into mutual or reciprocal relation.
Implement – to put into effect according to a plan or procedure.
Intractable – not docile or manageable.
Statistic - a fact in the form of a number that shows information about something.
2
Ива Стефанова Коевска
група 6
ф.№ 43829
Exercise
1.The _________ shows the exact number of terms in set A.
A.point
B.signature
C.index
D.term
2.The two _________ “x” and “y” are used to form a system of equations which has a
single solution (0;0).
A.parameters
B.variables
C.constants
D.numbers
3.The cycle is _________ (1) according to the value of the _________ (2) “i”.
A1.used
B1.done
C1.repeated
D1.iterated
A2.parameter
B2.variable
C2.constant
D2.number
4.The values of the two parameters “a” and “b” are used to form a _________ which
represents the relation between the two variables “x” and “y”.
A.relation
B.distribution
C.proportion
D.equation
5.The elements of the two _________ (1) are _________ (2) in order to form the rule by
which the set is formed.
A1.sets
B1.equations
C1.lines
D1.subsets
A2.correlated
B2.formed
C2.subtracted
D2.used
6.The _________ (1) is formed out of the _________ (2) of the set.
A1.mean number
B1.median
C1.natural number
D1.zero set
A2.things
B2.members
C2.elements
D2.terms
7.The _________ (1) is _________ (2) only for the _________ (3) sets of numbers.
A1.operation
B1.way
C1.rule
D1.algorithm
A2.implemented
B2.done
C2.started
D2.undone
A3.supernatural
B3.natural
C3.respective
D3.opposite
8.The algorithm was _________ so the problem couldn’t be solved.
A.undone
B.intractable
C.docile
D.manageable
9.The _________ (1) of the terms in subset A is greater than the _________ (2) of the
terms in subset B.
A1.elements
B1.numbers
C1.values
D1.extent
A2.elements
B2.numbers
C2.values
3
D2.extent
Ива Стефанова Коевска
група 6
ф.№ 43829
Answer keys:
1.The _________ shows the exact number of terms in set A.
A.point
B.signature
C.index
D.term
2.The two _________ “x” and “y” are used to form a system of equations which has a
single solution (0;0).
A.parameters
B.variables
C.constants
D.numbers
3.The cycle is _________ (1) according to the value of the _________ (2) “i”.
A1.used
B1.done
C1.repeated
D1.iterated
A2.parameter
B2.variable
C2.constant
D2.number
4.The values of the two parameters “a” and “b” are used to form a _________ which
represents the relation between the two variables “x” and “y”.
A.relation
B.distribution
C.proportion D.equation
5.The elements of the two _________ (1) are _________ (2) in order to form the rule by
which the set is formed.
A1.sets
B1.equations
C1.lines
D1.subsets
A2.correlated
B2.formed
C2.subtracted
D2.used
6.The _________ (1) is formed out of the _________ (2) of the set.
A1.mean number
B1.median
C1.natural number D1.zero set
A2.things
B2.members
C2.elements
D2.terms
7.The _________ (1) is _________ (2) only for the _________ (3) sets of numbers.
A1.operation
B1.way
C1.rule
D1.algorithm
A2.implemented
B2.done
C2.started
D2.undone
A3.supernatural
B3.natural
C3.respective D3.opposite
8.The algorithm was _________ so the problem couldn’t be solved.
A.undone
B.intractable
C.docile
D.manageable
9.The _________ (1) of the terms in subset A is greater than the _________ (2) of the
terms in subset B.
A1.elements
B1.numbers
C1.values
D1.extent
A2.elements
B2.numbers
C2.values
D2.extent
4
Download