Ива Стефанова Коевска, ф. № 43829 ФМИ, Информатика, гр. 6, 1 курс Курсова работа по английски език 2004/2005 Information Retrieval Modelling: The 2-Poisson Model http://www.cs.utwente.nl/~hiemstra/papers/thesis.pdf (page 18, page 24) Ива Стефанова Коевска 1.1 група 6 ф.№ 43829 Models of ranked retrieval Models of ranked retrieval usually imply the use of some statistics on the terms, that is, they somehow take into account the number of occurrences of terms in the documents or in the index to compute rankings. Another key is- sue of models of ranked retrieval is automatic query formulation. This addresses the difficulties non-expert users have with the Boolean operators. Non-expert users should be able to enter a real natural language request, or possibly just a couple of terms, without the use of operators. Both ranking and the fact that operators are not mandatory is shared by the approaches presented in this section. For each model, some pros and cons are identified. 1.2 The 2-Poisson model Bookstein and Swanson (1974) studied the problem of developing a set of statistical rules for the purpose of identifying the index terms of a document. They suggested that the number of occurrences tf of terms in documents could be modelled by a mixture of two Poisson distributions as follows, where X is a random variable for the number of occurrences. The model assumes that the documents were created by a random stream of term occurrences. For each term, the collection can be divided into two subsets. Documents in subset one treat a subject referred to by a term to a greater extent than documents in subset two. This is represented by which is the proportion of the documents that belong to subset one and by the Poisson means µ1 and µ2 (µ1 µ2) which can be estimated from the mean number of occurrences of the term in the respective subsets. For each term, the model needs these three parameters, but unfortunately, it is unknown to which subset each document belongs. The estimation of the three parameters should therefore be done iteratively by applying e.g. the expectation maximisation algorithm or alternatively by the method of moments as done by Harter (1975). If a document is taken at random from subset one, then the probability of relevance of this document is assumed to be equal to, or higher than, the probability of relevance of a document from subset two; because the probability of relevance is assumed to be correlated with the extent to which a subject referred to by a term is treated, and because µ1 µ2 . Useful terms will make a good distinction between relevant and nonrelevant documents, that is, both subsets will have very different Poisson means µ1 and µ2 . Therefore, Нarter (1975) suggests the following measure of effectiveness of an index term that can be used to rank the documents given a query. The 2-Poisson model's main advantage is that it does not need an addi tional term weighting algorithm to be implemented. In this respect, the model contributed to the understanding of information retrieval and inspired some researchers in developing new models. The model's biggest problem, however, is the estimation of the parameters. For each term there are three unknown parameters that cannot be estimated directly from the observed data. Furthermore, despite the model's complexity, it still might not fit the actual data if the term frequencies differ very much per document. Some studies therefore examine the use of more than two Poisson functions, but this makes the estimation problem even more intractable (Margulis 1993). 1 Ива Стефанова Коевска група 6 ф.№ 43829 Glossary Index – a pointer; a sign or indication. Term – each of the members of which a mathematical expression, a series of quantities, etc is composed. Variable – (a symbol that represents) a quantity that may assume any given value or set of values. Subset – a set consisting of elements of a given set that can be the same as the given set or smaller. Extent – the space or degree to which a thing extends. Proportion – comparative relation between things or magnitudes. Mean number – an average number. Respective subset – a subset that pertains individually to end of a number. Parameter – a constant or variable term in a mathematical function that determines the specific formof the function but not the natural value. Iterate – to say or do somaething again. Algorithm – a set of rules for solving a problem in a finite number of steps (as for finding the greatest common divider). Correlate – to bring into mutual or reciprocal relation. Implement – to put into effect according to a plan or procedure. Intractable – not docile or manageable. Statistic - a fact in the form of a number that shows information about something. 2 Ива Стефанова Коевска група 6 ф.№ 43829 Exercise 1.The _________ shows the exact number of terms in set A. A.point B.signature C.index D.term 2.The two _________ “x” and “y” are used to form a system of equations which has a single solution (0;0). A.parameters B.variables C.constants D.numbers 3.The cycle is _________ (1) according to the value of the _________ (2) “i”. A1.used B1.done C1.repeated D1.iterated A2.parameter B2.variable C2.constant D2.number 4.The values of the two parameters “a” and “b” are used to form a _________ which represents the relation between the two variables “x” and “y”. A.relation B.distribution C.proportion D.equation 5.The elements of the two _________ (1) are _________ (2) in order to form the rule by which the set is formed. A1.sets B1.equations C1.lines D1.subsets A2.correlated B2.formed C2.subtracted D2.used 6.The _________ (1) is formed out of the _________ (2) of the set. A1.mean number B1.median C1.natural number D1.zero set A2.things B2.members C2.elements D2.terms 7.The _________ (1) is _________ (2) only for the _________ (3) sets of numbers. A1.operation B1.way C1.rule D1.algorithm A2.implemented B2.done C2.started D2.undone A3.supernatural B3.natural C3.respective D3.opposite 8.The algorithm was _________ so the problem couldn’t be solved. A.undone B.intractable C.docile D.manageable 9.The _________ (1) of the terms in subset A is greater than the _________ (2) of the terms in subset B. A1.elements B1.numbers C1.values D1.extent A2.elements B2.numbers C2.values 3 D2.extent Ива Стефанова Коевска група 6 ф.№ 43829 Answer keys: 1.The _________ shows the exact number of terms in set A. A.point B.signature C.index D.term 2.The two _________ “x” and “y” are used to form a system of equations which has a single solution (0;0). A.parameters B.variables C.constants D.numbers 3.The cycle is _________ (1) according to the value of the _________ (2) “i”. A1.used B1.done C1.repeated D1.iterated A2.parameter B2.variable C2.constant D2.number 4.The values of the two parameters “a” and “b” are used to form a _________ which represents the relation between the two variables “x” and “y”. A.relation B.distribution C.proportion D.equation 5.The elements of the two _________ (1) are _________ (2) in order to form the rule by which the set is formed. A1.sets B1.equations C1.lines D1.subsets A2.correlated B2.formed C2.subtracted D2.used 6.The _________ (1) is formed out of the _________ (2) of the set. A1.mean number B1.median C1.natural number D1.zero set A2.things B2.members C2.elements D2.terms 7.The _________ (1) is _________ (2) only for the _________ (3) sets of numbers. A1.operation B1.way C1.rule D1.algorithm A2.implemented B2.done C2.started D2.undone A3.supernatural B3.natural C3.respective D3.opposite 8.The algorithm was _________ so the problem couldn’t be solved. A.undone B.intractable C.docile D.manageable 9.The _________ (1) of the terms in subset A is greater than the _________ (2) of the terms in subset B. A1.elements B1.numbers C1.values D1.extent A2.elements B2.numbers C2.values D2.extent 4