Abstract

advertisement
Learning to Categorise Texts using Unsupervised Neural Nets
Khurshid Ahmad
Department of Computing,
University of Surrey
Guildford, Surrey.
k.ahmad@surrey.ac.uk
The categorisation of texts according to a conceptual scheme is important for storing, retrieving and routing
texts. A number of pragmatic text categories are currently being used: Reuters uses a complex scheme for
categorising news stories and this scheme closely follows the layout of a quality newspaper; Dewey
decimal classification system is used by information scientists and library professionals to classify books:
this system reflects the layout of an encyclopaedia; journal editors/publishers use an idiosyncratic scheme
that reflects state of the discipline one or more journal serves. It appears that the categories in any of the
conceptual schemes described above rely largely on the consensus of individuals who publish texts and
those who actually use the published texts – the users may be critical of the scheme but in general is
powerless to do much about the scheme.
I will outline a method that relies on a set of salient terms in a collection of texts – a corpus organised either
synchronically or diachronically – for generating a proximity map that shows clusters of similar texts. The
frequency distribution of single words in the collection is compared with the distribution of these words in
a reference corpus of general language; in the case of English one can use the British National Corpus or
corpora collated by dictionary publishers. The salient terms will have a very different distribution in the
collection and in the reference corpus (Ahmad 1995 and Ahmad and Rogers 2001). Those with the highest
difference are chosen as a vector for describing each of the texts: the presence or absence of the salient
terms in each of the texts in the collection is denoted as ‘1’ or ‘0’. The vectors of the individual texts are
used in training a self-organising Kohonen Map (SOFM). The resulting map clearly shows clusters of texts
that has a resonance with the text categories assigned to the individuals by the (sub-)editors of the
publications in which the texts were published. I will describe how we have classified texts provided by
AP News Wire and Reuters using an SOFM. I will also report on a statistic developed at Surrey to describe
the goodness of each of the clusters within the SOFM. This method has been reported in Ahmad Vrusias
and Ledford (2001).
We have recently used the SOFM trained on full texts to evaluate the quality of summaries of these texts
produced automatically by a summariser developed at Surrey by Mohammad Benbrahim (Ahmad and
Benbrahim 1995), Lena Tostevin and Lee Gillam (Ahmad, Tostevin and Gillam 2001) and Paolo Olivera.
The evaluation of automatically generated is an open problem in text understanding and a method for
automatically measuring the overlap between a text and its surrogate summary perhaps will help here.
References
Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’. In
(Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2).
Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760.
Ahmad, K., Vrusias, B., and Ledford, Anthony. (2001) Choosing Feature Sets for Training and Testing
Self-Organising Maps: A Case Study. Neural Computing & Applications. Volume 10, pp 56-66.
Ahmad, K., Gillam, L., & Tostevin, L. (2001) Weirdness Indexing for Logical Document Extrapolation and
Retrieval (WILDER). In (Eds.) E.M. Voorhees and D.K. Harman. The 8th Text Retrieval Conference
(TREC-8). Washington: National Institute of Standards and Technology. pp 717-724
Ahmad, K. & Benbrahim, M. (1995) Text Summarisation: The role of lexical cohesion analysis. The New
Review of Document and Text Management, Vol. 1. pp.321-335.
Ahmad, K. (1995). Pragmatics of Specialist Terms and Terminology Management. In (Ed.) Petra Steffens.
Machine Translation and the Lexicon. ( Invited talk at 3rd Int. EAMT Workshop, Heidelberg, Germany,
April 26-28,1993.) Heidelberg (Germany): Springer. pp.51-76.
Download