Learning to Categorise Texts using Unsupervised Neural Nets Khurshid Ahmad Department of Computing, University of Surrey Guildford, Surrey. k.ahmad@surrey.ac.uk The categorisation of texts according to a conceptual scheme is important for storing, retrieving and routing texts. A number of pragmatic text categories are currently being used: Reuters uses a complex scheme for categorising news stories and this scheme closely follows the layout of a quality newspaper; Dewey decimal classification system is used by information scientists and library professionals to classify books: this system reflects the layout of an encyclopaedia; journal editors/publishers use an idiosyncratic scheme that reflects state of the discipline one or more journal serves. It appears that the categories in any of the conceptual schemes described above rely largely on the consensus of individuals who publish texts and those who actually use the published texts – the users may be critical of the scheme but in general is powerless to do much about the scheme. I will outline a method that relies on a set of salient terms in a collection of texts – a corpus organised either synchronically or diachronically – for generating a proximity map that shows clusters of similar texts. The frequency distribution of single words in the collection is compared with the distribution of these words in a reference corpus of general language; in the case of English one can use the British National Corpus or corpora collated by dictionary publishers. The salient terms will have a very different distribution in the collection and in the reference corpus (Ahmad 1995 and Ahmad and Rogers 2001). Those with the highest difference are chosen as a vector for describing each of the texts: the presence or absence of the salient terms in each of the texts in the collection is denoted as ‘1’ or ‘0’. The vectors of the individual texts are used in training a self-organising Kohonen Map (SOFM). The resulting map clearly shows clusters of texts that has a resonance with the text categories assigned to the individuals by the (sub-)editors of the publications in which the texts were published. I will describe how we have classified texts provided by AP News Wire and Reuters using an SOFM. I will also report on a statistic developed at Surrey to describe the goodness of each of the clusters within the SOFM. This method has been reported in Ahmad Vrusias and Ledford (2001). We have recently used the SOFM trained on full texts to evaluate the quality of summaries of these texts produced automatically by a summariser developed at Surrey by Mohammad Benbrahim (Ahmad and Benbrahim 1995), Lena Tostevin and Lee Gillam (Ahmad, Tostevin and Gillam 2001) and Paolo Olivera. The evaluation of automatically generated is an open problem in text understanding and a method for automatically measuring the overlap between a text and its surrogate summary perhaps will help here. References Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’. In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2). Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760. Ahmad, K., Vrusias, B., and Ledford, Anthony. (2001) Choosing Feature Sets for Training and Testing Self-Organising Maps: A Case Study. Neural Computing & Applications. Volume 10, pp 56-66. Ahmad, K., Gillam, L., & Tostevin, L. (2001) Weirdness Indexing for Logical Document Extrapolation and Retrieval (WILDER). In (Eds.) E.M. Voorhees and D.K. Harman. The 8th Text Retrieval Conference (TREC-8). Washington: National Institute of Standards and Technology. pp 717-724 Ahmad, K. & Benbrahim, M. (1995) Text Summarisation: The role of lexical cohesion analysis. The New Review of Document and Text Management, Vol. 1. pp.321-335. Ahmad, K. (1995). Pragmatics of Specialist Terms and Terminology Management. In (Ed.) Petra Steffens. Machine Translation and the Lexicon. ( Invited talk at 3rd Int. EAMT Workshop, Heidelberg, Germany, April 26-28,1993.) Heidelberg (Germany): Springer. pp.51-76.