AN AUTOMOUSLY MAINTAINED CORPUS COLLECTION? Khurshid Ahmad February 2005. Text corpora are used in a number of different ways: traditionally corpora have been used for the study and analysis of language at different levels of linguistic description; corpora have been constructed for the specific purpose of acquiring knowledge for information extraction systems, knowledge-based systems and e-business systems; corpora have been developed for cognitive neuropsychology purposes and for studying child language development. Speech corpora play a vital role in the specification, design and implementation of telephonic communication and for the broadcast media. The creation of a corpus is a task that experts perform better than novices: the experts have an intuitive notion of the inter-related notions of representativeness, balance, coverage, accuracy and impact. The categorization of texts is an equally complex task – and those of us who have looked at library classification systems, news wire subject category, and museum classification, are always so different from what ones intuition may suggest. In my talk, I will deal with two key issues: First, the design, assembly and analysis of corpora of sub-languages or special languages for automatically extracting terminology and ontology, or more accurately, a conceptual scheme for organising the terminology. I will draw upon my experience of the special language of nuclear physics, sewer engineering, philosophy of science, and art & dance criticism. Second, I will discuss how texts can be ‘automatically’ categorised using neural computing systems, particularly self-organising maps. A good example here is the WEBSOM web site that is being used to automatically categorise and retrieve texts in computer science and other disciplines. I will describe my recent experience of organising news wire texts automatically using a system that learns to categorise. The texts used in both the studies were captured from digital libraries and news wire services. The lessons drawn from terminology/ontology studies and from categorisation systems, will be synthesised to describe a system that can learn to assemble, design, analyse and categorise texts automatically. This system will additionally have a ‘knowledge base’ comprising heuristics developed by expert corpus linguists. So, in ‘fullness’ of time, we might have a system that can autonomously organise corpora by retrieving texts from the world wide web.