Khurshid Ahmad
February 2005.
Text corpora are used in a number of different ways: traditionally corpora have been
used for the study and analysis of language at different levels of linguistic description;
corpora have been constructed for the specific purpose of acquiring knowledge for
information extraction systems, knowledge-based systems and e-business systems;
corpora have been developed for cognitive neuropsychology purposes and for
studying child language development. Speech corpora play a vital role in the
specification, design and implementation of telephonic communication and for the
broadcast media.
The creation of a corpus is a task that experts perform better than novices: the experts
have an intuitive notion of the inter-related notions of representativeness, balance,
coverage, accuracy and impact. The categorization of texts is an equally complex
task – and those of us who have looked at library classification systems, news wire
subject category, and museum classification, are always so different from what ones
intuition may suggest.
In my talk, I will deal with two key issues: First, the design, assembly and analysis of
corpora of sub-languages or special languages for automatically extracting
terminology and ontology, or more accurately, a conceptual scheme for organising the
terminology. I will draw upon my experience of the special language of nuclear
physics, sewer engineering, philosophy of science, and art & dance criticism. Second,
I will discuss how texts can be ‘automatically’ categorised using neural computing
systems, particularly self-organising maps. A good example here is the WEBSOM
web site that is being used to automatically categorise and retrieve texts in computer
science and other disciplines. I will describe my recent experience of organising news
wire texts automatically using a system that learns to categorise. The texts used in
both the studies were captured from digital libraries and news wire services.
The lessons drawn from terminology/ontology studies and from categorisation
systems, will be synthesised to describe a system that can learn to assemble, design,
analyse and categorise texts automatically. This system will additionally have a
‘knowledge base’ comprising heuristics developed by expert corpus linguists. So, in
‘fullness’ of time, we might have a system that can autonomously organise corpora by
retrieving texts from the world wide web.