Introduction Corpus linguistics is a part of computational linguistics and a part of general language studies. It deals with large collections of written and spoken texts, up to hundreds of millions words. The main areas of interest in corpus linguistics are twofold. First, it is building, maintaining and operationalizing such collections. Second, it is analysing and interpreting language features and phenomena that are usually not visible on a smaller scale. Thus, the first part of this work? presents computer technology that provides tools for language analysis and formal description of the language as a system that can be digitalized, stored and accessed with the use of electronic tools. This requires designing effective storage systems and user-friendly interfaces that make the texts available for research and analysis. Storing letters and words has not been a problem even for early computers, but storing descriptions? of various aspects of words and finding multidimensional links between elements of the language is still challenging for language technologists. This is the domain of computer science. Methods of analysis come from statistics and matrix analysis, as the language is multidimensional and many aspects should be taken into consideration at the same time? The second part deals with interpretation of data searched and obtained from a large amount of evidence of language use. This interfaces the language as a social, cultural and psychological phenomenon. That is the domain of psycholinguistics and sociolinguistics, cultural studies and other linguistic areas of interest such as: lexical studies, semantics pragmatics, stylistics, speech analysis, and last not least teaching and translation. Results in this area are both academic, that is, growth in knowledge about the language, its interrelations with other languages, the way it functions, and practical by making available databases and tools for learners, teachers and translators: both human or machine translators. The third part is an introduction to corpus analysis and statistics. Some basic concepts in corpus linguistics as empirical domain are presented. Counts, frequencies and statistical tests show the methodology of corpus analysis. And last but not least part four gives examples of the use and application of the tools in linguistics and suggests ideas for further research and development. Having at your disposal such a large collection of text with language in use, human imagination seems to be the only limit to research. 5 Part 1. Corpora 1. History of corpus linguistics Linguists have always been interested in how people use languages, all languages that are available for them: native, second and foreign languages they learn. These real life acts of linguistic behaviour can be observed by investigating corpora. For years linguistics was seen as an empirical science. That is why research on collections of texts, has a long tradition. Without computers, handling a collection of texts bigger than one book was extremely timeconsuming. What is more, there was a fashionable tendency in linguistics against corpus linguistics, which prevailed for twenty years since the late 1950s. However, contrary to this popular trend, many linguists worked on corpora for thirty years. Their work enabled corpus linguistics to start developing fast in the eighties. Under Chomsky’s influence for many years linguistics dealt with “potential language”, i.e. the internal, not the real, language of a linguist. In fact it is a case study of a linguist’s potential language. Comparing this area of interest to other social studies, it is really interesting to study people’s dreams, but it is much more interesting to observe their behaviour and actualacts and reactions. Both approaches can find their place in a field of research called linguistics and none of them is superior to the other. Let us define a corpus linguist as a linguist whose main interest is the language as an existing phenomenon expressed orally or with the use of a tool, for example: paper, screen, silk or stone. Neither an “armchair linguist” nor a corpus linguist can describe a language in its entirety. The former can always find utterances that have not been uttered, before the latter can find phrases actually used that he has not ever imagined could be uttered. The limits in the first case are set by the imagination of the linguist, while in the second case by the basic reasons humans use the language (Jacobson). Technology makes corpora feasible for research without wasting researchers’ time on sorting the words several times tediously. It is worth pointing out that first PC computers appeared in 1976. Constant interest in the real use of language and availability of user-friendly tools for research are the key factors in the developments in corpus linguistics which depicts a broader picture on what users do with the language far beyond what a linguist can ever imagine. Early corpus linguistics – before Chomskyan revolution in the 1950s. Year Researchers and a short description of their corpora and research Area of study 1897 Käding created 11 million word corpus of German to study frequency distribution and sequences of letters in German (5 000 analysts were used to study it) spelling 1898 Preyer - studies on first language acquisition were based on parental diaries acquisition 1921 Thorndike’s word counts were important in defining the goals of vocabulary movement in second language pedagogy pedagogy 6 1924 Stern – studies on first language acquisition were based on parental diaries acquisition 1931 Palmer’s word counts were important in defining the goals of vocabulary movement in second language pedagogy pedagogy 1940 Fries and Traver - vocabulary lists for foreign learners pedagogy 1940 Eaton – comparing the frequency of word meanings in Dutch, French, German, and Italian comparative studies 1947 Bongers’ vocabulary lists for foreign learners pedagogy 1949 Father Busa started creating a computerized corpus of medieval philosophy computerized corpora 1949 Lorge - semantic frequency list syntax and semantics 1952 Fries – descriptive grammar of English based on a corpus syntax 1954 McCarthy - Corpora of children’s utterances were gathered to establish the stages of linguistic development of a child. acquisition 1956 Gougenheim et all – a study of high frequency lexical and grammatical choices was based on transcribed corpus of spoken French syntax and semantics 1956 Julliand started developing machine readable corpora, sampling techniques, annotations mechanolinguistics 1957 Firth in a series of writings from 1930s, 1940s and 1950s established the basic terminology used in corpus linguistics NeoFirthians (Halliday, Hoey, Sinclair) continued the tradition. corpus linguistics 1960 Quirk started Survey of English Usage grammar 1960 Francis and Kucera began Brown Corpus grammar 1964 Julliand and Chang Rodriguez - report on corpus construction mechanolinguistics 1967 Father Busa finished his project on medieval philosophy computerized corpora 1970 Bloom – Longitudinal studies on language acquisition acquisition 1973 Brown – Longitudinal studies on language acquisition acquisition Renaissance of corpus linguistics 1985 Quirk et all - a comprehensive grammar of the English language syntax 7 2. Building a corpus 2.1. Definition of a corpus The word corpus comes from Latin, it means body. The plural is usually corpora. A new plural is corpuses. A corpus is normally defined as a collection of texts, spoken and/or written, which has been designed and compiled based on a set of clearly defined criteria. Here are some definitions of corpora from various sources: “A collection of texts, especially if complete and self-contained: the corpus of Anglo-Saxon verse. In linguistics and lexicography, a body of texts, utterances, or other specimens considered more or less representative of a language, and usually stored as an electronic database. Currently, computer corpora may store many millions of running words, whose features can be analysed by means of tagging (the addition of identifying and classifying tags to words and other formations) and the use of concordancing programs. Corpus linguistics studies data in any such corpus.” (The Oxford Companion to the English Language, ed. McArthur & McArthur, 1992) “A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language.” (David Crystal, A Dictionary of Linguistics and Phonetics, Blackwell, 3rd Edition, 1991) “A collection of naturally occurring language text, chosen to characterize a state or variety of a language.” (John Sinclair, Corpus, Concordance, Collocation, OUP, 1991) 2.2. Legal matters Collections of texts are usually created for research and practical applications. Examples are presented in Part 4. Parts of the texts from the corpus are quoted in reports, dictionaries and 8 other publications. Thus, the copyright issues arise. Corpus holder should get a permission from authors of the texts to use them. “Language cannot be invented; it can only be captured.” (Sinclair, 1997: 31) The number of utterances in any language is unlimited. However, the data collected is limited and finite. Thus the problem of representativeness arises. 2.3. Types of corpora There are different categories in which corpora can be described. The first distinction can be drawn between a monitor corpus and a sample corpus (Sinclair: 1991 23-26). The monitor corpus attempts to be a representative cross-section of the spoken and/or written language to be studied (e.g. the Bank of English (COBUILD) and the British National Corpus) and by its very nature it has to be very large (the Bank of English is about 400 million words of written and spoken texts and continues to grow). Monitor corpora have to be continually updated with 'new' texts, and 'old' texts must be discarded if they are to be truly representative. The sample corpus does not pretend to be representative of the whole spoken and/or written forms of the language to be investigated. Sample corpora are much more common and they are the norm in most corpus-based studies (e.g. International Corpus of English and the Hong Kong Corpus of Spoken English at PolyU). In order to create a corpus, it is vital to establish a set of homogenous criteria, which should be applied consistently to all items in the corpus.. The type of corpus reflects the research aims that are intended to be investigated. There are various categories related to text origin, authorship, size, language, social and cultural contexts, time, etc. that need to be taken into consideration. Speech and sign language are inherently human, that is humans do not need anything but their body to produce language. Written language and netspeak are tool dependent, that is one needs some surface and a tool to write with. The categories presented below present show various aspects and dimensions of corpora. 9 Types of corpora based on text origin with examples : 1 2 3 Written corpus 4 Spoken corpus Spoken corpus Netspeak recorded as sound transcribed into files written form Media recordings - Media recordings - Printed in a book www sites TV (various subtypes) TV (various (various subtypes) (various subtypes) subtypes) Media recordings - Media recordings - radio radio Private conversation Private Private Private e- conversation correspondence mail Business Business Business e- conversation correspondence mail Business conversation printed in a journal Electronic media Types of corpora based on authors of the utterances 1 Gender Female spoken Male spoken or written or written 2 Age Adult Children, teenagers 3 Special features Person or people Person or people with special educational without language needs (dyslexic, blind, mentally abnormalities handicapped native speakers or non-native learners) 4 Unified group or culture National Subcultures (various types) 5 Nationality One nation Multinational 6 Linguistic origin Monolingual Bilingual or multilingual 7 Linguistic unity Monolingual Parallel bilingual or multilingual 8 Linguistic experience Native Learner 9 Number of authors One Many 10 Source of the text Original creation Translation Types of corpora based on their size and openness: 1 Monitor corpus Sample corpus 10 2 Closed corpus (one writer’s productions e.g. a Open - can be developed Shakespeare corpus, a Mallory corpus Types of corpora based on location, place, geographical features: Language(s) spoken on a selected area District Town Region Country Types of corpora based on content: Various, as in monitor corpus Restricted e.g. One artist’s texts (writer, poet, singer) Translations of legal documents Essays of students from one class. Types of corpora based on time: 1 Length of period of collection Long time Short time 2 Ways of collecting in time Continuous Sampled 3 History Synchronic Diachronic 11