Corpora in TS Corpus linguistics A corpus is a large, more or less principled, possibly annotated, collection of natural texts on computer. It is usually put together for a particular purpose and according to specific design criteria in order to ensure that it is representative of a given area. Notes 'large' may mean anything from 1m. to 200m. words 'principled': based on a principle, e.g. general vs. specialised (in terms of genre, register, subject matter etc.), written vs. spoken, synchronic vs. diachronic (historical), dialect or variant (BrE vs. AmE) 'texts' may be written or spoken (transcribed) or both What are corpora good for? Looking at actual use: how people actually speak, write, translate (performance) Description rather than prescription Natural language rather than formal representations of language Escaping for reliance on native speaker intuition (which may be biased) Variation of language features in relation to other (non-) linguistic features (genre, age, gender) Assessing (relative) frequency Lexicography Terminology – extracting from a representative corpus of authentic texts Use tells us about meaning, e.g. big vs. large vs. great Ensuring all variants of a term are covered Demonstrating their linguistic behaviour and lexical patterns (e.g. collocations from KWIC concordances (key words in context), inc. changes (diachronic) Showing (changes in) frequency of terms Sociolinguistics Intuition on registers, situations etc. unreliable or unavailable e.g. how do doctors speak to patients? Teaching Assessing relative frequency of patterns (and thus order of teaching) e.g. Present Simple vs. Present Continuous Search for translation 'universals' such as explicitation, simplification, levelling etc. Limitations Corpora are always incomplete Selection procedures may bias the sample – what constitutes a 'representative' sample hard to define (e.g. Slovene press – which papers, in what proportions, which parts of the paper etc.) Text fragments may paint a different picture from whole texts Some texts may be unavailable for copyright or other reasons Spoken language much harder to deal with Some features not susceptible to investigation, being hard to detect e.g. ellipsis Nothing can be inferred from non-occurrence in a corpus Frequency and grammaticality are not the same thing (less/fewer people, he might of done it) Occurrence in a corpus may not make an item standard/acceptable Corpora provide facts, but not all facts are significant or interesting A corpus is only a tool, not an end in itself (though it may involve a great deal of work) Types of corpora for translation research Parallel corpora (bilingual) - Original SL texts in language A and their translations in language B - Only as reliable as the translators (e.g. Evroterm) – particularly problematic translations from Slo to Eng - Tell us how translators actually tackle translation problems in practice - Search for translation norms Multilingual corpora - Sets of two or more monolingual corpora using similar design criteria - Study linguistic features/items in home environment (not translations) - Offer insights into behaviour of 'equivalent' items and structures in different languages: help develop teaching materials - Cannot provide answers to theoretical questions in TS - NB: based on the assumption that there is a natural way of saying anything in any language and all we have to do is find it in languages A and B. But: "there are some things that cannot be said naturally in some languages" (Headland 1981 cited in Baker 1995: 233) e.g. Dober tek or Gospa Maja in English, Good afternoon in Slovene - Rather than comparing language A and B, or ST and TT, makes more sense in TS research to compare regular text production with texts produced by translation. > Comparable corpora - Two separate collections of texts in the same language (A): o original texts in language A (monolingual corpus) o translations into language A from a given SL (or SLs) - Both should cover a similar domain, variety of language and time span, and be of comparable length - Helps us identify patterns specific to translated texts or with significantly higher or lower frequency than in original texts (my impressions:…) o e.g. namely over-represented in Slo-Eng translations (from namreč) o e.g. on the one hand … on the other over-represented in Slo-Eng translations o e.g. therefore over-represented in Slo-Eng translations (because of zato) o e.g. organising words like although, nevertheless, furthermore, moreover under-represented in Slo-Eng translations o e.g. phrasal verbs under-represnted in Slo-Eng translations - o Baker's e.g. They claimed/said (that) they would – probably that used more in translations than in original texts Also to support or refute hypotheses about the process of translation Corpus analysis software such as WordSmith can be used to identify a range of textual features that would be laborious or extremely difficult to identify manually: Type/token ratio (see notes on quantitative analysis) Lexical density (ditto) Average word and sentence length Frequency lists - provides statistical evidence for stylistic features such as key words Concordance – a concordancer lists all occurrences of a word and a number of preceding and following words – very helpful when looking for collocations Quantitative analysis Possible lines of research using comparable corpora: 1. Type-token ratio (TTR) To a computer, a sequence of letters with an orthographic space on either side is a word or token. We can say there are x tokens of a particular word, e.g. say, in a given corpus. The word-form say is a type, so we can say there are x tokens of the type say in a corpus of y million words. Corpora can be compared in these terms e.g. Krishnamurthy (1992) overall type-token ration for BBC World Service broadcasts approx. 174 for The Times newspaper approx. 60 (i.e. each type occurs on average 174 times rather than 60). Thus the former more lexically restricted than the latter – explainable by spoken-written difference?) Comparing translations with non-translations we may find a higher type-token ratio and thus evidence of lexical simplification. type-token ratio n. In the study of texts, the ratio of the number of different words, called types, to the total number of words, called tokens. For example, in a particular text, the number of different words may be 1,000 and the total number of words 5,000, because common words such as the may occur several times, and in this case the type-token ratio would be 1/5 or 0.2. Used for investigating e.g. child language acquisition 2. Lexical density Lexical words have 'content' – nouns, adjectives and verbs Grammatical words organise language – prepositions, determiners, pronouns Lexical density = percentage of lexical vs. grammatical words in a corpus Lexical density of written English higher than spoken: the former is edited, redrafted, repetition and redundancy cut; the latter has more contextual clues, has to be interpreted in real time. Lexical words belong to open sets therefore less predictable 'Difficult' texts have higher lexical density and information load. Is lexical density of translations lower than originals?