The Translational English Corpus: A practical approach to corpus building Outline • TEC and new developments – EDT Corpus – Humanities Corpus • Corpus design – Representativeness – Balance – Size • Corpus building – Identifying material – Scanning/Converting texts – Tagging & Annotation A corpus of contemporary English translations: written texts translated into English from a variety of source languages http://www.llc.manchester.ac.uk/ctis/research/english-corpus/ 25 15 10 5 0 French Arabic German Italian Spanish Brazilian… Portuguese Russian Norwegian Welsh Catalan Japanese Latin American… Polish Slovene Swedish Tamil Chinese Finnish Greek Hebrew Serbian Vietnamese Hopi 30 24 23 20 15 13 9 6 6 5 4 4 3 3 3 3 3 3 3 2 2 2 2 2 2 1 Number of books in each language for fiction and (auto)biography Set of software tools for the investigation of a wide range of issues to do with the language of translated texts. Header File: contains meta‐data such as the title of the text, author, publisher, etc. Text File: contains the actual data to be analysed – Sub-corpus Selection: Allows you to select particular text files or groups of text files to search. – Sort Tool: Allows you to sort concordances to the left or right, and specify the number words between the search keywords. – Corpus Tree Viewer: Allows you to “grow” a tree for various keywords. The size of the text reflects frequency of occurrence in the corpus. An electronic database of all material (to be) included in the TEC for the subcorpora of fiction and (auto)biography. The entry for each book includes not only most of the information that is included in the header file, but also images of the covers of the books. • A corpus of discourses on translation for the investigation of they way in which translation/translators are conceptualised in society at different historical periods. • No time, language or genre restriction: any material is included as long as it is written in English. • Two types of material – Peritextual : material that accompanies the translation, e.g. prefaces, introductions, afterwords, etc. – Epitextual: published material (broadsheet and mainstream newspapers, literary magazines, etc.) • Link with TEC A corpus of translations into English of works by theorists in the humanities, e.g. philosophers, sociologists, literary theorists, etc. Temporality: translations date from 1900 onwards, but the source texts texts do not have a time restriction. * Multiple translations of the same book. What is a corpus? ‘A collection of texts held in machine-readable form and capable of being analysed automatically or semi-automatically’ (Baker 1995)… ….and has certain characteristics: – Representativeness – Balance – Size “a corpus is thought to be representative of the language variety it is supposed to represent, if the findings based on its contents can be generalised to the said language variety” (Leech 1991). A corpus may focus on a particular genre/language/ author/translator, etc. Decisions about criteria for selection of texts TEC Design Material: English translations (whole texts) Genres: Fiction, (auto)biography, in-flight magazines, news articles Time of publication: Late 80s onwards Place of publication: UK and USA “a balanced corpus covers a wide range of texts which are supposed to be representative of the language variety under question” (McEnery et al. 2006). Also, ‘internal’ balance, e.g. – Gender balance – Source language balance – Genre balance A corpus needs to be adequate for the purposes for which it is intended. A bigger corpus is not necessarily more useful than a smaller one. Factors that affect corpus size: – Purpose of the corpus –Availability of data –Copyright • Research questions (purpose of the corpus) – Specialised corpora and corpora intended for morphosyntactic studies tend to be smaller than general corpora and corpora intended for lexical studies. Static corpora are also smaller than dynamic ones. • Availability of data – The availability of suitable data (especially in machinereadable form), as well as the ease with which they can be identified may affect the size of a corpus. • Copyright – Copyright clearance can impede corpus development as well as the accessibility and availability of a corpus to a wide audience. – Copyright law varies internationally. – Fair dealing: no permission needed for short extracts not exceeding 400 words for prose (or a total of 800 words in a series of extracts, none exceeding 300 words). – Out of copyright material: author’s / translator’s lifetime + 70 years (UK). – If you’re in doubt, seek permission! (McEnery et al. 2006) We're delighted learn of yourposting interestthe project, andof pleased ….We don't feeltocomfortable entirety both to grant general permission use all reviews and blogs titles you to your database, but to would bebook willing to make half on our site. We'll grateful iftoyou canresearch include acenter…We link to the site in the of both booksbeavailable your typically charge a fee ofpieces $150you peruse. title for use of such a large portion. …University Press is pleased to grant you non-exclusive, English language, world rights to reprint limits of fair use (under 300 words)… We're delighted to learn of your interesting project, and pleased to grant you general permission to use all book reviews and blogs on our site. We'll be grateful if you can include a link to the site in the pieces you use. • Possible sources • Publishers’ websites • Search engines e.g. Farrar, Strauss and Giroux, NYTimes • Publishing houses specialising in translation • Databases • National databases e.g. Three Percent, LTI Korea • Internet, archives, etc. • Problems • • • • Search engine not well-designed e.g. The Telegraph Need for specific material In some cases, not indicated whether it is a translation or not For reviews: not always related to translation • Scanning • Flat-bed scanner – Document feeder • Paper and print quality • Scanner settings: Resolution and Colour vs Greyscale • OCR (Optical Character Recognition) Process • • • • Language support Accuracy Font type Document format • Text File • Spelling errors • Character recognition errors (e.g. Tm instead of I’m) • Save as .txt file Adds value to a corpus, makes it easier to extract information and prepares texts to be used with a corpus software Factors that affect the extent of tagging/annotation (Olohan 2004): • Purpose of the corpus • Corpus software • Accessibility of the corpus • Technical expertise of the researcher • POS (Part-of-Speech) Tagging – Marks up a word in a corpus as corresponding to a particular part of speech, based on both its definition, as well as its context. E.g. John_NP0 loves_VVZ Mary_NP0 ._. • Lemmatisation – Reduces the inflectional variants of words to their respective lemmas, i.e. as they appear in a dictionary. E.g. is, are, am -> BE • Parsing – Marks the syntactic structure of each sentence. E.g. (S (NP (NNP John)) (VP (VPZ loves) (NP (NNP Mary))) • Develop and use your own software • Use existing corpus tools – TEC Tools For more information about how to use TEC Tools with local corpora, you can download the tutorial from the TEC webpage. – WordSmith Tools A collection of corpus linguistics tools – ParaConc A bilingual or multilingual concordancer – …. “When a corpus is created, a compromise has often to be reached between ideal design criteria and practical constraints. However, while opportunistic choices may be justified, the limitations and distortions they introduce in the makeup of a corpus should not be forgotten when evaluating the results”. (Zanettin 2011) TEC website http://www.llc.manchester.ac.uk/ctis/research/english-corpus/ TEC Email Address [email protected] Baker, Mona (1995) ‘Corpora in Translation Studies: An overview and some suggestions for future research’, Target 7(2): 223-243. Leech, Geoffrey (1991) ‘The state of the Art in Corpus Linguistics’, in Karin Aijmer and Bengt Altenberg (eds) English Corpus Linguistics: Linguistic studies in honour of Jan Svartvik, London: Longman, pp. 8-29. McEnery, Tony, Richard Xiao and Yukio Tono (2006) Corpus-based Language Studies, London and New York: Routledge. Olohan, Maeve (2004) Introducing Corpora in Translation Studies, London and New York: Routledge. Zanettin, Federico (2011) ‘Translation and Corpus Design’, SYNAPS 26:14-23.