Corpora in contrastive and translation studies

advertisement
Using corpora in
contrastive and translation studies
Corpus Linguistics
Richard Xiao
lancsxiaoz@googlemail.com
Aims of this session
• Lecture
– Corpora in contrastive and translation studies
– Use of comparable and parallel corpora
– Case study: Translation universals, do they really
exist?
• Lab session
– CUC paraconc and Babel parallel corpus
• Closing
– Shedding of valedictory tears
Types of corpora: Some distinctions
• Monolingual versus multilingual corpora
• Parallel versus comparable corpora
• Comparable versus comparative corpora
Monolingual versus multilingual corpora
• Monolingual corpora
– A corpus that only involves one language
• Multilingual corpora
– A corpus that involves texts of more than one language
– A corpus covering two languages is conventionally known
as ‘bilingual’
• Multilingual corpora, in a narrow sense, must involve more than
two languages
• ‘Multilingual’ and ‘bilingual’ are often used interchangeably
– Parallel and comparable corpora
Parallel versus comparable corpora
• Terminological confusion centres around the terms
• For some scholars (e.g. Aijmer and Altenberg 1996;
Granger 1996: 38)
– Corpora composed of source texts in one language and their
translations in another language (or other languages) are
‘translation corpora’ while those comprising different
components sampled from different native languages using
comparable sampling techniques are called ‘parallel corpora’
• For many others (e.g. Baker 1993: 248, 1995, 1999; Barlow 1995,
2000: 110; Hunston 2002: 15; McEnery and Wilson 1996: 57;
McEnery, Xiao and Tono 2006)
– Corpora of the first type are labelled ‘parallel corpora’ while
those of the latter type are ‘comparable corpora’
Parallel versus comparable corpora
• In classifying corpora, the criteria used must be consistent and
logical ways of doing things…
- We can say a corpus is a translation or a non-translation
corpus if the criterion of corpus content is used
- But if we choose to define corpus types by the criterion of
corpus form, we must use the criterion consistently
- We can say a corpus is parallel if the corpus contains source texts
and translations in parallel, or it is a comparable corpus if its
components or subcorpora are comparable by applying the same
sampling techniques and similar balance and coverage
- It is simply inconsistent and illogical to refer to corpora of
the first type as translation corpora by the criterion of content
while referring to corpora of the latter type as parallel corpora
by the criterion of form!
Multilingual vs. monolingual comparable corpora
• A common practice in TS is to compare a corpus of
translated texts (translational corpus) with a corpus
consisting of comparably sampled non-translated
texts in the same language
• The two sub-corpora form a monolingual
comparable corpus for translation research, as
opposed to a multilingual comparable corpus
composed of comparable texts for different
languages for cross-linguistic contrast
Comparative corpora
• Corpora containing different regional varieties
of the same language are not comparable
corpora
– E.g. the International Corpus of English (ICE), the
Brown family of corpora
– All corpora, as a resource for linguistic research,
have ‘always been pre-eminently suited for
comparative studies’ (Aarts 1998: ix), either
intralingually or interlingually
• Corpora of this kind are comparative corpora
Use of parallel & comparable corpora
• Parallel and comparable corpora “offer specific uses and
possibilities” for contrastive and translation studies
(Aijmer & Altenberg 1996: 12)
- they give new insights into the languages compared –
insights that are not likely to be gained from the study of
monolingual corpora;
- they can be used for a range of comparative purposes
and increase our knowledge of language-specific,
typological and cultural differences, as well as of
universal features;
- they illuminate differences between source texts and
translations, and between native and non-native texts;
- they can be used for a number of practical applications,
e.g. in lexicography, language teaching and translation.
Use of parallel & comparable corpora
• Used primarily for translation and contrastive
studies
• The two types of corpora have their own
characteristics, and serve different purposes
– Parallel corpora are useful in translation studies,
but they alone serve as a poor basis for crosslinguistic contrast, because translations cannot
avoid the effect of translationese
– Comparable corpora are well suited for contrastive
research, but are less useful in translation studies
Using corpora in translation studies
• Parallel corpora
– Useful in exploring how an idea in one language is conveyed in
another language, thus providing indirect evidence to the study of
translation processes
– Indispensable for building statistical or example-based machine
translation (EBMT) systems, and for the development of bilingual
lexicons and translation memories
– Parallel concordancing is a useful tool for translators
• Comparable corpora
– Useful in improving the translator’s understanding of the subject field
and improving the quality of translation in terms of fluency, correct
term choice and idiomatic expressions in the chosen subject field
– Can also be used to build terminology banks
Using corpora in translation studies
• Translational corpora
– Provide primary evidence in product-oriented Translation
Studies, and in studies of translation universals
– If corpora of this kind are encoded with sociolinguistic and
cultural parameters, they can also be used to study the
sociocultural environment of translations (e.g. functions of
translation in DTS)
• Monolingual corpora (source / target language )
– Raising the translator’s linguistic and cultural awareness in
general
– Providing a useful and effective reference tool for translators
– In combination with a parallel corpus to form a so-called
‘translation evaluation corpus’ that helps translator trainers
or critics to evaluate translations more effectively and
objectively
Corpus-based translation studies
• Laviosa (1998a)
– “the corpus-based approach is evolving, through theoretical
elaboration and empirical realisation, into a coherent, composite and
rich paradigm that addresses a variety of issues pertaining to theory,
description, and the practice of translation.”
• Hypotheses that translation universals can be tested by corpus data
(Baker 1993, 1995)
• Rapid development of corpus linguistics, esp. multilingual corpus research
in the early 1990s
• Increasing interest in Descriptive Translation Studies (Toury 1995)
• Tymoczko (1998)
– “Corpus Translation Studies is central to the way that Translation
Studies as a discipline will remain vital and move forward.”
• Meta 43/4 (1998); Kenny (2001); Laviosa (2002); Granger et al (eds.)
(2003); Olohan (2004); Mauranen et al (eds.) (2004); Kruger & Munday
(ed.) (2011); Hu (2011), Wang (2011), Xiao (2012)
The Holmes-Toury map
• Applied Translation Studies
• Descriptive Translation Studies
• Theoretical Translation Studies
Applied Translation Studies
• Three major contributions of corpora
• Corpus-assisted translating
– Bowker (1998: 631): ‘corpus-assisted translations are of a higher
quality with respect to subject field understanding, correct term
choice and idiomatic expressions.’
• Corpus-aided translation teaching and training
– Bernardini (1997): ‘large corpora concordancing’ (LCC) can help
students to develop ‘awareness’, ‘reflectiveness’ and
‘resourcefulness’, which are said to be the skills that distinguish
a translator from those unskilled amateurs
• Development of translation tools
– Corpora, and especially aligned parallel corpora, are essential
for the development of translation technology such as machine
translation (MT) systems, and computer-aided translation (CAT)
tools
Descriptive Translation Studies
• Characterized by its emphasis on the study of
translation per se
– It is to answer the question of ‘why a translator translates
in this way’ instead of ‘how to translate’
• Baker (1993) predicted that the availability of large
corpora of both source and translated texts, together
with the development of the corpus-based approach,
would enable translation scholars to uncover the
nature of translation as a mediated communicative
event
Descriptive Translation Studies
• Three focuses (Holmes 1972/1988)
• Translation as a product
– Concerned with describing translation as a product by comparing
corpora of translated and non-translational native texts in the target
language
– Attempting to uncover evidence to support or reject the so-called
translation universal hypotheses
• Translation as a process
– Aims at revealing the thought processes that take place in the mind of
the translator while she or he is translating
– One possible way for corpus-based DTS is to investigate the written
transcripts of these recordings off-line, which is known as Think-Aloud
Protocols (or TAPs)
– Translation as product providing indirect evidence to translation as
process
• The function of translation
– The study of contexts rather than texts: function or impact of a
translation
– Relatively few function-oriented studies that are corpus-based
Theoretical Translation Studies
• Aims ‘to establish general principles by means of
which these phenomena can be explained and
predicted’ (Holmes 1988: 71)
– Closely related to, and often reliant on the empirical
findings produced by Descriptive Translation Studies
• One good battleground of using DTS findings to
pursue general theory of translation is the hypothesis
of so-called translation universals (TUs) and its
related sub-hypotheses
– Sometimes referred to as the inherent features of
translational language, or ‘translationese’
TU: A focus of CBTS
• An important area of corpus-based TS over the past
decade
– Baker (1993, 1996); Chesterman (2004); Kenny (1998,
1999, 2000, 2001); Laviosa (1998b); Mauranen & Kujamaki
2004); McEnery & Xiao (2002, 2007); Olohan (2004);
Olohan & Baker’s (2000); Øverås (1998); Pym (2005); Xiao
and Yue (2008), Xiao (2010), Xiao & Dai (2010), Xiao (2010,
2011, 2012)
• The Translational English Corpus (TEC)
– Manual
• http://www.llc.manchester.ac.uk/ctis/research/english-corpus/
– Software
• http://ronaldo.cs.tcd.ie/tec2/jnlp/
19
Features of translated English
• Laviosa (1998b): Four core patterns of lexical
use in translational English
- A relatively low proportion of lexical words
over function words
- A relatively high proportion of highfrequency words over low-frequency words
- A relatively great repetition of the most
frequent words
- Less variety in most frequently used words
20
Features of translated English
• Beyond the lexical level
– Simplification: “tendency to simplify the language used in translation”
(Baker 1996: 181-182)
• simpler language than target native language lexically / syntactically /
stylistically
– Normalization: “tendency to exaggerate features of the target
language and to conform to its typical patterns” (Baker 1996: 183)
• more “normal” than the target native language
– Explicitation: tendency in translations to “spell things out rather than
leave them implicit” (Baker 1996: 180)
• more frequent use of conjunctions, and increased cohesion in translated
text
– Sanitization: translated texts are “somewhat ‘sanitized’ versions of the
original” (Kenny 1998: 515)
• Lost or reduced connotational meaning in translation
• “TU hypotheses”
21
TU: A target of debate
• Is translational language different from target
native language?
• Translational language is at best an
unrepresentative special variant of the target
language because translations cannot possibly
avoid the effect of translationese
• e.g. Baker 1993; Gellerstam 1996; Hartmann 1985;
Laviosa 1997; McEnery & Wilson 2001; McEnery & Xiao
(2002, 2007); Teubert 1996
TU: A target of debate
• Are the features uncovered on the basis of
translational English generalizable to other translated
languages?
• Existing evidence has largely come from translational
English and related European languages
• If such features are to be generalized as “translation
universals”, the language pairs involved must not be
restricted to English and closely related languages
– Cheong’s (2006) study of English-Korean translation
contradicts even the least controversial explicitation
hypothesis
• Evidence from “genetically” distinct language pairs
such as English and Chinese is undoubtedly more
convincing, if not indispensable
The ZCTC corpus
• Created with the explicit aim of studying the features of
translated Chinese
• A translational counterpart of the Lancaster Corpus of
Mandarin Chinese (LCMC), a one-million-word balanced
corpus of native Chinese (McEnery & Xiao 2004)
– www.ling.lancs.ac.uk/corplang/lcmc/
• Five hundred 2,000-word text samples taken proportionally
from fifteen written text categories published in China in the
1990s
– www.ling.lancs.ac.uk/corplang/ZCTC/
LCMC / ZCTC corpus design
ZCTC vs. LCMC
Corpus markup and annotation
• CES-compliant XML
– CES: www.cs.vassar.edu/CES/
• Tokenization and POS tagging
– ICTCLAS2008: www.ictclas.org
• A precision rate of 98.54% for tokenization
• Paragraph, sentence, word token
• Encoded in Unicode (UTF-8)
Core patterns of lexical use
• Do the core patterns of lexical use Laviosa
(1998b) observes in translational English also
apply in translated Chinese?
• Same criteria and parameters as in Laviosa
(1998b)
– Lexical density
– Frequency profiles
– Mean sentence length
Lexical density
• The Stubbs-style lexical density: the ratio between
the number of lexical words (i.e. content words) and
the total number of words (Stubbs 1986: 33; 1996:
172)
– Measure of informational load
– Adopted in Laviosa (1998b)
• Lexical density measured by TTR or Standardized TTR
(STTR) (Scott 2004)
– Measure of lexical variability
– Commonly used in Corpus Linguistics
Mean
R
P
N
M
L
K
J
H
G
F
E
D
C
B
80.00
70.00
60.00
50.00
40.00
30.00
20.00
10.00
0.00
A
Lexical density
Stubbs-style lexical density
Genre
LCMC
ZCTC
• Mean LD is significantly greater in native than translational corpus (66.93%
vs. 61.59%, t = -4.94, p<0.001)
• In addition, the native Chinese corpus displays a greater LD score in all of
the 15 genres – and significant for nearly all genres (except for M)
– Translations make more frequent use of function words
30
Standardized TTR
Standardsied TTR
60
50
40
30
20
10
Mean
R
P
N
M
L
K
J
H
G
F
E
D
C
B
A
0
Genre
ZCTC
LCMC
• Mean STTR is slightly greater in native than translation corpus (46.58 vs.
45.73): not significant (t = -0.573, p=0.571)
• The differences in most genres are also marginal
• Greater STTR scores can be found in both native (e.g. A) and translated (C)
Chinese genres
31
3
2.5
2
1.5
1
0.5
Mean
R
P
N
M
L
K
J
H
G
F
E
D
C
B
0
A
Lexical-function word ratio
Lexical-function ratio ≈ Stubbs LD
Genre
ZCTC
•
•
•
LCMC
Mean ratio between lexical and function words is significantly greater in native
than translational corpus (2.08 vs1.64, t = -4.88, p<0.001)
Also, native Chinese has a greater ratio in all genres, and the differences are
significant in nearly all genres (except for M)
In line with Laviosa’s (1998b) initial hypothesis that translational language has a
relatively low proportion of lexical words over function words
32
Frequency profiles of LCMC/ZCTC
• Laviosa’s (1998b) ‘list head’ or ‘high frequency
words’
– Wordlist items which individually account for at
least 0.10% of the total number of tokens in a
corpus
• The same criterion for high frequency words
in this study to ensure comparability
33
Frequency profiles
•
•
•
•
The numbers of high frequency words are very similar in the two corpora
High frequency words account for a considerably greater proportion of tokens in
the translational corpus (40.47% vs. 35.70%)
High frequency words display a much greater repetition rate in translated Chinese
(3154.37 vs. 2870.37)
Also the ratio between high- and low-frequency words is greater in the
translational corpus (0.6988 vs. 0.5659)
34
30
25
20
15
10
5
Mean
R
P
N
M
L
K
J
H
G
F
E
D
C
B
0
A
Mean sentence length
Mean sentence length vs. simplification
Genre
ZCTC
•
•
•
•
LCMC
Conflicting observations of mean sentence length as an indicator of
simplification (e.g. Laviosa 1998b vs. Malmkjaer 1997)
The native Chinese corpus (LCMC) shows a marginally greater mean sentence
length: not significant (t = - 1.41, p = 0.17)
Mean sentence length is sensitive to genre variation and may not be reliable
as an indicator of simplification in translational Chinese
(Mean sentence segment length)
35
Lexical use in translational Chinese
• Summary
- Analysis of lexical density and frequency profiles shows
that the four core patterns of lexical use in translational
English are essentially also applicable in translated
Chinese
- But mean sentence length is less reliable as an indicator
of simplification in translational Chinese
36
Explicitation: Connectives as a device?
• Perhaps the most studied topic in TU research
and also the least controversial hypothesis
• Chen (2006)
– Connectives are a device for explicitation in EnglishChinese translation of popular science books
• Xiao and Yue (2008)
– Connectives are significantly more frequent in translational
than native Chinese fiction
• Question
– Can we generalize this finding from these specific genres
to Mandarin Chinese in general?
37
500
400
300
200
100
Mean
R
P
N
M
L
K
J
H
G
F
E
D
C
B
0
A
Frequency per 10,000 words
Conjunctions in ZCTC and LCMC
Genre
ZCTC
LCMC
• Mean frequency of conjunctions is significantly greater in translational
than native corpus
– 306.42 and 243.23 instances per 10,000 tokens, LL=723.12 for 1 d.f., p<0.001
• In addition, genre-based distribution shows that most genres covered in
the corpora display a significantly more frequent use of conjunctions in
translational Chinese in spite of some genre-based subtleties (e.g. F, J)
38
Conjunctions of different frequency bands
Frequency
200
150
100
50
l
Al
%
+
0.
00
01
0.
00
05
%
+
+
0.
00
1%
+
0.
00
5%
+
0.
01
%
+
0.
05
%
0.
1%
+
0
Usage bands
ZCTC
•
•
•
•
LCMC
More conjunction types of high frequency bands (0.01% or above) are used in
translational corpus
There are an equal number of conjunction types (56 types) of medium
frequency band (0.005%) in translational and native corpora
Beyond this balance point, the native corpus displays a greater number of
conjunction types of low frequency band (0.001% or below)
In line with observations about high vs. low frequency words
39
Conjunctions of different styles
• A closer comparison of the lists of frequent
conjunctions (0.001% or above) in their respective
corpus also sheds some new light on the simplification
hypothesis
– There are 91 and 99 types of frequent conjunctions in the
two corpora – 86 items overlap in the two lists
– Conjunctions on the translational but not native list are all
informal, colloquial, and simple , which usually have more
formal alternatives (e.g. 虽然 for 虽说,总之 for 总的来说)
– Conjunctions on the native but not translation list are
typically formal, literate and archaic (e.g. 故、可见、进而、
加之、固然、继而、非但、然、而后)
• These results provide evidence for the simplification
hypothesis but against the normalization hypothesis
40
Conclusions
• Results based on two comparable Chinese corpora suggest
that the core patterns of lexical use in translational English
are generally also applicable in translated Chinese
• Beyond the lexical level
– Mean sentence length is sensitive to genre variation and may not
be reliable as an indicator of simplification
– A comparison of frequent conjunctions in native and translated
Chinese shows that simpler forms tend to be used in translations
– In spite of some genre-based subtleties, conjunctions are more
frequently used in translational Chinese, which provides evidence
in favour of the explicitation hypothesis
• Corpus Translation Studies is a promising area of research
• 《语料库翻译文库:英汉翻译中的汉语译文语料库研究》,
上海交通大学出版社,2012
41
CUC ParaConc
• Software demo…
Shedding of valedictory tears
lancsxiaoz@googlemail.com
xiaoz@zju.edu.cn
Download