HOMENAJE A MERVYN SMALE - Universidad de Granada

(DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS: LOOKING FOR INDIRECT OBJECTS IN THE ICE-GB CARMEN AGUILERA CARNERERO Universidad de Granada During the last decades, the world of Corpus Linguistics has witnessed the proliferation of many different and varied corpora. Among them, there are some in which syntactic information is added to the plain text: a grammatical category (i.e. tagged corpora) and a syntactic function (i.e. parsed corpora); the latter being quite useful for saving effort and time when retrieving syntactic structures to carry out syntactic analyses. This paper deals with some of the problems that arise when a parsed corpus – the British component of the International Corpus of English – is used to analyse indirect objects in English. During our quest, mainly two sorts of problems were detected: the disagreement with the parsers of the corpus in relation to some syntactic categories we find doubtful – ‘dimonotransitivity’, ‘parataxis’ and ‘transitive complementation’ – as well as several inconsistencies shown in the labelling of the constituents in the corpus. The particular problems faced in the accomplishment of our particular task illustrate some of the difficulties found working with a parsed corpus and make us question the real utility of parsed corpora. INTRODUCTION Within the great development that Corpus Linguistics has undergone recently, two main different approaches have been distinguished: on the one hand, the corpus-based approach and, on the other hand, the corpus-driven perspective. The two positions tackle the analysis of language from different angles – especially concerning methodological issues – with radical deep implications in their engaging of corpus analysis. Remarkably, they differ considerably in their attitude towards annotated corpora, showing a positive 390 CARMEN AGUILERA CARNERERO posture to it in the case of the corpus-based approach and a disbelief in the case of the corpus-driven linguists. As we have said above, this paper deals with the obstacles found working with a parsed corpus – the British component of the International Corpus of English (ICE) – when a lexico-grammatical analysis of indirect objects in English was carried out. This study not only helped us in getting deeper into the syntactic-semantic nature of indirect objects in English, but also entailed an intense reflection of the use of tagged and, above all, parsed corpora. In what follows, we will make a brief survey of the differences between the corpus-based approach and the corpus-driven approach concerning the use of tagged and parsed corpora to introduce, in the following section, the main features of the International Corpus of English (ICE). We will then deal with some of the difficulties faced when working with the ICE-GB and we will finish with the exposition of the conclusions reached in the light of our findings. THE ROLE OF ANNOTATION IN CORPORA As we have stated in the previous section, the study of language phenomena within Corpus Linguistics gave rise to two different approaches to the topic, that is, the corpus-based approach on the one hand, and the corpus-driven approach on the other. The positions vary in a considerable number of aspects that have been clearly summarized by Tognini-Bonelli (1991:65ff): One could argue that the two positions we are addressing with respect to corpus work, the corpus-based and the corpus-driven, reflect two opposed stances concerning this issue and while the corpus-based linguist attempts to insulate it, standardise it and reduce it, the corpus-driven linguists build it into the theoretical categories (s)he derives from the data. (Tognini-Bonelli 1991:67) Aarts (2002:3) considers that the linguist’s preference for one of the methodologies rather than the other depends on his/her answers to questions such as: 1. the type of evidence he/she thinks the corpus data provide 2. if he/she considers there is room for different data from those included in the corpus 3. the role played by non-corpus data in previous linguistic research. One of the main differences between these two perspectives is the way they dissent about the implications of assigning certain labels to particular corpus items. Whereas corpus-based linguists take annotation as a useful tool to help them in their analysis, corpus-driven linguists think this will predetermine the researcher’s conceptions of their results, a fact which, in their opinion, should emanate from the corpus itself not being contaminated, in this way, by previous research.This idea is overtly expressed by John Sinclair in the following passage: (DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS 391 [M]y reservations about annotation are quite specific, and concern only their inclusion in the resources around generic corpora. Because they impose one particular model of language on the corpus, they restrict the kind of research that can be done; because the practice of annotation normally requires human intervention, it is not a replicable process and therefore fails the first test of scientific method. Because the models imposed by current conventions of annotation are unlikely to be informed by corpus evidence, I believe researchers who use them are likely to make unnecessary problems for themselves. ( Sinclair 2004:54) A strong objection to this aseptic approach to language is raised by Mukherjee (2005:72): [T]his dogmatic stance [corpus-driven approach] with its fixation about corpus data seems both unrealistic and implausible to me. It is unrealistic because any linguistic research activity stems from some sort of initial intuitions about language. [...] So, there always is some sort of theoretical preconception involved, and, what is more, even the avoidance of a priori theory is a theoretical preconception. The distrust of intuition of CDL-methodology is also implausible since any corpus is compiled on the grounds of linguists’ informed intuitions about language in the first place. We agree entirely with Mukherjee on his rejection of corpus-driven methodology. It seems to be difficult to imagine the linguist’s brain as a tabula rasa, not being biased at all by any linguistic preconception. What’s more, the fact of working with a corpus implies support of a whole conception of language based on the observation of real language, as opposed to the Chomskyan tradition. What we think is not completely correct in Sinclair’s quotation is the corpus-driven assumption that all corpus-based linguists accept all the theoretical ideas expressed in the corpus by the taggers and parsers as a dogma by the mere fact of using it. We cannot deny the fact that working with a tagged and parsed corpus implies that the researcher has to deal with the material someone else manipulated before, which may be a disadvantage if he or she does not agree with the previous decisions made by the tagger and/or parser. Nevertheless, we think this does not have to be so. Using a tagged and parsed corpus for your analysis does not necessarily mean that you have to share all the linguistic preconceptions that led the tagger/parser to choose one decision or another. Once more, it is not a black-or-white issue. One could agree on, let’s say, the majority of decisions in the corpus, but not with every single choice that taggers and parsers make. In fact, agreement with the ideas existing in the corpus confirms the linguist’s conceptions on particular linguistic problems, whereas disagreement with the specific issues motivates debate and critical reflection in the researcher. As Mukherjee (2005: 79-80) explains: There is a danger, therefore, that already available corpora with their syntactic annotation predetermine the linguistic theory of and research into syntax. As a 392 CARMEN AGUILERA CARNERERO matter of fact, the reverse order should be aimed at; not that corpus annotation should influence linguistic research, but linguistic research questions should be the guideline for the corpus annotation. Paradoxically, annotated corpora have also been accused of containing less information than non-annotated ones. According to Aarts (2002:9), it is a problem of considering the type of information added by annotation different from the information contained in the corpus, as it is the result of a “descriptive framework that generated the tags”. Within the category of annotated corpora, two main different kinds can also be distinguished: tagged corpora and parsed corpora. Tagged corpora assign each lexical item in the corpus a grammatical category; parsed corpora add to them a syntactic function in the form of a tree. The advantages and disadvantages of these two types of corpora have been pointed out by Guilquin (2002:192). The structural information contained in the parsed ones is less easily available and unreliable (in the majority of cases). If parsed corpora are analysed in detail (as ICE-GB), then, the main disadvantage is their small size, and therefore, their inadequacy for studying infrequent structures in the language. Leaving aside the controversy over the linguist’s shared principles with the grammatical and syntactic categories contained in the corpus, a quite considerable quality a tagged and parsed corpus has is the simplification of the whole process of analysis, which is otherwise quite a time-consuming task. However, this advantage offered by an annotated corpus is negatively counterbalanced by, in our view, one of its great dangers: the inaccuracy of the tags concerning the grammatical and syntactic categories and, consequently, the unreliability of the results, as we will discuss below. THE INTERNATIONAL CORPUS OF ENGLISH: ICE The corpus chosen for this study is the British component of the International Corpus of English (ICE). The International Corpus of English was a project launched at London University in 1988 under the auspices of Professor Sidney Greenbaum, who felt the need to compare the linguistic varieties of England and the United States, both written and spoken, an aim he could not carry out satisfactorily using the corpora available at that time. 1 The ICE was compiled by the Survey of English usage, at the University of London, and the software programmes needed to carry out the grammatical and syntactic analysis were developed by the TOSCA group at Nijmegen University. In particular, the software for using the corpus (ICECUP) was created by Aidan 1. As Greenbaum (1991:83) explains, at that moment, there were some corpora devoted to the study of written English such as the Brown Corpus and the LOB corpus, as well as the London-Lund corpus specializing in spoken English, but there was no corpus that contained both modes so that written and spoken English could be compared. (DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS 393 Quinn and Nick Porter at the Survey of English Usage. The compilation of the British component was only the beginning, and later on, other national teams joined the project 2 in every country in which English was the first language or in those countries in which, although not official, English was the language of administration, education or the law courts. The ICE-GB corpus is not very large (1 million words) and it comprises 500 texts (200 written and 300 spoken, belonging to different genres, each one of approximately 2000 words). All the material was collected between 1990-1993 and was produced by adults (18+ years) who were educated in the English language at least until they finished secondary school. One of the features that makes the ICE-GB very appealing for researchers interested in syntax and grammar is the fact of being not only a tagged corpus but also a parsed one, which means that all clause constituents are associated with a grammatical category and also with a syntactic function in a tree. Besides, ICECUP allows the linguist to retrieve grammatical and syntactic information, thanks to the tool known as ‘fuzzy tree fragment’. The compilation and annotation of the ICE-GB was a long and complicated process which covered the following stages (Nelson, Wallis and Aarts 2002:10ff): 1) Structural markup: Structural markup encodes features of the original texts that are lost when it is converted into a plain text file on a computer. In written texts, markup symbols are used to encode typographic features, such as boldface, italics and underlining, as well as structural features such as sentence boundaries, paragraph boundaries, and headings. In spoken texts, markup encodes sentence boundaries, speaker turns, overlapping strings, and pauses. 2) Part-of-speech tagging: During this stage, each lexical item was assigned a part-of-speech label or tag, suchs as ‘N’ for noun, or ‘V’ for verb. In addition to the main label, most tags carry additional information, which appears in brackets. With some modifications, the tagset is based on the classifications given in Quirk, Greenbaum, Leech, and Svartvik 1985. The tagger assigned one or more tags to each lexical item, and the output was manually checked at the Survey of English Usage. The checking stage involved choosing the correct tag for each item and removing the incorrect tags. 3) Parsing: This is the most important stage for a parsed corpus, since a syntactic function is assigned to every element in the clause. The syntactic parsing was carried out automatically using the software created by the TOSCA group of the University of Nigmejen, but previously, there was a phase of pre-edition in which high frequency constructions were 2. For the time being, there are fifteen teams compiling their own national or regional components of the ICE in such different countries as Malaysia, Sri Lanka, Ghana or Ireland, to mention just a few. The components of Britain, New Zealand, India, Hong Kong, East Africa, Singapore and Philippines have already been finished and are available. 394 CARMEN AGUILERA CARNERERO marked manually “in order to reduce the ambiguity of the input, and thereby reduce the number of decisions that the automatic parser would have to make” (Nelson et al. 2002:14). The TOSCA parser analysed 70% of the parsing units of the corpus, then the Survey Parser analysed the rest. In summary, the annotation of the ICE corpus was partly automatic, partly manual, this latter phase being thought to solve the possible mistakes derived from the automatic tagging. PROBLEMS DERIVED FROM WORKING WITH A PARSED CORPUS As we have said in the previous section, the most outstanding characteristic of the ICE-GB is the fact that it is parsed, which – a priori – simplifies the work of linguists enormously, especially since querying the corpus takes just a few seconds, dodging the manual work the linguist has to do. According to this, any parsed corpus potentially seems to be the perfect solution to ease and quicken the tedious – but unavoidable – part of any syntactic study: the compilation of data. However, what may be at first a great advantage for the researcher could turn out to be a disadvantage, mainly due to problems of a twofold nature: a) Unidirectional problems of the corpus on account of the inconsistencies of the corpus itself. These are, from our point of view, the most serious difficulties, since they do not allow the linguist to rely on the results obtained from the search queries, demanding a subsequent analysis to check if the previous results are right or not, thus requiring the researcher to spend a lot of time on this manual work to find the mismatches obtained. As a consequence, the process previously shortened by the automatic searches gets much slower. b) Bidirectional problems, that is, problems the researcher may face as a result of his/her disagreement with the language categories existing in the corpus.These sorts of problems are of a theoretical nature. In the case of the ICE-GB, one has to be particularly careful; indeed, the authors issue the following warning in the manual: [W]e could not guarantee that all similar constructions would always be analysed in the same way throughout the corpus. In other words, while we could achieve accuracy in individual cases, we could not guarantee consistency across the whole corpus. (Nelson et al. 2002:17, emphasis added) After thinking about this statement, our question is: how can the linguist work with a corpus that warns about the incongruence of the data found? What then are the advantages (if any) tagged and parsed corpora have to offer the linguist? In the following section, we will concentrate on the aspects we have already brought up: the presumed easiness of analysis of search queries in terms of (DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS 395 investment of time and simplification of work, the inaccuracy of the data, and the possibility of not sharing linguistic principles with the parsers of the corpus. These principles are discussed in relation to the problems we found when we undertook the study of the indirect objects complementing the six more frequent ditransitive verbs in the ICE-GB: give, tell, show, send, ask and offer (Mukherjee 2005). The distribution of the examples analysed is shown in the following table: Table 1. Distribution of the examples studied in the corpus Ditransitive verb Number of occurrences in the corpus give tell show send ask offer 1159 794 639 518 346 198 Total 3654 Some of the results found in a previous sample make us aware beforehand of certain irregularities in the labelling of the grammatical and functional categories; therefore, we started a lexical search of these verb patterns of complementation. In particular, the grammatical categories of dimonotransitivity, parataxis and transitive complements are open to question. Dimonotransitivity As Mukherjee (2005:78) rightly points out in his study of ditransitive verbs in English, the concept of transitivity in the ICE-GB is not a stable property of the verb and it is purely syntactic. This means that the number of elements present in the sentence dictates the character of the verb and, consequently, the kind of sentence: if there is no object, the verb (and sentence) will be intransitive, if there is one object there are two possibilities: either the verb may be monotransitive (complemented just by a direct object) or dimonotransitive (only complemented by an indirect object), and finally, if the verb is complemented by a direct object and an indirect object, the sentence will be ditransitive. It is as simple (or as difficult) as that. In this line, the new category they coin ‘dimonotransitive’ makes sense. They need a new label to designate the possibility of having sentences with just one object: an indirect object. Only a reduced group of verbs admit this pattern (Nelson et al. 2002:49): show, ask, assure, grant, inform, promise, reassure and tell. For example: 396 CARMEN AGUILERA CARNERERO (1) When I asked her, she burst into tears. <ICE-GB: S1A-094#110> (2) I’ll tell you tomorrow. <ICE-GB: S1A:099#396> (3) Show me. <ICE-GB: S1A:042#119> As our concept of (di) transitivity is semantic and not syntactic, we cannot accept the label ‘dimonotransitive’ since we do not consider the possibility of finding the indirect object as the sole required semantic complement of the verb in the same way monotransitive verbs behave. What we do not admit is the fact that there may be verbs which cognitively evoke the presence of just an indirect object: wherever there is an indirect object in a sentence, we have a ditransitive verb. This is not incompatible with admitting – as we do – the possibility of omission of the direct object, leaving only the indirect object explicitely exposed in the sentence. A lexical search of the complementation of the most frequent ditransitive verbs in the corpus verbs showed us the following examples: (4) I’m asking you <ICE-GB: S1A-070#182> (5) By asking people <ICE-GB: W2A-016#014> (6) if he wanted anything he asked the nearest girl and firmly called his daughter Pamela, never Pig <ICE-GB: W2F-017#025> (7) I’m not sure I ever got round to asking her <ICE-GB: S1A-023#182> (8) Well, I’ll ask one of the stallholders down Chapel Street <ICE-GB: S1A010#025> (9) You’re meant to ask me <ICE-GB: S1A-017#146> (10) Did you ask anybody there <ICE-GB: S1A-024#012> (11) Oh, I suppose it’s a question a lot of people ask each other <ICE-GB: 050#028> (12) I mean ask Nigel you know <ICE-GB: S1A-090#208> (13) people have to be asked <ICE-GB: 078#207> (14) Some candidates may also be asked to attend for interview or to take an entrance examination <ICE-GB: W2D-007#049> (15) We hardly told anybody <ICE-GB: S2A-027#127> (16) If he does then I hope that he will approach the Health and Safety executive and talk to them about why these numbers have changed because he would be told <ICE-GB: S1B-057#092> (17) He’s called Basil in the stables and I’m told likes a pint of MacEwan with his feed <ICE-GB: S2A-011#064> (18) He was not told <ICE-GB: S2B-046#038> (19) And this way I can usually discover proposed future programmes all long before I’d officially be told <ICE-GB: S1A-082#036> (20) By the age of forty, he had risen to the position of managing director - a sign, as I supposed, that he possessed all those qualities of drive, initiative and enterprise which I am told are required for success in the world of commerce and industry <ICE-GB: W2F-011#043> In all the examples above, the only object present in the sentence is considered to be a direct object and the verb phrase is considered monotransitive. After a brief comparison between these sentences and the ones used to illustrate (DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS 397 the concept of dimonotransitivity, a question inmediately arises: which are the differences between the underlined monotransitive constituents in examples (16)-(18) and the dimonotransitives in (1), (2) and (3)?: all these examples have similar subjects (personal pronouns) and the same verb (tell). In other words, why have some of them been parsed as direct objects and others as indirect objects? If, according to the ICE-GB, dimonotransitive structures are combined with only an indirect object, we could logically think this is going to be the case through the whole corpus. The main problem with this label is to find out which are the lexico-grammatical criteria used by the ICE to label some elements as indirect objects and which ones are chosen to qualify other constituents as direct objects. In all the cases, the only participant in the predicate is a noun phrase, has a [+ animate, + human] referent, occupies an immediate post-verbal position or, in some cases of passive sentences (examples (13), (14), (16), (17), and (18)), it is the subject of a passive clause. Semantically, in all the examples the only animate constituent of the predicate has the semantic role of recipient, that is, the participant who receives the entity transferred by the agent. So, the results provided by the corpus are not reliable at all. To make things worse, in the following sentence, the highlighted constituent is considered an adverbial instead of a direct object: (21) the contribution of modern genetics has shown however that the genetic code is really a fundamental organising principle <ICE-GB: S1B-060#068> In the next example, there is a confusion between the direct object (you, according to the ICE) and the indirect object (what, according to the ICE): (22) And what Mr Lampitt told you was that he was interested in acquiring a business whereby he could bring that business into the centre of London in effect <ICE-GB: S1B-064#017> The ICE-GB user’s manual is not very explicit when explaining the reasons which led the parsers to make one decision instead of another. They just mention the formal categories both direct and indirect object are related to: NP in the case of indirect object and NP, CL, AJP, REACT, INTERJEC and DISP for direct objects, as well as the sorts of verbs they go with: ditransitive and dimonotransitive with both direct and indirect objects and monotransitive, and complex transitive in relation to direct objects. However, the fact that the ICE grammar is based on Quirk et. al’s grammar is still more surprising in relation to the labelling of these categories. Quirk et. al (1985:759) overtly mention the different lexical nature (usually animate in the case of indirect objects and prototypically inanimate in the case of direct objects), as one of the distinguishing features between direct and indirect object, a characteristic which has not obviously been taken into account for the ICE parsers to differentiate between these two types of constituents. 398 CARMEN AGUILERA CARNERERO Parataxis The label PARATAXIS is used with direct speech or reported speech and thoughts. It is assumed to have the clause level ‘main’ and is associated with the categories CL, DISP and NONCL, for example (Nelson et al. 2002:51,67): (23) And he said oh yes I agree with you <ICE-GB: S1A: 005#025> (24) So I said yes here <ICE-GB: 008#274> The leading problem we find in relation to this category is the non-inclusion of the paratactic element within the main clause. Giving this sort of constituent the functional category ‘parataxis’, the parsers are automatically considering it an element apart, not integrated in the clause structure, that is, without any function in the clause. Furthermore, the non-integration of paratactic elements within the clause makes them part of a higher unit, that is, a superordinate element. In this sense, it would be in the line of other types of constituents, such as disjuncts or conjuncts, elements which are clearly out of the scope of the clause. We contend that the element labelled in the corpus as parataxis is part of the clause structure, usually having the function of direct object. This has two main implications: 1) The (di)monotransitive nature of the verb complemented by a paratactic constituent in the line of cohesion with the principle of syntactic transitivity (i.e. one object  (di)monotransitivity). 2) The controversy about the functional nature of paratactic constituents: what sort of elements are they? Are they disjuncts, conjuncts or perhaps something different? Do they (disjuncts or adjuncts) share with paratactic elements any lexico-grammatical feature? An analysis of the next examples makes us think about the function of the paratactic element is the direct object of the clause: (25) What I am sensing is my own dread, she told herself <ICE-GB: W2F020#043> (26) And lots of people ask me well why do you go on <ICE-GB: S1B-026#217> (27) Rajiv perhaps best captured the imagination of the members of Congress when he told them: “India is an old country but a young nation: and like the young everywhere we are impatient” <ICE-GB: W2B-011#051> Semantically, the highlighted constituents refer to the content of the verb tell and ask respectively, and syntactically it could be transformed into indirect speech and be replaced by a that/wh- clause: She told herself what she was sensing was her own dread, And lots of people ask me why I go on, and Rajiv told them India is an old country but a young nation. However, we found that the following sentence has been considered ditransitive, with the direct speech constituent acting as direct object: (DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS 399 (28) He had refused to increase child benefits or pay large old age pensions and told unemployed “if it isn’t hurting it’s not working” <ICE-GB: W2C018#064> [bold face present in the original example] The fact that the realization of any element in the clause is in direct speech is not, in our view, a reason clear enough to think of it as a different sort of constituent, even less, not being part of the predicator. To entangle the whole taxonomy even more, the ICE does not admit the possibility of formal realization of the category PARATACTIC by a noun phrase, which prevents the next examples from being considered as such: (29) Nonsense, she told herself <ICE-GB: W2F-020#024> (30) “An act of bravado”, she’d told Mr Rainbow <ICE-GB: W2F-020#088> The syntactic units in direct speech are labelled as ELE (element), a category which is defined by the taggers of the ICE as “an isolated element. All phrases occurring within a NONCL have the function ELE.” The constituents labelled ELE do not have a role in the clause, it being an element outside the clause structure too. Transitive complement In relation to the category ‘transitive complements’, Nelson et al. (2002:40ff) explain: The transitivity of a verb is unclear in many instances where the main verb is transitive and is followed by a noun phrase that may be the subject of the nonfinite clause or the object of the host clause. In all such cases, we avoid deciding the type of transitivity by tagging the main verb “V (trans,...)” [...] The problem seems to be therefore, with nonfinite clauses with intervening nominal since they were considered transitive. Transitive complements occur with trans (transitive verbs) and are associated with the categories CL and DISP. However the label trans is not applied (a) if the verb is be: ‘one of my aims is to finish my PhD’ (b) if the nonfinite clause does not have an overt Subject: ‘I enjoy doing it’ Concerning this and taking into account the double nature of some of the constituents in these constructions (having a function within the subordinate clause and a different one with respect to the main clause), the most outstanding difficulty seems to be to decide if what we have is just one constituent (i.e. the verb is monotransitive, as in the case of want) or two (i.e. it is ditransitive, as in the case of tell). Greenbaum clearly justifies the lack of a steady decision adopted by the parsers of the ICE-GB: We do not want to pre-empt the findings of those investigating this conspicuous example of syntactic gradience. We therefore avoid deciding the type of transiti- 400 CARMEN AGUILERA CARNERERO vity by tagging the verb simply as transitive. We leave it to researchers to weigh criteria and decide what distinctions to make. (Greenbaum 1993:15) Greenbaum’s quotation is rather striking since one could interpret that the parsers do pre-empt the rest of the categories of the ICE, some of them still at the centre of hot linguistic debates, such as the possible formal realization of the indirect object by a prepositional phrase, for instance. From our point of view, this drawback could have been easily solved by following Quirk et al.’s tests (1985:16.25-67) to determine the degree of mono, di- or complex transitivity: possibility of replacement by a pronoun, answer to a wh- question, focus of a pseudo-cleft sentence, passivization of the subordinate clause, and retaining or dropping of the preposition to, instead of proposing a new and confusing category: the transitive complementation which is opposed (by system) to the already existing classes of (di)monotransitivity and ditransitivity. The application of these tests to a verb such as tell in an example labelled as ‘transitive complement’ confirms its ditransitive nature: (31) She told her Ministers at a Downing Street reception last night to work harder and argued that the most important thing for the Conservatives was to get the economy right <ICE-GB: 006#037> a) Replacement by a pronoun: She told her ministers at a Downing Street reception last night something. b) Answer to a wh- question: What did she tell her Ministers? c) Focus of a pseudo-cleft sentence: What she told them was to work harder. d) Passivization of the subordinate clause: They were told to work harder. e) Remaining or dropping of the preposition to: She told them. Other examples of transitive complements are: (32) It’s told its eighteen thousand employees not to report for work <ICE-GB: S2B-015#074> (33) He was told to use the normal exit and that caused “resentment and friction”<ICE-GB: 011#016> (34) Her husband told her not to attend as a result the trial was impeded <ICEGB: W2B-020#069> Following what the ICE says: “In passive constructions, the tagging of the main verb is the same as it would be if the verb were active” (Nelson et al. 2002:40), so this means that the examples below would have to be considered monotransitive also in the active voice, as they have been parsed monotransitive (neither transitive nor ditransitive): (35) We’ve all been told to do it <ICE-GB:S1A-093#032> (36) Fielding shows us giving the poor man a: “severe rebuke” concluding that “Every parish ought to keep their own poor” <ICE-GB: W1A-010#065> (DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS 401 Nevertheless, the following couple of sentences have been parsed as ditransitive in spite of having a similar structure to the previous pair of sentences, which are parsed as transitive: (37) The Interior Ministry has told people to carry on with their work and that attempts to destabilise the country will be severely punished <ICE-GB: S2B008#096> (38) Probably there was a bullet in him somewhere , but when she tried to protest he told her brusquely to pad and cover where the blood was seeping through and leave everything else alone <ICE-GB: W2F-015#074> CONCLUSIONS In the light of all that we have said, some questions inevitably stand out: Do linguists need a tagged/parsed corpus to carry out syntactic analysis? What are the advantages of working with a parsed corpus? Do they really make the researcher’s work easier? In our study of indirect objects in English, we have not found many advantages working with a parsed corpus such as the ICE-GB, due mainly to some functional categories we find doubtful, as well as the inconsistency revealed in the treatment of similar examples. Most of the problems that can be faced in the analysis of the corpus are deeply rooted in the mistaken labelling of constituents, therefore not allowing the linguist to retrieve the right information. In this sense, Guilquin (2002:207) says: [T]he fully automatic retrieval of syntactic structures with no manual intervention is still something of an impossible dream for lack of suitable and/or reliable tools and corpora [...] the lack of the ideal parsed corpus (i.e. accurate, detailed and big enough) forces one to turn to a tagged corpus and use a method requiring more manual post-editing and yielding slightly less satisfactory results. It is worthy of note that the three categories we have called into question, namely dimonotransitivity, parataxis and transitive complementation, are not included either in Quirk et al.’s A comprehensive grammar of the English language (1985) or in A student’s grammar of the English language, 3 in spite of the fact that Greenbaum (1993:13) recognises that these two reference grammars provided the basis for the categories in the ICE-GB. However, does our experience imply that parsed corpora are not useful for this sort of studies? Not necessarily. It just means the ICE-GB is not appropriate for our needs, mainly on account of the inaccuracy of its parsing. All in all, the choice of using a parsed corpus such as ICE-GB or the alternative of using a ‘only-tagged’ corpus would have been the same. Furthermore, in terms of time 3. Not even in Greenbaum´s Oxford grammar of the English language (1996). 402 CARMEN AGUILERA CARNERERO consumption and facility, working with the ICE-GB represented a much greater effort than working with non-parsed corpus. At least, we hope to have proved that Sinclair was totally wrong when he stated: Each tagger will put into practice a policy for these categories that is more likely to be the result of expediency than the elaboration of a theory, and these decisions will affect a decade or more of research, without the users even being aware of them. Most researchers are content that someone has tagged the corpus, and they are not inquisitive as to how this was done, or what the shortcomings are. (Sinclair 2002:53, emphasis added) REFERENCES Aarts, J. 2002, “Does corpus linguistics exist? Some old and new issues” in Leiv Ágil Breivik and A. Hasselgren (eds.). From the COLT’s mouth...and others’. Language Corpora Studies. In honour of Anna-Brita Stenström. Ámsterdam and New Cork: Rodopi. Francis, G. 1993, “A corpus-driven approach to grammar. Principles, methods, and examples”, in G. Sampson and D. MacCarthy (eds.). Corpus linguistics. Readings in a widening discipline. London and New York: Continuum. Greenbaum, S. 1991, “ICE: The International Corpus of English”. English Today, 28 (7): 3-7. Greenbaum, S. 1993, “The tagset for the International Corpus of English”, in C. Souter and E. Atwell. Corpus-based computational linguistics. Amsterdam: Rodopi, 11-24. Greenbaum, S. 1996, The Oxford grammar of the English language. Oxford: Oxford University Press. Greenbaum, S. and R. Quirk 1990, A student’s grammar of the English language. London: Longman. Guilquin, G. 2002, “Automatic retrieval of syntactic structures. The quest for the Holy Grail”. International Journal of Corpus Linguistics, 7 (2): 183-214. Mukherjee, J. 2005, English ditransitive verbs. Aspects of theory, description and a usage-based model. Ámsterdam: Rodopi. Nelson, G., S. Wallis and B. Aarts 2002, Exploring natural language. Working with the British component of the International Corpus of English. Amsterdam/Philadelphia: John Benjamins Publishing Company. Quinn, A. and N. Porter 1994, “Investigating English usage with ICECUP”. English Today, 10 (3): 21-24. Quirk, R., S. Greenbaum, G. Leech and I. Svartvik 1985, A comprehensive grammar of the English language. London: Longman. Sinclair, J. 2004, “Intuition and annotation: the discussion continues”, in K. Aijmer and B. Altenberg (eds.). Advances in corpus linguistics. Papers from the 23rd International Conference on English Language Research on Computerized Corpora (ICAME 23). Amsterdam/ New York: Rodopi, 39-59. (DIS)ADVANTAGES OF WORKING WITH A PARSED CORPUS 403 Tognini-Bonelli, E. 1991, Corpus linguistics at work. Amsterdam/Philadelphia: John Benjamins Publishing Company.

HOMENAJE A MERVYN SMALE - Universidad de Granada

Related documents

Products

Support

HOMENAJE A MERVYN SMALE - Universidad de Granada

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib