Applied Linguistics & Foreign Language Teaching Dr. Mei-hui Liu Fanny Chang G99120009 Pre-Reading Questions for Session 9 Reading for the sixth class session (Nov 9, 2011): Schmitt (2010), Chapter 6: Corpus linguistics * Questions: 1. The authors defined that “a corpus refers to a large principled collection of natural texts” (p. 89). From your reading of pp. 91-92, what are some of those “principles” that should guide corpus construction? Ans: Because corpora creators collect the data information they need from natural texts (e.g., the texts from real occurrences like daily conversation, newspapers, speeches, etc.), they must not seek for what they need from artificial resources (i.e., simulated conversations, etc.). Otherwise, the data information will be valueless because they cannot uncover the real language use situations. Therefore, such principles as deciding the purpose of a corpus (general or specialized corpora), finding out samples in real contexts, choosing the right tool to transfer raw texts onto computers are essential procedures that corpora creators need to bear in mind. 2. What are some differences between a general corpus and a specialized corpus? (see pp. 91-92) Ans: The major differences between a general corpus and a specialized corpus are on their nature of data information. The goal of general corpora can be said to include as many linguistic features as possible to fulfill researchers’ or language learners’ needs. For example, they include about 100 million words to make frequency lists, concordance program, etc. for users. Though specialized corpora also comprise these features (i.e., frequency lists, etc.), they mainly focus on more specific areas (e.g., child language, etc.). Therefore, general corpora usually comprise a larger amount of data than specialized corpora. As for specialized corpora, they are more likely to aim for particular areas. As mentioned from the book, such corpora might aim to explore child language, teenage language, newspaper language, etc. 3. Why are corpora of written language much more common than those of spoken language? (see p. 94) Ans: The most salient reason that causes written language more common than spoken language in corpora is because of the ways they are transferred into the electronic texts. For written corpora, creators just need to use scanners and other software to scan paper documents into electronic files. Creators of spoken corpora do not have so convenient equipments as written corpora; they have to do more tiring works like transcribing the natural texts onto computer and making them into electronic files. Spoken corpora can be said having one more step of working on text transferring than written corpora. However, written corpora are not completely away from troublesome process. They also need to do error-correction and proofreading if they use certain software to scan their paper documents. Therefore, both spoken corpora Applied Linguistics & Foreign Language Teaching Dr. Mei-hui Liu and written corpora have their own difficult parts to deal with. From this perspective, spoken corpora and written corpora actually have similar steps to go through. 4. According to pp. 94-95, what kinds of things can be encoded via markup/annotation/tagging? Ans: Markup, annotation, and tagging substantially serve as facilitators to enrich the information and value of a corpus. Moreover, they help users have fuller understanding of the data information. Without these techniques, a corpus can only be utilized to look for instances. Markup and annotation basically code different linguistic features. That is, one code macro level characteristics and others code micro level features. For markup, some structural features in written corpora like titles, authors, places, subheadings, etc. (i.e., background information of the data information) are more likely to be encoded. So, the use of markup help a corpus gain additional information in which browsers can have better understanding when seeing the resources. As for annotation and tagging, they provide further specific information for browsers. Actually, tagging is one technique under annotation. The example from the book talking about tagging is called part-of-speech tagging which is one of the annotations. In this technique, grammatical features of lexical items will be labeled for the lexical items. Therefore, those who have needs in the information can have clearer investigations. 5. What is the benefit of adding such markup/annotation/tagging to the raw texts in a corpus? Ans: The three techniques basically possess the same objective. That is, they all aim to provide additional information of raw texts to facilitate browsers’ understandings. However, they enrich browsers’ knowledge from different angles respectively (e.g., markup tends to provide help for background information of a raw text; whereas annotation and tagging are inclined to help browsers understand the linguistic features of lexical items). With the help of markup, corpora browsers can have more thorough understanding about the raw texts (i.e., knowing the gender, age, mother language, occupation, etc. in a spoken corpus). Annotation and tagging then give browsers information about individual lexical item. Therefore, whenever browsers have doubt in the process of browse, they can look back to the unfamiliar items to know more about their information. 6. On p. 98 & 100, the authors mention 2-3 things that corpus analysis can tell us. What are they? Ans: The authors mention a few things such as frequency of occurrence information, word lists, concordancing packages, concordance program etc. The corpus information of frequency of occurrence is the tool of searching for the word frequency information. For example, it will show browsers how frequent a word is used by revealing numbers next to the word. So, this information can also be used to compare two words’ frequency uses. Word lists provide a helpful guideline for teachers when they are deciding which word to teach. If a teacher has a hard time choosing target words for teaching airport English, the teacher can consult corpora to look for the words that are of frequent uses in the area. Applied Linguistics & Foreign Language Teaching Dr. Mei-hui Liu Concordancing package and concordance program look similar literally, however, they have different functions. The former one mainly presents how a word is being used in different contexts. Therefore, browsers can see a list of occurrences of the target/chosen word. The latter one provides browsers with information about what words that usually occur together. This corpus is very helpful because language learners can check the information of the unfamiliar words and see what usually accompany with the searched lexical items. 7. The authors reviewed a number of corpus research studies on pp. 100-101. What are the other specific questions or topics you think would be interesting to investigate through corpus research? Ans: I think investigating the uses and different types of assignment written language will be interesting and useful for college students. When I was a university freshman, I found me and my peers usually utilized spoken words into our writings (e.g., compositions/short essays). Because we were novice in writing, we could not distinguish written language from spoken language. Though some words are both applicable in written and spoken contexts, there are still words that are more appropriate to be used in only one context. If a student uses a large amount of spoken words into writing, his/her writing will look like a transcription instead of a composition. Therefore, this topic can help novice writers better understand the various types of written words. 8. According to the authors, what are two general ways that corpora can be applied to language teaching? (see p. 102) Ans: Corpora can be applied into classroom language teaching through two ways. The first one aims to facilitate language teaching from teacher’s perspective. For example, teachers can take corpora as a reference to adjust and ensure their teaching materials. If a teacher wants to teach spoken academic language to his/her students for formal speech, he/she can use corpora to investigate the features and words that are of frequent uses in formal speech. Corpora actually save time for teachers because their data resources are vast. Teachers can prepare a rich lesson and have more time maybe design classroom activities. The second way focuses more on learners’ interactions with corpora. However, computer equipments need to be sufficient in order to involve learners into the corpora environment. If this precondition is not available, then teachers can also print out the corpora information for their students. For example, a printout about all the uses of the word ‘perceive’ with various contexts and patterns associated with it will be helpful to solve the problem of equipment insufficiency. 9. Which of the example activities (see pp. 102-103) seem most interesting to you? Why? Ans: There are many activities mentioned in the chapter like frequency lists, collocational activities, etc., but I think concordance is the most interesting one among the above because of its complexity. Concordance refers to the shared meaning of a series of synonyms. I think concordance is complex Applied Linguistics & Foreign Language Teaching Dr. Mei-hui Liu because a word usually has different meanings in its nature. In other words, many words could possess the same meaning in certain degree (e.g., ‘speak’, ‘say’, ‘talk’, and ‘tell’ have similar meanings in Chinese). Though some words may have similar meanings, their uses are mostly different. For example, the four words ‘speak’, ‘say’, ‘talk’, and ‘tell’ are used in different contexts and for different purposes. Therefore, it is interesting to rely on corpus resources to investigate the different usages of different words which possess the similar meanings. In this way, as a language learner, I can know more accurately about how a word should be used (e.g., if I want to express I have an ability of a language, I will use ‘speak’ I can speak Japanese). 10. Please note down Two punch lines of this chapter. (1). P. 101: Corpus-based studies of particular language features and comprehensive works such as The Longman Grammar of Spoken and Written English (Biber et al., 1999) will also serve language teachers well by providing a basis for deciding which language features and structures are important and also how various features and structures are used. (2). P. It is worth noting here that the use of concordancing tasks in the classroom is a matter of some controversy- strongly advocated by those who favor an inductive or data-driven approach to learning (Johns, 1994), but criticized by others who argue that it is difficult to guide students appropriately and efficiently in the analysis of vast numbers of linguistic examples (Cook, 1998). 11. Please write down any questions or comments, if any, after doing the reading assignment. If we (as teachers) want to incorporate language corpora into our teaching, we can introduce the corpora presenting frequency list, pattern use, etc. But is it really necessary to introduce the frequency information together with the newly introduced words? If yes, what benefits will language learners have when knowing the frequency information? Note: You don’t need to put answers here, but make sure that you: (1) understand what frequency list, concordance listing, and KWIC refer to; (2) try the hand-on activity and see the suggested answers in the back of the book.