A keyword concordance and video retrieving system for the analysis of an authentic film corpus Ming-tyi Wu Dept. of Applied English, Southern Taiwan University of Technology Kun-te Wang Dept. of Engineering Science, National Cheng Kung University Abstract This paper introduces a multimedia database of films on DVD and a concordance and video retrieving system for the analysis of it. The design of the system and the functions in it are illustrated and a concordance of a sentence pattern “this is a…” is conducted to examine what are left out in an English syllabus which is based on simplified and de-contextualized sentence patterns. It is suggested that contextual information, background knowledge, cultural values, communicative intent and socio-cultural purpose are important factors in meaning expressions and teachers should use stories on DVDs as vehicles to bring all these factors into a meaningful whole for effective learning. The system can also be used a tool to further expand the contexts and contents of the learning materials. Keywords: multimedia database, DVD, concordance, video retrieving Introduction Ever since Sinclair (1987, 1991) developed the COBUILD project for the collection and analysis of a very large corpus, dictionaries, grammars, and ELT materials have been produced out of the corpus studies. Corpus linguistics became a major descriptive area in both applied linguistics and linguistics in the 1990s (Kaplan & Grabe, 2002). Although a corpus is composed of real written and spoken texts, these texts have been transplanted from their original medium and incorporated into another and hence their distinguishable features no longer existed (Mishan, 2004). Widdowson (2000) also argued that the electronic form of the texts in a corpus made them de-contextualized and the core aspect of the authenticity was lost. Conrad (2002) asserted that the availability of computer tools that allow discourse-level studies of corpora was required. Thanks to the development of the DVD technologies, authentic materials on DVDs no longer need to be transplanted and, in fact, they contain video images, sound streams and texts that make them really versatile for both discourse studies and language learning. In a paper which examined the potential of using DVD for language learning, Tschirner (2001) suggested seven principles culled from several areas of SLA research: (a) learning is situated; (b) input oriented learning of the FL is emphasized; (c) output oriented learning of the FL is encouraged; (d) the cultural component of FL competence is promoted; (e) a focus on form is fostered; (f) storage in memory of meaningful and situated sequences of sounds and words is promoted; and (g) affective needs of learners are taken care of (p. 308, 309). In the case of EFL, there are hundreds of DVD titles available on the market. These titles are authentic, in the sense that the target audience is native English speakers, and their scripts are written by expert users of English. Checked with Tschirner’s (ibid.) principles stated above, we can find that: (a) the various stories in these titles do provide rich situations in which utterances occur. The video images and the sound streams can be comprehensible input for learners in that they are embedded in meaningful contexts. The fact that they can be played repeatedly will encourage comprehension to happen. They also comprise rich cultural components of the American society. With the subtitles available on them, the linguistic elements in the utterances can be scrutinized. The interactive use of the DVD materials can enhance the storage of the contents in memory. And furthermore, the affective and emotional values highlighted in each of the titles may touch its audience. A corpus of more than 500 DVD titles was established out of a project sponsored by the Ministry of Education from 2001 to 2005 (Wu, 2006). It was continuously updated since. To date, it includes a total of more than 680 DVD titles. Such a corpus allows many opportunities for the learners to authenticate discourses through observation (Gavioli & Aston, 2001). This paper will introduce the corpus first and then a text analysis and video retrieving system will be presented. Some exemplar usage of both the corpus and the system will also be addressed. The DVD corpus Using a text analysis software, Concordance (Version 3.3) developed by J.R.C Watt Company, the word types and tokens for the titles in each of the categories are shown in Table 1. Table 1. The general features of the film corpus Cartoons No. of Titles 76 Word Types 16961 347815 Tokens/No. of Titles 4577 Classics 61 21862 600474 9844 Films 405 69207 3247475 8018 TV Series 148 37370 1150581 7774 Total 681 94654 5298378 7780 Categories Tokens Accordin g to Table 1, cartoons contain less word types and have less tokens than those of the other categories. This is because cartoons are produced mainly for children and hence the words used are somewhat simplified. The average tokens for a cartoon title are 4577 which is also far less than those of the other categories. It is obvious that the stories in cartoons are shorter and less complicated in content. Classical movies, on the other hand, are much longer and comprise much more serious and complicated content and hence the average tokens, 9844, for one of them are more than twice as much. The average tokens for Features films, 8018, and TV series, 7774, are less than those of the classical movies but are still far more than those of the cartoons. Although DVD titles can be rich sources for FL classrooms, with such large volumes of video texts, it is rather difficult, if possible at all, to go through each of the titles and select the appropriate sample utterances for the learners. A device which can search the key words or phrases in the texts and locate the particular addresses where each of the occurrences appears in the videos is required so that teachers can use the tool to select the desired examples that meet their needs and students can use it for exploratory learning. In the following section, such a device attempted by the authors will be introduced. The text analysis and video retrieving system The film materials are authorized for non-profitable educational purpose only so they can only be used by registered users in the campus only. The registered users need to log in every time they want to use the system. Figure 1 shows the log-in screen. Figure 1. The registration and log-in screen. After logging in, the administrator can upload film lists, films in WMV format, the scripts or subtitles of each of the film in both DKS and SMI formats. There are four categories at the moment: Movies, Cartoons, TV Series, and Documentaries. The system will generate an overall film list for all the film lists and thus the default film list for the users to conduct concordance and video retrieving is the overall list unless specified. Figure 2 is the interface for the administrator to set up the contents for the system. Figure 2. The administrator’s interface. For teachers, who want to use the system for research or editing their own teaching materials, or learners, who can use it for exploratory learning, they use the user’s interface below (Figure 3). Figure 3. User’s interface of the system. The interface comprises four parts: (a) A word list which allows the user to adjust the display in alphabet or frequency ascending or descending order. (b) Concordance and video retrieving setting’s windows, including keyword(s) and collocation(s) to be searched, and corpus selection. (c) A search result window which allows users to select any item identified for presentation or editing. (d) Links to net dictionary entries for the particular keyword, system homepage and courses edited with the system. On the top of the search result window, the concordance button will lead to another window which shows the complete results of the particular search. An example of part of such a list is shown in Figure 4. In the list, the collocated words are listed at the top and the size of each of them varies according to the numbers of the examples found. The MI (Mutual Information) values show how closely each of these collocated words is related to the searched word. A button is created in front of each of the sample sentences identified and the location where the particular sample sentence occurs can be retrieved and played along with the scripts, with the sentence in question highlighted, by clicking each of the buttons. Figure 4. Part of a concordance list. The figure below (Figure 5) displays how the retrieved video is presented together with the subtitles in SMI format under the screen and those in DKS format to the right. The sample sentence identified is highlighted in red. Such video and subtitle display window can be activated by (a) clicking a script file name on the search result list, (b) clicking an icon in the Edit column of the list (see Figure 3) and (c) clicking an icon in front of the sample sentence in the concordance list stated above. Figure 5. The video and subtitle display window. By doing step (b) and (c) described above, a button which leads to Editor’s window of course will appear at the bottom of the video and subtitle display window (see the Show button in Figure 5 above). The Editors window allows the users to select the discourse which comprises the sample sentence and save it to a course. The right hand side window in Figure 6 below shows the five discourses selected under the topic of Pathetic. By clicking the Play in English button at the bottom of the opened discourse window, the video of the particular discourse can be retrieved and presented in the video and SMI subtitle window (see the left hand side window in Figure 6). And thus, series of the topics can be edited, saved, and presented in a course. Figure 6. The display window of a topic with the selected discourses. At the current stage, one problem with the database is that some of the time codes that go with the subtitles need to be adjusted so that the video and the accompanying subtitles can be more accurately synchronized. And a problem with the system is that if too many users are doing the retrieving at the same time or too many keywords are requested at a time then an error message may appear when the system is too busy to be able to handle the requested tasks. Concordance Using the system stated above, a concordance of the sentence pattern: “This is a…” was conducted and a concordance list like Figure 4 above was obtained. This was performed by using the key word “this” and the collocations “is a” to the right of the keyword. The pattern was selected because the first English sentence ever learned by the author in junior high school was “This is a book.”, and yet it was never heard of in real life. In fact, not a single example of such a sentence was found in the movie corpus which comprises 466 film titles with a total of 3.8 million words. It is obvious that communicative intent and sociocultural purpose are lost in such a contrived sentence. On the concordance list, 335 examples of the sentence pattern were identified. These examples were further divided into 16 patterns (Table 2): Table 2. Patterns in the concordance list of “This is a…” Patterns Entries Examples This is a +N [93] [ALEXANDER] This is a nightmare. 91 This is a +N+Prep Phrase [75] I don’t think this is a case of undue influence. 29 This is a +N+Adj Phrase [197] Well this is a thing unheard-of. 5 This is a +N+Adj Clause [116] Please don’t say that. This is a mistake you’re 6 This is a +Adj N+N This is a +Adj N+N+Prep Phrase This is a +Adj+N 28 1 This is a +Adj+N+Adj Phrase 111 28 1 This is a +Adj+N+Adj Clause 9 This is a +Adj+N+Inf Phrase This is a +Adv+Adj+Prep Phrase 1 2 This is a +Adv+Adj 4 This is a +Adv+Adj+N 12 6 This is a +Adj+N+Prep Phrase This is a +Adv+Adj+N+Prep Phrase This is a +Adv+Adj+N+ Inf Phrase 16 patterns 1 making, Diana. [60] [BERGSTEIN] This is a wild-goose chase. [168] You think this is a treasure map for Cibola, don’t you? [8] I’m sorry, sir. This is a private room. [67] Fellas, this is a critical moment in his life. [10] I have to warn you. This is a dangerous place full of vultures. [218] This is a nice street you live on. Yeah, this is my street. [86] Carlos, this is a stupid fucking problem to have. [220] I’m not gonna get this. This is a little too complicated for me. [209] I think this is a little different. You’ll run circles round them. [63] Oh god. Jesus Christ, Robert, this is a really bad idea. [269] And this is a really good example of that. [133] For me this is a very hard thing to say. 335 Among the 335 examples, 91 (27%) have the same pattern as “this is a book”, i.e. “this is a +N.” The other examples all have one or more modifiers added to the noun (73%). Using the edit function provided by the concordance and retrieving system discussed above, the discourses in which each of these examples embedded can be retrieved and edited for further analysis or for language teaching and learning via exemplification (see Figure 6 above). And thus, this system is a computer tool which can perform what Conrad (2002) requested that allow discourse-level studies and applications henceforth. Some of the retrieved discourses that contains the sentence pattern “this is a +N.” in them are listed in Table 3. Table 3. Sample discourse with the embedded sentence of “this is a +N.” Sample Discourse Context, Background Knowledge & Cultural Values Wane: So, what does your wife think about this plan? Arnold lost his job 8 months ago from a company he had worked for 17 years. Wane tries to bring Arnold’s wife in to loosen the kidnapper. Wane owns a company and cherishes his wife’s love for him. To talk with the kidnapper in a way to bring back the humanistic nature in him. Arnold: My wife? Wane: Yeah. Those are her cigarettes. Arnold: You can keep things from your wife. Wane: I don't know. Arnold: What? You've never deceived your wife? Wane: Well, there are levels of deception, Arnold. I mean, this is a whopper. Communicative Intent & Sociocultural Purpose Arnold kidnapped Wane for money. Arnold’s wife smokes a lot. To Wane, man and wife share everything except maybe some white lies; to Arnold, man can keep things from his wife. Kidnap somebody for money is a felony not something trivial. Arnold: Oh. The example shows that even though no modifiers are added to the noun, meanings go far beyond the words. The contextual information plays a crucial role for the comprehension of the words. The richer the context the more students will understand and hence learning are more effective. If the material is simplified to a mere sentence pattern then it is meaningless and will not make any sense to the learner. For these discourses to occur, a wide range of background knowledge and cultural values are involved aside from the linguistic capabilities possessed by the interlocutors. The communicative intent and sociocultural purpose are the initiative reason for conversation to occur. For those patterns with one or more modifiers added to the nouns, the discourses in which they are embedded need to be examined as well. Table 4 presents some of the retrieved sample discourses: Table 4. Sample discourses with modifiers added to the noun. Patterns Sample Discourses +Adj N+N President Zia: So what I have been wondering is why your State Department would send someone here who thinks he understands the problem because I don’t think the prayers of the Texas Second Congressional District are going to turn the trick. Context, Background Knowledge & Cultural Values Communicative Intent & Sociocultural Purpose Zia is the President of Pakistan. What Pakistan needs is someone who really understand the problem. Charlie is a Congressman representing the Texas Second Congressional District. Charlie’s visit is not an official one. Pakistan needs substantial +Adj+N +Adj+N+ Prep Phrase Charlie: Well, now I wasn’t sent here by the State Department, Mr. President. I was asked to come here by our friend in Houston. So this is a courtesy call. President ZIA: I don’t need courtesy. I need airplanes, guns and money Many Afghanistan refugees fled to the Pakistan. Joanne: Charlie. So sorry for keeping you waiting. Charlie: Oh, it's no problem, Joanne. This is Bonnie Bach. Joanne: So nice to meet you. Bonnie: It's a pleasure meeting you, Mrs. Herring. This is a wonderful party. Charlie: Why don't you give us a few moments? Bonnie: Yes, sir. Joanne: Oh, Bobbie, if you could ask someone for a Bombay martini up, very dry? Bonnie: Oh, I'm not a slave girl, actually. I'm the Congressman's administrative assistant. Joanne: Isn't that wonderful for you? Bonnie: Yes. Joanne: Two olives, please. Tell them it's for me, they'll know. Bonnie: Certainly. Joanne: She doesn't like me. Charlie: Everybody likes you. Joanne: She's a liberal. Charlie: Well, I'm a liberal. Joanne: Not where it counts. Joanne Herring is a major donor for Charlie’s campaign. Senior Officer 1: So if it isn't over in four days, when will it be over? You. Tom: Well, all I have to measure it by... is how long it took last time when Shark first came in. Senior Officer 1: And how long did it take? Tom: Ten months. Senior Officer 2: Well, this is a great day for Adolf Hitler. Senior Officer 1: Ten months? Senior Officer 2: But you did break it. Tom: Yes. Senior Officer 2: How? Tom: I'm afraid I can't tell you that. Senior Officer 2: I think it's time we... I think it's time I got back to London. German Navy changed their old code, Shark, into a new one. help from the US. Whether to help Afghanistan people are both a political and a humanistic issue. She invited Charlie to a party in her house. Bonnie is Charlie’s assistant. Bombay is India’s most populated city and martini from there is dryer. Joanne thinks being a congressman’s assistant is a good thing for Bonnie. A liberal is someone who favors a political philosophy of progress and reform and the protection of civil liberties. British Navy is unable to decipher the new code. London is anxious about the block-out of the information from German’s telex code. They send Tom back from a hospital to help. It took Tom ten months to break Shark. It’s a top secret regarding how to break the German code. Joanne’s words “So nice to meet you” should be a formulaic speech to Bonnie but what she really means is that she is glad to see Charlie. Bonnie says “It's a pleasure meeting you, Mrs. Herring. This is a wonderful party.” in order to stop Joanne’s kiss to Charlie. What she says is to give Joanne some kind of attitude; it has nothing to do with the party. Joanne has no right to ask Bonnie to do things for her. Joanne notices Bonnie’s dislike for her but she deliberately ignore it and maintain her superior status to her. The officer from London wants to make sure how long it takes to break the new code. It took Tom ten months to break Shark so it’ll probable takes about the same time to break the new one. British Navy will not be able to know where the German submarines are deployed and where they are heading. Tom is not allowed to shed the top secret of how to break a code. The officers have to report to London about the situation they face. Aside from the contextual information required, the background knowledge and culture values embedded and the communicative intent and sociocultural purpose intended in the discourses listed above, meanings go far beyond the simple sentence pattern itself. In the first discourse above, the word “courtesy” is a noun that serves as an adjective and the sentence “This is a courtesy call” is used to hide the embarrassment that the president has doubts about the purpose of Charlie’s visit. In the second case, the sentence “This is a wonderful party” has nothing to do the party; it serves the function of stopping Joanne’s intimate behavior towards the speaker’s, Bonnie, boss. In the third one, “This is a great day for Adolf Hitler” means that if the German Navy know that the British Navy are unable to break their new code they will be safe and can deploy their submarines wherever they want them to be and the British Navy will be in big trouble. If this study of a simple sentence pattern reveals that contextual information, background knowledge, cultural values, communicative intents and socio-cultural purposes are all important factors for the comprehension for the discourse in which the particular sentence is embedded, then what the learners need are materials that can bring all these into a meaningful whole rather than the de-contextualized sentence patterns. In other words, using some meaningful stories as vehicles to serve the purpose is a better idea than using classified and simplified sentence patterns for the learners to practice and memorize. The stories on DVDs are the best format of the stories at the moment in that they contain the video images, the sound streams and the text to make the contents more comprehensible. If this is the case then the entire English syllabus need to be reformed so that the objectives that the students are able to comprehend, to speak, to read and to write simple English highlighted in our curriculum guidelines will not be in vain and fall into a hollow hole of poor English or non-English in the end. This is a serious issue (although the sentence is a simple one). Language exemplification The idea of language exemplification has been repeatedly asserted by the author (see e.g. Wu, 2004, 2005; Wu & Young, 2006). It stemmed from the fact that the dynamic factors involved in meaning formation and utterance production are too complicated to be described. They can only be exemplified in a meaningful whole. Any attempt of such a descriptive, bottom-up, and analytical approach, no matter how systematic it is, tends to be trapped in the hollow hole depicted above. The fact that the goals of our 9-year consecutive English curriculum have been reduced to 1000 words is the result of a vicious circle in such a hollow hole. Students need to be exposed to the authentic video texts so that they can watch, hear, read, generalize, induce and memorize what is observed and then, at a later stage, they are able to speak and write. This can be achieved in many ways. On a small scale, teachers can select a few DVD titles from the library and turn them into interactive teaching materials with the storyline, video segments, annotated subtitles, exercises and online tools (Wu, 2006; see Figure 7 below). Figure 7. An interactive DVD material presentation interface. Another way of doing it is to select a series of DVD titles and organize them into an online course so that learners can have access to the materials both in the class and at home to maximize their exposure to the materials. For such larger scale application, a teaching platform with front end account management device, material presentation device, and after class practice and rehearsal devices is required. One such platform was designed for STUT English department teachers and students (Wu, 2005; see Figure 8 below). Figure 8. An online teaching platform with both teacher and student’s interface. Conclusion The concordance and video retrieving system described in this paper is a large scale corpus analysis tool for teachers. The system provides (a) key word and collocation concordance for any vocabulary they encounter or they want their students to learn, (b) net dictionary entries for the vocabulary, (c) subtitles and video retrieving for the particular vocabulary, and (d) discourse selection and filing function for the exemplification of both the videos and the discourses. Concordance for sentence patterns and formulaic speech can also be performed with the system. For example, a concordance of the keyword “how” and the collocations of “are you” to the right of the keyword can be used to find out all the cases of “how are you” in the corpus and hence all the responses to it can be identified as well. Teachers can use the system to find out native speakers’ actual speech and the nuances in the different expressions. However, it is better to use the selected stories on DVDs as vehicles so that learners can learn the target language in meaningful context to achieve better long-term results. It is suggested that teachers bring all the contextual information, background knowledge, cultural values, communicative intent and socio-cultural purposes together by using the stories as vehicles so that learners can build up their own sense of the target language gradually via observation. The concordance and video retrieving system can be used as a tool to locate and retrieve from the database the related expressions to further expand the contexts and contents of the learning materials. References Conrad, S. (2002). Corpus linguistic approaches for discourse analysis. Annual Review of Applied Linguistics, 22, 75-95. Gavioli, L.; Aston, G. (2001). Enriching reality: Language corpora in language pedagogy. ELT Journal, 55(3), 238-246. Kaplan, R.B., & Grabe, W. (2002). Discourse analysis. Journal of Second Language Writing, II, 191-223. Mishan, F. (2004). Authenticating corpora for language learning: a problem and its resolution. ELT Journal, 58(3), 219-227. Sinclair. J. (1987). Looking up: An account of the COBUILD project in lexical computing. London: Collins. Sinclair. J. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press. Tschirner, E. (2001). Language acquisition in the classroom: The role of digital video. Computer Assisted Language Learning, 14(3-4), 305-319. Widdowson, H.G. (2000). On the limitations of linguistics applied. Applied Linguistics, 21(1). 3-25. Wu, M-T. (2004). 從語文教學的本質來看語文教學所面臨的基本問題. The Proceedings of 2004 International Conference on Language Education. Tainan: STUT Wu, M-T. (2005). Multimedia Materials and Online Learner Tools for English Learning: An Introduction to the STUT English Teaching and Learning System. The Proceedings of 2005 International Conference on Language Education. Tainan: STUT Wu, M-T & Young, J-M. (2006). A theme-based multimedia English learning environment and the underlying language elements. TELL Journal, 3,1-22.