A keyword concordance and video retrieving system for the analysis... film corpus Ming-tyi Wu

advertisement
A keyword concordance and video retrieving system for the analysis of an authentic
film corpus
Ming-tyi Wu
Dept. of Applied English, Southern Taiwan University of Technology
Kun-te Wang
Dept. of Engineering Science, National Cheng Kung University
Abstract
This paper introduces a multimedia database of films on DVD and a concordance and
video retrieving system for the analysis of it. The design of the system and the
functions in it are illustrated and a concordance of a sentence pattern “this is a…” is
conducted to examine what are left out in an English syllabus which is based on
simplified and de-contextualized sentence patterns. It is suggested that contextual
information, background knowledge, cultural values, communicative intent and
socio-cultural purpose are important factors in meaning expressions and teachers
should use stories on DVDs as vehicles to bring all these factors into a meaningful
whole for effective learning. The system can also be used a tool to further expand the
contexts and contents of the learning materials.
Keywords: multimedia database, DVD, concordance, video retrieving
Introduction
Ever since Sinclair (1987, 1991) developed the COBUILD project for the collection
and analysis of a very large corpus, dictionaries, grammars, and ELT materials have
been produced out of the corpus studies. Corpus linguistics became a major
descriptive area in both applied linguistics and linguistics in the 1990s (Kaplan &
Grabe, 2002). Although a corpus is composed of real written and spoken texts, these
texts have been transplanted from their original medium and incorporated into another
and hence their distinguishable features no longer existed (Mishan, 2004).
Widdowson (2000) also argued that the electronic form of the texts in a corpus made
them de-contextualized and the core aspect of the authenticity was lost. Conrad (2002)
asserted that the availability of computer tools that allow discourse-level studies of
corpora was required. Thanks to the development of the DVD technologies, authentic
materials on DVDs no longer need to be transplanted and, in fact, they contain video
images, sound streams and texts that make them really versatile for both discourse
studies and language learning.
In a paper which examined the potential of using DVD for language learning,
Tschirner (2001) suggested seven principles culled from several areas of SLA
research: (a) learning is situated; (b) input oriented learning of the FL is emphasized;
(c) output oriented learning of the FL is encouraged; (d) the cultural component of FL
competence is promoted; (e) a focus on form is fostered; (f) storage in memory of
meaningful and situated sequences of sounds and words is promoted; and (g) affective
needs of learners are taken care of (p. 308, 309). In the case of EFL, there are
hundreds of DVD titles available on the market. These titles are authentic, in the sense
that the target audience is native English speakers, and their scripts are written by
expert users of English.
Checked with Tschirner’s (ibid.) principles stated above, we can find that: (a) the
various stories in these titles do provide rich situations in which utterances occur. The
video images and the sound streams can be comprehensible input for learners in that
they are embedded in meaningful contexts. The fact that they can be played
repeatedly will encourage comprehension to happen. They also comprise rich cultural
components of the American society. With the subtitles available on them, the
linguistic elements in the utterances can be scrutinized. The interactive use of the
DVD materials can enhance the storage of the contents in memory. And furthermore,
the affective and emotional values highlighted in each of the titles may touch its
audience.
A corpus of more than 500 DVD titles was established out of a project sponsored by
the Ministry of Education from 2001 to 2005 (Wu, 2006). It was continuously updated
since. To date, it includes a total of more than 680 DVD titles. Such a corpus allows
many opportunities for the learners to authenticate discourses through observation
(Gavioli & Aston, 2001). This paper will introduce the corpus first and then a text
analysis and video retrieving system will be presented. Some exemplar usage of both
the corpus and the system will also be addressed.
The DVD corpus
Using a text analysis software, Concordance (Version 3.3) developed by J.R.C Watt
Company, the word types and tokens for the titles in each of the categories are shown
in Table 1.
Table 1. The general features of the film corpus
Cartoons
No. of
Titles
76
Word
Types
16961
347815
Tokens/No. of
Titles
4577
Classics
61
21862
600474
9844
Films
405
69207
3247475
8018
TV Series
148
37370
1150581
7774
Total
681
94654
5298378
7780
Categories
Tokens
Accordin
g to
Table 1,
cartoons
contain
less word
types and
have less
tokens than those of the other categories. This is because cartoons are produced
mainly for children and hence the words used are somewhat simplified. The average
tokens for a cartoon title are 4577 which is also far less than those of the other
categories. It is obvious that the stories in cartoons are shorter and less complicated in
content. Classical movies, on the other hand, are much longer and comprise much
more serious and complicated content and hence the average tokens, 9844, for one of
them are more than twice as much. The average tokens for Features films, 8018, and
TV series, 7774, are less than those of the classical movies but are still far more than
those of the cartoons.
Although DVD titles can be rich sources for FL classrooms, with such large volumes
of video texts, it is rather difficult, if possible at all, to go through each of the titles
and select the appropriate sample utterances for the learners. A device which can
search the key words or phrases in the texts and locate the particular addresses where
each of the occurrences appears in the videos is required so that teachers can use the
tool to select the desired examples that meet their needs and students can use it for
exploratory learning. In the following section, such a device attempted by the authors
will be introduced.
The text analysis and video retrieving system
The film materials are authorized for non-profitable educational purpose only so they
can only be used by registered users in the campus only. The registered users need to
log in every time they want to use the system. Figure 1 shows the log-in screen.
Figure 1. The registration and log-in screen.
After logging in, the administrator can upload film lists, films in WMV format, the
scripts or subtitles of each of the film in both DKS and SMI formats. There are four
categories at the moment: Movies, Cartoons, TV Series, and Documentaries. The
system will generate an overall film list for all the film lists and thus the default film
list for the users to conduct concordance and video retrieving is the overall list unless
specified. Figure 2 is the interface for the administrator to set up the contents for the
system.
Figure 2. The administrator’s interface.
For teachers, who want to use the system for research or editing their own teaching
materials, or learners, who can use it for exploratory learning, they use the user’s
interface below (Figure 3).
Figure 3. User’s interface of the system.
The interface comprises four parts:
(a) A word list which allows the user to adjust the display in alphabet or frequency
ascending or descending order.
(b) Concordance and video retrieving setting’s windows, including keyword(s) and
collocation(s) to be searched, and corpus selection.
(c) A search result window which allows users to select any item identified for
presentation or editing.
(d) Links to net dictionary entries for the particular keyword, system homepage and
courses edited with the system.
On the top of the search result window, the concordance button will lead to another
window which shows the complete results of the particular search. An example of part
of such a list is shown in Figure 4. In the list, the collocated words are listed at the top
and the size of each of them varies according to the numbers of the examples found.
The MI (Mutual Information) values show how closely each of these collocated words
is related to the searched word. A button is created in front of each of the sample
sentences identified and the location where the particular sample sentence occurs can
be retrieved and played along with the scripts, with the sentence in question
highlighted, by clicking each of the buttons.
Figure 4. Part of a concordance list.
The figure below (Figure 5) displays how the retrieved video is presented together
with the subtitles in SMI format under the screen and those in DKS format to the right.
The sample sentence identified is highlighted in red. Such video and subtitle display
window can be activated by (a) clicking a script file name on the search result list, (b)
clicking an icon in the Edit column of the list (see Figure 3) and (c) clicking an icon in
front of the sample sentence in the concordance list stated above.
Figure 5. The video and subtitle display window.
By doing step (b) and (c) described above, a button which leads to Editor’s window of
course will appear at the bottom of the video and subtitle display window (see the
Show button in Figure 5 above). The Editors window allows the users to select the
discourse which comprises the sample sentence and save it to a course. The right hand
side window in Figure 6 below shows the five discourses selected under the topic of
Pathetic. By clicking the Play in English button at the bottom of the opened discourse
window, the video of the particular discourse can be retrieved and presented in the
video and SMI subtitle window (see the left hand side window in Figure 6). And thus,
series of the topics can be edited, saved, and presented in a course.
Figure 6. The display window of a topic with the selected discourses.
At the current stage, one problem with the database is that some of the time codes that
go with the subtitles need to be adjusted so that the video and the accompanying
subtitles can be more accurately synchronized. And a problem with the system is that
if too many users are doing the retrieving at the same time or too many keywords are
requested at a time then an error message may appear when the system is too busy to
be able to handle the requested tasks.
Concordance
Using the system stated above, a concordance of the sentence pattern: “This is a…”
was conducted and a concordance list like Figure 4 above was obtained. This was
performed by using the key word “this” and the collocations “is a” to the right of the
keyword. The pattern was selected because the first English sentence ever learned by
the author in junior high school was “This is a book.”, and yet it was never heard of in
real life. In fact, not a single example of such a sentence was found in the movie
corpus which comprises 466 film titles with a total of 3.8 million words. It is obvious
that communicative intent and sociocultural purpose are lost in such a contrived
sentence. On the concordance list, 335 examples of the sentence pattern were
identified. These examples were further divided into 16 patterns (Table 2):
Table 2. Patterns in the concordance list of “This is a…”
Patterns
Entries
Examples
This is a +N
[93]
[ALEXANDER]
This
is a nightmare.
91
This is a +N+Prep Phrase
[75] I don’t think this is a case of undue influence.
29
This is a +N+Adj Phrase
[197] Well this is a thing unheard-of.
5
This is a +N+Adj Clause
[116] Please don’t say that. This is a mistake you’re
6
This is a +Adj N+N
This is a +Adj N+N+Prep Phrase
This is a +Adj+N
28
1
This is a +Adj+N+Adj Phrase
111
28
1
This is a +Adj+N+Adj Clause
9
This is a +Adj+N+Inf Phrase
This is a +Adv+Adj+Prep Phrase
1
2
This is a +Adv+Adj
4
This is a +Adv+Adj+N
12
6
This is a +Adj+N+Prep Phrase
This is a +Adv+Adj+N+Prep
Phrase
This is a +Adv+Adj+N+ Inf
Phrase
16 patterns
1
making, Diana.
[60] [BERGSTEIN] This is a wild-goose chase.
[168] You think this is a treasure map for Cibola, don’t
you?
[8] I’m sorry, sir. This is a private room.
[67] Fellas, this is a critical moment in his life.
[10] I have to warn you. This is a dangerous place full of
vultures.
[218] This is a nice street you live on. Yeah, this is my
street.
[86] Carlos, this is a stupid fucking problem to have.
[220] I’m not gonna get this. This is a little too
complicated for me.
[209] I think this is a little different. You’ll run circles round
them.
[63] Oh god. Jesus Christ, Robert, this is a really bad idea.
[269] And this is a really good example of that.
[133] For me this is a very hard thing to say.
335
Among the 335 examples, 91 (27%) have the same pattern as “this is a book”, i.e.
“this is a +N.” The other examples all have one or more modifiers added to the noun
(73%). Using the edit function provided by the concordance and retrieving system
discussed above, the discourses in which each of these examples embedded can be
retrieved and edited for further analysis or for language teaching and learning via
exemplification (see Figure 6 above). And thus, this system is a computer tool which
can perform what Conrad (2002) requested that allow discourse-level studies and
applications henceforth. Some of the retrieved discourses that contains the sentence
pattern “this is a +N.” in them are listed in Table 3.
Table 3. Sample discourse with the embedded sentence of “this is a +N.”
Sample Discourse
Context, Background
Knowledge & Cultural Values
Wane: So, what does your wife think
about this plan?
Arnold lost his job 8 months ago
from a company he had worked
for 17 years.
Wane tries to bring Arnold’s
wife in to loosen the
kidnapper.
Wane owns a company and
cherishes his wife’s love for him.
To talk with the kidnapper in a
way to bring back the
humanistic nature in him.
Arnold: My wife?
Wane: Yeah. Those are her
cigarettes.
Arnold: You can keep things from
your wife.
Wane: I don't know.
Arnold: What? You've never deceived
your wife?
Wane: Well, there are levels of
deception, Arnold. I mean, this is a
whopper.
Communicative Intent &
Sociocultural Purpose
Arnold kidnapped Wane for
money.
Arnold’s wife smokes a lot.
To Wane, man and wife share
everything except maybe some
white lies; to Arnold, man can
keep things from his wife.
Kidnap somebody for money is a
felony not something trivial.
Arnold: Oh.
The example shows that even though no modifiers are added to the noun, meanings
go far beyond the words. The contextual information plays a crucial role for the
comprehension of the words. The richer the context the more students will understand
and hence learning are more effective. If the material is simplified to a mere sentence
pattern then it is meaningless and will not make any sense to the learner. For these
discourses to occur, a wide range of background knowledge and cultural values are
involved aside from the linguistic capabilities possessed by the interlocutors. The
communicative intent and sociocultural purpose are the initiative reason for
conversation to occur.
For those patterns with one or more modifiers added to the nouns, the discourses in
which they are embedded need to be examined as well. Table 4 presents some of the
retrieved sample discourses:
Table 4. Sample discourses with modifiers added to the noun.
Patterns
Sample Discourses
+Adj
N+N
President Zia: So what I have
been wondering is why your State
Department would send someone
here who thinks he understands
the problem because I don’t think
the prayers of the Texas Second
Congressional District are going
to turn the trick.
Context, Background
Knowledge & Cultural Values
Communicative Intent &
Sociocultural Purpose
Zia is the President of
Pakistan.
What Pakistan needs is
someone who really
understand the problem.
Charlie is a Congressman
representing the Texas
Second Congressional
District.
Charlie’s visit is not an
official one.
Pakistan needs substantial
+Adj+N
+Adj+N+
Prep
Phrase
Charlie: Well, now I wasn’t sent
here by the State Department, Mr.
President. I was asked to come
here by our friend in Houston. So
this is a courtesy call.
President ZIA: I don’t need
courtesy. I need airplanes, guns
and money
Many Afghanistan
refugees fled to the
Pakistan.
Joanne: Charlie. So sorry for
keeping you waiting.
Charlie: Oh, it's no problem,
Joanne. This is Bonnie Bach.
Joanne: So nice to meet you.
Bonnie: It's a pleasure meeting
you, Mrs. Herring. This is a
wonderful party.
Charlie: Why don't you give us a
few moments?
Bonnie: Yes, sir.
Joanne: Oh, Bobbie, if you could
ask someone for a Bombay
martini up, very dry?
Bonnie: Oh, I'm not a slave girl,
actually. I'm the Congressman's
administrative assistant.
Joanne: Isn't that wonderful for
you?
Bonnie: Yes.
Joanne: Two olives, please. Tell
them it's for me, they'll know.
Bonnie: Certainly.
Joanne: She doesn't like me.
Charlie: Everybody likes you.
Joanne: She's a liberal.
Charlie: Well, I'm a liberal.
Joanne: Not where it counts.
Joanne Herring is a major
donor for Charlie’s
campaign.
Senior Officer 1: So if it isn't over
in four days, when will it be over?
You.
Tom: Well, all I have to measure
it by... is how long it took last
time when Shark first came in.
Senior Officer 1: And how long
did it take?
Tom: Ten months.
Senior Officer 2: Well, this is a
great day for Adolf Hitler.
Senior Officer 1: Ten months?
Senior Officer 2: But you did
break it.
Tom: Yes.
Senior Officer 2: How?
Tom: I'm afraid I can't tell you
that.
Senior Officer 2: I think it's time
we... I think it's time I got back to
London.
German Navy changed
their old code, Shark, into
a new one.
help from the US.
Whether to help
Afghanistan people are
both a political and a
humanistic issue.
She invited Charlie to a
party in her house.
Bonnie is Charlie’s
assistant.
Bombay is India’s most
populated city and martini
from there is dryer.
Joanne thinks being a
congressman’s assistant is
a good thing for Bonnie.
A liberal is someone who
favors a political
philosophy of progress and
reform and the protection
of civil liberties.
British Navy is unable to
decipher the new code.
London is anxious about
the block-out of the
information from German’s
telex code.
They send Tom back from
a hospital to help.
It took Tom ten months to
break Shark.
It’s a top secret regarding
how to break the German
code.
Joanne’s words “So nice to
meet you” should be a
formulaic speech to
Bonnie but what she really
means is that she is glad
to see Charlie.
Bonnie says “It's a
pleasure meeting you,
Mrs. Herring. This is a
wonderful party.” in order
to stop Joanne’s kiss to
Charlie. What she says is
to give Joanne some kind
of attitude; it has nothing
to do with the party.
Joanne has no right to ask
Bonnie to do things for
her.
Joanne notices Bonnie’s
dislike for her but she
deliberately ignore it and
maintain her superior
status to her.
The officer from London
wants to make sure how
long it takes to break the
new code.
It took Tom ten months to
break Shark so it’ll
probable takes about the
same time to break the
new one.
British Navy will not be
able to know where the
German submarines are
deployed and where they
are heading.
Tom is not allowed to shed
the top secret of how to
break a code.
The officers have to
report to London about
the situation they face.
Aside from the contextual information required, the background knowledge and
culture values embedded and the communicative intent and sociocultural purpose
intended in the discourses listed above, meanings go far beyond the simple sentence
pattern itself. In the first discourse above, the word “courtesy” is a noun that serves as
an adjective and the sentence “This is a courtesy call” is used to hide the
embarrassment that the president has doubts about the purpose of Charlie’s visit. In
the second case, the sentence “This is a wonderful party” has nothing to do the party;
it serves the function of stopping Joanne’s intimate behavior towards the speaker’s,
Bonnie, boss. In the third one, “This is a great day for Adolf Hitler” means that if the
German Navy know that the British Navy are unable to break their new code they will
be safe and can deploy their submarines wherever they want them to be and the
British Navy will be in big trouble.
If this study of a simple sentence pattern reveals that contextual information,
background knowledge, cultural values, communicative intents and socio-cultural
purposes are all important factors for the comprehension for the discourse in which
the particular sentence is embedded, then what the learners need are materials that can
bring all these into a meaningful whole rather than the de-contextualized sentence
patterns. In other words, using some meaningful stories as vehicles to serve the
purpose is a better idea than using classified and simplified sentence patterns for the
learners to practice and memorize. The stories on DVDs are the best format of the
stories at the moment in that they contain the video images, the sound streams and the
text to make the contents more comprehensible. If this is the case then the entire
English syllabus need to be reformed so that the objectives that the students are able
to comprehend, to speak, to read and to write simple English highlighted in our
curriculum guidelines will not be in vain and fall into a hollow hole of poor English or
non-English in the end. This is a serious issue (although the sentence is a simple one).
Language exemplification
The idea of language exemplification has been repeatedly asserted by the author (see
e.g. Wu, 2004, 2005; Wu & Young, 2006). It stemmed from the fact that the dynamic
factors involved in meaning formation and utterance production are too complicated
to be described. They can only be exemplified in a meaningful whole. Any attempt of
such a descriptive, bottom-up, and analytical approach, no matter how systematic it is,
tends to be trapped in the hollow hole depicted above. The fact that the goals of our
9-year consecutive English curriculum have been reduced to 1000 words is the result
of a vicious circle in such a hollow hole.
Students need to be exposed to the authentic video texts so that they can watch, hear,
read, generalize, induce and memorize what is observed and then, at a later stage, they
are able to speak and write. This can be achieved in many ways. On a small scale,
teachers can select a few DVD titles from the library and turn them into interactive
teaching materials with the storyline, video segments, annotated subtitles, exercises
and online tools (Wu, 2006; see Figure 7 below).
Figure 7. An interactive DVD material presentation interface.
Another way of doing it is to select a series of DVD titles and organize them into an
online course so that learners can have access to the materials both in the class and at
home to maximize their exposure to the materials. For such larger scale application, a
teaching platform with front end account management device, material presentation
device, and after class practice and rehearsal devices is required. One such platform
was designed for STUT English department teachers and students (Wu, 2005; see
Figure 8 below).
Figure 8. An online teaching platform with both teacher and student’s interface.
Conclusion
The concordance and video retrieving system described in this paper is a large scale
corpus analysis tool for teachers. The system provides (a) key word and collocation
concordance for any vocabulary they encounter or they want their students to learn, (b)
net dictionary entries for the vocabulary, (c) subtitles and video retrieving for the
particular vocabulary, and (d) discourse selection and filing function for the
exemplification of both the videos and the discourses. Concordance for sentence
patterns and formulaic speech can also be performed with the system. For example, a
concordance of the keyword “how” and the collocations of “are you” to the right of
the keyword can be used to find out all the cases of “how are you” in the corpus and
hence all the responses to it can be identified as well. Teachers can use the system to
find out native speakers’ actual speech and the nuances in the different expressions.
However, it is better to use the selected stories on DVDs as vehicles so that learners
can learn the target language in meaningful context to achieve better long-term
results.
It is suggested that teachers bring all the contextual information, background
knowledge, cultural values, communicative intent and socio-cultural purposes
together by using the stories as vehicles so that learners can build up their own sense
of the target language gradually via observation. The concordance and video
retrieving system can be used as a tool to locate and retrieve from the database the
related expressions to further expand the contexts and contents of the learning
materials.
References
Conrad, S. (2002). Corpus linguistic approaches for discourse analysis. Annual
Review of Applied Linguistics, 22, 75-95.
Gavioli, L.; Aston, G. (2001). Enriching reality: Language corpora in language
pedagogy. ELT Journal, 55(3), 238-246.
Kaplan, R.B., & Grabe, W. (2002). Discourse analysis. Journal of Second Language
Writing, II, 191-223.
Mishan, F. (2004). Authenticating corpora for language learning: a problem and its
resolution. ELT Journal, 58(3), 219-227.
Sinclair. J. (1987). Looking up: An account of the COBUILD project in lexical
computing. London: Collins.
Sinclair. J. (1991). Corpus, concordance, collocation. Oxford: Oxford University
Press.
Tschirner, E. (2001). Language acquisition in the classroom: The role of digital video.
Computer Assisted Language Learning, 14(3-4), 305-319.
Widdowson, H.G. (2000). On the limitations of linguistics applied. Applied
Linguistics, 21(1). 3-25.
Wu, M-T. (2004). 從語文教學的本質來看語文教學所面臨的基本問題. The
Proceedings of 2004 International Conference on Language Education.
Tainan: STUT
Wu, M-T. (2005). Multimedia Materials and Online Learner Tools for English
Learning: An Introduction to the STUT English Teaching and Learning
System. The Proceedings of 2005 International Conference on Language
Education. Tainan: STUT
Wu, M-T & Young, J-M. (2006). A theme-based multimedia English learning
environment and the underlying language elements. TELL Journal, 3,1-22.
Download