Corpora in the ESL Classroom

Using English Language Corpora in
the ESL Classroom
I-TESOL Conference
October 12th, 2012
Brent A. Green
Salt Lake Community College
• Personal and Professional interests in using
corpora in language teaching
• Goals of Workshop: Participants will learned how
to access and use on-line written and spoken
English corpora to help them prepare course
materials and assessments, increase
understanding of English language structures, and
engage students in data-driven learning tasks.
• What is a corpus?
– A large database of language
• What is a concordancer?
– A software program that allows you to search the database
for particular words or phrases
• What is classroom concordancing?
– A teaching approach in which concordance data are used
in the language classroom to help learners notice and
practice language patterns and use. This teaching
approach is sometimes referred to as Data-driven Learning
(DDL). Learners are driven by authentic language data,
presented in the form of concordance lines, to act as a
“linguistic detective’ to find answers to their linguistic
queries (Johns 1988; 1991 a, b)
• What are concordance lines?
– Examples of words or phrases uniquely presented
in a way that the words or phrases under
investigation are aligned in the middle of the page
with their left and right contexts (often referred to
as KWIC format).
Key Word in Context (KWIC)
• Example of KWIC from the Corpus of
Contemporary American English (COCA)
Three-Dimensional Framework
Larsen-Freeman 1991
What do we look for?
• Lexicography
– What are the meanings associated with a particular word?
– What is the frequency of a word relative to other related
– What non-linguistic association patterns does a particular
word have (e. g. to registers, historical periods, dialects)
– What words commonly co-occur with a particular word,
and what is the distribution of these “collocational”
sequences across registers?
– How are the senses and uses of a word distributed
– How are seemingly synonymous words used and
distributed in different ways?
(Biber et al, 1988)
What do we look for?
• Grammatical structures (if or that clauses, causatives,
• Discourse functions (making suggestions, introducing
a speaker, etc.)
How does one begin examining corpus
You need the following
1. a language related question which arises out of
the text, your own observations or curiosity, or
the observations and curiosity of your students.
2. A corpus of language that contains contexts
which are similar to your learners’ target
language learning domains.
3. Pedagogically sound principles in accessing and
applying corpus data.
Corpus-based Research
1. Research question
2. Extensive review of the literature
3. Summary of experts across form, meaning,
and use categories
4.Comparison of experts against spoken and
written corpora
5.Reformulation and expansion of existing
Corpus-based Teaching
• Syllabus design and evaluation
– Student-based corpora
– Student texts
Material preparation
Teacher-student collaboration
Student research
• What is it?
– The Corpus of Contemporary American English (COCA) is the
largest freely-available on-line corpus of English
• Who created it?
– It was created by Mark Davies of Brigham Young University in
• How many words does it contain?
– The corpus contains more than 450 million words of text and is
equally divided among spoken, fiction, popular magazines,
newspapers, and academic texts.
Information adapted from
• What type of searches can I do with COCA?
– The interface allows you to search for exact words or phrases,
wildcards, lemmas, part of speech, or any combinations of
these. You can search for surrounding words (collocates) within
a ten-word window.
– The corpus also allows you to easily limit searches by frequency
and compare the frequency of words, phrases, and grammatical
Information adapted from
 What else can you do?
You can also easily carry out semantically-based queries of the
corpus. For example, you can contrast and compare the
collocates of two related words to determine the difference in
meaning or use between these words.
You can find the frequency and distribution of synonyms for
nearly 60,000 words and also compare their frequency in
different genres, and also use these word lists as part of other
Finally, you can easily create your own lists of semanticallyrelated words, and then use them directly as part of the query.
 Information adapted from
Corpus-based Practice
• Before you look for the collocates of each of the
words deep, run, smile, and fairly -- what would you
guess are the best collocates -- in other words,
surrounding words that really help to "define" these
• Are there any that are surprises in what you see in the
Corpus-based Practice
• Compare the collocates of the two
words democrats and republicans. According to
these texts (from newspapers, magazines, TV
talk shows, etc),
• Any possible media bias here?
Corpus-based Practice
• Compare the frequency of second vs secondly in
academic texts. Which one would you guess is
more frequent?
• What issues do we have when we make this
Corpus-based Practice
• Compare the adjectives used to
describe women and men.
• Does this reflect biases in contemporary
American culture?
Corpus-based Practice
• Using the web interface, you can search by
Phrases—nooks and crannies or faint + noun (faint [n*])
lemmas (all forms of words, like sing ([sing])or tall ([tall])
wildcards (un*ly or r?n*)
more complex searches (un-X-ed adjectives (un*ed.[j*] )or
verb + any word + a form of ground ([vv*] * [ground]).
Types of Concordance-based Tasks
The teacher selects words
or phrases to be
investigated usually taken
from observations or
information presented in
the course text.
The teacher and the
learners agree on the
language to be studied
The learners form their
own questions
The teacher retrieves and
selects concordance lines,
and designs concordancebased tasks with different
degrees of control
The teacher and the
learners browse the corpus
and examine the language
data together
The learners browse the
corpus independently.
There is no structured or
controlled task.
The teacher provides clues
and hints to help learners
complete concordance
tasks, or guides learners to
a generalization or
The teacher comments on
and helps refine the
learner's generalizations
There is very little
interference from the
teacher in the
generalization process.
Adapted from Sripicharn 2003
Teacher-Centered Tasks
• Example #1
– Used to and would in the habitual past
• On Your Own
– Hedges (kind of, sort of, like)
– Say, talk, tell
Erades (1943)
• It may be safely said that in language a
difference of form always corresponds to a
difference in meaning and whenever more
than one construction is—theoretically—
possible, they never wholly and under all
circumstances denote the same thing. The
first axiom of all valid linguistic thinking is that
in language nothing can serve as a substitute
for something else.
Would vs. Used to Example
• Briefly discuss the differences between the two
sentences with a partner
(a) My father used to exercise every morning
(b) My father would exercise every morning
• One difference is that (a) can signal only
habitual past action whereas (b) can also be
conditional given appropriate context (i.e. “If he
had time”).
Would vs. Used to Example
• Steps
– think about the context when the structure
• personal narrative
– find corpus data that matches that context
• American Dreams (Studs Terkel)
• Switchboard
– search for target structures using a
concordancing program
• Monoconc
Would vs. Used to Example
• Steps cont.
– Look for patterns in form, meaning, and use
• In what ways, if any, are the forms the same or
• In what ways, if any, are the meanings different or
similar? (look carefully at surrounding context)
• In what ways, if any, are the structures used
differently? (look carefully at surrounding context)
– Create sample worksheets or tests for students
• How many words?
– approximately 1.8 million words (190 hours)
• What is the focus?
– Contemporary university speech within the University of
Michigan, in Ann Arbor, Michigan.
• Who are the speakers?
– Speakers represented in the corpus include faculty, staff,
and all levels of students, and both native and non-native
• What are the speech events?
– The speech events included in the corpus include: small
and large lectures (62), public interdisciplinary or
departmental colloquia (13), discussion sections (9),
student presentations (11), seminars (8), undergraduate
lab sessions (8), lab group and other meetings (6), one-onone tutorials (3), office hours (8), advising consultations (5),
dissertation defenses (4), study groups (8), interviews (3),
campus/museum tours (2), and service encounters (2).
On Your Own: Teacher-centered Task
• Say, Talk, or Tell
– Characteristics
• Transitive vs. intransitive vs. ditransitive
• Used in spoken language
• Idiomatic expressions
– Tasks
• Search MICASE for tokens of these forms
– Cut and past example sentences from MICASE into MS Word.
» Ask learners to examine the forms
» Assess learners ability to get the forms correct
• Search for idiomatic expressions
– Cut and paste examples of idiomatic forms
– Ask learners key questions about the examples
Example of Teacher-centered Tasks
Sample sentences and Idioms
Example of a collaborative task from
(Hartmann, P. & Blass, 2000)
• Click on the link below to begin your search
• Using the form, meaning, and use handout—
take notes on our discussion with softening
phrases such as I think, In my opinion, It seems
to me, others?
• The learners form their own questions
• The learners browse the corpus independently there
is no structure or controlled task
• There is very little interference from the teacher in
the generalization process
• Now it is your turn to answer those structure related
questions that have been bothering you for years!
• Corpus of Contemporary American English (COCA)
Other tasks
• Utilizing the audio features
• Browsing the corpus to find specific speech
• Micase activities for learners
Two examples of student with TA during
office hours
S1: okay
S2: you feel th- as though you're in a lab or, [S1: yeah almost ] <LAUGH>
it's a little a little bit a little bit odd. okay. uh, the reason i asked you
to come in is that, i- i'm looking at the grades and i'm looking at at
this paper and, you're at the point where i don't want you to, fall off
the edge. uh and and get a grade that's not gonna be, supportive. it
seems to me that you know that you've been in touch with things in
the class and that i, i liked what you did with your poem to change it
which wasn't_ which must have involved a fair amount of work.
[S1: (i don't know) ] to, you know to get that in a different order and
to get the system ba- was it a lot of work?
S1: mm, it wasn't too much it didn't take me too long to just, use the same
word i just, i'd say the hardest part yeah was changing the sentences.
trying to make 'em all fit again. [S2: okay ] but it wasn't too bad.
S2: okay. but the rhythm seemed to work right and, [S1: mhm ] it it really
did, come out to be a sus- sestina and one of the effects of the
sestina is that, since you're using those words over and over again
they they tend to acquire different meanings they tend to to just, they
sound different in different combinations [S1: mhm ] and and they
mean something. but let's look at this [S1: kay ] um, because i think
that that part of what's happening here, is that is that you're using a
lot of words where few words would work. where you don't really
need that that many words to say what you want to to say. and there
are some cases where you're where you're looking, or where you
seem to be saying something um, and i think i know what i know
what you want to say, but because you've sort of, you've given me
more than than i need you're really disguising the meaning [S1:
mkay ] rather than bringing the meaning out. so that, if y- if you
look at this sentence and if you just r- read that sentence aloud.
(R. C. Simpson, S. L. Briggs, J. Ovens, and J. M. Swales, 2002)
Victor: Do you have a few minutes?
Pam: Sure, I’m Pam.
Victor: I’m Victor
Pam: Hi Victor. Have a seat. How can I help you?
Victor: Well I’m in Dr. Sears’ American Lit class…and I’m having a lotta
trouble with that poetry unit. I’m thinking of dropping the class
Pam: Oh. I hate to tell you, but Friday was the last day to drop.
Victor: Oh no. I knew I should have dropped last week.
Pam: Well, it’s all right. Let’s see what we can do to get you through the
class. Guess literature isn’t your thing, huh?
Victor: It’s just this unit on poetry. I did okay with short stories.
Pam: What’s giving you problems?
Victor: I just don’t get a lot of this modern stuff. It just doesn’t seem like
poetry to me.
Pam: What exactly bothers you?
Victor: I understood the poems by Robert Frost and Maya Angelou, But
the poems in last night’s homework don’t rhyme or have rhythm or
(Hartmann, P. & Blass, 2000)
Favorite Corpus Web Site
• Michael Barlow’s Corpus Linguistics Site
Other Links
• Spoken Corpora
– MICASE: R. C. Simpson, S. L. Briggs, J. Ovens, and J.
M. Swales. (2002) The Michigan Corpus of
Academic Spoken English. Ann Arbor, MI: The
Regents of the University of Michigan.
– Linguistic Data Consortium University of
– The Corpus of Contemporary American English
Mark Davies, Brigham Young University
– American National Corpus
– British National Corpus also available through
Mark Davies Corpus website
• Spoken Language Resources
– Bygate, M. (1998) Theoretical perspectives on speaking.
Annual Review of Applied Linguistics 18, p. 20-42
– Burns, A. (1998) Teaching speaking. Annual Review of
Applied Linguistics 18, p. 102-123
– Burns, A. & Joyce, H. (2002). Focus on speaking. Sydney:
National Center for English Language Teaching and
– McCarthy, M. (1998). Spoken language & applied
linguistics. Cambridge: Cambridge University Press.
– Celce-Murcia, M., & Larsen-Freeman, D. (1999). The
grammar book: An ESL/EFL teacher's course (2nd ed.).
Boston, MA: Heinle & Heinle.
• Corpus Linguistics Texts
– Biber, D., Conrad, S., & Reppen, R. (1998). Corpus
linguistics: Investigating language structure and use.
Cambridge: Cambridge University Press.
– Partington, A. (1998) Patterns and Meanings: Using
corpora for English language research and teaching. John
– Tribble, C. & Jones, G. (1997). Concordances in the
classroom: using corpora. A resource guide for teachers
[new edition]. Houston, TX: Athelstan
• MICASE Tips and Tutorials
• Other References
– Erades, P. A. (1943). The case against provisional It. English Studies, 25, 169-176
– Hartmann, P. & Blass, L. (2000). Quest: Listening and speaking in the academic
word Book 3. New York: McGraw Hill.
– Johns, T. F. (1988) Whence and whither classroom concordancing? In T. Bongaerts,
P de Hann, S. Lobbe, & H. Wekker (eds.) Computer applications in language
learning, p. 9-27. USA: Forbis Publications
– Johns, T. F. (1991) Should you be persuaded: Two examples of Data-driven learning.
In T.F. Johns & P. King (eds.) ELR Journal Vol. 4 Classroom concordancing (p. 27-46).
Birmingham CESL: The University of Birmingham Press
– Johns, T. F. (1997). Contexts: The background, development, and trailing of a
concordance-based CALL program. In A. Wichmann, S. Fligelstone, T. McEnery, &
G. Knowles (eds.) Teaching and language corpora. London: Longman.
– Riggenbach, H. (1999). Discourse analysis in the language classroom: Vol. 1. The
spoken language. Ann Arbor, MI: University of Michigan Press.
– Sripicharn, P. (2003). Implementing collaborative concordancing between teacher
and learners in the writing class. Paper presented at the 5th CULI International
Conference, Bangkok, Thailand.