Sketch Engine

advertisement
Talking about your homework
• News story?
– What made you choose…?
• One of your words?
– What made you choose…?
• (Give your vocabulary books to another
student. He or she will test you. Give clues!)
–
–
–
–
–
What does ___ mean?
How many senses does ___ have?
What words does ___ collocate with?
How many grammar frames does it occur in?
Can you remember the example sentence?
Today
• Review and discuss homework
• Introduction to language corpus study
• Start on Food and diet
Giving and following
instructions
• Student 1: Give step-by-step
instructions for finding the top
collocations and best sentences
• Student 2: Use a laptop computer and
follow the instructions exactly
Introduction to corpus
linguistics
Simon Smith & Adam Kilgarriff
Plan for today
• Short review of corpus basics
• 4 ages of corpus research
– From pre-computer age, to SkE
• Functions of SkE
• Demonstration of SkE in use
Quiz
What’s a (linguistic) corpus?
What does the Latin word mean?
What are corpora?
What’s the BNC?
How big is the British National Corpus?
What is the advantage of having a very
big corpus?
• What can corpora be used for?
•
•
•
•
•
•
5 major uses for linguistic
corpora
• Language learning and teaching
• Theoretical research on Language and
Linguistics
• Literary research and analysis
• Language technology
• Lexicography
• (=dictionary making)
– Cobuild, Longman, …
– All learner dictionaries now use corpora
How do you make a dictionary?
(What sources can you use?)
• Use your own knowledge of words
• Ask all your friends for their
knowledge
• Consult other dictionaries
– and copy them
• Read thousands of books
– and take lots of notes
•
Use a corpus
Four ages of corpus
research (in lexicography)
Age 1: Pre-computer
 Age 2: KWIC concordance (KWIC=?)
 Age 3: Corpus query tools


Slide: 9
e.g. Sketch Engine
Kilgarriff, Lexical Computing
Taiwan, Dec 2006
Age 1:
Pre-computer
First Oxford
English (1860)
Dictionary:
• 20 million
index cards
– a word (usually
rare) and a citation
Slide: 10
Kilgarriff, Lexical Computing
Taiwan, Dec 2006
Age 2: KWIC Concordance
arity, which will be used to take a
from outside. You are invited to a
tion, we believe politicians of all
ould be reaching agreement with all
lack people. I have certainly been
. These should be discussed by both
presents They had hosted a cocktail
akes. By midnight the end-of-course
e should be a right for the injured
by the Safran Peoples ' Liberation
s. Ahead I could see the rest of my
cial ethic. The two main political
ritish successes in Perth The small
to help control. One member of the
rket society fashion magazine. The
security and secrecy than any Tory
Slide: 11
party of under-privileged child
party and after a couple of drin
parties will listen to our views
parties concerned, as to which
party to one or two discussions
parties before entering into the
party at Kensington palace, for
party is in full swing, but mo
party to terminate the contract.
Party. This presents the powerf
party plodding towards the final
parties - the Tories and the Lib
party of British players competi
party went to summon the rescue
party was held at his flat which
Party Conference : it seems that
Kilgarriff, Lexical Computing
Taiwan, Dec 2006
Age 2 (~1980-1990): KWIC
Concordances
Using computers
 List of lines which contain a keyword
 The keyword is in the middle

Slide: 12
Kilgarriff, Lexical Computing
Taiwan, Dec 2006
The coloured pens method
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
arity, which will be used to take a
from outside. You are invited to a
tion, we believe politicians of all
ould be reaching agreement with all
lack people. I have certainly been
. These should be discussed by both
presents They had hosted a cocktail
akes. By midnight the end-of-course
e should be a right for the injured
by the Safran Peoples ' Liberation
s. Ahead I could see the rest of my
cial ethic. The two main political
ritish successes in Perth The small
to help control. One member of the
rket society fashion magazine. The
security and secrecy than any Tory
1 political association
2 social event
3 group of people
Slide: 13
party of under-privileged children to D
party and after a couple of drinks you d
parties will listen to our views. &equo
parties concerned, as to which events,
party to one or two discussions amongst
parties before entering into the relatio
party at Kensington palace, for example
party is in full swing, but most cadet
party to terminate the contract. A mana
Party. This presents the powerful neigh
party plodding towards the final slope t
parties - the Tories and the Liberals party of British players competing in th
party went to summon the rescue team and
party was held at his flat which was a l
Party Conference : it seems that bootleg
4 person in an agreement/dispute
5 to be party to something...
Kilgarriff, Lexical Computing
Taiwan, Dec 2006
Age 2: limitations
as corpora get bigger:
too much data
•
•
•
50 lines for a word: read all
500 lines: could read all, takes a long time
5000 lines: impossible
Slide: 14
Kilgarriff, Lexical Computing
Taiwan, Dec 2006
Taiwan, Dec 2006
Why do corpora keep getting
bigger? (anyone?)
• Improvements in technology
– Price of storage is going down
– Speed of access is going up
• Representativeness
– Small corpus  many examples of
common words, maybe
– But not enough examples of unusual
words
Lexical distribution
• What’s the most common word in English?
• What % does it make up of a whole corpus?
• The 100 most common words make up __% of
all the words in a corpus?
• The 7500 most common words make up __%
• Answers:
– The, 5%, 45% and 90%
• So:
– you need massive corpora, if you want to really
represent rare words properly
Limitation of KWIC analysis
• As corpora get bigger: too much data
– 50 lines for a word: read all
– 500 lines: could read all, takes a long time
– 5000 lines: no
• Instead, look at a Word Sketch from
Sketch Engine
– a statistical summary of word usage
– shows most common collocates
19
20
Taiwan, Dec 2006
21
Taiwan, Dec 2006
Functions of SkE
• KWIC concordance
– Sorting, filtering etc
• Word sketch
• Automatic thesaurus
• Sketch difference
– discriminate near-synonyms
22
Lexical approach to language
learning
• Lewis (1993) and Schmitt (2000) say
–
–
–
–
the vocab is stored in the brain in collocations
Bacon is stored near eggs
蛋 is stored near 炒飯
scotch is stored with whisky
• Saying strong car or powerful tea or broken
house seems very “foreign”
23
From www.teachingenglish.org - a lexical
approach activity, based on a story text
24
Unit 7
Food and diet
Fruit and veg
• What fruit and veg do you like?
• How many servings of fruit and veg do
you eat each day?
• Is that enough?
• Do you have a good diet?
– What other kinds of good food do you
often eat?
– What junk food do you eat?
Food pyramid (p 119)
• Label the pyramid in your book
• Look at http://mypyramid.gov
• Compare the two
– See http://www.mypyramid.gov/downloads/MyPyramid_Anatomy.pdf
• Use the website to see how much you
should be eating from each food
group.
– Now, comment on your diet!
Genetically modified food
• (picture)
• Guess answers to p 123a
• Read
Download