Talking about your homework • News story? – What made you choose…? • One of your words? – What made you choose…? • (Give your vocabulary books to another student. He or she will test you. Give clues!) – – – – – What does ___ mean? How many senses does ___ have? What words does ___ collocate with? How many grammar frames does it occur in? Can you remember the example sentence? Today • Review and discuss homework • Introduction to language corpus study • Start on Food and diet Giving and following instructions • Student 1: Give step-by-step instructions for finding the top collocations and best sentences • Student 2: Use a laptop computer and follow the instructions exactly Introduction to corpus linguistics Simon Smith & Adam Kilgarriff Plan for today • Short review of corpus basics • 4 ages of corpus research – From pre-computer age, to SkE • Functions of SkE • Demonstration of SkE in use Quiz What’s a (linguistic) corpus? What does the Latin word mean? What are corpora? What’s the BNC? How big is the British National Corpus? What is the advantage of having a very big corpus? • What can corpora be used for? • • • • • • 5 major uses for linguistic corpora • Language learning and teaching • Theoretical research on Language and Linguistics • Literary research and analysis • Language technology • Lexicography • (=dictionary making) – Cobuild, Longman, … – All learner dictionaries now use corpora How do you make a dictionary? (What sources can you use?) • Use your own knowledge of words • Ask all your friends for their knowledge • Consult other dictionaries – and copy them • Read thousands of books – and take lots of notes • Use a corpus Four ages of corpus research (in lexicography) Age 1: Pre-computer Age 2: KWIC concordance (KWIC=?) Age 3: Corpus query tools Slide: 9 e.g. Sketch Engine Kilgarriff, Lexical Computing Taiwan, Dec 2006 Age 1: Pre-computer First Oxford English (1860) Dictionary: • 20 million index cards – a word (usually rare) and a citation Slide: 10 Kilgarriff, Lexical Computing Taiwan, Dec 2006 Age 2: KWIC Concordance arity, which will be used to take a from outside. You are invited to a tion, we believe politicians of all ould be reaching agreement with all lack people. I have certainly been . These should be discussed by both presents They had hosted a cocktail akes. By midnight the end-of-course e should be a right for the injured by the Safran Peoples ' Liberation s. Ahead I could see the rest of my cial ethic. The two main political ritish successes in Perth The small to help control. One member of the rket society fashion magazine. The security and secrecy than any Tory Slide: 11 party of under-privileged child party and after a couple of drin parties will listen to our views parties concerned, as to which party to one or two discussions parties before entering into the party at Kensington palace, for party is in full swing, but mo party to terminate the contract. Party. This presents the powerf party plodding towards the final parties - the Tories and the Lib party of British players competi party went to summon the rescue party was held at his flat which Party Conference : it seems that Kilgarriff, Lexical Computing Taiwan, Dec 2006 Age 2 (~1980-1990): KWIC Concordances Using computers List of lines which contain a keyword The keyword is in the middle Slide: 12 Kilgarriff, Lexical Computing Taiwan, Dec 2006 The coloured pens method 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 arity, which will be used to take a from outside. You are invited to a tion, we believe politicians of all ould be reaching agreement with all lack people. I have certainly been . These should be discussed by both presents They had hosted a cocktail akes. By midnight the end-of-course e should be a right for the injured by the Safran Peoples ' Liberation s. Ahead I could see the rest of my cial ethic. The two main political ritish successes in Perth The small to help control. One member of the rket society fashion magazine. The security and secrecy than any Tory 1 political association 2 social event 3 group of people Slide: 13 party of under-privileged children to D party and after a couple of drinks you d parties will listen to our views. &equo parties concerned, as to which events, party to one or two discussions amongst parties before entering into the relatio party at Kensington palace, for example party is in full swing, but most cadet party to terminate the contract. A mana Party. This presents the powerful neigh party plodding towards the final slope t parties - the Tories and the Liberals party of British players competing in th party went to summon the rescue team and party was held at his flat which was a l Party Conference : it seems that bootleg 4 person in an agreement/dispute 5 to be party to something... Kilgarriff, Lexical Computing Taiwan, Dec 2006 Age 2: limitations as corpora get bigger: too much data • • • 50 lines for a word: read all 500 lines: could read all, takes a long time 5000 lines: impossible Slide: 14 Kilgarriff, Lexical Computing Taiwan, Dec 2006 Taiwan, Dec 2006 Why do corpora keep getting bigger? (anyone?) • Improvements in technology – Price of storage is going down – Speed of access is going up • Representativeness – Small corpus many examples of common words, maybe – But not enough examples of unusual words Lexical distribution • What’s the most common word in English? • What % does it make up of a whole corpus? • The 100 most common words make up __% of all the words in a corpus? • The 7500 most common words make up __% • Answers: – The, 5%, 45% and 90% • So: – you need massive corpora, if you want to really represent rare words properly Limitation of KWIC analysis • As corpora get bigger: too much data – 50 lines for a word: read all – 500 lines: could read all, takes a long time – 5000 lines: no • Instead, look at a Word Sketch from Sketch Engine – a statistical summary of word usage – shows most common collocates 19 20 Taiwan, Dec 2006 21 Taiwan, Dec 2006 Functions of SkE • KWIC concordance – Sorting, filtering etc • Word sketch • Automatic thesaurus • Sketch difference – discriminate near-synonyms 22 Lexical approach to language learning • Lewis (1993) and Schmitt (2000) say – – – – the vocab is stored in the brain in collocations Bacon is stored near eggs 蛋 is stored near 炒飯 scotch is stored with whisky • Saying strong car or powerful tea or broken house seems very “foreign” 23 From www.teachingenglish.org - a lexical approach activity, based on a story text 24 Unit 7 Food and diet Fruit and veg • What fruit and veg do you like? • How many servings of fruit and veg do you eat each day? • Is that enough? • Do you have a good diet? – What other kinds of good food do you often eat? – What junk food do you eat? Food pyramid (p 119) • Label the pyramid in your book • Look at http://mypyramid.gov • Compare the two – See http://www.mypyramid.gov/downloads/MyPyramid_Anatomy.pdf • Use the website to see how much you should be eating from each food group. – Now, comment on your diet! Genetically modified food • (picture) • Guess answers to p 123a • Read