Corpus Processing and NLP Madrid 2010 Kilgarriff: Corpus Processing and NLP 1 What is NLP? • Natural Language Processing – natural language vs. computer languages • Other names – Computational Linguistics • emphasizes scientific not technological – Language Engineering • official European Union term, ca 1996-99 – Human Language Technology (HLT) • preferred EU and US Government term) – Language Technology Madrid 2010 Kilgarriff: Corpus Processing and NLP 2 NLP and linguistics L I N G supply ideas interpret results test theories expose gaps N L P plus turn into technology Madrid 2010 Kilgarriff: Corpus Processing and NLP 3 Example: regular morphology LINGUISTICS: – Rules: stems -> inflected forms NLP: – program the rules – apply rules to a lexicon of stems – Is the output correct? Errors? LINGUISTICS: – refine the theory Needed for: web search, spell-checkers, machine translation, speech recognition systems etc. Madrid 2010 Kilgarriff: Corpus Processing and NLP 4 Application areas • web search – Basic search – Filtering results • spelling and grammar checking • machine translation (MT) • talking to computers – speech processing as well • information extraction (IE) – finding facts in a database of documents; populating a database, answering questions Madrid 2010 Kilgarriff: Corpus Processing and NLP 5 How can NLP make better dictionaries? By pre-processing a corpus: • • • • • tokenization sentence splitting lemmatization POS-tagging parsing Each step builds on predecessors Madrid 2010 Kilgarriff: Corpus Processing and NLP 6 Tokenization “identifying the words” from: he didn't arrive. to: He did n’t arrive . Madrid 2010 Kilgarriff: Corpus Processing and NLP 7 Automatic tokenization • Western writing systems – easy! space is separator • Chinese, Japanese, some other writing systems – do not use word-separator – hard • like POS-tagging (below) Madrid 2010 Kilgarriff: Corpus Processing and NLP 8 Why isn't space=separator enough (even for English)? • what is a space – linebreaks, paragraph breaks, tabs • Punctuation – characters do not form parts of words but may be attached to words (with no spaces) • brackets, quotation marks • Hyphenation – is co-op one word or two? is well-managed? Madrid 2010 Kilgarriff: Corpus Processing and NLP 9 Sentence splitting “identifying the sentences” from: he didn't arrive. to: He did n’t arrive . Madrid 2010 to: <s> He did n’t arrive . </s> Kilgarriff: Corpus Processing and NLP 10 Lemmatization Mapping from text-word to lemma help (verb) text-word help helps helping helped Madrid 2010 . to lemma help (v) help (v) help (v) help (v) Kilgarriff: Corpus Processing and NLP 11 Lemmatization Mapping from text-word to lemma help (verb) help (noun), helping (noun) text-word help helps helping helped helpings to lemma help (v), help (n) help (v), helps (n)** help (v), helping (n) help (v) helping (n) **help (n): usually a mass noun, but part of compound home help which is a count noun, taking the "s" ending. Madrid 2010 . Kilgarriff: Corpus Processing and NLP 12 Lemmatization Dictionary entries are for lemmas so lemmatization is required for a match between text-word and dictionary-word . Madrid 2010 Kilgarriff: Corpus Processing and NLP 13 Lemmatization • Searching by lemma – English: little inflection – French: 36 forms per verb – Finno-Ugric: 2000. • Not always wanted: – English royalty • singular: kings and queens • plural royalties: payments to authors Madrid 2010 Kilgarriff: Corpus Processing and NLP 14 Automatic lemmatization • Write rules: – if word ends in "ing", delete "ing"; – if the remainder is verb lemma, add to list of possible lemmas • If detailed grammar available, use it • full lemma list is also required – Often available from dictionary companies Madrid 2010 Kilgarriff: Corpus Processing and NLP 15 Part-of-speech (POS) tagging “identifying parts of speech” from: he didn't arrive. to: … to: <s> He did n’t arrive PNP pers pronoun VVD past tense verb XNOT not VV base form of verb . Madrid 2010 . C </s> Kilgarriff: Corpus Processing and NLP punctuation 16 Tagsets • The set of part-of-speech tags to choose between – Basic: noun, verb, pronoun … – Advanced: examples - CLAWS English tagset • NN2 • VVG plural noun -ing form of lexical verb • Based on linguistics of the language. Madrid 2010 Kilgarriff: Corpus Processing and NLP 17 POS-tagging: why? • Use grammar when searching – Nouns modified by buckle – Verbs that buckle is object of Madrid 2010 Kilgarriff: Corpus Processing and NLP 18 POS-tagging: how? • Big topic for computational linguistics – well understood – taggers available for major languages • Some taggers use lemmatized input, others do not • Methods – constraint-based: set of rules of the form if previous word is "the" and VERB is one of the possibilities, delete VERB – Statistical: • Machine learning from tagged corpus • Various methods • Ref: Manning and Schutze, Foundations of Statistical Natural Language Processing, MIT Press 1999. Madrid 2010 Kilgarriff: Corpus Processing and NLP 19 Parsing • Find the structure: – Phrase structure (trees) The cat sat on the – Dependency structure (links) – Madrid 2010 The cat sat on the Kilgarriff: Corpus Processing and NLP mat mat 20 Automatic parsing • Big topic – see Jurafsky and Martin or other NLP textbook • Many methods too slow for large corpora • Sketch Engine usually uses “shallow parsing” – Patterns of POS-tags – Regular expressions Madrid 2010 Kilgarriff: Corpus Processing and NLP 21 Regular expressions • Search for any pattern • Very useful in lots of places • Exercises – http://www.sketchengine.co.uk/exercises/regex Madrid 2010 Kilgarriff: Corpus Processing and NLP Summary • What is NLP? • How can it help? – Tokenizing – Sentence splitting – Lemmatizing – POS-tagging – Parsing Madrid 2010 Kilgarriff: Corpus Processing and NLP 23 Exercise • • • • A sentence of your language A tagset of your language Tokenize For each word, decide – What is the lemma (doesn’t apply in Chinese) – Which tag applies Word Visiting relatives … Madrid 2010 Lemma visit relative Tag VVG NN2 Kilgarriff: Corpus Processing and NLP 24