School of Computing something FACULTY OF ENGINEERING OTHER NLP/CL: Review Eric Atwell, Language Research Group (with thanks to other contributors) Objectives of the module On completion of this module, students should be able to: - understand theory and terminology of empirical modelling of natural language; - understand and use algorithms, resources and techniques for implementing and evaluating NLP systems; - be familiar with some of the main language engineering application areas; - appreciate why unrestricted natural language processing is still a major research task. In a nutshell: Why NLP is difficult: language is a complex system How to solve it? Corpus-based machine-learning approaches Motivation: applications of “The Language Machine” The main sub-areas of linguistics ◮ Phonetics and Phonology: The study of linguistic sounds or speech. ◮ Morphology: The study of the meaningful components of words. ◮ Syntax: The study of the structural relationships between words. ◮ Semantics: The study of meanings of words, phrases, sentences. ◮ Discourse: The study of linguistic units larger than a single utterance. ◮ Pragmatics: The study of how language is used to accomplish goals. Python, NLTK, WEKA Python: A good programming language for NLP • Interpreted • Object-oriented • Easy to interface to other things (text files, web, DBMS) • Good stuff from: java, lisp, tcl, perl • Easy to learn • FUN! Python NLTK: Natural Language Tool Kit with demos and tutorials WEKA: Machine Learning toolkit: Classifiers, eg J48 Decision Trees Why is NLP difficult? Computers are not brains • There is evidence that much of language understanding is built into the human brain Computers do not socialize • Much of language is about communicating with people Key problems: • Representation of meaning and hidden structure • Language presupposes knowledge about the world • Language is ambiguous: a message can have many interpretations • Language presupposes communication between people Ambiguity: Grammar (PoS) and Meaning Iraqi Head Seeks Arms Juvenile Court to Try Shooting Defendant Teacher Strikes Idle Kids Kids Make Nutritious Snacks British Left Waffles on Falkland Islands Red Tape Holds Up New Bridges Bush Wins on Budget, but More Lies Ahead Hospitals are Sued by 7 Foot Doctors (Headlines leave out punctuation and function-words) Lynne Truss, 2003. Eats shoots and leaves: The Zero Tolerance Approach to Punctuation The Role of Memorization Children learn words quickly • Around age two they learn about 1 word every 2 hours. • (Or 9 words/day) • Often only need one exposure to associate meaning with word • Can make mistakes, e.g., overgeneralization “I goed to the store.” • Exactly how they do this is still under study Adult vocabulary • Typical adult: about 60,000 words • Literate adults: about twice that. But there is too much to memorize! establish establishment the church of England as the official state church. disestablishment antidisestablishment antidisestablishmentarian antidisestablishmentarianism is a political philosophy that is opposed to the separation of church and state. MAYBE we don’t remember every word separately; MAYBE we remember MORPHEMES and how to combine them Rationalism v Empiricism Rationalism: the doctrine that knowledge is acquired by reason without regard to experience (Collins English Dictionary) Noam Chomsky, 1957 Syntactic Structures -Argued that we should build models through introspection: -A language model is a set of rules thought up by an expert Like “Expert Systems”… Chomsky thought data was full of errors, better to rely on linguists’ intuitions… Empiricism v Rationalism Empiricism: the doctrine that all knowledge derives from experience (Collins English Dictionary) The field was stuck for quite some time: rationalist linguistic models for a specific example did not generalise. A new approach started around 1990: Corpus Linguistics • Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk space, or GHz Main idea: machine learning from CORPUS data How to do corpus linguistics: • Get large text collection (a corpus; plural: several corpora) • Compute statistical models over the words/PoS/parses/… in the corpus Surprisingly effective Example Problem Grammar checking example: Which word to use? <principal> <principle> Empirical solution: look at which words surround each use: • I am in my third year as the principal of Anamosa High School. • School-principal transfers caused some upset. • This is a simple formulation of the quantum mechanical uncertainty principle. • Power without principle is barren, but principle without power is futile. (Tony Blair) Using Very Large Corpora Keep track of which words are the neighbors of each spelling in well-edited text, e.g.: • Principal: “high school” • Principle: “rule” At grammar-check time, choose the spelling best predicted by the probability of co-occurring with surrounding words. No need to “understand the meaning” !? Surprising results: • Log-linear improvement even to a billion words! • Getting more data is better than fine-tuning algorithms! The Effects of LARGE Datasets From Banko & Brill, 2001. Scaling to Very Very Large Corpora for Natural Language Disambiguation, Proc ACL Corpus, word tokens and types Corpus: text selected by language, genre, domain, … Brown, LOB, BNC, Penn Treebank, MapTask, CCA, … Corpus Annotation: text headers, PoS, parses, … Corpus size is no. of words – depends on tokenisation We can count word tokens, word types, type-token distribution Lexeme/lemma is “root form”, v inflections (be v am/is/was…) Tokenization and Morphology Tokenization - by whitespace, regular expressions Problems: It’s data-base New York … Jabberwocky shows we can break words into morphemes Morpheme types: root/stem, affix, clitic Derivational vs. Inflectional Regular vs. Irregular Concatinative vs. Templatic (root-and-pattern) Morphological analysers: Porter stemmer, Morphy, PC-Kimmo Morphology by lookup: CatVar, CELEX, OALD++ Corpus word-counts and n-grams FreqDist counts of tokens and their distribution can be useful Eg find main characters in Gutenberg texts Eg compare word-lengths in different languages Human can predict the next word … N-gram models are based on counts in a large corpus Auto-generate a story ... (but gets stuck in local maximum) Word-counts follow Zipf’s Law Zipf’s law applies to a word type-token frequency distribution: frequency is proportional to the inverse of the rank in a ranked list f*r = k where f is frequency, r is rank, k is a constant ie a few very common words, a small to medium number of middle-frequency words, and a long tail of low frequency (1) Chomsky argued against corpus evidence as it is finite and limited compared to introspection; Zipf’s law shows that many words/structures only appear 1 or 0 times in a given corpus, ??supporting the argument that corpus evidence is limited compared to introspection Kilgarriff’s Sketch Engine Sketch Engine shows a Word Sketch or list of collocates: words co-occurring with the target word more frequently than predicted by independent probabilities A lexicographer can colour-code groups of related collocates indicating different senses or meanings of the target word With a large corpus the lexicographer should find all current senses, better than relying on intuition/introspection Large user-base of experience, used in development of several published dictionaries for English For minority languages with few existing corpus resources, Sketch Engine is combined with Web-Bootcat to enable lexicographers to collect their own Web-as-Corpus Parts of Speech Parts of Speech: groups words into grammatical categories … and separates different functions of a word In English, many words are ambiguous: 2 or more PoS-tags Very simple tagger: everything is NN Better Pos-Taggers: unigram, bigram, trigram, Brill, … Training and Testing of Machine Learning Algorithms Algorithms that “learn” from data see a set of examples and try to generalize from them. Training set: • Examples trained on Test set: • Also called held-out data and unseen data • Use this for testing your algorithm • Must be separate from the training set • Otherwise, you cheated! “Gold Standard” • A test set that a community has agreed on and uses as a common benchmark – use for final evaluation Grammar and Parsing Context-Free Grammars and Constituency Some common CFG phenomena for English • Sentence-level constructions • NP, PP, VP • Coordination • Subcategorization Top-down and Bottom-up Parsing Problems with context-free grammars and parsers Parse-trees show syntactic structure of sentences Key constituents: S, NP, VP, PP You can draw a parse-tree and corresponding CFG Problems with Context-Free Grammar: • Coordination: X X and X is a Meta-Rule, not strict CFG rule • Agreement: needs duplicate CFG rules for singular/plural etc • Subcategorization: needs separate CFG non-terminals for trans/intrans/… • Movement: object/subject of verb may be “moved” in questions • Dependency parsing captures deeper semantics but is harder • Parsing: top-down v bottom-up v combined • Ambiguity causes backtracking, so CHART PARSER stores partial parses Parsing sentences left-to-right The horse raced past the barn [S [NP [A the] [N horse] NP] [VP [V raced] [PP [I past] [NP [A the] [N barn] NP] PP] VP] S] The horse raced past the barn fell [S [NP [NP [A the] [N horse] NP] [VP [V raced] [PP [I past] [NP [A the] [N barn] NP] PP] VP] NP] [VP [V fell] VP] S] Chunking or Shallow Parsing Break text up into non-overlapping contiguous subsets of tokens. Shallow parsing or Chunking is useful for: Entity recognition • people, locations, organizations Studying linguistic patterns • gave NP • gave up NP in NP • gave NP NP • gave NP to NP Prosodic phrase breaks – pauses in speech Can ignore complex structure when not relevant Chunking can be done via regular expressions over PoS-tags Information Extraction Partial parsing gives us NP chunks… IE: Named Entity Recognition People, places, companies, dates etc. In a cohesive text, some NPs refer to the same thing/person Needed: an algorithm for NP coreference resolution; eg: “Hudda”, “ten of hearts”, “Mrs Anthrax”, “she” all refer to the same Named Entity Semantics: Word Sense Disambiguation e.g. mouse (animal /PC-interface) • It’s a hard task (Very) • Humans very good at it • Computers not • Active field of research for over 50 years • Mistakes in disambiguation have negative results • Beginning to be of practical use • Desirable skill (Google, M$) Machine learning v cognitive modelling NLP has been successful using ML from data, without “linguistic / cognitive models” Supervised ML: given labelled data (eg PoS-tagged text to train PoS-tagger, to tag new text in the style of training text) Unsupervised ML: no labelled data (eg clustering words with “similar contexts” gives PoS-tag categories) Unsupervised ML is harder, but increasingly successful! NLP applications Machine Translation Localization: adapting text (e.g. ads) to local language Information Retrieval (Google, etc) Information Extraction Detecting Terrorist Activities Understanding the Quran … For more, see The Language Machine And Finally… Any final questions? Feedback please (eg email me) Good luck in the exam! Look at past exam papers BUT note changes in topics covered And if you do use NLP in your career, please let me know…