Natural Language Processing (NLP) Fundamentals

Natural Language Processing Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai Natural Language Processing CSE4022 ABOUT ME Dr. Tulasi Prasad Sariki PhD from VIT Chennai M. Tech from JNTU, Hyderabad B. Tech from JNTU, Hyderabad Present Associate Professor, SCOPE, VIT, Chennai Past Assistant Professor, School of Computing, SASTRA University Natural Language Processing Machine Learning Data Science Course Objectives ● To introduce the fundamental concepts and techniques of Natural language Processing for analyzing words based on Morphology and CORPUS. ● To examine the NLP models and interpret algorithms for classification of NLP sentences by using both the traditional, symbolic and the more recent statistical approach. ● To get acquainted with the algorithmic description of the main language levels that includes morphology, syntax, semantics, and pragmatics for information retrieval and machine translation applications. Credits: Some of the slides borrowed from Dr. Diyi Yang at College of Computing - Georgia Tech Expected Course Outcome 1. Understand the principles and Process the Human Languages Such as English and other Indian Languages using computers. 2. Creating CORPUS linguistics based on digestive approach (Text Corpus method) 3. Demonstrate understanding of state-of-the-art algorithms and techniques for text-based processing of natural language with respect to morphology. 4. Perform POS tagging for a given natural language. 5. Select a suitable language modelling technique based on the structure of the language. 6. Check the syntactic and semantic correctness of sentences using grammars and labelling. 7. Develop Computational Methods for Real World Applications and explore deep learning based NLP. Mode of Evaluation Theory: ● CAT-1 ● CAT-2 ● Quiz-1 ● Quiz-2 ● Digital Assignment ● Final Assessment Test (FAT) Project: ● Review-1 : CAT-1 Time ● Review-2 : CAT-2 Time ● Review-3 : Before FAT Books ● Text Book(s) 1. Jurafsky and James H. Martin “Speech and Language Processing”, 3rd edition, Prentice Hall, 2009. ● Reference Books 1. Chris Manning and HinrichSchütze, “Foundations of Statistical Natural Language Processing”, 2nd edition, MITPress Cambridge, MA, 2003. 2. NitinIndurkhya, Fred J. Damerau “Handbook of Natural Language Processing”, Second Edition, CRC Press, 2010. 3. James Allen “Natural Language Understanding”, Pearson Publication 8th Edition. 2012. Course Contents 01 INTRODUCTION TO NLP 04 LEXICAL SYNTAX 07 APPLICATIONS OF NLP 02 05 TEXT PROCESSING LANGUAGE MODELING 03 MORPHOLOGY 06 SYNTAX & SEMANTICS 01 Overview of Natural Language Processing Outline ● Introduction to NLP ● Ambiguity ● Need for Natural Language Processing ● Natural Languages vs. Computer Languages ● Why Natural Language Processing ? ● Linguistics Levels of Analysis ● Basic terms / terminology in NLP ● Different Tasks in NLP Introduction What is a Language ? • A system for the expression of thoughts, feelings, etc., by the use of spoken sounds or conventional symbols -Collins Dictionary. Why do we need Language? • Ability to speak & write for the communication is one of the most fundamental aspects of human behavior. Communication With Machines Conversational Agents Conversational agents contain: ● ● ● ● ● Speech recognition Language analysis Dialogue processing Information retrieval Text to speech Google Assistant Apple Siri Amazon Alexa Microsoft Cortana Introduction ● ● ● Computers would be a lot more useful if they could handle our email, do our library research, talk to us … But they are fazed by natural human language (ambiguity). How can we tell computers about language? (Or help them learn it as kids do?) Natural Language Processing Ambiguity ● "I saw the man on the hill with a telescope" ● List the reasonable interpretations? Past tense to See What is Natural Language Processing? The field of study that focuses on the interactions between human language and computers is called NLP Humans Computers Need for Natural Language Processing ? • Huge amounts of data • Applications for processing large amounts of texts require NLP expertise Natural Language Processing ? Natural Language? • Refers to the language spoken by people, e.g. English, Telugu, Tamil, as opposed to artificial languages, like C++, Java, etc. Natural Language Processing • Applications that deal with natural language in a way or another. Computational Linguistics • Doing linguistics on computers • More on the linguistic side than NLP, but closely related Natural Languages vs. Computer Languages • Ambiguity is the primary difference between natural and computer languages. • Formal programming languages are designed to be unambiguous, i.e. they can be defined by a grammar that produces a unique parse tree for each sentence in the language. Why Natural Language Processing? • kJfmmfj mmmvvv nnnffn333 • Uj iheale eleee mnster vensi credur • Baboi oi cestnitze • Coovoel2^ ekk; ldsllk lkdf vnnjfj? • Fgmflmllk mlfm kfre xnnn! Computers Lack Knowledge! • Computers “see” text in English the same you have seen the previous text! • People have no trouble understanding language – Common sense knowledge – Reasoning capacity – Experience • Computers have – No common sense knowledge – No reasoning capacity Where does it fit in the CS taxonomy? CS AI NLP Computational Linguistics Linguistics Levels of Analysis Basic terms / terminology in NLP • Token: Before any real processing can be done on the input text, it needs to be segmented into linguistic units such as words, punctuation, numbers or alphanumeric. These units are known as tokens. • Sentence: An ordered sequence of tokens. • Tokenization: The process of splitting a sentence into its constituent tokens. Whitespace makes tokenization relatively easier and uninteresting. However, for languages such as Chinese and Arabic, the task is more difficult since there are no explicit boundaries. Basic terms / terminology in NLP Corpus: A body of text, usually containing a large number of sentences. Corpora: Collection of texts. Plural form of corpus. Bilingual corpus: A collection of texts in which each text appears in two languages. Dialogue: Communicative linguistic activity in which at least two speakers or agents participate. n-gram : A sequence of n tokens. Semantics: The study of linguistic meaning. Basic terms / terminology in NLP Part-of-speech (POS) Tag: A word can be classified into one or more of a set of lexical or part-of-speech categories such as Nouns, Verbs, Adjectives and Articles etc., A POS tag is a symbol representing such a lexical category - NN(Noun), VB(Verb), JJ(Adjective), AT(Article). POS Tagging: Given a sentence and a set of POS tags, a common language processing task is to automatically assign POS tags to each word in the sentences. For example, given the sentence: The ball is red, the output of a POS tagger would be The/AT ball/NN is/VB red/JJ. Basic terms / terminology in NLP Parse Tree: A tree defined over a given sentence that represents the syntactic structure of the sentence as defined by a formal grammar. Computational Morphology: Natural languages consist of a very large number of words that are built upon basic building blocks known as morphemes (or stems), the smallest linguistic units possessing meaning. Parsing: In the parsing task, a parser constructs the parse tree for a given sentence. Some parsers assume the existence of a set of grammar rules in order to parse but recent parsers are smart enough to deduce the parse trees directly from the given data using complex statistical models. Applications of NLP Information Extraction (IE): Identify phrases in language that refer to specific types of entities and relations in text. Named entity recognition is task of identifying names of people, places, organizations, etc. in text. people organizations places Michael Dell is the CEO of Dell Computer Corporation and lives in Texas. Relation extraction identifies specific relations between entities. Michael Dell is the CEO of Dell Computer Corporation and lives in Texas. Applications of NLP Question Answering: Directly answer natural language questions based on information presented in a corpora of textual documents (e.g. the web). When was Barack Obama born? (factoid) August 4, 1961 Who was president when Barack Obama was born? John F. Kennedy How many presidents have there been since Barack Obama was born? 9 Applications of NLP Text Summarization Article: With a split decision in the final two primaries and a flurry of super delegate endorsements, Sen. Barack Obama sealed the Democratic presidential nomination last night after a grueling and history-making campaign against Sen. Hillary Rodham Clinton that will make him the first African American to head a major-party ticket. Before a chanting and cheering audience in St. Paul, Minn., the first-term senator from Illinois savored what once seemed an unlikely outcome to the Democratic race with a nod to the marathon that was ending and to what will be another hard-fought battle, against Sen. John McCain, the presumptive Republican nominee…. Summary: nominee. Senator Barack Obama was declared the presumptive Democratic presidential Applications of NLP Machine Translation (MT) Translate a sentence from one natural language to another. Hasta la vista, bebé ⇒ See you later, baby. Information Retrieval Applications of NLP Text Categorization Applications of NLP Natural Language Interfaces Spell Checking Applications of NLP Sentiment Analysis Automatic Lyrics Generation Applications of NLP Plagiarism Detection Speech Recognition Why NLP is Hard? ● Contextual words and phrases and homonyms ● Synonyms ● Irony and sarcasm ● Ambiguity ● Errors in text or speech ● Colloquialisms and slang ● Domain-specific language ● Low-resource languages ● Lack of research and development ● Sparsity Contextual words and phrases and homonyms ● I ran to the store because we ran out of milk. ● Can I run something past you real quick? ● The house is looking really run down. Synonyms ● Synonyms can lead to issues similar to contextual understanding because we use many different words to express the same idea. ● Some of these words may convey exactly the same meaning. ● Some may be levels of complexity (small, little, tiny, minute). ● Different people use synonyms to denote slightly different meanings within their personal vocabulary. Sarcasm Ambiguity Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible interpretations. ● Lexical ambiguity: a word that could be used as a verb, noun, or adjective. ● Syntactic ambiguity: ● Semantic ambiguity: the interpretation of a sentence in context. Errors in Text to Speech ● Misspelled or misused words can create problems for text analysis. ● Autocorrect and grammar correction applications can handle common mistakes, but don’t always understand the writer’s intention. Colloquialisms and slang Sparsity ● Regardless of how large our corpus is, there will be a lot of infrequent words. ● This means we need to find clever ways to estimate probabilities for things we have rarely or never seen. Expressivity ● Not only can one form have different meanings (ambiguity) but the same meaning can be expressed with different forms: ○ She gave the book to Tom vs. She gave Tom the book ○ Some kids popped by vs. A few children visited ○ Is that window still open? vs. Please close the window Fields connections with NLP ● ● ● ● ● ● ● ● ● ● Machine learning Linguistics (including psycho-, socio-, descriptive, and theoretical) Cognitive science Information theory Logic Data science Political science Psychology Economics Education Factors Changing NLP Landscape 1. Increases in computing power 2. The rise of the web, then the social web 3. Advances in machine learning 4. Advances in understanding of language in social context J Component - Instructions ● 2-page Project Proposal (10%) ● 4-page Midway Report (10%) ● 8-page Final Report (20%) ● Project Presentation (60%) ● ○ Review-1 (10%) ○ Review-2 (20%) ○ Review-3 (30%) All submissions should be in Double column ACL 2022 format. [Link] THANKS! Do you have any questions? tulasiprasad.sariki@vit.ac.in www.learnersdesk.weebly.com CSE4022 Natural Language Processing Topic: Text Processing Pr o f. Tu l asi Pr a sa d S a ri ki , S C SE, V I T Ch ennai Ca mpus www. l earn ersd esk.weeb l y. com Contents  Text Processing  Text Preprocessing  Challenges in Text Preprocessing  Types of Writing Systems Text Processing 2 Text Processing In the linguistic analysis of a digital natural language text, it is necessary to clearly define the characters, words, and sentences in any document. TEXT PROCESSING 3 Text Preprocessing The task of converting a raw text file, essentially a sequence of digital bits, into a well-defined sequence of linguistically meaningful units: ◦ at the lowest level characters representing the individual graphemes in a language’s written system, ◦ Words consisting of one or more characters, ◦ sentences consisting of one or more words Text preprocessing is an essential part of any NLP system, since the characters, words, and sentences identified at this stage are the fundamental units passed to all further processing stages. TEXT PROCESSING 4 Text / Word segmentation Text segmentation is the process of converting a well-defined text corpus into its component words and sentences. This is very important task to work on morphology and syntax levels of NLP. Word segmentation breaks up the sequence of characters in a text by locating the word boundaries, the points where one word ends and another begins. For computational linguistics purposes, the words thus identified are frequently referred to as tokens, and word segmentation is also known as tokenization. TEXT PROCESSING 5 Sentence Segmentation / Text Normalization Sentence segmentation is the process of identifying sentence boundaries between words in different sentences. Since most written languages have punctuation marks that occur at sentence boundaries, sentence segmentation is frequently referred to as sentence boundary detection, sentence boundary disambiguation Text normalization is a related step that involves merging different written forms of a token into a canonical normalized form; for example, a document may contain the equivalent tokens “Mr.”, “Mr”, “mister”, and “Mister” that would all be normalized to a single form. TEXT PROCESSING 6 Challenges of Text Preprocessing The type of writing system (SCRIPT) used for a language is the most important factor for determining the best approach to text preprocessing. It needs: ◦ at least one set of defined base elements or symbols, individually termed characters and collectively called a script; ◦ at least one set of rules and conventions (orthography) understood and shared by a community, which arbitrarily assigns meaning to the base elements (graphemes), their ordering and relations to one another; ◦ at least one language (generally spoken) whose constructions are represented and able to be recalled by the interpretation of these elements and rules TEXT PROCESSING 7 Classification of Systems TEXT PROCESSING 8 Logographic writing systems In a logographic writing system, in theory, each symbol (word or morpheme) represents one idea example: Chinese Friendship Happiness Fortune TEXT PROCESSING Prosperity 9 Logophonetic Writing Systems Definition: there are two major types of signs, ones denoting morphemes and ones denoting sounds. (ex) Egyptian, Japanese and sumerian TEXT PROCESSING 10 Abugida South Asian scripts such as Brahmi and its descendants fit into both syllabary and alphabet. It is syllabic because the basic sign contains a consonant and a vowel. Greek had CV, CVC, CCVC, CVCC syllable structures, so they invent a way to cut down syllables to consonant and vowels TEXT PROCESSING 11 Alphabetic A system of consonant and vowel symbols that, either individually or in combinations, represent the speech sounds of a written language (ex) English TEXT PROCESSING 12 Abjad or Consonantal Alphabet alphabetic writing systems in which only the consonants in words are written, and the vowels are left out (ex) Hebrew, Arabic TEXT PROCESSING 13 Character representation How Characters in languages can be represented? At its lowest level, a computer-based text or document is merely a sequence of digital bits in a file. The first essential task is to interpret these bits as characters of a writing system of a natural language. TEXT PROCESSING 14 Unicode Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use. TEXT PROCESSING 15 Unicode These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Unicode covers all the characters for all the writing systems of the world, modern and ancient. It also includes technical symbols, punctuations, and many other characters used in writing text. The Unicode Standard is intended to support the needs of all types of users, whether in business or academia, using mainstream or minority scripts. TEXT PROCESSING 16 Types of Encoding Two Types Encoding Character Encoding ◦ ASCII, ISCII, Unicode Font Encoding ◦ Eenadu, vaartha, Kumudam , Daily Thanthi TEXT PROCESSING 17 ASCII Features American Standard Code for Information Interchange 7-bit code 8th bit is unused (or used for a parity bit) 27 = 128 codes Two general types of codes: ◦ 95 are “Graphic” codes (displayable on a console) ◦ 33 are “Control” codes (control features of the console or communications channel) TEXT PROCESSING 18 TEXT PROCESSING 19 ISCII(Indian Standard Code for Information Interchange) It is a coding scheme for representing various writing systems of India. It encodes the main Indic scripts and a Roman transliteration. The supported scripts are: Assamese, Bengali (Bengla), Devanagari, Gujarati, Gurmukhi, Kannada, Malayalam, Oriya, Tamil, and Telugu. One motivation for the use of a single encoding is the idea that it will allow easy transliteration from one writing system to another. TEXT PROCESSING 20 Unicode Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. The latest version (Unicode 7.0) of Unicode contains a collection of more than 110,000 characters covering 100 scripts and various symbols. Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8, UTF-16. http://www.unicodetables.com/ http://www.unicode.org/ TEXT PROCESSING 21 Structural Differences with ISCII Unicode is stateless: ◦ No shifting to get different scripts ◦ Each character has a unique number Unicode is uniform: ◦ No extension bytes necessary ◦ All characters coded in the same space Yudit is a free Unicode plain-text editor for Unix-like systems. http://www.yudit.org/ TEXT PROCESSING 22 Type vs Token Example: Consider the sentence below A rose is a rose is a rose There are three word types in the sentence: "rose", "is" & "a". There are eight word tokens TEXT PROCESSING 23 Telugu TEXT PROCESSING 24 Devanagari TEXT PROCESSING 25 Font A font file is a binary file that contains glyphs, or “pictures”, of symbols representing the building blocks of a displayable character set. Depending on the language, multiple glyphs can comprise a single character. Code Table In basic terms, a code table is a two column list that maps a numerical value to a glyph. The most widely used code table is Unicode TEXT PROCESSING 26 Font Encoding Encoding values are “stored” from a code table. There are many different encoding types to choose from depending on the application. UTF-8 UTF-16 (UCS(universal character set)-2) UTF-32 (UCS(universal character set)-4) Allows us to generate displays of text strings in many different languages by using fonts which contain the glyphs corresponding to their alphabet The computer system takes each code and displays the glyph associated with it which is displayed on a monitor or printed out. TEXT PROCESSING 27 Font The glyphs may be viewed as the building blocks for the letter to be displayed where, by placing the glyphs one after another, the required display is generated. Fonts also incorporate a feature whereby some of the glyphs may be defined to have zero width even though they extend over a horizontal range Thus when the system places a zero width glyph next to another, the two are superimposed and thus permit more complex shapes to be generated, such as accented letters. TEXT PROCESSING 28 Sentence A sentence is a group of words that are put together to mean something. A sentence is the basic unit of language which expresses a complete thought. It does this by following the grammatical rules of syntax. Sentence Boundary Disambiguation ◦ People use . ? and ! ◦ Sometimes ; ◦ End-of-sentence marks are overloaded. TEXT PROCESSING 29 Sentence Boundary Disambiguation English employs whitespace between most words and punctuation marks at sentence boundaries, but neither feature is sufficient to segment the text completely and unambiguously. Tibetan and Vietnamese both explicitly mark syllable boundaries, either through layout or by punctuation, but neither marks word boundaries. Written Chinese and Japanese have adopted punctuation marks for sentence boundaries, but neither denotes word boundaries. Period - most ambiguous. Decimals, e-mail addresses, abbreviations, initials in names, honorific titles. TEXT PROCESSING 30 Sentence Boundary Disambiguation For example: U.S. Dist. Judge Charles L. Powell denie motions made by defense attorneys Monday in Portland's insurance fraud trial. Of the handful of painters that Austria has produced in the 20th century, only one, Oskar Kokoschka, is widely known in U.S. This state of unawareness may not last much longer. Sentence boundary detection by humans is tedious, slow, error-prone, and extremely difficult to codify. Algorithmic syntactic sentence boundary detection is a necessity. POS tagging and syntax can be done on sentences TEXT PROCESSING 31 Related Work As of 1997: “identifying sentences has not received as much attention as it deserves.” [Reynar and Ratnaparkhi1997] “Although sentence boundary disambiguation is essential . . ., it is rarely addressed in the literature and there are few public-domain programs for performing the segmentation task.” [Palmer and Hearst1997] Two approaches ◦ Rule based approach ◦ Machine-learning-based approach TEXT PROCESSING 32 Related Work Rule based ◦ Regular expressions ◦ [Cutting1991] ◦ Mark Wasson converted grammar into a finite automata with 1419 states and 18002 transitions. ◦ Lexical endings of words ◦ [Müller1980] uses a large word list. Machine-learning-based approach ◦ [Riley1989] uses regression trees. ◦ [Palmer and Hearst1997] uses decision trees or neural network. TEXT PROCESSING 33 Maximum Entropy Approach Potential sentence boundaries are identified by scanning the text for sequences of characters separated by whitespace (tokens) containing one of the symbols !, . or ?. The system that focused on maximizing performance used the following hints, or contextual "templates": The Prefix, The Suffix The presence of particular characters in the Prefix or Suffix Whether the Candidate is an honorific (e.g. Ms., Dr., Prof.) TEXT PROCESSING 34 Maximum Entropy Approach Whether the Candidate is a corporate designator (e.g. Corp., M.L.A., M.L.C.) Features of the word left of the Candidate Features of the word right of the Candidate The templates specify only the form of the information. The exact information used by the maximum entropy model for the potential sentence boundary marked by Corp. in Example sentence would be: ◦ ANLP Corp. chairman Dr. Smith resigned. ◦ PreviousWordIsCapitalized, Prefix=Corp, Suffix=NULL, PrefixFeature=CorporateDesignator. TEXT PROCESSING 35 Maximum Entropy Approach For each potential sentence boundary token (., ?, and !), we estimate a joint probability distribution p of the token and its surrounding context, both of which are denoted by c, occurring as an actual sentence boundary. The distribution is given by: p(b,c) =𝜋 𝑘 𝑗=1 𝑓 (𝑏,𝑐 𝛼𝑗 𝑗 Where b {no, yes}, where the j's are the unknown parameters of the model, and where each j corresponds to a fj, or a feature. Thus the probability of seeing an actual sentence boundary in the context c is given by p(yes, c). TEXT PROCESSING 36 Corpus Corpus is a large collection of text covering different domains, styles, territorial and social variants of usage etc. A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. A corpus provides grammarians, lexicographers, and others a better description of a language. TEXT PROCESSING 37 Chomsky’s Critique of Corpus-Based Methods 1. Corpora model performance, while linguistics is aimed at the explanation of competence If you define linguistics that way, linguistic theories will never be able to deal with actual, messy data 2. Natural language is in principle infinite, whereas corpora are finite, so many examples will be missed Excellent point, which needs to be understood by anyone working with a corpus. But does that mean corpora are useless? Introspection is unreliable (prone to performance factors), and pretty useless with small and unrepresentative data. Insights from a corpus might lead to generalization/induction beyond the corpus– if the corpus is a good sample of the “text population” 3. Ungrammatical examples won’t be available in a corpus Depends on the corpus, e.g., spontaneous speech, language learners, etc. TEXT PROCESSING 38 Corpus Corpora analysis provide lexical information, morpho-syntactic information, syntactic as well as semantic information. Variety of Corpus ◦ Raw corpus ◦ POS tagged ◦ Parsed ◦ Multilingual aligned ◦ Spoken language ◦ Semantic tagged TEXT PROCESSING 39 Corpus Raw Corpus The texts are segmented into sentences and paragraphs ◦ Reuters corpus (180 Million Word) ◦ CIIL corpus (3 Million words for 10 major Indian languages) POS Tagged Corpus texts in corpus are annotated with Part Of Speech tags information ◦ BNC tagged corpus(100 Million CLAWS tagset) TEXT PROCESSING 40 Corpus Parsed Each sentence is annotated with a phrase-structure parse marking the boundaries of sentence, clause, phrase and coordinated word constituents. ◦ Lancaster Parsed Corpus (British English) ◦ Susanne parsed corpus Semantic Corpus Sense tagged corpus ◦ The FrameNet lexical database contains around 1,200 semantic frames, 13,000 lexical units (a pairing of a word with a meaning; polysemous words are represented by several lexical units) and over 190,000 example sentences TEXT PROCESSING 41 Corpus Multilingual aligned Identification of the corresponding sentences in multiple languages and align them ◦ CRATER:- Multilingual Aligned Annotated Corpus (English, French, Spanish) ◦ JRC-Acquis Multilingual Parallel Corpus: collection of parallel texts in the following 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish etc. ◦ Parallel corpora are useful for all types of cross-lingual research TEXT PROCESSING 42 Uses of Corpora Lexicography / terminology Linguistics / computational linguistics Dictionaries & grammars (Collins Cobuild) English Dictionary for Advanced Learners; Longman Grammar of Spoken and Written English Critical Discourse Analysis - Study texts in social context - Analyze texts to show underlying ideological meanings and assumptions - Analyze texts to show how other meanings and ways of talking could have been used….and therefore the ideological implications of the ways that things were stated TEXT PROCESSING 43 Uses of Corpora Literary studies Translation practice and theory Language teaching / learning ESL Teaching (English as Second Language) LSP Teaching (Language for Specific Purposes) TEXT PROCESSING 44 Type-Token Analysis Each distinct word form is a type and each occurrence of a type counts as a token. TEXT PROCESSING 45 TEXT PROCESSING What is a corpus and why are corpora important tools? Kristina Nilsson Björkenstam Computational Linguistics, Stockholm University 1. Introduction In 2012, the Republican candidate for US president, Mitt Romney, tried to defend himself against allegations that he was too liberal by saying: "But I was a severely conservative Republican governor.” (Mitt Romney, CPAC 2012-02-10) People, both within the Republican Party and outside, got upset because of the phrase "severely conservative". It didn't feel right. It was perceived as negative, almost as if Romney didn't want to be a conservative. But if we look up the adjective "severely" in Webster's dictionary we find the following definitions: 1. 2. 3. 4. 5. 6. harsh or strict, unsparing, stern serious, grave, forbidding conforming strictly to a rule or standard extremely plain or simple keen, violent, intense difficult, rigorous It would seem that Romney used the word “severely” in sense #3, "conforming strictly to a rule or standard". If this is an example of correct usage of the word, why did people (especially within his own party) get so upset with him? What did he do wrong? The answer is that there is more to language than dictionary definitions. If we look up the word "severely" in the 450 million word Corpus of Contemporary American English, 1 we find that this word typically co-occurs with words like: • • • • • • • damaged injured, wounded, ill, depressed disabled limited, restricted, limit punished, beaten criticized affected This pattern of co-occurrence with mostly negatively charged words is the reason why "severely" has negative connotations, and this is why people reacted to the phrase "severely conservative". 2 This blending of features of one set of words (e.g., “damaged”, “injured”, and “depressed”) with another word (“severely”) through frequent co-occurrence is called semantic prosody (Louw, 1 COCA. URL: http://corpus.buy.edu/coca See the blog entry “Severely X” by Mark Liberman at Language Log for more on this quote. URL: http://languagelog.ldc.upenn.edu/nll/?p=3762 2 1993), and this is one of the aspects of language that can be studied by analyzing language production in collections of language samples, so-called corpora. 2. What is a corpus? A corpus is a collection of natural language (text, and/or transcriptions of speech or signs) constructed with a specific purpose. While most available corpora are text only, there are a growing number of multimodal corpora, including sign language corpora. A multimodal corpus is ”a computer-based collection of language and communication-related material drawing on more than one sensory modality or on more than one production modality” (Allwood, 2007:207), where sensory modalities include sight, hearing, touch, smell or taste, and production modalities e.g., speech, signs, eye gaze, body posture, and gestures. That is, a multimodal corpus is a collection of video and/or audio recordings of people communicating. But any collection of audio and video is not a corpus. Firstly, the audio-visual material should be carefully selected, and the content must be described using meta-data. Secondly, the material should be analyzed and described with transcriptions and annotations in a standardized format. Ideally, a corpus is a set of language production samples designed to be representative of a language (or sub-language) through careful selection -- not a randomly collected set of data. How representative a corpus is, given a particular research question, is determined by the balance and sampling of the corpus. We can think of representativeness as the answer to the question: how well does this corpus describe relevant aspects of the language? In order to create a general corpus, language samples produced by both men and women, of all ages, from different parts of the area where the language is spoken, etc., should be included. The same principles regarding representativeness, balance and sampling are relevant for both text and multimodal corpora, and there is a large body of work on corpus design to draw from (for an introduction see e.g., (McEnery, Xiao & Tono, 2006) and (Allwood, 2007)). There are different ways to go about selecting data. One way is to focus on language as a "product" and sample different types of language material, e.g., dialogue or monologue, or scripted or spontaneous speech. Another way is to focus on the “producer” of language, and to choose informants based on speaker characteristics such as age, gender, social class, first/second language, level of education, profession, and regional background. In some cases, e.g., when recording communication between co-workers in a specific work place, the informants are selected because they work there, and not based on speaker characteristics. Such corpora are specialized rather than general, but speaker characteristics are still important when analyzing the data. In the case of multimodal corpora, an important aspect is whether the recordings were made in a naturalistic setting in a studio, or in the real world. There is also a difference between unobtrusive observation of an activity (e.g., a parent and a child playing with a set of toys at home), and recordings of people performing a task according to instructions (e.g., two adults discussing a movie in a lab setting). Corpus selection is important not only for corpus builders but also for also for corpus users because the set of questions that can be investigated depends on the composition of the corpus. Let me give you an example: The Alcohol Language Corpus (ALC; Schiel, Heinrich & Barfüßer, 2011) is a specialized corpus consisting of speech samples from 162 speakers of German (85 male and 77 female police officers). The recordings were made in a car. Each speaker was recorded both sober and drunk (with alcohol level as meta-data), speaking in two different speaking styles: scripted, (that is, as instructed, e.g., reciting addresses and registration numbers), and spontaneous (e.g., descriptions, question-answer, and dialog). Using this corpus, Schiel and colleagues (2011) show that there is a gender difference in spontaneous speech: men talk less when drunk, whereas women are not affected. They also show that both men and women make more speech errors when drunk. That is, using this corpus we can study how speech performance is affected by alcohol, and this knowledge can be used e.g., to device new methods to stop drunk drivers. But there are inherent limitations to this corpus that affect how this corpus can be used: the recordings were made in cars, all participants are police officers, and the speakers are drunk in some of the recordings. This, of course, means that we cannot use this corpus to study e.g., German in general. For that purpose, we need a balanced corpus consisting of language material produced by men and women of all ages, from all regions, with different speaker characteristics. 3. What corpora can tell us... The first thing I do when I get access to a new corpus is to explore the content using some basic methods, typically by counting the words. We can find out which words are the most frequent in the corpus, and by ranking the words by corpus frequency we can study the distribution of the vocabulary of the corpus. By using normalized frequencies, we can make comparisons between different corpora. We can e.g., compare the vocabulary frequency distribution of English (British National Corpus 3), Swedish (Stockholm-Umeå Corpus 4), and Swedish Sign Language (Swedish Sign Language Corpus 5). By using basic corpus linguistic tools, either built-in web interface tools for corpora such as COCA or BNC, or software such as AntConc, 6 we can also look at recurring sequences of words or signs, either as sequences of tokens (called n-grams) or as collocations. Starting with basic methods such as these, we can move on to study many aspects of language production using both quantitative and qualitative methods. 4. ... and what corpora cannot tell us However, there are limitations to what corpora can tell us. • • 3 No negative evidence: just because a word or a sign does not occur in a corpus (however large and well balanced) does not mean that the word or sign never can occur in the language. However, a representative corpus can show us what is central and typical in a language. The findings of a study can tell us something about the subset of language that is included in that corpus, but not necessarily about language as a whole. However, if the corpus is BNC. URL: http://www.natcorp.ox.ac.uk/ SUC. http://spraakbanken.gu.se/eng/resource/suc3 5 SSLC. http://www.ling.su.se/forskning/forskningsprojekt/teckenspr%C3%A5k/teckenspr%C3%A5kskorpus 6 AntConc. URL: www.antlab.sci.waseda.ac.jp/ 4 • representative of the language we are interested in we can make careful generalizations about the language. A corpus can rarely provide explanations, and thus most corpus studies combine quantitative and qualitative work. Sometimes other methods, such as questionnaires, eye gaze or EEG experiments are better suited to answer a particular question. Sometimes a descriptive corpus study can give new ideas on what to look for using other methods. To summarize: make sure that you select the right corpus for your study, find out as much as you can about the corpus, take the characteristics and limitations of the corpus into account, and make careful generalizations! 5. Why are corpora important tools? Corpus analysis provides quantitative, reusable data, and an opportunity to test and challenge our ideas and intuitions about language. Further, analysis applied to corpora as transcriptions or other types of linguistic annotation can be checked for consistency and inter-annotator agreement, and the annotated corpus can be reviewed and reused by others. Corpora are essential in particular for the study of spoken and signed language: while written language can be studied by examining the text, speech, signs and gestures disappear when they have been produced and thus, we need multimodal corpora in order to study interactive face-toface communication. References Allwood, J. 2007. Multimodal Corpora. In: Lüdeling, A. & M. Kytö (eds) Corpus Linguistics. An International Handbook. Mouton de Gruyter. Berlin: 207-225 Davies, M. 2008. The Corpus of Contemporary American English: 450 million words, 1990-present. URL: http://corpus.byu.edu/coca/ Louw, B. 1993. Irony in the Text or Insincerity in the Writer? The Diagnostic Potential of Semantic Prosodies. In Baker, M., Francis, G. & Tognini-Bonelli, E. (eds) "Text and Technology". Philadelphia/Amsterdam: John Benjamins. McEnery, T., R. Xiao & Y. Tono. 2006. Corpus-Based Language Studies: An Advanced Resource Book. Taylor & Francis US. Schiel F, C. Heinrich & S. Barfüßer. 2011. Alcohol Language Corpus. In: Language Resources and Evaluation, Springer, Berlin, New York, Vol 45. The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/ The Stockholm-Umeå Corpus, version 3.0. 2012. Distributed by the Swedish Language Bank at Gothenburg University. URL: http://spraakbanken.gu.se/eng/resource/suc3 The Swedish Sign Language Corpus. 2013. Distributed by the Section for Sign Language at Stockholm University. URL: http://www.ling.su.se/english/research/research-projects/signlanguage Sentence Segmentation and Decision Trees Sentence Segmentation • !, ? are relatively unambiguous • Period “.” is quite ambiguous • Sentence boundary • Abbreviations like Inc. or Dr. • Numbers like .02% or 4.3 • Build a binary classifier • Looks at a “.” • Decides EndOfSentence/NotEndOfSentence • Classifiers: hand-written rules, regular expressions, or machine-learning Determining if a word is end-ofsentence: a Decision Tree More sophisticated decision tree features • Case of word with “.”: Upper, Lower, Cap, Number • Case of word after “.”: Upper, Lower, Cap, Number • Numeric features • Length of word with “.” • Probability(word with “.” occurs at end-of-s) • Probability(word after “.” occurs at beginning-of-s) Implementing Decision Trees • A decision tree is just an if-then-else statement • The interesting research is choosing the features • Setting up the structure is often too hard to do by hand • Hand-building only possible for very simple features, domains • For numeric features, it’s too hard to pick each threshold • Instead, structure usually learned by machine learning from a training corpus Decision Trees and other classifiers • We can think of the questions in a decision tree as features that could be exploited by any kind of classifier: • Logistic regression • SVM • Neural Nets • etc. CSE4022 Natural Language Processing Topic: Morphology Dr. Tulasi Prasad Sariki SCOPE, VIT Chennai www. l earn ersd esk.weeb l y. com Contents  What is Morphology  Where Morphology is useful?  Classification of Morphemes  Properties of affixes  Representation / Terminology  Example Morphology 2 What is Morphology? Morpheme is a minimal meaning-bearing unit in a language. Morphemes are abstract concepts denoting entities or relationships. Morphology is the study of the internal structure of words. In natural languages, words are made up of meaningful subunits called morphemes. Morphological parsing is the task of recognizing the morphemes inside a word e.g., hands, foxes, children MORPHOLOGY 3 Where Morphology is useful? Machine translation Information retrieval Lexicography Any further processing (e.g., part-of-speech tagging) MORPHOLOGY 4 Observations about words and their structure Some words can be divided into parts which still have meaning Many words have meaning by themselves. But some words have meaning only when used with other words Some of the parts into which words can be divided can stand alone as words. But others cannot These word-parts that can occur only in combination must be combined in the correct way Languages create new words systematically MORPHOLOGY 5 Classification of Morphemes Lexical morphemes are those that having meaning by themselves (more accurately, they have sense). Nouns, verbs, adjectives ({boy}, {buy}, {big}) are typical lexical morphemes. Grammatical morphemes specify a relationship between other morphemes. But the distinction is not all that well defined. Prepositions, articles, conjunctions grammatical morphemes. MORPHOLOGY ({of}, {the}, {but}) are 6 Classification of Morphemes Free morphemes – morphemes which can stand by themselves as separate words, e.g. structure, like, go, work, friend etc. Bound morphemes – morphemes which cannot normally stand alone but need to be attached to other forms, e.g. re-, -ed, -s, -ing etc. - unit of meaning which can only exist alongside a free morpheme. - Bound morphemes operates in the connection processes by means of derivation, inflection, and compounding. MORPHOLOGY 7 Classification of Morphemes We can usefully divide morphemes into two classes ◦ Root or Lexeme: The core meaning-bearing units ◦ Affixes: Bits and pieces that adhere to stems to change their meanings and grammatical functions ◦ Prefix: un-, anti-, etc ◦ Suffix: -ity, -ation, etc ◦ Infix: are inserted inside the stem, English has almost no true infixes ◦ Circumfixes – a discontinuous morph composed of two parts which embrace the base element (live  en-live-en  enliven) MORPHOLOGY 8 Properties of roots ◦ Main part of word ◦ Must be at least one in a word ◦ In English, limited to two in a word ◦ (simple words have one, compound words have two); ◦ Can occur independently ◦ Tend to have richer, more specific semantic content ◦ Position is relatively free with respect to other roots ◦ E.g. photograph vs. telephoto MORPHOLOGY 9 Properties of affixes ◦ Subordinate part of word ◦ Not necessarily present--some words occur without any ◦ Multiple affixes can occur in a word ◦ Are dependent (bound) elements ◦ Have more "schematic" (non-specific) content; often grammar-like function ◦ Can either precede or follow their roots ( prefixes and suffixes ,respectively) ◦ Position for a given affix with respect to root is fixed MORPHOLOGY 10 Example Given word: Unbreakable How many morphemes? comprises three morphemes un- (a bound morpheme signifying "not") -break- (the root, a free morpheme) -able (a free morpheme signifying "can be done"). MORPHOLOGY 11 Representation / Terminology Morphological: girls = {girl} + {s} Semantic: {girl} = [-adult; -male; +human, ...] + {s} = {PLU} = [plural] Braces, { } indicate a morpheme. Square brackets, [ ] indicate a semantic characterization. Italics indicate a lexical item. Two different morphemes may be pronounced the same way. Ex: –er in buyer and shorter verb(agentive morpheme {AG}) adjective(comparative morpheme {COMP}) MORPHOLOGY 12 Morphemes and Words Combine morphemes to create words. Inflectional Morphology Combination of stem and morpheme resulting in word of same class Usually fills a syntactic feature such as agreement E.g., plural –s, past tense -ed Derivational Morphology Combination of stem and morpheme usually results in a word of a different class Meaning of the new word may be hard to predict E.g., +ation in words such as computerization MORPHOLOGY 13 Inflectional Morphology Inflection is a morphological process that adapts existing words so that they function effectively in sentences without changing the category of the base morpheme. Word stem + grammatical morpheme cat + s Only for nouns, verbs, and some adjectives Nouns ◦ plural: Rules for regular: +s, +es irregular: mouse-mice; ox-oxen Rules for exceptions: e.g.-y -> -ies like: butterfly-butterflies MORPHOLOGY 14 Inflectional Morphology (verbs) Morphological form stem -s form -ing form -ed form(past) Regular Inflected form walk walks walking walked thrash thrashes thrashing thrashed MORPHOLOGY try tries trying tried map maps mapping mapped 15 Inflectional Morphology (verbs) Morphological form Irregular Inflected form stem -s form -ing form -ed form(past) -ed form(participle) eat eats eating ate eaten catch catches catching caught caught MORPHOLOGY cut cuts cutting cut cut 16 Inflectional Morphology (verbs) The suffix –s functions in the Present Simple as the third person marking of the verb ◦ to work – he work-s The suffix –ed functions in the past simple as the past tense marker in regular verbs ◦ to love – lov-ed The suffixes –ed (regular verbs) and –en (for some regular verbs) function in the marking of the past partciple ◦ to study studied / To eat eaten The suffix –ing functions in the marking of the present participle. ◦ to eat – eating / To study - studying MORPHOLOGY 17 Inflectional Morphology (nouns) Regular Nouns (cat, hand) Morphological form stem -s form(plural) Morphological form stem -s form(plural) Irregular Nouns(child, ox) Regular Inflected form cat hand cats hands Irregular Inflected form child ox children oxen The suffix –s functions in the marking of the plural of nouns: dog – dogs The suffix –s functions as a possessive marker: Laura – Laura’s book. MORPHOLOGY 18 Regular vs Irregular It is a little complicated by the fact that some words misbehave (refuse to follow the rules) ◦ Mouse/mice, goose/geese, ox/oxen ◦ Go/went, fly/flew The terms regular and irregular are used to refer to words that follow the rules and those that don’t. Inflectional Morphology (Adjectives) The suffix –er functions as comparative marker: quick – quicker The suffix –est functions as superlative marker: quick - quickest MORPHOLOGY 19 Derivational Morphology Derivation is concerned with the way morphemes are connected to existing lexical forms as affixes. We distinguish affixes in two principal types ◦ Prefixes - attached at the beginning of a lexical item or base-morpheme – ex: un-, pre-, post-, dis, im-, etc. ◦ Suffixes – attached at the end of a lexical item – ex: -age, -ing, -ful, -able, -ness, -hood, -ly, etc. MORPHOLOGY 20 Examples of Derivational Morphology Lexical item (free morpheme): like (verb)+ prefix (bound morpheme) dis-= dislike (verb); like + suffix –able = likeable + prefix un- =unlikeable + suffix –ness = unlikeableness like + prefix un- = unlike + suffix –ness = unlikeness like + suffix –ly = likely + suffix –hood =likelihood + prefix un- =unlikelihood MORPHOLOGY 21 Derivational Morphology Derivational affixes can cause semantic change Prefix pre- means before; post- means after; un- means not, re- means again. Prefix = fixed before; Unhappy = not happy = sad; Retell = tell again. Prefix de- added to a verb conveys a sense of subtraction; dis- and unhave a sense of negativity. to decompose; to defame; to uncover; to discover. MORPHOLOGY 22 Derivational Morphology Derivational affixes can mark category change For Nouns For Adjectives Suffix Base Verb / Adjective Derived Noun Suffix Base Verb / Noun Derived Adjective -ation Computerize (V) Computerization -al Computation (N) Computational -ee Appoint (V) Appointee -able Embrace (V) Embraceable -er Kill (V) Killer -less Care (N) Careless -ness Fuzzy (A) Fuzziness -ful Care (N) Careful MORPHOLOGY 23 Derivational Morphology Verb Clitics are usually weak forms of functional elements MORPHOLOGY Full Form Clitic am ’m is ‘s are ‘re will ‘ll have ‘ve has ‘s had ‘d would ‘d 24 Derivational Processes 1. Derivation: (or Derivational affixation, Affixation) antiintellectualism 2. Compounding: combine two or more morphemes to form new words bathroom, blackboard 3. Reduplication: full or partial repetition of a morpheme dilly-dally, zig-zag 4. Blending: parts of the words that are combined are deleted fantastic + fabulous -> fantabulous MORPHOLOGY 25 Derivational Processes 5. Clipping: part of a word has been clipped off Prof , lab, doc 6. Acronyms: abbreviate a longer term by taking the initial letters WHO -> World Health Organization 7. Back formation: A word (usually a noun) is reduced to form another word of a different type (usually a verb) television -> televise babysitter -> babysit MORPHOLOGY 26 Derivational Processes 8. Extension of word formation rules : Part of a word is treated as a morpheme though it’s not workaholic 9. Functional shift (Conversion): A change in the part of speech computer users today use a mouse and bookmark an Internet address 10. Proper names -> Common words Xerox -> Photo copy JCB -> Proclainer MORPHOLOGY 27 Derivational Processes 11. Coining: Creating a completely new free morpheme googol -> 10 100 12. Onomatopoeia: words imitate sounds in nature tick-tock, quack 13. Borrowing: The taking over of words from other languages French to English brigade, ballet, bigot MORPHOLOGY 28 Derivational Processes Many paths are possible. Start with compute Computer -> computerize -> computerization Computer -> computerize -> computerizable Computation -> computational But not all paths/operations are equally good (allowable?) Clue Clue -> *clueable Happy unhappy Sad  *unsad MORPHOLOGY 29 Derivational Processes Morphotactics Morphotactics is concerned with ordering of morphemes. The ordering restrictions in place on the ordering of morphemes antiintellectualism -anti -ism -al -intellect anti + intellect + al +ism Morphophonemics: Focus on the sound changes that take place in morphemes when they combine to form words. e.g., the vowel changes in “sleep” and “slept,” “bind” and “bound,” “vain” and “vanity,” and the consonant alternations in “knife” and “knives,”. MORPHOLOGY 30 Derivational Processes Semantics: In English, un- cannot attach to adjectives that already have a negative connotation: Unhappy vs. *unsad Unhealthy vs. *unsick Unclean vs. *undirty Phonology: In English, -er cannot attach to words of more than two syllables great, greater Happy, happier Competent, *competenter Elegant, *eleganter MORPHOLOGY 31 Inflectional vs Derivational Inflectional Derivational Lexical category Do not change the lexical category of the word. Often change the lexical category of the word Location Tend to occur outside derivational affixes. Tend to occur next to the root Type of meaning Contribute syntactically conditioned information, such as number, gender, or aspect. Contribute lexical meaning Affixes used Occur with all or most members of a class of stems. Are restricted to some, but not all members of a class of stems Productivity May be used to coin new words of the same type. May eventually lose their meaning and usually cannot be used to coin new terms Grounding Create forms that are fully-grounded and able to Create forms that are not necessarily fully be integrated into discourse. grounded and may require inflectional operations before they can be integrated into discourse MORPHOLOGY 32 Stemming Stemming Stemming algorithms strip off word affixes yield stem only, no additional information (like plural, 3rdperson etc.) used, e.g. in web search engines. Stemming is one technique to provide ways of finding morphological variants of search terms. Used to improve retrieval effectiveness and to reduce the size of indexing files. Reduce tokens to “root” form of words to recognize morphological variation. “computer”, “computational”, “computation” all reduced to same token “compute” MORPHOLOGY 33 Stemming Criteria for judging stemmers Correctness ◦ Overstemming: too much of a term is removed. ◦ Understemming: too little of a term is removed. Retrieval effectiveness ◦ Measured with recall and precision, and on their speed, size, and so on Compression performance MORPHOLOGY 34 Type of stemming algorithms Table lookup approach Successor Variety n-gram stemmers Affix Removal Stemmers MORPHOLOGY 35 Table lookup approach Store a table of all index terms and their stems, so terms from queries and indexes could be stemmed very fast. Problems ◦ There is no such data for English. Or some terms are domain dependent. ◦ The storage overhead for such a table, though trading size for time is sometimes warranted. MORPHOLOGY 36 Successor Variety Determine word and morpheme boundaries based on the distribution of phonemes in a large body of utterances. The successor variety of a string is the number of different characters that follow it in words in some body of text. The successor variety of substrings of a term will decrease as more characters are added until a segment boundary is reached. MORPHOLOGY 37 Successor Variety Example Test Word: READABLE Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS, RED, ROPE, RIPE MORPHOLOGY 38 Successor Variety Example cutoff method ◦ some cutoff value is selected and a boundary is identified whenever the cutoff value is reached peak and plateau method ◦ segment break is made after a character whose successor variety exceeds that of the characters immediately preceding and following it complete method entropy method MORPHOLOGY 39 Successor Variety Two criteria used to evaluate various segmentation methods 1. the number of correct segment cuts divided by the total number of cuts 2. the number of correct segment cuts divided by the total number of true boundaries After segmenting, if the first segment occurs in more than 12 words in the corpus, it is probably a prefix. The successor variety stemming process has three parts 1. determine the successor varieties for a word 2. segment the word using one of the methods 3. select one of the segments as the stem MORPHOLOGY 40 n-gram stemmers Association measures are calculated between pairs of terms based on shared unique digrams. statistics => st ta at ti is st ti ic cs unique digrams = at cs ic is st ta ti statistical => st ta at ti is st ti ic ca al unique digrams = al at ca ic is st ta ti Dice’s coefficient (similarity) S  2C 2*6   .80 A B 78 A and B are the numbers of unique digrams in the first and the second words. C is the number of unique digrams shared by A and B. MORPHOLOGY 41 n-gram stemmers Similarity measures are determined for all pairs of terms in the database, forming a similarity matrix Once such a similarity matrix is available, terms are clustered using a single link clustering method MORPHOLOGY 42 Affix Removal Stemmers Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem ◦ If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991) Then “ies” -> “y” ◦ If a word ends in “es” but not ”aes” , or ”ees ” or “oes” Then “es” -> “e” ◦ If a word ends in “s” but not ”us” or ”ss ” Then “s” -> “NULL” MORPHOLOGY 43 The Porter Stemmer Online Demo: http://9ol.es/porter_js_demo.html Typical rules in Porter stemmer sses→ss (caresses -> caress ) ies→i (ponies -> poni, ties -> ti) ational→ate tional→tion ing →ε (motoring -> motor) MORPHOLOGY 44 Conditions on the stem 1. The measure , denoted m ,of a stem is based on its alternate vowel-consonant sequences. [C] ( VC ) m [V] Square brackets indicate an optional occurrence. Measure M=0 M=1 M=2 Example TR,EE,TREE,Y,BY TROUBLE,OATS,TREES,IVY TROUBLES,PRIVATE,OATEN MORPHOLOGY E.g., Troubles C V CVC 45 Conditions on the stem 2.*<X> ---the stem ends with a given letter X 3.*v*---the stem contains a vowel 4.*d ---the stem ends in double consonant 5.*o ---the stem ends with a consonant-vowel-consonant, sequence, where the final consonant is not w, x or y 6. *s --- the stem ends with a given letter S MORPHOLOGY 46 Step1 SSES -> SS caresses -> caress IES -> I ponies -> poni ties -> ti SS -> SS caress -> caress S -> є cats -> cat MORPHOLOGY 47 Step2a (m>1) EED -> EE Condition verified: agreed -> agree Condition not verified: feed -> feed (*V*) ED -> є Condition verified: plastered -> plaster Condition not verified: bled -> bled (*V*) ING -> є Condition verified: motoring -> motor MORPHOLOGY Condition not verified: sing -> sing 48 Step2b (These rules are ran if second or third rule in 2a apply) AT-> ATE conflat(ed) -> conflate BL -> BLE Troubl(ing) -> trouble (*d & ! (*L or *S or *Z)) -> single letter Condition verified: hopp(ing) -> hop, Condition not verified: fall(ing) -> fall (m=1 & *o) -> E Condition verified: fil(ing) -> file Condition not verified: fail -> fail MORPHOLOGY 49 Steps 3 and 4 Step 3: Y Elimination (*V*) Y -> I Condition verified: happy -> happi Condition not verified: sky -> sky Step 4: Derivational Morphology, I (m>0) ATIONAL -> ATE (m>0) IZATION -> IZE Relational -> relate generalization-> generalize (m>0) BILITI -> BLE sensibiliti -> sensible MORPHOLOGY 50 Steps 5 and 6 Step 5: Derivational Morphology, II Step 6: Derivational Morphology, III (m>0) ICATE -> IC (m>0) ANCE -> є triplicate -> triplic allowance-> allow (m>0) FUL -> є (m>0) ENT -> є hopeful -> hope dependent-> depend (m>0) NESS -> є (m>0) IVE -> є goodness -> good effective -> effect MORPHOLOGY 51 Step7 (cleanup) Step 7a (m>1) E -> є probate -> probat (m=1 & !*o) NESS -> є goodness -> good Step 7b (m>1 & *d & *L) -> single letter Condition verified: controll -> control MORPHOLOGY Condition not verified: roll -> roll 52 MORPHOLOGY 53 CSE4022: Natural Language Processing Topic: Text Classification Dr. Tulasi Prasad Sariki, SCOPE, VIT Chennai Campus www.learnersdesk.weebly.com Is this spam? From: "" <takworlld@hotmail.com> Subject: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Stop paying rent TODAY ! There is no need to spend hundreds or even thousands for similar courses I am 22 years old and I have already purchased 6 properties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW ! ================================================= Click here to order: http://www.wholesaledaily.com/sales/nmd.htm ================================================= TEXT CLASSIFICATION Classification Given: ❑A description of an instance, x∈X, where X is the instance language or instance space. ❑Issue: how to represent text documents. ❑A fixed set of categories: C = {c1, c2,…, cn} Determine: ❑The category of x: c(x)∈C, where c(x) is a categorization function whose domain is X and whose range is C. ❑We want to know how to build categorization functions (“classifiers”). TEXT CLASSIFICATION Examples Labels are most often topics such as Yahoo-categories e.g., "finance," "sports," "news>world>asia>business" Labels may be genres e.g., "editorials" "movie-reviews" "news“ Labels may be opinion e.g., “like”, “hate”, “neutral” Labels may be domain-specific binary e.g., “spam” : “not-spam”, e.g., “contains adult language” :“doesn’t” TEXT CLASSIFICATION Classification Methods Manual classification ❑ ❑ ❑ ❑ Used by Yahoo!, Looksmart, about.com, Medline Very accurate when job is done by experts Consistent when the problem size and team is small Difficult and expensive to scale Automatic document classification ❑ ❑ ❑ ❑ Hand-coded rule-based systems E.g., assign category if document contains a given boolean combination of words Accuracy is often very high if a rule has been carefully refined over time by an expert Building and maintaining these rules is expensive TEXT CLASSIFICATION Classification Methods Supervised learning of a document-label assignment function ❑ Many systems partly rely on machine learning ❑ ❑ ❑ ❑ ❑ k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) Requires hand-classified training data But data can be built up (and refined) by amateurs Note that many commercial systems use a mixture of methods TEXT CLASSIFICATION Bayesian Methods ❑ Learning and classification methods based on probability theory. ❑ Bayes theorem plays a critical role in probabilistic learning and classification. ❑ Build a generative model that approximates how data is produced ❑ Uses prior probability of each category given no information about an item. ❑ Categorization produces a posterior probability distribution over the possible categories given a description of an item. TEXT CLASSIFICATION Bayes’ theorem Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c), P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. This assumption is called class conditional independence. ● ● ● ● P(c|x) is the posterior probability of class (target) given predictor (attribute). P(c) is the prior probability of class. P(x|c) is the likelihood which is the probability of predictor given class. P(x) is the prior probability of predictor. In ZeroR model there is no predictor, in OneR model we try to find the single best predictor, naive Bayesian includes all predictors using Bayes' rule and the independence assumptions between predictors TEXT CLASSIFICATION Example The posterior probability can be calculated by first, constructing a frequency table for each attribute against the target. Then, transforming the frequency tables to likelihood tables and finally use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction. TEXT CLASSIFICATION Example: posterior probability TEXT CLASSIFICATION Example: posterior probability TEXT CLASSIFICATION Example: posterior probability TEXT CLASSIFICATION Example: The final probability TEXT CLASSIFICATION The Naïve Bayes Classifier Fl u X 1 runnynose X X X 2 sinus 3 cough 4 fever X 5 muscle-ache Conditional Independence Assumption: features are independent of each other given the class TEXT CLASSIFICATION Learning the Model First attempt: maximum likelihood estimates ❑ Simply use the frequencies in the data Smoothing to Avoid Over fitting # of values of Xi TEXT CLASSIFICATION Naïve Bayes: Learning From training corpus, extract Vocabulary Calculate required P(cj) and P(xk | cj) terms ▪ For each cj in C do ▪ docsj ← subset of documents for which the target class is cj ■ Textj ← single document containing all docs ■ for each word xk in Vocabulary ■ nk ← number of occurrences of xk in Textj ■ TEXT CLASSIFICATION Example Training: Document Name Doc1 Doc2 Doc3 Doc4 Doc5 Doc6 Kill 2 1 1 0 0 0 Bomb 1 1 1 1 0 0 Key Words Kidnap Music 3 0 1 0 2 0 0 2 1 1 0 2 Class Name Movie 0 0 1 1 1 2 TV 1 0 0 1 0 0 Terrorism Terrorism Terrorism Entertainment Entertainment Entertainment Testing: Document Name Doc7 Kill 2 Bomb 1 Key Words Kidnap Music 2 0 TEXT CLASSIFICATION Class Name Movie 0 TV 1 ? Example |V| C P(Ci) ni P(Kill / Ci) 6 T 0.5 15 0.2380 0.1904 0.3333 0.0476 0.09523 0.09253 E 0.5 12 0.0555 0.1111 0.1111 0.3333 0.2777 0.1111 |V| -> number of Vocabularies P(Bomb / Ci) P(Kidnap / Ci ) P(Music/ Ci) P(Movie / Ci) P(TV / Ci) ni -> total no 'of Documents P(Ci) -> no’ of Documents in Class / no’ of all Documents P(Kill / T) = (2 + 1 + 1) +1 = 5 15 + |V| 21 P( T / W) = P( T) * P(Kill / T) * P(Bomb / T) * P(Kidnap / T) * P(Music/ T) * P(Movie / T) * P(TV / T) P( E/ W) = P( E) * P(Kill / E) * P(Bomb / E) * P(Kidnap / E) * P(Music/ E) * P(Movie / E) * P(TV / E) TEXT CLASSIFICATION Example P( T/W) = 0.5 * (0.2380) 2 * (0.1904) 1 * (0.3333) 2 * (0.0476) 0 * (0.09523) 0 * (0.09523) 1 = 5.7047 X 10 -5 P( E/W) = 0.5 * (0.0555) 2 * (0.1111) 1 * (0.1111) 2 * (0.3333) 0 * (0.27777) 0 * (0.1111) 1 = 2.3456 X 10 -5 Since P( T/ W) has higher values therefore Document7 is classified into Terrorism Class TEXT CLASSIFICATION END TEXT CLASSIFICATION CSE4022 Natural Language Processing Topic: Regular Expressions & Automata Pr o f. Tu l asi Pr a sa d S a ri ki , S C SE, V I T Ch ennai Ca mpus www. l earn ersd esk.weeb l y. com Contents  Regular Expressions  Regular Expression Patterns  Operator precedence  Applications  Regular Expressions in MS-Word  Finite Automata  FSA / FST  Applications of FSA & FST REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions In computer science, RE is a language used for specifying text search string. A regular expression is a formula in a special language that is used for specifying a simple class of string. Formally, a regular expression is an algebraic notation for characterizing a set of strings. RE search requires ◦ a pattern that we want to search for, and ◦ a corpus of texts to search through. REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions A RE search function will search through the corpus returning all texts that contain the pattern. ◦ In a Web search engine, they might be the entire documents or Web pages. ◦ In a word-processor, they might be individual words, or lines of a document. ◦ E.g., the UNIX grep command Regular expressions are case sensitive. We will use Perl based syntax for representation. REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions Disjunctions [abc] Ranges [A-Z] Negations [^Ss] Optional characters ? and * Wild cards . Anchors ^ and $, also \b and \B Disjunction, grouping, and precedence | REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns regular expression example pattern matched /woodchucks/ “interesting links to woodchucks and lemurs” /a/ “Mary Ann stopped by Mona’s” /Claire says,/ Dagmar, my gift please,” Claire says,” /song/ “all our pretty songs” /!/ “You’ve left the burglar behind again!” said Nori REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns The use of the brackets [] to specify a disjunction of characters. Regular Expression /[wW]oodchuck/ /[abc]/ /[0123456789]/ Match Woodchuck or woodchuck “a”, “b”, or “c” Any digit REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns The use of the brackets [] plus the dash - to specify a range. Regular expression match sample pattern /[A-Z]/ any uppercase letter this is Linguistics 5981 /[0-9]/ any single digit this is Linguistics 5981 /[1 2 3 4 5 6 7 8 9 0]/ any single digit this is Linguistics 5981 REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns To search for negation, i.e. a character that I do NOT want to find we use the caret: [^] Regular expression match sample pattern /[Â-Z]/ not an uppercase letter this is Linguistics 5981 /[^L l]/ neither L nor l this is Linguistics 5981 /[^\.]/ not a period this is Linguistics 598 Special characters: \* \. \? \n \t an asterisk a period a question mark a newline a tab “L*I*N*G*U*I*S*T*I*C*S” “Dr.Doolittle” “Is this Linguistics 5981 ?” REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns To search for optional characters we use the question mark: [?] Regular expression match sample pattern /colou?r/ colour or color beautiful colour To search for any number of a certain character we use the Kleene star: [*] Regular expression match /a*/ any string of zero or more “a”s /aa*/ at least one a but also any number of “a”s REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns To look for at least one character of a type we use the Kleene “+”: Regular expression match /[0-9]+/ a sequence of digits Any combination is possible Regular expression match /[ab]*/ zero or more “a”s or “b”s /[0-9] [0-9]*/ any integer (= a string of digits) REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns The “.” is a very special character -> so-called wildcard Regular expression /b.ll/ match sample pattern any character between b and ll ball, bell, bull, bill The /. / symbol is called a wildcard : it matches any single character. For example, the regular expression /s.ng/ matches the following English words: sang, sing, song, sung. Note that /./ will match and not only alphabetic characters, but also numeric and whitespace characters. Consequently, /s.ng/ will also match non-words such as s3ng. The pattern /....berry/ finds words like cranberry. REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns Anchors (start of line: “^”, end of line:”$”) Regular expression match sample pattern /^Linguistics/ “Linguistics” at the beginning of a line Linguistics is fun. /linguistics\.$/ “linguistics” at the end of a line We like linguistics. Anchors (word boundary: “\b”, non-boundary:”\B”) Regular expression match sample pattern /\bthe\b/ “the” alone This is the place. /\Bthe\B/ “the” included This is my mother. REGULAR EXPRESSIONS AND AUTOMATA Regular Expression Patterns More on alternative characters: the pipe symbol: “|” (disjunction) Regular expression match sample pattern /colou?r/ colour or color beautiful colour /progra(m|mme)/ program or programme linguistics program REGULAR EXPRESSIONS AND AUTOMATA Predefined Character class Character class Description \d A digit. Equivalent to[0-9]. \D A non-digit. Equivalent to [^0-9]. \s A whitespace character. Equivalent to [ \t\n\x0B\f\r]. \S A nonwhitespace character. Equivalent to[^\s]. \w A word character. Equivalent to [a-zA-Z_0-9]. \W A non-word character. Equivalent to [^\w]. REGULAR EXPRESSIONS AND AUTOMATA Boundary matchers Boundary Matcher Description ^ The beginning of a line $ The end of a line \b A word boundary \B A nonword boundary \A The beginning of the text \G The end of the previous match \Z The end of the text (but for the final line terminator, if any) \z The end of the text REGULAR EXPRESSIONS AND AUTOMATA Quantifiers Character Description {n} n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does not match the 'o' in "Bob," but matches the two o's in "food". {n,} n is a nonnegative integer. Matches at least n times. For example, 'o{2,}' does not match the "o" in "Bob" and matches all the o's in "foooood". 'o{1,}' is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'. {n,m} M and n are nonnegative integers, where n <= m. Matches at least n and at most m times. For example, "o{1,3}" matches the first three o's in "fooooood". 'o{0,1}' is equivalent to 'o REGULAR EXPRESSIONS AND AUTOMATA Operator precedence A regular expression is evaluated from left to right and follows an order of precedence, much like an arithmetic expression. The following table illustrates, from highest to lowest, the order of precedence of the various regular expression operators: Operator(s) Description \ Escape (), (?:), (?=), [] Parentheses and Brackets *, +, ?, {n}, {n,}, {n,m} Quantifiers ^, $, \anymetacharacter, anycharacter Anchors and Sequences | Alternation REGULAR EXPRESSIONS AND AUTOMATA Operator precedence Characters have higher precedence than the alternation operator, which allows 'm|food' to match "m" or "food". To match "mood" or "food", use parentheses to create a subexpression, which results in '(m|f)ood'. REGULAR EXPRESSIONS AND AUTOMATA Applications Regular Expressions for the Java Programming Language • java.util.regex for enabling the use of regular expressions Applications • Simple word replacement • Email validation • Removal of control characters from a file • File searching REGULAR EXPRESSIONS AND AUTOMATA Example write a Perl regular expression to match the English article “the”: /the/ /[tT]he/ missed ‘The’ included ‘the’ in ‘others’ /\b[tT]he\b/ Missed ‘the25’ ‘the_’ /[â-zA-Z][tT]he[â-zA-Z]/ Missed ‘The’ at the beginning of a line /(^|[â-zA-Z])[tT]he[â-zA-Z]/ REGULAR EXPRESSIONS AND AUTOMATA Example Write a regular expression that will match “any PC with more than 500MHz and 32 Gb of disk space for less than $1000”: Price ◦ ◦ ◦ ◦ /$[0-9]+/ /$[0-9]+\.[0-9][0-9]/ /$[0-9]+(\.[0-9][0-9])?/ /\b$[0-9]+(\.[0-9][0-9])?\b/ # whole dollars # dollars and cents #cents optional #word boundaries REGULAR EXPRESSIONS AND AUTOMATA Example Specifications for processor speed ◦ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/ Memory size ◦ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/ ◦ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/ Vendors ◦ /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/ ◦ /\b(Mac|Macintosh|Apple)\b/ REGULAR EXPRESSIONS AND AUTOMATA Example Expression Matches /^\s*$/ Match a blank line. /\d{2}-\d{5}/ Validate an ID number consisting of 2 digits, a hyphen, and an additional 5 digits. /<\s*(\S+)(\s[^>]*)?>[\s\S]*<\s*\/\1\s*>/ Match an HTML tag. REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions in MS-Word ? and * The two most basic wildcard characters are ? and *. ? is used to represent a single character and * represents any number of characters.  s?t will find sat, set, sit, sat and any other combination of 3 characters beginning with “s” and ending with “t”. Ex: inset.  s*t will find all the above, but will also find “secret”, “serpent”, “sailing boat” and“sign over document”, etc. @  @ is used to find one or more occurrences of the previous character.  For example, lo@t will find lot or loot, ful@ will find ful or full etc. REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions in MS-Word <>  <s*t> would find “secret” and “serpent” and “sailing boat”, but not “sailing boats” or “sign over documents”. It will also find “'set” in “tea-set” , but not “set” in “toolset”.  The <> tags can be used in pairs, as above; or individually.  ful@> will find “full” and the appropriate part of “wilful”, but will not find “wilfully”. REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions in MS-Word [] Square brackets are always used in pairs and are used to identify specific characters or ranges of characters.  [abc] will find any of the letters a, b, or c.  [ A-Z] will find any upper case letter.  [ 13579] will find any odd digit. \  If you wish to search for a character that has a special meaning in wildcard searches – the obvious example being “?” – then you can do so by putting a backslash in front of it.  [\?] will not find “\” followed by any character; but will find “?” REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions in MS-Word [!]  [!] is very similar to [ ] except in this case it finds any character not listed in the box so [!o] would find every character except “o”.  You can use ranges of characters in exactly the same was as with [ ], thus [!A-Z] will find everything except upper case letters. REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions in MS-Word {}  Curly brackets are used for counting occurrences of the previous character or expression.  {n} This finds exactly the number “n” of occurrences of the previous character (so for example, a{2} will find “aa”).  {n,m} finds text containing between “n” and “m” occurrences of the previous character or expression; so a{2,3} will find “aa” and “aaa”, but only the first 3 characters in “aaaa” ). REGULAR EXPRESSIONS AND AUTOMATA Regular Expressions in MS-Word ()  Round brackets have no effect on the search pattern, but are used to divide the pattern into logical sequences where you wish to re-assemble those sequences in a different order during the replace – or to replace only part of that sequence.  They must be used in pairs and are addressed by number in the replacement.  Eg: (Tulasi) (Prasad) replaced by \2 \1 (note the spaces in the search and replace strings) – will produce Prasad Tulasi or replaced by \2 alone will give Prasad. ^  The ^ (“caret”) character is not specific to wildcard searches but it sometimes has to be used slightly differently from normal, when searching for wildcards. REGULAR EXPRESSIONS AND AUTOMATA Finite Automata  The regular expression is more than just a convenient meta-language for text searching.  Any regular expression can be implemented as a finite-state automaton.  Symmetrically, any finite-state automaton can be described with a regular expression.  Regular expression is one way of characterizing a particular kind of formal language called a regular language.  Both regular expressions and finite-state automata can be used to describe regular languages. REGULAR EXPRESSIONS AND AUTOMATA Finite Automata The relationship between finite state automata, regular expression, and regular language Finite state automata (Computataional Device) Regular Expression (Descriptive Notation) Regular language (Set of Objects) REGULAR EXPRESSIONS AND AUTOMATA What is a Finite-State Automaton?  An alphabet of symbols,  A finite set of states,  A transition function from states and symbols to states,  A distinguished member of the set of states called the start state, and  A distinguished subset of the set of states called final states.  FSA recognize the regular languages represented by regular expressions  Directed graph with labeled nodes and arc transitions REGULAR EXPRESSIONS AND AUTOMATA Formally  FSA is a 5-tuple consisting of  Q: a finite set of N states q0, q1, …, qN  : a finite input alphabet of symbols  q0: the start state  F: the set of final states, F  Q  (q,i): a transition function mapping Q x  to Q REGULAR EXPRESSIONS AND AUTOMATA FSA Accepter Input String Finite Automaton Output “Accept” or “Reject” REGULAR EXPRESSIONS AND AUTOMATA Transition Graph abba -Finite Accepter a, b q5 initial state b q0 a a a b q1 b q2 b q3 a state transition REGULAR EXPRESSIONS AND AUTOMATA a, b q4 final state “accept” Initial Configuration Input String a b b a a, b q5 b q0 a a a b q1 b q2 b q3 a REGULAR EXPRESSIONS AND AUTOMATA a, b q4 Reading Input Input String a b b a a, b q5 b q0 a a a b q1 b q2 b q3 a REGULAR EXPRESSIONS AND AUTOMATA a, b q4 Reading Input Input String a b b a a, b q5 b q0 a a a b q1 b q2 b q3 a REGULAR EXPRESSIONS AND AUTOMATA a, b q4 Reading Input Input String a b b a a, b q5 b q0 a a a b q1 b q2 b q3 a REGULAR EXPRESSIONS AND AUTOMATA a, b q4 Reading Input Input String a b b a a, b q5 b q0 a a a b q1 b q2 b q3 a REGULAR EXPRESSIONS AND AUTOMATA a, b q4 Reading Input Input String a b b a a, b q5 b q0 a a a b q1 b q2 b q3 a REGULAR EXPRESSIONS AND AUTOMATA a, b q4 Output: Accepted Using an FSA to Recognize Sheep talk REGULAR EXPRESSIONS AND AUTOMATA Using an FSA to Recognize Sheep Talk Sheep language can be defined as any string from the following (infinite) set: The regular expression for this kind of sheeptalk is /baa+!/ baa! baaa! baaaa! baaaaa! .... All RE can be represented as FSA b q0 a q1 a q2 a ! q3 REGULAR EXPRESSIONS AND AUTOMATA q4 State Transition Table for Sheep Talk State a b q0 a q1 a q2 ! q3 q4 REGULAR EXPRESSIONS AND AUTOMATA Input b a ! 0(null) 1 Ø Ø 1 Ø 2 Ø 2 Ø 3 Ø 3 Ø 3 4 4: Ø Ø Ø Algorithm function D-RECOGNIZE(tape,machine) returns accept or reject index <- Beginning of tape current-state <- Initial state of machine loop if End of input has been reached then if current-state is an accept state then return accept else return reject elseif transition-table[current-state,tape[index]] is empty then return reject else current-state <- transition-table[current-state,tape[index]] index <- index +1 REGULAR EXPRESSIONS AND AUTOMATA Using an FSA to Recognize Sheep Talk  FSA recognizes (accepts) strings of a regular language  baa!  baaa!  baaaa! … a b q0 a q1 a q2 ! q3  Tracing the execution of FSA on some sheep talk q 0 ... ... ... q 1 q 2 b a a q 3 a ! q 3 q 4 ... ... ... ... ... ... ... ... REGULAR EXPRESSIONS AND AUTOMATA q4 Adding a fail state to FSA a b q a q 0 ! b a q 1 ! b ! q 2 ! b q 3 ! b ? c a a qf REGULAR EXPRESSIONS AND AUTOMATA 4 Adding an else arch REGULAR EXPRESSIONS AND AUTOMATA Adding ϵ Transition b q 0 a q 1 a q ! q 3 2 ϵ REGULAR EXPRESSIONS AND AUTOMATA q 4 Example FSA An FSA for the words of English numbers 1-99 REGULAR EXPRESSIONS AND AUTOMATA FSA for NLP  Word Recognition  Dictionary Lookup  Spelling Conventions REGULAR EXPRESSIONS AND AUTOMATA Word Recognition A word recognizer takes a string of characters as input and returns “yes” or “no” according as the word is or is not in a given set. Solves the membership problem.  e.g. Spell Checking, Scrabble(Un-ordered Concatenation) Approximate methods  Has right set of letters (any order).  Has right sounds (Soundex).  Random (suprimposed) coding (Unix Spell) REGULAR EXPRESSIONS AND AUTOMATA Word Recognition Exact Methods  Hashing  Search (linear, binary ...)  Digital search (“Tries”)  Finite-state automata REGULAR EXPRESSIONS AND AUTOMATA Dictionary Lookup Dictionary lookup takes a string of characters as input and returns “yes” or “no” according as the word is or is not in a given set and returns information about the word. Lookup Methods Approximate — guess the information  If it ends in “ed”, it’s a past-tense verb.  Exact — store the information for finitely many words  Table Lookup  Hash  Search REGULAR EXPRESSIONS AND AUTOMATA Finite State Transducers A finite state transducer essentially is a finite state automaton that works on two (or more) tapes. The most common way to think about transducers is as a kind of ``translating machine''. They read from one of the tapes and write onto the other. a:b q0 a:b at the arc means that in this transition the transducer reads a from the first tape and writes b onto the second. REGULAR EXPRESSIONS AND AUTOMATA Finite State Transducers Transducer behaves as follows in the different modes.  generation mode: It writes a string of as on one tape and a string bs on the other tape. Both strings have the same length.  recognition mode: It accepts when the word on the first tape consists of exactly as many as as the word on the second tape consists of bs.  translation mode (left to right): It reads as from the first tape and writes an b for every a that it reads onto the second tape.  translation mode (right to left): It reads bs from the second tape and writes an a for every f that it reads onto the first tape.  relator mode: Computes relations between sets REGULAR EXPRESSIONS AND AUTOMATA FST vs FSA FSA can act as a FST can act as a  Recognizer  Generator  5 tuple Representation  Equivalent to regular languages  Recognizer  Generator  Translator  Set relator  7 tuple Representation  Equivalent to regular relations REGULAR EXPRESSIONS AND AUTOMATA FST Operations  Inversion: Switching input and output labels  If T maps from I to O, T-1 maps from O to I  Composition:  If T1 is a transducer from I1 to O1 and T2 is a transducer from I2 to O2, then T1  T2 is a transducer from I1 to O2. REGULAR EXPRESSIONS AND AUTOMATA FST for NLP  Tokenization  Morphological analysis  Transliteration  Parsing  Translation  Speech recognition  Spoken language understanding REGULAR EXPRESSIONS AND AUTOMATA REGULAR EXPRESSIONS AND AUTOMATA CSE4022 Natural Language Processing Topic: Text Summarization Pr o f. Tu l asi Pr a sa d S a ri ki , S C SE, V I T Ch ennai Ca mpus www. l earn ersd esk.weeb l y. com Rapid growth of data The problem: ◦ 4 Billion URLs indexed by Google ◦ 200 TB of data on the Web [Lyman and Varian 03] Possible approaches: ◦ information retrieval ◦ document clustering ◦ information extraction ◦ visualization ◦ question answering ◦ text summarization TEXT SUMMARIZATION Text Summarization Automatic Text Summarization ◦ No User interaction, system will return the condensed or summarized form. Query Specific Summarization ◦ User interaction will be there he/she will supply some input as keywords based on that summary will be generated. TEXT SUMMARIZATION Automatic Text Summarization Automatic text summarization is the technique where a computer automatically creates an abstract or gist of one or more text documents. Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users)and task (or tasks). TEXT SUMMARIZATION Typical 3 Stages of Summarization 1. Topic Identification: find/extract the most important material 2. Topic Interpretation: compress it 3. Summary Generation: say it in your own words …as easy as that! TEXT SUMMARIZATION Aspects that describe Summaries Source (Input): ◦ Source: single-document vs. multi-document ◦ Language: mono-lingual vs. multi-lingual vs. cross-lingual ◦ Genre: news vs. technical report vs. scientific paper etc. ◦ Specificity: domain-specific vs. general ◦ Length: short (1–2 pages) vs. long (> 2 pages) ◦ Media: text, graphics, audio, video, multi-media etc. TEXT SUMMARIZATION Aspects that describe Summaries Purpose: ◦ Use: generic vs. query-oriented (aimed to a specific information need) ◦ Purpose: what the summary is used for (e.g. alert, preview, inform, digest, provide biographical information) ◦ Audience: untargeted vs. targeted (aimed at a specific audience) Composition (Output): ◦ Derivation: extract vs. abstract ◦ Format: running text, tables, geographical displays, time lines, charts, illustrations etc. ◦ Partiality: neutral vs. evaluative (adding sentiment/values) TEXT SUMMARIZATION Query-Driven vs. Text-Driven Top-down: Query-driven focus ◦ Criteria of interest encoded as search specs. ◦ System uses specs to filter or analyze text portions. ◦ Examples: templates with slots with semantic characteristics; term lists of important terms. Bottom-up: Text-driven focus ◦ Generic importance metrics encoded as strategies. ◦ System applies strategies over rep of whole text. ◦ Examples: degree of connectedness in semantic graphs; frequency of occurrence of tokens. TEXT SUMMARIZATION Extract not Abstract Extraction is much easier than abstraction Abstraction needs understanding and rewriting Most automatic summarization tools makes extracts not abstracts Uses original sentences or part of sentences to create ”abstract” TEXT SUMMARIZATION Some Extraction Methods General method: score each sentence; choose best sentence(s) Scoring techniques: ◦ Position in the text: lead method; optimal position policy; title/heading method ◦ Cue phrases in sentences ◦ Word frequencies throughout the text ◦ Cohesion: links among words; word co-occurrence; coreference; lexical chains ◦ Information Extraction: parsing and analysis TEXT SUMMARIZATION Word Frequency[Luhn58] Steps: ◦ Count all word occurrences (after stemming) ◦ Ignore extreme frequencies. ◦ Give every word a score according to frequency. ◦ Calculate the importance of each sentence as the sum of its word scores. ◦ Take the physical distance between important words into consideration. ◦ Extract the N sentences with the highest scores. TEXT SUMMARIZATION Position: Title-Based Method Words in titles summarization. and headings are positively relevant to Shown to be statistically valid at 99% level of significance (Edmundson, 68). Empirically shown to be useful in summarization systems. TEXT SUMMARIZATION Cue words and phrases Baxendale (1958) identified two sets of phrases ◦ bonus phrases -> that tend to signal when a sentence is a likely candidate for inclusion in a summary ◦ stigma phrases -> that tend to signal when it is definitely not a candidate, respectively. ‘Bonus phrases’ such as "in summary", "in conclusion", and superlatives such as "the best", "the most important" can be good indicators of important content. ‘stigma phrases’ such as hardly and impossible may indicate nonimportant sentences TEXT SUMMARIZATION Cue words and phrases Cue words and phrases, such as "in conclusion", "important", "in this paper", "this paper", "this article", "this document", and "we conclude“ etc. can be very useful to determine signals of relevance or irrelevance. During processing, the Cue Phrase Module simply rewards each sentence containing a cue phrase with an appropriate score (constant per cue phrase) and penalizes those containing stigma phrases. TEXT SUMMARIZATION Multiple Methods Cue-Phrase Method: Some phrases imply significance: “significant”, “impossible”, ”hardly”, etc. Key Method: Word frequencies, like Luhn(for ATS). Title Method: Titles are important, and so are the words they contain sentences are play major role in summary. Location Method: First and Last sentences of a paragraph, sentences following titles. TEXT SUMMARIZATION Multiple Methods The Sentence importance is calculate as a linear combination of the different methods: Sentence Score = ß1 Cue + ß2 Key + ß3 Title + ß4 Location. Adjust the coefficients to control each methods significance. TEXT SUMMARIZATION Cohesion: Lexical chains method But Mr. Kenny’s move speeded up work on a machine which uses microcomputers to control the rate at which an anaesthetic is pumped into the blood of patients undergoing surgery. Such machines are nothing new. But Mr. Kenny’s device uses two personal-computers to achieve much closer monitoring of the pump feeding the anaesthetic into the patient. Extensive testing of the equipment has sufficiently impressed the authorities which regulate medical equipment in Britain, and, so far, four other countries, to make this the first such machine to be licensed for commercial sale to hospitals. TEXT SUMMARIZATION Lexical chains-based method Assumes that important sentences are those that are ‘traversed’ by strong chains (Barzilay and Elhadad, 97). ◦ Strength(C) = length(C) - #DistinctOccurrences(C) ◦ For each chain, choose the first sentence that is traversed by the chain and that uses a representative set of concepts from that chain. TEXT SUMMARIZATION Lexical Chains(Barzilay97) Lexical Chain : A chain of semantically related words TEXT SUMMARIZATION Lin - set of summarization methods Sentence order: Sentence order in text gives the importance of the sentences. First sentence highest ranking last sentence lowest ranking. Title: Sentences containing words in the title get high score. Term frequency (tf): Open class terms which are frequent in the text are more important than the less frequent. Open class terms are words that change over time. Position score: The assumption is that certain genres put important sentences in fixed positions. For example. Newspaper articles has most important terms in the 4 first paragraphs. TEXT SUMMARIZATION Lin - set of summarization methods Query signature: The query of the user affect the summary in the way that the extract will contain these words. Sentence length: The sentence length implies which sentence is the most important. Average lexical connectivity: Number terms shared with other sentences. The assumption is that a sentence that share more terms with other sentences is more important. Numerical data: Sentences containing numerical data obtain boolean value 1 (is scored higher ) than the ones without numerical values. TEXT SUMMARIZATION Lin - set of summarization methods Proper name: Sentences containing proper names will be given higher score Weekdays and Months: Sentences containing Weekdays and Months will be given higher score Quotation: Sentences containing quotations might be important for certain questions from user First sentence: First sentence of each paragraphs are the most important sentences TEXT SUMMARIZATION Lin - set of summarization methods Decision tree combination function: All the above parameters were put into decision tree and trained on set of texts and manual summarized texts. Simple combination function: All the above parameter were normalized and put in a combination function with no special weighting. TEXT SUMMARIZATION TEXT SUMMARIZATION CSE4022 Natural Language Processing Topic: PartsOfSpeach Tagging D r. Tu l a s i Pr a s a d S a r i k i , SCOPE, VIT Chennai Campus w w w. l e a r n e r s d e s k . w e e b l y. c o m Definition The process of assigning a part-of-speech or other lexical class marker to each word in a corpus. WORDS TAGS the mom kissed the boy on the cheek N V P DET POS TAGGING 2 Definition  Annotate each word in a sentence with a part-of-speech marker.  Lowest level of syntactic analysis.  Useful for subsequent syntactic parsing and word sense disambiguation.  Example John saw the saw and decided to take it to the table. NNP VBD DT NN CC VBD TO VB PRP IN DT NN POS TAGGING 3 An Example WORD LEMMA the mom kissed the boy on the cheek the mom kiss the boy on the cheek TAG +DET +NOUN +VPAST +DET +NOUN +PREP +DET +NOUN POS TAGGING 4 English POS Tagsets  Original Brown corpus used a large set of 87 POS tags.  Most common in NLP today is the Penn Treebank set of 45 tags. Reduced from the Brown set for use in the context of a parsed corpus (i.e. treebank).  The C5 tagset used for the British National Corpus (BNC) has 61 tags. POS TAGGING 5 Word Classes Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, … Open vs. Closed classes ◦ Open: ◦ Nouns, Verbs, Adjectives, Adverbs. ◦ Why “open”? ◦ Closed: ◦ determiners: a, an, the ◦ pronouns: she, he, I ◦ prepositions: on, under, over, near, by, … POS TAGGING 6 Closed vs. Open Class Closed class categories are composed of a small, fixed set of grammatical function words for a given language.  prepositions: on, under, over, …  particles: up, down, on, off, …  determiners: a, an, the, …  pronouns: she, who, I, ..  conjunctions: and, but, or, …  auxiliary verbs: can, may should, … POS TAGGING 7 Closed vs. Open Class Open class categories have large number of words and new ones are easily invented.  Nouns new nouns: Internet, website, URL, CD-ROM, email, newsgroup, bitmap, modem, multimedia New verbs have also : download, upload, reboot, right-click, doubleclick,  Verbs (Google),  Adjectives (geeky)  Abverb (chompingly) POS TAGGING 8 English Parts of Speech (Nouns) Noun (person, place or thing)  Singular (NN): dog, fork  Plural (NNS): dogs, forks  Proper (NNP, NNPS): John, Springfields  Personal pronoun (PRP): I, you, he, she, it  Wh-pronoun (WP): who, what POS TAGGING 9 English Parts of Speech (Nouns) Proper nouns (Penn, Philadelphia, Davidson)  English capitalizes these. Common nouns (the rest). Count nouns and mass nouns  Count: have plurals, get counted: goat/goats,  Mass: don’t get counted (snow, salt, water,) POS TAGGING 10 English Parts of Speech (Verbs) Verb (actions and processes)  Base, infinitive (VB): eat  Past tense (VBD): ate  Gerund (VBG): eating  Past participle (VBN): eaten  Non 3rd person singular present tense (VBP): eat  3rd person singular present tense: (VBZ): eats  Modal (MD): should, can  To (TO): to (to eat) POS TAGGING 11 English Parts of Speech (Adjectives) Adjective (modify nouns, identify properties or qualities of nouns)  Basic (JJ): red, tall  Comparative (JJR): redder, taller  Superlative (JJS): reddest, tallest Adjective ordering restrictions in English: Old blue book, not Blue old book  the 44th president  a green product  a responsible investment  the dumbest, worst leader POS TAGGING 12 English Parts of Speech (Adverbs) Adverb (modify verbs)  Basic (RB): quickly  Comparative (RBR): quicker  Superlative (RBS): quickest Unfortunately, John walked home extremely slowly yesterday  Directional/locative adverbs (here, downhill)  Degree adverbs (extremely, very, somewhat)  Manner adverbs (slowly, slinkily, delicately)  Temporal adverbs (yesterday, tomorrow) POS TAGGING 13 English Parts of Speech (Determiner) Is a word that occurs together with a noun or noun phrase and serves to express the reference of that noun or noun phrase in the context. That is, a determiner may indicate whether the noun is referring to a definite or indefinite element of a class, to a closer or more distant element, to an element belonging to a specified person or thing, to a particular number or quantity, etc. POS TAGGING 14 English Parts of Speech(Determiner) Common kinds of determiners include  definite and indefinite articles (the, a, an)  demonstratives (this, that, these)  possessive determiners (my, their)  quantifiers (many, few , several). POS TAGGING 15 English Parts of Speech ( preposition) Preposition (IN): a word governing, and usually preceding, a noun or pronoun and expressing a relation to another word or element in the clause, as in ‘the man on the platform’, ‘she arrived after dinner’. Ex: on, in, by, to, with POS TAGGING 16 English Parts of Speech Coordinating Conjunction (CC): that connects words, sentences, phrases or clauses. the truth of nature, and the power of giving interest Ex: and, but, or. Particle (RP): a particle is a function word that must be associated with another word or phrase to impart meaning, i.e., does not have its own lexical definition. Ex: off (took off), up (put up) POS TAGGING 17 POS tagging  POS Tagging is a process that attaches each word in a sentence with a suitable tag from a given set of tags.  Tagging is the assignment of a single part-of-speech tag to each word (and punctuation marker) in a corpus. The set of tags is called the Tag-set.  Standard Tag-set : Penn Treebank (for English). POS TAGGING 18 POS tagging  There are so many parts of speech, potential distinctions we can draw.  To do POS tagging, we need to choose a standard set of tags to work with.  Could pick very coarse tag sets. N, V, Adj, Adv.  More commonly used set is finer grained (Penn TreeBank, 45 tags) PRP$, WRB, WP$, VBG POS TAGGING 19 POS Tag Ambiguity  Deciding on the correct part of speech can be difficult even for people.  In English : I bank1 on the bank2 on the river bank3 for my transactions.  Bank1 is verb, the other two banks are nouns  In Hindi :  “Khaanaa” : can be noun (food) or verb (to eat) POS TAGGING 20 Measuring Ambiguity POS TAGGING 21 How Hard is POS Tagging?  About 11% of the word types in the Brown corpus are ambiguous with regard to part of speech  But they tend to be very common words  40% of the word tokens are ambiguous POS TAGGING 22 Penn TreeBank POS Tagset POS TAGGING 23 Using the Penn Tagset  The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.  Prepositions and subordinating conjunctions marked IN (“although/IN I/PRP..”)  Except the preposition/complementizer “to” is just marked “TO”. POS TAGGING 24 Process  List all possible tag for each word in sentence.  Choose best suitable tag sequence.  Example  ”People jump high”.  People : Noun/Verb  jump : Noun/Verb  high : Noun/Verb/Adjective  We can start with probabilities. POS TAGGING 25 Example POS TAGGING 26 Why POS  POS tell us a lot about a word (and the words near it).  E.g, adjectives often followed by nouns  personal pronouns often followed by verbs  possessive pronouns by nouns  Pronunciations depends on POS, e.g.  object (first syllable NN, second syllable VM), content, discount  First step in many NLP applications POS TAGGING 27 Rule-Based Tagging  Start with a dictionary.  Assign all possible tags to words from the dictionary.  Write rules by hand to selectively remove tags.  Leaving the correct tag for each word. POS TAGGING 28 Step1: Start with a Dictionary she: PRP promised: VBN,VBD to: back: TO VB, JJ, RB, NN the: DT bill: NN, VB Etc… for the ~100,000 words of English with more than 1 tag POS TAGGING 29 Step2: Assign Every Possible Tag NN RB VBN JJ TOVB VB PRP VBD DT NN She promised to back the bill POS TAGGING 30 Step3: Write Rules to Eliminate Tags Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP” NN RB VBN JJ TOVB VB PRP VBD DT NN She promised to back the bill POS TAGGING 31 POS TAGGING 32 END POS TAGGING 33 CSE528 Natural Language Processing Venue:ADB-405 Pr o f. Tu l asi Pr a sa d S a ri ki , S C SE, V I T Ch ennai Ca mpus www. l earn ersd esk.weeb l y. com Topic: Lexical Relations Contents  Lexical Semantics  Lexeme and Lexicon  Relationships among Lexemes & senses LEXICAL RELATIONS Lexical Semantics Lexical semantics covers theories of the classification and decomposition of word meaning, the differences and similarities in lexical semantic structure between different languages, and the relationship of word meaning to sentence meaning and syntax. Lexical semantics explores  Whether the meaning of a lexical unit is established by looking at its neighborhood in the semantic net , or if the meaning is already locally contained in the lexical unit.  Another topic that is explored is the mapping of words to concepts. LEXICAL RELATIONS Lexeme and Lexicon Lexeme: A lexeme is the minimal unit of language which has a semantic interpretation and embodies a distinct cultural concept. Lexicon: Finite list of Lexeme. Orthographic form & Phonological form to refer to the appropriate form part of lexeme. Sense is one of the meanings of a word. LEXICAL RELATIONS Relationships among Lexemes & senses  Homonymy  Polysemy  Synonymy  Antonymy  Hypernomy  Hyponomy  Meronomy LEXICAL RELATIONS Homonymy A homonym is, in the strict sense, one of a group of words that share the same spelling and pronunciation but may have different meanings.  Lexemes that share a form  Phonological, orthographic or both  But have unrelated, distinct meanings LEXICAL RELATIONS Homonymy Example:  Bat (wooden stick-like thing) vs Bat (flying scary mammal thing)  bank (financial institution) vs bank (riverside)  Homophones:  Write and right  Piece and peace LEXICAL RELATIONS Homonymy causes problems for NLP applications Text-to-Speech  Same orthographic form but different phonological form  bass vs bass Information retrieval  Different meanings same orthographic form  QUERY: bat care Machine Translation Speech recognition LEXICAL RELATIONS Polysemy The bank is constructed from red brick. I withdrew the money from the bank . Are those the same sense? Which sense of bank is this?  Is it distinct from (homonymous with) the river bank sense? A single lexeme with multiple related meanings (bank the building, bank the financial institution) Most non-rare words have multiple meanings LEXICAL RELATIONS Polysemy  Verbs tend more to polysemy  Distinguishing polysemy from homonymy.  Example: Consider the verb got "procure" (I got the tickets), "become" (she got scared), "understand" (I get it) . LEXICAL RELATIONS Synonyms Word that have the same meaning in some or all contexts.  couch / sofa  big / large  automobile / car  vomit / throw up  Water / H20 Two lexemes are synonyms if they can be successfully substituted for each other in all situations LEXICAL RELATIONS Synonymy is a relation between senses rather than words Consider the words big and large Are they synonyms?  How big is that plane?  Would I be flying on a large or small plane? How about here:  Miss Nelson, for instance, became a kind of big sister to Benjamin.  ?Miss Nelson, for instance, became a kind of large sister to Benjamin. Why?  big has a sense that means being older, or grown up  large lacks this sense LEXICAL RELATIONS Antonyms Senses that are opposites with respect to one feature of their meaning Otherwise, they are very similar!  dark / light  short / long  hot / cold  up / down  in / out More formally: antonyms can  Define a binary opposition or at opposite ends of a scale (long/short, fast/slow)  Be reversives: rise/fall, up/down LEXICAL RELATIONS Hyponymy One sense is a hyponym of another if the first sense is more specific, denoting a subclass of the other  car is a hyponym of vehicle  dog is a hyponym of animal  mango is a hyponym of fruit Conversely  vehicle is a hypernym/superordinate of car  animal is a hypernym of dog  fruit is a hypernym of mango LEXICAL RELATIONS Hypernymy more formally superordinate vehicle fruit furniture mammal hyponym car mango chair dog Extensional:  The class denoted by the superordinate  extensionally includes the class denoted by the hyponym Entailment:  A sense A is a hyponym of sense B if being an A entails being a B Hyponymy is usually transitive  A hypo B and B hypo C entails A hypo C LEXICAL RELATIONS Meronomy A meronomy or partonomy is a type of hierarchy that deals with part– whole relationships. relation between two concepts (classes) or objects (instances) – ‘whole’ and ‘part’ class-level : every instance of the whole includes one or more instances of the part instance-level : the whole instance includes the part instance LEXICAL RELATIONS Meronomy A wife has a husband. A husband is a part of a wife. A sound has a pitch and a volume. A volume is a part of a sound. Changing nappies is part of being a mother. Being a mother has changing nappies. A book has pages. A page is a part of a book LEXICAL RELATIONS Meronomy Cars have parts: engine, headlight, wheel  Engines have parts: crankcase, carburetor  Headlights have parts: headlight bulb, reflector A meronomy is a partial ordering of concept types by the part– whole relation LEXICAL RELATIONS Meronomy Axioms: Transitive – "Parts of parts are parts of the whole" – if A is part of B and B is part of C, then A is part of C. Reflexive – "Everything is part of itself" – A is part of A. Antisymmetric – "Nothing is a part of its parts" – if A is part of B and A != B then B is not part of A. LEXICAL RELATIONS LEXICAL RELATIONS An introduction to part-of-speech tagging and the Hidden Markov Model Let’s go back into the times when we had no language to communicate. The only way we had was sign language. That’s how we usually communicate with our dog at home, right? When we tell him, “We love you, Jimmy,” he responds by wagging his tail. This doesn’t mean he knows what we are actually saying. Instead, his response is simply because he understands the language of emotions and gestures more than words. We as humans have developed an understanding of a lot of nuances of the natural language more than any animal on this planet. That is why when we say “I LOVE you, honey” vs when we say “Lets make LOVE, honey” we mean different things. Since we understand the basic difference between the two phrases, our responses are very different. It is these very intricacies in natural language understanding that we want to teach to a machine. What this could mean is when your future robot dog hears “I love you, Jimmy”, he would know LOVE is a Verb. He would also realize that it’s an emotion that we are expressing to which he would respond in a certain way. And maybe when you are telling your partner “Lets make LOVE”, the dog would just stay out of your business ?. This is just an example of how teaching a robot to communicate in a language known to us can make things easier. The primary use case being highlighted in this example is how important it is to understand the difference in the usage of the word LOVE, in different contexts. Part-of-Speech Tagging From a very small age, we have been made accustomed to identifying part of speech tags. For example, reading a sentence and being able to identify what words act as nouns, pronouns, verbs, adverbs, and so on. All these are referred to as the part of speech tags. Let’s look at the Wikipedia definition for them: In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context — i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc. POS tagging is not something that is generic. It is quite possible for a single word to have a different part of speech tag in different sentences based on different contexts. That is why it is impossible to have a generic mapping for POS tags. Words often have more than one POS: back • The back door = JJ • On my back = NN • Win the voters back = RB • Promised to back the bill = VB As you can see, it is not possible to manually find out different part-of-speech tags for a given corpus. New types of contexts and new words keep coming up in dictionaries in various languages, and manual POS tagging is not scalable in itself. That is why we rely on machine-based POS tagging. Before proceeding further and looking at how part-of-speech tagging is done, we should look at why POS tagging is necessary and where it can be used. Why Part-of-Speech tagging? Part-of-Speech tagging in itself may not be the solution to any particular NLP problem. It is however something that is done as a pre-requisite to simplify a lot of different problems. Let us consider a few applications of POS tagging in various NLP tasks. Text to Speech Conversion Let us look at the following sentence: They refuse to permit us to obtain the refuse permit. The word refuse is being used twice in this sentence and has two different meanings here. refUSE (/rəˈfyo͞oz/)is a verb meaning “deny,” while REFuse(/ˈrefˌyo͞os/) is a noun meaning “trash” (that is, they are not homophones). Thus, we need to know which word is being used in order to pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.) Have a look at the part-of-speech tags generated for this very sentence by the NLTK package. >>> text = word_tokenize("They refuse to permit us to obtain the refuse permit") >>> nltk.pos_tag(text) [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'),('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')] As we can see from the results provided by the NLTK package, POS tags for both refUSE and REFuse are different. Using these two different POS tags for our text to speech converter can come up with a different set of sounds. Similarly, let us look at yet another classical application of POS tagging: word sense disambiguation. Word Sense Disambiguation Let’s talk about this kid called Peter. Since his mother is a neurological scientist, she didn’t send him to school. His life was devoid of science and math. One day she conducted an experiment, and made him sit for a math class. Even though he didn’t have any prior subject knowledge, Peter thought he aced his first test. His mother then took an example from the test and published it as below. Words often occur in different senses as different parts of speech. For example: • She saw a bear. • Your efforts will bear fruit. The word bear in the above sentences has completely different senses, but more importantly one is a noun and other is a verb. Rudimentary word sense disambiguation is possible if you can tag words with their POS tags. Word-sense disambiguation (WSD) is identifying which sense of a word (that is, which meaning) is used in a sentence, when the word has multiple meanings. Try to think of the multiple meanings for this sentence: Time flies like an arrow Here are the various interpretations of the given sentence. The meaning and hence the part-ofspeech might vary for each word. As we can clearly see, there are multiple interpretations possible for the given sentence. Different interpretations yield different kinds of part of speech tags for the words.This information, if available to us, can help us find out the exact version / interpretation of the sentence and then we can proceed from there. The above example shows us that a single sentence can have three different POS tag sequences assigned to it that are equally likely. That means that it is very important to know what specific meaning is being conveyed by the given sentence whenever it’s appearing. This is word sense disambiguation, as we are trying to find out THE sequence. These are just two of the numerous applications where we would require POS tagging. There are other applications as well which require POS tagging, like Question Answering, Speech Recognition, Machine Translation, and so on. Now that we have a basic knowledge of different applications of POS tagging, let us look at how we can go about actually assigning POS tags to all the words in our corpus. Types of POS taggers POS-tagging algorithms fall into two distinctive groups: • Rule-Based POS Taggers • Stochastic POS Taggers E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms. Let us first look at a very brief overview of what rule-based tagging is all about. Rule-Based Tagging Automatic part of speech tagging is an area of natural language processing where statistical techniques have been more successful than rule-based methods. Typical rule-based approaches use contextual information to assign tags to unknown or ambiguous words. Disambiguation is done by analyzing the linguistic features of the word, its preceding word, its following word, and other aspects. For example, if the preceding word is an article, then the word in question must be a noun. This information is coded in the form of rules. Example of a rule: If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective. Defining a set of rules manually is an extremely cumbersome process and is not scalable at all. So we need some automatic way of doing this. The Brill’s tagger is a rule-based tagger that goes through the training data and finds out the set of tagging rules that best define the data and minimize POS tagging errors. The most important point to note here about Brill’s tagger is that the rules are not hand-crafted, but are instead found out using the corpus provided. The only feature engineering required is a set of rule templates that the model can use to come up with new features. Let’s move ahead now and look at Stochastic POS tagging. Stochastic Part-of-Speech Tagging The term ‘stochastic tagger’ can refer to any number of different approaches to the problem of POS tagging. Any model which somehow incorporates frequency or probability may be properly labelled stochastic. The simplest stochastic taggers disambiguate words based solely on the probability that a word occurs with a particular tag. In other words, the tag encountered most frequently in the training set with the word is the one assigned to an ambiguous instance of that word. The problem with this approach is that while it may yield a valid tag for a given word, it can also yield inadmissible sequences of tags. An alternative to the word frequency approach is to calculate the probability of a given sequence of tags occurring. This is sometimes referred to as the n-gram approach, referring to the fact that the best tag for a given word is determined by the probability that it occurs with the n previous tags. This approach makes much more sense than the one defined before, because it considers the tags for individual words based on context. The next level of complexity that can be introduced into a stochastic tagger combines the previous two approaches, using both tag sequence probabilities and word frequency measurements. This is known as the Hidden Markov Model (HMM). Before proceeding with what is a Hidden Markov Model, let us first look at what is a Markov Model. That will better help understand the meaning of the term Hidden in HMMs. Markov Model Say that there are only three kinds of weather conditions, namely • Rainy • Sunny • Cloudy Now, since our young friend we introduced above, Peter, is a small kid, he loves to play outside. He loves it when the weather is sunny, because all his friends come out to play in the sunny conditions. He hates the rainy weather for obvious reasons. Every day, his mother observe the weather in the morning (that is when he usually goes out to play) and like always, Peter comes up to her right after getting up and asks her to tell him what the weather is going to be like. Since she is a responsible parent, she want to answer that question as accurately as possible. But the only thing she has is a set of observations taken over multiple days as to how weather has been. How does she make a prediction of the weather for today based on what the weather has been for the past N days? Say you have a sequence. Something like this: Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy So, the weather for any give day can be in any of the three states. Let’s say we decide to use a Markov Chain Model to solve this problem. Now using the data that we have, we can construct the following state diagram with the labelled probabilities. In order to compute the probability of today’s weather given N previous observations, we will use the Markovian Property. Markov Chain is essentially the simplest known Markov model, that is it obeys the Markov property. The Markov property suggests that the distribution for a random variable in the future depends solely only on its distribution in the current state, and none of the previous states have any impact on the future states. For a much more detailed explanation of the working of Markov chains, refer to this link. Also, have a look at the following example just to see how probability of the current state can be computed using the formula above, taking into account the Markovian Property. Apply the Markov property in the following example. We can clearly see that as per the Markov property, the probability of tomorrow's weather being Sunny depends solely on today's weather and not on yesterday's . Let us now proceed and see what is hidden in the Hidden Markov Models. Hidden Markov Model It’s the small kid Peter again, and this time he’s gonna pester his new caretaker — which is you. (Ooopsy!!) As a caretaker, one of the most important tasks for you is to tuck Peter into bed and make sure he is sound asleep. Once you’ve tucked him in, you want to make sure he’s actually asleep and not up to some mischief. You cannot, however, enter the room again, as that would surely wake Peter up. So all you have to decide are the noises that might come from the room. Either the room is quiet or there is noise coming from the room. These are your states. Peter’s mother, before leaving you to this nightmare, said: May the sound be with you :) His mother has given you the following state diagram. The diagram has some states, observations, and probabilities. Note that there is no direct correlation between sound from the room and Peter being asleep. There are two kinds of probabilities that we can see from the state diagram. • One is the emission probabilities, which represent the probabilities of making certain observations given a particular state. For example, we have P(noise | awake) = 0.5 . This is an emission probability. • The other ones is transition probabilities, which represent the probability of transitioning to another state given a particular state. For example, we have P(asleep | awake) = 0.4 . This is a transition probability. The Markovian property applies in this model as well. So do not complicate things too much. Markov, your savior said: Don’t go too much into the history… The Markov property, as would be applicable to the example we have considered here, would be that the probability of Peter being in a state depends ONLY on the previous state. But there is a clear flaw in the Markov property. If Peter has been awake for an hour, then the probability of him falling asleep is higher than if has been awake for just 5 minutes. So, history matters. Therefore, the Markov state machine-based model is not completely correct. It’s merely a simplification. The Markov property, although wrong, makes this problem very tractable. We usually observe longer stretches of the child being awake and being asleep. If Peter is awake now, the probability of him staying awake is higher than of him going to sleep. Hence, the 0.6 and 0.4 in the above diagram.P(awake | awake) = 0.6 and P(asleep | awake) = 0.4 Before actually trying to solve the problem at hand using HMMs, let’s relate this model to the task of Part of Speech Tagging. HMMs for Part of Speech Tagging We know that to model any problem using a Hidden Markov Model we need a set of observations and a set of possible states. The states in an HMM are hidden. In the part of speech tagging problem, the observations are the words themselves in the given sequence. As for the states, which are hidden, these would be the POS tags for the words. The transition probabilities would be somewhat like P(VP | NP) that is, what is the probability of the current word having a tag of Verb Phrase given that the previous tag was a Noun Phrase. Emission probabilities would be P(john | NP) or P(will | VP) that is, what is the probability that the word is, say, John given that the tag is a Noun Phrase. Note that this is just an informal modeling of the problem to provide a very basic understanding of how the Part of Speech tagging problem can be modeled using an HMM. How do we solve this? Coming back to our problem of taking care of Peter. Irritated are we ? ?. Our problem here was that we have an initial state: Peter was awake when you tucked him into bed. After that, you recorded a sequence of observations, namely noise or quiet, at different time-steps. Using these set of observations and the initial state, you want to find out whether Peter would be awake or asleep after say N time steps. We draw all possible transitions starting from the initial state. There’s an exponential number of branches that come out as we keep moving forward. So the model grows exponentially after a few time steps. Even without considering any observations. Have a look at the model expanding exponentially below. If we had a set of states, we could calculate the probability of the sequence. But we don’t have the states. All we have are a sequence of observations. This is why this model is referred to as the Hidden Markov Model — because the actual states over time are hidden Chapter 1 Multiword Expressions 1.1 1.2 1.3 1.4 1.5 1.6 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linguistic Properties of MWEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types of MWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MWE Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 1 3 12 18 20 28 28 Introduction Languages are made up of words, which combine via morphosyntax to encode meaning in the form of phrases and sentences. While it may appear relatively innocuous, the question of what constitutes a “word” is a surprisingly vexed one. First, are dog and dogs two separate words, or variants of a single word? The traditional view from lexicography and linguistics is to treat them as separate inflected wordforms of the lexeme dog, as any difference in the syntax/semantics of the two words is predictable from the general process of noun pluralisation in English. Second, what is the status of expressions like top dog and dog days? A speaker of English who knew top, dog and day in isolation but had never been exposed to these two expressions would be hard put to predict the semantics of “person who is in charge” and “period of inactivity”, respectively.1 To be able to retrieve the semantics of these expressions, they must have lexical status of some form in the mental lexicon, which encodes their particular semantics. Expressions such as these which have surprising properties not predicted by their component words are referred to as multiword expressions (MWEs).2 The focus of this chapter is the precise nature and types of MWEs, and the current state of MWE research in NLP. Armed with our informal description of MWEs, let’s first motivate this chapter with a brief overview of the range of MWEs, and complexities asso- 1 All glosses in this paper are taken from WordNet 3.0 (Fellbaum 1998). which are largely synonymous with “multiword expression” are “multiword unit”, “multiword lexical item”, “phraseological unit” and “fixed expression”; there is also variation in the hyphenation of “multiword”, with “multi-word” in common use. 2 Terms 1 Baldwin, Timothy and Su Nam Kim (2010) Multiword Expressions, in Nitin Indurkhya and Fred J. Damerau (eds.) Handbook of Natural Language Processing, Second Edition, CRC Press, Boca Raton, USA, pp. 267-292. 2 Handbook of NLP ciated with them. We return to define MWEs formally in Section 1.2. (1a)–(1b) include a number of MWEs, underlined. (1) a. In a nutshell, the administrator can take advantage of the database’s many features through a single interface. b. You should also jot down the serial number of your television video. As we can see, analogously to simple words, MWEs can occur in a wide range of lexical and syntactic configurations (e.g. nominal, verbal and adverbial). Semantically, we can observe different effects: in some cases (e.g. serial number and television video), the component words preserve their original semantics, but the MWE encodes extra semantics (e.g. the fact that a television video is a single-unit device, and usually designed to be portable); in other cases (e.g. in a nutshell , meaning “summed up briefly”), the semantics of one or more of the component words has no obvious bearing of the semantics of the MWE. While all of the MWE examples we have seen to date have occurred as contiguous units, this is not always the case: (2) a. She likes to take a long bath for relaxation after exams. b. Kim hates to put her friends out. For example, in (2a), long is an internal modifier and not a component of the base MWE take a bath, as there is nothing surprising about the syntax of the modified MWE or the resulting semantics (c.f. take a short/leisurely/warm/mud/... bath). How big an issue are MWEs, though? The number of MWEs is estimated to be of the same order of magnitude as the number of simplex words in a speaker’s lexicon (Jackendoff 1997; Tschichold 1998; Pauwels 2000). At the type level, therefore, MWEs are as much of an issue as simple words. Added to this, new (types of) MWE are continuously created as languages evolve (e.g. shock and awe, carbon footprint, credit crunch) (Gates 1988; Tschichold 1998; Fazly, Cook, and Stevenson 2009). Crosslingually, MWEs have been documented across a broad spectrum of the world’s languages (see the companion web site for this chapter for a detailed listing of references). In fact, MWEs are such an efficient way of providing nuance and facilitating lexical expansion with a relatively small simplex lexicon, it is highly doubtful that any language would evolve without MWEs of some description. MWEs are broadly used to enhance fluency and understandability, or mark the register/genre of language use (Fillmore, Kay, and O’Connor 1988; Liberman and Sproat 1992; Nunberg, Sag, and Wasow 1994; Dirven 2001). For example, MWEs can make language more or less informal/colloquial (c.f. London Underground vs. Tube, and piss off vs. annoy). Regionally, MWEs vary considerably. For example, take away and take out are identical in meaning, but the former is the preferred expression in British/Australian English, while the latter is the preferred expression in American English. Other examples Multiword Expressions 3 are phone box vs. phone booth, lay the table vs. set the table, and no through road vs. not a through street, respectively. There is a modest body of research on modelling MWEs which has been integrated into NLP applications, e.g. for the purposes of fluency, robustness or better understanding of natural language. One area where MWEs have traditionally been used heavily (either explicitly or implicitly) is machine translation, as a means of capturing subtle syntactic, semantic and pragmatic effects in the source and target languages (Miyazaki, Ikehara, and Yokoo 1993; Gerber and Yang 1997; Melamed 1997; Matsuo, Shirai, Yokoo, and Ikehara 1997). Understanding MWEs has broad utility in tasks ranging from syntactic disambiguation to conceptual (semantic) comprehension. Explicit lexicalised MWE data helps simplify the syntactic structure of sentences that include MWEs, and conversely, a lack of MWE lexical items in a precision grammar is a significant source of parse errors (Baldwin, Bender, Flickinger, Kim, and Oepen 2004). Additionally, it has been shown that accurate recognition of MWEs influences the accuracy of semantic tagging (Piao, Rayson, Archer, Wilson, and McEnery 2003), and word alignment in machine translation (MT) can be improved through a specific handling of the syntax and semantics of MWEs (Venkatapathy and Joshi 2006). 1.2 Linguistic Properties of MWEs We adopt the following formal definition of multiword expression, following (Sag, Baldwin, Bond, Copestake, and Flickinger 2002): (3) Multiword expressions (MWEs) are lexical items that: (a) can be decomposed into multiple lexemes; and (b) display lexical, syntactic, semantic, pragmatic and/or statistical idiomaticity In languages such as English, the conventional interpretation of the requirement of decomposability into lexemes is that MWEs must in themselves be made up of multiple whitespace-delimited words. For example, marketing manager is potentially a MWE as it is made up of two lexemes (marketing and manager ), while fused words such as lighthouse are conventionally not classified as MWEs.3 In languages such as German, the high productivity of compound nouns such as Kontaktlinse “contact lens” (the concatenation of Kontakt “contact” and Linse “lens”), without whitespace delimitation, means that we tend to relax this restriction and allow for single-word MWEs. In non-segmenting languages such as Japanese and Chinese (Baldwin and Bond 3 In practice, a significant subset of research on English noun compounds (see Section 1.3.1) has considered both fused and whitespace-separated expressions. 4 Handbook of NLP 2002; Xu, Lu, and Li 2006), we are spared this artificial consideration. The ability to decompose an expression into multiple lexemes is still applicable, however, and leads to the conclusion, e.g. that fukugō-hyōgen “multiword expression” is a MWE (both fukugō “compound” and hyōgen “expression” are standalone lexemes), but buchō “department head” is not (bu “department” is a standalone lexeme, but chō “head” is not). The second requirement on a MWE is for it to be idiomatic. We provide a detailed account of idiomaticity in its various manifestations in the following section. 1.2.1 Idiomaticity In the context of MWEs, idiomaticity refers to markedness or deviation from the basic properties of the component lexemes, and applies at the lexical, syntactic, semantic, pragmatic, and/or statistical levels. A given MWE is often idiomatic at multiple levels (e.g. syntactic, semantic and statistical in the case of by and large), as we return to illustrate in Section 1.2.3. Closely related to the notion of idiomaticity is compositionality , which we consider to be the degree to which the features of the parts of a MWE combine to predict the features of the whole. While compositionality is often construed as applying exclusively to semantic idiomatic (hence by “non-compositional MWE”, researchers tend to mean a semantically-idiomatic MWE), in practice it can apply across all the same levels as idiomaticity. Below, we present an itemised account of each sub-type of idiomaticity. 1.2.1.1 Lexical Idiomaticity Lexical idiomaticity occurs when one or more components of an MWE are not part of the conventional English lexicon. For example, ad hoc is lexically marked in that neither of its components (ad and hoc) are standalone English words.4 Lexical idiomaticity inevitably results in syntactic and semantic idiomaticity because there is no lexical knowledge associated directly with the parts from which to predict the behaviour of the MWE. As such, it is one of the most clear-cut and predictive properties of MWEhood. 1.2.1.2 Syntactic Idiomaticity Syntactic idiomaticity occurs when the syntax of the MWE is not derived directly from that of its components (Katz and Postal 2004; Chafe 1968; Bauer 1983; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For example, by and large, is syntactically idiomatic in that it is adverbial in nature, but made up of the anomalous coordination of a preposition (by) and an adjective 4 Note that the idiomaticity is diminished if the speaker has knowledge of the Latin origins of the term. Also, while the component words don’t have status as standalone lexical items, they do occur in other MWEs (e.g. ad nauseum, post hoc). Multiword Expressions 5 (large). On the other hand, take a walk is not syntactically marked as it is a simple verb–object combination which is derived transparently from a transitive verb (walk ) and a countable noun (walk ). Syntactic idiomaticity can also occur at the constructional level, in classes of MWEs having syntactic properties which are differentiated from their component words, e.g. verb particle constructions Section 1.3.2.1 and determinerless prepositional phrases Section 1.3.3.2. 1.2.1.3 Semantic Idiomaticity Semantic idiomaticity is the property of the meaning of a MWE not being explicitly derivable from its parts (Katz and Postal 2004; Chafe 1968; Bauer 1983; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For example, middle of the road usually signifies “non-extremism, especially in political views”, which we could not readily predict from either middle or road . On the other hand, to and fro is not semantically marked as its semantics is fully predictable from its parts. Many cases are not as clear cut as these, however. The semantics of blow hot and cold (“constantly change opinion”), for example, is partially predictable from blow (“move” and hence “change”), but not as immediately from hot and cold . There are also cases where the meanings of the parts are transparently inherited but there is additional semantic content which has no overt realisation. One such example is bus driver where, modulo the effects of word sense disambiguation, bus and driver both have their expected meanings, but there is additionally the default expectation that a bus driver is “one who drives a bus” and not “one who drives like a bus” or “an object for driving buses with”, for example. Closely related to the issue of semantic idiomaticity is the notion of figuration, i.e. the property of the components of a MWE having some metaphoric (e.g. take the bull by the horns), hyperbolic (e.g. not worth the paper it’s printed on) or metonymic (e.g. lend a hand ) meaning in addition to their literal meaning (Fillmore, Kay, and O’Connor 1988; Nunberg, Sag, and Wasow 1994). As an illustration of decomposability via metaphorical figuration, consider the English idiom spill the beans. Assuming a formal semantic representation of reveal′ (secret′ ) for the MWE, we can coerce the semantics of spill and beans into reveal′ and secret′ , respectively, to arrive at a figurative interpretation of the MWE semantics. A compositionality analysis would not be able to predict this regularity as these senses for spill and beans are not readily available outside this particular MWE. Predictably, MWEs vary in the immediacy of their decomposability — with get the nod being more transparently decomposable than spill the beans, e.g. — and not all MWEs are decomposable (c.f. kick the bucket). We return to discuss the interaction between decomposability and syntactic flexibility in Section 1.3.2.4. One intriguing aspect of semantic idiomaticity is that higher-usage MWEs are generally perceived to be less semantically idiomatic, or at least more readily decomposable (Keysar and Bly 1995). 6 Handbook of NLP 1.2.1.4 Pragmatic Idiomaticity Pragmatic idiomaticity is the condition of a MWE being associated with a fixed set of situations or a particular context (Kastovsky 1982; Jackendoff 1997; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). Good morning and all aboard are examples of pragmatic MWEs: the first is a greeting associated specifically with mornings5 and the second is a command associated with the specific situation of a train station or dock, and the imminent departure of a train or ship. Pragmatically idiomatic MWEs are often ambiguous with (non-situated) literal translations; e.g. good morning can mean “pleasant morning” (c.f. Kim had a good morning). 1.2.1.5 Statistical Idiomaticity Statistical idiomaticity occurs when a particular combination of words occurs with markedly high frequency, relative to the component words or alternative phrasings of the same expression (Cruse 1986; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For example, in Table 1.1, we present an illustration of statistical idiomaticity, adapted from Cruse (1986, p281). The example is based on the cluster of near-synonym adjectives (flawless, immaculate, impeccable and spotless), and their affinity to pre-modify a range of nouns. For a given pairing of adjective and noun, we indicate the compatibility in the form of discrete markers (“+” indicates a positive lexical affinity, “?” indicates a neutral lexical affinity, and “−” indicates a negative lexical affinity). For example, immaculate has a strong lexical affinity with performance (i.e. immaculate performance is a relatively common expression), whereas spotless has a negative affinity with credentials (i.e. spotless credentials is relatively infrequent). There may, of course, be phonological, semantic or other grounds for particular adjective–noun combinations being more or less frequent; statistical idiomaticity is simply an observation of the relative frequency of a given combination. It is also important to note that statistical idiomaticity is a continuously-graded phenomenon, and our predictions about lexical affinity in Table 1.1 are most naturally interpreted as a ranking of the propensity for each of the adjectives to occur as a pre-modifier of record ; for example, impeccable and spotless are more probable choices than immaculate, which is in turn more probable than flawless. Another striking case of statistical idiomaticity is with binomials such as black and white — as in black and white television — where the reverse noun ordering does not preserve the lexicalised semantics of the word combination (c.f. ?white and black television) (Benor and Levy 2006). The arbitrariness of the preferred noun order in English is poignantly illustrated by it being reversed in other languages, e.g. shirokuro “white and black” and blanco y negro “white and black” in Japanese and Spanish, respectively. 5 Which is not to say that it can’t be used ironically at other times of the day! Multiword Expressions condition credentials hair house logic timing flawless immaculate impeccable + − + − − + − + ? ? + ? + − + ? + + 7 spotless + − − + − − Table 1.1: Examples of statistical idiomaticity (“+” = strong lexical affinity, “?” = neutral lexical affinity, “−” = negative lexical affinity) Statistical idiomaticity relates closely to the notion of institutionalisation (a.k.a. conventionalisation), i.e. a particular word combination coming to be used to refer a given object (Fernando and Flavell 1981; Bauer 1983; Nunberg, Sag, and Wasow 1994; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For example, traffic light is the conventionalised descriptor for “a visual signal to control the flow of traffic at intersections”. There is no a priori reason why it shouldn’t instead be called a traffic director or intersection regulator , but the simple matter of the fact is that it is not referred to using either of those expressions; instead, traffic light was historically established as the canonical term for referring to the object. Similarly, it is an arbitrary fact of the English language that we say many thanks and not ∗ several thanks, and salt and pepper in preference to pepper and salt.6 We term these anti-collocations of the respective MWEs (Pearce 2001): lexico-syntactic variants of MWEs which have unexpectedly low frequency, and in doing so, contrastively highlight the statistical idiomaticity of the target expression.7 1.2.2 Other Properties of MWEs Other common properties of MWE are: single-word paraphrasability, proverbiality and prosody. Unlike idiomaticity, where some form of idiomaticity is a necessary feature of MWEs, these other properties are neither necessary nor sufficient. Prosody relates to semantic idiomaticity, while the other properties are independent of idiomaticity as described above. • Crosslingual variation 6 Which is not to say there wasn’t grounds for the selection of the canonical form at its genesis, e.g. for historical, crosslingual or phonological reasons. 7 The term anti-collocation originated in the context of collocation research (see Section 1.2.4). While noting the potential for confusion, we use it in the broader context of MWEs as a tool for analysing the statistical idiomaticity of a candidate MWE relative to alternative forms of the same basic expression. 8 Handbook of NLP There is remarkable variation in MWEs across languages (Villavicencio, Baldwin, and Waldron 2004). In some cases, there is direct lexicosyntactic correspondence for a crosslingual MWE pair with similar semantics. For example, in the red has a direct lexico-syntactic correlate in Portuguese with the same semantics: no vermelho, where no is the contraction of in and the, vermelho means red , and both idioms are prepositional phrases (PPs). Others have identical syntax but differ lexically. For example, in the black corresponds to no azul (“in the blue”) in Portuguese, with a different choice of colour term (blue instead of black ). More obtusely, Bring the curtain down on corresponds to the Portuguese botar um ponto final em (lit. “put the final dot in”), with similar syntactic make-up but radically different lexical composition. Other MWEs again are lexically similar but syntactically differentiated. For example, in a corner (e.g. The media has him in a corner ) and encurralado (“cornered”) are semantically equivalent but realised by different constructions – a PP in English and an adjective in Portuguese. There are of course many MWEs which have no direct translation equivalent in a second language. For example, the Japanese MWE zoku-giiN , meaning “legistors championing the causes of selected industries” has no direct translation in English (Tanaka and Baldwin 2003). Equally, there are terms which are realised as MWEs in one language but single-word lexemes in another, such as interest rate and its Japanese equivalent riritsu. • Single-word paraphrasability Single-word paraphrasability is the observation that significant numbers of MWEs can be paraphrased with a single word (Chafe 1968; Gibbs 1980; Fillmore, Kay, and O’Connor 1988; Liberman and Sproat 1992; Nunberg, Sag, and Wasow 1994). While some MWEs are single-word paraphrasable (e.g. leave out = omit), others are not (e.g. look up = ? ). Also, MWEs with arguments can sometimes be paraphrasable (e.g. take off clothes = undress), just as non-MWEs comprised of multiple words can be single-word paraphrasable (e.g. drop sharply = plummet). • Proverbiality Proverbiality is the ability of a MWE to “describe and implicitly to explain a recurrent situation of particular social interest in the virtue of its resemblance or relation to a scenario involving homely, concrete things and relations” (Nunberg, Sag, and Wasow 1994). For example, verb particle constructions and idioms are often indicators of more informal situations (e.g. piss off is an informal form of annoy, and drop off is an informal form of fall asleep). • Prosody Multiword Expressions all aboard bus driver by and large kick the bucket look up shock and awe social butterfly take a walk to and fro traffic light eat chocolate 9 Lexical Syntactic Semantic Pragmatic Statistical − − − + + − − + − + − + + − + − − + − + − − + − + − − − + + − − + − + − − + − ? ? + − − + − − + − + − − − − − Table 1.2: Classification of MWEs in terms of their idiomaticity MWEs can have distinct prosody , i.e. stress patterns, from compositional language (Fillmore, Kay, and O’Connor 1988; Liberman and Sproat 1992; Nunberg, Sag, and Wasow 1994). For example, when the components do not make an equal contribution to the semantics of the whole, MWEs can be prosodically marked, e.g. soft spot is prosodically marked (due to the stress on soft rather than spot), although first aid and red herring are not. Note that prosodic marking can equally occur with non-MWEs, such as dental operation. 1.2.3 Testing an Expression for MWEhood Above, we described five different forms of idiomaticity, along with a number of other properties of MWEs. We bring these together in categorising a selection of MWEs in Table 1.2. Taking the example of the verb particle construction look up (in the sense of “seek information from”, as in Kim looked the word up in the dictionary), we first observe that it is made up of multiple words (look and up), and thus satisfies the first requirement in our MWE definition. In terms of idiomaticity: (1) it is not lexically idiomatic, as both look and up are part of the standard English lexicon; (2) while it has peculiar syntax relative to its component words, in up being separable from look , this is a general property of transitive verb particle constructions (see Section 1.3.2.1) rather than this particular word combination, so it is not syntactically idiomatic; (3) it is semantically idiomatic, as the semantics of “seek information from” is not predictable from the standard semantics of look and up; (4) it is not pragmatically idiomatic, as it doesn’t generally evoke a particular situation; and (5) it is statistically 10 Handbook of NLP marked, as it contrasts with anti-collocations such as *see/watch up 8 and is a relatively frequent expression in English. That is, it is semantically and statistically idiomatic; in combination with its multiword composition, this is sufficient to classify it as a MWE. In Table 1.2, kick the bucket (in the sense of “die”) has only one form of idiomaticity (semantic), while all the other examples have at least two forms of idiomaticity. Traffic light, for example, is statistically idiomatic in that it is both a common expression in English and stands in opposition to anticollocations such as *vehicle light/traffic lamp, and it is semantically idiomatic in that the particular semantics of “a visual signal to control the flow of traffic” is not explicitly represented in the component words (e.g. interpretations such as “a visual signal to indicate the flow of traffic”, “a device for lighting the way of traffic” or “a lamp which indicates the relative flow of data” which are predicted by the component words are not readily available). Other noteworthy claims about idiomaticity are: shock and awe is pragmatically idiomatic because of its particular association with the commencement of the Iraq War in 2003; take a walk is semantically idiomatic because this sense of take is particular to this and other light verb constructions (see Section 1.3.2.3), and distinct from the literal sense of the verb; and to and fro is syntactically idiomatic because of the relative syntactic opacity of the antiquated fro, and (somewhat) lexically idiomatic as it is used almost exclusively in the context of to and fro.9 Table 1.2 includes one negative example: eat chocolate. While it satisfies the requirement for multiword decomposability (i.e. it is made up of more than one word), it clearly lacks lexical, syntactic, semantic and pragmatic idiomaticity. We would claim that it is also not statistically idiomatic. One possible counter-argument could be that eat is one of the most common verbs associated with chocolate, but the same argument could be made for almost any foodstuff in combination with eat. Possible anti-collocations such as consume chocolate or munch on chocolate are also perfectly acceptable. 1.2.4 Collocations and MWEs A common term in NLP which relates closely to our discussion of MWEs is collocation. A widely-used definition for collocation is “an arbitrary and recurrent word combination” (Benson 1990), or in our terms, a statistically idiomatic MWE (esp. of high frequency). While there is considerable varia- 8 Under the constraint that up is a particle; examples such as see you up the road occur readily, but are not considered to be anti-collocations as up is a (transitive) preposition. 9 Words such as this which occur only as part of a fixed expression are known variously as cranberry words or bound words (Aronoff 1976; Moon 1998; Trawiński, Sailer, Soehn, Lemnitzer, and Richter 2008) (other examples are tenterhooks and caboodle), and the expressions that contain them are often termed cranberry expressions (e.g. on tenterhooks and the whole caboodle). Multiword Expressions 11 tion between individual researchers, collocations are often distinguished from “idioms” or “non-compositional phrases” on the grounds that they are not syntactically idiomatic, and if they are semantically idiomatic, it is through a relatively transparent process of figuration or metaphor (Choueka 1988; Lin 1998; McKeown and Radev 2000; Evert 2004). Additionally, much work on collocations focuses exclusively on predetermined constructional templates (e.g. adjective–noun or verb–noun collocations). In Table 1.2, e.g. social butterfly is an uncontroversial instance of a collocation, but look up and to and fro would tend not to be classified as collocations. As such, collocations form a proper subset of MWEs. 1.2.5 A Word on Terminology and Related Fields It is worth making mention of a number of terms which relate to MWEs. The term idiom varies considerably in its usage, from any kind of multiword item to only those MWEs which are semantically idiomatic; even here, there are those who consider idioms to be MWEs which are exclusively semantically idiomatic (also sometimes termed pure idioms), and those who restrict the term to particular syntactic sub-types of semantically idiomatic MWEs (Fillmore, Kay, and O’Connor 1988; Nunberg, Sag, and Wasow 1994; Moon 1998; Huddleston and Pullum 2002). To avoid confusion, we will avoid using this term in this chapter. The field of terminology has a rich history of research on multiword terms, which relates closely to MWEs (Sager 1990; Justeson and Katz 1995; Frantzi, Ananiadou, and Mima 2000; Kageura, Daille, Nakagawa, and Chien 2004). The major difference is that terminology research is primarily interested in identifying and classifying technical terms specific to a particular domain (both MWE and simplex lexemes) and predicting patterns of variation in those terms. It is thus broader in scope than MWEs in the sense that simple lexemes can equally be technical terms, and narrower in the sense than non-technical MWEs are not of interest to the field. Phraseology is another field with a rich tradition history relating to MWEs (Cowie and Howarth 1996; Cowie 2001). It originally grew out of the work of Mel’čuk and others in Russia on Meaning-Text Theory (Mel’čuk and Polguère 1987), but more recently has taken on elements from the work of Sinclair and others in the context of corpus linguistics and corpus-based lexicography (Sinclair 1991). Phraseology is primarily interested in the description and functional classification of MWEs (including “sentence-like” units, such as phrases and quotations), from a theoretical perspective. 12 1.3 Handbook of NLP Types of MWE In this section, we detail a selection of the major MWE types which have received particular attention in the MWE literature. We will tend to focus on English MWEs for expository purposes, but provide tie-ins to corresponding MWEs in other languages where possible. 1.3.1 Nominal MWEs Nominal MWEs are one of the most common MWE types, in terms of token frequency, type frequency, and their occurrence in the world’s languages (Tanaka and Baldwin 2003; Lieber and Štekauer 2009). In English, the primary type of nominal MWE is the noun compound (NC), where two or more nouns combine to form a N̄, such as golf club or computer science department (Lauer 1995; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Huddleston and Pullum 2002); the rightmost noun in the NC is termed the head noun (i.e. club and department, respectively) and the remainder of the component(s) modifier(s) (i.e. golf and computer science, respectively).10 Within NCs, there is the subset of compound nominalisations, where the head is deverbal (e.g. investor hesitation or stress avoidance). There is also the broader class of nominal MWEs where the modifiers aren’t restricted to be nominal, but can also be verbs (usually present or past participles, such as connecting flight or hired help) or adjectives (e.g. open secret). To avoid confusion, we will term this broader set of nominal MWEs nominal compounds. In Romance languages such as Italian, there is the additional class of complex nominals which include a preposition or other marker between the nouns, such as succo di limone “lemon juice” and porta a vetri “glass door”.11 One property of noun compounds which has put them in the spotlight of NLP research is their underspecified semantics. For example, while sharing the same head, there is little semantic commonality between nut tree, clothes tree and family tree: a nut tree is a tree which bears edible nuts; a clothes tree is a piece of furniture shaped somewhat like a tree, for hanging clothes on; and a family tree is a graphical depiction of the genealogical history of a family (which can be shaped like a tree). In each case, the meaning of the compound relates (if at times obtusely!) to a sense of both the head and the modifier, but the precise relationship is highly varied and not represented 10 In fact, the norm amongst Germanic languages (e.g. Danish, Dutch, German, Norwegian and Swedish) is for noun compounds to be realised as a single compound word (Bauer 2001). Solar cell, for example, is zonnecel in Dutch, Solarzelle in German, and solcell in Swedish. See Section 1.2 for comments on their compatibility with our definition of MWE. 11 Our use of the term complex nominal for MWEs of form N P N should not be confused with that of Levi (1978), which included NCs and nominal compounds. Multiword Expressions 13 explicitly in any way. Furthermore, while it may be possible to argue that these are all lexicalised noun compounds with explicit semantic representations in the mental lexicon, native speakers generally have reasonably sharp intuitions about the semantics of novel compounds. For example, a bed tree is most plausibly a tree that beds are made from or perhaps for sleeping in, and a reflection tree could be a tree for reflecting in/near or perhaps the reflected image of a tree. Similarly, context can evoke irregular interpretations of high-frequency compounds (Downing 1977; Spärck Jones 1983; Copestake and Lascarides 1997; Gagné, Spalding, and Gorrie 2005). This suggests that there is a dynamic interpretation process that takes place, which complements encyclopedic information about lexicalised compounds. One popular approach to capturing the semantics of compound nouns is via a finite set of relations. For example, orange juice, steel bridge and paper hat could all be analysed as belonging to the make relation, where head is made from modifier. This observation has led to the development of a bewildering range of semantic relation sets of varying sizes, based on abstract relations (Vanderwende 1994; Barker and Szpakowicz 1998; Rosario and Hearst 2001; Moldovan, Badulescu, Tatu, Antohe, and Girju 2004; Nastase, Sayyad-Shirabad, Sokolova, and Szpakowicz 2006), direct paraphrases, e.g. using prepositions or verbs (Lauer 1995; Lapata 2002; Grover, Lapata, and Lascarides 2004; Nakov 2008), or various hybrids of the two (Levi 1978; Vanderwende 1994; Ó Séaghdha 2008). This style of approach has been hampered by issues including low inter-annotator agreement (especially for larger semantic relation sets), coverage over data from different domains, the impact of context on interpretation, how to deal with “fringe” instances which don’t quite fit any of the relations, and how to deal with interpretational ambiguity (Downing 1977; Spärck Jones 1983; Ó Séaghdha 2008). An additional area of interest with nominal MWEs (especially noun compounds) is the syntactic disambiguation of MWEs with 3 or more terms. For example, glass window cleaner can be syntactically analysed as either (glass (window cleaner)) (i.e. “a window cleaner made of glass”, or similar) or ((glass window) cleaner) (i.e. “a cleaner of glass windows”). Syntactic ambiguity impacts on both the semantic interpretation and prosody of the MWE. The task of disambiguating syntactic ambiguity in nominal MWEs is called bracketing . We return to discuss the basic approaches to bracketing in Section 1.5.3. 1.3.2 1.3.2.1 Verbal MWEs Verb-particle constructions Verb-particle constructions (VPCs, also sometimes termed particle verbs or phrasal verbs) are made up of a verb and an obligatory particle, typically in the form of an intransitive preposition (e.g. play around , take off ), but including adjectives (e.g. cut short, band together ) and verbs (e.g. let go, let fly) (Bolinger 1976; Jackendoff 1997; Huddleston and Pullum 2002; McIntyre 14 Handbook of NLP 2007). English VPCs relate closely to particle verbs (a.k.a. separable verbs) in languages such as German (Lüdeling 2001), Dutch (Booij 2002) and Estonian (Kaalep and Muischnek 2008), but the construction has its own peculiarities in each language which go beyond the bounds of this chapter. To avoid confusion, we will focus exclusively on English VPCs in our discussion here. The distinguishing properties of English VPCs are: • Transitive VPCs can occur in either the joined (e.g. Kim put on the sweater ) or split (e.g. Kim put the sweater on) word order in the case that the object NP is not pronominal • Transitive VPCs must occur in the split word order if the object NP is pronominal (e.g. Kim polished it off vs. ∗ Kim polished off it). • Manner adverbs do not readily occur between the verb and particle, in both intransitive and transitive VPCs (e.g. ?∗ Kim played habitually around, ∗ Kim made quickly up her mind ). Note, there is a small set of degree adverbs that readily premodify particles, notably right (e.g. My turn is coming right up) and back (e.g. Kim put the sweater back on) All of these properties are defined at the construction level and common to all VPCs, however, begging the question of where the idiomaticity comes in that allows us to define them as MWEs. The answer is, in the main, semantic and statistical idiosyncrasy. For example, the semantics of polish in polish off (e.g. polish off dessert, polish off the hitman, polish off my homework ) is differentiated from that of the simplex lexeme. Conversely, swallow down (e.g. swallow down the drink ) preserves the semantics of both swallow and down (i.e. the liquid is swallowed, and as a result goes down [the oesophagus]), and is thus conventionally not considered be a MWE. VPCs are highly frequent in English text, but the distribution is highly skewed towards a minority of the VPC types, with the majority of VPCs occurring very infrequently (Baldwin 2005a). This is bad news if we want to build a parser with full coverage, e.g., as we need to capture the long tail of VPC types. Compounding the problem, the construction is highly productive. For example, the completive up (e.g. eat/finish/rest/... up) can combine productively with a large array of action verbs to form a VPC with predictable syntax and semantics, which we could never hope to exhaustively list. Having said this, there are large numbers of semantically-idiomatic VPCs which need to be recorded in the lexicon if we wish to capture their semantics correctly. Even here, VPCs populate the spectrum of compositionality relative to their components (Lidner 1983; Brinton 1985; Jackendoff 2002; Bannard, Baldwin, and Lascarides 2003; McCarthy, Keller, and Carroll 2003; Cook and Stevenson 2006), so while some VPCs are clear candidates for lexicalisation in terms of their semantic idiomaticity (e.g. make out, as in Kim made out the cheque to Sandy or Kim and Sandy made out), others are semantically closer to the semantics of their component words (e.g. check out, blow over ) Multiword Expressions 15 and to some degree derivable from their component words. One approach to representing this continuum of VPC semantics is that of Bannard, Baldwin, and Lascarides (2003), who subclassify VPCs into four compositionality classes based on the independent semantic contribution of the verb and particle: (1) the VPC inherits its semantics from the verb and particle (i.e. is not semantically idiomatic); (2) the VPC inherits semantics from the verb only; (3) the VPC inherits semantics from the particle only; and (4) the VPC inherits semantics from neither the verb nor the particle. A second approach is to employ a one-dimensional classification of holistic VPC compositionality (e.g. in the form of a integer scale of 0 to 10 (McCarthy, Keller, and Carroll 2003)). 1.3.2.2 Prepositional verbs Prepositional verbs (PVs) relate closely to VPCs in being comprised of a verb and selected preposition, with the crucial difference that the preposition is transitive (e.g. refer to, look for ) (Jackendoff 1973; O’Dowd 1998; Huddleston and Pullum 2002; Baldwin 2005b; Osswald, Helbig, and Hartrumpf 2006). English PVs occur in two basic forms: (1) fixed preposition PVs (e.g. come across, grow on), where there is a hard constraint of the verb and selected preposition being strictly adjacent; and (2) mobile preposition PVs (e.g. refer to, send for ), where the selected preposition is adjacent to the verb in the canonical word order, but undergoes limited syntactic alternation. For example, mobile preposition PVs allow limited coordination of PP objects (e.g. refer to the book and to the DVD vs. ∗ come across the book and across the DVD), and the NP object of the selected preposition can be passivised (e.g. the book was referred to vs. ∗ I was grown on by the book ). PVs are highly frequent in general text, and notoriously hard to distinguish from VPCs and simple verb–preposition combinations, e.g. in parsing applications. 1.3.2.3 Light-Verb Constructions Light-verb constructions (i.e. LVCs) are made up of a verb and a noun complement, often in the indefinite singular form (Jespersen 1965; Abeillé 1988; Miyagawa 1989; Grefenstette and Tapanainen 1994; Hoshi 1994; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Huddleston and Pullum 2002; Butt 2003; Stevenson, Fazly, and North 2004). The name of the construction comes from the verb being semantically bleached or “light”, in the sense that their contribution to the meaning of the LVC is relatively small in comparison with that of the noun complement. In fact, the contribution of the light verb is so slight that in many cases, the LVC can be paraphrased with the verbal form of the noun complement (e.g. take a walk vs. walk or take a photograph vs. photograph). LVCs are also sometimes termed verb-complement pairs (Tan, Kan, and Cui 2006) or support verb constructions (Calzolari, Fillmore, Grishman, Ide, Lenci, MacLeod, and Zampolli 2002). 16 Handbook of NLP The following are the principle light verbs in English: • do, e.g. do a demo, do a drawing, do a report • give, e.g. give a wave, give a sigh, give a kiss • have, e.g. have a rest, have a drink , have pity (on) • make, e.g. make an offer , make an attempt, make a mistake • take, e.g. take a walk , take a bath, take a photograph There is some disagreement in the scope of the term LVC, most notably in the membership of verbs which can be considered “light”. Calzolari, Fillmore, Grishman, Ide, Lenci, MacLeod, and Zampolli (2002), e.g., argued that the definition of LVCs (or support verb constructions in their terms) should be extended to include: (1) verbs that combine with an event noun (deverbal or otherwise) where the subject is a participant in the event most closely identified with the noun (e.g. ask a question); and (2) verbs with subjects that belong to some scenario associated with the full understanding of the event type designated by the object noun (e.g. keep a promise). Morphologically, the verb in LVCs inflects but the noun complement tends to have fixed number and a preference for determiner type. For example, make amends undergoes full verbal inflection (make/makes/made/making amends), but the noun complement cannot be singular (e.g. ∗ make amend.12 Syntactically, LVCs are highly flexible, undergoing passivization (e.g. an offer was made), extraction (e.g. How many offers did Kim make? ) and internal modification (e.g. make an irresistible offer ). On the other hand, there are hard constraints on what light verbs a given noun complement can be combined with (c.f. ∗ give/do/put/take an offer ), noting that some noun complements combine with multiple light verbs (e.g. do/give a demo), often with different semantics (e.g. make a call vs. take a call vs. have a call ). Also, what light verb a given noun will combine with to form an LVC is often consistent across semantically-related noun clusters (e.g. give a cry/moan/howl vs. ∗ take a cry/moan/howl . LVCs occur across a large number of the world’s languages, including Japanese (Grimshaw and Mester 1988; Baldwin and Bond 2002), Korean (Ahn 1991), Hindi (Mohanan 1994) and Persian (Karimi-Doostan 1997). 1.3.2.4 Verb–Noun Idiomatic Combinations Verb–Noun Idiomatic Combinations (VNICs, also known as VP idioms) are composed of a verb and noun in direct object position, and are (at 12 But also note other examples where the noun complement can be either singular or plural, e.g. take a bath vs. take baths. Multiword Expressions 17 least) semantically idiomatic (e.g. kick the bucket, shoot the breeze) (Nunberg, Sag, and Wasow 1994; Fellbaum 2002; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Fazly, Cook, and Stevenson 2009). They are a notable subclass of MWE because of their crosslingual occurrence, and high lexical and semantic variability. VNICs (along with other semantically idiomatic MWEs) are often categorised into two groups, based on their semantic decomposability (see Section 1.2.1.3) (Nunberg, Sag, and Wasow 1994; Riehemann 2001). With decomposable VNICs, given the interpretation of the VNIC, it is possible to associate components of the VNIC with distinct elements of the VNIC interpretation, based on semantics not immediately accessible from the component lexemes. Assuming an interpretation of spill the beans such as reveal’(x,secret’),13 e.g., we could analyse spill as having the semantics of reveal’ and beans having the semantics of secret’, through a process of figuration. Other examples of decomposable VNICs are pull strings (c.f. exert’(x,influence’)) and touch a nerve (c.f. cause’(x,reaction’)). With non-decomposable VNICs (e.g. get the hang (of ), kick the bucket), such a semantic decomposition is not possible. The reason we make this distinction is that decomposable VNICs tend to be syntactically flexible, in a manner predicted by the nature of the semantic decomposition; non-decomposable VNICs, on the other hand, tend not to be syntactically flexible (Cruse 1986; Nunberg, Sag, and Wasow 1994; Jackendoff 1997; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For example, spill the beans can be passivised (It’s a shame the beans were spilled ) and internally modified (AT&T spilled the Starbucks beans), similarly to a conventional verb–direct object pair (c.f. Sandy is loved by Kim and Kim loves the inimitable Sandy); this is predicted by its decomposability. VNICs generally occur with low frequency, but are notoriously hard to distinguish from literal usages of the same word combination (e.g. Kim made a face at the policeman vs. Kim made a face in pottery class). An accurate means of disambiguation is thus important in tasks which require semantic interpretation, but generally fraught by low volumes of training data. 1.3.3 1.3.3.1 Prepositional MWEs Determinerless-Prepositional Phrases Determinerless prepositional phrases (PP-Ds) are MWEs that are made up of a preposition and a singular noun without a determiner (Quirk, Greenbaum, Leech, and Svartvik 1985; Huddleston and Pullum 2002; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Baldwin, Beavers, Van Der Beek, Bond, Flickinger, and Sag 2006). Syntactically, PP-Ds are highly diverse, and display differing levels of syntactic markedness, productivity and modifiability (Chander 1998; Ross 1995). 13 I.e., reveal’ is a 2-place predicate, with x binding to the subject. 18 Handbook of NLP That is, some PP-Ds are non-productive (e.g. on top vs. ∗ on bottom) and nonmodifiable (e.g. on top vs. ∗ on table top), whereas others are fully-productive (e.g. by car/foot/bus/...) and highly modifiable (e.g. at high expense, on summer vacation). In fact, while some PP-Ds are optionally modifiable (e.g. on vacation vs. on summer vacation), others require modification (e.g. ∗ at level vs. at eye level , and ∗ at expense vs. at company expense) (Baldwin, Beavers, Van Der Beek, Bond, Flickinger, and Sag 2006). Syntactically-marked PP-Ds can be highly productive (Ross 1995; Grishman, Macleod, and Myers 1998). For example, by combines with a virtually unrestricted array of countable nouns (e.g. by bus/car/taxi/...) but less readily with uncountable nouns (e.g. ∗ by information/linguistics/...). Semantically, PP-Ds have a certain degree of semantic markedness on the noun (Haspelmath 1997; Mimmelmann 1998; Stvan 1998; Bond 2005). For example, in combines with uncountable nouns which refer to a social institution (e.g. school, church, prison but not information) to form syntacticallyunmarked PP-Ds with marked semantics, in the sense that only the social institution sense of the noun is evoked (e.g. in school/church/prison/... vs. ∗ in information) (Baldwin, Beavers, Van Der Beek, Bond, Flickinger, and Sag 2006). PP-Ds occur with surprising frequency and cause problems during parsing and generation, in terms of achieving the right balance between over- and under-generation (Baldwin, Bender, Flickinger, Kim, and Oepen 2004). 1.3.3.2 Complex prepositions Another common form of prepositional MWE is complex prepositions (e.g. on top of , in addition to), and other forms of complex markers (Villada Moirón 2005; Tsuchiya, Shime, Takagi, Utsuro, Uchimoto, Matsuyoshi, Sato, and Nakagawa 2006; Trawiński, Sailer, and Soehn 2006). Complex prepositions can take the form of fixed MWEs (e.g. in addition to), or alternatively semi-fixed MWEs, for example optionally allowing internal modification (e.g. with (due/particular/special/...) regard to) or determiner insertion (e.g. on (the) top of ). 1.4 MWE Classification In developing a lexicon of MWEs, it is crucially important to develop a classification which captures the general properties of MWE classes, but at the same time allows for the encoding of information particular to a given MWE instance. In this section, we present a commonly-used high-level classification, based particularly on the syntactic and semantic properties of MWEs outlined in Figure 1.1 (Bauer 1983; Sag, Baldwin, Bond, Copestake, and Flickinger Multiword Expressions 19 MWE ✭✭❤❤❤❤ ❤❤❤❤ ✭✭✭✭ ❤❤ ✭✭✭✭ Lexicalised Phrase ❤❤❤ ✭✭✭ ❤❤❤❤ ✭ ✱ ✭ ✭ ✭ ❤❤ ✭ ✭ ✱ ❤ ✭ fixed semi-fixed syntactically-flexible non-decomposable VNICs nominal MWEs VPCs LVCs decomposable VNICs Institutionalised Phrase FIGURE 1.1: A classification of MWEs 2002). The classification of MWEs into lexicalised phrases and institutionalised phrases hinges on whether the MWE is lexicalised (i.e. explicitly encoded in the lexicon), or a simple collocation (i.e. only statistically idiomatic). Lexicalised phrases are MWEs with lexical, syntactic, semantic or pragmatic idiomaticity. Lexicalised phrases can be further split into: fixed expressions (e.g. ad hoc, at first), semi-fixed expressions (e.g. spill the beans, car dealer , Chicago White Socks) and syntactically-flexible expressions (e.g. add up, give a demo). • fixed expressions are fixed strings that undergo neither morphosyntactic variation nor internal modification, often due to fossilisation of what was once a compositional phrase. For example, by and large is not morphosyntactically modifiable (e.g. ∗ by and larger) or internally modifiable (e.g. ∗ by and very large). Non-modifiable determinerless prepositional phrases such as on air are also fixed expressions. • semi-fixed expressions are lexically-variable MWEs that have hard restrictions on word order and composition, but undergo some degree of lexical variation such as inflection (e.g. kick/kicks/kicked/kicking the bucket vs. ∗ the bucket was kicked ), variation in reflexive pronouns (e.g. in her/his/their shoes) and determiner selection (e.g. The Beatles vs. a Beatles album 14 ). Non-decomposable VNICs (e.g. kick the bucket, shoot the breeze) and nominal MWEs (e.g. attorney general , part of speech) are also classified as semi-fixed expressions. • syntactically flexible expressions are MWEs which undergo syntactic variation, such as VPCs, LVCs and decomposable VNICs. The nature of 14 The determiner the in The Beatles is obligatory in the case that The Beatles forms a noun phrase (i.e. Beatles can only be quantified by the), but in cases where Beatles forms a N̄, e.g. in [N P a [N ′ [N ′ Beatles’ ] album ]], the lexical item is realized without a determiner. 20 Handbook of NLP the flexibility varies significantly across construction types. VPCs, for example, are syntactically flexible with respect to the word order of the particle and NP in transitive usages: hand in the paper vs. hand the paper in. They are also usually compatible with internal modification, even for intransitive VPCs: the plane took right off. LVCs (e.g. give a demo) undergo full syntactic variation, including passivisation (e.g. a demo was given), extraction (e.g. how many demos did he give? ) and internal modification (e.g. give a clear demo ). Decomposable VNICs are also syntactically flexible to some degree, although the exact form of syntactic variation is predicted by the nature of their semantic decomposability. Note that many of our MWE construction types can be assigned to a unique sub-category of lexicalised phrase, namely: non-decomposable VNICs, NCs, VPCs and LVCs. Determinerless PPs, on the other hand, cut across all three sub-categories: non-modifiable PP-Ds (e.g. at first) are fixed expressions, PP-Ds with strict constraints on modifiability (e.g. at level ) are semi-fixed expressions, and highly-productive PP-Ds (e.g. as president/coach/father of the bride/...) are syntactically-flexible. The class of institutionalised phrases corresponds to MWEs which are exclusively statistically idiomatic, as described in Section 1.2.4. Examples include salt and pepper and many thanks. 1.5 Research Issues The major NLP tasks relating to MWEs are: (1) identifying and extracting MWEs from corpus data, and disambiguating their internal syntax, and (2) interpreting MWEs. Increasingly, these tasks are being pipelined with parsers and applications such as machine translation (Venkatapathy and Joshi 2006; Zhang, Kordoni, Villavicencio, and Idiart 2006; Blunsom 2007). Depending on the type of MWE, the relative import of these syntactic and semantic tasks varies. For example, with noun compounds, the identification and extraction tasks are relatively trivial, whereas interpretation is considerably more difficult. Below, we discuss the challenges and review the key research on MWEs in NLP. For a listing of relevant resources (especially datasets and toolkits), we refer the reader to the companion web site (http://...). 1.5.1 Identification Identification is the task of determining individual occurrences of MWEs in running text. The task is at the token (instance) level, such that we may Multiword Expressions 21 identify 50 distinct occurrences of pick up in a given corpus. To give an example of an identification task, given the corpus fragment in (4) (taken from “The Frog Prince”, a children’s story), we might identify the MWEs in (4): (4) One fine evening a young princess put on her bonnet and clogs, and went out to take a walk by herself in a wood; ... she ran to pick it up; ... In MWE identification, a key challenge is in differentiating between MWEs and literal usages for word combinations such as make a face which can occur in both usages (Kim made a face at the policeman [MWE] vs. Kim made a face in pottery class [non-MWE]). Syntactic ambiguity is also a major confounding factor, e.g. in identifying VPCs in contexts such as Have the paper in today. For example, in the sentence Kim signed in the room, there is ambiguity between a VPC interpretation (sign in = “check in/announce arrival”) and an intransitive verb + PP interpretation (“Kim performed the act of signing in the room”). MWE identification has tended to take the form of customised methods for particular MWE construction types and languages (e.g. English VPCs, LVCs and NVICs), but there have been attempts to develop generalised techniques, as outlined below. Perhaps the most obvious method of identifying MWEs is via a part-ofspeech (POS) tagger, chunker or parser, in the case that lexical information required to identify MWEs is contained within the parser output. For example, in the case of VPCs, there is a dedicated tag for (prepositional) particles in the Penn POS tagset, such that VPC identification can be performed simply by POS tagging a text, identifying all particle tags, and further identifying the head verb associated with each particle (e.g. by looking left for the first main verb, within a word window of fixed size) (Baldwin and Villavicencio 2002; Baldwin 2005a). Similarly, a chunker or phrase structure parser can be used to identify constructions such as noun compounds or VPCs (McCarthy, Keller, and Carroll 2003; Lapata and Lascarides 2003; Kim and Baldwin pear). This style of approach is generally not able to distinguish MWE and literal usages of a given word combination, however, as they are not differentiated in their surface syntax. Deep parsers which have lexical entries for MWEs and disambiguate to the level of lexical items are able to make this distinction, however, via supertagging or full parsing (Baldwin, Bender, Flickinger, Kim, and Oepen 2004; Blunsom 2007). Another general approach to MWE identification is to treat literal and MWE usages as different senses of a given word combination. This then allows for the application of word sense disambiguation (WSD) techniques to the identification problem. As with WSD research, both supervised (Patrick and Fletcher 2005; Hashimoto and Kawahara 2008) and unsupervised (Birke and Sarkar 2006; Katz and Giesbrecht 2006; Sporleder and Li 2009) approaches 22 Handbook of NLP have been applied to the identification task. The key assumption in unsupervised approaches has been that literal usages will be contextually similar to simplex usages of the component words (e.g. kick and bucket in the case of kick the bucket). Mirroring the findings from WSD research, supervised methods tend to be more accurate, but have the obvious drawback that they requires large numbers of annotated literal and idiomatic instances of a given MWE to work. Unsupervised techniques are therefore more generally applicable. A third approach, targeted particularly at semantically idiomatic MWEs, is to assume that MWEs occur: (a) in canonical forms, or (b) only in particular syntactic configurations, and do not undergo the same level of syntactic variation as literal usages. This relates to our claims in Section 1.3.2.4 relating to non-decomposable VNICs, where the prediction is that VNICs such as kick the bucket will not passivise or be internally modifiable. If we have a method of identifying the limits of syntactic variability of a given MWE, therefore, we can assume that any usage which falls outside these (e.g. kicked a bucket) must be literal. The problem, then, is identifying the degree of syntactic variability of a given MWE. This can be performed manually, in flagging individual MWE lexical items with predictions of what variations a given MWE can undergo (Li, Zhang, Niu, Jiang, and Srihari 2003; Hashimoto, Sato, and Utsuro 2006). An alternative which alleviates the manual overhead associated with hand annotation is to use unsupervised learning to predict the “canonical” configurations for a given MWE, which can optionally be complemented with a supervised model to identify literal usages which are used in one of the canonical MWE configurations (e.g. Kim kicked the bucket in frustration, and stormed out of the room) (Fazly, Cook, and Stevenson 2009). In research to date, good results have been achieved for particular MWEs, especially English VPCs. However, proposed methods have tended to rely heavily on existing resources such as parsers and hand-crafted lexical resources, and be tuned to particular MWE types. 1.5.2 Extraction MWE extraction is a type-level task, wherein the MWE lexical items attested in a predetermined corpus are extracted out into a lexicon. For example, we may wish to know whether a given corpus provides evidence for a given verb take and preposition off combining to form a VPC (i.e. take off ). To illustrate the difference between identification and extraction, identification would involve the determination of the individual occurrences of take off (e.g. each of the 240 in a given corpus), whereas extraction would involve the decision about whether take off occurred in the corpus or not (irrespective of the number of occurrences). Clearly there is a close connection between the two tasks, in that if we have identified one or more occurrences of a given MWE we can extract it as a MWE, and conversely, if we have extracted a given MWE, we must be able to identify at least one occurrence in the corpus. The motivation for MWE extraction is generally lexicon development and Multiword Expressions 23 expansion, e.g. recognising newly-formed MWEs (e.g. ring tone or shock and awe) or domain-specific MWEs Extracting MWEs is relevant to any lexically-driven application, such as grammar engineering or information extraction. Depending on the particular application, it may be necessary to additionally predict lexical properties of a given MWE, e.g. its syntactic or semantic class. In addition, it is particularly important for productive MWEs or domains which are rich in technical terms (e.g. bus speed or boot up in the IT domain). MWE extraction is difficult for many of the same reasons as MWE identification, namely syntactic flexibility and ambiguity. There has been a strong focus on the development of general-purpose techniques for MWE extraction, particularly in the guise of collocation extraction (see Section 1.2.4). The dominating view here is that extraction can be carried out via association measures such as pointwise mutual information or the t-test, based on analysis of the frequency of occurrence of a given word combination, often in comparison with the frequency of occurrence of the component words (Church and Hanks 1989; Smadja 1993; Frantzi, Ananiadou, and Mima 2000; Evert and Krenn 2001; Pecina 2008). Association measures provide a score for each word combination, which forms the basis of a ranking of MWE candidates. Final extraction, therefore, consists of determining an appropriate cut-off in the ranking, although evaluation is often carried out over the full ranking. Collocation extraction techniques have been applied to a wide range of extraction tasks over a number of languages, with the general finding that it is often unpredictable which association measure will work best for a given task. As a result, recent research has focused on building supervised classifiers to combine the predictions of a number of association measures, and shown that this leads to consistently superior results than any one association measure (Pecina 2008). It has also been shown that this style of approach works most effectively when combined with POS tagging or parsing, and strict filters on the type of MWE that is being extracted (e.g. adjective–noun or verb– noun: Justeson and Katz (1995, Pecina (2008)). It is worth noting that association measures have generally been applied to (continuous) word ngrams, or less frequently, pre-determined dependency types in the output of a parser. Additionally, collocational extraction techniques tend to require a reasonable number of token occurrences of a given word combination to operate reliably, which we cannot always assume (Baldwin 2005a; Fazly 2007). A second approach to MWE extraction, targeted specifically at semantically and statistically idiomatic MWEs, is to extend the general association measure approach to include substitution (Lin 1999; Schone and Jurafsky 2001; Pearce 2001). For example, in assessing the idiomaticity of red tape, explicit comparison is made with lexically-related candidates generated by component word substitution, such as yellow tape or red strip. Common approaches to determining substitution candidates for a given component word are (near)synonymy—e.g. based on resources such as WordNet—and distributional 24 Handbook of NLP similarity. Substitution can also be used to generate MWE candidates, and then check for their occurrence in corpus data. For example, if clear up is a known (compositional) VPC, it is reasonable to expect that VPCs such as clean/tidy/unclutter/... up are also VPCs (Villavicencio 2005). That is not to say that all of these occur as MWEs, however (c.f. ∗ unclutter up), so an additional check for corpus attestation is usually used in this style of approach. A third approach, also targeted at semantically idiomatic MWEs, is to analyse the relative similarity between the context of use of a given word combination and its component words (Schone and Jurafsky 2001; Stevenson, Fazly, and North 2004; Widdows and Dorow 2005). Similar to the unsupervised WSD-style approach to MWE identification (see Section 1.5.1), the underlying hypothesis is that semantically idiomatic MWEs will occur in markedly different lexical contexts to their component words. A bag of words representation is commonly used to model the combined lexical context of all usages of a given word or word combination. By interpreting this context model as a vector, it is possible to compare lexical contexts, e.g. via simple cosine similarity (Widdows 2005). In order to reduce the effects of data sparseness, dimensionality reduction is often carried out over the word space prior to comparison (Schütze 1997). The same approach has also been applied to extract LVCs, based on the assumption that the noun complements in LVCs are often deverbal (e.g. bath, proposal, walk ), and that the distribution of nouns in PPs post-modifying noun complements in genuine LVCs (e.g. (make a) proposal of marriage) will be similar to that of the object of the underlying verb (e.g. propose marriage) (Grefenstette and Teufel 1995). Here, therefore, the assumption is that LVCs will be distributionally similar to the base verb form of the noun complement, whereas with the original extraction method, the assumption was that semantically idiomatic MWEs are dissimilar to their component words. A fourth approach is to perform extraction on the basis of implicit identification. That is, (possibly noisy) token-level statistics can be fed into a type-level classifier to predict whether there have been genuine instances of a given MWE in the corpus. An example of this style of approach is to use POS taggers, chunkers and parsers to identify English VPCs in different syntactic configurations, and feed the predictions of the various preprocessors into the final extraction classifier (Baldwin 2005a). Alternatively, a parser can be used to identify PPs with singular nouns, and semantically idiomatic PP-Ds extracted from among them based on distributional (dis)similarity of occurrences with and without determiners across a range of prepositions (van der Beek 2005). A fifth approach is to use syntactic fixedness as a means of extracting MWEs, based on the assumption that semantically idiomatic MWEs undergo syntactic variation (e.g. passivisation or internal modification) less readily than simple verb–noun combinations (Bannard 2007; Fazly, Cook, and Stevenson 2009). Multiword Expressions 25 In addition to general-purpose extraction techniques, linguistic properties of particular MWE construction types have been used in extraction. For example, the fact that a given verb–preposition combination occurs as a noun (e.g. takeoff , clip-on) is a strong predictor of the fact that combination occurring as a VPC (Baldwin 2005a). One bottleneck in MWE extraction is the token frequency of the MWE candidate. With a few notable exceptions (e.g. (Baldwin 2005a; Fazly, Cook, and Stevenson 2009)), MWE research has tended to ignore low-frequency MWEs, e.g. by applying a method only to word combinations which occur at least N times in a corpus. 1.5.3 Internal Syntactic Disambiguation As part of the process of MWE identification and extraction, for some MWE types it is necessary to disambiguate the internal syntax of individual MWEs. A prominent case of this in English is noun compounds with 3 or more terms. For example, glass window cleaner has two possible interpretations,15 corresponding to the two possible bracketings of the compound: (1) “a cleaner of glass windows” (= [[glass window] cleaner] ), and (2) “a cleaner of windows, made of glass” (= [glass [window cleaner]] . In this case, the first case (of left bracketing) is the correct analysis, but movie car chase, e.g., is right bracketing (= (movie (car chase))). The process of disambiguating the syntax of an NC is called bracketing . The most common approach to bracketing is based on statistical analysis of the components of competing analyses. In the adjacency model, for a ternary NC N1 N2 N3 , a comparison is made of the frequencies of the two modifier– head pairings extracted from the two analyses, namely N1 N2 and N1 N3 in the left bracketing case, and N2 N3 and N1 N3 in the right bracketing case; as N1 N3 is common to both, in practice, N1 N2 is compared directly with N2 N3 . A left bracketing analysis is selected in the case that N1 N2 is judged to be more likely, otherwise a right bracketing analysis is selected (Marcus 1980). In the dependency model, the NC is instead decomposed into the dependency tuples of N1 N2 and N2 N3 in the case of left bracketing, and N2 N3 and N1 N3 in the case of right bracketing; once again, the dependency N2 N3 is common to both, and can be ignored. In the instance that N1 N2 is more likely than N1 N3 , the model prefers a left bracketing analysis, otherwise a right bracketing analysis is selected (Lauer 1995). While the dependency model tends to outperform the adjacency model, the best-performing models take features derived from both along with various syntactic and semantic features (Nakov and Hearst 2005; Vadas and Curran 2008). 15 More generally, for an n item noun compound, `2n´ the number of possible interpretations is 1 defined by the Catalan number Cn = n+1 . n 26 1.5.4 Handbook of NLP MWE Interpretation The semantic interpretation of MWEs is usually performed in one of two ways: (1) relative to a generalised semantic inventory (compatible with both simplex words and MWEs, such as WordNet); and (2) based on a set of semantic relations capturing semantic interplay between component words. When interpreting VPCs or lexicalised PP-Ds, e.g., the former approach would be more appropriate (e.g. to capture the fact that bow out is synonymous with withdraw , both of which are troponyms of retire). Nominal MWEs and productive PP-Ds, on the other hand, are more amenable to interpretation by semantic relations (e.g. to capture the semantics of apple pie in terms of the make relation, as in “pie made from apple(s)”). One common approach to MWE interpretation is via component similarity, i.e. comparison of the components of a MWE with corresponding components of annotated MWEs, or alternatively with simplex words. For example, a novel NC can be interpreted by identifying training NCs with similar modifier and head nouns (e.g. in interpreting grape extract, grape would be compared with similar modifiers, and extract with similar heads), as determined relative to a lexical resource or via distributional similarity. We can then extrapolate from the closely-matching training NCs to predict the interpretation of the novel NC (Vanderwende 1994; Moldovan, Badulescu, Tatu, Antohe, and Girju 2004; Kim and Baldwin 2005; Nastase, Sayyad-Shirabad, Sokolova, and Szpakowicz 2006; Kim and Baldwin 2007b; Ó Séaghdha 2008). Alternatively, we may employ contextual similarity to compare a VPC with its simplex verb, to determine if they are sufficiently similar that the VPC can be interpreted compositionally from the verb (Baldwin, Bannard, Tanaka, and Widdows 2003; McCarthy, Keller, and Carroll 2003; Cook and Stevenson 2006). Crosslinguistic evidence can also provide valuable evidence when interpreting MWEs. For example, analysis of what preposition is used in different Romance languages to translate a given English MWE can provide valuable insights into the range of possible interpretations for the English MWE (Girju 2009). Conversely, semantically idiomatic MWEs can be detected from parallel corpus data by identifying translation divergences in the component words lexical choice (Melamed 1997). For example, knowledge that balance and sheet are most often translated as équilibre and feuille, respectively, in French, and yet balance sheet is translated as bilan suggests that balance sheet is semantically idiomatic. One popular approach to determining the underlying semantic relation associated with a MWE is to identify surface realisations or paraphrases associated with each semantic class (Lapata 2002; Grover, Lapata, and Lascarides 2004; Kim and Baldwin 2006; Nicholson and Baldwin 2006; Nakov and Hearst 2008). For example, in the case of compound nominalisations, there are the two primary classes of subject and object, based on whether the modifier acts as the subject (e.g. investor hesitation = “investor hesitates”) or object (e.g. product replacement = “replace (the) product”) of the base verb form of Multiword Expressions 27 the deverbal head. For a given compound nominalisation and base verb form, it is possible to analyse the relative occurrence of the modifier as subject or object of the base verb, and select the interpretation which is most commonly observed (Lapata 2002; Grover, Lapata, and Lascarides 2004; Nicholson and Baldwin 2006). Another methodology which has been applied to the interpretation task with success is analysis of the co-occurrence properties of the MWE components. For example, the semantics of particles in VPCs can be interpreted by analysing what types of verbs can combine with a given particle (Cook and Stevenson 2006; Kim and Baldwin 2007a). Similarly, Japanese compound verbs (V-V combinations) can be interpreted by observing what set of verbs each of the component verbs combines with to form a compound verb, optionally including the semantic class of the resulting compound verb (Uchiyama, Baldwin, and Ishizaki 2005). One overarching assumption made in most semantic interpretation tasks is that it is possible to arrive at a compositional interpretation for each MWE via its component words. Ideally, we of course need to identify instances of semantic idiomaticity, motivating the need for methods which can model the relative compositionality or decomposability of MWEs (Lin 1999; Baldwin, Bannard, Tanaka, and Widdows 2003; McCarthy, Keller, and Carroll 2003; McCarthy, Venkatapathy, and Joshi 2007). While there has been a healthy interest in MWE interpretation, research has suffered from lack of agreement on semantic inventories, and the relative unavailability of annotated data. One very positive step towards redressing this situation was a shared task at SemEval-2007, on interpreting nominal MWEs in English (Girju, Nakov, Nastase, Szpakowicz, Turney, and Yuret 2007), and an upcoming SemEval-2010 task on the multi-way classification of semantic relations between pairs of nominals. In practice, the SemEval-2007 task took a pair of nouns in a fixed sentential context and attempted to determine if they were interpretable using a set of semantic relations compatible with NCs. As such, the task wasn’t specifically on NC interpretation, but NC interpretation methods could be evaluated over the dataset (Kim and Baldwin 2008; Ó Séaghdha 2008). Crucially, the task organisers chose to sidestep the controversy surrounding the precise membership of a broad-coverage set of semantic relations, and instead focused on relations where there is relatively high agreement between researchers. They additionally defused the question of interpretational overlap/ambiguity of a given nominal, by designing the task as a series of binary sub-tasks, where a prediction had to be made about each nominal’s compatibility with a given semantic relation (ignoring whether or not it was also compatible with other relations). 28 Handbook of NLP 1.6 Summary MWEs are an integral part of language: vast in number and highly varied in nature. They are defined by idiomaticity at the lexical, syntactic, semantic, pragmatic and statistical levels, and occur in a myriad of different constructions in the world’s languages. In addition to providing a brief foray into the linguistic complexities of MWEs, we have detailed the key MWEs in MWE research, and outlined various approaches to the primary computational challenges associated with MWEs, namely: identification, extraction and interpretation. We have deliberately not provided a survey of MWE resources in this paper, choosing instead to maintain an up-to-the-moment snapshot of the field on the companion website at http://.... For those interested in pursuing MWE research, we recommend this as your first port of call. For readers who are interested in further reading on MWEs, we particularly recommend the following works: (Moon 1998; McKeown and Radev 2000; Cowie 2001; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Villavicencio, Bond, Korhonen, and McCarthy 2005). Bibliography Abeillé, A. (1988). Light verb constructions and extraction out of NP in a tree adjoining grammar. In Papers of the 24th Regional Meeting of the Chicago Linguistics Society. Ahn, H.-D. (1991). Light verbs, VP-movement, Negation and Clausal Structure in Korean and English. Ph. D. thesis, University of WisconsinMadison. Aronoff, M. (1976). Word Formation in Generative Grammar. Cambridge, USA: MIT Press. Baldwin, T. (2005a). The deep lexical acquisition of English verb-particles. Computer Speech and Language, Special Issue on Multiword Expressions 19 (4), 398–414. Baldwin, T. (2005b). Looking for prepositional verbs in corpus data. In Proceedings of the 2nd ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications, Colchester, UK, pp. 115–126. Baldwin, T., C. Bannard, T. Tanaka, and D. Widdows (2003). An empirical model of multiword expression decomposability. In Proceedings of the Multiword Expressions 29 ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, pp. 89–96. Baldwin, T., J. Beavers, L. Van Der Beek, F. Bond, D. Flickinger, and I. A. Sag (2006). In search of a systematic treatment of determinerless PPs. In P. Saint-Dizier (Ed.), Syntax and Semantics of Prepositions. Springer. Baldwin, T., E. M. Bender, D. Flickinger, A. Kim, and S. Oepen (2004). Road-testing the English Resource Grammar over the British National Corpus. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 2047– 2050. Baldwin, T. and F. Bond (2002). Multiword expressions: Some problems for Japanese NLP. In Proceedings of the 8th Annual Meeting of the Association for Natural Language Processing (Japan), Keihanna, Japan, pp. 379–382. Baldwin, T. and A. Villavicencio (2002). Extracting the unextractable: A case study on verb-particles. In Proceedings of the 6th Conference on Natural Language Learning (CoNLL-2002), Taipei, Taiwan, pp. 98–104. Bannard, C. (2007). A measure of syntactic flexibility for automatically identifying multiword expressions in corpora. In Proceedings of the ACL2007 Workshop on A Broader Perspective on Multiword Expressions, Prague, Czech Republic, pp. 1–8. Bannard, C., T. Baldwin, and A. Lascarides (2003). A statistical approach to the semantics of verb-particles. In Proceedings of the ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment, Sapporo, Japan, pp. 65–72. Barker, K. and S. Szpakowicz (1998). Semi-automatic recognition of noun modifier relationships. In Proceedings of the 17th International Conference on Computational Linguistics (COLING-1998), Montreal, Canada, pp. 96–102. Bauer, L. (1983). English Word-formation. Cambridge, UK: Cambridge University Press. Bauer, L. (2001). Compounding. In M. Haspelmath (Ed.), Language Typology and Language Universals. The Hague, Netherlands: Mouton de Gruyter. Benor, S. B. and R. Levy (2006). The chicken or the egg? a probabilistic analysis of english binomials. Language 82 (2), 233–278. Benson, M. (1990). Collocations and general-purpose dictionaries. International Journal of Lexicography 3 (1), 23–35. Birke, J. and A. Sarkar (2006). A clustering approach for the nearly unsupervised recoginition of nonliteral language. In Proceedings of the 11th Conference of the EACL (EACL 2006), Trento, Italy, pp. 329–336. 30 Handbook of NLP Blunsom, P. (2007). Structured Classication for Multilingual Natural Language Processing. Ph. D. thesis, University of Melbourne. Bolinger, D. (1976). The Phrasal Verb in English. Boston, USA: Harvard University Press. Bond, F. (2005). Translating the Untranslatable: A solution to the Problem of Generating English Determiners. CSLI Studies in Computational Linguistics. CSLI Publications. Booij, G. (2002). Separable complex verbs in Dutch: A case of periphrastic word formation. In N. Dehé, R. Jackendoff, A. McIntyre, and S. Urban (Eds.), Verb-particle explorations, pp. 21–41. Berlin, Germany / New York, USA: Mouton de Gruyter. Brinton, L. (1985). Verb particles in English: Aspect or aktionsart. Studia Linguistica 39, 157–168. Butt, M. (2003). The light verb jungle. In Proceedings of the Workshop on Multi-Verb Constructions, Trondheim, Norway, pp. 1–49. Calzolari, N., C. Fillmore, R. Grishman, N. Ide, A. Lenci, C. MacLeod, and A. Zampolli (2002). Towards best practice for multiword expressions in computational lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands, pp. 1934–1940. Chafe, W. L. (1968). Idiomaticity as an anomaly in the Chomskyan paradigm. Foundations of Language 4, 109–127. Chander, I. (1998). Automated postediting of documents. Ph. D. thesis, University of Southern California. Choueka, Y. (1988). Lookin for needles in a haystack or locating interesting collocational expressions in large textual databases. In Proceedings of RIAO, pp. 43–38. Church, K. W. and P. Hanks (1989). Word assication norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of the Association of Computational Linguistics (ACL-1989), Vancouver, Canada, pp. 76–83. Cook, P. and S. Stevenson (2006). Classifying particle semantics in English verb-particle constructions. In Proceedings of the ACL-2006 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 45–53. Copestake, A. and A. Lascarides (1997). Integrating symbolic and statistical representations: The lexicon pragmatics interface. In Proceedings of the 35th Annual Meeting of the Association of Coomputational Linguistics and 8th Conference of the European Chapter of Association of Computational Linguistics (ACL/EACL-1997), Madrid, Spain, pp. 136–143. Multiword Expressions 31 Cowie, A. (Ed.) (2001). Phraseology : Theory, Analysis, and Applications. Oxford, UK: Oxford University Press. Cowie, A. P. and P. A. Howarth (1996). Phraseology – a select bibliography. International Journal of Lexicography 9 (1), 38–51. Cruse, A. D. (1986). Lexical Semantics. Cambridge, UK: Cambridge University Press. Dirven, R. (2001). The metaphoric in recent cognitive approaches to English phrasal verbs. metaphorik.de 1, 39–54. Downing, P. (1977). On the creation and use of English compound nouns. Language 53 (4), 810–842. Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Ph. D. thesis, University of Stuttgart. Evert, S. and B. Krenn (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the ACL and 10th Conference of the EACL (ACL-EACL 2001), Toulouse, France, pp. 188–195. Fazly, A. (2007). Automatic Acquisition of Lexical Knowledge about Multiword Predicates. Ph. D. thesis, University of Toronto. Fazly, A., P. Cook, and S. Stevenson (2009). Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35 (1), 61–103. Fellbaum, C. (Ed.) (1998). WordNet, An Electronic Lexical Database. Cambridge, Massachusetts, USA: MIT Press. Fellbaum, C. (2002). VP idioms in the lexicon: Topics for research using a very large corpus. In Proceedings of the KONVENS 2002 Conference, Saarbrücken, Germany. Fernando, C. and R. Flavell (1981). On idioms. Exeter: University of Exeter. Fillmore, C., P. Kay, and M. C. O’Connor (1988). Regularity and idiomaticity in grammatical constructions. Language 64, 501–538. Frantzi, K., S. Ananiadou, and H. Mima (2000). Automatic recognition of multi-word terms:. the C-value/NC-value method. International Journal on Digital Libraries 3 (2), 115–130. Gagné, C. L., T. L. Spalding, and M. C. Gorrie (2005). Sentential context and the interpretation of familiar open-compounds and novel modifiernoun phrases. Language and Speech 28 (2), 203–221. Gates, E. (1988). The treatment of multiword lexemes in some current dictionaries of English. Snell-Hornby. 32 Handbook of NLP Gerber, L. and J. Yang (1997). Systran MT dictionary development. In Proceedings of the Sixth Machine Translation Summit (MT Summit VI), San Diego, USA. Gibbs, R. W. (1980). Spilling the beans on understanding and memory for idioms in conversation. Memory and Cognition 8 (2), 149–156. Girju, R. (2009). The syntax and semantics of prepositions in the task of automatic interpretation of nominal phrases and compounds: A crosslinguistic study. Computational Linguistics 35 (2). Girju, R., P. Nakov, V. Nastase, S. Szpakowicz, P. Turney, and D. Yuret (2007). Semeval-2007 task 04: Classification of semantic relations between nominals. In Proceedings of the 4th International Workshop on Semantic Evaluations, Prague, Czech Republic, pp. 13–18. Grefenstette, G. and P. Tapanainen (1994). What is a word, what is a sentence? problems of tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research, Budapest, Hungary, pp. 79–87. Grefenstette, G. and S. Teufel (1995). A corpus-based method for automatic identification of support verbs for nominalizations. In Proceedings of the 7th European Chapter of Association of Computational Linguistics (EACL-1995), Dublin, Ireland, pp. 98–103. Grimshaw, J. and A. Mester (1988). Light verbs and theta-marking. Linguistic Inquiry 19 (2), 205–232. Grishman, R., C. Macleod, and A. Myers (1998). COMLEX syntax reference manual. Grover, C., M. Lapata, and A. Lascarides (2004). A comparison of parsing technologies for the biomedical domain. Journal of Natural Language Engineering 1 (1), 1–38. Hashimoto, C. and D. Kawahara (2008). Construction of an idiom corpus and its application to idiom identification based on WSD incorporating idiom-specific features. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), Honolulu, USA, pp. 992–1001. Hashimoto, C., S. Sato, and T. Utsuro (2006). Japanese idiom recognition: Drawing a line between literal and idiomatic meanings. In Proceedings of the COLING/ACL 2006 Interactive Poster System, Sydney, Australia, pp. 353–360. Haspelmath, M. (1997). From Space to Time in The World’s Languages. Munich, Germany: Lincorn Europa. Hoshi, H. (1994). Passive, Causive, and Light Verbs: A Study of Theta Role Assignment. Ph. D. thesis, University of Connecticut. Multiword Expressions 33 Huddleston, R. and G. K. Pullum (2002). The Cambridge Grammar of the English Language. Cambridge, UK: Cambridge University Press. Jackendoff, R. (1973). The base rules for prepositional phrases. In A Festschrift for Morris Halle, pp. 345–356. New York, USA: Rinehart and Winston. Jackendoff, R. (1997). The Architecture of the Language Faculty. Cambridge, USA: MIT Press. Jackendoff, R. (2002). Foundations of Language. Oxford, UK: Oxford University Press. Jespersen, O. (1965). A Modern English Grammar on Historical Principles, Part VI, Morphology. London, UK: George Allen and Unwin Ltd. Justeson, J. S. and S. M. Katz (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1 (1), 9–27. Kaalep, H.-J. and K. Muischnek (2008). Multi-word verbs of Estonian: a database and a corpus. In Proceedings of the LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 23–26. Kageura, K., B. Daille, H. Nakagawa, and L.-F. Chien (2004). Recent trends in computational terminology. Terminology 10 (1), 1–21. Karimi-Doostan, G. H. (1997). Light Verb Construction in Persian. Ph. D. thesis, University of Essex. Kastovsky, D. (1982). Bagel/Francke. Wortbildung und Semantik. Dusseldorf: Katz, G. and E. Giesbrecht (2006). Automatic identification of noncompositional multi-word expressions using latent semantic analysis. In Proceedings of the ACL-2006 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 28–35. Katz, J. J. and P. M. Postal (2004). Semantic interpretation of idioms and sentences containing them. In Quarterly Progress Report (70), MIT Research Laboratory of Electronics, pp. 275–282. MIT Press. Keysar, B. and B. Bly (1995). Intuitions of the transparency of idioms: Can one keep a secret by spilling the beans? Journal of Memory and Language 34 (1), 89–109. Kim, S. N. and T. Baldwin (2005). Automatic interpretation of compound nouns using WordNet similarity. In Proceedings of the 2nd International Joint Conference on Natural Language Processing (IJCNLP-05), Jeju, Korea, pp. 945–956. 34 Handbook of NLP Kim, S. N. and T. Baldwin (2006). Interpreting semantic relations in noun compounds via verb semantics. In Proceedings of the COLING/ACL 2006 Interactive Poster System, Sydney, Australia, pp. 491–498. Kim, S. N. and T. Baldwin (2007a). Detecting compositionality of English verb-particle constructions using semantic similarity. In Proceedings of Conference of the Pacific Association for Computational Linguistics, Melbourne, Australia, pp. 40–48. Kim, S. N. and T. Baldwin (2007b). Disambiguating noun compounds. In Proceedings of 22nd AAAI Conference on Artificial Intelligenc, Vancouver, Canada, pp. 901–906. Kim, S. N. and T. Baldwin (2008). Benchmarking noun compound interpretation. In Proceedings of 3rd International Joint Conference on Natual Language Processing (IJCNLP-2008), Hyderabad, India, pp. 569–576. Kim, S. N. and T. Baldwin (to appear). How to pick out token instances of English verb-particle constructions. Language Resources and Evaluation. Lapata, M. (2002). The disambiguation of nominalizations. Computational Linguistics 28 (3), 357–388. Lapata, M. and A. Lascarides (2003). Detecting novel compounds: The role of distributional evidence. In Proceedings of the 11th Conference of the European Chapter for the Association of Computational Linguistics (EACL-2003), Budapest, Hungary, pp. 235–242. Lauer, M. (1995). Designing Statistical Language Learners: Experiments on Noun Compounds. Ph. D. thesis, Macquarie University. Levi, J. (1978). The Syntax and Semantics of Complex Nominals. New York, USA: Academic Press. Li, W., X. Zhang, C. Niu, Y. Jiang, and R. K. Srihari (2003). An expert lexicon approach to identifying English phrasal verbs. In Proceedings of the ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment, Sapporo, Japan, pp. 513–520. Liberman, M. and R. Sproat (1992). The stress and structure of modified noun phrases in English. In I. A. Sag and A. Szabolcsi (Eds.), Lexical Matters – CSLI Lecture Notes No. 24. Stanford, USA: CSLI Publications. Lidner, S. (1983). A lexico-semantic analysis of English verb particle constructions with OUT and UP. Ph. D. thesis, University of Indiana at Bloomington. Lieber, R. and P. Štekauer (Eds.) (2009). The Oxford Handbook of Compounding. Oxford University Press. Lin, D. (1998). Extracting collocations from text corpora. In Proceedings of the 1st Workshop on Computational Terminology, Montreal, Canada. Multiword Expressions 35 Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, USA, pp. 317–324. Lüdeling, A. (2001). On Particle Verbs and Similar Constructions in German. Stanford, USA: CSLI Publications. Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. Cambridge, USA: MIT Press. Matsuo, Y., S. Shirai, A. Yokoo, and S. Ikehara (1997). Direct parse tree translation in cooperation with the transfer method. In D. Joneas and H. Somers (Eds.), New Methods in Language Processing, pp. 229–238. London, UK: UCL Press. McCarthy, D., B. Keller, and J. Carroll (2003). Detecting a continuum of compositionality in phrasal verbs. In Proceedings of the ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment, Sapporo, Japan, pp. 73–80. McCarthy, D., S. Venkatapathy, and A. Joshi (2007). Detecting compositionality of verb-object combinations using selectional preferences. In Proceedings of the 200 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 369–379. McIntyre, A. (2007). Particle verbs and argument structure. Language and Linguistics Compass 1 (4), 350–367. McKeown, K. R. and D. R. Radev (2000). Collocations. In R. Dale, H. Moisl, and H. Somers (Eds.), A Handbook of Natural Language Processing, Chapter 15. Marcel Dekker. Melamed, I. D. (1997). Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97), Providence, USA, pp. 97–108. Mel’čuk, I. A. and A. Polguère (1987). A formal lexicon in the MeaningText Theory (or how to do lexica with words). Computational Linguistics 13 (3–4), 261–275. Mimmelmann, N. P. (1998). Regularity in irregularity: Article use in adpositional phrases. Linguistic Typology 2, 315–353. Miyagawa, S. (1989). Light verbs and the ergative hypothesis. Linguistic Inquiry 20, 659–668. Miyazaki, M., S. Ikehara, and A. Yokoo (1993). Combined word retrieval for bilingual dictionary based on the analysis of compound word. Transactions of the Information Processing Society of Japan 34 (4), 743–754. (in Japanese). 36 Handbook of NLP Mohanan, T. (1994). Argument Structure in Hindi. Stanford, USA: CSLI Publications. Moldovan, D., A. Badulescu, M. Tatu, D. Antohe, and R. Girju (2004). Models for the semantic classification of noun phrases. In Proceedings of HLT-NAACL 2004: Workshop on Computational Lexical Semantics, Boston, USA, pp. 60–67. Moon, R. E. (1998). Fixed Expressions and Idioms in English: A Corpusbased Approach. Oxford, UK: Oxford University Press. Nakov, P. (2008). Noun compound interpretation using paraphrasing verbs: Feasibility study. In Proceedings of the 13th International Conference on Artificial Intelligence: Methodology, Systems, Applications (AIMSA’08), Varna, Bulgaria, pp. 103–117. Nakov, P. and M. Hearst (2005). Search engine statistics beyond the ngram: Application to noun compound bracketting. In Proceedings of the 9th Conference on Computational Natural Language Learning (CoNLL2005), Ann Arbor, USA, pp. 17–24. Nakov, P. and M. A. Hearst (2008). Solving relational similarity problems using the web as a corpus. In Proceedings of the 46th Annual Meeting of the ACL: HLT, Columbus, USA, pp. 452–460. Nastase, V., J. Sayyad-Shirabad, M. Sokolova, and S. Szpakowicz (2006). Learning noun-modifier semantic relations with corpus-based and WordNet-based features. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), Boston, USA, pp. 781–787. Nicholson, J. and T. Baldwin (2006). Interpretation of compound nominalisations using corpus and web statistics. In Proceedings of the COLING/ACL 2006 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 54–61. Nunberg, G., I. A. Sag, and T. Wasow (1994). Idioms. Language 70, 491– 538. O’Dowd, E. M. (1998). Prepositions and Particles in English. Oxford University Press. Ó Séaghdha, D. (2008). Learning compound noun semantics. Ph. D. thesis, Computer Laboratory, University of Cambridge. Osswald, R., H. Helbig, and S. Hartrumpf (2006). The representation of German prepositional verbs in a semantically based computer lexicon. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy. Patrick, J. and J. Fletcher (2005). Classifying verb particle constructions by verb arguments. In Proceedings of the Second ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications, Colchester, UK, pp. 200–209. Multiword Expressions 37 Pauwels, P. (2000). Put, set, lay, and place: a cognitve linguistic approach to verbal meaning. Munich, Germany: Lincom Europa. Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of the NAACL 2001 Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, USA, pp. 41– 46. Pecina, P. (2008). Lexical Association Measures. Ph. D. thesis, Charles University. Piao, S., P. Rayson, D. Archer, A. Wilson, and T. McEnery (2003). Extracting multiword expressions wth a semantic tagger. In Proceedings of the ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment, Sapporo, Japan, pp. 49–56. Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik (1985). A Comprehensive Grammar of the English Language. London, UK: Longman. Riehemann, S. (2001). A Constructional Approach to Idioms and Word Formation. Ph. D. thesis, Stanford University. Rosario, B. and M. Hearst (2001). Classifying the semantic relations in noun compounds via a domain-specific lexical hierarchy. In Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing (EMNLP-2001), Pittsburgh, Pennsylvania, USA, pp. 82–90. Ross, H. (1995). Defective noun phrases. In In Papers of the 31st Regional Meeting of the Chicago Linguistics Society, Chicago, Illinois, USA, pp. 398–440. Sag, I. A., T. Baldwin, F. Bond, A. Copestake, and D. Flickinger (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2002), Mexico City, Mexico, pp. 1–15. Sager, J. C. (1990). A Practical Course in Terminology Processing. Amsterdam, Netherlands / Philadelphia, USA: John Benjamins. Schone, P. and D. Jurafsky (2001). Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Hong Kong, China, pp. 100–108. Schütze, H. (1997). Ambiguity Resolution in Language Learning. Stanford, USA: CSLI Publications. Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford, UK: Oxford University Press. Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics 19 (1), 143–77. 38 Handbook of NLP Spärck Jones, K. (1983). Compound noun interpretation problems. Englewood Cliffes, USA: Prentice-Hall. Sporleder, C. and L. Li (2009). Unsupervised recognition of literal and nonliteral use of idiomatic expressions. In Proceedings of the 12th Conference of the EACL (EACL 2009), Athens, Greece, pp. 754–762. Stevenson, S., A. Fazly, and R. North (2004). Statistical measures of the semi-productivity of light verb constructions. In Proceedings of the 2nd ACL Workshop on Multiword Expressions: Integrating Processing, Barcelona, Spain, pp. 1–8. Stvan, L. S. (1998). The Semantics and Pragmatics of Bare Singular Noun Phrases. Ph. D. thesis, Northwestern University. Tan, Y. F., M.-Y. Kan, and H. Cui (2006). Extending corpus-based identification of light verb constructions using a supervised learning framework. In Proceedings of the EACL 2006 Workshop on Multi-word-expressions in a multilingual context (MWEmc), Trento, Italy. Tanaka, T. and T. Baldwin (2003). Noun-noun compound machine translation a feasibility study on shallow processing. In Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan, pp. 17–24. Trawiński, B., M. Sailer, and J.-P. Soehn (2006). Combinatorial aspects of collocational prepositional phrases. In P. Saint-Dizier (Ed.), Computational Linguistics Dimensions of Syntax and Semantics of Prepositions. Dordrecht, Netherlands: Kluwer Academic. Trawiński, B., M. Sailer, J.-P. Soehn, L. Lemnitzer, and F. Richter (2008). Cranberry expressions in English and in German. In Proceedings of the LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 35–38. Tschichold, C. (1998). Multi-word Units in Natural Language Processing. Ph. D. thesis, University of Basel. Tsuchiya, M., T. Shime, T. Takagi, T. Utsuro, K. Uchimoto, S. Matsuyoshi, S. Sato, and S. Nakagawa (2006). Chunking Japanese compound functional expressions by machine learning. In Proceedings of the EACL 06 Workshop on Multi-word-expressions in a Multilingual Context, Trento, Italy, pp. 25–32. Uchiyama, K., T. Baldwin, and S. Ishizaki (2005). Disambiguating Japanese compound verbs. Computer Speech and Language, Special Issue on Multiword Expressions 19 (4), 497–512. Vadas, D. and J. R. Curran (2008). Parsing noun phrase structure with CCG. In Proceedings of the 46th Annual Meeting of the ACL: HLT, Columbus, USA, pp. 335–343. Multiword Expressions 39 van der Beek, L. (2005). The extraction of determinerless PPs. In Proceedings of the Second ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications, Colchester, UK, pp. 190–199. Vanderwende, L. (1994). Algorithm for automatic interpretation of noun sequences. In Proceedings of the 15th Conference on Computational linguistics, Kyoto, Japan, pp. 782–788. Venkatapathy, S. and A. Joshi (2006). Using information about multi-word expressions for the word-alignment task. In Proceedings of the COLING/ACL 2006 Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 53–60. Villada Moirón, B. (2005). Data-driven identification of fixed expressions and their modifiability. Ph. D. thesis, Alfa-Informatica, University of Groningen. Villavicencio, A. (2005). The availability of verb-particle constructions in lexical resources: How much is enough? Computer Speech and Language, Special Issue on Multiword Expressions 19 (4), 415–432. Villavicencio, A., T. Baldwin, and B. Waldron (2004). A multilingual database of idioms. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 1127–1130. Villavicencio, A., F. Bond, A. Korhonen, and D. McCarthy (2005). Introduction to the special issue on multiword expressions: Having a crack at a hard nut. Computer Speech and Language, Special Issue on Multiword Expressions 19 (4), 365–377. Widdows, D. (2005). Geometry and Meaning. Stanford, USA: CSLI Publications. Widdows, D. and B. Dorow (2005). Automatic extraction of idioms using graph analysis and asymmetric lexicosyntactic patterns. In Proceedings of the ACL 2004 Workshop on Deep Lexical Acquisition, Ann Arbor, USA, pp. 48–56. Xu, R., Q. Lu, and S. Li (2006). The design and construction of a Chinese collocation bank. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy. Zhang, Y., V. Kordoni, A. Villavicencio, and M. Idiart (2006). Automated multiword expression prediction for grammar engineering. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 36–44. Association for Computational Linguistics. What is Named Entity Recognition (NER) Applications and Uses? NER, short for, Named Entity Recognition is a standard Natural Language Processing problem which deals with information extraction. The primary objective is to locate and classify named entities in text into predefined categories such as the names of persons, organizations, locations, events, expressions of times, quantities, monetary values, percentages, etc. To put it simply, NER deals with extracting the real-world entity from the text such as a person, an organization, or an event. Named Entity Recognition is also simply known as entity identification, entity chunking, and entity extraction. They are quite similar to POS(part-of-speech) tags. NER using NLTK NLTK is a standard python library with prebuilt functions and utilities for the ease of use and implementation. It is one of the most used libraries for natural language processing and computational linguistics. Recognizing named entities in a large corpus can be a challenging task, but NLTK has built-in method ‘nltk.ne_chunk()’ that can recognize various entities shown in the table below: NE Type ORGANIZATION PERSON LOCATION DATE TIME MONEY PERCENT FACILITY GPE Examples Georgia-Pacific Corp., WHO Eddy Bonte, President Obama Murray River, Mount Everest June, 2008-06-29 two fifty a.m, 1:30 p.m. 175 million Canadian Dollars, GBP 10.40 twenty pct, 18.75 % Washington Monument, Stonehenge South-East Asia, Midlothian IOB tagging The IOB format (short for inside, outside, beginning) is a tagging format that is used for tagging tokens in a chunking task such as named-entity recognition. These tags are similar to part-of-speech tags but give us information about the location of the word in the chunk. The IOB Tagging system contains tags of the form: • B-{CHUNK_TYPE} – for the word in the Beginning chunk • I-{CHUNK_TYPE} – for words Inside the chunk • O – Outside any chunk NER using SpaCy SpaCy is an open-source library for advanced Natural Language Processing written in the Python and Cython. It can be used to build information extraction or natural language understanding systems or to pre-process text for deep learning. Some of the features provided by spaCy are- Tokenization, Parts-of-Speech (PoS) Tagging, Text Classification, and Named Entity Recognition which we are going to use here. SpaCy provides a default model that can recognize a wide range of named or numerical entities, which include person, organization, language, event, etc. Apart from these default entities, we can also add arbitrary classes to the NER model, by training the model to update it with newer trained examples. SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it recognizes the following entity types. Applications of NER NER, short for, Named Entity Recognition has a wide range of applications in the field of Natural Language Processing and Information Retrieval. Few such examples have been listed below : Classifying content for news providers: A large amount of online content is generated by the news and publishing houses on a daily basis and managing them correctly can be a challenging task for the human workers. Named Entity Recognition can automatically scan entire articles and help in identifying and retrieving major people, organizations, and places discussed in them. Thus articles are automatically categorized in defined hierarchies and the content is also much easily discovered. Automatically Summarizing Resumes: You might have come across various tools that scan your resume and retrieve important information such as Name, Address, Qualification, etc from them. The majority of such tools use the NER software which helps it to retrieve such information. Also one of the challenging tasks faced by the HR Departments across companies is to evaluate a gigantic pile of resumes to shortlist candidates. A lot of these resumes are excessively populated in detail, of which, most of the information is irrelevant to the evaluator. Using the NER model, the relevant information to the evaluator can be easily retrieved from them thereby simplifying the effort required in shortlisting candidates among a pile of resumes. Optimizing Search Engine Algorithms: When designing a search engine algorithm, It would be an inefficient and computational task to search for an entire query across the millions of articles and websites online, an alternate way is to run a NER model on the articles once and store the entities associated with them permanently. Thus for a quick and efficient search, the key tags in the search query can be compared with the tags associated with the website articles Powering Recommendation systems: NER can be used in developing algorithms for recommender systems that make suggestions based on our search history or on our present activity. This is achieved by extracting the entities associated with the content in our history or previous activity and comparing them with the label assigned to other unseen content. Thus we frequently see the content of our interest. Simplifying Customer Support: Usually, a company gets tons of customer complaints and feedback on a daily basis, and going through each one of them and recognizing the concerned parties is not an easy task. Using NER we can recognize relevant entities in customer complaints and feedback such as Product specifications, department, or company branch location so that the feedback is classified accordingly and forwarded to the appropriate department responsible for the identified product. How Does Named Entity Recognition Work? When we read a text, we naturally recognize named entities like people, values, locations, and so on. For example, in the sentence “Mark Zuckerberg is one of the founders of Facebook, a company from the United States” we can identify three types of entities: • “Person”: Mark Zuckerberg • “Company”: Facebook • “Location”: United States For computers, however, we need to help them recognize entities first so that they can categorize them. This is done through machine learning and Natural Language Processing (NLP). NLP studies the structure and rules of language and creates intelligent systems capable of deriving meaning from text and speech, while machine learning helps machines learn and improve over time. To learn what an entity is, an NER model needs to be able to detect a word, or string of words that form an entity (e.g. New York City), and know which entity category it belongs to. So first, we need to create entity categories, like Name, Location, Event, Organization, etc., and feed an NER model relevant training data. Then, by tagging some word and phrase samples with their corresponding entities, you’ll eventually teach your NER model how to detect entities itself. How to Do Named Entity Recognition The easiest way to get started with named entity recognition is using an API. Basically, you can choose between two types: • Open-source named entity recognition APIs • SaaS named entity recognition APIs Open-Source named entity recognition APIs Open-source APIs are for developers: they are free, flexible, and entail a gentle learning curve. Here are a few options: • Stanford Named Entity Recognizer (SNER): this JAVA tool developed by Stanford University is considered the standard library for entity extraction. It’s based on Conditional Random Fields (CRF) and provides pre-trained models for extracting person, organization, location, and other entities. • SpaCy: a Python framework known for being fast and very easy to use. It has an excellent statistical system that you can use to build customized NER extractors. • Natural Language Toolkit (NLTK): this suite of libraries for Python is widely used for NLP tasks. NLKT has its own classifier to recognize named entities called ne_chunk, but also provides a wrapper to use the Stanford NER tagger in Python. SaaS named entity recognition APIs SaaS tools are ready-to-use, low-code, and cost-effective solutions. Plus, they are easy to integrate with other popular platforms. MonkeyLearn, for example, is a text analysis SaaS platform that you can use for different NLP tasks, one of which is named entity recognition. You can use MonkeyLearn's ready-built API to integrate pre-trained entity extraction models, or you can easily build your own custom named entity extractor also. Different blocks present in a Typical NER model A typical NER model consists of the following three blocks: 1. Noun Phrase Identification This step deals with extracting all the noun phrases from a text with the help of dependency parsing and part of speech tagging. Phrase Classification In this classification step, we classified all the extracted noun phrases from the above step into their respective categories. To disambiguate locations, Google Maps API can provide a very good path. and to identify person names or company names, the open databases from Dbpedia, Wikipedia can be used. Apart from this, we can also make the lookup tables and dictionaries by combining information with the help of different sources. Entity Disambiguation Sometimes what happens is that entities are misclassified, hence creating a validation layer on top of the results becomes useful. The use of knowledge graphs can be exploited for this purpose. Some of the popular knowledge graphs are: Google Knowledge graph IBM Watson Wikipedia Deep understanding of NER with an Example How does Named Entity Recognition work? As we can simple observed that after reading a particular text, naturally we can recognize named entities such as people, values, locations, and so on. For Example,Consider the following sentence: Sentence: Sundar Pichai, the CEO of Google Inc. is walking in the streets of California. From the above sentence, we can identify three types of entities: (Named Entities) • ( “person”: “Sundar Pichai” ), • (“org”: “Google Inc.”), • (“location”: “California”). But to do the same thing with the help of computers, we need to help them recognize entities first so that they can categorize them. So, to do so we can take the help of machine learning and Natural Language Processing (NLP). Let’s discuss the role of both these things while implementing NER using computers: • NLP: It studies the structure and rules of language and forms intelligent systems that are capable of deriving meaning from text and speech. • Machine Learning: It helps machines learn and improve over time. To learn what an entity is, a NER model needs to be able to detect a word or string of words that form an entity (e.g. California) and decide which entity category it belongs to. So, as a concluding step we can say that the heart of any NER model is a two-step process: • Detect a named entity • Categorize the entity So first, we need to create entity categories, like Name, Location, Event, Organization, etc., and feed a NER model relevant training data. Then, by tagging some samples of words and phrases with their corresponding entities, we’ll eventually teach our NER model to detect the entities and categorize them. CSE4022 Natural Language Processing Topic: Syntax Pr o f. Tu l asi Pr a sa d S a ri ki , S C SE, V I T Ch ennai Ca mpus www. l earn ersd esk.weeb l y. com Contents  What is Syntax ?  Where does it fit ?  Simplified View of Linguistics  Grammatical Analysis Techniques INTRODUCTION TO SYNTAX What is Syntax ?  Study of structure of language  Refers to the way words are arranged together, and the relationship between them.  Syntax is study of the system of rules and categories that underlies sentence formation.  Syntax is the study of the combination of words into phrases, clauses and sentences.  Syntax describes how sentences and their constituents are structured. INTRODUCTION TO SYNTAX What is Syntax ?  Roughly, goal is to relate surface form (what we perceive when someone says something) Specifically, goal is to relate an interface to morphological component to an interface to a semantic component  Note: interface to morphological component may look like written text  Representational device is tree structure INTRODUCTION TO SYNTAX Where does it fit ? Semantics Syntax Lexicon INTRODUCTION TO SYNTAX Simplified View of Linguistics  /waddyasai/ Phonology Morphology Syntax /waddyasai/  what did you say  subj you say subj you say Semantics what did you say obj what  INTRODUCTION TO SYNTAX obj what P[ x. say(you, x) ] Acronyms used in structural descriptions of natural language S=sentence/clause ADJP=adjective phrase N=(a single) noun ADV=adverb NP=noun phrase ADVP=adverb phrase V=verb DET=determiner VP=verb phrase CONJ=conjunction AUX=auxiliary verb COMP=complementizer AJ/ADJ=adjective PRO=pro-constituent PUNC=punctuation INTRODUCTION TO SYNTAX Examples S=sentence/clause Does the dog chase the cat? N=(a single) noun dog NP=noun phrase the old dog V=verb chase VP=verb phrase chase the cat AUX=auxiliary verb does AJ/ADJ=adjective old ADJP=adjective phrase old and gray INTRODUCTION TO SYNTAX Examples ADV=adverb happily ADVP=adverb phrase once upon a time DET=determiner the CONJ=conjunction and COMP=complementizer what PRO= pro-constituent he PUNC=punctuation ? INTRODUCTION TO SYNTAX Grammatical Analysis Techniques Two main devices Breaking up a String    Sequential Hierarchical Transformational Labeling the Constituents    Morphological Categorial Functional INTRODUCTION TO SYNTAX Sequential Breaking up That student solved the problems. that + student + solve + ed + the + problem + s INTRODUCTION TO SYNTAX Sequential Breaking up and Morphological Labeling That student solved the problems. that student solve ed word word stem the affix word problem s stem affix INTRODUCTION TO SYNTAX Sequential Breaking up and Categorial Labeling This boy can solve the problem. this boy can solve the problem Det N Aux V Det N They called her a taxi. They call ed her a taxi Pron V Affix Pron Det N INTRODUCTION TO SYNTAX Sequential Breaking up and Functional Labeling They called her Subject Verbal Direct Object They called her Subject Verbal Indirect Object a taxi Indirect Object a taxi Direct Object INTRODUCTION TO SYNTAX Hierarchical Breaking up Old men and women Old Old men and women men and women men and women Old men Old INTRODUCTION TO SYNTAX men and women Hierarchical Breaking up and Categorial Labeling Poor John Sran away. NP A Poor VP N John V Adv ran away INTRODUCTION TO SYNTAX Hierarchical Breaking up and Functional Labeling  Immediate Constituent (IC) Analysis  Construction types in terms of the function of the constituents:  Predication  Modification  Complementation  Subordination  Coordination (subject + predicate) (modifier + head) (verbal + complement) (subordinator + dependent unit) (independent unit + coordinator) INTRODUCTION TO SYNTAX Syntax as defined by Bloomfield It is the study of free forms that are composed entirely of free forms. Central notions of his theory  Form classes and  Constituent Structures INTRODUCTION TO SYNTAX Form-Classes Form-Class – A set of forms displaying similar or identical grammatical features is said to constitute a form-class, e.g. ‘Walk’, ‘come’, ‘run’, ‘jump’ - belong to the form-class of infinitive expressions. ‘John’, ‘the boys’, ‘Mr. Smith’ – belong to the form-class of nominative substantive expressions. Form-Classes are similar to the traditional parts of speech. One and the same form can belong to more than one form class. INTRODUCTION TO SYNTAX Form-Classes (contd.) Criterion for form-class membership – Substitutability In a sentence like – “John went to the Church”, ‘John’ can be substituted with ‘children’, ‘Mr. Smith’ or ‘the boys’ (as these are syntactically equivalent to each other and display identical grammatical features). Thus, form classes are sets of forms, any one of which may be substituted for any other in a given construction. The smaller forms into which a larger form may be analyzed are its constituents, and the larger form is a construction. INTRODUCTION TO SYNTAX Example of the Constituents of a Construction The phrase "poor John" is a construction analyzable into, or composed of, the constituents "poor" and "John." Similarly, the phrase "lost his watch" is composed of - "lost," "his," and "watch"-- all of which may be described as constituents of the construction put together in a linear order. INTRODUCTION TO SYNTAX Constituency Sentences or phrases can be analyzed as being composed of a number of somewhat smaller units called constituents (e.g. a Noun Phrase might consist of a determiner and a noun), and This constituent analysis can be continued until no further subdivisions are possible. The major divisions that can be made are Immediate Constituents. Ultimate Constituents - The irreducible elements of the construction resulting from such an analysis. INTRODUCTION TO SYNTAX Immediate Constituents An immediate constituent is the daughter of some larger unit that constitute a construction. Immediate constituents are often further reducible. There exists no intermediate unit between them that is a constituent of the same construction e.g. in a construction ‘poor John,’ ‘poor’ and ‘John’ are immediate constituents. INTRODUCTION TO SYNTAX Constructions Subordinating Constructions - Constructions in which only one immediate constituent is of the same form class as the whole construction e.g. ‘poor John’, ‘fresh milk’. The constituent that is syntactically equivalent to the whole construction is described as the head, and its partner is described as the modifier: thus, in "poor John," the form "John" is the head, and "poor" is its modifier. INTRODUCTION TO SYNTAX Constructions (contd.) Coordinating Constructions - Constructions in which both constituents are of the same form class as the whole construction e.g. ‘men and women’, ‘boys and girls’ “Men and women," in which, it may be assumed, the immediate constituents are the word "men" and the word "women," each of which is syntactically equivalent to "men and women." INTRODUCTION TO SYNTAX Immediate Constituent Structure The organization of the units of a sentence (its immediate constituents) both in terms of their hierarchical arrangement and their linear order. IC Structure can be represented in the form of a tree diagram or Using labeled bracketing, each analytic decision being represented by a pair of square brackets at the appropriate points in the construction. INTRODUCTION TO SYNTAX Immediate Constituent Structure (contd.) ‘Poor John lost his watch’ is not just a linear sequence of five words. It can be analyzed into the immediate constituents – ‘poor John’ and ‘lost his watch’ And each of these constituents is analyzable into its own immediate constituents. The Ultimate Constituents of the whole construction are- ‘poor’, ‘John’, ‘lost’, ‘his’, ‘watch’ INTRODUCTION TO SYNTAX Immediate Constituent Structure (contd.) In ‘poor John’ – ‘poor’ and ‘John’ are constituents as well as Immediate constituents as there is no intermediate unit between them that is a constituent of the same construction. Similarly, in ‘lost his watch’ – ‘lost’, ‘his’ and ‘watch’ are constituents Not all of them are immediate constituents. INTRODUCTION TO SYNTAX Immediate Constituent Structure (contd.) In ‘lost his watch’ – ‘his’ and ‘watch’ combine to make the intermediate construction called ‘his watch’ ‘his watch’ now combines with ‘lost’ to give ‘lost his watch’. ‘his’ and ‘watch’ are the constituents of ‘his watch’ and ‘lost’ and ‘his watch’ are immediate constituents of ‘lost his watch’ INTRODUCTION TO SYNTAX Representing Immediate Constituent Structure The constituent structure of the whole sentence can be represented by means of labeled bracketing e.g. [ [ [Poor] [John] ] [ [lost] [ [his] [watch] ] ] Or using a tree diagram for the same poor John lost his INTRODUCTION TO SYNTAX watch Representing Immediate Constituent Structure (contd.) Labeled bracketing using Category Symbols : [ [ [Poor]ADJ [John]N ]NP [ [lost]V [ [his]PRON [watch ]N ]NP ]VP ]S ‘Poor’ – ADJ ‘Poor John’ - NP ‘John’ – N ‘his watch’ - NP Lost – V ‘lost his watch’ - VP His – PRON ‘Poor John lost his watch’ - S Watch - N INTRODUCTION TO SYNTAX Immediate Constituent Structure using Tree Diagram S NP ADJ VP N V NP PRON Poor John lost INTRODUCTION TO SYNTAX his N watch Importance of the notion of Immediate Constituent It helps to account for the syntactic ambiguity of certain constructions. A classic example is the phrase "old men and women," which may be interpreted in two different ways: 1.One associates "old" with "men and women”; the immediate constituents are "old" and "men and women 2.And the second associates “old” just with "men." immediate constructions are "old men" and "women." INTRODUCTION TO SYNTAX Predication The part of a sentence or clause containing a verb and stating something about the subject. [Birds]subject [fly]predicate S Subject Birds INTRODUCTION TO SYNTAX Predicate fly Modification [A]modifier [flower]head John [slept]head [in the room]modifier S Subject John Predicate Head slept INTRODUCTION TO SYNTAX Modifier In the room Complementation He [saw]verbal [a lake]complement S Subject He Predicate Verbal Complement saw a lake complements are required to complete the meaning of a sentence or a part of a sentence. INTRODUCTION TO SYNTAX Subordination John slept [in]subordinator [the room]dependent unit S Subject John Predicate Head Modifier Subordinator slept is a way of combining sentences that makes one sentence more important than the other. in INTRODUCTION TO SYNTAX Dependent Unit the room Coordination [John came in time] independent unit [but]coordinator [Mary was not ready] independent unit S Independent Unit Coordinator Independent Unit John came in time but Mary was not ready Coordination is a way of adding sentences together INTRODUCTION TO SYNTAX An Example In the morning, the sky looked much brighter. S Modifier Subordinator Modifier Head Subject DU Head Modifier Predicate Head Verbal Complement Modifier In the morning, the sky looked INTRODUCTION TO SYNTAX much Head brighter Hierarchical Breaking up and Categorial / Functional Labeling Hierarchical Breaking up coupled with Categorial /Functional Labeling is a very powerful device. But there are ambiguities which demand something more powerful. E.g., Love of God Someone loves God God loves someone INTRODUCTION TO SYNTAX Hierarchical Breaking up Categorial Labeling Functional Labeling Love of God Noun Phrase love Love of God Prepositional Phrase of God Head love INTRODUCTION TO SYNTAX Modifier Sub DU of God Types of Generative Grammar  Finite State Model (sequential)  Phrase Structure Model (sequential + hierarchical) + (categorial)  Transformational Model (sequential + hierarchical + transformational) + (categorial + functional) INTRODUCTION TO SYNTAX Phrase Structure Grammar (PSG) A phrase-structure grammar G consists of a four tuple (V, T, S, P) V is a finite set of alphabets (or vocabulary) ◦ E.g., N, V, A, Adv, P, NP, VP, AP, AdvP, PP, student, sing, etc. T is a finite set of terminal symbols: T  V ◦ E.g., student, sing, etc. S is a distinguished non-terminal symbol, also called start symbol: S V P is a set of productions. INTRODUCTION TO SYNTAX Noun Phrases John the student NP NP the intelligent student NP N Det N John the student Det AdjP N the intelligent student INTRODUCTION TO SYNTAX Noun Phrase his first five PhD students NP Det Ord Quant N his first five PhD INTRODUCTION TO SYNTAX N students Noun Phrase The five best students of my class NP Det Quant the five AP N best students INTRODUCTION TO SYNTAX PP of my class Verb Phrases can sing can hit the ball VP VP Aux V Aux V NP can sing can hit the ball INTRODUCTION TO SYNTAX Verb Phrase Can give a flower to Mary VP Aux can V NP give a flower INTRODUCTION TO SYNTAX PP to Mary Verb Phrase may make John the chairman VP Aux may V NP make John NP the chairman INTRODUCTION TO SYNTAX Verb Phrase may find the book very interesting VP Aux V NP may find the book AP very interesting INTRODUCTION TO SYNTAX Prepositional Phrases in the classroom near the river PP PP P NP P NP in the classroom near the river INTRODUCTION TO SYNTAX Adjective Phrases intelligent very honest fond of sweets AP AP AP A Degree intelligent very A A honest fond INTRODUCTION TO SYNTAX PP of sweets Adjective Phrase very worried that she might have done badly in the assignment AP Degree very S’ A worried that she might have done badly in the assignment INTRODUCTION TO SYNTAX Phrase Structure Rules The boy hit the ball. Rewrite Rules: (i) S (ii) NP (iii) VP (iv) Det (v) N (v) V       NP VP Det N V NP the man, ball hit We interpret each rule X  Y as the instruction rewrite X as Y. INTRODUCTION TO SYNTAX Derivation The boy hit the ball. Sentence NP + VP Det + N + VP Det + N + V + NP The + N + V + NP The + boy + V + NP The + boy + hit + NP The + boy + hit + Det + N The + boy + hit + the + N The + boy + hit + the + ball (i) (ii) (iii) (iv) (v) (vi) (ii) (iv) (v) INTRODUCTION TO SYNTAX PSG Parse Tree The boy hit the ball. S NP VP Det N the boy NP V Det N the ball hit INTRODUCTION TO SYNTAX PSG Parse Tree S NP PropN VP V John wrote those words in the Book of Proverbs. PP NP NP P John wrote those words in NP the book INTRODUCTION TO SYNTAX PP of proverbs Penn POS Tags John wrote those words in the Book of Proverbs. [John/NNP ] wrote/VBD [ those/DT words/NNS ] in/IN [ the/DT Book/NN ] of/IN [ Proverbs/NNS ] INTRODUCTION TO SYNTAX Penn Treebank John wrote those words in the Book of Proverbs. (S (NP-SBJ (NP John)) (VP wrote (NP those words) (PP-LOC in (NP (NP-TTL (NP the Book) (PP of (NP Prove rbs))) INTRODUCTION TO SYNTAX PSG Parse Tree S NP VP NP PP Aux AP Official trading in the shares will start in Paris on Nov 6. N P V PP PP NP A official trading in the shares will start in Paris INTRODUCTION TO SYNTAX on Nov 6 Penn POS Tags Official trading in the shares will start in Paris on Nov 6. [ Official/JJ trading/NN ] in/IN [ the/DT shares/NNS ] will/MD start/VB in/IN [ Paris/NNP ] on/IN [ Nov./NNP 6/CD ] INTRODUCTION TO SYNTAX Penn Treebank Official trading in the shares will start in Paris on Nov 6. ( (S (NP-SBJ (NP Official trading) (PP in (NP the shares))) (VP will (VP start (PP-LOC in (NP Paris)) (PP-TMP on (NP (NP Nov 6) INTRODUCTION TO SYNTAX Penn POS Tag Sset Adjective: JJ Plural Noun: NNS Adverb: RB Personal Pronoun: PP Cardinal Number: CD Proper Noun: NP Determiner: DT Verb base form: VB Preposition: IN Modal verb: MD Coordinating Conjunction CC Verb (3sg Pres): VBZ Subordinating Conjunction: IN Wh-determiner: WDT Singular Noun: NN Wh-pronoun: WP INTRODUCTION TO SYNTAX INTRODUCTION TO SYNTAX Techniques of Grammatical Analysis What is Grammar? • A theory of language • A theory of competence of a native speaker • An explicit model of competence • A finite set of rules What are the requirements? • A model of competence – Should be able to generate an infinite set of grammatical sentences of the language – Should not generate any ungrammatical ones – Should be able to account for ambiguities – If two sentences are understood to have same meaning, the grammar should give the same structure for both at some level – If two sentences are understood to have different internal relationship, the grammar should assign different structural description Techniques of Grammatical Analysis • Two main devices – Breaking up a string • Sequential • Hierarchical • Transformational – Labeling the constituents • Morphological • Categorial • Functional • A grammar may combine any of these devices for grammatical analysis. Grammatical Analysis Techniques Breaking up Sequential Transformational Hierarchical Labeling Morphological Functional Categorial Breaking up and Labeling – Sequential Breaking up • Sequential Breaking up and Morphological labeling • Sequential Breaking up and Categorial labeling • Sequential Breaking up and Functional labeling – Hierarchical Breaking up • Hierarchical Breaking up and Categorial Labeling • Hierarchical Breaking up and Functional Labeling Sequential Breaking up • This device breaks up a sentence into a sequence of morphemes the + boy + s + kill + ed + the + un + happy + rat + s raam + ne + dande + se + saap + ko + maar + aa • If a sentence is understood in two different ways, a grammar should give two corresponding descriptions – They + can + fish • They are able to fish • They put fish in cans Sequential Breaking up and Morphological labeling After breaking up a sentence, we can give morphological labels to each units the + boy + s + kill + ed + the + un + happy + rat + s word stem affix stem affix word affix stem stem affix raam + ne + dande + se + saap + ko + maar + aa word affix word affix word affix word affix Sequential Breaking up and Categorical Labeling • This boy can solve the problem S This boy Det N can Aux solve V the Det problem N Sequential Breaking up and Categorical Labeling (…continued) • They can Pronoun Aux • They Pronoun Verb fish Verb can fish Noun Sequential Breaking up and Categorical Labeling (…continued) • They called her a taxi Pronoun Verb Pronoun Article Noun – They said she was a taxi – They called a taxi for her Sequential Breaking up and Functional labeling They called her a taxi Subject Verbal IO DO called her a taxi They Subject Verbal DO Object Complement Sequential Breaking up and Functional labeling (…continued) Old men modifier Head and women coordinator Head Hierarchical Breaking up • This device breaks up a sentence into its constituents at different hierarchies or levels Old men and women Old men and women men and women Old men and women Old men Old men and women Hierarchical Breaking up and Categorial Labeling Poor ADJ NP John ran away N V ADV VP Hierarchical Breaking up and Functional Labeling • Immediate Constituent Analysis • Construction types in terms of the function of the constituents: – Predication (subject + predicate) – Modification (modifier + head) – Complementation (verbal + complement) – Subordination (subordinator + dependent unit) – Coordination (independent unit + coordinator + independent unit) In the morning, the sky looked much brighter Mod Head Mod Mod Head Verbal Sub DU Modifier Head Subject Complement Predicate Head Hierarchical Breaking up and Categorial /Functional Labeling • Hierarchical Breaking up coupled with Categorial /Functional Labeling is a very powerful device • But there are ambiguities which demand something more powerful Love of God • Someone loves God • God loves someone Love of God Noun Phrase love Love of God Prepositional Phrase of God Head love Modifier Sub- DU of God Generative Grammar • A generative grammar – generates all the grammatical sentences of the language – rejects all the ungrammatical ones • It is free to choose and combine any of the techniques of breaking up and labeling • Depending upon what tools the grammar combines, we have different types of grammar Types of Generative Grammar • Finite State Model (sequential) • Phrase Structure Model (sequential + hierarchical) • Transformational Model (sequential + hierarchical + transformational) Phrase Structure Model • Inadequacies – Ambiguity – Paraphrase Relationship – Constructional Homonymy Ambiguity • If a sentence is understood to have two meanings, a grammar should give two corresponding structural descriptions • PS grammar fails to meet this demand • The shooting of the hunters – The hunter shot someone – Someone shot the hunters NP PP NP Det the N shooting P of NP Det N the hunters Paraphrase Relationship • If two sentences are understood to have the same meaning, a grammar should give the same structural description for the two sentences at some level • PS grammar fails to meet this demand • Examples – The boy slapped the girl – The girl was slapped by the boy S VP NP Det The N boy V slapped NP Det N the girl S VP NP Det The N girl V PP Aux V P was slapped by NP the boy Constructional Homonymy • If two sentences are not understood the same way, a grammar should give two structural descriptions • PS grammar cannot do this – The audience was asked to leave by the side-door – The audience was asked to leave by the chairman S VP NP Det V N Aux Inf. Phr V VP Inf. V PP P NP Det The audience was asked to leave by the N Sidedoor S VP NP Det V N Aux Inf. Phr V VP Inf. V PP P NP Det The audience was asked to leave by the N Chairman Transformational Model • If a generative grammar makes use of all the three -sequential -hierarchical and -transformational is called a Transformational grammar. PS model • Makes use of PS rules exclusively • PS rules generate surface structure Transformational model • Makes use of PS rules and transformational rules • PS rules generate DS • DS is converted into the surface structure by transformational rules PS grammar Surface Structure PS rules Transformational grammar PS rules Deep Structure Transformation Surface Structure Why wasn’t he punished? • PS rules: S  NP – VP VP  V - NP – PP V  Aux – V Aux (Tense)-(Modal)-(Perf)-(Prog) Tense  {pres/past} NP  (PreDet) - (Det) - (Ord) -(Quan) -(AP)-N Lexical substitution S VP NP VG Aux NP PP V Tense Someone past punish him for some reason Why wasn’t he punished? (…continued) • Transformations – Passivization – Agent deletion – Negation – Neg. Contraction – Wh- substitution – Interrogation – Wh- fronting – Affix switch Passivization S VP NP VG Aux Tense He past PP V Pass be en punish P PP NP N by someone for Some reason Agent Deletion S VP NP VG Aux Tense He past PP V Pass be en punish for some reason Negation S VP NP VG Aux Tense He past PP V Pass be not en punish for some reason Neg. Contraction S VP NP VG Aux Tense He past PP V Pass be’not en punish for some reason WhSubstitution S VP NP VG Aux Tense He past WH V Pass be’not en punish why Interrogative S Aux VP NP VG WH V Tense past be’not he en punish why WH Aux S Wh-fronting NP VP VG V Tense why past be’not he en punish WH Aux S Affix switch NP VP VG Tense V be’not past punish why was’not he en punished CSE4022 Natural Language Processing Topic: Basic Parsing Strategy Pr o f. Tu l asi Pr a sa d S a ri ki , S C SE, V I T Ch ennai Ca mpus www. l earn ersd esk.weeb l y. com Simple CFG for ATIS English Lexicon Grammar S → NP VP S → Aux NP VP S → VP NP → Pronoun NP → Proper-Noun NP → Det Nominal Nominal → Noun Nominal → Nominal Noun Nominal → Nominal PP VP → Verb VP → Verb NP VP → VP PP PP → Prep NP Det → the | a | that | this Noun → book | flight | meal | money Verb → book | include | prefer Pronoun → I | he | she | me Proper-Noun → Houston | NWA Aux → does Prep → from | to | on | near | through PARSING 2 A Fragment of English Grammar S  NP VP VP  V NP NP  NNP | ART N NNP  Ram V  ate | saw ART  a | an | the N  rice | apple | movie PARSING 3 Derivation • S is a special symbol called start symbol. S => NP VP Multiple Choice Points (rewrite S) => NNP VP (rewrite NP) => Ram VP (rewrite NNP) => Ram V NP (rewrite VP) => Ram ate NP (rewrite V) => Ram ate ART N (rewrite NP) => Ram ate the N (rewrite ART) => Ram ate the rice (rewrite N) PARSING 4 Two Strategies : Top-Down & Bottom-Up Top down : Start with S and generate the sentence. Bottom up : Start with the words in the sentence and use the rewrite rules backwards to reduce the sequence of symbols to produce S. Previous slide showed top-down strategy. PARSING 5 Bottom-Up Derivation Ram ate the rice => NNP ate the rice (rewrite Ram) => NNP V the rice (rewrite ate) => NNP V ART rice (rewrite the) => NNP V ART N (rewrite rice) => NP V ART N (rewrite NNP) => NP V NP (rewrite ART N) => NP VP (rewrite V NP) => S PARSING 6 Parsing Algorithm A procedure that “searches” through the grammatical rules to find a combination that generates a tree which stands for the structure of the sentence. PARSING 7 Parsing as Search (State Space : AND–OR Graph) S NP NNP VP ART N V NP NNP ART N The leaves have links to words in the language, e.g.,  AND node ART a an  OR node the PARSING 8 Top-Down Parsing DFS on the AND-OR graph Data structures:  Open List (OL): Nodes to be expanded  Closed List (CL): Expanded Nodes  Input List (IL): Words of sentence to be parsed  Moving Head (MH): Walks over the IL PARSING 9 Trace of Top-Down Parsing Initial Condition (T0) OL S CL (empty) Ram ate the rice IL MH PARSING 10 Trace of Top-Down Parsing T1: NP VP OL CL S Ram ate the rice IL MH PARSING 11 Trace of Top-Down Parsing T2: NNP ART N VP S NP OL CL Ram ate the rice IL MH PARSING 12 Trace of Top-Down Parsing T3: OL ART N VP S NP NNP CL Ram ate the rice IL MH (portion of Input consumed) PARSING 13 Trace of Top-Down Parsing T4: N OL VP S NP NNP ART* CL Ram ate the rice IL MH (* indicates ‘useless’ expansion) PARSING 14 Trace of Top-Down Parsing T5: OL VP S NP NNP ART* N* CL Ram ate the rice IL MH PARSING 15 Trace of Top-Down Parsing T6: OL V NP S NP NNP ART* N* CL Ram ate the rice IL MH PARSING 16 Trace of Top-Down Parsing T7: OL NP S NP NNP ART* N* V CL Ram ate the rice IL MH PARSING 17 Trace of Top-Down Parsing T8: NNP ART N OL S NP NNP ART* N* V NP CL Ram ate the rice IL MH PARSING 18 Trace of Top-Down Parsing T9: ART N OL S NP NNP ART* N* V NNP* CL Ram ate the rice IL MH PARSING 19 Trace of Top-Down Parsing T10: OL N S NP NNP ART* N* V NNP * ART Ram ate the rice CL IL MH PARSING 20 Trace of Top-Down Parsing T11: OL S NP NNP ART* N* V NNP* ART N Ram ate the rice CL IL MH Successful Termination: OL empty AND MH at the end of IL. PARSING 21 Bottom-Up Parsing Basic idea:  Refer to words from the lexicon.  Obtain all POSs for each word.  Keep combining until S is obtained. PARSING 22 Implementation of Bottom-up Parsing  Through a stack  Push words into the stack  Look for a “handle” to reduce to a non-terminal  Termination by “start symbol on stack” and “end of sentence”. PARSING 23 Trace of Bottom-Up Parsing T0 book the flight MH PARSING 24 Trace of Bottom-Up Parsing Push ‘book’; advance input pointer book the flight MH book PARSING 25 Trace of Bottom-Up Parsing Reduce ‘book’ book the flight MH V PARSING 26 Trace of Bottom-Up Parsing Push ‘the’; advance input pointer book the flight MH the V PARSING 27 Trace of Bottom-Up Parsing Reduce ‘the’ book the flight MH ART V PARSING 28 Trace of Bottom-Up Parsing Push ‘flight’; advance pointer book the flight flight MH ART V PARSING 29 Trace of Bottom-Up Parsing Reduce ‘flight’ book the flight N MH ART V PARSING 30 Trace of Bottom-Up Parsing Reduce ‘ART N’ by ‘NP’ book the flight MH NP V PARSING 31 Trace of Bottom-Up Parsing Reduce ‘V NP’ by ‘S’; termination by S on stack and input exhausted. book the flight MH S PARSING 32 Parsing Example S VP Verb NP book that flight book Det that Nominal Noun flight PARSING 33 Top Down Parsing S NP VP Pronoun PARSING 34 Top Down Parsing S NP VP Pronoun X book PARSING 35 Top Down Parsing S NP VP ProperNoun PARSING 36 Top Down Parsing S NP VP ProperNoun X book PARSING 37 Top Down Parsing S NP Det VP Nominal PARSING 38 Top Down Parsing S NP Det VP Nominal X book PARSING 39 Top Down Parsing S Aux NP PARSING VP 40 Top Down Parsing S Aux NP VP X book PARSING 41 Top Down Parsing S VP PARSING 42 Top Down Parsing S VP Verb PARSING 43 Top Down Parsing S VP Verb book PARSING 44 Top Down Parsing S VP Verb X book PARSING that 45 Top Down Parsing S VP Verb NP PARSING 46 Top Down Parsing S VP Verb NP book PARSING 47 Top Down Parsing S VP Verb NP book Pronoun PARSING 48 Top Down Parsing S VP Verb NP book Pronoun X that PARSING 49 Top Down Parsing S VP Verb NP book ProperNoun PARSING 50 Top Down Parsing S VP Verb NP book ProperNoun X that PARSING 51 Top Down Parsing S VP Verb NP book Det PARSING Nominal 52 Top Down Parsing S VP Verb NP book Det Nominal that PARSING 53 Top Down Parsing S VP Verb NP book Det that PARSING Nominal Noun 54 Top Down Parsing S VP Verb NP book Det that Nominal Noun flight PARSING 55 Bottom Up Parsing book that PARSING flight 56 Bottom Up Parsing Noun book that PARSING flight 57 Bottom Up Parsing Nominal Noun book that PARSING flight 58 Bottom Up Parsing Nominal Nominal Noun Noun book that PARSING flight 59 Bottom Up Parsing Nominal Nominal Noun X Noun book that PARSING flight 60 Bottom Up Parsing Nominal Nominal PP Noun book that PARSING flight 61 Bottom Up Parsing Nominal Nominal PP Noun Det book that PARSING flight 62 Bottom Up Parsing Nominal Nominal PP NP Noun Det Nominal book that flight PARSING 63 Bottom Up Parsing Nominal Nominal PP NP Noun Det Nominal book that Noun flight PARSING 64 Bottom Up Parsing Nominal Nominal PP NP Noun Det Nominal book that Noun flight PARSING 65 Bottom Up Parsing Nominal Nominal S PP NP VP Noun Det Nominal book that Noun flight PARSING 66 Bottom Up Parsing Nominal Nominal S PP NP VP Noun Det Nominal book that Noun X flight PARSING 67 Bottom Up Parsing Nominal Nominal PP X NP Noun Det Nominal book that Noun flight PARSING 68 Bottom Up Parsing NP Verb Det Nominal book that Noun flight PARSING 69 Bottom Up Parsing VP NP Verb Det Nominal book that Noun flight PARSING 70 Bottom Up Parsing S VP NP Verb Det Nominal book that Noun flight PARSING 71 Bottom Up Parsing S VP X NP Verb Det Nominal book that Noun flight PARSING 72 Bottom Up Parsing VP VP PP NP Verb Det Nominal book that Noun flight PARSING 73 Bottom Up Parsing VP VP PP X NP Verb Det Nominal book that Noun flight PARSING 74 Bottom Up Parsing VP NP Verb book NP Det Nominal that Noun flight PARSING 75 Bottom Up Parsing VP NP Verb Det Nominal book that Noun flight PARSING 76 Bottom Up Parsing S VP NP Verb Det Nominal book that Noun flight PARSING 77 Top Down vs. Bottom Up  Top down never explores options that will not lead to a full parse, but can explore many options that never connect to the actual sentence.  Bottom up never explores options that do not connect to the actual sentence but can explore options that can never lead to a full parse.  Relative amounts of wasted search depend on how much the grammar branches in each direction. PARSING 78 Efficiency Issues To reuse the work already done for constituent subtrees. Ex of inefficiency: Consider the sentence the train from Chennai to Vizag via Nellore Grammar: S  NP NP  NP PP | ART N| NNP ART  the N  train NNP  Chennai | Vizag | Nellore PP  P NP P  from | to | via PARSING 79 Possible False Steps the train from Chennai to Vizag via Nellore push “the”; reduce to “ART”; push “train”; reduce to “N”; A reduce “ART N” to “NP”; reduce “NP” to “S”. PARSING 80 Possible False Steps Perform A, and then from Chennai to Vizag via Nellore NP push “from”; reduce to “P”; push “Chennai”; reduce to “NNP”; reduce to “NP”; reduce “P NP” to “PP”; B Reduce “NP PP” to “NP”; reduce “NP” to “S”. PARSING 81 Possible False Steps Similarly for “……….. to Vizag via Nellore” and “…….. via Nellore” Shift reduce conflicts occur for S  NP NP  NP PP Should “NP” be reduced to “S” or should one search for “PP” and then get bigger “NP”? PARSING 82 Reduplication of Work the train NP S from Chennaii PP NP S to Vizag PP NP NP S via Nellore PP S PARSING 83 # of Repetitions for Subtree Computation the train 4 times from Chennai 3 times the train from Chennai 3 times to Vizag 2 times the train from Chennai to Vizag 2 times via Nellore 1 time the train from Chennai to Vizag via Nellore 1 time Can the “subtrees already computed” be reused? PARSING 84 Chart Parsing : Earley Algorithm (Dynamic Programming based) Sentence: book the flight Grammar: S  NP VP | VP NP  ART N | NNP VP  V | V NP ART  a | an | the N  book | flight V  book | include PARSING 85 Definitions CHART is the data structure that stores the record of matched constituents and expected constituents through the use of dotted rules. A dotted rule is of the form AB C where B is the matched constituent and C is the expected constituent. PARSING 86 Definitions PREDICTOR is the procedure that records by transitive closure the set of dotted rules for a given state of the input processing. SCANNER is the procedure that consumes the next input token. COMPLETER is the procedure that ◦ takes a dotted rule for which the dot is at the rightmost end and ◦ advances the dots for the rules for which a matched constituent was awaiting completion. PARSING 87 Illustration of the Algorithm Example : “0book1 the 2 flight3” Chart[0] State Dotted Rule Position Comment S0 S’  S S1 S S2 NP  ART N [0,0] Predictor S3 NP  NNP [0,0] Predictor S5 S S6 VP  V [0,0] Predictor S7 VP  V NP [0,0] Predictor  NP VP  VP [0,0] Dummy start state [0,0] Predictor [0,0] Predictor (fill the chart using transitive closure) PARSING 88 Example “ book the flight ” S8 S9 S10 S11 S11 S12 S13 Chart[1] 1 [0,1]2 3 V 0 book Scanner (consume in token “book”) VP  V [0,1] Completer S  VP [0,1] Completer (complete waiting constituents) S’  S [0,1] Completer (but not termination!) VP  V NP [0,1] Completer NP  ART N [1,1] Predictor (new waiting constituents come into the chart) NP  NNP [1,1] Predictor PARSING 89 Example “0book Chart[2] 1 the2 flight3” S14 ART  the [1,2] Scanner S15 NP  ART N [1,2] Completer PARSING 90 Example “0book the flight ” 1 2 3 Chart[3] S15 N  flight [2,3] Scanner S16 NP  ART N [1,3] Completer S17 VP  Verb NP [0,3] Completer S18 S  VP [0,3] Completer S19 S’  S [0,3] Successful termination PARSING 91 PARSING 92 Earley parser 1 Earley parser In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free language, though (depending on the variant) it may suffer problems with certain nullable grammars. The algorithm, named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in computational linguistics. It was first introduced in his dissertation (and later appeared in abbreviated, more legible form in a journal). Earley parsers are appealing because they can parse all context-free languagesTalk:Earley parser#, unlike LR parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of languages. The Earley parser executes in cubic time in the general case , where n is the length of the parsed string, quadratic time for unambiguous grammars , and linear time for almost all LR(k) grammars. It performs particularly well when the rules are written left-recursively. Earley Recognizer The following algorithm describes the Earley recognizer. The recognizer can be easily modified to create a parse tree as it recognizes, and in that way can be turned into a parser. The algorithm In the following descriptions, α, β, and γ represent any string of terminals/nonterminals (including the empty string), X and Y represent single nonterminals, and a represents a terminal symbol. Earley's algorithm is a top-down dynamic programming algorithm. In the following, we use Earley's dot notation: given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β is expected. Input position 0 is the position prior to input. Input position n is the position after accepting the nth token. (Informally, input positions can be thought of as locations at token boundaries.) For every input position, the parser generates a state set. Each state is a tuple (X → α • β, i), consisting of • the production currently being matched (X → α β) • our current position in that production (represented by the dot) • the position i in the input at which the matching of this production began: the origin position (Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical effect on the parsing efficiency, and it has subsequently been dropped from most implementations.) The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule. The parser then repeatedly executes three operations: prediction, scanning, and completion. • Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y → • γ, k) to S(k) for every production in the grammar with Y on the left-hand side (Y → γ). • Scanning: If a is the next symbol in the input stream, for every state in S(k) of the form (X → α • a β, j), add (X → α a • β, j) to S(k+1). • Completion: For every state in S(k) of the form (X → γ •, j), find states in S(j) of the form (Y → α • X β, i) and add (Y → α X • β, i) to S(k). It is important to note that duplicate states are not added to the state set, only new ones. These three operations are repeated until no new states can be added to the set. The set is generally implemented as a queue of states to process, with the operation to be performed depending on what kind of state it is. Earley parser 2 Pseudocode Adapted from by Daniel Jurafsky and James H. Martin function EARLEY-PARSE(words, grammar) ENQUEUE((γ → •S, 0), chart[0]) for i ← from 0 to LENGTH(words) do for each state in chart[i] do if INCOMPLETE?(state) then if NEXT-CAT(state) is a nonterminal then PREDICTOR(state, i, grammar) // non-terminal else do SCANNER(state, i) else do COMPLETER(state, i) end end return chart procedure PREDICTOR((A → α•B, i), j, grammar) for each (B → γ) in GRAMMAR-RULES-FOR(B, grammar) do ADD-TO-SET((B → •γ, j), chart[ j]) end procedure SCANNER((A → α•B, i), j) if B ⊂ PARTS-OF-SPEECH(word[j]) then ADD-TO-SET((B → word[j], i), chart[j + 1]) end procedure COMPLETER((B → γ•, j), k) for each (A → α•Bβ, i) in chart[j] do ADD-TO-SET((A → αB•β, i), chart[k]) end Example Consider the following simple grammar for arithmetic expressions: ::= S # the start rule <S> ::= <S> "+" <M>|<M> <M> ::= <M> "*" <T>|<T> <T> ::= "1" | "2" | "3" | "4" With the input: 2 + 3 * 4 This is the sequence of state sets: (state no.) Production (Origin) # Comment ----------------------------------------- // terminal Earley parser 3 S(0): • 2 + 3 * 4 (1) (2) (3) P →•S S →•S + M S →•M (0) (0) (0) # start rule # predict from (1) # predict from (1) (4) (5) (6) M →•M * T M →•T T → • number (0) (0) (0) # predict from (3) # predict from (3) # predict from (5) S(1): 2 • + 3 * 4 → number • (0) # scan from S(0)(6) →T • (0) # complete from (1) and S(0)(5) →M •* T (0) # complete from (2) and S(0)(4) →M • (0) # complete from (2) and S(0)(3) (1) (2) (3) (4) T M M S (5) (6) S →S •+ M (0) # complete from (4) and S(0)(2) P →S • (0) # complete from (4) and S(0)(1) S(2): 2 + • 3 * 4 (1) (2) (3) (4) S M M T →S + •M →•M * T →•T → • number (0) (2) (2) (2) # # # # scan from S(1)(5) predict from (1) predict from (1) predict from (3) S(3): 2 + 3 • * 4 (1) (2) (3) (4) (5) (6) T M M S S P → number • (2) # scan from S(2)(4) →T • (2) # complete from (1) and S(2)(3) →M •* T (2) # complete from (2) and S(2)(2) → S + M • (0) # complete from (2) and S(2)(1) →S •+ M (0) # complete from (4) and S(0)(2) →S • (0) # complete from (4) and S(0)(1) S(4): 2 + 3 * • 4 (1) M →M * •T (2) # scan from S(3)(3) (2) T → • number (4) # predict from (1) S(5): 2 + 3 * 4 • (1) T → number • (2) (3) (4) (5) M M S S (6) P →S • →M →M →S →S * T • •* T + M • •+ M (4) # scan from S(4)(2) (2) (2) (0) (0) (0) # complete from (1) and S(4)(1) # complete from (2) and S(2)(2) # complete from (2) and S(2)(1) # complete from (4) and S(0)(2) # complete from (4) and S(0)(1) The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete sentences. Earley parser Citations Other Reference Materials • Aycock, John; Horspool, R. Nigel (2002). "Practical Earley Parsing". The Computer Journal 45 (6). pp. 620–630. doi: 10.1093/comjnl/45.6.620 (http://dx.doi.org/10.1093/comjnl/45.6.620). • Leo, Joop M. I. M. (1991), "A general context-free parsing algorithm running in linear time on every LR(k) grammar without using lookahead", Theoretical Computer Science 82 (1): 165–176, doi: 10.1016/0304-3975(91)90180-A (http://dx.doi.org/10.1016/0304-3975(91)90180-A), MR 1112117 (http:// www.ams.org/mathscinet-getitem?mr=1112117). • Tomita, Masaru (1984). "LR parsers for natural languages". COLING. 10th International Conference on Computational Linguistics. pp. 354–357. External links C Implementations • 'early' (http://cocom.sourceforge.net/ammunition-13.html) An Earley parser C -library. • 'C Earley Parser' (https://bitbucket.org/abki/c-earley-parser/src) An Earley parser C. Wikipedia:Link rot Java Implementations • PEN (http://linguateca.dei.uc.pt/index.php?sep=recursos) A Java library that implements the Earley algorithm. • Pep (http://www.ling.ohio-state.edu/~scott/#projects-pep) A Java library that implements the Earley algorithm and provides charts and parse trees as parsing artifacts. • (http://www.cs.umanitoba.ca/~comp4190/Earley/Earley.java) A Java implementation of Earley parser. Perl Implementations • Marpa::R2 (https://metacpan.org/module/Marpa::R2) and Marpa::XS (https://metacpan.org/module/ Marpa::XS), Perl modules. Marpa (http://jeffreykegler.github.com/Marpa-web-site/) is an Earley's algorithm that includes the improvements made by Joop Leo, and by Aycock and Horspool. • Parse::Earley (https://metacpan.org/module/Parse::Earley) A Perl module that implements Jay Earley's original algorithm. Python Implementations • Charty (http://www.cavar.me/damir/charty/python/) a Python implementation of an Earley parser. • NLTK (http://nltk.org/) a Python toolkit that has an Earley parser. • Spark (http://pages.cpsc.ucalgary.ca/~aycock/spark/) an Object Oriented "little language framework" for Python that implements an Earley parser. • earley3.py (http://github.com/tomerfiliba/tau/blob/master/earley3.py) A stand-alone implementation of the algorithm in less than 150 lines of code, including generation of the parsing-forest and samples. 4 Earley parser Common Lisp Implementations • CL-EARLEY-PARSER (http://www.cliki.net/CL-EARLEY-PARSER) A Common Lisp library that implements an Earley parser. Scheme/Racket Implementations • Charty-Racket (http://www.cavar.me/damir/charty/scheme/) A Scheme / Racket implementation of an Earley parser. Resources • The Accent compiler-compiler (http://accent.compilertools.net/Entire.html) 5 Article Sources and Contributors Article Sources and Contributors Earley parser Source: http://en.wikipedia.org/w/index.php?oldid=576537591 Contributors: 1&only, AlexChurchill, Architectual, Borsotti, Brynosaurus, Cadr, Chentz, ChrisGualtieri, Clément Pillias, Conversion script, David Eppstein, Derek Ross, DixonD, EnTerr, Fimbulvetr, Frap, Idmillington, JYOuyang, Jamelan, Jason Quinn, Jeffreykegler, John of Reading, Jonsafari, Khabs, Kimiko, Kwi, Limited Atonement, Luqui, MCiura, Macrakis, Mkartic me, Opaldraggy, Paul Foxworthy, Peak, RA0808, Rfc1394, Simon_J_Kissane, Two Bananas, UKoch, Woogyun, Zacchiro, 71 anonymous edits License Creative Commons Attribution-Share Alike 3.0 //creativecommons.org/licenses/by-sa/3.0/ 6 CHAPTER 3 N-gram Language Models “You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for. Random sentence generated from a Jane Austen trigram model Predicting is difficult—especially about the future, as the old quip goes. But how about predicting something that seems much easier, like the next few words someone is going to say? What word, for example, is likely to follow Please turn your homework ... Hopefully, most of you concluded that a very likely word is in, or possibly over, but probably not refrigerator or the. In the following sections we will formalize this intuition by introducing models that assign a probability to each possible next word. The same models will also serve to assign a probability to an entire sentence. Such a model, for example, could predict that the following sequence has a much higher probability of appearing in a text: all of a sudden I notice three guys standing on the sidewalk than does this same set of words in a different order: on guys all I of notice sidewalk three a sudden standing the Why would you want to predict upcoming words, or assign probabilities to sentences? Probabilities are essential in any task in which we have to identify words in noisy, ambiguous input, like speech recognition. For a speech recognizer to realize that you said I will be back soonish and not I will be bassoon dish, it helps to know that back soonish is a much more probable sequence than bassoon dish. For writing tools like spelling correction or grammatical error correction, we need to find and correct errors in writing like Their are two midterms, in which There was mistyped as Their, or Everything has improve, in which improve should have been improved. The phrase There are will be much more probable than Their are, and has improved than has improve, allowing us to help users by detecting and correcting these errors. Assigning probabilities to sequences of words is also essential in machine translation. Suppose we are translating a Chinese source sentence: 他向记者介绍了主要内容 He to reporters introduced main content As part of the process we might have built the following set of potential rough English translations: he introduced reporters to the main contents of the statement he briefed to reporters the main contents of the statement he briefed reporters on the main contents of the statement 30 C HAPTER 3 AAC language model LM n-gram 3.1 • N- GRAM L ANGUAGE M ODELS A probabilistic model of word sequences could suggest that briefed reporters on is a more probable English phrase than briefed to reporters (which has an awkward to after briefed) or introduced reporters to (which uses a verb that is less fluent English in this context), allowing us to correctly select the boldfaced sentence above. Probabilities are also important for augmentative and alternative communication systems (Trnka et al. 2007, Kane et al. 2017). People often use such AAC devices if they are physically unable to speak or sign but can instead use eye gaze or other specific movements to select words from a menu to be spoken by the system. Word prediction can be used to suggest likely words for the menu. Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram. An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram) is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a 3-gram (a trigram) is a three-word sequence of words like “please turn your”, or “turn your homework”. We’ll see how to use n-gram models to estimate the probability of the last word of an n-gram given the previous words, and also to assign probabilities to entire sequences. In a bit of terminological ambiguity, we usually drop the word “model”, and use the term n-gram (and bigram, etc.) to mean either the word sequence itself or the predictive model that assigns it a probability. In later chapters we’ll introduce more sophisticated language models like the RNN LMs of Chapter 9. N-Grams Let’s begin with the task of computing P(w|h), the probability of a word w given some history h. Suppose the history h is “its water is so transparent that” and we want to know the probability that the next word is the: P(the|its water is so transparent that). (3.1) One way to estimate this probability is from relative frequency counts: take a very large corpus, count the number of times we see its water is so transparent that, and count the number of times this is followed by the. This would be answering the question “Out of the times we saw the history h, how many times was it followed by the word w”, as follows: P(the|its water is so transparent that) = C(its water is so transparent that the) C(its water is so transparent that) (3.2) With a large enough corpus, such as the web, we can compute these counts and estimate the probability from Eq. 3.2. You should pause now, go to the web, and compute this estimate for yourself. While this method of estimating probabilities directly from counts works fine in many cases, it turns out that even the web isn’t big enough to give us good estimates in most cases. This is because language is creative; new sentences are created all the time, and we won’t always be able to count entire sentences. Even simple extensions of the example sentence may have counts of zero on the web (such as “Walden Pond’s water is so transparent that the”; well, used to have counts of zero). 3.1 • N-G RAMS 31 Similarly, if we wanted to know the joint probability of an entire sequence of words like its water is so transparent, we could do it by asking “out of all possible sequences of five words, how many of them are its water is so transparent?” We would have to get the count of its water is so transparent and divide by the sum of the counts of all possible five word sequences. That seems rather a lot to estimate! For this reason, we’ll need to introduce more clever ways of estimating the probability of a word w given a history h, or the probability of an entire word sequence W . Let’s start with a little formalizing of notation. To represent the probability of a particular random variable Xi taking on the value “the”, or P(Xi = “the”), we will use the simplification P(the). We’ll represent a sequence of N words either as w1 . . . wn or w1:n (so the expression w1:n−1 means the string w1 , w2 , ..., wn−1 ). For the joint probability of each word in a sequence having a particular value P(X = w1 ,Y = w2 , Z = w3 , ...,W = wn ) we’ll use P(w1 , w2 , ..., wn ). Now how can we compute probabilities of entire sequences like P(w1 , w2 , ..., wn )? One thing we can do is decompose this probability using the chain rule of probability: P(X1 ...Xn ) = P(X1 )P(X2 |X1 )P(X3 |X1:2 ) . . . P(Xn |X1:n−1 ) n � P(Xk |X1:k−1 ) = (3.3) k=1 Applying the chain rule to words, we get P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1:2 ) . . . P(wn |w1:n−1 ) n � P(wk |w1:k−1 ) = (3.4) k=1 bigram The chain rule shows the link between computing the joint probability of a sequence and computing the conditional probability of a word given previous words. Equation 3.4 suggests that we could estimate the joint probability of an entire sequence of words by multiplying together a number of conditional probabilities. But using the chain rule doesn’t really seem to help us! We don’t know any way to compute the exact probability of a word given a long sequence of preceding words, P(wn |wn−1 1 ). As we said above, we can’t just estimate by counting the number of times every word occurs following every long string, because language is creative and any particular context might have never occurred before! The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words. The bigram model, for example, approximates the probability of a word given all the previous words P(wn |w1:n−1 ) by using only the conditional probability of the preceding word P(wn |wn−1 ). In other words, instead of computing the probability P(the|Walden Pond’s water is so transparent that) (3.5) we approximate it with the probability P(the|that) (3.6) When we use a bigram model to predict the conditional probability of the next word, we are thus making the following approximation: P(wn |w1:n−1 ) ≈ P(wn |wn−1 ) (3.7) 32 C HAPTER 3 Markov n-gram • N- GRAM L ANGUAGE M ODELS The assumption that the probability of a word depends only on the previous word is called a Markov assumption. Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the n-gram (which looks n − 1 words into the past). Thus, the general equation for this n-gram approximation to the conditional probability of the next word in a sequence is P(wn |w1:n−1 ) ≈ P(wn |wn−N+1:n−1 ) (3.8) Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4: P(w1:n ) ≈ n � P(wk |wk−1 ) (3.9) k=1 maximum likelihood estimation normalize How do we estimate these bigram or n-gram probabilities? An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE. We get the MLE estimate for the parameters of an n-gram model by getting counts from a corpus, and normalizing the counts so that they lie between 0 and 1.1 For example, to compute a particular bigram probability of a word y given a previous word x, we’ll compute the count of the bigram C(xy) and normalize by the sum of all the bigrams that share the same first word x: C(wn−1 wn ) P(wn |wn−1 ) = � w C(wn−1 w) (3.10) We can simplify this equation, since the sum of all bigram counts that start with a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader should take a moment to be convinced of this): P(wn |wn−1 ) = C(wn−1 wn ) C(wn−1 ) (3.11) Let’s work through an example using a mini-corpus of three sentences. We’ll first need to augment each sentence with a special symbol <s> at the beginning of the sentence, to give us the bigram context of the first word. We’ll also need a special end-symbol. </s>2 <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Here are the calculations for some of the bigram probabilities from this corpus P(I|<s>) = 2 3 = .67 P(</s>|Sam) = 1 1 2 = 0.5 P(Sam|<s>) = P(Sam|am) = 1 2 1 3 = .33 = .5 P(am|I) = P(do|I) = 2 3 1 3 = .67 = .33 For probabilistic models, normalizing means dividing by some total count so that the resulting probabilities fall legally between 0 and 1. 2 We need the end-symbol to make the bigram grammar a true probability distribution. Without an end-symbol, the sentence probabilities for all sentences of a given length would sum to one. This model would define an infinite set of probability distributions, with one distribution per sentence length. See Exercise 3.5. 3.1 • N-G RAMS 33 For the general case of MLE n-gram parameter estimation: P(wn |wn−N+1:n−1 ) = relative frequency C(wn−N+1:n−1 wn ) C(wn−N+1:n−1 ) (3.12) Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. This ratio is called a relative frequency. We said above that this use of relative frequencies as a way to estimate probabilities is an example of maximum likelihood estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood of the training set T given the model M (i.e., P(T |M)). For example, suppose the word Chinese occurs 400 times in a corpus of a million words like the Brown corpus. What is the probability that a random word selected from some other text of, say, 400 a million words will be the word Chinese? The MLE of its probability is 1000000 or .0004. Now .0004 is not the best possible estimate of the probability of Chinese occurring in all situations; it might turn out that in some other corpus or context Chinese is a very unlikely word. But it is the probability that makes it most likely that Chinese will occur 400 times in a million-word corpus. We present ways to modify the MLE estimates slightly to get better probability estimates in Section 3.4. Let’s move on to some examples from a slightly larger corpus than our 14-word example above. We’ll use data from the now-defunct Berkeley Restaurant Project, a dialogue system from the last century that answered questions about a database of restaurants in Berkeley, California (Jurafsky et al., 1994). Here are some textnormalized sample user queries (a sample of 9332 sentences is on the website): can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day Figure 3.1 shows the bigram counts from a piece of a bigram grammar from the Berkeley Restaurant Project. Note that the majority of the values are zero. In fact, we have chosen the sample words to cohere with each other; a matrix selected from a random set of seven words would be even more sparse. i want to eat chinese food lunch spend i 5 2 2 0 1 15 2 1 want 827 0 0 0 0 0 0 0 to 0 608 4 2 0 15 0 1 eat 9 1 686 0 0 0 0 0 chinese 0 6 2 16 0 1 0 0 food 0 6 0 2 82 4 1 0 lunch 0 5 6 42 1 0 0 0 spend 2 1 211 0 0 0 0 0 Figure 3.1 Bigram counts for eight of the words (out of V = 1446) in the Berkeley Restaurant Project corpus of 9332 sentences. Zero counts are in gray. Figure 3.2 shows the bigram probabilities after normalization (dividing each cell in Fig. 3.1 by the appropriate unigram for its row, taken from the following set of unigram probabilities): 34 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS i want to eat chinese food lunch spend 2533 927 2417 746 158 1093 341 278 i want to eat chinese food lunch spend i 0.002 0.0022 0.00083 0 0.0063 0.014 0.0059 0.0036 want 0.33 0 0 0 0 0 0 0 to 0 0.66 0.0017 0.0027 0 0.014 0 0.0036 eat 0.0036 0.0011 0.28 0 0 0 0 0 chinese 0 0.0065 0.00083 0.021 0 0.00092 0 0 food 0 0.0065 0 0.0027 0.52 0.0037 0.0029 0 lunch 0 0.0054 0.0025 0.056 0.0063 0 0 0 spend 0.00079 0.0011 0.087 0 0 0 0 0 Figure 3.2 Bigram probabilities for eight words in the Berkeley Restaurant Project corpus of 9332 sentences. Zero probabilities are in gray. Here are a few other useful probabilities: P(i|<s>) = 0.25 P(food|english) = 0.5 P(english|want) = 0.0011 P(</s>|food) = 0.68 Now we can compute the probability of sentences like I want English food or I want Chinese food by simply multiplying the appropriate bigram probabilities together, as follows: P(<s> i want english food </s>) = P(i|<s>)P(want|i)P(english|want) P(food|english)P(</s>|food) = .25 × .33 × .0011 × 0.5 × 0.68 = .000031 We leave it as Exercise 3.2 to compute the probability of i want chinese food. What kinds of linguistic phenomena are captured in these bigram statistics? Some of the bigram probabilities above encode some facts that we think of as strictly syntactic in nature, like the fact that what comes after eat is usually a noun or an adjective, or that what comes after to is usually a verb. Others might be a fact about the personal assistant task, like the high probability of sentences beginning with the words I. And some might even be cultural rather than linguistic, like the higher probability that people are looking for Chinese versus English food. trigram 4-gram 5-gram log probabilities Some practical issues: Although for pedagogical purposes we have only described bigram models, in practice it’s more common to use trigram models, which condition on the previous two words rather than the previous word, or 4-gram or even 5-gram models, when there is sufficient training data. Note that for these larger ngrams, we’ll need to assume extra contexts to the left and right of the sentence end. For example, to compute trigram probabilities at the very beginning of the sentence, we use two pseudo-words for the first trigram (i.e., P(I|<s><s>). We always represent and compute language model probabilities in log format as log probabilities. Since probabilities are (by definition) less than or equal to 1, the more probabilities we multiply together, the smaller the product becomes. Multiplying enough n-grams together would result in numerical underflow. By using log probabilities instead of raw probabilities, we get numbers that are not as small. 3.2 • E VALUATING L ANGUAGE M ODELS 35 Adding in log space is equivalent to multiplying in linear space, so we combine log probabilities by adding them. The result of doing all computation and storage in log space is that we only need to convert back into probabilities if we need to report them at the end; then we can just take the exp of the logprob: p1 × p2 × p3 × p4 = exp(log p1 + log p2 + log p3 + log p4 ) 3.2 (3.13) Evaluating Language Models extrinsic evaluation intrinsic evaluation training set test set held out development test The best way to evaluate the performance of a language model is to embed it in an application and measure how much the application improves. Such end-to-end evaluation is called extrinsic evaluation. Extrinsic evaluation is the only way to know if a particular improvement in a component is really going to help the task at hand. Thus, for speech recognition, we can compare the performance of two language models by running the speech recognizer twice, once with each language model, and seeing which gives the more accurate transcription. Unfortunately, running big NLP systems end-to-end is often very expensive. Instead, it would be nice to have a metric that can be used to quickly evaluate potential improvements in a language model. An intrinsic evaluation metric is one that measures the quality of a model independent of any application. For an intrinsic evaluation of a language model we need a test set. As with many of the statistical models in our field, the probabilities of an n-gram model come from the corpus it is trained on, the training set or training corpus. We can then measure the quality of an n-gram model by its performance on some unseen data called the test set or test corpus. We will also sometimes call test sets and other datasets that are not in our training sets held out corpora because we hold them out from the training data. So if we are given a corpus of text and want to compare two different n-gram models, we divide the data into training and test sets, train the parameters of both models on the training set, and then compare how well the two trained models fit the test set. But what does it mean to “fit the test set”? The answer is simple: whichever model assigns a higher probability to the test set—meaning it more accurately predicts the test set—is a better model. Given two probabilistic models, the better model is the one that has a tighter fit to the test data or that better predicts the details of the test data, and hence will assign a higher probability to the test data. Since our evaluation metric is based on test set probability, it’s important not to let the test sentences into the training set. Suppose we are trying to compute the probability of a particular “test” sentence. If our test sentence is part of the training corpus, we will mistakenly assign it an artificially high probability when it occurs in the test set. We call this situation training on the test set. Training on the test set introduces a bias that makes the probabilities all look too high, and causes huge inaccuracies in perplexity, the probability-based metric we introduce below. Sometimes we use a particular test set so often that we implicitly tune to its characteristics. We then need a fresh test set that is truly unseen. In such cases, we call the initial test set the development test set or, devset. How do we divide our data into training, development, and test sets? We want our test set to be as large as possible, since a small test set may be accidentally unrepresentative, but we also want as much training data as possible. At the minimum, we would want to pick 36 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS the smallest test set that gives us enough statistical power to measure a statistically significant difference between two potential models. In practice, we often just divide our data into 80% training, 10% development, and 10% test. Given a large corpus that we want to divide into training and test, test data can either be taken from some continuous sequence of text inside the corpus, or we can remove smaller “stripes” of text from randomly selected parts of our corpus and combine them into a test set. 3.2.1 Perplexity perplexity In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity. The perplexity (sometimes called PP for short) of a language model on a test set is the inverse probability of the test set, normalized by the number of words. For a test set W = w1 w2 . . . wN ,: 1 PP(W ) = P(w1 w2 . . . wN )− N � 1 = N P(w1 w2 . . . wN ) (3.14) We can use the chain rule to expand the probability of W : � �N �� N PP(W ) = � i=1 1 P(wi |w1 . . . wi−1 ) (3.15) Thus, if we are computing the perplexity of W with a bigram language model, we get: � �N �� N PP(W ) = � i=1 1 P(wi |wi−1 ) (3.16) Note that because of the inverse in Eq. 3.15, the higher the conditional probability of the word sequence, the lower the perplexity. Thus, minimizing perplexity is equivalent to maximizing the test set probability according to the language model. What we generally use for word sequence in Eq. 3.15 or Eq. 3.16 is the entire sequence of words in some test set. Since this sequence will cross many sentence boundaries, we need to include the begin- and end-sentence markers <s> and </s> in the probability computation. We also need to include the end-of-sentence marker </s> (but not the beginning-of-sentence marker <s>) in the total count of word tokens N. There is another way to think about perplexity: as the weighted average branching factor of a language. The branching factor of a language is the number of possible next words that can follow any word. Consider the task of recognizing the digits in English (zero, one, two,..., nine), given that (both in some training set and in some 1 test set) each of the 10 digits occurs with equal probability P = 10 . The perplexity of this mini-language is in fact 10. To see that, imagine a test string of digits of length N, and assume that in the training set all the digits occurred with equal probability. By Eq. 3.15, the perplexity will be 3.3 • G ENERALIZATION AND Z EROS 37 1 PP(W ) = P(w1 w2 . . . wN )− N 1 N −1 ) N 10 1 −1 = 10 = 10 = ( (3.17) But suppose that the number zero is really frequent and occurs far more often than other numbers. Let’s say that 0 occur 91 times in the training set, and each of the other digits occurred 1 time each. Now we see the following test set: 0 0 0 0 0 3 0 0 0 0. We should expect the perplexity of this test set to be lower since most of the time the next number will be zero, which is very predictable, i.e. has a high probability. Thus, although the branching factor is still 10, the perplexity or weighted branching factor is smaller. We leave this exact calculation as exercise 12. We see in Section 3.7 that perplexity is also closely related to the informationtheoretic notion of entropy. Finally, let’s look at an example of how perplexity can be used to compare different n-gram models. We trained unigram, bigram, and trigram grammars on 38 million words (including start-of-sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary. We then computed the perplexity of each of these models on a test set of 1.5 million words with Eq. 3.16. The table below shows the perplexity of a 1.5 million word WSJ test set according to each of these grammars. Unigram Bigram Trigram Perplexity 962 170 109 As we see above, the more information the n-gram gives us about the word sequence, the lower the perplexity (since as Eq. 3.15 showed, perplexity is related inversely to the likelihood of the test sequence according to the model). Note that in computing perplexities, the n-gram model P must be constructed without any knowledge of the test set or any prior knowledge of the vocabulary of the test set. Any kind of knowledge of the test set can cause the perplexity to be artificially low. The perplexity of two language models is only comparable if they use identical vocabularies. An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement in the performance of a language processing task like speech recognition or machine translation. Nonetheless, because perplexity often correlates with such improvements, it is commonly used as a quick check on an algorithm. But a model’s improvement in perplexity should always be confirmed by an end-to-end evaluation of a real task before concluding the evaluation of the model. 3.3 Generalization and Zeros The n-gram model, like many statistical models, is dependent on the training corpus. One implication of this is that the probabilities often encode specific facts about a given training corpus. Another implication is that n-grams do a better and better job of modeling the training corpus as we increase the value of N. 38 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS We can visualize both of these facts by borrowing the technique of Shannon (1951) and Miller and Selfridge (1950) of generating random sentences from different n-gram models. It’s simplest to visualize how this works for the unigram case. Imagine all the words of the English language covering the probability space between 0 and 1, each word covering an interval proportional to its frequency. We choose a random value between 0 and 1 and print the word whose interval includes this chosen value. We continue choosing random numbers and generating words until we randomly generate the sentence-final token </s>. We can use the same technique to generate bigrams by first generating a random bigram that starts with <s> (according to its bigram probability). Let’s say the second word of that bigram is w. We next chose a random bigram starting with w (again, drawn according to its bigram probability), and so on. To give an intuition for the increasing power of higher-order n-grams, Fig. 3.3 shows random sentences generated from unigram, bigram, trigram, and 4-gram models trained on Shakespeare’s works. 1 gram –To him swallowed confess hear both. Which. Of save on trail for are ay device and rote life have –Hill he late speaks; or! a more to leg less first you enter 2 gram –Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. –What means, sir. I confess she? then all sorts, he is trim, captain. 3 gram –Fly, and will rid me these news of price. Therefore the sadness of parting, as they say, ’tis done. –This shall forbid it should be branded, if renown made it empty. 4 gram –King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A great banquet serv’d in; –It cannot be but so. Figure 3.3 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected for capitalization to improve readability. The longer the context on which we train the model, the more coherent the sentences. In the unigram sentences, there is no coherent relation between words or any sentence-final punctuation. The bigram sentences have some local word-to-word coherence (especially if we consider that punctuation counts as a word). The trigram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a careful investigation of the 4-gram sentences shows that they look a little too much like Shakespeare. The words It cannot be but so are directly from King John. This is because, not to put the knock on Shakespeare, his oeuvre is not very large as corpora go (N = 884, 647,V = 29, 066), and our n-gram probability matrices are ridiculously sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the number of possible 4-grams is V 4 = 7 × 1017 . Thus, once the generator has chosen the first 4-gram (It cannot be but), there are only five possible continuations (that, I, he, thou, and so); indeed, for many 4-grams, there is only one continuation. To get an idea of the dependence of a grammar on its training set, let’s look at an n-gram grammar trained on a completely different corpus: the Wall Street Journal (WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so we might expect some overlap between our n-grams for the two genres. Fig. 3.4 3.3 • G ENERALIZATION AND Z EROS 39 shows sentences generated by unigram, bigram, and trigram grammars trained on 40 million words from WSJ. 1 gram 2 gram 3 gram Months the my and issue of year foreign new exchange’s september were recession exchange new endorsed a acquire to six executives Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners one point five percent of U. S. E. has already old M. X. corporation of living on information such as more frequently fishing to keep her They also point to ninety nine point six billion dollars from two hundred four oh six three percent of the rates of interest stores as Mexico and Brazil on market conditions Figure 3.4 Three sentences randomly generated from three n-gram models computed from 40 million words of the Wall Street Journal, lower-casing all characters and treating punctuation as words. Output was then hand-corrected for capitalization to improve readability. Compare these examples to the pseudo-Shakespeare in Fig. 3.3. While they both model “English-like sentences”, there is clearly no overlap in generated sentences, and little overlap even in small phrases. Statistical models are likely to be pretty useless as predictors if the training sets and the test sets are as different as Shakespeare and WSJ. How should we deal with this problem when we build n-gram models? One step is to be sure to use a training corpus that has a similar genre to whatever task we are trying to accomplish. To build a language model for translating legal documents, we need a training corpus of legal documents. To build a language model for a question-answering system, we need a training corpus of questions. It is equally important to get training data in the appropriate dialect or variety, especially when processing social media posts or spoken transcripts. For example some tweets will use features of African American Language (AAL)— the name for the many variations of language used in African American communities (King, 2020). Such features include words like finna—an auxiliary verb that marks immediate future tense —that don’t occur in other varieties, or spellings like den for then, in tweets like this one (Blodgett and O’Connor, 2017): (3.18) Bored af den my phone finna die!!! while tweets from varieties like Nigerian English have markedly different vocabulary and n-gram patterns from American English (Jurgens et al., 2017): (3.19) @username R u a wizard or wat gan sef: in d mornin - u tweet, afternoon - u tweet, nyt gan u dey tweet. beta get ur IT placement wiv twitter Matching genres and dialects is still not sufficient. Our models may still be subject to the problem of sparsity. For any n-gram that occurred a sufficient number of times, we might have a good estimate of its probability. But because any corpus is limited, some perfectly acceptable English word sequences are bound to be missing from it. That is, we’ll have many cases of putative “zero probability n-grams” that should really have some non-zero probability. Consider the words that follow the bigram denied the in the WSJ Treebank3 corpus, together with their counts: denied the allegations: denied the speculation: denied the rumors: denied the report: 5 2 1 1 40 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS But suppose our test set has phrases like: denied the offer denied the loan zeros Our model will incorrectly estimate that the P(offer|denied the) is 0! These zeros— things that don’t ever occur in the training set but do occur in the test set—are a problem for two reasons. First, their presence means we are underestimating the probability of all sorts of words that might occur, which will hurt the performance of any application we want to run on this data. Second, if the probability of any word in the test set is 0, the entire probability of the test set is 0. By definition, perplexity is based on the inverse probability of the test set. Thus if some words have zero probability, we can’t compute perplexity at all, since we can’t divide by 0! 3.3.1 Unknown Words closed vocabulary OOV open vocabulary The previous section discussed the problem of words whose bigram probability is zero. But what about words we simply have never seen before? Sometimes we have a language task in which this can’t happen because we know all the words that can occur. In such a closed vocabulary system the test set can only contain words from this lexicon, and there will be no unknown words. This is a reasonable assumption in some domains, such as speech recognition or machine translation, where we have a pronunciation dictionary or a phrase table that are fixed in advance, and so the language model can only use the words in that dictionary or phrase table. In other cases we have to deal with words we haven’t seen before, which we’ll call unknown words, or out of vocabulary (OOV) words. The percentage of OOV words that appear in the test set is called the OOV rate. An open vocabulary system is one in which we model these potential unknown words in the test set by adding a pseudo-word called <UNK>. There are two common ways to train the probabilities of the unknown word model <UNK>. The first one is to turn the problem back into a closed vocabulary one by choosing a fixed vocabulary in advance: 1. Choose a vocabulary (word list) that is fixed in advance. 2. Convert in the training set any word that is not in this set (any OOV word) to the unknown word token <UNK> in a text normalization step. 3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the training set. The second alternative, in situations where we don’t have a prior vocabulary in advance, is to create such a vocabulary implicitly, replacing words in the training data by <UNK> based on their frequency. For example we can replace by <UNK> all words that occur fewer than n times in the training set, where n is some small number, or equivalently select a vocabulary size V in advance (say 50,000) and choose the top V words by frequency and replace the rest by UNK. In either case we then proceed to train the language model as before, treating <UNK> like a regular word. The exact choice of <UNK> model does have an effect on metrics like perplexity. A language model can achieve low perplexity by choosing a small vocabulary and assigning the unknown word a high probability. For this reason, perplexities should only be compared across language models with the same vocabularies (Buck et al., 2014). 3.4 3.4 • S MOOTHING 41 Smoothing smoothing discounting What do we do with words that are in our vocabulary (they are not unknown words) but appear in a test set in an unseen context (for example they appear after a word they never appeared after in training)? To keep a language model from assigning zero probability to these unseen events, we’ll have to shave off a bit of probability mass from some more frequent events and give it to the events we’ve never seen. This modification is called smoothing or discounting. In this section and the following ones we’ll introduce a variety of ways to do smoothing: Laplace (add-one) smoothing, add-k smoothing, stupid backoff, and Kneser-Ney smoothing. 3.4.1 Laplace Smoothing Laplace smoothing The simplest way to do smoothing is to add one to all the bigram counts, before we normalize them into probabilities. All the counts that used to be zero will now have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called Laplace smoothing. Laplace smoothing does not perform well enough to be used in modern n-gram models, but it usefully introduces many of the concepts that we see in other smoothing algorithms, gives a useful baseline, and is also a practical smoothing algorithm for other tasks like text classification (Chapter 4). Let’s start with the application of Laplace smoothing to unigram probabilities. Recall that the unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its count ci normalized by the total number of word tokens N: P(wi ) = add-one ci N Laplace smoothing merely adds one to each count (hence its alternate name addone smoothing). Since there are V words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra V observations. (What happens to our P values if we don’t increase the denominator?) PLaplace (wi ) = ci + 1 N +V (3.20) Instead of changing both the numerator and denominator, it is convenient to describe how a smoothing algorithm affects the numerator, by defining an adjusted count c∗ . This adjusted count is easier to compare directly with the MLE counts and can be turned into a probability like an MLE count by normalizing by N. To define this count, since we are only changing the numerator in addition to adding 1 we’ll N : also need to multiply by a normalization factor N+V c∗i = (ci + 1) discounting discount N N +V (3.21) We can now turn c∗i into a probability Pi∗ by normalizing by N. A related way to view smoothing is as discounting (lowering) some non-zero counts in order to get the probability mass that will be assigned to the zero counts. Thus, instead of referring to the discounted counts c∗ , we might describe a smoothing algorithm in terms of a relative discount dc , the ratio of the discounted counts to the original counts: 42 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS c∗ c Now that we have the intuition for the unigram case, let’s smooth our Berkeley Restaurant Project bigrams. Figure 3.5 shows the add-one smoothed counts for the bigrams in Fig. 3.1. dc = i want to eat chinese food lunch spend i 6 3 3 1 2 16 3 2 want 828 1 1 1 1 1 1 1 to 1 609 5 3 1 16 1 2 eat 10 2 687 1 1 1 1 1 chinese 1 7 3 17 1 2 1 1 food 1 7 1 3 83 5 2 1 lunch 1 6 7 43 2 1 1 1 spend 3 2 212 1 1 1 1 1 Figure 3.5 Add-one smoothed bigram counts for eight of the words (out of V = 1446) in the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts are in gray. Figure 3.6 shows the add-one smoothed probabilities for the bigrams in Fig. 3.2. Recall that normal bigram probabilities are computed by normalizing each row of counts by the unigram count: C(wn−1 wn ) (3.22) C(wn−1 ) For add-one smoothed bigram counts, we need to augment the unigram count by the number of total word types in the vocabulary V : P(wn |wn−1 ) = C(wn−1 wn ) + 1 C(wn−1 wn ) + 1 ∗ PLaplace (wn |wn−1 ) = � = (C(w w) + 1) C(wn−1 ) +V n−1 w (3.23) Thus, each of the unigram counts given in the previous section will need to be augmented by V = 1446. The result is the smoothed bigram probabilities in Fig. 3.6. i want to eat chinese food lunch spend i 0.0015 0.0013 0.00078 0.00046 0.0012 0.0063 0.0017 0.0012 want 0.21 0.00042 0.00026 0.00046 0.00062 0.00039 0.00056 0.00058 to 0.00025 0.26 0.0013 0.0014 0.00062 0.0063 0.00056 0.0012 eat 0.0025 0.00084 0.18 0.00046 0.00062 0.00039 0.00056 0.00058 chinese 0.00025 0.0029 0.00078 0.0078 0.00062 0.00079 0.00056 0.00058 food 0.00025 0.0029 0.00026 0.0014 0.052 0.002 0.0011 0.00058 lunch 0.00025 0.0025 0.0018 0.02 0.0012 0.00039 0.00056 0.00058 spend 0.00075 0.00084 0.055 0.00046 0.00062 0.00039 0.00056 0.00058 Figure 3.6 Add-one smoothed bigram probabilities for eight of the words (out of V = 1446) in the BeRP corpus of 9332 sentences. Previously-zero probabilities are in gray. It is often convenient to reconstruct the count matrix so we can see how much a smoothing algorithm has changed the original counts. These adjusted counts can be computed by Eq. 3.24. Figure 3.7 shows the reconstructed counts. c∗ (wn−1 wn ) = [C(wn−1 wn ) + 1] ×C(wn−1 ) C(wn−1 ) +V (3.24) 3.4 i want to eat chinese food lunch spend i 3.8 1.2 1.9 0.34 0.2 6.9 0.57 0.32 want 527 0.39 0.63 0.34 0.098 0.43 0.19 0.16 to 0.64 238 3.1 1 0.098 6.9 0.19 0.32 eat 6.4 0.78 430 0.34 0.098 0.43 0.19 0.16 chinese 0.64 2.7 1.9 5.8 0.098 0.86 0.19 0.16 • food 0.64 2.7 0.63 1 8.2 2.2 0.38 0.16 S MOOTHING lunch 0.64 2.3 4.4 15 0.2 0.43 0.19 0.16 43 spend 1.9 0.78 133 0.34 0.098 0.43 0.19 0.16 Figure 3.7 Add-one reconstituted counts for eight words (of V = 1446) in the BeRP corpus of 9332 sentences. Previously-zero counts are in gray. Note that add-one smoothing has made a very big change to the counts. C(want to) changed from 609 to 238! We can see this in probability space as well: P(to|want) decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at the discount d (the ratio between new and old counts) shows us how strikingly the counts for each prefix word have been reduced; the discount for the bigram want to is .39, while the discount for Chinese food is .10, a factor of 10! The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros. 3.4.2 Add-k smoothing add-k One alternative to add-one smoothing is to move a bit less of the probability mass from the seen to the unseen events. Instead of adding 1 to each count, we add a fractional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing. ∗ PAdd-k (wn |wn−1 ) = C(wn−1 wn ) + k C(wn−1 ) + kV (3.25) Add-k smoothing requires that we have a method for choosing k; this can be done, for example, by optimizing on a devset. Although add-k is useful for some tasks (including text classification), it turns out that it still doesn’t work well for language modeling, generating counts with poor variances and often inappropriate discounts (Gale and Church, 1994). 3.4.3 Backoff and Interpolation backoff interpolation The discounting we have been discussing so far can help solve the problem of zero frequency n-grams. But there is an additional source of knowledge we can draw on. If we are trying to compute P(wn |wn−2 wn−1 ) but we have no examples of a particular trigram wn−2 wn−1 wn , we can instead estimate its probability by using the bigram probability P(wn |wn−1 ). Similarly, if we don’t have counts to compute P(wn |wn−1 ), we can look to the unigram P(wn ). In other words, sometimes using less context is a good thing, helping to generalize more for contexts that the model hasn’t learned much about. There are two ways to use this n-gram “hierarchy”. In backoff, we use the trigram if the evidence is sufficient, otherwise we use the bigram, otherwise the unigram. In other words, we only “back off” to a lower-order n-gram if we have zero evidence for a higher-order n-gram. By contrast, in interpolation, we always mix the probability estimates from all the n-gram estimators, weighing and combining the trigram, bigram, and unigram counts. 44 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS In simple linear interpolation, we combine different order n-grams by linearly interpolating all the models. Thus, we estimate the trigram probability P(wn |wn−2 wn−1 ) by mixing together the unigram, bigram, and trigram probabilities, each weighted by a λ : P̂(wn |wn−2 wn−1 ) = λ1 P(wn |wn−2 wn−1 ) +λ2 P(wn |wn−1 ) +λ3 P(wn ) (3.26) such that the λ s sum to 1: � λi = 1 (3.27) i In a slightly more sophisticated version of linear interpolation, each λ weight is computed by conditioning on the context. This way, if we have particularly accurate counts for a particular bigram, we assume that the counts of the trigrams based on this bigram will be more trustworthy, so we can make the λ s for those trigrams higher and thus give that trigram more weight in the interpolation. Equation 3.28 shows the equation for interpolation with context-conditioned weights: P̂(wn |wn−2 wn−1 ) = λ1 (wn−2:n−1 )P(wn |wn−2 wn−1 ) +λ2 (wn−2:n−1 )P(wn |wn−1 ) + λ3 (wn−2:n−1 )P(wn ) held-out discount Katz backoff (3.28) How are these λ values set? Both the simple interpolation and conditional interpolation λ s are learned from a held-out corpus. A held-out corpus is an additional training corpus that we use to set hyperparameters like these λ values, by choosing the λ values that maximize the likelihood of the held-out corpus. That is, we fix the n-gram probabilities and then search for the λ values that—when plugged into Eq. 3.26—give us the highest probability of the held-out set. There are various ways to find this optimal set of λ s. One way is to use the EM algorithm, an iterative learning algorithm that converges on locally optimal λ s (Jelinek and Mercer, 1980). In a backoff n-gram model, if the n-gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a history that has some counts. In order for a backoff model to give a correct probability distribution, we have to discount the higher-order n-grams to save some probability mass for the lower order n-grams. Just as with add-one smoothing, if the higher-order n-grams aren’t discounted and we just used the undiscounted MLE probability, then as soon as we replaced an n-gram which has zero probability with a lower-order n-gram, we would be adding probability mass, and the total probability assigned to all possible strings by the language model would be greater than 1! In addition to this explicit discount factor, we’ll need a function α to distribute this probability mass to the lower order n-grams. This kind of backoff with discounting is also called Katz backoff. In Katz backoff we rely on a discounted probability P∗ if we’ve seen this n-gram before (i.e., if we have non-zero counts). Otherwise, we recursively back off to the Katz probability for the shorter-history (N-1)-gram. The probability for a backoff n-gram PBO is 3.5 • K NESER -N EY S MOOTHING thus computed as follows:   P∗ (wn |wn−N+1:n−1 ), PBO (wn |wn−N+1:n−1 ) =  α(wn−N+1:n−1 )PBO (wn |wn−N+2:n−1 ), Good-Turing 3.5 45 if C(wn−N+1:n ) > 0 otherwise. (3.29) Katz backoff is often combined with a smoothing method called Good-Turing. The combined Good-Turing backoff algorithm involves quite detailed computation for estimating the Good-Turing smoothing and the P∗ and α values. Kneser-Ney Smoothing Kneser-Ney One of the most commonly used and best performing n-gram smoothing methods is the interpolated Kneser-Ney algorithm (Kneser and Ney 1995, Chen and Goodman 1998). Kneser-Ney has its roots in a method called absolute discounting. Recall that discounting of the counts for frequent n-grams is necessary to save some probability mass for the smoothing algorithm to distribute to the unseen n-grams. To see this, we can use a clever idea from Church and Gale (1991). Consider an n-gram that has count 4. We need to discount this count by some amount. But how much should we discount it? Church and Gale’s clever idea was to look at a held-out corpus and just see what the count is for all those bigrams that had count 4 in the training set. They computed a bigram grammar from 22 million words of AP newswire and then checked the counts of each of these bigrams in another 22 million words. On average, a bigram that occurred 4 times in the first 22 million words occurred 3.23 times in the next 22 million words. Fig. 3.8 from Church and Gale (1991) shows these counts for bigrams with c from 0 to 9. Bigram count in training set 0 1 2 3 4 5 6 7 8 9 Bigram count in heldout set 0.0000270 0.448 1.25 2.24 3.23 4.21 5.23 6.21 7.21 8.26 Figure 3.8 For all bigrams in 22 million words of AP newswire of count 0, 1, 2,...,9, the counts of these bigrams in a held-out corpus also of 22 million words. Absolute discounting Notice in Fig. 3.8 that except for the held-out counts for 0 and 1, all the other bigram counts in the held-out set could be estimated pretty well by just subtracting 0.75 from the count in the training set! Absolute discounting formalizes this intuition by subtracting a fixed (absolute) discount d from each count. The intuition is that since we have good estimates already for the very high counts, a small discount d won’t affect them much. It will mainly modify the smaller counts, for which we 46 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS don’t necessarily trust the estimate anyway, and Fig. 3.8 suggests that in practice this discount is actually a good one for bigrams with counts 2 through 9. The equation for interpolated absolute discounting applied to bigrams: C(wi−1 wi ) − d PAbsoluteDiscounting (wi |wi−1 ) = � + λ (wi−1 )P(wi ) v C(wi−1 v) (3.30) The first term is the discounted bigram, and the second term is the unigram with an interpolation weight λ . We could just set all the d values to .75, or we could keep a separate discount value of 0.5 for the bigrams with counts of 1. Kneser-Ney discounting (Kneser and Ney, 1995) augments absolute discounting with a more sophisticated way to handle the lower-order unigram distribution. Consider the job of predicting the next word in this sentence, assuming we are interpolating a bigram and a unigram model. . I can’t see without my reading The word glasses seems much more likely to follow here than, say, the word Kong, so we’d like our unigram model to prefer glasses. But in fact it’s Kong that is more common, since Hong Kong is a very frequent word. A standard unigram model will assign Kong a higher probability than glasses. We would like to capture the intuition that although Kong is frequent, it is mainly only frequent in the phrase Hong Kong, that is, after the word Hong. The word glasses has a much wider distribution. In other words, instead of P(w), which answers the question “How likely is w?”, we’d like to create a unigram model that we might call PCONTINUATION , which answers the question “How likely is w to appear as a novel continuation?”. How can we estimate this probability of seeing the word w as a novel continuation, in a new unseen context? The Kneser-Ney intuition is to base our estimate of PCONTINUATION on the number of different contexts word w has appeared in, that is, the number of bigram types it completes. Every bigram type was a novel continuation the first time it was seen. We hypothesize that words that have appeared in more contexts in the past are more likely to appear in some new context as well. The number of times a word w appears as a novel continuation can be expressed as: PCONTINUATION (w) ∝ |{v : C(vw) > 0}| (3.31) To turn this count into a probability, we normalize by the total number of word bigram types. In summary: PCONTINUATION (w) = |{v : C(vw) > 0}| |{(u� , w� ) : C(u� w� ) > 0}| (3.32) An equivalent formulation based on a different metaphor is to use the number of word types seen to precede w (Eq. 3.31 repeated): PCONTINUATION (w) ∝ |{v : C(vw) > 0}| (3.33) normalized by the number of words preceding all words, as follows: PCONTINUATION (w) = � |{v : C(vw) > 0}| � w� |{v : C(vw ) > 0}| (3.34) A frequent word (Kong) occurring in only one context (Hong) will have a low continuation probability. 3.6 Interpolated Kneser-Ney • H UGE L ANGUAGE M ODELS AND S TUPID BACKOFF 47 The final equation for Interpolated Kneser-Ney smoothing for bigrams is then: PKN (wi |wi−1 ) = max(C(wi−1 wi ) − d, 0) + λ (wi−1 )PCONTINUATION (wi ) C(wi−1 ) (3.35) The λ is a normalizing constant that is used to distribute the probability mass we’ve discounted.: λ (wi−1 ) = � d |{w : C(wi−1 w) > 0}| C(w i−1 v) v (3.36) d , is the normalized discount. The second term, C(w i−1 v) v |{w : C(wi−1 w) > 0}|, is the number of word types that can follow wi−1 or, equivalently, the number of word types that we discounted; in other words, the number of times we applied the normalized discount. The general recursive formulation is as follows: The first term, � PKN (wi |wi−n+1:i−1 ) = max(cKN (w i−n+1: i ) − d, 0) � + λ (wi−n+1:i−1 )PKN (wi |wi−n+2:i−1 ) v cKN (wi−n+1:i−1 v) (3.37) where the definition of the count cKN depends on whether we are counting the highest-order n-gram being interpolated (for example trigram if we are interpolating trigram, bigram, and unigram) or one of the lower-order n-grams (bigram or unigram if we are interpolating trigram, bigram, and unigram): � count(·) for the highest order cKN (·) = (3.38) continuationcount(·) for lower orders The continuation count is the number of unique single word contexts for ·. At the termination of the recursion, unigrams are interpolated with the uniform distribution, where the parameter � is the empty string: PKN (w) = modified Kneser-Ney 3.6 max(cKN (w) − d, 0) 1 � + λ (�) �) c (w V � KN w (3.39) If we want to include an unknown word <UNK>, it’s just included as a regular vocabulary entry with count zero, and hence its probability will be a lambda-weighted uniform distribution λV(�) . The best performing version of Kneser-Ney smoothing is called modified KneserNey smoothing, and is due to Chen and Goodman (1998). Rather than use a single fixed discount d, modified Kneser-Ney uses three different discounts d1 , d2 , and d3+ for n-grams with counts of 1, 2 and three or more, respectively. See Chen and Goodman (1998, p. 19) or Heafield et al. (2013) for the details. Huge Language Models and Stupid Backoff By using text from the web or other enormous collections, it is possible to build extremely large language models. The Web 1 Trillion 5-gram corpus released by Google includes various large sets of n-grams, including 1-grams through 5-grams from all the five-word sequences that appear in at least 40 distinct books from 48 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS 1,024,908,267,229 words of text from publicly accessible Web pages in English (Franz and Brants, 2006). Google has also released Google Books Ngrams corpora with n-grams drawn from their book collections, including another 800 billion tokens of of n-grams from Chinese, English, French, German, Hebrew, Italian, Russian, and Spanish (Lin et al., 2012a). Smaller but more carefully curated n-gram corpora for English include the million most frequent n-grams drawn from the COCA (Corpus of Contemporary American English) 1 billion word corpus of American English (Davies, 2020). COCA is a balanced corpora, meaning that it has roughly equal numbers of words from different genres: web, newspapers, spoken conversation transcripts, fiction, and so on, drawn from the period 1990-2019, and has the context of each n-gram as well as labels for genre and provenance). Some example 4-grams from the Google Web corpus: 4-gram Count serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 Bloom filters stupid backoff Efficiency considerations are important when building language models that use such large sets of n-grams. Rather than store each word as a string, it is generally represented in memory as a 64-bit hash number, with the words themselves stored on disk. Probabilities are generally quantized using only 4-8 bits (instead of 8-byte floats), and n-grams are stored in reverse tries. N-grams can also be shrunk by pruning, for example only storing n-grams with counts greater than some threshold (such as the count threshold of 40 used for the Google n-gram release) or using entropy to prune less-important n-grams (Stolcke, 1998). Another option is to build approximate language models using techniques like Bloom filters (Talbot and Osborne 2007, Church et al. 2007). Finally, efficient language model toolkits like KenLM (Heafield 2011, Heafield et al. 2013) use sorted arrays, efficiently combine probabilities and backoffs in a single value, and use merge sorts to efficiently build the probability tables in a minimal number of passes through a large corpus. Although with these toolkits it is possible to build web-scale language models using full Kneser-Ney smoothing, Brants et al. (2007) show that with very large language models a much simpler algorithm may be sufficient. The algorithm is called stupid backoff. Stupid backoff gives up the idea of trying to make the language model a true probability distribution. There is no discounting of the higher-order probabilities. If a higher-order n-gram has a zero count, we simply backoff to a lower order n-gram, weighed by a fixed (context-independent) weight. This algorithm does not produce a probability distribution, so we’ll follow Brants et al. (2007) in referring to it as S:   count(wii−k+1 ) if count(wi i−1 i−k+1 ) > 0 count(wi−1 ) S(wi |wi−k+1 ) = i−k+1  λ S(w |wi−1 ) otherwise i i−k+2 (3.40) . Brants The backoff terminates in the unigram, which has probability S(w) = count(w) N et al. (2007) find that a value of 0.4 worked well for λ . 3.7 3.7 • A DVANCED : P ERPLEXITY ’ S R ELATION TO E NTROPY 49 Advanced: Perplexity’s Relation to Entropy Entropy We introduced perplexity in Section 3.2.1 as a way to evaluate n-gram models on a test set. A better n-gram model is one that assigns a higher probability to the test data, and perplexity is a normalized version of the probability of the test set. The perplexity measure actually arises from the information-theoretic concept of cross-entropy, which explains otherwise mysterious properties of perplexity (why the inverse probability, for example?) and its relationship to entropy. Entropy is a measure of information. Given a random variable X ranging over whatever we are predicting (words, letters, parts of speech, the set of which we’ll call χ) and with a particular probability function, call it p(x), the entropy of the random variable X is: H(X) = − � p(x) log2 p(x) (3.41) x∈χ The log can, in principle, be computed in any base. If we use log base 2, the resulting value of entropy will be measured in bits. One intuitive way to think about entropy is as a lower bound on the number of bits it would take to encode a certain decision or piece of information in the optimal coding scheme. Consider an example from the standard information theory textbook Cover and Thomas (1991). Imagine that we want to place a bet on a horse race but it is too far to go all the way to Yonkers Racetrack, so we’d like to send a short message to the bookie to tell him which of the eight horses to bet on. One way to encode this message is just to use the binary representation of the horse’s number as the code; thus, horse 1 would be 001, horse 2 010, horse 3 011, and so on, with horse 8 coded as 000. If we spend the whole day betting and each horse is coded with 3 bits, on average we would be sending 3 bits per race. Can we do better? Suppose that the spread is the actual distribution of the bets placed and that we represent it as the prior probability of each horse as follows: Horse 1 Horse 2 Horse 3 Horse 4 1 2 1 4 1 8 1 16 Horse 5 Horse 6 Horse 7 Horse 8 1 64 1 64 1 64 1 64 The entropy of the random variable X that ranges over horses gives us a lower bound on the number of bits and is H(X) = − i=8 � p(i) log p(i) i=1 = 1 log 1 −4( 1 log 1 ) − 12 log 12 − 14 log 14 − 18 log 18 − 16 16 64 64 = 2 bits (3.42) A code that averages 2 bits per race can be built with short encodings for more probable horses, and longer encodings for less probable horses. For example, we could encode the most likely horse with the code 0, and the remaining horses as 10, then 110, 1110, 111100, 111101, 111110, and 111111. 50 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS What if the horses are equally likely? We saw above that if we used an equallength binary code for the horse numbers, each horse took 3 bits to code, so the average was 3. Is the entropy the same? In this case each horse would have a probability of 18 . The entropy of the choice of horses is then H(X) = − i=8 � 1 i=1 8 log 1 1 = − log = 3 bits 8 8 (3.43) Until now we have been computing the entropy of a single variable. But most of what we will use entropy for involves sequences. For a grammar, for example, we will be computing the entropy of some sequence of words W = {w0 , w1 , w2 , . . . , wn }. One way to do this is to have a variable that ranges over sequences of words. For example we can compute the entropy of a random variable that ranges over all finite sequences of words of length n in some language L as follows: � H(w1 , w2 , . . . , wn ) = − p(W1n ) log p(W1n ) (3.44) W1n ∈L entropy rate We could define the entropy rate (we could also think of this as the per-word entropy) as the entropy of this sequence divided by the number of words: 1 1 � p(W1n ) log p(W1n ) H(W1n ) = − n n n (3.45) W1 ∈L But to measure the true entropy of a language, we need to consider sequences of infinite length. If we think of a language as a stochastic process L that produces a sequence of words, and allow W to represent the sequence of words w1 , . . . , wn , then L’s entropy rate H(L) is defined as 1 H(w1 , w2 , . . . , wn ) n 1� = − lim p(w1 , . . . , wn ) log p(w1 , . . . , wn ) n→∞ n H(L) = lim n→∞ (3.46) W ∈L The Shannon-McMillan-Breiman theorem (Algoet and Cover 1988, Cover and Thomas 1991) states that if the language is regular in certain ways (to be exact, if it is both stationary and ergodic), Stationary 1 (3.47) H(L) = lim − log p(w1 w2 . . . wn ) n→∞ n That is, we can take a single sequence that is long enough instead of summing over all possible sequences. The intuition of the Shannon-McMillan-Breiman theorem is that a long-enough sequence of words will contain in it many other shorter sequences and that each of these shorter sequences will reoccur in the longer sequence according to their probabilities. A stochastic process is said to be stationary if the probabilities it assigns to a sequence are invariant with respect to shifts in the time index. In other words, the probability distribution for words at time t is the same as the probability distribution at time t + 1. Markov models, and hence n-grams, are stationary. For example, in a bigram, Pi is dependent only on Pi−1 . So if we shift our time index by x, Pi+x is still dependent on Pi+x−1 . But natural language is not stationary, since as we show 3.7 cross-entropy • A DVANCED : P ERPLEXITY ’ S R ELATION TO E NTROPY 51 in Chapter 12, the probability of upcoming words can be dependent on events that were arbitrarily distant and time dependent. Thus, our statistical models only give an approximation to the correct distributions and entropies of natural language. To summarize, by making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long sample of the output and computing its average log probability. Now we are ready to introduce cross-entropy. The cross-entropy is useful when we don’t know the actual probability distribution p that generated some data. It allows us to use some m, which is a model of p (i.e., an approximation to p). The cross-entropy of m on p is defined by H(p, m) = lim − n→∞ 1� p(w1 , . . . , wn ) log m(w1 , . . . , wn ) n (3.48) W ∈L That is, we draw sequences according to the probability distribution p, but sum the log of their probabilities according to m. Again, following the Shannon-McMillan-Breiman theorem, for a stationary ergodic process: 1 H(p, m) = lim − log m(w1 w2 . . . wn ) n→∞ n (3.49) This means that, as for entropy, we can estimate the cross-entropy of a model m on some distribution p by taking a single sequence that is long enough instead of summing over all possible sequences. What makes the cross-entropy useful is that the cross-entropy H(p, m) is an upper bound on the entropy H(p). For any model m: H(p) ≤ H(p, m) (3.50) This means that we can use some simplified model m to help estimate the true entropy of a sequence of symbols drawn according to probability p. The more accurate m is, the closer the cross-entropy H(p, m) will be to the true entropy H(p). Thus, the difference between H(p, m) and H(p) is a measure of how accurate a model is. Between two models m1 and m2 , the more accurate model will be the one with the lower cross-entropy. (The cross-entropy can never be lower than the true entropy, so a model cannot err by underestimating the true entropy.) We are finally ready to see the relation between perplexity and cross-entropy as we saw it in Eq. 3.49. Cross-entropy is defined in the limit as the length of the observed word sequence goes to infinity. We will need an approximation to crossentropy, relying on a (sufficiently long) sequence of fixed length. This approximation to the cross-entropy of a model M = P(wi |wi−N+1 ...wi−1 ) on a sequence of words W is H(W ) = − perplexity 1 log P(w1 w2 . . . wN ) N (3.51) The perplexity of a model P on a sequence of words W is now formally defined as the exp of this cross-entropy: 52 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS Perplexity(W ) = 2H(W ) 1 = P(w1 w2 . . . wN )− N � 1 = N P(w1 w2 . . . wN ) � �N �� 1 N = � P(wi |w1 . . . wi−1 ) (3.52) i=1 3.8 Summary This chapter introduced language modeling and the n-gram, one of the most widely used tools in language processing. • Language models offer a way to assign a probability to a sentence or other sequence of words, and to predict a word from preceding words. • n-grams are Markov models that estimate words from a fixed window of previous words. n-gram probabilities can be estimated by counting in a corpus and normalizing (the maximum likelihood estimate). • n-gram language models are evaluated extrinsically in some task, or intrinsically using perplexity. • The perplexity of a test set according to a language model is the geometric mean of the inverse test set probability computed by the model. • Smoothing algorithms provide a more sophisticated way to estimate the probability of n-grams. Commonly used smoothing algorithms for n-grams rely on lower-order n-gram counts through backoff or interpolation. • Both backoff and interpolation require discounting to create a probability distribution. • Kneser-Ney smoothing makes use of the probability of a word being a novel continuation. The interpolated Kneser-Ney smoothing algorithm mixes a discounted probability with a lower-order continuation probability. Bibliographical and Historical Notes The underlying mathematics of the n-gram was first proposed by Markov (1913), who used what are now called Markov chains (bigrams and trigrams) to predict whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a consonant. Markov classified 20,000 letters as V or C and computed the bigram and trigram probability that a given letter would be a vowel given the previous one or two letters. Shannon (1948) applied n-grams to compute approximations to English word sequences. Based on Shannon’s work, Markov models were commonly used in engineering, linguistic, and psychological work on modeling word sequences by the 1950s. In a series of extremely influential papers starting with Chomsky (1956) and including Chomsky (1957) and Miller and Chomsky (1963), Noam Chomsky argued that “finite-state Markov processes”, while a possibly useful engineering heuristic, CSE4022 Natural Language Processing Topic: Machine Translation D r. Tu l a s i Pr a s a d S a r i k i , S CS E , V I T C h e n n a i C a m p u s w w w. l e a r n e r s d e s k . w e e b l y. c o m Contents  History of Machine Translation  Machine Translation: Where are we today?  How Does MT Work?  Core Challenges of MT  Rule-based vs. Data-driven Approaches to MT  Statistical MT (SMT)  Major Sources of Translation Problems  Speech to speech translation MACHINE TRANSLATION 2 History of Machine Translation 1946: MT is one of the 1st conceived applications of modern computers ( Alan Turing) 1954: The “Georgetown Experiment” demonstrations of Russian-English MT Late 1950s and early 1960s: MT fails to scale up to “real” systems 1966: ALPAC Report: MT recognized as an extremely difficult, “AI-complete” problem. Funding disappears 1968: SYSTRAN founded 1985: CMU “Center for Machine Translation” (CMT) founded Late 1980s and early 1990s: Field dominated by rule-based approaches – KBMT, KANT, Eurotra, etc. MACHINE TRANSLATION 3 History of Machine Translation 1992: “Noisy Channel” Statistical MT models invented by IBM (CANDIDE) Mid 1990s: First major DARPA MT Program. PANGLOSS Late 1990s: Major Speech-to-Speech MT demonstrations: C-STAR 1999: JHU Summer Workshop results in GIZA 2000s: Large DARPA Funding Programs – TIDES and GALE 2003: Och et al introduce Phrase-based SMT. PHARAOH 2006: Google Translate is launched 2007: Koehn et al release MOSES 2008: a text/SMS translation service for mobiles in Japan 2009: mobile phone with built-in speech-to-speech translation facility for English and Japanese 2012: Google announced that Google Translate MACHINE TRANSLATION 4 MT: Where are we today? Age of Internet & Globalization – great demand for translation services and MT  Multiple official languages of UN, EU, Canada, etc.  Commercial demand from increasing number of global enterprises  (Microsoft, IBM, Intel, Apple, E-bay, Amazon, etc.)  Language and translation services business sector estimated at $15 Billion worldwide in 2008 and growing at a healthy pace Economic incentive and demand is still focused primarily within G-8 languages, but growing in emerging markets (BRIC: Brazil, Russia, India, China), Arabic, and more… MACHINE TRANSLATION 5 MT: Where are we today? Some fairly decent commercial products in the market for these language pairs  Primarily a product of rule-based systems after many years of development  New generation of data-driven “statistical” MT: Google, Microsoft, Language Weaver Web-based (mostly free) MT services: Google, Babelfish, others… Pervasive MT between many language pairs still non-existent, but Google is trying to change that! MACHINE TRANSLATION 6 How Does MT Work? All modern MT approaches are based on building translations for complete sentences by putting together smaller pieces of translation Core Questions:  What are these smaller pieces of translation?  Where do they come from?  How does MT put these pieces together?  How does the MT system pick the correct (or best) translation among many options? MACHINE TRANSLATION 7 Core Challenges of MT Ambiguity and Language Divergences: Human languages are highly ambiguous, and differently in different languages Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific constructions and idioms Amount of required knowledge: Translation equivalencies for vast vocabularies Syntactic knowledge (how to map syntax of one language to another), plus more complex language divergences (semantic differences, constructions and idioms, etc.) How do you acquire and construct a knowledge base that big that is (even mostly) correct and consistent? MACHINE TRANSLATION 8 Rule-based vs. Data-driven Approaches to MT What are the pieces of translation? Where do they come from? Rule-based: large-scale “clean” word translation lexicons, manually constructed over time by experts Data-driven: broad-coverage word and multi-word translation lexicons, learned automatically from available sentence-parallel corpora How does MT put these pieces together? Rule-based: large collections of rules, manually developed over time by human experts, that map structures from the source to the target language Data-driven: a computer algorithm that explores millions of possible ways of putting the small pieces together, looking for the translation that statistically looks best MACHINE TRANSLATION 9 Rule-based vs. Data-driven Approaches to MT How does the MT system pick the correct (or best) translation among many options? Rule-based: Human experts encode preferences among the rules designed to prefer creation of better translations Data-driven: a variety of fitness and preference scores, many of which can be learned from available training data, are used to model a total score for each of the millions of possible translation candidates; algorithm then selects and outputs the best scoring translation MACHINE TRANSLATION 10 Rule-based vs. Data-driven Approaches to MT Why have the data-driven approaches become so popular? Increasing amounts of sentence-parallel data are constantly being created on the web Advances in machine learning algorithms Computational power of today’s computers can train systems on these massive amounts of data and can perform these massive search-based translation computations when translating new texts  Building and maintaining rule-based systems is too difficult, expensive and time-consuming In many scenarios, it actually works better! MACHINE TRANSLATION 11 Statistical MT (SMT) Data-driven, most dominant approach in current MT research Proposed by IBM in early 1990s: a direct, purely statistical, model for MT Evolved from word-level translation to phrase-based translation Main Ideas: Training: statistical “models” of word and phrase translation equivalence are learned automatically from bilingual parallel sentences, creating a bilingual “database” of translations Decoding: new sentences are translated by a program (the decoder), which matches the source words and phrases with the database of translations, and searches the “space” of all possible translation combinations. MACHINE TRANSLATION 12 Statistical MT (SMT) Main steps in training phrase-based statistical MT: Create a sentence-aligned parallel corpus Word Alignment: train word-level alignment models (GIZA++) Phrase Extraction: extract phrase-to-phrase translation correspondences using heuristics (Moses) Minimum Error Rate Training (MERT): optimize translation system parameters on development data to achieve best translation performance Attractive: completely automatic, no manual rules, much reduced manual labor MACHINE TRANSLATION 13 Statistical MT (SMT) Main drawbacks: Translation accuracy levels vary widely Effective only with large volumes (several mega-words) of parallel text Broad domain, but domain-sensitive Viable only for limited number of language pairs! Impressive progress in last 5-10 years! MACHINE TRANSLATION 14 Statistical MT: Major Challenges Current approaches are too naïve and “direct”: Good at learning word-to-word and phrase-to-phrase correspondences from data Not good enough at learning how to combine these pieces and reorder them properly during translation Learning general rules requires much more complicated algorithms and computer processing of the data The space of translations that is “searched” often doesn’t contain a perfect translation The fitness scores that are used aren’t good enough to always assign better scores to the better translations; we don’t always find the best translation even when it’s there! MERT is brittle, problematic and metric-dependent! MACHINE TRANSLATION 15 Statistical MT: Major Challenges Solutions: Google solution: more and more data! Research solution: “smarter” algorithms and learning methods MACHINE TRANSLATION 16 Rule-based vs. Data-driven MT We thank all participants of the whole world for their comical and creative drawings; to choose the victors was not easy task! We thank all the participants from around the world for their designs cocasses and creative; selecting winners was not easy! Click here to see work of winning European of these two months, and use it to look at what the winning of USA sent us. Click here to see the artwork of winners European of these two months, and disclosure to look at what the winners of the US have been sending. Rule-based Data-driven MACHINE TRANSLATION 17 Major Sources of Translation Problems Lexical Differences: Multiple possible translations for SL word, or difficulties expressing SL word meaning in a single TL word Structural Differences: Syntax of SL is different than syntax of the TL: word order, sentence and constituent structure Differences in Mappings of Syntax to Semantics: Meaning in TL is conveyed using a different syntactic structure than in the SL Idioms and Constructions MACHINE TRANSLATION 18 How to Tackle the Core Challenges Manual Labor: 1000s of person-years of human experts developing large word and phrase translation lexicons and translation rules. Example: Systran’s RBMT systems. Lots of Parallel Data: data-driven approaches for finding word and phrase correspondences automatically from large amounts of sentence-aligned parallel texts. Example: Statistical MT systems. Learning Approaches: learn translation rules automatically from small amounts of human translated and word-aligned data. Example: AVENUE’s Statistical XFER approach. Simplify the Problem: build systems that are limited-domain or constrained in other ways. Examples: CATALYST, NESPOLE!. MACHINE TRANSLATION 19 State-of-the-Art in MT What users want: General purpose (any text) High quality (human level) Fully automatic (no user intervention) We can meet any 2 of these 3 goals today, but not all three at once: FA HQ: Knowledge-Based MT (KBMT) FA GP: Corpus-Based (Example-Based) MT GP HQ: Human-in-the-loop (Post-editing) MACHINE TRANSLATION 20 Types of MT Applications: Assimilation: multiple source languages, uncontrolled style/topic. purpose MT, no customization. (Google Translate) General Dissemination: one source language, controlled style, single topic/domain. Customized RBMT or SMT. (Safaba) Communication: Lower quality may be okay, but system robustness, real-time required. (Jibiggo) MACHINE TRANSLATION 21 Approaches to MT: Vaquois MT Triangle Interlingua Give-information+personal-data (name=alon_lavie) Generation Analysis Transfer [s [vp accusative_pronoun “chiarman” proper_name]] [s [np [possessive_pronoun “name”]] [vp “be” proper_name]] Direct Mi chiamo Alon Lavie My name is Alon Lavie MACHINE TRANSLATION 22 Direct Approaches No intermediate stage in the translation First MT systems developed in the 1950’s-60’s (assembly code programs) Morphology, bi-lingual dictionary lookup, local reordering rules “Word-for-word, with some local word-order adjustments” Modern Approaches: Phrase-based Statistical MT (SMT) Example-based MT (EBMT) MACHINE TRANSLATION 23 EBMT Paradigm New Sentence (Source): Yesterday, 200 delegates met with President Clinton. Matches to Source Found Yesterday, 200 delegates met behind closed doors… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Difficulties with President Clinton… Schwierigkeiten mit Praesident Clinton… Alignment (Sub-sentential) Yesterday, 200 delegates met behind closed doors… Gestern trafen sich 200 Abgeordnete hinter verschlossenen… Difficulties with President Clinton over… Schwierigkeiten mit Praesident Clinton… Translated Sentence (Target): Gestern trafen sich 200 Abgeordnete mit Praesident Clinton. MACHINE TRANSLATION 24 Analysis and Generation Main Steps Analysis: Morphological analysis (word-level) and POS tagging Syntactic analysis and disambiguation (produce syntactic parse-tree) Semantic analysis and disambiguation (produce symbolic frames or logical form representation) Map to language-independent Interlingua Generation: Generate semantic representation in TL Sentence Planning: generate syntactic structure and lexical selections for concepts Surface-form realization: generate correct forms of words MACHINE TRANSLATION 25 Transfer Approaches Syntactic Transfer: Analyze SL input sentence to its syntactic structure (parse tree) Transfer SL parse-tree to TL parse-tree (various formalisms for mappings) Generate TL sentence from the TL parse-tree Semantic Transfer: Analyze SL input to a language-specific semantic representation (i.e., Case Frames, Logical Form) Transfer SL semantic representation to TL semantic representation Generate syntactic structure and then surface sentence in the TL MACHINE TRANSLATION 26 Transfer Approaches (Pros & Cons) Syntactic Transfer: ◦ No need for semantic analysis and generation ◦ Syntactic structures are general, not domain specific dependent, can handle open domains ◦ Requires word translation lexicon  Less domain Semantic Transfer: ◦ Requires deeper analysis and generation, symbolic representation of concepts and predicates  difficult to construct for open or unlimited domains ◦ Can better handle non-compositional meaning structures  can be more accurate ◦ No word translation lexicon – generate in TL from symbolic concepts MACHINE TRANSLATION 27 Knowledge-based Interlingual MT The classic “deep” Artificial Intelligence approach: Analyze the source language into a detailed symbolic representation of its meaning Generate this meaning in the target language “Interlingua”: one single meaning representation for all languages Nice in theory, but extremely difficult in practice: What kind of representation? What is the appropriate level of detail to represent? How to ensure that the interlingua is in fact universal? MACHINE TRANSLATION 28 Interlingua versus Transfer With interlingua, need only N parsers/ generators instead of N2 transfer systems: L2 L2 L1 L3 L1 L3 L4 L6 L4 interlingua L6 L5 L5 MACHINE TRANSLATION 29 Multi-Engine MT Apply several MT engines to each input in parallel Create a combined translation from the individual translations Goal is to weaknesses. combine strengths, and avoid Along all dimensions: domain limits, quality, development time/cost, run-time speed, etc. Various approaches to the problem MACHINE TRANSLATION 30 Speech-to-Speech MT Speech just makes MT (much) more difficult: Spoken language is messier False starts, filled pauses, repetitions, out-of-vocabulary words Lack of punctuation and explicit sentence boundaries Current Speech technology is far from perfect Need for speech recognition and synthesis in foreign languages Robustness: MT quality degradation should be proportional to SR quality Tight Integration: rather than separate sequential tasks, can SR + MT be integrated in ways that improves end-to-end performance? MACHINE TRANSLATION 31 END MACHINE TRANSLATION 32

Natural Language Processing (NLP) Fundamentals

Related documents

Products

Support

Natural Language Processing (NLP) Fundamentals

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib