Uploaded by nekesuresh

1 (20 files merged)

advertisement
Natural Language Processing
Dr. Tulasi Prasad Sariki
SCOPE, VIT Chennai
Natural Language Processing
CSE4022
ABOUT ME
Dr. Tulasi Prasad Sariki
PhD from VIT Chennai
M. Tech from JNTU, Hyderabad
B. Tech from JNTU, Hyderabad
Present
Associate Professor, SCOPE, VIT, Chennai
Past
Assistant Professor, School of Computing, SASTRA University
Natural Language Processing
Machine Learning
Data Science
Course Objectives
●
To introduce the fundamental concepts and techniques of Natural language
Processing for analyzing words based on Morphology and CORPUS.
●
To examine the NLP models and interpret algorithms for classification of NLP
sentences by using both the traditional, symbolic and the more recent statistical
approach.
●
To get acquainted with the algorithmic description of the main language levels
that includes morphology, syntax, semantics, and pragmatics for information
retrieval and machine translation applications.
Credits: Some of the slides borrowed from Dr. Diyi Yang at College of Computing - Georgia Tech
Expected Course Outcome
1. Understand the principles and Process the Human Languages Such as English and other Indian
Languages using computers.
2. Creating CORPUS linguistics based on digestive approach (Text Corpus method)
3. Demonstrate understanding of state-of-the-art algorithms and techniques for text-based processing of
natural language with respect to morphology.
4. Perform POS tagging for a given natural language.
5. Select a suitable language modelling technique based on the structure of the language.
6. Check the syntactic and semantic correctness of sentences using grammars and labelling.
7. Develop Computational Methods for Real World Applications and explore deep learning based NLP.
Mode of Evaluation
Theory:
● CAT-1
● CAT-2
● Quiz-1
● Quiz-2
● Digital Assignment
● Final Assessment Test (FAT)
Project:
● Review-1 : CAT-1 Time
● Review-2 : CAT-2 Time
● Review-3 : Before FAT
Books
● Text Book(s)
1. Jurafsky and James H. Martin “Speech and Language Processing”, 3rd edition, Prentice Hall, 2009.
●
Reference Books
1. Chris Manning and HinrichSchütze, “Foundations of Statistical Natural Language Processing”, 2nd
edition, MITPress Cambridge, MA, 2003.
2. NitinIndurkhya, Fred J. Damerau “Handbook of Natural Language Processing”, Second Edition, CRC
Press, 2010.
3. James Allen “Natural Language Understanding”, Pearson Publication 8th Edition. 2012.
Course Contents
01
INTRODUCTION
TO NLP
04
LEXICAL
SYNTAX
07
APPLICATIONS
OF
NLP
02
05
TEXT
PROCESSING
LANGUAGE
MODELING
03
MORPHOLOGY
06
SYNTAX
&
SEMANTICS
01
Overview of
Natural Language Processing
Outline
●
Introduction to NLP
●
Ambiguity
●
Need for Natural Language Processing
●
Natural Languages vs. Computer Languages
●
Why Natural Language Processing ?
●
Linguistics Levels of Analysis
●
Basic terms / terminology in NLP
●
Different Tasks in NLP
Introduction
What is a Language ?
• A system for the expression of thoughts, feelings, etc., by the use
of spoken sounds or conventional symbols -Collins Dictionary.
Why do we need Language?
• Ability to speak & write for the communication is one of the most
fundamental aspects of human behavior.
Communication With Machines
Conversational Agents
Conversational agents contain:
●
●
●
●
●
Speech recognition
Language analysis
Dialogue processing
Information retrieval
Text to speech
Google Assistant
Apple Siri
Amazon Alexa
Microsoft Cortana
Introduction
●
●
●
Computers would be a lot more useful if they could handle
our email, do our library research, talk to us …
But they are fazed by natural human language (ambiguity).
How can we tell computers about language? (Or help them
learn it as kids do?)
Natural Language Processing
Ambiguity
●
"I saw the man on the hill with a telescope"
●
List the reasonable interpretations?
Past tense to See
What is Natural Language Processing?
The field of study that
focuses on the interactions
between human language
and computers is called
NLP
Humans
Computers
Need for Natural Language Processing ?
• Huge amounts of data
• Applications for processing large amounts of texts require NLP expertise
Natural Language Processing ?
Natural Language?
• Refers to the language spoken by people, e.g. English, Telugu, Tamil, as
opposed to artificial languages, like C++, Java, etc.
Natural Language Processing
• Applications that deal with natural language in a way or another.
Computational Linguistics
• Doing linguistics on computers
• More on the linguistic side than NLP, but closely related
Natural Languages vs. Computer Languages
• Ambiguity is the primary difference between natural and computer
languages.
• Formal programming languages are designed to be unambiguous, i.e.
they can be defined by a grammar that produces a unique parse tree for
each sentence in the language.
Why Natural Language Processing?
• kJfmmfj mmmvvv nnnffn333
• Uj iheale eleee mnster vensi credur
• Baboi oi cestnitze
• Coovoel2^ ekk; ldsllk lkdf vnnjfj?
• Fgmflmllk mlfm kfre xnnn!
Computers Lack Knowledge!
• Computers “see” text in English the same you have seen the
previous text!
• People have no trouble understanding language
– Common sense knowledge
– Reasoning capacity
– Experience
• Computers have
– No common sense knowledge
– No reasoning capacity
Where does it fit in the CS taxonomy?
CS
AI
NLP
Computational
Linguistics
Linguistics Levels of Analysis
Basic terms / terminology in NLP
• Token: Before any real processing can be done on the input text, it
needs to be segmented into linguistic units such as words,
punctuation, numbers or alphanumeric. These units are known as
tokens.
• Sentence: An ordered sequence of tokens.
• Tokenization: The process of splitting a sentence into its constituent
tokens. Whitespace makes tokenization relatively easier and
uninteresting. However, for languages such as Chinese and Arabic,
the task is more difficult since there are no explicit boundaries.
Basic terms / terminology in NLP
Corpus: A body of text, usually containing a large number of sentences.
Corpora: Collection of texts. Plural form of corpus.
Bilingual corpus: A collection of texts in which each text appears in two
languages.
Dialogue: Communicative linguistic activity in which at least two speakers or
agents participate.
n-gram : A sequence of n tokens.
Semantics: The study of linguistic meaning.
Basic terms / terminology in NLP
Part-of-speech (POS) Tag: A word can be classified into one or more of a
set of lexical or part-of-speech categories such as Nouns, Verbs, Adjectives
and Articles etc.,
A POS tag is a symbol representing such a lexical category - NN(Noun),
VB(Verb), JJ(Adjective), AT(Article).
POS Tagging: Given a sentence and a set of POS tags, a common
language processing task is to automatically assign POS tags to each word
in the sentences.
For example, given the sentence: The ball is red, the output of a POS
tagger would be The/AT ball/NN is/VB red/JJ.
Basic terms / terminology in NLP
Parse Tree: A tree defined over a given sentence that represents the
syntactic structure of the sentence as defined by a formal grammar.
Computational Morphology: Natural languages consist of a very large
number of words that are built upon basic building blocks known as
morphemes (or stems), the smallest linguistic units possessing meaning.
Parsing: In the parsing task, a parser constructs the parse tree for a given
sentence. Some parsers assume the existence of a set of grammar rules in
order to parse but recent parsers are smart enough to deduce the parse
trees directly from the given data using complex statistical models.
Applications of NLP
Information Extraction (IE): Identify phrases in language that refer to
specific types of entities and relations in text.
Named entity recognition is task of identifying names of people, places,
organizations, etc. in text.
people
organizations
places
Michael Dell is the CEO of Dell Computer Corporation and lives in Texas.
Relation extraction identifies specific relations between entities.
Michael Dell is the CEO of Dell Computer Corporation and lives in Texas.
Applications of NLP
Question Answering: Directly answer natural language questions
based on information presented in a corpora of textual documents
(e.g. the web).
When was Barack Obama born? (factoid)
August 4, 1961
Who was president when Barack Obama was born?
John F. Kennedy
How many presidents have there been since Barack Obama was born?
9
Applications of NLP
Text Summarization
Article:
With a split decision in the final two primaries and a flurry of super delegate
endorsements, Sen. Barack Obama sealed the Democratic presidential nomination last night after
a grueling and history-making campaign against Sen. Hillary Rodham Clinton that will make him
the first African American to head a major-party ticket. Before a chanting and cheering audience
in St. Paul, Minn., the first-term senator from Illinois savored what once seemed an unlikely
outcome to the Democratic race with a nod to the marathon that was ending and to what will be
another hard-fought battle, against Sen. John McCain, the presumptive Republican nominee….
Summary:
nominee.
Senator Barack Obama was declared the presumptive Democratic presidential
Applications of NLP
Machine Translation (MT)
Translate a sentence from one natural language to another.
Hasta la vista, bebé ⇒ See you later, baby.
Information Retrieval
Applications of NLP
Text Categorization
Applications of NLP
Natural Language Interfaces
Spell Checking
Applications of NLP
Sentiment Analysis
Automatic Lyrics Generation
Applications of NLP
Plagiarism Detection
Speech Recognition
Why NLP is Hard?
●
Contextual words and phrases and homonyms
●
Synonyms
●
Irony and sarcasm
●
Ambiguity
●
Errors in text or speech
●
Colloquialisms and slang
●
Domain-specific language
●
Low-resource languages
●
Lack of research and development
●
Sparsity
Contextual words and phrases and homonyms
● I ran to the store because we ran out of milk.
● Can I run something past you real quick?
● The house is looking really run down.
Synonyms
● Synonyms can lead to issues similar to contextual understanding
because we use many different words to express the same idea.
● Some of these words may convey exactly the same meaning.
● Some may be levels of complexity (small, little, tiny, minute).
● Different people use synonyms to denote slightly different meanings
within their personal vocabulary.
Sarcasm
Ambiguity
Ambiguity in NLP refers to sentences and phrases that potentially have two or more possible interpretations.
●
Lexical ambiguity: a word that could be used as a verb, noun, or adjective.
●
Syntactic ambiguity:
●
Semantic ambiguity: the interpretation of a sentence in context.
Errors in Text to Speech
●
Misspelled or misused words can create problems for
text analysis.
●
Autocorrect and grammar correction applications can
handle common mistakes, but don’t always
understand the writer’s intention.
Colloquialisms and slang
Sparsity
● Regardless of how large our corpus is, there will be a lot of infrequent
words.
● This means we need to find clever ways to estimate probabilities for
things we have rarely or never seen.
Expressivity
●
Not only can one form have different meanings (ambiguity) but the same
meaning can be expressed with different forms:
○ She gave the book to Tom vs. She gave Tom the book
○ Some kids popped by vs. A few children visited
○ Is that window still open? vs. Please close the window
Fields connections with NLP
●
●
●
●
●
●
●
●
●
●
Machine learning
Linguistics (including psycho-, socio-, descriptive, and theoretical)
Cognitive science
Information theory
Logic
Data science
Political science
Psychology
Economics
Education
Factors Changing NLP Landscape
1. Increases in computing power
2. The rise of the web, then the social web
3. Advances in machine learning
4. Advances in understanding of language in social context
J Component - Instructions
●
2-page Project Proposal (10%)
●
4-page Midway Report (10%)
●
8-page Final Report (20%)
●
Project Presentation (60%)
●
○
Review-1 (10%)
○
Review-2 (20%)
○
Review-3 (30%)
All submissions should be in Double column ACL 2022 format. [Link]
THANKS!
Do you have any questions?
tulasiprasad.sariki@vit.ac.in
www.learnersdesk.weebly.com
CSE4022
Natural Language Processing
Topic: Text Processing
Pr o f. Tu l asi Pr a sa d S a ri ki ,
S C SE, V I T Ch ennai Ca mpus
www. l earn ersd esk.weeb l y. com
Contents
 Text Processing
 Text Preprocessing
 Challenges in Text Preprocessing
 Types of Writing Systems
Text Processing
2
Text Processing
In the linguistic analysis of a digital natural language text, it is necessary to
clearly define the characters, words, and sentences in any document.
TEXT PROCESSING
3
Text Preprocessing
The task of converting a raw text file, essentially a sequence of digital bits,
into a well-defined sequence of linguistically meaningful units:
◦ at the lowest level characters representing the individual graphemes in a
language’s written system,
◦ Words consisting of one or more characters,
◦ sentences consisting of one or more words
Text preprocessing is an essential part of any NLP system, since the
characters, words, and sentences identified at this stage are the
fundamental units passed to all further processing stages.
TEXT PROCESSING
4
Text / Word segmentation
Text segmentation is the process of converting a well-defined text corpus
into its component words and sentences.
This is very important task to work on morphology and syntax levels of
NLP.
Word segmentation breaks up the sequence of characters in a text by
locating the word boundaries, the points where one word ends and
another begins.
For computational linguistics purposes, the words thus identified are
frequently referred to as tokens, and word segmentation is also known as
tokenization.
TEXT PROCESSING
5
Sentence Segmentation / Text Normalization
Sentence segmentation is the process of identifying sentence boundaries
between words in different sentences.
Since most written languages have punctuation marks that occur at
sentence boundaries, sentence segmentation is frequently referred to as
sentence boundary detection, sentence boundary disambiguation
Text normalization is a related step that involves merging different written
forms of a token into a canonical normalized form; for example, a
document may contain the equivalent tokens “Mr.”, “Mr”, “mister”, and
“Mister” that would all be normalized to a single form.
TEXT PROCESSING
6
Challenges of Text Preprocessing
The type of writing system (SCRIPT) used for a language is the most
important factor for determining the best approach to text preprocessing.
It needs:
◦ at least one set of defined base elements or symbols, individually termed
characters and collectively called a script;
◦ at least one set of rules and conventions (orthography) understood and
shared by a community, which arbitrarily assigns meaning to the base
elements (graphemes), their ordering and relations to one another;
◦ at least one language (generally spoken) whose constructions are represented
and able to be recalled by the interpretation of these elements and rules
TEXT PROCESSING
7
Classification of Systems
TEXT PROCESSING
8
Logographic writing systems
In a logographic writing system, in theory, each symbol (word or
morpheme) represents one idea example: Chinese
Friendship
Happiness
Fortune
TEXT PROCESSING
Prosperity
9
Logophonetic Writing Systems
Definition: there are two major types of
signs, ones denoting morphemes and ones
denoting sounds. (ex) Egyptian, Japanese
and sumerian
TEXT PROCESSING
10
Abugida
South Asian scripts such as Brahmi and its descendants fit into both
syllabary and alphabet.
It is syllabic because the basic sign contains a consonant and a vowel.
Greek had CV, CVC, CCVC, CVCC syllable structures, so they invent a way to
cut down syllables to consonant and vowels
TEXT PROCESSING
11
Alphabetic
A system of consonant and vowel symbols that, either individually or in
combinations, represent the speech sounds of a written language (ex)
English
TEXT PROCESSING
12
Abjad or Consonantal Alphabet
alphabetic writing systems in which only the consonants in words are
written, and the vowels are left out (ex) Hebrew, Arabic
TEXT PROCESSING
13
Character representation
How Characters in languages can be represented?
At its lowest level, a computer-based text or document is merely a
sequence of digital bits in a file.
The first essential task is to interpret these bits as characters of a
writing system of a natural language.
TEXT PROCESSING
14
Unicode
Fundamentally, computers just deal with numbers. They store letters and
other characters by assigning a number for each one.
Before Unicode was invented, there were hundreds of different encoding
systems for assigning these numbers.
Even for a single language like English no single encoding was adequate
for all the letters, punctuation, and technical symbols in common use.
TEXT PROCESSING
15
Unicode
These encoding systems also conflict with one another. That is, two
encodings can use the same number for two different characters, or use
different numbers for the same character.
Unicode covers all the characters for all the writing systems of the world,
modern and ancient. It also includes technical symbols, punctuations, and
many other characters used in writing text. The Unicode Standard is
intended to support the needs of all types of users, whether in business or
academia, using mainstream or minority scripts.
TEXT PROCESSING
16
Types of Encoding
Two Types Encoding
Character Encoding
◦ ASCII, ISCII, Unicode
Font Encoding
◦ Eenadu, vaartha, Kumudam , Daily Thanthi
TEXT PROCESSING
17
ASCII Features
American Standard Code for Information Interchange
7-bit code
8th bit is unused (or used for a parity bit)
27 = 128 codes
Two general types of codes:
◦ 95 are “Graphic” codes (displayable on a console)
◦ 33 are “Control” codes (control features of the console or communications
channel)
TEXT PROCESSING
18
TEXT PROCESSING
19
ISCII(Indian Standard Code for Information Interchange)
It is a coding scheme for representing
various writing systems of India. It
encodes the main Indic scripts and a
Roman transliteration.
The supported scripts are: Assamese,
Bengali (Bengla), Devanagari, Gujarati,
Gurmukhi, Kannada, Malayalam,
Oriya, Tamil, and Telugu.
One motivation for the use of a single
encoding is the idea that it will allow
easy transliteration from one writing
system to another.
TEXT PROCESSING
20
Unicode
Unicode is a computing industry standard for the consistent encoding,
representation and handling of text expressed in most of the world's
writing systems.
The latest version (Unicode 7.0) of Unicode contains a collection of more than
110,000 characters covering 100 scripts and various symbols.
Unicode can be implemented by different character encodings. The most
commonly used encodings are UTF-8, UTF-16.
http://www.unicodetables.com/
http://www.unicode.org/
TEXT PROCESSING
21
Structural Differences with ISCII
Unicode is stateless:
◦ No shifting to get different scripts
◦ Each character has a unique number
Unicode is uniform:
◦ No extension bytes necessary
◦ All characters coded in the same space
Yudit is a free Unicode plain-text editor for Unix-like systems.
http://www.yudit.org/
TEXT PROCESSING
22
Type vs Token
Example:
Consider the sentence below
A rose is a rose is a rose
There are three word types in
the sentence: "rose", "is" & "a".
There are eight word tokens
TEXT PROCESSING
23
Telugu
TEXT PROCESSING
24
Devanagari
TEXT PROCESSING
25
Font
A font file is a binary file that contains glyphs, or “pictures”, of symbols
representing the building blocks of a displayable character set.
Depending on the language, multiple glyphs can comprise a single
character.
Code Table
In basic terms, a code table is a two column list that maps a numerical
value to a glyph. The most widely used code table is Unicode
TEXT PROCESSING
26
Font
Encoding
Encoding values are “stored” from a code table. There are many different
encoding types to choose from depending on the application.
UTF-8
UTF-16 (UCS(universal character set)-2)
UTF-32 (UCS(universal character set)-4)
Allows us to generate displays of text strings in many different languages
by using fonts which contain the glyphs corresponding to their alphabet
The computer system takes each code and displays the glyph associated
with it which is displayed on a monitor or printed out.
TEXT PROCESSING
27
Font
The glyphs may be viewed as the building blocks for the letter to be
displayed where, by placing the glyphs one after another, the required
display is generated.
Fonts also incorporate a feature whereby some of the glyphs may be
defined to have zero width even though they extend over a horizontal
range
Thus when the system places a zero width glyph next to another, the two
are superimposed and thus permit more complex shapes to be generated,
such as accented letters.
TEXT PROCESSING
28
Sentence
A sentence is a group of words that are put together to mean something.
A sentence is the basic unit of language which expresses a complete
thought.
It does this by following the grammatical rules of syntax.
Sentence Boundary Disambiguation
◦ People use . ? and !
◦ Sometimes ;
◦ End-of-sentence marks are overloaded.
TEXT PROCESSING
29
Sentence Boundary Disambiguation
English employs whitespace between most words and punctuation marks
at sentence boundaries, but neither feature is sufficient to segment the
text completely and unambiguously.
Tibetan and Vietnamese both explicitly mark syllable boundaries, either
through layout or by punctuation, but neither marks word boundaries.
Written Chinese and Japanese have adopted punctuation marks for
sentence boundaries, but neither denotes word boundaries.
Period - most ambiguous. Decimals, e-mail addresses, abbreviations,
initials in names, honorific titles.
TEXT PROCESSING
30
Sentence Boundary Disambiguation
For example:
U.S. Dist. Judge Charles L. Powell denie motions made by defense attorneys
Monday in Portland's insurance fraud trial. Of the handful of painters that
Austria has produced in the 20th century, only one, Oskar Kokoschka, is widely
known in U.S. This state of unawareness may not last much longer.
Sentence boundary detection by humans is tedious, slow, error-prone, and
extremely difficult to codify.
Algorithmic syntactic sentence boundary detection is a necessity.
POS tagging and syntax can be done on sentences
TEXT PROCESSING
31
Related Work
As of 1997:
“identifying sentences has not received as much attention as it deserves.”
[Reynar and Ratnaparkhi1997]
“Although sentence boundary disambiguation is essential . . ., it is rarely
addressed in the literature and there are few public-domain programs for
performing the segmentation task.” [Palmer and Hearst1997]
Two approaches
◦ Rule based approach
◦ Machine-learning-based approach
TEXT PROCESSING
32
Related Work
Rule based
◦ Regular expressions
◦ [Cutting1991]
◦ Mark Wasson converted grammar into a finite automata with 1419 states
and 18002 transitions.
◦ Lexical endings of words
◦ [Müller1980] uses a large word list.
Machine-learning-based approach
◦ [Riley1989] uses regression trees.
◦ [Palmer and Hearst1997] uses decision trees or neural network.
TEXT PROCESSING
33
Maximum Entropy Approach
Potential sentence boundaries are identified by scanning the text for
sequences of characters separated by whitespace (tokens)
containing one of the symbols !, . or ?.
The system that focused on maximizing performance used the
following hints, or contextual "templates":
The Prefix, The Suffix
The presence of particular characters in the Prefix or Suffix
Whether the Candidate is an honorific (e.g. Ms., Dr., Prof.)
TEXT PROCESSING
34
Maximum Entropy Approach
Whether the Candidate is a corporate designator (e.g. Corp., M.L.A.,
M.L.C.)
Features of the word left of the Candidate
Features of the word right of the Candidate
The templates specify only the form of the information. The exact information
used by the maximum entropy model for the potential sentence boundary
marked by Corp. in Example sentence would be:
◦ ANLP Corp. chairman Dr. Smith resigned.
◦ PreviousWordIsCapitalized, Prefix=Corp, Suffix=NULL,
PrefixFeature=CorporateDesignator.
TEXT PROCESSING
35
Maximum Entropy Approach
For each potential sentence boundary token (., ?, and !), we estimate a joint
probability distribution p of the token and its surrounding context, both of which
are denoted by c, occurring as an actual sentence boundary.
The distribution is given by:
p(b,c) =𝜋
𝑘
𝑗=1
𝑓 (𝑏,𝑐
𝛼𝑗 𝑗
Where b {no, yes}, where the j's are the unknown parameters of the model,
and where each j corresponds to a fj, or a feature.
Thus the probability of seeing an actual sentence boundary in the context c is
given by p(yes, c).
TEXT PROCESSING
36
Corpus
Corpus is a large collection of text covering different domains, styles,
territorial and social variants of usage etc.
A corpus is a collection of pieces of language that are selected and
ordered according to explicit linguistic criteria in order to be used as a
sample of the language.
A corpus provides grammarians, lexicographers, and others a better
description of a language.
TEXT PROCESSING
37
Chomsky’s Critique of Corpus-Based Methods
1. Corpora model performance, while linguistics is aimed at the explanation of
competence
If you define linguistics that way, linguistic theories will never be able to deal with
actual, messy data
2. Natural language is in principle infinite, whereas corpora are finite, so many
examples will be missed
Excellent point, which needs to be understood by anyone working with a corpus.
But does that mean corpora are useless?
Introspection is unreliable (prone to performance factors), and pretty useless with
small and unrepresentative data.
Insights from a corpus might lead to generalization/induction beyond the corpus– if
the corpus is a good sample of the “text population”
3. Ungrammatical examples won’t be available in a corpus
Depends on the corpus, e.g., spontaneous speech, language learners, etc.
TEXT PROCESSING
38
Corpus
Corpora analysis provide lexical information, morpho-syntactic
information, syntactic as well as semantic information.
Variety of Corpus
◦ Raw corpus
◦ POS tagged
◦ Parsed
◦ Multilingual aligned
◦ Spoken language
◦ Semantic tagged
TEXT PROCESSING
39
Corpus
Raw Corpus
The texts are segmented into sentences and paragraphs
◦ Reuters corpus (180 Million Word)
◦ CIIL corpus (3 Million words for 10 major Indian languages)
POS Tagged Corpus
texts in corpus are annotated with Part Of Speech tags information
◦ BNC tagged corpus(100 Million CLAWS tagset)
TEXT PROCESSING
40
Corpus
Parsed
Each sentence is annotated with a phrase-structure parse marking the
boundaries of sentence, clause, phrase and coordinated word constituents.
◦ Lancaster Parsed Corpus (British English)
◦ Susanne parsed corpus
Semantic Corpus
Sense tagged corpus
◦ The FrameNet lexical database contains around 1,200 semantic frames, 13,000 lexical
units (a pairing of a word with a meaning; polysemous words are represented by
several lexical units) and over 190,000 example sentences
TEXT PROCESSING
41
Corpus
Multilingual aligned
Identification of the corresponding sentences in multiple languages and align
them
◦ CRATER:- Multilingual Aligned Annotated Corpus (English, French, Spanish)
◦ JRC-Acquis Multilingual Parallel Corpus: collection of parallel texts in the following 22
languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish
etc.
◦ Parallel corpora are useful for all types of cross-lingual research
TEXT PROCESSING
42
Uses of Corpora
Lexicography / terminology
Linguistics / computational linguistics
Dictionaries & grammars (Collins Cobuild) English Dictionary for Advanced Learners;
Longman Grammar of Spoken and Written English
Critical Discourse Analysis
- Study texts in social context
- Analyze texts to show underlying ideological meanings and assumptions
- Analyze texts to show how other meanings and ways of talking could have
been used….and therefore the ideological implications of the ways that things
were stated
TEXT PROCESSING
43
Uses of Corpora
Literary studies
Translation practice and theory
Language teaching / learning
ESL Teaching (English as Second Language)
LSP Teaching (Language for Specific Purposes)
TEXT PROCESSING
44
Type-Token Analysis
Each distinct word form is a type and each occurrence of a type counts as
a token.
TEXT PROCESSING
45
TEXT PROCESSING
What is a corpus and why are corpora important tools?
Kristina Nilsson Björkenstam
Computational Linguistics, Stockholm University
1. Introduction
In 2012, the Republican candidate for US president, Mitt Romney, tried to defend himself against
allegations that he was too liberal by saying:
"But I was a severely conservative Republican governor.” (Mitt Romney, CPAC 2012-02-10)
People, both within the Republican Party and outside, got upset because of the phrase "severely
conservative". It didn't feel right. It was perceived as negative, almost as if Romney didn't want to
be a conservative. But if we look up the adjective "severely" in Webster's dictionary we find the
following definitions:
1.
2.
3.
4.
5.
6.
harsh or strict, unsparing, stern
serious, grave, forbidding
conforming strictly to a rule or standard
extremely plain or simple
keen, violent, intense
difficult, rigorous
It would seem that Romney used the word “severely” in sense #3, "conforming strictly to a rule or
standard". If this is an example of correct usage of the word, why did people (especially within his
own party) get so upset with him? What did he do wrong?
The answer is that there is more to language than dictionary definitions. If we look up the word
"severely" in the 450 million word Corpus of Contemporary American English, 1 we find that this
word typically co-occurs with words like:
•
•
•
•
•
•
•
damaged
injured, wounded, ill, depressed
disabled
limited, restricted, limit
punished, beaten
criticized
affected
This pattern of co-occurrence with mostly negatively charged words is the reason why "severely"
has negative connotations, and this is why people reacted to the phrase "severely conservative". 2
This blending of features of one set of words (e.g., “damaged”, “injured”, and “depressed”) with
another word (“severely”) through frequent co-occurrence is called semantic prosody (Louw,
1
COCA. URL: http://corpus.buy.edu/coca
See the blog entry “Severely X” by Mark Liberman at Language Log for more on this quote.
URL: http://languagelog.ldc.upenn.edu/nll/?p=3762
2
1993), and this is one of the aspects of language that can be studied by analyzing language
production in collections of language samples, so-called corpora.
2. What is a corpus?
A corpus is a collection of natural language (text, and/or transcriptions of speech or signs)
constructed with a specific purpose. While most available corpora are text only, there are a
growing number of multimodal corpora, including sign language corpora.
A multimodal corpus is ”a computer-based collection of language and communication-related
material drawing on more than one sensory modality or on more than one production modality”
(Allwood, 2007:207), where sensory modalities include sight, hearing, touch, smell or taste, and
production modalities e.g., speech, signs, eye gaze, body posture, and gestures. That is, a
multimodal corpus is a collection of video and/or audio recordings of people communicating. But
any collection of audio and video is not a corpus. Firstly, the audio-visual material should be
carefully selected, and the content must be described using meta-data. Secondly, the material
should be analyzed and described with transcriptions and annotations in a standardized format.
Ideally, a corpus is a set of language production samples designed to be representative of a
language (or sub-language) through careful selection -- not a randomly collected set of data. How
representative a corpus is, given a particular research question, is determined by the balance and
sampling of the corpus. We can think of representativeness as the answer to the question: how
well does this corpus describe relevant aspects of the language? In order to create a general
corpus, language samples produced by both men and women, of all ages, from different parts of
the area where the language is spoken, etc., should be included.
The same principles regarding representativeness, balance and sampling are relevant for both text
and multimodal corpora, and there is a large body of work on corpus design to draw from (for an
introduction see e.g., (McEnery, Xiao & Tono, 2006) and (Allwood, 2007)). There are different ways
to go about selecting data. One way is to focus on language as a "product" and sample different
types of language material, e.g., dialogue or monologue, or scripted or spontaneous speech.
Another way is to focus on the “producer” of language, and to choose informants based on
speaker characteristics such as age, gender, social class, first/second language, level of education,
profession, and regional background. In some cases, e.g., when recording communication between
co-workers in a specific work place, the informants are selected because they work there, and not
based on speaker characteristics. Such corpora are specialized rather than general, but speaker
characteristics are still important when analyzing the data.
In the case of multimodal corpora, an important aspect is whether the recordings were made in a
naturalistic setting in a studio, or in the real world. There is also a difference between unobtrusive
observation of an activity (e.g., a parent and a child playing with a set of toys at home), and
recordings of people performing a task according to instructions (e.g., two adults discussing a
movie in a lab setting).
Corpus selection is important not only for corpus builders but also for also for corpus users
because the set of questions that can be investigated depends on the composition of the corpus.
Let me give you an example: The Alcohol Language Corpus (ALC; Schiel, Heinrich & Barfüßer, 2011)
is a specialized corpus consisting of speech samples from 162 speakers of German (85 male and 77
female police officers). The recordings were made in a car. Each speaker was recorded both sober
and drunk (with alcohol level as meta-data), speaking in two different speaking styles: scripted,
(that is, as instructed, e.g., reciting addresses and registration numbers), and spontaneous (e.g.,
descriptions, question-answer, and dialog).
Using this corpus, Schiel and colleagues (2011) show that there is a gender difference in
spontaneous speech: men talk less when drunk, whereas women are not affected. They also show
that both men and women make more speech errors when drunk. That is, using this corpus we can
study how speech performance is affected by alcohol, and this knowledge can be used e.g., to
device new methods to stop drunk drivers. But there are inherent limitations to this corpus that
affect how this corpus can be used: the recordings were made in cars, all participants are police
officers, and the speakers are drunk in some of the recordings. This, of course, means that we
cannot use this corpus to study e.g., German in general. For that purpose, we need a balanced
corpus consisting of language material produced by men and women of all ages, from all regions,
with different speaker characteristics.
3. What corpora can tell us...
The first thing I do when I get access to a new corpus is to explore the content using some basic
methods, typically by counting the words. We can find out which words are the most frequent in
the corpus, and by ranking the words by corpus frequency we can study the distribution of the
vocabulary of the corpus. By using normalized frequencies, we can make comparisons between
different corpora. We can e.g., compare the vocabulary frequency distribution of English (British
National Corpus 3), Swedish (Stockholm-Umeå Corpus 4), and Swedish Sign Language (Swedish Sign
Language Corpus 5).
By using basic corpus linguistic tools, either built-in web interface tools for corpora such as COCA
or BNC, or software such as AntConc, 6 we can also look at recurring sequences of words or signs,
either as sequences of tokens (called n-grams) or as collocations.
Starting with basic methods such as these, we can move on to study many aspects of language
production using both quantitative and qualitative methods.
4. ... and what corpora cannot tell us
However, there are limitations to what corpora can tell us.
•
•
3
No negative evidence: just because a word or a sign does not occur in a corpus (however
large and well balanced) does not mean that the word or sign never can occur in the
language. However, a representative corpus can show us what is central and typical in a
language.
The findings of a study can tell us something about the subset of language that is included
in that corpus, but not necessarily about language as a whole. However, if the corpus is
BNC. URL: http://www.natcorp.ox.ac.uk/
SUC. http://spraakbanken.gu.se/eng/resource/suc3
5
SSLC. http://www.ling.su.se/forskning/forskningsprojekt/teckenspr%C3%A5k/teckenspr%C3%A5kskorpus
6
AntConc. URL: www.antlab.sci.waseda.ac.jp/
4
•
representative of the language we are interested in we can make careful generalizations
about the language.
A corpus can rarely provide explanations, and thus most corpus studies combine
quantitative and qualitative work. Sometimes other methods, such as questionnaires, eye
gaze or EEG experiments are better suited to answer a particular question. Sometimes a
descriptive corpus study can give new ideas on what to look for using other methods.
To summarize: make sure that you select the right corpus for your study, find out as much as you
can about the corpus, take the characteristics and limitations of the corpus into account, and make
careful generalizations!
5. Why are corpora important tools?
Corpus analysis provides quantitative, reusable data, and an opportunity to test and challenge our
ideas and intuitions about language. Further, analysis applied to corpora as transcriptions or other
types of linguistic annotation can be checked for consistency and inter-annotator agreement, and
the annotated corpus can be reviewed and reused by others.
Corpora are essential in particular for the study of spoken and signed language: while written
language can be studied by examining the text, speech, signs and gestures disappear when they
have been produced and thus, we need multimodal corpora in order to study interactive face-toface communication.
References
Allwood, J. 2007. Multimodal Corpora. In: Lüdeling, A. & M. Kytö (eds) Corpus Linguistics. An
International Handbook. Mouton de Gruyter. Berlin: 207-225
Davies, M. 2008. The Corpus of Contemporary American English: 450 million words, 1990-present.
URL: http://corpus.byu.edu/coca/
Louw, B. 1993. Irony in the Text or Insincerity in the Writer? The Diagnostic Potential of Semantic
Prosodies. In Baker, M., Francis, G. & Tognini-Bonelli, E. (eds) "Text and Technology".
Philadelphia/Amsterdam: John Benjamins.
McEnery, T., R. Xiao & Y. Tono. 2006. Corpus-Based Language Studies: An Advanced Resource Book.
Taylor & Francis US.
Schiel F, C. Heinrich & S. Barfüßer. 2011. Alcohol Language Corpus. In: Language Resources and
Evaluation, Springer, Berlin, New York, Vol 45.
The British National Corpus, version 3 (BNC XML Edition). 2007. Distributed by Oxford University
Computing Services on behalf of the BNC Consortium. URL: http://www.natcorp.ox.ac.uk/
The Stockholm-Umeå Corpus, version 3.0. 2012. Distributed by the Swedish Language Bank at
Gothenburg University. URL: http://spraakbanken.gu.se/eng/resource/suc3
The Swedish Sign Language Corpus. 2013. Distributed by the Section for Sign Language at
Stockholm University. URL: http://www.ling.su.se/english/research/research-projects/signlanguage
Sentence Segmentation
and Decision Trees
Sentence Segmentation
• !, ? are relatively unambiguous
• Period “.” is quite ambiguous
• Sentence boundary
• Abbreviations like Inc. or Dr.
• Numbers like .02% or 4.3
• Build a binary classifier
• Looks at a “.”
• Decides EndOfSentence/NotEndOfSentence
• Classifiers: hand-written rules, regular expressions, or machine-learning
Determining if a word is end-ofsentence: a Decision Tree
More sophisticated decision tree
features
• Case of word with “.”: Upper, Lower, Cap, Number
• Case of word after “.”: Upper, Lower, Cap, Number
• Numeric features
• Length of word with “.”
• Probability(word with “.” occurs at end-of-s)
• Probability(word after “.” occurs at beginning-of-s)
Implementing Decision Trees
• A decision tree is just an if-then-else statement
• The interesting research is choosing the features
• Setting up the structure is often too hard to do by hand
• Hand-building only possible for very simple features, domains
• For numeric features, it’s too hard to pick each threshold
• Instead, structure usually learned by machine learning from a training corpus
Decision Trees and other classifiers
• We can think of the questions in a decision tree
as features that could be exploited by any kind of
classifier:
• Logistic regression
• SVM
• Neural Nets
• etc.
CSE4022
Natural Language Processing
Topic: Morphology
Dr. Tulasi Prasad Sariki
SCOPE, VIT Chennai
www. l earn ersd esk.weeb l y. com
Contents
 What is Morphology
 Where Morphology is useful?
 Classification of Morphemes
 Properties of affixes
 Representation / Terminology
 Example
Morphology
2
What is Morphology?
Morpheme is a minimal meaning-bearing unit in a language.
Morphemes are abstract concepts denoting entities or relationships.
Morphology is the study of the internal structure of words.
In natural languages, words are made up of meaningful subunits
called morphemes.
Morphological parsing is the task of recognizing the morphemes
inside a word
e.g., hands, foxes, children
MORPHOLOGY
3
Where Morphology is useful?
Machine translation
Information retrieval
Lexicography
Any further processing (e.g., part-of-speech tagging)
MORPHOLOGY
4
Observations about words and their structure
Some words can be divided into parts which still have meaning
Many words have meaning by themselves. But some words have meaning
only when used with other words
Some of the parts into which words can be divided can stand alone as
words. But others cannot
These word-parts that can occur only in combination must be combined in
the correct way
Languages create new words systematically
MORPHOLOGY
5
Classification of Morphemes
Lexical morphemes are those that having meaning by themselves
(more accurately, they have sense).
Nouns, verbs, adjectives ({boy}, {buy}, {big}) are typical lexical
morphemes.
Grammatical morphemes specify a relationship between other
morphemes. But the distinction is not all that well defined.
Prepositions, articles, conjunctions
grammatical morphemes.
MORPHOLOGY
({of},
{the},
{but})
are
6
Classification of Morphemes
Free morphemes – morphemes which can stand by themselves as
separate words,
e.g. structure, like, go, work, friend etc.
Bound morphemes – morphemes which cannot normally stand alone but
need to be attached to other forms,
e.g. re-, -ed, -s, -ing etc.
- unit of meaning which can only exist alongside a free morpheme.
- Bound morphemes operates in the connection processes by means of
derivation, inflection, and compounding.
MORPHOLOGY
7
Classification of Morphemes
We can usefully divide morphemes into two classes
◦ Root or Lexeme: The core meaning-bearing units
◦ Affixes: Bits and pieces that adhere to stems to change their meanings
and grammatical functions
◦ Prefix: un-, anti-, etc
◦ Suffix: -ity, -ation, etc
◦ Infix: are inserted inside the stem, English has almost no true infixes
◦ Circumfixes – a discontinuous morph composed of two parts which
embrace the base element (live  en-live-en  enliven)
MORPHOLOGY
8
Properties of roots
◦ Main part of word
◦ Must be at least one in a word
◦ In English, limited to two in a word
◦ (simple words have one, compound words have two);
◦ Can occur independently
◦ Tend to have richer, more specific semantic content
◦ Position is relatively free with respect to other roots
◦ E.g. photograph vs. telephoto
MORPHOLOGY
9
Properties of affixes
◦ Subordinate part of word
◦ Not necessarily present--some words occur without any
◦ Multiple affixes can occur in a word
◦ Are dependent (bound) elements
◦ Have more "schematic" (non-specific) content; often grammar-like
function
◦ Can either precede or follow their roots ( prefixes and suffixes ,respectively)
◦ Position for a given affix with respect to root is fixed
MORPHOLOGY
10
Example
Given word: Unbreakable
How many morphemes?
comprises three morphemes
un- (a bound morpheme signifying "not")
-break- (the root, a free morpheme)
-able (a free morpheme signifying "can be done").
MORPHOLOGY
11
Representation / Terminology
Morphological: girls = {girl} + {s}
Semantic: {girl} = [-adult; -male; +human, ...] + {s} = {PLU} = [plural]
Braces, { } indicate a morpheme. Square brackets, [ ]
indicate a semantic characterization. Italics indicate a lexical item.
Two different morphemes may be pronounced the same way.
Ex: –er in buyer and shorter
verb(agentive morpheme {AG})
adjective(comparative morpheme {COMP})
MORPHOLOGY
12
Morphemes and Words
Combine morphemes to create words.
Inflectional Morphology
Combination of stem and morpheme resulting in word of same class
Usually fills a syntactic feature such as agreement
E.g., plural –s, past tense -ed
Derivational Morphology
Combination of stem and morpheme usually results in a word of a
different class Meaning of the new word may be hard to predict
E.g., +ation in words such as computerization
MORPHOLOGY
13
Inflectional Morphology
Inflection is a morphological process that adapts existing words so
that they function effectively in sentences without changing the
category of the base morpheme.
Word stem + grammatical morpheme cat + s
Only for nouns, verbs, and some adjectives
Nouns
◦ plural:
Rules for regular: +s, +es
irregular: mouse-mice; ox-oxen
Rules for exceptions: e.g.-y -> -ies like: butterfly-butterflies
MORPHOLOGY
14
Inflectional Morphology (verbs)
Morphological form
stem
-s form
-ing form
-ed form(past)
Regular Inflected form
walk
walks
walking
walked
thrash
thrashes
thrashing
thrashed
MORPHOLOGY
try
tries
trying
tried
map
maps
mapping
mapped
15
Inflectional Morphology (verbs)
Morphological form
Irregular Inflected form
stem
-s form
-ing form
-ed form(past)
-ed form(participle)
eat
eats
eating
ate
eaten
catch
catches
catching
caught
caught
MORPHOLOGY
cut
cuts
cutting
cut
cut
16
Inflectional Morphology (verbs)
The suffix –s functions in the Present Simple as the third person marking of the verb
◦ to work – he work-s
The suffix –ed functions in the past simple as the past tense marker in regular verbs
◦ to love – lov-ed
The suffixes –ed (regular verbs) and –en (for some regular verbs) function in the marking
of the past partciple
◦ to study studied / To eat eaten
The suffix –ing functions in the marking of the present participle.
◦ to eat – eating / To study - studying
MORPHOLOGY
17
Inflectional Morphology (nouns)
Regular Nouns (cat, hand)
Morphological form
stem
-s form(plural)
Morphological form
stem
-s form(plural)
Irregular Nouns(child, ox)
Regular Inflected form
cat
hand
cats
hands
Irregular Inflected form
child
ox
children
oxen
The suffix –s functions in the marking of the plural of nouns: dog – dogs
The suffix –s functions as a possessive marker: Laura – Laura’s book.
MORPHOLOGY
18
Regular vs Irregular
It is a little complicated by the fact that some words misbehave (refuse to
follow the rules)
◦ Mouse/mice, goose/geese, ox/oxen
◦ Go/went, fly/flew
The terms regular and irregular are used to refer to words that follow the
rules and those that don’t.
Inflectional Morphology (Adjectives)
The suffix –er functions as comparative marker: quick – quicker
The suffix –est functions as superlative marker: quick - quickest
MORPHOLOGY
19
Derivational Morphology
Derivation is concerned with the way morphemes are connected to
existing lexical forms as affixes.
We distinguish affixes in two principal types
◦ Prefixes - attached at the beginning of a lexical item or base-morpheme
– ex: un-, pre-, post-, dis, im-, etc.
◦ Suffixes – attached at the end of a lexical item
– ex: -age, -ing, -ful, -able, -ness, -hood, -ly, etc.
MORPHOLOGY
20
Examples of Derivational Morphology
Lexical item (free morpheme): like (verb)+ prefix (bound morpheme) dis-=
dislike (verb);
like + suffix –able = likeable + prefix un- =unlikeable + suffix –ness =
unlikeableness
like + prefix un- = unlike + suffix –ness = unlikeness
like + suffix –ly = likely + suffix –hood =likelihood + prefix un- =unlikelihood
MORPHOLOGY
21
Derivational Morphology
Derivational affixes can cause semantic change
Prefix pre- means before; post- means after; un- means not, re- means
again.
Prefix = fixed before; Unhappy = not happy = sad; Retell = tell again.
Prefix de- added to a verb conveys a sense of subtraction; dis- and unhave a sense of negativity.
to decompose; to defame; to uncover; to discover.
MORPHOLOGY
22
Derivational Morphology
Derivational affixes can mark category change
For Nouns
For Adjectives
Suffix
Base Verb / Adjective Derived Noun
Suffix
Base Verb / Noun
Derived Adjective
-ation
Computerize (V)
Computerization
-al
Computation (N)
Computational
-ee
Appoint (V)
Appointee
-able
Embrace (V)
Embraceable
-er
Kill (V)
Killer
-less
Care (N)
Careless
-ness
Fuzzy (A)
Fuzziness
-ful
Care (N)
Careful
MORPHOLOGY
23
Derivational Morphology
Verb Clitics are usually weak forms of functional elements
MORPHOLOGY
Full Form
Clitic
am
’m
is
‘s
are
‘re
will
‘ll
have
‘ve
has
‘s
had
‘d
would
‘d
24
Derivational Processes
1. Derivation: (or Derivational affixation, Affixation)
antiintellectualism
2. Compounding: combine two or more morphemes to form new words
bathroom, blackboard
3. Reduplication: full or partial repetition of a morpheme
dilly-dally, zig-zag
4. Blending: parts of the words that are combined are deleted
fantastic + fabulous -> fantabulous
MORPHOLOGY
25
Derivational Processes
5. Clipping: part of a word has been clipped off
Prof , lab, doc
6. Acronyms: abbreviate a longer term by taking the initial letters
WHO -> World Health Organization
7. Back formation: A word (usually a noun) is reduced to form another word of a
different type (usually a verb)
television ->
televise
babysitter ->
babysit
MORPHOLOGY
26
Derivational Processes
8. Extension of word formation rules : Part of a word is treated as a morpheme
though it’s not
workaholic
9. Functional shift (Conversion): A change in the part of speech
computer users today use a mouse and bookmark an Internet address
10. Proper names -> Common words
Xerox -> Photo copy
JCB -> Proclainer
MORPHOLOGY
27
Derivational Processes
11. Coining: Creating a completely new free morpheme
googol -> 10 100
12. Onomatopoeia: words imitate sounds in nature
tick-tock, quack
13. Borrowing: The taking over of words from other languages French to English
brigade, ballet, bigot
MORPHOLOGY
28
Derivational Processes
Many paths are possible.
Start with compute
Computer -> computerize -> computerization
Computer -> computerize -> computerizable
Computation -> computational
But not all paths/operations are equally good (allowable?)
Clue
Clue -> *clueable
Happy unhappy
Sad  *unsad
MORPHOLOGY
29
Derivational Processes
Morphotactics
Morphotactics is concerned with ordering of morphemes.
The ordering restrictions in place on the ordering of morphemes
antiintellectualism
-anti -ism -al -intellect
anti + intellect + al +ism
Morphophonemics:
Focus on the sound changes that take place in morphemes when they combine
to form words.
e.g., the vowel changes in “sleep” and “slept,” “bind” and “bound,” “vain” and
“vanity,” and the consonant alternations in “knife” and “knives,”.
MORPHOLOGY
30
Derivational Processes
Semantics: In English, un- cannot attach to adjectives that
already have a negative connotation:
Unhappy vs. *unsad
Unhealthy vs. *unsick
Unclean vs. *undirty
Phonology: In English, -er cannot attach to words of more than two
syllables
great, greater
Happy, happier
Competent, *competenter
Elegant, *eleganter
MORPHOLOGY
31
Inflectional vs Derivational
Inflectional
Derivational
Lexical category
Do not change the lexical category of the word.
Often change the lexical category of the word
Location
Tend to occur outside derivational affixes.
Tend to occur next to the root
Type of meaning
Contribute syntactically conditioned
information, such as number, gender, or aspect.
Contribute lexical meaning
Affixes used
Occur with all or most members of a class of
stems.
Are restricted to some, but not all members of a
class of stems
Productivity
May be used to coin new words of the same
type.
May eventually lose their meaning and usually
cannot be used to coin new terms
Grounding
Create forms that are fully-grounded and able to Create forms that are not necessarily fully
be integrated into discourse.
grounded and may require inflectional operations
before they can be integrated into discourse
MORPHOLOGY
32
Stemming
Stemming
Stemming algorithms strip off word affixes yield stem only, no additional
information (like plural, 3rdperson etc.) used, e.g. in web search engines.
Stemming is one technique to provide ways of finding morphological variants of
search terms.
Used to improve retrieval effectiveness and to reduce the size of indexing files.
Reduce tokens to “root” form of words to recognize morphological variation.
“computer”, “computational”, “computation” all reduced to same token
“compute”
MORPHOLOGY
33
Stemming
Criteria for judging stemmers
Correctness
◦ Overstemming: too much of a term is removed.
◦ Understemming: too little of a term is removed.
Retrieval effectiveness
◦ Measured with recall and precision, and on their speed, size, and so on
Compression performance
MORPHOLOGY
34
Type of stemming algorithms
Table lookup approach
Successor Variety
n-gram stemmers
Affix Removal Stemmers
MORPHOLOGY
35
Table lookup approach
Store a table of all index terms and their stems, so terms from
queries and indexes could be stemmed very fast.
Problems
◦ There is no such data for English. Or some terms are domain dependent.
◦ The storage overhead for such a table, though trading size for time is
sometimes warranted.
MORPHOLOGY
36
Successor Variety
Determine word and morpheme boundaries based on the distribution of
phonemes in a large body of utterances.
The successor variety of a string is the number of different characters that
follow it in words in some body of text.
The successor variety of substrings of a term will decrease as more
characters are added until a segment boundary is reached.
MORPHOLOGY
37
Successor Variety Example
Test Word: READABLE
Corpus: ABLE, APE, BEATABLE, FIXABLE, READ, READABLE, READING, READS,
RED, ROPE, RIPE
MORPHOLOGY
38
Successor Variety Example
cutoff method
◦ some cutoff value is selected and a boundary is identified whenever the
cutoff value is reached
peak and plateau method
◦ segment break is made after a character whose successor variety exceeds that
of the characters immediately preceding and following it
complete method
entropy method
MORPHOLOGY
39
Successor Variety
Two criteria used to evaluate various segmentation methods
1. the number of correct segment cuts divided by the total number of cuts
2. the number of correct segment cuts divided by the total number of true
boundaries
After segmenting, if the first segment occurs in more than 12 words in the
corpus, it is probably a prefix.
The successor variety stemming process has three parts
1. determine the successor varieties for a word
2. segment the word using one of the methods
3. select one of the segments as the stem
MORPHOLOGY
40
n-gram stemmers
Association measures are calculated between pairs of terms based on shared
unique digrams.
statistics => st ta at ti is st ti ic cs
unique digrams = at cs ic is st ta ti
statistical => st ta at ti is st ti ic ca al
unique digrams = al at ca ic is st ta ti
Dice’s coefficient (similarity)
S 
2C
2*6

 .80
A B
78
A and B are the numbers of unique digrams in the
first and the second words. C is the number of
unique digrams shared by A and B.
MORPHOLOGY
41
n-gram stemmers
Similarity measures are determined for all pairs of terms in the database,
forming a similarity matrix
Once such a similarity matrix is available, terms are clustered using a
single link clustering method
MORPHOLOGY
42
Affix Removal Stemmers
Affix removal algorithms remove suffixes and/or prefixes from terms
leaving a stem
◦ If a word ends in “ies” but not ”eies” or ”aies ” (Harman 1991)
Then “ies” -> “y”
◦ If a word ends in “es” but not ”aes” , or ”ees ” or “oes”
Then “es” -> “e”
◦ If a word ends in “s” but not ”us” or ”ss ”
Then “s” -> “NULL”
MORPHOLOGY
43
The Porter Stemmer
Online Demo: http://9ol.es/porter_js_demo.html
Typical rules in Porter stemmer
sses→ss (caresses -> caress )
ies→i (ponies -> poni, ties -> ti)
ational→ate
tional→tion
ing →ε (motoring -> motor)
MORPHOLOGY
44
Conditions on the stem
1. The measure , denoted m ,of a stem is based on its alternate
vowel-consonant sequences.
[C] ( VC ) m [V] Square brackets indicate an optional occurrence.
Measure
M=0
M=1
M=2
Example
TR,EE,TREE,Y,BY
TROUBLE,OATS,TREES,IVY
TROUBLES,PRIVATE,OATEN
MORPHOLOGY
E.g.,
Troubles
C V CVC
45
Conditions on the stem
2.*<X> ---the stem ends with a given letter X
3.*v*---the stem contains a vowel
4.*d ---the stem ends in double consonant
5.*o ---the stem ends with a consonant-vowel-consonant, sequence,
where the final consonant is not w, x or y
6. *s --- the stem ends with a given letter S
MORPHOLOGY
46
Step1
SSES -> SS
caresses -> caress
IES -> I
ponies -> poni
ties -> ti
SS -> SS
caress -> caress
S -> є
cats -> cat
MORPHOLOGY
47
Step2a
(m>1) EED -> EE
Condition verified: agreed -> agree Condition not verified: feed -> feed
(*V*) ED -> є
Condition verified: plastered -> plaster
Condition not verified: bled -> bled
(*V*) ING -> є
Condition verified: motoring -> motor
MORPHOLOGY
Condition not verified: sing -> sing
48
Step2b
(These rules are ran if second or third rule in 2a apply)
AT-> ATE
conflat(ed) -> conflate
BL -> BLE
Troubl(ing) -> trouble
(*d & ! (*L or *S or *Z)) -> single letter
Condition verified: hopp(ing) -> hop,
Condition not verified: fall(ing) -> fall
(m=1 & *o) -> E
Condition verified: fil(ing) -> file
Condition not verified: fail -> fail
MORPHOLOGY
49
Steps 3 and 4
Step 3: Y Elimination (*V*) Y -> I
Condition verified: happy -> happi
Condition not verified: sky -> sky
Step 4: Derivational Morphology, I
(m>0) ATIONAL -> ATE
(m>0) IZATION -> IZE
Relational -> relate
generalization-> generalize
(m>0) BILITI -> BLE
sensibiliti -> sensible
MORPHOLOGY
50
Steps 5 and 6
Step 5: Derivational Morphology, II
Step 6: Derivational Morphology, III
(m>0) ICATE -> IC
(m>0) ANCE -> є
triplicate -> triplic
allowance-> allow
(m>0) FUL -> є
(m>0) ENT -> є
hopeful -> hope
dependent-> depend
(m>0) NESS -> є
(m>0) IVE -> є
goodness -> good
effective -> effect
MORPHOLOGY
51
Step7 (cleanup)
Step 7a
(m>1) E -> є
probate -> probat
(m=1 & !*o) NESS -> є
goodness -> good
Step 7b
(m>1 & *d & *L) -> single letter
Condition verified: controll -> control
MORPHOLOGY
Condition not verified: roll -> roll
52
MORPHOLOGY
53
CSE4022: Natural Language Processing
Topic: Text Classification
Dr. Tulasi Prasad Sariki,
SCOPE, VIT Chennai Campus
www.learnersdesk.weebly.com
Is this spam?
From: "" <takworlld@hotmail.com>
Subject: real estate is the only way... gem oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even thousands for similar courses
I am 22 years old and I have already purchased 6 properties using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
=================================================
Click here to order: http://www.wholesaledaily.com/sales/nmd.htm
=================================================
TEXT CLASSIFICATION
Classification
Given:
❑A description of an instance, x∈X, where X is the instance language or
instance space.
❑Issue: how to represent text documents.
❑A fixed set of categories:
C = {c1, c2,…, cn}
Determine:
❑The category of x: c(x)∈C, where c(x) is a categorization function whose
domain is X and whose range is C.
❑We want to know how to build categorization functions (“classifiers”).
TEXT CLASSIFICATION
Examples
Labels are most often topics such as Yahoo-categories
e.g., "finance," "sports," "news>world>asia>business"
Labels may be genres
e.g., "editorials" "movie-reviews" "news“
Labels may be opinion
e.g., “like”, “hate”, “neutral”
Labels may be domain-specific binary
e.g., “spam” : “not-spam”, e.g., “contains adult language” :“doesn’t”
TEXT CLASSIFICATION
Classification Methods
Manual classification
❑
❑
❑
❑
Used by Yahoo!, Looksmart, about.com, Medline
Very accurate when job is done by experts
Consistent when the problem size and team is small
Difficult and expensive to scale
Automatic document classification
❑
❑
❑
❑
Hand-coded rule-based systems
E.g., assign category if document contains a given boolean combination of words
Accuracy is often very high if a rule has been carefully refined over time by an expert
Building and maintaining these rules is expensive
TEXT CLASSIFICATION
Classification Methods
Supervised learning of a document-label assignment function
❑ Many systems partly rely on machine learning
❑
❑
❑
❑
❑
k-Nearest Neighbors (simple, powerful)
Naive Bayes (simple, common method)
Support-vector machines (new, more powerful)
Requires hand-classified training data
But data can be built up (and refined) by amateurs
Note that many commercial systems use a mixture of methods
TEXT CLASSIFICATION
Bayesian Methods
❑ Learning and classification methods based on probability theory.
❑ Bayes theorem plays a critical role in probabilistic learning and
classification.
❑ Build a generative model that approximates how data is produced
❑ Uses prior probability of each category given no information about
an item.
❑ Categorization produces a posterior probability distribution over the
possible categories given a description of an item.
TEXT CLASSIFICATION
Bayes’ theorem
Bayes theorem provides a way of calculating the posterior probability, P(c|x),
from P(c), P(x), and P(x|c). Naive Bayes classifier assume that the effect of the
value of a predictor (x) on a given class (c) is independent of the values of other
predictors. This assumption is called class conditional independence.
●
●
●
●
P(c|x) is the posterior probability of class (target) given predictor (attribute).
P(c) is the prior probability of class.
P(x|c) is the likelihood which is the probability of predictor given class.
P(x) is the prior probability of predictor.
In ZeroR model there is no predictor, in OneR model we try to find
the single best predictor, naive Bayesian includes all predictors using
Bayes' rule and the independence assumptions between predictors
TEXT CLASSIFICATION
Example
The posterior probability can be calculated
by first, constructing a frequency table for
each attribute against the target.
Then, transforming the frequency tables to
likelihood tables and finally use the Naive
Bayesian equation to calculate the
posterior probability for each class.
The class with the highest posterior
probability is the outcome of prediction.
TEXT CLASSIFICATION
Example: posterior probability
TEXT CLASSIFICATION
Example: posterior probability
TEXT CLASSIFICATION
Example: posterior probability
TEXT CLASSIFICATION
Example: The final probability
TEXT CLASSIFICATION
The Naïve Bayes Classifier
Fl
u
X
1
runnynose
X
X
X
2
sinus
3
cough
4
fever
X
5
muscle-ache
Conditional Independence Assumption: features are independent of each other given the class
TEXT CLASSIFICATION
Learning the Model
First attempt: maximum likelihood estimates
❑ Simply use the frequencies in the data
Smoothing to Avoid Over fitting
# of values of Xi
TEXT CLASSIFICATION
Naïve Bayes: Learning
From training corpus, extract Vocabulary
Calculate required P(cj) and P(xk | cj) terms
▪ For each cj in C do
▪ docsj ← subset of documents for which the target class is cj
■
Textj ← single document containing all docs
■
for each word xk in Vocabulary
■
nk ← number of occurrences of xk in Textj
■
TEXT CLASSIFICATION
Example
Training:
Document Name
Doc1
Doc2
Doc3
Doc4
Doc5
Doc6
Kill
2
1
1
0
0
0
Bomb
1
1
1
1
0
0
Key Words
Kidnap
Music
3
0
1
0
2
0
0
2
1
1
0
2
Class Name
Movie
0
0
1
1
1
2
TV
1
0
0
1
0
0
Terrorism
Terrorism
Terrorism
Entertainment
Entertainment
Entertainment
Testing:
Document Name
Doc7
Kill
2
Bomb
1
Key Words
Kidnap
Music
2
0
TEXT CLASSIFICATION
Class Name
Movie
0
TV
1
?
Example
|V|
C
P(Ci)
ni
P(Kill / Ci)
6
T
0.5
15
0.2380
0.1904
0.3333
0.0476
0.09523
0.09253
E
0.5
12
0.0555
0.1111
0.1111
0.3333
0.2777
0.1111
|V| -> number of Vocabularies
P(Bomb / Ci) P(Kidnap / Ci )
P(Music/ Ci)
P(Movie / Ci) P(TV / Ci)
ni -> total no 'of Documents
P(Ci) -> no’ of Documents in Class / no’ of all Documents
P(Kill / T) = (2 + 1 + 1) +1 = 5
15 + |V|
21
P( T / W) = P( T) * P(Kill / T) * P(Bomb / T) * P(Kidnap / T) * P(Music/ T) * P(Movie / T) * P(TV / T)
P( E/ W) = P( E) * P(Kill / E) * P(Bomb / E) * P(Kidnap / E) * P(Music/ E) * P(Movie / E) * P(TV / E)
TEXT CLASSIFICATION
Example
P( T/W) = 0.5 * (0.2380) 2 * (0.1904) 1 * (0.3333) 2 * (0.0476) 0 * (0.09523) 0 * (0.09523) 1 = 5.7047 X 10 -5
P( E/W) = 0.5 * (0.0555) 2 * (0.1111) 1 * (0.1111) 2 * (0.3333) 0 * (0.27777) 0 * (0.1111) 1 = 2.3456 X 10 -5
Since P( T/ W) has higher values therefore Document7 is classified into Terrorism Class
TEXT CLASSIFICATION
END
TEXT CLASSIFICATION
CSE4022
Natural Language Processing
Topic: Regular Expressions & Automata
Pr o f. Tu l asi Pr a sa d S a ri ki ,
S C SE, V I T Ch ennai Ca mpus
www. l earn ersd esk.weeb l y. com
Contents
 Regular Expressions
 Regular Expression Patterns
 Operator precedence
 Applications
 Regular Expressions in MS-Word
 Finite Automata
 FSA / FST
 Applications of FSA & FST
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions
In computer science, RE is a language used for specifying text search string.
A regular expression is a formula in a special language that is used for
specifying a simple class of string.
Formally, a regular expression is an algebraic notation for characterizing a
set of strings.
RE search requires
◦ a pattern that we want to search for, and
◦ a corpus of texts to search through.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions
A RE search function will search through the corpus returning all
texts that contain the pattern.
◦ In a Web search engine, they might be the entire documents or Web
pages.
◦ In a word-processor, they might be individual words, or lines of a
document.
◦ E.g., the UNIX grep command
Regular expressions are case sensitive.
We will use Perl based syntax for representation.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions
Disjunctions [abc]
Ranges [A-Z]
Negations [^Ss]
Optional characters ? and *
Wild cards .
Anchors ^ and $, also \b and \B
Disjunction, grouping, and precedence |
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
regular expression
example pattern matched
/woodchucks/
“interesting links to woodchucks and lemurs”
/a/
“Mary Ann stopped by Mona’s”
/Claire says,/
Dagmar, my gift please,” Claire says,”
/song/
“all our pretty songs”
/!/
“You’ve left the burglar behind again!” said Nori
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
The use of the brackets [] to specify a disjunction of characters.
Regular Expression
/[wW]oodchuck/
/[abc]/
/[0123456789]/
Match
Woodchuck or woodchuck
“a”, “b”, or “c”
Any digit
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
The use of the brackets [] plus the dash - to specify a range.
Regular expression
match
sample pattern
/[A-Z]/
any uppercase letter
this is Linguistics 5981
/[0-9]/
any single digit
this is Linguistics 5981
/[1 2 3 4 5 6 7 8 9 0]/ any single digit
this is Linguistics 5981
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
To search for negation, i.e. a character that I do NOT want to find we use
the caret: [^]
Regular expression
match
sample pattern
/[^A-Z]/
not an uppercase letter
this is Linguistics 5981
/[^L l]/
neither L nor l
this is Linguistics 5981
/[^\.]/
not a period
this is Linguistics 598
Special characters:
\*
\.
\?
\n
\t
an asterisk
a period
a question mark
a newline
a tab
“L*I*N*G*U*I*S*T*I*C*S”
“Dr.Doolittle”
“Is this Linguistics 5981 ?”
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
To search for optional characters we use the question mark: [?]
Regular expression
match
sample pattern
/colou?r/
colour or color
beautiful colour
To search for any number of a certain character we use the Kleene star: [*]
Regular expression
match
/a*/
any string of zero or more “a”s
/aa*/
at least one a but also any number of “a”s
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
To look for at least one character of a type we use the Kleene “+”:
Regular expression
match
/[0-9]+/
a sequence of digits
Any combination is possible
Regular expression
match
/[ab]*/
zero or more “a”s or “b”s
/[0-9] [0-9]*/
any integer (= a string of digits)
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
The “.” is a very special character -> so-called wildcard
Regular expression
/b.ll/
match
sample pattern
any character between b and ll
ball, bell, bull, bill
The /. / symbol is called a wildcard : it matches any single character. For example, the
regular expression /s.ng/ matches the following English words:
sang, sing, song, sung.
Note that /./ will match and not only alphabetic characters, but also numeric and
whitespace characters. Consequently, /s.ng/ will also match non-words such as s3ng.
The pattern /....berry/ finds words like cranberry.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
Anchors (start of line: “^”, end of line:”$”)
Regular expression
match
sample pattern
/^Linguistics/
“Linguistics” at the beginning of a line Linguistics is fun.
/linguistics\.$/
“linguistics” at the end of a line
We like linguistics.
Anchors (word boundary: “\b”, non-boundary:”\B”)
Regular expression
match
sample pattern
/\bthe\b/
“the” alone
This is the place.
/\Bthe\B/
“the” included
This is my mother.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expression Patterns
More on alternative characters: the pipe symbol: “|” (disjunction)
Regular expression
match
sample pattern
/colou?r/
colour or color
beautiful colour
/progra(m|mme)/
program or programme
linguistics program
REGULAR EXPRESSIONS AND AUTOMATA
Predefined Character class
Character class
Description
\d
A digit. Equivalent to[0-9].
\D
A non-digit. Equivalent to [^0-9].
\s
A whitespace character. Equivalent to [ \t\n\x0B\f\r].
\S
A nonwhitespace character. Equivalent to[^\s].
\w
A word character. Equivalent to [a-zA-Z_0-9].
\W
A non-word character. Equivalent to [^\w].
REGULAR EXPRESSIONS AND AUTOMATA
Boundary matchers
Boundary Matcher
Description
^
The beginning of a line
$
The end of a line
\b
A word boundary
\B
A nonword boundary
\A
The beginning of the text
\G
The end of the previous match
\Z
The end of the text (but for the final line terminator, if any)
\z
The end of the text
REGULAR EXPRESSIONS AND AUTOMATA
Quantifiers
Character Description
{n}
n is a nonnegative integer. Matches exactly n times. For example, 'o{2}' does
not match the 'o' in "Bob," but matches the two o's in "food".
{n,}
n is a nonnegative integer. Matches at least n times. For example, 'o{2,}'
does not match the "o" in "Bob" and matches all the o's in "foooood". 'o{1,}'
is equivalent to 'o+'. 'o{0,}' is equivalent to 'o*'.
{n,m}
M and n are nonnegative integers, where n <= m. Matches at least n and at
most m times. For example, "o{1,3}" matches the first three o's in
"fooooood". 'o{0,1}' is equivalent to 'o
REGULAR EXPRESSIONS AND AUTOMATA
Operator precedence
A regular expression is evaluated from left to right and follows an order of
precedence, much like an arithmetic expression.
The following table illustrates, from highest to lowest, the order of precedence
of the various regular expression operators:
Operator(s)
Description
\
Escape
(), (?:), (?=), []
Parentheses and Brackets
*, +, ?, {n}, {n,}, {n,m}
Quantifiers
^, $, \anymetacharacter, anycharacter
Anchors and Sequences
|
Alternation
REGULAR EXPRESSIONS AND AUTOMATA
Operator precedence
Characters have higher precedence than the alternation operator, which
allows 'm|food' to match "m" or "food". To match "mood" or "food", use
parentheses to create a subexpression, which results in '(m|f)ood'.
REGULAR EXPRESSIONS AND AUTOMATA
Applications
Regular Expressions for the Java Programming Language
• java.util.regex for enabling the use of regular expressions
Applications
• Simple word replacement
• Email validation
• Removal of control characters from a file
• File searching
REGULAR EXPRESSIONS AND AUTOMATA
Example
write a Perl regular expression to match the English article “the”:
/the/
/[tT]he/
missed ‘The’
included ‘the’ in ‘others’
/\b[tT]he\b/ Missed ‘the25’ ‘the_’
/[^a-zA-Z][tT]he[^a-zA-Z]/
Missed ‘The’ at the beginning of a line
/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
REGULAR EXPRESSIONS AND AUTOMATA
Example
Write a regular expression that will match “any PC with more than
500MHz and 32 Gb of disk space for less than $1000”:
Price
◦
◦
◦
◦
/$[0-9]+/
/$[0-9]+\.[0-9][0-9]/
/$[0-9]+(\.[0-9][0-9])?/
/\b$[0-9]+(\.[0-9][0-9])?\b/
# whole dollars
# dollars and cents
#cents optional
#word boundaries
REGULAR EXPRESSIONS AND AUTOMATA
Example
Specifications for processor speed
◦ /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/
Memory size
◦ /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/
◦ /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
Vendors
◦ /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/
◦ /\b(Mac|Macintosh|Apple)\b/
REGULAR EXPRESSIONS AND AUTOMATA
Example
Expression
Matches
/^\s*$/
Match a blank line.
/\d{2}-\d{5}/
Validate an ID number consisting of 2
digits, a hyphen, and an additional 5 digits.
/<\s*(\S+)(\s[^>]*)?>[\s\S]*<\s*\/\1\s*>/
Match an HTML tag.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions in MS-Word
? and *
The two most basic wildcard characters are ? and *.
? is used to represent a single character and * represents any number of characters.
 s?t will find sat, set, sit, sat and any other combination of 3 characters
beginning with “s” and ending with “t”. Ex: inset.
 s*t will find all the above, but will also find “secret”, “serpent”, “sailing
boat” and“sign over document”, etc.
@
 @ is used to find one or more occurrences of the previous character.
 For example, lo@t will find lot or loot, ful@ will find ful or full etc.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions in MS-Word
<>
 <s*t> would find “secret” and “serpent” and “sailing boat”, but not “sailing boats”
or “sign over documents”. It will also find “'set” in “tea-set” , but not “set” in
“toolset”.
 The <> tags can be used in pairs, as above; or individually.
 ful@> will find “full” and the appropriate part of “wilful”, but will not find “wilfully”.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions in MS-Word
[]
Square brackets are always used in pairs and are used to identify specific characters
or ranges of characters.
 [abc] will find any of the letters a, b, or c.
 [ A-Z] will find any upper case letter.
 [ 13579] will find any odd digit.
\
 If you wish to search for a character that has a special meaning in wildcard searches
– the obvious example being “?” – then you can do so by putting a backslash in front
of it.
 [\?] will not find “\” followed by any character; but will find “?”
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions in MS-Word
[!]
 [!] is very similar to [ ] except in this case it finds any character not listed in the box
so [!o] would find every character except “o”.
 You can use ranges of characters in exactly the same was as with [ ], thus [!A-Z] will
find everything except upper case letters.
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions in MS-Word
{}
 Curly brackets are used for counting occurrences of the previous character or
expression.
 {n} This finds exactly the number “n” of occurrences of the previous character (so
for example, a{2} will find “aa”).
 {n,m} finds text containing between “n” and “m” occurrences of the previous
character or expression; so a{2,3} will find “aa” and “aaa”, but only the first 3
characters in “aaaa” ).
REGULAR EXPRESSIONS AND AUTOMATA
Regular Expressions in MS-Word
()
 Round brackets have no effect on the search pattern, but are used to divide the
pattern into logical sequences where you wish to re-assemble those sequences in a
different order during the replace – or to replace only part of that sequence.
 They must be used in pairs and are addressed by number in the replacement.
 Eg: (Tulasi) (Prasad) replaced by \2 \1 (note the spaces in the search and replace
strings) – will produce Prasad Tulasi or replaced by \2 alone will give Prasad.
^
 The ^ (“caret”) character is not specific to wildcard searches but it sometimes has to
be used slightly differently from normal, when searching for wildcards.
REGULAR EXPRESSIONS AND AUTOMATA
Finite Automata
 The regular expression is more than just a convenient meta-language for text
searching.
 Any regular expression can be implemented as a finite-state automaton.
 Symmetrically, any finite-state automaton can be described with a regular
expression.
 Regular expression is one way of characterizing a particular kind of formal
language called a regular language.
 Both regular expressions and finite-state automata can be used to describe
regular languages.
REGULAR EXPRESSIONS AND AUTOMATA
Finite Automata
The relationship between finite state automata, regular expression, and
regular language
Finite state automata
(Computataional Device)
Regular Expression
(Descriptive Notation)
Regular language
(Set of Objects)
REGULAR EXPRESSIONS AND AUTOMATA
What is a Finite-State Automaton?
 An alphabet of symbols,
 A finite set of states,
 A transition function from states and symbols to states,
 A distinguished member of the set of states called the start state, and
 A distinguished subset of the set of states called final states.
 FSA recognize the regular languages represented by regular expressions
 Directed graph with labeled nodes and arc transitions
REGULAR EXPRESSIONS AND AUTOMATA
Formally
 FSA is a 5-tuple consisting of
 Q: a finite set of N states q0, q1, …, qN
 : a finite input alphabet of symbols
 q0: the start state
 F: the set of final states, F  Q
 (q,i): a transition function mapping Q x  to Q
REGULAR EXPRESSIONS AND AUTOMATA
FSA Accepter
Input
String
Finite
Automaton
Output
“Accept”
or
“Reject”
REGULAR EXPRESSIONS AND AUTOMATA
Transition Graph
abba -Finite Accepter
a, b
q5
initial
state
b
q0 a
a
a
b
q1 b q2 b q3 a
state
transition
REGULAR EXPRESSIONS AND AUTOMATA
a, b
q4
final state “accept”
Initial Configuration
Input String
a b b a
a, b
q5
b
q0 a
a
a
b
q1 b q2 b q3 a
REGULAR EXPRESSIONS AND AUTOMATA
a, b
q4
Reading Input
Input String
a b b a
a, b
q5
b
q0 a
a
a
b
q1 b q2 b q3 a
REGULAR EXPRESSIONS AND AUTOMATA
a, b
q4
Reading Input
Input String
a b b a
a, b
q5
b
q0 a
a
a
b
q1 b q2 b q3 a
REGULAR EXPRESSIONS AND AUTOMATA
a, b
q4
Reading Input
Input String
a b b a
a, b
q5
b
q0 a
a
a
b
q1 b q2 b q3 a
REGULAR EXPRESSIONS AND AUTOMATA
a, b
q4
Reading Input
Input String
a b b a
a, b
q5
b
q0 a
a
a
b
q1 b q2 b q3 a
REGULAR EXPRESSIONS AND AUTOMATA
a, b
q4
Reading Input
Input String
a b b a
a, b
q5
b
q0 a
a
a
b
q1 b q2 b q3 a
REGULAR EXPRESSIONS AND AUTOMATA
a, b
q4 Output: Accepted
Using an FSA to Recognize Sheep talk
REGULAR EXPRESSIONS AND AUTOMATA
Using an FSA to Recognize Sheep Talk
Sheep language can be defined as any string from the following (infinite) set:
The regular expression for this kind of sheeptalk is
/baa+!/
baa!
baaa!
baaaa!
baaaaa!
....
All RE can be represented as FSA
b
q0
a
q1
a
q2
a
!
q3
REGULAR EXPRESSIONS AND AUTOMATA
q4
State Transition Table for Sheep Talk
State
a
b
q0
a
q1
a
q2
!
q3
q4
REGULAR EXPRESSIONS AND AUTOMATA
Input
b
a
!
0(null)
1
Ø
Ø
1
Ø
2
Ø
2
Ø
3
Ø
3
Ø
3
4
4:
Ø
Ø
Ø
Algorithm
function D-RECOGNIZE(tape,machine) returns accept or reject
index <- Beginning of tape
current-state <- Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
elseif transition-table[current-state,tape[index]] is empty then
return reject
else
current-state <- transition-table[current-state,tape[index]]
index <- index +1
REGULAR EXPRESSIONS AND AUTOMATA
Using an FSA to Recognize Sheep Talk
 FSA recognizes (accepts) strings of a regular language
 baa!
 baaa!
 baaaa!
…
a
b
q0
a
q1
a
q2
!
q3
 Tracing the execution of FSA on some sheep talk
q
0
... ... ...
q
1
q
2
b a a
q
3
a !
q
3
q
4
... ... ... ... ... ... ... ...
REGULAR EXPRESSIONS AND AUTOMATA
q4
Adding a fail state to FSA
a
b
q
a
q
0
!
b
a
q
1
!
b
!
q
2
!
b
q
3
!
b
?
c
a
a
qf
REGULAR EXPRESSIONS AND AUTOMATA
4
Adding an else arch
REGULAR EXPRESSIONS AND AUTOMATA
Adding ϵ Transition
b
q
0
a
q
1
a
q
!
q
3
2
ϵ
REGULAR EXPRESSIONS AND AUTOMATA
q
4
Example FSA
An FSA for the words of English numbers 1-99
REGULAR EXPRESSIONS AND AUTOMATA
FSA for NLP
 Word Recognition
 Dictionary Lookup
 Spelling Conventions
REGULAR EXPRESSIONS AND AUTOMATA
Word Recognition
A word recognizer takes a string of characters as input and returns “yes”
or “no” according as the word is or is not in a given set.
Solves the membership problem.
 e.g. Spell Checking, Scrabble(Un-ordered Concatenation)
Approximate methods
 Has right set of letters (any order).
 Has right sounds (Soundex).
 Random (suprimposed) coding (Unix Spell)
REGULAR EXPRESSIONS AND AUTOMATA
Word Recognition
Exact Methods
 Hashing
 Search (linear, binary ...)
 Digital search (“Tries”)
 Finite-state automata
REGULAR EXPRESSIONS AND AUTOMATA
Dictionary Lookup
Dictionary lookup takes a string of characters as input and returns “yes” or “no”
according as the word is or is not in a given set and returns information about
the word.
Lookup Methods
Approximate — guess the information
 If it ends in “ed”, it’s a past-tense verb.
 Exact — store the information for finitely many words
 Table Lookup
 Hash
 Search
REGULAR EXPRESSIONS AND AUTOMATA
Finite State Transducers
A finite state transducer essentially is a finite state automaton that works
on two (or more) tapes. The most common way to think about
transducers is as a kind of ``translating machine''. They read from one of
the tapes and write onto the other.
a:b
q0
a:b at the arc means that in this transition the transducer reads a from the
first tape and writes b onto the second.
REGULAR EXPRESSIONS AND AUTOMATA
Finite State Transducers
Transducer behaves as follows in the different modes.
 generation mode: It writes a string of as on one tape and a string bs on
the other tape. Both strings have the same length.
 recognition mode: It accepts when the word on the first tape consists of
exactly as many as as the word on the second tape consists of bs.
 translation mode (left to right): It reads as from the first tape and writes
an b for every a that it reads onto the second tape.
 translation mode (right to left): It reads bs from the second tape and
writes an a for every f that it reads onto the first tape.
 relator mode: Computes relations between sets
REGULAR EXPRESSIONS AND AUTOMATA
FST vs FSA
FSA can act as a
FST can act as a
 Recognizer
 Generator
 5 tuple Representation
 Equivalent to regular languages
 Recognizer
 Generator
 Translator
 Set relator
 7 tuple Representation
 Equivalent to regular relations
REGULAR EXPRESSIONS AND AUTOMATA
FST Operations
 Inversion: Switching input and output labels
 If T maps from I to O, T-1 maps from O to I
 Composition:
 If T1 is a transducer from I1 to O1 and T2 is a transducer from I2 to O2,
then T1  T2 is a transducer from I1 to O2.
REGULAR EXPRESSIONS AND AUTOMATA
FST for NLP
 Tokenization
 Morphological analysis
 Transliteration
 Parsing
 Translation
 Speech recognition
 Spoken language understanding
REGULAR EXPRESSIONS AND AUTOMATA
REGULAR EXPRESSIONS AND AUTOMATA
CSE4022
Natural Language Processing
Topic: Text Summarization
Pr o f. Tu l asi Pr a sa d S a ri ki ,
S C SE, V I T Ch ennai Ca mpus
www. l earn ersd esk.weeb l y. com
Rapid growth of data
The problem:
◦ 4 Billion URLs indexed by Google
◦ 200 TB of data on the Web [Lyman and Varian 03]
Possible approaches:
◦ information retrieval
◦ document clustering
◦ information extraction
◦ visualization
◦ question answering
◦ text summarization
TEXT SUMMARIZATION
Text Summarization
Automatic Text Summarization
◦ No User interaction, system will return the condensed or summarized
form.
Query Specific Summarization
◦ User interaction will be there he/she will supply some input as
keywords based on that summary will be generated.
TEXT SUMMARIZATION
Automatic Text Summarization
Automatic text summarization is the technique where a computer
automatically creates an abstract or gist of one or more text documents.
Text summarization is the process of distilling the most important
information from a source (or sources) to produce an abridged version for
a particular user (or users)and task (or tasks).
TEXT SUMMARIZATION
Typical 3 Stages of Summarization
1. Topic Identification: find/extract the most important material
2. Topic Interpretation: compress it
3. Summary Generation: say it in your own words
…as easy as that!
TEXT SUMMARIZATION
Aspects that describe Summaries
Source (Input):
◦ Source: single-document vs. multi-document
◦ Language: mono-lingual vs. multi-lingual vs. cross-lingual
◦ Genre: news vs. technical report vs. scientific paper etc.
◦ Specificity: domain-specific vs. general
◦ Length: short (1–2 pages) vs. long (> 2 pages)
◦ Media: text, graphics, audio, video, multi-media etc.
TEXT SUMMARIZATION
Aspects that describe Summaries
Purpose:
◦ Use: generic vs. query-oriented (aimed to a specific information need)
◦ Purpose: what the summary is used for (e.g. alert, preview, inform, digest,
provide biographical information)
◦ Audience: untargeted vs. targeted (aimed at a specific audience)
Composition (Output):
◦ Derivation: extract vs. abstract
◦ Format: running text, tables, geographical displays, time lines, charts,
illustrations etc.
◦ Partiality: neutral vs. evaluative (adding sentiment/values)
TEXT SUMMARIZATION
Query-Driven vs. Text-Driven
Top-down: Query-driven focus
◦ Criteria of interest encoded as search specs.
◦ System uses specs to filter or analyze text portions.
◦ Examples: templates with slots with semantic characteristics; term lists of
important terms.
Bottom-up: Text-driven focus
◦ Generic importance metrics encoded as strategies.
◦ System applies strategies over rep of whole text.
◦ Examples: degree of connectedness in semantic graphs; frequency of
occurrence of tokens.
TEXT SUMMARIZATION
Extract not Abstract
Extraction is much easier than abstraction
Abstraction needs understanding and rewriting
Most automatic summarization tools makes extracts not abstracts
Uses original sentences or part of sentences to create ”abstract”
TEXT SUMMARIZATION
Some Extraction Methods
General method: score each sentence; choose best sentence(s)
Scoring techniques:
◦ Position in the text: lead method; optimal position policy; title/heading
method
◦ Cue phrases in sentences
◦ Word frequencies throughout the text
◦ Cohesion: links among words; word co-occurrence; coreference; lexical chains
◦ Information Extraction: parsing and analysis
TEXT SUMMARIZATION
Word Frequency[Luhn58]
Steps:
◦ Count all word occurrences (after stemming)
◦ Ignore extreme frequencies.
◦ Give every word a score according to frequency.
◦ Calculate the importance of each sentence as the sum of its word scores.
◦ Take the physical distance between important words into consideration.
◦ Extract the N sentences with the highest scores.
TEXT SUMMARIZATION
Position: Title-Based Method
Words in titles
summarization.
and
headings
are
positively
relevant
to
Shown to be statistically valid at 99% level of significance
(Edmundson, 68).
Empirically shown to be useful in summarization systems.
TEXT SUMMARIZATION
Cue words and phrases
Baxendale (1958) identified two sets of phrases
◦ bonus phrases -> that tend to signal when a sentence is a likely candidate for
inclusion in a summary
◦ stigma phrases -> that tend to signal when it is definitely not a candidate,
respectively.
‘Bonus phrases’ such as "in summary", "in conclusion", and superlatives
such as "the best", "the most important" can be good indicators of
important content.
‘stigma phrases’ such as hardly and impossible may indicate nonimportant sentences
TEXT SUMMARIZATION
Cue words and phrases
Cue words and phrases, such as "in conclusion", "important", "in this
paper", "this paper", "this article", "this document", and "we conclude“
etc. can be very useful to determine signals of relevance or irrelevance.
During processing, the Cue Phrase Module simply rewards each sentence
containing a cue phrase with an appropriate score (constant per cue
phrase) and penalizes those containing stigma phrases.
TEXT SUMMARIZATION
Multiple Methods
Cue-Phrase Method: Some phrases imply significance: “significant”,
“impossible”, ”hardly”, etc.
Key Method: Word frequencies, like Luhn(for ATS).
Title Method: Titles are important, and so are the words they
contain sentences are play major role in summary.
Location Method: First and Last sentences of a paragraph, sentences
following titles.
TEXT SUMMARIZATION
Multiple Methods
The Sentence importance is calculate as a linear combination of the
different methods:
Sentence Score = ß1 Cue + ß2 Key + ß3 Title + ß4 Location.
Adjust the coefficients to control each methods significance.
TEXT SUMMARIZATION
Cohesion: Lexical chains method
But Mr. Kenny’s move speeded up work on a machine which uses microcomputers to control the rate at which an anaesthetic is pumped into the blood
of patients undergoing surgery. Such machines are nothing new. But Mr. Kenny’s
device uses two personal-computers to achieve much closer monitoring of the
pump feeding the anaesthetic into the patient. Extensive testing of the
equipment has sufficiently impressed the authorities which regulate medical
equipment in Britain, and, so far, four other countries, to make this the first such
machine to be licensed for commercial sale to hospitals.
TEXT SUMMARIZATION
Lexical chains-based method
Assumes that important sentences are those that are ‘traversed’ by
strong chains (Barzilay and Elhadad, 97).
◦ Strength(C) = length(C) - #DistinctOccurrences(C)
◦ For each chain, choose the first sentence that is traversed by the chain
and that uses a representative set of concepts from that chain.
TEXT SUMMARIZATION
Lexical Chains(Barzilay97)
Lexical Chain : A chain of semantically related words
TEXT SUMMARIZATION
Lin - set of summarization methods
Sentence order: Sentence order in text gives the importance of the
sentences. First sentence highest ranking last sentence lowest ranking.
Title: Sentences containing words in the title get high score.
Term frequency (tf): Open class terms which are frequent in the text are
more important than the less frequent. Open class terms are words that
change over time.
Position score: The assumption is that certain genres put important
sentences in fixed positions. For example. Newspaper articles has most
important terms in the 4 first paragraphs.
TEXT SUMMARIZATION
Lin - set of summarization methods
Query signature: The query of the user affect the summary in the way
that the extract will contain these words.
Sentence length: The sentence length implies which sentence is the most
important.
Average lexical connectivity: Number terms shared with other sentences.
The assumption is that a sentence that share more terms with other
sentences is more important.
Numerical data: Sentences containing numerical data obtain boolean
value 1 (is scored higher ) than the ones without numerical values.
TEXT SUMMARIZATION
Lin - set of summarization methods
Proper name: Sentences containing proper names will be given
higher score
Weekdays and Months: Sentences containing Weekdays and Months
will be given higher score
Quotation: Sentences containing quotations might be important for
certain questions from user
First sentence: First sentence of each paragraphs are the most
important sentences
TEXT SUMMARIZATION
Lin - set of summarization methods
Decision tree combination function: All the above parameters were
put into decision tree and trained on set of texts and manual
summarized texts.
Simple combination function: All the above parameter were
normalized and put in a combination function with no special
weighting.
TEXT SUMMARIZATION
TEXT SUMMARIZATION
CSE4022
Natural Language Processing
Topic: PartsOfSpeach Tagging
D r. Tu l a s i Pr a s a d S a r i k i ,
SCOPE, VIT Chennai Campus
w w w. l e a r n e r s d e s k . w e e b l y. c o m
Definition
The process of assigning a part-of-speech or other lexical class
marker to each word in a corpus.
WORDS
TAGS
the
mom
kissed
the
boy
on
the
cheek
N
V
P
DET
POS TAGGING
2
Definition
 Annotate each word in a sentence with a part-of-speech marker.
 Lowest level of syntactic analysis.
 Useful for subsequent syntactic parsing and word sense disambiguation.
 Example
John saw the saw and decided to take it to the table.
NNP VBD DT NN CC VBD TO VB PRP IN DT NN
POS TAGGING
3
An Example
WORD
LEMMA
the
mom
kissed
the
boy
on
the
cheek
the
mom
kiss
the
boy
on
the
cheek
TAG
+DET
+NOUN
+VPAST
+DET
+NOUN
+PREP
+DET
+NOUN
POS TAGGING
4
English POS Tagsets
 Original Brown corpus used a large set of 87 POS tags.
 Most common in NLP today is the Penn Treebank set of 45 tags.
Reduced from the Brown set for use in the context of a parsed corpus
(i.e. treebank).
 The C5 tagset used for the British National Corpus (BNC) has 61 tags.
POS TAGGING
5
Word Classes
Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, …
Open vs. Closed classes
◦ Open:
◦ Nouns, Verbs, Adjectives, Adverbs.
◦ Why “open”?
◦ Closed:
◦ determiners: a, an, the
◦ pronouns: she, he, I
◦ prepositions: on, under, over, near, by, …
POS TAGGING
6
Closed vs. Open Class
Closed class categories are composed of a small, fixed set of
grammatical function words for a given language.
 prepositions: on, under, over, …
 particles: up, down, on, off, …
 determiners: a, an, the, …
 pronouns: she, who, I, ..
 conjunctions: and, but, or, …
 auxiliary verbs: can, may should, …
POS TAGGING
7
Closed vs. Open Class
Open class categories have large number of words and new ones are
easily invented.
 Nouns new nouns: Internet, website, URL, CD-ROM, email, newsgroup,
bitmap, modem, multimedia
New verbs have also : download, upload, reboot, right-click, doubleclick,
 Verbs (Google),
 Adjectives (geeky)
 Abverb (chompingly)
POS TAGGING
8
English Parts of Speech (Nouns)
Noun (person, place or thing)
 Singular (NN): dog, fork
 Plural (NNS): dogs, forks
 Proper (NNP, NNPS): John, Springfields
 Personal pronoun (PRP): I, you, he, she, it
 Wh-pronoun (WP): who, what
POS TAGGING
9
English Parts of Speech (Nouns)
Proper nouns (Penn, Philadelphia, Davidson)
 English capitalizes these.
Common nouns (the rest).
Count nouns and mass nouns
 Count: have plurals, get counted: goat/goats,
 Mass: don’t get counted (snow, salt, water,)
POS TAGGING
10
English Parts of Speech (Verbs)
Verb (actions and processes)
 Base, infinitive (VB): eat
 Past tense (VBD): ate
 Gerund (VBG): eating
 Past participle (VBN): eaten
 Non 3rd person singular present tense (VBP): eat
 3rd person singular present tense: (VBZ): eats
 Modal (MD): should, can
 To (TO): to (to eat)
POS TAGGING
11
English Parts of Speech
(Adjectives)
Adjective (modify nouns, identify properties or qualities of nouns)
 Basic (JJ): red, tall
 Comparative (JJR): redder, taller
 Superlative (JJS): reddest, tallest
Adjective ordering restrictions in English:
Old blue book, not Blue old book
 the 44th president
 a green product
 a responsible investment
 the dumbest, worst leader
POS TAGGING
12
English Parts of Speech
(Adverbs)
Adverb (modify verbs)
 Basic (RB): quickly
 Comparative (RBR): quicker
 Superlative (RBS): quickest
Unfortunately, John walked home extremely slowly yesterday
 Directional/locative adverbs (here, downhill)
 Degree adverbs (extremely, very, somewhat)
 Manner adverbs (slowly, slinkily, delicately)
 Temporal adverbs (yesterday, tomorrow)
POS TAGGING
13
English Parts of Speech
(Determiner)
Is a word that occurs together with a noun or noun phrase and serves to
express the reference of that noun or noun phrase in the context.
That is, a determiner may indicate whether the noun is referring to a
definite or indefinite element of a class, to a closer or more distant
element, to an element belonging to a specified person or thing, to a
particular number or quantity, etc.
POS TAGGING
14
English Parts of
Speech(Determiner)
Common kinds of determiners include
 definite and indefinite articles (the, a, an)
 demonstratives (this, that, these)
 possessive determiners (my, their)
 quantifiers (many, few , several).
POS TAGGING
15
English Parts of Speech
( preposition)
Preposition (IN): a word governing, and usually preceding, a noun or
pronoun and expressing a relation to another word or element in the
clause, as in ‘the man on the platform’, ‘she arrived after dinner’.
Ex: on, in, by, to, with
POS TAGGING
16
English Parts of Speech
Coordinating Conjunction (CC): that connects words, sentences, phrases or
clauses.
the truth of nature, and the power of giving interest
Ex: and, but, or.
Particle (RP): a particle is a function word that must be associated with
another word or phrase to impart meaning, i.e., does not have its own
lexical definition.
Ex: off (took off), up (put up)
POS TAGGING
17
POS tagging
 POS Tagging is a process that attaches each word in a sentence
with a suitable tag from a given set of tags.
 Tagging is the assignment of a single part-of-speech tag to each
word (and punctuation marker) in a corpus.
The set of tags is called the Tag-set.
 Standard Tag-set : Penn Treebank (for English).
POS TAGGING
18
POS tagging
 There are so many parts of speech, potential distinctions we can
draw.
 To do POS tagging, we need to choose a standard set of tags to
work with.
 Could pick very coarse tag sets.
N, V, Adj, Adv.
 More commonly used set is finer grained (Penn TreeBank, 45 tags)
PRP$, WRB, WP$, VBG
POS TAGGING
19
POS Tag Ambiguity
 Deciding on the correct part of speech can be difficult even for
people.
 In English : I bank1 on the bank2 on the river bank3 for my
transactions.
 Bank1 is verb, the other two banks are nouns
 In Hindi :
 “Khaanaa” : can be noun (food) or verb (to eat)
POS TAGGING
20
Measuring Ambiguity
POS TAGGING
21
How Hard is POS Tagging?
 About 11% of the word types in the Brown corpus are ambiguous
with regard to part of speech
 But they tend to be very common words
 40% of the word tokens are ambiguous
POS TAGGING
22
Penn TreeBank POS Tagset
POS TAGGING
23
Using the Penn Tagset
 The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
 Prepositions and subordinating conjunctions marked IN
(“although/IN I/PRP..”)
 Except the preposition/complementizer “to” is just marked
“TO”.
POS TAGGING
24
Process
 List all possible tag for each word in sentence.
 Choose best suitable tag sequence.
 Example
 ”People jump high”.
 People : Noun/Verb
 jump : Noun/Verb
 high : Noun/Verb/Adjective
 We can start with probabilities.
POS TAGGING
25
Example
POS TAGGING
26
Why POS
 POS tell us a lot about a word (and the words near it).
 E.g, adjectives often followed by nouns
 personal pronouns often followed by verbs
 possessive pronouns by nouns
 Pronunciations depends on POS, e.g.
 object (first syllable NN, second syllable VM), content, discount
 First step in many NLP applications
POS TAGGING
27
Rule-Based Tagging
 Start with a dictionary.
 Assign all possible tags to words from the dictionary.
 Write rules by hand to selectively remove tags.
 Leaving the correct tag for each word.
POS TAGGING
28
Step1: Start with a Dictionary
she:
PRP
promised:
VBN,VBD
to:
back:
TO
VB, JJ, RB, NN
the:
DT
bill:
NN, VB
Etc… for the ~100,000 words of English with more than 1 tag
POS TAGGING
29
Step2: Assign Every Possible Tag
NN
RB
VBN
JJ
TOVB
VB
PRP
VBD
DT NN
She
promised to back the
bill
POS TAGGING
30
Step3: Write Rules to Eliminate
Tags
Eliminate VBN if VBD is an option when VBN|VBD follows
“<start> PRP”
NN
RB
VBN
JJ
TOVB
VB
PRP
VBD
DT NN
She
promised to back the
bill
POS TAGGING
31
POS TAGGING
32
END
POS TAGGING
33
CSE528
Natural Language Processing
Venue:ADB-405
Pr o f. Tu l asi Pr a sa d S a ri ki ,
S C SE, V I T Ch ennai Ca mpus
www. l earn ersd esk.weeb l y. com
Topic: Lexical Relations
Contents
 Lexical Semantics
 Lexeme and Lexicon
 Relationships among Lexemes & senses
LEXICAL RELATIONS
Lexical Semantics
Lexical semantics covers theories of the classification and decomposition
of word meaning, the differences and similarities in lexical semantic
structure between different languages, and the relationship of word
meaning to sentence meaning and syntax.
Lexical semantics explores
 Whether the meaning of a lexical unit is established by looking at its
neighborhood in the semantic net , or if the meaning is already locally
contained in the lexical unit.
 Another topic that is explored is the mapping of words to concepts.
LEXICAL RELATIONS
Lexeme and Lexicon
Lexeme: A lexeme is the minimal unit of language which has a
semantic interpretation and embodies a distinct cultural concept.
Lexicon: Finite list of Lexeme.
Orthographic form & Phonological form to refer to the appropriate
form part of lexeme.
Sense is one of the meanings of a word.
LEXICAL RELATIONS
Relationships among Lexemes & senses
 Homonymy
 Polysemy
 Synonymy
 Antonymy
 Hypernomy
 Hyponomy
 Meronomy
LEXICAL RELATIONS
Homonymy
A homonym is, in the strict sense, one of a group of words that share
the same spelling and pronunciation but may have different
meanings.
 Lexemes that share a form
 Phonological, orthographic or both
 But have unrelated, distinct meanings
LEXICAL RELATIONS
Homonymy
Example:
 Bat (wooden stick-like thing) vs Bat (flying scary mammal thing)
 bank (financial institution) vs bank (riverside)
 Homophones:
 Write and right
 Piece and peace
LEXICAL RELATIONS
Homonymy causes problems for NLP
applications
Text-to-Speech
 Same orthographic form but different phonological form
 bass vs bass
Information retrieval
 Different meanings same orthographic form
 QUERY: bat care
Machine Translation
Speech recognition
LEXICAL RELATIONS
Polysemy
The bank is constructed from red brick.
I withdrew the money from the bank .
Are those the same sense?
Which sense of bank is this?
 Is it distinct from (homonymous with) the river bank sense?
A single lexeme with multiple related meanings (bank the building,
bank the financial institution)
Most non-rare words have multiple meanings
LEXICAL RELATIONS
Polysemy
 Verbs tend more to polysemy
 Distinguishing polysemy from homonymy.
 Example: Consider the verb got
"procure" (I got the tickets),
"become" (she got scared),
"understand" (I get it) .
LEXICAL RELATIONS
Synonyms
Word that have the same meaning in some or all contexts.
 couch / sofa
 big / large
 automobile / car
 vomit / throw up
 Water / H20
Two lexemes are synonyms if they can be successfully substituted for
each other in all situations
LEXICAL RELATIONS
Synonymy is a relation between senses
rather than words
Consider the words big and large
Are they synonyms?
 How big is that plane?
 Would I be flying on a large or small plane?
How about here:
 Miss Nelson, for instance, became a kind of big sister to Benjamin.
 ?Miss Nelson, for instance, became a kind of large sister to Benjamin.
Why?
 big has a sense that means being older, or grown up
 large lacks this sense
LEXICAL RELATIONS
Antonyms
Senses that are opposites with respect to one feature of their meaning
Otherwise, they are very similar!
 dark / light
 short / long
 hot / cold
 up / down
 in / out
More formally: antonyms can
 Define a binary opposition or at opposite ends of a scale (long/short, fast/slow)
 Be reversives: rise/fall, up/down
LEXICAL RELATIONS
Hyponymy
One sense is a hyponym of another if the first sense is more specific,
denoting a subclass of the other
 car is a hyponym of vehicle
 dog is a hyponym of animal
 mango is a hyponym of fruit
Conversely
 vehicle is a hypernym/superordinate of car
 animal is a hypernym of dog
 fruit is a hypernym of mango
LEXICAL RELATIONS
Hypernymy more formally
superordinate
vehicle
fruit
furniture
mammal
hyponym
car
mango
chair
dog
Extensional:
 The class denoted by the superordinate
 extensionally includes the class denoted by the hyponym
Entailment:
 A sense A is a hyponym of sense B if being an A entails being a B
Hyponymy is usually transitive
 A hypo B and B hypo C entails A hypo C
LEXICAL RELATIONS
Meronomy
A meronomy or partonomy is a type of hierarchy that deals with part–
whole relationships.
relation between two concepts (classes) or objects (instances) – ‘whole’
and ‘part’
class-level
: every instance of the whole includes one or more instances of the part
instance-level
: the whole instance includes the part instance
LEXICAL RELATIONS
Meronomy
A wife has a husband.
A husband is a part of a wife.
A sound has a pitch and a volume.
A volume is a part of a sound.
Changing nappies is part of being a mother.
Being a mother has changing nappies.
A book has pages.
A page is a part of a book
LEXICAL RELATIONS
Meronomy
Cars have parts: engine, headlight, wheel
 Engines have parts: crankcase, carburetor
 Headlights have parts: headlight bulb, reflector
A meronomy is a partial ordering of concept types by the part–
whole relation
LEXICAL RELATIONS
Meronomy
Axioms:
Transitive – "Parts of parts are parts of the whole" – if A is part of B and B
is part of C, then A is part of C.
Reflexive – "Everything is part of itself" – A is part of A.
Antisymmetric – "Nothing is a part of its parts" – if A is part of B and A !=
B then B is not part of A.
LEXICAL RELATIONS
LEXICAL RELATIONS
An introduction to part-of-speech tagging and the Hidden Markov Model
Let’s go back into the times when we had no language to communicate. The only way we had was
sign language. That’s how we usually communicate with our dog at home, right? When we tell him,
“We love you, Jimmy,” he responds by wagging his tail. This doesn’t mean he knows what we are
actually saying. Instead, his response is simply because he understands the language of emotions
and gestures more than words.
We as humans have developed an understanding of a lot of nuances of the natural language more
than any animal on this planet. That is why when we say “I LOVE you, honey” vs when we say
“Lets make LOVE, honey” we mean different things. Since we understand the basic difference
between the two phrases, our responses are very different. It is these very intricacies in natural
language understanding that we want to teach to a machine.
What this could mean is when your future robot dog hears “I love you, Jimmy”, he would know
LOVE is a Verb. He would also realize that it’s an emotion that we are expressing to which he
would respond in a certain way. And maybe when you are telling your partner “Lets make LOVE”,
the dog would just stay out of your business ?.
This is just an example of how teaching a robot to communicate in a language known to us can
make things easier.
The primary use case being highlighted in this example is how important it is to understand the
difference in the usage of the word LOVE, in different contexts.
Part-of-Speech Tagging
From a very small age, we have been made accustomed to identifying part of speech tags. For
example, reading a sentence and being able to identify what words act as nouns, pronouns, verbs,
adverbs, and so on. All these are referred to as the part of speech tags.
Let’s look at the Wikipedia definition for them:
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST),
also called grammatical tagging or word-category disambiguation, is the process of
marking up a word in a text (corpus) as corresponding to a particular part of speech,
based on both its definition and its context — i.e., its relationship with adjacent and
related words in a phrase, sentence, or paragraph. A simplified form of this is commonly
taught to school-age children, in the identification of words as nouns, verbs, adjectives,
adverbs, etc.
POS tagging is not something that is generic. It is quite possible for a single word to have a
different part of speech tag in different sentences based on different contexts. That is why it is
impossible to have a generic mapping for POS tags.
Words often have more than one POS: back
• The back door = JJ
• On my back = NN
• Win the voters back = RB
• Promised to back the bill = VB
As you can see, it is not possible to manually find out different part-of-speech tags for a given
corpus. New types of contexts and new words keep coming up in dictionaries in various languages,
and manual POS tagging is not scalable in itself. That is why we rely on machine-based POS
tagging.
Before proceeding further and looking at how part-of-speech tagging is done, we should look at
why POS tagging is necessary and where it can be used.
Why Part-of-Speech tagging?
Part-of-Speech tagging in itself may not be the solution to any particular NLP problem. It is
however something that is done as a pre-requisite to simplify a lot of different problems. Let us
consider a few applications of POS tagging in various NLP tasks.
Text to Speech Conversion
Let us look at the following sentence:
They refuse to permit us to obtain the refuse permit.
The word refuse is being used twice in this sentence and has two different meanings here.
refUSE (/rəˈfyo͞oz/)is a verb meaning “deny,” while REFuse(/ˈrefˌyo͞os/) is a noun meaning “trash”
(that is, they are not homophones). Thus, we need to know which word is being used in order to
pronounce the text correctly. (For this reason, text-to-speech systems usually perform POS-tagging.)
Have a look at the part-of-speech tags generated for this very sentence by the NLTK package.
>>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us',
'PRP'),('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'),
('permit', 'NN')]
As we can see from the results provided by the NLTK package, POS tags for both refUSE and
REFuse are different. Using these two different POS tags for our text to speech converter can come
up with a different set of sounds.
Similarly, let us look at yet another classical application of POS tagging: word sense
disambiguation.
Word Sense Disambiguation
Let’s talk about this kid called Peter. Since his mother is a neurological scientist, she didn’t send
him to school. His life was devoid of science and math.
One day she conducted an experiment, and made him sit for a math class. Even though he didn’t
have any prior subject knowledge, Peter thought he aced his first test. His mother then took an
example from the test and published it as below.
Words often occur in different senses as different parts of speech. For example:
• She saw a bear.
• Your efforts will bear fruit.
The word bear in the above sentences has completely different senses, but more importantly one is
a noun and other is a verb. Rudimentary word sense disambiguation is possible if you can tag words
with their POS tags.
Word-sense disambiguation (WSD) is identifying which sense of a word (that is, which meaning) is
used in a sentence, when the word has multiple meanings.
Try to think of the multiple meanings for this sentence:
Time flies like an arrow
Here are the various interpretations of the given sentence. The meaning and hence the part-ofspeech might vary for each word.
As we can clearly see, there are multiple interpretations possible for the given sentence. Different
interpretations yield different kinds of part of speech tags for the words.This information, if
available to us, can help us find out the exact version / interpretation of the sentence and then we
can proceed from there.
The above example shows us that a single sentence can have three different POS tag sequences
assigned to it that are equally likely. That means that it is very important to know what specific
meaning is being conveyed by the given sentence whenever it’s appearing. This is word sense
disambiguation, as we are trying to find out THE sequence.
These are just two of the numerous applications where we would require POS tagging. There are
other applications as well which require POS tagging, like Question Answering, Speech
Recognition, Machine Translation, and so on.
Now that we have a basic knowledge of different applications of POS tagging, let us look at how
we can go about actually assigning POS tags to all the words in our corpus.
Types of POS taggers
POS-tagging algorithms fall into two distinctive groups:
• Rule-Based POS Taggers
• Stochastic POS Taggers
E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based
algorithms. Let us first look at a very brief overview of what rule-based tagging is all about.
Rule-Based Tagging
Automatic part of speech tagging is an area of natural language processing where statistical
techniques have been more successful than rule-based methods.
Typical rule-based approaches use contextual information to assign tags to unknown or ambiguous
words. Disambiguation is done by analyzing the linguistic features of the word, its preceding word,
its following word, and other aspects.
For example, if the preceding word is an article, then the word in question must be a noun. This
information is coded in the form of rules.
Example of a rule:
If an ambiguous/unknown word X is preceded by a determiner and followed by a noun,
tag it as an adjective.
Defining a set of rules manually is an extremely cumbersome process and is not scalable at all. So
we need some automatic way of doing this.
The Brill’s tagger is a rule-based tagger that goes through the training data and finds out the set of
tagging rules that best define the data and minimize POS tagging errors. The most important point
to note here about Brill’s tagger is that the rules are not hand-crafted, but are instead found out
using the corpus provided. The only feature engineering required is a set of rule templates that the
model can use to come up with new features.
Let’s move ahead now and look at Stochastic POS tagging.
Stochastic Part-of-Speech Tagging
The term ‘stochastic tagger’ can refer to any number of different approaches to the problem of POS
tagging. Any model which somehow incorporates frequency or probability may be properly labelled
stochastic.
The simplest stochastic taggers disambiguate words based solely on the probability that a word
occurs with a particular tag. In other words, the tag encountered most frequently in the training set
with the word is the one assigned to an ambiguous instance of that word. The problem with this
approach is that while it may yield a valid tag for a given word, it can also yield inadmissible
sequences of tags.
An alternative to the word frequency approach is to calculate the probability of a given sequence of
tags occurring. This is sometimes referred to as the n-gram approach, referring to the fact that the
best tag for a given word is determined by the probability that it occurs with the n previous tags.
This approach makes much more sense than the one defined before, because it considers the tags for
individual words based on context.
The next level of complexity that can be introduced into a stochastic tagger combines the previous
two approaches, using both tag sequence probabilities and word frequency measurements. This is
known as the Hidden Markov Model (HMM).
Before proceeding with what is a Hidden Markov Model, let us first look at what is a Markov
Model. That will better help understand the meaning of the term Hidden in HMMs.
Markov Model
Say that there are only three kinds of weather conditions, namely
• Rainy
• Sunny
• Cloudy
Now, since our young friend we introduced above, Peter, is a small kid, he loves to play outside. He
loves it when the weather is sunny, because all his friends come out to play in the sunny conditions.
He hates the rainy weather for obvious reasons.
Every day, his mother observe the weather in the morning (that is when he usually goes out to play)
and like always, Peter comes up to her right after getting up and asks her to tell him what the
weather is going to be like. Since she is a responsible parent, she want to answer that question as
accurately as possible. But the only thing she has is a set of observations taken over multiple days
as to how weather has been.
How does she make a prediction of the weather for today based on what the weather has been for
the past N days?
Say you have a sequence. Something like this:
Sunny, Rainy, Cloudy, Cloudy, Sunny, Sunny, Sunny, Rainy
So, the weather for any give day can be in any of the three states.
Let’s say we decide to use a Markov Chain Model to solve this problem. Now using the data that we
have, we can construct the following state diagram with the labelled probabilities.
In order to compute the probability of today’s weather given N previous observations, we will use
the Markovian Property.
Markov Chain is essentially the simplest known Markov model, that is it obeys the Markov
property.
The Markov property suggests that the distribution for a random variable in the future depends
solely only on its distribution in the current state, and none of the previous states have any impact
on the future states.
For a much more detailed explanation of the working of Markov chains, refer to this link.
Also, have a look at the following example just to see how probability of the current state can be
computed using the formula above, taking into account the Markovian Property.
Apply the Markov property in the following example.
We can clearly see that as per the Markov property, the probability of tomorrow's weather being
Sunny depends solely on today's weather and not on yesterday's .
Let us now proceed and see what is hidden in the Hidden Markov Models.
Hidden Markov Model
It’s the small kid Peter again, and this time he’s gonna pester his new caretaker — which is you.
(Ooopsy!!)
As a caretaker, one of the most important tasks for you is to tuck Peter into bed and make sure he is
sound asleep. Once you’ve tucked him in, you want to make sure he’s actually asleep and not up to
some mischief.
You cannot, however, enter the room again, as that would surely wake Peter up. So all you have to
decide are the noises that might come from the room. Either the room is quiet or there is noise
coming from the room. These are your states.
Peter’s mother, before leaving you to this nightmare, said:
May the sound be with you :)
His mother has given you the following state diagram. The diagram has some states, observations,
and probabilities.
Note that there is no direct correlation between sound from the room and Peter being asleep.
There are two kinds of probabilities that we can see from the state diagram.
• One is the emission probabilities, which represent the probabilities of making certain
observations given a particular state. For example, we have P(noise | awake) = 0.5
. This is an emission probability.
• The other ones is transition probabilities, which represent the probability of transitioning to
another state given a particular state. For example, we have P(asleep | awake) =
0.4 . This is a transition probability.
The Markovian property applies in this model as well. So do not complicate things too much.
Markov, your savior said:
Don’t go too much into the history…
The Markov property, as would be applicable to the example we have considered here, would be
that the probability of Peter being in a state depends ONLY on the previous state.
But there is a clear flaw in the Markov property. If Peter has been awake for an hour, then the
probability of him falling asleep is higher than if has been awake for just 5 minutes. So, history
matters. Therefore, the Markov state machine-based model is not completely correct. It’s merely a
simplification.
The Markov property, although wrong, makes this problem very tractable.
We usually observe longer stretches of the child being awake and being asleep. If Peter is awake
now, the probability of him staying awake is higher than of him going to sleep. Hence, the 0.6 and
0.4 in the above diagram.P(awake | awake) = 0.6 and P(asleep | awake) =
0.4
Before actually trying to solve the problem at hand using HMMs, let’s relate this model to the task
of Part of Speech Tagging.
HMMs for Part of Speech Tagging
We know that to model any problem using a Hidden Markov Model we need a set of observations
and a set of possible states. The states in an HMM are hidden.
In the part of speech tagging problem, the observations are the words themselves in the given
sequence.
As for the states, which are hidden, these would be the POS tags for the words.
The transition probabilities would be somewhat like P(VP | NP) that is, what is the probability
of the current word having a tag of Verb Phrase given that the previous tag was a Noun Phrase.
Emission probabilities would be P(john | NP) or P(will | VP) that is, what is the
probability that the word is, say, John given that the tag is a Noun Phrase.
Note that this is just an informal modeling of the problem to provide a very basic understanding of
how the Part of Speech tagging problem can be modeled using an HMM.
How do we solve this?
Coming back to our problem of taking care of Peter.
Irritated are we ? ?.
Our problem here was that we have an initial state: Peter was awake when you tucked him into bed.
After that, you recorded a sequence of observations, namely noise or quiet, at different time-steps.
Using these set of observations and the initial state, you want to find out whether Peter would be
awake or asleep after say N time steps.
We draw all possible transitions starting from the initial state. There’s an exponential number of
branches that come out as we keep moving forward. So the model grows exponentially after a few
time steps. Even without considering any observations. Have a look at the model expanding
exponentially below.
If we had a set of states, we could calculate the probability of the sequence. But we don’t have the
states. All we have are a sequence of observations. This is why this model is referred to as the
Hidden Markov Model — because the actual states over time are hidden
Chapter 1
Multiword Expressions
1.1
1.2
1.3
1.4
1.5
1.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Linguistic Properties of MWEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Types of MWE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
MWE Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
1
3
12
18
20
28
28
Introduction
Languages are made up of words, which combine via morphosyntax to encode meaning in the form of phrases and sentences. While it may appear
relatively innocuous, the question of what constitutes a “word” is a surprisingly vexed one. First, are dog and dogs two separate words, or variants of
a single word? The traditional view from lexicography and linguistics is to
treat them as separate inflected wordforms of the lexeme dog, as any difference in the syntax/semantics of the two words is predictable from the general
process of noun pluralisation in English. Second, what is the status of expressions like top dog and dog days? A speaker of English who knew top,
dog and day in isolation but had never been exposed to these two expressions
would be hard put to predict the semantics of “person who is in charge” and
“period of inactivity”, respectively.1 To be able to retrieve the semantics of
these expressions, they must have lexical status of some form in the mental
lexicon, which encodes their particular semantics. Expressions such as these
which have surprising properties not predicted by their component words are
referred to as multiword expressions (MWEs).2 The focus of this chapter is the
precise nature and types of MWEs, and the current state of MWE research
in NLP.
Armed with our informal description of MWEs, let’s first motivate this
chapter with a brief overview of the range of MWEs, and complexities asso-
1 All
glosses in this paper are taken from WordNet 3.0 (Fellbaum 1998).
which are largely synonymous with “multiword expression” are “multiword unit”,
“multiword lexical item”, “phraseological unit” and “fixed expression”; there is also variation in the hyphenation of “multiword”, with “multi-word” in common use.
2 Terms
1
Baldwin, Timothy and Su Nam Kim (2010) Multiword Expressions, in Nitin Indurkhya and
Fred J. Damerau (eds.) Handbook of Natural Language Processing, Second Edition, CRC
Press, Boca Raton, USA, pp. 267-292.
2
Handbook of NLP
ciated with them. We return to define MWEs formally in Section 1.2.
(1a)–(1b) include a number of MWEs, underlined.
(1)
a. In a nutshell, the administrator can take advantage of the database’s
many features through a single interface.
b. You should also jot down the serial number of your television video.
As we can see, analogously to simple words, MWEs can occur in a wide range
of lexical and syntactic configurations (e.g. nominal, verbal and adverbial).
Semantically, we can observe different effects: in some cases (e.g. serial number
and television video), the component words preserve their original semantics,
but the MWE encodes extra semantics (e.g. the fact that a television video is
a single-unit device, and usually designed to be portable); in other cases (e.g.
in a nutshell , meaning “summed up briefly”), the semantics of one or more of
the component words has no obvious bearing of the semantics of the MWE.
While all of the MWE examples we have seen to date have occurred as
contiguous units, this is not always the case:
(2)
a. She likes to take a long bath for relaxation after exams.
b. Kim hates to put her friends out.
For example, in (2a), long is an internal modifier and not a component of the
base MWE take a bath, as there is nothing surprising about the syntax of the
modified MWE or the resulting semantics (c.f. take a short/leisurely/warm/mud/...
bath).
How big an issue are MWEs, though? The number of MWEs is estimated
to be of the same order of magnitude as the number of simplex words in a
speaker’s lexicon (Jackendoff 1997; Tschichold 1998; Pauwels 2000). At the
type level, therefore, MWEs are as much of an issue as simple words. Added
to this, new (types of) MWE are continuously created as languages evolve
(e.g. shock and awe, carbon footprint, credit crunch) (Gates 1988; Tschichold
1998; Fazly, Cook, and Stevenson 2009).
Crosslingually, MWEs have been documented across a broad spectrum of
the world’s languages (see the companion web site for this chapter for a detailed listing of references). In fact, MWEs are such an efficient way of providing nuance and facilitating lexical expansion with a relatively small simplex
lexicon, it is highly doubtful that any language would evolve without MWEs
of some description.
MWEs are broadly used to enhance fluency and understandability, or mark
the register/genre of language use (Fillmore, Kay, and O’Connor 1988; Liberman and Sproat 1992; Nunberg, Sag, and Wasow 1994; Dirven 2001). For
example, MWEs can make language more or less informal/colloquial (c.f. London Underground vs. Tube, and piss off vs. annoy). Regionally, MWEs vary
considerably. For example, take away and take out are identical in meaning,
but the former is the preferred expression in British/Australian English, while
the latter is the preferred expression in American English. Other examples
Multiword Expressions
3
are phone box vs. phone booth, lay the table vs. set the table, and no through
road vs. not a through street, respectively.
There is a modest body of research on modelling MWEs which has been integrated into NLP applications, e.g. for the purposes of fluency, robustness or
better understanding of natural language. One area where MWEs have traditionally been used heavily (either explicitly or implicitly) is machine translation, as a means of capturing subtle syntactic, semantic and pragmatic effects
in the source and target languages (Miyazaki, Ikehara, and Yokoo 1993; Gerber and Yang 1997; Melamed 1997; Matsuo, Shirai, Yokoo, and Ikehara 1997).
Understanding MWEs has broad utility in tasks ranging from syntactic disambiguation to conceptual (semantic) comprehension. Explicit lexicalised MWE
data helps simplify the syntactic structure of sentences that include MWEs,
and conversely, a lack of MWE lexical items in a precision grammar is a significant source of parse errors (Baldwin, Bender, Flickinger, Kim, and Oepen
2004). Additionally, it has been shown that accurate recognition of MWEs
influences the accuracy of semantic tagging (Piao, Rayson, Archer, Wilson,
and McEnery 2003), and word alignment in machine translation (MT) can be
improved through a specific handling of the syntax and semantics of MWEs
(Venkatapathy and Joshi 2006).
1.2
Linguistic Properties of MWEs
We adopt the following formal definition of multiword expression, following
(Sag, Baldwin, Bond, Copestake, and Flickinger 2002):
(3) Multiword expressions (MWEs) are lexical items that: (a) can be
decomposed into multiple lexemes; and (b) display lexical, syntactic,
semantic, pragmatic and/or statistical idiomaticity
In languages such as English, the conventional interpretation of the requirement of decomposability into lexemes is that MWEs must in themselves be
made up of multiple whitespace-delimited words. For example, marketing
manager is potentially a MWE as it is made up of two lexemes (marketing
and manager ), while fused words such as lighthouse are conventionally not
classified as MWEs.3 In languages such as German, the high productivity of
compound nouns such as Kontaktlinse “contact lens” (the concatenation of
Kontakt “contact” and Linse “lens”), without whitespace delimitation, means
that we tend to relax this restriction and allow for single-word MWEs. In
non-segmenting languages such as Japanese and Chinese (Baldwin and Bond
3 In
practice, a significant subset of research on English noun compounds (see Section 1.3.1)
has considered both fused and whitespace-separated expressions.
4
Handbook of NLP
2002; Xu, Lu, and Li 2006), we are spared this artificial consideration. The
ability to decompose an expression into multiple lexemes is still applicable,
however, and leads to the conclusion, e.g. that fukugō-hyōgen “multiword expression” is a MWE (both fukugō “compound” and hyōgen “expression” are
standalone lexemes), but buchō “department head” is not (bu “department”
is a standalone lexeme, but chō “head” is not).
The second requirement on a MWE is for it to be idiomatic. We provide a
detailed account of idiomaticity in its various manifestations in the following
section.
1.2.1
Idiomaticity
In the context of MWEs, idiomaticity refers to markedness or deviation from
the basic properties of the component lexemes, and applies at the lexical,
syntactic, semantic, pragmatic, and/or statistical levels. A given MWE is
often idiomatic at multiple levels (e.g. syntactic, semantic and statistical in
the case of by and large), as we return to illustrate in Section 1.2.3.
Closely related to the notion of idiomaticity is compositionality , which we
consider to be the degree to which the features of the parts of a MWE combine
to predict the features of the whole. While compositionality is often construed
as applying exclusively to semantic idiomatic (hence by “non-compositional
MWE”, researchers tend to mean a semantically-idiomatic MWE), in practice
it can apply across all the same levels as idiomaticity. Below, we present an
itemised account of each sub-type of idiomaticity.
1.2.1.1
Lexical Idiomaticity
Lexical idiomaticity occurs when one or more components of an MWE are
not part of the conventional English lexicon. For example, ad hoc is lexically
marked in that neither of its components (ad and hoc) are standalone English
words.4 Lexical idiomaticity inevitably results in syntactic and semantic idiomaticity because there is no lexical knowledge associated directly with the
parts from which to predict the behaviour of the MWE. As such, it is one of
the most clear-cut and predictive properties of MWEhood.
1.2.1.2
Syntactic Idiomaticity
Syntactic idiomaticity occurs when the syntax of the MWE is not derived
directly from that of its components (Katz and Postal 2004; Chafe 1968; Bauer
1983; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For example, by
and large, is syntactically idiomatic in that it is adverbial in nature, but
made up of the anomalous coordination of a preposition (by) and an adjective
4 Note
that the idiomaticity is diminished if the speaker has knowledge of the Latin origins
of the term. Also, while the component words don’t have status as standalone lexical items,
they do occur in other MWEs (e.g. ad nauseum, post hoc).
Multiword Expressions
5
(large). On the other hand, take a walk is not syntactically marked as it
is a simple verb–object combination which is derived transparently from a
transitive verb (walk ) and a countable noun (walk ). Syntactic idiomaticity
can also occur at the constructional level, in classes of MWEs having syntactic
properties which are differentiated from their component words, e.g. verb
particle constructions Section 1.3.2.1 and determinerless prepositional phrases
Section 1.3.3.2.
1.2.1.3
Semantic Idiomaticity
Semantic idiomaticity is the property of the meaning of a MWE not being
explicitly derivable from its parts (Katz and Postal 2004; Chafe 1968; Bauer
1983; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For example,
middle of the road usually signifies “non-extremism, especially in political
views”, which we could not readily predict from either middle or road . On
the other hand, to and fro is not semantically marked as its semantics is fully
predictable from its parts. Many cases are not as clear cut as these, however.
The semantics of blow hot and cold (“constantly change opinion”), for example, is partially predictable from blow (“move” and hence “change”), but not
as immediately from hot and cold . There are also cases where the meanings of
the parts are transparently inherited but there is additional semantic content
which has no overt realisation. One such example is bus driver where, modulo the effects of word sense disambiguation, bus and driver both have their
expected meanings, but there is additionally the default expectation that a
bus driver is “one who drives a bus” and not “one who drives like a bus” or
“an object for driving buses with”, for example.
Closely related to the issue of semantic idiomaticity is the notion of figuration, i.e. the property of the components of a MWE having some metaphoric
(e.g. take the bull by the horns), hyperbolic (e.g. not worth the paper it’s
printed on) or metonymic (e.g. lend a hand ) meaning in addition to their
literal meaning (Fillmore, Kay, and O’Connor 1988; Nunberg, Sag, and Wasow 1994). As an illustration of decomposability via metaphorical figuration,
consider the English idiom spill the beans. Assuming a formal semantic representation of reveal′ (secret′ ) for the MWE, we can coerce the semantics of
spill and beans into reveal′ and secret′ , respectively, to arrive at a figurative interpretation of the MWE semantics. A compositionality analysis would
not be able to predict this regularity as these senses for spill and beans are
not readily available outside this particular MWE. Predictably, MWEs vary
in the immediacy of their decomposability — with get the nod being more
transparently decomposable than spill the beans, e.g. — and not all MWEs
are decomposable (c.f. kick the bucket). We return to discuss the interaction
between decomposability and syntactic flexibility in Section 1.3.2.4.
One intriguing aspect of semantic idiomaticity is that higher-usage MWEs
are generally perceived to be less semantically idiomatic, or at least more
readily decomposable (Keysar and Bly 1995).
6
Handbook of NLP
1.2.1.4
Pragmatic Idiomaticity
Pragmatic idiomaticity is the condition of a MWE being associated with a
fixed set of situations or a particular context (Kastovsky 1982; Jackendoff
1997; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). Good morning
and all aboard are examples of pragmatic MWEs: the first is a greeting associated specifically with mornings5 and the second is a command associated
with the specific situation of a train station or dock, and the imminent departure of a train or ship. Pragmatically idiomatic MWEs are often ambiguous
with (non-situated) literal translations; e.g. good morning can mean “pleasant
morning” (c.f. Kim had a good morning).
1.2.1.5
Statistical Idiomaticity
Statistical idiomaticity occurs when a particular combination of words occurs
with markedly high frequency, relative to the component words or alternative
phrasings of the same expression (Cruse 1986; Sag, Baldwin, Bond, Copestake,
and Flickinger 2002). For example, in Table 1.1, we present an illustration
of statistical idiomaticity, adapted from Cruse (1986, p281). The example is
based on the cluster of near-synonym adjectives (flawless, immaculate, impeccable and spotless), and their affinity to pre-modify a range of nouns. For
a given pairing of adjective and noun, we indicate the compatibility in the
form of discrete markers (“+” indicates a positive lexical affinity, “?” indicates a neutral lexical affinity, and “−” indicates a negative lexical affinity).
For example, immaculate has a strong lexical affinity with performance (i.e.
immaculate performance is a relatively common expression), whereas spotless
has a negative affinity with credentials (i.e. spotless credentials is relatively
infrequent). There may, of course, be phonological, semantic or other grounds
for particular adjective–noun combinations being more or less frequent; statistical idiomaticity is simply an observation of the relative frequency of a
given combination. It is also important to note that statistical idiomaticity is
a continuously-graded phenomenon, and our predictions about lexical affinity
in Table 1.1 are most naturally interpreted as a ranking of the propensity
for each of the adjectives to occur as a pre-modifier of record ; for example,
impeccable and spotless are more probable choices than immaculate, which is
in turn more probable than flawless.
Another striking case of statistical idiomaticity is with binomials such as
black and white — as in black and white television — where the reverse noun
ordering does not preserve the lexicalised semantics of the word combination
(c.f. ?white and black television) (Benor and Levy 2006). The arbitrariness
of the preferred noun order in English is poignantly illustrated by it being
reversed in other languages, e.g. shirokuro “white and black” and blanco y
negro “white and black” in Japanese and Spanish, respectively.
5 Which
is not to say that it can’t be used ironically at other times of the day!
Multiword Expressions
condition
credentials
hair
house
logic
timing
flawless immaculate impeccable
+
−
+
−
−
+
−
+
?
?
+
?
+
−
+
?
+
+
7
spotless
+
−
−
+
−
−
Table 1.1: Examples of statistical idiomaticity (“+” = strong lexical affinity,
“?” = neutral lexical affinity, “−” = negative lexical affinity)
Statistical idiomaticity relates closely to the notion of institutionalisation
(a.k.a. conventionalisation), i.e. a particular word combination coming to be
used to refer a given object (Fernando and Flavell 1981; Bauer 1983; Nunberg,
Sag, and Wasow 1994; Sag, Baldwin, Bond, Copestake, and Flickinger 2002).
For example, traffic light is the conventionalised descriptor for “a visual signal
to control the flow of traffic at intersections”. There is no a priori reason why
it shouldn’t instead be called a traffic director or intersection regulator , but
the simple matter of the fact is that it is not referred to using either of those
expressions; instead, traffic light was historically established as the canonical
term for referring to the object. Similarly, it is an arbitrary fact of the English language that we say many thanks and not ∗ several thanks, and salt and
pepper in preference to pepper and salt.6 We term these anti-collocations of
the respective MWEs (Pearce 2001): lexico-syntactic variants of MWEs which
have unexpectedly low frequency, and in doing so, contrastively highlight the
statistical idiomaticity of the target expression.7
1.2.2
Other Properties of MWEs
Other common properties of MWE are: single-word paraphrasability, proverbiality and prosody. Unlike idiomaticity, where some form of idiomaticity is a
necessary feature of MWEs, these other properties are neither necessary nor
sufficient. Prosody relates to semantic idiomaticity, while the other properties
are independent of idiomaticity as described above.
• Crosslingual variation
6 Which
is not to say there wasn’t grounds for the selection of the canonical form at its
genesis, e.g. for historical, crosslingual or phonological reasons.
7 The term anti-collocation originated in the context of collocation research (see Section 1.2.4). While noting the potential for confusion, we use it in the broader context
of MWEs as a tool for analysing the statistical idiomaticity of a candidate MWE relative
to alternative forms of the same basic expression.
8
Handbook of NLP
There is remarkable variation in MWEs across languages (Villavicencio,
Baldwin, and Waldron 2004). In some cases, there is direct lexicosyntactic correspondence for a crosslingual MWE pair with similar semantics. For example, in the red has a direct lexico-syntactic correlate
in Portuguese with the same semantics: no vermelho, where no is the
contraction of in and the, vermelho means red , and both idioms are
prepositional phrases (PPs). Others have identical syntax but differ
lexically. For example, in the black corresponds to no azul (“in the
blue”) in Portuguese, with a different choice of colour term (blue instead of black ). More obtusely, Bring the curtain down on corresponds
to the Portuguese botar um ponto final em (lit. “put the final dot in”),
with similar syntactic make-up but radically different lexical composition. Other MWEs again are lexically similar but syntactically differentiated. For example, in a corner (e.g. The media has him in a
corner ) and encurralado (“cornered”) are semantically equivalent but
realised by different constructions – a PP in English and an adjective in
Portuguese.
There are of course many MWEs which have no direct translation equivalent in a second language. For example, the Japanese MWE zoku-giiN ,
meaning “legistors championing the causes of selected industries” has no
direct translation in English (Tanaka and Baldwin 2003). Equally, there
are terms which are realised as MWEs in one language but single-word
lexemes in another, such as interest rate and its Japanese equivalent
riritsu.
• Single-word paraphrasability
Single-word paraphrasability is the observation that significant numbers
of MWEs can be paraphrased with a single word (Chafe 1968; Gibbs
1980; Fillmore, Kay, and O’Connor 1988; Liberman and Sproat 1992;
Nunberg, Sag, and Wasow 1994). While some MWEs are single-word
paraphrasable (e.g. leave out = omit), others are not (e.g. look up =
? ). Also, MWEs with arguments can sometimes be paraphrasable (e.g.
take off clothes = undress), just as non-MWEs comprised of multiple
words can be single-word paraphrasable (e.g. drop sharply = plummet).
• Proverbiality
Proverbiality is the ability of a MWE to “describe and implicitly to explain a recurrent situation of particular social interest in the virtue of its
resemblance or relation to a scenario involving homely, concrete things
and relations” (Nunberg, Sag, and Wasow 1994). For example, verb
particle constructions and idioms are often indicators of more informal
situations (e.g. piss off is an informal form of annoy, and drop off is an
informal form of fall asleep).
• Prosody
Multiword Expressions
all aboard
bus driver
by and large
kick the bucket
look up
shock and awe
social butterfly
take a walk
to and fro
traffic light
eat chocolate
9
Lexical Syntactic Semantic Pragmatic Statistical
−
−
−
+
+
−
−
+
−
+
−
+
+
−
+
−
−
+
−
+
−
−
+
−
+
−
−
−
+
+
−
−
+
−
+
−
−
+
−
?
?
+
−
−
+
−
−
+
−
+
−
−
−
−
−
Table 1.2: Classification of MWEs in terms of their idiomaticity
MWEs can have distinct prosody , i.e. stress patterns, from compositional
language (Fillmore, Kay, and O’Connor 1988; Liberman and Sproat
1992; Nunberg, Sag, and Wasow 1994). For example, when the components do not make an equal contribution to the semantics of the whole,
MWEs can be prosodically marked, e.g. soft spot is prosodically marked
(due to the stress on soft rather than spot), although first aid and red
herring are not. Note that prosodic marking can equally occur with
non-MWEs, such as dental operation.
1.2.3
Testing an Expression for MWEhood
Above, we described five different forms of idiomaticity, along with a number of other properties of MWEs. We bring these together in categorising a
selection of MWEs in Table 1.2.
Taking the example of the verb particle construction look up (in the sense
of “seek information from”, as in Kim looked the word up in the dictionary),
we first observe that it is made up of multiple words (look and up), and thus
satisfies the first requirement in our MWE definition. In terms of idiomaticity:
(1) it is not lexically idiomatic, as both look and up are part of the standard
English lexicon; (2) while it has peculiar syntax relative to its component
words, in up being separable from look , this is a general property of transitive
verb particle constructions (see Section 1.3.2.1) rather than this particular
word combination, so it is not syntactically idiomatic; (3) it is semantically
idiomatic, as the semantics of “seek information from” is not predictable from
the standard semantics of look and up; (4) it is not pragmatically idiomatic,
as it doesn’t generally evoke a particular situation; and (5) it is statistically
10
Handbook of NLP
marked, as it contrasts with anti-collocations such as *see/watch up 8 and is
a relatively frequent expression in English. That is, it is semantically and
statistically idiomatic; in combination with its multiword composition, this is
sufficient to classify it as a MWE.
In Table 1.2, kick the bucket (in the sense of “die”) has only one form of
idiomaticity (semantic), while all the other examples have at least two forms
of idiomaticity. Traffic light, for example, is statistically idiomatic in that
it is both a common expression in English and stands in opposition to anticollocations such as *vehicle light/traffic lamp, and it is semantically idiomatic
in that the particular semantics of “a visual signal to control the flow of traffic”
is not explicitly represented in the component words (e.g. interpretations such
as “a visual signal to indicate the flow of traffic”, “a device for lighting the
way of traffic” or “a lamp which indicates the relative flow of data” which are
predicted by the component words are not readily available). Other noteworthy claims about idiomaticity are: shock and awe is pragmatically idiomatic
because of its particular association with the commencement of the Iraq War
in 2003; take a walk is semantically idiomatic because this sense of take is
particular to this and other light verb constructions (see Section 1.3.2.3), and
distinct from the literal sense of the verb; and to and fro is syntactically idiomatic because of the relative syntactic opacity of the antiquated fro, and
(somewhat) lexically idiomatic as it is used almost exclusively in the context
of to and fro.9
Table 1.2 includes one negative example: eat chocolate. While it satisfies
the requirement for multiword decomposability (i.e. it is made up of more
than one word), it clearly lacks lexical, syntactic, semantic and pragmatic
idiomaticity. We would claim that it is also not statistically idiomatic. One
possible counter-argument could be that eat is one of the most common verbs
associated with chocolate, but the same argument could be made for almost
any foodstuff in combination with eat. Possible anti-collocations such as consume chocolate or munch on chocolate are also perfectly acceptable.
1.2.4
Collocations and MWEs
A common term in NLP which relates closely to our discussion of MWEs
is collocation. A widely-used definition for collocation is “an arbitrary and
recurrent word combination” (Benson 1990), or in our terms, a statistically
idiomatic MWE (esp. of high frequency). While there is considerable varia-
8 Under the constraint that up is a particle; examples such as see you up the road occur
readily, but are not considered to be anti-collocations as up is a (transitive) preposition.
9 Words such as this which occur only as part of a fixed expression are known variously
as cranberry words or bound words (Aronoff 1976; Moon 1998; Trawiński, Sailer, Soehn,
Lemnitzer, and Richter 2008) (other examples are tenterhooks and caboodle), and the expressions that contain them are often termed cranberry expressions (e.g. on tenterhooks and
the whole caboodle).
Multiword Expressions
11
tion between individual researchers, collocations are often distinguished from
“idioms” or “non-compositional phrases” on the grounds that they are not
syntactically idiomatic, and if they are semantically idiomatic, it is through
a relatively transparent process of figuration or metaphor (Choueka 1988;
Lin 1998; McKeown and Radev 2000; Evert 2004). Additionally, much work
on collocations focuses exclusively on predetermined constructional templates
(e.g. adjective–noun or verb–noun collocations). In Table 1.2, e.g. social butterfly is an uncontroversial instance of a collocation, but look up and to and
fro would tend not to be classified as collocations. As such, collocations form
a proper subset of MWEs.
1.2.5
A Word on Terminology and Related Fields
It is worth making mention of a number of terms which relate to MWEs.
The term idiom varies considerably in its usage, from any kind of multiword
item to only those MWEs which are semantically idiomatic; even here, there
are those who consider idioms to be MWEs which are exclusively semantically idiomatic (also sometimes termed pure idioms), and those who restrict
the term to particular syntactic sub-types of semantically idiomatic MWEs
(Fillmore, Kay, and O’Connor 1988; Nunberg, Sag, and Wasow 1994; Moon
1998; Huddleston and Pullum 2002). To avoid confusion, we will avoid using
this term in this chapter.
The field of terminology has a rich history of research on multiword terms,
which relates closely to MWEs (Sager 1990; Justeson and Katz 1995; Frantzi,
Ananiadou, and Mima 2000; Kageura, Daille, Nakagawa, and Chien 2004).
The major difference is that terminology research is primarily interested in
identifying and classifying technical terms specific to a particular domain
(both MWE and simplex lexemes) and predicting patterns of variation in
those terms. It is thus broader in scope than MWEs in the sense that simple lexemes can equally be technical terms, and narrower in the sense than
non-technical MWEs are not of interest to the field.
Phraseology is another field with a rich tradition history relating to MWEs
(Cowie and Howarth 1996; Cowie 2001). It originally grew out of the work of
Mel’čuk and others in Russia on Meaning-Text Theory (Mel’čuk and Polguère
1987), but more recently has taken on elements from the work of Sinclair
and others in the context of corpus linguistics and corpus-based lexicography
(Sinclair 1991). Phraseology is primarily interested in the description and
functional classification of MWEs (including “sentence-like” units, such as
phrases and quotations), from a theoretical perspective.
12
1.3
Handbook of NLP
Types of MWE
In this section, we detail a selection of the major MWE types which have
received particular attention in the MWE literature. We will tend to focus on
English MWEs for expository purposes, but provide tie-ins to corresponding
MWEs in other languages where possible.
1.3.1
Nominal MWEs
Nominal MWEs are one of the most common MWE types, in terms of token frequency, type frequency, and their occurrence in the world’s languages
(Tanaka and Baldwin 2003; Lieber and Štekauer 2009). In English, the primary type of nominal MWE is the noun compound (NC), where two or more
nouns combine to form a N̄, such as golf club or computer science department
(Lauer 1995; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Huddleston and Pullum 2002); the rightmost noun in the NC is termed the head
noun (i.e. club and department, respectively) and the remainder of the component(s) modifier(s) (i.e. golf and computer science, respectively).10 Within
NCs, there is the subset of compound nominalisations, where the head is deverbal (e.g. investor hesitation or stress avoidance). There is also the broader
class of nominal MWEs where the modifiers aren’t restricted to be nominal,
but can also be verbs (usually present or past participles, such as connecting
flight or hired help) or adjectives (e.g. open secret). To avoid confusion, we will
term this broader set of nominal MWEs nominal compounds. In Romance languages such as Italian, there is the additional class of complex nominals which
include a preposition or other marker between the nouns, such as succo di
limone “lemon juice” and porta a vetri “glass door”.11
One property of noun compounds which has put them in the spotlight of
NLP research is their underspecified semantics. For example, while sharing
the same head, there is little semantic commonality between nut tree, clothes
tree and family tree: a nut tree is a tree which bears edible nuts; a clothes
tree is a piece of furniture shaped somewhat like a tree, for hanging clothes
on; and a family tree is a graphical depiction of the genealogical history of a
family (which can be shaped like a tree). In each case, the meaning of the
compound relates (if at times obtusely!) to a sense of both the head and
the modifier, but the precise relationship is highly varied and not represented
10 In
fact, the norm amongst Germanic languages (e.g. Danish, Dutch, German, Norwegian
and Swedish) is for noun compounds to be realised as a single compound word (Bauer
2001). Solar cell, for example, is zonnecel in Dutch, Solarzelle in German, and solcell in
Swedish. See Section 1.2 for comments on their compatibility with our definition of MWE.
11 Our use of the term complex nominal for MWEs of form N P N should not be confused
with that of Levi (1978), which included NCs and nominal compounds.
Multiword Expressions
13
explicitly in any way. Furthermore, while it may be possible to argue that
these are all lexicalised noun compounds with explicit semantic representations in the mental lexicon, native speakers generally have reasonably sharp
intuitions about the semantics of novel compounds. For example, a bed tree
is most plausibly a tree that beds are made from or perhaps for sleeping in,
and a reflection tree could be a tree for reflecting in/near or perhaps the reflected image of a tree. Similarly, context can evoke irregular interpretations
of high-frequency compounds (Downing 1977; Spärck Jones 1983; Copestake
and Lascarides 1997; Gagné, Spalding, and Gorrie 2005). This suggests that
there is a dynamic interpretation process that takes place, which complements
encyclopedic information about lexicalised compounds.
One popular approach to capturing the semantics of compound nouns is
via a finite set of relations. For example, orange juice, steel bridge and paper
hat could all be analysed as belonging to the make relation, where head
is made from modifier. This observation has led to the development of
a bewildering range of semantic relation sets of varying sizes, based on abstract relations (Vanderwende 1994; Barker and Szpakowicz 1998; Rosario
and Hearst 2001; Moldovan, Badulescu, Tatu, Antohe, and Girju 2004; Nastase, Sayyad-Shirabad, Sokolova, and Szpakowicz 2006), direct paraphrases,
e.g. using prepositions or verbs (Lauer 1995; Lapata 2002; Grover, Lapata,
and Lascarides 2004; Nakov 2008), or various hybrids of the two (Levi 1978;
Vanderwende 1994; Ó Séaghdha 2008). This style of approach has been hampered by issues including low inter-annotator agreement (especially for larger
semantic relation sets), coverage over data from different domains, the impact
of context on interpretation, how to deal with “fringe” instances which don’t
quite fit any of the relations, and how to deal with interpretational ambiguity
(Downing 1977; Spärck Jones 1983; Ó Séaghdha 2008).
An additional area of interest with nominal MWEs (especially noun compounds) is the syntactic disambiguation of MWEs with 3 or more terms. For
example, glass window cleaner can be syntactically analysed as either (glass
(window cleaner)) (i.e. “a window cleaner made of glass”, or similar) or ((glass
window) cleaner) (i.e. “a cleaner of glass windows”). Syntactic ambiguity impacts on both the semantic interpretation and prosody of the MWE. The task
of disambiguating syntactic ambiguity in nominal MWEs is called bracketing .
We return to discuss the basic approaches to bracketing in Section 1.5.3.
1.3.2
1.3.2.1
Verbal MWEs
Verb-particle constructions
Verb-particle constructions (VPCs, also sometimes termed particle verbs or
phrasal verbs) are made up of a verb and an obligatory particle, typically
in the form of an intransitive preposition (e.g. play around , take off ), but
including adjectives (e.g. cut short, band together ) and verbs (e.g. let go, let
fly) (Bolinger 1976; Jackendoff 1997; Huddleston and Pullum 2002; McIntyre
14
Handbook of NLP
2007). English VPCs relate closely to particle verbs (a.k.a. separable verbs) in
languages such as German (Lüdeling 2001), Dutch (Booij 2002) and Estonian
(Kaalep and Muischnek 2008), but the construction has its own peculiarities
in each language which go beyond the bounds of this chapter. To avoid
confusion, we will focus exclusively on English VPCs in our discussion here.
The distinguishing properties of English VPCs are:
• Transitive VPCs can occur in either the joined (e.g. Kim put on the
sweater ) or split (e.g. Kim put the sweater on) word order in the case
that the object NP is not pronominal
• Transitive VPCs must occur in the split word order if the object NP is
pronominal (e.g. Kim polished it off vs. ∗ Kim polished off it).
• Manner adverbs do not readily occur between the verb and particle,
in both intransitive and transitive VPCs (e.g. ?∗ Kim played habitually
around, ∗ Kim made quickly up her mind ). Note, there is a small set of
degree adverbs that readily premodify particles, notably right (e.g. My
turn is coming right up) and back (e.g. Kim put the sweater back on)
All of these properties are defined at the construction level and common to
all VPCs, however, begging the question of where the idiomaticity comes in
that allows us to define them as MWEs. The answer is, in the main, semantic
and statistical idiosyncrasy. For example, the semantics of polish in polish
off (e.g. polish off dessert, polish off the hitman, polish off my homework ) is
differentiated from that of the simplex lexeme. Conversely, swallow down (e.g.
swallow down the drink ) preserves the semantics of both swallow and down
(i.e. the liquid is swallowed, and as a result goes down [the oesophagus]), and
is thus conventionally not considered be a MWE.
VPCs are highly frequent in English text, but the distribution is highly
skewed towards a minority of the VPC types, with the majority of VPCs occurring very infrequently (Baldwin 2005a). This is bad news if we want to
build a parser with full coverage, e.g., as we need to capture the long tail of
VPC types. Compounding the problem, the construction is highly productive.
For example, the completive up (e.g. eat/finish/rest/... up) can combine productively with a large array of action verbs to form a VPC with predictable
syntax and semantics, which we could never hope to exhaustively list. Having said this, there are large numbers of semantically-idiomatic VPCs which
need to be recorded in the lexicon if we wish to capture their semantics correctly. Even here, VPCs populate the spectrum of compositionality relative
to their components (Lidner 1983; Brinton 1985; Jackendoff 2002; Bannard,
Baldwin, and Lascarides 2003; McCarthy, Keller, and Carroll 2003; Cook and
Stevenson 2006), so while some VPCs are clear candidates for lexicalisation
in terms of their semantic idiomaticity (e.g. make out, as in Kim made out
the cheque to Sandy or Kim and Sandy made out), others are semantically
closer to the semantics of their component words (e.g. check out, blow over )
Multiword Expressions
15
and to some degree derivable from their component words. One approach
to representing this continuum of VPC semantics is that of Bannard, Baldwin, and Lascarides (2003), who subclassify VPCs into four compositionality
classes based on the independent semantic contribution of the verb and particle: (1) the VPC inherits its semantics from the verb and particle (i.e. is
not semantically idiomatic); (2) the VPC inherits semantics from the verb
only; (3) the VPC inherits semantics from the particle only; and (4) the VPC
inherits semantics from neither the verb nor the particle. A second approach
is to employ a one-dimensional classification of holistic VPC compositionality
(e.g. in the form of a integer scale of 0 to 10 (McCarthy, Keller, and Carroll
2003)).
1.3.2.2
Prepositional verbs
Prepositional verbs (PVs) relate closely to VPCs in being comprised of a
verb and selected preposition, with the crucial difference that the preposition
is transitive (e.g. refer to, look for ) (Jackendoff 1973; O’Dowd 1998; Huddleston and Pullum 2002; Baldwin 2005b; Osswald, Helbig, and Hartrumpf
2006). English PVs occur in two basic forms: (1) fixed preposition PVs (e.g.
come across, grow on), where there is a hard constraint of the verb and selected preposition being strictly adjacent; and (2) mobile preposition PVs (e.g.
refer to, send for ), where the selected preposition is adjacent to the verb in
the canonical word order, but undergoes limited syntactic alternation. For
example, mobile preposition PVs allow limited coordination of PP objects
(e.g. refer to the book and to the DVD vs. ∗ come across the book and across
the DVD), and the NP object of the selected preposition can be passivised
(e.g. the book was referred to vs. ∗ I was grown on by the book ).
PVs are highly frequent in general text, and notoriously hard to distinguish from VPCs and simple verb–preposition combinations, e.g. in parsing
applications.
1.3.2.3
Light-Verb Constructions
Light-verb constructions (i.e. LVCs) are made up of a verb and a noun complement, often in the indefinite singular form (Jespersen 1965; Abeillé 1988;
Miyagawa 1989; Grefenstette and Tapanainen 1994; Hoshi 1994; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Huddleston and Pullum 2002;
Butt 2003; Stevenson, Fazly, and North 2004). The name of the construction
comes from the verb being semantically bleached or “light”, in the sense that
their contribution to the meaning of the LVC is relatively small in comparison with that of the noun complement. In fact, the contribution of the light
verb is so slight that in many cases, the LVC can be paraphrased with the
verbal form of the noun complement (e.g. take a walk vs. walk or take a photograph vs. photograph). LVCs are also sometimes termed verb-complement pairs
(Tan, Kan, and Cui 2006) or support verb constructions (Calzolari, Fillmore,
Grishman, Ide, Lenci, MacLeod, and Zampolli 2002).
16
Handbook of NLP
The following are the principle light verbs in English:
• do, e.g. do a demo, do a drawing, do a report
• give, e.g. give a wave, give a sigh, give a kiss
• have, e.g. have a rest, have a drink , have pity (on)
• make, e.g. make an offer , make an attempt, make a mistake
• take, e.g. take a walk , take a bath, take a photograph
There is some disagreement in the scope of the term LVC, most notably in
the membership of verbs which can be considered “light”. Calzolari, Fillmore,
Grishman, Ide, Lenci, MacLeod, and Zampolli (2002), e.g., argued that the
definition of LVCs (or support verb constructions in their terms) should be
extended to include: (1) verbs that combine with an event noun (deverbal
or otherwise) where the subject is a participant in the event most closely
identified with the noun (e.g. ask a question); and (2) verbs with subjects
that belong to some scenario associated with the full understanding of the
event type designated by the object noun (e.g. keep a promise).
Morphologically, the verb in LVCs inflects but the noun complement tends
to have fixed number and a preference for determiner type. For example, make
amends undergoes full verbal inflection (make/makes/made/making amends),
but the noun complement cannot be singular (e.g. ∗ make amend.12 Syntactically, LVCs are highly flexible, undergoing passivization (e.g. an offer was
made), extraction (e.g. How many offers did Kim make? ) and internal modification (e.g. make an irresistible offer ). On the other hand, there are hard
constraints on what light verbs a given noun complement can be combined
with (c.f. ∗ give/do/put/take an offer ), noting that some noun complements
combine with multiple light verbs (e.g. do/give a demo), often with different semantics (e.g. make a call vs. take a call vs. have a call ). Also, what
light verb a given noun will combine with to form an LVC is often consistent
across semantically-related noun clusters (e.g. give a cry/moan/howl vs. ∗ take
a cry/moan/howl .
LVCs occur across a large number of the world’s languages, including Japanese
(Grimshaw and Mester 1988; Baldwin and Bond 2002), Korean (Ahn 1991),
Hindi (Mohanan 1994) and Persian (Karimi-Doostan 1997).
1.3.2.4
Verb–Noun Idiomatic Combinations
Verb–Noun Idiomatic Combinations (VNICs, also known as VP idioms)
are composed of a verb and noun in direct object position, and are (at
12 But
also note other examples where the noun complement can be either singular or plural,
e.g. take a bath vs. take baths.
Multiword Expressions
17
least) semantically idiomatic (e.g. kick the bucket, shoot the breeze) (Nunberg, Sag, and Wasow 1994; Fellbaum 2002; Sag, Baldwin, Bond, Copestake,
and Flickinger 2002; Fazly, Cook, and Stevenson 2009). They are a notable
subclass of MWE because of their crosslingual occurrence, and high lexical
and semantic variability.
VNICs (along with other semantically idiomatic MWEs) are often categorised into two groups, based on their semantic decomposability (see Section 1.2.1.3) (Nunberg, Sag, and Wasow 1994; Riehemann 2001). With decomposable VNICs, given the interpretation of the VNIC, it is possible to associate
components of the VNIC with distinct elements of the VNIC interpretation,
based on semantics not immediately accessible from the component lexemes.
Assuming an interpretation of spill the beans such as reveal’(x,secret’),13
e.g., we could analyse spill as having the semantics of reveal’ and beans having the semantics of secret’, through a process of figuration. Other examples
of decomposable VNICs are pull strings (c.f. exert’(x,influence’)) and
touch a nerve (c.f. cause’(x,reaction’)). With non-decomposable VNICs
(e.g. get the hang (of ), kick the bucket), such a semantic decomposition is not
possible. The reason we make this distinction is that decomposable VNICs
tend to be syntactically flexible, in a manner predicted by the nature of the
semantic decomposition; non-decomposable VNICs, on the other hand, tend
not to be syntactically flexible (Cruse 1986; Nunberg, Sag, and Wasow 1994;
Jackendoff 1997; Sag, Baldwin, Bond, Copestake, and Flickinger 2002). For
example, spill the beans can be passivised (It’s a shame the beans were spilled )
and internally modified (AT&T spilled the Starbucks beans), similarly to a conventional verb–direct object pair (c.f. Sandy is loved by Kim and Kim loves
the inimitable Sandy); this is predicted by its decomposability.
VNICs generally occur with low frequency, but are notoriously hard to distinguish from literal usages of the same word combination (e.g. Kim made a face
at the policeman vs. Kim made a face in pottery class). An accurate means
of disambiguation is thus important in tasks which require semantic interpretation, but generally fraught by low volumes of training data.
1.3.3
1.3.3.1
Prepositional MWEs
Determinerless-Prepositional Phrases
Determinerless prepositional phrases (PP-Ds) are MWEs that are made up
of a preposition and a singular noun without a determiner (Quirk, Greenbaum, Leech, and Svartvik 1985; Huddleston and Pullum 2002; Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Baldwin, Beavers, Van Der Beek,
Bond, Flickinger, and Sag 2006).
Syntactically, PP-Ds are highly diverse, and display differing levels of syntactic markedness, productivity and modifiability (Chander 1998; Ross 1995).
13 I.e.,
reveal’ is a 2-place predicate, with x binding to the subject.
18
Handbook of NLP
That is, some PP-Ds are non-productive (e.g. on top vs. ∗ on bottom) and nonmodifiable (e.g. on top vs. ∗ on table top), whereas others are fully-productive
(e.g. by car/foot/bus/...) and highly modifiable (e.g. at high expense, on
summer vacation). In fact, while some PP-Ds are optionally modifiable (e.g.
on vacation vs. on summer vacation), others require modification (e.g. ∗ at
level vs. at eye level , and ∗ at expense vs. at company expense) (Baldwin,
Beavers, Van Der Beek, Bond, Flickinger, and Sag 2006).
Syntactically-marked PP-Ds can be highly productive (Ross 1995; Grishman, Macleod, and Myers 1998). For example, by combines with a virtually
unrestricted array of countable nouns (e.g. by bus/car/taxi/...) but less readily with uncountable nouns (e.g. ∗ by information/linguistics/...).
Semantically, PP-Ds have a certain degree of semantic markedness on the
noun (Haspelmath 1997; Mimmelmann 1998; Stvan 1998; Bond 2005). For
example, in combines with uncountable nouns which refer to a social institution (e.g. school, church, prison but not information) to form syntacticallyunmarked PP-Ds with marked semantics, in the sense that only the social
institution sense of the noun is evoked (e.g. in school/church/prison/... vs.
∗
in information) (Baldwin, Beavers, Van Der Beek, Bond, Flickinger, and Sag
2006).
PP-Ds occur with surprising frequency and cause problems during parsing
and generation, in terms of achieving the right balance between over- and
under-generation (Baldwin, Bender, Flickinger, Kim, and Oepen 2004).
1.3.3.2
Complex prepositions
Another common form of prepositional MWE is complex prepositions (e.g.
on top of , in addition to), and other forms of complex markers (Villada
Moirón 2005; Tsuchiya, Shime, Takagi, Utsuro, Uchimoto, Matsuyoshi, Sato,
and Nakagawa 2006; Trawiński, Sailer, and Soehn 2006). Complex prepositions can take the form of fixed MWEs (e.g. in addition to), or alternatively
semi-fixed MWEs, for example optionally allowing internal modification (e.g.
with (due/particular/special/...) regard to) or determiner insertion (e.g. on
(the) top of ).
1.4
MWE Classification
In developing a lexicon of MWEs, it is crucially important to develop a
classification which captures the general properties of MWE classes, but at the
same time allows for the encoding of information particular to a given MWE
instance. In this section, we present a commonly-used high-level classification,
based particularly on the syntactic and semantic properties of MWEs outlined
in Figure 1.1 (Bauer 1983; Sag, Baldwin, Bond, Copestake, and Flickinger
Multiword Expressions
19
MWE
✭✭❤❤❤❤
❤❤❤❤
✭✭✭✭
❤❤
✭✭✭✭
Lexicalised Phrase
❤❤❤
✭✭✭
❤❤❤❤
✭
✱
✭
✭
✭
❤❤
✭
✭
✱
❤
✭
fixed
semi-fixed
syntactically-flexible
non-decomposable VNICs
nominal MWEs
VPCs
LVCs
decomposable VNICs
Institutionalised Phrase
FIGURE 1.1: A classification of MWEs
2002).
The classification of MWEs into lexicalised phrases and institutionalised phrases
hinges on whether the MWE is lexicalised (i.e. explicitly encoded in the lexicon), or a simple collocation (i.e. only statistically idiomatic).
Lexicalised phrases are MWEs with lexical, syntactic, semantic or pragmatic
idiomaticity. Lexicalised phrases can be further split into: fixed expressions
(e.g. ad hoc, at first), semi-fixed expressions (e.g. spill the beans, car dealer ,
Chicago White Socks) and syntactically-flexible expressions (e.g. add up, give a
demo).
• fixed expressions are fixed strings that undergo neither morphosyntactic
variation nor internal modification, often due to fossilisation of what
was once a compositional phrase. For example, by and large is not morphosyntactically modifiable (e.g. ∗ by and larger) or internally modifiable
(e.g. ∗ by and very large). Non-modifiable determinerless prepositional
phrases such as on air are also fixed expressions.
• semi-fixed expressions are lexically-variable MWEs that have hard restrictions on word order and composition, but undergo some degree
of lexical variation such as inflection (e.g. kick/kicks/kicked/kicking the
bucket vs. ∗ the bucket was kicked ), variation in reflexive pronouns (e.g.
in her/his/their shoes) and determiner selection (e.g. The Beatles vs. a
Beatles album 14 ). Non-decomposable VNICs (e.g. kick the bucket, shoot
the breeze) and nominal MWEs (e.g. attorney general , part of speech)
are also classified as semi-fixed expressions.
• syntactically flexible expressions are MWEs which undergo syntactic variation, such as VPCs, LVCs and decomposable VNICs. The nature of
14 The determiner the in The Beatles is obligatory in the case that The Beatles forms a
noun phrase (i.e. Beatles can only be quantified by the), but in cases where Beatles forms a
N̄, e.g. in [N P a [N ′ [N ′ Beatles’ ] album ]], the lexical item is realized without a determiner.
20
Handbook of NLP
the flexibility varies significantly across construction types. VPCs, for
example, are syntactically flexible with respect to the word order of the
particle and NP in transitive usages: hand in the paper vs. hand the
paper in. They are also usually compatible with internal modification,
even for intransitive VPCs: the plane took right off. LVCs (e.g. give a
demo) undergo full syntactic variation, including passivisation (e.g. a
demo was given), extraction (e.g. how many demos did he give? ) and
internal modification (e.g. give a clear demo ). Decomposable VNICs
are also syntactically flexible to some degree, although the exact form
of syntactic variation is predicted by the nature of their semantic decomposability.
Note that many of our MWE construction types can be assigned to a unique
sub-category of lexicalised phrase, namely: non-decomposable VNICs, NCs,
VPCs and LVCs. Determinerless PPs, on the other hand, cut across all three
sub-categories: non-modifiable PP-Ds (e.g. at first) are fixed expressions,
PP-Ds with strict constraints on modifiability (e.g. at level ) are semi-fixed
expressions, and highly-productive PP-Ds (e.g. as president/coach/father of
the bride/...) are syntactically-flexible.
The class of institutionalised phrases corresponds to MWEs which are exclusively statistically idiomatic, as described in Section 1.2.4. Examples include
salt and pepper and many thanks.
1.5
Research Issues
The major NLP tasks relating to MWEs are: (1) identifying and extracting
MWEs from corpus data, and disambiguating their internal syntax, and (2)
interpreting MWEs. Increasingly, these tasks are being pipelined with parsers
and applications such as machine translation (Venkatapathy and Joshi 2006;
Zhang, Kordoni, Villavicencio, and Idiart 2006; Blunsom 2007).
Depending on the type of MWE, the relative import of these syntactic
and semantic tasks varies. For example, with noun compounds, the identification and extraction tasks are relatively trivial, whereas interpretation
is considerably more difficult. Below, we discuss the challenges and review
the key research on MWEs in NLP. For a listing of relevant resources (especially datasets and toolkits), we refer the reader to the companion web site
(http://...).
1.5.1
Identification
Identification is the task of determining individual occurrences of MWEs in
running text. The task is at the token (instance) level, such that we may
Multiword Expressions
21
identify 50 distinct occurrences of pick up in a given corpus. To give an
example of an identification task, given the corpus fragment in (4) (taken
from “The Frog Prince”, a children’s story), we might identify the MWEs in
(4):
(4) One fine evening a young princess put on her bonnet and clogs, and
went out to take a walk by herself in a wood; ... she ran to pick it up;
...
In MWE identification, a key challenge is in differentiating between MWEs
and literal usages for word combinations such as make a face which can
occur in both usages (Kim made a face at the policeman [MWE] vs. Kim
made a face in pottery class [non-MWE]). Syntactic ambiguity is also a major confounding factor, e.g. in identifying VPCs in contexts such as Have the
paper in today. For example, in the sentence Kim signed in the room, there
is ambiguity between a VPC interpretation (sign in = “check in/announce
arrival”) and an intransitive verb + PP interpretation (“Kim performed the
act of signing in the room”).
MWE identification has tended to take the form of customised methods for
particular MWE construction types and languages (e.g. English VPCs, LVCs
and NVICs), but there have been attempts to develop generalised techniques,
as outlined below.
Perhaps the most obvious method of identifying MWEs is via a part-ofspeech (POS) tagger, chunker or parser, in the case that lexical information
required to identify MWEs is contained within the parser output. For example, in the case of VPCs, there is a dedicated tag for (prepositional) particles
in the Penn POS tagset, such that VPC identification can be performed simply by POS tagging a text, identifying all particle tags, and further identifying
the head verb associated with each particle (e.g. by looking left for the first
main verb, within a word window of fixed size) (Baldwin and Villavicencio
2002; Baldwin 2005a). Similarly, a chunker or phrase structure parser can be
used to identify constructions such as noun compounds or VPCs (McCarthy,
Keller, and Carroll 2003; Lapata and Lascarides 2003; Kim and Baldwin pear).
This style of approach is generally not able to distinguish MWE and literal
usages of a given word combination, however, as they are not differentiated in
their surface syntax. Deep parsers which have lexical entries for MWEs and
disambiguate to the level of lexical items are able to make this distinction,
however, via supertagging or full parsing (Baldwin, Bender, Flickinger, Kim,
and Oepen 2004; Blunsom 2007).
Another general approach to MWE identification is to treat literal and
MWE usages as different senses of a given word combination. This then allows for the application of word sense disambiguation (WSD) techniques to the
identification problem. As with WSD research, both supervised (Patrick and
Fletcher 2005; Hashimoto and Kawahara 2008) and unsupervised (Birke and
Sarkar 2006; Katz and Giesbrecht 2006; Sporleder and Li 2009) approaches
22
Handbook of NLP
have been applied to the identification task. The key assumption in unsupervised approaches has been that literal usages will be contextually similar to
simplex usages of the component words (e.g. kick and bucket in the case of kick
the bucket). Mirroring the findings from WSD research, supervised methods
tend to be more accurate, but have the obvious drawback that they requires
large numbers of annotated literal and idiomatic instances of a given MWE
to work. Unsupervised techniques are therefore more generally applicable.
A third approach, targeted particularly at semantically idiomatic MWEs, is
to assume that MWEs occur: (a) in canonical forms, or (b) only in particular
syntactic configurations, and do not undergo the same level of syntactic variation as literal usages. This relates to our claims in Section 1.3.2.4 relating
to non-decomposable VNICs, where the prediction is that VNICs such as kick
the bucket will not passivise or be internally modifiable. If we have a method
of identifying the limits of syntactic variability of a given MWE, therefore,
we can assume that any usage which falls outside these (e.g. kicked a bucket)
must be literal. The problem, then, is identifying the degree of syntactic
variability of a given MWE. This can be performed manually, in flagging individual MWE lexical items with predictions of what variations a given MWE
can undergo (Li, Zhang, Niu, Jiang, and Srihari 2003; Hashimoto, Sato, and
Utsuro 2006). An alternative which alleviates the manual overhead associated
with hand annotation is to use unsupervised learning to predict the “canonical” configurations for a given MWE, which can optionally be complemented
with a supervised model to identify literal usages which are used in one of the
canonical MWE configurations (e.g. Kim kicked the bucket in frustration, and
stormed out of the room) (Fazly, Cook, and Stevenson 2009).
In research to date, good results have been achieved for particular MWEs,
especially English VPCs. However, proposed methods have tended to rely
heavily on existing resources such as parsers and hand-crafted lexical resources, and be tuned to particular MWE types.
1.5.2
Extraction
MWE extraction is a type-level task, wherein the MWE lexical items attested in a predetermined corpus are extracted out into a lexicon. For example, we may wish to know whether a given corpus provides evidence for a
given verb take and preposition off combining to form a VPC (i.e. take off ).
To illustrate the difference between identification and extraction, identification would involve the determination of the individual occurrences of take off
(e.g. each of the 240 in a given corpus), whereas extraction would involve the
decision about whether take off occurred in the corpus or not (irrespective
of the number of occurrences). Clearly there is a close connection between
the two tasks, in that if we have identified one or more occurrences of a given
MWE we can extract it as a MWE, and conversely, if we have extracted a
given MWE, we must be able to identify at least one occurrence in the corpus.
The motivation for MWE extraction is generally lexicon development and
Multiword Expressions
23
expansion, e.g. recognising newly-formed MWEs (e.g. ring tone or shock and
awe) or domain-specific MWEs
Extracting MWEs is relevant to any lexically-driven application, such as
grammar engineering or information extraction. Depending on the particular
application, it may be necessary to additionally predict lexical properties of a
given MWE, e.g. its syntactic or semantic class. In addition, it is particularly
important for productive MWEs or domains which are rich in technical terms
(e.g. bus speed or boot up in the IT domain). MWE extraction is difficult for
many of the same reasons as MWE identification, namely syntactic flexibility
and ambiguity.
There has been a strong focus on the development of general-purpose techniques for MWE extraction, particularly in the guise of collocation extraction
(see Section 1.2.4). The dominating view here is that extraction can be carried out via association measures such as pointwise mutual information or the
t-test, based on analysis of the frequency of occurrence of a given word combination, often in comparison with the frequency of occurrence of the component
words (Church and Hanks 1989; Smadja 1993; Frantzi, Ananiadou, and Mima
2000; Evert and Krenn 2001; Pecina 2008). Association measures provide a
score for each word combination, which forms the basis of a ranking of MWE
candidates. Final extraction, therefore, consists of determining an appropriate cut-off in the ranking, although evaluation is often carried out over the
full ranking.
Collocation extraction techniques have been applied to a wide range of
extraction tasks over a number of languages, with the general finding that it is
often unpredictable which association measure will work best for a given task.
As a result, recent research has focused on building supervised classifiers to
combine the predictions of a number of association measures, and shown that
this leads to consistently superior results than any one association measure
(Pecina 2008). It has also been shown that this style of approach works most
effectively when combined with POS tagging or parsing, and strict filters on
the type of MWE that is being extracted (e.g. adjective–noun or verb–
noun: Justeson and Katz (1995, Pecina (2008)). It is worth noting that
association measures have generally been applied to (continuous) word ngrams, or less frequently, pre-determined dependency types in the output of
a parser. Additionally, collocational extraction techniques tend to require
a reasonable number of token occurrences of a given word combination to
operate reliably, which we cannot always assume (Baldwin 2005a; Fazly 2007).
A second approach to MWE extraction, targeted specifically at semantically
and statistically idiomatic MWEs, is to extend the general association measure approach to include substitution (Lin 1999; Schone and Jurafsky 2001;
Pearce 2001). For example, in assessing the idiomaticity of red tape, explicit
comparison is made with lexically-related candidates generated by component
word substitution, such as yellow tape or red strip. Common approaches to
determining substitution candidates for a given component word are (near)synonymy—e.g. based on resources such as WordNet—and distributional
24
Handbook of NLP
similarity.
Substitution can also be used to generate MWE candidates, and then
check for their occurrence in corpus data. For example, if clear up is a
known (compositional) VPC, it is reasonable to expect that VPCs such as
clean/tidy/unclutter/... up are also VPCs (Villavicencio 2005). That is not
to say that all of these occur as MWEs, however (c.f. ∗ unclutter up), so an
additional check for corpus attestation is usually used in this style of approach.
A third approach, also targeted at semantically idiomatic MWEs, is to analyse the relative similarity between the context of use of a given word combination and its component words (Schone and Jurafsky 2001; Stevenson, Fazly,
and North 2004; Widdows and Dorow 2005). Similar to the unsupervised
WSD-style approach to MWE identification (see Section 1.5.1), the underlying hypothesis is that semantically idiomatic MWEs will occur in markedly
different lexical contexts to their component words. A bag of words representation is commonly used to model the combined lexical context of all usages
of a given word or word combination. By interpreting this context model
as a vector, it is possible to compare lexical contexts, e.g. via simple cosine
similarity (Widdows 2005). In order to reduce the effects of data sparseness,
dimensionality reduction is often carried out over the word space prior to
comparison (Schütze 1997).
The same approach has also been applied to extract LVCs, based on the
assumption that the noun complements in LVCs are often deverbal (e.g. bath,
proposal, walk ), and that the distribution of nouns in PPs post-modifying
noun complements in genuine LVCs (e.g. (make a) proposal of marriage) will
be similar to that of the object of the underlying verb (e.g. propose marriage)
(Grefenstette and Teufel 1995). Here, therefore, the assumption is that LVCs
will be distributionally similar to the base verb form of the noun complement, whereas with the original extraction method, the assumption was that
semantically idiomatic MWEs are dissimilar to their component words.
A fourth approach is to perform extraction on the basis of implicit identification. That is, (possibly noisy) token-level statistics can be fed into a
type-level classifier to predict whether there have been genuine instances of a
given MWE in the corpus. An example of this style of approach is to use POS
taggers, chunkers and parsers to identify English VPCs in different syntactic configurations, and feed the predictions of the various preprocessors into
the final extraction classifier (Baldwin 2005a). Alternatively, a parser can be
used to identify PPs with singular nouns, and semantically idiomatic PP-Ds
extracted from among them based on distributional (dis)similarity of occurrences with and without determiners across a range of prepositions (van der
Beek 2005).
A fifth approach is to use syntactic fixedness as a means of extracting
MWEs, based on the assumption that semantically idiomatic MWEs undergo
syntactic variation (e.g. passivisation or internal modification) less readily
than simple verb–noun combinations (Bannard 2007; Fazly, Cook, and Stevenson 2009).
Multiword Expressions
25
In addition to general-purpose extraction techniques, linguistic properties of
particular MWE construction types have been used in extraction. For example, the fact that a given verb–preposition combination occurs as a noun (e.g.
takeoff , clip-on) is a strong predictor of the fact that combination occurring
as a VPC (Baldwin 2005a).
One bottleneck in MWE extraction is the token frequency of the MWE
candidate. With a few notable exceptions (e.g. (Baldwin 2005a; Fazly, Cook,
and Stevenson 2009)), MWE research has tended to ignore low-frequency
MWEs, e.g. by applying a method only to word combinations which occur at
least N times in a corpus.
1.5.3
Internal Syntactic Disambiguation
As part of the process of MWE identification and extraction, for some
MWE types it is necessary to disambiguate the internal syntax of individual
MWEs. A prominent case of this in English is noun compounds with 3 or more
terms. For example, glass window cleaner has two possible interpretations,15
corresponding to the two possible bracketings of the compound: (1) “a cleaner
of glass windows” (= [[glass window] cleaner] ), and (2) “a cleaner of windows,
made of glass” (= [glass [window cleaner]] . In this case, the first case (of
left bracketing) is the correct analysis, but movie car chase, e.g., is right
bracketing (= (movie (car chase))). The process of disambiguating the syntax
of an NC is called bracketing .
The most common approach to bracketing is based on statistical analysis of
the components of competing analyses. In the adjacency model, for a ternary
NC N1 N2 N3 , a comparison is made of the frequencies of the two modifier–
head pairings extracted from the two analyses, namely N1 N2 and N1 N3 in
the left bracketing case, and N2 N3 and N1 N3 in the right bracketing case;
as N1 N3 is common to both, in practice, N1 N2 is compared directly with N2
N3 . A left bracketing analysis is selected in the case that N1 N2 is judged to
be more likely, otherwise a right bracketing analysis is selected (Marcus 1980).
In the dependency model, the NC is instead decomposed into the dependency
tuples of N1 N2 and N2 N3 in the case of left bracketing, and N2 N3 and
N1 N3 in the case of right bracketing; once again, the dependency N2 N3 is
common to both, and can be ignored. In the instance that N1 N2 is more
likely than N1 N3 , the model prefers a left bracketing analysis, otherwise
a right bracketing analysis is selected (Lauer 1995). While the dependency
model tends to outperform the adjacency model, the best-performing models
take features derived from both along with various syntactic and semantic
features (Nakov and Hearst 2005; Vadas and Curran 2008).
15 More
generally, for an n item noun compound,
`2n´ the number of possible interpretations is
1
defined by the Catalan number Cn = n+1
.
n
26
1.5.4
Handbook of NLP
MWE Interpretation
The semantic interpretation of MWEs is usually performed in one of two
ways: (1) relative to a generalised semantic inventory (compatible with both
simplex words and MWEs, such as WordNet); and (2) based on a set of
semantic relations capturing semantic interplay between component words.
When interpreting VPCs or lexicalised PP-Ds, e.g., the former approach would
be more appropriate (e.g. to capture the fact that bow out is synonymous
with withdraw , both of which are troponyms of retire). Nominal MWEs and
productive PP-Ds, on the other hand, are more amenable to interpretation
by semantic relations (e.g. to capture the semantics of apple pie in terms of
the make relation, as in “pie made from apple(s)”).
One common approach to MWE interpretation is via component similarity,
i.e. comparison of the components of a MWE with corresponding components
of annotated MWEs, or alternatively with simplex words. For example, a
novel NC can be interpreted by identifying training NCs with similar modifier
and head nouns (e.g. in interpreting grape extract, grape would be compared
with similar modifiers, and extract with similar heads), as determined relative
to a lexical resource or via distributional similarity. We can then extrapolate
from the closely-matching training NCs to predict the interpretation of the
novel NC (Vanderwende 1994; Moldovan, Badulescu, Tatu, Antohe, and Girju
2004; Kim and Baldwin 2005; Nastase, Sayyad-Shirabad, Sokolova, and Szpakowicz 2006; Kim and Baldwin 2007b; Ó Séaghdha 2008). Alternatively, we
may employ contextual similarity to compare a VPC with its simplex verb, to
determine if they are sufficiently similar that the VPC can be interpreted compositionally from the verb (Baldwin, Bannard, Tanaka, and Widdows 2003;
McCarthy, Keller, and Carroll 2003; Cook and Stevenson 2006).
Crosslinguistic evidence can also provide valuable evidence when interpreting MWEs. For example, analysis of what preposition is used in different
Romance languages to translate a given English MWE can provide valuable
insights into the range of possible interpretations for the English MWE (Girju
2009). Conversely, semantically idiomatic MWEs can be detected from parallel corpus data by identifying translation divergences in the component words
lexical choice (Melamed 1997). For example, knowledge that balance and sheet
are most often translated as équilibre and feuille, respectively, in French, and
yet balance sheet is translated as bilan suggests that balance sheet is semantically idiomatic.
One popular approach to determining the underlying semantic relation associated with a MWE is to identify surface realisations or paraphrases associated with each semantic class (Lapata 2002; Grover, Lapata, and Lascarides
2004; Kim and Baldwin 2006; Nicholson and Baldwin 2006; Nakov and Hearst
2008). For example, in the case of compound nominalisations, there are the
two primary classes of subject and object, based on whether the modifier
acts as the subject (e.g. investor hesitation = “investor hesitates”) or object
(e.g. product replacement = “replace (the) product”) of the base verb form of
Multiword Expressions
27
the deverbal head. For a given compound nominalisation and base verb form,
it is possible to analyse the relative occurrence of the modifier as subject or
object of the base verb, and select the interpretation which is most commonly
observed (Lapata 2002; Grover, Lapata, and Lascarides 2004; Nicholson and
Baldwin 2006).
Another methodology which has been applied to the interpretation task
with success is analysis of the co-occurrence properties of the MWE components. For example, the semantics of particles in VPCs can be interpreted by
analysing what types of verbs can combine with a given particle (Cook and
Stevenson 2006; Kim and Baldwin 2007a). Similarly, Japanese compound
verbs (V-V combinations) can be interpreted by observing what set of verbs
each of the component verbs combines with to form a compound verb, optionally including the semantic class of the resulting compound verb (Uchiyama,
Baldwin, and Ishizaki 2005).
One overarching assumption made in most semantic interpretation tasks is
that it is possible to arrive at a compositional interpretation for each MWE
via its component words. Ideally, we of course need to identify instances of
semantic idiomaticity, motivating the need for methods which can model the
relative compositionality or decomposability of MWEs (Lin 1999; Baldwin,
Bannard, Tanaka, and Widdows 2003; McCarthy, Keller, and Carroll 2003;
McCarthy, Venkatapathy, and Joshi 2007).
While there has been a healthy interest in MWE interpretation, research
has suffered from lack of agreement on semantic inventories, and the relative
unavailability of annotated data. One very positive step towards redressing
this situation was a shared task at SemEval-2007, on interpreting nominal
MWEs in English (Girju, Nakov, Nastase, Szpakowicz, Turney, and Yuret
2007), and an upcoming SemEval-2010 task on the multi-way classification of
semantic relations between pairs of nominals. In practice, the SemEval-2007
task took a pair of nouns in a fixed sentential context and attempted to determine if they were interpretable using a set of semantic relations compatible
with NCs. As such, the task wasn’t specifically on NC interpretation, but NC
interpretation methods could be evaluated over the dataset (Kim and Baldwin
2008; Ó Séaghdha 2008). Crucially, the task organisers chose to sidestep the
controversy surrounding the precise membership of a broad-coverage set of
semantic relations, and instead focused on relations where there is relatively
high agreement between researchers. They additionally defused the question
of interpretational overlap/ambiguity of a given nominal, by designing the
task as a series of binary sub-tasks, where a prediction had to be made about
each nominal’s compatibility with a given semantic relation (ignoring whether
or not it was also compatible with other relations).
28
Handbook of NLP
1.6
Summary
MWEs are an integral part of language: vast in number and highly varied in nature. They are defined by idiomaticity at the lexical, syntactic,
semantic, pragmatic and statistical levels, and occur in a myriad of different
constructions in the world’s languages. In addition to providing a brief foray
into the linguistic complexities of MWEs, we have detailed the key MWEs
in MWE research, and outlined various approaches to the primary computational challenges associated with MWEs, namely: identification, extraction
and interpretation.
We have deliberately not provided a survey of MWE resources in this paper, choosing instead to maintain an up-to-the-moment snapshot of the field
on the companion website at http://.... For those interested in pursuing
MWE research, we recommend this as your first port of call. For readers
who are interested in further reading on MWEs, we particularly recommend
the following works: (Moon 1998; McKeown and Radev 2000; Cowie 2001;
Sag, Baldwin, Bond, Copestake, and Flickinger 2002; Villavicencio, Bond,
Korhonen, and McCarthy 2005).
Bibliography
Abeillé, A. (1988). Light verb constructions and extraction out of NP in a
tree adjoining grammar. In Papers of the 24th Regional Meeting of the
Chicago Linguistics Society.
Ahn, H.-D. (1991). Light verbs, VP-movement, Negation and Clausal Structure in Korean and English. Ph. D. thesis, University of WisconsinMadison.
Aronoff, M. (1976). Word Formation in Generative Grammar. Cambridge,
USA: MIT Press.
Baldwin, T. (2005a). The deep lexical acquisition of English verb-particles.
Computer Speech and Language, Special Issue on Multiword Expressions 19 (4), 398–414.
Baldwin, T. (2005b). Looking for prepositional verbs in corpus data. In
Proceedings of the 2nd ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics
Formalisms and Applications, Colchester, UK, pp. 115–126.
Baldwin, T., C. Bannard, T. Tanaka, and D. Widdows (2003). An empirical
model of multiword expression decomposability. In Proceedings of the
Multiword Expressions
29
ACL-2003 Workshop on Multiword Expressions: Analysis, Acquisition
and Treatment, Sapporo, Japan, pp. 89–96.
Baldwin, T., J. Beavers, L. Van Der Beek, F. Bond, D. Flickinger, and
I. A. Sag (2006). In search of a systematic treatment of determinerless
PPs. In P. Saint-Dizier (Ed.), Syntax and Semantics of Prepositions.
Springer.
Baldwin, T., E. M. Bender, D. Flickinger, A. Kim, and S. Oepen (2004).
Road-testing the English Resource Grammar over the British National
Corpus. In Proceedings of the 4th International Conference on Language
Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 2047–
2050.
Baldwin, T. and F. Bond (2002). Multiword expressions: Some problems
for Japanese NLP. In Proceedings of the 8th Annual Meeting of the
Association for Natural Language Processing (Japan), Keihanna, Japan,
pp. 379–382.
Baldwin, T. and A. Villavicencio (2002). Extracting the unextractable: A
case study on verb-particles. In Proceedings of the 6th Conference on
Natural Language Learning (CoNLL-2002), Taipei, Taiwan, pp. 98–104.
Bannard, C. (2007). A measure of syntactic flexibility for automatically
identifying multiword expressions in corpora. In Proceedings of the ACL2007 Workshop on A Broader Perspective on Multiword Expressions,
Prague, Czech Republic, pp. 1–8.
Bannard, C., T. Baldwin, and A. Lascarides (2003). A statistical approach
to the semantics of verb-particles. In Proceedings of the ACL2003 Workshop on Multiword Expressions: analysis, acquisition and treatment,
Sapporo, Japan, pp. 65–72.
Barker, K. and S. Szpakowicz (1998). Semi-automatic recognition of noun
modifier relationships. In Proceedings of the 17th International Conference on Computational Linguistics (COLING-1998), Montreal, Canada,
pp. 96–102.
Bauer, L. (1983). English Word-formation. Cambridge, UK: Cambridge
University Press.
Bauer, L. (2001). Compounding. In M. Haspelmath (Ed.), Language Typology and Language Universals. The Hague, Netherlands: Mouton de
Gruyter.
Benor, S. B. and R. Levy (2006). The chicken or the egg? a probabilistic
analysis of english binomials. Language 82 (2), 233–278.
Benson, M. (1990). Collocations and general-purpose dictionaries. International Journal of Lexicography 3 (1), 23–35.
Birke, J. and A. Sarkar (2006). A clustering approach for the nearly unsupervised recoginition of nonliteral language. In Proceedings of the 11th
Conference of the EACL (EACL 2006), Trento, Italy, pp. 329–336.
30
Handbook of NLP
Blunsom, P. (2007). Structured Classication for Multilingual Natural Language Processing. Ph. D. thesis, University of Melbourne.
Bolinger, D. (1976). The Phrasal Verb in English. Boston, USA: Harvard
University Press.
Bond, F. (2005). Translating the Untranslatable: A solution to the Problem of Generating English Determiners. CSLI Studies in Computational
Linguistics. CSLI Publications.
Booij, G. (2002). Separable complex verbs in Dutch: A case of periphrastic
word formation. In N. Dehé, R. Jackendoff, A. McIntyre, and S. Urban
(Eds.), Verb-particle explorations, pp. 21–41. Berlin, Germany / New
York, USA: Mouton de Gruyter.
Brinton, L. (1985). Verb particles in English: Aspect or aktionsart. Studia
Linguistica 39, 157–168.
Butt, M. (2003). The light verb jungle. In Proceedings of the Workshop on
Multi-Verb Constructions, Trondheim, Norway, pp. 1–49.
Calzolari, N., C. Fillmore, R. Grishman, N. Ide, A. Lenci, C. MacLeod, and
A. Zampolli (2002). Towards best practice for multiword expressions in
computational lexicons. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas,
Canary Islands, pp. 1934–1940.
Chafe, W. L. (1968). Idiomaticity as an anomaly in the Chomskyan
paradigm. Foundations of Language 4, 109–127.
Chander, I. (1998). Automated postediting of documents. Ph. D. thesis, University of Southern California.
Choueka, Y. (1988). Lookin for needles in a haystack or locating interesting
collocational expressions in large textual databases. In Proceedings of
RIAO, pp. 43–38.
Church, K. W. and P. Hanks (1989). Word assication norms, mutual information and lexicography. In Proceedings of the 27th Annual Meeting of
the Association of Computational Linguistics (ACL-1989), Vancouver,
Canada, pp. 76–83.
Cook, P. and S. Stevenson (2006). Classifying particle semantics in English
verb-particle constructions. In Proceedings of the ACL-2006 Workshop
on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 45–53.
Copestake, A. and A. Lascarides (1997). Integrating symbolic and statistical
representations: The lexicon pragmatics interface. In Proceedings of the
35th Annual Meeting of the Association of Coomputational Linguistics
and 8th Conference of the European Chapter of Association of Computational Linguistics (ACL/EACL-1997), Madrid, Spain, pp. 136–143.
Multiword Expressions
31
Cowie, A. (Ed.) (2001). Phraseology : Theory, Analysis, and Applications.
Oxford, UK: Oxford University Press.
Cowie, A. P. and P. A. Howarth (1996). Phraseology – a select bibliography.
International Journal of Lexicography 9 (1), 38–51.
Cruse, A. D. (1986). Lexical Semantics. Cambridge, UK: Cambridge University Press.
Dirven, R. (2001). The metaphoric in recent cognitive approaches to English
phrasal verbs. metaphorik.de 1, 39–54.
Downing, P. (1977). On the creation and use of English compound nouns.
Language 53 (4), 810–842.
Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and
Collocations. Ph. D. thesis, University of Stuttgart.
Evert, S. and B. Krenn (2001). Methods for the qualitative evaluation of
lexical association measures. In Proceedings of the 39th Annual Meeting of the ACL and 10th Conference of the EACL (ACL-EACL 2001),
Toulouse, France, pp. 188–195.
Fazly, A. (2007). Automatic Acquisition of Lexical Knowledge about Multiword Predicates. Ph. D. thesis, University of Toronto.
Fazly, A., P. Cook, and S. Stevenson (2009). Unsupervised type and token
identification of idiomatic expressions. Computational Linguistics 35 (1),
61–103.
Fellbaum, C. (Ed.) (1998). WordNet, An Electronic Lexical Database. Cambridge, Massachusetts, USA: MIT Press.
Fellbaum, C. (2002). VP idioms in the lexicon: Topics for research using
a very large corpus. In Proceedings of the KONVENS 2002 Conference,
Saarbrücken, Germany.
Fernando, C. and R. Flavell (1981). On idioms. Exeter: University of Exeter.
Fillmore, C., P. Kay, and M. C. O’Connor (1988). Regularity and idiomaticity in grammatical constructions. Language 64, 501–538.
Frantzi, K., S. Ananiadou, and H. Mima (2000). Automatic recognition of
multi-word terms:. the C-value/NC-value method. International Journal
on Digital Libraries 3 (2), 115–130.
Gagné, C. L., T. L. Spalding, and M. C. Gorrie (2005). Sentential context
and the interpretation of familiar open-compounds and novel modifiernoun phrases. Language and Speech 28 (2), 203–221.
Gates, E. (1988). The treatment of multiword lexemes in some current dictionaries of English. Snell-Hornby.
32
Handbook of NLP
Gerber, L. and J. Yang (1997). Systran MT dictionary development. In
Proceedings of the Sixth Machine Translation Summit (MT Summit VI),
San Diego, USA.
Gibbs, R. W. (1980). Spilling the beans on understanding and memory for
idioms in conversation. Memory and Cognition 8 (2), 149–156.
Girju, R. (2009). The syntax and semantics of prepositions in the task of
automatic interpretation of nominal phrases and compounds: A crosslinguistic study. Computational Linguistics 35 (2).
Girju, R., P. Nakov, V. Nastase, S. Szpakowicz, P. Turney, and D. Yuret
(2007). Semeval-2007 task 04: Classification of semantic relations between nominals. In Proceedings of the 4th International Workshop on
Semantic Evaluations, Prague, Czech Republic, pp. 13–18.
Grefenstette, G. and P. Tapanainen (1994). What is a word, what is a
sentence? problems of tokenization. In Proceedings of the 3rd Conference
on Computational Lexicography and Text Research, Budapest, Hungary,
pp. 79–87.
Grefenstette, G. and S. Teufel (1995). A corpus-based method for automatic
identification of support verbs for nominalizations. In Proceedings of
the 7th European Chapter of Association of Computational Linguistics
(EACL-1995), Dublin, Ireland, pp. 98–103.
Grimshaw, J. and A. Mester (1988). Light verbs and theta-marking. Linguistic Inquiry 19 (2), 205–232.
Grishman, R., C. Macleod, and A. Myers (1998). COMLEX syntax reference manual.
Grover, C., M. Lapata, and A. Lascarides (2004). A comparison of parsing
technologies for the biomedical domain. Journal of Natural Language
Engineering 1 (1), 1–38.
Hashimoto, C. and D. Kawahara (2008). Construction of an idiom corpus
and its application to idiom identification based on WSD incorporating
idiom-specific features. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (EMNLP 2008), Honolulu,
USA, pp. 992–1001.
Hashimoto, C., S. Sato, and T. Utsuro (2006). Japanese idiom recognition:
Drawing a line between literal and idiomatic meanings. In Proceedings of
the COLING/ACL 2006 Interactive Poster System, Sydney, Australia,
pp. 353–360.
Haspelmath, M. (1997). From Space to Time in The World’s Languages.
Munich, Germany: Lincorn Europa.
Hoshi, H. (1994). Passive, Causive, and Light Verbs: A Study of Theta
Role Assignment. Ph. D. thesis, University of Connecticut.
Multiword Expressions
33
Huddleston, R. and G. K. Pullum (2002). The Cambridge Grammar of the
English Language. Cambridge, UK: Cambridge University Press.
Jackendoff, R. (1973). The base rules for prepositional phrases. In A
Festschrift for Morris Halle, pp. 345–356. New York, USA: Rinehart
and Winston.
Jackendoff, R. (1997). The Architecture of the Language Faculty. Cambridge, USA: MIT Press.
Jackendoff, R. (2002). Foundations of Language. Oxford, UK: Oxford University Press.
Jespersen, O. (1965). A Modern English Grammar on Historical Principles,
Part VI, Morphology. London, UK: George Allen and Unwin Ltd.
Justeson, J. S. and S. M. Katz (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural
Language Engineering 1 (1), 9–27.
Kaalep, H.-J. and K. Muischnek (2008). Multi-word verbs of Estonian: a
database and a corpus. In Proceedings of the LREC 2008 Workshop:
Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 23–26.
Kageura, K., B. Daille, H. Nakagawa, and L.-F. Chien (2004). Recent trends
in computational terminology. Terminology 10 (1), 1–21.
Karimi-Doostan, G. H. (1997). Light Verb Construction in Persian. Ph. D.
thesis, University of Essex.
Kastovsky, D. (1982).
Bagel/Francke.
Wortbildung
und
Semantik.
Dusseldorf:
Katz, G. and E. Giesbrecht (2006). Automatic identification of noncompositional multi-word expressions using latent semantic analysis.
In Proceedings of the ACL-2006 Workshop on Multiword Expressions:
Identifying and Exploiting Underlying Properties, Sydney, Australia, pp.
28–35.
Katz, J. J. and P. M. Postal (2004). Semantic interpretation of idioms
and sentences containing them. In Quarterly Progress Report (70), MIT
Research Laboratory of Electronics, pp. 275–282. MIT Press.
Keysar, B. and B. Bly (1995). Intuitions of the transparency of idioms:
Can one keep a secret by spilling the beans? Journal of Memory and
Language 34 (1), 89–109.
Kim, S. N. and T. Baldwin (2005). Automatic interpretation of compound
nouns using WordNet similarity. In Proceedings of the 2nd International
Joint Conference on Natural Language Processing (IJCNLP-05), Jeju,
Korea, pp. 945–956.
34
Handbook of NLP
Kim, S. N. and T. Baldwin (2006). Interpreting semantic relations in noun
compounds via verb semantics. In Proceedings of the COLING/ACL
2006 Interactive Poster System, Sydney, Australia, pp. 491–498.
Kim, S. N. and T. Baldwin (2007a). Detecting compositionality of English
verb-particle constructions using semantic similarity. In Proceedings of
Conference of the Pacific Association for Computational Linguistics,
Melbourne, Australia, pp. 40–48.
Kim, S. N. and T. Baldwin (2007b). Disambiguating noun compounds. In
Proceedings of 22nd AAAI Conference on Artificial Intelligenc, Vancouver, Canada, pp. 901–906.
Kim, S. N. and T. Baldwin (2008). Benchmarking noun compound interpretation. In Proceedings of 3rd International Joint Conference on Natual
Language Processing (IJCNLP-2008), Hyderabad, India, pp. 569–576.
Kim, S. N. and T. Baldwin (to appear). How to pick out token instances
of English verb-particle constructions. Language Resources and Evaluation.
Lapata, M. (2002). The disambiguation of nominalizations. Computational
Linguistics 28 (3), 357–388.
Lapata, M. and A. Lascarides (2003). Detecting novel compounds: The
role of distributional evidence. In Proceedings of the 11th Conference of
the European Chapter for the Association of Computational Linguistics
(EACL-2003), Budapest, Hungary, pp. 235–242.
Lauer, M. (1995). Designing Statistical Language Learners: Experiments
on Noun Compounds. Ph. D. thesis, Macquarie University.
Levi, J. (1978). The Syntax and Semantics of Complex Nominals. New York,
USA: Academic Press.
Li, W., X. Zhang, C. Niu, Y. Jiang, and R. K. Srihari (2003). An expert
lexicon approach to identifying English phrasal verbs. In Proceedings of
the ACL2003 Workshop on Multiword Expressions: analysis, acquisition
and treatment, Sapporo, Japan, pp. 513–520.
Liberman, M. and R. Sproat (1992). The stress and structure of modified
noun phrases in English. In I. A. Sag and A. Szabolcsi (Eds.), Lexical
Matters – CSLI Lecture Notes No. 24. Stanford, USA: CSLI Publications.
Lidner, S. (1983). A lexico-semantic analysis of English verb particle constructions with OUT and UP. Ph. D. thesis, University of Indiana at
Bloomington.
Lieber, R. and P. Štekauer (Eds.) (2009). The Oxford Handbook of Compounding. Oxford University Press.
Lin, D. (1998). Extracting collocations from text corpora. In Proceedings of
the 1st Workshop on Computational Terminology, Montreal, Canada.
Multiword Expressions
35
Lin, D. (1999). Automatic identification of non-compositional phrases. In
Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, USA, pp. 317–324.
Lüdeling, A. (2001). On Particle Verbs and Similar Constructions in German. Stanford, USA: CSLI Publications.
Marcus, M. (1980). A Theory of Syntactic Recognition for Natural Language. Cambridge, USA: MIT Press.
Matsuo, Y., S. Shirai, A. Yokoo, and S. Ikehara (1997). Direct parse tree
translation in cooperation with the transfer method. In D. Joneas and
H. Somers (Eds.), New Methods in Language Processing, pp. 229–238.
London, UK: UCL Press.
McCarthy, D., B. Keller, and J. Carroll (2003). Detecting a continuum
of compositionality in phrasal verbs. In Proceedings of the ACL2003
Workshop on Multiword Expressions: analysis, acquisition and treatment, Sapporo, Japan, pp. 73–80.
McCarthy, D., S. Venkatapathy, and A. Joshi (2007). Detecting compositionality of verb-object combinations using selectional preferences. In
Proceedings of the 200 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning
(EMNLP-CoNLL), pp. 369–379.
McIntyre, A. (2007). Particle verbs and argument structure. Language and
Linguistics Compass 1 (4), 350–367.
McKeown, K. R. and D. R. Radev (2000). Collocations. In R. Dale,
H. Moisl, and H. Somers (Eds.), A Handbook of Natural Language Processing, Chapter 15. Marcel Dekker.
Melamed, I. D. (1997). Automatic discovery of non-compositional compounds in parallel data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (EMNLP-97), Providence,
USA, pp. 97–108.
Mel’čuk, I. A. and A. Polguère (1987). A formal lexicon in the MeaningText Theory (or how to do lexica with words). Computational Linguistics 13 (3–4), 261–275.
Mimmelmann, N. P. (1998). Regularity in irregularity: Article use in adpositional phrases. Linguistic Typology 2, 315–353.
Miyagawa, S. (1989). Light verbs and the ergative hypothesis. Linguistic
Inquiry 20, 659–668.
Miyazaki, M., S. Ikehara, and A. Yokoo (1993). Combined word retrieval
for bilingual dictionary based on the analysis of compound word. Transactions of the Information Processing Society of Japan 34 (4), 743–754.
(in Japanese).
36
Handbook of NLP
Mohanan, T. (1994). Argument Structure in Hindi. Stanford, USA: CSLI
Publications.
Moldovan, D., A. Badulescu, M. Tatu, D. Antohe, and R. Girju (2004).
Models for the semantic classification of noun phrases. In Proceedings
of HLT-NAACL 2004: Workshop on Computational Lexical Semantics,
Boston, USA, pp. 60–67.
Moon, R. E. (1998). Fixed Expressions and Idioms in English: A Corpusbased Approach. Oxford, UK: Oxford University Press.
Nakov, P. (2008). Noun compound interpretation using paraphrasing verbs:
Feasibility study. In Proceedings of the 13th International Conference on Artificial Intelligence: Methodology, Systems, Applications
(AIMSA’08), Varna, Bulgaria, pp. 103–117.
Nakov, P. and M. Hearst (2005). Search engine statistics beyond the ngram: Application to noun compound bracketting. In Proceedings of the
9th Conference on Computational Natural Language Learning (CoNLL2005), Ann Arbor, USA, pp. 17–24.
Nakov, P. and M. A. Hearst (2008). Solving relational similarity problems
using the web as a corpus. In Proceedings of the 46th Annual Meeting
of the ACL: HLT, Columbus, USA, pp. 452–460.
Nastase, V., J. Sayyad-Shirabad, M. Sokolova, and S. Szpakowicz (2006).
Learning noun-modifier semantic relations with corpus-based and
WordNet-based features. In Proceedings of the 21st National Conference on Artificial Intelligence (AAAI), Boston, USA, pp. 781–787.
Nicholson, J. and T. Baldwin (2006). Interpretation of compound nominalisations using corpus and web statistics. In Proceedings of the COLING/ACL 2006 Workshop on Multiword Expressions: Identifying and
Exploiting Underlying Properties, Sydney, Australia, pp. 54–61.
Nunberg, G., I. A. Sag, and T. Wasow (1994). Idioms. Language 70, 491–
538.
O’Dowd, E. M. (1998). Prepositions and Particles in English. Oxford University Press.
Ó Séaghdha, D. (2008). Learning compound noun semantics. Ph. D. thesis,
Computer Laboratory, University of Cambridge.
Osswald, R., H. Helbig, and S. Hartrumpf (2006). The representation of
German prepositional verbs in a semantically based computer lexicon. In
Proceedings of the 5th International Conference on Language Resources
and Evaluation (LREC 2006), Genoa, Italy.
Patrick, J. and J. Fletcher (2005). Classifying verb particle constructions by
verb arguments. In Proceedings of the Second ACL-SIGSEM Workshop
on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics Formalisms and Applications, Colchester, UK, pp.
200–209.
Multiword Expressions
37
Pauwels, P. (2000). Put, set, lay, and place: a cognitve linguistic approach
to verbal meaning. Munich, Germany: Lincom Europa.
Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of
the NAACL 2001 Workshop on WordNet and Other Lexical Resources:
Applications, Extensions and Customizations, Pittsburgh, USA, pp. 41–
46.
Pecina, P. (2008). Lexical Association Measures. Ph. D. thesis, Charles University.
Piao, S., P. Rayson, D. Archer, A. Wilson, and T. McEnery (2003). Extracting multiword expressions wth a semantic tagger. In Proceedings of
the ACL2003 Workshop on Multiword Expressions: analysis, acquisition
and treatment, Sapporo, Japan, pp. 49–56.
Quirk, R., S. Greenbaum, G. Leech, and J. Svartvik (1985). A Comprehensive Grammar of the English Language. London, UK: Longman.
Riehemann, S. (2001). A Constructional Approach to Idioms and Word
Formation. Ph. D. thesis, Stanford University.
Rosario, B. and M. Hearst (2001). Classifying the semantic relations in noun
compounds via a domain-specific lexical hierarchy. In Proceedings of the
6th Conference on Empirical Methods in Natural Language Processing
(EMNLP-2001), Pittsburgh, Pennsylvania, USA, pp. 82–90.
Ross, H. (1995). Defective noun phrases. In In Papers of the 31st Regional
Meeting of the Chicago Linguistics Society, Chicago, Illinois, USA, pp.
398–440.
Sag, I. A., T. Baldwin, F. Bond, A. Copestake, and D. Flickinger (2002).
Multiword expressions: A pain in the neck for NLP. In Proceedings
of the 3rd International Conference on Intelligent Text Processing and
Computational Linguistics (CICLing-2002), Mexico City, Mexico, pp.
1–15.
Sager, J. C. (1990). A Practical Course in Terminology Processing. Amsterdam, Netherlands / Philadelphia, USA: John Benjamins.
Schone, P. and D. Jurafsky (2001). Is knowledge-free induction of multiword unit dictionary headwords a solved problem? In Proceedings of
the 6th Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Hong Kong, China, pp. 100–108.
Schütze, H. (1997). Ambiguity Resolution in Language Learning. Stanford,
USA: CSLI Publications.
Sinclair, J. (1991). Corpus, Concordance, Collocation. Oxford, UK: Oxford
University Press.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics 19 (1), 143–77.
38
Handbook of NLP
Spärck Jones, K. (1983). Compound noun interpretation problems. Englewood Cliffes, USA: Prentice-Hall.
Sporleder, C. and L. Li (2009). Unsupervised recognition of literal and nonliteral use of idiomatic expressions. In Proceedings of the 12th Conference
of the EACL (EACL 2009), Athens, Greece, pp. 754–762.
Stevenson, S., A. Fazly, and R. North (2004). Statistical measures of
the semi-productivity of light verb constructions. In Proceedings of the
2nd ACL Workshop on Multiword Expressions: Integrating Processing,
Barcelona, Spain, pp. 1–8.
Stvan, L. S. (1998). The Semantics and Pragmatics of Bare Singular Noun
Phrases. Ph. D. thesis, Northwestern University.
Tan, Y. F., M.-Y. Kan, and H. Cui (2006). Extending corpus-based identification of light verb constructions using a supervised learning framework.
In Proceedings of the EACL 2006 Workshop on Multi-word-expressions
in a multilingual context (MWEmc), Trento, Italy.
Tanaka, T. and T. Baldwin (2003). Noun-noun compound machine translation a feasibility study on shallow processing. In Proceedings of the
ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition
and Treatment, Sapporo, Japan, pp. 17–24.
Trawiński, B., M. Sailer, and J.-P. Soehn (2006). Combinatorial aspects of
collocational prepositional phrases. In P. Saint-Dizier (Ed.), Computational Linguistics Dimensions of Syntax and Semantics of Prepositions.
Dordrecht, Netherlands: Kluwer Academic.
Trawiński, B., M. Sailer, J.-P. Soehn, L. Lemnitzer, and F. Richter (2008).
Cranberry expressions in English and in German. In Proceedings of the
LREC 2008 Workshop: Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp. 35–38.
Tschichold, C. (1998). Multi-word Units in Natural Language Processing.
Ph. D. thesis, University of Basel.
Tsuchiya, M., T. Shime, T. Takagi, T. Utsuro, K. Uchimoto, S. Matsuyoshi,
S. Sato, and S. Nakagawa (2006). Chunking Japanese compound functional expressions by machine learning. In Proceedings of the EACL 06
Workshop on Multi-word-expressions in a Multilingual Context, Trento,
Italy, pp. 25–32.
Uchiyama, K., T. Baldwin, and S. Ishizaki (2005). Disambiguating Japanese
compound verbs. Computer Speech and Language, Special Issue on Multiword Expressions 19 (4), 497–512.
Vadas, D. and J. R. Curran (2008). Parsing noun phrase structure with
CCG. In Proceedings of the 46th Annual Meeting of the ACL: HLT,
Columbus, USA, pp. 335–343.
Multiword Expressions
39
van der Beek, L. (2005). The extraction of determinerless PPs. In Proceedings of the Second ACL-SIGSEM Workshop on the Linguistic Dimensions of Prepositions and their Use in Computational Linguistics
Formalisms and Applications, Colchester, UK, pp. 190–199.
Vanderwende, L. (1994). Algorithm for automatic interpretation of noun
sequences. In Proceedings of the 15th Conference on Computational linguistics, Kyoto, Japan, pp. 782–788.
Venkatapathy, S. and A. Joshi (2006). Using information about multi-word
expressions for the word-alignment task. In Proceedings of the COLING/ACL 2006 Workshop on Multiword Expressions: Identifying and
Exploiting Underlying Properties, Sydney, Australia, pp. 53–60.
Villada Moirón, B. (2005). Data-driven identification of fixed expressions
and their modifiability. Ph. D. thesis, Alfa-Informatica, University of
Groningen.
Villavicencio, A. (2005). The availability of verb-particle constructions in
lexical resources: How much is enough? Computer Speech and Language, Special Issue on Multiword Expressions 19 (4), 415–432.
Villavicencio, A., T. Baldwin, and B. Waldron (2004). A multilingual
database of idioms. In Proceedings of the 4th International Conference
on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal,
pp. 1127–1130.
Villavicencio, A., F. Bond, A. Korhonen, and D. McCarthy (2005). Introduction to the special issue on multiword expressions: Having a crack at
a hard nut. Computer Speech and Language, Special Issue on Multiword
Expressions 19 (4), 365–377.
Widdows, D. (2005). Geometry and Meaning. Stanford, USA: CSLI Publications.
Widdows, D. and B. Dorow (2005). Automatic extraction of idioms using
graph analysis and asymmetric lexicosyntactic patterns. In Proceedings
of the ACL 2004 Workshop on Deep Lexical Acquisition, Ann Arbor,
USA, pp. 48–56.
Xu, R., Q. Lu, and S. Li (2006). The design and construction of a Chinese
collocation bank. In Proceedings of the 5th International Conference on
Language Resources and Evaluation (LREC 2006), Genoa, Italy.
Zhang, Y., V. Kordoni, A. Villavicencio, and M. Idiart (2006). Automated
multiword expression prediction for grammar engineering. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, Sydney, Australia, pp. 36–44. Association for Computational Linguistics.
What is Named Entity Recognition (NER) Applications and Uses?
NER, short for, Named Entity Recognition is a standard Natural Language Processing problem
which deals with information extraction. The primary objective is to locate and classify named
entities in text into predefined categories such as the names of persons, organizations, locations,
events, expressions of times, quantities, monetary values, percentages, etc.
To put it simply, NER deals with extracting the real-world entity from the text such as a person, an
organization, or an event. Named Entity Recognition is also simply known as entity identification,
entity chunking, and entity extraction. They are quite similar to POS(part-of-speech) tags.
NER using NLTK
NLTK is a standard python library with prebuilt functions and utilities for the ease of use and
implementation. It is one of the most used libraries for natural language processing and
computational linguistics.
Recognizing named entities in a large corpus can be a challenging task, but NLTK has built-in
method ‘nltk.ne_chunk()’ that can recognize various entities shown in the table below:
NE Type
ORGANIZATION
PERSON
LOCATION
DATE
TIME
MONEY
PERCENT
FACILITY
GPE
Examples
Georgia-Pacific Corp., WHO
Eddy Bonte, President Obama
Murray River, Mount Everest
June, 2008-06-29
two fifty a.m, 1:30 p.m.
175 million Canadian Dollars, GBP 10.40
twenty pct, 18.75 %
Washington Monument, Stonehenge
South-East Asia, Midlothian
IOB tagging
The IOB format (short for inside, outside, beginning) is a tagging format that is used for tagging
tokens in a chunking task such as named-entity recognition. These tags are similar to part-of-speech
tags but give us information about the location of the word in the chunk. The IOB Tagging system
contains tags of the form:
• B-{CHUNK_TYPE} – for the word in the Beginning chunk
• I-{CHUNK_TYPE} – for words Inside the chunk
• O – Outside any chunk
NER using SpaCy
SpaCy is an open-source library for advanced Natural Language Processing written in the Python
and Cython. It can be used to build information extraction or natural language understanding
systems or to pre-process text for deep learning. Some of the features provided by spaCy are-
Tokenization, Parts-of-Speech (PoS) Tagging, Text Classification, and Named Entity Recognition
which we are going to use here.
SpaCy provides a default model that can recognize a wide range of named or numerical entities,
which include person, organization, language, event, etc. Apart from these default entities, we can
also add arbitrary classes to the NER model, by training the model to update it with newer trained
examples.
SpaCy’s named entity recognition has been trained on the OntoNotes 5 corpus and it recognizes the
following entity types.
Applications of NER
NER, short for, Named Entity Recognition has a wide range of applications in the field of Natural
Language Processing and Information Retrieval. Few such examples have been listed below :
Classifying content for news providers: A large amount of online content is generated by the
news and publishing houses on a daily basis and managing them correctly can be a challenging task
for the human workers. Named Entity Recognition can automatically scan entire articles and help in
identifying and retrieving major people, organizations, and places discussed in them. Thus articles
are automatically categorized in defined hierarchies and the content is also much easily discovered.
Automatically Summarizing Resumes: You might have come across various tools that scan your
resume and retrieve important information such as Name, Address, Qualification, etc from them.
The majority of such tools use the NER software which helps it to retrieve such information. Also
one of the challenging tasks faced by the HR Departments across companies is to evaluate a
gigantic pile of resumes to shortlist candidates. A lot of these resumes are excessively populated in
detail, of which, most of the information is irrelevant to the evaluator. Using the NER model, the
relevant information to the evaluator can be easily retrieved from them thereby simplifying the
effort required in shortlisting candidates among a pile of resumes.
Optimizing Search Engine Algorithms: When designing a search engine algorithm, It would be an
inefficient and computational task to search for an entire query across the millions of articles and
websites online, an alternate way is to run a NER model on the articles once and store the entities
associated with them permanently. Thus for a quick and efficient search, the key tags in the search
query can be compared with the tags associated with the website articles
Powering Recommendation systems: NER can be used in developing algorithms for
recommender systems that make suggestions based on our search history or on our present activity.
This is achieved by extracting the entities associated with the content in our history or previous
activity and comparing them with the label assigned to other unseen content. Thus we frequently
see the content of our interest.
Simplifying Customer Support: Usually, a company gets tons of customer complaints and
feedback on a daily basis, and going through each one of them and recognizing the concerned
parties is not an easy task. Using NER we can recognize relevant entities in customer complaints
and feedback such as Product specifications, department, or company branch location so that the
feedback is classified accordingly and forwarded to the appropriate department responsible for the
identified product.
How Does Named Entity Recognition Work?
When we read a text, we naturally recognize named entities like people, values, locations, and so
on. For example, in the sentence “Mark Zuckerberg is one of the founders of Facebook, a company
from the United States” we can identify three types of entities:
• “Person”: Mark Zuckerberg
• “Company”: Facebook
• “Location”: United States
For computers, however, we need to help them recognize entities first so that they can categorize
them.
This is done through machine learning and Natural Language Processing (NLP).
NLP studies the structure and rules of language and creates intelligent systems capable of deriving
meaning from text and speech, while machine learning helps machines learn and improve over time.
To learn what an entity is, an NER model needs to be able to detect a word, or string of words that
form an entity (e.g. New York City), and know which entity category it belongs to.
So first, we need to create entity categories, like Name, Location, Event, Organization, etc., and
feed an NER model relevant training data. Then, by tagging some word and phrase samples with
their corresponding entities, you’ll eventually teach your NER model how to detect entities itself.
How to Do Named Entity Recognition
The easiest way to get started with named entity recognition is using an API.
Basically, you can choose between two types:
• Open-source named entity recognition APIs
• SaaS named entity recognition APIs
Open-Source named entity recognition APIs
Open-source APIs are for developers: they are free, flexible, and entail a gentle learning curve. Here
are a few options:
• Stanford Named Entity Recognizer (SNER): this JAVA tool developed by Stanford
University is considered the standard library for entity extraction. It’s based on Conditional
Random Fields (CRF) and provides pre-trained models for extracting person, organization,
location, and other entities.
• SpaCy: a Python framework known for being fast and very easy to use. It has an excellent
statistical system that you can use to build customized NER extractors.
• Natural Language Toolkit (NLTK): this suite of libraries for Python is widely used for NLP
tasks. NLKT has its own classifier to recognize named entities called ne_chunk, but also
provides a wrapper to use the Stanford NER tagger in Python.
SaaS named entity recognition APIs
SaaS tools are ready-to-use, low-code, and cost-effective solutions. Plus, they are easy to integrate
with other popular platforms.
MonkeyLearn, for example, is a text analysis SaaS platform that you can use for different NLP
tasks, one of which is named entity recognition. You can use MonkeyLearn's ready-built API to
integrate pre-trained entity extraction models, or you can easily build your own custom named
entity extractor also.
Different blocks present in a Typical NER model
A typical NER model consists of the following three blocks:
1. Noun Phrase Identification
This step deals with extracting all the noun phrases from a text with the help of dependency parsing
and part of speech tagging.
Phrase Classification
In this classification step, we classified all the extracted noun phrases from the above step into their
respective categories. To disambiguate locations, Google Maps API can provide a very good path.
and to identify person names or company names, the open databases from Dbpedia, Wikipedia can
be used. Apart from this, we can also make the lookup tables and dictionaries by combining
information with the help of different sources.
Entity Disambiguation
Sometimes what happens is that entities are misclassified, hence creating a validation layer on top
of the results becomes useful. The use of knowledge graphs can be exploited for this purpose. Some
of the popular knowledge graphs are:
Google Knowledge graph
IBM Watson
Wikipedia
Deep understanding of NER with an Example
How does Named Entity Recognition work?
As we can simple observed that after reading a particular text, naturally we can recognize named
entities such as people, values, locations, and so on.
For Example,Consider the following sentence:
Sentence: Sundar Pichai, the CEO of Google Inc. is walking in the streets of
California.
From the above sentence, we can identify three types of entities: (Named Entities)
• ( “person”: “Sundar Pichai” ),
• (“org”: “Google Inc.”),
• (“location”: “California”).
But to do the same thing with the help of computers, we need to help them recognize entities first so
that they can categorize them. So, to do so we can take the help of machine learning and Natural
Language Processing (NLP).
Let’s discuss the role of both these things while implementing NER using computers:
• NLP: It studies the structure and rules of language and forms intelligent systems that are
capable of deriving meaning from text and speech.
• Machine Learning: It helps machines learn and improve over time.
To learn what an entity is, a NER model needs to be able to detect a word or string of words that
form an entity (e.g. California) and decide which entity category it belongs to.
So, as a concluding step we can say that the heart of any NER model is a two-step process:
• Detect a named entity
• Categorize the entity
So first, we need to create entity categories, like Name, Location, Event, Organization, etc., and
feed a NER model relevant training data.
Then, by tagging some samples of words and phrases with their corresponding entities, we’ll
eventually teach our NER model to detect the entities and categorize them.
CSE4022
Natural Language Processing
Topic: Syntax
Pr o f. Tu l asi Pr a sa d S a ri ki ,
S C SE, V I T Ch ennai Ca mpus
www. l earn ersd esk.weeb l y. com
Contents
 What is Syntax ?
 Where does it fit ?
 Simplified View of Linguistics
 Grammatical Analysis Techniques
INTRODUCTION TO SYNTAX
What is Syntax ?
 Study of structure of language
 Refers to the way words are arranged together, and the
relationship between them.
 Syntax is study of the system of rules and categories that underlies
sentence formation.
 Syntax is the study of the combination of words into phrases,
clauses and sentences.
 Syntax describes how sentences and their constituents are
structured.
INTRODUCTION TO SYNTAX
What is Syntax ?
 Roughly, goal is to relate surface form (what we perceive when
someone says something)
Specifically, goal is to relate an interface to morphological
component to an interface to a semantic component
 Note: interface to morphological component may look like written
text
 Representational device is tree structure
INTRODUCTION TO SYNTAX
Where does it fit ?
Semantics
Syntax
Lexicon
INTRODUCTION TO SYNTAX
Simplified View of Linguistics
 /waddyasai/
Phonology
Morphology
Syntax
/waddyasai/

what did you say 
subj
you
say
subj
you
say
Semantics
what did you say
obj
what

INTRODUCTION TO SYNTAX
obj
what
P[ x. say(you, x) ]
Acronyms used in structural descriptions
of natural language
S=sentence/clause
ADJP=adjective phrase
N=(a single) noun
ADV=adverb
NP=noun phrase
ADVP=adverb phrase
V=verb
DET=determiner
VP=verb phrase
CONJ=conjunction
AUX=auxiliary verb
COMP=complementizer
AJ/ADJ=adjective
PRO=pro-constituent
PUNC=punctuation
INTRODUCTION TO SYNTAX
Examples
S=sentence/clause
Does the dog chase the cat?
N=(a single) noun
dog
NP=noun phrase
the old dog
V=verb
chase
VP=verb phrase
chase the cat
AUX=auxiliary verb
does
AJ/ADJ=adjective
old
ADJP=adjective phrase
old and gray
INTRODUCTION TO SYNTAX
Examples
ADV=adverb
happily
ADVP=adverb phrase
once upon a time
DET=determiner
the
CONJ=conjunction
and
COMP=complementizer
what
PRO= pro-constituent
he
PUNC=punctuation
?
INTRODUCTION TO SYNTAX
Grammatical Analysis Techniques
Two main devices
Breaking up a String



Sequential
Hierarchical
Transformational
Labeling the Constituents



Morphological
Categorial
Functional
INTRODUCTION TO SYNTAX
Sequential Breaking up
That student solved the problems.
that + student + solve + ed + the + problem + s
INTRODUCTION TO SYNTAX
Sequential Breaking up and
Morphological Labeling
That student solved the problems.
that student solve ed
word
word
stem
the
affix word
problem
s
stem
affix
INTRODUCTION TO SYNTAX
Sequential Breaking up and
Categorial Labeling
This boy can solve the problem.
this
boy
can
solve
the
problem
Det
N
Aux
V
Det
N
They called her a taxi.
They
call
ed
her
a
taxi
Pron
V
Affix
Pron
Det
N
INTRODUCTION TO SYNTAX
Sequential Breaking up and
Functional Labeling
They
called
her
Subject
Verbal
Direct
Object
They
called
her
Subject
Verbal
Indirect
Object
a
taxi
Indirect
Object
a
taxi
Direct
Object
INTRODUCTION TO SYNTAX
Hierarchical Breaking up
Old men and
women
Old
Old men and women
men and women
men
and
women
Old men
Old
INTRODUCTION TO SYNTAX
men
and
women
Hierarchical Breaking up and Categorial
Labeling
Poor John Sran away.
NP
A
Poor
VP
N
John
V
Adv
ran
away
INTRODUCTION TO SYNTAX
Hierarchical Breaking up and Functional Labeling
 Immediate Constituent (IC) Analysis
 Construction types in terms of the function of the constituents:
 Predication
 Modification
 Complementation
 Subordination
 Coordination
(subject + predicate)
(modifier + head)
(verbal + complement)
(subordinator + dependent unit)
(independent unit + coordinator)
INTRODUCTION TO SYNTAX
Syntax as defined by Bloomfield
It is the study of free forms that are composed entirely of free forms.
Central notions of his theory
 Form classes and
 Constituent Structures
INTRODUCTION TO SYNTAX
Form-Classes
Form-Class – A set of forms displaying similar or identical
grammatical features is said to constitute a form-class, e.g.
‘Walk’, ‘come’, ‘run’, ‘jump’ - belong to the form-class of infinitive
expressions.
‘John’, ‘the boys’, ‘Mr. Smith’ – belong to the form-class of
nominative substantive expressions.
Form-Classes are similar to the traditional parts of speech.
One and the same form can belong to more than one form class.
INTRODUCTION TO SYNTAX
Form-Classes (contd.)
Criterion for form-class membership – Substitutability
In a sentence like – “John went to the Church”,
‘John’ can be substituted with ‘children’, ‘Mr. Smith’ or ‘the boys’
(as these are syntactically equivalent to each other and display
identical grammatical features).
Thus, form classes are sets of forms, any one of which may be
substituted for any other in a given construction.
The smaller forms into which a larger form may be analyzed are its
constituents, and the larger form is a construction.
INTRODUCTION TO SYNTAX
Example of the Constituents of a Construction
The phrase "poor John" is a construction analyzable into, or
composed of, the constituents "poor" and "John."
Similarly, the phrase "lost his watch" is composed of - "lost," "his,"
and "watch"-- all of which may be described as constituents of the
construction put together in a linear order.
INTRODUCTION TO SYNTAX
Constituency
Sentences or phrases can be analyzed as being composed of a
number of somewhat smaller units called constituents
(e.g. a Noun Phrase might consist of a determiner and a noun), and
This constituent analysis can be continued until no further
subdivisions are possible.
The major divisions that can be made are Immediate Constituents.
Ultimate Constituents - The irreducible elements of the construction
resulting from such an analysis.
INTRODUCTION TO SYNTAX
Immediate Constituents
An immediate constituent is the daughter of some larger unit that
constitute a construction. Immediate constituents are often further
reducible.
There exists no intermediate unit between them that is a constituent
of the same construction e.g.
in a construction ‘poor John,’ ‘poor’ and ‘John’ are immediate
constituents.
INTRODUCTION TO SYNTAX
Constructions
Subordinating Constructions - Constructions in which only one
immediate constituent is of the same form class as the whole
construction e.g. ‘poor John’, ‘fresh milk’.
The constituent that is syntactically equivalent to the whole
construction is described as the head, and its partner is described as
the modifier: thus, in "poor John," the form "John" is the head, and
"poor" is its modifier.
INTRODUCTION TO SYNTAX
Constructions (contd.)
Coordinating Constructions - Constructions in which both
constituents are of the same form class as the whole construction
e.g. ‘men and women’, ‘boys and girls’
“Men and women," in which, it may be assumed, the immediate
constituents are the word "men" and the word "women," each of
which is syntactically equivalent to "men and women."
INTRODUCTION TO SYNTAX
Immediate Constituent Structure
The organization of the units of a sentence (its immediate
constituents) both in terms of their hierarchical arrangement and
their linear order.
IC Structure can be represented in the form of a tree diagram or
Using labeled bracketing, each analytic decision being represented
by a pair of square brackets at the appropriate points in the
construction.
INTRODUCTION TO SYNTAX
Immediate Constituent Structure (contd.)
‘Poor John lost his watch’ is not just a linear sequence of five words.
It can be analyzed into the immediate constituents – ‘poor John’ and
‘lost his watch’
And each of these constituents is analyzable into its own immediate
constituents.
The Ultimate Constituents of the whole construction are- ‘poor’,
‘John’, ‘lost’, ‘his’, ‘watch’
INTRODUCTION TO SYNTAX
Immediate Constituent Structure (contd.)
In ‘poor John’ –
‘poor’ and ‘John’ are constituents as well as
Immediate constituents as there is no intermediate unit between
them that is a constituent of the same construction.
Similarly, in ‘lost his watch’ –
‘lost’, ‘his’ and ‘watch’ are constituents
Not all of them are immediate constituents.
INTRODUCTION TO SYNTAX
Immediate Constituent Structure (contd.)
In ‘lost his watch’ –
‘his’ and ‘watch’ combine to make the intermediate construction
called ‘his watch’
‘his watch’ now combines with ‘lost’ to give
‘lost his watch’.
‘his’ and ‘watch’ are the constituents of ‘his watch’ and
‘lost’ and ‘his watch’ are immediate constituents of ‘lost his watch’
INTRODUCTION TO SYNTAX
Representing Immediate Constituent Structure
The constituent structure of the whole sentence can
be represented by means of labeled bracketing e.g.
[ [ [Poor] [John] ] [ [lost] [ [his] [watch] ] ]
Or using a tree diagram for the same poor
John lost
his
INTRODUCTION TO SYNTAX
watch
Representing Immediate Constituent Structure
(contd.)
Labeled bracketing using Category Symbols :
[ [ [Poor]ADJ [John]N ]NP [ [lost]V [ [his]PRON [watch ]N ]NP ]VP ]S
‘Poor’ – ADJ
‘Poor John’ - NP
‘John’ – N
‘his watch’ - NP
Lost – V
‘lost his watch’ - VP
His – PRON
‘Poor John lost his watch’ - S
Watch - N
INTRODUCTION TO SYNTAX
Immediate Constituent Structure using Tree
Diagram
S
NP
ADJ
VP
N
V
NP
PRON
Poor
John
lost
INTRODUCTION TO SYNTAX
his
N
watch
Importance of the notion of Immediate
Constituent
It helps to account for the syntactic ambiguity of certain
constructions.
A classic example is the phrase "old men and women," which may be
interpreted in two different ways:
1.One associates "old" with "men and women”; the immediate
constituents are "old" and "men and women
2.And the second associates “old” just with "men." immediate
constructions are "old men" and "women."
INTRODUCTION TO SYNTAX
Predication
The part of a sentence or clause containing a verb and stating
something about the subject.
[Birds]subject [fly]predicate
S
Subject
Birds
INTRODUCTION TO SYNTAX
Predicate
fly
Modification
[A]modifier [flower]head
John [slept]head [in the room]modifier
S
Subject
John
Predicate
Head
slept
INTRODUCTION TO SYNTAX
Modifier
In the room
Complementation
He [saw]verbal [a lake]complement
S
Subject
He
Predicate
Verbal
Complement
saw
a lake
complements are required to complete the meaning of
a sentence or a part of a sentence.
INTRODUCTION TO SYNTAX
Subordination
John slept [in]subordinator [the room]dependent unit
S
Subject
John
Predicate
Head
Modifier
Subordinator
slept
is a way of combining sentences that makes one
sentence more important than the other.
in
INTRODUCTION TO SYNTAX
Dependent Unit
the room
Coordination
[John came in time] independent unit [but]coordinator [Mary was not ready] independent unit
S
Independent Unit
Coordinator
Independent Unit
John came in time
but
Mary was not ready
Coordination is a way of adding sentences together
INTRODUCTION TO SYNTAX
An Example
In the morning, the sky looked much brighter.
S
Modifier
Subordinator
Modifier
Head
Subject
DU
Head Modifier
Predicate
Head
Verbal
Complement
Modifier
In
the
morning,
the
sky
looked
INTRODUCTION TO SYNTAX
much
Head
brighter
Hierarchical Breaking up and
Categorial / Functional Labeling
Hierarchical Breaking up coupled with Categorial /Functional
Labeling is a very powerful device.
But there are ambiguities which demand something more powerful.
E.g., Love of God
Someone loves God
God loves someone
INTRODUCTION TO SYNTAX
Hierarchical Breaking up
Categorial Labeling
Functional Labeling
Love of God
Noun
Phrase
love
Love of God
Prepositional
Phrase
of
God
Head
love
INTRODUCTION TO SYNTAX
Modifier
Sub
DU
of
God
Types of Generative Grammar
 Finite State Model
(sequential)
 Phrase Structure Model
(sequential + hierarchical) + (categorial)
 Transformational Model
(sequential + hierarchical + transformational) + (categorial + functional)
INTRODUCTION TO SYNTAX
Phrase Structure Grammar (PSG)
A phrase-structure grammar G consists of a four tuple (V, T, S, P)
V is a finite set of alphabets (or vocabulary)
◦ E.g., N, V, A, Adv, P, NP, VP, AP, AdvP, PP, student, sing, etc.
T is a finite set of terminal symbols: T  V
◦ E.g., student, sing, etc.
S is a distinguished non-terminal symbol, also called start symbol: S
V
P is a set of productions.
INTRODUCTION TO SYNTAX
Noun Phrases
John
the student
NP
NP
the intelligent student
NP
N
Det
N
John
the
student
Det
AdjP
N
the intelligent student
INTRODUCTION TO SYNTAX
Noun Phrase
his first five PhD students
NP
Det
Ord
Quant
N
his
first
five
PhD
INTRODUCTION TO SYNTAX
N
students
Noun Phrase
The five best students of my class
NP
Det
Quant
the
five
AP
N
best students
INTRODUCTION TO SYNTAX
PP
of my class
Verb Phrases
can sing
can hit the ball
VP
VP
Aux
V
Aux
V
NP
can
sing
can
hit
the ball
INTRODUCTION TO SYNTAX
Verb Phrase
Can give a flower to Mary
VP
Aux
can
V
NP
give a flower
INTRODUCTION TO SYNTAX
PP
to Mary
Verb Phrase
may make John the chairman
VP
Aux
may
V
NP
make John
NP
the chairman
INTRODUCTION TO SYNTAX
Verb Phrase
may find the book very interesting
VP
Aux
V
NP
may
find
the book
AP
very interesting
INTRODUCTION TO SYNTAX
Prepositional Phrases
in the classroom
near the river
PP
PP
P
NP
P
NP
in
the classroom
near
the river
INTRODUCTION TO SYNTAX
Adjective Phrases
intelligent
very honest
fond of sweets
AP
AP
AP
A
Degree
intelligent
very
A
A
honest fond
INTRODUCTION TO SYNTAX
PP
of sweets
Adjective Phrase
very worried that she might have done badly in the assignment
AP
Degree
very
S’
A
worried
that she might have done badly in the
assignment
INTRODUCTION TO SYNTAX
Phrase Structure Rules
The boy hit the ball.
Rewrite Rules:
(i)
S
(ii)
NP
(iii)
VP
(iv)
Det
(v)
N
(v)
V






NP VP
Det N
V NP
the
man, ball
hit
We interpret each rule X  Y as the instruction rewrite X as Y.
INTRODUCTION TO SYNTAX
Derivation
The boy hit the ball.
Sentence
NP + VP
Det + N + VP
Det + N + V + NP
The + N + V + NP
The + boy + V + NP
The + boy + hit + NP
The + boy + hit + Det + N
The + boy + hit + the + N
The + boy + hit + the + ball
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(ii)
(iv)
(v)
INTRODUCTION TO SYNTAX
PSG Parse Tree
The boy hit the ball.
S
NP
VP
Det
N
the
boy
NP
V
Det
N
the
ball
hit
INTRODUCTION TO SYNTAX
PSG Parse Tree
S
NP
PropN
VP
V
John wrote those words in the Book of
Proverbs.
PP
NP
NP
P
John wrote
those
words
in
NP
the
book
INTRODUCTION TO SYNTAX
PP
of
proverbs
Penn POS Tags
John wrote those words in the Book of Proverbs.
[John/NNP ]
wrote/VBD
[ those/DT words/NNS ]
in/IN
[ the/DT Book/NN ]
of/IN
[ Proverbs/NNS ]
INTRODUCTION TO SYNTAX
Penn Treebank
John wrote those words in the Book of Proverbs.
(S (NP-SBJ (NP John))
(VP wrote
(NP those words)
(PP-LOC in
(NP (NP-TTL (NP the Book)
(PP of
(NP Prove rbs)))
INTRODUCTION TO SYNTAX
PSG Parse Tree
S
NP
VP
NP
PP
Aux
AP
Official trading in the
shares will start in Paris
on Nov 6.
N
P
V
PP
PP
NP
A
official trading in
the shares will
start in Paris
INTRODUCTION TO SYNTAX
on Nov 6
Penn POS Tags
Official trading in the shares will start in Paris on Nov 6.
[ Official/JJ trading/NN ]
in/IN
[ the/DT shares/NNS ]
will/MD start/VB in/IN
[ Paris/NNP ]
on/IN
[ Nov./NNP 6/CD ]
INTRODUCTION TO SYNTAX
Penn Treebank
Official trading in the shares will start in Paris on Nov 6.
( (S (NP-SBJ (NP Official trading)
(PP in
(NP the shares)))
(VP will
(VP start
(PP-LOC in
(NP Paris))
(PP-TMP on
(NP (NP Nov 6)
INTRODUCTION TO SYNTAX
Penn POS Tag Sset
Adjective:
JJ
Plural Noun:
NNS
Adverb:
RB
Personal Pronoun:
PP
Cardinal Number:
CD
Proper Noun:
NP
Determiner:
DT
Verb base form:
VB
Preposition:
IN
Modal verb:
MD
Coordinating Conjunction
CC
Verb (3sg Pres):
VBZ
Subordinating Conjunction:
IN
Wh-determiner:
WDT
Singular Noun:
NN
Wh-pronoun:
WP
INTRODUCTION TO SYNTAX
INTRODUCTION TO SYNTAX
Techniques of Grammatical
Analysis
What is Grammar?
• A theory of language
• A theory of competence of a native
speaker
• An explicit model of competence
• A finite set of rules
What are the requirements?
• A model of competence
– Should be able to generate an infinite set of
grammatical sentences of the language
– Should not generate any ungrammatical ones
– Should be able to account for ambiguities
– If two sentences are understood to have same
meaning, the grammar should give the same structure
for both at some level
– If two sentences are understood to have different
internal relationship, the grammar should assign
different structural description
Techniques of Grammatical Analysis
• Two main devices
– Breaking up a string
• Sequential
• Hierarchical
• Transformational
– Labeling the constituents
• Morphological
• Categorial
• Functional
• A grammar may combine any of these devices
for grammatical analysis.
Grammatical Analysis Techniques
Breaking up
Sequential
Transformational
Hierarchical
Labeling
Morphological
Functional
Categorial
Breaking up and Labeling
– Sequential Breaking up
• Sequential Breaking up and Morphological labeling
• Sequential Breaking up and Categorial labeling
• Sequential Breaking up and Functional labeling
– Hierarchical Breaking up
• Hierarchical Breaking up and Categorial Labeling
• Hierarchical Breaking up and Functional Labeling
Sequential Breaking up
• This device breaks up a sentence into a sequence of
morphemes
the + boy + s + kill + ed + the + un + happy + rat + s
raam + ne + dande + se + saap + ko + maar + aa
• If a sentence is understood in two different ways, a
grammar should give two corresponding descriptions
– They + can + fish
• They are able to fish
• They put fish in cans
Sequential Breaking up and
Morphological labeling
After breaking up a sentence, we can give morphological
labels to each units
the + boy + s + kill + ed + the + un + happy + rat + s
word
stem affix
stem
affix
word
affix stem
stem
affix
raam + ne + dande + se + saap + ko + maar + aa
word
affix
word
affix
word
affix
word
affix
Sequential Breaking up and
Categorical Labeling
• This boy can solve the problem
S
This
boy
Det
N
can
Aux
solve
V
the
Det
problem
N
Sequential Breaking up and
Categorical Labeling (…continued)
• They
can
Pronoun Aux
• They
Pronoun Verb
fish
Verb
can
fish
Noun
Sequential Breaking up and
Categorical Labeling (…continued)
• They
called
her
a
taxi
Pronoun Verb Pronoun Article Noun
– They said she was a taxi
– They called a taxi for her
Sequential Breaking up and
Functional labeling
They
called
her
a taxi
Subject
Verbal
IO
DO
called
her
a taxi
They
Subject Verbal
DO
Object
Complement
Sequential Breaking up and
Functional labeling (…continued)
Old
men
modifier
Head
and
women
coordinator Head
Hierarchical Breaking up
• This device breaks up a sentence into its
constituents at different hierarchies or
levels
Old men and
women
Old
men and women
men
and
women
Old men and
women
Old men
Old
men
and
women
Hierarchical Breaking up and
Categorial Labeling
Poor
ADJ
NP
John
ran
away
N
V
ADV
VP
Hierarchical Breaking up and
Functional Labeling
• Immediate Constituent Analysis
• Construction types in terms of the function
of the constituents:
– Predication
(subject + predicate)
– Modification
(modifier + head)
– Complementation
(verbal + complement)
– Subordination (subordinator + dependent unit)
– Coordination (independent unit + coordinator
+ independent unit)
In the morning, the sky
looked much brighter
Mod
Head
Mod
Mod Head
Verbal
Sub
DU
Modifier
Head
Subject
Complement
Predicate
Head
Hierarchical Breaking up and
Categorial /Functional Labeling
• Hierarchical Breaking up coupled with Categorial
/Functional Labeling is a very powerful device
• But there are ambiguities which demand
something more powerful
Love of God
• Someone loves God
• God loves someone
Love of God
Noun
Phrase
love
Love of God
Prepositional
Phrase
of
God
Head
love
Modifier
Sub-
DU
of
God
Generative Grammar
• A generative grammar
– generates all the grammatical sentences of the
language
– rejects all the ungrammatical ones
• It is free to choose and combine any of the
techniques of breaking up and labeling
• Depending upon what tools the grammar
combines, we have different types of grammar
Types of Generative Grammar
•
Finite State Model
(sequential)
•
Phrase Structure Model
(sequential + hierarchical)
•
Transformational Model
(sequential + hierarchical +
transformational)
Phrase Structure Model
• Inadequacies
– Ambiguity
– Paraphrase Relationship
– Constructional Homonymy
Ambiguity
• If a sentence is understood to have two
meanings, a grammar should give two
corresponding structural descriptions
• PS grammar fails to meet this demand
• The shooting of the hunters
– The hunter shot someone
– Someone shot the hunters
NP
PP
NP
Det
the
N
shooting
P
of
NP
Det
N
the
hunters
Paraphrase Relationship
• If two sentences are understood to have the
same meaning, a grammar should give the
same structural description for the two
sentences at some level
• PS grammar fails to meet this demand
• Examples
– The boy slapped the girl
– The girl was slapped by the boy
S
VP
NP
Det
The
N
boy
V
slapped
NP
Det
N
the
girl
S
VP
NP
Det
The
N
girl
V
PP
Aux
V
P
was
slapped
by
NP
the
boy
Constructional Homonymy
• If two sentences are not understood the same
way, a grammar should give two structural
descriptions
• PS grammar cannot do this
– The audience was asked to leave by the side-door
– The audience was asked to leave by the chairman
S
VP
NP
Det
V
N
Aux
Inf. Phr
V
VP
Inf.
V
PP
P
NP
Det
The
audience was
asked
to
leave
by the
N
Sidedoor
S
VP
NP
Det
V
N
Aux
Inf. Phr
V
VP
Inf.
V
PP
P
NP
Det
The
audience was
asked
to
leave
by the
N
Chairman
Transformational Model
• If a generative grammar makes use of all
the three
-sequential
-hierarchical and
-transformational
is called a Transformational
grammar.
PS model
• Makes use of PS
rules exclusively
• PS rules generate
surface structure
Transformational
model
• Makes use of PS
rules and
transformational rules
• PS rules generate DS
• DS is converted into
the surface structure
by transformational
rules
PS grammar
Surface
Structure
PS rules
Transformational grammar
PS rules
Deep
Structure
Transformation
Surface
Structure
Why wasn’t he punished?
• PS rules:
S
 NP – VP
VP  V - NP – PP
V
 Aux – V
Aux (Tense)-(Modal)-(Perf)-(Prog)
Tense  {pres/past}
NP  (PreDet) - (Det) - (Ord) -(Quan) -(AP)-N
Lexical substitution
S
VP
NP
VG
Aux
NP
PP
V
Tense
Someone
past
punish
him
for some
reason
Why wasn’t he punished?
(…continued)
• Transformations
– Passivization
– Agent deletion
– Negation
– Neg. Contraction
– Wh- substitution
– Interrogation
– Wh- fronting
– Affix switch
Passivization
S
VP
NP
VG
Aux
Tense
He
past
PP
V
Pass
be en punish
P
PP
NP
N
by someone
for
Some
reason
Agent
Deletion
S
VP
NP
VG
Aux
Tense
He
past
PP
V
Pass
be en punish
for some reason
Negation
S
VP
NP
VG
Aux
Tense
He
past
PP
V
Pass
be not
en punish
for some reason
Neg.
Contraction
S
VP
NP
VG
Aux
Tense
He
past
PP
V
Pass
be’not
en punish
for some reason
WhSubstitution
S
VP
NP
VG
Aux
Tense
He
past
WH
V
Pass
be’not
en punish
why
Interrogative
S
Aux
VP
NP
VG
WH
V
Tense
past
be’not
he
en
punish
why
WH
Aux
S
Wh-fronting
NP
VP
VG
V
Tense
why
past
be’not
he
en
punish
WH
Aux
S
Affix switch
NP
VP
VG
Tense
V
be’not
past
punish
why
was’not
he
en
punished
CSE4022
Natural Language Processing
Topic: Basic Parsing Strategy
Pr o f. Tu l asi Pr a sa d S a ri ki ,
S C SE, V I T Ch ennai Ca mpus
www. l earn ersd esk.weeb l y. com
Simple CFG for ATIS English
Lexicon
Grammar
S → NP VP
S → Aux NP VP
S → VP
NP → Pronoun
NP → Proper-Noun
NP → Det Nominal
Nominal → Noun
Nominal → Nominal Noun
Nominal → Nominal PP
VP → Verb
VP → Verb NP
VP → VP PP
PP → Prep NP
Det → the | a | that | this
Noun → book | flight | meal | money
Verb → book | include | prefer
Pronoun → I | he | she | me
Proper-Noun → Houston | NWA
Aux → does
Prep → from | to | on | near | through
PARSING
2
A Fragment of English Grammar
S
 NP VP
VP  V NP
NP  NNP | ART N
NNP  Ram
V
 ate | saw
ART  a | an | the
N
 rice | apple | movie
PARSING
3
Derivation
• S is a special symbol called start symbol.
S => NP VP
Multiple
Choice
Points
(rewrite S)
=> NNP VP
(rewrite NP)
=> Ram VP
(rewrite NNP)
=> Ram V NP
(rewrite VP)
=> Ram ate NP
(rewrite V)
=> Ram ate ART N
(rewrite NP)
=> Ram ate the N
(rewrite ART)
=> Ram ate the rice
(rewrite N)
PARSING
4
Two Strategies : Top-Down & Bottom-Up
Top down : Start with S and generate the sentence.
Bottom up : Start with the words in the sentence and use the
rewrite rules backwards to reduce the sequence of symbols to
produce S.
Previous slide showed top-down strategy.
PARSING
5
Bottom-Up Derivation
Ram ate the rice
=>
NNP ate the rice
(rewrite Ram)
=>
NNP V the rice
(rewrite ate)
=>
NNP V ART rice
(rewrite the)
=>
NNP V ART N
(rewrite rice)
=>
NP V ART N
(rewrite NNP)
=>
NP V NP
(rewrite ART N)
=>
NP VP
(rewrite V NP)
=>
S
PARSING
6
Parsing Algorithm
A procedure that “searches” through the grammatical rules to find a
combination that generates a tree which stands for the structure of
the sentence.
PARSING
7
Parsing as Search (State Space : AND–OR Graph)
S
NP
NNP
VP
ART N
V
NP
NNP ART N
The leaves have links to words in the language, e.g.,
 AND node
ART
a
an
 OR node
the
PARSING
8
Top-Down Parsing
DFS on the AND-OR graph
Data structures:
 Open List (OL): Nodes to be expanded
 Closed List (CL): Expanded Nodes
 Input List (IL): Words of sentence to be parsed
 Moving Head (MH): Walks over the IL
PARSING
9
Trace of Top-Down Parsing
Initial Condition (T0)
OL
S
CL (empty)
Ram ate the rice
IL
MH
PARSING
10
Trace of Top-Down Parsing
T1:
NP
VP
OL
CL
S
Ram ate the rice
IL
MH
PARSING
11
Trace of Top-Down Parsing
T2:
NNP ART N VP
S NP
OL
CL
Ram ate the rice
IL
MH
PARSING
12
Trace of Top-Down Parsing
T3:
OL
ART N VP
S NP NNP
CL
Ram ate the rice
IL
MH (portion of Input consumed)
PARSING
13
Trace of Top-Down Parsing
T4:
N
OL
VP
S NP NNP ART*
CL
Ram ate the rice
IL
MH
(* indicates ‘useless’ expansion)
PARSING
14
Trace of Top-Down Parsing
T5:
OL
VP
S NP NNP ART* N*
CL
Ram ate the rice
IL
MH
PARSING
15
Trace of Top-Down Parsing
T6:
OL
V NP
S NP NNP ART* N*
CL
Ram ate the rice
IL
MH
PARSING
16
Trace of Top-Down Parsing
T7:
OL
NP
S NP NNP ART* N* V
CL
Ram ate the rice
IL
MH
PARSING
17
Trace of Top-Down Parsing
T8:
NNP ART N
OL
S NP NNP ART* N* V NP CL
Ram ate the rice
IL
MH
PARSING
18
Trace of Top-Down Parsing
T9:
ART
N
OL
S NP NNP ART* N* V NNP* CL
Ram ate the rice
IL
MH
PARSING
19
Trace of Top-Down Parsing
T10:
OL
N
S NP NNP ART* N* V NNP * ART
Ram ate the rice
CL
IL
MH
PARSING
20
Trace of Top-Down Parsing
T11:
OL
S NP NNP ART* N* V NNP* ART N
Ram ate the rice
CL
IL
MH
Successful Termination: OL empty AND MH at the
end of IL.
PARSING
21
Bottom-Up Parsing
Basic idea:
 Refer to words from the lexicon.
 Obtain all POSs for each word.
 Keep combining until S is obtained.
PARSING
22
Implementation of Bottom-up Parsing
 Through a stack
 Push words into the stack
 Look for a “handle” to reduce to a non-terminal
 Termination by “start symbol on stack” and “end of sentence”.
PARSING
23
Trace of Bottom-Up Parsing
T0
book the flight
MH
PARSING
24
Trace of Bottom-Up Parsing
Push ‘book’; advance input pointer
book the flight
MH
book
PARSING
25
Trace of Bottom-Up Parsing
Reduce ‘book’
book the flight
MH
V
PARSING
26
Trace of Bottom-Up Parsing
Push ‘the’; advance input pointer
book the flight
MH
the
V
PARSING
27
Trace of Bottom-Up Parsing
Reduce ‘the’
book the flight
MH
ART
V
PARSING
28
Trace of Bottom-Up Parsing
Push ‘flight’; advance pointer
book the flight
flight
MH
ART
V
PARSING
29
Trace of Bottom-Up Parsing
Reduce ‘flight’
book the flight
N
MH
ART
V
PARSING
30
Trace of Bottom-Up Parsing
Reduce ‘ART N’ by ‘NP’
book the flight
MH
NP
V
PARSING
31
Trace of Bottom-Up Parsing
Reduce ‘V NP’ by ‘S’; termination by S on stack and input exhausted.
book the flight
MH
S
PARSING
32
Parsing Example
S
VP
Verb NP
book that flight
book
Det
that
Nominal
Noun
flight
PARSING
33
Top Down Parsing
S
NP
VP
Pronoun
PARSING
34
Top Down Parsing
S
NP
VP
Pronoun
X
book
PARSING
35
Top Down Parsing
S
NP
VP
ProperNoun
PARSING
36
Top Down Parsing
S
NP
VP
ProperNoun
X
book
PARSING
37
Top Down Parsing
S
NP
Det
VP
Nominal
PARSING
38
Top Down Parsing
S
NP
Det
VP
Nominal
X
book
PARSING
39
Top Down Parsing
S
Aux
NP
PARSING
VP
40
Top Down Parsing
S
Aux
NP
VP
X
book
PARSING
41
Top Down Parsing
S
VP
PARSING
42
Top Down Parsing
S
VP
Verb
PARSING
43
Top Down Parsing
S
VP
Verb
book
PARSING
44
Top Down Parsing
S
VP
Verb
X
book
PARSING
that
45
Top Down Parsing
S
VP
Verb NP
PARSING
46
Top Down Parsing
S
VP
Verb NP
book
PARSING
47
Top Down Parsing
S
VP
Verb NP
book
Pronoun
PARSING
48
Top Down Parsing
S
VP
Verb NP
book
Pronoun
X
that
PARSING
49
Top Down Parsing
S
VP
Verb NP
book
ProperNoun
PARSING
50
Top Down Parsing
S
VP
Verb NP
book
ProperNoun
X
that
PARSING
51
Top Down Parsing
S
VP
Verb NP
book
Det
PARSING
Nominal
52
Top Down Parsing
S
VP
Verb NP
book
Det
Nominal
that
PARSING
53
Top Down Parsing
S
VP
Verb NP
book
Det
that
PARSING
Nominal
Noun
54
Top Down Parsing
S
VP
Verb NP
book
Det
that
Nominal
Noun
flight
PARSING
55
Bottom Up Parsing
book
that
PARSING
flight
56
Bottom Up Parsing
Noun
book
that
PARSING
flight
57
Bottom Up Parsing
Nominal
Noun
book
that
PARSING
flight
58
Bottom Up Parsing
Nominal
Nominal
Noun
Noun
book
that
PARSING
flight
59
Bottom Up Parsing
Nominal
Nominal
Noun
X
Noun
book
that
PARSING
flight
60
Bottom Up Parsing
Nominal
Nominal
PP
Noun
book
that
PARSING
flight
61
Bottom Up Parsing
Nominal
Nominal
PP
Noun
Det
book
that
PARSING
flight
62
Bottom Up Parsing
Nominal
Nominal
PP
NP
Noun
Det
Nominal
book
that
flight
PARSING
63
Bottom Up Parsing
Nominal
Nominal
PP
NP
Noun
Det
Nominal
book
that
Noun
flight
PARSING
64
Bottom Up Parsing
Nominal
Nominal
PP
NP
Noun
Det
Nominal
book
that
Noun
flight
PARSING
65
Bottom Up Parsing
Nominal
Nominal
S
PP
NP
VP
Noun
Det
Nominal
book
that
Noun
flight
PARSING
66
Bottom Up Parsing
Nominal
Nominal
S
PP
NP
VP
Noun
Det
Nominal
book
that
Noun
X
flight
PARSING
67
Bottom Up Parsing
Nominal
Nominal
PP
X
NP
Noun
Det
Nominal
book
that
Noun
flight
PARSING
68
Bottom Up Parsing
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
69
Bottom Up Parsing
VP
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
70
Bottom Up Parsing
S
VP
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
71
Bottom Up Parsing
S
VP
X
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
72
Bottom Up Parsing
VP
VP
PP
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
73
Bottom Up Parsing
VP
VP
PP
X
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
74
Bottom Up Parsing
VP
NP
Verb
book
NP
Det
Nominal
that
Noun
flight
PARSING
75
Bottom Up Parsing
VP
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
76
Bottom Up Parsing
S
VP
NP
Verb
Det
Nominal
book
that
Noun
flight
PARSING
77
Top Down vs. Bottom Up
 Top down never explores options that will not lead to a full parse, but
can explore many options that never connect to the actual sentence.
 Bottom up never explores options that do not connect to the actual
sentence but can explore options that can never lead to a full parse.
 Relative amounts of wasted search depend on how much the grammar
branches in each direction.
PARSING
78
Efficiency Issues
To reuse the work already done for constituent subtrees.
Ex of inefficiency: Consider the sentence the train from Chennai to Vizag via Nellore
Grammar:
S
 NP
NP
 NP PP | ART N| NNP
ART
 the
N
 train
NNP  Chennai | Vizag | Nellore
PP
 P NP
P
 from | to | via
PARSING
79
Possible False Steps
the train from Chennai to
Vizag via Nellore
push “the”; reduce to “ART”;
push “train”; reduce to “N”;
A
reduce “ART N” to “NP”;
reduce “NP” to “S”.
PARSING
80
Possible False Steps
Perform A, and then
from Chennai to Vizag via Nellore
NP
push “from”; reduce to “P”;
push “Chennai”; reduce to “NNP”;
reduce to “NP”; reduce “P NP” to “PP”;
B
Reduce “NP PP” to “NP”;
reduce “NP” to “S”.
PARSING
81
Possible False Steps
Similarly for
“……….. to Vizag via Nellore”
and
“…….. via Nellore”
Shift reduce conflicts occur for
S  NP
NP  NP PP
Should “NP” be reduced to “S” or should one search for “PP” and then get bigger “NP”?
PARSING
82
Reduplication of Work
the train
NP
S
from Chennaii
PP
NP
S
to Vizag
PP
NP
NP
S
via Nellore
PP
S
PARSING
83
# of Repetitions for Subtree Computation
the train
4 times
from Chennai
3 times
the train from Chennai
3 times
to Vizag
2 times
the train from Chennai to Vizag
2 times
via Nellore
1 time
the train from Chennai to Vizag via Nellore
1 time
Can the “subtrees already computed” be reused?
PARSING
84
Chart Parsing : Earley Algorithm
(Dynamic Programming based)
Sentence: book the flight
Grammar:
S
 NP VP | VP
NP
 ART N | NNP
VP
 V | V NP
ART  a | an | the
N
 book | flight
V
 book | include
PARSING
85
Definitions
CHART is the data structure that stores the record of matched constituents and expected
constituents through the use of dotted rules.
A dotted rule is of the form
AB C
where B is the matched constituent and C is the
expected constituent.
PARSING
86
Definitions
PREDICTOR is the procedure that records by transitive closure the set of dotted rules for a given
state of the input processing.
SCANNER is the procedure that consumes the next input token.
COMPLETER is the procedure that
◦ takes a dotted rule for which the dot is at the rightmost end and
◦ advances the dots for the rules for which a matched constituent was awaiting completion.
PARSING
87
Illustration of the Algorithm
Example : “0book1 the
2 flight3”
Chart[0]
State Dotted Rule
Position Comment
S0
S’  S
S1
S
S2
NP  ART N
[0,0] Predictor
S3
NP  NNP
[0,0] Predictor
S5
S
S6
VP  V
[0,0] Predictor
S7
VP  V NP
[0,0] Predictor
 NP VP
 VP
[0,0] Dummy start state
[0,0] Predictor
[0,0] Predictor
(fill the chart using transitive closure)
PARSING
88
Example
“
book
the
flight
”
S8
S9
S10
S11
S11
S12
S13
Chart[1]
1 [0,1]2
3
V 0
book
Scanner
(consume in token “book”)
VP  V
[0,1]
Completer
S  VP
[0,1]
Completer
(complete waiting constituents)
S’  S
[0,1]
Completer
(but not termination!)
VP  V NP
[0,1]
Completer
NP  ART N
[1,1]
Predictor
(new waiting constituents come into the chart)
NP  NNP
[1,1]
Predictor
PARSING
89
Example “0book
Chart[2]
1 the2 flight3”
S14 ART  the
[1,2]
Scanner
S15 NP  ART N
[1,2]
Completer
PARSING
90
Example “0book
the
flight
”
1
2
3
Chart[3]
S15 N  flight
[2,3]
Scanner
S16 NP  ART N
[1,3]
Completer
S17 VP  Verb NP
[0,3]
Completer
S18 S  VP
[0,3]
Completer
S19
S’  S
[0,3] Successful termination
PARSING
91
PARSING
92
Earley parser
1
Earley parser
In computer science, the Earley parser is an algorithm for parsing strings that belong to a given context-free
language, though (depending on the variant) it may suffer problems with certain nullable grammars. The algorithm,
named after its inventor, Jay Earley, is a chart parser that uses dynamic programming; it is mainly used for parsing in
computational linguistics. It was first introduced in his dissertation (and later appeared in abbreviated, more legible
form in a journal).
Earley parsers are appealing because they can parse all context-free languagesTalk:Earley parser#, unlike LR
parsers and LL parsers, which are more typically used in compilers but which can only handle restricted classes of
languages. The Earley parser executes in cubic time in the general case
, where n is the length of the parsed
string, quadratic time for unambiguous grammars
, and linear time for almost all LR(k) grammars. It
performs particularly well when the rules are written left-recursively.
Earley Recognizer
The following algorithm describes the Earley recognizer. The recognizer can be easily modified to create a parse tree
as it recognizes, and in that way can be turned into a parser.
The algorithm
In the following descriptions, α, β, and γ represent any string of terminals/nonterminals (including the empty string),
X and Y represent single nonterminals, and a represents a terminal symbol.
Earley's algorithm is a top-down dynamic programming algorithm. In the following, we use Earley's dot notation:
given a production X → αβ, the notation X → α • β represents a condition in which α has already been parsed and β
is expected.
Input position 0 is the position prior to input. Input position n is the position after accepting the nth token.
(Informally, input positions can be thought of as locations at token boundaries.) For every input position, the parser
generates a state set. Each state is a tuple (X → α • β, i), consisting of
• the production currently being matched (X → α β)
• our current position in that production (represented by the dot)
• the position i in the input at which the matching of this production began: the origin position
(Earley's original algorithm included a look-ahead in the state; later research showed this to have little practical
effect on the parsing efficiency, and it has subsequently been dropped from most implementations.)
The state set at input position k is called S(k). The parser is seeded with S(0) consisting of only the top-level rule.
The parser then repeatedly executes three operations: prediction, scanning, and completion.
• Prediction: For every state in S(k) of the form (X → α • Y β, j) (where j is the origin position as above), add (Y
→ • γ, k) to S(k) for every production in the grammar with Y on the left-hand side (Y → γ).
• Scanning: If a is the next symbol in the input stream, for every state in S(k) of the form (X → α • a β, j), add (X
→ α a • β, j) to S(k+1).
• Completion: For every state in S(k) of the form (X → γ •, j), find states in S(j) of the form (Y → α • X β, i) and
add (Y → α X • β, i) to S(k).
It is important to note that duplicate states are not added to the state set, only new ones. These three operations are
repeated until no new states can be added to the set. The set is generally implemented as a queue of states to process,
with the operation to be performed depending on what kind of state it is.
Earley parser
2
Pseudocode
Adapted from by Daniel Jurafsky and James H. Martin
function EARLEY-PARSE(words, grammar)
ENQUEUE((γ → •S, 0), chart[0])
for i ← from 0 to LENGTH(words) do
for each state in chart[i] do
if INCOMPLETE?(state) then
if NEXT-CAT(state) is a nonterminal then
PREDICTOR(state, i, grammar)
// non-terminal
else do
SCANNER(state, i)
else do
COMPLETER(state, i)
end
end
return chart
procedure PREDICTOR((A → α•B, i), j, grammar)
for each (B → γ) in GRAMMAR-RULES-FOR(B, grammar) do
ADD-TO-SET((B → •γ, j), chart[ j])
end
procedure SCANNER((A → α•B, i), j)
if B ⊂ PARTS-OF-SPEECH(word[j]) then
ADD-TO-SET((B → word[j], i), chart[j + 1])
end
procedure COMPLETER((B → γ•, j), k)
for each (A → α•Bβ, i) in chart[j] do
ADD-TO-SET((A → αB•β, i), chart[k])
end
Example
Consider the following simple grammar for arithmetic expressions:
::= S
# the start rule
<S> ::= <S> "+" <M>|<M>
<M> ::= <M> "*" <T>|<T>
<T> ::= "1" | "2" | "3" | "4"
With the input:
2 + 3 * 4
This is the sequence of state sets:
(state no.) Production (Origin) # Comment
-----------------------------------------
// terminal
Earley parser
3
S(0): • 2 + 3 * 4
(1)
(2)
(3)
P →•S
S →•S + M
S →•M
(0)
(0)
(0)
# start rule
# predict from (1)
# predict from (1)
(4)
(5)
(6)
M →•M * T
M →•T
T → • number
(0)
(0)
(0)
# predict from (3)
# predict from (3)
# predict from (5)
S(1): 2 • + 3 * 4
→ number • (0)
# scan from S(0)(6)
→T •
(0)
# complete from (1) and S(0)(5)
→M •* T
(0)
# complete from (2) and S(0)(4)
→M •
(0)
# complete from (2) and S(0)(3)
(1)
(2)
(3)
(4)
T
M
M
S
(5)
(6)
S →S •+ M
(0)
# complete from (4) and S(0)(2)
P →S •
(0)
# complete from (4) and S(0)(1)
S(2): 2 + • 3 * 4
(1)
(2)
(3)
(4)
S
M
M
T
→S + •M
→•M * T
→•T
→ • number
(0)
(2)
(2)
(2)
#
#
#
#
scan from S(1)(5)
predict from (1)
predict from (1)
predict from (3)
S(3): 2 + 3 • * 4
(1)
(2)
(3)
(4)
(5)
(6)
T
M
M
S
S
P
→ number • (2)
# scan from S(2)(4)
→T •
(2)
# complete from (1) and S(2)(3)
→M •* T
(2)
# complete from (2) and S(2)(2)
→ S + M • (0)
# complete from (2) and S(2)(1)
→S •+ M
(0)
# complete from (4) and S(0)(2)
→S •
(0)
# complete from (4) and S(0)(1)
S(4): 2 + 3 * • 4
(1)
M →M * •T
(2)
# scan from S(3)(3)
(2)
T → • number
(4)
# predict from (1)
S(5): 2 + 3 * 4 •
(1)
T → number •
(2)
(3)
(4)
(5)
M
M
S
S
(6)
P →S •
→M
→M
→S
→S
* T •
•* T
+ M •
•+ M
(4)
# scan from S(4)(2)
(2)
(2)
(0)
(0)
(0)
# complete from (1) and S(4)(1)
# complete from (2) and S(2)(2)
# complete from (2) and S(2)(1)
# complete from (4) and S(0)(2)
# complete from (4) and S(0)(1)
The state (P → S •, 0) represents a completed parse. This state also appears in S(3) and S(1), which are complete
sentences.
Earley parser
Citations
Other Reference Materials
• Aycock, John; Horspool, R. Nigel (2002). "Practical Earley Parsing". The Computer Journal 45 (6). pp. 620–630.
doi: 10.1093/comjnl/45.6.620 (http://dx.doi.org/10.1093/comjnl/45.6.620).
• Leo, Joop M. I. M. (1991), "A general context-free parsing algorithm running in linear time on every LR(k)
grammar without using lookahead", Theoretical Computer Science 82 (1): 165–176, doi:
10.1016/0304-3975(91)90180-A (http://dx.doi.org/10.1016/0304-3975(91)90180-A), MR 1112117 (http://
www.ams.org/mathscinet-getitem?mr=1112117).
• Tomita, Masaru (1984). "LR parsers for natural languages". COLING. 10th International Conference on
Computational Linguistics. pp. 354–357.
External links
C Implementations
• 'early' (http://cocom.sourceforge.net/ammunition-13.html) An Earley parser C -library.
• 'C Earley Parser' (https://bitbucket.org/abki/c-earley-parser/src) An Earley parser C. Wikipedia:Link rot
Java Implementations
• PEN (http://linguateca.dei.uc.pt/index.php?sep=recursos) A Java library that implements the Earley
algorithm.
• Pep (http://www.ling.ohio-state.edu/~scott/#projects-pep) A Java library that implements the Earley
algorithm and provides charts and parse trees as parsing artifacts.
• (http://www.cs.umanitoba.ca/~comp4190/Earley/Earley.java) A Java implementation of Earley parser.
Perl Implementations
• Marpa::R2 (https://metacpan.org/module/Marpa::R2) and Marpa::XS (https://metacpan.org/module/
Marpa::XS), Perl modules. Marpa (http://jeffreykegler.github.com/Marpa-web-site/) is an Earley's algorithm
that includes the improvements made by Joop Leo, and by Aycock and Horspool.
• Parse::Earley (https://metacpan.org/module/Parse::Earley) A Perl module that implements Jay Earley's original
algorithm.
Python Implementations
• Charty (http://www.cavar.me/damir/charty/python/) a Python implementation of an Earley parser.
• NLTK (http://nltk.org/) a Python toolkit that has an Earley parser.
• Spark (http://pages.cpsc.ucalgary.ca/~aycock/spark/) an Object Oriented "little language framework" for
Python that implements an Earley parser.
• earley3.py (http://github.com/tomerfiliba/tau/blob/master/earley3.py) A stand-alone implementation of the
algorithm in less than 150 lines of code, including generation of the parsing-forest and samples.
4
Earley parser
Common Lisp Implementations
• CL-EARLEY-PARSER (http://www.cliki.net/CL-EARLEY-PARSER) A Common Lisp library that
implements an Earley parser.
Scheme/Racket Implementations
• Charty-Racket (http://www.cavar.me/damir/charty/scheme/) A Scheme / Racket implementation of an Earley
parser.
Resources
• The Accent compiler-compiler (http://accent.compilertools.net/Entire.html)
5
Article Sources and Contributors
Article Sources and Contributors
Earley parser Source: http://en.wikipedia.org/w/index.php?oldid=576537591 Contributors: 1&only, AlexChurchill, Architectual, Borsotti, Brynosaurus, Cadr, Chentz, ChrisGualtieri, Clément
Pillias, Conversion script, David Eppstein, Derek Ross, DixonD, EnTerr, Fimbulvetr, Frap, Idmillington, JYOuyang, Jamelan, Jason Quinn, Jeffreykegler, John of Reading, Jonsafari, Khabs,
Kimiko, Kwi, Limited Atonement, Luqui, MCiura, Macrakis, Mkartic me, Opaldraggy, Paul Foxworthy, Peak, RA0808, Rfc1394, Simon_J_Kissane, Two Bananas, UKoch, Woogyun, Zacchiro,
71 anonymous edits
License
Creative Commons Attribution-Share Alike 3.0
//creativecommons.org/licenses/by-sa/3.0/
6
CHAPTER
3
N-gram Language Models
“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model
Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next few words someone
is going to say? What word, for example, is likely to follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over,
but probably not refrigerator or the. In the following sections we will formalize
this intuition by introducing models that assign a probability to each possible next
word. The same models will also serve to assign a probability to an entire sentence.
Such a model, for example, could predict that the following sequence has a much
higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
than does this same set of words in a different order:
on guys all I of notice sidewalk three a sudden standing the
Why would you want to predict upcoming words, or assign probabilities to sentences? Probabilities are essential in any task in which we have to identify words in
noisy, ambiguous input, like speech recognition. For a speech recognizer to realize
that you said I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish. For writing
tools like spelling correction or grammatical error correction, we need to find and
correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved.
The phrase There are will be much more probable than Their are, and has improved
than has improve, allowing us to help users by detecting and correcting these errors.
Assigning probabilities to sequences of words is also essential in machine translation. Suppose we are translating a Chinese source sentence:
他 向 记者
介绍了
主要 内容
He to reporters introduced main content
As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement
30
C HAPTER 3
AAC
language model
LM
n-gram
3.1
•
N- GRAM L ANGUAGE M ODELS
A probabilistic model of word sequences could suggest that briefed reporters on
is a more probable English phrase than briefed to reporters (which has an awkward
to after briefed) or introduced reporters to (which uses a verb that is less fluent
English in this context), allowing us to correctly select the boldfaced sentence above.
Probabilities are also important for augmentative and alternative communication systems (Trnka et al. 2007, Kane et al. 2017). People often use such AAC
devices if they are physically unable to speak or sign but can instead use eye gaze or
other specific movements to select words from a menu to be spoken by the system.
Word prediction can be used to suggest likely words for the menu.
Models that assign probabilities to sequences of words are called language models or LMs. In this chapter we introduce the simplest model that assigns probabilities to sentences and sequences of words, the n-gram. An n-gram is a sequence
of n words: a 2-gram (which we’ll call bigram) is a two-word sequence of words
like “please turn”, “turn your”, or ”your homework”, and a 3-gram (a trigram) is
a three-word sequence of words like “please turn your”, or “turn your homework”.
We’ll see how to use n-gram models to estimate the probability of the last word of
an n-gram given the previous words, and also to assign probabilities to entire sequences. In a bit of terminological ambiguity, we usually drop the word “model”,
and use the term n-gram (and bigram, etc.) to mean either the word sequence itself
or the predictive model that assigns it a probability. In later chapters we’ll introduce
more sophisticated language models like the RNN LMs of Chapter 9.
N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “its water is so transparent that” and we
want to know the probability that the next word is the:
P(the|its water is so transparent that).
(3.1)
One way to estimate this probability is from relative frequency counts: take a
very large corpus, count the number of times we see its water is so transparent that,
and count the number of times this is followed by the. This would be answering the
question “Out of the times we saw the history h, how many times was it followed by
the word w”, as follows:
P(the|its water is so transparent that) =
C(its water is so transparent that the)
C(its water is so transparent that)
(3.2)
With a large enough corpus, such as the web, we can compute these counts and
estimate the probability from Eq. 3.2. You should pause now, go to the web, and
compute this estimate for yourself.
While this method of estimating probabilities directly from counts works fine in
many cases, it turns out that even the web isn’t big enough to give us good estimates
in most cases. This is because language is creative; new sentences are created all the
time, and we won’t always be able to count entire sentences. Even simple extensions
of the example sentence may have counts of zero on the web (such as “Walden
Pond’s water is so transparent that the”; well, used to have counts of zero).
3.1
•
N-G RAMS
31
Similarly, if we wanted to know the joint probability of an entire sequence of
words like its water is so transparent, we could do it by asking “out of all possible
sequences of five words, how many of them are its water is so transparent?” We
would have to get the count of its water is so transparent and divide by the sum of
the counts of all possible five word sequences. That seems rather a lot to estimate!
For this reason, we’ll need to introduce more clever ways of estimating the probability of a word w given a history h, or the probability of an entire word sequence
W . Let’s start with a little formalizing of notation. To represent the probability
of a particular random variable Xi taking on the value “the”, or P(Xi = “the”), we
will use the simplification P(the). We’ll represent a sequence of N words either as
w1 . . . wn or w1:n (so the expression w1:n−1 means the string w1 , w2 , ..., wn−1 ). For the
joint probability of each word in a sequence having a particular value P(X = w1 ,Y =
w2 , Z = w3 , ...,W = wn ) we’ll use P(w1 , w2 , ..., wn ).
Now how can we compute probabilities of entire sequences like P(w1 , w2 , ..., wn )?
One thing we can do is decompose this probability using the chain rule of probability:
P(X1 ...Xn ) = P(X1 )P(X2 |X1 )P(X3 |X1:2 ) . . . P(Xn |X1:n−1 )
n
�
P(Xk |X1:k−1 )
=
(3.3)
k=1
Applying the chain rule to words, we get
P(w1:n ) = P(w1 )P(w2 |w1 )P(w3 |w1:2 ) . . . P(wn |w1:n−1 )
n
�
P(wk |w1:k−1 )
=
(3.4)
k=1
bigram
The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. Equation 3.4 suggests that we could estimate the joint probability of an entire sequence of
words by multiplying together a number of conditional probabilities. But using the
chain rule doesn’t really seem to help us! We don’t know any way to compute the
exact probability of a word given a long sequence of preceding words, P(wn |wn−1
1 ).
As we said above, we can’t just estimate by counting the number of times every word
occurs following every long string, because language is creative and any particular
context might have never occurred before!
The intuition of the n-gram model is that instead of computing the probability of
a word given its entire history, we can approximate the history by just the last few
words.
The bigram model, for example, approximates the probability of a word given
all the previous words P(wn |w1:n−1 ) by using only the conditional probability of the
preceding word P(wn |wn−1 ). In other words, instead of computing the probability
P(the|Walden Pond’s water is so transparent that)
(3.5)
we approximate it with the probability
P(the|that)
(3.6)
When we use a bigram model to predict the conditional probability of the next word,
we are thus making the following approximation:
P(wn |w1:n−1 ) ≈ P(wn |wn−1 )
(3.7)
32
C HAPTER 3
Markov
n-gram
•
N- GRAM L ANGUAGE M ODELS
The assumption that the probability of a word depends only on the previous word is
called a Markov assumption. Markov models are the class of probabilistic models
that assume we can predict the probability of some future unit without looking too
far into the past. We can generalize the bigram (which looks one word into the past)
to the trigram (which looks two words into the past) and thus to the n-gram (which
looks n − 1 words into the past).
Thus, the general equation for this n-gram approximation to the conditional
probability of the next word in a sequence is
P(wn |w1:n−1 ) ≈ P(wn |wn−N+1:n−1 )
(3.8)
Given the bigram assumption for the probability of an individual word, we can compute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
P(w1:n ) ≈
n
�
P(wk |wk−1 )
(3.9)
k=1
maximum
likelihood
estimation
normalize
How do we estimate these bigram or n-gram probabilities? An intuitive way to
estimate probabilities is called maximum likelihood estimation or MLE. We get
the MLE estimate for the parameters of an n-gram model by getting counts from a
corpus, and normalizing the counts so that they lie between 0 and 1.1
For example, to compute a particular bigram probability of a word y given a
previous word x, we’ll compute the count of the bigram C(xy) and normalize by the
sum of all the bigrams that share the same first word x:
C(wn−1 wn )
P(wn |wn−1 ) = �
w C(wn−1 w)
(3.10)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):
P(wn |wn−1 ) =
C(wn−1 wn )
C(wn−1 )
(3.11)
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol. </s>2
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus
P(I|<s>) =
2
3
= .67
P(</s>|Sam) =
1
1
2
= 0.5
P(Sam|<s>) =
P(Sam|am) =
1
2
1
3
= .33
= .5
P(am|I) =
P(do|I) =
2
3
1
3
= .67
= .33
For probabilistic models, normalizing means dividing by some total count so that the resulting probabilities fall legally between 0 and 1.
2 We need the end-symbol to make the bigram grammar a true probability distribution. Without an
end-symbol, the sentence probabilities for all sentences of a given length would sum to one. This model
would define an infinite set of probability distributions, with one distribution per sentence length. See
Exercise 3.5.
3.1
•
N-G RAMS
33
For the general case of MLE n-gram parameter estimation:
P(wn |wn−N+1:n−1 ) =
relative
frequency
C(wn−N+1:n−1 wn )
C(wn−N+1:n−1 )
(3.12)
Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the
observed frequency of a particular sequence by the observed frequency of a prefix.
This ratio is called a relative frequency. We said above that this use of relative
frequencies as a way to estimate probabilities is an example of maximum likelihood
estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood
of the training set T given the model M (i.e., P(T |M)). For example, suppose the
word Chinese occurs 400 times in a corpus of a million words like the Brown corpus.
What is the probability that a random word selected from some other text of, say,
400
a million words will be the word Chinese? The MLE of its probability is 1000000
or .0004. Now .0004 is not the best possible estimate of the probability of Chinese
occurring in all situations; it might turn out that in some other corpus or context
Chinese is a very unlikely word. But it is the probability that makes it most likely
that Chinese will occur 400 times in a million-word corpus. We present ways to
modify the MLE estimates slightly to get better probability estimates in Section 3.4.
Let’s move on to some examples from a slightly larger corpus than our 14-word
example above. We’ll use data from the now-defunct Berkeley Restaurant Project,
a dialogue system from the last century that answered questions about a database
of restaurants in Berkeley, California (Jurafsky et al., 1994). Here are some textnormalized sample user queries (a sample of 9332 sentences is on the website):
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Figure 3.1 shows the bigram counts from a piece of a bigram grammar from the
Berkeley Restaurant Project. Note that the majority of the values are zero. In fact,
we have chosen the sample words to cohere with each other; a matrix selected from
a random set of seven words would be even more sparse.
i
want
to
eat
chinese
food
lunch
spend
i
5
2
2
0
1
15
2
1
want
827
0
0
0
0
0
0
0
to
0
608
4
2
0
15
0
1
eat
9
1
686
0
0
0
0
0
chinese
0
6
2
16
0
1
0
0
food
0
6
0
2
82
4
1
0
lunch
0
5
6
42
1
0
0
0
spend
2
1
211
0
0
0
0
0
Figure 3.1 Bigram counts for eight of the words (out of V = 1446) in the Berkeley Restaurant Project corpus of 9332 sentences. Zero counts are in gray.
Figure 3.2 shows the bigram probabilities after normalization (dividing each cell
in Fig. 3.1 by the appropriate unigram for its row, taken from the following set of
unigram probabilities):
34
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
i
want to
eat chinese food lunch spend
2533 927 2417 746 158
1093 341 278
i
want
to
eat
chinese
food
lunch
spend
i
0.002
0.0022
0.00083
0
0.0063
0.014
0.0059
0.0036
want
0.33
0
0
0
0
0
0
0
to
0
0.66
0.0017
0.0027
0
0.014
0
0.0036
eat
0.0036
0.0011
0.28
0
0
0
0
0
chinese
0
0.0065
0.00083
0.021
0
0.00092
0
0
food
0
0.0065
0
0.0027
0.52
0.0037
0.0029
0
lunch
0
0.0054
0.0025
0.056
0.0063
0
0
0
spend
0.00079
0.0011
0.087
0
0
0
0
0
Figure 3.2 Bigram probabilities for eight words in the Berkeley Restaurant Project corpus
of 9332 sentences. Zero probabilities are in gray.
Here are a few other useful probabilities:
P(i|<s>) = 0.25
P(food|english) = 0.5
P(english|want) = 0.0011
P(</s>|food) = 0.68
Now we can compute the probability of sentences like I want English food or
I want Chinese food by simply multiplying the appropriate bigram probabilities together, as follows:
P(<s> i want english food </s>)
= P(i|<s>)P(want|i)P(english|want)
P(food|english)P(</s>|food)
= .25 × .33 × .0011 × 0.5 × 0.68
= .000031
We leave it as Exercise 3.2 to compute the probability of i want chinese food.
What kinds of linguistic phenomena are captured in these bigram statistics?
Some of the bigram probabilities above encode some facts that we think of as strictly
syntactic in nature, like the fact that what comes after eat is usually a noun or an
adjective, or that what comes after to is usually a verb. Others might be a fact about
the personal assistant task, like the high probability of sentences beginning with
the words I. And some might even be cultural rather than linguistic, like the higher
probability that people are looking for Chinese versus English food.
trigram
4-gram
5-gram
log
probabilities
Some practical issues: Although for pedagogical purposes we have only described
bigram models, in practice it’s more common to use trigram models, which condition on the previous two words rather than the previous word, or 4-gram or even
5-gram models, when there is sufficient training data. Note that for these larger ngrams, we’ll need to assume extra contexts to the left and right of the sentence end.
For example, to compute trigram probabilities at the very beginning of the sentence,
we use two pseudo-words for the first trigram (i.e., P(I|<s><s>).
We always represent and compute language model probabilities in log format
as log probabilities. Since probabilities are (by definition) less than or equal to
1, the more probabilities we multiply together, the smaller the product becomes.
Multiplying enough n-grams together would result in numerical underflow. By using
log probabilities instead of raw probabilities, we get numbers that are not as small.
3.2
•
E VALUATING L ANGUAGE M ODELS
35
Adding in log space is equivalent to multiplying in linear space, so we combine log
probabilities by adding them. The result of doing all computation and storage in log
space is that we only need to convert back into probabilities if we need to report
them at the end; then we can just take the exp of the logprob:
p1 × p2 × p3 × p4 = exp(log p1 + log p2 + log p3 + log p4 )
3.2
(3.13)
Evaluating Language Models
extrinsic
evaluation
intrinsic
evaluation
training set
test set
held out
development
test
The best way to evaluate the performance of a language model is to embed it in
an application and measure how much the application improves. Such end-to-end
evaluation is called extrinsic evaluation. Extrinsic evaluation is the only way to
know if a particular improvement in a component is really going to help the task
at hand. Thus, for speech recognition, we can compare the performance of two
language models by running the speech recognizer twice, once with each language
model, and seeing which gives the more accurate transcription.
Unfortunately, running big NLP systems end-to-end is often very expensive. Instead, it would be nice to have a metric that can be used to quickly evaluate potential
improvements in a language model. An intrinsic evaluation metric is one that measures the quality of a model independent of any application.
For an intrinsic evaluation of a language model we need a test set. As with many
of the statistical models in our field, the probabilities of an n-gram model come from
the corpus it is trained on, the training set or training corpus. We can then measure
the quality of an n-gram model by its performance on some unseen data called the
test set or test corpus. We will also sometimes call test sets and other datasets that
are not in our training sets held out corpora because we hold them out from the
training data.
So if we are given a corpus of text and want to compare two different n-gram
models, we divide the data into training and test sets, train the parameters of both
models on the training set, and then compare how well the two trained models fit the
test set.
But what does it mean to “fit the test set”? The answer is simple: whichever
model assigns a higher probability to the test set—meaning it more accurately
predicts the test set—is a better model. Given two probabilistic models, the better
model is the one that has a tighter fit to the test data or that better predicts the details
of the test data, and hence will assign a higher probability to the test data.
Since our evaluation metric is based on test set probability, it’s important not to
let the test sentences into the training set. Suppose we are trying to compute the
probability of a particular “test” sentence. If our test sentence is part of the training
corpus, we will mistakenly assign it an artificially high probability when it occurs
in the test set. We call this situation training on the test set. Training on the test
set introduces a bias that makes the probabilities all look too high, and causes huge
inaccuracies in perplexity, the probability-based metric we introduce below.
Sometimes we use a particular test set so often that we implicitly tune to its
characteristics. We then need a fresh test set that is truly unseen. In such cases, we
call the initial test set the development test set or, devset. How do we divide our
data into training, development, and test sets? We want our test set to be as large
as possible, since a small test set may be accidentally unrepresentative, but we also
want as much training data as possible. At the minimum, we would want to pick
36
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
the smallest test set that gives us enough statistical power to measure a statistically
significant difference between two potential models. In practice, we often just divide
our data into 80% training, 10% development, and 10% test. Given a large corpus
that we want to divide into training and test, test data can either be taken from some
continuous sequence of text inside the corpus, or we can remove smaller “stripes”
of text from randomly selected parts of our corpus and combine them into a test set.
3.2.1 Perplexity
perplexity
In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity. The perplexity (sometimes called PP for short)
of a language model on a test set is the inverse probability of the test set, normalized
by the number of words. For a test set W = w1 w2 . . . wN ,:
1
PP(W ) = P(w1 w2 . . . wN )− N
�
1
= N
P(w1 w2 . . . wN )
(3.14)
We can use the chain rule to expand the probability of W :
�
�N
��
N
PP(W ) = �
i=1
1
P(wi |w1 . . . wi−1 )
(3.15)
Thus, if we are computing the perplexity of W with a bigram language model,
we get:
�
�N
��
N
PP(W ) = �
i=1
1
P(wi |wi−1 )
(3.16)
Note that because of the inverse in Eq. 3.15, the higher the conditional probability of the word sequence, the lower the perplexity. Thus, minimizing perplexity is
equivalent to maximizing the test set probability according to the language model.
What we generally use for word sequence in Eq. 3.15 or Eq. 3.16 is the entire sequence of words in some test set. Since this sequence will cross many sentence
boundaries, we need to include the begin- and end-sentence markers <s> and </s>
in the probability computation. We also need to include the end-of-sentence marker
</s> (but not the beginning-of-sentence marker <s>) in the total count of word tokens N.
There is another way to think about perplexity: as the weighted average branching factor of a language. The branching factor of a language is the number of possible next words that can follow any word. Consider the task of recognizing the digits
in English (zero, one, two,..., nine), given that (both in some training set and in some
1
test set) each of the 10 digits occurs with equal probability P = 10
. The perplexity of
this mini-language is in fact 10. To see that, imagine a test string of digits of length
N, and assume that in the training set all the digits occurred with equal probability.
By Eq. 3.15, the perplexity will be
3.3
•
G ENERALIZATION AND Z EROS
37
1
PP(W ) = P(w1 w2 . . . wN )− N
1 N −1
) N
10
1 −1
=
10
= 10
= (
(3.17)
But suppose that the number zero is really frequent and occurs far more often
than other numbers. Let’s say that 0 occur 91 times in the training set, and each
of the other digits occurred 1 time each. Now we see the following test set: 0 0
0 0 0 3 0 0 0 0. We should expect the perplexity of this test set to be lower since
most of the time the next number will be zero, which is very predictable, i.e. has
a high probability. Thus, although the branching factor is still 10, the perplexity or
weighted branching factor is smaller. We leave this exact calculation as exercise 12.
We see in Section 3.7 that perplexity is also closely related to the informationtheoretic notion of entropy.
Finally, let’s look at an example of how perplexity can be used to compare different n-gram models. We trained unigram, bigram, and trigram grammars on 38
million words (including start-of-sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary. We then computed the perplexity of each of these
models on a test set of 1.5 million words with Eq. 3.16. The table below shows the
perplexity of a 1.5 million word WSJ test set according to each of these grammars.
Unigram Bigram Trigram
Perplexity 962
170
109
As we see above, the more information the n-gram gives us about the word
sequence, the lower the perplexity (since as Eq. 3.15 showed, perplexity is related
inversely to the likelihood of the test sequence according to the model).
Note that in computing perplexities, the n-gram model P must be constructed
without any knowledge of the test set or any prior knowledge of the vocabulary of
the test set. Any kind of knowledge of the test set can cause the perplexity to be
artificially low. The perplexity of two language models is only comparable if they
use identical vocabularies.
An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) improvement in the performance of a language processing task like speech recognition
or machine translation. Nonetheless, because perplexity often correlates with such
improvements, it is commonly used as a quick check on an algorithm. But a model’s
improvement in perplexity should always be confirmed by an end-to-end evaluation
of a real task before concluding the evaluation of the model.
3.3
Generalization and Zeros
The n-gram model, like many statistical models, is dependent on the training corpus.
One implication of this is that the probabilities often encode specific facts about a
given training corpus. Another implication is that n-grams do a better and better job
of modeling the training corpus as we increase the value of N.
38
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
We can visualize both of these facts by borrowing the technique of Shannon
(1951) and Miller and Selfridge (1950) of generating random sentences from different n-gram models. It’s simplest to visualize how this works for the unigram
case. Imagine all the words of the English language covering the probability space
between 0 and 1, each word covering an interval proportional to its frequency. We
choose a random value between 0 and 1 and print the word whose interval includes
this chosen value. We continue choosing random numbers and generating words
until we randomly generate the sentence-final token </s>. We can use the same
technique to generate bigrams by first generating a random bigram that starts with
<s> (according to its bigram probability). Let’s say the second word of that bigram
is w. We next chose a random bigram starting with w (again, drawn according to its
bigram probability), and so on.
To give an intuition for the increasing power of higher-order n-grams, Fig. 3.3
shows random sentences generated from unigram, bigram, trigram, and 4-gram
models trained on Shakespeare’s works.
1
gram
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
rote life have
–Hill he late speaks; or! a more to leg less first you enter
2
gram
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
3
gram
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
’tis done.
–This shall forbid it should be branded, if renown made it empty.
4
gram
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
great banquet serv’d in;
–It cannot be but so.
Figure 3.3 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.
The longer the context on which we train the model, the more coherent the sentences. In the unigram sentences, there is no coherent relation between words or any
sentence-final punctuation. The bigram sentences have some local word-to-word
coherence (especially if we consider that punctuation counts as a word). The trigram and 4-gram sentences are beginning to look a lot like Shakespeare. Indeed, a
careful investigation of the 4-gram sentences shows that they look a little too much
like Shakespeare. The words It cannot be but so are directly from King John. This is
because, not to put the knock on Shakespeare, his oeuvre is not very large as corpora
go (N = 884, 647,V = 29, 066), and our n-gram probability matrices are ridiculously
sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the number of possible 4-grams is V 4 = 7 × 1017 . Thus, once the generator has chosen the first 4-gram
(It cannot be but), there are only five possible continuations (that, I, he, thou, and
so); indeed, for many 4-grams, there is only one continuation.
To get an idea of the dependence of a grammar on its training set, let’s look at an
n-gram grammar trained on a completely different corpus: the Wall Street Journal
(WSJ) newspaper. Shakespeare and the Wall Street Journal are both English, so
we might expect some overlap between our n-grams for the two genres. Fig. 3.4
3.3
•
G ENERALIZATION AND Z EROS
39
shows sentences generated by unigram, bigram, and trigram grammars trained on
40 million words from WSJ.
1
gram
2
gram
3
gram
Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 3.4 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctuation as words. Output was then hand-corrected for capitalization to improve readability.
Compare these examples to the pseudo-Shakespeare in Fig. 3.3. While they both
model “English-like sentences”, there is clearly no overlap in generated sentences,
and little overlap even in small phrases. Statistical models are likely to be pretty useless as predictors if the training sets and the test sets are as different as Shakespeare
and WSJ.
How should we deal with this problem when we build n-gram models? One step
is to be sure to use a training corpus that has a similar genre to whatever task we are
trying to accomplish. To build a language model for translating legal documents,
we need a training corpus of legal documents. To build a language model for a
question-answering system, we need a training corpus of questions.
It is equally important to get training data in the appropriate dialect or variety,
especially when processing social media posts or spoken transcripts. For example
some tweets will use features of African American Language (AAL)— the name
for the many variations of language used in African American communities (King,
2020). Such features include words like finna—an auxiliary verb that marks immediate future tense —that don’t occur in other varieties, or spellings like den for then,
in tweets like this one (Blodgett and O’Connor, 2017):
(3.18) Bored af den my phone finna die!!!
while tweets from varieties like Nigerian English have markedly different vocabulary and n-gram patterns from American English (Jurgens et al., 2017):
(3.19) @username R u a wizard or wat gan sef: in d mornin - u tweet, afternoon - u
tweet, nyt gan u dey tweet. beta get ur IT placement wiv twitter
Matching genres and dialects is still not sufficient. Our models may still be
subject to the problem of sparsity. For any n-gram that occurred a sufficient number
of times, we might have a good estimate of its probability. But because any corpus is
limited, some perfectly acceptable English word sequences are bound to be missing
from it. That is, we’ll have many cases of putative “zero probability n-grams” that
should really have some non-zero probability. Consider the words that follow the
bigram denied the in the WSJ Treebank3 corpus, together with their counts:
denied the allegations:
denied the speculation:
denied the rumors:
denied the report:
5
2
1
1
40
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
But suppose our test set has phrases like:
denied the offer
denied the loan
zeros
Our model will incorrectly estimate that the P(offer|denied the) is 0!
These zeros— things that don’t ever occur in the training set but do occur in
the test set—are a problem for two reasons. First, their presence means we are
underestimating the probability of all sorts of words that might occur, which will
hurt the performance of any application we want to run on this data.
Second, if the probability of any word in the test set is 0, the entire probability
of the test set is 0. By definition, perplexity is based on the inverse probability of the
test set. Thus if some words have zero probability, we can’t compute perplexity at
all, since we can’t divide by 0!
3.3.1 Unknown Words
closed
vocabulary
OOV
open
vocabulary
The previous section discussed the problem of words whose bigram probability is
zero. But what about words we simply have never seen before?
Sometimes we have a language task in which this can’t happen because we know
all the words that can occur. In such a closed vocabulary system the test set can
only contain words from this lexicon, and there will be no unknown words. This is
a reasonable assumption in some domains, such as speech recognition or machine
translation, where we have a pronunciation dictionary or a phrase table that are fixed
in advance, and so the language model can only use the words in that dictionary or
phrase table.
In other cases we have to deal with words we haven’t seen before, which we’ll
call unknown words, or out of vocabulary (OOV) words. The percentage of OOV
words that appear in the test set is called the OOV rate. An open vocabulary system
is one in which we model these potential unknown words in the test set by adding a
pseudo-word called <UNK>.
There are two common ways to train the probabilities of the unknown word
model <UNK>. The first one is to turn the problem back into a closed vocabulary one
by choosing a fixed vocabulary in advance:
1. Choose a vocabulary (word list) that is fixed in advance.
2. Convert in the training set any word that is not in this set (any OOV word) to
the unknown word token <UNK> in a text normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any other regular
word in the training set.
The second alternative, in situations where we don’t have a prior vocabulary in advance, is to create such a vocabulary implicitly, replacing words in the training data
by <UNK> based on their frequency. For example we can replace by <UNK> all words
that occur fewer than n times in the training set, where n is some small number, or
equivalently select a vocabulary size V in advance (say 50,000) and choose the top
V words by frequency and replace the rest by UNK. In either case we then proceed
to train the language model as before, treating <UNK> like a regular word.
The exact choice of <UNK> model does have an effect on metrics like perplexity.
A language model can achieve low perplexity by choosing a small vocabulary and
assigning the unknown word a high probability. For this reason, perplexities should
only be compared across language models with the same vocabularies (Buck et al.,
2014).
3.4
3.4
•
S MOOTHING
41
Smoothing
smoothing
discounting
What do we do with words that are in our vocabulary (they are not unknown words)
but appear in a test set in an unseen context (for example they appear after a word
they never appeared after in training)? To keep a language model from assigning
zero probability to these unseen events, we’ll have to shave off a bit of probability
mass from some more frequent events and give it to the events we’ve never seen.
This modification is called smoothing or discounting. In this section and the following ones we’ll introduce a variety of ways to do smoothing: Laplace (add-one)
smoothing, add-k smoothing, stupid backoff, and Kneser-Ney smoothing.
3.4.1 Laplace Smoothing
Laplace
smoothing
The simplest way to do smoothing is to add one to all the bigram counts, before
we normalize them into probabilities. All the counts that used to be zero will now
have a count of 1, the counts of 1 will be 2, and so on. This algorithm is called
Laplace smoothing. Laplace smoothing does not perform well enough to be used
in modern n-gram models, but it usefully introduces many of the concepts that we
see in other smoothing algorithms, gives a useful baseline, and is also a practical
smoothing algorithm for other tasks like text classification (Chapter 4).
Let’s start with the application of Laplace smoothing to unigram probabilities.
Recall that the unsmoothed maximum likelihood estimate of the unigram probability
of the word wi is its count ci normalized by the total number of word tokens N:
P(wi ) =
add-one
ci
N
Laplace smoothing merely adds one to each count (hence its alternate name addone smoothing). Since there are V words in the vocabulary and each one was incremented, we also need to adjust the denominator to take into account the extra V
observations. (What happens to our P values if we don’t increase the denominator?)
PLaplace (wi ) =
ci + 1
N +V
(3.20)
Instead of changing both the numerator and denominator, it is convenient to
describe how a smoothing algorithm affects the numerator, by defining an adjusted
count c∗ . This adjusted count is easier to compare directly with the MLE counts and
can be turned into a probability like an MLE count by normalizing by N. To define
this count, since we are only changing the numerator in addition to adding 1 we’ll
N
:
also need to multiply by a normalization factor N+V
c∗i = (ci + 1)
discounting
discount
N
N +V
(3.21)
We can now turn c∗i into a probability Pi∗ by normalizing by N.
A related way to view smoothing is as discounting (lowering) some non-zero
counts in order to get the probability mass that will be assigned to the zero counts.
Thus, instead of referring to the discounted counts c∗ , we might describe a smoothing algorithm in terms of a relative discount dc , the ratio of the discounted counts to
the original counts:
42
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
c∗
c
Now that we have the intuition for the unigram case, let’s smooth our Berkeley
Restaurant Project bigrams. Figure 3.5 shows the add-one smoothed counts for the
bigrams in Fig. 3.1.
dc =
i
want
to
eat
chinese
food
lunch
spend
i
6
3
3
1
2
16
3
2
want
828
1
1
1
1
1
1
1
to
1
609
5
3
1
16
1
2
eat
10
2
687
1
1
1
1
1
chinese
1
7
3
17
1
2
1
1
food
1
7
1
3
83
5
2
1
lunch
1
6
7
43
2
1
1
1
spend
3
2
212
1
1
1
1
1
Figure 3.5 Add-one smoothed bigram counts for eight of the words (out of V = 1446) in
the Berkeley Restaurant Project corpus of 9332 sentences. Previously-zero counts are in gray.
Figure 3.6 shows the add-one smoothed probabilities for the bigrams in Fig. 3.2.
Recall that normal bigram probabilities are computed by normalizing each row of
counts by the unigram count:
C(wn−1 wn )
(3.22)
C(wn−1 )
For add-one smoothed bigram counts, we need to augment the unigram count by
the number of total word types in the vocabulary V :
P(wn |wn−1 ) =
C(wn−1 wn ) + 1
C(wn−1 wn ) + 1
∗
PLaplace
(wn |wn−1 ) = �
=
(C(w
w)
+
1)
C(wn−1 ) +V
n−1
w
(3.23)
Thus, each of the unigram counts given in the previous section will need to be
augmented by V = 1446. The result is the smoothed bigram probabilities in Fig. 3.6.
i
want
to
eat
chinese
food
lunch
spend
i
0.0015
0.0013
0.00078
0.00046
0.0012
0.0063
0.0017
0.0012
want
0.21
0.00042
0.00026
0.00046
0.00062
0.00039
0.00056
0.00058
to
0.00025
0.26
0.0013
0.0014
0.00062
0.0063
0.00056
0.0012
eat
0.0025
0.00084
0.18
0.00046
0.00062
0.00039
0.00056
0.00058
chinese
0.00025
0.0029
0.00078
0.0078
0.00062
0.00079
0.00056
0.00058
food
0.00025
0.0029
0.00026
0.0014
0.052
0.002
0.0011
0.00058
lunch
0.00025
0.0025
0.0018
0.02
0.0012
0.00039
0.00056
0.00058
spend
0.00075
0.00084
0.055
0.00046
0.00062
0.00039
0.00056
0.00058
Figure 3.6 Add-one smoothed bigram probabilities for eight of the words (out of V = 1446) in the BeRP
corpus of 9332 sentences. Previously-zero probabilities are in gray.
It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts. These adjusted counts can be
computed by Eq. 3.24. Figure 3.7 shows the reconstructed counts.
c∗ (wn−1 wn ) =
[C(wn−1 wn ) + 1] ×C(wn−1 )
C(wn−1 ) +V
(3.24)
3.4
i
want
to
eat
chinese
food
lunch
spend
i
3.8
1.2
1.9
0.34
0.2
6.9
0.57
0.32
want
527
0.39
0.63
0.34
0.098
0.43
0.19
0.16
to
0.64
238
3.1
1
0.098
6.9
0.19
0.32
eat
6.4
0.78
430
0.34
0.098
0.43
0.19
0.16
chinese
0.64
2.7
1.9
5.8
0.098
0.86
0.19
0.16
•
food
0.64
2.7
0.63
1
8.2
2.2
0.38
0.16
S MOOTHING
lunch
0.64
2.3
4.4
15
0.2
0.43
0.19
0.16
43
spend
1.9
0.78
133
0.34
0.098
0.43
0.19
0.16
Figure 3.7 Add-one reconstituted counts for eight words (of V = 1446) in the BeRP corpus
of 9332 sentences. Previously-zero counts are in gray.
Note that add-one smoothing has made a very big change to the counts. C(want to)
changed from 609 to 238! We can see this in probability space as well: P(to|want)
decreases from .66 in the unsmoothed case to .26 in the smoothed case. Looking at
the discount d (the ratio between new and old counts) shows us how strikingly the
counts for each prefix word have been reduced; the discount for the bigram want to
is .39, while the discount for Chinese food is .10, a factor of 10!
The sharp change in counts and probabilities occurs because too much probability mass is moved to all the zeros.
3.4.2 Add-k smoothing
add-k
One alternative to add-one smoothing is to move a bit less of the probability mass
from the seen to the unseen events. Instead of adding 1 to each count, we add a fractional count k (.5? .05? .01?). This algorithm is therefore called add-k smoothing.
∗
PAdd-k
(wn |wn−1 ) =
C(wn−1 wn ) + k
C(wn−1 ) + kV
(3.25)
Add-k smoothing requires that we have a method for choosing k; this can be
done, for example, by optimizing on a devset. Although add-k is useful for some
tasks (including text classification), it turns out that it still doesn’t work well for
language modeling, generating counts with poor variances and often inappropriate
discounts (Gale and Church, 1994).
3.4.3 Backoff and Interpolation
backoff
interpolation
The discounting we have been discussing so far can help solve the problem of zero
frequency n-grams. But there is an additional source of knowledge we can draw on.
If we are trying to compute P(wn |wn−2 wn−1 ) but we have no examples of a particular
trigram wn−2 wn−1 wn , we can instead estimate its probability by using the bigram
probability P(wn |wn−1 ). Similarly, if we don’t have counts to compute P(wn |wn−1 ),
we can look to the unigram P(wn ).
In other words, sometimes using less context is a good thing, helping to generalize more for contexts that the model hasn’t learned much about. There are two ways
to use this n-gram “hierarchy”. In backoff, we use the trigram if the evidence is
sufficient, otherwise we use the bigram, otherwise the unigram. In other words, we
only “back off” to a lower-order n-gram if we have zero evidence for a higher-order
n-gram. By contrast, in interpolation, we always mix the probability estimates from
all the n-gram estimators, weighing and combining the trigram, bigram, and unigram
counts.
44
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
In simple linear interpolation, we combine different order n-grams by linearly interpolating all the models. Thus, we estimate the trigram probability P(wn |wn−2 wn−1 )
by mixing together the unigram, bigram, and trigram probabilities, each weighted
by a λ :
P̂(wn |wn−2 wn−1 ) = λ1 P(wn |wn−2 wn−1 )
+λ2 P(wn |wn−1 )
+λ3 P(wn )
(3.26)
such that the λ s sum to 1:
�
λi = 1
(3.27)
i
In a slightly more sophisticated version of linear interpolation, each λ weight is
computed by conditioning on the context. This way, if we have particularly accurate
counts for a particular bigram, we assume that the counts of the trigrams based on
this bigram will be more trustworthy, so we can make the λ s for those trigrams
higher and thus give that trigram more weight in the interpolation. Equation 3.28
shows the equation for interpolation with context-conditioned weights:
P̂(wn |wn−2 wn−1 ) = λ1 (wn−2:n−1 )P(wn |wn−2 wn−1 )
+λ2 (wn−2:n−1 )P(wn |wn−1 )
+ λ3 (wn−2:n−1 )P(wn )
held-out
discount
Katz backoff
(3.28)
How are these λ values set? Both the simple interpolation and conditional interpolation λ s are learned from a held-out corpus. A held-out corpus is an additional
training corpus that we use to set hyperparameters like these λ values, by choosing
the λ values that maximize the likelihood of the held-out corpus. That is, we fix
the n-gram probabilities and then search for the λ values that—when plugged into
Eq. 3.26—give us the highest probability of the held-out set. There are various ways
to find this optimal set of λ s. One way is to use the EM algorithm, an iterative
learning algorithm that converges on locally optimal λ s (Jelinek and Mercer, 1980).
In a backoff n-gram model, if the n-gram we need has zero counts, we approximate it by backing off to the (N-1)-gram. We continue backing off until we reach a
history that has some counts.
In order for a backoff model to give a correct probability distribution, we have
to discount the higher-order n-grams to save some probability mass for the lower
order n-grams. Just as with add-one smoothing, if the higher-order n-grams aren’t
discounted and we just used the undiscounted MLE probability, then as soon as we
replaced an n-gram which has zero probability with a lower-order n-gram, we would
be adding probability mass, and the total probability assigned to all possible strings
by the language model would be greater than 1! In addition to this explicit discount
factor, we’ll need a function α to distribute this probability mass to the lower order
n-grams.
This kind of backoff with discounting is also called Katz backoff. In Katz backoff we rely on a discounted probability P∗ if we’ve seen this n-gram before (i.e., if
we have non-zero counts). Otherwise, we recursively back off to the Katz probability for the shorter-history (N-1)-gram. The probability for a backoff n-gram PBO is
3.5
•
K NESER -N EY S MOOTHING
thus computed as follows:

 P∗ (wn |wn−N+1:n−1 ),
PBO (wn |wn−N+1:n−1 ) =
 α(wn−N+1:n−1 )PBO (wn |wn−N+2:n−1 ),
Good-Turing
3.5
45
if C(wn−N+1:n ) > 0
otherwise.
(3.29)
Katz backoff is often combined with a smoothing method called Good-Turing.
The combined Good-Turing backoff algorithm involves quite detailed computation
for estimating the Good-Turing smoothing and the P∗ and α values.
Kneser-Ney Smoothing
Kneser-Ney
One of the most commonly used and best performing n-gram smoothing methods
is the interpolated Kneser-Ney algorithm (Kneser and Ney 1995, Chen and Goodman 1998).
Kneser-Ney has its roots in a method called absolute discounting. Recall that
discounting of the counts for frequent n-grams is necessary to save some probability
mass for the smoothing algorithm to distribute to the unseen n-grams.
To see this, we can use a clever idea from Church and Gale (1991). Consider
an n-gram that has count 4. We need to discount this count by some amount. But
how much should we discount it? Church and Gale’s clever idea was to look at a
held-out corpus and just see what the count is for all those bigrams that had count
4 in the training set. They computed a bigram grammar from 22 million words of
AP newswire and then checked the counts of each of these bigrams in another 22
million words. On average, a bigram that occurred 4 times in the first 22 million
words occurred 3.23 times in the next 22 million words. Fig. 3.8 from Church and
Gale (1991) shows these counts for bigrams with c from 0 to 9.
Bigram count in
training set
0
1
2
3
4
5
6
7
8
9
Bigram count in
heldout set
0.0000270
0.448
1.25
2.24
3.23
4.21
5.23
6.21
7.21
8.26
Figure 3.8 For all bigrams in 22 million words of AP newswire of count 0, 1, 2,...,9, the
counts of these bigrams in a held-out corpus also of 22 million words.
Absolute
discounting
Notice in Fig. 3.8 that except for the held-out counts for 0 and 1, all the other
bigram counts in the held-out set could be estimated pretty well by just subtracting
0.75 from the count in the training set! Absolute discounting formalizes this intuition by subtracting a fixed (absolute) discount d from each count. The intuition is
that since we have good estimates already for the very high counts, a small discount
d won’t affect them much. It will mainly modify the smaller counts, for which we
46
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
don’t necessarily trust the estimate anyway, and Fig. 3.8 suggests that in practice this
discount is actually a good one for bigrams with counts 2 through 9. The equation
for interpolated absolute discounting applied to bigrams:
C(wi−1 wi ) − d
PAbsoluteDiscounting (wi |wi−1 ) = �
+ λ (wi−1 )P(wi )
v C(wi−1 v)
(3.30)
The first term is the discounted bigram, and the second term is the unigram with
an interpolation weight λ . We could just set all the d values to .75, or we could keep
a separate discount value of 0.5 for the bigrams with counts of 1.
Kneser-Ney discounting (Kneser and Ney, 1995) augments absolute discounting with a more sophisticated way to handle the lower-order unigram distribution.
Consider the job of predicting the next word in this sentence, assuming we are interpolating a bigram and a unigram model.
.
I can’t see without my reading
The word glasses seems much more likely to follow here than, say, the word
Kong, so we’d like our unigram model to prefer glasses. But in fact it’s Kong that is
more common, since Hong Kong is a very frequent word. A standard unigram model
will assign Kong a higher probability than glasses. We would like to capture the
intuition that although Kong is frequent, it is mainly only frequent in the phrase Hong
Kong, that is, after the word Hong. The word glasses has a much wider distribution.
In other words, instead of P(w), which answers the question “How likely is
w?”, we’d like to create a unigram model that we might call PCONTINUATION , which
answers the question “How likely is w to appear as a novel continuation?”. How can
we estimate this probability of seeing the word w as a novel continuation, in a new
unseen context? The Kneser-Ney intuition is to base our estimate of PCONTINUATION
on the number of different contexts word w has appeared in, that is, the number of
bigram types it completes. Every bigram type was a novel continuation the first time
it was seen. We hypothesize that words that have appeared in more contexts in the
past are more likely to appear in some new context as well. The number of times a
word w appears as a novel continuation can be expressed as:
PCONTINUATION (w) ∝ |{v : C(vw) > 0}|
(3.31)
To turn this count into a probability, we normalize by the total number of word
bigram types. In summary:
PCONTINUATION (w) =
|{v : C(vw) > 0}|
|{(u� , w� ) : C(u� w� ) > 0}|
(3.32)
An equivalent formulation based on a different metaphor is to use the number of
word types seen to precede w (Eq. 3.31 repeated):
PCONTINUATION (w) ∝ |{v : C(vw) > 0}|
(3.33)
normalized by the number of words preceding all words, as follows:
PCONTINUATION (w) = �
|{v : C(vw) > 0}|
�
w� |{v : C(vw ) > 0}|
(3.34)
A frequent word (Kong) occurring in only one context (Hong) will have a low
continuation probability.
3.6
Interpolated
Kneser-Ney
•
H UGE L ANGUAGE M ODELS AND S TUPID BACKOFF
47
The final equation for Interpolated Kneser-Ney smoothing for bigrams is then:
PKN (wi |wi−1 ) =
max(C(wi−1 wi ) − d, 0)
+ λ (wi−1 )PCONTINUATION (wi )
C(wi−1 )
(3.35)
The λ is a normalizing constant that is used to distribute the probability mass
we’ve discounted.:
λ (wi−1 ) = �
d
|{w : C(wi−1 w) > 0}|
C(w
i−1 v)
v
(3.36)
d
, is the normalized discount. The second term,
C(w
i−1 v)
v
|{w : C(wi−1 w) > 0}|, is the number of word types that can follow wi−1 or, equivalently, the number of word types that we discounted; in other words, the number of
times we applied the normalized discount.
The general recursive formulation is as follows:
The first term, �
PKN (wi |wi−n+1:i−1 ) =
max(cKN (w i−n+1: i ) − d, 0)
�
+ λ (wi−n+1:i−1 )PKN (wi |wi−n+2:i−1 )
v cKN (wi−n+1:i−1 v)
(3.37)
where the definition of the count cKN depends on whether we are counting the
highest-order n-gram being interpolated (for example trigram if we are interpolating
trigram, bigram, and unigram) or one of the lower-order n-grams (bigram or unigram
if we are interpolating trigram, bigram, and unigram):
�
count(·) for the highest order
cKN (·) =
(3.38)
continuationcount(·) for lower orders
The continuation count is the number of unique single word contexts for ·.
At the termination of the recursion, unigrams are interpolated with the uniform
distribution, where the parameter � is the empty string:
PKN (w) =
modified
Kneser-Ney
3.6
max(cKN (w) − d, 0)
1
�
+ λ (�)
�)
c
(w
V
�
KN
w
(3.39)
If we want to include an unknown word <UNK>, it’s just included as a regular vocabulary entry with count zero, and hence its probability will be a lambda-weighted
uniform distribution λV(�) .
The best performing version of Kneser-Ney smoothing is called modified KneserNey smoothing, and is due to Chen and Goodman (1998). Rather than use a single
fixed discount d, modified Kneser-Ney uses three different discounts d1 , d2 , and
d3+ for n-grams with counts of 1, 2 and three or more, respectively. See Chen and
Goodman (1998, p. 19) or Heafield et al. (2013) for the details.
Huge Language Models and Stupid Backoff
By using text from the web or other enormous collections, it is possible to build
extremely large language models. The Web 1 Trillion 5-gram corpus released by
Google includes various large sets of n-grams, including 1-grams through 5-grams
from all the five-word sequences that appear in at least 40 distinct books from
48
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
1,024,908,267,229 words of text from publicly accessible Web pages in English
(Franz and Brants, 2006). Google has also released Google Books Ngrams corpora with n-grams drawn from their book collections, including another 800 billion
tokens of of n-grams from Chinese, English, French, German, Hebrew, Italian, Russian, and Spanish (Lin et al., 2012a). Smaller but more carefully curated n-gram corpora for English include the million most frequent n-grams drawn from the COCA
(Corpus of Contemporary American English) 1 billion word corpus of American
English (Davies, 2020). COCA is a balanced corpora, meaning that it has roughly
equal numbers of words from different genres: web, newspapers, spoken conversation transcripts, fiction, and so on, drawn from the period 1990-2019, and has the
context of each n-gram as well as labels for genre and provenance).
Some example 4-grams from the Google Web corpus:
4-gram
Count
serve as the incoming
92
serve as the incubator
99
serve as the independent
794
serve as the index
223
serve as the indication
72
serve as the indicator
120
serve as the indicators
45
Bloom filters
stupid backoff
Efficiency considerations are important when building language models that use
such large sets of n-grams. Rather than store each word as a string, it is generally
represented in memory as a 64-bit hash number, with the words themselves stored
on disk. Probabilities are generally quantized using only 4-8 bits (instead of 8-byte
floats), and n-grams are stored in reverse tries.
N-grams can also be shrunk by pruning, for example only storing n-grams with
counts greater than some threshold (such as the count threshold of 40 used for the
Google n-gram release) or using entropy to prune less-important n-grams (Stolcke,
1998). Another option is to build approximate language models using techniques
like Bloom filters (Talbot and Osborne 2007, Church et al. 2007). Finally, efficient language model toolkits like KenLM (Heafield 2011, Heafield et al. 2013) use
sorted arrays, efficiently combine probabilities and backoffs in a single value, and
use merge sorts to efficiently build the probability tables in a minimal number of
passes through a large corpus.
Although with these toolkits it is possible to build web-scale language models
using full Kneser-Ney smoothing, Brants et al. (2007) show that with very large language models a much simpler algorithm may be sufficient. The algorithm is called
stupid backoff. Stupid backoff gives up the idea of trying to make the language
model a true probability distribution. There is no discounting of the higher-order
probabilities. If a higher-order n-gram has a zero count, we simply backoff to a
lower order n-gram, weighed by a fixed (context-independent) weight. This algorithm does not produce a probability distribution, so we’ll follow Brants et al. (2007)
in referring to it as S:

 count(wii−k+1 ) if count(wi
i−1
i−k+1 ) > 0
count(wi−1
)
S(wi |wi−k+1 ) =
i−k+1
 λ S(w |wi−1 ) otherwise
i i−k+2
(3.40)
. Brants
The backoff terminates in the unigram, which has probability S(w) = count(w)
N
et al. (2007) find that a value of 0.4 worked well for λ .
3.7
3.7
•
A DVANCED : P ERPLEXITY ’ S R ELATION TO E NTROPY
49
Advanced: Perplexity’s Relation to Entropy
Entropy
We introduced perplexity in Section 3.2.1 as a way to evaluate n-gram models on
a test set. A better n-gram model is one that assigns a higher probability to the
test data, and perplexity is a normalized version of the probability of the test set.
The perplexity measure actually arises from the information-theoretic concept of
cross-entropy, which explains otherwise mysterious properties of perplexity (why
the inverse probability, for example?) and its relationship to entropy. Entropy is a
measure of information. Given a random variable X ranging over whatever we are
predicting (words, letters, parts of speech, the set of which we’ll call χ) and with a
particular probability function, call it p(x), the entropy of the random variable X is:
H(X) = −
�
p(x) log2 p(x)
(3.41)
x∈χ
The log can, in principle, be computed in any base. If we use log base 2, the
resulting value of entropy will be measured in bits.
One intuitive way to think about entropy is as a lower bound on the number of
bits it would take to encode a certain decision or piece of information in the optimal
coding scheme.
Consider an example from the standard information theory textbook Cover and
Thomas (1991). Imagine that we want to place a bet on a horse race but it is too
far to go all the way to Yonkers Racetrack, so we’d like to send a short message to
the bookie to tell him which of the eight horses to bet on. One way to encode this
message is just to use the binary representation of the horse’s number as the code;
thus, horse 1 would be 001, horse 2 010, horse 3 011, and so on, with horse 8 coded
as 000. If we spend the whole day betting and each horse is coded with 3 bits, on
average we would be sending 3 bits per race.
Can we do better? Suppose that the spread is the actual distribution of the bets
placed and that we represent it as the prior probability of each horse as follows:
Horse 1
Horse 2
Horse 3
Horse 4
1
2
1
4
1
8
1
16
Horse 5
Horse 6
Horse 7
Horse 8
1
64
1
64
1
64
1
64
The entropy of the random variable X that ranges over horses gives us a lower
bound on the number of bits and is
H(X) = −
i=8
�
p(i) log p(i)
i=1
=
1 log 1 −4( 1 log 1 )
− 12 log 12 − 14 log 14 − 18 log 18 − 16
16
64
64
= 2 bits
(3.42)
A code that averages 2 bits per race can be built with short encodings for more
probable horses, and longer encodings for less probable horses. For example, we
could encode the most likely horse with the code 0, and the remaining horses as 10,
then 110, 1110, 111100, 111101, 111110, and 111111.
50
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
What if the horses are equally likely? We saw above that if we used an equallength binary code for the horse numbers, each horse took 3 bits to code, so the
average was 3. Is the entropy the same? In this case each horse would have a
probability of 18 . The entropy of the choice of horses is then
H(X) = −
i=8
�
1
i=1
8
log
1
1
= − log = 3 bits
8
8
(3.43)
Until now we have been computing the entropy of a single variable. But most of
what we will use entropy for involves sequences. For a grammar, for example, we
will be computing the entropy of some sequence of words W = {w0 , w1 , w2 , . . . , wn }.
One way to do this is to have a variable that ranges over sequences of words. For
example we can compute the entropy of a random variable that ranges over all finite
sequences of words of length n in some language L as follows:
�
H(w1 , w2 , . . . , wn ) = −
p(W1n ) log p(W1n )
(3.44)
W1n ∈L
entropy rate
We could define the entropy rate (we could also think of this as the per-word
entropy) as the entropy of this sequence divided by the number of words:
1
1 �
p(W1n ) log p(W1n )
H(W1n ) = −
n
n n
(3.45)
W1 ∈L
But to measure the true entropy of a language, we need to consider sequences of
infinite length. If we think of a language as a stochastic process L that produces a
sequence of words, and allow W to represent the sequence of words w1 , . . . , wn , then
L’s entropy rate H(L) is defined as
1
H(w1 , w2 , . . . , wn )
n
1�
= − lim
p(w1 , . . . , wn ) log p(w1 , . . . , wn )
n→∞ n
H(L) = lim
n→∞
(3.46)
W ∈L
The Shannon-McMillan-Breiman theorem (Algoet and Cover 1988, Cover and
Thomas 1991) states that if the language is regular in certain ways (to be exact, if it
is both stationary and ergodic),
Stationary
1
(3.47)
H(L) = lim − log p(w1 w2 . . . wn )
n→∞ n
That is, we can take a single sequence that is long enough instead of summing
over all possible sequences. The intuition of the Shannon-McMillan-Breiman theorem is that a long-enough sequence of words will contain in it many other shorter
sequences and that each of these shorter sequences will reoccur in the longer sequence according to their probabilities.
A stochastic process is said to be stationary if the probabilities it assigns to a
sequence are invariant with respect to shifts in the time index. In other words, the
probability distribution for words at time t is the same as the probability distribution
at time t + 1. Markov models, and hence n-grams, are stationary. For example, in
a bigram, Pi is dependent only on Pi−1 . So if we shift our time index by x, Pi+x is
still dependent on Pi+x−1 . But natural language is not stationary, since as we show
3.7
cross-entropy
•
A DVANCED : P ERPLEXITY ’ S R ELATION TO E NTROPY
51
in Chapter 12, the probability of upcoming words can be dependent on events that
were arbitrarily distant and time dependent. Thus, our statistical models only give
an approximation to the correct distributions and entropies of natural language.
To summarize, by making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long
sample of the output and computing its average log probability.
Now we are ready to introduce cross-entropy. The cross-entropy is useful when
we don’t know the actual probability distribution p that generated some data. It
allows us to use some m, which is a model of p (i.e., an approximation to p). The
cross-entropy of m on p is defined by
H(p, m) = lim −
n→∞
1�
p(w1 , . . . , wn ) log m(w1 , . . . , wn )
n
(3.48)
W ∈L
That is, we draw sequences according to the probability distribution p, but sum
the log of their probabilities according to m.
Again, following the Shannon-McMillan-Breiman theorem, for a stationary ergodic process:
1
H(p, m) = lim − log m(w1 w2 . . . wn )
n→∞ n
(3.49)
This means that, as for entropy, we can estimate the cross-entropy of a model
m on some distribution p by taking a single sequence that is long enough instead of
summing over all possible sequences.
What makes the cross-entropy useful is that the cross-entropy H(p, m) is an upper bound on the entropy H(p). For any model m:
H(p) ≤ H(p, m)
(3.50)
This means that we can use some simplified model m to help estimate the true entropy of a sequence of symbols drawn according to probability p. The more accurate
m is, the closer the cross-entropy H(p, m) will be to the true entropy H(p). Thus,
the difference between H(p, m) and H(p) is a measure of how accurate a model is.
Between two models m1 and m2 , the more accurate model will be the one with the
lower cross-entropy. (The cross-entropy can never be lower than the true entropy, so
a model cannot err by underestimating the true entropy.)
We are finally ready to see the relation between perplexity and cross-entropy
as we saw it in Eq. 3.49. Cross-entropy is defined in the limit as the length of the
observed word sequence goes to infinity. We will need an approximation to crossentropy, relying on a (sufficiently long) sequence of fixed length. This approximation to the cross-entropy of a model M = P(wi |wi−N+1 ...wi−1 ) on a sequence of
words W is
H(W ) = −
perplexity
1
log P(w1 w2 . . . wN )
N
(3.51)
The perplexity of a model P on a sequence of words W is now formally defined as
the exp of this cross-entropy:
52
C HAPTER 3
•
N- GRAM L ANGUAGE M ODELS
Perplexity(W ) = 2H(W )
1
= P(w1 w2 . . . wN )− N
�
1
= N
P(w1 w2 . . . wN )
�
�N
��
1
N
= �
P(wi |w1 . . . wi−1 )
(3.52)
i=1
3.8
Summary
This chapter introduced language modeling and the n-gram, one of the most widely
used tools in language processing.
• Language models offer a way to assign a probability to a sentence or other
sequence of words, and to predict a word from preceding words.
• n-grams are Markov models that estimate words from a fixed window of previous words. n-gram probabilities can be estimated by counting in a corpus
and normalizing (the maximum likelihood estimate).
• n-gram language models are evaluated extrinsically in some task, or intrinsically using perplexity.
• The perplexity of a test set according to a language model is the geometric
mean of the inverse test set probability computed by the model.
• Smoothing algorithms provide a more sophisticated way to estimate the probability of n-grams. Commonly used smoothing algorithms for n-grams rely on
lower-order n-gram counts through backoff or interpolation.
• Both backoff and interpolation require discounting to create a probability distribution.
• Kneser-Ney smoothing makes use of the probability of a word being a novel
continuation. The interpolated Kneser-Ney smoothing algorithm mixes a
discounted probability with a lower-order continuation probability.
Bibliographical and Historical Notes
The underlying mathematics of the n-gram was first proposed by Markov (1913),
who used what are now called Markov chains (bigrams and trigrams) to predict
whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a consonant. Markov classified 20,000 letters as V or C and computed the bigram and
trigram probability that a given letter would be a vowel given the previous one or
two letters. Shannon (1948) applied n-grams to compute approximations to English
word sequences. Based on Shannon’s work, Markov models were commonly used in
engineering, linguistic, and psychological work on modeling word sequences by the
1950s. In a series of extremely influential papers starting with Chomsky (1956) and
including Chomsky (1957) and Miller and Chomsky (1963), Noam Chomsky argued
that “finite-state Markov processes”, while a possibly useful engineering heuristic,
CSE4022
Natural Language Processing
Topic: Machine Translation
D r. Tu l a s i Pr a s a d S a r i k i ,
S CS E , V I T C h e n n a i C a m p u s
w w w. l e a r n e r s d e s k . w e e b l y. c o m
Contents
 History of Machine Translation
 Machine Translation: Where are we today?
 How Does MT Work?
 Core Challenges of MT
 Rule-based vs. Data-driven Approaches to MT
 Statistical MT (SMT)
 Major Sources of Translation Problems
 Speech to speech translation
MACHINE TRANSLATION
2
History of Machine Translation
1946: MT is one of the 1st conceived applications of modern computers ( Alan
Turing)
1954: The “Georgetown Experiment” demonstrations of Russian-English MT
Late 1950s and early 1960s: MT fails to scale up to “real” systems
1966: ALPAC Report: MT recognized as an extremely difficult, “AI-complete”
problem. Funding disappears
1968: SYSTRAN founded
1985: CMU “Center for Machine Translation” (CMT) founded
Late 1980s and early 1990s: Field dominated by rule-based approaches – KBMT,
KANT, Eurotra, etc.
MACHINE TRANSLATION
3
History of Machine Translation
1992: “Noisy Channel” Statistical MT models invented by IBM (CANDIDE)
Mid 1990s: First major DARPA MT Program. PANGLOSS
Late 1990s: Major Speech-to-Speech MT demonstrations: C-STAR
1999: JHU Summer Workshop results in GIZA
2000s: Large DARPA Funding Programs – TIDES and GALE
2003: Och et al introduce Phrase-based SMT. PHARAOH
2006: Google Translate is launched
2007: Koehn et al release MOSES
2008: a text/SMS translation service for mobiles in Japan
2009: mobile phone with built-in speech-to-speech translation facility for English and Japanese
2012: Google announced that Google Translate
MACHINE TRANSLATION
4
MT: Where are we today?
Age of Internet & Globalization – great demand for translation services and MT
 Multiple official languages of UN, EU, Canada, etc.
 Commercial demand from increasing number of global enterprises
 (Microsoft, IBM, Intel, Apple, E-bay, Amazon, etc.)
 Language and translation services business sector estimated at $15 Billion
worldwide in 2008 and growing at a healthy pace
Economic incentive and demand is still focused primarily within G-8 languages,
but growing in emerging markets (BRIC: Brazil, Russia, India, China), Arabic, and
more…
MACHINE TRANSLATION
5
MT: Where are we today?
Some fairly decent commercial products in the market for these language pairs
 Primarily a product of rule-based systems after many years of development
 New generation of data-driven “statistical” MT: Google, Microsoft, Language
Weaver
Web-based (mostly free) MT services: Google, Babelfish, others…
Pervasive MT between many language pairs still non-existent, but Google is
trying to change that!
MACHINE TRANSLATION
6
How Does MT Work?
All modern MT approaches are based on building translations for
complete sentences by putting together smaller pieces of translation
Core Questions:
 What are these smaller pieces of translation?
 Where do they come from?
 How does MT put these pieces together?
 How does the MT system pick the correct (or best) translation among
many options?
MACHINE TRANSLATION
7
Core Challenges of MT
Ambiguity and Language Divergences:
Human languages are highly ambiguous, and differently in different
languages
Ambiguity at all “levels”: lexical, syntactic, semantic, language-specific
constructions and idioms
Amount of required knowledge:
Translation equivalencies for vast vocabularies
Syntactic knowledge (how to map syntax of one language to another), plus
more complex language divergences (semantic differences, constructions and
idioms, etc.)
How do you acquire and construct a knowledge base that big that is (even
mostly) correct and consistent?
MACHINE TRANSLATION
8
Rule-based vs. Data-driven Approaches to MT
What are the pieces of translation? Where do they come from?
Rule-based: large-scale “clean” word translation lexicons, manually
constructed over time by experts
Data-driven: broad-coverage word and multi-word translation lexicons,
learned automatically from available sentence-parallel corpora
How does MT put these pieces together?
Rule-based: large collections of rules, manually developed over time by
human experts, that map structures from the source to the target language
Data-driven: a computer algorithm that explores millions of possible ways of
putting the small pieces together, looking for the translation that statistically
looks best
MACHINE TRANSLATION
9
Rule-based vs. Data-driven Approaches to MT
How does the MT system pick the correct (or best) translation among many
options?
Rule-based: Human experts encode preferences among the rules designed to
prefer creation of better translations
Data-driven: a variety of fitness and preference scores, many of which can be
learned from available training data, are used to model a total score for each
of the millions of possible translation candidates; algorithm then selects and
outputs the best scoring translation
MACHINE TRANSLATION
10
Rule-based vs. Data-driven Approaches to MT
Why have the data-driven approaches become so popular?
Increasing amounts of sentence-parallel data are constantly being created
on the web
Advances in machine learning algorithms
Computational power of today’s computers can train systems on these
massive amounts of data and can perform these massive search-based
translation computations when translating new texts
 Building and maintaining rule-based systems is too difficult, expensive and
time-consuming
In many scenarios, it actually works better!
MACHINE TRANSLATION
11
Statistical MT (SMT)
Data-driven, most dominant approach in current MT research
Proposed by IBM in early 1990s: a direct, purely statistical, model for MT
Evolved from word-level translation to phrase-based translation
Main Ideas:
Training: statistical “models” of word and phrase translation equivalence are
learned automatically from bilingual parallel sentences, creating a bilingual
“database” of translations
Decoding: new sentences are translated by a program (the decoder), which
matches the source words and phrases with the database of translations, and
searches the “space” of all possible translation combinations.
MACHINE TRANSLATION
12
Statistical MT (SMT)
Main steps in training phrase-based statistical MT:
Create a sentence-aligned parallel corpus
Word Alignment: train word-level alignment models (GIZA++)
Phrase Extraction: extract phrase-to-phrase translation correspondences
using heuristics (Moses)
Minimum Error Rate Training (MERT): optimize translation system
parameters on development data to achieve best translation performance
Attractive: completely automatic, no manual rules, much reduced manual
labor
MACHINE TRANSLATION
13
Statistical MT (SMT)
Main drawbacks:
Translation accuracy levels vary widely
Effective only with large volumes (several mega-words) of parallel text
Broad domain, but domain-sensitive
Viable only for limited number of language pairs!
Impressive progress in last 5-10 years!
MACHINE TRANSLATION
14
Statistical MT: Major Challenges
Current approaches are too naïve and “direct”:
Good at learning word-to-word and phrase-to-phrase correspondences from
data
Not good enough at learning how to combine these pieces and reorder them
properly during translation
Learning general rules requires much more complicated algorithms and
computer processing of the data
The space of translations that is “searched” often doesn’t contain a perfect
translation
The fitness scores that are used aren’t good enough to always assign better
scores to the better translations; we don’t always find the best translation even
when it’s there!
MERT is brittle, problematic and metric-dependent!
MACHINE TRANSLATION
15
Statistical MT: Major Challenges
Solutions:
Google solution: more and more data!
Research solution: “smarter” algorithms and learning methods
MACHINE TRANSLATION
16
Rule-based vs. Data-driven MT
We thank all participants of the
whole world for their comical
and creative drawings; to
choose the victors was not easy
task!
We thank all the participants
from around the world for
their designs cocasses and
creative; selecting winners
was not easy!
Click here to see work of
winning European of these two
months, and use it to look at
what the winning of USA sent
us.
Click here to see the artwork
of winners European of these
two months, and disclosure to
look at what the winners of
the US have been sending.
Rule-based
Data-driven
MACHINE TRANSLATION
17
Major Sources of Translation Problems
Lexical Differences:
Multiple possible translations for SL word, or difficulties expressing SL word
meaning in a single TL word
Structural Differences:
Syntax of SL is different than syntax of the TL: word order, sentence and
constituent structure
Differences in Mappings of Syntax to Semantics:
Meaning in TL is conveyed using a different syntactic structure than in the SL
Idioms and Constructions
MACHINE TRANSLATION
18
How to Tackle the Core Challenges
Manual Labor: 1000s of person-years of human experts developing large word and
phrase translation lexicons and translation rules.
Example: Systran’s RBMT systems.
Lots of Parallel Data:
data-driven approaches for finding word and phrase
correspondences automatically from large amounts of sentence-aligned parallel texts.
Example: Statistical MT systems.
Learning Approaches: learn translation rules automatically from small amounts of
human translated and word-aligned data. Example: AVENUE’s Statistical XFER approach.
Simplify the Problem: build systems that are limited-domain or constrained in other
ways. Examples: CATALYST, NESPOLE!.
MACHINE TRANSLATION
19
State-of-the-Art in MT
What users want:
General purpose (any text)
High quality (human level)
Fully automatic (no user intervention)
We can meet any 2 of these 3 goals today, but not all three at once:
FA HQ: Knowledge-Based MT (KBMT)
FA GP: Corpus-Based (Example-Based) MT
GP HQ: Human-in-the-loop (Post-editing)
MACHINE TRANSLATION
20
Types of MT Applications:
Assimilation: multiple source languages, uncontrolled style/topic.
purpose MT, no customization. (Google Translate)
General
Dissemination: one source language, controlled style, single topic/domain.
Customized RBMT or SMT. (Safaba)
Communication: Lower quality may be okay, but system robustness, real-time
required. (Jibiggo)
MACHINE TRANSLATION
21
Approaches to MT: Vaquois MT Triangle
Interlingua
Give-information+personal-data (name=alon_lavie)
Generation
Analysis
Transfer
[s [vp accusative_pronoun
“chiarman” proper_name]]
[s [np [possessive_pronoun “name”]]
[vp “be” proper_name]]
Direct
Mi chiamo Alon Lavie
My name is Alon Lavie
MACHINE TRANSLATION
22
Direct Approaches
No intermediate stage in the translation
First MT systems developed in the 1950’s-60’s (assembly code programs)
Morphology, bi-lingual dictionary lookup, local reordering rules
“Word-for-word, with some local word-order adjustments”
Modern Approaches:
Phrase-based Statistical MT (SMT)
Example-based MT (EBMT)
MACHINE TRANSLATION
23
EBMT Paradigm
New Sentence (Source): Yesterday, 200 delegates met with President Clinton.
Matches to Source Found
Yesterday, 200 delegates met behind closed
doors…
Gestern trafen sich 200 Abgeordnete hinter
verschlossenen…
Difficulties with President Clinton…
Schwierigkeiten mit Praesident Clinton…
Alignment (Sub-sentential)
Yesterday, 200 delegates met behind closed
doors…
Gestern trafen sich 200 Abgeordnete hinter
verschlossenen…
Difficulties with President Clinton over…
Schwierigkeiten mit Praesident Clinton…
Translated Sentence (Target): Gestern trafen sich 200 Abgeordnete mit Praesident Clinton.
MACHINE TRANSLATION
24
Analysis and Generation Main Steps
Analysis:
Morphological analysis (word-level) and POS tagging
Syntactic analysis and disambiguation (produce syntactic parse-tree)
Semantic analysis and disambiguation (produce symbolic frames or logical
form representation)
Map to language-independent Interlingua
Generation:
Generate semantic representation in TL
Sentence Planning: generate syntactic structure and lexical selections for
concepts
Surface-form realization: generate correct forms of words
MACHINE TRANSLATION
25
Transfer Approaches
Syntactic Transfer:
Analyze SL input sentence to its syntactic structure (parse tree)
Transfer SL parse-tree to TL parse-tree (various formalisms for mappings)
Generate TL sentence from the TL parse-tree
Semantic Transfer:
Analyze SL input to a language-specific semantic representation (i.e., Case
Frames, Logical Form)
Transfer SL semantic representation to TL semantic representation
Generate syntactic structure and then surface sentence in the TL
MACHINE TRANSLATION
26
Transfer Approaches (Pros & Cons)
Syntactic Transfer:
◦ No need for semantic analysis and generation
◦ Syntactic structures are general, not domain specific
dependent, can handle open domains
◦ Requires word translation lexicon
 Less domain
Semantic Transfer:
◦ Requires deeper analysis and generation, symbolic representation of concepts
and predicates  difficult to construct for open or unlimited domains
◦ Can better handle non-compositional meaning structures  can be more
accurate
◦ No word translation lexicon – generate in TL from symbolic concepts
MACHINE TRANSLATION
27
Knowledge-based Interlingual MT
The classic “deep” Artificial Intelligence approach:
Analyze the source language into a detailed symbolic representation of its
meaning
Generate this meaning in the target language
“Interlingua”: one single meaning representation for all languages
Nice in theory, but extremely difficult in practice:
What kind of representation?
What is the appropriate level of detail to represent?
How to ensure that the interlingua is in fact universal?
MACHINE TRANSLATION
28
Interlingua versus Transfer
With interlingua, need only N parsers/ generators instead of N2 transfer
systems:
L2
L2
L1
L3
L1
L3
L4
L6
L4
interlingua
L6
L5
L5
MACHINE TRANSLATION
29
Multi-Engine MT
Apply several MT engines to each input in parallel
Create a combined translation from the individual
translations
Goal is to
weaknesses.
combine
strengths,
and
avoid
Along all dimensions: domain limits, quality,
development time/cost, run-time speed, etc.
Various approaches to the problem
MACHINE TRANSLATION
30
Speech-to-Speech MT
Speech just makes MT (much) more difficult:
Spoken language is messier
False starts, filled pauses, repetitions, out-of-vocabulary words
Lack of punctuation and explicit sentence boundaries
Current Speech technology is far from perfect
Need for speech recognition and synthesis in foreign languages
Robustness: MT quality degradation should be proportional to SR quality
Tight Integration: rather than separate sequential tasks, can SR + MT be
integrated in ways that improves end-to-end performance?
MACHINE TRANSLATION
31
END
MACHINE TRANSLATION
32
Download