21 - School of Computing

advertisement
School of Computing
something
FACULTY OF ENGINEERING
OTHER
NLP/CL: Review
Eric Atwell, Language Research Group
(with thanks to other contributors)
Objectives of the module
On completion of this module, students should be able to:
- understand theory and terminology of empirical modelling of
natural language;
- understand and use algorithms, resources and techniques
for implementing and evaluating NLP systems;
- be familiar with some of the main language engineering
application areas;
- appreciate why unrestricted natural language processing is
still a major research task.
In a nutshell:
Why NLP is difficult: language is a complex system
How to solve it? Corpus-based machine-learning approaches
Motivation: applications of “The Language Machine”
The main sub-areas of linguistics
◮ Phonetics and Phonology: The study of linguistic sounds or speech.
◮ Morphology: The study of the meaningful components of words.
◮ Syntax: The study of the structural relationships between words.
◮ Semantics: The study of meanings of words, phrases, sentences.
◮ Discourse: The study of linguistic units larger than a single utterance.
◮ Pragmatics: The study of how language is used to accomplish goals.
Python, NLTK, WEKA
Python: A good programming language for NLP
• Interpreted
• Object-oriented
• Easy to interface to other things (text files, web, DBMS)
• Good stuff from: java, lisp, tcl, perl
• Easy to learn
• FUN!
Python NLTK: Natural Language Tool Kit with demos and tutorials
WEKA: Machine Learning toolkit: Classifiers, eg J48 Decision Trees
Why is NLP difficult?
Computers are not brains
• There is evidence that much of language understanding is built into
the human brain
Computers do not socialize
• Much of language is about communicating with people
Key problems:
• Representation of meaning and hidden structure
• Language presupposes knowledge about the world
• Language is ambiguous: a message can have many interpretations
• Language presupposes communication between people
Ambiguity:
Grammar (PoS) and Meaning
Iraqi Head Seeks Arms
Juvenile Court to Try Shooting Defendant
Teacher Strikes Idle Kids
Kids Make Nutritious Snacks
British Left Waffles on Falkland Islands
Red Tape Holds Up New Bridges
Bush Wins on Budget, but More Lies Ahead
Hospitals are Sued by 7 Foot Doctors
(Headlines leave out punctuation and function-words)
Lynne Truss, 2003. Eats shoots and leaves:
The Zero Tolerance Approach to Punctuation
The Role of Memorization
Children learn words quickly
• Around age two they learn about 1 word every 2 hours.
• (Or 9 words/day)
• Often only need one exposure to associate meaning with word
• Can make mistakes, e.g., overgeneralization
“I goed to the store.”
• Exactly how they do this is still under study
Adult vocabulary
• Typical adult: about 60,000 words
• Literate adults: about twice that.
But there is too much to memorize!
establish
establishment
the church of England as the official state church.
disestablishment
antidisestablishment
antidisestablishmentarian
antidisestablishmentarianism
is a political philosophy that is opposed to the separation of church and
state.
MAYBE we don’t remember every word separately;
MAYBE we remember MORPHEMES and how to combine them
Rationalism v Empiricism
Rationalism: the doctrine that knowledge is acquired by
reason without regard to experience (Collins English Dictionary)
Noam Chomsky, 1957 Syntactic Structures
-Argued that we should build models through introspection:
-A language model is a set of rules thought up by an expert
Like “Expert Systems”…
Chomsky thought data was full of errors, better to rely on
linguists’ intuitions…
Empiricism v Rationalism
Empiricism: the doctrine that all knowledge derives from
experience (Collins English Dictionary)
The field was stuck for quite some time: rationalist
linguistic models for a specific example did not generalise.
A new approach started around 1990: Corpus Linguistics
• Well, not really new, but in the 50’s to 80’s, they didn’t have the text, disk
space, or GHz
Main idea: machine learning from CORPUS data
How to do corpus linguistics:
• Get large text collection (a corpus; plural: several corpora)
• Compute statistical models over the words/PoS/parses/… in the corpus
Surprisingly effective
Example Problem
Grammar checking example:
Which word to use?
<principal> <principle>
Empirical solution: look at which words surround each use:
• I am in my third year as the principal of Anamosa High School.
• School-principal transfers caused some upset.
• This is a simple formulation of the quantum mechanical uncertainty
principle.
• Power without principle is barren, but principle without power is
futile. (Tony Blair)
Using Very Large Corpora
Keep track of which words are the neighbors of each spelling in
well-edited text, e.g.:
• Principal: “high school”
• Principle: “rule”
At grammar-check time, choose the spelling best predicted by the
probability of co-occurring with surrounding words.
No need to “understand the meaning” !?
Surprising results:
• Log-linear improvement even to a billion words!
• Getting more data is better than fine-tuning algorithms!
The Effects of LARGE
Datasets
From Banko & Brill, 2001. Scaling to Very Very Large
Corpora for Natural Language Disambiguation, Proc ACL
Corpus, word tokens and types
Corpus: text selected by language, genre, domain, …
Brown, LOB, BNC, Penn Treebank, MapTask, CCA, …
Corpus Annotation: text headers, PoS, parses, …
Corpus size is no. of words – depends on tokenisation
We can count word tokens, word types, type-token distribution
Lexeme/lemma is “root form”, v inflections (be v am/is/was…)
Tokenization and Morphology
Tokenization - by whitespace, regular expressions
Problems: It’s data-base New York …
Jabberwocky shows we can break words into morphemes
Morpheme types: root/stem, affix, clitic
Derivational vs. Inflectional
Regular vs. Irregular
Concatinative vs. Templatic (root-and-pattern)
Morphological analysers: Porter stemmer, Morphy, PC-Kimmo
Morphology by lookup: CatVar, CELEX, OALD++
Corpus word-counts and n-grams
FreqDist counts of tokens and their distribution can be useful
Eg find main characters in Gutenberg texts
Eg compare word-lengths in different languages
Human can predict the next word …
N-gram models are based on counts in a large corpus
Auto-generate a story ... (but gets stuck in local maximum)
Word-counts follow Zipf’s Law
Zipf’s law applies to a word type-token frequency distribution:
frequency is proportional to the inverse of the rank in a ranked
list
f*r = k where f is frequency, r is rank, k is a constant
ie a few very common words, a small to medium number of
middle-frequency words, and a long tail of low frequency (1)
Chomsky argued against corpus evidence as it is finite and
limited compared to introspection; Zipf’s law shows that many
words/structures only appear 1 or 0 times in a given corpus,
??supporting the argument that corpus evidence is limited
compared to introspection
Kilgarriff’s Sketch Engine
Sketch Engine shows a Word Sketch or list of collocates:
words co-occurring with the target word more frequently than
predicted by independent probabilities
A lexicographer can colour-code groups of related collocates
indicating different senses or meanings of the target word
With a large corpus the lexicographer should find all current
senses, better than relying on intuition/introspection
Large user-base of experience, used in development of
several published dictionaries for English
For minority languages with few existing corpus resources,
Sketch Engine is combined with Web-Bootcat to enable
lexicographers to collect their own Web-as-Corpus
Parts of Speech
Parts of Speech: groups words into grammatical categories
… and separates different functions of a word
In English, many words are ambiguous: 2 or more PoS-tags
Very simple tagger: everything is NN
Better Pos-Taggers: unigram, bigram, trigram, Brill, …
Training and Testing of
Machine Learning Algorithms
Algorithms that “learn” from data see a set of examples
and try to generalize from them.
Training set:
• Examples trained on
Test set:
• Also called held-out data and unseen data
• Use this for testing your algorithm
• Must be separate from the training set
• Otherwise, you cheated!
“Gold Standard”
• A test set that a community has agreed on and uses as a common
benchmark – use for final evaluation
Grammar and Parsing
Context-Free Grammars and Constituency
Some common CFG phenomena for English
• Sentence-level constructions
• NP, PP, VP
• Coordination
• Subcategorization
Top-down and Bottom-up Parsing
Problems with context-free
grammars and parsers
Parse-trees show syntactic structure of sentences
Key constituents: S, NP, VP, PP
You can draw a parse-tree and corresponding CFG
Problems with Context-Free Grammar:
• Coordination: X  X and X is a Meta-Rule, not strict CFG rule
• Agreement: needs duplicate CFG rules for singular/plural etc
• Subcategorization: needs separate CFG non-terminals for trans/intrans/…
• Movement: object/subject of verb may be “moved” in questions
• Dependency parsing captures deeper semantics but is harder
• Parsing: top-down v bottom-up v combined
• Ambiguity causes backtracking, so CHART PARSER stores partial parses
Parsing sentences left-to-right
The horse raced past the barn
[S [NP [A the] [N horse] NP]
[VP [V raced] [PP [I past] [NP [A the] [N barn] NP] PP]
VP] S]
The horse raced past the barn fell
[S [NP [NP [A the] [N horse] NP]
[VP [V raced] [PP [I past] [NP [A the] [N barn] NP]
PP] VP] NP]
[VP [V fell] VP] S]
Chunking or Shallow Parsing
Break text up into non-overlapping contiguous subsets of tokens.
Shallow parsing or Chunking is useful for:
Entity recognition
• people, locations, organizations
Studying linguistic patterns
• gave NP
• gave up NP in NP
• gave NP NP
• gave NP to NP
Prosodic phrase breaks – pauses in speech
Can ignore complex structure when not relevant
Chunking can be done via regular expressions over PoS-tags
Information Extraction
Partial parsing gives us NP chunks…
IE: Named Entity Recognition
People, places, companies, dates etc.
In a cohesive text, some NPs refer to the same thing/person
Needed: an algorithm for NP coreference resolution; eg:
“Hudda”, “ten of hearts”, “Mrs Anthrax”, “she” all refer to the
same Named Entity
Semantics:
Word Sense Disambiguation
e.g. mouse (animal /PC-interface)
• It’s a hard task (Very)
• Humans very good at it
• Computers not
• Active field of research for over 50 years
• Mistakes in disambiguation have negative results
• Beginning to be of practical use
• Desirable skill (Google, M$)
Machine learning
v cognitive modelling
NLP has been successful using ML from data, without
“linguistic / cognitive models”
Supervised ML: given labelled data (eg PoS-tagged text to
train PoS-tagger, to tag new text in the style of training text)
Unsupervised ML: no labelled data (eg clustering words with
“similar contexts” gives PoS-tag categories)
Unsupervised ML is harder, but increasingly successful!
NLP applications
Machine Translation
Localization: adapting text (e.g. ads) to local language
Information Retrieval (Google, etc)
Information Extraction
Detecting Terrorist Activities
Understanding the Quran
…
For more, see The Language Machine
And Finally…
Any final questions?
Feedback please (eg email me)
Good luck in the exam!
Look at past exam papers
BUT note changes in topics covered
And if you do use NLP in your career, please let me know…
Download