16 - School of Computing

advertisement
School of Computing
something
FACULTY OF ENGINEERING
OTHER
Chunking: Shallow Parsing
Eric Atwell, Language Research Group
Shallow Parsing
Break text up into non-overlapping contiguous subsets of
tokens.
• Also called chunking, partial parsing, light parsing.
What is it useful for? – semantic patterns
• Finding key “meaning-elements”: Named Entity Recognition
• people, locations, organizations
• Studying linguistic patterns, e.g. semantic patterns of verbs
• gave NP
• gave up NP in NP
• gave NP NP
• gave NP to NP
• Can ignore complex structure when not relevant
A Relationship between
Segmenting and Labeling
Tokenization segments the text
Tagging labels the text
Shallow parsing does both simultaneously.
Chunking vs. Full Syntactic Parsing
“G.K. Chesterton, author of The Man who was Thursday”
Representations for Chunks
IOB tags
• Inside, outside, and begin
• In English, the start of a phrase is often marked by a function-word
Representations for Chunks
Trees
• Chunk structure is a two-level tree that spans the entire text, containing
both chunks and non-chunks
CONLL Corpus: training data for
Machine Learning of chunking
From the Conference on Natural Language Learning
Competition from 2000
Goal: create machine learning methods to improve on the
chunking task
CONLL Corpus
Data in IOB format from WSJ Wall Street Journal:
• Word POS-tag IOB-tag
• Training set: 8936 sentences
• Test set: 2012 sentences
Tags from the Brill tagger
• Penn Treebank Tags
Evaluation measure: F-score
• 2*precision*recall / (recall+precision)
• Baseline was: select the chunk tag that is most frequently associated with the
POS tag, F =77.07
• Best score in the contest was F=94.13
Chunking with Regular Expressions
This time we write regex’s over TAGS rather than characters
• <DT><JJ>?<NN>
• <NN.*>
• <JJ|NN>+
Compile them with parse.ChunkRule()
•
rule = parse.ChunkRule(‘<DT|NN>+’)
•
chunkparser = parse.RegexpChunk([rule],
chunk_node = ‘NP’)
Resulting object is a (sort-of) parse tree
• Top-level node called S
• Chunks are labelled NP
Chunking with Regular Expressions
Chunking with Regular Expressions
Rule application is sensitive to order
Chinking
Specify what does not go into a chunk.
• Kind of like specifying punctuation as being not alphanumeric and
spaces.
• Can be more difficult to think about.
Simple chink-chunk approach:
function v content word-class
Regular expressions for chunks and chinks CAN get complex
BUT the whole point is to be simpler than full parsing!
SO: use a simple model which works “reasonably well”
(then tidy up afterwards…)
Chunk = nominal content-word (noun)
Chink = others (verb, pronoun, determiner, preposition,
conjunction) (+adjective, adverb as a borderline category)
Example
Fruit flies like a banana
fruit\N flies\N like\V a\A banana\N
[fruit flies] like a [banana]
[S [NP fruit\N flies\N NP]
[VP like\V
[NP a\A banana\N NP]
VP]
S]
An alternative parse
This sentence is grammatically ambiguous:
Fruit flies like a banana
fruit\N flies\N like\V a\A banana\N [fruit flies] like a [banana]
fruit\N flies\V like\I a\A banana\N
[fruit] flies like a [banana]
cf: “bank robbers like a chase” v “bread bakes in an oven”
[S [NP fruit\N NP]
[VP flies\V
[PP like\I [NP a\A banana\N NP] PP]
VP]
S]
Ambiguity leads to more rules
fruit\N flies\N like\V a\A banana\N [fruit flies] like a [banana]
fruit\N flies\V like\I a\A banana\N
[fruit] flies like a [banana]
BUT what about: Time flies like an arrow - time\N, time\V
time\N flies\N like\V an\A arrow\N [time flies] like an [arrow]
time\N flies\V like\I an\A arrow\N [time] flies like an [arrow]
time\V flies\N like\I an\A arrow\N time [flies] like an [arrow]
3rd PoS-tagging gives ambiguous parse
Chunking can predict prosodic breaks
http://www.acm.org/crossroads/
An Approach for Detecting Prosodic Phrase Boundaries in
Spoken English by Claire Brierley and Eric Atwell
Summary
Shallow parsing is useful for:
Entity recognition
• people, locations, organizations
Studying linguistic patterns
• gave NP
• gave up NP in NP
• gave NP NP
• gave NP to NP
Prosodic phrase breaks – pauses in speech
Can ignore complex structure when not relevant
Chink-chunk approach: “quick-and-dirty” chunking, content v function PoS
Chink-chunk parsing is simpler than context-free grammar parsing!
Download