Corpus Linguistics and Machine Learning Techniques CS4705 CS 4705

advertisement
CS4705
Corpus Linguistics and Machine
Learning Techniques
CS 4705
Review
• What do we know about so far?
– Words (stems and affixes, roots and
templates,…)
– Ngrams (simple word sequences)
– POS (e.g. nouns, verbs, adverbs, adjectives,
determiners, articles, …)
Some Additional Things We Could Find
• Named Entities
– Persons
– Company Names
– Locations
– Dates
What useful things can we do with this knowledge?
• Find sentence boundaries, abbreviations
• Find Named Entities (person names, company
names, telephone numbers, addresses,…)
• Find topic boundaries and classify articles into
topics
• Identify a document’s author and their opinion on
the topic, pro or con
• Answer simple questions (factoids)
• Do simple summarization/compression
But first, we need corpora…
• Online collections of text and speech
• Some examples
– Brown Corpus
– Wall Street Journal and AP News
– ATIS, Broadcast News
– TDTN
– Switchboard, Call Home
– TRAINS, FM Radio, BDC Corpus
– Hansards’ parallel corpus of French and English
– And many private research collections
Next, we pose a question…the dependent variable
• Binary questions:
– Is this word followed by a sentence boundary or not?
– A topic boundary?
– Does this word begin a person name? End one?
– Should this word or sentence be included in a
summary?
• Classification:
– Is this document about medical issues? Politics?
Religion? Sports? …
• Predicting continuous variables:
– How loud or high should this utterance be produced?
Finding a suitable corpus and preparing it for
analysis
• Which corpora can answer my question?
– Do I need to get them labeled to do so?
• Dividing the corpus into training and test corpora
– To develop a model, we need a training corpus
• overly narrow corpus: doesn’t generalize
• overly general corpus: don't reflect task or domain
– To demonstrate how general our model is, we need a
test corpus to evaluate the model
• Development test set vs. held out test set
– To evaluate our model we must choose an evaluation
metric
• Accuracy
• Precision, recall, F-measure,…
• Cross validation
Then we build the model…
• Identify the dependent variable: what do we want to predict or
classify?
– Does this word begin a person name? Is this word within a person
name?
– Is this document about sports? The weather? International news?
???
• Identify the independent variables: what features might help to predict
the dependent variable?
– What is this word’s POS? What is the POS of the word before it?
After it?
– Is this word capitalized? Is it followed by a ‘.’?
– Does ‘hocky’ appear in this document?
– How far is this word from the beginning of its sentence?
• Extract the values of each variable from the corpus by some automatic
means
A Sample Feature Vector for Sentence-Ending
Detection
WordID
POS
Cap?
, After?
Dist/Sbeg End?
Clinton
N
y
n
1
n
won
V
n
n
2
n
easily
Adv
n
y
3
n
but
Conj
n
n
4
n
An Example: Finding Caller Names in
Voicemail  SCANMail
• Motivated by interviews, surveys and usage logs
of heavy users:
– Hard to scan new msgs to find those you need to deal
with quickly
– Hard to find msg you want in archive
– Hard to locate information you want in any msg
• How could we help?
Caller
SCANMail Architecture
SCANMail
Subscriber
Corpus Collection
• Recordings collected from 138 AT&T Labs
employees’ mailboxes
• 100 hours; 10K msgs; 2500 speakers
• Gender balanced: 12% non-native speakers
• Mean message duration 36.4 secs, median 30.0 secs
• Hand-transcribed and annotated with caller id,
gender, age, entity demarcation (names, dates,
telnos)
• Also recognized using ASR engine
Transcription and Bracketing
[ Greeting: hi R ] [ CallerID: it's me ] give me a
call [ um ] right away cos there's [ .hn ] I guess
there's some [ .hn ] change [ Date: tomorrow ]
with the nursery school and they [ um ] [ .hn ]
anyway they had this idea [ cos ] since I think
J's the only one staying [ Date: tomorrow ] for
play club so they wanted to they suggested that
[ .hn ] well J2 actually offered to take J home
with her and then would she
would meet you back at the synagogue at [ Time:
five thirty ] to pick her up [ .hn ] [ uh ] so I don't
know how you feel about that otherwise M_ and
one other teacher would stay and take care of
her till [ Date: five thirty tomorrow ] but if you [
.hn ] I wanted to know how you feel before I tell
her one way or the other so call me [ .hn ] right
away cos I have to get back to her in about an
hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]
SCANMail Demo
http://www.avatarweb.com/scan
mail/
Audix extension: demo
Audix password: (null)
Information Extraction (Martin Jansche and
Steve Abney)
• Goals: extract key information from msgs to
present in headers
• Approach:
– Supervised learning from transcripts (phone
#’s, caller self-ids)
– Combine Machine Learning techniques with
simpler alternatives, e.g. hand-crafted rules
– Two stage approaches
– Features exploit structure of key elements (e.g.
length of phone numbers) and of surrounding
context (e.g. self-ids tend to occur at beginning
of msg)
Telephone Number Identification
• Rules convert all numbers to standard digit format
• Predict start of phone number with rules
– This step over-generates
– Prune with decision-tree classifier
• Best features:
– Position in msg
– Lexical cues
– Length of digit string
• Performance:
– .94 F on human-labeled transcripts
– .95 F on ASR)
Caller Self-Identifications
• Predict start of id with classifier
– 97% of id’s begin 1-7 words into msg
• Then predict length of phrase
– Majority are only 2-4 words long
• Avoid risk of relying on correct speech recognition for
names
• Best cues to end of phrase are a few common words
– ‘I’, ‘could’, ‘please’
– No actual names: they over-fit the data
• Performance
– .71 F on human-labeled
– .70 F on ASR
Introduction to Weka
Download