CS 4705
• What do we know about so far?
– Words
(stems and affixes, roots and templates,…)
– POS
(e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …)
– Named Entities
(e.g. Person Names)
– Ngrams
(simple word sequences)
– Syntactic Constituents (NPs, VPs, Ss,…)
What useful things can we do – with only this knowledge?
• Find sentence boundaries, abbreviations
• Find Named Entities (person names, company names, telephone numbers, addresses,…)
• Find topic boundaries and classify articles into topics
• Identify a document’s author and their opinion on the topic, pro or con
• Answer simple questions ( factoids )
• Do simple summarization/compression
• Online collections of text and speech
• Some examples
– Brown Corpus
– Wall Street Journal and AP News
– ATIS, Broadcast News
– TDT
N
– Switchboard, Call Home
– TRAINS, FM Radio, BDC Corpus
– Hansards’ parallel corpus of French and English
– And many private research collections
Next, we pose a question…the dependent variable
• Binary questions:
– Is this word followed by a sentence boundary or not?
– A topic boundary?
– Does this word begin a person name? End one?
– Should this word or sentence be included in a summary?
• Other classification:
– Is this document about medical issues? Politics?
Religion? Sports? …
• Predicting continuous variables:
– How loud or high should this utterance be produced?
Finding a suitable corpus and preparing it for analysis
• Which corpora can answer my question?
– Do I need to get them labeled to do so?
• Dividing the corpus into training and test corpora
– To develop a model, we need a training corpus
• overly narrow corpus: doesn’t generalize
• overly general corpus: don't reflect task or domain
– To demonstrate how general our model is, we need a test corpus to evaluate the model
• Development test set vs. held out test set
– To evaluate our model we must choose an evaluation metric
• Accuracy
• Precision, recall, F-measure,…
• Cross validation
• Again, identify the dependent variable : what do we want to predict or classify?
– Does this word begin a person name? Is this word within a person name?
• Identify the independent variables : what features might help to predict the dependent variable?
– What is this word’s POS? What is the POS of the word before it? After it?
– Is this word capitalized? Is it followed by a ‘.’?
– How far is this word from the beginning of its sentence?
• Extract the values of each variable from the corpus by some automatic means
A Sample Feature Vector for Sentence Ending
Detection
WordID POS Cap?
, After?
Dist/Sbeg End?
Clinton N won V easily but
Adv n
Conj n y n y n n n
3
4
1
2 n n n n
– Hard to scan new msgs to find those you need to deal with quickly
– Hard to find msg you want in archive
– Hard to locate information you want in any msg
• How could we help?
Caller
SCANMail
Subscriber
• Recordings collected from 138 AT&T Labs employees’ mailboxes
• 100 hours; 10K msgs; 2500 speakers
• Gender balanced: 12% non-native speakers
• Mean message duration 36.4 secs, median 30.0 secs
• Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos)
• Also recognized using ASR engine
[ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think
J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that
[ .hn ] well J2 actually offered to take J home with her and then would she
would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [
.hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]
(Martin Jansche and
Steve Abney)
• Goals: extract key information from msgs to present in headers
• Approach:
– Supervised learning from transcripts (phone
#’s, caller self-ids)
– Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules
– Two stage approaches
– Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)
• Rules convert all numbers to standard digit format
• Predict start of phone number with rules
– This step over-generates
– Prune with decision-tree classifier
• Best features:
– Position in msg
– Lexical cues
– Length of digit string
• Performance:
– .94 F on human-labeled transcripts
– .95 F on ASR)
• Predict start of id with classifier
– 97% of id’s begin 1-7 words into msg
• Then predict length of phrase
– Majority are only 2-4 words long
• Avoid risk of relying on correct speech recognition for names
• Best cues to end of phrase are a few common words
– ‘I’, ‘could’, ‘please’
– No actual names: they over-fit the data
• Performance
– .71 F on human-labeled
– .70 F on ASR