Corpus Linguistics and Machine Learning Techniques CS 4705

Corpus Linguistics and Machine

Learning Techniques

CS 4705

Review

• What do we know about so far?

– Words

(stems and affixes, roots and templates,…)

– POS

(e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …)

– Named Entities

(e.g. Person Names)

– Ngrams

(simple word sequences)

– Syntactic Constituents (NPs, VPs, Ss,…)

What useful things can we do – with only this knowledge?

• Find sentence boundaries, abbreviations

• Find Named Entities (person names, company names, telephone numbers, addresses,…)

• Find topic boundaries and classify articles into topics

• Identify a document’s author and their opinion on the topic, pro or con

• Answer simple questions ( factoids )

• Do simple summarization/compression

But first, we need corpora…

• Online collections of text and speech

• Some examples

– Brown Corpus

– Wall Street Journal and AP News

– ATIS, Broadcast News

– TDT

N

– Switchboard, Call Home

– TRAINS, FM Radio, BDC Corpus

– Hansards’ parallel corpus of French and English

– And many private research collections

Next, we pose a question…the dependent variable

• Binary questions:

– Is this word followed by a sentence boundary or not?

– A topic boundary?

– Does this word begin a person name? End one?

– Should this word or sentence be included in a summary?

• Other classification:

– Is this document about medical issues? Politics?

Religion? Sports? …

• Predicting continuous variables:

– How loud or high should this utterance be produced?

Finding a suitable corpus and preparing it for analysis

• Which corpora can answer my question?

– Do I need to get them labeled to do so?

• Dividing the corpus into training and test corpora

– To develop a model, we need a training corpus

• overly narrow corpus: doesn’t generalize

• overly general corpus: don't reflect task or domain

– To demonstrate how general our model is, we need a test corpus to evaluate the model

• Development test set vs. held out test set

– To evaluate our model we must choose an evaluation metric

• Accuracy

• Precision, recall, F-measure,…

• Cross validation

Then we build the model…

• Again, identify the dependent variable : what do we want to predict or classify?

– Does this word begin a person name? Is this word within a person name?

• Identify the independent variables : what features might help to predict the dependent variable?

– What is this word’s POS? What is the POS of the word before it? After it?

– Is this word capitalized? Is it followed by a ‘.’?

– How far is this word from the beginning of its sentence?

• Extract the values of each variable from the corpus by some automatic means

A Sample Feature Vector for Sentence Ending

Detection

WordID POS Cap?

, After?

Dist/Sbeg End?

Clinton N won V easily but

Adv n

Conj n y n y n n n

3

4

1

2 n n n n

An Example: Finding Caller Names in

Voicemail



SCANMail

• Motivated by interviews, surveys and usage logs of heavy users:

– Hard to scan new msgs to find those you need to deal with quickly

– Hard to find msg you want in archive

– Hard to locate information you want in any msg

• How could we help?

Caller

SCANMail Architecture

SCANMail

Subscriber

Corpus Collection

• Recordings collected from 138 AT&T Labs employees’ mailboxes

• 100 hours; 10K msgs; 2500 speakers

• Gender balanced: 12% non-native speakers

• Mean message duration 36.4 secs, median 30.0 secs

• Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos)

• Also recognized using ASR engine

Transcription and Bracketing

[ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think

J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that

[ .hn ] well J2 actually offered to take J home with her and then would she

would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [

.hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]

SCANMail Demo http://www.avatarweb.com/scan mail/

Audix extension: demo

Audix password: (null)

Information Extraction

(Martin Jansche and

Steve Abney)

• Goals: extract key information from msgs to present in headers

• Approach:

– Supervised learning from transcripts (phone

#’s, caller self-ids)

– Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules

– Two stage approaches

– Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)

Telephone Number Identification

• Rules convert all numbers to standard digit format

• Predict start of phone number with rules

– This step over-generates

– Prune with decision-tree classifier

• Best features:

– Position in msg

– Lexical cues

– Length of digit string

• Performance:

– .94 F on human-labeled transcripts

– .95 F on ASR)

Caller Self-Identifications

• Predict start of id with classifier

– 97% of id’s begin 1-7 words into msg

• Then predict length of phrase

– Majority are only 2-4 words long

• Avoid risk of relying on correct speech recognition for names

• Best cues to end of phrase are a few common words

– ‘I’, ‘could’, ‘please’

– No actual names: they over-fit the data

• Performance

– .71 F on human-labeled

– .70 F on ASR

Introduction to Weka

Corpus Linguistics and Machine Learning Techniques CS 4705

Corpus Linguistics and Machine

Learning Techniques

Review

But first, we need corpora…

Then we build the model…

An Example: Finding Caller Names in

Voicemail

SCANMail

• Motivated by interviews, surveys and usage logs of heavy users:

SCANMail Architecture

Corpus Collection

Transcription and Bracketing

SCANMail Demo http://www.avatarweb.com/scan mail/

Audix extension: demo

Audix password: (null)

Information Extraction

Telephone Number Identification

Caller Self-Identifications

Introduction to Weka

Related documents

Products

Support

Corpus Linguistics and Machine Learning Techniques CS 4705