Corpus Linguistics and Machine Learning Techniques CS 4705

advertisement

Corpus Linguistics and Machine

Learning Techniques

CS 4705

Review

• What do we know about so far?

– Words

(stems and affixes, roots and templates,…)

– POS

(e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …)

– Named Entities

(e.g. Person Names)

– Ngrams

(simple word sequences)

– Syntactic Constituents (NPs, VPs, Ss,…)

What useful things can we do – with only this knowledge?

• Find sentence boundaries, abbreviations

• Find Named Entities (person names, company names, telephone numbers, addresses,…)

• Find topic boundaries and classify articles into topics

• Identify a document’s author and their opinion on the topic, pro or con

• Answer simple questions ( factoids )

• Do simple summarization/compression

But first, we need corpora…

• Online collections of text and speech

• Some examples

– Brown Corpus

– Wall Street Journal and AP News

– ATIS, Broadcast News

– TDT

N

– Switchboard, Call Home

– TRAINS, FM Radio, BDC Corpus

– Hansards’ parallel corpus of French and English

– And many private research collections

Next, we pose a question…the dependent variable

• Binary questions:

– Is this word followed by a sentence boundary or not?

– A topic boundary?

– Does this word begin a person name? End one?

– Should this word or sentence be included in a summary?

• Other classification:

– Is this document about medical issues? Politics?

Religion? Sports? …

• Predicting continuous variables:

– How loud or high should this utterance be produced?

Finding a suitable corpus and preparing it for analysis

• Which corpora can answer my question?

– Do I need to get them labeled to do so?

• Dividing the corpus into training and test corpora

– To develop a model, we need a training corpus

• overly narrow corpus: doesn’t generalize

• overly general corpus: don't reflect task or domain

– To demonstrate how general our model is, we need a test corpus to evaluate the model

• Development test set vs. held out test set

– To evaluate our model we must choose an evaluation metric

• Accuracy

• Precision, recall, F-measure,…

• Cross validation

Then we build the model…

• Again, identify the dependent variable : what do we want to predict or classify?

– Does this word begin a person name? Is this word within a person name?

• Identify the independent variables : what features might help to predict the dependent variable?

– What is this word’s POS? What is the POS of the word before it? After it?

– Is this word capitalized? Is it followed by a ‘.’?

– How far is this word from the beginning of its sentence?

• Extract the values of each variable from the corpus by some automatic means

A Sample Feature Vector for Sentence Ending

Detection

WordID POS Cap?

, After?

Dist/Sbeg End?

Clinton N won V easily but

Adv n

Conj n y n y n n n

3

4

1

2 n n n n

An Example: Finding Caller Names in

Voicemail

SCANMail

• Motivated by interviews, surveys and usage logs of heavy users:

– Hard to scan new msgs to find those you need to deal with quickly

– Hard to find msg you want in archive

– Hard to locate information you want in any msg

• How could we help?

Caller

SCANMail Architecture

SCANMail

Subscriber

Corpus Collection

• Recordings collected from 138 AT&T Labs employees’ mailboxes

• 100 hours; 10K msgs; 2500 speakers

• Gender balanced: 12% non-native speakers

• Mean message duration 36.4 secs, median 30.0 secs

• Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos)

• Also recognized using ASR engine

Transcription and Bracketing

[ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think

J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that

[ .hn ] well J2 actually offered to take J home with her and then would she

would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [

.hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ]

SCANMail Demo http://www.avatarweb.com/scan mail/

Audix extension: demo

Audix password: (null)

Information Extraction

(Martin Jansche and

Steve Abney)

• Goals: extract key information from msgs to present in headers

• Approach:

– Supervised learning from transcripts (phone

#’s, caller self-ids)

– Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules

– Two stage approaches

– Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg)

Telephone Number Identification

• Rules convert all numbers to standard digit format

• Predict start of phone number with rules

– This step over-generates

– Prune with decision-tree classifier

• Best features:

– Position in msg

– Lexical cues

– Length of digit string

• Performance:

– .94 F on human-labeled transcripts

– .95 F on ASR)

Caller Self-Identifications

• Predict start of id with classifier

– 97% of id’s begin 1-7 words into msg

• Then predict length of phrase

– Majority are only 2-4 words long

• Avoid risk of relying on correct speech recognition for names

• Best cues to end of phrase are a few common words

– ‘I’, ‘could’, ‘please’

– No actual names: they over-fit the data

• Performance

– .71 F on human-labeled

– .70 F on ASR

Introduction to Weka

Download