CS4705 Corpus Linguistics and Machine Learning Techniques CS 4705 Review • What do we know about so far? – Words (stems and affixes, roots and templates,…) – Ngrams (simple word sequences) – POS (e.g. nouns, verbs, adverbs, adjectives, determiners, articles, …) Some Additional Things We Could Find • Named Entities – Persons – Company Names – Locations – Dates What useful things can we do with this knowledge? • Find sentence boundaries, abbreviations • Find Named Entities (person names, company names, telephone numbers, addresses,…) • Find topic boundaries and classify articles into topics • Identify a document’s author and their opinion on the topic, pro or con • Answer simple questions (factoids) • Do simple summarization/compression But first, we need corpora… • Online collections of text and speech • Some examples – Brown Corpus – Wall Street Journal and AP News – ATIS, Broadcast News – TDTN – Switchboard, Call Home – TRAINS, FM Radio, BDC Corpus – Hansards’ parallel corpus of French and English – And many private research collections Next, we pose a question…the dependent variable • Binary questions: – Is this word followed by a sentence boundary or not? – A topic boundary? – Does this word begin a person name? End one? – Should this word or sentence be included in a summary? • Classification: – Is this document about medical issues? Politics? Religion? Sports? … • Predicting continuous variables: – How loud or high should this utterance be produced? Finding a suitable corpus and preparing it for analysis • Which corpora can answer my question? – Do I need to get them labeled to do so? • Dividing the corpus into training and test corpora – To develop a model, we need a training corpus • overly narrow corpus: doesn’t generalize • overly general corpus: don't reflect task or domain – To demonstrate how general our model is, we need a test corpus to evaluate the model • Development test set vs. held out test set – To evaluate our model we must choose an evaluation metric • Accuracy • Precision, recall, F-measure,… • Cross validation Then we build the model… • Identify the dependent variable: what do we want to predict or classify? – Does this word begin a person name? Is this word within a person name? – Is this document about sports? The weather? International news? ??? • Identify the independent variables: what features might help to predict the dependent variable? – What is this word’s POS? What is the POS of the word before it? After it? – Is this word capitalized? Is it followed by a ‘.’? – Does ‘hocky’ appear in this document? – How far is this word from the beginning of its sentence? • Extract the values of each variable from the corpus by some automatic means A Sample Feature Vector for Sentence-Ending Detection WordID POS Cap? , After? Dist/Sbeg End? Clinton N y n 1 n won V n n 2 n easily Adv n y 3 n but Conj n n 4 n An Example: Finding Caller Names in Voicemail SCANMail • Motivated by interviews, surveys and usage logs of heavy users: – Hard to scan new msgs to find those you need to deal with quickly – Hard to find msg you want in archive – Hard to locate information you want in any msg • How could we help? Caller SCANMail Architecture SCANMail Subscriber Corpus Collection • Recordings collected from 138 AT&T Labs employees’ mailboxes • 100 hours; 10K msgs; 2500 speakers • Gender balanced: 12% non-native speakers • Mean message duration 36.4 secs, median 30.0 secs • Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telnos) • Also recognized using ASR engine Transcription and Bracketing [ Greeting: hi R ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think J's the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [ .hn ] well J2 actually offered to take J home with her and then would she would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise M_ and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [ .hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ] SCANMail Demo http://www.avatarweb.com/scan mail/ Audix extension: demo Audix password: (null) Information Extraction (Martin Jansche and Steve Abney) • Goals: extract key information from msgs to present in headers • Approach: – Supervised learning from transcripts (phone #’s, caller self-ids) – Combine Machine Learning techniques with simpler alternatives, e.g. hand-crafted rules – Two stage approaches – Features exploit structure of key elements (e.g. length of phone numbers) and of surrounding context (e.g. self-ids tend to occur at beginning of msg) Telephone Number Identification • Rules convert all numbers to standard digit format • Predict start of phone number with rules – This step over-generates – Prune with decision-tree classifier • Best features: – Position in msg – Lexical cues – Length of digit string • Performance: – .94 F on human-labeled transcripts – .95 F on ASR) Caller Self-Identifications • Predict start of id with classifier – 97% of id’s begin 1-7 words into msg • Then predict length of phrase – Majority are only 2-4 words long • Avoid risk of relying on correct speech recognition for names • Best cues to end of phrase are a few common words – ‘I’, ‘could’, ‘please’ – No actual names: they over-fit the data • Performance – .71 F on human-labeled – .70 F on ASR Introduction to Weka