Robust Semantics, Information Extraction, and Information Retrieval CS 4705 Problems with Syntax-Driven Semantics • Syntactic structures often don’t fit semantic structures very well – Important semantic elements often distributed very differently in trees for sentences that mean ‘the same’ I like soup. Soup is what I like. – Parse trees contain many structural elements not clearly important to making semantic distinctions – Syntax driven semantic representations are sometimes pretty verbose V --> serves {xe, y( Isa(e,Serving ) Server (e, y) Served (e, x)} Alternatives? • Semantic Grammars • Information Extraction • Information Retrieval Semantic Grammars • Alternative to modifying syntactic grammars to deal with semantics too • Define grammars specifically in terms of the semantic information we want to extract – Domain specific: Rules correspond directly to entities and activities in the domain I want to go from Boston to Baltimore on Thursday, September 24th – Greeting --> {Hello|Hi|Um…} – TripRequest Need-spec travel-verb from City to City on Date Predicting User Input • Semantic grammars rely upon knowledge of the task and (sometimes) constraints on what the user can do, when – Allows them to handle very sophisticated phenomena I want to go to Boston on Thursday. I want to leave from there on Friday for Baltimore. TripRequest Need-spec travel-verb from City on Date for City Dialogue postulate maps filler for ‘from-city’ to prespecified from-city Priming User Input • Users will tend to use the vocabulary they hear from the system • Explicit training vs. implicit training • Training the user vs. retraining the system Drawbacks of Semantic Grammars • Lack of generality – A new one for each application – Large cost in development time • Can be very large, depending on how much coverage you want • If users go outside the grammar, things may break disastrously I want to leave from my house. I want to talk to someone human. Some examples • Semantic grammars Information Extraction • Another ‘robust’ alternative • Idea: ‘extract’ particular types of information from arbitrary text or transcribed speech • Examples: – Named entities: people, places, organizations, times, dates • <Organization> MIPS</Organization> Vice President <Person>John Hime</Person> – MUC evaluations • Domains: Medical texts, broadcast news (terrorist reports), voicemail,... Appropriate where Semantic Grammars and Syntactic Parsers are Not • Appropriate where information needs very specific and specifiable in advance – Question answering systems, gisting of news or mail… – Job ads, financial information, terrorist attacks • Input too complex and far-ranging to build semantic grammars • But full-blown syntactic parsers are impractical – Too much ambiguity for arbitrary text – 50 parses or none at all – Too slow for real-time applications Information Extraction Techniques • Often use a set of simple templates or frames with slots to be filled in from input text – Ignore everything else – My number is 212-555-1212. – The inventor of the wiggleswort was Capt. John T. Hart. – The king died in March of 1932. • Context (neighboring words, capitalization, punctuation) provides cues to help fill in the appropriate slots • How to do better than everyone else? The IE Process • Given a corpus and a target set of items to be extracted: – – – – Clean up the corpus Tokenize it Do some hand labeling of target items Extract some simple features • POS tags • Phrase Chunks … – Do some machine learning to associate features with target items or derive this associate by intuition – Use e.g. FSTs, simple or cascaded to iteratively annotate the input, eventually identifying the slot fillers IE in SCANMail: Audio Browsing and Retrieval for Voicemail • Motivated by interviews, surveys and usage logs identifying problems of heavy voicemail users: – It’s hard to quickly scan through new messages to find the ones you need to deal with (e.g. during a meeting break) – It’s hard to find the message you want in your archive – It’s hard to locate the information you want in any message (e.g. the telephone number, caller name) Caller SCANMail Architecture SCANMail Subscriber Corpus Details • Recordings collected from 138 voicemail boxes of AT&T Labs employees • 100 hours; 10,000 messages; 2500 speakers • Gender balanced; 12% non-native speakers • Mean message duration 36.4 secs, median 30.0 secs • Hand-transcribed and annotated with caller id, gender, age, entity demarcation (names, dates, telephone numbers) gender F Transcription age A caller_name NA native_speaker N speech_pathology N sample_rate 8000 label 0 804672 " [ Greeting: hi R__ ] [ CallerID: it's me ] give me a call [ um ] right away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I think J__'s the only one staying [ Date: tomorrow ] for play club so they wanted to they suggested that [ .hn ] well J2__actually offered to take J__home with her and then would she would meet you back at the synagogue at [ Time: five thirty ] to pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise Miriam and one other teacher would stay and take care of her till [ Date: five thirty tomorrow ] but if you [ .hn ] I wanted to know how you feel before I tell her one way or the other so call me [ .hn ] right away cos I have to get back to her in about an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ] ]" duration "50.3 seconds" Demo SCANMail demo Audix extension: 8380 Audix password: (null) http://www.fancentral.org/~isenhour/scan mail/demo.html Audix extension: 8380 Audix password: (null) SCANMail Demo: Number Extraction SCANMail Access Devices PC Pocket PC Dataphone Voice Phone Flash E-mail Finding Phone Numbers and Caller IDs (Jansche & Abney ‘02) • Goals: extract key information from msgs to present in headers from ASR transcripts • Approach: – Supervised learning from transcripts (phone #’s, caller self-ids) • Hand-crafted rules (good recall) propose candidates • Statistical classifier (decision tree) prunes bad candidates – Features exploit structure of key elements (e.g. length of phone numbers) and surrounding context (e.g. selfids occur at beginning of msg) • Location is key • Predict 1=phr begin,2=in phr,3=neither • Phone numbers: – Rules convert to standard digit format – Predict start with rules and prune with classifier – Position in msg and lexical cues plus length of digit string as features (.94 F on human-labeled; .95 F on ASR) • Self-ids: – Predict start prediction (97% begin 1-7 words into msg) and then length of phrase (majority 2-4 words) – Avoid risk of relying on correct recognition for names – Good lexical cues to end of phrase (‘I’, ‘could’, ‘please’) (.71 F on human-labeled; .70 F on ASR) Information Retrieval • How related to NLP? – Operates on language (speech or text) – Does it use linguistic information? • Stemming • Bag-of-words approach • Very simple analyses – Does it make use of document formatting? • Headlines, punctuation, captions • Collection: a set of documents • Term: a word or phrase • Query: a set of terms But…what is a term? • Stop list • Stemming • Homonymy, polysemy, synonymy Vector Space Model • Simple versions represent documents and queries as feature vectors, one binary feature for each term in collection • Is a term t in this document or in this query or not? D = (t1,t2,…,tn) Q = (t1,t2,…,tn) • Similarity metric:how many terms does a query share with each candidate document? • Weighted terms: term-by-document matrix D = (wt1,wt2,…,wtn) Q = (wt1,wt2,…,wtn) • How do we compare the vectors? – Normalize each term weight by the number of terms in the document: how important is each t in D? – Compute dot product between vectors to see how similar they are – Cosine of angle: 1 = identity; 0 = no common terms • How do we get the weights? – Term frequency (tf): how often does t occur in D? – Inverse document frequency (idf): # docs/ # docs term t occurs in idfi log N ni – tf . idf weighting: weight of term i for doc j is product of frequency of i in j with log of idf in collection wi, j tfi, j idfi Evaluating IR Performance • Precision: #rel docs returned/total #docs returned -- how often are you right when you say this document is relevant? • Recall: #rel docs returned/#rel docs in collection -how many of the relevant documents do you find? • F-measure combines P and R 2(PR) P R • Are P and R equally important? Improving Queries • Relevance feedback: users rate retrieved docs • Query expansion: many techniques – e.g. add top N docs retrieved to query and resubmit expanded query • Term clustering: cluster rows of terms to produce synonyms and add to query IR Tasks • Ad hoc retrieval: ‘normal’ IR • Routing/categorization: assign new doc to one of predefined set of categories • Clustering: divide a collection into N clusters • Segmentation: segment text into coherent chunks • Summarization: compress a text by extracting summary items • Question-answering: find a stretch of text containing the answer to a question Combining IR and IE for QA • Information extraction Summary • Many approaches to ‘robust’ semantic analysis – Semantic grammars targeting particular domains Utterance --> Yes/No Reply Yes/No Reply --> Yes-Reply | No-Reply Yes-Reply --> {yes,yeah, right, ok,”you bet”,…} – Information extraction techniques targeting specific tasks • Extracting information about terrorist events from news – Information retrieval techniques --> more like NLP