Robust Semantics, Information Extraction, and Information Retrieval CS 4705

advertisement
Robust Semantics, Information
Extraction, and Information
Retrieval
CS 4705
Problems with Syntax-Driven Semantics
• Syntactic structures often don’t fit semantic
structures very well
– Important semantic elements often distributed very
differently in trees for sentences that mean ‘the same’
I like soup. Soup is what I like.
– Parse trees contain many structural elements not clearly
important to making semantic distinctions
– Syntax driven semantic representations are sometimes
pretty verbose
V --> serves
{xe, y( Isa(e,Serving )  Server (e, y)  Served (e, x)}
Alternatives?
• Semantic Grammars
• Information Extraction
• Information Retrieval
Semantic Grammars
• Alternative to modifying syntactic grammars to
deal with semantics too
• Define grammars specifically in terms of the
semantic information we want to extract
– Domain specific: Rules correspond directly to entities
and activities in the domain
I want to go from Boston to Baltimore on Thursday,
September 24th
– Greeting --> {Hello|Hi|Um…}
– TripRequest  Need-spec travel-verb from City to City
on Date
Predicting User Input
• Semantic grammars rely upon knowledge of the
task and (sometimes) constraints on what the user
can do, when
– Allows them to handle very sophisticated phenomena
I want to go to Boston on Thursday.
I want to leave from there on Friday for Baltimore.
TripRequest  Need-spec travel-verb from City on Date
for City
Dialogue postulate maps filler for ‘from-city’ to prespecified from-city
Priming User Input
• Users will tend to use the vocabulary they hear
from the system
• Explicit training vs. implicit training
• Training the user vs. retraining the system
Drawbacks of Semantic Grammars
• Lack of generality
– A new one for each application
– Large cost in development time
• Can be very large, depending on how much
coverage you want
• If users go outside the grammar, things may break
disastrously
I want to leave from my house.
I want to talk to someone human.
Some examples
• Semantic grammars
Information Extraction
• Another ‘robust’ alternative
• Idea: ‘extract’ particular types of information from
arbitrary text or transcribed speech
• Examples:
– Named entities: people, places, organizations, times,
dates
• <Organization> MIPS</Organization> Vice
President <Person>John Hime</Person>
– MUC evaluations
• Domains: Medical texts, broadcast news (terrorist
reports), voicemail,...
Appropriate where Semantic Grammars and
Syntactic Parsers are Not
• Appropriate where information needs very
specific and specifiable in advance
– Question answering systems, gisting of news or mail…
– Job ads, financial information, terrorist attacks
• Input too complex and far-ranging to build
semantic grammars
• But full-blown syntactic parsers are impractical
– Too much ambiguity for arbitrary text
– 50 parses or none at all
– Too slow for real-time applications
Information Extraction Techniques
• Often use a set of simple templates or frames with
slots to be filled in from input text
– Ignore everything else
– My number is 212-555-1212.
– The inventor of the wiggleswort was Capt. John T.
Hart.
– The king died in March of 1932.
• Context (neighboring words, capitalization,
punctuation) provides cues to help fill in the
appropriate slots
• How to do better than everyone else?
The IE Process
• Given a corpus and a target set of items to be
extracted:
–
–
–
–
Clean up the corpus
Tokenize it
Do some hand labeling of target items
Extract some simple features
• POS tags
• Phrase Chunks …
– Do some machine learning to associate features with
target items or derive this associate by intuition
– Use e.g. FSTs, simple or cascaded to iteratively
annotate the input, eventually identifying the slot fillers
IE in SCANMail: Audio Browsing and
Retrieval for Voicemail
• Motivated by interviews, surveys and usage logs
identifying problems of heavy voicemail users:
– It’s hard to quickly scan through new messages to find
the ones you need to deal with (e.g. during a meeting
break)
– It’s hard to find the message you want in your archive
– It’s hard to locate the information you want in any
message (e.g. the telephone number, caller name)
Caller
SCANMail Architecture
SCANMail
Subscriber
Corpus Details
• Recordings collected from 138 voicemail
boxes of AT&T Labs employees
• 100 hours; 10,000 messages; 2500 speakers
• Gender balanced; 12% non-native speakers
• Mean message duration 36.4 secs, median
30.0 secs
• Hand-transcribed and annotated with caller id,
gender, age, entity demarcation (names, dates,
telephone numbers)
gender F
Transcription
age A
caller_name NA
native_speaker N
speech_pathology N
sample_rate 8000
label 0 804672 " [ Greeting: hi R__ ] [ CallerID: it's me ] give me a call [ um ] right
away cos there's [ .hn ] I guess there's some [ .hn ] change [ Date: tomorrow ] with
the nursery school and they [ um ] [ .hn ] anyway they had this idea [ cos ] since I
think J__'s the only one staying [ Date: tomorrow ] for play club so they wanted to
they suggested that [ .hn ] well J2__actually offered to take J__home with her
and then would she would meet you back at the synagogue at [ Time: five thirty ] to
pick her up [ .hn ] [ uh ] so I don't know how you feel about that otherwise Miriam
and one other teacher would stay and take care of her till [ Date: five thirty
tomorrow ] but if you [ .hn ] I wanted to know how you feel before I tell her one
way or the other so call me [ .hn ] right away cos I have to get back to her in about
an hour so [ .hn ] okay [ Closing: bye [ .nhn ] [ .onhk ] ]"
duration "50.3 seconds"
Demo
SCANMail demo
Audix extension: 8380
Audix password: (null)
http://www.fancentral.org/~isenhour/scan
mail/demo.html
Audix extension: 8380
Audix password: (null)
SCANMail Demo:
Number Extraction
SCANMail
Access Devices
PC
Pocket PC
Dataphone
Voice Phone
Flash E-mail
Finding Phone Numbers and Caller IDs
(Jansche & Abney ‘02)
• Goals: extract key information from msgs to
present in headers from ASR transcripts
• Approach:
– Supervised learning from transcripts (phone #’s, caller
self-ids)
• Hand-crafted rules (good recall) propose candidates
• Statistical classifier (decision tree) prunes bad
candidates
– Features exploit structure of key elements (e.g. length
of phone numbers) and surrounding context (e.g. selfids occur at beginning of msg)
• Location is key
• Predict 1=phr begin,2=in phr,3=neither
• Phone numbers:
– Rules convert to standard digit format
– Predict start with rules and prune with classifier
– Position in msg and lexical cues plus length of digit string as
features (.94 F on human-labeled; .95 F on ASR)
• Self-ids:
– Predict start prediction (97% begin 1-7 words into msg) and
then length of phrase (majority 2-4 words)
– Avoid risk of relying on correct recognition for names
– Good lexical cues to end of phrase (‘I’, ‘could’, ‘please’) (.71
F on human-labeled; .70 F on ASR)
Information Retrieval
• How related to NLP?
– Operates on language (speech or text)
– Does it use linguistic information?
• Stemming
• Bag-of-words approach
• Very simple analyses
– Does it make use of document formatting?
• Headlines, punctuation, captions
• Collection: a set of documents
• Term: a word or phrase
• Query: a set of terms
But…what is a term?
• Stop list
• Stemming
• Homonymy, polysemy, synonymy
Vector Space Model
• Simple versions represent documents and queries
as feature vectors, one binary feature for each term
in collection
• Is a term t in this document or in this query or not?
D = (t1,t2,…,tn)
Q = (t1,t2,…,tn)
• Similarity metric:how many terms does a query
share with each candidate document?
• Weighted terms: term-by-document matrix
D = (wt1,wt2,…,wtn)
Q = (wt1,wt2,…,wtn)
• How do we compare the vectors?
– Normalize each term weight by the number of terms in
the document: how important is each t in D?
– Compute dot product between vectors to see how
similar they are
– Cosine of angle: 1 = identity; 0 = no common terms
• How do we get the weights?
– Term frequency (tf): how often does t occur in D?
– Inverse document frequency (idf): # docs/ # docs term
t occurs in idfi  log N 
 ni 


– tf . idf weighting: weight of term i for doc j is product
of frequency of i in j with log of idf in collection
wi, j  tfi, j idfi
Evaluating IR Performance
• Precision: #rel docs returned/total #docs returned
-- how often are you right when you say this
document is relevant?
• Recall: #rel docs returned/#rel docs in collection -how many of the relevant documents do you find?
• F-measure combines P and R 2(PR)
P R
• Are P and R equally important?
Improving Queries
• Relevance feedback: users rate retrieved docs
• Query expansion: many techniques
– e.g. add top N docs retrieved to query and resubmit
expanded query
• Term clustering: cluster rows of terms to produce
synonyms and add to query
IR Tasks
• Ad hoc retrieval: ‘normal’ IR
• Routing/categorization: assign new doc to one of
predefined set of categories
• Clustering: divide a collection into N clusters
• Segmentation: segment text into coherent chunks
• Summarization: compress a text by extracting
summary items
• Question-answering: find a stretch of text
containing the answer to a question
Combining IR and IE for QA
• Information extraction
Summary
• Many approaches to ‘robust’ semantic analysis
– Semantic grammars targeting particular domains
Utterance --> Yes/No Reply
Yes/No Reply --> Yes-Reply | No-Reply
Yes-Reply --> {yes,yeah, right, ok,”you bet”,…}
– Information extraction techniques targeting specific
tasks
• Extracting information about terrorist events from
news
– Information retrieval techniques --> more like NLP
Download