Robust Semantics, Information Extraction, and Information Retrieval CS 4705

advertisement
Robust Semantics, Information
Extraction, and Information
Retrieval
CS 4705
Problems with Syntax-Driven Semantics
• Syntactic structures often don’t fit semantic
structures very well
– Important semantic elements often distributed very
differently in trees for sentences that mean ‘the same’
I like soup. Soup is what I like.
– Parse trees contain many structural elements not clearly
important to making semantic distinctions
– Syntax driven semantic representations are sometimes
pretty verbose
V --> serves
{xe, y( Isa(e,Serving )  Server (e, y)  Served (e, x)}
Semantic Grammars
• Alternative to modifying syntactic grammars to
deal with semantics too
• Define grammars specifically in terms of the
semantic information we want to extract
– Domain specific: Rules correspond directly to entities
and activities in the domain
I want to go from Boston to Baltimore on Thursday,
September 24th
– Greeting --> {Hello|Hi|Um…}
– TripRequest  Need-spec travel-verb from City to City
on Date
Predicting User Input
• Rely on knowledge of task and (sometimes)
constraints on what the user can do
– Can handle very sophisticated phenomena
I want to go to Boston on Thursday.
I want to leave from there on Friday for Baltimore.
TripRequest  Need-spec travel-verb from City on Date
for City
Dialogue postulate maps filler for ‘from-city’ to prespecified to-city
Priming User Input
• Users will tend to use the vocabulary they hear
from the system: lexical entrainment (Clark &
Brennan ’96)
– Reference to objects: the scarey M&M man
– Re-use of system prompt vocabulary/syntax:
Please tell me where you would like to leave/depart
from.
Where would you like to leave/depart from?
• Explicit training vs. implicit training
• Training the user vs. retraining the system
Drawbacks of Semantic Grammars
• Lack of generality
– A new one for each application
– Large cost in development time
• Can be very large, depending on how much
coverage you want them to have
• If users go outside the grammar, things may break
disastrously
I want to leave from my house at 10 a.m.
I want to talk to a person.
Information Retrieval
• How related to NLP?
– Operates on language (speech or text)
– Does it use linguistic information?
• Stemming
• Bag-of-words approach
• Very simple analyses
– Does it make use of document formatting?
• Headlines, punctuation, captions
• Collection: a set of documents
• Term: a word or phrase
• Query: a set of terms
But…what is a term?
• Stop list
• Stemming
• Homonymy, polysemy, synonymy
Vector Space Model
• Simple versions represent documents and queries
as feature vectors, one binary feature for each term
in collection
• Is a term t in this document or in this query or not?
D = (t1,t2,…,tn)
Q = (t1,t2,…,tn)
• Similarity metric:how many terms does a query
share with each candidate document?
• Weighted terms: term-by-document matrix
D = (wt1,wt2,…,wtn)
Q = (wt1,wt2,…,wtn)
• How can we compare docs of different length
– Normalize each term weight by the number of terms in
the document to get : how important is each t in D?
• How do we compare the vectors?
– Compute dot product between query vector and each
doc vector to see how similar they are
– Cosine of angle betw vectors: 1 = identity; 0 = no
common terms
TfIdf
• How can we improve the weights?
– How important is each word in identifying the
document it is in?
– Term frequency (tfi,j): how often does i occur in
Doc j?
– Inverse document frequency (idf): # docs/ #
docs term i occurs in idfi  log N 
 ni 

– tf . idf weighting: weight of term
i for doc j is
product of frequency of i in j and its idf score in
collection
wi, j  tfi, j idfi
Evaluating IR Performance
• Precision: #relevant docs returned/total #docs
returned -- how often are you right when you say
this document is relevant?
• Recall: #relevant docs returned/#relevant docs in
collection -- how many of the relevant documents
2(PR)
do you find?
P R
• F-measure combines P and R
• Are P and R equally important?
Improving Queries
• Relevance feedback: users rate retrieved docs
• Query expansion: many techniques
– add top N docs retrieved to query and resubmit
expanded query
– WordNet
• Term clustering: cluster rows of terms in term-bydocument matrix to produce synonyms and add to
query
IR Tasks
• Ad hoc retrieval: ‘normal’ IR
• Routing/categorization: assign new doc to one of
predefined set of categories
• Clustering: divide a collection into N clusters
• Segmentation: segment text into coherent chunks
• Summarization: compress a text by extracting
summary items or eliminating less relevant items
• Question-answering: find a span of text (within
some window) containing the answer to a question
Information Extraction
• A ‘robust’ semantic method
• Idea: ‘extract’ particular types of information from
arbitrary text or transcribed speech
• Examples:
– Named entities: people, places, organizations, times,
dates
• <Organization> MIPS</Organization> Vice
President <Person>John Hime</Person>
– MUC evaluations
• Domains: Medical texts, broadcast news (terrorist
reports), …
Appropriate where Semantic Grammars and
Syntactic Parsers are not
• Appropriate where information needs very
specific and specifiable in advance
– Question answering systems, gisting of news or mail…
– Job ads, financial information, terrorist attacks
• Input too complex and far-ranging to build
semantic grammars
• But full-blown syntactic parsers are impractical
– Too much ambiguity for arbitrary text
– 50 parses or none at all
– Too slow for real-time applications
Information Extraction Techniques
• Often use a set of simple templates or frames with
slots to be filled in from input text
– Ignore everything else
– My number is 212-555-1212.
– The inventor of the wiggleswort was Capt. John T.
Hart.
– The king died in March of 1932.
• Context (neighboring words, capitalization,
punctuation) provides cues to help fill in the
appropriate slots
• How to do better than everyone else?
The IE Process
• Given a corpus and a target set of items to be
extracted:
–
–
–
–
Clean up the corpus
Tokenize it
Do some hand labeling of target items
Extract some simple features
• POS tags
• Phrase Chunks …
– Do some machine learning to associate features with
target items or derive this associate by intuition
– Use e.g. FSTs, simple or cascaded to iteratively
annotate the input, eventually identifying the slot fillers
Domain-Specific IE from the Web (Patwardhan &
Riloff ’06)
• The Problem:
– IE systems typically domain-specific – a new extraction
procedure for every task
– Supervised learning depends on hand annotation for
training
• Goals:
– Acquire domain specific texts automatically on the Web
– Identify domain-specific IE patterns automatically
• Approach:
– Start with a set of seed IE patterns learned from a handlabeled corpus
– Use these to identify relevant documents on the web
– Find new seed patterns in the retrieved documents
MUC04 IE Task
• Corpus:
– 1700 news stories about terrorist events in Latin
America
– Answer keys about information that should be extracted
• Problems:
– All upper case
– 50% of texts irrelevant
– Stories may describe multiple events
• Best results:
– 50-70% precision and recall with hand-built
components
– 41-44% recall and 49-51% precision with automatically
generated templates
Procedure
• Apply pre-defined syntactic patterns to a training corpus of
documents for which relevant/irrelevant judgments known
• Count how often partial lexicalizations of each (e.g. <subj>
was killed) appear in relevant vs. irrelevant documents
• Rank patterns based on association with domain
(frequency in domain documents vs. non-domain
documents)
• Manually review patterns and assign thematic
roles to those deemed useful
– From 40K+ patterns  291
• Now find similar web documents
Domain Corpus Creation
• Create IR queries by crossing names of 5 terrorist
organizations (e.g. Al Qaeda, IRA) with 16
terrorist activities (e.g assinated, bombed,
hijacked, wounded)  80 queries
–
–
–
–
Restricted to CNN, English documents
Eliminated TV transcripts
Yield from 2 runs: 6,182 documents
Cleaned corpus: 5,618 documents
Learning Domain-Specific Patterns
• Hypothesis: new extraction patterns co-occurring
with seed patterns from training corpus will also
be associated with terrorism
• Generate all extraction patterns in CNN corpus
(147,712)
• Compute correlation of each with seed patterns
based on frequency of co-occurrence in same
sentence – keep those occurring more often that
chance with some seed
• Rank new patterns by their seed correlations
Highly Ranked Patterns
• Filter: Measure semantic affinity: how often does this
pattern extract an entity of a particular category (e.g.
victim, target)?
• Compute semantic affinity for each extraction pattern wrt 6
categories: target, victim plus distractors: perpetrator,
organization, weapon, other
– E.g. Frequency of extracting target/frequency of extracting any of
6 categories weighted by log probability of target
• Remove patterns not strongly associated with
desired classes:
• Evaluate on MUC-4
– Baseline:
• Recall 64%/Precision 43% on targets
• Recall 50%/Precision 52% on victims
Results for Web-Learned Patterns
• Use 396 terrorism extraction patterns learned from
MUC training set as seeds
• Produce ranked list of new patterns from web
using semantic affinity of 3.0 threshold
• Chose top N (50-300) patterns to add to seed set
• Performance:
Combining IR and IE for QA
• Information extraction:
AQUA
Summary
• Many approaches to ‘robust’ semantic analysis
– Semantic grammars targeting particular domains
Utterance --> Yes/No Reply
Yes/No Reply --> Yes-Reply | No-Reply
Yes-Reply --> {yes,yeah, right, ok,”you bet”,…}
– Information extraction techniques targeting specific
tasks
• Extracting information about terrorist events from
news
– Information retrieval techniques --> more like NLP
Download