Attardi_091007

advertisement
Università di Pisa
NL search: hype or reality?
Giuseppe Attardi
Dipartimento di Informatica
Università di Pisa
With H. Zaragoza, J. Atserias, M. Ciaramita of Yahoo!
Research Barcelona
Hakia
Hakia’s Aims and Benefits
Hakia is building the Web’s new
“meaning-based” search engine with
the sole purpose of improving search
relevancy and interactivity, pushing
the current boundaries of Web
search.
The benefits to the end user are
search efficiency, richness of
information, and time savings.
Hakia’s Promise
The basic promise is to bring search
results by meaning match - similar to the
human brain's cognitive skills - rather than
by the mere occurrence (or popularity) of
search terms.
Hakia’s new technology is a radical
departure from the conventional indexing
approach, because indexing has severe
limitations to handle full-scale semantic
search.
Hakia’s Appeal
Hakia’s capabilities will appeal to all
Web searchers - especially those
engaged in research on knowledge
intensive subjects, such as medicine,
law, finance, science, and literature.
Hakia “meaning-based” search
Ontological Semantics

A formal and comprehensive linguistic
theory of meaning in natural language
 A set of resources, including:
– a language-independent ontology of 8,000
interrelated concepts
– an ontology-based English lexicon of 100,000
word senses
– an ontological parser which "translates" every
sentence of the text into its text meaning
representation
– acquisition toolbox which ensures the
homogeneity of the ontological concepts and
lexical entries by different acquirers of limited
training
OntoSem Lexicon Example
Bow
(bow-n1
(cat n)
(anno (def "instrument for archery"))
(syn-struc ((root $var0) (cat n)))
(sem-struc (bow))
)
(bow-n2
(cat n)
(anno (def "part of string-instruments"))
(syn-struc ((root $var0) (cat n)))
(sem-struc (stringed-instrument-bow))
)
Lexicon (Bow)
(bow-v1
(cat v)
(anno (def "to give in to someone or something"))
(syn-struc ((subject ((root $var2) (cat np)))
(root $var0) (cat v)
(pp-adjunct ((root to)
(cat prep)
(obj ((root $var3) (cat np))))))
)
(sem-struc (yield-to
(agent (value ^$var2))
(caused-by (value ^$var3))))
)
QDEX
QDEX extracts all possible queries
that can be asked to a Web page, at
various lengths and forms
 queries (sequences) become
gateways to the originating
documents, paragraphs and
sentences during retrieval

QDEX vs Inverted Index
An inverted index has a huge
“active” data set prior to a query
from the user.
 Enriching this data set with semantic
equivalences (concept relations) will
further increase the operational
burden in an exponential manner.
 QDEX has a tiny active set for each
query and semantic associations can
be easily handled on-the-fly.

QDEX combinatorics



The critical point in QDEX system is to be able to
decompose sentences into a handful of
meaningful sequences without getting lost in the
combinatory explosion space.
For example, a sentence with 8 significant words
can generate over a billion sequences (of 1, 2, 3,
4, 5, and 6 words) where only a few dozen makes
sense by human inspection.
The challenge is how to reduce billion
possibilities into a few dozen that make sense.
hakia uses OntoSem technology to meet this
challenge.
Semantic Rank





a pool of relevant paragraphs come from
the QDEX system for a given query terms
final relevancy is determined based on an
advanced sentence analysis and concept
match between the query and the best
sentence of each paragraph
morphological and syntactic analyses are
also performed
no keyword matching or Boolean algebra
is involved
the credibility and age (of the Web page)
are also taken into account
PowerSet
Powerset Demo

NL Question on
Wikipedia
 What companies
did IBM acquire?
 Which company
did IBM acquire in
1989?

Google query on
Wikipedia
 Same queries
 Poorer results
Try yourself
Who acquired IBM?
 IBM acquisitions 1996
 IBM acquisitions
 What do liberal democrats say about
healthcare

– 1.4 million matches
Problems

Parser from Xerox is a quite
sophisticated constituent parser:
– it produces all possible parser trees
– fairly slow

Workaround: index only the highest
relevant portion of the Web
Reality
Semantic Document Analysis

Question Answering
– Return precise answer to natural
language queries
Relation Extraction
 Intent Mining

– assess the attitude of the document
author with respect to a given subject
– Opinion mining: attitude is a positive or
negative opinion
Semantic Retrieval Approaches


Used in QA, Opinion Retrieval, etc.
Typical 2-stage approach:
1. Perform IR and rank by topic
relevance
2. Postprocess results with filters and
rerank

Generally slow:
– Requires several minutes to process
each query
Single stage approach

Single-stage approach:
– Enrich the index with opinion tags
– Perform normal retrieval with custom
ranking function

Proved effective at TREC 2006 Blog
Opinion Mining Task
Enriched Index for TREC Blog

Overlay words with tags
1
music
2
3
is a
4
touch
5
lame
NEGATIVE
soundtrack
little
weak
ART
bit
plate
Enhanced Queries
music NEGATIVE:lame
 music NEGATIVE:*


Achieved 3rd best P@5 at TREC Blog
Track 2006
Enriched Inverted Index
Inverted Index

Stored compressed
– ~1 byte per term occurrence

Efficient intersection operation
– O(n) where n is the length of shortest
postings list
– Using skip lists further reduces cost

Size: ~ 1/8 original text
Small Adaptive Set Intersection
world
wide
web
3
1
2
9
8
4
12
10
6
20
25
21
40
40
30
47
41
35
40
IXE Search Engine Library


C++ OO architecture
Fast indexing
– Sort-based inversion

Fast search
– Efficient algorithms and data structures
– Query Compiler
• Small Adaptive Set Intersection



– Suffix array with supra index
– Memory mapped index files
Programmable API library
Template metaprogramming
Object Store Data Base
IXE Performance

TREC TeraByte 2005:
– 2nd fastest
– 2nd best P@5
Query Processing

Query compiler
– One cursor on posting lists for each
node
– CursorWord, CursorAnd, CursorOr,
CursorPhrase

QueryCursor.next(Result& min)
– Returns first result r >= min

Single operator for all kind of
queries: e.g. proximity
IXE Composability
DocInfo
name
date
size
Collection<DocInfo>
Collection<PassageDoc>
Cursor
PassageDoc
next()
text
boundaries
QueryCursor
next()
PassageQueryCursor
next()
Passage Retrieval
Documents are split into passages
 Matches are searched in passages ±
n nearby
 Results are ranked passages
 Efficiency requires special store for
passage boundaries

QA Using Dependency Relations
Build dependency trees for both
question and answer
 Determine similarity of
corresponding paths in dependency
trees of question and answer

PiQASso Answer Matching
1 Parsing
Tungsten is a very dense material and has the highest melting point of any metal.
2 Answer type check
<tungsten, material, pred>
<tungsten, has, subj>
<point, has, obj>
…
SUBSTANCE
obj
sub
3 Relation extraction
mod
mod
What metal has the highest melting point?
4 Matching Distance
Tungsten
5 Distance Filtering
6 Popularity Ranking
ANSWER
QA Using Dependency Relations
Further developed by Cui et al, NUS
 Score computed by statistical
translation model
 Second best at TREC 2004

Wikipedia Experiment

Tagged Wikipedia with:
– POS
– LEMMA
– NE (WSJ, IEER)
– WN Super Senses
– Anaphora
– Parsing (head, dependency)
Tools Used
SST tagger [Ciaramita & Altun]
 DeSR dependency parser [Attardi &
Ciaramita]

– Fast: 200 sentence/sec
– Accurate: 90 % UAS
Dependency Parsing
Produces dependency trees
 Word-word dependency relations
 Far easier to understand and to
annotate

SUBJ
OBJ
MOD
SUBJ
OBJ
SUBJ
TO
MOD
Rolls-Royce Inc. said it expects its sales to remain steady
Right
Shift
Left
Classifer-based Shift-Reduce Parsing
top
next
He
PP
saw
VVD
a
DT
girl
NN
with
IN
a
DT
telescope
NNS
.
SENT
CoNLL 2007 Results
Language
Catalan
Chinese
English
Italian
Czech
Turkish
Arabic
Hungarian
Greek
Basque
UAS
92.20
86.73
86.99
85.54
83.40
83.56
82.53
81.81
80.75
76.86
LAS
87.64
86.86
85.85
81.34
77.37
76.87
72.66
76.81
73.92
69.84
EvalIta 2007 Results
Collection
UAS
Cod. Civile
Newspaper
Best statistical parser
91.37
85.49
LAS
79.13
76.62
Experiment
Experimental data sets
Wikipedia
 Yahoo! Answers

English Wikipedia Indexing
Original size: 4.4 GB
 Number of articles: 1,400,000
 Tagging time: ~3 days (6 days with
previous tools)
 Parsing time: 40 hours
 Indexing time: 9 hours (8 days with
UIMA + Lucene)
 Index size: 3 GB
 Metadata: 12 GB

Scaling Indexing
Highly parallelizable
 Using Hadoop in stream mode

Example (partial)
TERM
POS
The
DT
Tories
LEMMA
HEAD
DEP
0
2
NMOD
NNPS tory
B-noun.person
3
SUB
won
VBD
win
B-verb.competition
0
VMOD
this
DT
this
0
5
NMOD
election B-noun.act
3
OBJ
election NN
the
WNSS
Stacked View
1
2
3
4
5
TERM
The
Tories
won
this
election
POS
DT
NNPS
VBD
DT
NN
LEMMA
the
tory
win
this
election
WNSS
0
B-noun.person
B-verb.competition
0
B-noun.act
HEAD
2
3
0
5
3
DEP
NMOD
SUB
VMOD
NMOD
OBJ
Implementation

Special version of Passage Retrieval
 Tags are overlaid to words
– Dealt as terms in same position as
corresponding word
– Not counted to avoid skewing TF/IDF
– Given an ID in the lexicon

Retrieval is fast:
– A few msec per query on a 10 GB index

Provided as both Linux library and
Windows DLL
Java Interface
Generated using SWIG
 Results accessible through a
ResultIterator
 List of terms or tags for a sentence
generated on demand

Proximity queries

Did France win the World Cup?
proximity 15 [MORPH/win:* DEP/SUB:france
'world cup']

Born in the French territory of New Caledonia, he
was a vital player in the French team that won the
1998 World Cup and was on the squad, but
played just one game, as France won Euro 2000.

France repeated the feat of Argentina in 1998, by
taking the title as they won their home 1998 World
Cup, beating Brazil.

Both England (1966) and France (1998) won their
only World Cups whilst playing as host nations.
Proximity queries

Who won the World Cup in 1998?
proximity 13 [MORPH/win:* DEP/SUB:*
'world cup' WSJ/DATE:1998]

With the French national team, Dugarry
won World Cup 1998 and Euro 2000.

He captained Arsenal and won the World
Cup with France in 1998.
Did France win the World Cup in
2002?
proximity 30 [MORPH/win:*
DEP/SUB:france 'world cup'
WSJ/DATE:2002]
 No result.


Who won it in 2002?
proximity 6 [MORPH/win:* DEP/SUB:*
'world cup' 2002]

He has 105 caps for Brazil, and helped his
country win the World Cup in
2002 after finishing second in 1998.

2002 - Brazil wins the Football World Cup
becoming the first team to win the
trophy 5 times
Dependency Queries


deprel [ pattern headPattern ]
Semantics: clause matches any document
that contains a match for pattern whose
head matches headPattern
 Implementation:
search for pattern
for each match at (doc, pos)
find h = head(doc, pos)
find match for headPattern at (doc, h±2)
Finding heads
How to find head(doc, pos)?
 Solution: to store the HEAD
positions in a special posting list.
 A posting list stores the positions
where a term occurs in a document.
 The HEADS posting list stores the
heads of each term in a document.

Finding Heads
To retrieve head(doc, pos), one
accesses the posting list of HEADS
for doc and extracts the pos-th item.
 Posting lists are efficient since they
are stored compressed on disk and
accessed through memory mapping.

Dependency Paths

deprel [ pattern0 pattern1 … patterni]

Note: opposite direction from XPath
Multiple Tags

DEP/SUB:MORPH/insect:*
Dependency queries
Who won the elections?
deprel [ election won ]
deprel [ DEP/OBJ:election
MORPH/win:* ]
 The Scranton/Shafer team won the
election over Philadelphia mayor
Richardson Dilworth and Shafer
became the states lieutenant

Collect

What are the causes of death?
deprel [ from MORPH/die:* ]
She died from throat cancer in
Sherman Oaks, California.
 Wilson died from AIDS.

Demo

Deep Search on Wikipedia
– Web interface
– Queries with tags and deprel

Browsing on Deep Search results
– Sentences are collected
– Graph of sentences/entities is created
• WebGraph [Boldi-Vigna]
– Results clustered through most
frequent entities
Issues

Dependency relations are crude for
English (30 in total)
– SUB, OBJ, NMOD

Better for Catalan (168)
– Distinguish time/location/cause
adverbials

Relation might not be direct
– E.g. “die from cancer”

Queries can’t express SUB/OBJ
relationship
Semantic Relations?
The movie is not a masterpiece
Target-Opinion
Or a few general relation types?
Directly/Indirectly
 Affirmative/Negative/Dubitative
 Active/Passive

Translating Queries
Compile NL query into query syntax
 Learn from examples, e.g. Yahoo!
Answers

Generic Quadruple
V
S
O
M
(Subject, Object, Verb, Mode)
 Support searching for quadruples
 Rank based on distance

Related Work
Chakrabarti
Proposes to use proximity queries
 On a Web index built with Lucene
and UIMA

We are hiring
Three projects starting:
1. Semantic Search on Italian
Wikipedia
2 assegni ricerca. Fond. Cari Pisa
2. Deep Search
2 PostDoc. Yahoo! Research
3. Machine Translation
2 PostDoc.
Questions
Discussion

Are there other uses than search?
– better query refinement
– semantic clustering
– Vanessa Murdock’s aggregated result
visualization
Interested in getting access to the
resource for experimentation?
 Shall relation types be learned?
 Will it scale?

Download