Università di Pisa NL search: hype or reality? Giuseppe Attardi Dipartimento di Informatica Università di Pisa With H. Zaragoza, J. Atserias, M. Ciaramita of Yahoo! Research Barcelona Hakia Hakia’s Aims and Benefits Hakia is building the Web’s new “meaning-based” search engine with the sole purpose of improving search relevancy and interactivity, pushing the current boundaries of Web search. The benefits to the end user are search efficiency, richness of information, and time savings. Hakia’s Promise The basic promise is to bring search results by meaning match - similar to the human brain's cognitive skills - rather than by the mere occurrence (or popularity) of search terms. Hakia’s new technology is a radical departure from the conventional indexing approach, because indexing has severe limitations to handle full-scale semantic search. Hakia’s Appeal Hakia’s capabilities will appeal to all Web searchers - especially those engaged in research on knowledge intensive subjects, such as medicine, law, finance, science, and literature. Hakia “meaning-based” search Ontological Semantics A formal and comprehensive linguistic theory of meaning in natural language A set of resources, including: – a language-independent ontology of 8,000 interrelated concepts – an ontology-based English lexicon of 100,000 word senses – an ontological parser which "translates" every sentence of the text into its text meaning representation – acquisition toolbox which ensures the homogeneity of the ontological concepts and lexical entries by different acquirers of limited training OntoSem Lexicon Example Bow (bow-n1 (cat n) (anno (def "instrument for archery")) (syn-struc ((root $var0) (cat n))) (sem-struc (bow)) ) (bow-n2 (cat n) (anno (def "part of string-instruments")) (syn-struc ((root $var0) (cat n))) (sem-struc (stringed-instrument-bow)) ) Lexicon (Bow) (bow-v1 (cat v) (anno (def "to give in to someone or something")) (syn-struc ((subject ((root $var2) (cat np))) (root $var0) (cat v) (pp-adjunct ((root to) (cat prep) (obj ((root $var3) (cat np)))))) ) (sem-struc (yield-to (agent (value ^$var2)) (caused-by (value ^$var3)))) ) QDEX QDEX extracts all possible queries that can be asked to a Web page, at various lengths and forms queries (sequences) become gateways to the originating documents, paragraphs and sentences during retrieval QDEX vs Inverted Index An inverted index has a huge “active” data set prior to a query from the user. Enriching this data set with semantic equivalences (concept relations) will further increase the operational burden in an exponential manner. QDEX has a tiny active set for each query and semantic associations can be easily handled on-the-fly. QDEX combinatorics The critical point in QDEX system is to be able to decompose sentences into a handful of meaningful sequences without getting lost in the combinatory explosion space. For example, a sentence with 8 significant words can generate over a billion sequences (of 1, 2, 3, 4, 5, and 6 words) where only a few dozen makes sense by human inspection. The challenge is how to reduce billion possibilities into a few dozen that make sense. hakia uses OntoSem technology to meet this challenge. Semantic Rank a pool of relevant paragraphs come from the QDEX system for a given query terms final relevancy is determined based on an advanced sentence analysis and concept match between the query and the best sentence of each paragraph morphological and syntactic analyses are also performed no keyword matching or Boolean algebra is involved the credibility and age (of the Web page) are also taken into account PowerSet Powerset Demo NL Question on Wikipedia What companies did IBM acquire? Which company did IBM acquire in 1989? Google query on Wikipedia Same queries Poorer results Try yourself Who acquired IBM? IBM acquisitions 1996 IBM acquisitions What do liberal democrats say about healthcare – 1.4 million matches Problems Parser from Xerox is a quite sophisticated constituent parser: – it produces all possible parser trees – fairly slow Workaround: index only the highest relevant portion of the Web Reality Semantic Document Analysis Question Answering – Return precise answer to natural language queries Relation Extraction Intent Mining – assess the attitude of the document author with respect to a given subject – Opinion mining: attitude is a positive or negative opinion Semantic Retrieval Approaches Used in QA, Opinion Retrieval, etc. Typical 2-stage approach: 1. Perform IR and rank by topic relevance 2. Postprocess results with filters and rerank Generally slow: – Requires several minutes to process each query Single stage approach Single-stage approach: – Enrich the index with opinion tags – Perform normal retrieval with custom ranking function Proved effective at TREC 2006 Blog Opinion Mining Task Enriched Index for TREC Blog Overlay words with tags 1 music 2 3 is a 4 touch 5 lame NEGATIVE soundtrack little weak ART bit plate Enhanced Queries music NEGATIVE:lame music NEGATIVE:* Achieved 3rd best P@5 at TREC Blog Track 2006 Enriched Inverted Index Inverted Index Stored compressed – ~1 byte per term occurrence Efficient intersection operation – O(n) where n is the length of shortest postings list – Using skip lists further reduces cost Size: ~ 1/8 original text Small Adaptive Set Intersection world wide web 3 1 2 9 8 4 12 10 6 20 25 21 40 40 30 47 41 35 40 IXE Search Engine Library C++ OO architecture Fast indexing – Sort-based inversion Fast search – Efficient algorithms and data structures – Query Compiler • Small Adaptive Set Intersection – Suffix array with supra index – Memory mapped index files Programmable API library Template metaprogramming Object Store Data Base IXE Performance TREC TeraByte 2005: – 2nd fastest – 2nd best P@5 Query Processing Query compiler – One cursor on posting lists for each node – CursorWord, CursorAnd, CursorOr, CursorPhrase QueryCursor.next(Result& min) – Returns first result r >= min Single operator for all kind of queries: e.g. proximity IXE Composability DocInfo name date size Collection<DocInfo> Collection<PassageDoc> Cursor PassageDoc next() text boundaries QueryCursor next() PassageQueryCursor next() Passage Retrieval Documents are split into passages Matches are searched in passages ± n nearby Results are ranked passages Efficiency requires special store for passage boundaries QA Using Dependency Relations Build dependency trees for both question and answer Determine similarity of corresponding paths in dependency trees of question and answer PiQASso Answer Matching 1 Parsing Tungsten is a very dense material and has the highest melting point of any metal. 2 Answer type check <tungsten, material, pred> <tungsten, has, subj> <point, has, obj> … SUBSTANCE obj sub 3 Relation extraction mod mod What metal has the highest melting point? 4 Matching Distance Tungsten 5 Distance Filtering 6 Popularity Ranking ANSWER QA Using Dependency Relations Further developed by Cui et al, NUS Score computed by statistical translation model Second best at TREC 2004 Wikipedia Experiment Tagged Wikipedia with: – POS – LEMMA – NE (WSJ, IEER) – WN Super Senses – Anaphora – Parsing (head, dependency) Tools Used SST tagger [Ciaramita & Altun] DeSR dependency parser [Attardi & Ciaramita] – Fast: 200 sentence/sec – Accurate: 90 % UAS Dependency Parsing Produces dependency trees Word-word dependency relations Far easier to understand and to annotate SUBJ OBJ MOD SUBJ OBJ SUBJ TO MOD Rolls-Royce Inc. said it expects its sales to remain steady Right Shift Left Classifer-based Shift-Reduce Parsing top next He PP saw VVD a DT girl NN with IN a DT telescope NNS . SENT CoNLL 2007 Results Language Catalan Chinese English Italian Czech Turkish Arabic Hungarian Greek Basque UAS 92.20 86.73 86.99 85.54 83.40 83.56 82.53 81.81 80.75 76.86 LAS 87.64 86.86 85.85 81.34 77.37 76.87 72.66 76.81 73.92 69.84 EvalIta 2007 Results Collection UAS Cod. Civile Newspaper Best statistical parser 91.37 85.49 LAS 79.13 76.62 Experiment Experimental data sets Wikipedia Yahoo! Answers English Wikipedia Indexing Original size: 4.4 GB Number of articles: 1,400,000 Tagging time: ~3 days (6 days with previous tools) Parsing time: 40 hours Indexing time: 9 hours (8 days with UIMA + Lucene) Index size: 3 GB Metadata: 12 GB Scaling Indexing Highly parallelizable Using Hadoop in stream mode Example (partial) TERM POS The DT Tories LEMMA HEAD DEP 0 2 NMOD NNPS tory B-noun.person 3 SUB won VBD win B-verb.competition 0 VMOD this DT this 0 5 NMOD election B-noun.act 3 OBJ election NN the WNSS Stacked View 1 2 3 4 5 TERM The Tories won this election POS DT NNPS VBD DT NN LEMMA the tory win this election WNSS 0 B-noun.person B-verb.competition 0 B-noun.act HEAD 2 3 0 5 3 DEP NMOD SUB VMOD NMOD OBJ Implementation Special version of Passage Retrieval Tags are overlaid to words – Dealt as terms in same position as corresponding word – Not counted to avoid skewing TF/IDF – Given an ID in the lexicon Retrieval is fast: – A few msec per query on a 10 GB index Provided as both Linux library and Windows DLL Java Interface Generated using SWIG Results accessible through a ResultIterator List of terms or tags for a sentence generated on demand Proximity queries Did France win the World Cup? proximity 15 [MORPH/win:* DEP/SUB:france 'world cup'] Born in the French territory of New Caledonia, he was a vital player in the French team that won the 1998 World Cup and was on the squad, but played just one game, as France won Euro 2000. France repeated the feat of Argentina in 1998, by taking the title as they won their home 1998 World Cup, beating Brazil. Both England (1966) and France (1998) won their only World Cups whilst playing as host nations. Proximity queries Who won the World Cup in 1998? proximity 13 [MORPH/win:* DEP/SUB:* 'world cup' WSJ/DATE:1998] With the French national team, Dugarry won World Cup 1998 and Euro 2000. He captained Arsenal and won the World Cup with France in 1998. Did France win the World Cup in 2002? proximity 30 [MORPH/win:* DEP/SUB:france 'world cup' WSJ/DATE:2002] No result. Who won it in 2002? proximity 6 [MORPH/win:* DEP/SUB:* 'world cup' 2002] He has 105 caps for Brazil, and helped his country win the World Cup in 2002 after finishing second in 1998. 2002 - Brazil wins the Football World Cup becoming the first team to win the trophy 5 times Dependency Queries deprel [ pattern headPattern ] Semantics: clause matches any document that contains a match for pattern whose head matches headPattern Implementation: search for pattern for each match at (doc, pos) find h = head(doc, pos) find match for headPattern at (doc, h±2) Finding heads How to find head(doc, pos)? Solution: to store the HEAD positions in a special posting list. A posting list stores the positions where a term occurs in a document. The HEADS posting list stores the heads of each term in a document. Finding Heads To retrieve head(doc, pos), one accesses the posting list of HEADS for doc and extracts the pos-th item. Posting lists are efficient since they are stored compressed on disk and accessed through memory mapping. Dependency Paths deprel [ pattern0 pattern1 … patterni] Note: opposite direction from XPath Multiple Tags DEP/SUB:MORPH/insect:* Dependency queries Who won the elections? deprel [ election won ] deprel [ DEP/OBJ:election MORPH/win:* ] The Scranton/Shafer team won the election over Philadelphia mayor Richardson Dilworth and Shafer became the states lieutenant Collect What are the causes of death? deprel [ from MORPH/die:* ] She died from throat cancer in Sherman Oaks, California. Wilson died from AIDS. Demo Deep Search on Wikipedia – Web interface – Queries with tags and deprel Browsing on Deep Search results – Sentences are collected – Graph of sentences/entities is created • WebGraph [Boldi-Vigna] – Results clustered through most frequent entities Issues Dependency relations are crude for English (30 in total) – SUB, OBJ, NMOD Better for Catalan (168) – Distinguish time/location/cause adverbials Relation might not be direct – E.g. “die from cancer” Queries can’t express SUB/OBJ relationship Semantic Relations? The movie is not a masterpiece Target-Opinion Or a few general relation types? Directly/Indirectly Affirmative/Negative/Dubitative Active/Passive Translating Queries Compile NL query into query syntax Learn from examples, e.g. Yahoo! Answers Generic Quadruple V S O M (Subject, Object, Verb, Mode) Support searching for quadruples Rank based on distance Related Work Chakrabarti Proposes to use proximity queries On a Web index built with Lucene and UIMA We are hiring Three projects starting: 1. Semantic Search on Italian Wikipedia 2 assegni ricerca. Fond. Cari Pisa 2. Deep Search 2 PostDoc. Yahoo! Research 3. Machine Translation 2 PostDoc. Questions Discussion Are there other uses than search? – better query refinement – semantic clustering – Vanessa Murdock’s aggregated result visualization Interested in getting access to the resource for experimentation? Shall relation types be learned? Will it scale?