On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Center for Automated Learning and Discovery + Language Technology Institute + Center for Bioimage Informatics + Joint CMU-Pitt Program in Bioinformatics Carnegie Mellon University On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Machine Learning Department + Language Technology Institute + Center for Bioimage Informatics + Joint CMU-Pitt Program in Bioinformatics Carnegie Mellon University On Beyond Hypertext: Searching in Graphs Containing Documents, Words, and Data William W. Cohen Carnegie Mellon University joint work with: Einat Minkov (CMU) Andrew Ng (Stanford) Outline • Motivation: why I’m interested in – structured data that is partly text; – structured data represented as graphs; – measuring similarity of nodes in graphs • Contributions: – a simple query language for graphs; – experiments on natural types of queries; – techniques for learning to answer queries of a certain type better “A Little Knowledge is A Dangerous Thing” [A. Pope, 1709] • Three centuries later, we’ve learned that a lot of knowledge is also sort of dangerous.... ... so how do we deal with information overload? One approach: adding structure to unstructured information October 14, 2002, 4:00 a.m. PT ... by recognizing entity names... For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation. ... and relationships between them... Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“ Richard Stallman, founder of the Free Software Foundation, countered saying… IE NAME Bill Gates Bill Veghte Richard Stallman TITLE ORGANIZATION CEO Microsoft VP Microsoft founder Free Soft.. One approach: adding structure to unstructured information [Carvalho, Cohen SIGIR05; Cohen, Carvalho, Mitchell EMNLP 04] One approach: adding structure to unstructured information [Mitchell et al CEAS 2004] One approach: adding structure to unstructured information One approach: adding structure to unstructured information [McCallum et al IJCAI05] Is converting unstructured data to structured data enough? Limitations of structured data What is the email address for the person named “Halevy” mentioned in this presentation? • Diversity: many different types of information from many different sources, that arise to fill many different needs. • Uncertainty: information from many sources (like IE programs or the web) need not be correct. What files from my home machine will I need for this meeting? What people will attend this meeting? • Complexity of interaction: How do you discover & access the formulating ‘information needs’ as tens or hundreds ... ? of structured queries to a DB can be databases? ? difficult...especially a heterogeneous How do you understand & combine DB, with a complex/changing schema. the hundreds of schemata, with thousands of fields? How can you include many diverse sources of information in single database? How do you relate the thousands or millions or ... of entity identifiers from the different databases? When are two entities the same? When is referent(oid1)=referent(oid2) ? • • • • • • Bell Labs [1925] Bell Telephone Labs AT&T Bell Labs AT&T Labs AT&T Labs—Research AT&T Labs Research, Shannon Laboratory • Shannon Labs • Bell Labs Innovations • Lucent Technologies/Bell Labs Innovations History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers…. [www.research.att.com] Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925… [bell-labs.com] Is there a definition of ‘entity identity’ that is user- and purpose- independent? = ≠ Bell Telephone Labs = When are two entities are the same? “Buddhism rejects the key element in folk psychology: the idea of a self (a unified personal identity that is continuous through time)… King Milinda and Nagasena (the Buddhist sage) discuss … personal identity… Milinda gradually realizes that "Nagasena" (the word) does not stand for anything he can point to: … not … the hairs on Nagasena's head, nor the hairs of the body, nor the "nails, teeth, skin, muscles, sinews, bones, marrow, kidneys, ..." etc… Milinda concludes that "Nagasena" doesn't stand for anything… If we can't say what a person is, then how do we know a person is the same person through time? … There's really no you, and if there's no you, there are no beliefs or desires for you to have… The folk psychology picture is profoundly misleading and believing it will make you miserable.” -S. LaFave Traditional approach: Linkage Queries Uncertainty about what to link must be decided by the integration system, not the end user WHIRL vision: SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a=S.a and S.b=T.b Link items as needed by Q Query Q R.a S.a S.b T.b Anhai Anhai Doan Doan Dan Dan Weld Weld Weaker links: those agreeable to some users William Will Cohen Cohn Steve Steven Minton Mitton even weaker links… William David Cohen Cohn Strongest links: those agreeable to most users WHIRL vision: DB1 + DB2 ≠ DB SELECT R.a,S.a,S.b,T.b FROM R,S,T WHERE R.a~S.a and S.b~T.b Link items as needed by Q Incrementally produce a ranked list of possible links, with “best matches” first. User (or downstream process) decides how much of the list to generate and examine. (~ TFIDF-similar) Query Q R.a S.a S.b T.b Anhai Anhai Doan Doan Dan Dan Weld Weld William Will Cohen Cohn Steve Steven Minton Mitton William David Cohen Cohn Outline • Motivation: why I’m interested in – structured data that is partly text: similarity! – structured data represented as graphs; There are general-purpose, fast, robust – measuring similarityfor of text, nodes in graphs similarity measures which are for data integration....and hence, • useful Contributions: combining information from – a simple query language formultiple graphs; sources. – experiments on natural types of queries; – techniques for learning to answer queries of a certain type better Limitations of structured data What is the email address for the person named “Halevy” mentioned in this presentation? • Diversity: many different types of information from many different sources, that arise to fill many different needs. What files from my home machine will I need for this meeting? • Uncertainty: information from many sources (like IE programs or the web) need not be correct. • Complexity of interaction: formulating ‘information needs’ as queries to a DB can be difficult...especially a heterogeneous one. How can you exploit structure without understanding the structure? What people will attend this meeting? ... ? ? Schema-free structured search • • • • DataSpot (DTL)/Mercado Intuifind: [VLDB 98] Proximity Search: [VLDB98] Information units (linked Web pages): [WWW10] Microsoft DBExplorer, Microsoft English query • BANKS (Browsing ANd Keyword Search): [Chakrabarti & others, VLDB 02, VLDB 05] BANKS: Basic Data Model • Database is modeled as a graph – Nodes = tuples – Edges = references between tuples User need not know organization of database to formulate queries. • edges are directed. • foreign key, inclusion dependencies, .. BANKS: Keyword search… MultiQuery Optimization paper writes Charuta S. Sudarshan Prasan Roy author BANKS: Answer to Query Query: “sudarshan roy” Answer: subtree from graph MultiQuery Optimization writes author S. Sudarshan paper writes Prasan Roy author BANKS: Basic Data Model • Database is modeled as a graph – Nodes = tuples – Edges = references between tuples • edges are directed. • foreign key, inclusion dependencies, .. not quite so basic BANKS: Basic Data Model • Database All information is modeled as a graph – Nodes = tuples or documents or strings or words – Edges = references between tuples nodes • • • • • edges are directed, labeled and weighted foreign key, inclusion dependencies, ... doc/string D to word contained by D (TFIDF weighted, perhaps) word W to doc/string containing W (inverted index) [string S to strings ‘similar to’ S] Outline • Motivation: why I’m interested in – structured data that is partly text – similarity! – structured data represented as graphs; all sorts of information can be poured into this model. – measuring similarity of nodes in graphs • Contributions: – a simple query language for graphs; – experiments on natural types of queries; – techniques for learning to answer queries of a certain type better Yet another schema-free query language • Assume data is encoded in a graph with: – a node for each object x – a type of each object x, T(x) – an edge for each binary relation r:x y Node similarity • Queries are of this form: – Given type t* and node x, find y:T(y)=t* and y~x. • We’d like to construct a general-purpose similarity function x~y for objects in the graph: • We’d also like to learn many such functions for different specific tasks (like “who should attend a meeting”) Similarity of Nodes in Graphs Given type t* and node x, find y:T(y)=t* and y~x. • Similarity defined by “damped” version of PageRank • Similarity between nodes x and y: – “Random surfer model”: from a node z, • with probability α, stop and “output” z • pick an edge label r using Pr(r | z) ... e.g. uniform • pick a y uniformly from { y’ : z y with label r } • repeat from node y .... – Similarity x~y = Pr( “output” y | start at x) • Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length. not quite so basic BANKS: Basic Data Model • Database All information is modeled as a graph – Nodes = tuples or documents or strings or words – Edges = references between tuples nodes • • • • • edges are directed, labeled and weighted foreign key, inclusion dependencies, ... doc/string D to word contained by D (TFIDF weighted, perhaps) word W to doc/string containing W (inverted index) [string S to strings ‘similar to’ S] “William W. Cohen, CMU” cohen dr william w cmu “Dr. W. W. Cohen” optional—strings that are similar in TFIDF/cosine distance will still be “nearby” in graph (connected by many length=2 paths) Similarity of Nodes in Graphs • Random surfer on graphs: – natural extension to PageRank – closely related to Lafferty’s heat diffusion kernel • but generalized to directed graphs – somewhat amenable to learning parameters of the walk (gradient search, w/ various optimization metrics): • Toutanova, Manning & NG, ICML2004 • Nie et al, WWW2005 • Xi et al, SIGIR 2005 – can be sped up and adapted to longer walks by sampling approaches to matrix multiplication (e.g. Lewis & E. Cohen, SODA 1998), similar to particle filtering – our current implementation (GHIRL): Lucene + Sleepycat with extensive use of memory caching (sampling approaches visit many nodes repeatedly) Query: “sudarshan roy” Answer: subtree from graph MultiQuery Optimization writes author S. Sudarshan paper writes Prasan Roy author y: paper(y) & y~“roy” AND w: paper(y) & w~“roy” Query: “sudarshan roy” Answer: subtree from graph Evaluation on Personal Information Management Tasks [Minkov et al, SIGIR 2006] Many tasks can be expressed as simple, non-conjunctive search queries in this framework. Such as: What is the email address for the person named “Halevy” mentioned in this presentation? • Person Name Disambiguation in Email novel • Threading [eg Diehl, Getoor, Namata, 2006] What files from my home machine will I need for this meeting? [eg Lewis & Knowles 97] • Finding email-address aliases given a person’s name novel • Finding relevant meeting attendees novel What people will attend this meeting? ... ? Also consider a generalization: x Vq Vq is a distribution over nodes x Email as a graph sent_date date 2 sent_to alias Email address 1 +1_day a_inv sf_inv date 1 person name 1 Sent_ from Sent_ to Email address 2 person name 2 st_inv sent_date file 1 sd_Inv Email address 3 sent_ from in_ file in_ subj Email address 4 If_inv person name 4 is_inv term 1 term 4 file 2 person name 3 term 2 term 5 term 8 term 3 term 6 term 7 Email address 5 sent_ to person name 5 term 9 term 10 term 11 Person Name Disambiguation file Person file Person: Andrew Johns Q: “who is Andy?” file • • Given: a term that is not mentioned ‘as is’ in header (otherwise, easy), that is known to be a personal name Output: ranked person nodes. term: andy Person * This task is complementary to person name annotation in email (E. Minkov, R. Wang, W.Cohen, Extracting Personal Names from Emails: Applying Named Entity Recognition to Informal Text, HLT/EMNLP 2005) Corpora and Datasets a. Corpora Example nicknames: Dave for David, Kai for Keiko, Jenny for Qing b. Types of names Person Name Disambiguation 1. Baseline: String matching (& common nicknames) 3. Graph walk: term+file Vq: name term + file nodes (2 steps) Find persons that are similar to the name term (Jaro) • The file node is natural available context • Successful in many cases • Solves the ambiguity problem! • Not successful for some nicknames • But, incorporates additional noise. • Can not handle ambiguity (arbitrary) 2. Graph walk: term Vq: name term node (2 steps) 4. Graph walk: term+file, reranked using learning Re-rank the output of (3), using: • Models co-occurrences. - path-describing features • Can not handle ambiguity (dominant) - ‘source count’ : do the paths originate from a single or two source nodes - string similarity Results Results after learning-to-rank graph walk from {name,file} graph walk from name baseline: string match, nicknames Results Enron execs Results Learning • There is no single “best” measure of similarity: – How can you learn how to better rank graph nodes, for a particular task? • Learning methods for graph walks: – The parameters can be adjusted using gradient descent methods (Diligenti et-al, IJCAI 2005) – We explored a node re-ranking approach – which can take advantage of a wider range of features features (and is complementary to parameter tuning) • Features of candidate answer y describe the set of paths from query x to y Re-ranking overview Boosting-based reranking, following (Collins and Koo, Computational Linguistics, 2005): A training example includes: – – – linear combination of features a ranked list of li nodes. Each node is represented through m features At least one known correct node original score y~x Scoring function: , where Find w that minimizes (boosted version): Requires binary features and has a closed form formula to find best feature and delta in each iteration. Path describing Features • The set of paths to a target node in step k is recovered in full. X1 ‘Edge unigram’ features: was edge type l used in reaching x from Vq. X2 X3 X4 X5 K=0 K=1 Paths (x3, k=2): x2 x1 x3 x4 x1 x3 x2 x2 x3 x2 x3 K=2 ‘Edge bigram’ features: were edge types l1 and l2 used (in that order) in reaching x from Vq. ‘Top edge bigram’ features: were edge types l1 and l2 used (in that order) in reaching x from Vq, among the top two highest scoring paths. Results Threading Threading is an interesting problem, because: • There are often irregularities in thread structural information, thus threads discourse should be captured using an intelligent approach (D.E. Lewis and K.A. Knowles, Threading email: A preliminary study, Information Processing and Management, 1997) • Threading information can improve message categorization into topical folders (B. Klimt and Y. Yang, The Enron corpus: A new dataset for email classification research, ECML, 2004) • Adjacent messages in a thread can be assumed to be most similar to each other in the corpus. Therefore, threading is related to the general problem of finding similar messages in a corpus. The task: given a message, retrieve adjacent messages in the thread Some intuition ? file x Some intuition ? file x Shared content Some intuition ? file x Shared content Social network Some intuition ? file x Shared content Social network Timeline Threading: experiments 1. Baseline: TF-IDF Similarity Consider all the available information (header & body) as text 2. Graph walk: uniform Start from the file node, 2 steps, uniform edge weights 3. Graph walk: random Start from the file node, 2 steps, random edge weights (best out of 10) 4. Graph walk: reranked Rerank the output of (3) using the graph-describing features Results Highly-ranked edge-bigrams: • sent-from sent-to -1 • date-of date-of -1 • has-term has-term -1 Finding Meeting Attendees Extended graph contains 2 months of calendar data: Main Contributions • Presented an extended similarity measure incorporating non-textual objects • Finite lazy random walks to perform typed search • A re-ranking paradigm to improve on graph walk results • Instantiation of this framework for email • Defined and evaluated novel tasks for email Another Task that Can be Formulated as a Graph Query: GeneId-Ranking • Given: – a biomedical paper abstract • Find: – the geneId for every gene mentioned in the abstract • Method: – from paper x, ranked list of geneId y: x~y • Background resources: – a “synonym list”: geneId { name1, name2, ... } – one or more protein NER systems – training/evaluation data: pairs of (paper, {geneId1, ...., geneIdn}) Sample abstracts and synonyms • MGI:96273 • Htr1a • 5-hydroxytryptamine (serotonin) true receptor 1A • 5-HT1A receptor labels •MGI:104886 •Gpx5 •glutathione peroxidase 5 NER extractor •Arep • ... • 52,000+ for mouse, 35,000+ for fly Graph for the task.... abstracts ... file:doc115 hasProtein hasProtein hasProtein proteins “HT1A” “CA1” “HT1” hasTerm terms term:HT term:1 term:A hasTerm ... hasTerm term:hippocampus term:CA inFile synonyms “5-HT1A receptor” “Htr1a” “eIF-1A” synonym geneIds MGI:46273 ... synonym MGI:95298 ... ... abstracts ... file:doc115 hasProtein hasProtein hasProtein proteins “HT1A” “CA1” “HT1” hasTerm hasTerm terms term:HT term:1 term:A hasTerm ... term:hippocampus term:CA ... inFile synonyms “5-HT1A receptor” “Htr1a” “eIF-1A” ... synonym geneIds noisy training abstracts MGI:46273 file:doc214 MGI:95298 file:doc523 ... file:doc6273 ... Experiments • Data: Biocreative Task 1B – mouse: 10,000 train abstracts, 250 devtest, using first 150 for now; 50,000+ geneId’s; graph has 525,000+ nodes • NER systems: – likelyProtein: trained on yapex.train using off-the-shelf NER systems (Minorthird) – possibleProtein: same, but modified (on yapex.test) to optimize F3, not F1 (rewards recall over precision) Experiments with NER Token Precision Span Recall Precision Recall F1 likely 94.9 64.8 87.2 62.1 72.5 possible 49.0 97.4 47.2 82.5 60.0 likely 81.6 31.3 66.7 26.8 45.3 possible 43.9 88.5 30.4 56.6 39.6 dictionary 50.1 46.9 24.5 43.9 31.4 yapex.test mouse Experiments with Graph Search • Baseline method: – extract entities of type x – for each string of type x, find best-matching synonym, and then its geneId • consider only synonyms sharing >=1 token • Soft/TFIDF distance • break ties randomly – rank geneId’s by number of times they are reached • rewards multiple mentions (even via alternate synonyms) • Evaluation: – average, over 50 test documents, of • non-interpolated average precision (plausible for curators) • max F1 over all cutoff’s Experiments with Graph Search mouse dataset MAP maxF1 likelyProtein + softTFIDF 45.0 58.1 possibleProtein + softTFIDF 62.6 74.9 graph walk 51.3 64.3 Baseline vs Graphwalk • Baseline includes: – softTFIDF distances from NER entity to gene synonyms – knowledge that “shortcut” path docentitysynonymgeneId is important • Graph includes: – IDF effects, correlations, training data, etc • Proposed graph extension: – add softTFIDF and “shortcut” edges • Learning and reranking: – start with “local” features fi(e) of edges e=uv – for answer y, compute expectations: E( fi(e) | start=x,end=y) – use expectations as feature values and voted perceptron (Collins, 2002) as learning-to-rank method. Experiments with Graph Search mouse dataset MAP maxF1 likelyProtein + softTFIDF 45.0 58.1 possibleProtein + softTFIDF 62.6 74.9 graph walk 51.3 64.3 walk + extra links 73.0 80.7 walk + extra links + learning 79.7 83.9 Experiments with Graph Search Hot off the presses • Ongoing work: learn NER system from pairs of (document,geneIdList) – much easier to obtain training data than documents in which every occurrence of every gene name is highlighted (usual NER training data) – obtains F1 of 71.4 on mouse data (vs 45.3 by training on YAPEX data, which is from different distribution) – Joint work with Richard Wang, Bob Frederking, Anthony Tomasic Experiments with Graph Search mouse dataset MAP (Yapex Trained) MAP (MGI Trained) likelyProtein + softTFIDF 45.0 72.7 possibleProtein + softTFIDF 62.6 65.7 graph walk 51.3 54.4 walk + extra links 73.0 76.7 walk + extra links + learning 79.7 84.2 Summary • Contributions: – a very simple query language for graphs, based on a diffusion-kernel (damped PageRank,...) similarity metric – experiments on natural types of queries: • finding likely meeting attendees • finding related documents (email threading) • disambiguating person and gene/protein entity names – techniques for learning to answer queries • reranking using expectations of simple, local features • tune performance to a particular “similarity” Summary • Some open problems: – scalability & efficiency: • K-step walk on node-node graph with fan-out b is O(KbN) • accurate sampling is O(1min) for 10-steps with O(106) nodes. – faster, better learning methods: • combine re-ranking with learning parameters of graph walk – add language modeling, topic modeling: • extend graph to include models as well as data