Information Extraction and
Reasoning with
Heterogeneous Data
Machine Learning Department and Language Technologies Institute
School of Computer Science
Carnegie Mellon University joint work with:
Einat Minkov, Andrew Ng, Richard Wang, Anthony Tomasic, Bob Frederking
– The True Names problem
– The schema problem
– The GHIRL system
– PIM
– Information extraction and BioText entity normalization
Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing.
He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies.
Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.”
Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.
Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.”
“Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.”
He turned to face the machine. “Is there a God ?”
The mighty voice answered without hesitation, without the clicking of a single relay.
“Yes, now there is a god.”
Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.
A bolt of lightning from the cloudless sky struck him down and fused the switch shut.
‘Answer’ by Fredric Brown.
©1954, Angels and Spaceships
“ceremonial soldering”
Query retrieval
Answer
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
X:advisor(wc,Y)&affil(X,lti) ?
Query
{X=em; X=vc}
Answer inference
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
Xxx xxxx
Xxx xxxx xx xxxx xxxx xxx xxxx xxx
Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx
IE advisor(wc,vc) advisor(yh,tm) affil(wc,mld) affil(vc,lti) fn(wc,``William”) fn(vc,``Vitor”)
AND
Data Integration 101:
Queries
Problems:
1. Heterogeneous databases have different schemas.
2. Heterogeneous databases have different object identifiers.
Linkage
Queries
Problems:
1. Heterogeneous databases have different schemas.
2. Heterogeneous databases have different object identifiers.
• …but usually contain some readable description or name of the objects record linkage
• Bell Labs [1925]
• Bell Telephone Labs
• AT&T Bell Labs
• A&T Labs
• AT&T Labs—Research
• AT&T Labs Research,
Shannon Laboratory
• Shannon Labs
• Bell Labs Innovations
• Lucent Technologies/Bell
Labs Innovations
History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers….
[www.research.att.com]
Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925…
[bell-labs.com]
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:
The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black
Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…
This had been, of course, Roger Pollack's great fear. They had discovered Mr.
Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.
True Names , by Vernon Vinge
Traditional approach:
Linkage
Uncertainty about what to link
must be decided by the linkage/integration system, not the end user
Queries
WHIRL vision:
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a=S.a and S.b=T.b
Link items as needed by Q
Query Q
R.a
S.a
S.b
T.b
Strongest links: those agreeable to most users
Anhai Anhai Doan
Dan Dan Weld
Doan
Weld
Weaker links: those agreeable to some users
William Will Cohen Cohn
Steve Steven Minton Mitton even weaker links…
William David Cohen Cohn
WHIRL vision:
DB
1
+ DB
2
≠ DB
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a~S.a and S.b~T.b
(~ TFIDF-similar)
Link items as needed by Q
Query Q
Incrementally produce a ranked list of possible links, with “best matches” first. User
(or downstream process) decides how much of the list to generate and examine.
R.a
S.a
S.b
Anhai Anhai Doan
Dan Dan
William Will
Weld
T.b
Doan
Weld
Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
• Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing
The
Hitchhiker’s
Guide to the
Galaxy, 2005
Men in
Black, 1997
Space Balls,
1987
…
This is a faithful re-creation of the original radio series
– not surprisingly, as Adams wrote the screenplay ….
Will Smith does an excellent job in this …
Only a die-hard Mel Brooks fan could claim to enjoy …
…
Star Wars
Episode III
Cinderella
Man
The
Senator
Theater
The
Rotunda
Cinema
1:00,
4:15, &
7:30pm.
1:00,
4:30, &
7:30pm.
… … …
[movie domain]
FROM review SELECT * WHERE r.text~’sci fi comedy’
(like standard ranked retrieval of “sci-fi comedy”)
FROM review as r, LISTING as s, SELECT *
WHERE r.title~s.title and r.text~’sci fi comedy’
(best answers: titles are similar to each other – e.g.,
“Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s
Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)
• Similarity is based on TFIDF rare words are most important .
• Search for high-ranking answers uses inverted indices….
The Hitchhiker’s Guide to the Galaxy, 2005
Men in Black, 1997
Space Balls, 1987
…
Star Wars Episode III
Hitchhiker’s Guide to the Galaxy
Cinderella Man
…
• Similarity is based on TFIDF rare words are most important .
• Search for high-ranking answers uses inverted indices….
It is easy to find the (few) items that match on “ important ” terms
- Search for strong matches can prune “unimportant terms”
The
Hitchhiker ’s Guide to the
Galaxy ,
2005
Men in Black , 1997
Space Balls , 1987
…
Star Wars Episode III
Hitchhiker ’s Guide to the
Galaxy
Cinderella Man
…
Years are common in the review archive, so have low weight hitchhiker movie00137 the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031,
…..
WHIRL* vision: very radical, everything was inter-dependent
SELECT R.a,S.a,S.b,T.b FROM R,S,T
WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar)
Link items as needed by Q
Query Q
Incrementally produce a ranked list of possible links, with “best matches” first. User
(or downstream process) decides how much of the list to generate and examine.
R.a
S.a
S.b
Anhai Anhai Doan
Dan Dan
William Will
Weld
T.b
Doan
Weld
Cohen Cohn
Steve Steven Minton Mitton
William David Cohen Cohn
To make SQLlike queries, user must understand the schema of the underlying
DB (and hence someone must understand
DB1, DB2,
DB3, ...
?
*
Word-Based Heterogeneous Information Representation Language
– The True Name problem & soft matching
– The schema problem & graph databases
– The GHIRL system
– PIM
– BioText entity normalization
• Database is modeled as a graph
– Nodes = tuples
– Edges = references between tuples
• foreign key, inclusion dependencies, ..
• Edges are directed.
User need not know organization of database to formulate queries.
BANKS: Keyword search…
Charuta
MultiQuery Optimization
S. Sudarshan paper writes
Prasan Roy author
Query: “sudarshan roy” Answer: subtree from graph paper
MultiQuery Optimization writes writes author
S. Sudarshan Prasan Roy author
• Database is modeled as a graph
– Nodes = tuples
– Edges = references between tuples
• edges are directed.
• foreign key, inclusion dependencies, ..
• Database All information is modeled as a graph
– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes
• edges are directed, labeled and weighted
• foreign key, inclusion dependencies, ...
• doc/string D to word contained by D (TFIDF weighted, perhaps)
• word W to doc/string containing W (inverted index)
• [string S to strings ‘similar to’ S]
Similarity in a BANKS-like system
• Assume data is encoded in a graph with:
– a node for each object x
– a type of each object x, T(x)
– an edge for each binary relation r:x
y
Node similarity
– Given type t* and node x , find y:T(y)=t* and y~x.
• We’d like to construct a general-purpose similarity function x~y for objects in the graph:
• We’d also like to learn many such functions for different specific tasks (like “who should attend a meeting”)
Given type t* and node x , find y:T(y)=t* and y~x.
• Similarity defined by a heat diffusion kernel
• Similarity between nodes x and y :
– “Random surfer model”: from a node z,
• with probability α, stop and “output” z
• pick an edge label r using Pr( r | z) ... e.g. uniform
• pick a y uniformly from { y’ : z
y with label r }
• repeat from node y ....
– Similarity x~y = Pr( “output” y | start at x)
• Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length.
• Database All information is modeled as a graph
– Nodes = tuples or documents or strings or words
– Edges = references between tuples nodes
• edges are directed, labeled and weighted
• foreign key, inclusion dependencies, ...
• doc/string D to word contained by D (TFIDF weighted, perhaps)
• word W to doc/string containing W (inverted index)
• [string S to strings ‘similar to’ S] dr
“William W. Cohen, CMU” cohen william w cmu
“Dr. W. W. Cohen” optional —strings that are similar in
TFIDF/cosine distance will still be
“nearby” in graph (connected by many length=2 paths)
– natural extension to PageRank
– closely related to Lafferty’s heat diffusion kernel
• but generalized to directed graphs
– somewhat amenable to learning parameters of the walk
(gradient search, w/ various optimization metrics):
• Toutanova, Manning & NG, ICML2004
• Nie et al, WWW2005
• Xi et al, SIGIR 2005
– can be sped up and adapted to longer walks by sampling approaches to matrix multiplication (e.g. Lewis
& E. Cohen, SODA 1998), similar to particle filtering
– our current implementation (GHIRL): Lucene +
Sleepycat with extensive use of memory caching
(sampling approaches visit many nodes repeatedly)
– The True Names problem
– The schema problem
– The GHIRL system
– PIM – personal information management tasks
– BioText entity normalization
Chris.germany
@enron.com
alias
Chris
1.22.00
On_date file
1 has_subj
_term has_ term yo
I’m you work where
Mgermany@ ch2m.com
Melissa
Germany
• A directed graph
• A node carries an entity type
• An edge carries a relation type
• Edges are bi-directional ( cyclic )
• Nodes inter-connect via linked entities.
• Graph G :
- nodes x, y, z
- node types T(x), T(y), T(z)
- edge labels
- parameters
• Edge weight x
y: a. Pick an outgoing edge label
• Prob. Distribution: b. Pick node y uniformly
Defined by lazy graph walks over k steps.
Given:
Stay probability:
A transition matrix:
(larger values favor shorter paths)
Initial node distribution:
Output node distribution:
We use this platform to perform SEARCH of related items in the graph: a query is initial distribution V q over nodes and a desired output type T out
• Learn how to better rank graph nodes per a particular task.
• The parameters can be adjusted using gradient descent methods
( Diligenti et-al, IJCAI 2005 ) – also current work
• Here we use a re-ranking approach ( Collins and Koo, Computational
Linguistics, 2005)
– Boosting-based learning-to-rank; takes advantage of ‘global’ features
• A training example includes:
– a ranked list of l i nodes.
– Each node represented through m features
– At least one known correct node
• Features will describe the graph walk paths
• The full set of paths to a target node in step k can be recovered.
X
1
X
2
X
3
X
4
X
5
K=0 K=1
Paths ( x3
, k=2): x2 x3 x2 x1 x3 x4 x1 x3 x2 x2 x3
K=2
‘Edge unigrams’: was edge type l used in reaching x from V q
.
‘Edge bigrams’: were edge types l
1 and l
2 used
(in that order) in reaching x from V q
.
‘Top edge bigrams’: were edge types l
1 and l
2 used
(in that order) in reaching x from V q
, among the top two highest scoring paths.
– Person Name Disambiguation
– Threading
“who is Andy?”
• Given :
– a term that is known to be a personal name
– is not mentioned ‘as is’ in header
(otherwise, easy)
• Output :
– ranked person nodes.
Person file andy file file
Person
Person
Mgmt. Game
Sager-E
Shapiro-R
• Example types :
– Andy
– Kai
– Jenny
Files
Corpus
Nodes
821 6248
1632
978
9753
13174
Andrew
Keiko
• Two-fold problem:
– Map terms to person nodes ( co-occurrence )
– Disambiguation ( context )
Test
Dataset
Train
80 26
51
49
11
11
1. Baseline: String matching
(& common nicknames)
Find persons that are similar to the name term (Jaro measure)
• Lexical similarity
2. Graph walk: Term
V q
: name term node
• Co-occurrence
3. Graph walk: Term + File
V q
: name term + file nodes
• Co-occurrence
• Ambiguity
• but, incorporates additional noise
4. G: Term + File, Reranked
Re-rank (3), using:
Path-describing features
‘source count’ : do the paths originate from a single/two source nodes
string similarity
Mgmt. game
100%
80%
60%
40%
20%
0%
1 2 3 4 5
Rank
6 7 8 9 10
Mgmt. game
100%
80%
60%
40%
20%
0%
1 2 3 4 5
Rank
6 7 8 9 10
Mgmt. game
100%
80%
60%
40%
20%
0%
1 2 3 4 5
Rank
6 7 8 9 10
Mgmt. game
100%
80%
60%
40%
20%
0%
1 2 3 4 5
Rank
6 7 8 9 10
Mgmt. Game
Baseline
Graph - T
Graph - T + F
Reranked - T
Reranked - T + F
Enron:
Sager-E
Enron:
Shapiro-R
Baseline
Graph - T
Graph - T + F
Reranked - T
Reranked - T + F
Baseline
Graph - T
Graph - T + F
Reranked - T
Reranked - T + F
MAP
49.0
72.6
66.3
85.6
89.0
MAP
67.5
82.8
61.7
83.2
88.9
MAP
60.8
84.1
56.5
87.9
85.5
Δ
-
48%
35%
75%
82%
Δ
-
23%
-9%
23%
32%
Δ
-
38%
-7%
45%
41%
Acc.
41.3
61.3
48.8
72.5
83.8
Δ
-
48%
18%
76%
103%
Acc.
38.8
63.3
38.8
65.3
77.6
Acc.
39.2
66.7
41.2
68.6
80.4
Δ
-
70%
5%
75%
105%
Δ
-
65%
1%
71%
103%
Another Task that Can be Formulated as a Graph
Query: GeneId-Ranking
– a biomedical paper abstract
– the geneId for every gene mentioned in the abstract
• Method:
– from paper x, ranked list of geneId y: x~y
– a “synonym list”: geneId { name1, name2, ... }
– one or more protein NER systems
– training/evaluation data: pairs of (paper, {geneId1, ...., geneId n })
Sample abstracts and synonyms
•
MGI:96273
• Htr1a
• 5-hydroxytryptamine (serotonin) receptor 1A true
• 5-HT1A receptor labels
• MGI:104886
• Gpx5
NER
• glutathione peroxidase 5 extractor
• Arep
• ...
• 52,000+ for mouse, 35,000+ for fly
abstracts proteins terms synonyms
...
hasProtein file:doc115
“HT1A” hasProtein hasProtein
“HT1” “CA1” hasTerm
...
hasTerm hasTerm term:HT term:1
“5-HT1A receptor” term:A term:CA term:hippocampus
“Htr1a” synonym inFile
“eIF-1A” ...
synonym geneIds
MGI:46273 MGI:95298 ...
...
abstracts proteins terms synonyms geneIds
...
hasProtein file:doc115
“HT1A” hasProtein
“HT1” hasProtein
“CA1” hasTerm
...
hasTerm term:HT term:1 term:A term:CA hasTerm
“5-HT1A receptor”
“Htr1a” synonym inFile
“eIF-1A” ...
MGI:46273 MGI:95298 ...
term:hippocampus
...
noisy training abstracts file:doc214 file:doc523 file:doc6273
...
– mouse: 10,000 train abstracts, 250 devtest, using first
150 for now; 50,000+ geneId’s; graph has 525,000+ nodes
– likelyProtein: trained on yapex.train using off-the-shelf
NER systems (Minorthird)
– possibleProtein: same, but modified (on yapex.test) to optimize F3, not F1 (rewards recall over precision)
likely possible
Token Span
Precision Recall Precision Recall
94.9
49.0
64.8
97.4
87.2
47.2
62.1
82.5
likely possible dictionary
81.6
43.9
50.1
31.3
88.5
46.9
66.7
30.4
24.5
26.8
56.6
43.9
F1
72.5
60.0
yapex.test
45.3
39.6
31.4
mouse
Experiments with Graph Search
– extract entities of type x
– for each string of type x, find best-matching synonym, and then its geneId
• consider only synonyms sharing >=1 token
• Soft/TFIDF distance
• break ties randomly
– rank geneId’s by number of times they are reached
• rewards multiple mentions (even via alternate synonyms)
– average, over 50 test documents, of
• non-interpolated average precision (plausible for curators)
• max F1 over all cutoff’s
Experiments with Graph Search mouse eval dataset likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk*
MAP
45.0
62.6
51.3
maxF1
58.1
74.9
64.3
* Graph walk is an approximate walk for 10 steps through the
1,000,000+ node graph, done using a particle-filtering approach (with
500 particles)
• Baseline includes:
– softTFIDF distances from NER entity to gene synonyms
– knowledge that “shortcut” path doc entity synonym geneId is important
• Graph includes:
– IDF effects, correlations, training data, etc
• Proposed graph extension:
– add softTFIDF and “shortcut” edges
• Learning and reranking:
– start with “local” features f i
(e) of edges e=u
– for answer y, compute expectations: E( f i
v
(e) | start= x ,end= y )
– use expectations as feature values and voted perceptron (Collins,
2002) as learning-to-rank method.
Experiments with Graph Search mouse eval dataset likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links walk + extra links + learning
MAP
45.0
62.6
51.3
73.0
79.7
average max F1
58.1
74.9
64.3
80.7
83.9
– much easier to obtain training data than documents in which every occurrence of every gene name is highlighted (usual NER training data)
– obtains F1 of 71.1 on mouse data (vs 45.3 by training on YAPEX data, which is from different distribution)
NER from weakly labeled documents
– Lists of geneIds + synonyms for each geneId
– Conventionally-labeled documents:
• hand-labeled 50 abstracts from BioCreative devtest set
– Weakly-labeled documents:
• documents + geneIds of all genes mentioned (250 Medline abstracts from BioCreative devtest set)
• documents + geneIds of all genes mentioned (5000 Medline abstracts from Biocreative “training” set.
– In principle very cheap to obtain
NER from weakly labeled documents
• Methods:
– Edit-distance based features:
• For each token t, f
D
(t) = max s єW,s’єD
( sim(s,s’) ) where W is the set of sliding windows that include t, and D is a dictionary of gene synonyms.
– Soft-matching:
• extract as a gene all longest subsequences of tokens t: f
D
(t)>
θ
– Weak label grounding:
• for each document d, let A ( d ) be the aliases for all geneId’s that weaklabel d.
Soft-match d against A(d) to create a conventionally-labeled document that can be used to train a convention NER.
– Sentence filtering:
• Discard sentences with no grounded entities.
– Single document repetition (SDR) postpass:
• s is a gene name in d, s’ a substring in d, and sim(s’,s)>θ’ then make s’ a gene name also
45
200
1000
1200
NER from weakly labeled documents
GED = Global edit distance features
45
57
1000
1057
GED = Global edit distance features
Experiments with Graph Search mouse eval dataset likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links walk + extra links + learning
MAP
(Yapex trained)
45.0
62.6
51.3
73.0
79.7
Experiments with Graph Search mouse eval dataset likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links walk + extra links + learning
MAP
(Yapex trained)
45.0
MAP
(MGI trained)
72.7
62.6
51.3
65.7
54.4
73.0
79.7
76.7
84.2
mouse blind test data likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links + learning
MAP
(Yapex trained)
(45.0) 36.8
Max F1
(Yapex trained)
(58.1) 42.1
(62.6)
(51.3)
(79.7)
61.1
64.0
71.1
(74.9) 67.2
(64.3) 69.5
(83.9) 75.5
Experiments with Graph Search mouse blind test data likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links + learning mouse blind test data walk + extra links + learning
MAP
(Yapex trained)
36.8
Max F1
(Yapex trained)
42.1
61.1
64.0
67.2
69.5
71.1
MAP
(MGI trained)
80.1
75.5
Max F1
(MGI trained)
83.7
mouse blind test data walk + extra links + learning walk + extra links + learning
MAP
(Yapex trained)
71.1
Average Max F1
(Yapex trained)
75.5
(MGI trained)
80.1
(MGI trained)
83.7
– a very simple query language for graphs, based on a diffusion-kernel similarity metric
• combines ideas from graph databases and
– experiments on natural types of queries:
• finding likely meeting attendees
• finding related documents (email threading)
• disambiguating person and gene/protein entity names
– techniques for learning to answer queries
• reranking using expectations of simple, local features
• tune performance to a particular “similarity”