Slides - School of Computer Science

Information Extraction and

Reasoning with

Heterogeneous Data

William W. Cohen

Machine Learning Department and Language Technologies Institute

School of Computer Science

Carnegie Mellon University joint work with:

Einat Minkov, Andrew Ng, Richard Wang, Anthony Tomasic, Bob Frederking

Outline

• Motivation

• Data integration 101

– The True Names problem

– The schema problem

– The GHIRL system

• Experiments with GHIRL

– PIM

– Information extraction and BioText entity normalization

• Conclusions

Bookends from old books

Dwar Ev ceremoniously soldered the final connection with gold. The eyes of a dozen television cameras watched him and the subether bore throughout the universe a dozen pictures of what he was doing.

He straightened and nodded to Dwar Reyn, then moved to a position beside the switch that would complete the contact when he threw it. The switch that would connect, all at once, all of the monster computing machines of all the populated planets in the universe - ninety-six billion planets - into the supercircuit that would connect them all into one supercalculator, one cybernetics machine that would combine all the knowledge of all the galaxies.

Dwar Reyn spoke briefly to the watching and listening trillions. Then after a moment’s silence he said, “Now, Dwar Ev.”

Dwar Ev threw the switch. There was a mighty hum, the surge of power from ninety-six billion planets. Lights flashed and quieted along the miles-long panel.

Dwar Ev stepped back and drew a deep breath. “The honour of asking the first questions is yours, Dwar Reyn.”

“Thank you,” said Dwar Reyn. “It shall be a question which no single cybernetics machine has been able to answer.”

He turned to face the machine. “Is there a God ?”

The mighty voice answered without hesitation, without the clicking of a single relay.

“Yes, now there is a god.”

Sudden fear flashed on the face of Dwar Ev. He leaped to grab the switch.

A bolt of lightning from the cloudless sky struck him down and fused the switch shut.

‘Answer’ by Fredric Brown.

©1954, Angels and Spaceships

Two ways to manage information

“ceremonial soldering”

Query retrieval

Answer

Xxx xxxx xxxx xxx xxx xxx xx xxxx xxxx xxx

X:advisor(wc,Y)&affil(X,lti) ?

Query

{X=em; X=vc}

Answer inference


Xxx xxxx

Xxx xxxx xx xxxx xxxx xxx xxxx xxx


IE advisor(wc,vc) advisor(yh,tm) affil(wc,mld) affil(vc,lti) fn(wc,``William”) fn(vc,``Vitor”)

AND

Data Integration 101:

Queries

Problems:

1. Heterogeneous databases have different schemas.

2. Heterogeneous databases have different object identifiers.

Linkage

Queries

Problems:

1. Heterogeneous databases have different schemas.

2. Heterogeneous databases have different object identifiers.

• …but usually contain some readable description or name of the objects  record linkage

Record Linkage

When are two entities the same?

• Bell Labs [1925]

• Bell Telephone Labs

• AT&T Bell Labs

• A&T Labs

• AT&T Labs—Research

• AT&T Labs Research,

Shannon Laboratory

• Shannon Labs

• Bell Labs Innovations

• Lucent Technologies/Bell

Labs Innovations

History of Innovation: From 1925 to today, AT&T has attracted some of the world's greatest scientists, engineers and developers….

[www.research.att.com]

Bell Labs Facts: Bell Laboratories, the research and development arm of Lucent Technologies, has been operating continuously since 1925…

[bell-labs.com]

In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest threat to his continued good health, for--the stories go--once an enemy, even a weak unskilled enemy, learned the sorcerer's true name, then routine and widely known spells could destroy or enslave even the most powerful. As times passed, and we graduated to the Age of Reason and thence to the first and second industrial revolutions, such notions were discredited. Now it seems that the Wheel has turned full circle (even if there never really was a First Age) and we are back to worrying about true names again:

The first hint Mr. Slippery had that his own True Name might be known--and, for that matter, known to the Great Enemy--came with the appearance of two black

Lincolns humming up the long dirt driveway ... Roger Pollack was in his garden weeding, had been there nearly the whole morning.... Four heavy-set men and a hard-looking female piled out, started purposefully across his well-tended cabbage patch.…

This had been, of course, Roger Pollack's great fear. They had discovered Mr.

Slippery's True Name and it was Roger Andrew Pollack TIN/SSAN 0959-34-2861.

True Names , by Vernon Vinge

Traditional approach:

Linkage

Uncertainty about what to link

must be decided by the linkage/integration system, not the end user

Queries

WHIRL vision:

SELECT R.a,S.a,S.b,T.b FROM R,S,T

WHERE R.a=S.a and S.b=T.b

Link items as needed by Q

Query Q

R.a

S.a

S.b

T.b

Strongest links: those agreeable to most users

Anhai Anhai Doan

Dan Dan Weld

Doan

Weld

Weaker links: those agreeable to some users

William Will Cohen Cohn

Steve Steven Minton Mitton even weaker links…

William David Cohen Cohn

WHIRL vision:

DB

1

+ DB

2

≠ DB


WHERE R.a~S.a and S.b~T.b

(~ TFIDF-similar)

Link items as needed by Q

Query Q

Incrementally produce a ranked list of possible links, with “best matches” first. User

(or downstream process) decides how much of the list to generate and examine.

R.a

S.a

S.b

Anhai Anhai Doan

Dan Dan

William Will

Weld

T.b

Doan

Weld

Cohen Cohn

Steve Steven Minton Mitton


WHIRL queries

• Assume two relations: review(movieTitle,reviewText): archive of reviews listing(theatre, movieTitle, showTimes, …): now showing

The

Hitchhiker’s

Guide to the

Galaxy, 2005

Men in

Black, 1997

Space Balls,

1987

…

This is a faithful re-creation of the original radio series

– not surprisingly, as Adams wrote the screenplay ….

Will Smith does an excellent job in this …

Only a die-hard Mel Brooks fan could claim to enjoy …

…

Star Wars

Episode III

Cinderella

Man

The

Senator

Theater

The

Rotunda

Cinema

1:00,

4:15, &

7:30pm.

1:00,

4:30, &

7:30pm.

… … …

WHIRL queries

• “Find reviews of sci-fi comedies

[movie domain]

FROM review SELECT * WHERE r.text~’sci fi comedy’

(like standard ranked retrieval of “sci-fi comedy”)

• “ “Where is [that sci-fi comedy] playing?”

FROM review as r, LISTING as s, SELECT *

WHERE r.title~s.title and r.text~’sci fi comedy’

(best answers: titles are similar to each other – e.g.,

“Hitchhiker’s Guide to the Galaxy” and “The Hitchhiker’s

Guide to the Galaxy, 2005” and the review text is similar to “sci-fi comedy”)

WHIRL queries

• Similarity is based on TFIDF  rare words are most important .

• Search for high-ranking answers uses inverted indices….

The Hitchhiker’s Guide to the Galaxy, 2005

Men in Black, 1997

Space Balls, 1987

…

Star Wars Episode III

Hitchhiker’s Guide to the Galaxy

Cinderella Man

…

WHIRL queries

• Similarity is based on TFIDF  rare words are most important .

• Search for high-ranking answers uses inverted indices….

It is easy to find the (few) items that match on “ important ” terms

- Search for strong matches can prune “unimportant terms”

The

Hitchhiker ’s Guide to the

Galaxy ,

2005

Men in Black , 1997

Space Balls , 1987

…

Star Wars Episode III

Hitchhiker ’s Guide to the

Galaxy

Cinderella Man

…

Years are common in the review archive, so have low weight hitchhiker movie00137 the movie001,movie003,movie007,movie008, movie013,movie018,movie023,movie0031,

…..

WHIRL* vision: very radical, everything was inter-dependent


WHERE R.a~S.a and S.b~T.b (~ TFIDF-similar)

Link items as needed by Q

Query Q

Incrementally produce a ranked list of possible links, with “best matches” first. User

(or downstream process) decides how much of the list to generate and examine.

R.a

S.a

S.b

Anhai Anhai Doan

Dan Dan

William Will

Weld

T.b

Doan

Weld

Cohen Cohn

Steve Steven Minton Mitton


To make SQLlike queries, user must understand the schema of the underlying

DB (and hence someone must understand

DB1, DB2,

DB3, ...

?

*

Word-Based Heterogeneous Information Representation Language

Outline

• Motivation


– The True Name problem & soft matching

– The schema problem & graph databases



– PIM

– BioText entity normalization

• Conclusions

BANKS: Basic Data Model

• Database is modeled as a graph

– Nodes = tuples

– Edges = references between tuples

• foreign key, inclusion dependencies, ..

• Edges are directed.

User need not know organization of database to formulate queries.

BANKS: Keyword search…

Charuta

MultiQuery Optimization

S. Sudarshan paper writes

Prasan Roy author

BANKS: Answer to Query

Query: “sudarshan roy” Answer: subtree from graph paper

MultiQuery Optimization writes writes author

S. Sudarshan Prasan Roy author


• Database is modeled as a graph

– Nodes = tuples

– Edges = references between tuples

• edges are directed.

• foreign key, inclusion dependencies, ..

not quite so basic


• Database All information is modeled as a graph

– Nodes = tuples or documents or strings or words

– Edges = references between tuples nodes

• edges are directed, labeled and weighted

• foreign key, inclusion dependencies, ...

• doc/string D to word contained by D (TFIDF weighted, perhaps)

• word W to doc/string containing W (inverted index)

• [string S to strings ‘similar to’ S]

Similarity in a BANKS-like system

• Motivation: why I’m interested in

– structured data that is partly text – similarity!

– structured data represented as graphs; all sorts of information can be poured into this model .

– measuring similarity of nodes in graphs

• Coming up next:

– a simple query language for graphs;

– experiments on natural types of queries;

– techniques for learning to answer queries of a certain type better

GHIRL: another schema-free query language

• Assume data is encoded in a graph with:

– a node for each object x

– a type of each object x, T(x)

– an edge for each binary relation r:x

 y

• Queries are of this form:

Node similarity

– Given type t* and node x , find y:T(y)=t* and y~x.

• We’d like to construct a general-purpose similarity function x~y for objects in the graph:

• We’d also like to learn many such functions for different specific tasks (like “who should attend a meeting”)

Similarity of Nodes in Graphs

Given type t* and node x , find y:T(y)=t* and y~x.

• Similarity defined by a heat diffusion kernel

• Similarity between nodes x and y :

– “Random surfer model”: from a node z,

• with probability α, stop and “output” z

• pick an edge label r using Pr( r | z) ... e.g. uniform

• pick a y uniformly from { y’ : z

 y with label r }

• repeat from node y ....

– Similarity x~y = Pr( “output” y | start at x)

• Intuitively, x~y is summation of weight of all paths from x to y, where weight of path decreases exponentially with length.

not quite so basic


• Database All information is modeled as a graph

– Nodes = tuples or documents or strings or words

– Edges = references between tuples nodes

• edges are directed, labeled and weighted

• foreign key, inclusion dependencies, ...

• doc/string D to word contained by D (TFIDF weighted, perhaps)

• word W to doc/string containing W (inverted index)

• [string S to strings ‘similar to’ S] dr

“William W. Cohen, CMU” cohen william w cmu

“Dr. W. W. Cohen” optional —strings that are similar in

TFIDF/cosine distance will still be

“nearby” in graph (connected by many length=2 paths)

Similarity of Nodes in Graphs

• Random surfer on graphs:

– natural extension to PageRank

– closely related to Lafferty’s heat diffusion kernel

• but generalized to directed graphs

– somewhat amenable to learning parameters of the walk

(gradient search, w/ various optimization metrics):

• Toutanova, Manning & NG, ICML2004

• Nie et al, WWW2005

• Xi et al, SIGIR 2005

– can be sped up and adapted to longer walks by sampling approaches to matrix multiplication (e.g. Lewis

& E. Cohen, SODA 1998), similar to particle filtering

– our current implementation (GHIRL): Lucene +

Sleepycat with extensive use of memory caching

(sampling approaches visit many nodes repeatedly)

Outline

• Motivation


– The True Names problem

– The schema problem



– PIM – personal information management tasks

– BioText entity normalization

• Conclusions

Email as a Graph

Chris.germany

@enron.com

alias

Chris

1.22.00

On_date file

1 has_subj

_term has_ term yo

I’m you work where

Mgermany@ ch2m.com

Melissa

Germany

Email as a Graph

• A directed graph

• A node carries an entity type

• An edge carries a relation type

• Edges are bi-directional ( cyclic )

• Nodes inter-connect via linked entities.

Edge Weights

• Graph G :

- nodes x, y, z

- node types T(x), T(y), T(z)

- edge labels

- parameters

• Edge weight x



y: a. Pick an outgoing edge label

• Prob. Distribution: b. Pick node y uniformly

Graph Similarity

Defined by lazy graph walks over k steps.

Given:

Stay probability:

A transition matrix:

(larger values favor shorter paths)

Initial node distribution:

Output node distribution:

We use this platform to perform SEARCH of related items in the graph: a query is initial distribution V q over nodes and a desired output type T out

Learning

• Learn how to better rank graph nodes per a particular task.

• The parameters can be adjusted using gradient descent methods

( Diligenti et-al, IJCAI 2005 ) – also current work

• Here we use a re-ranking approach ( Collins and Koo, Computational

Linguistics, 2005)

– Boosting-based learning-to-rank; takes advantage of ‘global’ features

• A training example includes:

– a ranked list of l i nodes.

– Each node represented through m features

– At least one known correct node

• Features will describe the graph walk paths

Path describing Features

• The full set of paths to a target node in step k can be recovered.

X

1

X

2

X

3

X

4

X

5

K=0 K=1

Paths ( x3

, k=2): x2  x3 x2  x1  x3 x4  x1  x3 x2  x2  x3

K=2

‘Edge unigrams’: was edge type l used in reaching x from V q

.

‘Edge bigrams’: were edge types l

1 and l

2 used

(in that order) in reaching x from V q

.

‘Top edge bigrams’: were edge types l

1 and l

2 used

(in that order) in reaching x from V q

, among the top two highest scoring paths.

Outline

• Extended similarity measure using graph walks

• Instantiation for Email

• Learning

• Evaluation

– Person Name Disambiguation

– Threading

• Summary and future directions

Person Name Disambiguation

“who is Andy?”

• Given :

– a term that is known to be a personal name

– is not mentioned ‘as is’ in header

(otherwise, easy)

• Output :

– ranked person nodes.

Person file andy file file

Person

Person

Corpora and Datasets

Mgmt. Game

Sager-E

Shapiro-R

• Example types :

– Andy

– Kai

– Jenny







Files

Corpus

Nodes

821 6248

1632

978

9753

13174

Andrew

Keiko

Xing

• Two-fold problem:

– Map terms to person nodes ( co-occurrence )

– Disambiguation ( context )

Test

Dataset

Train

80 26

51

49

11

11

Methods

1. Baseline: String matching

(& common nicknames)

Find persons that are similar to the name term (Jaro measure)

• Lexical similarity

2. Graph walk: Term

V q

: name term node

• Co-occurrence

3. Graph walk: Term + File

V q

: name term + file nodes

• Co-occurrence

• Ambiguity

• but, incorporates additional noise

4. G: Term + File, Reranked

Re-rank (3), using:

Path-describing features

‘source count’ : do the paths originate from a single/two source nodes

string similarity

Results

Mgmt. game

100%

80%

60%

40%

20%

0%

1 2 3 4 5

Rank

6 7 8 9 10

Results

Mgmt. game

100%

80%

60%

40%

20%

0%

1 2 3 4 5

Rank

6 7 8 9 10

Results

Mgmt. game

100%

80%

60%

40%

20%

0%

1 2 3 4 5

Rank

6 7 8 9 10

Results

Mgmt. game

100%

80%

60%

40%

20%

0%

1 2 3 4 5

Rank

6 7 8 9 10

Mgmt. Game

Baseline

Graph - T

Graph - T + F

Reranked - T

Reranked - T + F

Enron:

Sager-E

Enron:

Shapiro-R

Baseline

Graph - T

Graph - T + F

Reranked - T


Baseline

Graph - T

Graph - T + F

Reranked - T


Results

MAP

49.0

72.6

66.3

85.6

89.0

MAP

67.5

82.8

61.7

83.2

88.9

MAP

60.8

84.1

56.5

87.9

85.5

Δ

-

48%

35%

75%

82%

Δ

-

23%

-9%

23%

32%

Δ

-

38%

-7%

45%

41%

Acc.

41.3

61.3

48.8

72.5

83.8

Δ

-

48%

18%

76%

103%

Acc.

38.8

63.3

38.8

65.3

77.6

Acc.

39.2

66.7

41.2

68.6

80.4

Δ

-

70%

5%

75%

105%

Δ

-

65%

1%

71%

103%

Another Task that Can be Formulated as a Graph

Query: GeneId-Ranking

• Given:

– a biomedical paper abstract

• Find:

– the geneId for every gene mentioned in the abstract

• Method:

– from paper x, ranked list of geneId y: x~y

• Background resources:

– a “synonym list”: geneId  { name1, name2, ... }

– one or more protein NER systems

– training/evaluation data: pairs of (paper, {geneId1, ...., geneId n })

Sample abstracts and synonyms

•

MGI:96273

• Htr1a

• 5-hydroxytryptamine (serotonin) receptor 1A true

• 5-HT1A receptor labels

• MGI:104886

• Gpx5

NER

• glutathione peroxidase 5 extractor

• Arep

• ...

• 52,000+ for mouse, 35,000+ for fly

Graph for the task....

abstracts proteins terms synonyms

...

hasProtein file:doc115

“HT1A” hasProtein hasProtein

“HT1” “CA1” hasTerm

...

hasTerm hasTerm term:HT term:1

“5-HT1A receptor” term:A term:CA term:hippocampus

“Htr1a” synonym inFile

“eIF-1A” ...

synonym geneIds

MGI:46273 MGI:95298 ...

...

abstracts proteins terms synonyms geneIds

...

hasProtein file:doc115

“HT1A” hasProtein

“HT1” hasProtein

“CA1” hasTerm

...

hasTerm term:HT term:1 term:A term:CA hasTerm

“5-HT1A receptor”

“Htr1a” synonym inFile

“eIF-1A” ...

MGI:46273 MGI:95298 ...

term:hippocampus

...

noisy training abstracts file:doc214 file:doc523 file:doc6273

...

Experiments

• Data: Biocreative Task 1B

– mouse: 10,000 train abstracts, 250 devtest, using first

150 for now; 50,000+ geneId’s; graph has 525,000+ nodes

• NER systems:

– likelyProtein: trained on yapex.train using off-the-shelf

NER systems (Minorthird)

– possibleProtein: same, but modified (on yapex.test) to optimize F3, not F1 (rewards recall over precision)

Experiments with NER

likely possible

Token Span

Precision Recall Precision Recall

94.9

49.0

64.8

97.4

87.2

47.2

62.1

82.5

likely possible dictionary

81.6

43.9

50.1

31.3

88.5

46.9

66.7

30.4

24.5

26.8

56.6

43.9

F1

72.5

60.0

yapex.test

45.3

39.6

31.4

mouse

Experiments with Graph Search

• Baseline method:

– extract entities of type x

– for each string of type x, find best-matching synonym, and then its geneId

• consider only synonyms sharing >=1 token

• Soft/TFIDF distance

• break ties randomly

– rank geneId’s by number of times they are reached

• rewards multiple mentions (even via alternate synonyms)

• Evaluation:

– average, over 50 test documents, of

• non-interpolated average precision (plausible for curators)

• max F1 over all cutoff’s

Experiments with Graph Search mouse eval dataset likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk*

MAP

45.0

62.6

51.3

maxF1

58.1

74.9

64.3

* Graph walk is an approximate walk for 10 steps through the

1,000,000+ node graph, done using a particle-filtering approach (with

500 particles)

Baseline vs Graphwalk

• Baseline includes:

– softTFIDF distances from NER entity to gene synonyms

– knowledge that “shortcut” path doc  entity  synonym  geneId is important

• Graph includes:

– IDF effects, correlations, training data, etc

• Proposed graph extension:

– add softTFIDF and “shortcut” edges

• Learning and reranking:

– start with “local” features f i

(e) of edges e=u

– for answer y, compute expectations: E( f i

 v

(e) | start= x ,end= y )

– use expectations as feature values and voted perceptron (Collins,

2002) as learning-to-rank method.

Experiments with Graph Search mouse eval dataset likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links walk + extra links + learning

MAP

45.0

62.6

51.3

73.0

79.7

average max F1

58.1

74.9

64.3

80.7

83.9

Recent work…

• Recent/ongoing work: learn NER system from pairs of (document,geneIdList)

– much easier to obtain training data than documents in which every occurrence of every gene name is highlighted (usual NER training data)

– obtains F1 of 71.1 on mouse data (vs 45.3 by training on YAPEX data, which is from different distribution)

NER from weakly labeled documents

• Inputs:

– Lists of geneIds + synonyms for each geneId

– Conventionally-labeled documents:

• hand-labeled 50 abstracts from BioCreative devtest set

– Weakly-labeled documents:

• documents + geneIds of all genes mentioned (250 Medline abstracts from BioCreative devtest set)

• documents + geneIds of all genes mentioned (5000 Medline abstracts from Biocreative “training” set.

– In principle very cheap to obtain


• Methods:

– Edit-distance based features:

• For each token t, f

D

(t) = max s єW,s’єD

( sim(s,s’) ) where W is the set of sliding windows that include t, and D is a dictionary of gene synonyms.

– Soft-matching:

• extract as a gene all longest subsequences of tokens t: f

D

(t)>

θ

– Weak label grounding:

• for each document d, let A ( d ) be the aliases for all geneId’s that weaklabel d.

Soft-match d against A(d) to create a conventionally-labeled document that can be used to train a convention NER.

– Sentence filtering:

• Discard sentences with no grounded entities.

– Single document repetition (SDR) postpass:

• s is a gene name in d, s’ a substring in d, and sim(s’,s)>θ’ then make s’ a gene name also

45

200

1000

1200


GED = Global edit distance features

45

57

1000

1057

A second model organism (fly)

GED = Global edit distance features


MAP

(Yapex trained)

45.0

62.6

51.3

73.0

79.7


MAP

(Yapex trained)

45.0

MAP

(MGI trained)

72.7

62.6

51.3

65.7

54.4

73.0

79.7

76.7

84.2

Experiments on BioCreative Blind Test Set

mouse blind test data likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links + learning

MAP

(Yapex trained)

(45.0) 36.8

Max F1

(Yapex trained)

(58.1) 42.1

(62.6)

(51.3)

(79.7)

61.1

64.0

71.1

(74.9) 67.2

(64.3) 69.5

(83.9) 75.5

Experiments with Graph Search mouse blind test data likelyProtein + softTFIDF possibleProtein + softTFIDF graph walk walk + extra links + learning mouse blind test data walk + extra links + learning

MAP

(Yapex trained)

36.8

Max F1

(Yapex trained)

42.1

61.1

64.0

67.2

69.5

71.1

MAP

(MGI trained)

80.1

75.5

Max F1

(MGI trained)

83.7

mouse blind test data walk + extra links + learning walk + extra links + learning

MAP

(Yapex trained)

71.1

Average Max F1

(Yapex trained)

75.5

(MGI trained)

80.1

(MGI trained)

83.7

Conclusions

• Contributions:

– a very simple query language for graphs, based on a diffusion-kernel similarity metric

• combines ideas from graph databases and

– experiments on natural types of queries:

• finding likely meeting attendees

• finding related documents (email threading)

• disambiguating person and gene/protein entity names

– techniques for learning to answer queries

• reranking using expectations of simple, local features

• tune performance to a particular “similarity”