Information Extraction

advertisement

Information Extraction from the

World Wide Web

CSE 454

Based on Slides by

William W. Cohen

Carnegie Mellon University

Andrew McCallum

University of Massachusetts Amherst

From KDD 2003

Course Overview

Machine Learning

Introduction

Scaling

Bag of Words

Naïve Bayes

Hadoop

Search

Self-Supervised

Overfitting,

Ensembles,

Cotraining

Clustering Topics

Administrivia

• Proposals

– Due Thursday 10/22

– Pre-screen

– Time Estimation / Pert Chart

• Scaling Up: Way Up

– Today 3:30 – EE 105

– David “Pablo” Cohen – Google News

– “Content + Connection” Algorithms; Attribute Factoring

© Daniel S. Weld

Example: The Problem

Martin Baker, a person

Genomics job

Employers job posting form

Slides from Cohen & McCallum

Example: A Solution

Slides from Cohen & McCallum

Extracting Job Openings from the Web foodscience.com-Job2

JobTitle: Ice Cream Guru

Employer: foodscience.com

JobCategory: Travel/Hospitality

JobFunction: Food Services

JobLocation: Upper Midwest

Contact Phone: 800-488-2611

DateExtracted: January 8, 2001

Source: www.foodscience.com/jobs_midwest.html

OtherCompanyJobs: foodscience.com-Job1

Slides from Cohen & McCallum

Slides from Cohen & McCallum

What is “Information Extraction”

As a task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the

Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a

Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free

Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Slides from Cohen & McCallum

What is “Information Extraction”

As a task: Filling slots in a database from sub-segments of text.

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the

Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a

Microsoft VP . "That's a super-important shift for us in terms of code access.“

Richard Stallman , founder of the Free

Software Foundation , countered saying…

IE

NAME TITLE ORGANIZATION

Bill Gates

Bill Veghte

CEO

VP

Microsoft

Microsoft

Richard Stallman founder Free Soft..

Slides from Cohen & McCallum

What is “Information Extraction”

As a family of techniques:

Information Extraction = segmentation + classification + clustering + association

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the

Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a

Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free

Software Foundation , countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman founder

Free Software Foundation

Slides from Cohen & McCallum

What is “Information Extraction”

As a family of techniques:

Information Extraction = segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the

Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a

Microsoft VP . "That's a super-important shift for us in terms of code access.“

Richard Stallman , founder of the Free

Software Foundation , countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman founder

Free Software Foundation

Slides from Cohen & McCallum

What is “Information Extraction”

As a family of techniques:

Information Extraction = segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the

Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a

Microsoft VP . "That's a super-important shift for us in terms of code access.“

Richard Stallman , founder of the Free

Software Foundation , countered saying…

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman founder

Free Software Foundation

Slides from Cohen & McCallum

What is “Information Extraction”

As a family of techniques:

Information Extraction = segmentation + classification + association + clustering

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill

Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a

"cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the opensource concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the

Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte , a

Microsoft VP . "That's a super-important shift for us in terms of code access.“

*

*

*

*

Microsoft Corporation

CEO

Bill Gates

Microsoft

Gates

Microsoft

Bill Veghte

Microsoft

VP

Richard Stallman founder

Free Software Foundation

Richard Stallman , founder of the Free

Software Foundation , countered saying…

Slides from Cohen & McCallum

IE in Context

Create ontology

Spider

Filter by relevance

IE

Segment

Classify

Associate

Cluster

Document collection

Train extraction models

Label training data

Load DB

Database

Query,

Search

Data mine

Slides from Cohen & McCallum

IE History

Pre-Web

• Mostly news articles

– De Jong’s

FRUMP [1982]

• Hand-built system to fill Schank-style “scripts” from news wire

– Message Understanding Conference (MUC) DARPA [’87-’95], TIPSTER [’92-

’96]

• Most early work dominated by hand-built models

– E.g. SRI’s FASTUS , hand-built FSMs.

– But by 1990’s, some machine learning: Lehnert, Cardie, Grishman and then

HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98]

Web

• AAAI ’94 Spring Symposium on “Software Agents”

– Much discussion of ML applied to Web. Maes, Mitchell, Etzioni.

• Tom Mitchell’s WebKB, ‘96

– Build KB’s from the Web.

• Wrapper Induction

– First by hand, then ML: [Doorenbos ‘96], Soderland ’96], [Kushmerick ’97],…

Slides from Cohen & McCallum

What makes IE from the Web Different?

Less grammar, but more formatting & linking

Newswire Web www.apple.com/retail

Apple to Open Its First Retail Store in New York City

MACWORLD EXPO, NEW YORK--July 17, 2002--

Apple's first retail store in New York City will open in

Manhattan's SoHo district on Thursday, July 18 at

8:00 a.m. EDT. The SoHo store will be Apple's largest retail store to date and is a stunning example of Apple's commitment to offering customers the world's best computer shopping experience.

"Fourteen months after opening our first retail store, our 31 stores are attracting over 100,000 visitors each week," said Steve Jobs, Apple's CEO. "We hope our SoHo store will surprise and delight both

Mac and PC users who want to see everything the

Mac can do to enhance their digital lifestyles." www.apple.com/retail/soho www.apple.com/retail/soho/theatre.html

The directory structure, link structure, formatting & layout of the Web is its own new grammar.

Slides from Cohen & McCallum

Landscape of IE Tasks (1/4):

Pattern Feature Domain

Text paragraphs without formatting

Astro Teller is the CEO and co-founder of

BodyMedia. Astro holds a Ph.D. in Artificial

Intelligence from Carnegie Mellon University, where he was inducted as a national Hertz fellow.

His M.S. in symbolic and heuristic computation and B.S. in computer science are from Stanford

University. His work in science, literature and business has appeared in international media from the New York Times to CNN to NPR.

Grammatical sentences and some formatting & links

Non-grammatical snippets, rich formatting & links

Tables

Slides from Cohen & McCallum

Landscape of IE Tasks (2/4):

Pattern Scope

Web site specific

Formatting

Amazon.com Book Pages

Genre specific

Layout

Resumes

Wide, non-specific

Language

University Names

Slides from Cohen & McCallum

Landscape of IE Tasks (3/4):

Pattern Complexity

E.g. word patterns:

Closed set

U.S. states

He was born in Alabama …

Regular set

U.S. phone numbers

Phone: (413) 545-1323

The big Wyoming sky… The CALD main office can be reached at 412-268-1299

Complex pattern

U.S. postal addresses

University of Arkansas

P.O. Box 140

Hope, AR 71802

Headquarters:

1128 Main Street, 4th Floor

Cincinnati, Ohio 45210

Ambiguous patterns, needing context and many sources of evidence

Person names

…was among the six houses sold by Hope Feldman that year.

Pawel Opalinski, Software

Engineer at WhizBang Labs.

Slides from Cohen & McCallum

Landscape of IE Tasks (4/4):

Pattern Combinations

Jack Welch will retire as CEO of General Electric tomorrow. The top role at the Connecticut company will be filled by Jeffrey Immelt.

Single entity Binary relationship N-ary record

Person: Jack Welch

Person: Jeffrey Immelt

Location: Connecticut

Relation: Person-Title

Person: Jack Welch

Title: CEO

Relation: Company-Location

Company: General Electric

Location: Connecticut

Relation: Succession

Company: General Electric

Title: CEO

Out:

In:

Jack Welsh

Jeffrey Immelt

“Named entity” extraction

Slides from Cohen & McCallum

Evaluation of Single Entity Extraction

TRUTH:

Michael Kearns and Sebastian Seung will start Monday’s tutorial, followed by Richard M. Karpe and Martin Cooke .

PRED:

Michael Kearns and Sebastian Seung will start Monday ’s tutorial, followed by Richard M. Karpe and Martin Cooke .

# correctly predicted segments 2

Precision = =

# predicted segments 6

# correctly predicted segments 2

Recall = =

# true segments 4

F1 = Harmonic mean of Precision & Recall =

1

((1/P) + (1/R)) / 2

Slides from Cohen & McCallum

State of the Art Performance

• Named entity recognition

– Person, Location, Organization, …

– F1 in high 80’s or low- to mid-90’s

• Binary relation extraction

– Contained-in (Location1, Location2)

Member-of (Person1, Organization1)

– F1 in 60’s or 70’s or 80’s

• Wrapper induction

– Extremely accurate performance obtainable

– Human effort (~30min) required on each site

Slides from Cohen & McCallum

Landscape of IE Techniques (1/1):

Models

Lexicons

Abraham Lincoln was born in Kentucky.

member?

Alabama

Alaska

Wisconsin

Wyoming

Classify Pre-segmented

Candidates

Abraham Lincoln was born in Kentucky.

Classifier which class?

Sliding Window

Abraham Lincoln was born in Kentucky.

Classifier which class?

Try alternate window sizes:

Boundary Models Finite State Machines Context Free Grammars

Abraham Lincoln was born in Kentucky.

BEGIN

Abraham Lincoln was born in Kentucky.

Abraham Lincoln was born in Kentucky.

Most likely state sequence?

NNP NNP V V P NP

Classifier

PP which class?

VP

NP

VP

BEGIN END BEGIN END

S

…and beyond

Any of these models can be used to capture words, formatting or both.

Slides from Cohen & McCallum

Landscape:

Focus of this Tutorial

Pattern complexity

Pattern feature domain closed set regular complex ambiguous words words + formatting formatting

Pattern scope site-specific genre-specific general

Pattern combinations entity binary n-ary

Models lexicon regex window boundary FSM CFG

Slides from Cohen & McCallum

Sliding Windows

Slides from Cohen & McCallum

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

Extraction by Sliding Window

E.g.

Looking for seminar location

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

Slides from Cohen & McCallum

A “Naïve Bayes” Sliding Window Model

[Freitag 1997]

… 00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrun w t-m w t-1 w t w t+n w t+n+1 w t+n+m

… suffix prefix contents

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

If

P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

Other examples of sliding window: [Baluja et al 2000]

(decision tree over individual words & their context)

Slides from Cohen & McCallum

“Naïve Bayes” Sliding Window Results

Domain: CMU UseNet Seminar Announcements

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell

School of Computer Science

Carnegie Mellon University

3:30 pm

7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Field F1

Person Name: 30%

Location: 61%

Start Time: 98%

Slides from Cohen & McCallum

Realistic sliding-window-classifier IE

• What windows to consider?

– all windows containing as many tokens as the shortest example, but no more tokens than the longest example

• How to represent a classifier? It might:

– Restrict the length of window;

– Restrict the vocabulary or formatting used before/after/inside window;

– Restrict the relative order of tokens, etc.

• Learning Method

– SRV: Top-Down Rule Learning [Frietag AAAI ‘98]

– Rapier: Bottom-Up [Califf & Mooney, AAAI ‘99]

Slides from Cohen & McCallum

Rapier: results – precision/recall

Slides from Cohen & McCallum

Rule-learning approaches to slidingwindow classification: Summary

• SRV, Rapier, and WHISK

[Soderland KDD ‘97]

– Representations for classifiers allow restriction of the relationships between tokens, etc

– Representations are carefully chosen subsets of even more powerful representations based on logic programming (ILP and Prolog)

– Use of these “heavyweight” representations is complicated , but seems to pay off in results

• Can simpler representations for classifiers work?

Slides from Cohen & McCallum

BWI: Learning to detect boundaries

[Freitag & Kushmerick, AAAI 2000]

• Another formulation: learn three probabilistic classifiers:

– START(i) = Prob( position i starts a field)

– END(j) = Prob( position j ends a field)

– LEN(k) = Prob( an extracted field has length k )

• Then score a possible extraction (i,j) by

START(i) * END(j) * LEN(j-i)

• LEN(k) is estimated from a histogram

Slides from Cohen & McCallum

BWI: Learning to detect boundaries

• BWI uses boosting to find “detectors” for

START and END

• Each weak detector has a BEFORE and

AFTER pattern (on tokens before/after position i).

• Each “pattern” is a sequence of

– tokens and/or

– wildcards like: anyAlphabeticToken, anyNumber, …

• Weak learner for “patterns” uses greedy search (+ lookahead) to repeatedly extend a pair of empty BEFORE,AFTER patterns

Slides from Cohen & McCallum

BWI: Learning to detect boundaries

Field F1

Person Name: 30%

Location: 61%

Start Time: 98%

Slides from Cohen & McCallum

Problems with Sliding Windows and Boundary Finders

• Decisions in neighboring parts of the input are made independently from each other.

– Naïve Bayes Sliding Window may predict a

“seminar end time” before the “seminar start time”.

– It is possible for two overlapping windows to both be above threshold.

– In a Boundary-Finding system, left boundaries are laid down independently from right boundaries, and their pairing happens as a separate step.

Slides from Cohen & McCallum

Finite State Machines

Slides from Cohen & McCallum

Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, …

Finite state model

Graphical model

...

S t - 1

S t

S t+1 transitions

...

observations

...

O t -1

O t

O t +1

Generates:

State sequence

Observation sequence o

1 o

2 o

3 o

4 o

5 o

6 o

7 o

8

P ( s

,

 o )

 t

| o | 

1

P ( s t

| s t

1

) P ( o t

| s t

)

Parameters: for all states S={s

1

,s

2

,…}

Start state probabilities: P(s t

)

Transition probabilities:

P(s t

|s t-1

)

Observation (emission) probabilities: P(o t

|s t

)

Training:

Usually a multinomial over atomic, fixed alphabet

IE with Hidden Markov Models

Given a sequence of observations:

Yesterday Pedro Domingos spoke this example sentence.

and a trained HMM: person name location name background

Find the most likely state sequence: (Viterbi) arg max  s

P ( s

,

 o )

Yesterday Pedro Domingos spoke this example sentence.

Any words said to be generated by the designated “person name” state extract as a person name:

Person name: Pedro Domingos

Slides from Cohen & McCallum

HMM Example: “Nymble”

Task: Named Entity Extraction

[Bikel, et al 1998],

[BBN “IdentiFinder”] start-ofsentence

Person

Org

(Five other name classes)

Other end-ofsentence

Train on ~500k words of news wire text.

Results: Case Language

Mixed English

Upper English

Mixed Spanish

F1 .

93%

91%

90%

Transition probabilities

Observation probabilities

P(s t

| s t-1

, o t-1

) P(o t

| s t

, s t-1

) or P(o t

| s t

, o t-1

)

Back-off to: Back-off to:

P(s t

| s t-1

)

P(s t

)

P(o

P(o t t

| s

) t

)

Other examples of shrinkage for HMMs in IE:

[Freitag and McCallum ‘99]

Slides from Cohen & McCallum

We want More than an Atomic View of Words

Would like richer representation of text: many arbitrary, overlapping features of the words.

identity of word ends in “-ski” is capitalized is part of a noun phrase is in a list of city names is “Wisniewski” is under node X in WordNet is in bold font part of noun phrase is indented is in hyperlink anchor last person name was female next two words are “and Associates”

S t - 1 ends in

“-ski”

O t -1

S t

O t

S t+1

O t +1

Slides from Cohen & McCallum

Problems with Richer Representation and a Joint Model

These arbitrary features are not independent.

– Multiple levels of granularity (chars, words, phrases)

– Multiple dependent modalities (words, formatting, layout)

– Past & future

Two choices:

Model the dependencies.

Each state would have its own

Bayes Net. But we are already starved for training data!

S t - 1

S t

S t+1

Ignore the dependencies.

This causes “over-counting” of evidence (ala naïve Bayes).

Big problem when combining evidence, as in Viterbi!

S t - 1

S t

S t+1

O t -1

O t

O t +1

O t -1

O t t +1

Conditional Sequence Models

• We prefer a model that is trained to maximize a conditional probability rather than joint probability:

P(s|o) instead of P(s,o):

– Can examine features, but not responsible for generating them.

– Don’t have to explicitly model their dependencies.

– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.

Slides from Cohen & McCallum

Conditional Finite State Sequence Models

[McCallum, Freitag & Pereira, 2000]

[Lafferty, McCallum, Pereira 2001]

From HMMs to CRFs s

 s

1

, s

2

,...

s n

Joint

 o

 o

1

, o

2

,...

o n

P ( s

,

 o )

|

 o | t

1

P ( s t

| s t

1

) P ( o t

| s t

)

O t-1

S t-1

O t

S t

O t+1

S t+1

...

...

Conditional

P ( s

|

 o )

P

1

( o )

| o | t

1

P ( s t

| s t

1

) P ( o t

| s t

)

Z (

1

 o ) t

| o  |

1

 s

( s t

, s t

1

)

 o

( o t

, s t

) where

 o

( t )

 exp

 k

 k f k

( s t

, o t

)

O t-1

S t-1

O t

S t

S t+1

...

O t+1

...

(A super-special case of

Conditional Random Fields.)

Slides from Cohen & McCallum

Conditional Random Fields

[Lafferty, McCallum, Pereira 2001]

1. FSM special-case: linear chain among unknowns, parameters tied across time steps.

S t

S t+1

S t+2

S t+3

S t+4

P ( s

|

 o )

Z (

1

 o ) t

| o  |

1 exp

 k

 k

O = O t

, O t+1

, O t+2

, O t+3

, O t+4 f k

( s t

, s t

1

,

 o , t )

2. In general:

CRFs = "Conditionally-trained Markov Network" arbitrary structure among unknowns

3. Relational Markov Networks [Taskar, Abbeel, Koller 2002]:

Parameters tied across hits from SQL-like queries ("clique templates")

4. Markov Logic Networks [Richardson & Domingos]:

Weights on arbitrary logical sentences

Slides from Cohen & McCallum

Feature Functions

Example

f k

(

s t

,

s t

1

,

 o

,

t

) :

f

Capitalize d, s i

, s j

( s t

, s t

1

,

 o , t )

1 if Capitalize d( o t

)

 s i

 s t

1

 s j

 s t

0 otherwise o = Yesterday Pedro Domingos spoke this example sentence.

o

1 o

2 o

3 o

4 o

5 o

6 o

7 s

1 s

3 s

2 s

4 f

Capitalize d , s

1

, s

3

( s

1

, s

2

,

 o , 2 )

1

Slides from Cohen & McCallum

Learning Parameters of CRFs

Maximize log-likelihood of parameters

  { k

} given training data D

L

 s , o

D log



Z

1

( o ) t

|

 o  |

1 exp

 k

 k f k

( s t

, s t

1

,

 o , t )



  k

2

2

 k

2

Log-likelihood gradient:

L

  k

 s

  o ,

D

# k

where # k

( s

,

( s

,

 o )

 o )

 

 t i s

 f k

'

( s t

P

1

,

( s

 s t

|'

 o

( i )

) # k

,

 o , t )

( s

' ,

 o

( i )

)

 k

2

Methods:

• iterative scaling (quite slow)

• conjugate gradient (much faster) [Sha & Pereira 2002] & [Malouf 2002]

• limited-memory quasi-Newton methods, BFGS (super fast)

Slides from Cohen & McCallum

General CRFs vs. HMMs

• More general and expressive modeling technique

• Comparable computational efficiency

• Features may be arbitrary functions of any or all observations

• Parameters need not fully specify generation of observations; require less training data

• Easy to incorporate domain knowledge

• State means only “state of process”, vs

“state of process” and “observational history I’m keeping”

Slides from Cohen & McCallum

Person name Extraction

[McCallum 2001, unpublished]

Slides from Cohen & McCallum

Person name Extraction

Slides from Cohen & McCallum

Features in Experiment

Capitalized

Mixed Caps

All Caps

Initial Cap

Xxxxx

XxXxxx

XXXXX

X….

xxx5 xxxx

Contains Digit

All lowercase

Initial

Punctuation

Period .

X

.,:;!(), etc

Comma

Apostrophe ‘

,

Dash -

Preceded by HTML tag

Character n-gram classifier says string is a person name (80% accurate)

In stopword list

(the, of, their, etc)

In honorific list

(Mr, Mrs, Dr, Sen, etc)

In person suffix list

(Jr, Sr, PhD, etc)

In name particle list

(de, la, van, der, etc)

In Census lastname list; segmented by P(name)

In Census firstname list; segmented by P(name)

In locations lists

(states, cities, countries)

Hand-built FSM person-name extractor says yes,

(prec/recall ~ 30/95)

Conjunctions of all previous feature pairs, evaluated at the current time step.

Conjunctions of all previous feature pairs, evaluated at current step and one step ahead.

All previous features, evaluated two steps ahead.

All previous features, evaluated one step behind.

In company name list

(“J. C. Penny”)

Total number of features = ~500k

In list of company suffixes

(Inc, & Associates, Foundation)

Slides from Cohen & McCallum

Training and Testing

• Trained on 65k words from 85 pages, 30 different companies’ web sites.

• Training takes 4 hours on a 1 GHz Pentium.

• Training precision/recall is 96% / 96%.

• Tested on different set of web pages with similar size characteristics.

• Testing precision is 92 – 95%, recall is 89 – 91%.

Slides from Cohen & McCallum

Table Extraction from Government Reports

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

: Number :-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

:Milk Cows 1/:-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

: 1,000 Head --- Pounds --Percent Million Pounds

:

1993 : 9,589 15,704 575 3.66 150,582 5,514.4

1994 : 9,500 16,175 592 3.66 153,664 5,623.7

1995 : 9,461 16,451 602 3.66 155,644 5,694.3

--------------------------------------------------------------------------------

1/ Average number during year, excluding heifers not yet fresh.

2/ Excludes milk sucked by calves.

Slides from Cohen & McCallum

Table Extraction from Government Reports

[Pinto, McCallum, Wei, Croft, 2003]

100+ documents from www.fedstats.gov

Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was slightly below 1994. Producer returns averaged $12.93 per hundredweight,

$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,

1 percent above 1994. Marketings include whole milk sold to plants and dealers as well as milk sold directly to consumers.

An estimated 1.56 billion pounds of milk were used on farms where produced,

8 percent less than 1994. Calves were fed 78 percent of this milk with the remainder consumed in producer households.

Milk Cows and Production of Milk and Milkfat:

United States, 1993-95

--------------------------------------------------------------------------------

: : Production of Milk and Milkfat 2/

-------------------------------------------------------

Year : of : Per Milk Cow : Percentage : Total

-------------------: of Fat in All :------------------

: : Milk : Milkfat : Milk Produced : Milk : Milkfat

--------------------------------------------------------------------------------

Pounds --Percent Million Pounds

CRF

Labels:

• Non-Table

• Table Title

• Table Header

• Table Data Row

• Table Section Data Row

• Table Footnote

• ... (12 in all)

Features:

• Percentage of digit chars

• Percentage of alpha chars

• Indented

• Contains 5+ consecutive spaces

• Whitespace in this line aligns with prev.

• ...

• Conjunctions of all previous features,

Table Extraction Experimental Results

[Pinto, McCallum, Wei, Croft, 2003]

HMM

Line labels, percent correct

65 %

Stateless

MaxEnt

CRF w/out conjunctions

CRF

85 %

52 %

95 %

Slides from Cohen & McCallum

Named Entity Recognition

Reuters stories on international news Train on ~300k words

CRICKET -

MILLNS SIGNS FOR BOLAND

CAPE TOWN 1996-08-22

South African provincial side

Boland said on Thursday they had signed Leicestershire fast bowler David Millns on a one year contract.

Millns , who toured Australia with

England A in 1992, replaces former England all-rounder

Phillip DeFreitas as Boland 's overseas professional.

Labels:

PER

ORG

LOC

MISC

Examples:

Yayuk Basuki

Innocent Butare

3M

KDP

Leicestershire

Leicestershire

Nirmal Hriday

The Oval

Java

Basque

1,000 Lakes Rally

Slides from Cohen & McCallum

Automatically Induced Features

[McCallum 2003]

Index Feature

0

5

20

75

100

200

500

711 inside-noun-phrase (o t-1

) stopword (o t

) capitalized (o t+1

) word=the (o t

) in-person-lexicon (o t-1

) word=in (o t+2

) word=Republic (o t+1

) word=RBI (o t

) & header=BASEBALL

1027 header=CRICKET (o t

) & in-English-county-lexicon (o t

)

1298 company-suffix-word (firstmention t+2

)

4040 location (o t

) & POS=NNP (o t

) & capitalized (o t

) & stopword (o t-1

)

4945 moderately-rare-first-name (o t-1

) & very-common-last-name (o t

)

4474 word=the (o t-2

) & word=of (o t

)

Slides from Cohen & McCallum

Named Entity Extraction Results

[McCallum & Li, 2003]

Method

BBN's Identifinder, word features

CRFs word features, w/out Feature Induction

CRFs many features, w/out Feature Induction

CRFs many candidate features with Feature Induction

F1

79%

80%

# parameters

~500k

~500k

75%

90%

~3 million

~60k

Slides from Cohen & McCallum

Inducing State-Transition Structure

[Chidlovskii, 2000]

K-reversible grammars

Structure learning for

HMMs + IE

[Seymore et al 1999]

[Frietag & McCallum 2000]

Slides from Cohen & McCallum

Limitations of Finite State Models

• Finite state models have a linear structure

• Web documents have a hierarchical structure

– Are we suffering by not modeling this structure more explicitly?

• How can one learn a hierarchical extraction model?

Slides from Cohen & McCallum

IE Resources

• Data

– RISE, http://www.isi.edu/~muslea/RISE/index.html

– Linguistic Data Consortium (LDC)

• Penn Treebank, Named Entities, Relations, etc.

– http://www.biostat.wisc.edu/~craven/ie

– http://www.cs.umass.edu/~mccallum/data

• Code

– TextPro, http://www.ai.sri.com/~appelt/TextPro

– MALLET, http://www.cs.umass.edu/~mccallum/mallet

– SecondString

, http://secondstring.sourceforge.net/

• Both

– http://www.cis.upenn.edu/~adwait/penntools.html

Slides from Cohen & McCallum

References

[Bikel et al 1997] Bikel, D.; Miller, S.; Schwartz, R.; and Weischedel, R. Nymble: a high-performance learning name-finder. In

Proceedings of ANLP’97 , p194-201.

[Califf & Mooney 1999], Califf, M.E.; Mooney, R.: Relational Learning of Pattern-Match Rules for Information Extraction, in

Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI-99).

[Cohen, Hurst, Jensen, 2002] Cohen, W.; Hurst, M.; Jensen, L.: A flexible learning system for wrapping tables and lists in HTML documents. Proceedings of The Eleventh International World Wide Web Conference (WWW-2002)

[Cohen, Kautz, McAllester 2000] Cohen, W; Kautz, H.; McAllester, D.: Hardening soft information sources. Proceedings of the

Sixth International Conference on Knowledge Discovery and Data Mining (KDD-2000).

[Cohen, 1998] Cohen, W.: Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual

Similarity, in Proceedings of ACM SIGMOD-98 .

[Cohen, 2000a] Cohen, W.: Data Integration using Similarity Joins and a Word-based Information Representation Language,

ACM Transactions on Information Systems , 18(3).

[Cohen, 2000b] Cohen, W. Automatically Extracting Features for Concept Learning from the Web, Machine Learning:

Proceedings of the Seventeeth International Conference (ML-2000).

[Collins & Singer 1999] Collins, M.; and Singer, Y. Unsupervised models for named entity classification. In Proceedings of the

Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora , 1999.

[De Jong 1982] De Jong, G. An Overview of the FRUMP System. In: Lehnert, W. & Ringle, M. H. (eds), Strategies for Natural

Language Processing . Larence Erlbaum, 1982, 149-176.

[Freitag 98] Freitag, D: Information extraction from HTML: application of a general machine learning approach, Proceedings of the

Fifteenth National Conference on Artificial Intelligence (AAAI-98).

[Freitag, 1999], Freitag, D. Machine Learning for Information Extraction in Informal Domains . Ph.D. dissertation, Carnegie Mellon

University.

[Freitag 2000], Freitag, D: Machine Learning for Information Extraction in Informal Domains, Machine Learning 39(2/3): 99-101

(2000).

Freitag & Kushmerick, 1999] Freitag, D; Kushmerick, D.: Boosted Wrapper Induction. Proceedings of the Sixteenth National

Conference on Artificial Intelligence (AAAI-99)

[Freitag & McCallum 1999] Freitag, D. and McCallum, A. Information extraction using HMMs and shrinakge. In Proceedings

AAAI-99 Workshop on Machine Learning for Information Extraction. AAAI Technical Report WS-99-11.

[Kushmerick, 2000] Kushmerick, N: Wrapper Induction: efficiency and expressiveness, Artificial Intelligence , 118(pp 15-68).

[Lafferty, McCallum & Pereira 2001] Lafferty, J.; McCallum, A.; and Pereira, F., Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data, In Proceedings of ICML-2001 .

[Leek 1997] Leek, T. R. Information extraction using hidden Markov models . Master’s thesis. UC San Diego.

[McCallum, Freitag & Pereira 2000] McCallum, A.; Freitag, D.; and Pereira. F., Maximum entropy Markov models for information extraction and segmentation, In Proceedings of ICML-2000

[Miller et al 2000] Miller, S.; Fox, H.; Ramshaw, L.; Weischedel, R. A Novel Use of Statistical Parsing to Extract Information from

Text. Proceedings of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL), p. 226 - 233.

Slides from Cohen & McCallum

Download