Entity Resolution, Resource-bounded Information Gathering, and

advertisement
Information Extraction,
Social Network Analysis
Structured Topic Models & Influence Mapping
Andrew McCallum
mccallum@cs.umass.edu
Information Extraction & Synthesis Laboratory
Department of Computer Science
University of Massachusetts
Joint work with Aron Culotta, Charles Sutton, Wei Li, Chris Pal, Pallika Kanani, Gideon
Mann, Natasha Mohanty, Xuerui Wang.
Goals
• Quickly understand and analyze contents of
large volume of text + other data
– browse topics
– navigate connections
– discover & see patterns
•
•
•
•
•
Assess^data source to determine relevance
Browse data newly acquired from the field
Navigate your own data
Discover structure and patterns
Assess impact and influence
Clustering words into topics with
Latent Dirichlet Allocation
[Blei, Ng, Jordan 2003]
Generative
Process:
Example:
For each document:
Sample a distribution
over topics, 
70% Iraq war
30% US election
For each word in doc
Sample a topic, z
Sample a word
from the topic, w
Iraq war
“bombing”
Example topics
induced from a large collection of text
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
[Tennenbaum et al]
Example topics
induced from a large collection of text
JOB
SCIENCE
BALL
FIELD
STORY
MIND
DISEASE
WATER
WORK
STUDY
GAME
MAGNETIC
STORIES
WORLD
BACTERIA
FISH
JOBS
SCIENTISTS
TEAM
MAGNET
TELL
DREAM
DISEASES
SEA
CAREER
SCIENTIFIC FOOTBALL
WIRE
CHARACTER
DREAMS
GERMS
SWIM
KNOWLEDGE
BASEBALL EXPERIENCE
NEEDLE
THOUGHT CHARACTERS
FEVER
SWIMMING
WORK
PLAYERS EMPLOYMENT
CURRENT
AUTHOR
IMAGINATION
CAUSE
POOL
OPPORTUNITIES
RESEARCH
PLAY
COIL
READ
MOMENT
CAUSED
LIKE
WORKING
CHEMISTRY
FIELD
POLES
TOLD
THOUGHTS
SPREAD
SHELL
TRAINING
TECHNOLOGY PLAYER
IRON
SETTING
OWN
VIRUSES
SHARK
SKILLS
MANY
BASKETBALL
COMPASS
TALES
REAL
INFECTION
TANK
CAREERS
MATHEMATICS COACH
LINES
PLOT
LIFE
VIRUS
SHELLS
POSITIONS
BIOLOGY
PLAYED
CORE
TELLING
IMAGINE
MICROORGANISMS SHARKS
FIND
FIELD
PLAYING
ELECTRIC
SHORT
SENSE
PERSON
DIVING
POSITION
PHYSICS
HIT
DIRECTION
INFECTIOUS
DOLPHINS CONSCIOUSNESS FICTION
FIELD
LABORATORY
TENNIS
FORCE
ACTION
STRANGE
COMMON
SWAM
OCCUPATIONS
STUDIES
TEAMS
MAGNETS
TRUE
FEELING
CAUSING
LONG
REQUIRE
WORLD
GAMES
BE
EVENTS
WHOLE
SMALLPOX
SEAL
OPPORTUNITY
SPORTS
MAGNETISM SCIENTIST
TELLS
BEING
BODY
DIVE
EARN
STUDYING
BAT
POLE
TALE
MIGHT
INFECTIONS
DOLPHIN
ABLE
SCIENCES
TERRY
INDUCED
NOVEL
HOPE
CERTAIN
UNDERWATER
[Tennenbaum et al]
Social Network in an Email Dataset
Author-Recipient-Topic SNA model
[McCallum, Corrada, Wang, 2005]
Topic choice depends on:
- author
- recipient
r
Enron Email Corpus
• 250k email messages
• 23k people
Date: Wed, 11 Apr 2001 06:56:00 -0700 (PDT)
From: debra.perlingiere@enron.com
To: steve.hooser@enron.com
Subject: Enron/TransAltaContract dated Jan 1, 2001
Please see below. Katalin Kiss of TransAlta has requested an
electronic copy of our final draft? Are you OK with this? If
so, the only version I have is the original draft without
revisions.
DP
Debra Perlingiere
Enron North America Corp.
Legal Department
1400 Smith Street, EB 3885
Houston, Texas 77002
dperlin@enron.com
Topics, and prominent senders / receivers
Topic names,
discovered by ART [McCallum et al 2005]
by hand
Topics, and prominent senders / receivers
discovered by ART
Beck = “Chief Operations Officer”
Dasovich = “Government Relations Executive”
Shapiro = “Vice President of Regulatory Affairs”
Steffes = “Vice President of Government Affairs”
Comparing Role Discovery
Traditional SNA
ART
Author-Topic
distribution over
authored topics
distribution over
authored topics
connection strength (A,B) =
distribution over
recipients
Comparing Role Discovery
Tracy Geaconne  Dan McCarty
Traditional SNA
ART
Similar roles
Different roles
Geaconne = “Secretary”
McCarty = “Vice President”
Author-Topic
Different roles
Comparing Role Discovery
Lynn Blair  Kimberly Watson
Traditional SNA
Different roles
ART
Very similar
Author-Topic
Very different
Blair = “Gas pipeline logistics”
Watson = “Pipeline facilities planning”
ART: Roles but not Groups
Traditional SNA
Block structured
Enron TransWestern Division
ART
Not
Author-Topic
Not
Two Relations with Different Attributes
Student Roster
Academic Admiration
Social Admiration
Adams
Bennett
Carter
Davis
Edwards
Frederking
Acad(A, B) Acad(C, B)
Acad(A, D) Acad(C, D)
Acad(B, E) Acad(D, E)
Acad(B, F) Acad(D, F)
Acad(E, A) Acad(F, A)
Acad(E, C) Acad(F, C)
Soci(A, B) Soci(A, D) Soci(A, F)
Soci(B, A) Soci(B, C) Soci(B, E)
Soci(C, B) Soci(C, D) Soci(C, F)
Soci(D, A) Soci(D, C) Soci(D, E)
Soci(E, B) Soci(E, D) Soci(E, F)
Soci(F, A) Soci(F, C) Soci(F, E)
A C B D E F
G1G1G2G2G3G3
A
C
B
D
E
F
G1
G1
G2
G2
G3
G3
A C E B D F
G1G1G1G2G2G2
A
C
E
B
D
F
G1
G1
G1
G2
G2
G2
The Group-Topic Model:
Discovering Groups and Topics Simultaneously

Uniform

Beta
Multinomial

t
g
G2
S
Binomial
Multinomial
w
T
Nb
v


T
Dirichlet

Dirichlet
S2
B
Inference and Estimation
Gibbs Sampling:
- Many r.v.s can be
integrated out
- Easy to implement
- Reasonably fast
We assume the relationship is symmetric.
Dataset #1:
U.S. Senate
•
16 years of voting records in the US Senate (1989 – 2005)
•
a Senator may respond Yea or Nay to a resolution
•
3423 resolutions with text attributes (index terms)
•
191 Senators in total across 16 years
S.543
Title: An Act to reform Federal deposit insurance, protect the deposit insurance
funds, recapitalize the Bank Insurance Fund, improve supervision and regulation
of insured depository institutions, and for other purposes.
Sponsor: Sen Riegle, Donald W., Jr. [MI] (introduced 3/5/1991) Cosponsors (2)
Latest Major Action: 12/19/1991 Became Public Law No: 102-242.
Index terms: Banks and banking Accounting Administrative fees Cost control
Credit Deposit insurance Depressed areas and other 110 terms
Adams (D-WA), Nay Akaka (D-HI), Yea Bentsen (D-TX), Yea Biden (D-DE), Yea
Bond (R-MO), Yea Bradley (D-NJ), Nay Conrad (D-ND), Nay ……
Topics Discovered (U.S. Senate)
Mixture of Unigrams
Group-Topic Model
Education
Energy
Military
Misc.
Economic
education
school
aid
children
drug
students
elementary
prevention
energy
power
water
nuclear
gas
petrol
research
pollution
government
military
foreign
tax
congress
aid
law
policy
federal
labor
insurance
aid
tax
business
employee
care
Education
+ Domestic
Foreign
Economic
Social Security
+ Medicare
labor
insurance
tax
congress
income
minimum
wage
business
social
security
insurance
medical
care
medicare
disability
assistance
education
foreign
school
trade
federal
chemicals
aid
tariff
government
congress
tax
drugs
energy
communicable
research
diseases
Groups Discovered (US Senate)
Groups from topic Education + Domestic
Senators Who Change Coalition the most
Dependent on Topic
e.g. Senator Shelby (D-AL) votes
with the Republicans on Economic
with the Democrats on Education + Domestic
with a small group of maverick Republicans on Social Security + Medicaid
Dataset #2:
The UN General Assembly
•
Voting records of the UN General Assembly (1990 - 2003)
•
A country may choose to vote Yes, No or Abstain
•
931 resolutions with text attributes (titles)
•
192 countries in total
•
Also experiments later with resolutions from 1960-2003
Vote on Permanent Sovereignty of Palestinian People, 87th plenary meeting
The draft resolution on permanent sovereignty of the Palestinian people in the
occupied Palestinian territory, including Jerusalem, and of the Arab population in
the occupied Syrian Golan over their natural resources (document A/54/591) was
adopted by a recorded vote of 145 in favour to 3 against with 6 abstentions:
In favour: Afghanistan, Argentina, Belgium, Brazil, Canada, China, France,
Germany, India, Japan, Mexico, Netherlands, New Zealand, Pakistan, Panama,
Russian Federation, South Africa, Spain, Turkey, and other 126 countries.
Against: Israel, Marshall Islands, United States.
Abstain: Australia, Cameroon, Georgia, Kazakhstan, Uzbekistan, Zambia.
Topics Discovered (UN)
Mixture of
Unigrams
Group-Topic
Model
Everything
Nuclear
Human Rights
Security
in Middle East
nuclear
weapons
use
implementation
countries
rights
human
palestine
situation
israel
occupied
israel
syria
security
calls
Nuclear
Non-proliferation
Nuclear Arms
Race
Human Rights
nuclear
states
united
weapons
nations
nuclear
arms
prevention
race
space
rights
human
palestine
occupied
israel
Groups
Discovered
(UN)
The countries list for each
group are ordered by their
2005 GDP (PPP) and only 5
countries are shown in
groups that have more than
5 members.
Groups and Topics, Trends over Time (UN)
I call these...
Structured Topic Models
Models that combine text analysis
with other structured data:
people, senders, receivers,
organizations, votes,
time, locations, materials, ...
Improve Basic Infrastructure
of Topic Models
• Incorporate time
• Finer-grained, more interpretable topics
by representing topic correlations
• Discover relevant phrases
• Map influence and impact
Groups and Topics, Trends over Time (UN)
Want to Model Trends over Time
• Pattern appears only briefly
– Capture its statistics in focused way
– Don’t confuse it with patterns elsewhere in time
• Is prevalence of topic growing or waning?
• How do roles, groups, influence shift over time?
Topics over Time (TOT)
[Wang, McCallum, KDD 2006]

Dirichlet

multinomial
over topics
Uniform
prior
Dirichlet
prior
z

word

w
topic
index

time
stamp

t
T
Multinomial
over words
T
Nd
D
Beta
over time
State of the Union Address
208 Addresses delivered between January 8, 1790 and January 29, 2002.
To increase the number of documents, we split the addresses into paragraphs
and treated them as ‘documents’. One-line paragraphs were excluded.
Stopping was applied.
• 17156 ‘documents’
• 21534 words
• 669,425 tokens
Our scheme of taxation, by means of which this needless surplus is taken
from the people and put into the public Treasury, consists of a tariff or
duty levied upon importations from abroad and internal-revenue taxes levied
upon the consumption of tobacco and spirituous and malt liquors. It must be
conceded that none of the things subjected to internal-revenue taxation
are, strictly speaking, necessaries. There appears to be no just complaint
of this taxation by the consumers of these articles, and there seems to be
nothing so well able to bear the burden without hardship to any portion of
the people.
1910
Comparing
TOT
against
LDA
TOT
versus
LDA
on my
email
Topic Distributions Conditioned on Time
topic mass (in vertical height)
in NIPS conference papers
time
Discovering Group Structure
Trends over Time
Group Model
without Time
Group Model
with Time
G
per group
beta over
time
multinomial
distribution
over groups
group
id
observed
relation
per group-pair
binomial over
relation
absent / present
timestamp
Improve Basic Infrastructure
of Topic Models
• Incorporate time
• Finer-grained, more interpretable topics
by representing topic correlations
• Discover relevant phrases
• Map influence and impact
Latent Dirichlet Allocation
“images,
motion, eyes”
LDA 20
[Blei, Ng, Jordan, 2003]
α
topic
distribution
θ
topic
z
word
w
β
n
φ
N
T
Per-topic
multinomial
over words
visual
model
motion
field
object
image
images
objects
fields
receptive
eye
position
spatial
direction
target
vision
multiple
figure
orientation
location
“motion”
(+ some generic)
LDA 100
motion
detection
field
optical
flow
sensitive
moving
functional
detect
contrast
light
dimensional
intensity
computer
mt
measures
occlusion
temporal
edge
real
Pachinko Machine
Pachinko Allocation Model (PAM)
[Li, McCallum, 2006]
11
21
31
41
word1
word2
42
word3
Model structure:
directed acyclic graph (DAG);
at each interior node: a Dirichlet over its children
and words at leaves
22
32
33
43
word4
For each document:
Sample a multinomial from
each Dirichlet
44
word5
word6
45
word7
For each word in this document:
Starting from the root,
sample a child from successive
nodes, down to a leaf.
Generate the word at the leaf
word8
Like a Polya tree, but DAG shaped, with arbitrary number of children.
Pachinko Allocation Model
[Li, McCallum, 2006]
11
21
31
41
word1
word2
42
word3
22
32
Distributions over distributions over topics...
33
43
word4
Distributions over topics;
mixtures, representing topic correlations
44
word5
word6
45 Distributions over words
(like “LDA topics”)
word7
word8
Some interior nodes could contain one multinomial, used for all documents.
(i.e. a very peaked Dirichlet)
Pachinko Allocation Model
[Li, McCallum, 2006]
11
Estimate all these Dirichlets from data.
21
22
Estimate model structure from data.
(number of nodes, and connectivity)
31
41
word1
word2
42
word3
32
33
43
word4
44
word5
word6
45
word7
word8
Pachinko Allocation Special Cases
Latent Dirichlet Allocation
11
21
word1
word2
22
word3
23
word4
24
word5
word6
25
word7
word8
Inference – Gibbs Sampling
P ( z w 2  t k , z w3
nkp( d )   kp
n pw   w
n1(kd )  1k
 t p | D, z  w ,  ,  )  ( d )
 (d )

n1  k ' 1k ' nk   p '  kp ' n p  m  m
α2
α3
P (t k )
θ2
Jointly
sampled
θ3
z2
P (t p | t k )
P(w | t p )
T’
z3
w
β
n
φ
N
T
Dirichlet parameters α are estimated with moment matching
Example Topics
“images,
motion
eyes”
LDA 20
“motion”
(some generic)
LDA 100
visual
model
motion
field
object
image
images
objects
fields
receptive
eye
position
spatial
direction
target
vision
multiple
figure
orientation
location
motion
detection
field
optical
flow
sensitive
moving
functional
detect
contrast
light
dimensional
intensity
computer
mt
measures
occlusion
temporal
edge
real
“motion”
PAM 100
“eyes”
PAM 100
“images”
PAM 100
motion
video
surface
surfaces
figure
scene
camera
noisy
sequence
activation
generated
analytical
pixels
measurements
assigne
advance
lated
shown
closed
perceptual
eye
head
vor
vestibulo
oculomotor
vestibular
vary
reflex
vi
pan
rapid
semicircular
canals
responds
streams
cholinergic
rotation
topographically
detectors
ning
image
digit
faces
pixel
surface
interpolation
scene
people
viewing
neighboring
sensors
patches
manifold
dataset
magnitude
transparency
rich
dynamical
amounts
tor
Blind Topic Evaluation
• Randomly select 25
similar pairs of topics
generated from PAM
and LDA
• 5 people
• Each asked to “select
the topic in each pair
that you find more
semantically coherent.”
Topic counts
LDA
PAM
5 votes
0
5
>= 4 votes
3
8
>= 3 votes
9
16
Examples
PAM
LDA
control
systems
robot
adaptive
environment
goal
state
controller
control
systems
based
adaptive
direct
con
controller
change
5 votes
0 votes
Examples
PAM
LDA
motion
image
detection
images
scene
vision
texture
segmentation
image
motion
images
multiple
local
generated
noisy
optical
4 votes
1 vote
Examples
PAM
LDA
PAM
LDA
signals
source
separation
eeg
sources
blind
single
event
signal
signals
single
time
low
source
temporal
processing
algorithm
learning
algorithms
gradient
convergence
function
stochastic
weight
algorithm
algorithms
gradient
convergence
stochastic
line
descent
converge
4 votes
1 vote
1 vote
4 votes
Topic Correlations
Likelihood Comparison
• Varying number of topics
PAM supports ~5x more topics than LDA
Improve Basic Infrastructure
of Topic Models
• Incorporate time
• Finer-grained, more interpretable topics
by representing topic correlations
• Discover relevant phrases
• Map influence and impact
Topic Interpretability
LDA
Topical N-grams
algorithms
algorithm
genetic
problems
efficient
genetic algorithms
genetic algorithm
evolutionary computation
evolutionary algorithms
fitness function
[Wang, McCallum 2005]
See also:
[Steyvers, Griffiths, Newman, Smyth 2005]
Topical N-gram Model


topic
uni- / bi-gram
status
z1
z2
y1
y2
w1
words
z3
z4
y3
...
y4
w2
w3
...
w4
...
D

1

W T
1
uni-
2
2
bi-
W T
Features of Topical N-Grams model
• Easily trained by Gibbs sampling
– Can run efficiently on millions of words
• Topic-specific phrase discovery
– “white house” has special meaning as a phrase
in the politics topic,
– ... but not in the real estate topic.
Topic Comparison
LDA
learning
optimal
reinforcement
state
problems
policy
dynamic
action
programming
actions
function
markov
methods
decision
rl
continuous
spaces
step
policies
planning
Topical N-grams (2)
reinforcement learning
optimal policy
dynamic programming
optimal control
function approximator
prioritized sweeping
finite-state controller
learning system
reinforcement learning rl
function approximators
markov decision problems
markov decision processes
local search
state-action pair
markov decision process
belief states
stochastic policy
action selection
upright position
reinforcement learning methods
Topical N-grams (1)
policy
action
states
actions
function
reward
control
agent
q-learning
optimal
goal
learning
space
step
environment
system
problem
steps
sutton
policies
Topic Comparison
LDA
word
system
recognition
hmm
speech
training
performance
phoneme
words
context
systems
frame
trained
speaker
sequence
speakers
mlp
frames
segmentation
models
Topical N-grams (2)
Topical N-grams (1)
speech recognition
training data
neural network
error rates
neural net
hidden markov model
feature vectors
continuous speech
training procedure
continuous speech recognition
gamma filter
hidden control
speech production
neural nets
input representation
output layers
training algorithm
test set
speech frames
speaker dependent
speech
word
training
system
recognition
hmm
speaker
performance
phoneme
acoustic
words
context
systems
frame
trained
sequence
phonetic
speakers
mlp
hybrid
Improve Basic Infrastructure
of Topic Models
• Incorporate time
• Finer-grained, more interpretable topics
by representing topic correlations
• Discover relevant phrases
• Map influence and impact
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Previous Systems
Cites
Research
Paper
More Entities and Relations
Expertise
Cites
Research
Paper
Grant
Venue
Person
University
Groups
Topical Transfer
Citation counts from one topic to another.
Map “producers and consumers”
Topical Bibliometric Impact Measures
[Mann, Mimno, McCallum, 2006]
• Topical Citation Counts
• Topical Impact Factors
• Topical Longevity
• Topical Precedence
• Topical Diversity
• Topical Transfer
Topical Transfer
Transfer from Digital Libraries to other topics
Other topic
Cit’s
Paper Title
Web Pages
31
Trawling the Web for Emerging CyberCommunities, Kumar, Raghavan,... 1999.
Computer Vision
14
On being ‘Undigital’ with digital cameras:
extending the dynamic...
Video
12
Lessons learned from the creation and
deployment of a terabyte digital video libr..
Graphs
12
Trawling the Web for Emerging CyberCommunities
Web Pages
11
WebBase: a repository of Web pages
Topical Diversity
Papers that had the most influence across many other fields...
Topical Diversity
Entropy of the topic distribution among
papers that cite this paper (this topic).
High
Diversity
Low
Diversity
Topical Bibliometric Impact Measures
[Mann, Mimno, McCallum, 2006]
• Topical Citation Counts
• Topical Impact Factors
• Topical Longevity
• Topical Precedence
• Topical Diversity
• Topical Transfer
Topical Precedence
Within a topic, what are the earliest papers
that received more than n citations?
Speech Recognition:
Some experiments on the recognition of speech, with one and two ears,
E. Colin Cherry (1953)
Spectrographic study of vowel reduction,
B. Lindblom (1963)
Automatic Lipreading to enhance speech recognition,
Eric D. Petajan (1965)
Effectiveness of linear prediction characteristics of the speech wave for...,
B. Atal (1974)
Automatic Recognition of Speakers from Their Voices,
B. Atal (1976)
Topical Precedence
Within a topic, what are the earliest papers
that received more than n citations?
Information Retrieval:
On Relevance, Probabilistic Indexing and Information Retrieval,
Kuhns and Maron (1960)
Expected Search Length: A Single Measure of Retrieval Effectiveness Based
on the Weak Ordering Action of Retrieval Systems,
Cooper (1968)
Relevance feedback in information retrieval,
Rocchio (1971)
Relevance feedback and the optimization of retrieval effectiveness,
Salton (1971)
New experiments in relevance feedback,
Ide (1971)
Automatic Indexing of a Sound Database Using Self-organizing Neural Nets,
Feiten and Gunzel (1982)
Topical Transfer Through Time
• Can we predict which research topics
will be “hot” at the ICML conference next year?
• ...based on
– the hot topics in “neighboring” venues last year
– learned “neighborhood” distances for venue pairs
How do Ideas Progress Through
Social Networks?
Hypothetical Example:
“ADA Boost”
SIGIR
(Info. Retrieval)
COLT
ICML
ICCV
(Vision)
ACL
(NLP)
How do Ideas Progress Through
Social Networks?
Hypothetical Example:
“ADA Boost”
SIGIR
(Info. Retrieval)
COLT
ICML
ICCV
(Vision)
ACL
(NLP)
How do Ideas Progress Through
Social Networks?
Hypothetical Example:
“ADA Boost”
SIGIR
(Info. Retrieval)
COLT
ICML
ICCV
(Vision)
ACL
(NLP)
Topic Prediction Models
Static Model
Transfer Model
Linear Regression and Ridge Regression
Used for Coefficient Training.
Preliminary Results
Mean
Squared
Prediction
Error
(Smaller
Is better)
Transfer
Model
# Venues used for prediction
Transfer Model with Ridge Regression is a good Predictor
Toward More Detailed, Structured Data
Structured Topic Models
Leveraging Text in Social Network Analysis
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Extract structured data about
entities, relations, events
Actionable
knowledge
Prediction
Outlier detection
Decision support
Joint Inference
Uncertainty Info
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Discover patterns
- entity types
- links / relations
- events
Database
Document
collection
Actionable
knowledge
Emerging Patterns
Prediction
Outlier detection
Decision support
Solution:
Unified Model
Spider
Filter
Data
Mining
IE
Segment
Classify
Associate
Cluster
Probabilistic
Model
Discover patterns
- entity types
- links / relations
- events
Discriminatively-trained undirected graphical models
Document
collection
Conditional Random Fields
[Lafferty, McCallum, Pereira]
Conditional PRMs
[Koller…], [Jensen…],
[Geetor…], [Domingos…]
Complex Inference and Learning
Just what we researchers like to sink our teeth into!
Actionable
knowledge
Prediction
Outlier detection
Decision support
(Linear Chain) Conditional Random Fields
[Lafferty, McCallum, Pereira 2001]
Undirected graphical model, trained to maximize
conditional probability of output sequence given input sequence
Finite state model
Graphical model
OTHER
y t-1
PERSON
yt
OTHER
y t+1
ORG
y t+2
TITLE …
y t+3
output seq
FSM states
...
observations
x
said
1
p(y | x)  (y t , y t1,x,t)
Zx t
x
t -1
t
Jones
where
x
a
t +1
x
t +2
Microsoft
x
t +3
VP …
input seq


(y t , y t1,x,t)  exp k f k (y t , y t1,x,t)
 k

Wide-spread interest, positive experimental results in many applications.
Noun phrase, Named entity [HLT’03], [CoNLL’03]
Protein structure prediction [ICML’04]

IE from Bioinformatics text [Bioinformatics ‘04],…
Asian word segmentation [COLING’04], [ACL’04]
IE from Research papers [HTL’04]
Object classification in images [CVPR ‘04]
Table Extraction from Government Reports
Cash receipts from marketings of milk during 1995 at $19.9 billion dollars, was
slightly below 1994. Producer returns averaged $12.93 per hundredweight,
$0.19 per hundredweight below 1994. Marketings totaled 154 billion pounds,
1 percent above 1994. Marketings include whole milk sold to plants and dealers
as well as milk sold directly to consumers.
An estimated 1.56 billion pounds of milk were used on farms where produced,
8 percent less than 1994. Calves were fed 78 percent of this milk with the
remainder consumed in producer households.
Milk Cows and Production of Milk and Milkfat:
United States, 1993-95
-------------------------------------------------------------------------------:
:
Production of Milk and Milkfat 2/
:
Number
:------------------------------------------------------Year
:
of
:
Per Milk Cow
:
Percentage
:
Total
:Milk Cows 1/:-------------------: of Fat in All :-----------------:
: Milk : Milkfat : Milk Produced : Milk : Milkfat
-------------------------------------------------------------------------------: 1,000 Head
--- Pounds --Percent
Million Pounds
:
1993
:
9,589
15,704
575
3.66
150,582 5,514.4
1994
:
9,500
16,175
592
3.66
153,664 5,623.7
1995
:
9,461
16,451
602
3.66
155,644 5,694.3
-------------------------------------------------------------------------------1/ Average number during year, excluding heifers not yet fresh.
2/ Excludes milk sucked by calves.
Table Extraction from Government Reports
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
100+ documents from www.fedstats.gov
Labels:
CRF
milk during 1995 at $19.9 billion dollars,
turns averaged $12.93 per hundredweight,
94.
Marketings totaled 154 billion pounds,
s include whole milk sold to plants and
consumers.
of milk were used on farms where produced,
s were fed 78 percent of this milk with the
ouseholds.
Production of Milk and Milkfat:
ted States, 1993-95
--------------------------------------------
Production of Milk and Milkfat 2/
--------------------------------------------
Milk Cow
:
Percentage
------------: of Fat in All
:
Total
:--------------
•
•
•
•
•
•
•
Non-Table
Table Title
Table Header
Table Data Row
Table Section Data Row
Table Footnote
... (12 in all)
Features:
•
•
•
•
•
•
•
Percentage of digit chars
Percentage of alpha chars
Indented
Contains 5+ consecutive spaces
Whitespace in this line aligns with prev.
...
Conjunctions of all previous features,
time offset: {0,0}, {-1,0}, {0,1}, {1,2}.
Table Extraction Experimental Results
[Pinto, McCallum, Wei, Croft, 2003 SIGIR]
Line labels,
percent correct
HMM
65 %
Stateless
MaxEnt
85 %
CRF
95 %
IE from Research Papers
[McCallum et al ‘99]
IE from Research Papers
Field-level F1
Hidden Markov Models (HMMs)
75.6
[Seymore, McCallum, Rosenfeld, 1999]
Support Vector Machines (SVMs)
89.7
 error
40%
[Han, Giles, et al, 2003]
Conditional Random Fields (CRFs)
[Peng, McCallum, 2004]
93.9
Named Entity Recognition
CRICKET MILLNS SIGNS FOR BOLAND
CAPE TOWN 1996-08-22
South African provincial side
Boland said on Thursday they
had signed Leicestershire fast
bowler David Millns on a one
year contract.
Millns, who toured Australia with
England A in 1992, replaces
former England all-rounder
Phillip DeFreitas as Boland's
overseas professional.
Labels:
PER
ORG
LOC
MISC
Examples:
Yayuk Basuki
Innocent Butare
3M
KDP
Cleveland
Cleveland
Nirmal Hriday
The Oval
Java
Basque
1,000 Lakes Rally
Named Entity Extraction Results
[McCallum & Li, 2003, CoNLL]
Method
F1
HMMs BBN's Identifinder
73%
CRFs
90%
MALLET
Machine Learning for LanguagE Toolkit
•
•
~80k lines of Java
Based on experience with previous toolkits
– Rainbow, WhizBang. GATE, Weka.
•
Document classification, information extraction, clustering, co-reference,
cross-document co-reference, POS tagging, shallow parsing, relational
classification, sequence alignment, structured topic models, social network
analysis with text.
•
•
Infrastructure for pipelining feature extraction and processing steps.
Many ML basics in common, convenient framework:
– naïve Bayes, MaxEnt, Boosting, SVMs; Dirichlets, Conjugate Gradient
•
Advanced ML algorithms:
– Conditional Random Fields, BFGS, Expectation Propagation, …
•
•
Unlike other general toolkits (e.g. Weka) MALLET scales to millions of
features, millions of training examples, as needed for NLP.
Now being used in many universities & companies all over the world:
– MIT, CMU, UPenn, Berkeley, UTexas, Purdue, Oregon State, UWash, UMass,
Google, Yahoo, BAE.
– Also in UK, Germany, France.
Semi-Supervised Learning
• Labeled data is expensive
– Especially for sequence modeling tasks
– POS tagging, word segmentation, NER
• Unlabeled data is abundant
–
–
–
–
The Web
Newswire
Other internal reports
etc.
HMM-LDA Model
[Griffiths, et al. 2004]
• Distinguish between
semantic words and syntactic words
Experiments
• Dataset
– Wall Street Journal (WSJ) collection labeled with
part-of-speech tags. There are totally 2312
documents in this corpus, 38665 unique words
and 1.2M word tokens.
• 50 topics and 40 syntactic classes
• Gibbs sampling
– 40 samples with a lag of 100 iterations between
them and an initial burn-in period of 4000
iterations.
Sample Syntactic Clusters
make
sell
buy
take
get
do
pay
go
give
provide
0.0279
0.0210
0.0174
0.0164
0.0157
0.0155
0.0152
0.0113
0.0104
0.0086
of
in
for
from
and
to
;
with
that
or
0.7448
0.0828
0.0355
0.0239
0.0238
0.0185
0.0096
0.0073
0.0055
0.0039
way
agreement
price
time
bid
effort
position
meeting
offer
day
0.0172
0.0140
0.0136
0.0121
0.0103
0.0100
0.0098
0.0098
0.0093
0.0092
last
first
next
york
third
past
this
dow
federal
fiscal
0.0767
0.0740
0.0479
0.0433
0.0424
0.0368
0.0361
0.0295
0.0288
0.0262
Table 1: Sample syntactic word clusters, each column displays the top 10 words in one cluster
and their probabilities
Sample Semantic Clusters
bank
loans
banks
loan
thrift
assets
savings
federal
regulators
debt
0.0918
0.0327
0.0291
0.0289
0.0264
0.0235
0.0220
0.0179
0.0146
0.0142
computer
0.0610
computers 0.0301
ibm
0.0280
data
0.0200
machines
0.0191
technology 0.0182
software
0.0176
digital
0.0173
systems
0.0169
business
0.0151
jaguar
ford
gm
shares
auto
express
maker
car
share
saab
0.0824
0.0641
0.0353
0.0249
0.0172
0.0144
0.0136
0.0134
0.0128
0.0116
ad
advertising
agency
brand
ads
saatchi
brands
account
industry
clients
0.0314
0.0298
0.0268
0.0181
0.0177
0.0162
0.0142
0.0120
0.0106
0.0105
Table 2: Sample semantic word clusters, each column displays the top 10 words in one cluster and their
probabilities
POS Tagging
• Features
–
–
–
–
Word unigrams and bigrams
Spelling features
Word suffixes
Cluster features
• HMM-LDA: the most likely class assignment for each
word over all the samples
• HC: bit string prefixes of lengths 8, 12, 16 and 20
• CRFs
Evaluation Results
(a) 10k labeled words, OOV rate = 24.46%
Error(%)
No Clusters
Hierarchical
HMM-LDA
Overall
10.04
9.46 (5.78)
8.56 (14.74)
OOV
22.32
21.56 (3.40)
18.49 (17.16)
(b) 30k labeled words, OOV rate = 15.31%
Error(%)
No Clusters
Hierarchical
HMM-LDA
Overall
6.08
5.85 (3.78)
5.40 (11.18)
OOV
17.34
17.35 (-0.00)
15.01 (13.44)
(c) 50k labeled words, OOV rate = 12.49%
Error(%)
No Clusters
Hierarchical
HMM-LDA
Overall
5.34
5.12 (4.12)
4.78 (10.30)
OOV
16.36
16.21 (0.92)
14.45 (11.67)
18%
reduction
in error
Desired Future Work
• Add more “structured data types” to topic models.
• Leverage Pachinko Allocation to learn topic hierarchies
and topic correlations in time.
• New type of topic model
– fast enough to work on streaming data
– more naturally combines many data modalities
(add more “structured data types” together)
– topics defined by both positive and negative features
• Use structured topic models to help predict influence
and impact.
• Extremely low-supervision training of information
extractors. Discover interesting entity/relation classes.
End of Talk
Download