Constrained Conditional Models Tutorial

advertisement

CONSTRAINED

CONDITIONAL MODELS

TUTORIAL

Jingyu Chen, Xiao Cheng

INTRODUCTION

Main ideas:

• Idea 1: Modeling

Separate modeling and problem formulation from algorithms

Similar to the philosophy of probabilistic modeling

• Idea 2: Inference

Keep model simple, make expressive decisions (via constraints)

Unlike probabilistic modeling, where models become more expressive

Inject background knowledge

• Idea 3: Learning

Expressive structured decisions can be supported by simply

• learned models

Global Inference can be used to amplify the simple models (and even minimal supervision).

Task of interest: Structured Prediction

• Common formulation

• e.g. HMM, CRF, Structured Perceptron etc.

arg max 𝑦 𝑤 𝑇 𝑓 𝑥, 𝑦

• Covers a lot of NLP problems:

• Parsing; Semantic Parsing; Summarization; Transliteration; Coreference resolution, Textual Entailment…

• IE problems:

• Entities, relations, attributes…

• How to improve without incurring performance issues?

Pipeline?

• Very crude approximation to the real problem, propagates error.

• Ignores dependency 𝜙(𝑥, 𝑦) :

• e.g. In relation extraction, the label of the entity depends on the relation it is involved and the relation label depends on the label of its arguments.

Model Formulation

• Typical models arg max 𝑦 𝑤 𝑇 𝑓 𝑥, 𝑦

• With CCM we choose arg max 𝑦 𝑤 𝑇 𝑓 𝑥, 𝑦 − 𝜌 𝑇 𝑑(𝑥, 𝑦)

Local dependency e.g. HMM, CRF

Penalty Violation measure

Regularization

Constraint expressivity

Multiclass Problem: Ideal classification, can be expressed through constraints

One v. All approximation:

Implementations arg max 𝑦 𝑤 𝑇 𝑓 𝑥, 𝑦 − 𝜌 𝑇 𝑑(𝑥, 𝑦)

Modeling Objective function

Constrained Optimization

Solver

Inference

Integer Linear Programming

Exact ILP, Heurisitic Search, Relaxation,

Dynamic Programming

Learning Learn 𝑤 and 𝜌 , can be learnt jointly or separately, semi-supervised learning etc.

How do we use CCM to learn?

EXAMPLE 1:

JOINT INFERENCE-BASED

LEARNING

Constrained HMM in Information Extraction

Typical work flow

• Define basic classifiers

• Define constraints as linear inequalities

• Combine the two into an objective function

HMM CCM Example

• Information extraction without prior knowledge

• Use HMM arg max 𝑦 𝑤 𝑇 𝜙(𝑥, 𝑦)

HMM CCM Example

AUTHOR

TITLE

EDITOR

BOOKTITLE

TECH-REPORT

INSTITUTION

DATE

Lars Ole Andersen . Program analysis and specialization for the

C

Programming language

. PhD thesis .

DIKU , University of Copenhagen , May

1994 .

Violates a lot of natural constraints

HMM CCM Example

• Each field must be a consecutive list of words and can appear at most once in a citation .

• State transitions must occur on punctuation marks.

• The citation can only start with AUTHOR or EDITOR .

The words pp., pages correspond to PAGE.

Four digits starting with 20xx and 19xx are DATE .

Quotations can appear only in TITLE

HMM CCM Example

• How do we use constraints with HMM?

• Standard HMM:

• Learn the probability of the sequence of labels 𝑌 and input 𝑋 :

• Inference, taking the most likely label sequence:

HMM CCM Example

• New objective function involving constraints

• Penalize the probability of sequence if it violates constraint

Penalty for each time the constraint is violated

HMM CCM Example

• Transform to linear model

HMM CCM Example

• We need to learn the new parameters 𝑤, 𝜌 maximizes the scoring function

• Despite the fact that the scoring function is no longer a log likelihood of the dataset, it is still a smooth concave function with a unique global maximum with zero gradient.

HMM CCM Example

Simply counting the probability of the constraints being violated

HMM CCM Example

Can this paradigm be generalized?

Are there other ways to learn?

TRAINING PARADIGMS

Training paradigms arg max 𝑦 𝑤 𝑇 𝑓 𝑥, 𝑦 − 𝜌 𝑇 𝑑(𝑥, 𝑦)

Decompose

Learn Inference

Prior knowledge: Features vs. Constraints

Data dependent

Learnable

Size

Improvement

Approach

Domain

Penalty type

Common usage

Formulation

Feature

Yes

Yes

Large

Constraint

No (if not learnt)

Yes

Small

Higher order model Post-processing for I+L

𝑋 × 𝑌 → ℝ

Soft

Local

Propositional/ 𝜙(𝑥)

𝑋 × 𝑌 → {0,1}

Hard & Soft

Global

FOL/ 𝜙(𝑥, 𝑦)

Comparison with MLN

• MLN models constraints are formulated as an explicit probability jointly with the overall distributions:

• e.g. 𝑃 𝐶 𝑎

= 1; ∀𝑏! = 𝑎, 𝑃(𝐶 𝑏

) = 0

• Constraints in CCM are formulated as linear inequalities

• e.g.

𝑎 ≤ 𝐶 ≤ 𝑎

• Theoretically the same, very different in practice

Training paradigms

• Learning + Inference : Train with some constraints, apply all constraints only in inference

• No need to retrain an existing system

• Fast and modular

• Inference-Based Training : Train jointly with constraints and dependencies (e.g. Graphical Models)

• Better for strong interactions between 𝑦

• Other training paradigm:

• Pipe-line like sequential model [Roth, Small, Titov : AI&Stat’09]

Constraints Driven Learning (CODL) [Chang et. al’07,12]

Which paradigm is better?

Algorithmic view of the differences

For each iteration

For each (𝑋, 𝒀𝑮𝑶𝑳𝑫 ) in the training data endfor endfor

𝒀

𝑷𝑹𝑬𝑫

= arg max 𝑦 𝑤 𝑇 𝑓 𝑥, 𝑦 −𝜌 𝑇 𝑑(𝑥, 𝑦)

If 𝒀

𝑷𝑹𝑬𝑫

! = 𝒀

𝑮𝑶𝑳𝑫 𝜆 = 𝜆 + 𝐹(𝑋, 𝒀𝑮𝑶𝑳𝑫 ) − 𝐹(𝑋, 𝒀𝑷𝑹𝑬𝑫) endif

IBT

𝒀

𝑷𝑹𝑬𝑫

= arg max 𝑦 𝑤 𝑇 𝑓 𝑥, 𝑦 − 𝜌 𝑇 𝑑(𝑥, 𝑦) I+L

L+I vs. IBT tradeoffs

In some cases problems are hard due to lack of training data.

Semi-supervised learning

# of Features

Choice of paradigm

• IBT:

• Better when the interaction between output label is strong

• L+I:

• Faster computationally

• Modular, no need to retrain existing classifier and works with simple models such as 𝜙(𝑥)

PARADIGM 2:

LEARNING + INFERENCE

An example with Entity-Relation Extraction

Entity-Relation Extraction

[RothYi07]

Dole ’s wife, Elizabeth , is a native of N.C.

E 1 E 2 E 3

R

12

R

23

Decision time inference

1: 32

Entity-Relation Extraction

[RothYi07]

• Formulation 1: Joint Global Model

Intractable to learn

Need to decomposition

Entity-Relation Extraction

[RothYi07]

• Formulation 2: Local learning + global inference

Entity-Relation Extraction

[RothYi07]

Dole

E

1

Elizabeth

E

2

N.C.

E

3

Cost function: c

{ E1 = per}

·

x c

{ R12 = spouse_of}

+ …

R

12

R

21

R

23

{ E1 = per}

+

c

{ E1 = loc}

·

x

·

x

{ R12 = spouse_of}

R

32

R

13

R

31

{ E1 = loc}

+ … +

c

+ … +

{ R12 =

}

·

x

{ R12 =

}

Entity-Relation Extraction

[RothYi07]

Exactly one label for each relation and entity

Relation and entity type constraints

Integral constraints, in effect boolean

Entity-Relation Extraction

[RothYi07]

• Each entity is either a person, organization or location:

 x

{ E1 = per}

+ x

{ E1 = loc}

+ x

{ E1 = org}

+ x

{ E1 =

}

=1

( R

12

 x

= spouse_of)

( E

1

{ R12 = spouse_of}

= person)

( E

2

 x

{ E1 = per}

 x

{ R12 = spouse_of}

 x

{ E2 = per}

= person)

Entity-Relation Extraction

[RothYi07]

• Entity classification results

Entity-Relation Extraction

[RothYi07]

• Relation identification results

Entity-Relation Extraction

[RothYi07]

• Relation identification results

INNER WORKINGS OF

INFERENCE

Constraints Encoding

• Atoms

𝑋 𝑖𝑠𝐶𝑎𝑝𝑖𝑡𝑎𝑙𝑖𝑧𝑒𝑑

= 𝑡𝑟𝑢𝑒

𝑋 𝑖𝑠𝐶𝑎𝑝𝑖𝑡𝑎𝑙𝑖𝑧𝑒𝑑

≤ 1 ∧ 𝑋 𝑖𝑠𝐶𝑎𝑝𝑖𝑡𝑎𝑙𝑖𝑧𝑒𝑑

≥ 1

• Existential quantification

∃𝑋 𝑎𝑟𝑔𝑢𝑚𝑒𝑛𝑡

𝑋 𝑛𝑢𝑚𝑏𝑒𝑟𝑂𝑓𝐴𝑟𝑔𝑢𝑚𝑒𝑛𝑡

≥ 1

• Negation

• 1 − 𝑋 =

¬

𝑋

• Conjunction

• Disjunction

Integer Linear Programming (ILP)

• Powerful tool, very general

• NP-hard even in binary case, but efficient for most NLP problems

• If ILP can not solve the problem efficiently, we can fall back to approximate solutions using heuristic search

Integer Linear Programming (ILP)

Integer Linear Programming (ILP)

SENTENCE

COMPRESSION

Sentence Compression Example

Modelling Compression with Discourse Constraints, James Clarke and

Mirella Lapata, COLING/SCL 2006

• 1. What is sentence compression?

• Sentence compression is commonly expressed as a word deletion problem: given an input sentence of words W = w 1, w 2, . . . , wn , the aim is to produce a compression by removing any subset of these words (Knight and Marcu 2002).

A trigram language model: maximize a scoring function by ILP:

p i: word i starts the compression q i,j : sequence wi,wj ends the compression

X i,j,k : trigram wi , wj ,wk in the compression

Y i : word i in the compression

Each p ,q,x,y is either 0 or 1,

Sentential

Constrains:

• 1. disallows the inclusion of modifiers without their head words:

• 2. presence of modifiers when the head is retained in the compression:

• 3. constrains that if a verb is present in the compression then so are its arguments:

Modifier Constraint Example

Modifier Constraint Example

Sentential

Constrains:

• 4. preserve personal pronouns in the compressed output:

Discourse

Constrains:

• 1. Center of a sentence is retained in the compression, and the entity realised as the center in the following sentence is also retained.

• Center of the sentences is the entity with the highest rank.

• Entity may ranked by many features.

• EX:

• grammatical role

• (subjects > objects > others).

Discourse

Constrains:

• 2. Lexical Chain Constrains:

• Lexical chain is a sequences of semantically related words.

• Often the longest lexical chain is the most important chain.

SEMANTIC ROLE

LABELING

Semantic Role labeling Example:

• What is SRL?

• SRL identifies all constituents that fill a semantic role, and determines their roles.

General information:

• Both models(argument identifier and argument classifiers) are trained by SNoW.

• Idea: maximization the scoring function

SRL: Argument Identification

• use a learning scheme that utilizes two classifiers, one to

• predict the beginnings of possible arguments, and the other the ends. The predictions are combined to form argument candidates.

• Why:

• When only shallow parsing is available, the system does not have constituents to begin with. Therefore, conceptually, the system has to consider all possible subsequences.

SRL: List of features

• POS tags

• Length

• Verb class

• Head word and POS tag of the head word

• Position

• Path

• Chunk pattern

• Clause relative position

• Clause coverage

• NEG

• MOD

SRL: Constraints

• 1. Arguments cannot overlap with the predicate.

• 2. Arguments cannot exclusively overlap with the clauses.

• 3. If a predicate is outside a clause, its arguments cannot be embedded in that clause.

• 4. No overlapping or embedding arguments.

• 5. No duplicate argument classes for core arguments.

• Note: conjunction is an exception.

• [A0 I] [V left ] [A1 my pearls] [A2 to my daughter] and [A1 my gold] [A2 to my son].

SRL: Constraints

• 6. if an argument is a reference to some other argument arg , then this referenced argument must exist in the sentence.

• 7. If there is a Carg argument, then there has to be an arg argument; in addition,the Carg argument must occur after arg .

• the label Carg is then used to specify the continuity of the arguments.

• 8. Given a specific verb, some argument types should

• never occur.

SRL Results:

QA

• Questions?

Download