ppt

advertisement

Learner Error Corpora

Grammatical Error Detection

Grammatical Error Correction

Evaluation of Error Detection/Correction
System
Native
Language
Transformation
Foreign
Language
Interlanguage
A learner corpus is a computerized textual database of
the language produced by foreign language learners
 Benefits

 Researchers will have access to leaners’ interlanguage
 May lead to development of language learning tools

Error tagged corpora
 Deals with real errors made by language learners

Well formed corpora
 Language corpora with well formed constructs
 BNC, WSJ
 N-gram corpora

Artificial error corpora
 Error tagged corpora are expensive
 Well formed corpora do not deal with errors
 Artificially modify well formed corpora to become
error corpora

NUCLE : NUS Corpus of Learner English
 About 1,400 essays from university-level students
with 1.2 million words.
 Completely annotated with error categories and
corrections.
 Annotation performed by English instructors at
NUS Centre for English Language Communication
(CELC).

Annotation Task
 Select arbitrary, contiguous text spans using the
cursor to identify grammatical errors.
 Classify errors by choosing an error tag from a dropdown menu.
 Correct errors by typing the correction into a text box.
 Comment to give additional explanations if necessary.
Writing, Annotation, and Marking Platform (WAMP)
27 error categories with 13 error groups

NICT-JLE
 “Error Annotation for Corpus of Japanese Learner
English” – Izumi et al.
 CoNLL shared task data
▪ http://www.comp.nus.edu.sg/~nlp/conll13st.html
 HOO data
▪ http://clt.mq.edu.au/research/projects/hoo/hoo2012/ind
ex.html

Precision Grammar: formal grammar
designed to distinguish ungrammatical from
grammatical sentences.
 Constraint Dependency Grammar (CDG)
▪ every grammatical rule is given as a constraint on wordto-word modifications
Resource: Structural disambiguation with constraint propagation, Hiroshi Maruyama,
ACL’90

CDG grammar 𝐺 =< Σ, 𝑅, 𝐿, 𝐶 >
 Σ finite set of terminal symbols (words)
 𝑅 = {𝑟1 , 𝑟2 , … 𝑟𝑘 } finite set of role-ids
 𝐿 = {𝑎, 𝑎2 , … 𝑎𝑡 } finite set of labels
 𝐶 constraint that an assignment A should
satisfy


A sentence 𝑠 = 𝑤1 , 𝑤2 , … 𝑤𝑛 is a finite string
on Σ
Each word in a sentence s has k roles
𝑟1 𝑖 , 𝑟2 𝑖 , … , 𝑟𝑘 (𝑖)
 Roles are variables that can take < 𝑎, 𝑑 > as its
value where 𝑎 ∈ 𝐿 and modifiee is 𝑑 either 1 ≤
𝑑 ≤ 𝑛 or special symbol 𝑛𝑖𝑙.
𝑾𝟏
𝑾𝟐
𝑾𝒏
𝒓𝟏 -𝒓𝒐𝒍𝒆
𝒓𝟐 -𝒓𝒐𝒍𝒆
𝒓𝒌 -𝒓𝒐𝒍𝒆
Analysis of a sentence  assigning values to the 𝑛 × 𝑘 roles.

Definitions
 Assuming 𝑥 is an 𝑟𝑗 role of word 𝑖.
▪ 𝑝𝑜𝑠 𝑥 ⇒ 𝑡ℎ𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑖
▪ 𝑟𝑖𝑑 𝑥 ⇒ 𝑡ℎ𝑒 𝑟𝑜𝑙𝑒 𝑖𝑑 𝑟𝑗
▪ 𝑙𝑎𝑏 𝑥 ⇒ 𝑡ℎ𝑒 𝑙𝑎𝑏𝑒𝑙 𝑜𝑓 𝑥
▪ 𝑚𝑜𝑑 𝑥 ⇒ 𝑡ℎ𝑒 𝑚𝑜𝑑𝑖𝑓𝑖𝑒𝑒 𝑜𝑓 𝑥
▪ 𝑤𝑜𝑟𝑑 𝑖 ⇒
𝑡ℎ𝑒 𝑡𝑒𝑟𝑚𝑖𝑛𝑎𝑙 𝑠𝑦𝑚𝑏𝑜𝑙 𝑜𝑐𝑐𝑢𝑟𝑟𝑖𝑛𝑔 𝑎𝑡 𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛 𝑖

A constraint
 ∀𝑥1 𝑥2 ….𝑥𝑝 𝑃1 ⋀𝑃2 ⋀ … ⋀𝑃𝑚 where
▪ 𝑥1 𝑥2 … . 𝑥𝑝 range over the set of roles in an assignment
𝐴.
▪ Each 𝑃𝑖 is a subformula with vocabulary
▪ Variables: 𝑥1 𝑥2 … . 𝑥𝑝
▪
▪
▪
▪
Constants: Σ ∪ 𝐿 ∪ 𝑅 ∪ {𝑛𝑖𝑙, 1,2, … }
Function symbols: 𝑤𝑜𝑟𝑑
, 𝑝𝑜𝑠
, 𝑟𝑖𝑑
Predicate symbols: =, <, >, ∈
Logical connectors: ∧,∨, ¬, ⇒
, 𝑚𝑜𝑑

Definitions
 The arity of a subformula 𝑃𝑖 depends on the
number of variables that it contains
 The degree of grammar is the size of set of role
ids (𝑅).

A non-null string 𝑠 over the alphabet Σ is
generated iff there exists an assignment
𝐴 that satisfies the constraint 𝐶.

𝐺1 =< Σ1, 𝑅1, 𝐿1, 𝐶1 >




Σ1 = 𝐷, 𝑁, 𝑉
𝑅1 = {𝑔𝑜𝑣𝑒𝑟𝑛𝑜𝑟}
𝐿1 = 𝐷𝐸𝑇, 𝑆𝑈𝐵𝐽, 𝑅𝑂𝑂𝑇
𝐶1 = ∀𝑥𝑦 : 𝑟𝑜𝑙𝑒; 𝑃11 ⋀𝑃12 ⋀𝑃13 ⋀𝑃14
• 𝑷𝟏𝟏 : “A determiner (D) modifies a noun (N) on the right with the label DET”
𝑤𝑜𝑟𝑑 𝑝𝑜𝑠 𝑥 = 𝐷 ⇒ (𝑙𝑎𝑏 𝑥 = 𝐷𝐸𝑇, 𝑤𝑜𝑟𝑑 𝑚𝑜𝑑 𝑥 = 𝑁, 𝑝𝑜𝑠 𝑥
< 𝑝𝑜𝑠(𝑚𝑜𝑑 𝑥 ))
• 𝑷𝟏𝟐 :“A noun modifies a verb (V) on the right with the label SUBJ”
• 𝑷𝟏𝟑 : “A verb modifies nothing and its label should be ROOT”
• 𝑷𝟏𝟒 : “No two words can modify the same word with the same label.”

𝐺1 =< Σ1, 𝑅1, 𝐿1, 𝐶1 >




Σ1 = 𝐷, 𝑁, 𝑉
𝑅1 = {𝑔𝑜𝑣𝑒𝑟𝑛𝑜𝑟}
𝐿1 = 𝐷𝐸𝑇, 𝑆𝑈𝐵𝐽, 𝑅𝑂𝑂𝑇
𝐶1 = ∀𝑥𝑦 : 𝑟𝑜𝑙𝑒; 𝑃11 ⋀𝑃12 ⋀𝑃13 ⋀𝑃14
• 𝑷𝟏𝟏 : “A determiner (D) modifies a noun (N) on the right with the label DET”
𝑤𝑜𝑟𝑑 𝑝𝑜𝑠 𝑥 = 𝐷 ⇒ (𝑙𝑎𝑏 𝑥 = 𝐷𝐸𝑇, 𝑤𝑜𝑟𝑑 𝑚𝑜𝑑 𝑥 = 𝑁, 𝑝𝑜𝑠 𝑥
< 𝑝𝑜𝑠(𝑚𝑜𝑑 𝑥 ))
• 𝑷𝟏𝟐 :“A noun modifies a verb (V) on the right with the label SUBJ”
𝑤𝑜𝑟𝑑 𝑝𝑜𝑠 𝑥 = 𝑁
⇒ 𝑙𝑎𝑏 𝑥 = 𝑆𝑈𝐵𝐽, 𝑤𝑜𝑟𝑑 𝑚𝑜𝑑 𝑥 = 𝑉, 𝑝𝑜𝑠 𝑥 < 𝑝𝑜𝑠(𝑚𝑜𝑑 𝑥 )
• 𝑷𝟏𝟑 : “A verb modifies nothing and its label should be ROOT”
𝑤𝑜𝑟𝑑 𝑝𝑜𝑠 𝑥 = 𝑉 ⇒ 𝑙𝑎𝑏 𝑥 = 𝑅𝑂𝑂𝑇, 𝑚𝑜𝑑 𝑥 = 𝑛𝑖𝑙
• 𝑷𝟏𝟒 : “No two words can modify the same word with the same label.”
𝑚𝑜𝑑 𝑥 = 𝑚𝑜𝑑 𝑦 , 𝑙𝑎𝑏 𝑥 = 𝑙𝑎𝑏 𝑦 ⇒ 𝑥 = 𝑦

[A1]D [dog2]N [runs3]V

CDG parsing
 Assigning values to 𝑛 × 𝑘 roles from a finite set
𝐿 × {𝑛𝑖𝑙, 1,2 … , 𝑛}
 A constraint satisfaction problem (CSP)
 Use constraint propagation or filtering to solve
CSP
▪ Form an initial constraint network using a “core”
grammar.
▪ Remove local inconsistencies by filtering.
▪ If any ambiguity remains, add new constraints and go to
Step 2.
Put the block on the floor on the table in the room
𝑉1 𝑁𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5

𝐺2 =< Σ2, 𝑅2, 𝐿2, 𝐶2 >
 Σ2 = 𝑉, 𝑁𝑃, 𝑃𝑃
 𝑅2 = {𝑔𝑜𝑣𝑒𝑟𝑛𝑜𝑟}
 𝐿2 = 𝑅𝑂𝑂𝑇, 𝑂𝐵𝐽, 𝐿𝑂𝐶, 𝑃𝑂𝑆𝑇𝑀𝑂𝐷
 𝐶2 = ∀𝑥𝑦 : 𝑟𝑜𝑙𝑒; 𝑃
Constraints
mod(x)
mod(y)
x
y
mod(x)
mod(y)
y
x

Total number of possible parse trees?
 Catallan number
 Explicit representation is not feasible
 Constraint network for implicit representation of
the parse trees
{𝑅𝑛𝑖𝑙 }
{𝑂1 }
𝑷𝑷𝟓 {𝐿1 , 𝑃2 , 𝑃3 , 𝑃4 }
𝑽𝟏
𝑵𝑷𝟐
𝑷𝑷𝟒 {𝐿1 , 𝑃2 , 𝑃3 }
𝑷𝑷𝟑
𝑅𝑛𝑖𝑙  <ROOT, nil>, 𝐿1  <LOC,1>
{𝐿1 , 𝑃2 }
A constraint network is said to be arc consistent
if, for any constraint matrix, there are no rows
and no columns that contain only zeros
 A node corresponding to all zero row or column
is removed from solution
 Removing one value makes others inconsistent
 The process is propagated until the network
becomes arc consistent.
 The network in example is arc consistent


Two more constraints
𝑓𝑒(𝑝𝑜𝑠(𝑥))  extracts semantic
features of 𝑥
Put the block on the floor on the table in the room
𝑉1 𝑁𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5
{𝑅𝑛𝑖𝑙 }
{𝑂1 }
𝑷𝑷𝟓 {𝐿1 , 𝑃2 , 𝑃3 , 𝑃4 }
𝑽𝟏
𝑵𝑷𝟐
𝑷𝑷𝟒 {𝐿1 , 𝑃2 , 𝑃3 }
𝑷𝑷𝟑
{𝐿1 , 𝑃2 }
{𝑅𝑛𝑖𝑙 }
{𝑂1 }
𝑷𝑷𝟓 {𝐿1 , 𝑃2 , 𝑃3 , 𝑃4 }
𝑽𝟏
𝑵𝑷𝟐
𝑷𝑷𝟒
𝑷𝑷𝟑
{𝐿1 , 𝑃2 }
{𝐿1 , 𝑃2 }
{𝑅𝑛𝑖𝑙 }
{𝑂1 }
𝑷𝑷𝟓 {𝐿1 , 𝑃2 , 𝑃3 , 𝑃4 }
𝑽𝟏
𝑵𝑷𝟐
𝑷𝑷𝟒
𝑷𝑷𝟑
{𝐿1 , 𝑃2 }
{𝐿1 , 𝑃2 }

Two more constraints
{𝑅𝑛𝑖𝑙 }
{𝑂1 }
𝑽𝟏
𝑷𝑷𝟓
𝑵𝑷𝟐
{𝐿1 , 𝑃2 , 𝑃4 }
𝑷𝑷𝟒
𝑷𝑷𝟑
{𝐿1 , 𝑃2 }
{𝐿1 , 𝑃2 }
{𝑅𝑛𝑖𝑙 }
{𝑂1 }
𝑽𝟏
𝑷𝑷𝟓
𝑵𝑷𝟐
{𝐿1 , 𝑃2 , 𝑃4 }
𝑷𝑷𝟒
𝑷𝑷𝟑
{𝑃2 }
{𝐿1 , 𝑃2 }
Put the block on the floor on the table in the room
𝑉1 𝑁𝑃2 𝑃𝑃3 𝑃𝑃4 𝑃𝑃5
{𝑅𝑛𝑖𝑙 }
{𝑂1 }
𝑷𝑷𝟓 {𝑃4 }
𝑽𝟏
𝑵𝑷𝟐
𝑷𝑷𝟒 {𝐿1 }
𝑷𝑷𝟑
{𝑃2 }
Put the block on the floor on the table in the room
OBJ
POSTMOD
LOC
POSTMOD



All constraints are treated with same priority
failure to adhere to the set of specified
constraints mark an utterance to be
ungrammatical
Gradation in natural language
 Can model robustness, the ability to deal with
unexpected and possibly erroneous input .
 Weighted Constraint Dependency Grammar
(WCDG)

Different error detection tasks
 Grammatical vs Ungrammatical
 Detecting errors for targeted categories
▪ Preposition errors
▪ Article errors
 Agnostic to error category

Approaches
 Error detection as classification
 Error detection as sequence labelling

Generic steps




Decide on the error category
Pick up a learning algorithm
Identify discriminative features
Train the algorithm with training data
▪ Error corpora
▪ Model encodes the error contexts
▪ flags error detecting a match of context in learner response
▪ Well-formed corpora
▪ Learns the ideal models for the targeted categories
▪ Flags error in case of mismatch
▪ Artificial error corpora

Type of preposition errors
 Selection error [They arrived to the town]
 Extraneous use [They came to outside]
 Omission error [He is fond this book]

Tasks
 Classifier prediction
 Training a model
 What are the features?
Resource: The Ups and Downs of Preposition Error Detection in ESL Writing, Tetreault and
Chodorow, COLING’08


Cast error detection task as a classification problem
Given a model classifier and a context:
 System outputs a probability distribution over all prepositions
 Compare weight of system’s top preposition with writer’s
preposition

Error occurs when:
 Writer’s preposition ≠ classifier’s prediction
 And the difference in probabilities exceeds a threshold

Develop a training set of error-annotated ESL
essays (millions of examples?):
 Too labor intensive to be practical

Alternative:
 Train on millions of examples of proper usage

Determining how “close to correct” writer’s
preposition is

Prepositions are influenced by:
 Words in the local context, and how they interact
with each other (lexical)
 Syntactic structure of context
 Semantic interpretation
1.
2.
3.
Extract lexical and syntactic features from
well-formed (native) text
Train MaxEnt model on feature set to output
a probability distribution over a set of preps
Evaluate on error-annotated ESL corpus by:
 Comparing system’s prep with writer’s prep
 If unequal, use thresholds to determine
“correctness” of writer’s prep
Feature Description
PV
Prior verb
PN
Prior noun
FH
Headword of the following phrase
FP
Following phrase
TGLR
Middle trigram (pos + words)
TGL
Left trigram
TGR
Right trigram
BGL
Left bigram
He will take our place in the line
Feature Description
PV
Prior verb
PN
Prior noun
FH
Headword of the following phrase
FP
Following phrase
TGLR
Middle trigram (pos + words)
TGL
Left trigram
TGR
Right trigram
BGL
Left bigram
He will take our place in the line
PV
PN
FH
Feature Description
PV
Prior verb
PN
Prior noun
FH
Headword of the following phrase
FP
Following phrase
TGLR
Middle trigram (pos + words)
TGL
Left trigram
TGR
Right trigram
BGL
Left bigram
He will take our place in the line.
TGLR


MaxEnt does not model the interactions
between features
Build “combination” features of the head
nouns and commanding verbs
 PV, PN, FH

3 types: word, tag, word+tag
 Each type has four possible combinations
 Maximum of 12 features
Class
p-N
Components
FH
+Combo:word
line
N-p-N
PN-FH
place-line
V-p-N
PV-PN
take-line
V-N-p-N
PV-PN-FH
take-place-line
“He will take our place in the line.”

Typical way that non-native speakers check if
usage is correct:
 “Google” the phrase and alternatives



Google N-gram corpus
Queries provided frequency data for the
+Combo features
Top three prepositions per query were used
as features for ME model
 Maximum of 12 Google features
Class
p-N
Combo:word
line
Google Features
N-p-N
place-line
P1= in
P2= on
P3= of
V-p-N
take-line
P1= on
P2= to
P3= into
V-N-p-N
take-place-line
P1= in
P2= on
P3= after
P1= on
P2= in
P3= of
“He will take our place in the line”

Thresholds allow the system to skip cases
where the top-ranked preposition and what
the student wrote differ by less than a prespecified amount
FLAG AS ERROR
100
90
80
FLAG ERROR
70
60
50
40
30
20
10
0
of
in
at
by
“He is fond with beer”
with
FLAG AS OK
60
50
FLAG OK
40
30
20
10
0
of
in
around
by
with
“My sister usually gets home around 3:00”

Errors consist of a sub-sequence of tokens in
a longer token sequence.
 Some of the sub-sequences are errors while the
others not
 Advantage: Error category independent

Sequence modelling tasks in NLP
 Parts-of-speech tagging
 Information Extraction
Resource: High-Order Sequence Modeling for Language Learner Error Detection, Michael
Gamon, 6th Workshop on Innovative Use of NLP for Building Educational Applications
Many NLP problems can be viewed as sequence
labeling.
 Each token in a sequence is assigned a label.
 Labels of tokens are dependent on the labels of
other tokens in the sequence, particularly their
neighbors (not i.i.d).

foo
bar
blam
zonk
zonk
bar
Slides from Raymond J. Mooney
blam


Annotate each word in a sentence with a
part-of-speech.
Lowest level of syntactic analysis.
John saw the saw and decided to take it to the table.
PN V Det N Con V Part V Pro Prep Det N

Useful for subsequent syntactic parsing
and word sense disambiguation.


Identify phrases in language that refer to specific types of
entities and relations in text.
Named Entity Recognition (NER) is task of identifying
names of people, places, organizations, etc. in text.
people organizations places
 Michael Dell is the CEO of Dell Computer Corporation and lives in
Austin Texas.

Extract pieces of information relevant to a specific
application, e.g. used car ads:
make model year mileage price
 For sale, 2002 Toyota Prius, 20,000 mi, $15K or best offer.
Available starting July 30, 2006.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
PN
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
V
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
Det
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
N
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
Conj
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
V
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
Part
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
V
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
classifier
Pro
to the table.

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
Prep

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
Det

Classify each token independently but use as input features,
information about the surrounding tokens (sliding window).
John saw the saw and decided to take it
to the table.
classifier
N

Hidden Markov Model
 Finite state automation with stochastic state
transitions and observations
 Start from a state  emitting an observation 
transiting to new state  emitting observation
……..  Final state
▪ State transition probability 𝑃(𝑠|𝑠 ′ )
▪ Observation probability 𝑃(𝑜|𝑠)
▪ Initial state distribution 𝑃0 (𝑠)
𝒔𝒕−𝟏
𝒔𝒕
𝒐𝒕

Maximum Entropy Markov Model (MEMM)
 Combines transition and observation functions
together with a single function
𝑃(𝑠|𝑠 ′ , 𝑜)
𝒔𝒕−𝟏
𝒔𝒕
𝒐𝒕

NER annotation convention
 “O” outside NE
 “B”  beginning of NE
 “I”  inside NE
Michael Dell is the CEO of Dell Computer Corporation and lives in Austin Texas.
B

I
O O
O O
B
I
I
Learner error annotation
 “O” and “I”
▪ Most of the error spans are short
O
O
O
B
I

Language model features
 How close or far is the learner’s utterance from
ideal language usage?

String features
 whether a token is capitalized (initial
capitalization or all capitalized)?
 token length in characters
 number of tokens in the sentence

Linguistic analysis feature
 Features from constituency parse tree


All features are calculated for each token 𝑤 of
the tokens 𝑤1 , 𝑤2 , … 𝑤𝑘 in a sentence
Basic LM features
 All LM score in log probability
 Unigram probability of 𝑤
 average n-gram probability of all n-grams in the
sentence that contain 𝑤
𝐴𝑣𝑔 =
𝑛
𝑖=1
𝑝𝑜𝑠 𝑤
𝑗=𝑝𝑜𝑠 𝑤 −𝑖+1 𝑃(𝑤𝑗 … 𝑤𝑗+𝑖−1 )
𝑝𝑜𝑠 𝑤
𝑛
𝑖=1 𝑗=𝑝𝑜𝑠 𝑤 −𝑖+1 1

Ratio features
 tokens that are part of an unlikely combination of
otherwise likely smaller n-grams  error
 𝑟𝑎𝑡𝑖𝑜 =
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑟𝑜𝑏 𝑜𝑓 𝑥−𝑔𝑟𝑎𝑚 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑤
(𝑤ℎ𝑒𝑟𝑒 𝑥 > 𝑦)
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑝𝑟𝑜𝑏 𝑜𝑓 𝑦−𝑔𝑟𝑎𝑚 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑤

Drop features
 drop or increase in n-gram probability across token
𝑤.
Δ 𝑤𝑖 = 𝑃 𝑤𝑖 𝑤𝑖+1 − 𝑃(𝑤𝑖−1 𝑤𝑖 )

Entropy delta features
 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑤𝑖 =
𝐿𝑀_𝑆𝑐𝑜𝑟𝑒(𝑤𝑖 …𝑤𝑛 )
𝑁𝑜 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠(𝑤𝑖 …𝑤𝑛 )
 𝑏𝑎𝑐𝑘𝑤𝑎𝑟𝑑 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑤𝑖 =
𝐿𝑀_𝑆𝑐𝑜𝑟𝑒(𝑤0 …𝑤𝑖 )
𝑁𝑜 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠(𝑤0 …𝑤𝑖 )
 𝑓𝑜𝑟𝑤𝑎𝑟𝑑 𝑠𝑙𝑖𝑑𝑖𝑛𝑔 𝑒𝑛𝑡𝑟𝑜𝑝𝑦 𝑤𝑖 =
𝐿𝑀𝑆𝑐𝑜𝑟𝑒
𝑤𝑖 …𝑤𝑛
− 𝐿𝑀_𝑆𝑐𝑜𝑟𝑒(𝑤𝑖+1 … 𝑤𝑛 )
 Similarly backward sliding entropy

“good” n-gram is likely to have a much higher
probability than an n-gram with the same
tokens in random order
 𝑟𝑎𝑡𝑖𝑜𝑟𝑎𝑛𝑑𝑜𝑚 =
𝑃(𝑛−𝑔𝑟𝑎𝑚 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑤)
𝑠𝑢𝑚 𝑜𝑓 𝑢𝑛𝑖𝑔𝑟𝑎𝑚𝑠 𝑝𝑟𝑜𝑏𝑠 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 𝑡ℎ𝑎𝑡 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑠
 Minimum ratio to random
 Average ratio to random
 Overall ratio to random = 𝑠𝑢𝑚1 𝑠𝑢𝑚2
▪ 𝑠𝑢𝑚1 =
𝑛
𝑖=2
𝑝𝑜𝑠 𝑤
𝑗=𝑝𝑜𝑠 𝑤 −𝑖+1 𝑃(𝑤𝑗
… 𝑤𝑗+𝑖−1 )
▪ 𝑠𝑢𝑚2 = 𝑠𝑢𝑚 𝑜𝑓 𝑢𝑛𝑖𝑔𝑟𝑎𝑚 𝑝𝑟𝑜𝑏𝑠 𝑜𝑓 𝑡𝑜𝑘𝑒𝑛𝑠 𝑡ℎ𝑎𝑡 𝑡ℎ𝑒 𝑎𝑏𝑜𝑣𝑒 𝑛 − 𝑔𝑟𝑎𝑚𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛

Overlap to adjacent ratio
 an erroneous word may cause n-grams that contain the
word to be less likely than adjacent n-grams not
containing the word
 =
𝑠𝑢𝑚 𝑜𝑓 𝑝𝑟𝑜𝑏𝑠 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑖𝑛𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑤𝑖
𝑠𝑢𝑚 𝑜𝑓 𝑝𝑟𝑜𝑏𝑠 𝑜𝑓 𝑛−𝑔𝑟𝑎𝑚𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑎𝑑𝑗𝑎𝑐𝑒𝑛𝑡 𝑡𝑜 𝑤𝑖 𝑏𝑢𝑡 𝑒𝑥𝑐𝑙𝑢𝑑𝑖𝑛𝑔 𝑖𝑡

Features extracted from syntactic parse trees
 Label of the parent and grandparent node (some
of the labels denote complex constructs , e.g.,
SBAR )
 number of sibling nodes
 number of siblings of the parent
 length of path to the root
GEC Approaches
Rule-based
Classification
Language
Modelling
SMT
Hybrid

Whole sentence error correction
 Pipeline based approach
▪ Design classifiers for different error categories
▪ Deploy classifiers independently
▪ Relations between errors are ignored
▪ Example: A cats runs
▪ An article classifier may propose to delete ‘a’
▪ A noun number classifier may propose to change ‘cats’ to ‘cat’
Resource: Grammatical Error Correction Using Integer Linear Programming, Yuanbin Wu
and Hwee Tou Ng

Joint Inference
 Errors are most of the cases interacting
 Errors needs to be corrected jointly
 Steps
▪ For every possible correction, a score (how much
grammatical) is assigned to the corrected sentence
▪ A set of corrections resulting in maximum score is
selected
Objective Function
Integer Linear Programming
𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒 𝒄𝑇 𝒙
𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑡𝑜 𝐴𝒙 ≤ 𝑏
Output Variable Space
𝒙≥0
𝒙∈ℤ


Constraints
GEC  “Given an input sentence, choose a set
of corrections which results in the best output
sentence”

ILP formulation of GEC
 Encode the output space using integer variables.
▪ Corrections that a word needs
 Express inference objective as a linear objective
function.
▪ Maximize the grammaticality of corrections
 Introducing constraints to refine feasible output
space
▪ Constraints guarantee that the corrections do not
conflict with each other

What corrections at which positions?
 Location of error
 Error type
 Correction

First order variables
𝑘
𝑍𝑙,𝑝
∈ 0,1
𝑝 ∈ 1,2, … , 𝑠
𝑙 ∈ 𝐿 is an error type
𝑘 ∈ 1,2, … , 𝐶 𝑙 is a correction of type 𝑙
𝑘
 𝑍𝑙,𝑝
=1
 The word at position 𝑝 should be corrected to 𝑘
that is of error type 𝑙.

𝑘
𝑍𝑙,𝑝
=0
 The word at position 𝑝 is not applicable for
correction 𝑘

𝜖
𝑍𝑙,𝑝
=1
 Deletion of a word

Objective: To find best correction
 Exponential in combinations of corrections

Approximate by decomposable assumption
 Measuring the output quality of multiple corrections can
be decomposed into measuring quality of the individual
corrections

𝑘
𝑍𝑙,𝑝
Let 𝑠
𝑠′and 𝑤𝑙,𝑝,𝑘 ∈ ℝ measure the
grammaticality of 𝑠′
𝑘
𝑤𝑙,𝑝,𝑘 × 𝑍𝑙,𝑝
max
𝑙,𝑝,𝑘

Objective
 Overview of IBM Watson technology
 Overview of IBM Watson services on IBM Bulemix
 Overview and demo of key Watson toolkit

Date and Time
 08.02.2015, 10:30AM-1:30PM

Venue
 V1, Vikramshila

Specific instructions
 Please enter the venue by 10:20AM
 Bring your laptops if possible

𝑘
For individual correction 𝑍𝑙,𝑝
, the quality of 𝑠′ is
depends on
 Language model score: ℎ 𝑠 ′ , 𝐿𝑀
 Classifier confidence: 𝑓 𝑠 ′ , 𝑡
 Disagreement score: 𝑔 𝑠 ′ , 𝑡
▪ Difference between maximum confidence score and the
score of the word that is being corrected
𝑤𝑙,𝑝,𝑘 = 𝜈𝐿𝑀 ℎ 𝑠 ′ , 𝐿𝑀 +
𝜆𝑡 𝑓(𝑠 ′ , 𝑡) +
𝑡∈𝐸
𝜇𝑡 𝑔(𝑠 ′ , 𝑡)
𝑡∈𝐸

Constraint to avoid conflict
 For each error type 𝑙, only one output 𝑘 is allowed
at any applicable position 𝑝
𝑘
𝑍𝑙,𝑝
= 1 ∀ applicable 𝑙, 𝑝
𝑘

Final ILP formulation
𝑘
𝑤𝑙,𝑝,𝑘 × 𝑍𝑙,𝑝
max
𝑙,𝑝,𝑘
𝑘
𝑍𝑙,𝑝
= 1 ∀ applicable 𝑙, 𝑝
𝑠. 𝑡.
𝑘
𝑘
𝑍𝑙,𝑝
∈ {0,1}
A cats sat on the mat
Possible corrections and related variables

Constraint
𝑎
𝑡ℎ𝑒
𝜖
𝑍𝐴𝑟𝑡,1
+ 𝑍𝐴𝑟𝑡,1
+ 𝑍𝐴𝑟𝑡,1
=1

Computing weights
 Language model score, classifier confidence
score, disagreement score
 Classifiers: article (ART), preposition (PREP), noun
number (NOUN)
 Correction: 𝑠
𝑘
𝑍𝑙,𝑝
𝑠 ′ = 𝐴 𝑐𝑎𝑡 𝑠𝑎𝑡 𝑜𝑛 𝑡ℎ𝑒 𝑚𝑎𝑡.
𝑠𝑖𝑛𝑔𝑢𝑙𝑎𝑟
 Weight for 𝑍𝑁𝑜𝑢𝑛,2
:
 𝑤𝑁𝑜𝑢𝑛,2,𝑠𝑖𝑛𝑔𝑢𝑙𝑎𝑟 = 𝜈𝐿𝑀 ℎ 𝑠 ′ , 𝐿𝑀 +
𝜆𝐴𝑅𝑇 𝑓 𝑠 ′ , 𝐴𝑅𝑇 + 𝜆𝑃𝑅𝐸𝑃 𝑓 𝑠 ′ , 𝑃𝑅𝐸𝑃 + 𝜆𝑁𝑂𝑈𝑁 𝑓 𝑠 ′ , 𝑁𝑂𝑈𝑁 +
𝜇𝐴𝑅𝑇 𝑔 𝑠 ′ , 𝐴𝑅𝑇 + 𝜇𝑃𝑅𝐸𝑃 𝑔 𝑠 ′ , 𝑃𝑅𝐸𝑃 + 𝜇𝑁𝑂𝑈𝑁 𝑔 𝑠 ′ , 𝑁𝑂𝑈𝑁
 𝑓 𝑠 ′ , 𝐴𝑅𝑇 =
=
1
2
𝑓𝑡 𝑠 ′ 1 , 1, 𝐴𝑅𝑇 + 𝑓𝑡 𝑠 ′ 5 , 5, 𝐴𝑅𝑇
1
𝑓 𝑎, 1, 𝐴𝑅𝑇 + 𝑓𝑡 𝑡ℎ𝑒, 5, 𝐴𝑅𝑇
2 𝑡
 𝑔 𝑠 ′ , 𝐴𝑅𝑇 = max 𝑔1 , 𝑔2
𝑔1 = 𝑚𝑎𝑥 𝑓𝑡 𝑘, 1, 𝐴𝑅𝑇 − 𝑓𝑡 𝑎, 1, 𝐴𝑅𝑇
𝑘
𝑔2 = 𝑚𝑎𝑥 𝑓𝑡 𝑘, 5, 𝐴𝑅𝑇 − 𝑓𝑡 𝑡ℎ𝑒, 5, 𝐴𝑅𝑇
𝑘
𝑤𝑁𝑜𝑢𝑛,2,𝑠𝑖𝑛𝑔𝑢𝑙𝑎𝑟 = 𝜈𝐿𝑀 ℎ 𝑠 ′ , 𝐿𝑀
𝜆𝐴𝑅𝑇
+
𝑓𝑡 𝑎, 1, 𝐴𝑅𝑇 + 𝑓𝑡 𝑡ℎ𝑒, 5, 𝐴𝑅𝑇
2
+ 𝜆𝑃𝑅𝐸𝑃 𝑓𝑡 𝑜𝑛, 4, 𝑃𝑅𝐸𝑃
𝜆𝑁𝑂𝑈𝑁
+
𝑓𝑡 𝑐𝑎𝑡, 2, 𝑁𝑂𝑈𝑁 + 𝑓𝑡 𝑚𝑎𝑡, 6, 𝑁𝑂𝑈𝑁
2
+ 𝜇𝐴𝑅𝑇 𝑔 𝑠 ′ , 𝐴𝑅𝑇
+𝜇𝑃𝑅𝐸𝑃 𝑔 𝑠 ′ , 𝑃𝑅𝐸𝑃
+ 𝜇𝑁𝑂𝑈𝑁 𝑔 𝑠 ′ , 𝑁𝑂𝑈𝑁

Modification count constraints
 Major part of the sentence is grammatical
▪ No of modifications allowed for a particular error
category (𝑙) can be constrained to 𝑁𝑙
𝑘
▪ 𝑝,𝑘 𝑍𝐴𝑟𝑡,𝑝
≤ 𝑁𝐴𝑟𝑡 𝑤ℎ𝑒𝑟𝑒 𝑘 ≠ 𝑠[𝑝]
▪
𝑘
𝑍
𝑝,𝑘 𝑃𝑟𝑒𝑝,𝑝 ≤ 𝑁𝑃𝑟𝑒𝑝 𝑤ℎ𝑒𝑟𝑒 𝑘 ≠ 𝑠[𝑝]
▪
𝑘
𝑍
𝑝,𝑘 𝑁𝑜𝑢𝑛,𝑝 ≤ 𝑁𝑁𝑜𝑢𝑛 𝑤ℎ𝑒𝑟𝑒 𝑘 ≠ 𝑠[𝑝]

Article-Noun agreement constraints
 A noun in plural form cannot have a (or an) as its
article.
▪ Noun in plural form is mutually exclusive with having article a
(or an)
𝑝𝑙𝑢𝑟𝑎𝑙
𝑎
▪ 𝑍𝐴𝑟𝑡,𝑝1
+ 𝑍𝑁𝑜𝑢𝑛,𝑝2 ≤ 1

Dependency relation constraints
 subject-verb relation and determiner-noun relation
▪ If w belongs to a set of verbs or determiners (are, were, these,
all) that takes a plural noun, then the noun n is required to be
in plural
𝑝𝑙𝑢𝑟𝑎𝑙
▪ 𝑍𝑁𝑜𝑢𝑛,𝑝 = 1

A motivating case
 A cat sat on the mat (𝑠)  Cats sat on the mat (𝑠′)
𝜖
𝑍𝐴𝑟𝑡,1
𝑝𝑙𝑢𝑟𝑎𝑙
𝑍𝑁𝑜𝑢𝑛,2
 𝑠
𝑠𝐼
𝑠′
 𝑤𝐴𝑟𝑡,1,𝜖 will be small due to missing article
 𝑤𝑁𝑜𝑢𝑛,2,𝑝𝑙𝑢𝑟𝑎𝑙 will be small due to low LM score of ‘A
cats’

Relaxing decomposable assumption
 Combine multiple corrections to a single correction
▪ Instead of considering corrections A/𝜖 and 𝑐𝑎𝑡/𝑐𝑎𝑡𝑠
separately consider 𝐴 𝑐𝑎𝑡/𝜖 𝑐𝑎𝑡 together
▪ Higher order variables



𝑘
𝑍𝑙,𝑝 , ∀𝑙, 𝑝, 𝑘
Let 𝑍 = 𝑍𝑢 𝑍𝑢 =
be the set of
first order variables
𝑘
Let 𝑤𝑢 = 𝑤𝑙,𝑝,𝑘 be the weight of 𝑍𝑢 = 𝑍𝑙,𝑝
A second order variable:
𝑘
𝑘
 𝑋𝑢,𝑣 = 𝑍𝑢 ∧ 𝑍𝑣 , 𝑍𝑢 ≜ 𝑍𝑙 1,𝑝 , 𝑍𝑣 ≜ 𝑍𝑙 2,𝑝
1 1
2 2


𝑝𝑙𝑢𝑟𝑎𝑙
𝜖
𝑋𝑢,𝑣 = 𝑍𝐴𝑟𝑡,1 ∧ 𝑍𝑁𝑜𝑢𝑛,2
𝑋𝑢,𝑣
𝑠
𝑠 ′ = 𝐶𝑎𝑡𝑠 𝑠𝑎𝑡 𝑜𝑛 𝑡ℎ𝑒
𝑚𝑎𝑡.

Weight for second order variable is similar as
that for first order variables
 Why?
 𝑤𝑢,𝑣 = 𝜈𝐿𝑀 ℎ 𝑠 ′ , 𝐿𝑀 +
′ , 𝑡) +
𝜆
𝑓(𝑠
𝑡∈𝐸 𝑡
′ , 𝑡)
𝜇
𝑔(𝑠
𝑡∈𝐸 𝑡

New constraints for enforcing consistency
between first and second order variables

New objective function

Statistical Machine Translation for GEC
 𝐸 = arg max 𝑃 𝐸 𝑃(𝐹|𝐸)
𝐸
 Model GEC as SMT
▪ E=Corrected sentence and F=Erroneous sentence
▪ Parallel corpora: Learner error corpora

GEC is as good as SMT
 Increase size of parallel corpora covering targeted
types of errors  Expensive

A hack through
 SMT systems considered to be meaning
preserving
 Generate alternate surface renderings of the
meaning expressed in erroneous sentence
 Select the most fluent one
Resource: Exploring Grammatical Error Correction with Not-So-Crummy Machine Translation,
Madnani et al.
PL1
Translation
Bilingual MT
System 1
PL2
Translation
PLn
Translation
Bilingual MT
System 2
Bilingual MT
System n
Erroneous
Sentence
RT1
RT2
Combine
RTn
Select

Find the most fluent alternative
 Use an n-gram language model

Issue
 Language model does not care about preserving sentence
meaning
 No single translation is error free in general

To increase the likelihood of whole-sentence
correction
 Combine evidence of corrections produced by
each independent translation model

Steps: Combination based approach
 Align (original, round translation) pairs
 Combine aligned pairs to form word lattice
 Decode for best candidate

The task: Align each sentence pair <original,
round trip translation>
 Alignment:
▪ For a (hypothesis, reference) pair perform some edit
operations that transform a hypothesis sentence to a
reference one
▪ Each edit operation involves a cost
▪ Best alignment is that with minimal cost
▪ Also used as machine translation metric

Word order rate (WER)
 Levenstein distance between <hypothesis,
reference> pair
 Edit operations: Match, Insertion , Deletion and
Substitution
 Fails to model reordering of words or phrases in
translation

Translation Edit Rate (TER)
 Introduce shift operation
Resource: TERp System Description, Snover et al.
Toolkit: TER Compute, http://www.cs.umd.edu/~snover/tercom/
REF:
saudi arabia denied this week
information published in the american new york
times
HYP: this week the
saudis denied
information published in the
times



new york
WER too harsh when output is distorted from
reference
With WER, no credit is given to the system when
it generates the right string in the wrong place
TER shifts reflect the editing action of moving
the string from one location to another
REF:
saudi arabia denied this week
information published in the american new york
times
HYP: this week the
saudis denied
information published in the
times



new york
WER too harsh when output is distorted from
reference
With WER, no credit is given to the system when
it generates the right string in the wrong place
TER shifts reflect the editing action of moving
the string from one location to another
REF: **** **** SAUDI ARABIA denied THIS WEEK
information published in the AMERICAN new york
times
HYP: THIS WEEK THE
SAUDIS denied **** ****
information published in the ******** new york
times



WER too harsh when output is distorted from
reference
With WER, no credit is given to the system when
it generates the right string in the wrong place
TER shifts reflect the editing action of moving
the string from one location to another
REF: **** **** SAUDI ARABIA denied THIS WEEK
information published in the AMERICAN new york
times
HYP: THIS WEEK THE
SAUDIS denied **** ****
information published in the ******** new york
times



WER too harsh when output is distorted from
reference
With WER, no credit is given to the system when
it generates the right string in the wrong place
TER shifts reflect the editing action of moving
the string from one location to another
REF:
saudi arabia denied this week
information published in the american new
york times
HYP: this week the
saudis denied
information published in the
york times
new
REF:
saudi arabia denied this week
information published in the american new
york times
HYP: @
the
saudis denied [this week]
information published in the
new
york times
Edits:
 Shift “this week” to after “denied”
REF:
SAUDI ARABIA denied this week
information published in the american new
york times
HYP: @
THE
SAUDIS denied [this week]
information published in the
new
york times
Edits:
 Shift “this week” to after “denied”
 Substitute “Saudi Arabia” for “the Saudis”
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new
york times
HYP: @
THE
SAUDIS denied [this week]
information published in the ******** new
york times
Edits:
 Shift “this week” to after “denied”
 Substitute “Saudi Arabia” for “the Saudis”
 Insert “American”
REF:
SAUDI ARABIA denied this week
information published in the AMERICAN new
york times
HYP: @
THE
SAUDIS denied [this week]
information published in the ******** new
york times
Edits:
 Shift “this week” to after “denied”
 Substitute “Saudi Arabia” for “the Saudis”
 Insert “American”

1 Shift, 2 Substitutions, 1 Insertion

Optimal sequence of edits (with shifts) is very
expensive to find

Use a greedy search to select the set of shifts
 At each step, calculate min-edit (Levenshtein) distance (number
of insertions, deletions, substitutions) using dynamic
programming
 Choose shift that most reduces min-edit distance
 Repeat until no shift remains that reduces min-edit distance

After all shifting is complete, the number of
edits is the number of shifts plus the
remaining edit distance
REF: DOWNER SAID " IN
THE END ,
ANY bad
AGREEMENT will NOT be an agreement we CAN
SIGN
. "
HYP: HE
OUT " EVENTUALLY ,
ANY WAS *** bad
,
will *** be an agreement we WILL
SIGNED . ”
Shifted words must match the reference words in the destination
position exactly
 The word sequence of the hypothesis in the original position and the
corresponding reference words must not match
 The word sequence of the reference that corresponds to the
destination position must be misaligned before the shift

REF: DOWNER SAID " IN
THE END ,
ANY bad
AGREEMENT will NOT be an agreement we CAN
SIGN
. "
HYP: HE
OUT " EVENTUALLY ,
ANY WAS *** bad
,
will *** be an agreement we WILL
SIGNED . ”
Shifted words must match the reference words in the destination
position exactly
 The word sequence of the hypothesis in the original position and the
corresponding reference words must not match
 The word sequence of the reference that corresponds to the
destination position must be misaligned before the shift

REF: DOWNER SAID " IN
THE END ,
ANY bad
AGREEMENT will NOT be an agreement we CAN
SIGN
. "
HYP: HE
OUT " EVENTUALLY ,
ANY WAS *** bad
,
will *** be an agreement we WILL
SIGNED . ”
Shifted words must match the reference words in the destination
position exactly
 The word sequence of the hypothesis in the original position and the
corresponding reference words must not match
 The word sequence of the reference that corresponds to the
destination position must be misaligned before the shift

REF: DOWNER SAID " IN
THE END ,
ANY bad
AGREEMENT will NOT be an agreement we CAN
SIGN
. "
HYP: HE
OUT " EVENTUALLY ,
ANY WAS *** bad
,
will *** be an agreement we WILL
SIGNED . ”
Shifted words must match the reference words in the destination
position exactly
 The word sequence of the hypothesis in the original position and the
corresponding reference words must not match
 The word sequence of the reference that corresponds to the
destination position must be misaligned before the shift

REF: DOWNER SAID " IN
THE END ,
ANY bad
AGREEMENT will NOT be an agreement we CAN
SIGN
. "
HYP: HE
OUT " EVENTUALLY ,
ANY WAS *** bad
,
will *** be an agreement we WILL
SIGNED . ”
Shifted words must match the reference words in the destination
position exactly
 The word sequence of the hypothesis in the original position and the
corresponding reference words must not match
 The word sequence of the reference that corresponds to the
destination position must be misaligned before the shift


TER-Plus (TERp)
 Three more edit operations
▪ Stem match, synonym match, phrase substitution
 allows shifts if the words being shifted are exactly
the same, are synonyms, stems or paraphrases of
each other, or any such combination
both experience and books are very important about living .
related to the life experiences and the books are very imp0rtant .
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
----
both experience and books are very important about living .
related to the life experiences and the books are very imp0rtant .
[I]
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
----
---- both experience and books are very important about living .
related to the life experiences and the books are very imp0rtant .
[I]
[I]
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
----
---- both experience and books are very important about living
related to the life experiences and the books are very imp0rtant
[I]
[I] [S]
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
----
---- both experience and books are very important about living
related to the
[I]
experiences and the books are very imp0rtant life
[I] [S]
[Y]*
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
----
---- both experience and books are very important about living
related to the
[I]
experiences and the books are very imp0rtant life
[I] [S]
[Y]*
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
----
---- both experience and books are very important about living
related to the
[I]
[I] [S]
experiences and the books are very imp0rtant life
[T]
[Y]*
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
----
---- both experience and books are very important about living
related to the experiences and the books are very imp0rtant life
[I]
[I] [S]
[T]
[M]
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
[Y]*
----
---- both experience and ---
books are very important about living
related to the experiences and the books are very imp0rtant life
[I]
[I] [S]
[T]
[M]
[I]
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
[Y]*
----
---- both experience and ---
books are very important about living
related to the experiences and the books are very imp0rtant life
[I]
[I] [S]
[T]
[M]
[I]
[I]  Insertion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
[Y]*
----
---- both experience and ---
books are very important about living
related to the experiences and the books are very imp0rtant
[I]
[I] [S]
[T]
[M]
[I]
[M]
[M]
[I]  Insertion
[D]  Deletion
[S]  Substitution
[M]  Match
[T]  Stemming
[Y]  Wordnet Synonym
*  Shifting
[M]
[M]
-----
life
[D]
[Y]*
----
both experience --- and books are very important about living
and
the
[I]
[S]
experience , and book
[M]
[I]
[M]
[T]*
a
very important about life
[S]
[M]
[M]
[M]
[Y]

The task: Combine every translations using
their alignments to the original sentence
 We need a data structure for combination: Word
Lattice

Word Lattice
 a directed acyclic graph with a single start point
and edges labeled with a word and weight.
 a word lattice can represent an exponential
number of sentences in polynomial space

1
Create backbone of the lattice using the
original sentence
both/1
2
experience/1
3
and/1
4

For all round trip translations, map the
alignments to the lattice
 Action for edit: each insertion, deletion,
substitution, stemming, synonymy and
paraphrase operation lead to creation of new
nodes
 Action for match: Duplicate nodes are merged
 Combining weights: Edges produced by different
translations between same pair of nodes are
merged and their weights are added
Original: Both experience and books are very important about living.
Russian: And the experience, and a very important book about life.
----
both experience --- and books are very important about living
and
the
[I]
[S]
experience , and book
[M]
[I]
[M]
[T]*
a
very important about life
[S]
[M]
[M]
[M]
[Y]
A FST is (Q, , , I , F ,  , P)
 Q: a finite set of states
 Σ: a finite set of input symbols
 Γ: a finite set of output symbols
 I: Q R+ (initial-state probabilities)
 F: Q R+ (final-state probabilities)
   Q  (  { })  (  { })  Q : the transition
relation between states.
 P:   R  (transition probabilities)
Accepted with confidence?
Accepted + Translated with
confidence?
Acceptors (FSAs)
Unweighted
Transducers (FSTs)
c
{false, true}
a

Weighted
:y
c/.7
numbers
a/.5
/.5
c:z
strings
a:x
.3
(string, num) pairs c:z/.7
a:x/.5
:y/.5
.3
Acceptors (FSAs)
Unweighted
Weighted
{false, true}
Grammatical?
numbers
How grammatical?
Better, how likely?
Transducers (FSTs)
strings
Markup
Correction
Translation
(string, num) pairs
Good markups
Good corrections
Good translations
Greedy best first
Both experience and books are very important about life

1-Best
 Convert TREp lattice edge weights to edge costs
by multiplying the weights by -1
 Find the output as the shortest path in TERp
lattice.
Both experience and the books are very important about life (cost: -59)

Language Model ranked
 Find n-best (lowest cost) list from TERp lattice
 Rank the list using n-gram language model
 Suggest top ranked candidate as correction

Product Re-ranked
 Find n-best (lowest cost) list from TERp lattice
 Multiply cost of each hypothesis with its LM score
 Rank hypothesis by the product and chose the
best

Language Model Composition
 Convert edge weights in the TERp lattice into
probabilities
▪ Weighted Finite State Transducer (WFST)
representation (𝑊𝐹𝑆𝑇𝑙𝑎𝑡𝑡𝑖𝑐𝑒 )
 Train an n-gram finite state language model in
WFST (𝑊𝐹𝑆𝑇𝐿𝑀 )
 Compose: 𝑊𝐹𝑆𝑇𝑐𝑜𝑚𝑝 = 𝑊𝐹𝑆𝑇𝑙𝑎𝑡𝑡𝑖𝑐𝑒 ∘ 𝑊𝐹𝑆𝑇𝐿𝑀
 shortest path through 𝑊𝐹𝑆𝑇𝑐𝑜𝑚𝑝 is suggested as
correction
Relevant Toolkit: OpenFst Toolkit, http://openfst.org/twiki/bin/view/FST/WebHome

Resources
 OpenFst Toolkit
▪ http://openfst.org/twiki/bin/view/FST/WebHome
 Tutorials
▪ http://www.openfst.org/twiki/pub/FST/FstSltTutorial/pa
rt1.pdf
▪ http://www.openfst.org/twiki/pub/FST/FstSltTutorial/pa
rt2.pdf

Compile
 fstcompile -isymbols=T.isyms -
osymbols=T.osyms T.txt T.fst

Printing
 fstprint -isymbols=T.isyms -
osymbols=T.osyms T.fst >T.txt

Drawing
 fstdraw -isymbols=T.isyms -
osymbols=T.osyms T.fst >T.dot

Visualization format
 dot
-Tpng T.dot > T.png
$ fstunion a.fst b.fst out.fst
$ fstconcat a.fst b.fst out.fst
ab?d
abc
abcd
f
g
abc
ab?d
Function composition: f  g
(1, c, a, 0.3, 1)
(2, a, b, 0.6, 2)

Learner error corpora

Grammatical error detection

Grammatical error correction

Evaluating error detection and correction
system
Download