Computational Models of Text Quality

advertisement
Computational Models of
Text Quality
Ani Nenkova
University of Pennsylvania
ESSLLI 2010, Copenhagen
1
The ultimate text quality
application
 Imagine your favorite text editor


With spell-checker and grammar checker
But also functions that tell you






``Word W is repeated too many times”
``Fill the gap is a cliché”
``You might consider using this more figurative
expression”
``This sentence is unclear and hard to read’’
``What is the connection between these two
sentences?”
……..
2
Currently
 It is our friends who give such feedback
 Often conflicting
 We might agree that a text is good, but find it
hard to explain exactly why
 Computational linguistics should have some
answers
 Though far from offering a complete solution
yet
3
In this course
 We will overview research dealing with
various aspects of text quality
 A unified approach does not yet exist, but
many proposals


have been tested on corpus data
integrated in applications
4
Current applications: education
 Grading student writing
 Is this a good essay?
 One of the graders of SAT and GRE essays is in
fact a machine! [1]
http://www.ets.org/research/capabilities/automated_scoring
 Providing appropriate reading material
 Is this text good for a particular user?
 Appropriate grade level
 Appropriate language competency in L2 [2,3]
http://reap.cs.cmu.edu/
5
Current applications: information
retrieval
 Particularly user generated content
 Questions and answers on the web
 Blogs and comments
 Searching over such content poses new
problems [4]

What is a good question/answer/comment?
http://answers.yahoo.com/
 Relevant for general IR as well
 Of the many relevant document some, are better
written
6
Current applications: NLP
 Models of text quality
 lead to improved systems [5]
 offer possibilities for automatic evaluation [6]
 Automatic summarization
 Select important content and organize it in as wellwritten text
 Language generation
 Select, organize and present content on
document, paragraph, sentence and phrase level
 Machine translation
7
Text quality factors
 Interesting
 Style (clichés, figurative language)
 Vocabulary use
 Grammatical and fluent sentences
 Coherent and easy to understand
 In most types of writing, well-written means clear and
easy to understand. Not necessarily so in literary
works.
 Problems with clarity of instructions motivated a fair
amount of early work.
8
Early work: keep in mind these
predate modern computers!
 Common words are easier to understand
 stentorian vs. loud
 myocardial infarction vs. heart attack
 Common words are short
o Standard readability metrics
o percentage of words not among the N most frequent
o average numbers of syllables per word
 Syntactically simple sentences are easier to
understand
o
average number of words per sentence
[Flesch-Kincaid, Automated Readability Index, Gunning-Fog, SMOG,
Coleman-Liau]
9
Modern equivalents
 Language models
 Word probabilities from a large collection
http://www.speech.cs.cmu.edu/SLM_info.html
 Features derived from syntactic parse
[2,7,8,9]




Parse tree height
Number of subordinating conjunctions
Number of passive voice constructions
Number of noun and verb phrases
10
Language models
 Unigram and bigram language models
 Really, just huge tables
 Smoothing necessary to account for unseen
words
nw
p(w) 
N
p(w1 | w 2 ) 

n w2 w1
n w2
11
Features from language models
 Assessing the readability of text t consisting
of m words, for intended audience class c
 Number of out of vocabulary words in the text
with respect to the language model for c
 Text likelihood and perplexity
L(t)  P(c)P(w1 | c)....P(w m | c)
PP  2 H (t|c )
1
H(t | c)   log 2 P(t | c)
m
12
Application to grade level prediction
Collins-Thompson and Callan, NAACL 2004 [10]
13
Application to grade level prediction
Collins-Thompson and Callan, NAACL 2004 [10]
14
Results on predicting grade level
Schwarm and Ostendorf, ACL 2005 [11]
 Flesch-Kincaid Grade Level index


number of syllables per word
sentence length
 Lexile


word frequency
sentence length
 SVM features

language models and syntax
15
Models of text coherence
 Global coherence

Overall document organization
 Local coherence

Adjacent sentences
16
Text structure can be learnt in an
unsupervised manner
Location, time
 Human-written
examples from a
domain
damage
magnitude
relief efforts
17
Content model
Barzilay & Lee’04 [12]
 Hidden Markov Model (HMM)-based



States - clusters of related sentences “topics”
Generating
Transition prob. - sentence precedence
in corpus
sentence in
Emission prob. - bigram language model
current topic
p
(

s
,
h

|

s
,
h

)

p
(
h
|
h
)

p
(
s
|
h
)
i

1
i

1
i
i
t
i

1
i
e
i

1
i

1
Earthquake reports
Transition
from previous
topic
location,
magnitude
relief efforts
casualties
18
Generating Wikipedia articles
Sauper and Barzilay, 2009 [12]
 Articles on diseases and American film actors
 Create templates of subtopics

Focus only on subtopic level structure
◦ Use paragraphs from documents on the web
19
Template creation
 Cluster similar headings

signs and symptoms, symptoms, early
symptoms…
 Choose k clusters

average number of subtopics in that domain
 Find majority ordering for the clusters
Biography
Diseases
Early life
Symptoms
Career
Causes
Personal life
Diagnosis
Death
Treatment
20
Extraction of excerpts and ranking
 Candidates for a subtopic

Paragraphs from top 10 pages of search
results
 Measure relevance of candidates for that
subtopic

Features ~ unigrams, bigrams, number of
sentences…
21
Need to control redundancy across subtopics
 Integer Linear Program
 Variables
 One per excerpt (value 1-chosen or 0)
1 2 3
4 5
causes
symptoms
diagnosis
 Objective
 Minimize sum of the ranks of the
excerpts chosen

treatment
Constraints
◦ Cosine similarity between any selected pair <= 0.5
◦ One excerpt per subtopic
22
Linguistic models of coherence
[Halliday and Hasan, 1976] [13]
 Coherent text is characterized by the
presence of various types of cohesive links
that facilitate text comprehension
 Reference and lexical reiteration
 Pronouns, definite descriptions, semantically
related words
 Discourse relations (conjunction)
 I closed the window because it started raining.
 Substitution (one) or ellipses (do)
23
Referential coherence
 Centering theory

tracking focus of attention across adjacent
sentences [14, 15, 16, 17]
 Syntactic form of references

Particularly first and subsequent mention [18,
19], pronominalization
 Lexical chains

Identifying and tracking topics within a text [20,
21, 22, 23]
24
Discourse relations
 Explicit vs. implicit
 I stayed home because I had a headache
o Signaled by a discourse connective
o

Inferred without the presence of a connective
I took my umbrella. [Because] The forecast was
for rain in the afternoon.
25
Lexical chains
 Often discussed as cohesion indicator,
implemented systems, but not used in text
quality tasks


Find all words that refer to the same topic
Find the correct sense of the words
LexChainer Tool: http://www1.cs.columbia.edu/nlp/tools.cgi [23]
 Applications: summarization, IR, spell
checking, hypertext construction
John bought a Jaguar. He loves the car.
LC = {jaguar, car, engine, it}
26
Centering theory ingredients
(Grosz et al, 1995)
 Deals with local coherence


What happens to the flow from sentence to
sentence
Does not deal with global structuring of the
text (paragraphs/segments)
 Defines coherence as an estimate of the
processing load required to “understand” the
text
27
Processing load
 Upon hearing a sentence a person



Cognitive effort to interpret the expressions in
the utterance
Integrates the meaning of the utterance with
that of the previous sentence
Creates some expectations on what might
come next
28
Example
(1) John met his friend Mary today.
(2) He was surprised to see her.
(3) He thought she is still in Italy.

Form of referring expressions



Anaphora needs to be resolved
“Create” a discourse entity at first mention with
full noun phrase
Creating expectations
29
Creating and meeting expectations
(1) a. John went to his favorite music store to buy a piano.
b. He had frequented the store for many years.
c. He was excited that he could finally buy a piano.
d. He arrived just as the store was closing for the day.
(2) a. John went to his favorite music store to buy a piano.
b. It was a store John had frequented for many years.
c. He was excited that he could finally buy a piano.
d. It was closing just as John arrived.
30
Interpreting pronouns
Terry really goofs sometimes.
b. Yesterday was a beautiful day and he was
excited about trying out his new sailboat.
c. He wanted Tony to join him on a sailing
expedition.
d. He called him at 6am.
e. He was sick and furious at being woken up so
early.
a.
31
Basic centering definitions
 Centers of an utterance



Set of entities serving to link that utterance to
the other utterances in the discourse segment
that contains it
Not words or phrases themselves
Semantic interpretations of noun phraes
32
Types of centers
 Forward looking centers
An ordered set of entities
 What could we expect to hear about next
 Ordered by salience as determined by grammatical function
 Subject > Indirect object > Object > Others
 John gave the textbook to Mary.
 Cf = {John, Mary, textbook}
 Preferred center Cp
 The highest ranked forward looking center
 High expectation that the next utterance in the segment will
be about Cp

33
Backward looking center
 Single backward looking center, Cb (U)
 For each utterance other than the segmentinitial one
 The backward looking center of utterance Un+1
connects with one of the forward looking
centers of Un
 Cb (U+1) is the most highly ranked element
from Cf (Un) that is also realized in U+1
34
Centering transitions ordering
Cb(Un+1)=Cb(Un) Cb(Un+1) !=
OR
Cb(Un)
Cb(Un)=[?]
Cb(Un+1) =
Cp(Un+1)
continue
Cb(Un+1) !=
Cp(Un+1)
retain
35
smooth-shift
rough-shift
Centering constraints
 There is precisely one backward-looking
center Cb(Un)
 Cb(Un+1) is the highest-ranked element of
Cf(Un) that is realized in Un+1
36
Centering rules
 If some element of Cf(Un) is realized as a
pronoun in Un+1 then so is Cb(Un+1)
 Transitions not equal

continue > retain > smooth-shift > rough-shift
37
Centering analysis
 Terry really goofs sometimes.

Cf={Terry}, Cb=?, undef
 Yesterday was a beautiful day and he was excited about
trying out his new sailboat.
 Cf={Terry,sailboat}, Cb=Terry, continue
 He wanted Tony to join him in a sailing expedition.

Cf={Terry, Tony, expedition}, Cb=Terry, continue
 He called him at 6am.

Cf={Terry,Tony}, Cb=Terry, continue
38
 He called him at 6am.

Cf={Terry,Tony}, Cb=Terry, continue
 Tony was sick and furious at being woken up so early.

Cf={Tony}, Cb=Tony, smooth shift
 He told Terry to get lost and hung up.

Cf={Tony,Terry}, Cb=Tony, continue
 Of course, Terry hadn’t intended to upset Tony.

Cf={Terry,Tony}, Cb = Tony, retain
39
Rough shifts in evaluation of
writing skills (Miltsakaki and Kukich, 2002)
 Automatic grading of essays by E-rater
 Syntactic variety
 Represented by features that quantify the
occurrence of clause types
 Clear transitions
 Cue phrases in certain syntactic constructions
 Existence of main and supporting points
 Appropriateness of the vocabulary content of the
essay
 What about local coherence?
40
Essay score model
 Human score available
 E-rater prediction available
 Percentage of rough-shifts in each essay:
analysis done manually
 Negative correlation between the human
score and the percentage of rough-shifts
41
 Linear multi-factor regression
 Approximate the human score as a linear function
of the e-rater prediction and the percentage of
rough-shifts
 Adding rough shifts significantly improves the
model of the score

0.5 improvement on 1—6 scale
 How easy/difficult would it be to fully automate
the rough-shift variable?
42
Variants of centering and
application to information ordering
 Karamanis et al, 09 is the most comprehensive
overview of variants of centering theory and an
evaluation of centering in a specific task related to
text quality
43
Information ordering task
 Given a set of sentences/clauses, what is the
best presentation?

Take a newspaper article and jumble the
sentences---the result will be much more difficult
to read than the original
 Negative examples constructed by randomly
permuting the original
 Criteria for deciding which of two orderings is
better

Centering would definitely be applicable
44
Centering variations
 Continuity (NOCB=lack of continuity)
 Cf(Un) and Cf(Un+1) share at least one element
 Coherence
 Cb(Un) = Cb(Un+1)
 Salience
 Cb(U) = Cp(U)
 Cheapness (fulfilled expectations)
 Cb (Un+1) = Cp(Un)
45
Metrics of coherence
 M.NOCB (no continuity)
 M.CHEAP (expectations not met)
 M.KP sum of the violations of continuity,
cheapness, coherence and salience
 M. BFP seeks to maximize transitions
according to Rule 2
46
Experimental methodology
 Gold-standard ordering
 The original order of the text (object description,
news article)
 Assume that other orderings are inferior
 Classification error rate
 Percentage orderings that score better than the
gold-standard
+ 0.5*percentage of the orderings that score the
same
47
Results
 NOCB gives best results


Significantly better than the other metrics
Consistent results for three different corpora



Museum artifact descriptions (2)
News
Airplane accidents
 M.BFP is the second best metric
48
49
Entity grid
(Barzilay and Lapata, 2005, 2008)
 Inspired by centering

Tracks entities across adjacent sentences, as
well as their syntactic positions
 Much easier to compute from raw text
Brown Coherence Toolkit
http://www.cs.brown.edu/~melsner/manual.html
50
Entity grid: applications
 Several applications , with very good results
 Information ordering
 Comparing the coherence of pairs of
summaries
 Distinguishing readability levels


Child vs. adult
Improves over Petersen&Ostendorf
51
Entity grid example
1 [The Justice Department]S is conducting an [anti-trust trial]O against [Microsoft
Corp.]X with [evidence]X that [the company]S is increasingly attempting to crush
[competitors]O.
2 [Microsoft]O is accused of trying to forcefully buy into [markets]X where [its own
products]S are not competitive enough to unseat [established brands]O.
3 [The case]S revolves around [evidence]O of [Microsoft]S aggressively pressuring
[Netscape]O into merging [browser software]O.
4 [Microsoft]S claims [its tactics]S are commonplace and good economically.
5 [The government]S may file [a civil suit]O ruling that [conspiracy]S to curb
[competition]O through [collusion]X is [a violation of the Sherman Act]O.
6 [Microsoft]S continues to show [increased earnings]O despite [the trial]X.
52
Entity grid representation
53
16 entity grid features
 The probability of each type of transition in
the text


Four syntactic distinctions
S, O, X, _
54
Type of reference and info ordering
(Elsner and Charniak, 2008)
 Entity grid features not concerned with how
an entity is mentioned
 Discourse old vs. discourse new
Kent Wells, a BP senior vice president said on Saturday during a
technical briefing that the current cap, which has a looser fit and
has been diverting about 15,000 barrels of oil a day to a
drillship, will be replaced with a new one in 4 to 7 days.
The new cap will take 4 to 7 days to be installed, and in case the
new cap is not effective, Mr. Wells said engineers were
prepared to replace it with an improved version of the current
cap.
55
 The probability of a given sequence of
discourse new and old realizations gives a
further indication about ordering
 Similarly, pronouns should have reasonable
antecedents
 Adding both models to the entity grid
improves performance on the information
ordering task
56
Sentence Ordering
 n sentences

Output from a generation or summarization
system
 Find most coherent ordering


n! permutations
With local coherence metrics
◦ Adjacent sentence flow
◦ Finding best ordering is NP complete
 Reduction from Traveling Salesman Problem
57
Word co-occurrence model
(Lapata, ACL 2003; Soricut and Marcu, 2005) [23,24]
 Idea from statistical machine translation

Alignment models
John
Johnwent
est allé
to aàrestaurant.
un restaurant.
He
Il ordonna
ordered de
fish.
poisson.
The
Le garçon
waiter was
était very
très attentive.
attentif.
…
…
…
…
We ate at a restaurant yesterday.
We also ordered some take away.
He
John
ordered
went to
fish.
a restaurant.
The
He ordered
waiter was
fish.very attentive.
John
The waiter
gave him
wasavery
hugeattentive.
tip.
…
…
…
…
P(ordered | restaurant)
P(fish | poisson)
P(waiter | ordered)
P(tip | waiter)
…
58
Discourse (coherence) relations
 Only recently empirically results have shown
that discourse relations are predictive of text
quality (Pitler and Nenkova, 2008)
59
PDTB discourse relations annotations
 Largest corpus of annotated discourse
relations
http://www.seas.upenn.edu/~pdtb/
 Four broad classes of relations




Contingency
Comparison
Temporal
Expansion
 Explicit and implicit
60
Implicit and explicit relations
(E1) He is very tired because he played tennis all morning.
(E2) He is not very strong but he can run amazingly fast.
(E3) We had some tea in the afternoon and later went to a
restaurant for a big dinner
(I1) I took my umbrella this morning. [because] The forecast
was for rain.
(I2) She is never late for meetings. [but] He always arrives
10 minutes late.
(I3) She woke up early. [afterwards] She had breakfast and
went for a walk in the park.
61
What is the relative importance of
factors in determining text quality?
 Competent readers (native English speaker)


graduate students at Penn
Wall Street Journal texts
 30 texts ranked on scale 1 to 5




How well-written is this article?
How well does the text fit together?
How easy was it to understand?
How interesting is the article?
62
 Several judgments for each text

Final quality score was the average
 Scores range from 1.5 to 4.33

Mean 3.2
63
 Which of the many indicators will work best?

Usually research study focus on only one or
two
 How do indicators combine?
 Metrics


Correlation coefficient
Accuracy of pair-wise ranking prediction
64
Correlation coefficients between
assessor ratings and different
features
65
Baseline measures
 Average Characters/Word

r = -.0859 (p = .6519)
 Average Words/Sentence

r = .1637 (p = .3874)
 Max Words/Sentence

r = .0866 (p = .6489)
 Article length

r = -.3713 (p = .0434)
66
Vocabulary factors
 Language model probability of the article
p
(
w
|M
)

C
(
w
)
w
c
(
w
)
log(
p
(
w
|
M
))

w
 M estimated from PTB (WSJ)
 M estimated from general news (NEWS)
67
Correlations with ‘well-written’
assessment
 Log likelihood, WSJ
 r = .3723 (p = .0428)
 Log likelihood, NEWS
 r= .4497 (p = .0127)
 Log likelihood with length, WSJ
 r = .3732 (p = .0422)
 Log likelihood with length, NEWS
 r = .6359, p = .0002
68
Syntactic features
 Average parse tree height

r = -.0634 (p = .7439)
 Avr. number of noun phrases per sentence

r = .2189 (p = .2539)
 Average SBARs

r = .3405 (p = .0707)
 Avr. number of verb phrases per sentence

r = .4213 (p = .0228)
69
Elements of lexical cohesion
 Avr. cosine similarity between adjacent sents
 r = -.1012 (p = .5947)
 Avr. word overlap between adjacent sentences
 r = -.0531, p = .7806
 Avr. Noun+Pronoun Overlap
 r = .0905, p = .6345
 Avr. # Pronouns/Sent
 r = .2381, p = .2051
 Avr # Definite Articles
 r = .2309, p = .2196
70
Correlation with ‘well-written” score
 Prob. of S-S transition

r = -.1287 (p = .5059)
 Prob. of S-O transition

r = -.0427 (p = .8261)
 Prob. of S-X transition

r = -.1450 (p = .4529)
 Prob. of S-N transition

r = .3116 (p = .0999)
 Prob. of O-S transition

r = .1131 (p = .5591)
 Prob. of O-O transition

r = .0825 (p = .6706)
 Prob. of O-X transition

r = .0744 (p = .7014)
 Prob. of O-N transition

r = .2590 (p = .1749)
71
 Prob. of X-S transition

r = .1732 (p = .3688)
 Prob. of X-O transition

r = .0098 (p = .9598)
 Prob. of X-X transition

r = -.0655 (p = .7357)
 Prob. of X-N transition

r = .1319 (p = .4953)
 Prob. of N-S transition

r = .1898 (p = .3242)
 Prob. of N-O transition

r = .2577 (p = .1772)
 Prob. of N-X transition

r = .1854 (p = .3355)
 Prob. of N-N transition

r = -.2349 (p = .2200)
72
Well-writteness and discourse
 Log likelihood of discourse rels
 r = .4835 (p = .0068)
 # of discourse relations
 r = -.2729 (p = .1445)
 Log likelihood of rels with # of rels
 r = .5409 (p = .0020)
 # of relations with # of words
 r = .3819 (p = .0373)
 Explicit relations only
 r = .1528 (p = .4203)
 Implicit relations only
 r = .2403 (p = .2009)
73
Summary: significant factors
 Log likelihood of discourse relations
 r = .4835
 Log likelihood , NEWS
 r = .4497
 Average verb phrases per sentence
 r = .4213
 Log likelihood, WSJ
 r = .3723
 Number of words
 r = -.3713
74
Text quality prediction as ranking
 Every pair of texts with ratings differing by 0.5
 Features are the difference of feature values
for each text
 Task: predict which of the two articles has
higher text quality score
75
Prediction accuracy (10-fold cross
validation)
 None (Majority Class) 50.21%
 number of words 65.84%
 ALL 88.88%




Grid only 79.42%
log l discourse rels 77.77%
Avg VPs sen 69.54%
log l NEWS 66.25%
76
Findings
 Complex interplay between features
 Entity grid features not significantly correlated
with ‘well-written score’ but very useful for the
ranking task
 Discourse information is very helpful


But here we used gold-standard annotations
Developing automatic classifier underway
77
Implicit and explicit discourse
relations
Class
Explicit
Implicit
Comparison
69%
31%
Contingency
47%
53%
Temporal
80%
20%
Expansion
42%
58%
78
Sense classification based on
connectives only
 Four-way classification
 Explicit relations only

93% accuracy
 All relations (implicit+explicit)

75% accuracy
 Implicit relations are the real challenge
79
Explicit discourse relations, tasks
Pitler and Nenkova, 2009 [25]
 Discourse vs. non-discourse use
 I will be happier once the semester is over.
 I have been to Ohio once.
 Relation sense

Contingency, comparison, temporal,
expansion

I haven’t been to Paris since I went there on a
school trip in 1998. [Temporal]
I haven’t been to Antarctica since it is very far
away. [Contingency]

80
Penn Discourse Treebank
 Largest available annotated corpus of
discourse relations



Penn Treebank WSJ articles
18,459 explicit discourse relations
100 connectives
“although”
91% discourse
vs.
“or”
3% discourse
81
Discourse Usage Experiments
 Positive examples: discourse connectives
 Negative examples: same strings in PTDB,
unannotated
 10-fold cross validation
 Maximum Entropy classifier
82
Discourse Usage Results
83
Discourse Usage Results
84
Sense Disambiguation: Comparison,
Contingency, Expansion, or Temporal?
Features
Accuracy
Connective
93.67%
Connective + Syntax
94.15%
Interannotator Agreement
94%
85
Tool
 Automatic annotation of discourse use and
sense of discourse connectives
Discourse Connectives Tagger
http://www.cis.upenn.edu/~epitler/discourse.html
86
What about implicit relations?
 Is there hope to have a usable tool soon?
 Early studies on unannotated data gave
reason for optimism
 But when recently tested on the PDTB, their
performance is poor

Accuracy of contingency, comparison and
temporal is below 50%
87
Classify implicits and explicits
together
 Not easy to infer from combined results how
early systems performed on implicits

As we saw, one can get reasonable overall
performance by doing nothing for explicts
 Same sentence
[26]
 Graphbank corpus: doesn’t distinguish
implicit and explicit [27]
88
Classify on large unannotated corpus
 Create artificial implicits by deleting
connective [28, 29, 30]

I am in Europe, but I live in the United States.
 First proposed by Marcu and Echihabi, 2002
 Very good initial results
 Accuracy of distinguishing between two rels, >75%
 But these were on balanced classes
 Not the case in real text
 Not tested on real implicits (but see [30,29])
89
Experiments with PDTB
 Pitler et al, ACL 2009 [31]

Wide variety of features to capture semantic opposition
and parallelism
 Lin et al, EMNLP 2009 [32]
 (Lexicalized) syntactic features
 Results improve over baselines, better
understanding of features, but the classifiers
are not suitable for application in real tasks
90
Word pairs as features
 Most basic feature for implicits
there
I
am
is
a
a
13
Iittle
hour
time
tired
difference
 I_there, I_is, …, tired_time, tired_difference
Marcu and Echihabi ,
2002
91
Intuition: with large amounts of data,
will find semantically-related pairs
 The recent explosion of country funds mirrors
the “closed-end fund mania of the 1920s, Mr.
Foot says, when narrowly focused funds grew
wildly popular.
 They fell into oblivion after the 1929 crash.
92
Meta error analysis of prior work
 Using just content words reduces
performance (but has steeper learning curve)

Marcu and Echihabi, 2002
 Nouns and adjectives don’t help at all

Lapata and Lascarides, 2004 [33]
 Filtering out stopwords lowers results

Blair-Goldensohn et al., 2007
93
Word pairs experiments
Pitler et al 2009
 Synthetic implicits: Cause/Contrast/None
 Explicit instances from Gigaword with connective
deleted
 Because  Cause, But  Contrast
 At least 3 sentences apart  None
[Blair-Goldensohn et al., 2007]
 Random selection
 5,000 Cause
 5,000 Other
 Computed information gain of word pairs
94
Function words have highest
information gain
But…Didn’t we
remove the
connective?
95
“but” signals “Not-Comparison” in
synthetic data
 The government says it has reached most
isolated townships by now, but because
roads are blocked, getting anything but basic
food supplies to people remains difficult.
 but because  Comparison
 but because  Contingency
96
Results: Word pairs
 Pairs of words from the two text spans
 What doesn’t work

Training on synthetic implicits
 What really works


Use synthetic implicits for feature selection
Train on PDTB
97
Best Results: f-scores
Comparison
Contingency
21.96
47.13
(17.13)
Expansion
76.41
(31.10)
Temporal
(63.84)
16.76
(16.21)
Comparison/Contingency baseline: synthetic implicits word pairs
Expansion/Temporal baseline: real implicits word pairs
98
Further experiments using context
 Results from classifying each relation
independently

Naïve Bayes, MaxEnt, AdaBoost
 Since context features were helpful, tried
CRF
 6-way classification, word pairs as features


Naïve Bayes accuracy: 43.27%
CRF accuracy: 44.58%
99
Do we need more coherence factors?
Louis and Nenkova, 2010 [34]
 If we had perfect co-reference and discourse
relation information, would we be able to
explain local discourse coherence
 Our recent corpus study indicates the answer
is NO
 30% of adjacent sentences in the same
paragraph in PDTB

Neither share an entity nor have an implicit
comparison contingency or temporal relation
 Lexical chains?
100
References
[1] Burstein, J. & Chodorow, M. (in press). Progress and new directions in technology for
automated essay evaluation. In R. Kaplan (Ed.), The Oxford handbook of applied linguistics
(2nd Ed.). New York: Oxford University Press.
[2] Heilman, M., Collins-Thompson, K., Callan, J., and Eskenazi, M. (2007). Combining Lexical
and Grammatical Features to Improve Readability Measures for First and Second Language
Texts. Proceedings of the Human Language Technology Conference. Rochester, NY.
[3] S. Petersen and M. Ostendorf, “A machine learning approach to reading level assessment,”
Computer, Speech and Language, vol. 23, no. 1, pp. 89-106, 2009
[4] Finding High Quality Content in Social Media, Eugene Agichtein, Carlos Castillo, Debora
Donato, Aristides Gionis, Gilad Mishne, ACM Web Search and Data Mining Conference
(WSDM), 2008
[5] Regina Barzilay and Lillian Lee, Catching the Drift: Probabilistic Content Models, with
Applications to Generation and Summarization, HLT-NAACL 2004: Proceedings of the Main
Conference, pp113—120, 2004
101
References
[6] Emily Pitler, Annie Louis and Ani Nenkova, Automatic Evaluation of Linguistic Quality in MultiDocument Summarization, Proceedings of ACL 2010
[7] Schwarm, S. E. and Ostendorf, M. 2005. Reading level assessment using support vector
machines and statistical language models. In Proceedings of ACL 2005.
[8] Jieun Chae, Ani Nenkova: Predicting the Fluency of Text with Shallow Structural Features:
Case Studies of Machine Translation and Human-Written Text. In proceedings of EACL
2009: 139-147
[9] Charniak, E. and Johnson, M. 2005. Coarse-to-fine n-best parsing and MaxEnt discriminative
reranking. In Proceedings of ACL 2005.
[10] K. Collins-Thompson and J. Callan. (2004). A language modeling approach to predicting
reading difficulty. Proceedings of HLT/NAACL 2004.
[11] Sarah E. Schwarm and Mari Ostendorf. Reading Level Assessment Using Support Vector
Machines and Statistical Language Models. In Proceedings of ACL, 2005.
102
References
[12] Automatically generating Wikipedia articles: A structure-aware approach, C. Sauper and R.
Barzilay, ACL-IJCNLP 2009
[13] Halliday, M. A. K., and Ruqaiya Hasan. 1976.Cohesion in English. London: Longman
[14] B. Grosz, A. Joshi, and S. Weinstein. 1995. Centering: a framework for modelling the local
coherence of dis- course. Computational Linguistics, 21(2):203–226
[15] E. Miltsakaki and K. Kukich. 2000. The role of centering theory’s rough-shift in the teaching
and evaluation of writing skills. In Proceedings of ACL’00, pages 408– 415.
[16] Karamanis, N., Mellish, C., Poesio, M., and Oberlander, J. 2009. Evaluating centering for
information ordering using corpora. Comput. Linguist. 35, 1 (Mar. 2009), 29-46.
[17] Regina Barzilay, Mirella Lapata, "Modeling Local Coherence: An Entity-based Approach”,
Computational Linguistics, 2008.
[18] Ani Nenkova, Kathleen McKeown: References to Named Entities: a Corpus Study. HLTNAACL 2003
103
References
[19] Micha Elsner, Eugene Charniak: Coreference-inspired Coherence Modeling. ACL (Short
Papers) 2008: 41-44
[20] Morris, J. and Hirst, G. 1991. Lexical cohesion computed by thesaural relations as an
indicator of the structure of text. Comput. Linguist. 17, 1 (Mar. 1991), 21-48.
[21] Regina Barzilay and Michael Elhadad, "Text summarizations with lexical chains”, In
Inderjeet Mani and Mark Maybury, editors, Advances in Automatic Text Summarization. MIT
Press, 1999.
[22] Silber, H. G. and McCoy, K. F. 2002. Efficiently computed lexical chains as an intermediate
representation for automatic text summarization. Comput. Linguist. 28, 4 (Dec. 2002), 487496.
[23] Mirella Lapata, Probabilistic Text Structuring: Experiments with Sentence Ordering,
Proceedings of ACL 2003.
[24] Discourse generation using utility-trained coherence models, R. Soricut & D. Marcu,
COLING-ACL 2006
104
References
[25] Emily Pitler and Ani Nenkova. Using Syntax to Disambiguate Explicit Discourse Connectives
in Text. Proceedings of ACL, short paper, 2009
[26] Radu Soricut and Daniel Marcu. 2003. Sentence Level Discourse Parsing using Syntactic
and Lexical Information. Proceedings of the Human Language Technology and North
American Association for Computational Linguistics Conference (HLT/NAACL-2003)
[27] Ben Wellner, James Pustejovsky, Catherine Havasi, Roser Sauri and Anna Rumshisky.
Classification of Discourse Coherence Relations: An Exploratory Study using Multiple
Knowledge Sources. In Proceedings of the 7th SIGDIAL Workshop on Discourse and
Dialogue
[28] Daniel Marcu and Abdessamad Echihabi (2002). An Unsupervised Approach to Recognizing
Discourse Relations. Proceedings of the 40th Annual Meeting of the Association for
Computational Linguistics (ACL-2002)
[29] Sasha Blair-Goldensohn, Kathleen McKeown, Owen Rambow: Building and Refining
Rhetorical-Semantic Relation Models. HLT-NAACL 2007: 428-435
105
References
[30] Sporleder, C. and Lascarides, A. 2008. Using automatically labelled examples to classify
rhetorical relations: An assessment. Nat. Lang. Eng. 14, 3 (Jul. 2008), 369-416.
[31] Emily Pitler, Annie Louis, and Ani Nenkova. Automatic Sense Prediction for Implicit
Discourse Relations in Text. Proceedings of ACL, 2009.
[32] Ziheng Lin, Min-Yen Kan and Hwee Tou Ng (2009). Recognizing Implicit Discourse
Relations in the Penn Discourse Treebank. In Proceedings of EMNLP
[33] Lapata, Mirella and Alex Lascarides. 2004. Inferring Sentence-internal Temporal Relations.
In Proceedings of the North American Chapter of the Assocation of Computational
Linguistics, 153-160.
[34] Annie Louis and Ani Nenkova, Creating Local Coherence: An Empirical Assessment,
Proceedings of NAACL-HLT 2010
106
Download