1
Dr. T. Florian Jaeger
My father
My friends who have voluntarily given me their Chinglish essays
People at HLP lab
2
1)
Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back.
2)
Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected.
3)
US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to
$112.63.
3
US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63.
Meanwhile, Bren crude hit an all-time peak of
$112.73 before falling back.
Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected.
4
If humans try to communicate in the most efficient way, they should produce language:
Action by putting less information into words or sentences with little prior context, and more later on
Goal
To ensure the increase of information is uniform
Humans as rational agents who optimize the flow of information in language production
5
Uniform Information Density (UID)
6
An engineering perspective
The most efficient way of communicating through a noisy channel is to send information at a constant rate.
(
, Shannon 1948).
7
No good models of the information of a sentence in context exist
Methods from natural language processing provide reasonably good estimates of out-of-context information of sentences
8
Intuitively, less contextual information is available at the beginning of a discourse.
If speakers/writers communicate efficiently, early sentences should be made more predictable (easier for listeners).
The out-of-context information at the beginning of a discourse should be lower than later in the discourse.
9
10
Genzel & Charniak (2002) provided evidence for the hypothesis of uniform information by analyzing English discourse.
They found that:
◦ Information of sentences increases with sentence numbers in a discourse.
◦ The effect of increase is due to both lexical (what words are used) and non-lexical (how words are used) factors.
11
Evaluate UID on Chinese written corpora by measuring information content.
Evaluate UID on a Chinese English (Chinglish) corpus
Ultimately: why is Chinese English harder to understand for native English speakers, but relatively easy for native Chinese speakers?
12
13
Four corpora are used
◦ XIN – Beijing Xinhua News
◦ SINO – Taiwan Sinorama Magazine
◦ HK – Hong Kong News ( too little data )
◦ VOA – Voice of America Chinese News
We build n-gram language models to measure the (un)predictability of written
Chinese sentences.
14
二十 年 前 , 许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。
Twenty year ago, many Chinese family ‘s dream is have a piece telephone.
Trigrams
二十 年 前
年 前 ,
前 , 许多
, 许多 中国
部 电话 。
P( 二十 年 前 ) = 0.1%
15
Lexicalized part-of-speech n-gram
二十 _CD 年 _M 前 _LC , _PU
许多
_CD 中国 _NR 家庭 _NN 的 _DEG
梦想 _NN 是 _VC
拥有
_VV 一 _CD 部 _M
电话
_NN 。 _PU
16
With respect to an entire document
◦ Sentence effect in a document
◦ Paragraph effect in a document
17
18
19
With respect to the immediate containing domain of the linguistic unit in question.
Predictors
1. Sentence position in paragraph
2. Paragraph position in document
3. Word position in sentence
Multiple regression on the above three predictors
20
21
Information goes up and converges (after removal of early words)
Limited amount of context information available.
22
二十 年 前 , 许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。
Twenty year ago, many Chinese family ‘s dream is have a piece telephone.
Trigrams
二十 年 前
年 前 ,
前 , 许多
, 许多 中国
部 电话 。
23
We replicated Genzel & Charniak’s study on
Chinese corpora.
◦ Sentence effect within documents is not found.
◦ However:
Paragraph effect within documents is consistent with
UID.
Sentence effect within paragraphs is also found.
Due to the size of data, effects are observable only early in discourse (viable cut-offs are low).
24
We are the first to look at the effect of word position within sentences.
◦ Information content increases with word position.
◦ Context estimation leads to early convergence.
Does increase of information only occur locally in
Chinese?
◦ Current data seem to support this idea.
25
Writing style? Could be.
◦ Chinese – Summarization & Expansion
◦ English – Narrative style
26
A collection of English essays written by native Chinese speakers.
◦ Corpus of English as a Second Language (CESL)
We trained a language model based on the
Brown Corpus (American English) and use the model to measure information content of
Chinese English sentences.
27
XIN: - p<0.001*** CESL: - p=0.0167 *
28
The average information content is much higher in Chinese English (8.2~8.4) than in
Chinese (4.5~5.0).
It is also higher than information content of
English, which converges at 7.0 bits
(Paintadosi, CUNY 2008).
29
Chinese, English, and Chinglish
◦ Globally, Chinglish essays fail to exhibit the information distribution as predicted by UID, either.
◦ Further studies needed to discover more properties of Chinglish.
Possible reasons that explain why Chinglish is harder to understand
◦ Higher information content
◦ Again, writing style
30
31