PPTX Slides

advertisement

Human Language Processing Lab

Brain and Cognitive Sciences

Ting Qian

1

Dr. T. Florian Jaeger

My father

My friends who have voluntarily given me their Chinglish essays

People at HLP lab

2

1)

Meanwhile, Bren crude hit an all-time peak of $112.73 before falling back.

2)

Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected.

3)

US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to

$112.63.

3

US light, sweet crude oil rose to a fresh high of $114.95 before slipping back to $112.63.

Meanwhile, Bren crude hit an all-time peak of

$112.73 before falling back.

Prices initially rose when the report was released with traders reacting to news that inventories were lower than expected.

4

If humans try to communicate in the most efficient way, they should produce language:

Action by putting less information into words or sentences with little prior context, and more later on

Goal

To ensure the increase of information is uniform

Humans as rational agents who optimize the flow of information in language production

5

Uniform Information Density (UID)

6

An engineering perspective

The most efficient way of communicating through a noisy channel is to send information at a constant rate.

(

Information Theory

, Shannon 1948).

7

No good models of the information of a sentence in context exist

Methods from natural language processing provide reasonably good estimates of out-of-context information of sentences

8

Intuitively, less contextual information is available at the beginning of a discourse.

If speakers/writers communicate efficiently, early sentences should be made more predictable (easier for listeners).

The out-of-context information at the beginning of a discourse should be lower than later in the discourse.

9

10

Genzel & Charniak (2002) provided evidence for the hypothesis of uniform information by analyzing English discourse.

They found that:

◦ Information of sentences increases with sentence numbers in a discourse.

◦ The effect of increase is due to both lexical (what words are used) and non-lexical (how words are used) factors.

11

Evaluate UID on Chinese written corpora by measuring information content.

Evaluate UID on a Chinese English (Chinglish) corpus

Ultimately: why is Chinese English harder to understand for native English speakers, but relatively easy for native Chinese speakers?

12

13

Four corpora are used

◦ XIN – Beijing Xinhua News

◦ SINO – Taiwan Sinorama Magazine

◦ HK – Hong Kong News ( too little data )

◦ VOA – Voice of America Chinese News

We build n-gram language models to measure the (un)predictability of written

Chinese sentences.

14

二十 年 前 , 许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。

Twenty year ago, many Chinese family ‘s dream is have a piece telephone.

Trigrams

二十 年 前

年 前 ,

前 , 许多

, 许多 中国

部 电话 。

P( 二十 年 前 ) = 0.1%

15

Lexicalized part-of-speech n-gram

二十 _CD 年 _M 前 _LC , _PU

许多

_CD 中国 _NR 家庭 _NN 的 _DEG

梦想 _NN 是 _VC

拥有

_VV 一 _CD 部 _M

电话

_NN 。 _PU

16

With respect to an entire document

◦ Sentence effect in a document

◦ Paragraph effect in a document

17

18

19

With respect to the immediate containing domain of the linguistic unit in question.

Predictors

1. Sentence position in paragraph

2. Paragraph position in document

3. Word position in sentence

Multiple regression on the above three predictors

20

21

Information goes up and converges (after removal of early words)

Limited amount of context information available.

22

二十 年 前 , 许多 中国 家庭 的 梦想 是 拥有 一 部 电话 。

Twenty year ago, many Chinese family ‘s dream is have a piece telephone.

Trigrams

二十 年 前

年 前 ,

前 , 许多

, 许多 中国

部 电话 。

23

We replicated Genzel & Charniak’s study on

Chinese corpora.

◦ Sentence effect within documents is not found.

◦ However:

 Paragraph effect within documents is consistent with

UID.

 Sentence effect within paragraphs is also found.

Due to the size of data, effects are observable only early in discourse (viable cut-offs are low).

24

We are the first to look at the effect of word position within sentences.

◦ Information content increases with word position.

◦ Context estimation leads to early convergence.

Does increase of information only occur locally in

Chinese?

◦ Current data seem to support this idea.

25

Writing style? Could be.

◦ Chinese – Summarization & Expansion

◦ English – Narrative style

26

A collection of English essays written by native Chinese speakers.

◦ Corpus of English as a Second Language (CESL)

We trained a language model based on the

Brown Corpus (American English) and use the model to measure information content of

Chinese English sentences.

27

XIN: - p<0.001*** CESL: - p=0.0167 *

28

The average information content is much higher in Chinese English (8.2~8.4) than in

Chinese (4.5~5.0).

It is also higher than information content of

English, which converges at 7.0 bits

(Paintadosi, CUNY 2008).

29

Chinese, English, and Chinglish

◦ Globally, Chinglish essays fail to exhibit the information distribution as predicted by UID, either.

◦ Further studies needed to discover more properties of Chinglish.

Possible reasons that explain why Chinglish is harder to understand

◦ Higher information content

◦ Again, writing style

30

Questions?

31

Download