homework

advertisement

Some Background:

These sentences are sample alternative outputs from a translation system. The system often has to choose one alternative to present as the “correct” answer, but as you can see, the options are a complex web of probabilities and alignments.

The given probabilities are a combination of 2 factors, a translation model and a language model. The translation model just calculates probabilities of Spanish words to English words (context independent).

The language model calculates probabilities of English words irrespective of the Spanish – it assigns higher probabilities to sequences that sound better in English.

The result is that the same word-to-word translation in 2 sentences may have different probabilities because the overall probability is context dependent.

For example, the English sequence “in the world” is more likely than “about the world” – so the word

“world” will have a higher probability in the first example than the in the second, based on the language model component.

Unannotated data:

El español es la lengua más hablada del mundo tras el chino mandarín por el número de hablantes que la tienen como lengua materna.

Spanish is the language most spoken about the world following the Chinese Mandarin for the number of speakers who take it as a mother tongue.

Spanish is the most spoken language in the world after Chinese Mandarin by the number of speakers that have as their mother tongue.

The Spanish is the language more spoken of the world after the Chinese Mandarin by the number of speaking has that is like maternal language.

The Spanish is her tongue further talked of the humanity behind the Chinese she’d send by the numeral of they talk which her has as tongue lying-in hospital.

Spanish is the language most spoken about the world after the Chinese Mandarin by the number of speakers who take it as a mother tongue.

Annotated data:

El 1 español 2 es 3 la 4 lengua 5 más 6 hablada 7 del 8 mundo 9 tras 10 el 11 chino 12 mandarín 13 por 14 el 15 número 16 de 17 hablantes 18 que 19 la 20 tienen 21 como 22 lengua 23 materna 24 .

Spanish 1,2 (0.90) is 3 (0.90) the 4 (0.94) language 5 (0.39) most 6 (0.25) spoken 7 (0.52) [about the] 8 (0.30) world 9 (0.93) following 10 (0.64) the 11 (0.72) Chinese 12 (0.87) Mandarin 13 (0.25) for 14 (0.24) the 15 (0.77) number 16 (0.79) of 17 (0.99) speakers 18 (0.82) who 19 (0.80) take 21 (0.21) it 20 (0.96) [as a] 22 (0.46) mother 24 (0.35) tongue 23 (0.45)

Spanish 1,2 (0.90) is 3 (0.90) the 4 (0.83) most 6 (0.55) spoken 7 (0.73) language 5 (0.44) [in the] 8 (0.89) world 9 (0.94) after 10 (0.73)

Chinese 11,12 (0.80) Mandarin 13 (0.41) by 14 (0.41) the 15 (0.79) number 16 (0.94) of 17 (0.75) speakers 18 (0.84) that 19 (0.62) have 21

(0.22) as 22 (0.40) their 20 (0.10) mother 24 (0.37) tongue 23 (0.60) .

The 1 (0.90) Spanish 2 (0.90) is 3 (0.90) the 4 (0.94) language 5 (0.38) more 6 (0.22) spoken 7 (0.46) [of the] 8 (0.82) world 9 (0.93) after 10 (0.67) the 11 (0.70) Chinese 12 (0.92) Mandarin 13 (0.32) by 14 (0.33) the 15 (0.79) number 16 (0.94) of 17 (0.50) speaking 18

(0.44) has 21 (0.21) that 19 (0.91) it 20 (0.19) like 22 (0.75) maternal 24 (0.98) language 23 (0.30) .

The 1 (0.90) Spanish 2 (0.91) is 3 (0.87) her 4 (0.52) tongue 5 (0.65) further 6 (0.36) talked 7 (0.73) [of the] 8 (0.52) humanity 9 (0.30) behind 10 (0.19) the 11 (0.84) Chinese 12 (0.99) [she'd send] 13 (0.03) by 14 (0.37) the 15 (0.83) numeral 16 (0.82) of 17 (0.92) [they talk] 18 (0.11) which 19 (0.67) her 20 (0.33) has 21 (0.52) as 22 (0.25) tongue 23 (0.30) [lying-in hospital] 24 (0.16) .

Spanish 1,2 (0.90) is 3 (0.90) the 4 (0.94) language 5 (0.39) most 6 (0.25) spoken 7 (0.52) [about the] 8 (0.30) world 9 (0.93) after 10

(0.67) the 11 (0.69) Chinese 12 (0.92) Mandarin 13 (0.32) by 14 (0.33) the 15 (0.79) number 16 (0.94) of 17 (0.50) speakers 18 (0.82) who 19 (0.80) take 21 (0.21) it 20 (0.96) [as a] 22 (0.46) mother 24 (0.35) tongue 23 (0.25) .

Key:

Superscripts (Spanish): Word number

Superscripts (English): Corresponding Spanish word(s); numbers in parenthesis are confidence scores

(estimated probability that the indicated Spanish words translate correctly to the given English words).

Download