Andy Chin - 2013 PNC

advertisement
PNC2013
Kyoto University
December 10-11 2013
New Language Resources for
Cantonese Linguistics Research:
A Linguistic Corpus of Mid-20th Century
Hong Kong Cantonese
Andy C. Chin
The Hong Kong Institute of Education
andychin@ied.edu.hk
Outline
• Why “Cantonese”?
• Research on early Cantonese (19th - mid20th C) – Diachronic development
• The corpus
– Source of data
– Demonstration of search engine
2
Cantonese in Hong Kong
3
Cantonese
• One of the dialects of the Chinese
language family
• In spite of being a dialect, Cantonese
serves as a lingua franca in Hong Kong,
Macau and most part of Guangdong
Province of China
4
Use of Cantonese
5
“Cantonese” in early Hong Kong
• A fishing village
• Population: 1851: ~33,000
– Four major ethnic groups:
•
•
•
•
•
Guangfu 廣府 (本地)
Danjia 蛋家 (seafaring people)
Hakka 客家
Min 閩語(鶴佬/潮州)
Their languages are mutually unintelligible
6
Given the long history
of Cantonese in HK
• We are interested in understanding its
development in the past 200 years
• Are there any differences between early
Cantonese and modern Cantonese?
• How can we capture these differences?
7
Diachronic studies of Cantonese
• Two approaches
– Apparent time approach
– Real time approach
8
Apparent time approach
• age-stratified variation in a linguistic form is
often indicative of a change in progress
– 75 vs. 50 vs. 25 y/o  changes over 50 years
– language of 200 years ago?
– language change: Can we assume a speaker still
speak the language of his time?
• if two speakers show no difference with
respect to a linguistic feature, does it mean
that there has been no change?
9
Real time approach
• samples the population over an extended period
of time – longitudinal study
• To collect data produced in the period
concerned
10
Limitations on Research
in Cantonese
• Cantonese is a vernacular language
• Spoken data is needed
• Any records of Cantonese of early 19th-C?
- spoken data vs. written records
11
With these early materials,
• We are able to reconstruct the early stage
of the Cantonese language (about 200
years ago)
• Some of the linguistic features are very
different from those in modern Cantonese
12
Previous research on Cantonese
Neutral Qs
Directional
complements
Aspect markers
demonstratives
phonology
Verb complement
…
Comparative construction
Lexicon (sociolinguistics)
Dative verb GIVE
Sentence final particles
Grammar of the late Qing
period
…
13
Furthermore,
• Some linguistic changes took
place/completed around the mid-20th century
– Dative marker: 過  畀 (送本書過/畀佢)
– Neutral Q:你去睇戲唔呀 你去唔去睇戲呀
–…
• New and old features might co-exist in mid20th C
14
~66 years
120 years
Morrison
(1828)
2013
Chao (1947)
15
Existing Cantonese corpora
1. The Hong Kong Cantonese Child Language
Corpus
2. The Hong Kong Bilingual Child Language Corpus
3. Hong Kong Cantonese Corpus
4. The Hong Kong Cantonese Adult Language
Corpus
5. 19th Century Cantonese Corpus
16
Source of corpus data
– Real time vs. Apparent time
– Naturally occurring data
– HK Cantonese movies( 粵語長片)
17
http://corpus.ied.edu.hk/hkcc/
HK Movie Industry in mid-20th C.
Year
No. of Cantonese movies
1952 - 1955
627
1956 - 1960
963
1961 - 1965
928
1966 - 1970
361
Total
2879
No. of PTH movies
222
314
206
286
1028
Source of data:Chung (2004:177)
19
About the corpus
• 21 movies have been transcribed with
Chinese characters: ~200k characters
• Word segmentation
• search engine (14 movies, since Apr 2012)
– http://corpus.ied.edu.hk/hkcc/
– 350+ registered users
20
Search criteria
• Characters or words (segmented units)
• Cantonese pronunciation
•
•
•
•
Movie names
Names of speakers
Gender of speakers
…
21
契爺艷史(1952)
• Yes-No question
• VP-Neg: 你位千金有讀書冇呀?
• V-Neg-VO: 呢道係咪有位黃小姐?
• Dative marker
• 重要畀錢過人?
• 咪可以快啲還清啲債畀人?
22
Some challenges
•
•
•
•
•
•
Quality of speech
Overlap of speech
Representations of colloquial vocabulary
Parts-of-speech: How many types?
Discourse features
…
23
Acknowledgments
• ECS research grants, RGC:
– Linguistic Analysis of Mid-20th Century Hong Kong
Cantonese by Constructing an Annotated Spoken
Corpus (2013/2015)
• HKIEd Internal Research Grants:
– RG41/2010-2011: Spoken Corpus Construction and
Linguistic Analysis of Mid-20th Century Cantonese
– RG62/12-13R: A Preliminary Linguistic Analysis of
Mid-20th Century Cantonese from a Corpus-based
Approach
24
Demonstration
• http://corpus.ied.edu.hk/hkcc/
25
Download