PowerPoint slides

advertisement
The ‘London Corpora’ projects
- the benefits of hindsight some lessons for diachronic corpus design
Sean Wallis
Survey of English Usage
University College London
s.wallis@ucl.ac.uk
Motivating questions
• What is meant by the phrase ‘a balanced
corpus’?
– How do sampling decisions made by corpus builders
affect the type of research questions that may be
asked of the data?
Motivating questions
• What is meant by the phrase ‘a balanced
corpus’?
– How do sampling decisions made by corpus builders
affect the type of research questions that may be
asked of the data?
• Reviewing ICE-GB and DCPSE:
– Should the data have been more sociolinguistically representative, by social class and region?
Motivating questions
• What is meant by the phrase ‘a balanced
corpus’?
– How do sampling decisions made by corpus builders
affect the type of research questions that may be
asked of the data?
• Reviewing ICE-GB and DCPSE:
– Should the data have been more sociolinguistically representative, by social class and region?
– Should texts have been stratified: sampled so that
speakers of all categories of gender and age were
(equally) represented in each genre?
ICE-GB
• British Component of ICE
• Corpus of speech and writing (1990-1992)
– 60% spoken, 40% written; 1 million words; orthographically
transcribed speech, marked up, tagged and fully parsed
• Sampling principles
– International sampling scheme, including broad range of
spoken and written categories
– But:
• Adults who had completed secondary education
• ‘British corpus’ geographically limited
– speakers mostly from London / SE UK (or sampled there)
DCPSE
• Diachronic Corpus of Present-day Spoken
English (late 1950s - early 1990s)
– 800,000 words (nominal)
– London-Lund component annotated as ICE-GB
• orthographically transcribed and fully parsed
• Created from subsamples of LLC and ICE-GB
– Matching numbers of texts in text categories
– Not sampled over equal duration
• LLC (1958-1977)
• ICE-GB (1990-1992)
– Text passages in LLC larger than ICE-GB
• LLC (5,000 words) • ICE-GB (2,000 words)
• But text passages may include subtexts
– telephone calls and newspaper articles are frequently short
DCPSE
• Representative?
– Text categories of unequal size
– Broad range of text types sampled
– Not balanced by speaker demography
text category
LLC (1960s)
formal face-to-face
informal face-to-face
telephone conversations
broadcast discussions
broadcast interviews
spontaneous commentary
parliamentary language
legal cross-examination
assorted spontaneous
prepared speech
46,291
207,852
25,645
43,620
20,359
45,765
10,081
5,089
10,111
30,564
TOTAL
(51)
(146)
(110)
(47)
(12)
(50)
(14)
(4)
(8)
(14)
ICE-GB (1990s)
39,201
176,244
19,455
42,002
21,385
48,539
10,226
4,249
10,767
32,180
(58)
(398)
(30)
(101)
(26)
(60)
(58)
(5)
(5)
(71)
445,377 (450) 404,248 (818)
TOTAL
85,492
384,096
45,100
85,622
41,744
94,304
20,307
9,338
20,878
62,744
(109)
(544)
(140)
(148)
(38)
(110)
(72)
(9)
(13)
(85)
849,625 (1,268)
A balanced corpus?
• Corpora are reusable experimental datasets
– Data collection (sampling) should avoid limiting
future research goals
– Samples should be representative
• What are they representative of?
• Quantity vs. quality
– Large/lighter annotation vs. small/richer
– Are larger corpora more (easily) representative?
• Problems for historical corpora
– Can we add samples to make the corpus more
representative?
“Representativeness”
• Do we mean representative...
– of the language?
• A sample in the corpus is a genuine random sample
of the type of text in the language
“Representativeness”
• Do we mean representative...
– of the language?
• A sample in the corpus is a genuine random sample
of the type of text in the language
– of text types?
• Effort made to include examples of all types of
language “text types” (including speech contexts)
“Representativeness”
• Do we mean representative...
– of the language?
• A sample in the corpus is a genuine random sample
of the type of text in the language
– of text types?
• Effort made to include examples of all types of
language “text types” (including speech contexts)
– of speaker types?
• Sampling decisions made to include equal numbers (by
gender, age, geography, etc.) of participants in each text
category
• Should subdivide data independently (stratification)
“Representativeness”
• Do we mean representative...
– of the language?
“random sample”
• A sample in the corpus is a genuine random sample
of the type of text in the language
“broad”
– of text types?
• Effort made to include examples of all types of
language “text types” (including speech contexts)
– of speaker types?
“stratified”
• Sampling decisions made to include equal numbers (by
gender, age, geography, etc.) of participants in each text
category
• Should subdivide data independently (stratification)
Stratified sampling
• Ideal
– Corpus independently
subdivided by each variable
Stratified sampling
• Ideal
– Corpus independently
subdivided by each variable
Stratified sampling
• Ideal
– Corpus independently
subdivided by each variable
– Equal subdivisions?
Stratified sampling
• Ideal
– Corpus independently
subdivided by each variable
– Equal subdivisions?
• Not required
• Independent variables =
constant probability in each
subset
– e.g. proportion of words spoken
by women not affected by text
genre
– e.g. same ratio of women:men
in age groups, etc.
Stratified sampling
• Ideal
– Corpus independently
subdivided by each variable
– Equal subdivisions?
• Not required
• Independent variables =
constant probability in each
subset
– e.g. proportion of words spoken
by women not affected by text
genre
• What is the reality?
ICE-GB: gender / written-spoken
• Proportion of words in each category spoken
by women and men
– The authors of some texts are unspecified
– Some written material may be jointly authored
female
written
male
spoken
TOTAL
0
p1
– female/male ratio varies slightly (=0.02)
0.2
0.4
0.6
0.8
ICE-GB: gender / spoken genres
• Gender variation in spoken subcategories
female
unscripted speeches
spontaneous commentaries
legal presentations
demonstrations
unscripted
non-broadcast speeches
broadcast talks
scripted
monologue
broadcast news
mixed
parliamentary debates
legal cross-examinations
classroom lessons
business transactions
broadcast interviews
broadcast discussions
public
telephone calls
direct conversations
private
dialogue
TOTAL spoken
male
0
0.2
0.4
0.6
0.8
p
1
ICE-GB: gender / written genres
• Gender variation in written genres
female
press news reports
reportage
press editorials
persuasive writing
technology
social sciences
natural sciences
humanities
non-academic writing
skills/hobbies
administrative/regulatory
instructional writing
novels/stories
creative writing
technology
social sciences
natural sciences
humanities
academic writing
printed
untimed student essays
student examination scripts
non-professional writing
social letters
business letters
correspondence
non-printed
TOTAL written
male
<author unknown/joint>
0
0.2
0.4
0.6
0.8
p
1
ICE-GB
• Sampling was not stratified across variables
– Women contribute 1/3 of corpus words
– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation
• academic writing: technology, natural sciences
• non-academic writing: technology, social science
ICE-GB
• Sampling was not stratified across variables
– Women contribute 1/3 of corpus words
– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation
• academic writing: technology, natural sciences
• non-academic writing: technology, social science
– Is this representative?
ICE-GB
• Sampling was not stratified across variables
– Women contribute 1/3 of corpus words
– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation
• academic writing: technology, natural sciences
• non-academic writing: technology, social science
– Is this representative?
– When we compare
• technology writing with creative writing
• academic writing with student essays
– are we also finding gender effects?
ICE-GB
• Sampling was not stratified across variables
– Women contribute 1/3 of corpus words
– Some genres are all male (where specified)
• speech: spontaneous commentary, legal presentation
• academic writing: technology, natural sciences
• non-academic writing: technology, social science
– Is this representative?
– When we compare
• technology writing with creative writing
• academic writing with student essays
– are we also finding gender effects?
– Difficult to compensate for absent data in analysis!
DCPSE: gender / genre
• DCPSE has a simpler genre categorisation
– also divided by time
prepared speech
assorted spontaneous
legal cross-examination
parliamentary language
spontaneous commentary
female
broadcast interviews
male
broadcast discussions
telephone conversations
informal
formal
face-to-face conversations
TOTAL
0
0.2
0.4
0.6
0.8
p
1
DCPSE: gender / time
• DCPSE has a simpler genre categorisation
– also divided by time
• note the gap
1
p
0.8
0.6
0.4
1992
1990
1988
1986
1984
1982
1980
1978
1976
1974
1972
1970
1968
1966
1964
1962
1960
0
1958
0.2
time
DCPSE: genre / time
• Proportion in each spoken genre, over time
– sampled by matching LLC and ICE-GB overall
• this is a ‘stratified sample’ (but only LLC:ICE-GB)
• uneven sampling over 5-year periods (within LLC)
p
formal face-to-face
ICE-GB
target for LLC
0.6
0.4
informal face-to-face
prepared speech
0.2
spontaneous commentary
telephone conversations
0
1960
1965
1970
1975
1980
1985
1990
DCPSE
• LLC sampling not stratified
– Issue not considered, data collected over
extended period
– Some data was surreptitiously recorded
DCPSE
• LLC sampling not stratified
– Issue not considered, data collected over
extended period
– Some data was surreptitiously recorded
• DCPSE matched samples by ‘genre’
– Same text category sizes in ICE-GB and LLC
– But problems in LLC (and ICE) percolate
DCPSE
• LLC sampling not stratified
– Issue not considered, data collected over
extended period
– Some data was surreptitiously recorded
• DCPSE matched samples by ‘genre’
– Same text category sizes in ICE-GB and LLC
– But problems in LLC (and ICE) percolate
• No stratification by speaker
– Result: difficult and sometimes impossible to
separate out speaker-demographic effects from
text category
Conclusions
• Ideal would be that:
– the corpus was “representative” in all 3 ways:
• a genuine random sample
• a broad range of text types
• a stratified sampling of speakers
– But these principles are unlikely to be compatible
• e.g. speaker age and utterance context
• Some compensatory approaches may be
employed at research (data analysis) stage
– what about absent or atypical data?
– what if we have few speakers/writers?
• So...
Conclusions
• …pay attention to stratification in deciding
which texts to include in subcategories
– consider replacing texts in outlying categories
• …justify and document non-inclusion of
stratum by evidence
– e.g. “there are no published articles attributable to
authors of this age in this time period”
Download