Slides - Natural Language Processing Group

advertisement
…challenges in
investigating Keyness
Mike Scott,
School of English
University of Liverpool
Keyness
NLP Group Computer Science
University of Sheffield
15 February 2008
1
Purpose
1.
2.
3.
4.
5.
To explore the notion of keyness
and its implications in corpus-based study
and to consider concgrams
and key concgrams
all with reference to WordSmith
Keyness
2
Overview
Keyness, as a new territory, looks promising
and has attracted colonists and prospectors.
It generally appears to give robust indications
of the text’s aboutness together with
indicators of style.
Keyness
3
the text’s aboutness
Keyness
4
Issues
 the issue of text section v. text v. corpus v. sub-
corpus
 statistical questions: what exactly can be
claimed?
 how to choose a reference corpus
 handling related forms such as antonyms
Keyness
5
Machine and Human KWS
 Rigotti and Rocci (2002) warn that machine
identification of key words omits all interpretation
of the writer’s intentions, cannot get at cultural
implications and does not spot the congruity of the
meanings of each section with the next.
Keyness
6
metaphors
 “In our view, a natural language text, slippery and
vague as it may be, is not a stone soup where words
float free, tied only to their multiple associations
within a Foucoultian discourse” (Rigotti and Rocci,
2002)
Keyness
7
Of course it doesn’t actually
understand…
Keyness
8
… or know what is “correct”
Keyness
9
… only look at what is found in text
… or context
… whether marked up or not …
Keyness
10
Context?
Keyness
11
Levels of Context
Physical environment
Keyness
12
If so
 what is the status of the “key words” one may
identify and what is to be done with them?
Keyness
13
Issues
1.
2.
3.
4.
5.
the issue of text section v. text v. corpus v. subcorpus
statistical questions: what exactly can be
claimed?
how to choose a reference corpus
handling related forms such as antonyms
what is the status of the “key words” one may
identify and what is to be done with them?
Keyness
14
text section v. text v. corpus v. subcorpus
 text section: levels 1-5
 text: level 6
 corpus: levels 7 & 8
Keyness
15
But these are often not clearly
differentiated
 “text”, level 6: with or without mark-up,
images, sounds?
 what do we mean by section, chapter (4) and
other non linguistically defined categories?
 is text itself mutating?
Keyness
16
Internet text
Keyness
17
Wikipedia homepage (part)
Keyness
18
Wikipedia homepage (part)
Keyness
19
Wikipedia article
(3 parts of same article)
Keyness
20
Wikipedia discussion
 from History of the stall article
 latest contributor, “Talk” section
Keyness
21
statistical issues
 p value is a well-established standard, relying on the
notion of chance, random effects
 but
 if you run lots of comparisons some will spuriously (by
chance) appear significant
 if we’re operating at the level of word or cluster, text
itself doesn’t consist of randomly ordered words
Keyness
22
Implication
 there is no statistical defence of the whole set of KWs
 but only of each one
 comparing KW p values is not advisable
Keyness
23
the South Utsere farmer
 3 problems
 storm
 wind
 drought
 3 crops
 bananas
 barley
 chick-peas
Keyness
24
choosing a reference corpus
 using a mixed bag RC, the larger the RC the better but a moderate




sized RC may suffice.
the keyword procedure is fairly robust.
KWs identified even by an obviously absurd RC can be plausible
indicators of aboutness, which reinforces the conclusion that
keyword analysis is robust.
genre-specific RCs identify rather different KWs
the aboutness of a text may not be one thing but numerous
different ones.
Scott (forthcoming)
Keyness
25
related forms such as antonyms
Keyness
26
status of the “key words”
Keyness
27
Concgram section
Keyness
28
What is a “concgram”?
 For years it has been easy to search for or identify
consecutive clusters (n-grams) such as
 AT THE END OF
 MERRY CHRISTMAS
 TERM TIME.
 It has also been possible to find non-consecutive linkages
such as STRONG within the horizons of TEA by adapting
searches to find context words.
 In such cases we might get
 ...STRONG blah blah blah TEA...
 ...TEA blah blah blah blah STRONG...
 etc.
29
 The concgram procedure takes a whole corpus of text and finds all sorts
of combinations like




...STRONG blah blah blah TEA...
...TEA blah blah blah blah STRONG...
...STRONG TEA...
...TEA STRONG...
 whether consecutive or not...
 a sequence (n-gram)
 within a concordance span.
 (“skip-gram” is used (Wilks 2005) to describe non-contiguous word
associations but doesn’t include TEA […] STRONG)
30
Cheng, Greaves & Warren (2006)
 For our purposes, a ‘concgram’ is all of the permutations of
constituency variation and positional variation generated
by the association of two or more words. This means that
the associated words comprising a particular concgram
may be the source of a number of ‘collocational patterns’
(Sinclair 2004:xxvii). In fact, the hunt for what we term
‘concgrams’ has a fairly long history dating back to the
1980s (Sinclair 2005, personal communication) when the
Cobuild team at the University of Birmingham led by
Professor John Sinclair attempted, with limited success, to
devise the means to automatically search for noncontiguous sequences of associated words.
 Cheng, Greaves & Warren (2006:414)
31
ConcGram (©) aims to be
 "a search-engine, which on top of the capability to
handle constituency variation (i.e. AB, ACB), also
handles positional variation (i.e. AB, BA), conducts
fully automated searches, and searches for word
associations of any size." (2006:413)
 WSConcGram is developed in homage to this idea.
32
The goal
Cheng, Greaves & Warren (2006:426)
33
A Problem
 Greaves (2007) reported that ConcGram requires
months using numerous linked PCs to generate the 5word concgrams based on a corpus of some 5 million
words.
 There are a lot of combinations to take into account…
34
Implementation in WordSmith
 The plan: to process a corpus of adequate size (say 10 or
more million words)
 find all instances of all frequently co-occurring words
 co-occurring within a given span
 identify them as potential
 pairs (sup ... with)
 triplets (light ... an ... dark)
 quadruplets (with ... the ... of ... war)
 quintuplets (eyes ... are ... full ... of ... tears)
 etc. (a ... light ... condition ... in ... beauty ... dark)
 determine whether they are significantly associated
35
Stages
 design the procedure
 plan human interface aspects
36
Procedures and Routines
WS3-5’s WordList index function already
knew how to process a corpus and identify
all instances of each word in it...
37
Index
“A kingdom for a stage, princes to act
And monarchs to behold the swelling
scene.” (Henry V)
file of Types
A
1
kingdom
2
for
3
stage
4
princes
5
to
6
act
7
and
8
monarchs
9
behold
10
Huge file of records containing token data:
word_type_number,
next_known_token,
file_number,
file_byte_position
R1, R2, R3 ... RN
R1’s word_type_number=1; (a)
R2’s word_type_number=2; (kingdom)
R3’s word_type_number=3; (for)
R4’s word_type_number=1; (a)
R5’s word_type_number=4; (stage)
R6’s word_type_number=5; (princes)
R7’s word_type_number=6; (to)
R8’s word_type_number=7; (act)
R9’s word_type_number=8; (and)
R10’s word_type_number=9; (monarchs)
R11’s word_type_number=6; (to)
38
WSConcGram procedures (1)
 process the index, looking at each instance of words
above a certain threshold frequency (e.g. 5 instances)
 considering all its neighbours within a given span (e.g.
5)
 finding all pairs repeated more than a threshold
number of times
 saving in a file data on where in the corpus each pair is
to be found
39
WSConcGram procedures (2)
 sort the file of pairs
 process it, finding overlaps,
 e.g. where HOW and MATTER and NOW are all found
within the default span
 sorting the resulting triplets, quadruplets, etc and
 storing them in another file
40
WSConcGram Files
 shakespeare.types
 shakespeare.tokens
 shakespeare.base_pairs
 shakespeare.base_index
 shakespeare.base_index_cg
41
Human Interface Aspects
 sorting concgrams by frequency and alphabetically
 displaying the root word types
 choosing concgram forms
 clustering them in trees
 filtering according to
 statistical properties
 required word(s)
 other needs
 copy, save as .txt, print, print preview
 concordance selected concgrams
42
Problem areas
 computing the association statistics correctly
 clustering each concgram
 showing & hiding parts of a concgram
 ordering them in a tree structure
43
References
 Berber Sardinha, Tony, 1999. Using Key Words in Text Analysis: practical aspects.
DIRECT Papers 42, LAEL, Catholic University of São Paulo.
 Berber Sardinha, Tony, 2004. Lingüística de Corpus. Barueri: Manole.
 Cheng, Winnie, Chris Greaves & Martin Warren, 2006. “From n-gram to skipgram to
concgram”, International Journal of Corpus Linguistics, Vol. 11, No. 4, pp. 411-433.
 Greaves, Chris, 2007. Demonstration of ConcGram. Keyness in Text conference, Certosa
di Pontignano, Tuscany, Italy, 26-30 June 2007.
 Culpeper, J. ,2002. 'Computers, language and characterisation: An Analysis of six
characters in Romeo and Juliet'. In: U. Melander-Marttala, C. Östman and M. Kytö (eds.),
Conversation in Life and in Literature: Papers from the ASLA Symposium, Association
Suedoise de Linguistique Appliquée (ASLA), 15. Universitetstryckeriet: Uppsala, pp.11-30.
 Kemppanen, Hannu 2004. Keywords and Ideology in Translated History Texts: A
Corpus-based Analysis. Across Languages and Cultures 5 (1), 89-106
 Rigotti, Eddo and Andrea Rocci, 2002. From Argument Analysis to Cultural Keywords
(and back again). http://www.ils.com.unisi.ch/articoli-rigotti-rocci-keywordspublished.pdf (accessed May 2007). In F. H. van Eemeren et al, Proceedings of the 5th
Conference of the International Society for the Study of Argumentation. Amsterdam:
SicSat. pp. 903-908.
 Scott, M., 1996 with new versions in 1997, 1999, 2004, Wordsmith Tools, Oxford: Oxford
University Press.
 Scott, M., 1997a. "PC Analysis of Key Words -- and Key Key Words", System, Vol. 25,
No. 1, pp. 1-13.
 Scott, M., 1997b. "The Right Word in the Right Place: Key Word Associates in Two
Languages", AAA - Arbeiten aus Anglistik und Amerikanistik, Vol. 22, No. 2, pp. 239252.
44
References
 Scott, M., 2000a. ‘Focusing on the Text and Its Key Words’, in L. Burnard & T. McEnery (eds.),









Rethinking Language Pedagogy from a Corpus Perspective, Volume 2. Frankfurt: Peter Lang., pp.
103-122.
Scott, M. 2000b. Reverberations of an Echo, in B. Lewandowska-Tomaszczyk & P.J. Melia (eds.)
PALC’99: Practical Applications in Language Corpora. Lodz Studies in Language, Volume 1.
Frankfurt: Peter Lang., pp. 49-68.
Scott, M., 2001. ‘Mapping Key Words to Problem and Solution’ in M. Scott & G. Thompson (eds.)
Patterns of Text: in honour of Michael Hoey, Amsterdam: Benjamins, pp. 109-127.
Scott, M., 2002. ‘Picturing the key words of a very large corpus and their lexical upshots – or
getting at the Guardian’s view of the world’ in B. Kettemann & G. Marko (eds.) Teaching and
Learning by Doing Corpus Analysis, Amsterdam: Rodopi, pp. 43-50 and cd-rom within the cover
of the book.
Scott, M. 2006. "The Importance of Key Words for LSP" in Arnó Macià, E., A. Soler Cervera & C.
Rueda Ramos (eds.), Information Technology in Languages for Specific Purposes: issues and
prospects. New York: Springer, pp. 231-243.
Scott. M. (forthcoming) In Search of a Bad Reference Corpus. AHRC Methods Network.
Scott, M. & Tribble, C., 2006. Textual Patterns: keyword and corpus analysis in language
education, Amsterdam: Benjamins.
Seale C, Charteris-Black J, Ziebland S. 2006. Gender, cancer experience and internet use: a
comparative keyword analysis of interviews and online cancer support groups. Social Science and
Medicine. 62, 10: 2577-2590
Tribble, Chris, 1999, "Genres, keywords, teaching: towards a pedagogic account of the language
of project proposals" in L. Burnard & A. McEnery (eds.) Rethinking Language Pedagogy from a
Corpus Perspective: Papers from the Third International Conference on Teaching and Language
Corpora, (Lodz Studies in Language). Hamburg: Peter Lang.
Wilks, Yorick, 2005. REVEAL: the notion of anomalous texts in a very large corpus. Tuscan Word
Centre International Workshop. Certosa di Pontignano, Tuscany, Italy, 31 June–3 July 2005 (cited
in Cheng et al.)
Keyness
45
Download