Corpus analysis

advertisement
Corpus analysis (2)
Corpus Linguistics
Richard Xiao
lancsxiaoz@googlemail.com
Outline of the session
• Lecture
– Keyword
– Reference corpus
– Key keyword
• Practical
– WST keyword
– AntConc keyword
– Wmatrix keyword / key concept
– Extra: keyword analysis with CQPweb
2
What is a keyword?
• Keywords are those words whose frequency is
exceptionally high (positive keywords) or low
(negative keywords) in comparison with a
reference corpus
– Keywords usually refer to positive keywords
– But negative keywords are equally interesting
(see Xiao and McEnery 2005)
• They appear at the very end of your listing, in a
different colour in WordSmith
• They are omitted automatically from a keywords
database for key keyword analysis and a keyword
plot
3
Why keyword analysis?
• Indicating the ‘aboutness’ (Scott 1999) of a
particular text or corpus
– Contents analysis, discourse analysis
• Also revealing the salient features which
are functionally related to a particular
genre (Xiao and McEnery 2005)
– Genre analysis, stylistic analysis
4
How to do keyword analysis
• Make a wordlist of the target corpus
• Locate or make a word list of a reference corpus
– Scott (2005) “In search of a bad reference corpus”
• http://www.methodsnetwork.ac.uk/redist/pdf/es1_05scott.pdf
– The reference corpus is usually larger than the target
corpus
– The appropriateness of a reference corpus depends on
your research questions!
• Compare the frequency of each item in the two
wordlists to extract keywords – done automatically
• Analyse and interpret keywords – you will do it!
5
Keywords in the party speeches
• Target corpus – just one text
– David Cameron's speech at the Conservative
conference (10 October 2012, Manchester)
• http://www.bbc.co.uk/news/uk-politics-15189614
• Local copy available (David_speech Unicode text)
- download and unzip the file into a file folder:
www.fass.lancs.ac.uk/projects/corpus/data/workshop3texts.zip
• Reference corpus
– The 100-million-word BNC: download and unzip (local
copy available)
www.lexically.net/downloads/version4/BNC_World.zip
• Tool
– WST Keyword
6
Wordlist of David’s speech
7
Creating keyword list
8
Keyword extraction in progress
Warning: It can take time if you have loaded two large wordlists
9
Keywords in David’s speech
What do these
keywords tell us?
Negative keyword
10
Keyword: Plot view
11
What companies do keywords keep?
12
Why “marriage”?
13
Key clusters
Similar to word clusters,
but only keywords are used.
14
Key keywords
• A key keyword is one which is "key" in more than
one of a number of related texts
– The more texts it is "key" in, the more "key key" it is
– Can avoid extracting keywords which are unusually
frequent in only a small number of files
• Can be created automatically and as simple to
extract as you do for keywords
• n.b. Negative keywords are omitted automatically
from a key keyword list
15
Making a batch wordlist
Specify a folder where you can write
16
Batch making keyword lists
17
Batch making keyword lists
Specify a folder where you can write
18
Making a KW database
19
Key keywords
key coverage of the corpus
An "associate" is a keyword
that appears in the same text
20
Keyword in AntConc
target corpus
reference corpus
21
Keyword in AntConc
Key words in David's
speech (in relation to
Ed's speech)
22
Wmatrix: Keywords and key concepts
• POS and semantic tagging
• Keyword / key concept analysis in Cameron’s
speech in comparison with Miliband’s speech
• Copy and paste the speeches into two separate
text files
– http://www.bbc.co.uk/news/uk-politics-15189614
– http://www.labour.org.uk/ed-milibands-speech-tolabour-party-conference
• Save the two texts as David_speech.txt and
Ed_speech.txt
www.fass.lancs.ac.uk/projects/corpus/data/workshop3texts.zip
23
Wmatrix: Keywords and key concepts
• Login with your account using zhejiangxx
account
– http://ucrel.lancs.ac.uk/wmatrix3.html
24
Tagging Wizard
25
Tagging in progress
26
Tagging result
27
Labour frequency list
28
KWIC concordance
29
“My folders”
Upload and tag
Ed’s speech
…and click on “My
folders”
Warning: Your folder view may look different!
30
Open David_speech folder and select
Ed_speech in “Keyword compared to”
dropdown box
31
Keyword list to download!
32
Keyword cloud –
even more interesting!
33
David’s key concepts
(“Key concepts compared to”)
34
Keyword analysis in online corpora
• Using Lancaster’s CQPweb to compare British
English (LOB+FLOB) and American English
(Brown + Frown)
• Login CQPweb
– http://cqpweb.lancs.ac.uk
• Similar analysis can be done at BSFU’s
CQPweb corpus hub (different corpora)
– http://124.193.83.252/cqp/
– Account: ID=pass=test
35
Creating subcorpora
36
Creating subcorpus BrE
37
Creating subcorpus AmE
38
Making wordlists
39
Wordlist available now
40
Computing keywords
You can make adjustments to the statistical
measure, cut-off point, and minimum frequency
according your research purposes.
41
Keywords in BrE and AmE
42
Download