Using AntConc to explore a small corpus of emails
In this seminar we will be investigating a small corpus (approx 12000 words) of email messages (referred to as EMAIL in this handout). We will mainly be doing this by comparing our email corpus with a corpus of written British English called FLOB. 1) Start up AntConc: 2) We have to set AntConc to ignore tags – which are special codes inside the files. To do this, click on Global Settings on the menu bar, then click on Tag Settings, and then select Hide Tags, as shown below (all other options are fine). Then click 3) We also need to tell AntConc to include numbers (we’ll see why later). To do this, click on Global settings, click on Token (Word) Definition, and select Number in the Number Token Classes box. Then click on Apply. 4) Now we need to tell AntConc which file we’re working with. Select File and then Open File(s)… A standard file-open box will appear. Navigate to the file that contains the Email corpus we will be using today, which should be on your PC desktop. Click on EMAILtagged.txt - which will highlight it in blue. Then click on Open. You will see that EMAILtagged.txt is now listed in the main AntConc window Page 1
Word Frequency Lists
5) Create a word frequency list. Click on the tab. Make sure you tick the box that looks like this: (note: this is very important, if you don’t click it, AntConc will treat the and THE as if they were
two different words). Now click . Use the AntConc Wordlist screen to fill in Table 1 below. Write down the top 20 words in the EMAIL corpus and their % frequencies in the table provided below – (we’ve actually done some of them for you to save time). The % figure is calculated as follows:
%Frequency = raw frequency ÷ total no. of word tokens x 100
The AntConc display tells you the total number of word tokens in the file. Look at the info bar above the word list – you’ll see something like this: Table 1: TOP 20 words in EMAIL and FLOB 1 2 3 4 5 6 7 8 9
the i to and a you it of in 10 s 11 on 12 that
373 300 291 148 130 130
the of 2.93 and 2.36 to 2.29 a in that is 1.16 was 1.09 it 1.02 for 1.02 he
64470 33951 27159 26940 22973 20737 10748 10378 10135 9626 9307 7961
6.28 3.31 2.64 2.62 2.24 2.02 1.05 1.01 0.99 0.94 0.91 0.78 13 for 14 have 15 we 16 do 17 n 18 t 129 111 1.01 s as with 0.87 on i be 7621 7470 7099 7089 7011 6497 0.74 0.73 0.69 0.69 0.68 0.63 19 is 20 not 102 0.80 his by 5794 5392 0.56 0.53 QUESTION>> Why have we gone to the trouble to work out % frequencies?
The table also contains the top 20 words from the FLOB corpus of written English. Compare the words and frequencies for the two corpora and make a note of any differences/similarities. Do your results allow you to say anything about the language of emails? Page 2
6) You may have noticed that the EMAIL word list contains letters s, n, and t. In order to understand why a particular item appears in a wordlist, it’s often useful to investigate using concordances. A concordance is a list of all of the occurrences of a certain word, in the context of the sentence(s) that word occurs in. You can look at a concordance list simply by clicking on any word in the word frequency list (your cursor should change to a ‘pointing-finger’ icon when you move it over a word). Alternatively, you can click on the tab to take you to the concordance screen. Then type the word you want in the box labelled Search Term, and click TASK>> Even though you may be able to guess, use concordances to find the meaning of s, n and t in the EMAIL corpus. Are there any issues regarding the s item in the wordlist?
Further investigation of have. The % frequency and ranking of have is higher in EMAIL than it is in FLOB. This difference in frequencies could mean that ‘have’ is an interesting word to investigate in EMAIL, and we’re going to do this using the sort and the cluster function. From the wordlist click on have to get a concordance. We can sort the concordance into alphabetical order of what comes before or after have. The control for this is at the bottom of the window. You can use the up and down buttons to change the basis of the concordance sort. 1R means one-word-to-the right of have, 1L means one-word-to-the-left of have, and so on. For now, we want the 1R sort so select 1R and then press . TASK>> This sorted format should help you to notice more easily how have is used in EMAIL. Make a note of any patterns you observe. Are there any patterns in the word-class that follows have? It is also sometimes useful to look at clusters. Click on the Click on start. tab. Set the cluster max and min size to 10 and 2 respectively (as shown to the right), and set the min cluster frequency to 3. TASK>> make a note of any patterns of usage of have that you notice when looking at clusters Page 3
9) Now we’re going to compare the wordlists for EMAIL and FLOB using the Keyword List function. This allows us to find out which words appear more or less often in the Email corpus than would be expected by chance alone when compared against the reference corpus (FLOB). These words are called “keywords”. Click on to access the comparison tool. We have to tell AntConc what we want to compare EMAIL with. This is called setting the reference corpus: the “reference corpus” is whatever our standard of comparison is. In this case our standard of comparison is FLOB – a corpus of British written English. We set this up using Tool Preferences. Click on Tool Preferences on the menu bar, then click on Keyword List (see the diagram below). Make sure this box is ticked Click here to select your reference corpus Click here to access the Keyword List preferences The files you’ve selected appear here First, click the “treat all data as lowercase” box. This is very important! Then click Choose Files – this takes you to a file-open box just like the one you use to load EMAIL, only this time we are loading a reference corpus. Select flob.txt and press Open, and the file’s name will appear as shown. Then click Apply. Now, press and AntConc will generate a list of words which are statistically more common in the EMAIL corpus than in FLOB. (This might take a few seconds). The keyword lists should look something like this: Page 4
The keywords are sorted by their keyness, which is a statistical measure of the likelihood of the over represented words in EMAIL being down to chance alone. The higher the keyness score, the more confident we can be that the keyword is a characteristic of the data rather than a chance occurrence. A text might have hundreds of keywords, but often the twenty or thirty words with the highest keyness are the most useful. 10) The next stage in Keyword analysis is to try to understand why the words in the list are key. Remember, keywords are just words that appear (statistically) more in one text or corpus than they do in another (comparison) text or corpus. The statistical test that calculates the keyness merely indicates that the difference is real and not just a fluke. It is up to you as analysts to work out whether that statistical keyness equates to linguistic salience. i.e. Are the keywords above telling you anything interesting about the EMAIL data they come from? We, therefore, need to look at how the words are used in more context. Keywords, then, could be seen as a list of possibilities for further, more focused, analysis of a text or corpus. 11) Further investigation of soon. Task>> Use the sort and cluster functions (described in the have example above) to find the patterns of usage of soon in EMAIL (tip – try sorting one word to the left of soon)
Further investigation of 2 Return to the Keyword list by clicking on Next, click on 2 to get a concordance (you might have tp scroll down a bit to find it). QUESTION>> how is 2 used in EMAIL? (you’ll probably have guessed the answer to this) Now click on QUESTION>> 13) TASK>> to see the dispersion of 2 across EMAIL what does this tell us about the usage of 2 in EMAIL? Using concordance lines and any of the other tools in AntConc, investigate any of the other keywords from your list.
Does your investigation reveal anything about the language use in EMAIL that you didn’t find when you compared the wordlists? 14) General discussion points: What are the problems with drawing conclusions about emails in general from our findings today? Is this a representative corpus of emails? Is FLOB a good choice of comparison corpus? What could we do to make a future research project into emails better? Did you find through the course of this seminar any other problems with this kind of investigation? Page 5