here (msword)

advertisement
Corpora and Statistical Methods
Tutorial 3
1 Introduction
The aim of this tutorial is to (a) familiarise yourself with the SketchEngine, a web
interface to a number of corpora; (b) apply some of the concepts about Zipfian
distributions and morphological productivity to actual data.
1.1 Accessing the SketchEngine
The SketchEngine is online at http://beta.sketchengine.co.uk/auth/. When you go to
this URL, you will be prompted for a username and password, which the lecturer
should provide.
(An alternative is to create a trial account. This is free but lasts for only a month).
1.2 Tools you will need
Throughout this tutorial, you will be using the Corpus Query Language (CQL).
This is a language which mixes regular expressions with a special syntax to make
elaborate queries of corpus data. Examples include:


Finding all adjectives in the corpus which end in –ity
Finding all occurrences of the verb kill which are followed by a determiner, an
adjective, and a noun in that order.
A tutorial about CQL and regular expressions can be found in a separate file,
downloadable from the website.
1.2 Accessing a corpus
Once you log in, you will see a selection of different corpora for many different
languages. These include some well-known ones, such as the BNC. There are also a
number of web corpora (i.e. corpora built by scraping the web). We shall use one of
these, namely ukWaC (UK Web as Corpus). (Please use “ukWaC”, not “ukWaC v1.0
old”, which is also marked on the menu).
Here is some useful info about this corpus:



Corpus size: ca. 1.5 billion words
Vocabulary size: ca. 11.2 million
No. of hapax legomena: 1,949,571 individual lemmas
Click on the ukWaC link. You will be taken to a search form as shown below.
Note:
1. You can make simple word/phrase queries by typing them in the Query box;
2. If you click Query Type on the left menu (bottom), a drop-down menu appears
which among other things allows you to choose between a simple query, a
lemma query (for all morphological forms of a specific lemma), or a CQL
query (which allows for complex searches for word/lemma sequences with
part of speech tags).
2 Creating CQL queries
For this part of the tutorial, you’ll practice using CQL. Be sure to select the CQL
option from the drop down Query Type box, as shown above. You’ll need to refer to
the tagset used in the corpus. This is the tagset originally developed for the Penn
Treebank corpus, and a full listing is provided here:
 http://trac.sketchengine.co.uk/wiki/tagsets/penn
Q1. Using the examples from the CQL tutorial, write CQL queries for the following
(the first has been done for you):
a. Nouns ending in the suffix –ness (e.g. goodness)
CQL = [word= “.+ness$” & tag= “NN”]
b. Adjectives preceded by a determiner (e.g. the) and ending with the suffix –ous
(e.g. the calamitous...)
c. Adverbs ending in –ly (e.g. slowly) and followed by a verb (e.g. slowly ran)
d. Adjectives starting with the negative prefix in- or im- and ending in the suffix
–ous (e.g. impecunious)
e. Complex adjectives involving the prefix non- (with the hyphen)
f. Complex adjectives involving the prefix well- (with the hyphen)
Try out the queries in the CQL box. Do you get the desired results or does your
query overgenerate?
Note: The search returns what is known as a KWIC (“key word in context”)
concordance, i.e. a list of the patterns matched in the context in which they occur, as
shown below. You can see the whole context of a specific match by clicking on the
matched word or phrase itself, which is highlighted in red as shown below.
2.1 Creating a frequency list
Once you have a page of results, you can generate a frequency list, by clicking Node
forms under Frequency in the left menu (circled in the diagram above), which gives
you the types matched and their frequency.
Q2. Construct a frequency list for each of the last two queries above (Q1.e & Q1.f).
Eyeball the data:
a. What are the characteristics of the distribution?
b. Judging by the list of types, do you think that these are productive
morphological processes? Are there some cases where the complex adjectives
seem to be non-compositional?
Note: It may be easier to save the frequency lists to your desktop and loading it into a
spreadsheet program, like Excel or SPSS. You can do this by using the Save button
(see below). This takes you to a form. You can leave all the fields as they are, but be
sure to set a large value for the maximum number of lines to save (1 million should do
it); otherwise you won’t save all the data.
2.2 Computing productivity measures
Q3. Count the number of individual hapax legomena for each of your two adjective
queries from Q2 above. Based on raw counts, do they differ? What does this suggest
to you regarding their productivity?
Q4. Based on your frequency lists, compute the realised, expanding and potential
productivity coefficients for each of the two processes.
a. Do they come out roughly equal on any of the measures or are there
substantial differences? How do you interpret these results?
b. For each of the two morphological processes, what do you observe about the
three measures? Do you think they are roughly the same, or are they very
different?
c. (Slightly more challenging) For each of the two cases, compute a pairwise
correlation between the three measures (i.e. a correlation between realised and
expanding productivity, expanding and potential productivity, and realised and
potential productivity). You can use a Pearson’s correlation (denoted r ) for
this. What do you observe? How should a correlation be interpreted?
o Note: if you’ve never computed a correlation, don’t worry, we’ll
discuss it in class. It’s worth trying, however. You can find information
about Pearson’s r here:
http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Q5. Re-compute the calculations for each process, but this time carry out your query
on the BNC (you’ll need to return to the Home menu to select it). Note that the BNC
is a much smaller corpus. Moreover, it contains texts up to the early 1990s (whereas
ukWaC is much more recent).
a. Are there more, or fewer types for the two morphological processes in the
BNC, compared to ukWaC?
b. Do the two morphological processes come out equally productive based on the
BNC data? Why (not)?
Download