1 Introduction

advertisement
LIN 3098 Corpus Linguistics
Practical Task III
1 Introduction
In today’s practical session, we’ll be using the Brown Corpus, a corpus of written
American English (ca. 1 million words). The purpose of the practical is mainly to gain
some hands-on experience with corpora marked up using the Extensible Markup
Language (XML).
2 The data
The corpus files are located in this folder:
o : \\10.254.64.9\iol-shared
You can paste the address in Windows Explorer. You’ll be prompted for a username
and password.
 Enter your ITS username preceded by CSC\ (e.g. CSC\agat1)
 Enter your ITS password.
2.1 Contents of the folder
This folder contains two sub-folders:
 Text\ contains the original corpus files, in text format
 Xml\ contains the same corpus files in XML format
There is also a file called CONTENTS, which explains what the different filenames
mean. Read the contents file first! (You can open it in Notepad, Word or any other
program that can read text).
For the practical today, you will also need to consult the list of part-of-speech tags
used in the corpus. A full listing of the symbols used can be found here:
o http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html
3 Eyeballing the data and getting used to XML
3.1 The text version
Subfolder text/ contains the corpus files in a simple, text format.
Open a couple of these files and take a look at them.
Answer the following questions:
1. Is there any way, within these text files, of identifying individual sentences?
2. What is the format in which tokens and tags are marked up?
3.2 The xml version
Subfolder Xml/ contains the same files in XML format. Open the same files you
looked at earlier, but this time, in XML.
XML files can be opened in a web browser like FireFox or Internet Explorer. Once
they are open, you’ll be able to see the XML document tree. Browse through it,
keeping the tagset listing in front of you. Make sure you understand what the different
XML tags mean.
Observe that the markup indicates sentence boundaries, word boundaries, and the
parts of speech of different words.
Answer the following questions, and discuss your answers with a partner:
3. Every XML document has a root node. What is the root of the documents in
this corpus?
4. You’ll notice that each token in the corpus is marked up using tags of the
following kind: <WORD CAT=”...”>word</WORD>. In this markup:
a. What is the tag name?
b. What is the attribute name?
5. On the tag listing, you’ll notice some compound tags, that is, symbols that
stand for two parts of speech. These usually have the format “TAG1+TAG2”,
for example:
o <WORD CAT=”dt+bez”>that’s</WORD>
Why do you think this was done?
4 Developing a part of speech tagset
Here’s a sample paragraph in Maltese, with word-by-word glosses.
Ilbieraħ
yesterday
,
,
ilthe
Gvern
government
Malti
Maltese
għamel
made
talba
request
għal
for
iżjed
more
għajnuna
help
Your task is to develop a small POS tagset for this text.
NB: If you have difficulty with Maltese, feel free to do the same for a sentence of
a different language.
What sorts of POS tags would you need to mark up this text? In the case of Maltese,
think especially of:
o the major categories (noun, verb etc)
o their grammatical features (person, number, gender, etc)
o the fact that some tokens, such as the definite article, are attached (cliticised)
to a host
1. Write down the list of tags you would use.
2. Imagine that the above sentence (or an equivalent one in your preferred
language) is the only sentence in an entire document. Write the tagged version
of document in XML format. (Remember that an XML document must have a
root node).
Download