LIN 3098 Corpus Linguistics Practical Task III 1 Introduction In today’s practical session, we’ll be using the Brown Corpus, a corpus of written American English (ca. 1 million words). The purpose of the practical is mainly to gain some hands-on experience with corpora marked up using the Extensible Markup Language (XML). 2 The data The corpus files are located in this folder: o : \\10.254.64.9\iol-shared You can paste the address in Windows Explorer. You’ll be prompted for a username and password. Enter your ITS username preceded by CSC\ (e.g. CSC\agat1) Enter your ITS password. 2.1 Contents of the folder This folder contains two sub-folders: Text\ contains the original corpus files, in text format Xml\ contains the same corpus files in XML format There is also a file called CONTENTS, which explains what the different filenames mean. Read the contents file first! (You can open it in Notepad, Word or any other program that can read text). For the practical today, you will also need to consult the list of part-of-speech tags used in the corpus. A full listing of the symbols used can be found here: o http://www.comp.leeds.ac.uk/ccalas/tagsets/brown.html 3 Eyeballing the data and getting used to XML 3.1 The text version Subfolder text/ contains the corpus files in a simple, text format. Open a couple of these files and take a look at them. Answer the following questions: 1. Is there any way, within these text files, of identifying individual sentences? 2. What is the format in which tokens and tags are marked up? 3.2 The xml version Subfolder Xml/ contains the same files in XML format. Open the same files you looked at earlier, but this time, in XML. XML files can be opened in a web browser like FireFox or Internet Explorer. Once they are open, you’ll be able to see the XML document tree. Browse through it, keeping the tagset listing in front of you. Make sure you understand what the different XML tags mean. Observe that the markup indicates sentence boundaries, word boundaries, and the parts of speech of different words. Answer the following questions, and discuss your answers with a partner: 3. Every XML document has a root node. What is the root of the documents in this corpus? 4. You’ll notice that each token in the corpus is marked up using tags of the following kind: <WORD CAT=”...”>word</WORD>. In this markup: a. What is the tag name? b. What is the attribute name? 5. On the tag listing, you’ll notice some compound tags, that is, symbols that stand for two parts of speech. These usually have the format “TAG1+TAG2”, for example: o <WORD CAT=”dt+bez”>that’s</WORD> Why do you think this was done? 4 Developing a part of speech tagset Here’s a sample paragraph in Maltese, with word-by-word glosses. Ilbieraħ yesterday , , ilthe Gvern government Malti Maltese għamel made talba request għal for iżjed more għajnuna help Your task is to develop a small POS tagset for this text. NB: If you have difficulty with Maltese, feel free to do the same for a sentence of a different language. What sorts of POS tags would you need to mark up this text? In the case of Maltese, think especially of: o the major categories (noun, verb etc) o their grammatical features (person, number, gender, etc) o the fact that some tokens, such as the definite article, are attached (cliticised) to a host 1. Write down the list of tags you would use. 2. Imagine that the above sentence (or an equivalent one in your preferred language) is the only sentence in an entire document. Write the tagged version of document in XML format. (Remember that an XML document must have a root node).