Ch. 2

advertisement

McEnery & Wilson Ch 2. "What is a corpus and what is in it?"

FIND ANSWERS TO THESE QUESTIONS

1. CORPORA VS. MACHINE REDABLE TEXTS [quite important]

Is corpus "any body of text"?

1.1. sampling and representativeness

What is sampling and why is it necessary?

How can we ensure for a corpus to be maximally representative?

Can, in your opinion, a corpus really represent the whole of a language?

1.2. Finite size

Are all corpora finite in size?

Is the BNC a monitor corpus?

1.3. Machine-readable form?

Have all corpora been machine-readable?

1.4. A standard reference

What is, in McE&W's understanding, the 'standard reference' of a corpus?

II. TEXT ENCODING AND ANNOTATION

What is the difference between an unannotated corpus and an annotated corpus?

Name 2-3 most important (in your view) of Leechs' maxims of annotation

What is the nature of a potential conflict of interest between a corpus end-user and its annotator?

2.1. FORMATS OF ANNOTATION

What are COCOA references?

In general terms, what is the difference between a COCOA approach and a more formal corpus annotation style, such as recommended by the TEI (Text Encoding Initiative)?

Is the BNC is a TEI-conformant corpus?

Do the TEI guidelines constrain the corpus processing methods (e.g. by recommending specific software)?

Is it easy to concordance a heavily-annotated corpus, in your opinion?

2.2. TYPES OF ANNOTATION

Which forms of linguistic annotation are the commonest? What may be the reasons for their predominance?

2.2.1. TEXTUAL AND EXTRA-TEXTUAL INFORMATION [quite important]

What is the difference between textual / extra-textual annotation and linguistic annotation?

2.2.2. ORTHOGRAPHY

Why is the encoding of orthography a potential problem?

Why can transcription of spoken data be an encoding problem?

2.2.3. LINGUISTIC ANNOTATIONS [important, bar the heaviest technical sections]. a) PART OF SPEECH ANNOTATION

What is a POS-tag?

How can POS tagging help with the disambiguation of homographs?

What is a tagset?

What is the difference between a portmanteau tag and a ditto tag?

What is the current success rate of automatic POS taggers?

What do McE & W mean by the divisibility of POS tag names?

In what context do McE & W discuss 'reduced tagsets'? b) LEMMATISATION

What is lemmatisation?

Can you think of a reason for which lemmatisation is not widely applied in corpora? c) PARSING

How is morphosyntactic tagging different from syntactic parsing?

What are treebanks?

What is the most common technique for the marking of syntactic patterns in a corpus?

What is the difference between full parsing and skeleton parsing?

Can parsing be performed automatically? d) SEMANTICS

What are the two types of semantic annotation? e) DISCOURSAL AND TEXT LINGUISTIC ANNOTATION

Provide a few examples of discourse phenomena that can be tagged in a corpus

What is anaphoric annotation? f) PHONETIC TRANSCRIPTION and g) PROSODY

What is the difference between phonetic transcription and annotation of prosody in a corpus?

What is prosody? How does it differ from 'semantic prosody'?

PROBLEM-ORIENTED TAGGING

How does problem-oriented tagging differ from other, standard forms of corpus annotation?

3. MULTILINGUAL CORPORA

What is the difference between parallel corpora and comparable corpora?

5. STUDY QUESTIONS [fairly important]

Consider question 3 and propose an answer: Can you see any disadvantages to annotated corpora?

Download