Fanny Meunier Computer tools for the analysis of learner corpora

advertisement
Fanny Meunier "Computer tools for the analysis of learner corpora"
(Source: Granger, S. (ed.) 1998. Learner English on Computer. Longman. Chapter 2)
FIND ANSWERS TO THE FOLLOWING QUESTIONS
1. Raw and annotated data
> What is a raw corpus and a tagged (= annotated) corpus?
> Name at least 3 forms of (linguistic, PK) annotation.
1.1. POS tagging
> What is the average success rate of automatic POS taggers?
> What is meant by the complexity/refinement of a tagset?
> Do we need special POS taggers for tagging interlanguage data?
> How can one determine the best tagger for a given research purpose?
1.2 Syntactic parsing
> What is (syntactic) parsing?
> What is the connection between POS tagging and parsing?
> Can parsing be performed automatically?
> What is 'skeleton' parsing?
> What is 'partial parsing' ?
1.3 Semantic tagging
> Do we have automatic semantic taggers?
> Why would semantic tagging be of service to CLC research?
1.4. Discoursal tagging
> Are discourse taggers available?
1.5. Error tagging
> How can spellchecking help with the editing of non-spelling errors?
> Can error editing become automatized?
> What is the potential advantage of research on error-tagged corpora?
2. Working with software tools to analyse interlanguage
2.1 General statistics
2.1.1 Word counting
> What is the advantage of word statistics obtained from the WORDS programme described by Meunier over
simple word counts available e.g. in Microsoft Word?
> Can we use different word counting tools when comparing different corpora?
2.1.2 Word/sentence statistics
> What is the difference between a sentence and a T-unit?
> What is the difference between non-native-speaker (NNS) and native speaker (NS) writers' use of varied
sentence lengths?
2.2 Lexical analysis
2.2.1 Frequency analysis
> How can frequency lists be applied to discover facts about learner language?
> What is the dispersion / distribution of an item in a corpus and why is this information useful?
> In general terms, what is a comparison of wordlists (NOTE: WordSmith Tools is available at IFA on the local
network and in the 603 multimedia lab, should you need any of its features; I hope to be able to give you a demo
soon)
2.2.2 Context analysis
> NOTE: you should by now be familiar with all kinds of queries presented here; the specific syntax conventions
described by Meunier are typical of WordSmith Tools and a few other off-line concordancers. Internet-based
tools often require their own scripting conventions (so called 'regular expressions') for 'multiple-item queries'
> What is a stoplist?
> How does Meunier differentiate between 'collocation facilities' and 'collocation generators'?
2.2.3 Lexical variation analysis
> What is the type-token ratio (TTR) ?
> Can we use TTR to compare corpora/texts of different lengths?
> According to Meunier, why is TTR not a discriminating feature between NS and NNS writers?
2.2.4. Other lexical measures
> What is lexical density (LD) ?
> Does LD depend on the length of a corpus?
> How can lexical sophistication be measured automatically?
2.3 Grammatical analysis
> What three techniques of querying a POS-tagged corpus can be applied to probe grammar use in a corpus?
2.4 Syntactic analysis
> At what stage of advancement is corpus-based syntactic analysis?
3. Conclusion
Download