Part 2. Tools for computerizing a corpus

advertisement
2.4. Purpose of collecting a corpus
Each corpus is designed, collected and maintained to serve the designer’s purpose.
The general purposes for creating corpora can be categorized as follows:
1) Research: description of the language in use.
2) Business:
a)
b)
c)
publishing dictionaries and reference books
teaching: evaluating curricula, improving methods of teaching,
producing teaching materials.
translation: producing translator’s aids.
Detailed description of various purposes is presented with examples of research done in Part
4.
2.5. Storing a corpus
Collections of written texts are stored in the digital electronic form, thus corpora are
sometimes called machine readable. Information technology allows users to access the data
quickly via user-friendly interfaces.
Spoken corpus has to be recorded as sound and transcribed into text without phonetics and
prosody. Written form of a spoken corpus is convenient for lexical and syntactic analysis of
the spoken language.
Nowadays, it still seems to be impossible to analyse spoken corpus as it is stored, because
sound concordance programmes have not been invented yet. There is a need for a sound
concordance programme and a sound annotation tool and they will be developed sooner or
later, as speech technology is developing quite fast now.
Written or spoken texts have to be transformed into the form of digital data.
There are alphabets characteristic of particular languages. Unicode is a tool that provides
codes for language description and storing texts in languages. To know more about Unicode
visit http://www.unicode.org/ . Computers understand numbers only. Unicode provides a
unique number for every character in all languages.
Polish diacritics names and codes are the following:
00D3 Ó Latin capital letter o with acute
00F3 ó Latin small letter o with acute
13
0104 -Ą Latin capital letter A with ogonek.
0105 – ą Latin small letter A with ogonek
0106 - Ć Latin capital letter c with acute
0107 ć Latin small letter c with acute
0118 Ę Latin capital letter e with ogonek
0119 ę Latin small letter e with ogonek
0141 Ł Latin capital letter l with stroke
0142 ł Latin small letter l with stroke
0143 Ń Latin capital letter n with acute
0144 ń Latin small letter n with acute
015A Ś Latin capital letter s with acute
015B ś Latin small letter s with acute
0179 Ź Latin capital letter z with acute
017A ź Latin small letter z with acute
017B Ż Latin capital letter z with dot above
017C ż Latin small letter z with dot above
Plain text is the core of any corpus in a written form, but it is difficult to get at, hardly
accessible, as reading from screen is not as convenient as reading from paper. There is a need
for adding various markers e.g. semantic or syntactic, that facilitate access to the data.
Whatever is done on the corpus, the original plain text of each corpus must be safely
protected. Original plain sound corpora should be protected even more carefully, because they
need to wait for new tools.
2.6. Annotations
2.6.1. Leech’s maxims
Plain text is called unannotated. Annotations, that is extra information, are added in order to
retrieve and search the data faster in a purposeful way.
In 1993 Leech described seven maxims about annotations:
1. It should be possible to remove the annotation from an annotated corpus in
order to revert to the raw corpus. At times this can be a simple process – for example
removing every character after an underscore e.g. "Claire_NP1 collects_VVZ
shoes_NN2" would become "Claire collects shoes". However, the prosodic annotation
of the London-Lund corpus is interspersed within words – for example "g/oing"
indicates a rising pitch on the first syllable of the word "going", meaning that the
original words cannot be so easily reconstructed.
2. It should be possible to extract the annotations by themselves from the text. This
is the flip side of maxim 1. Taking points 1 and 2 together, the annotated corpus should
allow the maximum flexibility for manipulation by the user.
14
3. The annotation scheme should be based on guidelines which are available to the
end user. Most corpora have a manual which contains full details of the annotation
scheme and guidelines issued to the annotators. This enables the user to understand fully
what each instance of annotation represents without resorting to guesswork, and in cases
of ambiguity to understand why a particular annotation decision was made at that point.
You might want to look briefly at an example of the guidelines for part-of-speech
annotation of the BNC corpus
http://www.comp.lancs.ac.uk/computing/users/eiamjw/claws/claws7.html although this
page has restricted access.
4. It should be made clear how and by whom the annotation was carried out. A
corpus may be annotated manually, either by a single person or by a number of different
people; alternatively the annotation may be carried out automatically by a computer
program whose output may or may not be corrected by human beings.
5. The end user should be made aware that the corpus annotation is not infallible,
but simply a potentially useful tool. Any act of corpus annotation is, by definition, also
an act of interpretation, either of the structure of the text or of its content.
6. Annotation schemes should be based as far as possible on widely agreed and
theory-neutral principles. For example, parsed corpora often adopt a basic context-free
phrase structure grammar rather than implementing a narrower specific grammatical
theory such as Chomsky's Principles and Parameters framework.
7. No annotation scheme has the a priori right to be considered as a standard.
Standards emerge through practical consensus.
Although there are no fixed standards for annotations, some conventions have been
developed “through practical consensus”. Kahrel, Bbrnett, Leech (1997) pointed:
“Standardization of annotation practices can ensure that an annotated corpus can be used
to its greatest potential”. Standards may be developed on two levels: standard encoding
of corpora and annotation, and standard annotation of corpora.
2.6.2. Types of annotations
Although there are no clear and obligatory standards, there are different types of annotations
that proved to be useful in various corpora. McEnery and Wilson (2001) developed the
following classification:
-
Part-of-speech annotation
Lemmatisation
Parsing
Semantic
Discoursal and text linguistic annotation
o Pragmatic/stylistic
Phonetic transcription
15
-
Prosody
Problem-oriented tagging
Garside, Leech, McEnery (1997) suggest also orthographic annotation. However, orthography
is the graphic representation of a text – not its linguistic interpretation. In some cases using
italics may distinguish a linguistic function.
What follows are examples of annotations listed above.
Part-of-speech annotation
Part-of-speech annotations are used to identify and mark what part of speech a separate word
is.
Perdita&NN1-NP0; ,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF; the&AT0; lorries&NN2;
with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0; ponies&NN2; '&POS; feet&NN2; ,&PUN;
suddenly&AV0; heard&VVD-VVN; Alejandro&NN1-NP0; shouting&VVG; that&CJT; she&PNP;
better&AV0; dig&VVB; out&AVP; a&AT0; pair&NN0; of&PRF; clean&AJ0; breeches&NN2; and&CJC;
polish&VVB; her&DPS; boots&NN2; ,&PUN; as*CJS; she&PNP; 'd&VM0; be&VBI; playing&VVG;
in&PRP; the&AT0; match&NN1; that&DT0; afternoon&NN1; .&PUN;
The codes used above are:
AJ0: general adjectiveAT0: article, neutral for numberAV0: general adverbAVP:
prepositional adverbCJC: co-ordinating conjunctionCJS: subordinating conjunctionCJT: that
conjunctionDPS: possessive determinerDT0: singular determinerNN0: common noun, neutral
for numberNN1: singular common nounNN2: plural common nounNP0: proper nounPOS:
genitive markerPNP: pronounPRF: ofPRP: preposition
PUN: punctuationTO0: infinitive toVBI: beVM0: modal auxiliaryVVB: base form of lexical
verbVVD: past tense form of lexical verbVVG: -ing form of lexical verbVVI: infinitive form
of lexical verbVVN: past participle form of lexical verb
Source: McEnery http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
Lemmatisation
Words like -ed -ing form the lemma of a lexeme.
A lexeme is a unit of meaning. Variants of a lexeme create the lemma of it. For example,
goes, going, gone, went belong to the lemma of go
Not many corpora are lemmatised.
Examples by McEnery available at:
http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
See the example below:
In the first column there are text references, in the second part of speech, in the third the
actual words from the text, the fourth column contains the lemmatised words.
16
N12:0510g - PPHS1m
He
he
N12:0510h - VVDv
studied
study
N12:0510i - AT
the
the
N12:0510j - NN1c
problem
problem
N12:0510k - IF
for
for
N12:0510m - DD221
a
a
N12:0510n - DD222
few
few
N12:0510p - NNT2
seconds
second
N12:0520a - CC
and
and
N12:0520b - VVDv
thought
think
N12:0520c - IO
of
of
N12:0520d - AT1
a
a
N12:0520e - NNc
means
means
N12:0520f - IIb
by
by
N12:0520g - DDQr
which
which
N12:0520h - PPH1
it
it
N12:0520i - VMd
might
may
N12:0520j - VB0
be
be
N12:0520k - VVNt
solved
solve
N12:0520m - YF
+.
-
Parsing
Full parsing provides detailed analysis of the structure of sentences.
Skeleton parsing uses a limited set of syntactic constituent types and ignores, for example,
the internal structure of certain constituent types.
Study the examples by McEnery at:
http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
Full parsing
[S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns the_AT1
[NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ HH+]NN/JJ&] heel_NN ,_, [Fr[Nq
which_WDT Nq] [Vzp was_BEDZ shown_VBN Vzp] [Tn[Vn teamed_VBN Vn] [R up_RP
R] [P with_INW [NP[JJ/JJ/NN& pointed_JJ ,_, [JJ- squared_JJ JJ-] ,_, [NN+ and_CC
chisel_NN NN+]JJ/JJ/NN&] toes_NNS Np]P]Tn]Fr]Ns] ._. S]
17
This example was taken from the Lancaster-Leeds treebank available at McEnery’s website
http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
The syntactic constituent structure is indicated by nested pairs of labelled square brackets,
and the words have part-of-speech tags attached to them. The syntactic constituent labels
used are:
& whole coordination
+ subordinate conjunct, introduced
- subordinate conjunct, not introduced
Fr relative phrase
JJ adjective phrase
Ncs noun phrase, count noun singular
Np noun phrase, plural
Nq noun phrase, wh- word
Ns noun phrase, singular
P prepositional phrase
R adverbial phrase
S sentence
Tn past participle phrase
Vn verb phrase, past participle
Vzb verb phrase, third person singular to be
Vzp verb phrase, passive third person singular
Skeleton Parsing
[S& [P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNL1
N]P]N]P] [N this_DD1 charter_NN1 N] [V enshrines_VVZ [N a_AT1 victorious_JJ
principle_NN1 N]V]S&] ;_; and_CC [S+[N the_AT fruits_NN2 [P of_IO [N that_DD1
victory_NN1 N]P]N] [V can_VM immediately_RR be_VB0 seen_VVN [P in_II [N the_AT
international_JJ community_NNJ [P of_IO [N scholars_NN2 N]P] [Fr that_CST [V
has_VHZ graduated_VVN here_RL today_RT V]Fr]N]P]V]S+] ._.
This example was taken from the Spoken English Corpus.
The two examples are similar, but in the case of skeleton parsing all noun phrases are simply
labelled with the letter N, whereas in the example of full parsing there are several types of
noun phrases which are distinguished according to features such as plurality. The only
constituent labels used in the skeleton parsing example are:
Fr relative clause
N noun phrase
P prepositional phrase
S& 1st main conjunct of a compound sentence
S+ 2nd main compound of a compound sentence
V verb phrase
18
Automatic parsing is not as effective as part-of speech annotation, and human post-editing of
parsing is necessary. However, human parsing is inconsistent, particularly when ambiguities
occur.
Semantic marking
Based on: McEnery at:
http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2semant.htm
The example below (Wilson 1996) is intended to give the reader an idea of the types of
categories used in semantics:
And
the
soldiers
platted
a
crown
of
thorns
and
put
it
on
his
head
and
they
put
on
him
a
purple
robe
00000000
00000000
23241000
21072000
00000000
21110400
00000000
13010000
00000000
21072000
00000000
00000000
00000000
21030000
00000000
00000000
21072000
00000000
00000000
00000000
31241100
21110321
The numeric codes stand for:
00000000
Low content word (and, the, a, of, on, his, they etc)
13010000
Plant life in general
21030000
Body and body parts
21072000
Object-oriented physical activity (e.g. put)
21110321
Men's clothing: outer clothing
21110400
Headgear
23231000
War and conflict: general
31241100
Colour
The semantic categories are represented by 8-digit numbers - the one above is based on that
used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three top
level categories, which are themselves subdivided, and so on.
19
Discoursal and text linguistic annotation
Source: McEnery http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2discour.htm
Discourse tags
Stenström (1984) annotated the London-Lund spoken corpus with 16 "discourse tags". They
included categories such as:
"apologies" e.g. sorry, excuse me
"greetings" e.g. hello
"hedges" e.g. kind of, sort of thing
"politeness" e.g. please
"responses" e.g. really, that's right
Despite their potential role in the analysis of discourse these kinds of annotation have never
become widely used, possibly because the linguistic categories are context-dependent and
their identification in texts is a greater source of dispute than other forms of linguistic
phenomena. Thus, annotations at the levels of discourse and text are rarely used.
Anaphoric annotation
Cohesion is the vehicle by which elements in text are linked together, through the use of
pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in
English" (1976) was considered to be a turning point in linguistics, as it was the most
influential account of cohesion. Anaphoric annotation is the marking of pronoun reference –
our pronoun system can only be realised and understood by reference to large amounts of
empirical data, in other words, corpora.
Anaphoric annotation can only be carried out by human analysts, since one of the aims of the
annotation is to train computer programs with this data to carry out the task. There are only a
few instances of corpora which have been anaphorically annotated; one of these is the
Lancaster/IBM anaphoric treebank, an example of which is given below:
A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2 [N the_AT (9 Charlotte_N1
9) Police_NN2 Department_NNJ N] 2) [Ti to_TO get_VV0 rid_VVN of_IO [N 3 <REF=2
its_APP$ chaplain 3) ,_, [N {{3 the_AT Rev._NNSB1 Dennis_NP1 Whitaker_NP1 3} ,_,
38_MC N]N]Ti]V] ._.
The above text has been part-of-speech tagged and skeleton parsed, as well as anaphorically
annotated. The following codes explain the annotation:
(1 1) etc. – noun phrase which enters into a relationship with anaphoric elements in the text
<REF=2 – referential anaphor; the number indicates the noun phrase which it refers to - here
it refers to noun phrase number 2, the Charlotte Police Department
{{3 3}} – noun phrase entering in equivalence relationship with preceding noun phrase; here
the Rev Dennis Whitaker is identified as being the same referent as noun phrase number 3,
its chaplain
20
Phonetic transcription
Phonetic transcription can be done by trained and skilled humans, not by computers. Thus, it
costly and time consuming.
Prosody
Prosody describes the stress, intonation and rhythm of a sentence.
Examples by McEnery
http://www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpus2/2fra1.htm
The example below is taken from the London-Lund corpus:
1 8 14 1470 1 1 A 11 ^what a_bout a cigar\ette# .
/
1 8 15 1480 1 1 A 20 *((4 sylls))*
/
1 8 14 1490 1 1 B 11 *I ^w\on't have one th/anks#* - - -
/
1 8 14 1500 1 1 A 11 ^aren't you .going to sit d/own# -
/
1 8 14 1510 1 1 B 11 ^[/\m]# 1 8 14 1520 1 1 A 11 ^have my _coffee in p=eace# - - -
/
/
1 8 14 1530 1 1 B 11 ^quite a nice .room to !s\it in ((actually))# /
1 8 14 1540 1 1 B 11 *^\isn't* it#
/
1 5 15 1550 1 1 A 11 *^y/\es#* - - -
/
The codes used in this example are:
# end of tone group
^ onset
/ rising nuclear tone
\ falling nuclear tone
/\ rise-fall nuclear tone
_ level nuclear tone
[] enclose partial words and phonetic symbols
. normal stress
! booster: higher pitch than preceding prominent syllable
= booster: continuance
(( )) unclear
* * simultaneous speech
- pause of one stress unit
Judgments on prosody are subjective and inconsistent, just as the ones on discourse.
21
Problem oriented tagging
According to Leech’s maxims, every researcher can add annotations relevant to their
research. It is important to keep this category in mind; however, it is impossible to make any
general statements about it.
2.6.3. Initiatives and projects in the standardization of corpus annotation
There have been several projects aimed at the standardisation of corpus annotation. Here are
some more prominent ones carried out at UCREL University Centre for Computer Corpus
Research on Language, Lancaster University
Information about current and previous projects can be found at:
http://www.comp.lancs.ac.uk/computing/research/ucrel/projects.html
CLAWS (Constituent-Likelihood Automatic Word-Tagging System).
It is a system for assigning to each word in a text an unambiguous indication of the
grammatical class to which this word belongs.
The system consists of five separate stages of tagging:
1. A pre-editing phase – preparing a text for tagging, partly automatic, partly manual.
(PREEDIT)
2. Tag assignment – adding a set of tags without looking at the context (WORDEDIT)
3. Idiom tagging – looking at specific contexts to limit the tags (IDIOMTAG)
4. Tag disambiguation – assigning the most probable tag to words that have got more
than one tag at stage 2. (CHAINPROBS)
5. A post editing phase – a manual process in which errors made by a computer are
corrected; a reformatting stage (LOBFORMAT)
TEI
Initiatives and projects presented here are known as Text Encoding Initiative (TEI), sponsored
by the Association for Computational Linguistics, the Association for Literary and Linguistic
Computing, and the Association for Computers and the Humanities.
Its aim is to develop standardized forms for machine readable texts.
TEI uses document markup known as SGML (Standard Generalized Markup Language)
For more information about the SGML visit http://www.w3.org/MarkUp/SGML/
The corpus encoding standard is available at http://www.cs.vassar.edu/CES/
22
3. Examples of corpora
3.1. Internet resources with monolingual corpora
Rapid development in corpus linguistics parallel to development in communication
technologies results in large number of websites with links to various corpora. Building a
corpus and researching it, is still time consuming business. Thus, existing corpora are
maintained with care and protection. It is easy, however, to create a website with links to
corpora websites is easy and it takes seconds to move a website from one server to another.
Therefore?, the aim of this chapter is to present existing corpora in many languages and links
to them. It should be made clear that the author is aware that by the time the book is
complete, some of the links will have disappeared, and when the reader gets the text plenty
of them will be inactive, but there will be others – even better and more complete – that will
be launched. So, instead of getting annoyed with an inactive link, Dear Reader, use a search
engine (such as google.com, kartoo.com, iboogie.com, hotboot.com) and type a key word:
“corpus linguistics”, “multilingual corpora”, “language corpus” and you will find the latest
news about corpora.
An extensive list of corpora is available at:
http://www.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/corpora/list/index2.html
Examples of corpora in various languages:
1. English










The Bank of English written and spoken English (used extensively by researchers and
for the COBUILD series of English language books)
The BNC (British National Corpus)– written and spoken British English (used
extensively by researchers, and by the Oxford University Press, Chambers and
Longman publishing houses)
CANCODE (Cambridge Nottingham Corpus of the Discourse of English) - spoken
British English (used extensively by researchers and Cambridge University Press)
ICE (International Corpus of English) – international varieties of spoken and written
English (most of the corpus is not yet available)
Brown University Corpus & LOB (Lancaster-Oslo-Bergen) Corpus – parallel corpora
of written texts (but now rather outdated)
London-Lund Corpus (Survey of English Usage) spoken British English (used very
extensively by researchers, but it is now quite old)
Santa Barbara Corpus – spoken American English (most of the corpus is not yet
available)
Hong Kong Corpus of Spoken English (still being compiled, 1 million out of the target
1.5 million words have been collected so far)
ICAME (International Computer Archive of Modern English) - a centre which aims to
coordinate and facilitate the sharing of computer-based corpora.
Translational English Corpus (CTIS) - the first and only computerised corpus of
translated English in the world (currently approaching 10 million words). It has
spearheaded the development of a unique research methodology which has informed
the work of several PhD students and various research programmes around the world.
Online access: http://ronaldo.cs.tcd.ie/tec/jnlp/
23
2. French

Association des Bibliophiles Universels – various French literary works.
Online access: http://cedric.cnam.fr/ABU/

American and French Research on the Treasury of the French Language (ARTFL) 150 million word corpus of various genres of French. You have to be a member to use
it (but membership is fairly cheap). Available at:
http://humanities.uchicago.edu/ARTFL/ARTFL.html
3. German

COSMAS Corpus available at: http://corpora.ids-mannheim.de/~cosmas/. Large (over
a billion words!) online-searchable German and Austrian corpora. This is the publicly
available part of the 1.85 billion word Mannheimer Corpus Collection http://www.idsmannheim.de/kt/corpora.shtml

NEGRA Corpus - Saarland University Syntactically Annotated Corpus of German
Newspaper Texts. Available free of charge to academics. 20,000 sentences, tagged,
and with syntactic structures.
Online access: http://www.coli.uni-sb.de/sfb378/negra-corpus/
4. Polish





PWN corpus available at: http://www.pwn.com.pl
PELCRA Corpora
IPI PAN corpus www.korpus.pl
Spoken corpus collected by Agnieszka-Otwinowska-Kasztelanic
Polish Virtual Library Polska Biblioteka Internetowa, can be accessed from:
http://www.pbi.edu.pl/
5. Russian

Library of Russian Internet Libraries - various literary works.
Online access: http://www.orc.ru/~patrikey/liblib/enauth.htm
6. Spanish and Portuguese

TychoBrahe Parsed Corpus of Historical Portuguese - over a million words of
Portuguese from different historical periods, some of it morphologically
analysed/tagged. Available free of charge at: http://www.ime.usp.br/~tycho/corpus/
24

Folha de S. Paulo newspaper - 4 annual CD ROMs with full text.
Online access: http://www.publifolha.com.br/

COMPARA - Portuguese-English parallel corpus. (In general, various resources at
Linguateca site). Available at: http://www.linguateca.pt/COMPARA/Welcome.html
3.2. Multilingual and parallel





OPUS - An open source parallel corpus, aligned, in many languages, based on free
Linux etc. manuals.
Online access: http://logos.uio.no/opus/
Searchable Canadian Hansard French-English parallel texts (1986-1993).
Online access: http://www-rali.iro.umontreal.ca/TransSearch/TS-simple-uen.cgi
European Union web server - Parallel text in all EU languages.
Online access: http://europa.eu.int/
TELRI CD-ROMs. Online access: http://www.telri.bham.ac.uk/cdrom.html
Parallel and other text in central and eastern European languages. Online access:
http://stp.ling.uu.se/~corpora/
What follows is an overview of information about parallel corpora that is available on the
web.

ACL SIGLEX Parallel Corpora.
Online access: http://www.clres.com/paracorp.html
A collection of links to publicly available parallel corpora. The collection is
maintained by Ken Litkowski of the ACL Special Interest Group on the Lexicon.

The BAF Corpus.
Online access: http://www-rali.iro.umontreal.ca/arc-a2/BAF/
An aligned French-English corpus containing approximately 450,000 words per
language from different sources. Supplied by Michel Simard of the Laboratoire de
Recherche Appliquée en Linguistique Informatique in Montreal, Canada (in French).

INTERSECT: Parallel Corpora and Contrastive Linguistics
Online access: http://www.brighton.ac.uk/edusport/languages/html/intersect.html
A project at the University of Brighton, United Kingdom in which parallel texts in
French and English are being constructed and analysed.

The Lingua Project
Online access: http://spraakbanken.gu.se/lb/pedant/parabank/node4.html
25
An excellent description of the Lingua Parallel Concordancing Project which aims at
managing a multilingual corpus to ease students' and teachers' work in second
language learning. 11 organisations from 6 different countries participate in this
project.

Linguistic Data Consortium
Online access: http://www.ldc.upenn.edu/
The Linguistic Data Consortium at the University of Pennsylvania, USA supplies a
big parallel corpus of United Nation texts in English, French and Spanish.

Michael Barlow's Parallel Corpora Page
Online access: http://www.ruf.rice.edu/~barlow/para.html
An overview page about the global research in parallel corpora. Michael Barlow also
maintains a general Corpus Linguistics page at:
http://www.ruf.rice.edu/~barlow/corpus.html
3.3. Other Internet resources – potential corpora












Project Gutenberg
The Oxford Text Archive
CETH - Directory of Electronic Text Centers
The EServer: Accessible Online Publishing
IPL Online Text Collection (Internet Public Library)
BiVir - Virtual Library (Universal Literature in Galician language)
Galician Virtual Library (Galician Literature)
Vercial Project (Portuguese Literature)
The WWW Virtual Library: Linguistics
The applied linguistics WWW virtual library
The World Lecture Hall
Virtual Reference Collection
Conclusion
There are a lot of free or commercial resources available on the Internet. There is still a need
for more.
26
Part 2. Tools for computerizing a corpus
Corpora are machine readable. Nowadays they are all stored as digital data. In order to make
use of them, we need tools for searching, annotating, analysing. There are two main features
of the tools. They have to be user-friendly for researchers – linguists, and they have to be
effective in operating textual (not numerical) data.
Creating such tools is a challenge for teams of linguists and computer program analysts as
well as for designers. This is a part of computational linguistics. On the one hand a language
should be described in a way that can be stored in a digital form. This is not easy taking into
consideration language complexity, its flexibility, ambiguity, possibility of unlimited number
of utterances, change over time and unlimited creativity of its users. What is more, linguistic
researchers have to identify their needs clearly to provide programmers with clear guidelines
for the project design.
On the other hand the interface has to be user-friendly and intuitively operated even by an
inexperienced computer user. The access to data has to be fast and reliable. Operating large
collection of texts in various languages, annotating them, then searching and analysing
according to different research questions is a challenge for software analysts. As some of the
readers may not have any background in computer science, let me remind them that the only
thing that a computer can do is to perform instructions, nothing else.
Rundell and Stock (1992) stated that software empowers linguists to “focus their creative
energies to doing what machines cannot do”.
In this part various tools for extracting linguistic information from corpora are presented:
concordance programmes, parsers, taggers and other tools used in corpus linguistics.
4. Concordance programs
4.1. The nature of words
Studying the environments: linguistic and computerised
Nature of texts
Linguistic terms
Computer-readable form
Letters
Characters
Words
Words
Phrases
Lines
Clauses
Sentences
Paragraphs
The logical organization is the same but the physical form is different.
27
Concordance programmes make it possible to see the context of an individual keyword. A
keyword (node word) is a string of characters.
Concordance programms find all occurrences of the keyword in the corpus and present the
results of the search in an appropriate format.
There are various display options available:
-
Variable length KWIC (Keyword in Context)
-
Sentence context
-
Paragraph context
-
Whole text browsing
Concordance software is insensitive to case, for example the and The are recognized as the
same words.
Multiple keyword patterns use a wild card symbol (*) that means any set of characters that
follow the letters written before or after the symbol (*)
example: a keyword part* gives the following results:
parted, partiality, participated, particulars, partly, party, etc.
Sorting options are the following:
-
The sequence in which the lines occurred in the original texts
-
The keyword, if wild cards are used
-
Depending on the words to the right or to the left (without a direct object on the
right or with a direct object)
-
User – allocated category (nouns or verbs on the right)
Output options are as follows:
-
Direct output to a printer
-
Output to a file
Other options:
-
Line numbers
-
Text reference
-
Frequency of collocations
4.2. Examples of software
Here are some examples of software:
28
a. Wordsmith tools – demo 3.0 . Demo version with limited number of output lines is
free.
http://www1.oup.co.uk/elt/catalogue/Multimedia/WordSmithTools3.0/download.html
Wordsmith tools demo 4.0 http://www.lexically.net/wordsmith/version4/
i. Concord
ii. Word list
iii. Keyword
b. Microconcord demo http://www.liv.ac.uk/~ms2928/software/ Demo is free.
c. MonoConc –Demo http://www.camsoftpartners.co.uk/monoconc.htm.
http://www.athel.com/mono.html#monopro
d. Simple concordance program http://www.textworld.com/scp/ Free.
e. AntConc http://www.antlab.sci.waseda.ac.jp/ Free
4.3. Internet concordancers
1. WebCorp http://www.webcorp.org.uk/index.html
2. Glossanet http://glossa.fltr.ucl.ac.be/
3. KwicFinder http://www.kwicfinder.com/KWiCFinder.html
4. WebConc http://www.niederlandistik.fu-berlin.de/cgi-bin/webconc.cgi?sprache=en&art=google
5. Lexware Culler http://82.182.103.45/lexware/concord/culler.html
5. Taggers
a. Brill’s part of speech tagger
Online access: http://www.cs.jhu.edu/~brill/
b. Free CLAWS WWW trial service
Online access: http://www.comp.lancs.ac.uk/computing/research/ucrel/claws/trial.html
c. Free trial service at Conexor
Online access: http://www.connexor.com/demos/index.html
6. Parsers
a. Link grammar
Online access: http://www.link.cs.cmu.edu/link/
b. Apple pie parser
Online access: http://nlp.cs.nyu.edu/app/
c. ENGCG: Constraint Grammar Parser of English
Online access: http://www.lingsoft.fi/cgi-bin/engcg
29
7. Other tools
7.1. NLP tools
Here are examples of various NLP (Natural Language Processing) tools available on the
Internet
Paul Nation RANGE
Online access: http://www.vuw.ac.nz/lals/staff/paul-nation/nation.aspx
Kolokacje - - http://www.mimuw.edu.pl/polszczyzna/
Software Tools http://lingo.lancs.ac.uk/devotedto/corpora/software.htm
TextLadder http://www.readingenglish.net/software/ordinarylicense.htm
Tools for Natural Language Processing
Multilingual Corpus Toolkit – on a CD
Paul Nation VocabProfile http://www.er.uqam.ca/nobel/r21270/cgibin/webfreqs/web_vp.cgi http://www.er.uqam.ca/nobel/r21270/cgibin/webfreqs/vp_research.html
WordCruncher
Natural language toolkit http://linguistlist.org/issues/15/15-610.html#1
TACT: Textual Analysis Computing Tools.
http://www.chass.utoronto.ca/tact/TACT/tact0.html
http://www.indiana.edu/~letrs/help-services/QuickGuides/about-tact.html
TACTWeb http://tactweb.humanities.mcmaster.ca/tactweb/doc/tact.htm
Text analysis software http://etext.lib.virginia.edu/textual.html
Provalis Research for analysis http://www.simstat.com/home.html
WORDSTAT v4.0 - Content Analysis & Text Mining Module for Simstat and QDA
Miner
30
Protan - content analysis http://www.psor.ucl.ac.be/protan/protanae.html
7.2. Potential tools for NLP
SIMSTAT v2.5 - Statistical software
QDA Miner v1.0 - Qualitative Data Analysis software
ORIANA v2.0 - Statistical analysis for circular data
MVSP v3.1 - MultiVariate Statistical Program (PCA, correspondence,
cluster, etc.)
PracticeMill v1.2 (for teachers, trainers, & students)
ITALASSI v1.1 - Interaction Viewer FREE!
STATITEM v1.2 - Classical Item Analysis Module
EASY FACTOR ANALYSIS v4.0 - Principal Component/Factor Analysis
7.3 Tools for Data Driven Learning - DDL
Tom Cobb The Compleat Lexical Tutor http://www.corpus-linguistics.de/ddl/ddl.html free.
31
Download