Uploaded by Mandeep Singh

NLP class notes

advertisement
Any corpus that is not regularly updated rapidly becomes unrepresentative.
The relevance of permanence in corpus design actually depends on how we view a corpus, that
is, whether a corpus should be viewed as a static or dynamic language model.
The static view typically applies to a sample corpus whereas a dynamic view applies to a
monitor corpus
A monitor corpus is primarily designed to track changes from different periods.
Monitor corpora are constantly (e.g., annually, monthly, or even daily) supplemented with fresh
material and keep increasing in size.
CORPORA
Types of corpora
GENERAL
SPECIALIZED
While both general and specialized corpora should representative of a language or language
variety, they have different criteria for representative
The representative of a general corpus depends heavily on sampling from a broad range of
genres whereas the representatives of a specialized corpus, at the lexical level at least,can be
measured by the degree of closure.
For a general corpus, it is understandable that it should cover,proportionally,as many text types
as possible so that the corpus is maximally representative of the language or language variety it
is supposed to represent
Closure/saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of
languages (e.g., computer manuals) means that the feature appears to be finite or is subject to
very limited variation beyond a certain point.
To measure the saturation of a corpus, the corpus is first divided into segments of equal size
based on its tokens.
A general corpus that contains both written and spoken data (e.g., the BNC) is balanced.
Balance and representativeness are important considerations in corpus design,they depend on
the research question and the ease with which data can be captured and thus must be
interpreted in relative terms.
Any claims of corpus balance is largely an act of faith rather than a statement of fact as, at
present,there is no reliable scientific measure of corpus balance
SAMPLING
A sample is assumed to be representative if what we find for the sample also holds for the
general population.
In the statistical sense,samples are scaled down versions of a larger population
Given that language is infinite whereas a corpus is finite in size,sampling is unavoidable in
corpus creation.
Corpus representative and balance are closely associated with sampling.
Sampling must be done properly so that the resulting corpus is as balanced and representative.
CORPUS DESIGN
In corpus design,a population can be defined in terms of
Language production
Language reception, or
Language as a product.
The first two design are basically demographically oriented as they use the demographic
distribution (e.g., age,sex,social,class) of the individuals who produce/receive language data to
define the population(BNC)
Brown and LOB corpora were created using the criterion of languages as a product.
SAMPLING TECHNIQUES
Simple random sampling:
All sampling units within the sampling frame are numbered and the sample is chosen by use of
a table of random numbers.
Stratified random sampling:
It first divides the whole population into relatively homogeneous groups (so called strata) and
then samples each stratum at random.
The numbers of samples across text categories should be proportional to their frequencies
and/or weights in the target population in order for the resulting corpus to be considered as
representative
In the brown and LOB corpora, for example, the target population for each corpus was first
grouped into 15 text categories such as news reportage, academic prose,and different types of
fiction;samples were drawn from each text category.
Demographic sampling, which first categories sampling units in the population on the basis of
speaker/writer age,sex and social class, is also a type of stratified sampling.
Biber (1993) observes that a stratified sample is never less representative than a simple
random sample.
SAMPLE SIZE
With written language,should we sample
Full texts(i.e.,whole documents) or
Useful in text linguistics,copyright issues
Coverage of a corpus including full texts may not be as balanced
Text chunks?
initial ,middle, or end chunks
In creating a balanced,representative corpus, stratified random sampling is preferred over
simple random sampling.
For written texts, a text typology established based on external criteria.
For spoken data demographic sampling is appropriate.
DATA CAPTURE AND COPYRIGHT
Electronic data is preferred over paper-based material in building DIY corpora.
The world wide web (www) is an important source of machine- readable data for many
languages.
Some universities libraries
http://lib.virginia.edu/digital/collections/text/
http://onlinebooks.library.upenn.edu/
Provide large amounts of publicly accessible electronic texts.
DATA DOWNLOAD
A number of tools can be used for downloading data in one go.
Grab- a-site or HTTrack
● The wordsmith tools
● Webgetter
● Multilingual corpus toolkits
A major issues in data collection is copyright
copyright -free material are usually old and may not be useful
CORPORA
In terms of purposes,corpora typically of two types:
Commercial purposes
● Creators of commercial corpora usually reach an agreement with copyright holders as to
how the profit will be shared.
● Non- profit- making academic research.
Copyright issues in corpus criterion are complex and unavoidable.
“Proceed with caution.”
CORPUS MARKUP AND ANNOTATION
Data collected using a sampling frame forms a raw corpus.
So,it needs to be processed before use.
Spoken data needs to be transcribed from audio/video recordings;
Written texts may need to be rendered machine readable, if they are not already, by
keyboarding or OCR scanning.
In order to extract linguistic information from a corpus,such information must first of all be
encoded in the corpus,a process that is technically known as “corpus annotation”.
Corpus markup is a system of standard codes inserted into a document stored in electronic form
to provide information about the text itself(i.e.,text metadata) and govern formatting,printing or
other processing(i.e.,structural organization).
Two types of markup:
● Metadata markup can be embedded in the same document or stored in a separate but
linked document(embedding vs stand-alone annotation),
● Structural markup has to be embedded in the text.
One of the earliest markup schemes was COCOA.
COCOA references consist of a set of attributes names and values enclosed in angle brackets.
Eg: <A WILLIAM SHAKESPEARE>, where A (author) is the attribute name and WILLIAM
SHAKESPEARE is the attribute value.
COCOA references, only encode a limited set of features such as authors,titles and dates
METADATA MARKUP SCHEMES
The dublin core metadata initiative(DCMI,dekkers and Weibel 2003),
The open language archives community(OLAC,bird and simons 2000),
The ISLE metadata initiative (IMDI, Writtenburg et al.2002),
The text encoding initiative(TEI, Sperberg-Mcqueen and Burnard 2002),and
The corpus encoding standard(CES, Ide and priest-dorman 2000).
DCMI provides 15 elements used primarily to describe authored web resources.
OLAC is an extension of the DCMI, which introduces refinements to narrow down the semantic
scope of DCMI elements and adds an extra element to describe the language(s) covered by the
resource.
IMDI applies to multimedia corpora and lexical resources as well.
There is currently no widely agreed standard way of representing metadata,
The most influential schemes in corpus building are TEI and CES.
TEI
The text encoding initiative was sponsored by three major academic associations concerned
with humanities computing:
● The Association for Computational linguistics (ACL),
● The Association For literary and linguistic computing (ALLC),and
● The Associations For Computers and the Humanities(ACH).
The aim of the TEI guidelines is to facilitate data exchange by standardizing the markup or
encoding of information stored in electronic form.
In TEI, each individual text(referred to as “documents”) consists of two parts:
● header(typically providing text metadata) and
● body(i.e., the text itself), which are in turn composed of different “elements.”
In a TEI header (tagged as), there are four principal elements(Burnard 2002):
A file description (tagged as <fileDesc>) containing a full bibliographic description of an
electronic file
An encoding description (tagged as <encodingDesc>), which describes the relationship
between an electronic text and the source or sources from which it was derived.
A text profile(tagged as <profileDesc>), containing a detailed description of non-bibliographic
aspects of a text,especially the languages and sublanguages used,the situation in which it was
produced,the participants and their setting.
TEI
The TEI scheme can be expressed using a number of different formal languages.
The first editions used the standard generalized markup language (SGML): the more recent
editions and can be expressed in the extensible markup language(XML)
SGML and XML are very similar,both defining a representation scheme for texts in electronic
form, which is device and system independent.
SGML is a very powerful markup language,but associated with the powerful is complexity.
XML is a simplified subset of SGML intended to make SGML easy enough for use on the web
Hence, while all XML different are valid SGML documents,the reverse is not true
CES
Corpus encoding standard (CES) is designed specifically for the encoding language corpora.
CES described as “simplified” TEI in that it includes only the subset of the TEI tag set relevant to
corpus based work
While it simplifies the TEI specification, TES also extends the TEI guidelines by adding new
elements not covered in TEI, specifying the precise values for some attributes,marking
required/recommended/optional elements,and explicating detailed semantics for elements
relevant to the language engineering.
CES covers three principal types of markups:
a)document-wide markup,which uses more or less the same tags as for TEI to provide a
bibliographic description of the documents, encoding description,etc.
b)gross-structural markup,which encodes structural units of text(such as volume,chapters,etc)
down to the level of paragraph (but also including footnotes,titles,headings,tables,figure,etc) and
specifies normalization to recommended characters set and entities
c)markup for sub-paragraph structures,including sentences,quotations,word
abbreviations,names,dates,terms and cited words,etc.
CES specifies a minimal encoding level that corpus must achieve to be considered standardized
in terms of descriptive representation as well as general architecture.
Three levels of text standardization are specified in CES.
a)The metalanguage level
b)The syntactic level
c) The semantic level
CES was developed and recommended by the expert advisory groups on language engineering
standards (EAGLES) as a TEI compliant application of SGNL that could serve as a widely
accepted set of encoding standards for corpus-based work.
CES is available in both SGML and XML versions.
The XML version, Referred to as XCS, has also developed support for additional types of
annotation and resources, including discourse/dialogue,lexicon and speech.
On the other hand, metalanguages such as SGML and XML usually follow the system of
attribute names. Lead out of implementation standards such as TEI and CES, this may not be
necessarily the ease.
VECTOR SEMANTICS AND EMBEDDINGS
What do words mean?
N-gram or text classification methods we’ve seen so far
● Words are just strings (or indices wi in a vocabulary list)
● That’s not very satisfactory!
Introductory logic classes:
● The meaning of “dog” is DOG; cat is CAT
Ax DOG(x) —-> MAMMAL(x)
Old linguistic joke by Barbara Partee in 1967:
● Q: What's the meaning of life?
● A: LIFE
DESIDERATA
What should a theory of word meaning do for us?
Let’s look at some desiderata
From lexical semantics. the linguistic study of word meaning
Lemmas and senses
mouse(N) 1.any of numerous small rodents…
2.a hand- operated device that controls a cursor….
A sense or is the
Relations between senses: synonymy
Synonyms have the same meaning in some or all contexts.
❖ water/H2o
❖ couch/sofa
❖ vomit/throw up
Note that there are probably no examples of perfect synonymy
● Even if many aspects of meaning are identical
● Still may differ based on politeness, slang, register, genre, etc.
Relation:synonymy?
The linguistic principle of contrast
Different in form→ difference in meaning
Relation:similarity
Words with similar meanings. not synonyms, but sharing some element of meaning
Car, bicycle
Cow, horse
Ask human how similar 2 words are
Vanish
disappear
9.8
Behave
obey
7.3
Belief
impression
5.95
Relation: word relatedness
Also called “word association”
Words can be related in any way, perhaps via a semantics frame or field
Coffee, tea: similar
Coffee, cup: related, not similar
Semantic field
Words that
● Cover a particular semantic domain
● Bear structured relations with each other.
Hospitals
Surgeon,scalpel,nurse,anesthetic,hospital
Restaurants
Waiter,menu,plate,food,menu,chef
Houses
door,roof,kitchen,family,bed
Relation: antonymy
Senses that are opposites with respect to only one features of meaning
Otherwise, they are very similar!
dark/light
short/long
fast/slow
rise/fall
Hot/cold
up/down
in/out
More formally: antonyms can
Define a binary opposition or be at opposite ends of a scale
long/short, fast/slow
Be reversives:
rise/fall, up/down
Connotation (sentiment)
Words have affective meanings
Positive connotation (happy)
Negative connotation (sad)
Connotations can be subtle:
positive connotation: copy, replica, reproduction
Negative connotation: fake, knockoff, forgery
evaluation(sentiment!)
Positive evaluation (great, love)
Negative evaluation (terrible, hate)
Words seem to vary along 3 affective dimensions:
Valence: the pleasantness of the stimulus
Arousal: the intensity of emotion provoked by the stimulus
Dominance: the degree of control exerted by the stimulus
valence
arousal
dominance
word
score
word
score
Love
1.000
toxic
0.008
happy
1.000
nightmare
0.005
elated
0.960
mellow
0.069
frenzy
0.965
napping
0.046
powerful
0.991
weak
0.045
leadership
0.983
empty
0.081
VECTOR SEMANTICS
Computational models of word meaning
Can we build a theory of how to represent word meaning, that accounts for at least some of the
desiderata?
We’ll introduce vector semantics
The standard model in language processing!
Handles many of our goals!
LUDWIG WITTGENSTEIN
PI #43:
“The meaning of a word is its use in the language”
Lets define words by their usages
One way to define “usage”
Words are defined by their environments ( the words around them)
Zellig harris(1954):
If A and B have almost identical environments we say that they are synonyms.
Ongchoi
Idea 1: design meaning by linguistic distribution
Idea 2: meaning as a point in multidimensional space
Defining meaning as a point in space based on distribution
Each word = a vector (not just “good or “w45”)
Similar words are “nearby in semantic space”
We build this space automatically by seeing which words are nearby in text.
We define meaning of a word as a vector
Called an “embedding” because it’s embedded into a space (see textbook)
The standard way to represent meaning in NLP
Every modern NLP algorithm uses embeddings as the representation of word meaning
Fined -grained model of meaning with similarity
Intuition:why vectores?
Consider sentiment analysis:
With words, a feature is a word identity
Feature 5: “the previous word was “terrible” “
Requires exact same word to be in training and test
With embeddings:
Feature is word vector
“The previous word was vector [35,22,17,...]
Now in the test set we might see a similar vector [34,21,14]
We can generalize similar but unseen words!!!
We’ll discuss 2 kinds of embeddings
Tf-idf (term frequency-inverse document frequency)
● Information retrieval workhorse!
● A common baseline model
● Sparse vectors
● Words are represented by ( a simple function of) the counts of nearby words
Word2vec
● Dense vectors
● Representation is created by raining a classifier to predict whether a word is likely to
appear nearby
● Later we’ll discuss extensions called contextual embeddings
Download