Any corpus that is not regularly updated rapidly becomes unrepresentative. The relevance of permanence in corpus design actually depends on how we view a corpus, that is, whether a corpus should be viewed as a static or dynamic language model. The static view typically applies to a sample corpus whereas a dynamic view applies to a monitor corpus A monitor corpus is primarily designed to track changes from different periods. Monitor corpora are constantly (e.g., annually, monthly, or even daily) supplemented with fresh material and keep increasing in size. CORPORA Types of corpora GENERAL SPECIALIZED While both general and specialized corpora should representative of a language or language variety, they have different criteria for representative The representative of a general corpus depends heavily on sampling from a broad range of genres whereas the representatives of a specialized corpus, at the lexical level at least,can be measured by the degree of closure. For a general corpus, it is understandable that it should cover,proportionally,as many text types as possible so that the corpus is maximally representative of the language or language variety it is supposed to represent Closure/saturation for a particular linguistic feature (e.g. size of lexicon) of a variety of languages (e.g., computer manuals) means that the feature appears to be finite or is subject to very limited variation beyond a certain point. To measure the saturation of a corpus, the corpus is first divided into segments of equal size based on its tokens. A general corpus that contains both written and spoken data (e.g., the BNC) is balanced. Balance and representativeness are important considerations in corpus design,they depend on the research question and the ease with which data can be captured and thus must be interpreted in relative terms. Any claims of corpus balance is largely an act of faith rather than a statement of fact as, at present,there is no reliable scientific measure of corpus balance SAMPLING A sample is assumed to be representative if what we find for the sample also holds for the general population. In the statistical sense,samples are scaled down versions of a larger population Given that language is infinite whereas a corpus is finite in size,sampling is unavoidable in corpus creation. Corpus representative and balance are closely associated with sampling. Sampling must be done properly so that the resulting corpus is as balanced and representative. CORPUS DESIGN In corpus design,a population can be defined in terms of Language production Language reception, or Language as a product. The first two design are basically demographically oriented as they use the demographic distribution (e.g., age,sex,social,class) of the individuals who produce/receive language data to define the population(BNC) Brown and LOB corpora were created using the criterion of languages as a product. SAMPLING TECHNIQUES Simple random sampling: All sampling units within the sampling frame are numbered and the sample is chosen by use of a table of random numbers. Stratified random sampling: It first divides the whole population into relatively homogeneous groups (so called strata) and then samples each stratum at random. The numbers of samples across text categories should be proportional to their frequencies and/or weights in the target population in order for the resulting corpus to be considered as representative In the brown and LOB corpora, for example, the target population for each corpus was first grouped into 15 text categories such as news reportage, academic prose,and different types of fiction;samples were drawn from each text category. Demographic sampling, which first categories sampling units in the population on the basis of speaker/writer age,sex and social class, is also a type of stratified sampling. Biber (1993) observes that a stratified sample is never less representative than a simple random sample. SAMPLE SIZE With written language,should we sample Full texts(i.e.,whole documents) or Useful in text linguistics,copyright issues Coverage of a corpus including full texts may not be as balanced Text chunks? initial ,middle, or end chunks In creating a balanced,representative corpus, stratified random sampling is preferred over simple random sampling. For written texts, a text typology established based on external criteria. For spoken data demographic sampling is appropriate. DATA CAPTURE AND COPYRIGHT Electronic data is preferred over paper-based material in building DIY corpora. The world wide web (www) is an important source of machine- readable data for many languages. Some universities libraries http://lib.virginia.edu/digital/collections/text/ http://onlinebooks.library.upenn.edu/ Provide large amounts of publicly accessible electronic texts. DATA DOWNLOAD A number of tools can be used for downloading data in one go. Grab- a-site or HTTrack ● The wordsmith tools ● Webgetter ● Multilingual corpus toolkits A major issues in data collection is copyright copyright -free material are usually old and may not be useful CORPORA In terms of purposes,corpora typically of two types: Commercial purposes ● Creators of commercial corpora usually reach an agreement with copyright holders as to how the profit will be shared. ● Non- profit- making academic research. Copyright issues in corpus criterion are complex and unavoidable. “Proceed with caution.” CORPUS MARKUP AND ANNOTATION Data collected using a sampling frame forms a raw corpus. So,it needs to be processed before use. Spoken data needs to be transcribed from audio/video recordings; Written texts may need to be rendered machine readable, if they are not already, by keyboarding or OCR scanning. In order to extract linguistic information from a corpus,such information must first of all be encoded in the corpus,a process that is technically known as “corpus annotation”. Corpus markup is a system of standard codes inserted into a document stored in electronic form to provide information about the text itself(i.e.,text metadata) and govern formatting,printing or other processing(i.e.,structural organization). Two types of markup: ● Metadata markup can be embedded in the same document or stored in a separate but linked document(embedding vs stand-alone annotation), ● Structural markup has to be embedded in the text. One of the earliest markup schemes was COCOA. COCOA references consist of a set of attributes names and values enclosed in angle brackets. Eg: <A WILLIAM SHAKESPEARE>, where A (author) is the attribute name and WILLIAM SHAKESPEARE is the attribute value. COCOA references, only encode a limited set of features such as authors,titles and dates METADATA MARKUP SCHEMES The dublin core metadata initiative(DCMI,dekkers and Weibel 2003), The open language archives community(OLAC,bird and simons 2000), The ISLE metadata initiative (IMDI, Writtenburg et al.2002), The text encoding initiative(TEI, Sperberg-Mcqueen and Burnard 2002),and The corpus encoding standard(CES, Ide and priest-dorman 2000). DCMI provides 15 elements used primarily to describe authored web resources. OLAC is an extension of the DCMI, which introduces refinements to narrow down the semantic scope of DCMI elements and adds an extra element to describe the language(s) covered by the resource. IMDI applies to multimedia corpora and lexical resources as well. There is currently no widely agreed standard way of representing metadata, The most influential schemes in corpus building are TEI and CES. TEI The text encoding initiative was sponsored by three major academic associations concerned with humanities computing: ● The Association for Computational linguistics (ACL), ● The Association For literary and linguistic computing (ALLC),and ● The Associations For Computers and the Humanities(ACH). The aim of the TEI guidelines is to facilitate data exchange by standardizing the markup or encoding of information stored in electronic form. In TEI, each individual text(referred to as “documents”) consists of two parts: ● header(typically providing text metadata) and ● body(i.e., the text itself), which are in turn composed of different “elements.” In a TEI header (tagged as), there are four principal elements(Burnard 2002): A file description (tagged as <fileDesc>) containing a full bibliographic description of an electronic file An encoding description (tagged as <encodingDesc>), which describes the relationship between an electronic text and the source or sources from which it was derived. A text profile(tagged as <profileDesc>), containing a detailed description of non-bibliographic aspects of a text,especially the languages and sublanguages used,the situation in which it was produced,the participants and their setting. TEI The TEI scheme can be expressed using a number of different formal languages. The first editions used the standard generalized markup language (SGML): the more recent editions and can be expressed in the extensible markup language(XML) SGML and XML are very similar,both defining a representation scheme for texts in electronic form, which is device and system independent. SGML is a very powerful markup language,but associated with the powerful is complexity. XML is a simplified subset of SGML intended to make SGML easy enough for use on the web Hence, while all XML different are valid SGML documents,the reverse is not true CES Corpus encoding standard (CES) is designed specifically for the encoding language corpora. CES described as “simplified” TEI in that it includes only the subset of the TEI tag set relevant to corpus based work While it simplifies the TEI specification, TES also extends the TEI guidelines by adding new elements not covered in TEI, specifying the precise values for some attributes,marking required/recommended/optional elements,and explicating detailed semantics for elements relevant to the language engineering. CES covers three principal types of markups: a)document-wide markup,which uses more or less the same tags as for TEI to provide a bibliographic description of the documents, encoding description,etc. b)gross-structural markup,which encodes structural units of text(such as volume,chapters,etc) down to the level of paragraph (but also including footnotes,titles,headings,tables,figure,etc) and specifies normalization to recommended characters set and entities c)markup for sub-paragraph structures,including sentences,quotations,word abbreviations,names,dates,terms and cited words,etc. CES specifies a minimal encoding level that corpus must achieve to be considered standardized in terms of descriptive representation as well as general architecture. Three levels of text standardization are specified in CES. a)The metalanguage level b)The syntactic level c) The semantic level CES was developed and recommended by the expert advisory groups on language engineering standards (EAGLES) as a TEI compliant application of SGNL that could serve as a widely accepted set of encoding standards for corpus-based work. CES is available in both SGML and XML versions. The XML version, Referred to as XCS, has also developed support for additional types of annotation and resources, including discourse/dialogue,lexicon and speech. On the other hand, metalanguages such as SGML and XML usually follow the system of attribute names. Lead out of implementation standards such as TEI and CES, this may not be necessarily the ease. VECTOR SEMANTICS AND EMBEDDINGS What do words mean? N-gram or text classification methods we’ve seen so far ● Words are just strings (or indices wi in a vocabulary list) ● That’s not very satisfactory! Introductory logic classes: ● The meaning of “dog” is DOG; cat is CAT Ax DOG(x) —-> MAMMAL(x) Old linguistic joke by Barbara Partee in 1967: ● Q: What's the meaning of life? ● A: LIFE DESIDERATA What should a theory of word meaning do for us? Let’s look at some desiderata From lexical semantics. the linguistic study of word meaning Lemmas and senses mouse(N) 1.any of numerous small rodents… 2.a hand- operated device that controls a cursor…. A sense or is the Relations between senses: synonymy Synonyms have the same meaning in some or all contexts. ❖ water/H2o ❖ couch/sofa ❖ vomit/throw up Note that there are probably no examples of perfect synonymy ● Even if many aspects of meaning are identical ● Still may differ based on politeness, slang, register, genre, etc. Relation:synonymy? The linguistic principle of contrast Different in form→ difference in meaning Relation:similarity Words with similar meanings. not synonyms, but sharing some element of meaning Car, bicycle Cow, horse Ask human how similar 2 words are Vanish disappear 9.8 Behave obey 7.3 Belief impression 5.95 Relation: word relatedness Also called “word association” Words can be related in any way, perhaps via a semantics frame or field Coffee, tea: similar Coffee, cup: related, not similar Semantic field Words that ● Cover a particular semantic domain ● Bear structured relations with each other. Hospitals Surgeon,scalpel,nurse,anesthetic,hospital Restaurants Waiter,menu,plate,food,menu,chef Houses door,roof,kitchen,family,bed Relation: antonymy Senses that are opposites with respect to only one features of meaning Otherwise, they are very similar! dark/light short/long fast/slow rise/fall Hot/cold up/down in/out More formally: antonyms can Define a binary opposition or be at opposite ends of a scale long/short, fast/slow Be reversives: rise/fall, up/down Connotation (sentiment) Words have affective meanings Positive connotation (happy) Negative connotation (sad) Connotations can be subtle: positive connotation: copy, replica, reproduction Negative connotation: fake, knockoff, forgery evaluation(sentiment!) Positive evaluation (great, love) Negative evaluation (terrible, hate) Words seem to vary along 3 affective dimensions: Valence: the pleasantness of the stimulus Arousal: the intensity of emotion provoked by the stimulus Dominance: the degree of control exerted by the stimulus valence arousal dominance word score word score Love 1.000 toxic 0.008 happy 1.000 nightmare 0.005 elated 0.960 mellow 0.069 frenzy 0.965 napping 0.046 powerful 0.991 weak 0.045 leadership 0.983 empty 0.081 VECTOR SEMANTICS Computational models of word meaning Can we build a theory of how to represent word meaning, that accounts for at least some of the desiderata? We’ll introduce vector semantics The standard model in language processing! Handles many of our goals! LUDWIG WITTGENSTEIN PI #43: “The meaning of a word is its use in the language” Lets define words by their usages One way to define “usage” Words are defined by their environments ( the words around them) Zellig harris(1954): If A and B have almost identical environments we say that they are synonyms. Ongchoi Idea 1: design meaning by linguistic distribution Idea 2: meaning as a point in multidimensional space Defining meaning as a point in space based on distribution Each word = a vector (not just “good or “w45”) Similar words are “nearby in semantic space” We build this space automatically by seeing which words are nearby in text. We define meaning of a word as a vector Called an “embedding” because it’s embedded into a space (see textbook) The standard way to represent meaning in NLP Every modern NLP algorithm uses embeddings as the representation of word meaning Fined -grained model of meaning with similarity Intuition:why vectores? Consider sentiment analysis: With words, a feature is a word identity Feature 5: “the previous word was “terrible” “ Requires exact same word to be in training and test With embeddings: Feature is word vector “The previous word was vector [35,22,17,...] Now in the test set we might see a similar vector [34,21,14] We can generalize similar but unseen words!!! We’ll discuss 2 kinds of embeddings Tf-idf (term frequency-inverse document frequency) ● Information retrieval workhorse! ● A common baseline model ● Sparse vectors ● Words are represented by ( a simple function of) the counts of nearby words Word2vec ● Dense vectors ● Representation is created by raining a classifier to predict whether a word is likely to appear nearby ● Later we’ll discuss extensions called contextual embeddings