Building terminology and conceptual (ontological) systems from text corpora? Khurshid Ahmad Professor of Artificial Intelligence, Department of Computing, University of Surrey Surrey, ENGLAND. Dial-a-Corpus, Tuscan Word Center/Univeristy of Sienna, Certosa di Pontignano, Sienna, Italy. June 28th, 2005. KNOWLEDGE, TEXT & LEARNING Terminology relates to the conceptual systems of specialist disciplines: some argue that a term denotes a concept whilst others deny this correspondence Conceptual systems in themselves are organised as by definition we have a system and a system is organisation per se. KNOWLEDGE, TEXT & LEARNING Building a specialist corpus: Use search engines –think of keywords carefully; Use crawlers and design bots. Visit Journal sites; Book sites; Popular Science & Sunday Newspaper Supplement; (Patent Documents –free from US PTO) Newsletters; Course & Conference Announcements; Scientific Biography; Balance the corpus KNOWLEDGE, TEXT & LEARNING Building a specialist corpus: Collect data; Organise data Analyse data Visualise results KNOWLEDGE, TEXT & LEARNING Development of Visual Evidence Thesaurus The collateral texts – written texts or speech (fragments) closely or loosely related to an image or objects within the image. The collateral texts are special language texts and comprise keywords that may help in indexing and retrieving the images. PICTURE CAPTION and Davis was armed with a 9mm Browning High Power, 9mm caliber semiautomatic pistol with an obliterated serial number. A small firearm with a more or less curved stock, adapted to be held in, and fired with, one hand Firearm found on top of a table 9 mm Browning high-power pistol BUILDING A SPECIALIST CORPUS Surrey Forensic Science Corpus ( 0.58 Million words) • Descriptions of Images by Scene of Crime Experts; • FBI, UK, Australian Literature on Scene of Crime Practice; • Research Papers on Evidence-based policing • Brochures and Marketing documents on Crime Labs; • Newspaper reports relating to scene of crime • Descriptions of courses and conferences; • Mainly American English texts BUILDING A SPECIALIST CORPUS Organise Texts • Build a conceptual system for differentiating texts: • Imaginative, Informative, Instructional; • Author Attribute (Age, Nationality, Gender, Language) • Publication attributes (Date of Publication, Audience….) • Sub-domain categories •Categorise texts on your filestore BUILDING A SPECIALIST CORPUS: Data on the Web – Reuters Financial News Classification of documents into a fixed number of predefined categories Each document can be in multiple, one, or no category Building text classifiers by hand is difficult and time consuming Automatic text classifier can hand in the problem Number of Document 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00% 1 3 5 7 9 11 13 15 17 19 21 23 25 27 Number of Codes per Document BUILDING A SPECIALIST CORPUS: Data on the Web – Reuters Financial News Topic codes •TREC-AP text collection: AP newswire stories from the TREC/TIPSTER, a total of 242,918 AP stories from 19881990 combined with 20 categories (Lewis & Gale, 1994) Corporated/industrial (CCAT) C11 strategy/plans 1992) I0 agriculture,forestry and fishing I00 agriculture, forestry and fishing C12 legal/judicial I000 agriculture, forestry and fishing Region codes AARCT Antartica ABDBI Abu Dhabi AFGH Afghanistan C13 regulation/policy C14 share listings C15 performance C151 accounts/earnings C1511 annual results I01 agriculture and horticulture I010 agriculture and horitculture I0100 agriculture and horitculture AFRICA Africa AJMN Ajman ALB Albania I01000 agriculture and horitcuture ALG Algeria I01001 agriculture AMSAM American samoa C152 comment/forcasts Economics (ECAT) •Reuters Collection Volume 1( RCV1) 806,791 stories from 1997-1998 combined with: • 126 codes for topics, • 870 codes for industry and • 266 codes for regions (Lewis, Industry codes E11 economic performance E12 monetary/economic E121 money supply E13 inflation/prices Government/social (GCAT) G11 social affairs G111 health/safety G112 social security G113 education/research Markets (MCAT) M11 equity markets M12 bond markets M13 money markets M131 interbank markets I0100105 cattle farming I0100107 egg production Terminology Resources for Categorising Texts Stock Preferred Stock Common Stock See www.Investorwords.com Junior Equity BUILDING A SPECIALIST CORPUS: Organising Data on the Web – Reuters Financial News See: WEBSOM Definitions: Terminology Etymologically, The doctrine or scientific study of terms; in use almost always. The system of terms belonging to any science or subject; technical terms collectively; nomenclature. Hence terminological (adj.) pertaining to terminology; terminological inexactitude, a humorous expression for a falsehood; terminologically (adv. ); terminologist, one versed in terminology. Definitions: Special Language •The special language of focussed, single minded pursuits: Science, technology, sports, politics, philosophy,…… •A natural language privileges persons ; in contrast the “splinter of ordinary language” that we call [specialised] scientific discourse privileges a world of objects, processes, happenings, events. Definitions: Special Language •The ‘identificatory force’ of subject position in grammar [of Indo-European languages] is reserved for speakers and their fellow creatures. •The ‘identificatory force’ of subject position in grammar of specialist discourse is reserved for objects, processes, happenings, events Definitions: Special Language A note on creativity or terminicide ‘That science has become more difficult for nonspecialists to understand is a truth universally acknowledged’. The choice of words in a journal paper is very different to that in a quality newspaper – obscuring the work of the scientists. Source Lexical Difficulty Nature 55.5 Science 44.8 Cell 31.6 Physics Today 13.3 New Scientist 4.0 Quality Newspaper 0.0 Donald Hayes (1992) ‘The growing inaccessibility of science’. Nature. Vol 356, pp 739-740 Definitions: Special Language – A note on creativity or terminicide Source Quality Newspaper Popular Science Lexical Difficulty 0.0 -4.7 (Discover) Fiction Nat. History magazine -19.3 -22.6 (Ranger Rick) Children’s fiction Farm-workers talking to cows -27.4 -63.8 TEXT TECHNOLOGY? DERIVE information from text; FEED information into a conceptual or mathematical model; USE model for analysis or prediction HERMENUTICS & CORPUS LINGUITICS? Languages are constantly in flux The corpus linguist explores the discourse as a system that can be explained without referring to a discourse external reality or to the mental state of the members of the discourse community. Teubert, Wolfgang (2003). Writing, hermenutics and corpus linguistics. Logos and Language Vol.IV (no. 2) pp 1-17. HERMENUTICS & CORPUS LINGUITICS? Interpertament Texts are responses to previous texts and the texts are then responded to in turn and the cycle continues hence the diachronic dimension HERMENUTICS & CORPUS LINGUITICS Interpretament Encyclopaedias deal with monosemic concepts Dictionaries deal with polysemic words Text Corpus deal with neologisms, especially compounds and abbreviations, retronyms. Explanation of a concept becomes the paraphrase of a phrase and thus a discrete internal object. HERMENUTICS & CORPUS LINGUITICS Interpretament Where will you find the evidence of use, definition, and elaboration of terms like: • inclusive learning environment (e-Learning) • Borromean Halo Nuclei (Radioactive Nuclear Beam Physics) • honeycombed catalytic converter (Automotive Engineering) • indivualist weak supervenience (Philosophy of Science) • indoor blood videotaping (Forensic Science) EXCEPT IN A TEXT CORPUS? Lexicogenesis: Diachronic Semantic Inversion? Term/ ‘Concept' Perception: One sees Motion: Objects move due to Solar Cycle: Sunrise is Before After beams coming from an object (Aristotle) beams leaving the observer's eyes (Pythagoras) an in-built tendency to move (Aristotle) something exerts 'attraction' (Galileo) a rising Sun (Brahe) a turning earth (Kepler) loss of phlogiston (Priestley) the addition of oxygen (Lavoisier) an explosion during diastole of the heart (Descartes) a compression during systole of the heart (Harvey) caused by Combustion: Burning an object in air leads to a Heartbeat: Blood circulation is caused by Species: The distinction an absolute phenomenon that between species is has been determined in the past (Linnaeus) a contemporaneous phenomenon with borders between the species (Darwin) Verschuuren, G. M. N. (1986). Investigating the Life Sciences: An Introduction to the Philosophy of Science. Oxford: Pergamon Press. Lexicogenesis: Synchronic Semantic Change 19 Meaning s of Paradigm in The Structure of Scientific Revolution Kuhn (1970)? Page x a universally recognised scientific achievement Page 2 a successful metaphysical speculation Page 2 Page 4 Page 10 Page 14 a concrete scientific achievement a set of beliefs like a textbook like an analogy Page 17 a myth Page 23 like an accepted judicial decision Page 23 a grammatical paradigm Lexicogenesis: Synchronic Semantic Change 19 Meaning s of Paradigm in The Structure of Scientific Revolution Kuhn (1970)? pp37, 76 a conceptual and instrumental tools pp59, 60 a device or type of instrumentation Page 63 like a gestalt figure Page 85 Page 91 Page 102 Page 108 an anomalous pack like a set of political institutions a standard a map 117-121 a new way of seeing things Page 120 an organising principle governing perception Page 128 something which determines a large area of reality Lexicogenesis: Diachronic Semantic Change The establishment of atom 1477 1650 1819 An atom is a hypothetical body, so small as to be incapable of further division; and thus to be one of the ultimate particles of nature. Physical Atoms: The supposed ultimate particles in which matter actually exists (without reference to its stability). Chemical Atoms: The smallest particles in which the elements combine, or are known to possess the properties of a particular element. LEXICOGENESIS & KNOWLEDGE The origins, evolution and obsolescence of concepts and conceptual systems are a hotly debated subject: for some concepts are ethereal but and for others concepts relate directly to our sensual experience. LEXICOGENESIS & KNOWLEDGE More recently, my colleagues in computing, especially those in artificial intelligence and semantic web, have started to use the term ontology in a creative way. LEXICOGENESIS & ONTOLOGY? Ontology etymologically essence of being There are those for whom an ontology is a list of terms perhaps organised as a thesaurus, an ontology can be found very easily provided we find the logical basis of science, philosophy and thought. LEXICOGENESIS & TEXT? If any essence or trace of the knowledge of the individuals is left behind then it is usually found in documents, comprising words, illustrations and drawings, mathematical and other symbols. TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday or general language texts seldom carry a signature. TERMINOLOGY in TEXT First ten most frequent terms in the Springer-Verlag’s medical text corpus. Muchmore; N=1.08 million Word Absolute Relative Frequency Frequency Relative Frequency (%) the 68451 0.063315 6.3% of 55661 0.051484 5.1% and 34248 0.031678 3.2% in 30035 0.027781 2.8% a 21268 0.019672 2.0% to 19988 0.018488 1.8% with 14455 0.01337 1.3% is 12333 0.011408 1.1% for 10311 0.009537 1.0% patients 9448 0.008739 0.9% TERMINOLOGY in TEXT First ten most frequent terms in the Springer-Verlag’s medical text corpus. muchmore; N=1.08 million words Word Relative Frequency (%) Word Relative Frequency (%) the 6.3% was 0.77% of 5.1% be 0.72% and 3.2% are 0.66% in 2.8% as 0.63% a 2.0% by 0.63% to 1.8% were 0.57% with 1.3% an 0.51% is 1.1% on 0.49% for 1.0% or 0.49% patients 0.9% this 0.47% TOTAL 25.5% TOTAL 6.30% TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. British National Corpus (BNC) c. 90 Million words; 5 words comprise 16.5% of the Corpus BNC: Absolute Frequency BNC: Relative Frequency the 6181374 6.2% of 2938675 2.9% and 2680037 2.7% to 2557635 2.6% a 2148608 2.1% TERMINOLOGY in TEXT The muchmore Corpus N=1.08 million words Token the, of, and, in, a, to, with, is, for, patients Cumulative Relative Frequency No. Of OCW 25.5% 1 was, be, are, as, by, were, an, on, or, this 6.3% 0 that, after, from, treatment, not, we, at, can, clinical, have 3.4% 2 which, has, patient, these, results, therapy, been, cases, it, study 2.4% 5 all, during, disease, only, years, between, may, no, diagnosis, surgery 1.8% 4 TOTAL 39.5 11 TERMINOLOGY in TEXT Language of the British National Corpus (c. 1980’s) Token the, of, and, a, in, to, it, is, was, to Cumulative Relative Frequency No. Of OCW 21.28% 6.66% 0 0 are, not, this, but, 's, they, his, from, had, she 4.35% 0 which, or, we, an, n't, 's, were, that, been, have 3.25% 0 their, has, would, what, will, there, if, can, all, her 2.42% 0 37.96% 0 i, for, you, he, be, with, on, that, by, at TOTAL TERMINOLOGY in TEXT The weirdness ratio: The asymmetry in the distribution of a word in a special language corpus and a general language corpus f weirdness(term) f special N special N general general If weirdness >>1 then a specialist term; If weirdness <1 then not a specialist term. TERMINOLOGY in TEXT First ten most frequent terms in the Springer-Verlag’s medical text corpus muchmore; N=1.08 million words and the BNC Relative Frequency (%) Weirdness Ratio the 6.3% 1.02 of 5.1% 1.75 Word and 3.2% 1.18 in 2.8% 1.48 a 2.0% 0.92 to 1.8% 0.72 with 1.3% 2.05 is 1.1% 1.14 for 1.0% 1.12 patients 0.9% 50.46 TOTAL 25.5% f weirdness(term) f special N special N general general Weird word TERMINOLOGY in TEXT The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; TERMINOLOGY in TEXT Specialist text corpora, collections of systematically organised texts have been used in studying the language of linguistics, and in the manual creation of terms in systemic linguistics and in theoretical linguistics. The works of David Crystal, Reinhard Hartmann, Alex de Joia and Adrian Stanton, Robert Trask and Kirsten Malmjkær all refer to the manual analysis of collections of texts for extracting, validating and elaborating linguistic terms. TERMINOLOGY in TEXT Special languages deal with a range of named or designated entities: objects, events, acts, processes, abstractions and generalisations, to name but a few. These entities may have different qualities and quantities, may behave differently, and the behaviour may be further sub-classified. Special language vocabularies largely comprise nouns, adjectives, (full-) verbs, and adverbs; these are sometimes referred to as words of the open classes; classes whose stock is constantly changing. Words in each of the open classes have approximately ‘the same grammatical properties and structural possibilities’ (Quirk et al 1985:72). TERMINOLOGY & ONTOLOGYin TEXT Specialist languages have larger vocabularies and ‘restricted’ syntax (Gerr 1943), show burstiness (Justeson & Katz 1992), or weirdness (Malinowski 1930, Ahmad 1995); Specialist languages are governed, in part, by local grammars (Harris 1991, Ahmad et al 2003, 2004); Specialist language show morphological productivity (after Bauer 2000) e.g. frequent use of plurals, blending, and collocation patterns involving a restricted number of heads (Smadja 1991, Ahmad et al, 2001, 2002) Specialist languages have in-built lexical semantic cues that include enumeration of the members of a sub-class (Hearst 1992), part-whole relationships (after Cruse 1986, Ahmad et al 2003). TERMINOLOGY in TEXT The determiners, conjunctions, (primary and modal) verbs and pronouns (not frequently used in special languages), belong to the closed word classes; their stock is seldom renewed, if at all. It has been remarked that ‘statistical data can confirm that special languages have a higher rate of repetition of lexical items than general language texts’ (Sager, Dungworth and McDonald 1981) Specialist texts comprise a large number of frequently used nouns (or nominals) and, in many ways, form the signature of the subject domain in which they are used. TERMINOLOGY in TEXT Recall that a corpus comprises the evidence of how a language is being used at various levels of description. Specialist text corpora can be distinguished from general language texts at different linguistic levels of linguistic descriptions: at the level of word usage (lexical), at the level of phrases and sentences (grammatical), at the level of meaning (semantics), & at the level of intentions (pragmatics). TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday, general language texts seldom carry a signature. Texts in modern nuclear physics can be identified by the signature: SINGLE TERMS: nuclei, nucleus, nuclear, neutron, electrons, scattering, particle, particles, nucleon, & COMPOUND TERMS: kinetic energy, nuclear, structure, angular momentum, nn-transition, nuclear reactions, target nucleus TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday, general language texts seldom carry a signature. Texts in modern linguistics can be identified by the signature: SINGLE TERMS: gender, nouns, agreement, noun, form, case, language, structure, semantic, & COMPOUND TERMS: network morphology, noun phrase, gender system, gender agreement, gender systems, noun phrases, semantic agreement, complex demonstratives, lexemic hierarchy, TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday, general language texts seldom carry a signature. Texts in forensic science can be identified by the signature: SINGLE TERMS: evidence, crime, scene, forensic, police, identification case, court, analysis, time, information, blood & COMPOUND TERMS: crime scene, forensic evidence, court case, blood analysis, earprint, fingerprint, crime scenes TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. The same concept may be referred to by different names; The frequency of words in a text carry a signature – if the text is specialist then a select few terms are repeatedly used; Everyday, general language texts seldom carry a signature. Texts in all specialist domains show a few repeatedly used terms form the SIGNATURE. These terms are used PRODUCTIVELY – in plural form, as (heads of) compounds, and in derivative forms nucleus crime nuclei (PL.), nuclear (Adjective); stable/unstable/nuclei; halo/closed shell nuclei; nuclear force/reaction; nuclear matter crime, criminal, crimes, criminals, criminalistics, criminology, criminalist(s), criminological, criminality crime scene; crime of passion; property crime; TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. We compare the frequency of every word in our corpus (descriptions and specialist texts) with that of a standard general language corpus (the British National Corpus) TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. Surrey Forensic Science Corpus (SFSC) = 0.58 Million words; 5 words comprise 18% of the Corpus SFSC: Absolute Frequency SFSC: Relative Frequency the 39718 6.8% of 21387 3.7% and 15491 2.7% to 14830 2.5% a 14217 2.4% TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. British National Corpus (BNC) = 100 Million words; 5 words comprise 16.5% of the Corpus BNC: Absolute Frequency BNC: Relative Frequency the 6181374 6.2% of 2938675 2.9% and 2680037 2.7% to 2557635 2.6% a 2148608 2.1% TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. British National Corpus (BNC) = 100 Million words; Surrey Forensic Science Corpus (SFSC) = 0.58 Million words; SFSC: Relative Frequency BNC: Relative Frequency SFSC/BNC: WEIRDNESS the 6.8% 6.2% 1.1 of 3.7% 2.9% 1.2 and 2.7% 2.7% 1.0 to 2.5% 2.6% 1.0 a 2.4% 2.1% 1.1 The 5 words have about the same distribution in the two corpora: These are the so-called closed class words, or grammatical words, and one may find these words with the same frequency as both corpora have English language texts. There is no weirdness in the use of these words in the Forensic Science corpus. TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. British National Corpus (BNC) = 100 Million words; Surrey Forensic Science Corpus (SFSC) = 0.58 Million words; SFSC: Relative Frequency evidence crime scene forensic police 0.47% 0.40% 0.27% 0.25% 0.25% BNC: Relative Frequency 0.021% 0.007% 0.007% 0.001% 0.028% SFSC/BNC: WEIRDNESS 22 57 40 473 9 The 5 words do not have the same distribution in the two corpora: These are the so-called open class words, or lexical words. For every 22 instances of evidence in the Surrey corpus there is only one instance of this word in the BNC. And, forensic is most weird: 473 instances in the Surrey Corpus as opposed to only one in the BNC. TERMINOLOGY in TEXT The use of terminology distinguishes one domain from another; different concepts are emphasised in different domains. British National Corpus (BNC) = 100 Million words; Surrey Forensic Science Corpus (SFSC) = 0.58 Million words; SFSC: Relative Frequency bitemark earprint accelerant pyrolysis ballistics BNC: Relative Frequency 0.0187% 0% 0.0137% 0% 0.0115% 0% 0.0139% 0.00001% 0.0146% 0.00002% SFSC/BNC: WEIRDNESS 634 1263 The first three words DO NOT EXIST in the BNC: These are the so-called neologisms, or new words. Pyrolysis & ballistics both are also lesser used words in the BNC. TERMINOLOGY in TEXT In our corpus of texts written by Chomsky, comprising a sample of Syntactic Structures (c.9500 words), Aspects of the Theory of Syntax (chapters 1 and 2, c. 9309 words) and Government and Binding (chapters 1 and 3 c.24897 words), we see that the closed class words dominate this 43,000 word corpus. First 10 most frequent words are the closed class words - the, of, in, to, a, is, and, that, be, as and make up more than a quarter of the total text. There are 32 open class words in Chomsky’s texts, and one can see in Table 3 that grammar dominates the discussion in Chomsky, not only through the term itself but through its variants grammatical and grammars. TERMINOLOGY in TEXT Language Change of Mr Chomsky 1957 to 1981 Token the, of, in, to, a, is, and, that, be, as Cumulative Relative Frequency No. Of OCW 27.54% 7.51% 4.86% 0 1 3 an, but, if, from, which, some, can, alpha, may, language 3.78% 2 there, grammatical, other, structure, more, only, t, one, binding, no 2.78% 3 46.47% 9 it, for, we, are, this, i, case, by, not, on with, s, theory, np, or, grammar, have, these, such, will TOTAL TERMINOLOGY in TEXT Language Change in Mr Chomsky 1957 1981 Word Aspects of Theory of Syntax Syntactic Structures (9504 words) 1957 (9309 words) 1965 Government & Binding (24807 words) 1981 % % % grammar 0.55 0.74 0.31 grammars 0.24 0.24 0.15 grammatical 0.88 0.39 0.06 grammatically 0.01 Not found Not Found grammaticalness Not found 0.12 0.09 ungrammatical Not found 0.09 0.03 ONTOLOGY in TEXT Once the single terms, especially weird terms, are identified then System Quirk finds candidate compound terms by computing collocation statistics between the single terms and other open class words in the entire CORPUS. Here is a list of most weird terms alphabetically ordered American English Spelling ONTOLOGY in TEXT Once the single terms, especially weird terms, are identified then System Quirk finds candidate compound terms by computing collocation statistics between the single terms and other open class words in the entire CORPUS. Here is a list of most weird terms alphabetically ordered American English Spelling TERMINOLOGY in TEXT Language of the Tunneling Diodes (c. 1980’s) Tunnelling Diodes: High-speed devices yet to be manufactured TERMINOLOGY in TEXT Language of the Nanotube Corpus (1 Million Words; 1990-2000): Journal Papers; Patent Applications; Book Chapters; Conference Announcements Word British National Words/100 million Nanotube corpus words/million Weirdness nanowires 0 619 INF nanoparticles 1 829 81996 nanowire 0 360 INF nanotube 2 969 47921 nanoscale 0 268 INF nanoparticle 0 232 INF nanotubes 5 1379 27279 nanostructures 0 212 INF mwnts 0 176 INF nanorods 0 159 INF nanocrystals 2 395 19534 BUILDING A THESAURUS Surrey Nanotube Corpus (SFSC) = 1.09 Million words; Carbon is the 15th most frequent word in our corpus; we compute its downwards collocates. Collocations with carbon (frequency of 1506) in the Surrey Nanoscale science corpus. Collocate Fre q -5 -4 -3 -2 -1 1 2 3 4 5 n a n ot u bes 690 8 8 9 2 0 647 6 0 7 3 n a n ot u be 252 3 2 2 0 0 229 2 1 5 8 sin gl ewa l l ed a l ign ed 77 0 0 1 1 75 0 0 0 0 0 94 1 1 3 5 74 0 1 1 3 5 m u l t iwa l l ed 70 1 1 2 0 59 0 0 1 5 1 a m or ph ou s 58 1 1 6 0 46 0 1 1 0 2 a t om s 51 1 2 0 1 42 0 1 3 1 0 BUILDING A THESAURUS Collocations with Collocations with carbon nanotubes (frequency of 647) in the Surrey Nanoscale science corpus. Collocate singlewalled aligned multiwalled properties multiwall Frequency -5 -4 -3 -2 1 1 2 3 4 73 0 0 1 1 71 0 0 0 0 63 1 1 1 5 48 0 0 2 4 53 0 0 1 0 46 0 0 5 1 60 1 4 1 5 32 0 34 0 1 0 1 0 0 6 2 30 0 2 0 0 LSP: Collocation Patterns Phrase Field Field Energy Energy electron electron tunneling tunneling tunneling tunneling current current Barrier Barrier quantum quantum quantum Collocate magnetic electric fermi kinetic dimensional tunneling resonant electron diodes diode density voltage double height wells multiple structures k-score -5 24.02078 4 16.75592 2 13.57791 0 5.297875 1 10.20212 1 10.06765 2 25.88729 2 5.845766 3 2.982692 0 2.59579 2 12.12932 1 17.72585 10 19.90954 2 5.159374 2 25.73331 1 6.539368 0 6.337326 3 -4 0 0 1 0 7 1 5 3 1 0 0 7 3 4 1 1 2 -3 -2 -1 1 2 0 0 367 0 0 1 0 257 0 0 0 1 90 0 0 0 0 40 0 0 0 2 65 0 0 5 3 6 49 5 5 13 295 0 4 1 5 49 7 3 0 0 0 36 0 0 0 0 34 0 1 1 0 98 1 7 7 2 90 7 1 0 191 0 1 2 0 0 38 4 0 1 0 251 0 0 0 64 0 1 3 0 1 3 44 3 1 0 1 0 0 1 2 5 1 1 0 7 0 0 0 2 5 4 4 1 5 0 2 3 6 1 3 0 1 9 1 1 2 0 5 5 1 3 2 0 2 3 6 2 1 0 2 6 0 3 2 0 0 LSP: Collocation Patterns Phrase tunneling field field energy electron electron tunneling energy tunneling Collocate k-score resonant 25.9 magnetic 24.0 electric 16.8 Fermi 13.6 dimensional 10.2 tunneling 10.1 electron 5.8 kinetic 5.3 diodes 3.0 -5 2 4 2 0 1 2 3 1 0 -4 5 0 0 1 7 1 3 0 1 -3 5 0 1 0 0 5 1 0 0 -2 -1 1 2 3 4 13 295 0 4 2 6 0 367 0 0 1 4 0 257 0 0 0 1 1 90 0 0 1 5 2 65 0 0 0 2 3 6 49 5 1 3 5 49 7 3 5 1 0 40 0 0 0 0 0 0 36 0 1 3 5 Total 6 338 1 377 3 264 2 100 2 79 3 78 2 79 0 41 1 42 LSP: (Re)Collocation Phrase magnetic field magnetic field resonant tunneling resonant tunneling double barrier double barrier current density current density quantum wells quantum wells dimensional electron dimensional electron Collocate transverse parallel diodes diode resonant tunneling j k multiple coupled gases 2deg k-score -5 -4 -3 -2 -1 1 2 4.208123 0 0 0 0 17 0 0 9.754802 0 1 0 0 5 12 10 6.760186 0 0 0 0 0 31 0 5.21645 0 0 0 0 0 25 0 7.566908 1 8 4 2 0 27 0 8.607712 3 2 13 4 1 1 24 5.782593 0 1 0 0 0 19 1 3.175241 0 0 0 0 0 0 13 5.569255 0 1 0 0 17 1 0 5.569255 0 0 1 2 13 2 1 3.444534 0 0 0 0 0 8 0 1.321555 0 1 0 0 0 0 3 3 1 9 1 1 2 1 1 0 0 0 0 0 4 0 2 0 0 1 2 0 0 0 0 0 0 5 0 0 1 0 1 1 0 0 0 0 0 0 LSP: (ReRe)Collocation Statistically discover the extent of compounding of a term Phrase double barrier resonant double barrier resonant Collocate tunneling structures k-score -5 -4 -3 -2 -1 1 6.894571 1 0 3 1 0 24 1.317459 0 0 0 0 0 0 2 0 5 3 0 2 4 0 0 5 0 0 Machine Generated results unipolar resonant tunnelling diode bipolar resonant tunnelling diodes high-frequency characteristics of the bipolar light-emitting resonant tunnelling diode are compared to the unipolar resonant tunnelling diode and the resonant interband tunnelling diode . The high-frequency characteristics of bipolar resonant tunnelling diodes are experimentally investigated at room temperature . High-frequency capacitance of bipolar resonant tunnelling diodes triplebarrier diodes triplebarrier diode k: From the calculated ||M22||2 the fundamental quantities that determine the resonant tunnelling diode characteristics —transmission coefficients , resonant conditions , and full widths at half maximum of resonant peaks —are calculated for double- and triple-barrier diodes . k: The same calculation method is then also applied to a calculation of ||M22||2 for a triple barrier diode . Continued Evolution: a future concept hierarchy? (Terms found with low frequencies in current texts) tunneling diode resonant tunneling diode unipolar resonant tunneling diode Same thing? interband resonant tunneling diode resonant interband tunneling diode - RITD delta doped resonant tunneling diode double-barrier resonant tunneling diode quantum well resonant tunneling diode bipolar light-emitting resonant tunneling diode interband double barrier tunneling diode An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION _ 2 ( p p ) 10 i i Coll .Spread :U U 10 i 0 j 1 j Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’. In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2). Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760. Smajda, Frank. (1994). Retrieving Collocations from Text: Xtract. In (Ed.) Susan Armstrong(Warwick). Using Large Corpora. Cambridge, Massachusetts & London, England: MIT Press. pp143-177. An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION f Collocation Strength :k ij ij _ f k ij 0 Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’. In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2). Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760. Smajda, Frank. (1994). Retrieving Collocations from Text: Xtract. In (Ed.) Susan Armstrong(Warwick). Using Large Corpora. Cambridge, Massachusetts & London, England: MIT Press. pp143-177. An ALGORITHM FOR TERMINOLOGY AND ONTOLOGY EXTRACTION f Coll . Strength :k ij ij _ f k ij j _ p )2 i U 10 0 10 ( p i Coll .Spread :U i j 1 _ j Peakedness : p p (k i i 1 Metrices ( k 0 U i ) , k ,U ) (1,1,10) 0 1 0 ONTOLOGY in TEXT Collocates of the the weird term EARPRINT + collocation statistics ONTOLOGY in TEXT Collocates of the the weird term EARPRINT IDENTIFICATION ONTOLOGY in TEXT A hierarchy of EARPRINT collocates: ONTOLOGY in TEXT A inheritance hierarchy of EARPRINT collocates: ONTOLOGY in TEXT The hierarchy is rendered into an RDF description which can be exported by System Quirk to other (web-enabled) applications: nanotubes 1378 carbon nanotubes z nanotubes 647 24 aligned carbon nanotubes multiwalled carbon nanotubes single-wall carbon nanotubes multiwall carbon nanotubes 48 46 24 46 multiwalled carbon nanotubes mwnts single-wall carbon nanotubes swnts 13 4 vertically aligned carbon nanotubes vertically aligned carbon kai 15 15 ONTOLOGY in TEXT A inheritance hierarchy of EARPRINT collocates exported to a knowledge representation system PROTEGE: ONTOLOGY in TEXT A multiple inheritance hierarchy of EARPRINT collocates now exported to a knowledge representation workbench PROTEGE: Rubbish! ONTOLOGY in TEXT Editing a multiple inheritance hierarchy of EARPRINT collocates now exported to a knowledge representation workbench PROTEGE ONTOLOGY in TEXT Edited multiple inheritance hierarchy of EARPRINT collocates now exported to a knowledge representation workbench PROTEGE ONTOLOGY in TEXT Automatic Extraction of terms and their relationship to other terms (ontology) from texts using System Quirk and Stanford Uni’s PROTÉGÉ. The system can reason over the hierarchy and infer (new) facts. EVIDENCE TRACE EVIDENCE BLOOD INORGANIC FIBRE FIBRE MANUFACTURED POLYMERIC FIBRE DNA DYE FIBRE ONTOLOGY in TEXT The production of the inheritance tree shows the conceptual organisation of terms in a specialist domain – a thesaurus in other words. Such a conceptual organisation reflects the ontological commitment of the domain community – how is it that the conventions of organising concepts have evolved and adhered to - often misleadingly just called ontology. EVIDENCE TRACE EVIDENCE BLOOD INORGANIC FIBRE FIBRE MANUFACTURED POLYMERIC FIBRE DNA DYE FIBRE ONTOLOGY in TEXT The terminology and the ontological commitment thus identified will be used to create the visual evidence thesaurus. No assumptions were made about the terminology or ontology – apart from the assumption that the way in which specialists write descriptions is unique to the specialism: weirdness and local grammar. EVIDENCE TRACE EVIDENCE BLOOD INORGANIC FIBRE FIBRE MANUFACTURED POLYMERIC FIBRE DNA DYE FIBRE Sentiment & Market Analysis Reuters Financial Services Streaming Data and News Service Sentiment & Market Analysis News Analysis: service for extracting MARKET SENTIMENT. Correlation: Market sentiment correlation with financial time series. Fusing Qualitative and Quantitative Data Analysis We have developed a Sentiment and Time Series: Financial analysis system (SATISFI) for visualising and correlating the sentiment and instrument time series both as text (and numbers) and graphically as well. Fusing Qualitative and Quantitative Data Analysis We have developed a Sentiment and Time Series: Financial analysis system (SATISFI) for visualising and correlating the sentiment and instrument time series both as text (and numbers) and graphically as well. Fusing Qualitative and Quantitative Data Analysis Fusing Qualitative and Quantitative Data Analysis Fusing Qualitative and Quantitative Data Analysis Fusing Qualitative and Quantitative Data Analysis Afterword Specialists use language in an idiosyncratic fashion: Repeat lexical items comprising the specific vocabulary of a subject domain Invent new words Borrow words from other domains Re-define words or terms Such processes contribute significantly to the organisation and communication of tacit and explicit knowledge. Afterword In order to investigate innovation or creativity, we have developed a computer-based method that compares the relative occurrence of single words in an Englishscientific paper (or a collection or corpus of papers) with the occurrence of the words in a representative sample of contemporary English language.