Financial Information Grid –an ESRC e

advertisement
Building terminology and
conceptual (ontological)
systems from text corpora?
Khurshid Ahmad
Professor of Artificial Intelligence,
Department of Computing,
University of Surrey
Surrey, ENGLAND.
Dial-a-Corpus, Tuscan Word Center/Univeristy of Sienna,
Certosa di Pontignano, Sienna, Italy.
June 28th, 2005.
KNOWLEDGE, TEXT &
LEARNING
Terminology relates to the conceptual
systems of specialist disciplines:


some argue that a term denotes a concept
whilst
others deny this correspondence
Conceptual systems in themselves are
organised as by definition we have a
system and a system is organisation per
se.
KNOWLEDGE, TEXT &
LEARNING
Building a specialist corpus:

Use search engines –think of keywords
carefully; Use crawlers and design bots.
Visit
 Journal sites; Book sites; Popular Science &
Sunday Newspaper Supplement;
 (Patent Documents –free from US PTO)
 Newsletters; Course & Conference
Announcements;
 Scientific Biography;
Balance the corpus
KNOWLEDGE, TEXT &
LEARNING
Building a specialist corpus:




Collect data;
Organise data
Analyse data
Visualise results
KNOWLEDGE, TEXT & LEARNING
Development of Visual Evidence Thesaurus
The collateral texts – written texts or speech (fragments) closely or
loosely related to an image or objects within the image.
 The collateral texts are special language texts and comprise keywords
that may help in indexing and retrieving the images.
PICTURE
CAPTION
and Davis was armed
with a 9mm Browning
High Power, 9mm
caliber semiautomatic pistol with
an obliterated serial
number.
A small firearm with a
more or less
curved stock,
adapted to be
held in, and
fired with, one
hand
Firearm
found
on top of a
table
9 mm
Browning
high-power
pistol
BUILDING A SPECIALIST CORPUS
Surrey Forensic Science Corpus
( 0.58 Million words)
• Descriptions of Images by Scene of Crime Experts;
• FBI, UK, Australian Literature on Scene of Crime Practice;
• Research Papers on Evidence-based policing
• Brochures and Marketing documents on Crime Labs;
• Newspaper reports relating to scene of crime
• Descriptions of courses and conferences;
• Mainly American English texts
BUILDING A SPECIALIST CORPUS
Organise Texts
• Build a conceptual system for differentiating texts:
• Imaginative, Informative, Instructional;
• Author Attribute (Age, Nationality, Gender,
Language)
• Publication attributes (Date of Publication,
Audience….)
• Sub-domain categories
•Categorise texts on your filestore
BUILDING A SPECIALIST CORPUS:
Data on the Web – Reuters Financial News



Classification of
documents into a fixed
number of predefined
categories
Each document can be in
multiple, one, or no
category
Building text classifiers
by hand is difficult and
time consuming
Automatic text classifier
can hand in the problem
Number of Document

30.00%
25.00%
20.00%
15.00%
10.00%
5.00%
0.00%
1
3
5
7
9
11 13 15 17 19 21 23 25 27
Number of Codes per Document
BUILDING A SPECIALIST CORPUS:
Data on the Web – Reuters Financial News
Topic codes
•TREC-AP text collection: AP
newswire stories from the
TREC/TIPSTER, a total of
242,918 AP stories from 19881990 combined with 20
categories (Lewis & Gale, 1994)
Corporated/industrial (CCAT)
C11 strategy/plans
1992)
I0 agriculture,forestry and fishing
I00 agriculture, forestry and fishing
C12 legal/judicial
I000 agriculture, forestry and fishing
Region codes
AARCT Antartica
ABDBI Abu Dhabi
AFGH Afghanistan
C13 regulation/policy
C14 share listings
C15 performance
C151 accounts/earnings
C1511 annual results
I01 agriculture and horticulture
I010 agriculture and horitculture
I0100 agriculture and horitculture
AFRICA Africa
AJMN Ajman
ALB Albania
I01000 agriculture and horitcuture
ALG Algeria
I01001 agriculture
AMSAM American samoa
C152 comment/forcasts
Economics (ECAT)
•Reuters Collection Volume 1(
RCV1) 806,791 stories from
1997-1998 combined with:
• 126 codes for topics,
• 870 codes for industry and
• 266 codes for regions (Lewis,
Industry codes
E11 economic performance
E12 monetary/economic
E121 money supply
E13 inflation/prices
Government/social (GCAT)
G11 social affairs
G111 health/safety
G112 social security
G113 education/research
Markets (MCAT)
M11 equity markets
M12 bond markets
M13 money markets
M131 interbank markets
I0100105 cattle farming
I0100107 egg production
Terminology Resources
for Categorising Texts
Stock
Preferred Stock
Common Stock
See www.Investorwords.com
Junior Equity
BUILDING A SPECIALIST CORPUS:
Organising Data on the Web – Reuters Financial News
See: WEBSOM
Definitions: Terminology


Etymologically, The doctrine or scientific study
of terms; in use almost always. The system of
terms belonging to any science or subject;
technical terms collectively; nomenclature.
Hence terminological (adj.) pertaining to
terminology; terminological inexactitude, a
humorous expression for a falsehood;
terminologically (adv. ); terminologist, one
versed in terminology.
Definitions: Special Language
•The special language of focussed, single
minded pursuits: Science, technology, sports,
politics, philosophy,……
•A natural language privileges persons ; in
contrast the “splinter of ordinary language”
that we call [specialised] scientific discourse
privileges a world of objects, processes,
happenings, events.
Definitions: Special Language
•The ‘identificatory force’ of subject
position in grammar [of Indo-European
languages] is reserved for speakers and
their fellow creatures.
•The ‘identificatory force’ of subject
position in grammar of specialist
discourse is reserved for objects,
processes, happenings, events
Definitions: Special Language A note on creativity or terminicide
‘That science has become more difficult for nonspecialists to
understand is a truth universally acknowledged’.
The choice of words in a journal paper is very different to that in a
quality newspaper – obscuring the work of the scientists.
Source
Lexical Difficulty
Nature
55.5
Science
44.8
Cell
31.6
Physics Today
13.3
New Scientist
4.0
Quality Newspaper
0.0
Donald Hayes (1992) ‘The growing inaccessibility of science’. Nature. Vol 356, pp 739-740
Definitions: Special Language –
A note on creativity or terminicide
Source
Quality Newspaper
Popular Science
Lexical Difficulty
0.0
-4.7
(Discover)
Fiction
Nat. History magazine
-19.3
-22.6
(Ranger Rick)
Children’s fiction
Farm-workers talking to cows
-27.4
-63.8
TEXT TECHNOLOGY?
DERIVE information from text;
FEED information into a
conceptual or mathematical
model;
USE model for analysis or
prediction
HERMENUTICS & CORPUS
LINGUITICS?
Languages are constantly in flux
The corpus linguist explores the discourse
as a system that can be explained without
referring to a discourse external reality or
to the mental state of the members of the
discourse community.
Teubert, Wolfgang (2003). Writing, hermenutics and corpus linguistics. Logos
and Language Vol.IV (no. 2) pp 1-17.
HERMENUTICS & CORPUS
LINGUITICS?
Interpertament
Texts are responses to
previous texts and the texts are
then responded to in turn and
the cycle continues  hence
the diachronic dimension
HERMENUTICS &
CORPUS LINGUITICS
Interpretament
Encyclopaedias
deal with monosemic concepts
Dictionaries
deal with polysemic words
Text Corpus
deal with neologisms, especially
compounds and abbreviations,
retronyms.
Explanation of a concept becomes the
paraphrase of a phrase and thus a
discrete internal object.
HERMENUTICS &
CORPUS LINGUITICS
Interpretament
Where will you find the evidence of use,
definition, and elaboration of terms like:
• inclusive learning environment (e-Learning)
• Borromean Halo Nuclei (Radioactive Nuclear Beam Physics)
• honeycombed catalytic converter (Automotive Engineering)
• indivualist weak supervenience (Philosophy of Science)
• indoor blood videotaping (Forensic Science)
EXCEPT IN A TEXT CORPUS?
Lexicogenesis:
Diachronic Semantic Inversion?
Term/
‘Concept'
Perception: One sees
Motion: Objects move due to
Solar Cycle: Sunrise is
Before
After
beams coming from an object
(Aristotle)
beams leaving the observer's eyes
(Pythagoras)
an in-built tendency to move
(Aristotle)
something exerts 'attraction'
(Galileo)
a rising Sun (Brahe)
a turning earth (Kepler)
loss of phlogiston (Priestley)
the addition of oxygen (Lavoisier)
an explosion during diastole of
the heart (Descartes)
a compression during systole of
the heart (Harvey)
caused by
Combustion: Burning an
object in air leads to a
Heartbeat: Blood
circulation is caused by
Species: The distinction an absolute phenomenon that
between species is
has been determined in the past
(Linnaeus)
a contemporaneous phenomenon
with borders between the species
(Darwin)
Verschuuren, G. M. N. (1986). Investigating the Life Sciences: An Introduction to the
Philosophy of Science. Oxford: Pergamon Press.
Lexicogenesis:
Synchronic Semantic Change
19 Meaning s of Paradigm in The Structure of Scientific
Revolution Kuhn (1970)?
Page x a universally recognised scientific
achievement
Page 2 a successful metaphysical speculation
Page 2
Page 4
Page 10
Page 14
a concrete scientific achievement
a set of beliefs
like a textbook
like an analogy
Page 17 a myth
Page 23 like an accepted judicial decision
Page 23 a grammatical paradigm
Lexicogenesis:
Synchronic Semantic Change
19 Meaning s of Paradigm in The Structure of Scientific Revolution Kuhn
(1970)?
pp37, 76 a conceptual and instrumental tools
pp59, 60 a device or type of instrumentation
Page 63 like a gestalt figure
Page 85
Page 91
Page 102
Page 108
an anomalous pack
like a set of political institutions
a standard
a map
117-121 a new way of seeing things
Page 120 an organising principle governing perception
Page 128 something which determines a large area of
reality
Lexicogenesis:
Diachronic Semantic Change
The establishment of atom
1477
1650
1819
An atom is a hypothetical body, so small as to
be incapable of further division; and thus to
be one of the ultimate particles of nature.
Physical Atoms: The supposed ultimate
particles in which matter actually exists
(without reference to its stability).
Chemical Atoms: The smallest particles in
which the elements combine, or are known to
possess the properties of a particular element.
LEXICOGENESIS &
KNOWLEDGE
The origins, evolution and
obsolescence of concepts and
conceptual systems are a hotly
debated subject:
for some concepts are ethereal
but
 and for others concepts relate directly
to our sensual experience.

LEXICOGENESIS &
KNOWLEDGE
More recently, my
colleagues in computing,
especially those in artificial
intelligence and semantic
web, have started to use the
term ontology in a creative
way.
LEXICOGENESIS &
ONTOLOGY?
Ontology  etymologically essence of
being
There are those for whom an
ontology
is a list of terms perhaps organised as a
thesaurus,
 an ontology can be found very easily
provided we find the logical basis of
science, philosophy and thought.

LEXICOGENESIS & TEXT?
If any essence or trace of the
knowledge of the individuals is left
behind then it is usually found in
documents, comprising words,
illustrations and drawings,
mathematical and other symbols.
TERMINOLOGY in TEXT




The use of terminology distinguishes one
domain from another; different concepts are
emphasised in different domains.
The same concept may be referred to by
different names;
The frequency of words in a text carry a
signature – if the text is specialist then a
select few terms are repeatedly used;
Everyday or general language texts seldom
carry a signature.
TERMINOLOGY in TEXT
First ten most frequent terms in the Springer-Verlag’s medical
text corpus. Muchmore; N=1.08 million
Word
Absolute Relative
Frequency Frequency
Relative
Frequency (%)
the
68451
0.063315
6.3%
of
55661
0.051484
5.1%
and
34248
0.031678
3.2%
in
30035
0.027781
2.8%
a
21268
0.019672
2.0%
to
19988
0.018488
1.8%
with
14455
0.01337
1.3%
is
12333
0.011408
1.1%
for
10311
0.009537
1.0%
patients
9448
0.008739
0.9%
TERMINOLOGY in TEXT
First ten most frequent terms in the Springer-Verlag’s medical
text corpus. muchmore; N=1.08 million words
Word
Relative
Frequency (%)
Word
Relative
Frequency (%)
the
6.3%
was
0.77%
of
5.1%
be
0.72%
and
3.2%
are
0.66%
in
2.8%
as
0.63%
a
2.0%
by
0.63%
to
1.8%
were
0.57%
with
1.3%
an
0.51%
is
1.1%
on
0.49%
for
1.0%
or
0.49%
patients
0.9%
this
0.47%
TOTAL
25.5%
TOTAL
6.30%
TERMINOLOGY in TEXT
The use of
terminology
distinguishes
one domain
from
another;
different
concepts are
emphasised
in different
domains.

British National Corpus (BNC)
c. 90 Million words;
5 words comprise 16.5% of the Corpus
BNC:
Absolute
Frequency
BNC:
Relative
Frequency
the
6181374
6.2%
of
2938675
2.9%
and
2680037
2.7%
to
2557635
2.6%
a
2148608
2.1%
TERMINOLOGY in TEXT
The muchmore Corpus N=1.08 million words
Token
the, of, and, in, a, to, with, is, for,
patients
Cumulative
Relative
Frequency
No. Of
OCW
25.5%
1
was, be, are, as, by, were, an, on, or,
this
6.3%
0
that, after, from, treatment, not, we,
at, can, clinical, have
3.4%
2
which, has, patient, these, results,
therapy, been, cases, it, study
2.4%
5
all, during, disease, only, years,
between, may, no, diagnosis, surgery
1.8%
4
TOTAL
39.5
11
TERMINOLOGY in TEXT
Language of the British National Corpus (c. 1980’s)
Token
the, of, and, a, in, to, it, is, was, to
Cumulative
Relative
Frequency
No. Of
OCW
21.28%
6.66%
0
0
are, not, this, but, 's, they, his, from,
had, she
4.35%
0
which, or, we, an, n't, 's, were, that,
been, have
3.25%
0
their, has, would, what, will, there,
if, can, all, her
2.42%
0
37.96%
0
i, for, you, he, be, with, on, that, by,
at
TOTAL
TERMINOLOGY in TEXT
The weirdness ratio: The asymmetry in the
distribution of a word in a special language
corpus and a general language corpus
f
weirdness(term) 
f
special
N
special
N
general
general
If weirdness >>1 then a specialist term;
If weirdness <1 then not a specialist term.
TERMINOLOGY in TEXT
First ten most frequent terms in the Springer-Verlag’s medical
text corpus muchmore; N=1.08 million words and the BNC
Relative
Frequency
(%)
Weirdness
Ratio
the
6.3%
1.02
of
5.1%
1.75
Word
and
3.2%
1.18
in
2.8%
1.48
a
2.0%
0.92
to
1.8%
0.72
with
1.3%
2.05
is
1.1%
1.14
for
1.0%
1.12
patients
0.9%
50.46
TOTAL
25.5%
f
weirdness(term) 
f
special
N
special
N
general
general
Weird word
TERMINOLOGY in TEXT

The frequency of words in a text carry a
signature – if the text is specialist then a
select few terms are repeatedly used;
TERMINOLOGY in TEXT
Specialist text corpora, collections of
systematically organised texts have been used
in studying the language of linguistics, and in
the manual creation of terms in systemic
linguistics and in theoretical linguistics.
The works of David Crystal, Reinhard
Hartmann, Alex de Joia and Adrian Stanton,
Robert Trask and Kirsten Malmjkær all refer
to the manual analysis of collections of texts
for extracting, validating and elaborating
linguistic terms.
TERMINOLOGY in TEXT
Special languages deal with a range of named or
designated entities: objects, events, acts, processes,
abstractions and generalisations, to name but a few.
These entities may have different qualities and
quantities, may behave differently, and the behaviour
may be further sub-classified.
Special language vocabularies largely comprise
nouns, adjectives, (full-) verbs, and adverbs; these are
sometimes referred to as words of the open classes;
classes whose stock is constantly changing.
Words in each of the open classes have approximately
‘the same grammatical properties and structural
possibilities’ (Quirk et al 1985:72).
TERMINOLOGY & ONTOLOGYin
TEXT
Specialist languages have larger vocabularies and
‘restricted’ syntax (Gerr 1943), show burstiness
(Justeson & Katz 1992), or weirdness (Malinowski
1930, Ahmad 1995);
Specialist languages are governed, in part, by local
grammars (Harris 1991, Ahmad et al 2003, 2004);
Specialist language show morphological productivity
(after Bauer 2000) e.g. frequent use of plurals,
blending, and collocation patterns involving a
restricted number of heads (Smadja 1991, Ahmad et al,
2001, 2002)
Specialist languages have in-built lexical semantic cues
that include enumeration of the members of a sub-class
(Hearst 1992), part-whole relationships (after Cruse
1986, Ahmad et al 2003).
TERMINOLOGY in TEXT
The determiners, conjunctions, (primary and modal)
verbs and pronouns (not frequently used in special
languages), belong to the closed word classes; their
stock is seldom renewed, if at all.
It has been remarked that ‘statistical data can
confirm that special languages have a higher rate of
repetition of lexical items than general language texts’
(Sager, Dungworth and McDonald 1981)
Specialist texts comprise a large number of frequently
used nouns (or nominals) and, in many ways, form
the signature of the subject domain in which they are
used.
TERMINOLOGY in TEXT
Recall that a corpus comprises the evidence of
how a language is being used at various levels
of description.
Specialist text corpora can be distinguished
from general language texts at different
linguistic levels of linguistic descriptions:
at the level of word usage (lexical),
at the level of phrases and sentences (grammatical),
at the level of meaning (semantics), &
at the level of intentions (pragmatics).
TERMINOLOGY in TEXT




The use of terminology
distinguishes one
domain from another;
different concepts are
emphasised in different
domains.
The same concept may
be referred to by
different names;
The frequency of words
in a text carry a
signature – if the text is
specialist then a select
few terms are
repeatedly used;
Everyday, general
language texts seldom
carry a signature.
Texts in modern nuclear physics can be
identified by the signature:
SINGLE TERMS:
nuclei, nucleus, nuclear, neutron,
electrons, scattering, particle,
particles, nucleon,
&
COMPOUND TERMS:
kinetic energy, nuclear, structure,
angular momentum, nn-transition,
nuclear reactions, target nucleus
TERMINOLOGY in TEXT




The use of terminology
distinguishes one
domain from another;
different concepts are
emphasised in different
domains.
The same concept may
be referred to by
different names;
The frequency of words
in a text carry a
signature – if the text is
specialist then a select
few terms are
repeatedly used;
Everyday, general
language texts seldom
carry a signature.
Texts in modern linguistics can be
identified by the signature:
SINGLE TERMS:
gender, nouns, agreement,
noun, form, case,
language, structure, semantic,
&
COMPOUND TERMS:
network morphology, noun phrase,
gender system, gender agreement,
gender systems, noun phrases,
semantic agreement,
complex demonstratives,
lexemic hierarchy,
TERMINOLOGY in TEXT




The use of terminology
distinguishes one
domain from another;
different concepts are
emphasised in different
domains.
The same concept may
be referred to by
different names;
The frequency of words
in a text carry a
signature – if the text is
specialist then a select
few terms are
repeatedly used;
Everyday, general
language texts seldom
carry a signature.
Texts in forensic science can be identified
by the signature:
SINGLE TERMS:
evidence, crime, scene,
forensic, police, identification
case, court, analysis,
time, information, blood
&
COMPOUND TERMS:
crime scene, forensic evidence,
court case, blood analysis,
earprint, fingerprint, crime scenes
TERMINOLOGY in TEXT




The use of terminology
distinguishes one
domain from another;
different concepts are
emphasised in different
domains.
The same concept may
be referred to by
different names;
The frequency of words
in a text carry a
signature – if the text is
specialist then a select
few terms are
repeatedly used;
Everyday, general
language texts seldom
carry a signature.
Texts in all specialist domains show a few
repeatedly used terms form the SIGNATURE.
These terms are used PRODUCTIVELY – in
plural form, as (heads of) compounds, and
in derivative forms
nucleus 
crime 
nuclei (PL.),
nuclear (Adjective);
stable/unstable/nuclei;
halo/closed shell nuclei;
nuclear force/reaction;
nuclear matter
crime, criminal, crimes,
criminals, criminalistics,
criminology, criminalist(s),
criminological, criminality
crime scene; crime of
passion; property crime;
TERMINOLOGY in TEXT


The use of terminology
distinguishes one domain from
another; different concepts are
emphasised in different domains.
We compare the
frequency of every word in
our corpus (descriptions
and specialist texts) with
that of a standard general
language corpus (the
British National Corpus)
TERMINOLOGY in TEXT
The use of
terminology
distinguishes
one domain
from
another;
different
concepts are
emphasised
in different
domains.

Surrey Forensic Science Corpus (SFSC) = 0.58 Million words;
5 words comprise 18% of the Corpus
SFSC:
Absolute
Frequency
SFSC:
Relative
Frequency
the
39718
6.8%
of
21387
3.7%
and
15491
2.7%
to
14830
2.5%
a
14217
2.4%
TERMINOLOGY in TEXT
The use of
terminology
distinguishes
one domain
from
another;
different
concepts are
emphasised
in different
domains.

British National Corpus (BNC)
= 100 Million words;
5 words comprise 16.5% of the Corpus
BNC:
Absolute
Frequency
BNC:
Relative
Frequency
the
6181374
6.2%
of
2938675
2.9%
and
2680037
2.7%
to
2557635
2.6%
a
2148608
2.1%
TERMINOLOGY in TEXT
The use of
terminology
distinguishes
one domain
from
another;
different
concepts are
emphasised
in different
domains.

British National Corpus (BNC)
= 100 Million words;
Surrey Forensic Science Corpus (SFSC)
= 0.58 Million words;
SFSC:
Relative Frequency
BNC:
Relative Frequency
SFSC/BNC:
WEIRDNESS
the
6.8%
6.2%
1.1
of
3.7%
2.9%
1.2
and
2.7%
2.7%
1.0
to
2.5%
2.6%
1.0
a
2.4%
2.1%
1.1
The 5 words have about the same distribution in the two corpora: These
are the so-called closed class words, or grammatical words, and one may
find these words with the same frequency as both corpora have English
language texts. There is no weirdness in the use of these words in the
Forensic Science corpus.
TERMINOLOGY in TEXT
The use of
terminology
distinguishes
one domain
from
another;
different
concepts are
emphasised
in different
domains.

British National Corpus (BNC)
= 100 Million words;
Surrey Forensic Science Corpus (SFSC)
= 0.58 Million words;
SFSC:
Relative Frequency
evidence
crime
scene
forensic
police
0.47%
0.40%
0.27%
0.25%
0.25%
BNC:
Relative Frequency
0.021%
0.007%
0.007%
0.001%
0.028%
SFSC/BNC:
WEIRDNESS
22
57
40
473
9
The 5 words do not have the same distribution in the two corpora: These
are the so-called open class words, or lexical words. For every 22 instances
of evidence in the Surrey corpus there is only one instance of this word in
the BNC. And, forensic is most weird: 473 instances in the Surrey Corpus as
opposed to only one in the BNC.
TERMINOLOGY in TEXT
The use of
terminology
distinguishes
one domain
from
another;
different
concepts are
emphasised
in different
domains.

British National Corpus (BNC)
= 100 Million words;
Surrey Forensic Science Corpus (SFSC)
= 0.58 Million words;
SFSC:
Relative Frequency
bitemark
earprint
accelerant
pyrolysis
ballistics
BNC:
Relative Frequency
0.0187%
0%
0.0137%
0%
0.0115%
0%
0.0139% 0.00001%
0.0146% 0.00002%
SFSC/BNC:
WEIRDNESS



634
1263
The first three words DO NOT EXIST in the BNC: These are the so-called
neologisms, or new words. Pyrolysis & ballistics both are also lesser used
words in the BNC.
TERMINOLOGY in TEXT
In our corpus of texts written by Chomsky, comprising a sample
of Syntactic Structures (c.9500 words), Aspects of the Theory of
Syntax (chapters 1 and 2, c. 9309 words) and Government and
Binding (chapters 1 and 3 c.24897 words), we see that the closed
class words dominate this 43,000 word corpus.
First 10 most frequent words are the closed class words - the, of,
in, to, a, is, and, that, be, as and make up more than a quarter of
the total text.
There are 32 open class words in Chomsky’s texts, and one can
see in Table 3 that grammar dominates the discussion in
Chomsky, not only through the term itself but through its
variants grammatical and grammars.
TERMINOLOGY in TEXT
Language Change of Mr Chomsky 1957 to 1981
Token
the, of, in, to, a, is, and, that, be, as
Cumulative
Relative
Frequency
No. Of
OCW
27.54%
7.51%
4.86%
0
1
3
an, but, if, from, which, some, can, alpha,
may, language
3.78%
2
there, grammatical, other, structure,
more, only, t, one, binding, no
2.78%
3
46.47%
9
it, for, we, are, this, i, case, by, not, on
with, s, theory, np, or, grammar, have,
these, such, will
TOTAL
TERMINOLOGY in TEXT
Language Change in Mr Chomsky 1957  1981
Word
Aspects of Theory
of Syntax
Syntactic
Structures
(9504 words)
1957
(9309 words)
1965
Government &
Binding
(24807 words)
1981
%
%
%
grammar
0.55
0.74
0.31
grammars
0.24
0.24
0.15
grammatical
0.88
0.39
0.06
grammatically
0.01
Not found
Not Found
grammaticalness
Not found
0.12
0.09
ungrammatical
Not found
0.09
0.03
ONTOLOGY in TEXT
 Once the single terms, especially weird terms, are identified then
System Quirk finds candidate compound terms by computing
collocation statistics between the single terms and other open class
words in the entire CORPUS.
 Here is a list of most weird terms alphabetically ordered
American
English
Spelling
ONTOLOGY in TEXT
 Once the single terms, especially weird terms, are identified then
System Quirk finds candidate compound terms by computing
collocation statistics between the single terms and other open class
words in the entire CORPUS.
 Here is a list of most weird terms alphabetically ordered
American
English
Spelling
TERMINOLOGY in TEXT
Language of the Tunneling Diodes (c. 1980’s)
Tunnelling Diodes: High-speed devices yet to be
manufactured
TERMINOLOGY in TEXT
Language of the Nanotube Corpus (1 Million Words; 1990-2000): Journal
Papers; Patent Applications; Book Chapters; Conference Announcements
Word
British National
Words/100 million
Nanotube corpus
words/million
Weirdness
nanowires
0
619
INF
nanoparticles
1
829
81996
nanowire
0
360
INF
nanotube
2
969
47921
nanoscale
0
268
INF
nanoparticle
0
232
INF
nanotubes
5
1379
27279
nanostructures
0
212
INF
mwnts
0
176
INF
nanorods
0
159
INF
nanocrystals
2
395
19534
BUILDING A THESAURUS
Surrey Nanotube Corpus (SFSC)
= 1.09 Million words;
Carbon is the 15th most frequent word in our corpus; we
compute its downwards collocates.
Collocations with carbon (frequency of 1506) in the Surrey
Nanoscale science corpus.
Collocate
Fre q -5
-4
-3
-2
-1
1
2
3
4
5
n a n ot u bes
690
8
8
9
2
0
647
6
0
7
3
n a n ot u be
252
3
2
2
0
0
229
2
1
5
8
sin gl ewa l l ed
a l ign ed
77
0
0
1
1 75
0
0
0
0
0
94
1
1
3
5 74
0
1
1
3
5
m u l t iwa l l ed
70
1
1
2
0 59
0
0
1
5
1
a m or ph ou s
58
1
1
6
0 46
0
1
1
0
2
a t om s
51
1
2
0
1
42
0
1
3
1
0
BUILDING A THESAURUS
Collocations with Collocations with carbon nanotubes
(frequency of 647) in the Surrey Nanoscale science
corpus.
Collocate
singlewalled
aligned
multiwalled
properties
multiwall
Frequency -5 -4 -3 -2 1
1 2 3 4
73 0
0
1
1
71 0 0 0 0
63 1
1
1
5
48 0 0 2 4
53 0
0
1
0
46 0 0 5 1
60 1
4
1 5 32 0
34 0
1
0
1
0 0 6 2
30 0 2 0 0
LSP: Collocation Patterns
Phrase
Field
Field
Energy
Energy
electron
electron
tunneling
tunneling
tunneling
tunneling
current
current
Barrier
Barrier
quantum
quantum
quantum
Collocate
magnetic
electric
fermi
kinetic
dimensional
tunneling
resonant
electron
diodes
diode
density
voltage
double
height
wells
multiple
structures
k-score
-5
24.02078 4
16.75592 2
13.57791 0
5.297875 1
10.20212 1
10.06765 2
25.88729 2
5.845766 3
2.982692 0
2.59579 2
12.12932 1
17.72585 10
19.90954 2
5.159374 2
25.73331 1
6.539368 0
6.337326 3
-4
0
0
1
0
7
1
5
3
1
0
0
7
3
4
1
1
2
-3 -2 -1
1 2
0 0 367
0 0
1 0 257
0 0
0 1 90
0 0
0 0 40
0 0
0 2 65
0 0
5 3
6 49 5
5 13 295
0 4
1 5 49
7 3
0 0
0 36 0
0 0
0 34 0
1 1
0 98 1
7 7
2 90 7
1 0 191
0 1
2 0
0 38 4
0 1
0 251 0
0 0 64
0 1
3 0
1
3 44
3
1
0
1
0
0
1
2
5
1
1
0
7
0
0
0
2
5
4
4
1
5
0
2
3
6
1
3
0
1
9
1
1
2
0
5
5
1
3
2
0
2
3
6
2
1
0
2
6
0
3
2
0
0
LSP: Collocation Patterns
Phrase
tunneling
field
field
energy
electron
electron
tunneling
energy
tunneling
Collocate k-score
resonant
25.9
magnetic
24.0
electric
16.8
Fermi
13.6
dimensional
10.2
tunneling
10.1
electron
5.8
kinetic
5.3
diodes
3.0
-5
2
4
2
0
1
2
3
1
0
-4
5
0
0
1
7
1
3
0
1
-3
5
0
1
0
0
5
1
0
0
-2 -1 1 2 3 4
13 295 0 4 2 6
0 367 0 0 1 4
0 257 0 0 0 1
1 90 0 0 1 5
2 65 0 0 0 2
3
6 49 5 1 3
5 49 7 3 5 1
0 40 0 0 0 0
0
0 36 0 1 3
5 Total
6 338
1 377
3 264
2 100
2 79
3 78
2
79
0
41
1 42
LSP: (Re)Collocation
Phrase
magnetic field
magnetic field
resonant tunneling
resonant tunneling
double barrier
double barrier
current density
current density
quantum wells
quantum wells
dimensional electron
dimensional electron
Collocate
transverse
parallel
diodes
diode
resonant
tunneling
j
k
multiple
coupled
gases
2deg
k-score
-5 -4 -3 -2 -1 1 2
4.208123 0 0 0 0 17 0 0
9.754802 0 1 0 0 5 12 10
6.760186 0 0 0 0 0 31 0
5.21645 0 0 0 0 0 25 0
7.566908 1 8 4 2 0 27 0
8.607712 3 2 13 4 1 1 24
5.782593 0 1 0 0 0 19 1
3.175241 0 0 0 0 0 0 13
5.569255 0 1 0 0 17 1 0
5.569255 0 0 1 2 13 2 1
3.444534 0 0 0 0 0 8 0
1.321555 0 1 0 0 0 0 3
3
1
9
1
1
2
1
1
0
0
0
0
0
4
0
2
0
0
1
2
0
0
0
0
0
0
5
0
0
1
0
1
1
0
0
0
0
0
0
LSP: (ReRe)Collocation
Statistically discover the extent of
compounding of a term
Phrase
double barrier resonant
double barrier resonant
Collocate
tunneling
structures
k-score
-5 -4 -3 -2 -1 1
6.894571 1 0 3 1 0 24
1.317459 0 0 0 0 0 0
2
0
5
3
0
2
4
0
0
5
0
0
Machine Generated results
unipolar
resonant
tunnelling
diode
bipolar
resonant
tunnelling
diodes
high-frequency characteristics of the bipolar light-emitting resonant tunnelling diode are
compared to the unipolar resonant tunnelling diode and the resonant interband tunnelling
diode .
The high-frequency characteristics of bipolar resonant tunnelling diodes are experimentally
investigated at room temperature .
High-frequency capacitance of bipolar resonant tunnelling diodes
triplebarrier
diodes
triplebarrier
diode
k: From the calculated ||M22||2 the fundamental quantities that determine the resonant
tunnelling diode characteristics —transmission coefficients , resonant conditions , and full
widths at half maximum of resonant peaks —are calculated for double- and triple-barrier
diodes .
k: The same calculation method is then also applied to a calculation of ||M22||2 for a triple barrier diode .
Continued Evolution: a future concept hierarchy? (Terms
found with low frequencies in current texts)
tunneling diode
resonant tunneling diode
unipolar resonant tunneling diode
Same thing?
interband resonant tunneling diode
resonant interband tunneling diode - RITD
delta doped resonant tunneling diode
double-barrier resonant tunneling diode
quantum well resonant tunneling diode
bipolar light-emitting resonant tunneling diode
interband double barrier tunneling diode
An ALGORITHM FOR TERMINOLOGY AND
ONTOLOGY EXTRACTION
_
2
(
p

p
)
10
i
i
Coll .Spread :U  
U
10
i
0
j 1
j
Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’.
In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2).
Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760.
Smajda, Frank. (1994). Retrieving Collocations from Text: Xtract. In (Ed.) Susan Armstrong(Warwick). Using Large Corpora. Cambridge, Massachusetts & London, England: MIT Press. pp143-177.
An ALGORITHM FOR TERMINOLOGY AND
ONTOLOGY EXTRACTION
f
Collocation Strength :k

ij
ij

_
 f
k
ij
0
Ahmad, Khurshid., and Rogers, Margaret A. (2001). ‘Corpus Linguistics and Terminology Extraction’.
In (Eds. ) Sue-Ellen Wright and Gerhard Budin. Handbook of Terminology Management (Volume 2).
Amsterdam & Philadelphia: John Benjamins Publishing Company. pp 725-760.
Smajda, Frank. (1994). Retrieving Collocations from Text: Xtract. In (Ed.) Susan Armstrong(Warwick). Using Large Corpora. Cambridge, Massachusetts & London, England: MIT Press. pp143-177.
An ALGORITHM FOR TERMINOLOGY AND
ONTOLOGY EXTRACTION
f
Coll . Strength :k

ij
ij

_
 f
 k
ij
j
_
 p )2
i
U
10
0
10 ( p
i
Coll .Spread :U


i
j 1
_
j
Peakedness : p
 p  (k 
i
i
1
Metrices  ( k
0
U
i
)
, k ,U
)  (1,1,10)
0
1
0
ONTOLOGY in TEXT
 Collocates of the the weird term EARPRINT + collocation statistics
ONTOLOGY in TEXT
 Collocates of the the weird term EARPRINT IDENTIFICATION
ONTOLOGY in TEXT
 A hierarchy of EARPRINT collocates:
ONTOLOGY in TEXT
 A inheritance hierarchy of EARPRINT collocates:
ONTOLOGY in TEXT
 The hierarchy is rendered into an RDF description which can be
exported by System Quirk to other (web-enabled) applications:
nanotubes
1378
carbon nanotubes
z nanotubes
647
24
aligned carbon nanotubes
multiwalled carbon nanotubes
single-wall carbon nanotubes
multiwall carbon nanotubes
48
46
24
46
multiwalled carbon nanotubes mwnts
single-wall carbon nanotubes swnts
13
4
vertically aligned carbon nanotubes
vertically aligned carbon kai
15
15
ONTOLOGY in TEXT
 A inheritance hierarchy of EARPRINT collocates exported to a
knowledge representation system PROTEGE:
ONTOLOGY in TEXT
 A multiple inheritance hierarchy of EARPRINT collocates now
exported to a knowledge representation workbench PROTEGE:
Rubbish!
ONTOLOGY in TEXT
 Editing a multiple inheritance hierarchy of EARPRINT collocates
now exported to a knowledge representation workbench PROTEGE
ONTOLOGY in TEXT
 Edited multiple inheritance hierarchy of EARPRINT collocates now
exported to a knowledge representation workbench PROTEGE
ONTOLOGY in TEXT
 Automatic Extraction of terms and their
relationship to other terms (ontology) from texts
using System Quirk and Stanford Uni’s PROTÉGÉ.
 The system can reason over the hierarchy and
infer (new) facts.
EVIDENCE
TRACE EVIDENCE
BLOOD
INORGANIC
FIBRE
FIBRE
MANUFACTURED
POLYMERIC FIBRE
DNA
DYE FIBRE
ONTOLOGY in TEXT
 The production of the inheritance tree shows the
conceptual organisation of terms in a specialist
domain – a thesaurus in other words.
 Such a conceptual organisation reflects the
ontological commitment of the domain community
– how is it that the conventions of organising
concepts have evolved and adhered to - often
misleadingly just called ontology.
EVIDENCE
TRACE EVIDENCE
BLOOD
INORGANIC
FIBRE
FIBRE
MANUFACTURED
POLYMERIC FIBRE
DNA
DYE FIBRE
ONTOLOGY in TEXT
 The terminology and the ontological commitment
thus identified will be used to create the visual
evidence thesaurus.
 No assumptions were made about the terminology
or ontology – apart from the assumption that the
way in which specialists write descriptions is
unique to the specialism: weirdness and local
grammar.
EVIDENCE
TRACE EVIDENCE
BLOOD
INORGANIC
FIBRE
FIBRE
MANUFACTURED
POLYMERIC FIBRE
DNA
DYE FIBRE
Sentiment & Market Analysis
Reuters Financial Services Streaming Data and
News Service
Sentiment & Market Analysis
News Analysis: service for extracting
MARKET SENTIMENT.
Correlation: Market sentiment
correlation with financial time series.
Fusing Qualitative and
Quantitative Data Analysis
 We have developed a Sentiment and Time Series:
Financial analysis system (SATISFI) for visualising
and correlating the sentiment and instrument time
series both as text (and numbers) and graphically as
well.
Fusing Qualitative and
Quantitative Data Analysis
 We have developed a Sentiment and Time Series:
Financial analysis system (SATISFI) for visualising
and correlating the sentiment and instrument time
series both as text (and numbers) and graphically as
well.
Fusing Qualitative and
Quantitative Data Analysis
Fusing Qualitative and
Quantitative Data Analysis
Fusing Qualitative and
Quantitative Data Analysis
Fusing Qualitative and
Quantitative Data Analysis
Afterword
Specialists use language in an
idiosyncratic fashion:




Repeat lexical items comprising the specific vocabulary of a subject
domain
Invent new words
Borrow words from other domains
Re-define words or terms
Such processes contribute significantly
to the organisation and communication
of tacit and explicit knowledge.
Afterword
In order to investigate innovation
or creativity, we have developed a
computer-based method that
compares the relative occurrence
of single words in an Englishscientific paper (or a collection or
corpus of papers) with the
occurrence of the words in a
representative sample of
contemporary English language.
Download