Interactive Linguistics and Distributed Grammar

advertisement
MAIN TOPICS
1.
2.
3.
4.
5.
6.
7.
8.
Linguistics - State of the Art
What is Interactive Modelling ?
Knowledge Discovery in Databases (KDD)
Interactive Linguistics
The SEMANA software
Data Representation
Attributive & Relational Knowledge
Some Results of Formal Concept Analysis
(FCA)
LINGUISTICS
(State of the Art)
“Form” and “Matter”
in Structural Linguistics
OBJECT
APPROACH
Deductive
types
FORM
universal
(Structures)
Synthesis
homogeneous
using rules
MATTER
(Data)
Inductive
instances
specific
Analysis of
heterogeneous
analogies
THEORY
L = (W, G)
Language is a set of sentences
generated by grammar rules G from
words W
Prediction
L = (W, L)
Language is a set of sentences L
analysed as words W
Explanation
ALTMAN G. (1987) "The Levels of Linguistic Investigation", Theoretical Linguistics, vol. 14,
edited by H. Schnelle, W. de Guyter, Berlin - New York
Structural and Computational Linguistics
Structural Linguistics
Computational Linguistics
Natural Language Processing
FORM
THEORY-oriented Linguistics (Lexicon-Functional Grammars,
(Structures) (Formal Generative Linguistics)
MATTER
(Data)
DATA-oriented Linguistics
(Linguistic Typology)
Unification Grammars, Logic
Grammars)
Human Language Technology
(Corpus Linguistics, Lexicons and
Thesauri - WordNet, FrameNet etc.)
INTERACTIVE LINGUISTICS
André WLODARCZYK
What is Interactive Modelling ?
Standard Research Procedure
Meta-Theory
Domain
of
interest
Researcher
Theory
Modeling Tasks in a loop
Meta-Theory
(formal)
Meta-Abstraction
(Extend the meta-theory)
DOMAIN
of
Interest
Metaphor
Abstraction
(Describe raw data)
Verification
(Evaluate the Model/Theory)
Model
(formal)
Simplification
(Compress the Model)
Formalization stage I
(Interpret the Meta-Theory)
Theory
(formal)
Formalization stage II
(Interpret the Theory)
KNOWLEDGE DISCOVERY
IN DATABASES
(KDD)
Characteristics of KDD
1. Tasks (visualization, classification, clustering,
regression etc.)
2. Structure of the Model adapted to data (it
determines the limits of what will be compared or
revealed)
3. Evaluation function (adequacy / correspondence
and generalization problems)
4. Search or Optimalization Methods (heart of data
exploration algorithms)
5. Data Management Techniques (tools for data
accumulation and indexation).
HAND David, MANNILA Heikki & SMYTH Padhraic (2001) Principles of Data Mining,
Massachussetts Institute of Technology, USA
KDD Procedure
From Data Mining to Knowledge Discovery in Databases by Usama Fayyad, Gregory PiatetskyShapiro, and Padhraic Smyth, AI Magazine 1997 (American Association for Artificial Intelligence)
INTERACTIVE LINGUISTICS
KDD in the Linguistic Domain
In language studies this method is issued from the integration of
Text Mining and Analysis (Data Mining).
NLP
Natural Language Processing
Data
Base
Management
Systems
Corpus
Linguistics
Interactive
Linguistics
HLT
Human Language Technologies
Automated
Discovery
Systems
Text Mining and Data Mining
Data Mining
Text Mining
3. Interactive tasks
with KDD algorithms
(Rough Set, FCA, etc.)
2. This task is automatic.
1. This task needs active involvement on
behalf of the researcher.
From Data Mining to Knowledge Discovery in Databases by Usama Fayyad, Gregory PiatetskyShapiro, and Padhraic Smyth, AI Magazine 1997 (American Association for Artificial Intelligence)
Tasks
Approaches
Object
s
OBJECTS – APPROACHES - TASKS
Text Data
Symbolic Data
Corpus Linguistics
Interactive Linguistics
Text Document Exploration
(Text Mining)
Linguistic Knowledge Extraction
(Data Mining)
1. Selection
2a. Preprocessing
2b. Filtering
3. Transformation
4. Analysis
5. Evaluation
The SEMANA software
Architecture of SEMANA
Application for Apple, Windows-PC and Linux computers
Dynamic DB Builder
Data sheets
Data coding
Data storage
Attribute Editor
Discretisation
Logical scaling …
Tree Builder
Aid to code structuring
Charts (various formats)
“Multi-valued tables”
Rough Set Theory
Decision Logic
Upper approximation,
Lower approximation,
Reducts, Core,
Discriminating power
(Pawlak, Skowron
& Polkwski)
Minimal rules,
Attribute strength
(Bolc, Cytowski
& Stacewicz)
“One-valued tables”
Formal Concept
Analysis
Galois lattice,
“central concepts”
(Wille, Ganter)
Statistical tools
Correlation Matrix,
Correspondence Factor,
Analysis, Hierarchical,
Classifications
(Benzécri)
Logical Anlyses in SEMANA
Symbolic Data Analysis
Set Theory
& Lattice Theory
FCA
Formal Concept
Analysis
Logics
Interpretation
Evaluation
Proposition
Logic
Formal Contexts Lattices
Decision Logic
Decision Rules
Rudolf WILLE, 1982
RST
RFCA
Rough Set Theory
Zdzisław PAWLAK,
1981
RSA
DL
Statistical Data Analysis
1. Factor Analysis
2. Bottom-up Hierarchical
classification
3 & Data Processing :
Various Other Tools forSTAT
Data Analysis
Approximation, Analogy, Discrimination Power, Imputation etc.
Data Representation
Assignment of Attributes to Objects
Objects
:=
Attributes
*) Context by Wille R. – 1982 (Formal Concept Analysis)  “Single-valued charts”
**) Information System by Pawlak Z. – 1982 (Rough Set Theory)  “Multi-valued charts”
Objects and Attributes
SYSTEM NAME
OBJECT
ATTRIBUTE
RELATION
Author
Data Base
Argument
Predicate
Relation
Chu Space
Point/Indi
vidual
State
Assignment ( := )
Barr & Chu, 1979
Element
Predicate
Dual Relationship
Pratt
Context
Extent
Intent
Assignment ( := )
Wille, 1982
Information
System
Object
Attribute
Assignment ( := )
Pawlak, 1982
General System Object
Sort
Signature
Classification
Token
Type
Satisfaction ( ⊫ )
Barwise &
Seligmann, 1997
Institution
Sentence
Model
Satisfaction ( ⊫ )
Goguen, 2004
Codd, 1969
Pogonowski, 1982
Similarity & Distinction
similarity1
distinction 1
ALL
features are
common
NO features
are common
SOME
features are
common
SOME features
are NOT
common
similarity 2
distinction 2
Following the definitions of ‘Similarity’ and ‘Distinction’ by Jerzy Pogonowski (1991),
Linguistic Oppositions, UAM Scientific Editions, Poznań, pp. 125
Identity and Difference
Identit
y
Difference
weak
Similarit
y
stron
g
stron
g
Distinction
weak
Linguistic Opposition
Comparison of signs can be analysed in two dual continua
which have identity (equivalence) and difference as their
extreme cases.
Close Senses
Distant Senses
Similarity
Distinction
strong
weak
weak
strong
Oppositions : Morphemes are organised by pairs of similarity and distinction.
Structural linguists proposed 3 kinds of oppositions: privative (binary),
equipollent (multi-value) and gradual (degree).
Sign • Semion • Sense • Usage
The Sign is a structure with usages U as objects and descriptions F as
formulae.
Sign = < U,F >
Let F be a set of atomic formulae F={ϕ, χ, ψ…}and let  be a subset of
formulae in F (i.e.:   F). Let X={x, y, z...} be a subset of U (i.e.: X  U).
We define a Semion as a formal concept in the FCA (Formal Concept
Analysis) - a substructure (pair of sets: a set of uses X and a set of
formulae  )
Semion = < X,  > where X ⊆ U and  ⊆ F
The Usage of a semion is defined as its Extent (extension); i.e.: a typified
set (class) of uses.
|| ||Sign = {x X : x ⊨Sign  }
The Sense of a semion is defined as its Intent (intension).
||X||Sign = {ϕ ∈   ϕ ⊨Sign X}
Attributive & Relational
Knowledge
Semantic Network
Collins & Quilian (1969)
Collins, A. M., & Quillian, M. R. (1969). Retrieval Time from Semantic Memory. Journal of
Verbal Learning and Verbal Behavior, 8, 240-247.
Connectionist (neural) System
Rumelhart & Todd (1993)
Some kind of Attributive Knowledge
is similar to the Connectionnist one
Two Conditions:
1. The initial System must contain Alternatives Attributes (multivalued attributes)
2. The initial System must be complete (i.e. all cases present)
Relational Knowledge
Example: Ontology Coordination using
Information Flow Logic
Marco Schorlemmer and Yannis Kalfoglou, « A Channel-Theoretic Foundation for Ontology
Coordination »
Attributive Knowledge
‘Ohio’ is big then it is said to be a “river” in English.
‘Ohio’ is tributary then it is said to be a “rivière” in French.
Some Results of FCA
(Formal Concept Analysis)
“Family Resemblance”
Lattice of Formal Concepts
fca
BigEars
BlueEyes
FlatNose RoundFace
Bald
Jim ****************************************
x
x
0
x
x
JohnCENTRAL0 FORMAL xCONCEPT :0
0
x
C5 {Jim,Bob},{BigEars,RoundFace,Bald}
Bob
x
0
x
x
x
****************************************
Max
0
0
x
x
0
LIST OF FORMAL CONCEPTS
C1 {},{BigEars,BlueEyes,FlatNose,RoundFace,Bald}
C2 {Bob},{BigEars,FlatNose,RoundFace,Bald}
C3 {Bob,Max},{FlatNose,RoundFace}
C4 {Jim},{BigEars,BlueEyes,RoundFace,Bald}
C5 {Jim,Bob},{BigEars,RoundFace,Bald}
C6 {Jim,Bob,Max},{RoundFace}
C7 {Jim,John},{BlueEyes,Bald}
C8 {Jim,John,Bob},{Bald}
C9 {Jim,John,Bob,Max},{}
****************************************
2 intensional master-concept(s) :
C2 {Bob},{BigEars,FlatNose,RoundFace,Bald}
C4 {Jim},{BigEars,BlueEyes,RoundFace,Bald}
****************************************
2 extensional master-concept(s) :
C6 {Jim,Bob,Max},{RoundFace}
C8 {Jim,John,Bob},{Bald}
“Family Resemblance”
Multi-base Classes
fca
BigEars
BlueEyes
Jim
x
x
FlatNose RoundFace
0
x
Bald
x
You have selected the formal concepts:
John
0
x
0
0
x
C2 {Bob},{BigEars,FlatNose,RoundFace,Bald}
C4 {Jim},{BigEars,BlueEyes,RoundFace,Bald}
Bob
x
0
x
x
x
-------------------------------------Max
0
0
x
x
0
The meet is:
C1 {},{BigEars,BlueEyes,FlatNose,RoundFace,Bald}
CLASS STRUCTURE
The join is:
Classe 1: Bob
C5 {Jim,Bob},{BigEars,RoundFace,Bald}
1
C2 {Bob},{BigEars,FlatNose,RoundFace,Bald}
C3 {Bob,Max},{FlatNose,RoundFace}
Similarity index = 0.75Classe 2: Jim
according to Zhao, WangC4& {Jim},{BigEars,BlueEyes,RoundFace,Bald}
Halang, 2006
(after Tversky's model)C7 {Jim,John},{BlueEyes,Bald}
Classe 3: Jim,Bob
C5 {Jim,Bob},{BigEars,RoundFace,Bald}
C6 {Jim,Bob,Max},{RoundFace}
C8 {Jim,John,Bob},{Bald}
All Formal Concepts included
2
3
Double Binary Opposition
ły2
-HUM
li1
+HUM
Psy stały.
Pociągi stały.
Dzieci stały.
Ludzie stali.
Matka i dziecko
stali.
Panie stały.
ły1
+FEM
li2
-FEM
Panowie stali.
Converse Opposition
(my old name : “Boomerang Opposition”)
In a BASE UTTERANCE :
- “GA” (ga1) is a marker of the Attention-driven
New
wa2 Phrase (Subject with the status: ‘New’ )
ga1
-“WA” (wa2) is a marker of the Attention-driven
Phrase (Subject with status ‘non-New’)
 : GA ---> WA
- New
- Old
Old
wa1
ga2
 : WA ---> GA
In an EXTENDED UTTERANCE :
- “WA” (wa1) is a marker of the Attentiondriven Phrase (Topic with the status: ‘Old’ )
-“GA” (ga2) is a marker of the Attentiondriven Phrase (Focus with the status: ‘New’ )
WA ⇄ GA infomorphism
Infomorphism of the Japanese ‘wa’ and ‘ga’ particles
OLD + wa + NEW
WA
wa1
wa2
WA
NEW + ga + OLD
 : WA ⟶ GA
 : GA ⟶ WA
OLD +wa + OLD
GA
ga2
ga1
GA
NEW + ga + NEW
The MIC Theory
Conceptual Lattice View
http://www.celta.paris-sorbonne.fr
© André WLODARCZYK
Download