MAIN TOPICS 1. 2. 3. 4. 5. 6. 7. 8. Linguistics - State of the Art What is Interactive Modelling ? Knowledge Discovery in Databases (KDD) Interactive Linguistics The SEMANA software Data Representation Attributive & Relational Knowledge Some Results of Formal Concept Analysis (FCA) LINGUISTICS (State of the Art) “Form” and “Matter” in Structural Linguistics OBJECT APPROACH Deductive types FORM universal (Structures) Synthesis homogeneous using rules MATTER (Data) Inductive instances specific Analysis of heterogeneous analogies THEORY L = (W, G) Language is a set of sentences generated by grammar rules G from words W Prediction L = (W, L) Language is a set of sentences L analysed as words W Explanation ALTMAN G. (1987) "The Levels of Linguistic Investigation", Theoretical Linguistics, vol. 14, edited by H. Schnelle, W. de Guyter, Berlin - New York Structural and Computational Linguistics Structural Linguistics Computational Linguistics Natural Language Processing FORM THEORY-oriented Linguistics (Lexicon-Functional Grammars, (Structures) (Formal Generative Linguistics) MATTER (Data) DATA-oriented Linguistics (Linguistic Typology) Unification Grammars, Logic Grammars) Human Language Technology (Corpus Linguistics, Lexicons and Thesauri - WordNet, FrameNet etc.) INTERACTIVE LINGUISTICS André WLODARCZYK What is Interactive Modelling ? Standard Research Procedure Meta-Theory Domain of interest Researcher Theory Modeling Tasks in a loop Meta-Theory (formal) Meta-Abstraction (Extend the meta-theory) DOMAIN of Interest Metaphor Abstraction (Describe raw data) Verification (Evaluate the Model/Theory) Model (formal) Simplification (Compress the Model) Formalization stage I (Interpret the Meta-Theory) Theory (formal) Formalization stage II (Interpret the Theory) KNOWLEDGE DISCOVERY IN DATABASES (KDD) Characteristics of KDD 1. Tasks (visualization, classification, clustering, regression etc.) 2. Structure of the Model adapted to data (it determines the limits of what will be compared or revealed) 3. Evaluation function (adequacy / correspondence and generalization problems) 4. Search or Optimalization Methods (heart of data exploration algorithms) 5. Data Management Techniques (tools for data accumulation and indexation). HAND David, MANNILA Heikki & SMYTH Padhraic (2001) Principles of Data Mining, Massachussetts Institute of Technology, USA KDD Procedure From Data Mining to Knowledge Discovery in Databases by Usama Fayyad, Gregory PiatetskyShapiro, and Padhraic Smyth, AI Magazine 1997 (American Association for Artificial Intelligence) INTERACTIVE LINGUISTICS KDD in the Linguistic Domain In language studies this method is issued from the integration of Text Mining and Analysis (Data Mining). NLP Natural Language Processing Data Base Management Systems Corpus Linguistics Interactive Linguistics HLT Human Language Technologies Automated Discovery Systems Text Mining and Data Mining Data Mining Text Mining 3. Interactive tasks with KDD algorithms (Rough Set, FCA, etc.) 2. This task is automatic. 1. This task needs active involvement on behalf of the researcher. From Data Mining to Knowledge Discovery in Databases by Usama Fayyad, Gregory PiatetskyShapiro, and Padhraic Smyth, AI Magazine 1997 (American Association for Artificial Intelligence) Tasks Approaches Object s OBJECTS – APPROACHES - TASKS Text Data Symbolic Data Corpus Linguistics Interactive Linguistics Text Document Exploration (Text Mining) Linguistic Knowledge Extraction (Data Mining) 1. Selection 2a. Preprocessing 2b. Filtering 3. Transformation 4. Analysis 5. Evaluation The SEMANA software Architecture of SEMANA Application for Apple, Windows-PC and Linux computers Dynamic DB Builder Data sheets Data coding Data storage Attribute Editor Discretisation Logical scaling … Tree Builder Aid to code structuring Charts (various formats) “Multi-valued tables” Rough Set Theory Decision Logic Upper approximation, Lower approximation, Reducts, Core, Discriminating power (Pawlak, Skowron & Polkwski) Minimal rules, Attribute strength (Bolc, Cytowski & Stacewicz) “One-valued tables” Formal Concept Analysis Galois lattice, “central concepts” (Wille, Ganter) Statistical tools Correlation Matrix, Correspondence Factor, Analysis, Hierarchical, Classifications (Benzécri) Logical Anlyses in SEMANA Symbolic Data Analysis Set Theory & Lattice Theory FCA Formal Concept Analysis Logics Interpretation Evaluation Proposition Logic Formal Contexts Lattices Decision Logic Decision Rules Rudolf WILLE, 1982 RST RFCA Rough Set Theory Zdzisław PAWLAK, 1981 RSA DL Statistical Data Analysis 1. Factor Analysis 2. Bottom-up Hierarchical classification 3 & Data Processing : Various Other Tools forSTAT Data Analysis Approximation, Analogy, Discrimination Power, Imputation etc. Data Representation Assignment of Attributes to Objects Objects := Attributes *) Context by Wille R. – 1982 (Formal Concept Analysis) “Single-valued charts” **) Information System by Pawlak Z. – 1982 (Rough Set Theory) “Multi-valued charts” Objects and Attributes SYSTEM NAME OBJECT ATTRIBUTE RELATION Author Data Base Argument Predicate Relation Chu Space Point/Indi vidual State Assignment ( := ) Barr & Chu, 1979 Element Predicate Dual Relationship Pratt Context Extent Intent Assignment ( := ) Wille, 1982 Information System Object Attribute Assignment ( := ) Pawlak, 1982 General System Object Sort Signature Classification Token Type Satisfaction ( ⊫ ) Barwise & Seligmann, 1997 Institution Sentence Model Satisfaction ( ⊫ ) Goguen, 2004 Codd, 1969 Pogonowski, 1982 Similarity & Distinction similarity1 distinction 1 ALL features are common NO features are common SOME features are common SOME features are NOT common similarity 2 distinction 2 Following the definitions of ‘Similarity’ and ‘Distinction’ by Jerzy Pogonowski (1991), Linguistic Oppositions, UAM Scientific Editions, Poznań, pp. 125 Identity and Difference Identit y Difference weak Similarit y stron g stron g Distinction weak Linguistic Opposition Comparison of signs can be analysed in two dual continua which have identity (equivalence) and difference as their extreme cases. Close Senses Distant Senses Similarity Distinction strong weak weak strong Oppositions : Morphemes are organised by pairs of similarity and distinction. Structural linguists proposed 3 kinds of oppositions: privative (binary), equipollent (multi-value) and gradual (degree). Sign • Semion • Sense • Usage The Sign is a structure with usages U as objects and descriptions F as formulae. Sign = < U,F > Let F be a set of atomic formulae F={ϕ, χ, ψ…}and let be a subset of formulae in F (i.e.: F). Let X={x, y, z...} be a subset of U (i.e.: X U). We define a Semion as a formal concept in the FCA (Formal Concept Analysis) - a substructure (pair of sets: a set of uses X and a set of formulae ) Semion = < X, > where X ⊆ U and ⊆ F The Usage of a semion is defined as its Extent (extension); i.e.: a typified set (class) of uses. || ||Sign = {x X : x ⊨Sign } The Sense of a semion is defined as its Intent (intension). ||X||Sign = {ϕ ∈ ϕ ⊨Sign X} Attributive & Relational Knowledge Semantic Network Collins & Quilian (1969) Collins, A. M., & Quillian, M. R. (1969). Retrieval Time from Semantic Memory. Journal of Verbal Learning and Verbal Behavior, 8, 240-247. Connectionist (neural) System Rumelhart & Todd (1993) Some kind of Attributive Knowledge is similar to the Connectionnist one Two Conditions: 1. The initial System must contain Alternatives Attributes (multivalued attributes) 2. The initial System must be complete (i.e. all cases present) Relational Knowledge Example: Ontology Coordination using Information Flow Logic Marco Schorlemmer and Yannis Kalfoglou, « A Channel-Theoretic Foundation for Ontology Coordination » Attributive Knowledge ‘Ohio’ is big then it is said to be a “river” in English. ‘Ohio’ is tributary then it is said to be a “rivière” in French. Some Results of FCA (Formal Concept Analysis) “Family Resemblance” Lattice of Formal Concepts fca BigEars BlueEyes FlatNose RoundFace Bald Jim **************************************** x x 0 x x JohnCENTRAL0 FORMAL xCONCEPT :0 0 x C5 {Jim,Bob},{BigEars,RoundFace,Bald} Bob x 0 x x x **************************************** Max 0 0 x x 0 LIST OF FORMAL CONCEPTS C1 {},{BigEars,BlueEyes,FlatNose,RoundFace,Bald} C2 {Bob},{BigEars,FlatNose,RoundFace,Bald} C3 {Bob,Max},{FlatNose,RoundFace} C4 {Jim},{BigEars,BlueEyes,RoundFace,Bald} C5 {Jim,Bob},{BigEars,RoundFace,Bald} C6 {Jim,Bob,Max},{RoundFace} C7 {Jim,John},{BlueEyes,Bald} C8 {Jim,John,Bob},{Bald} C9 {Jim,John,Bob,Max},{} **************************************** 2 intensional master-concept(s) : C2 {Bob},{BigEars,FlatNose,RoundFace,Bald} C4 {Jim},{BigEars,BlueEyes,RoundFace,Bald} **************************************** 2 extensional master-concept(s) : C6 {Jim,Bob,Max},{RoundFace} C8 {Jim,John,Bob},{Bald} “Family Resemblance” Multi-base Classes fca BigEars BlueEyes Jim x x FlatNose RoundFace 0 x Bald x You have selected the formal concepts: John 0 x 0 0 x C2 {Bob},{BigEars,FlatNose,RoundFace,Bald} C4 {Jim},{BigEars,BlueEyes,RoundFace,Bald} Bob x 0 x x x -------------------------------------Max 0 0 x x 0 The meet is: C1 {},{BigEars,BlueEyes,FlatNose,RoundFace,Bald} CLASS STRUCTURE The join is: Classe 1: Bob C5 {Jim,Bob},{BigEars,RoundFace,Bald} 1 C2 {Bob},{BigEars,FlatNose,RoundFace,Bald} C3 {Bob,Max},{FlatNose,RoundFace} Similarity index = 0.75Classe 2: Jim according to Zhao, WangC4& {Jim},{BigEars,BlueEyes,RoundFace,Bald} Halang, 2006 (after Tversky's model)C7 {Jim,John},{BlueEyes,Bald} Classe 3: Jim,Bob C5 {Jim,Bob},{BigEars,RoundFace,Bald} C6 {Jim,Bob,Max},{RoundFace} C8 {Jim,John,Bob},{Bald} All Formal Concepts included 2 3 Double Binary Opposition ły2 -HUM li1 +HUM Psy stały. Pociągi stały. Dzieci stały. Ludzie stali. Matka i dziecko stali. Panie stały. ły1 +FEM li2 -FEM Panowie stali. Converse Opposition (my old name : “Boomerang Opposition”) In a BASE UTTERANCE : - “GA” (ga1) is a marker of the Attention-driven New wa2 Phrase (Subject with the status: ‘New’ ) ga1 -“WA” (wa2) is a marker of the Attention-driven Phrase (Subject with status ‘non-New’) : GA ---> WA - New - Old Old wa1 ga2 : WA ---> GA In an EXTENDED UTTERANCE : - “WA” (wa1) is a marker of the Attentiondriven Phrase (Topic with the status: ‘Old’ ) -“GA” (ga2) is a marker of the Attentiondriven Phrase (Focus with the status: ‘New’ ) WA ⇄ GA infomorphism Infomorphism of the Japanese ‘wa’ and ‘ga’ particles OLD + wa + NEW WA wa1 wa2 WA NEW + ga + OLD : WA ⟶ GA : GA ⟶ WA OLD +wa + OLD GA ga2 ga1 GA NEW + ga + NEW The MIC Theory Conceptual Lattice View http://www.celta.paris-sorbonne.fr © André WLODARCZYK