Text Mining to identify trends in Testing Autonomous Systems Text Mining to identify trends in Testing Autonomous Systems “Autonomous systems have the power of self-governance – the ability to act independently of human control and in unrehearsed conditions” Goal: Research communities are characterized by their vocabulary, which directly maps to the concepts utilized by the community at a point in time. The Autonomous Systems T&E community is characterized by the published vocabulary and concepts inferred while observing and preserving autonomous systems testing. To understand the concepts utilized by the autonomous systems community and to mine the knowledge on the concepts usually referred upon, we aim to text mine the literature and identify the topics that have evolved over the last decade in conjunction to T&E. Motivation: The presence of autonomous systems and their projected increased penetration in the global market thereby creates a need in thoroughly understanding how to test autonomous systems and in ensuring their safe operations based on the fact that they interact with humans without human supervision. Fig.2. Text Mining approach for realizing project goal Fig.1. Approach to explore the evolution of themes in describing Autonomous Systems Fig.3. Concept Relationship Identification Fig.4. Concept Mining of Testing Autonomous Systems Text Mining & Analytics • Known as Text Data Mining. • Process of examining a large set of textual resources to generate new information. • Related to Text Retrieval • Essential Component in Text Mining Text Data – Subjective Sensors • Sensors monitor the real world as report data • Text Data is generated by humans as subjective sensors Real World Sense Sensor Weather Thermometer Location Geo Sensor Real World Human Sensor Perceive Report Data 3’C, 15’F,..etc 14’N,120’W….. Express Knowledge Why do Text Mining? • Intent is to turn text data into a high quality or Actionable Knowledge. • Generate new information. • To identify patterns and relationships which exist within a large body of texts which would otherwise be extremely difficult or timeconsuming to discover. Applications of Text Mining • • • • • • • • • • Enterprise Business Intelligence Healthcare/Medical Records National Security Scientific Discovery Sentiment Analysis Tools Natural Language Service Publishing Automated Ad Placement Information Access Social Media Monitoring Text Mining Process 1. Problem Definition & Specific Goals 2. Identify Text to be Collected 3. Text Organization 4. Feature Extraction 5. Analysis 6. Insight, Recommendation/Output Goal Mining Knowledge about T&E of Autonomous Systems Infer concepts and methodologies talked about Testing of Autonomous Systems Trends in Industry and Academia Observe/Perceive Text Data/ Knowledge Express Data Selection & Gathering • Research Papers/Articles on Autonomous Systems Testing • 60 Articles downloaded over set time frames to prevent overlap Data Processing & Transformation (Cleaning) Create a Corpus of Data Remove Special Characters (/,\,|,@,..etc) Conversion to Lower Case & Removing Numbers Removing Punctuation Data Processing & Transformation (Cleaning) Removing Stop Words Most Commonly used words in English language. Example: I, me , myself, we, our, ourselves, you, yours…etc. Similarly, any given word can be iteratively removed from the corpus to refine the data. Create a Corpus of Data Removing Punctuation Remove Special Characters (/,\,|,@,..etc) Removing Stop Words Conversion to Lower Case & Removing Numbers 60 Documents, 21,760 Terms Bag-of-Words Representation A simplified representation of text a bag of words, disregarding grammar or the word order but keeping multiplicity Obtaining concept Frequencies Network Map of Concepts • Visualize which terms occur together frequently. • Helps to resolve an important keyword with a central theme. 1. A Term Document Matrix is created Network Map of Concepts 2. Next, an adjacent matrix is created for an inner product to determine the number of times each term appears together in a document. Network Map of Words Topic Identification Topic Identification 1. Identify number of expected topics (based on best informed estimate, or , Trial and Error). 2. Randomly assign every term to a topic. 3. Update the topic assignment of the term based on: 1. How prevalent a word is across the topic? 2. How prevalent are topics in the document? Topics Identified Future- Topic Proportions Over Time • Aggregation of mean topic proportions per year over a set of corpus. The visualization shows that topics around the relation between the federal government and the states as well as inner conflicts clearly dominate the first Thankyou LDA iteration 1 i wi di zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 21 LDA iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 ? 22 LDA iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 ? Count of all words assigned with topic j words in di assigned with topic j Count of instances where wi is assigned with topic j words in di assigned with any topic 23 LDA iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 ? What’s the most likely topic for wi in di? How likely would di choose topic j? How likely would topic j generate word wi ? 24 LDA iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 ? 25 LDA iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 ? 26 LDA iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 1 ? 27 LDA iteration 1 2 i wi di zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 1 2 ? 28 LDA iteration 1 2 … 1000 i wi di zi zi zi 1 2 3 4 5 6 7 8 9 10 11 12 . . . 50 MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS RESEARCH WORK SCIENTIFIC MATHEMATICS WORK SCIENTIFIC KNOWLEDGE . . . JOY 1 1 1 1 1 1 1 1 1 1 2 2 . . . 5 2 2 1 2 1 2 2 1 2 1 1 1 . . . 2 2 1 1 2 2 2 2 1 2 2 1 2 . . . 1 2 2 2 1 2 2 2 1 2 2 2 2 . . . 1 … 29