Uploaded by SA Akundi

Text Mining

advertisement
Text Mining to identify trends in
Testing Autonomous Systems
Text Mining to identify trends in Testing Autonomous Systems
“Autonomous systems have the power of self-governance – the ability to act independently of
human control and in unrehearsed conditions”
Goal: Research communities are characterized by their vocabulary, which directly maps to the
concepts utilized by the community at a point in time. The Autonomous Systems T&E
community is characterized by the published vocabulary and concepts inferred while observing
and preserving autonomous systems testing. To understand the concepts utilized by the
autonomous systems community and to mine the knowledge on the concepts usually referred
upon, we aim to text mine the literature and identify the topics that have evolved over the last
decade in conjunction to T&E.
Motivation: The presence of autonomous systems and their projected increased penetration in
the global market thereby creates a need in thoroughly understanding how to test autonomous
systems and in ensuring their safe operations based on the fact that they interact with humans
without human supervision.
Fig.2. Text Mining approach for
realizing project goal
Fig.1. Approach to explore the evolution of themes in describing
Autonomous Systems
Fig.3. Concept Relationship Identification
Fig.4. Concept Mining of Testing Autonomous Systems
Text Mining & Analytics
• Known as Text Data Mining.
• Process of examining a large set of textual resources to generate new
information.
• Related to Text Retrieval
• Essential Component in Text Mining
Text Data – Subjective Sensors
• Sensors monitor the real world as report data
• Text Data is generated by humans as subjective sensors
Real World
Sense
Sensor
Weather
Thermometer
Location
Geo Sensor
Real World
Human Sensor
Perceive
Report
Data
3’C, 15’F,..etc
14’N,120’W…..
Express
Knowledge
Why do Text Mining?
• Intent is to turn text data into a high quality or Actionable Knowledge.
• Generate new information.
• To identify patterns and relationships which exist within a large body
of texts which would otherwise be extremely difficult or timeconsuming to discover.
Applications of Text Mining
•
•
•
•
•
•
•
•
•
•
Enterprise Business Intelligence
Healthcare/Medical Records
National Security
Scientific Discovery
Sentiment Analysis Tools
Natural Language Service
Publishing
Automated Ad Placement
Information Access
Social Media Monitoring
Text Mining Process
1. Problem Definition & Specific
Goals
2. Identify Text to be Collected
3. Text Organization
4. Feature Extraction
5. Analysis
6. Insight,
Recommendation/Output
Goal
Mining Knowledge about T&E of
Autonomous Systems
Infer concepts
and
methodologies
talked about
Testing of
Autonomous
Systems
Trends in
Industry and
Academia
Observe/Perceive
Text Data/
Knowledge
Express
Data Selection & Gathering
• Research Papers/Articles on Autonomous Systems Testing
• 60 Articles downloaded over set time frames to prevent overlap
Data Processing & Transformation (Cleaning)
Create a Corpus of
Data
Remove Special
Characters
(/,\,|,@,..etc)
Conversion to
Lower Case &
Removing Numbers
Removing
Punctuation
Data Processing & Transformation (Cleaning)
Removing Stop
Words
Most Commonly used words in English language. Example: I, me ,
myself, we, our, ourselves, you, yours…etc. Similarly, any given word
can be iteratively removed from the corpus to refine the data.
Create a Corpus of
Data
Removing
Punctuation
Remove Special
Characters
(/,\,|,@,..etc)
Removing Stop Words
Conversion to Lower
Case & Removing
Numbers
60 Documents,
21,760 Terms
Bag-of-Words Representation
A simplified representation of text a bag
of words, disregarding grammar or the
word order but keeping multiplicity
Obtaining concept Frequencies
Network Map of Concepts
• Visualize which terms occur together frequently.
• Helps to resolve an important keyword with a central theme.
1. A Term Document Matrix is created
Network Map of Concepts
2. Next, an adjacent matrix is created for an inner product to determine
the number of times each term appears together in a document.
Network Map of Words
Topic Identification
Topic Identification
1. Identify number of expected topics (based on best informed
estimate, or , Trial and Error).
2. Randomly assign every term to a topic.
3. Update the topic assignment of the term based on:
1. How prevalent a word is across the topic?
2. How prevalent are topics in the document?
Topics Identified
Future- Topic Proportions Over Time
• Aggregation of mean topic proportions per year over a set of corpus.
The visualization shows that topics around the relation between the federal
government and the states as well as inner conflicts clearly dominate the first
Thankyou
LDA
iteration
1
i
wi
di
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
21
LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
22
LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
Count of all words
assigned with topic j
words in di assigned with topic j
Count of instances where wi is
assigned with topic j
words in di assigned with any topic
23
LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
?
What’s the most likely topic for wi in di?
How likely would di choose topic j?
How likely would topic j
generate word wi ?
24
LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
?
25
LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
?
26
LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
?
27
LDA
iteration
1
2
i
wi
di
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
2
?
28
LDA
iteration
1
2
…
1000
i
wi
di
zi
zi
zi
1
2
3
4
5
6
7
8
9
10
11
12
.
.
.
50
MATHEMATICS
KNOWLEDGE
RESEARCH
WORK
MATHEMATICS
RESEARCH
WORK
SCIENTIFIC
MATHEMATICS
WORK
SCIENTIFIC
KNOWLEDGE
.
.
.
JOY
1
1
1
1
1
1
1
1
1
1
2
2
.
.
.
5
2
2
1
2
1
2
2
1
2
1
1
1
.
.
.
2
2
1
1
2
2
2
2
1
2
2
1
2
.
.
.
1
2
2
2
1
2
2
2
1
2
2
2
2
.
.
.
1
…
29
Download