L.A.S.I. Feasibility Presentation Linguistic Analysis for Subject Identification November 12, 2012

advertisement
L.A.S.I.
Linguistic Analysis for Subject Identification
November 12, 2012
Feasibility Presentation
Presented by: CS410 Red Group
2
November 12, 2012
Outline
•
•
•
•
•
•
•
•
•
•
Team Red Staff Chart
Introduction
Societal Problem
Case Study
Proposed Solution
Major Component Diagram
Algorithm
The Competition
Risk
Conclusion
410 Red Group
3
November 12, 2012
Team Red Staff Chart
Scott Minter
Brittany Johnson
Project Co Leader
Software Specialist
Project Co Leader
Documentation Specialist
Dustin Patrick
Richard Owens
Algorithm Specialist
Expert Liaison
Documentation Specialist
Communication Specialist
Aluan Haddad
Erik Rogers
Algorithm Specialist
Software Specialist
Marketing Specialist
GUI Developer
410 Red Group
4
November 12, 2012
What is a theme?
410 Red Group
5
November 12, 2012
410 Red Group
A specific and distinctive quality,
characteristic, or concern.1
1“Theme” Merriam Webster
6
November 12, 2012
410 Red Group
What are you looking for when you are
identifying a theme?
7
November 12, 2012
5 W’s & 1 H
•
•
•
•
•
•
Who
What
When
Where
Why
How
410 Red Group
8
November 12, 2012
410 Red Group
Bill’s stove was broken. He has
been saying for months that he
would go to the appliance store to
buy a new one. He had some free
time yesterday, so he drove to the
store to buy a new stove.
9
November 12, 2012
Who
Bill
What
He travelled to some place
When
Yesterday
Where The store
Why
To buy a stove because his broke
How
By driving
410 Red Group
10
November 12, 2012
410 Red Group
The Theme from the 5 W’s & 1 H
Bill drove to the store yesterday to
buy a new stove because his broke.
11
November 12, 2012
Why are themes important?
• Comprehension
• Summarization
• Assists in communication between people
410 Red Group
12
November 12, 2012
410 Red Group
Societal Problem
It is difficult for people to identify a common
theme over a large set of documents in a timely,
consistent, and objective manner.
13
November 12, 2012
410 Red Group
How long does it take?
• Finding a theme over multiple documents is a
time-consuming process.
• The average reading speed of an adult is 250
words per minute.2
2Thomas "What Is the Average Reading Speed and the Best Rate of Reading?"
14
November 12, 2012
410 Red Group
Consistency and Objectivity
• The criteria for evaluation may vary from person
to person.
• Large quantities of documents must be mentally
digested, assessed, and interrelated.
15
November 12, 2012
410 Red Group
Dr. Patrick Hester
Ph. D. from Vanderbilt University, 2007
Major: Risk and Reliability Engineering and
Management
“My research interests include multi-objective
decision making under uncertainty,
probabilistic and non probabilistic
uncertainty analysis, critical infrastructure
protection, and decision making using
modeling and simulation.” 3
- Dr. Hester
3Patrick Hester Website
16
November 12, 2012
410 Red Group
• Dr. Hester is a systems analyst and researcher
▫ He Must
 Conduct extensive research
 Quickly become familiar with client systems
 Formulate concise, objective assessments
• LASI will help with all of this
17
November 12, 2012
410 Red Group
Assessment Improvement Design (A.I.D.)
• Preliminary Problem statement Identified from
document
• Problem statement then used to find Critical
Operational Issues (COI’s)
• COIs used to find Measures of Effectiveness
(MOE’s)
• MOE’s used to find Measures of Performance
(MOP’s)
18
November 12, 2012
410 Red Group
Current Method
Continue on to the rest of the
A.I.D Process
Customer
Contact
yes
Situational
Awareness
Meeting
Is
Customer
satisfied?
Problem
Statement
Presentation
no
Will
NCSOSE
be
needed?
no
Client Goes
Elsewhere
yes
Document
Gathering
Process
Document
Analysis
19
November 12, 2012
LASI:
410 Red Group
Linguistic Analysis for Subject Identification
LASI
THEMES
20
November 12, 2012
410 Red Group
Our Proposed Solution
• LASI is a linguistic analysis decision support tool
used to help determine a common theme across
multiple documents. It is our goal with LASI to:
▫ accurately find themes
▫ be system efficient
▫ provide consistent results
21
November 12, 2012
410 Red Group
What do we mean by “linguistic analysis”?
The contextual study of written works and how the
words combine to form an overall meaning.
26
November 12, 2012
410 Red Group
Linguistic analysis involves
Syntactic
• Logical grammar
• Statistical Data
• Alphabetical Frequencies
• Word Counts
• Parts of Speech
• Word Dependencies
Semantic
• Relating syntactic
structures to languageindependent meanings
• Extracting meaning and
conceptional arguments
• Summarization
23
November 12, 2012
410 Red Group
The Wills and Will Nots of LASI
What LASI Will Do
• Analyze multiple documents to
find common themes
What LASI Will Not Do
• Provide a concise synopsis
• Provide a single theme
• Provide statistical data to help
a user make a decision
24
November 12, 2012
Who Would This Appeal To?
• Researchers
• Consultants
• Academics
• Students
410 Red Group
25
November 12, 2012
Benefits To The Customer
• Time saving
• Objective output
• Consistent output
• Cost saving solution
410 Red Group
26
November 12, 2012
410 Red Group
How does LASI fit into our Case Study?
27
November 12, 2012
Before LASI
Continue on to the rest of the
A.I.D Process
Customer
Contact
yes
Situational
Awareness
Meeting
Is the
Customer
satisfied?
Problem
Statement
Presentation
no
Will
NCSOSE
be
needed?
no
Client Goes
Elsewhere
410 Red Group
yes
Document
Gathering
Process
Document
Analysis
28
November 12, 2012
After LASI
Continue on to the rest of the
A.I.D Process
Customer
Contact
yes
Situational
Awareness
Meeting
Is the
Customer
satisfied?
Problem
Statement
Presentation
no
Will
NCSOSE
be
needed?
no
Client Goes
Elsewhere
410 Red Group
yes
Document
Gathering
Process
LASI Aided
Document
Analysis
29
November 12, 2012
410 Red Group
Major Functional Components
Hardware
Software
Algorithm:
High End Notebook PC
- Computation
Quad-Core CPU
- Primary Memory
8.0 GB DDR3 RAM
- Document Storage
Solid State Storage
~$1500 USD
Extrapolates the most
likely congruence of
themes and ideas across
all documents in the input
domain
User Interface:
- Multi-Level Views
- Weighted Phrase List
- Detailed Breakdown
- Step by Step Justification
30
November 12, 2012
410 Red Group
Linguistic Analysis Algorithm
Primary Analysis:
Word Count and
Syntactic Assessment
Secondary Analysis:
Associative
Identification
Tertiary Analysis:
Semantic Relationship
Assessment
Traverse Document in
Word-Wise Manner
Bind Pronouns to Nouns,
Updating Frequency
Identify Potential
Synonyms
Identify Corresponding
Parts of Speech
Bind Adjectives to Nouns
Assess Potential SubjectObject-Verb Relationships
Determine Frequency by
Grammatical Role
Identify Potential Noun
Phrases
Output List of Weighted
Themes
31
November 12, 2012
Milestone diagram
410 Red Group
32
November 12, 2012
The Competition
410 Red Group
33
November 12, 2012
The Competition
410 Red Group
34
WordStat
November 12, 2012
410 Red Group
35
Stanford CoreNLP
November 12, 2012
410 Red Group
36
ReadMe
November 12, 2012
410 Red Group
37
Automap
November 12, 2012
410 Red Group
38
November 12, 2012
410 Red Group
Risk Matrix
Customer Risks
C1 -- Product Interest
C2 -- Maintenance
C3 -- Trust
Technical Risks
T1 -- System Limitations
T2 -- Scanned Text Recognition
T3 -- Jargon Recognition
T4 – Illegal Character Handling
39
November 12, 2012
410 Red Group
Customer Risks
C1. Product Interest
Probability 2
Impact 4
Mitigation: LASI offers unique functionality and user friendliness.
C2. Maintenance
Probability 3
Impact 2
Mitigation: LASI will be a free, open source application allowing
the community to maintain and extend it over time.
C3. Trust
Probability 3 Impact 3
Mitigation: LASI will provide a step by step breakdown of output
analysis and algorithm reasoning
40
November 12, 2012
410 Red Group
Technical Risks
T1. System Limitations
Probability 4
Impact 2
Mitigation: LASI will be designed from the ground up in native C++
for memory and CPU efficient code.
T2. Scanned Text Recognition
Probability 4
Impact 3
Mitigation: LASI will implement an optical character recognition
algorithm to handle scanned text
41
November 12, 2012
410 Red Group
Technical Risks
T3. Jargon Recognition
Probability 3 Impact 2
Mitigation: LASI will have domain specific dictionaries and feature
intuitive contextual inference.
T4. Illegal Character Handling
Probability 4 Impact 2
Mitigation: LASI will providers contextual inference, synonym
recognition and statistical methods
42
November 12, 2012
410 Red Group
Conclusion
• LASI is feasible.
• LASI is a decision support tool not a decision
making tool.
• Implications of success affect a wide area of
study and professions.
• In order for LASI to succeed the output needs to
immediately usable and the interface userfriendly.
43
November 12, 2012
410 Red Group
References
1.
2.
3.
"Theme." Def. 1b. Merriam Webster. N.p., n.d. Web. 19 Oct. 2012.
<http://www.merriam-webster.com/dictionary/theme >.
Thomas, Mark. "What Is the Average Reading Speed and the Best Rate of
Reading?" What Is the Average Reading Speed and the Best Rate of
Reading? Web. 19 Oct. 2012.
<http://www.healthguidance.org/entry/13263/1/What-Is-the-AverageReading-Speed-and-the-Best-Rate-of-Reading.html>.
“Patrick Hester" Old Dominion University. N.p., n.d. Web. 24 Sept. 2012
<http://www.odu.edu/directory/people/p/pthester>.
Stanislaw Osinski, Dawid Weiss. 13 August, 2012 . Carrot 2. 9/25/2012 <http://project.carrot2.org>.
”WordStat” Provalis Research. Web. 24 Sept. 2012.
<http://provalisresearch.com/products/content-analysis-software/>.
“ReadMe: Software for Automated Content Analysis” Web. 24 Sept. 2012.
<http://gking.harvard.edu/node/4520/rbuild_documentation/readme.pdf>
"AlchemyAPI Overview." AlchemyAPI. N.p., n.d. Web. 19 Oct. 2012.
<http://www.alchemyapi.com/api/>.
"AutoMap:." Project. N.p., n.d. Web. 19 Oct. 2012.
<http://www.casos.cs.cmu.edu/projects/automap/>.
"CL Research Home Page." CL Research Home Page. N.p., n.d. Web. 19 Oct. 2012.
<http://www.clres.com/>.
Download