Knowledge Modeling from Software Documentation

advertisement
Knowledge Modeling from Software
Documentation
By
Madhuri Gopal,
G.S Mahalakshmi
V.Vani Vijayan
Agenda:
•
•
•
•
•
•
•
•
Objective
Project overview
Design Principles
Technology Stack
Approach and Methodology
Execution Framework
Modules Covered
Results
Objective
The objective of this presentation is to understand the
nuances of converting existing software documentation to
an intelligent knowledge representation
Project Overview:
Background
• Traditional development , deployment & maintenance of
conventional software applications require higher quality
with shorter time to market cycles to reap the benefits of
customer delight.
•
This involves a formal , explicit and conventional
representation of the knowledge base shared across stakeholders
•
Existing SDLC documents do not cater to any intelligent extraction and
interpretation either for downstream applications or enhancements.
•
There is a growing need for effective and efficient utilization of software
artifacts to deliver enhanced traceability to changing future needs.
Challenges in the existing systems
•
•
•
•
More than 90% of existing software documentation is
in the form of text
Knowledge Engineers create knowledge representations from
the scratch making reuse and enhancements difficult to
existing representations
Existing Knowledge representation techniques require domain knowledge
and have a steep learning curve.
Difference in conceptualization of the domain model leads to inconsistencies in
its representation
Design Principles
Open Close Principle
Software entities like classes, modules and functions should
be open for extension but closed for modifications.
Dependency Inversion Principle
• High-level modules should not depend on low-level modules.
Both should depend on abstractions.
• Abstractions should not depend on details. Details should depend on
abstractions.
Design Principles Contd..
Single Responsibility Principle
A class should have only one reason to change.
Liskov's Substitution Principle
Derived types must be completely substitutable for their base
types.
Technology Stack
The architecture followed is a 2 tier
architecture.
Front-End :
Java
Back-end :
Files
Development Hardware
Processor:
Intel(R) Core™ 2 Duo CPU T6400 @ 2.00 GHZ
Memory(RAM) : 4 GB
System type:
32-bit Operating System
Tools used
CoreNLP – Stanford package for Natural Language Processing(NLP)
ConExp - Open Source for creation of Formal Concept Lattice.
Approach and Methodology
•
Software prototyping (Incremental prototyping)
methodology is used for development.
•
The final product is built as separate prototypes.
•
At the end the separate prototypes are merged in an overall design
•
Steps are:
a) Identification of basic requirements.
b) Development of the initial prototype
c) Review of prototype
d) Revision and Enhancement of the Prototype
Overall
Architecture
Modules covered
1.
Part Of Speech Tagging (POS) using a Maximum
Entropy based Tagger algorithm
2.
Lemmatization to reduce the relevant terms
extracted by POS Tagging to their Lemma forms.
3.
Named Entity Resolution(NER) using Conditional Random Fields(CRF)
with Gibbs sampling for entity identification & extraction.
4. Parsing to determine the grammatical structure w.r.t Formal Parsed
Grammar using a Factored model.
Modules covered contd….
5. Co-reference Resolution by using tiers of deterministic
models to determine the relative importance of
different terms.
6. Querying and Manipulation of Natural Language Text
7. Formal Concept analysis to derive the relationship between the attributes
& the objects and also between attributes
8.Conversion of formal concept lattice to XML for extraction of Knowledge
representation.
Input Sources
Software Engineering documents that are part of
MIL STD 498 Software Development Standard are used
as input consisting of:
•
•
•
•
•
•
•
•
•
•
•
Computer Operation Manual (COM)
Computer Programming Manual (CPM)
Database Design Description (DBDD)
Firmware Support Manual (FSM)
Interface Design Description (IDD)
Interface Requirements Specifications (IRS)
Operational Concept Description (OCD)
Software Centre Operator Manual(SCOM)
Software Design Description (SDD)
Software User Manual (SUM)
Software Version Description (SVD)
Input Sources Contd..
•
•
•
•
•
•
•
•
•
•
•
Software Development Plan (SDP)
Software Input/ Output Manual (SIOM)
Software Installation Plan (SIP)
Software Product Specification (SPS)
Software Requirements Specification (SRS)
System/Subsystem Design Description
System/Subsystem Specification
Software Test Description (STD)
Software Test Plan
Software Test Report (STR)
Software Transition Plan (STrp)
Algorithm
Step 1 : Tagger 1= POS_Tagging_Function(SRS )
Tagger 2= POS_Tagging_Function(SDD )
Tagger 3= POS_Tagging_Function(STD)
Step 2: Lemma_Form1 = Lemma_construction(Tagger1)
Lemma_Form2 = Lemma_construction(Tagger2)
Lemma_Form 3= Lemma_construction(Tagger3)
Step 3: NER1 =CRF_Gibbs_Function(Lemma_Form1 )
NER2 =CRF_Gibbs_Function(Lemma_Form2 )
NER3 =CRF_Gibbs_Function(Lemma_Form3 )
Step 4: Parse1 = Parser(NER1)
Parse2 = Parser(NER1)
Parse3 = Parser(NER1)
Input Sources Contd..
Step 5:
CoRef1 = Coreference_Resolution(Parse1)
CoRef2 = Coreference_Resolution(Parse2)
CoRef3 = Coreference_Resolution(Parse3)
Step 6:
TREE_NODE= Query_Manipulation_function(CoRef1,
CoRef2, CoRef3)
Step 7: Concept_Lattice= FCA (context, concept,TREE_NODE)
Step 8: XML_DOC = XML_Convert(Concept_Lattice)
Implementation Steps
The algorithm is mapped to the following series of steps:
•
Collection of existing software documents
a) Software Requirements Specification(SRS)
This document contains a set of use cases that describe system –
user interaction & non functional requirements as design
constraints and quality standards.
b) Software Design Document (SDD)
The SDD shows how the software system will be structured to represent
software components, interfaces, and data necessary for the implementation
phase.
c) Software Testing Document (STD)
It specifies the form of a set of documents for use in different stages of software
testing
Implementation Steps contd…
• Extraction of relevant knowledge from the SRS, SDD, SDT by
using a sequence of natural language processing
steps as follows:
•
POS tagging
•
•
•
•
Lemmatization
Named Entity Resolution
Syntactic Parsing
Coreference Resolution
Input: SRS, SDD , STD
Output: Annotated Text Corpora
Annotated SRS
Annotated SDD
Annotated STD
Implementation Steps contd…
Querying and Manipulation of annotated text
corpora and conversion to tree data structures
•
This step uses query manipulation tools to extract the relevant
knowledge from the annotated text corpora .
•
The verb subject , object and PP complement pairs are extracted
and the syntactic dependencies between verb subject – verb- verb object
and verb- PP complement are exploited to derive a meaningful hierarchical relationship
Input: Annotated SRS, SDD , STD
Output: Tree Data Structure Representation
Implementation Steps contd…
Formation of Concept Lattice using Formal Concept
Analysis
•
The hierarchical information and syntactic dependencies
obtained by NLP gives a relationship between the set of
verbs that act as objects and the verb-subject , verb-object
& verb-PP Complement act as the set of attributes.
•
This relationship is written in the form of a matrix given as input to ConExp that
transforms the matrix to a concept lattice.
Input: Tree Data structure Representation
Output: Formal Concept Lattice
Formal Concept Lattice
•
The top most element indicates the object that has no
attributes
•
The bottom most element indicates the object that has all
attributes.
•
The node in blue indicates the objects
•
The node in orange depicts the attributes
Implementation Steps contd…
Conversion of formal concept lattice to XML
•
The set of all attributes and their values is extracted for
each object .
•
This provides an intermediate representation of the
Concept hierarchy before it is transformed to a knowledge representation.
Input: Formal Concept Lattice
Output: XML Format
Implementation Steps contd…
Pseudocode for Conversion of formal concept lattice
to XML
•
Let n be the total number of objects and m be the total
number of attributes
For j =1 to n
For k= 1 to m
For each object Ij and attribute Ak that is is an attribute of Ij ,
Form the XML element with head =Ij and list of attributes Ak
Conclusion
•
•
•
•
•
•
Software documentation practices vary among
different organizations.
53% of the organizations deliver consistent software
to maintenance phase
16% update their documentation at all levels
53% of organizations have their user manuals
consistent with system state
42% revise and modify regression test case repositories
11% achieve full traceability amongst system documents and only 5% have
achieved traceability of change .
On an average, a software Cost savings of 10- 15% is expected to be achieved
depending on the size and complexity of software documentation
Thank You
Download