Semantic Information System Blueprint

advertisement
1
Semantic Information System Blueprint
SEMANTIC INFORMATION
SYSTEM BLUEPRINT
A BUSINESS PROCESS STUDY FOR IMPLEMENTING SEMANTIC
INFORMATION SYSTEM AND A ROADMAP FOR ACTION
Copyright Veda Semantics August 2013
This document is the output of a detailed study carried out by Veda Semantics during the period MMM-YYYY. During this
period various stakeholders where contacted over structured interactions. The information thus collected was analyzed by
Veda Semantics and subject to the processes the following report has been presented.
Preamble
2
Semantic Information System Blueprint
Cover Sheet
Sl. #
Heading
Description
1.
Company
ABC Company
2.
Industry
Banking Industry
3.
Size
Revenue
$ NN Million Year 2012, Projection $ NN Million Year 2013
People
4.
Primary Location
5.
Business
6.
SIS Impact
City, State, USA
<Summary of impact >
Revenue
Cost
Potential Data Points
Terms
Frequency
7.
Unstructured Data Sources
<List of data sources>
Preamble
3
Semantic Information System Blueprint
Disclaimer
This report should be used for informational purposes only. Vendor and product selections should be made based on
multiple information sources, face-to-face meetings, customer reference checking, product demonstrations and proof of
concept applications.
The statistical information contained in this Report is a summary of the opinions expressed in the responses of individuals
who have responded questionnaire, and does not represent a scientific sampling of any kind.
Veda Semantics shall not be liable for the content of this Report, the study results, or for any damages incurred or
alleged to be incurred by any of the companies included in the Report as a result of its content.
Reproduction and distribution of this publication in any form without prior written permission is forbidden.
Preamble
4
Semantic Information System Blueprint
Table of Contents
1
PREAMBLE ............................................................................................................................................................... 5
2
THE THREE STAGES REQURIED FOR DERIVING MEANING FROM DATA ......................................................................................... 6
3
THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”) .................................................................................................... 7
4
ABOUT THIS STUDY .................................................................................................................................................... 9
5
METHODOLOGY ADOPTED FOR THIS STUDY ....................................................................................................................... 9
6
ABOUT THE BUSINESS – BUSINESS NEEDS ......................................................................................................................... 10
7
UNSTRUCTURED DATA SOURCES ................................................................................................................................... 10
8
CURRENT TECHNOLOGY USAGE .................................................................................................................................... 11
9
DATA VIEWS REQUIRED – ONTOLOGIES, NLP AND LINKAGES .................................................................................................. 12
10
INFRASTRUCTURE NEEDS ........................................................................................................................................ 13
11
DIAGRAMMATIC REPRESENTATION OF OVERALL SYSTEM ................................................................................................... 14
12
SAMPLE SIS DASHBOARD ......................................................................................................................................... 14
13
RETURN ON INVESTMENT ........................................................................................................................................ 16
14
PROPOSED PROJECT PLAN ...................................................................................................................................... 17
15
CHALLENGES AND LIMITATIONS ................................................................................................................................ 17
16
CHANGES REQUIRED TO ORGANIZATION STRUCTURE ...................................................................................................... 18
Preamble
5
Semantic Information System Blueprint
1
PREAMBLE
The early focus of information technology was to record the information in a structured way using a software application. The primary
reasons for this were twofold:


Technology not being up to the mark in being able to analyse unstructured information at speed and scale
Perhaps as a consequence of the above, structured data being considered the only relevant data for decision making
However, today’s technologies are different. Structured data analysis has become deep and complex, and has been supplemented by BI
tools, user dashboards and easy visualizations. Consequently, users are finding it easier to finally answer the “What” question with nearly
realtime data, without a heavy dependence on the IT team.
On the other hand, today’s business realities are also different. No longer is it sufficient to answer the ‘What’ question related to the
data of the business. To gain competitive advantage and make business agile, it is equally important to:



Know the ‘What’ question for:
o competitors (eg what are the markets they are developing, what are their new products and services, what is their
promotion and pricing strategy)
o regulators (what is the proposed regulation being considered, what are its implication, what action is being taken by
industry bodes, etc)
o customers (what is being said about their needs, what new products have they liked, etc)
o suppliers (what are their limitations, what are alternatives available, etc)
o many more stakeholders
Know the ‘Why’ question for various aspects within the business (eg why have sales dropped, why are employees demotivated,
why has the project failed, etc) so that this information can also feed into predictive models
Be able to answer the ‘What if’ questions easily, eg which departments will be impacted by a process change, what related
processes will change, and what second order derivative processes will be impacted.
Data to answer these kind of questions is usually unstructured, and estimates suggest that this information could be at least 75% or more
of the total information available for decision making in a company. However, most of current technologies (eg SQL databases) are not
geared towards answering these questions.
Hence, newer techniques are required to analyze the information, both in terms of how to answer the questions, and how to efficiently
retrieve the answers. Using a structured semantic framework and other Big Data techniques one can make use of the unstructured data
and converge the relevant information into existing core business solutions or workflows, making them richer and more meaningful.
Preamble
6
Semantic Information System Blueprint
Semantic technology (based on Natural Language Processing, mathematics / graph theory / statistics) is complex and not very easy for a
non-technical person to grasp easily. It may also be confused with Text Analysis, an older technique that has been employed by various
companies for basic applications and knowledge management. Despite these potential concerns, this whitepaper attempts to simplify and
describe how the unstructured data is processed to derive meaningful information that can be used.
2
THE THREE STAGES REQURIED FOR DERIVING MEANING FROM DATA
For data to become information, it must first be collected efficiently and comprehensively, it must then be organized and processed, and
finally, presented to the user or find its way into another system, eg BI system. A gap in any area could lead to information being
incomplete or erroneous. The following paragraphs talk about Veda’s expertise in these areas:
Stage 1: Collection of Data
Collection of raw data is the first stage where the volume of data from various sources say social media sites, news sites, internal
applications, emails, blogs, etc are crawled to check if any relevant direct or implied entity mentions exist.
Once the relevant sources are crawled, data is extracted using various techniques. Examples include Natural Language Processing
techniques like Entity Recognition, that automatically identify the types of entities mentioned in a document, or Sentiment engine, that
pulls out sentiments from the text, including an ability to attach the sentiment to a specific attribute of the entity being talked about.
Further, data can also be extracted using classification algorithms that identify themes and concepts mentioned in the text.
Veda Semantics has developed many tools using its strong R&D team to help:




crawl various sources of data external to the organization (eg news, social media, blogs, etc)
crawl semi / unstructured information sources within the organization (eg logs, emails, docs, PDFs, etc)
extract meaningful data from the documents (eg relevant sentences, metatagging information, words / phrases of concern,
sentiment being expressed, etc)
store data in repositories in standard formats
The expertise is compatible with many platforms that are commonly used for data storage. With the help of these tools, an organization
wide layer of information extraction capability in near real time and with minimal / no effort can be created.
Stage 2: Organization of Data
Having extracted various mentions of data about multiple items, the data has to be organized in a smart and meaningful manner. This is
done using Ontologies. Ontologies are effectively a multifaceted lens through which a business views its data. Unlike a SQL technique,
THE THREE STAGES REQURIED FOR DERIVING MEANING FROM DATA
7
Semantic Information System Blueprint
that allows for only structured (and therefore limited) relationships, ontologies can help create multiple relationships, at times crisscrossing each other. For example, competitors can be mapped under the relationship ‘competitors’, and their competing products can be
further mapped to them as well as own competing products. Consequently, a reference of competitor product would be captured, and
stored both related to the competitor, as well as the business’ product it competes with. This also means that this information can be
accessed meaningfully if a query is made either for competitors (say at a corporate level), or for information related to competitive
products (say at a BU level).
Veda technology allows a user to:




quickly build an ontology
see the ontology concepts in graph or other forms
edit and query it
provide auto classification
In other words, various technology tools developed by Veda help users create a connected and classified picture from the various data bits
collected in the Collect phase.
Stage 3: Presentation of Data
The data that has been collected and organized can be queried and presented to the user at any time. Report formats can be created /
configured at user discretion. Output in standard formats like OWL and RDF ensures that the output can be configured to other BI systems
as well.
The tools under each phase can be used across industries and various types of applications, and therefore, together allow for the creation
of a system for analysis of unstructured information in the same manner in which ERP systems store and analyse structured data. This
concept is called the Semantic Information System.
3
THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”)
The overall concept of an SIS is simple – the right information to the right person at the right time. Of course, the information being
talked about is unstructured data, flowing into the organization at high velocity, generated internally in high volumes, and being available
in various types, which would otherwise go completely unanalysed. An SIS helps build a robust made for purpose platform that can
THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”)
8
Semantic Information System Blueprint
automate the information retrieval process. An SIS is in effect “a one stop solution for all unstructured data relevant for an
organization”
An SIS, though extremely powerful when covering the full organization, can also be set up for parts of the organization, though losing its
power slightly in the process. However, a limited application allows for the benefits of the SIS to be understood in a live situation, and
can help establish a clear RoI metric. As more areas of the organization are brought under the SIS, the power increases as connections
between hitherto seemingly unconnected elements can be made and analysed.
An SIS





has various benefits:
Real-time aggregation, analysis and retrieval of data.
Time & Cost Reduction –due to automation domain ontologies.
Mitigate Risk –Due to automation and timely information.
Provides a comprehensive outlook for decision making
Provides an organized central repository of information –easy retrieval.
THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”)
9
Semantic Information System Blueprint
4
ABOUT THIS STUDY
This study was conducted by Veda Semantics for <ABC> as part of an initial consulting project to help assess the feasibility of an SIS,
identify critical areas that could yield near term results, assess costs and time to completion, and establish RoI benchmarks. This report
has been prepared after ___ days of effort, including interviews, data analysis, discussions and questionnaires. We thank various personnel
of <ABC> that have contributed to making this study possible.
5
METHODOLOGY ADOPTED FOR THIS STUDY
Information for this study is gathered through the following specific approaches:
1. Initial questionnaire: An initial questionnaire was completed by <ABC>, allowing for a focused approach for further investigation to be
formed by Veda Semantics.
2. Preparatory work: Veda Semantics business analysts and <ABC> formed a clear action plan for two weeks to gather more information
from various personnel leading to this report. In parallel, <ABC> formed a Steering and Execution committee to help Veda Semantics
prepare this report.
3. Interviews: Interviews were conducted with selected stakeholders and important users to understand the goal and expectations. These
were either face to face or long distance conversations / screen sharing sessions. Standard questionnaires were prepared and used in
most interviews. This helped understand information needs and also the pain areas in greater detail.
4. Document gathering: This helped Veda Semantics study various documents relating to the technology systems being used by <ABC>,
which would have a bearing on the SIS. This was further supplemented by reviews with the IT team.
5. Weekly milestone meetings: These meetings with the Steering Committee formed for this exercise helped make corrections to the
course of action.
6. Workshop: A final workshop with various stakeholders was used to fine tune this report and recommendations, clearly laying out areas
of impact, costs, timelines, challenges and approach.
ABOUT THIS STUDY
10
Semantic Information System Blueprint
6
ABOUT THE BUSINESS – BUSINESS NEEDS
During our discussion with various business users, the following critical business needs were identified, which current structured data techniques
are not able to meet fully. These needs have been tabulated below:
Kind of
information
Why needed
How currently
done
Criticality
Possible
sources of
information
Department
This ‘needs matrix’ helped us shortlist some areas that are critical for business users and could add significant impact to the business. They
also helped us gain a broad sense of data sources to look at in more detail (highlighted below).
7
UNSTRUCTURED DATA SOURCES
Usually, unstructured data sources can be:


Both within and outside the organization
Access controlled or freely available
In the instant case, we identified the following data sources critical for <ABC>.
Data
source
Structured /
unstructured
/ semi
structured
Single
repository
or
distributed
Internal /
external
Free /
Access
controlled
Format of
documents
(eg Word,
PDF, txt)
Estimated
quantum
(MB,
number)
ABOut the buSIness – business needs
Why
relevant
(mapped
to above
table)
11
Semantic Information System Blueprint
Based on a preliminary analysis, we find that:



Data essential for meeting information needs for Business challenge <x> is spread across multiple internal data sources
Data essential for Department <q> is mostly, external, while that needed by Department <s> usually lies within other departments of
<ABC>.
Data sources for solving Problem <F> are a mix of free and controlled sources. A discussion with the relevant individuals revealed that
it is possible to provide ‘permitted access’ after a manual review in some cases. In other cases, the query must be routed to a
designated officer who would use the unstructured data repository to provide relevant answers, without disclosing the contents of full
documents where access is limited.
We believe that Veda tools could be used to crawl all the information sources, and extract metadata information from them. However, Veda
tools may not be able to extract information from the following kinds of files:




8
Table or graph information from PDF files
Information from Excel files unless they are in specified formats
Graphs etc from PPTs
Information from Titles, headers and footers in some cases could be mixed with the main text
CURRENT TECHNOLOGY USAGE
We find that currently <ABC> uses the following technology stack that impacts the information stored in the unstructured data repositories we
identified, as well as in processing and presenting data to users.
Data source
Data
presentation
Technology
used
Relevant details
Integration challenges if any
current technology usage
12
Semantic Information System Blueprint
Based on the above, it appears that integrating results provided by SIS could benefit from <the already existing visualization dashboard that
<ABC> already owns, with only limited modifications>. It also appears that the SIS architecture must be compliant with <DEF> standards, to
which the organization is moving and has planned on its roadmap by 2015. The implications of this on SIS are: <TBD>
9
DATA VIEWS REQUIRED – ONTOLOGIES, NLP AND LINKAGES
For the critical issues highlighted by departments, and for which data sources have been identified, we have done a preliminary assessment to
determine what kind of information, once extracted and organized, would be useful to solve the business challenge at hand, as well as help in
addressing needs that may come up going forward.
Based on this analysis, we believe that the following kind of analyses will be primarily needed for specific purposes. This does not of course
mean that the SIS platform will not allow other kinds of analyses to be conducted. It simply implies our best recommendation of the kind of
analysis that may help arrive at an answer for the specific problem in a quick way.
Kind of
information
Available in
(sources)
What is required
to be extracted
Primary semantic
analyses required
Remarks
It must also be noted that an ontology will be required to answer the questions of the type <W> mentioned above. While there are various
ways in which ontologies can be drawn, highlighting various aspects of the organization, we recommend that at least the following ontologies
with the following parameters be drawn:
Ontology for
Required
because
Primary aspect
Estimated
timeframe
Challenges if any
data views required – ontologies, NLP and linkages
13
Semantic Information System Blueprint
As an example, the ontology in Point 1 above will ensure that <TVD> is mapped not only to <DER>, but also to <WES>. This means that it may
be possible to query <WES> related to questions on <TVD>, which is hitherto not possible. A diagrammatic representation of a basic ontology is
provided below:
10 INFRASTRUCTURE NEEDS
Based on our estimation, the following infrastructure requirements will be required for implementation of an SIS Phase 1, to solve for the
critical business challenges mentioned above.
Infrastructure
Required for
Estimated cost
Estimated Specs
INFRASTRUCTURE NEEDS
14
Semantic Information System Blueprint
We understand that of the above, the organization already has spare capacity in the case of <B>, and hence incremental investment need not
be made for it at this stage.
11 DIAGRAMMATIC REPRESENTATION OF OVERALL SYSTEM
We have provided below a diagrammatic representation of the overall architecture and information flow:
12 SAMPLE SIS DASHBOARD
Based on the needs analysis, we have provided below, a sample SIS dashboard that users may be able to see after completion of Phase 1 of the
SIS project.
diagrammatic representation of overall system
15
Semantic Information System Blueprint
SAMPLE SIS DASHBOARD
16
Semantic Information System Blueprint
13 RETURN ON INVESTMENT
Based on each customer and usage context of Semantic Information System, a broad return on Investment
analysis has been made to justify the Cost – Benefit. For the purpose, the cost of the SIS has been
allocated based on criticality of use case, and the estimated benefits have been ascertained based on
interviews with business users. Where such estimation was not possible, an estimated benefit forecast has
been prepared based on assumptions that have been attached to this report.
Sequence
Usecase Title
Department
Impact /
RoI
Hardware
costs ($)
Service
costs (S)
License
cost ($)
Total
cost($)
1.
2.
3.
4.
5.
6.
7.
8.
As a result of this analysis, we believe that a potential map that shows RoI for different use cases can be
prepared, as follows. Based on this, it appears that Usecases 1, 2 and 3 could be good candidates for early
adoption of an SIS approach, which could then extend to Usecases 4, 5 and 6 as well. It may be noted that
under this approach, the cost allocations for the Usecases may change since Phase 1 may absorb various one
time costs, automatically reducing the costs for Phase 2.
Recommended Sequence of Execution
Effort & Time 
6
5
4
3
2
1
Impact / RoI 
Return on Investment
17
Semantic Information System Blueprint
14 PROPOSED PROJECT PLAN
If <ABC> decides to implement the SIS for Use cases 1-3 or 1-6, the following Project Plan provides an estimate
of time and activities required for completion. Please note that this will be heavily influenced by the
availability and involvement of <ABC> personnel, and hence we request that specific team be formed for this
purpose, like in the case of the report preparation phase. We will be happy to work with you on this.
15 CHALLENGES AND LIMITATIONS
We would also like to highlight some of the challenges that we see upfront, in the implementation and usage
exercise. These have been based on the study of technology, kinds of documents and sources of documents
that are being used by <ABC>, coupled with the business problems that are being sought to be solved. Critical
limitations of the SIS system will be:
Limitation
Reason therefor
What will <ABC>
need to do
Proposed Project Plan
Example of what can / cannot be
expected
18
Semantic Information System Blueprint
16 CHANGES REQUIRED TO ORGANIZATION STRUCTURE
Based on our discussion, the current data requests flow from BU users to IT team, who then process them
within given SLAs. In the case of <D> and <R> divisions, some BU users access the BI system and pull in data
directly. For the SIS concept to be successful, we recommend the following small changes in this system:




A specialized team be set up within IT that is exclusively geared to draw insights form unstructured
data
At least one BU resource from impacted Bus works closely with this team for the initial month,
highlighting crucial business drivers for the BU
Apart from configuring the SIS system to send out auto updates based on BU needs, a monthly ‘mock
exercise’ must be conducted between IT and BUs to see if critical insights are being captured or missed
from the unstructured data. The SIS data feeds can be changed as needed based on such inputs, or the
kinds of analysis being run could be altered.
Every 3 months, a log of issues should be raised to the vendor, and BU, IT and the vendor teams could
discuss possible ways to capture more information as needed. With time, as the system and the
processes around it stabilize, this will not be required.
CHANGES REQUIRED TO ORGANIZATION STRUCTURE
Download