1 Semantic Information System Blueprint SEMANTIC INFORMATION SYSTEM BLUEPRINT A BUSINESS PROCESS STUDY FOR IMPLEMENTING SEMANTIC INFORMATION SYSTEM AND A ROADMAP FOR ACTION Copyright Veda Semantics August 2013 This document is the output of a detailed study carried out by Veda Semantics during the period MMM-YYYY. During this period various stakeholders where contacted over structured interactions. The information thus collected was analyzed by Veda Semantics and subject to the processes the following report has been presented. Preamble 2 Semantic Information System Blueprint Cover Sheet Sl. # Heading Description 1. Company ABC Company 2. Industry Banking Industry 3. Size Revenue $ NN Million Year 2012, Projection $ NN Million Year 2013 People 4. Primary Location 5. Business 6. SIS Impact City, State, USA <Summary of impact > Revenue Cost Potential Data Points Terms Frequency 7. Unstructured Data Sources <List of data sources> Preamble 3 Semantic Information System Blueprint Disclaimer This report should be used for informational purposes only. Vendor and product selections should be made based on multiple information sources, face-to-face meetings, customer reference checking, product demonstrations and proof of concept applications. The statistical information contained in this Report is a summary of the opinions expressed in the responses of individuals who have responded questionnaire, and does not represent a scientific sampling of any kind. Veda Semantics shall not be liable for the content of this Report, the study results, or for any damages incurred or alleged to be incurred by any of the companies included in the Report as a result of its content. Reproduction and distribution of this publication in any form without prior written permission is forbidden. Preamble 4 Semantic Information System Blueprint Table of Contents 1 PREAMBLE ............................................................................................................................................................... 5 2 THE THREE STAGES REQURIED FOR DERIVING MEANING FROM DATA ......................................................................................... 6 3 THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”) .................................................................................................... 7 4 ABOUT THIS STUDY .................................................................................................................................................... 9 5 METHODOLOGY ADOPTED FOR THIS STUDY ....................................................................................................................... 9 6 ABOUT THE BUSINESS – BUSINESS NEEDS ......................................................................................................................... 10 7 UNSTRUCTURED DATA SOURCES ................................................................................................................................... 10 8 CURRENT TECHNOLOGY USAGE .................................................................................................................................... 11 9 DATA VIEWS REQUIRED – ONTOLOGIES, NLP AND LINKAGES .................................................................................................. 12 10 INFRASTRUCTURE NEEDS ........................................................................................................................................ 13 11 DIAGRAMMATIC REPRESENTATION OF OVERALL SYSTEM ................................................................................................... 14 12 SAMPLE SIS DASHBOARD ......................................................................................................................................... 14 13 RETURN ON INVESTMENT ........................................................................................................................................ 16 14 PROPOSED PROJECT PLAN ...................................................................................................................................... 17 15 CHALLENGES AND LIMITATIONS ................................................................................................................................ 17 16 CHANGES REQUIRED TO ORGANIZATION STRUCTURE ...................................................................................................... 18 Preamble 5 Semantic Information System Blueprint 1 PREAMBLE The early focus of information technology was to record the information in a structured way using a software application. The primary reasons for this were twofold: Technology not being up to the mark in being able to analyse unstructured information at speed and scale Perhaps as a consequence of the above, structured data being considered the only relevant data for decision making However, today’s technologies are different. Structured data analysis has become deep and complex, and has been supplemented by BI tools, user dashboards and easy visualizations. Consequently, users are finding it easier to finally answer the “What” question with nearly realtime data, without a heavy dependence on the IT team. On the other hand, today’s business realities are also different. No longer is it sufficient to answer the ‘What’ question related to the data of the business. To gain competitive advantage and make business agile, it is equally important to: Know the ‘What’ question for: o competitors (eg what are the markets they are developing, what are their new products and services, what is their promotion and pricing strategy) o regulators (what is the proposed regulation being considered, what are its implication, what action is being taken by industry bodes, etc) o customers (what is being said about their needs, what new products have they liked, etc) o suppliers (what are their limitations, what are alternatives available, etc) o many more stakeholders Know the ‘Why’ question for various aspects within the business (eg why have sales dropped, why are employees demotivated, why has the project failed, etc) so that this information can also feed into predictive models Be able to answer the ‘What if’ questions easily, eg which departments will be impacted by a process change, what related processes will change, and what second order derivative processes will be impacted. Data to answer these kind of questions is usually unstructured, and estimates suggest that this information could be at least 75% or more of the total information available for decision making in a company. However, most of current technologies (eg SQL databases) are not geared towards answering these questions. Hence, newer techniques are required to analyze the information, both in terms of how to answer the questions, and how to efficiently retrieve the answers. Using a structured semantic framework and other Big Data techniques one can make use of the unstructured data and converge the relevant information into existing core business solutions or workflows, making them richer and more meaningful. Preamble 6 Semantic Information System Blueprint Semantic technology (based on Natural Language Processing, mathematics / graph theory / statistics) is complex and not very easy for a non-technical person to grasp easily. It may also be confused with Text Analysis, an older technique that has been employed by various companies for basic applications and knowledge management. Despite these potential concerns, this whitepaper attempts to simplify and describe how the unstructured data is processed to derive meaningful information that can be used. 2 THE THREE STAGES REQURIED FOR DERIVING MEANING FROM DATA For data to become information, it must first be collected efficiently and comprehensively, it must then be organized and processed, and finally, presented to the user or find its way into another system, eg BI system. A gap in any area could lead to information being incomplete or erroneous. The following paragraphs talk about Veda’s expertise in these areas: Stage 1: Collection of Data Collection of raw data is the first stage where the volume of data from various sources say social media sites, news sites, internal applications, emails, blogs, etc are crawled to check if any relevant direct or implied entity mentions exist. Once the relevant sources are crawled, data is extracted using various techniques. Examples include Natural Language Processing techniques like Entity Recognition, that automatically identify the types of entities mentioned in a document, or Sentiment engine, that pulls out sentiments from the text, including an ability to attach the sentiment to a specific attribute of the entity being talked about. Further, data can also be extracted using classification algorithms that identify themes and concepts mentioned in the text. Veda Semantics has developed many tools using its strong R&D team to help: crawl various sources of data external to the organization (eg news, social media, blogs, etc) crawl semi / unstructured information sources within the organization (eg logs, emails, docs, PDFs, etc) extract meaningful data from the documents (eg relevant sentences, metatagging information, words / phrases of concern, sentiment being expressed, etc) store data in repositories in standard formats The expertise is compatible with many platforms that are commonly used for data storage. With the help of these tools, an organization wide layer of information extraction capability in near real time and with minimal / no effort can be created. Stage 2: Organization of Data Having extracted various mentions of data about multiple items, the data has to be organized in a smart and meaningful manner. This is done using Ontologies. Ontologies are effectively a multifaceted lens through which a business views its data. Unlike a SQL technique, THE THREE STAGES REQURIED FOR DERIVING MEANING FROM DATA 7 Semantic Information System Blueprint that allows for only structured (and therefore limited) relationships, ontologies can help create multiple relationships, at times crisscrossing each other. For example, competitors can be mapped under the relationship ‘competitors’, and their competing products can be further mapped to them as well as own competing products. Consequently, a reference of competitor product would be captured, and stored both related to the competitor, as well as the business’ product it competes with. This also means that this information can be accessed meaningfully if a query is made either for competitors (say at a corporate level), or for information related to competitive products (say at a BU level). Veda technology allows a user to: quickly build an ontology see the ontology concepts in graph or other forms edit and query it provide auto classification In other words, various technology tools developed by Veda help users create a connected and classified picture from the various data bits collected in the Collect phase. Stage 3: Presentation of Data The data that has been collected and organized can be queried and presented to the user at any time. Report formats can be created / configured at user discretion. Output in standard formats like OWL and RDF ensures that the output can be configured to other BI systems as well. The tools under each phase can be used across industries and various types of applications, and therefore, together allow for the creation of a system for analysis of unstructured information in the same manner in which ERP systems store and analyse structured data. This concept is called the Semantic Information System. 3 THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”) The overall concept of an SIS is simple – the right information to the right person at the right time. Of course, the information being talked about is unstructured data, flowing into the organization at high velocity, generated internally in high volumes, and being available in various types, which would otherwise go completely unanalysed. An SIS helps build a robust made for purpose platform that can THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”) 8 Semantic Information System Blueprint automate the information retrieval process. An SIS is in effect “a one stop solution for all unstructured data relevant for an organization” An SIS, though extremely powerful when covering the full organization, can also be set up for parts of the organization, though losing its power slightly in the process. However, a limited application allows for the benefits of the SIS to be understood in a live situation, and can help establish a clear RoI metric. As more areas of the organization are brought under the SIS, the power increases as connections between hitherto seemingly unconnected elements can be made and analysed. An SIS has various benefits: Real-time aggregation, analysis and retrieval of data. Time & Cost Reduction –due to automation domain ontologies. Mitigate Risk –Due to automation and timely information. Provides a comprehensive outlook for decision making Provides an organized central repository of information –easy retrieval. THE CONCEPT OF SEMANTIC INFORMATION SYSTEM (“SIS”) 9 Semantic Information System Blueprint 4 ABOUT THIS STUDY This study was conducted by Veda Semantics for <ABC> as part of an initial consulting project to help assess the feasibility of an SIS, identify critical areas that could yield near term results, assess costs and time to completion, and establish RoI benchmarks. This report has been prepared after ___ days of effort, including interviews, data analysis, discussions and questionnaires. We thank various personnel of <ABC> that have contributed to making this study possible. 5 METHODOLOGY ADOPTED FOR THIS STUDY Information for this study is gathered through the following specific approaches: 1. Initial questionnaire: An initial questionnaire was completed by <ABC>, allowing for a focused approach for further investigation to be formed by Veda Semantics. 2. Preparatory work: Veda Semantics business analysts and <ABC> formed a clear action plan for two weeks to gather more information from various personnel leading to this report. In parallel, <ABC> formed a Steering and Execution committee to help Veda Semantics prepare this report. 3. Interviews: Interviews were conducted with selected stakeholders and important users to understand the goal and expectations. These were either face to face or long distance conversations / screen sharing sessions. Standard questionnaires were prepared and used in most interviews. This helped understand information needs and also the pain areas in greater detail. 4. Document gathering: This helped Veda Semantics study various documents relating to the technology systems being used by <ABC>, which would have a bearing on the SIS. This was further supplemented by reviews with the IT team. 5. Weekly milestone meetings: These meetings with the Steering Committee formed for this exercise helped make corrections to the course of action. 6. Workshop: A final workshop with various stakeholders was used to fine tune this report and recommendations, clearly laying out areas of impact, costs, timelines, challenges and approach. ABOUT THIS STUDY 10 Semantic Information System Blueprint 6 ABOUT THE BUSINESS – BUSINESS NEEDS During our discussion with various business users, the following critical business needs were identified, which current structured data techniques are not able to meet fully. These needs have been tabulated below: Kind of information Why needed How currently done Criticality Possible sources of information Department This ‘needs matrix’ helped us shortlist some areas that are critical for business users and could add significant impact to the business. They also helped us gain a broad sense of data sources to look at in more detail (highlighted below). 7 UNSTRUCTURED DATA SOURCES Usually, unstructured data sources can be: Both within and outside the organization Access controlled or freely available In the instant case, we identified the following data sources critical for <ABC>. Data source Structured / unstructured / semi structured Single repository or distributed Internal / external Free / Access controlled Format of documents (eg Word, PDF, txt) Estimated quantum (MB, number) ABOut the buSIness – business needs Why relevant (mapped to above table) 11 Semantic Information System Blueprint Based on a preliminary analysis, we find that: Data essential for meeting information needs for Business challenge <x> is spread across multiple internal data sources Data essential for Department <q> is mostly, external, while that needed by Department <s> usually lies within other departments of <ABC>. Data sources for solving Problem <F> are a mix of free and controlled sources. A discussion with the relevant individuals revealed that it is possible to provide ‘permitted access’ after a manual review in some cases. In other cases, the query must be routed to a designated officer who would use the unstructured data repository to provide relevant answers, without disclosing the contents of full documents where access is limited. We believe that Veda tools could be used to crawl all the information sources, and extract metadata information from them. However, Veda tools may not be able to extract information from the following kinds of files: 8 Table or graph information from PDF files Information from Excel files unless they are in specified formats Graphs etc from PPTs Information from Titles, headers and footers in some cases could be mixed with the main text CURRENT TECHNOLOGY USAGE We find that currently <ABC> uses the following technology stack that impacts the information stored in the unstructured data repositories we identified, as well as in processing and presenting data to users. Data source Data presentation Technology used Relevant details Integration challenges if any current technology usage 12 Semantic Information System Blueprint Based on the above, it appears that integrating results provided by SIS could benefit from <the already existing visualization dashboard that <ABC> already owns, with only limited modifications>. It also appears that the SIS architecture must be compliant with <DEF> standards, to which the organization is moving and has planned on its roadmap by 2015. The implications of this on SIS are: <TBD> 9 DATA VIEWS REQUIRED – ONTOLOGIES, NLP AND LINKAGES For the critical issues highlighted by departments, and for which data sources have been identified, we have done a preliminary assessment to determine what kind of information, once extracted and organized, would be useful to solve the business challenge at hand, as well as help in addressing needs that may come up going forward. Based on this analysis, we believe that the following kind of analyses will be primarily needed for specific purposes. This does not of course mean that the SIS platform will not allow other kinds of analyses to be conducted. It simply implies our best recommendation of the kind of analysis that may help arrive at an answer for the specific problem in a quick way. Kind of information Available in (sources) What is required to be extracted Primary semantic analyses required Remarks It must also be noted that an ontology will be required to answer the questions of the type <W> mentioned above. While there are various ways in which ontologies can be drawn, highlighting various aspects of the organization, we recommend that at least the following ontologies with the following parameters be drawn: Ontology for Required because Primary aspect Estimated timeframe Challenges if any data views required – ontologies, NLP and linkages 13 Semantic Information System Blueprint As an example, the ontology in Point 1 above will ensure that <TVD> is mapped not only to <DER>, but also to <WES>. This means that it may be possible to query <WES> related to questions on <TVD>, which is hitherto not possible. A diagrammatic representation of a basic ontology is provided below: 10 INFRASTRUCTURE NEEDS Based on our estimation, the following infrastructure requirements will be required for implementation of an SIS Phase 1, to solve for the critical business challenges mentioned above. Infrastructure Required for Estimated cost Estimated Specs INFRASTRUCTURE NEEDS 14 Semantic Information System Blueprint We understand that of the above, the organization already has spare capacity in the case of <B>, and hence incremental investment need not be made for it at this stage. 11 DIAGRAMMATIC REPRESENTATION OF OVERALL SYSTEM We have provided below a diagrammatic representation of the overall architecture and information flow: 12 SAMPLE SIS DASHBOARD Based on the needs analysis, we have provided below, a sample SIS dashboard that users may be able to see after completion of Phase 1 of the SIS project. diagrammatic representation of overall system 15 Semantic Information System Blueprint SAMPLE SIS DASHBOARD 16 Semantic Information System Blueprint 13 RETURN ON INVESTMENT Based on each customer and usage context of Semantic Information System, a broad return on Investment analysis has been made to justify the Cost – Benefit. For the purpose, the cost of the SIS has been allocated based on criticality of use case, and the estimated benefits have been ascertained based on interviews with business users. Where such estimation was not possible, an estimated benefit forecast has been prepared based on assumptions that have been attached to this report. Sequence Usecase Title Department Impact / RoI Hardware costs ($) Service costs (S) License cost ($) Total cost($) 1. 2. 3. 4. 5. 6. 7. 8. As a result of this analysis, we believe that a potential map that shows RoI for different use cases can be prepared, as follows. Based on this, it appears that Usecases 1, 2 and 3 could be good candidates for early adoption of an SIS approach, which could then extend to Usecases 4, 5 and 6 as well. It may be noted that under this approach, the cost allocations for the Usecases may change since Phase 1 may absorb various one time costs, automatically reducing the costs for Phase 2. Recommended Sequence of Execution Effort & Time 6 5 4 3 2 1 Impact / RoI Return on Investment 17 Semantic Information System Blueprint 14 PROPOSED PROJECT PLAN If <ABC> decides to implement the SIS for Use cases 1-3 or 1-6, the following Project Plan provides an estimate of time and activities required for completion. Please note that this will be heavily influenced by the availability and involvement of <ABC> personnel, and hence we request that specific team be formed for this purpose, like in the case of the report preparation phase. We will be happy to work with you on this. 15 CHALLENGES AND LIMITATIONS We would also like to highlight some of the challenges that we see upfront, in the implementation and usage exercise. These have been based on the study of technology, kinds of documents and sources of documents that are being used by <ABC>, coupled with the business problems that are being sought to be solved. Critical limitations of the SIS system will be: Limitation Reason therefor What will <ABC> need to do Proposed Project Plan Example of what can / cannot be expected 18 Semantic Information System Blueprint 16 CHANGES REQUIRED TO ORGANIZATION STRUCTURE Based on our discussion, the current data requests flow from BU users to IT team, who then process them within given SLAs. In the case of <D> and <R> divisions, some BU users access the BI system and pull in data directly. For the SIS concept to be successful, we recommend the following small changes in this system: A specialized team be set up within IT that is exclusively geared to draw insights form unstructured data At least one BU resource from impacted Bus works closely with this team for the initial month, highlighting crucial business drivers for the BU Apart from configuring the SIS system to send out auto updates based on BU needs, a monthly ‘mock exercise’ must be conducted between IT and BUs to see if critical insights are being captured or missed from the unstructured data. The SIS data feeds can be changed as needed based on such inputs, or the kinds of analysis being run could be altered. Every 3 months, a log of issues should be raised to the vendor, and BU, IT and the vendor teams could discuss possible ways to capture more information as needed. With time, as the system and the processes around it stabilize, this will not be required. CHANGES REQUIRED TO ORGANIZATION STRUCTURE