Connecting Knowledge Silos using Federated Text Mining Guy Singh Senior Manager, Product & Strategic Alliances ©2014 Linguamatics Ltd Click edit Master title style Data Silos Clicktoto edit Master title style • Structured, semi-structured or unstructured content • Separate interfaces to access content • Cannot query across the silos, or exchange content Internal Content External Content ©2014 Linguamatics Ltd Click edit Master title style Possible Approaches Clicktoto edit Master title Federated Text Mining style Data Warehousing Connecting Silos Linked Data (RDF) ©2014 Linguamatics Ltd Workflow Integration Click edit Master titleWorkflow style Integration using Tools Clicktoto edit Master title style • If each data source has an API, can link together using specific tools for each data source • Can program particular workflows pulling information together from different data sources • Advantages – Can perform complex data manipulation – Can exploit structure in data sources, or use I2E to transform the unstructured data • Disadvantages – Workflows are fixed: can’t easily navigate and explore connections between data ©2014 Linguamatics Ltd Click edit Master title style Connecting via Linked Data Clicktoto edit Master title style • Transform databases to RDF or provide a conversion layer • Advantages – – – – Standardizes data format Can exploit structure in structured data sources Can use I2E to transform unstructured data into RDF Can reason with the RDF • Disadvantages – Transformations are fixed – Have to predict what information you need from the unstructured text • typically pull out a small proportion of the original information ©2014 Linguamatics Ltd Click edit Master title style Using a edit Data Warehouses Clicktoto Master title style • Integrate the data together into a data warehouse – Extract, Transform and Load each data source into a new database • Advantages – Allows users to perform a single query across all the content – Can use I2E to pull information out of unstructured text – Can combine with human curation so warehouse contains checked content • Disadvantages – ETL can be time consuming and expensive process – Lose information • have to predict what information you need from the unstructured text – typically pull out a small proportion of the original information • transformation of discrete fields can lose finer distinctions ©2014 Linguamatics Ltd Click edit Master title style Federated Text Mining Data Clicktoto edit Master titlefor style Silos • Use I2E to make data available for search, navigation, linking – Keep data in original format without any data loss – I2E queries become the conversion layer, dynamically transforming data into the format we want when we need it – Ontologies convert between different identifiers, or different languages – Configurable: just change the queries • Use other methods when require their strengths – RDF for reasoning with results – Workflow tools for complex data analysis and manipulation – Data warehouses for curated data ©2014 Linguamatics Ltd Road to Federated Text Mining Click toto edit Master title style Click edit Master title style Link the Content Servers Data Normalization ©2014 Linguamatics Ltd Federated Text Mining Merge Results Click edit Master title style Data Normalisation –title Virtual Clicktoto edit Master styleIndexes Pathology Reports Index Journal Abstracts Index Virtual Index 9 Click edit Master title style Data Normalisation –title Document Clicktoto edit Master style Structure Journal Abstracts Pathology Reports 10 Click edit Master title style Data Normalisation -title Entities Clicktoto edit Master style Journal Abstracts 11 Combined (Normalized) Pathology Reports Road to Federated Text Mining Click toto edit Master title style Click edit Master title style Link the Content Servers Data Normalization ©2014 Linguamatics Ltd Federated Text Mining Merge Results Click toto edit Master title style I2E 4.1/4.2: Single Client, Multiple Click edit Master title style Results internal network external network I2E Server 1 I2E Server 2 Internal Documents ©2014 Linguamatics Ltd Linked server FDA Drug Labels Road to Federated Text Mining Click toto edit Master title style Click edit Master title style Link the Content Servers Data Normalization ©2014 Linguamatics Ltd Federated Text Mining Merge Results Click edit Master title style Each Server supplying separate Clicktoto edit Master title style Content Server 1 Content Server 2 set of results Content Server 3 Merge into a single set of results 15 Content Server 4 Road to Federated Text Mining Click toto edit Master title style Click edit Master title style Link the Content Servers Data Normalization ©2014 Linguamatics Ltd Federated Text Mining Merge Results Click toto edit Master title style I2E Text Mining ClickFederated edit Master title style Extract and connect data in any format, wherever it resides Knowledge ©2014©Linguamatics 17 LinguamaticsLtd 2014 - Confidential Connected