Approach document for Extraction of Data for VIVO Indo-US Cancer Collaboratory: A VIVO Pilot VIVO Project Team Document Revision History Version Number Date 1.0 29th 2012 June Version Description Editor Extraction of Data from Identified URLs for VIVO Project Swati Mehta, Vivek Koul Srikanth Jaggari 2.0 19th July 2012 Addition of Process Swati Mehta, Vivek Koul Srikanth Jaggari 3.0 23rd July 2012 Process Diagrams Attached Swati Mehta, Vivek Koul Srikanth Jaggari Table of Contents Document Revision History .................................................................................. 1 1. Introduction ................................................................................................... 3 2. Objective ........................................................................................................ 3 3. Scope .............................................................................................................. 3 4. Data Extraction and Ingestion Approach ........................................................ 3 4.1. Ingestion of RDF Data .................................................................................. 4 4.2. Ingestion of CSV Data .................................................................................. 4 4.2.1. Manual Data Ingestion: ............................................................................ 4 4.2.2. Data Ingestion Using Harvester: ............................................................. 13 4.3. Extraction & Ingestion of Data from websites ........................................... 14 5. References.................................................................................................... 14 Table of Figures Figure 1: Ontology Editing Form ........................................................................... 5 Figure 2: Create Class ........................................................................................... 6 Figure 3: Data Property Form ............................................................................... 7 Figure 4: Create Models........................................................................................ 7 Figure 5: Convert CSV to RDF ................................................................................ 8 Figure 6: Execute SPARQL Construct Query ........................................................ 10 Figure 7: Add RDF Data ....................................................................................... 12 Figure 8: RDF Upload .......................................................................................... 12 Figure 9: Data Ingestion using Harvester ............................................................ 13 1. Introduction VIVO is a semantic web application that enables the discovery of research and scholarship across disciplines in an institution. Populated with detailed profiles of faculty and researchers; displaying items such as publications, teaching, service, and professional affiliations. Powerful search functionality for locating people and information within or across institutions. A VIVO environment [http://cdac-ohsl-vivo.cdac.in/ VIVO] has established to identify researchers in both countries with a view to create team science consortia. Profiles are largely created via automated data feeds, but can be customized to suit the needs of the individual. 2. Objective This document describes the approach for pulling of data/information from already available Cancer related websites through Information Extraction process and ingestion of data in VIVO will also be described. 3. Scope The scope of this document is to use cancer related website containing data data in form of html pages, pdfs, csv or owl/rdf files related to Research areas, researchers profile, organization information, publication details etc. It is limited to develop a specific extractor interface to the particular website due lack of uniformity among websites and also unstructured information in some of the websites. 4. Data Extraction and Ingestion Approach Data can be extracted from websites using html pages, csv or owl/rdf files provided by some of the websites. One simple approach is that if the data is present in the same format as needed like CSV or RDF by VIVO we can download the data and ingest it. Other approach is we need to develop an extractor that would extract the fields present in a particular document/website depending upon the structure of document or data. But limitation of this approach is that as each document/html page may contain data in a different format so we can overcome this by creating an extractor that would extract most of the common stuffs and then adding few other features to the extractor depending upon the data present in the source. In this approach the data extraction from sources and ingestion into VIVO will be done using following methods:- 4.1. Ingestion of RDF Data a) b) c) d) e) Navigate to the Site Administration page. Select Add/Remove RDF. Browse to the RDF file. Select N3 as import format*. Confirmation should state ‘Added RDF from file people_rdf. Added n statements.’ * The output engine uses N3, not RDF/XML. n is number of statements. People_rdf is RDF file name 4.2. Ingestion of CSV Data Manual Data Ingestion Data Ingestion Using Harvester 4.2.1. Manual Data Ingestion: The following are the steps to ingest data manually. Process steps: a) b) c) d) e) f) Create a local ontology. Create workspace models for ingesting and constructing data. Pull external data file into RDF. Map tabular data into the ontology format. Construct the ingested entities using the map of properties. Load data to the current web model. Step 1: Create a Local Ontology When VIVO is first installed it comes with 12 ontologies preloaded. To accommodate data source information, create local class within a local ontology and create data properties within a local ontology. Create Local Ontology: 1. Select the 'Site Admin' link in top right corner to return to the Site Administration page 2. Select 'Ontology List' in the Ontology Editor section 3. Select 'Add new ontology' 4. Input information as noted in the figure below and select 'Create New Record' Figure 1: Ontology Editing Form Create a Local Class in Local Ontology: 1. Go to Site Admin, and then select Class hierarchy. 2. Click on ‘Add new class ’. 3. Fill the Data as follows and select ‘Create New Record’. Figure 2: Create Class Create a New Data Property: 1. 2. 3. 4. Select the newly created ontology. Select ‘Datatype Properties Defined in This Namespace’. Select 'Add new data property'. Add the information as follows and select 'Create New Record'. Figure 3: Data Property Form Step 2: Create Workspace Models Working with VIVO requires pulling data into a model and transforming the data to create the proper assertions. 1. To perform this step, select "Ingest Tools" from the Advanced Tools Menu. 2. Select "Manage Jena Models". 3. Select "Create Model" then type in a name for your model (it can be anything). Figure 4: Create Models Step 3: Pull external data file into RDF Ingesting data from external sources can be found under the ingest menu, "Convert CSV to RDF". 1. 2. 3. 4. Select ‘Ingest Menu’ In Site Admin. Click On ‘Convert CSV to RDF’. Browse CSV File. The ‘Namespace in which to generate properties’ follow the format of your ontology. 5. ‘Class Local Name for Resources’ should be existing class in any Ontology and the Data Properties should exist in this class. 6. Select “convert”. Figure 5: Convert CSV to RDF With this we converted CSV data into RDF. We can check our RDF conversion through the following method. From the Ingest Menu, select the 'Manage Jena Models' and then select the previous ingest model's 'output model.' This is something you can open with WordPad and see the created triples for your CSV file. Step 4: Map Tabular Data onto Ontology In this step we have to create SPARQL Query with gathering the URI's for the properties, classes, and associations that we are going to make, and the URI's for the ingested data. Ingested Data URI's: The easiest method to attaining the URI's that were created for each of the columns in your csv file, is to open the “output model” rdf data as described above. A description of the format is described in the figure below. The predicates are the properties we need for our SPARQL Query. Step 5: Construct the Ingested Entities When writing a SPARQL Query ensure that the URI's are constructed from the CSV to RDF and the properties we want the data to be placed into. Click on 'Execute SPARQL CONSTRUCT' Constructing new entities in SPARQL starts with the basic frame: Construct { *** } Where { *** } Figure 6: Execute SPARQL Construct Query We then need to create statements that will create triples in VIVO in the basic format (subject predicate object). The period ‘.’ separates triples from each other. Variables are represented by the question mark ‘? ‘ before the name of the variable. Example for people.csv: Construct { ?person <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <Class’s Namespace URI> . ?person <Data Property URI Namespace > ?Data Property Name . *** } Where { ?person <URI From RDF File which we saved in WordPad > ?Property Name . *** } Example for organization.csv: Construct { ?org <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <Class’s Namespace URI> . ?org <Data Property URI Namespace > ?Data Property Name . ……….. } Where { ?org <URI From RDF File which we saved in WordPad > ?Property Name . ………….. } Next: Select Source Model (‘csv-ingest’). Select Destination Model (‘csv-construct’). Select "Execute CONSTRUCT". Upon completion, the system will report ‘n statements CONSTRUCTed’. * n shows no .of statements constructed. Step 6: Load to Webapp The final step is to output the final model and add it into the current model. a) From the Ingest Menu, select "Manage Jena Models" b) Select "output model" for the desired construct model c) Save the resulting file d) Navigate back to the Site Administration page e) Select Add/Remove RDF f) Browse to the file previously saved g) Select N3 as import format* h) Confirmation should state ‘Added RDF from file people_rdf. Added n number of statements.’ * The output engine uses N3, not RDF/XML. Figure 7: Add RDF Data Figure 8: RDF Upload 4.2.2. Data Ingestion Using Harvester: The Following are the Steps to ingest the data Using Harvester. a) Navigate to the Site Administration page. b) Select ‘Ingest tools’. c) Click on ‘Harvest Person Data from CSV’. d) Upload the CSV which You Have. e) Select “‘Harvest”. With this the data can be harvested successfully. Figure 9: Data Ingestion using Harvester 4.3. Extraction & Ingestion of Data from websites Process of extracting relevant data/information for “Indo-US Cancer Collaboratory: A VIVO Pilot” from the World Wide Web from Cancer related websites is largely dependent on the structure of the web pages. Depending upon the structure & design of the web pages following techniques have been adopted: DEiXTo: Web data extraction tool Webpage Specific Extraction Tool developed by C-DAC Manual Extraction For this pilot project, mentioned below are the Cancer Centers website which we have considered for extracting relevant data/information. Using DEiXTo: Web data extraction tool Dana-Farber: http://www.dana-farber.org/ Fred Hutchinson Cancer Center(http://www.fhcrc.org/en/labs/profiles.html) International Oncology http://internationaloncology.com Webpage Specific Extraction Tool developed by C-DAC: Bangalore Institute of Oncology [www.hcgoncology.com/hcg-bio] Manual Extraction: TATA Memorial Center: http://tmc.gov.in/ Amala Institute of Medical Sciences (http://www.amalaims.org/cancer_research_centre_amala_facultiesandstaff.php ) 5. References 1) http://sourceforge.net/apps/mediawiki/vivo/index.php?title=Data_Ingest_Guide 2) http://vivoweb.org/ 3) http://en.wikipedia.org/wiki/Web_scraping 4) http://deixto.com/