4. Data Extraction and Ingestion Approach

advertisement
Approach
document
for
Extraction
of Data for
VIVO
Indo-US
Cancer
Collaboratory: A VIVO
Pilot
VIVO Project Team
Document Revision History
Version Number
Date
1.0
29th
2012
June
Version Description
Editor
Extraction of Data from Identified
URLs for VIVO Project
Swati Mehta,
Vivek Koul
Srikanth Jaggari
2.0
19th July 2012
Addition of Process
Swati Mehta,
Vivek Koul
Srikanth Jaggari
3.0
23rd July 2012
Process Diagrams Attached
Swati Mehta,
Vivek Koul
Srikanth Jaggari
Table of Contents
Document Revision History .................................................................................. 1
1. Introduction ................................................................................................... 3
2. Objective ........................................................................................................ 3
3. Scope .............................................................................................................. 3
4. Data Extraction and Ingestion Approach ........................................................ 3
4.1. Ingestion of RDF Data .................................................................................. 4
4.2. Ingestion of CSV Data .................................................................................. 4
4.2.1. Manual Data Ingestion: ............................................................................ 4
4.2.2. Data Ingestion Using Harvester: ............................................................. 13
4.3. Extraction & Ingestion of Data from websites ........................................... 14
5. References.................................................................................................... 14
Table of Figures
Figure 1: Ontology Editing Form ........................................................................... 5
Figure 2: Create Class ........................................................................................... 6
Figure 3: Data Property Form ............................................................................... 7
Figure 4: Create Models........................................................................................ 7
Figure 5: Convert CSV to RDF ................................................................................ 8
Figure 6: Execute SPARQL Construct Query ........................................................ 10
Figure 7: Add RDF Data ....................................................................................... 12
Figure 8: RDF Upload .......................................................................................... 12
Figure 9: Data Ingestion using Harvester ............................................................ 13
1. Introduction
VIVO is a semantic web application that enables the discovery of research and scholarship
across disciplines in an institution. Populated with detailed profiles of faculty and researchers;
displaying items such as publications, teaching, service, and professional affiliations. Powerful
search functionality for locating people and information within or across institutions.
A VIVO environment [http://cdac-ohsl-vivo.cdac.in/ VIVO] has established to identify
researchers in both countries with a view to create team science consortia.
Profiles are largely created via automated data feeds, but can be customized to suit the needs
of the individual.
2. Objective
This document describes the approach for pulling of data/information from already available
Cancer related websites through Information Extraction process and ingestion of data in VIVO
will also be described.
3. Scope
The scope of this document is to use cancer related website containing data data in form of
html pages, pdfs, csv or owl/rdf files related to Research areas, researchers profile,
organization information, publication details etc. It is limited to develop a specific extractor
interface to the particular website due lack of uniformity among websites and also
unstructured information in some of the websites.
4. Data Extraction and Ingestion Approach
Data can be extracted from websites using html pages, csv or owl/rdf files provided by some of
the websites.
One simple approach is that if the data is present in the same format as needed like CSV or
RDF by VIVO we can download the data and ingest it.
Other approach is we need to develop an extractor that would extract the fields present in a
particular document/website depending upon the structure of document or data.
But limitation of this approach is that as each document/html page may contain data in a
different format so we can overcome this by creating an extractor that would extract most of
the common stuffs and then adding few other features to the extractor depending upon the
data present in the source.
In this approach the data extraction from sources and ingestion into VIVO will be done using
following methods:-
4.1. Ingestion of RDF Data
a)
b)
c)
d)
e)
Navigate to the Site Administration page.
Select Add/Remove RDF.
Browse to the RDF file.
Select N3 as import format*.
Confirmation should state ‘Added RDF from file people_rdf. Added n statements.’
* The output engine uses N3, not RDF/XML. n is number of statements. People_rdf is RDF file name
4.2. Ingestion of CSV Data



Manual Data Ingestion
Data Ingestion Using Harvester
4.2.1. Manual Data Ingestion:
The following are the steps to ingest data manually.
Process steps:
a)
b)
c)
d)
e)
f)
Create a local ontology.
Create workspace models for ingesting and constructing data.
Pull external data file into RDF.
Map tabular data into the ontology format.
Construct the ingested entities using the map of properties.
Load data to the current web model.
Step 1: Create a Local Ontology
When VIVO is first installed it comes with 12 ontologies preloaded. To accommodate
data source information, create local class within a local ontology and create data
properties within a local ontology.
Create Local Ontology:
1. Select the 'Site Admin' link in top right corner to return to the Site Administration
page
2. Select 'Ontology List' in the Ontology Editor section
3. Select 'Add new ontology'
4. Input information as noted in the figure below and select 'Create New Record'
Figure 1: Ontology Editing Form
Create a Local Class in Local Ontology:
1. Go to Site Admin, and then select Class hierarchy.
2. Click on ‘Add new class ’.
3. Fill the Data as follows and select ‘Create New Record’.
Figure 2: Create Class
Create a New Data Property:
1.
2.
3.
4.
Select the newly created ontology.
Select ‘Datatype Properties Defined in This Namespace’.
Select 'Add new data property'.
Add the information as follows and select 'Create New Record'.
Figure 3: Data Property Form
Step 2: Create Workspace Models
Working with VIVO requires pulling data into a model and transforming the data to
create the proper assertions.
1. To perform this step, select
"Ingest Tools" from the Advanced
Tools Menu.
2. Select "Manage Jena Models".
3. Select "Create Model" then type in
a name for your model (it can be
anything).
Figure 4: Create Models
Step 3: Pull external data file into RDF
Ingesting data from external sources can be found under the ingest menu, "Convert
CSV to RDF".
1.
2.
3.
4.
Select ‘Ingest Menu’ In Site Admin.
Click On ‘Convert CSV to RDF’.
Browse CSV File.
The ‘Namespace in which to generate properties’ follow the format of your
ontology.
5. ‘Class Local Name for Resources’ should be existing class in any Ontology and the
Data Properties should exist in this class.
6. Select “convert”.
Figure 5: Convert CSV to RDF
With this we converted CSV data into RDF. We can check our RDF conversion through
the following method.
From the Ingest Menu, select the 'Manage Jena Models' and then select the previous
ingest model's 'output model.' This is something you can open with WordPad and see
the created triples for your CSV file.
Step 4: Map Tabular Data onto Ontology
In this step we have to create SPARQL Query with gathering the URI's for the
properties, classes, and associations that we are going to make, and the URI's for the
ingested data.
Ingested Data URI's:
The easiest method to attaining the URI's that were created for each of the columns in
your csv file, is to open the “output model” rdf data as described above. A description
of the format is described in the figure below. The predicates are the properties we
need for our SPARQL Query.
Step 5: Construct the Ingested Entities
When writing a SPARQL Query ensure that the URI's are constructed from the CSV to
RDF and the properties we want the data to be placed into.
Click on 'Execute SPARQL CONSTRUCT' Constructing new entities in SPARQL starts with
the basic frame:
Construct
{
***
}
Where
{
***
}
Figure 6: Execute SPARQL Construct Query
We then need to create statements that will create triples in VIVO in the basic format
(subject predicate object). The period ‘.’ separates triples from each other. Variables
are represented by the question mark ‘? ‘ before the name of the variable.
Example for people.csv:
Construct
{
?person <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <Class’s Namespace
URI> .
?person <Data Property URI Namespace > ?Data Property Name .
***
}
Where
{
?person <URI From RDF File which we saved in WordPad > ?Property Name .
***
}
Example for organization.csv:
Construct
{
?org <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <Class’s Namespace URI> .
?org <Data Property URI Namespace > ?Data Property Name .
………..
}
Where
{
?org <URI From RDF File which we saved in WordPad > ?Property Name .
…………..
}
Next:
 Select Source Model (‘csv-ingest’).
 Select Destination Model (‘csv-construct’).
 Select "Execute CONSTRUCT".
 Upon completion, the system will report ‘n statements CONSTRUCTed’.
* n shows no .of statements constructed.
Step 6: Load to Webapp
The final step is to output the final model and add it into the current model.
a) From the Ingest Menu, select "Manage Jena Models"
b) Select "output model" for the desired construct model
c) Save the resulting file
d) Navigate back to the Site Administration page
e) Select Add/Remove RDF
f) Browse to the file previously saved
g) Select N3 as import format*
h) Confirmation should state ‘Added RDF from file people_rdf. Added n number of
statements.’
* The output engine uses N3, not RDF/XML.
Figure 7: Add RDF Data
Figure 8: RDF Upload
4.2.2. Data Ingestion Using Harvester:
The Following are the Steps to ingest the data Using Harvester.
a) Navigate to the Site Administration page.
b) Select ‘Ingest tools’.
c) Click on ‘Harvest Person Data from CSV’.
d) Upload the CSV which You Have.
e) Select “‘Harvest”.
With this the data can be harvested successfully.
Figure 9: Data Ingestion using Harvester
4.3. Extraction & Ingestion of Data from websites
Process of extracting relevant data/information for “Indo-US Cancer Collaboratory: A
VIVO Pilot” from the World Wide Web from Cancer related websites is largely dependent
on the structure of the web pages. Depending upon the structure & design of the web
pages following techniques have been adopted:
 DEiXTo: Web data extraction tool
 Webpage Specific Extraction Tool developed by C-DAC
 Manual Extraction
For this pilot project, mentioned below are the Cancer Centers website which we have
considered for extracting relevant data/information.
Using DEiXTo: Web data extraction tool
 Dana-Farber: http://www.dana-farber.org/
 Fred Hutchinson Cancer Center(http://www.fhcrc.org/en/labs/profiles.html)
 International Oncology http://internationaloncology.com
Webpage Specific Extraction Tool developed by C-DAC:
 Bangalore Institute of Oncology [www.hcgoncology.com/hcg-bio]
Manual Extraction:
 TATA Memorial Center: http://tmc.gov.in/
 Amala Institute of Medical Sciences
(http://www.amalaims.org/cancer_research_centre_amala_facultiesandstaff.php )
5. References
1) http://sourceforge.net/apps/mediawiki/vivo/index.php?title=Data_Ingest_Guide
2) http://vivoweb.org/
3) http://en.wikipedia.org/wiki/Web_scraping
4)
http://deixto.com/
Download