Distributed RDF data store on HBase. Project By: • Anuj Shetye • Vinay Boddula Project Overview Introduction Motivation HBase Our work Evaluation Related work. Future work and conclusion. Introduction As RDF datasets goes on increasing, therefore size of RDF is much larger than traditional graph Cardinality of vertex and edges is much larger. Therefore large data stores are required for following reasons Fast and efficient querying . Scalability issues. Motivation Research has been done to map RDF dataset onto relational databases example: Virtuoso, Jena SDB. But dataset is stored centrally i.e. on one server. Examples: Jena SDB map RDF triple in relational database. – Scalability Some try to store RDF data as a large graph but on single node example Jena TDB– Scalability Hbase is an open source distributed sorted map datastore. modelled on google big table. Contd... Hbase is a No SQL datbase. High Scalability , Highly Fault Tolerant. Fast Read/Write Dynamic Database Hadoop and other apps integrated. Column family oriented data layout. Max datasize : ~1 PB. Read/write limits millions of queries per second. Who uses Hbase/Bigtable Adobe, Facebook, Twitter, Yahoo, Gmail, Google maps etc. Hadoop EcoSystem Src : cloudera Our Project Our project to create a distributed data storage capability for RDF schema using Hbase . We developed a system which takes the Ntriple file of an RDF graph as an input and stores the triples in Hbase as a Key value pair using Map reduce jobs. The schema is simple we create column families of each predicates subjects as Row keys objects as the values System Architecture MR Job I/p File Mapper MR job MR Job Hbase Data store Data Model Logical view as ‘Records’ Row key Data Anuj hasAdvisor : {‘Dr. Miller’} workedFor: {‘UGA’} Vinay hasAdvisor : {‘Dr.Ramaswamy’} hasPapers : {‘Paper 1’,’Paper 2’} workedFor: {‘IBM’ , ‘UGA’} Data Model contd.. Physical Model hasAdvisor Column family Row Key Column key Timestam p value Anuj hasAdvisor T1 Dr. Miller Vinay hasAdvisor T2 Dr.Ramaswamy hasPaper Column family Row Key Column key Timestamp value Vinay hasPaper T2 Paper1 Vinay hasPaper T1 Paper2 workedFor Column family Row Key Column key Timestamp value Anuj workedFor T1 ‘UGA’ Vinay workedFor T3 ‘UGA’ Vinay workedFor T2 ‘IBM’ Two major issues can be solved using Hbase Data insertion Data updation Versioning possible (Timestamps). Bulk loading of data. Two types complete bulk load (hbase File Formatter, our approach ) Incremental bulk load Evaluations We talk about it during the demo Related Work. CumulusRDF: Linked Data Management on Nested KeyValue Stores appeared in SSWS 2011 works on distributed key value indexing on data stores they used Casandra as the data store. Apache Casandra is currently capable of storing rdf data and has an adapter to store data in a distributed management system. Future Work and Conclusion Our future work lies in developing an efficient interface for sparql as querying with SQL like HIVE is slower in Hbase. The testing of the system was done on single node, therefore testing it on multiple nodes would be an ultimate test of efficiency . Questions ??