BRAHMS – RDF storage Maciej Janik and Krzysztof Kochut November 9th, 2005 ISWC 2005 – Galway, Ireland Work supported by the National Science Foundation Grant No. IIS-0325464, entitled “SemDIS: Discovering Complex Relationships in the Semantic Web”. Computer Science Department University of Georgia Outline • • • • • Motivation Design objectives Details of design and implementation Tests and results Future work Computer Science Department University of Georgia What is BRAHMS ? • BRAHMS – main-memory RDF/S storage that offers high performance for accessing RDF/S data – developed for the need of SemDis project • SemDis1 project – model, discover and reason about complex relationships between entities in Semantic Web – infrastructure with ontology support – query and ranking algorithms [1] http://lsdis.cs.uga.edu/projects/semdis/ Computer Science Department University of Georgia SemDis project overview ? * ! BRAHMS RDF/S Computer Science Department University of Georgia Motivation for BRAHMS • Why? – need for simple path searches with limited length (hop-limit) on large ontology, e.g. to answer question: “how two resources/entities are related” – tested systems did not offer sufficient speed or could not handle large ontologies using the main-memory model – query and ranking algorithms (computationally intensive) require high-performance ontology storage Computer Science Department University of Georgia How resources are related? Example: ? r13 r2 r8 r1 rstart spoke at r3 r12 spouse of Maria Shriver relative of r4 Edward r7 Kennedy r10 r9 r5 Democratic r11 Convention 2000 Arnold Schwarzenegger Bidirectional breadth first search of simple paths on instance base rend r6 spoke at Bill Clinton Computer Science Department University of Georgia Design objectives for BRAHMS • Offer high performance for basic operations used in graph traversal algorithms. • Capable of handling big ontologies (100s Mbytes to many Gbytes). • Handle RDF / RDFS. • Distinguish between schema and instance level. • Provide framework for testing different semantic association discovery algorithms. Computer Science Department University of Georgia Design decisions • Performance requirements – use main memory for storage – fastest access – create indexes for operations used in graph traversal algorithms – use C/C++ in implementation instead of Java – instead of string URIs, use simple type [int] as resource identifiers. • Ontology size – compact representation for handling large ontologies – leave some memory for algorithms Computer Science Department University of Georgia Design decisions • Handle RDF / S – simplify the design and do not include and check logic or constraints imposed by OWL • Separate instance base from schema – represent instances, schema classes and properties as different object types – have specific methods to access schema or instances – different types of objects require different types of statements Computer Science Department University of Georgia Separated instance base and schema Schema c3 c1 c2 c5 c4 rdf:type Instance base r3 r2 r1 r7 r5 r6 r4 r8 Computer Science Department University of Georgia Object types in BRAHMS s t a t e m e n t Subject Predicate Object InstanceNode InstanceNode InstanceNode Literal SchemaClass SchemaClass SchemaProperty SchemaClass SchemaClassLiteral SchemaProperty SchemaProperty SchemaProperty SchemaPropertyLiteral Computer Science Department University of Georgia Design decisions • Framework for algorithms – create rich API of basic operations to access RDF/S data • Consequences of design decisions – compact knowledge base to minimize memory usage, no memory fragmentation – use contiguous memory blocks make it readonly – create snapshot of memory structures for fast start-up (parse* once, use many times) – handle taxonomy in a special way. (*) Redland’s Raptor is used as RDF/S parser – http://librdf.org/raptor Computer Science Department University of Georgia Taxonomy handling C1 C2 C5 C3 C1 C 3 C5 C6 Ancestors C6 C4 C8 C10 C11 C7 Descendants C8 C9 C10 C11 C12 C9 C12 • subClassOf or subPropertyOf handled separately from statements • direct parents / children are given from RDF • ancestors and descendants are calculated and kept in snapshot • information is kept as a sorted list of identifiers – check if list contains element is O(log2n) Computer Science Department University of Georgia Taxonomy and instances r1 r3 r8 r9 r5 r6 r10 C1 C2 C3 O(n log2k) merge r1 r3 r8 r9 k r2 r4 r7 r2 r4 r7 • each class holds a list of its direct instances (assigned by rdf:type) • to get instances of all children / descendants, union lists of their instances with current • also easy to get instances of all descendants • instance lists are sorted, so merge is O(n log2k) – merge needed to get list without duplicates – bonus: result list is sorted r5 r6 r10 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 n Computer Science Department University of Georgia Key to BRAHMS speed Tables, Indexes and Iterators • Tables – extensive use of tables for values, identifiers and indexes - resources kept as object values in tables – each resource type (instance, class, ...) has contiguous identifiers from 0 to N-1 – table can be indexed directly by an identifier • Index – get proper reference to table fragment (starting index and length) for given resource to feed iterator • Iterator – walk through fragment of prepared table to access values Computer Science Department University of Georgia Table structures 0 0 0 0 0 St1 SO 5 1 P3 O1 St2 SO 5 P3 O3 St3 2 St1 Hash start St3 SO 5 3 P3 O2 code length St2 r-1 s-1 Statement order index s-1 r-1 Resources Value table index Statement table Values n-1 Computer Science Department University of Georgia Indexes in Brahms S t a t e m e n t (triple) Statementid Subjectid Predicateid simple (direct) Objectid composite (calculated) Predicate, Object Subject, Predicate Object Object, Predicate Subject, Object Predicate Predicate, Object Subject Subject Subject, Object Predicate Object, Subject Predicate, Subject Object Subject, Predicate in Brahms snapshot Computer Science Department University of Georgia Why need all 6 simple indexes ? SPARQL Example: SELECT ?x, ?y, ?z WHERE { x? <P1> y? . y? <P2> z? } use iterators P1 OS P2 SO sorted by „y” sorted by „y” Intersection of two sorted lists Time: O(n) [if no duplicates in lists] Computer Science Department University of Georgia Simple index and iterator Subject Object, Predicate 0 0 time : O(1) Sta S5 P2 O1 Stc 0 S5 Sta Start length idx Sid=5 Stb S5 P1 O2 Stb Stx Sty n-1 Resource (Subject) Subject index record table s-1 Statement order index Stc S5 P1 O1 s-1 Statement table Computer Science Department University of Georgia Composite index and iterator Subject + Predicate Object 0 0 use: Subject Predicate, Object Sta S5 P3 O1 time : O(log2length) S5 Stx Sta 0 P3 Sid=5 Start length idx binary search Stb S5 P3 O3 Stc Stb Sty Resources (Subject, Predicate) n-1 Subject index record table s-1 Stc S5 P3 O2 s-1 Statement order index Statement table Computer Science Department University of Georgia Test results • Tested datastores – Jena (2.1) – Sesame (1.1) – Redland (1.0.0) • As testbed, algorithms for k-hop limited semantic associations were used – Depth-first-search – Bidirectional breadth-first-search • Datasets – SWETO 1 – small [1.2Mb], medium [14Mb], big [255Mb] – Lehigh University benchmark2 – Univ(50, 0) [556Mb] – Synthetic dataset [14Mb] for Business – Sports – Entertainment ontology (generated with TOntoGen3) [1] Aleman-Meza et. al., SWETO: Large-Scale Semantic Web Test-bed. in 16th International Conference on Software Engineering and Knowledge Engineering (SEKE2004): Workshop on Ontology in Action, (Banff, Canada, 2004). [2] Guo, Y., Pan, Z. and Heflin, J., An Evaluation of Knowledge Base Systems for Large OWL Datasets. in Third International Semantic Web Conference, (Hiroshima, Japan, 2004), Spinger, 274-288. [3] http://lsdis.cs.uga.edu/projects/semdis/tontogen/ Computer Science Department University of Georgia Test datasets • SWETO – dataset extracted from Internet using Semagix Freedom1 technology – edge distribution follows distribution of links specific for internet • Lehigh University Benchmark – synthetically generated dataset, from small University ontology • Synthetic Business – Sports – Entertainment – synthetically generated dataset of combined business, entertainment and sports ontology – high connectivity graph with uniform distribution of edges [1] Freedom is a product of Semagix, http://www.semagix.com/ Computer Science Department University of Georgia Test dataset statistics Dataset name SWETO medium SWETO big Instance Statements Instance Nodes Avg Node Degree RDF File Size 59,105 55,876 2.116 14 Mb 1,553,112 813,479 3.818 255 Mb 3,298,813 1,082,818 6.093 556 Mb 45,000 29,889 3.011 13 Mb Univ (50, 0) Lehigh University Bus–Sports–Ent TOntoGen Degree distribution log / log scale 500 1000 1500 2000 Sesame; 1422 Redland, no IDX; 2432 BRAMS - Load dump; 10 BRAMS - Initial; 31 Redland, IDX; 99 Redland, no IDX; 71 Sesame; 112 Jena; 1828 BRAMS - Load dump; 270 Redland, IDX; 2924 Redland, no IDX; 2674 Sesame; 1477 BRAMS - Initial; 356 Jena; 79 2500 Jena; 1730 BRAMS - Load dump; 20 BRAMS - Initial; 32 Redland, IDX; 210 Redland, no IDX; 147 Sesame; 112 3000 Small SWETO Big SWETO Small Synthetic Univ (50,0) Jena 112 1730 79 1828 Sesame 112 1477 112 1422 Redland, no IDX 147 2674 71 2432 Redland, IDX 210 2924 99 x BRAMS - Initial 32 356 31 509 BRAMS - Load dump 20 270 10 501 BRAMS - Load dump; 501 BRAMS - Initial; 509 Redland, IDX; out of memory 0 Jena; 112 Memory usage [Mb] Computer Science Department University of Georgia Results – memory usage Memory usage for RDF file load [Mb] 3500 100 association 50 length [relations] 0 DFS Jena 9 10 11 12 77 104 174 x DFS BRAMS, 7 bi-BFS Jena, 5.5 bi-BFS Sesame, 0.2 bi-BFS Redland, 0.2 bi-BFS BRAMS, 0.1 DFS Jena, out of memory DFS Sesame, 54 200 DFS Sesame, 11 DFS Redland, 52 DFS BRAMS, 3 bi-BFS Jena, 5.5 bi-BFS Sesame, 0.2 bi-BFS Redland, 0.1 bi-BFS BRAMS, 0.1 DFS Jena, 104 DFS Sesame, 2 DFS Redland, 27 DFS BRAMS, 0.5 bi-BFS Jena, 5 bi-BFS Sesame, 0.2 bi-BFS Redland, 0.1 bi-BFS BRAMS, 0.1 150 DFS Jena, 77 DFS Sesame, 2 DFS Redland, 16 DFS BRAMS, 0.5 bi-BFS Jena, 5 bi-BFS Sesame, 0.2 bi-BFS Redland, 0.1 bi-BFS BRAMS, 0.1 250 DFS Redland, 204 DFS Jena, 174 time [s] Computer Science Department University of Georgia Results - timing Search time on small SWETO DFS Sesame 2 2 11 54 DFS Redland 16 27 52 204 DFS BRAMS 0.5 0.5 3 7 bi-BFS Jena 5 5 5.5 5.5 bi-BFS Sesame 0.2 0.2 0.2 0.2 bi-BFS Redland 0.1 0.1 0.1 0.2 bi-BFS BRAMS 0.1 0.1 0.1 0.1 Found paths 47 61 61 289 Computer Science Department University of Georgia Results - timing bi-BFS on synthetic Business-Sports-Entertainment 900 x 22.29 Jena; 847 800 700 Sesame; 386 time [sec] 600 500 400 x 10.16 9 10 11 12 12.8 39.9 59.3 847 Sesame 1.8 11.9 25.7 386 Redland 0.43 2.6 5.2 64.8 BRAMS 0.1 0.5 1.9 38 8559 131009 1680943 24392420 Jena Found paths BRAMS; 38 Redland; 64.8 BRAMS; 1.9 Redland; 5.2 Sesame; 25.7 Jena; 59.3 BRAMS; 0.5 Redland; 2.6 Sesame; 11.9 Jena; 39.9 Sesame; 1.8 BRAMS; 0.1 0 Jena; 12.8 200 association length 100 [relations] Redland; 0.43 300 x 1.70 10000 5000 association length [relations] 6 7 8 9 10 x x x x x Sesame 17.3 41.8 86.5 726.2 28111 Redland x x x x x BRAMS 0.1 0.8 2.28 22.41 309.9 1,506 15,339 667,901 8,812,652 298,990,413 0 Jena Number of paths 20000 15000 BRAMS, 309.9 30000 Sesame, 28111 bi-BFS search on Univ(50, 0) Redland, out of memory during load Jena, out of memory BRAMS, 22.41 Redland, out of memory during load Sesame, 726.2 Jena, out of memory BRAMS, 2.28 Redland, out of memory during load Sesame, 86.5 Jena, out of memory BRAMS, 0.8 Redland, out of memory during load Sesame, 41.8 Jena, out of memory BRAMS, 0.1 Redland, out of memory during load Sesame, 17.3 Jena, out of memory time [s] Computer Science Department University of Georgia Results - timing 25000 association length [relations] Paths 0 Jena 500 Redland; 427 BRAHMS; 336 Jena; out of memory Redland; 49 BRAHMS; 23 Jena; out of memory 1500 Sesame; 18341 Sesame; 1442 Sesame; 721 1000 Redland; 20 BRAHMS; 3.89 Jena; out of memory Jena; 108 Sesame; 175 Redland; 3.26 BRAHMS; 0.47 Jena; 71 Sesame; 113 Redland; 1.63 BRAHMS; 0.21 Jena; 25 Sesame; 15 Redland; 0.19 BRAHMS; 0.02 Jena; 13 Sesame; 9.38 Redland; 0.12 BRAHMS; 0.01 Time [sec] Computer Science Department University of Georgia Results - timing bi-BFS on Univ(10,0) - 100Mb file 5 6 7 8 9 10 11 13 25 71 108 x x x Sesame 9.38 15 113 175 721 1442 18341 Redland 0.12 0.19 1.63 3.26 20 49 427 BRAHMS 0.01 0.02 0.21 0.47 3.89 23 336 5 319 4,988 97,868 1,401,886 22,876,121 319,574,607 Computer Science Department University of Georgia Results - timing bi-BFS search on Univ(700,0) - 6.5Gb file 350 314,116,239 1,271,857 94,152 200 10,000,000 1,000,000 100,000 10,000 150 1,000 BRAHMS Paths BRAHMS; 0.33 BRAHMS; 0.15 association length [relations] 0 BRAHMS; 0.02 50 32 BRAHMS; 46.42 205 100 100 10 4 5 6 7 8 0.02 0.15 0.33 46.42 308.87 32 205 94,152 1,271,857 314,116,239 1 Found paths [log scale] Time [sec] 250 100,000,000 BRAHMS; 308.87 300 1,000,000,000 Computer Science Department University of Georgia BRAHMS - today and tomorrow • Today – Implemented (most of) SPARQL over BRAHMS – BRAHMS successfully used as storage for funded projects in LSDIS lab • Insider Threat project 1 • Peer-to-Peer Semantic Association Discovery 2 • Future – create context representation in Brahms – design and create new querying model with use of context and association discovery [1] B. Aleman-Meza, et. al., An Ontological Approach to the Document Access Problem of Insider Threat, Proceedings of the IEEE Intl. Conference on Intelligence and Security Informatics (ISI-2005), May 19-20, 2005 [2] M. Perry et. al., "Peer-to-Peer Discovery of Semantic Associations", Second International Workshop on Peer-to-Peer Knowledge Management, San Diego, CA, July 17, 2005 Computer Science Department University of Georgia Thank you SemDis project http://lsdis.cs.uga.edu/project/semdis BRAHMS page http://lsdis.cs.uga.edu/project/semdis/brahms