BRAHMS – RDF storage Maciej Janik and Krzysztof Kochut November 9 , 2005

advertisement
BRAHMS – RDF storage
Maciej Janik and Krzysztof Kochut
November 9th, 2005
ISWC 2005 – Galway, Ireland
Work supported by the National Science Foundation Grant
No. IIS-0325464, entitled “SemDIS: Discovering Complex
Relationships in the Semantic Web”.
Computer Science Department
University of Georgia
Outline
•
•
•
•
•
Motivation
Design objectives
Details of design and implementation
Tests and results
Future work
Computer Science Department
University of Georgia
What is BRAHMS ?
• BRAHMS
– main-memory RDF/S storage that offers high
performance for accessing RDF/S data
– developed for the need of SemDis project
• SemDis1 project
– model, discover and reason about complex
relationships between entities in Semantic Web
– infrastructure with ontology support
– query and ranking algorithms
[1] http://lsdis.cs.uga.edu/projects/semdis/
Computer Science Department
University of Georgia
SemDis project overview
?
*
!
BRAHMS
RDF/S
Computer Science Department
University of Georgia
Motivation for BRAHMS
• Why?
– need for simple path searches with limited
length (hop-limit) on large ontology, e.g. to
answer question: “how two resources/entities
are related”
– tested systems did not offer sufficient speed or
could not handle large ontologies using the
main-memory model
– query and ranking algorithms (computationally
intensive) require high-performance ontology
storage
Computer Science Department
University of Georgia
How resources are related?
Example:
?
r13
r2
r8
r1
rstart
spoke at
r3
r12
spouse of
Maria
Shriver
relative of
r4
Edward
r7
Kennedy
r10
r9
r5
Democratic
r11
Convention
2000
Arnold
Schwarzenegger
Bidirectional breadth first search
of simple paths on instance base
rend
r6
spoke at
Bill
Clinton
Computer Science Department
University of Georgia
Design objectives for BRAHMS
• Offer high performance for basic operations
used in graph traversal algorithms.
• Capable of handling big ontologies
(100s Mbytes to many Gbytes).
• Handle RDF / RDFS.
• Distinguish between schema and instance
level.
• Provide framework for testing different
semantic association discovery algorithms.
Computer Science Department
University of Georgia
Design decisions
• Performance requirements
– use main memory for storage – fastest access
– create indexes for operations used in graph
traversal algorithms
– use C/C++ in implementation instead of Java
– instead of string URIs, use simple type [int] as
resource identifiers.
• Ontology size
– compact representation for handling large
ontologies – leave some memory for algorithms
Computer Science Department
University of Georgia
Design decisions
• Handle RDF / S
– simplify the design and do not include and
check logic or constraints imposed by OWL
• Separate instance base from schema
– represent instances, schema classes and
properties as different object types
– have specific methods to access schema or
instances
– different types of objects require different
types of statements
Computer Science Department
University of Georgia
Separated instance base and schema
Schema
c3
c1
c2
c5
c4
rdf:type
Instance
base
r3
r2
r1
r7
r5
r6
r4
r8
Computer Science Department
University of Georgia
Object types in BRAHMS
s t a t e m e n t
Subject
Predicate
Object
InstanceNode
InstanceNode
InstanceNode
Literal
SchemaClass
SchemaClass
SchemaProperty
SchemaClass
SchemaClassLiteral
SchemaProperty
SchemaProperty
SchemaProperty
SchemaPropertyLiteral
Computer Science Department
University of Georgia
Design decisions
• Framework for algorithms
– create rich API of basic operations to access
RDF/S data
• Consequences of design decisions
– compact knowledge base to minimize memory
usage, no memory fragmentation – use
contiguous memory blocks  make it readonly
– create snapshot of memory structures for fast
start-up (parse* once, use many times)
– handle taxonomy in a special way.
(*) Redland’s Raptor is used as RDF/S parser – http://librdf.org/raptor
Computer Science Department
University of Georgia
Taxonomy handling
C1
C2
C5
C3
C1 C 3 C5 C6
Ancestors
C6
C4
C8
C10
C11
C7
Descendants
C8 C9 C10 C11 C12
C9
C12
• subClassOf or
subPropertyOf handled
separately from
statements
• direct parents / children
are given from RDF
• ancestors and descendants
are calculated and kept in
snapshot
• information is kept as a
sorted list of identifiers –
check if list contains
element is O(log2n)
Computer Science Department
University of Georgia
Taxonomy and instances
r1 r3 r8 r9
r5 r6 r10
C1
C2
C3
O(n log2k)
merge
r1 r3 r8 r9
k
r2 r4 r7
r2 r4 r7
• each class holds a list of its
direct instances (assigned
by rdf:type)
• to get instances of all
children / descendants,
union lists of their
instances with current
• also easy to get instances
of all descendants
• instance lists are sorted, so
merge is O(n log2k)
– merge needed to get list
without duplicates
– bonus: result list is sorted
r5 r6 r10
r1 r2 r3 r4 r5 r6 r7 r8 r9 r10
n
Computer Science Department
University of Georgia
Key to BRAHMS speed
Tables, Indexes and Iterators
• Tables
– extensive use of tables for values, identifiers and indexes
- resources kept as object values in tables
– each resource type (instance, class, ...) has contiguous
identifiers from 0 to N-1
– table can be indexed directly by an identifier
• Index
– get proper reference to table fragment (starting index and
length) for given resource to feed iterator
• Iterator
– walk through fragment of prepared table to access values
Computer Science Department
University of Georgia
Table structures
0
0
0
0
0
St1 SO
5 1 P3 O1
St2 SO
5 P3 O3
St3
2
St1
Hash start
St3 SO
5 3 P3 O2 code length
St2
r-1
s-1
Statement
order index
s-1
r-1
Resources
Value
table
index
Statement table
Values
n-1
Computer Science Department
University of Georgia
Indexes in Brahms
S t a t e m e n t (triple)
Statementid
Subjectid
Predicateid
simple (direct)
Objectid
composite (calculated)
Predicate, Object
Subject, Predicate
Object
Object, Predicate
Subject, Object
Predicate
Predicate, Object
Subject
Subject
Subject, Object
Predicate
Object, Subject
Predicate, Subject
Object
Subject, Predicate
in Brahms snapshot
Computer Science Department
University of Georgia
Why need all 6 simple indexes ?
SPARQL Example:
SELECT ?x, ?y, ?z
WHERE { x? <P1> y? . y? <P2> z? }
use iterators
P1  OS
P2  SO
sorted by „y”
sorted by „y”
Intersection
of two sorted lists
Time: O(n)
[if no duplicates in lists]
Computer Science Department
University of Georgia
Simple index and iterator
Subject  Object, Predicate
0
0
time : O(1)
Sta S5 P2 O1
Stc
0
S5
Sta
Start
length
idx
Sid=5
Stb S5 P1 O2
Stb
Stx
Sty
n-1
Resource
(Subject)
Subject index
record table
s-1
Statement
order index
Stc S5 P1 O1
s-1
Statement table
Computer Science Department
University of Georgia
Composite index and iterator
Subject + Predicate  Object
0
0
use: Subject  Predicate, Object
Sta S5 P3 O1
time : O(log2length)
S5
Stx
Sta
0
P3
Sid=5
Start
length
idx
binary
search
Stb S5 P3 O3
Stc
Stb
Sty
Resources
(Subject,
Predicate)
n-1
Subject index
record table
s-1
Stc S5 P3 O2
s-1
Statement
order index
Statement table
Computer Science Department
University of Georgia
Test results
• Tested datastores
– Jena (2.1)
– Sesame (1.1)
– Redland (1.0.0)
• As testbed, algorithms for k-hop limited semantic
associations were used
– Depth-first-search
– Bidirectional breadth-first-search
• Datasets
– SWETO 1 – small [1.2Mb], medium [14Mb], big [255Mb]
– Lehigh University benchmark2 – Univ(50, 0) [556Mb]
– Synthetic dataset [14Mb] for Business – Sports –
Entertainment ontology (generated with TOntoGen3)
[1] Aleman-Meza et. al., SWETO: Large-Scale Semantic Web Test-bed. in 16th International Conference on Software Engineering
and Knowledge Engineering (SEKE2004): Workshop on Ontology in Action, (Banff, Canada, 2004).
[2] Guo, Y., Pan, Z. and Heflin, J., An Evaluation of Knowledge Base Systems for Large OWL Datasets. in Third International
Semantic Web Conference, (Hiroshima, Japan, 2004), Spinger, 274-288.
[3] http://lsdis.cs.uga.edu/projects/semdis/tontogen/
Computer Science Department
University of Georgia
Test datasets
• SWETO
– dataset extracted from Internet using Semagix
Freedom1 technology
– edge distribution follows distribution of links specific for
internet
• Lehigh University Benchmark
– synthetically generated dataset, from small University
ontology
• Synthetic Business – Sports – Entertainment
– synthetically generated dataset of combined business,
entertainment and sports ontology
– high connectivity graph with uniform distribution of
edges
[1] Freedom is a product of Semagix, http://www.semagix.com/
Computer Science Department
University of Georgia
Test dataset statistics
Dataset name
SWETO
medium
SWETO
big
Instance
Statements
Instance
Nodes
Avg
Node
Degree
RDF
File
Size
59,105
55,876
2.116
14 Mb
1,553,112
813,479
3.818
255 Mb
3,298,813
1,082,818
6.093
556 Mb
45,000
29,889
3.011
13 Mb
Univ (50, 0)
Lehigh University
Bus–Sports–Ent
TOntoGen
Degree
distribution
log / log scale
500
1000
1500
2000
Sesame; 1422
Redland, no IDX; 2432
BRAMS - Load dump; 10
BRAMS - Initial; 31
Redland, IDX; 99
Redland, no IDX; 71
Sesame; 112
Jena; 1828
BRAMS - Load dump; 270
Redland, IDX; 2924
Redland, no IDX; 2674
Sesame; 1477
BRAMS - Initial; 356
Jena; 79
2500
Jena; 1730
BRAMS - Load dump; 20
BRAMS - Initial; 32
Redland, IDX; 210
Redland, no IDX; 147
Sesame; 112
3000
Small SWETO
Big SWETO
Small Synthetic
Univ (50,0)
Jena
112
1730
79
1828
Sesame
112
1477
112
1422
Redland, no IDX
147
2674
71
2432
Redland, IDX
210
2924
99
x
BRAMS - Initial
32
356
31
509
BRAMS - Load dump
20
270
10
501
BRAMS - Load dump; 501
BRAMS - Initial; 509
Redland, IDX; out of memory
0
Jena; 112
Memory usage [Mb]
Computer Science Department
University of Georgia
Results – memory usage
Memory usage for RDF file load [Mb]
3500
100
association 50
length
[relations]
0
DFS Jena
9
10
11
12
77
104
174
x
DFS BRAMS, 7
bi-BFS Jena, 5.5
bi-BFS Sesame, 0.2
bi-BFS Redland, 0.2
bi-BFS BRAMS, 0.1
DFS Jena, out of memory
DFS Sesame, 54
200
DFS Sesame, 11
DFS Redland, 52
DFS BRAMS, 3
bi-BFS Jena, 5.5
bi-BFS Sesame, 0.2
bi-BFS Redland, 0.1
bi-BFS BRAMS, 0.1
DFS Jena, 104
DFS Sesame, 2
DFS Redland, 27
DFS BRAMS, 0.5
bi-BFS Jena, 5
bi-BFS Sesame, 0.2
bi-BFS Redland, 0.1
bi-BFS BRAMS, 0.1
150
DFS Jena, 77
DFS Sesame, 2
DFS Redland, 16
DFS BRAMS, 0.5
bi-BFS Jena, 5
bi-BFS Sesame, 0.2
bi-BFS Redland, 0.1
bi-BFS BRAMS, 0.1
250
DFS Redland, 204
DFS Jena, 174
time [s]
Computer Science Department
University of Georgia
Results - timing
Search time on small SWETO
DFS Sesame
2
2
11
54
DFS Redland
16
27
52
204
DFS BRAMS
0.5
0.5
3
7
bi-BFS Jena
5
5
5.5
5.5
bi-BFS Sesame
0.2
0.2
0.2
0.2
bi-BFS Redland
0.1
0.1
0.1
0.2
bi-BFS BRAMS
0.1
0.1
0.1
0.1
Found paths
47
61
61
289
Computer Science Department
University of Georgia
Results - timing
bi-BFS on
synthetic Business-Sports-Entertainment
900
x 22.29
Jena; 847
800
700
Sesame; 386
time [sec]
600
500
400
x 10.16
9
10
11
12
12.8
39.9
59.3
847
Sesame
1.8
11.9
25.7
386
Redland
0.43
2.6
5.2
64.8
BRAMS
0.1
0.5
1.9
38
8559
131009
1680943
24392420
Jena
Found paths
BRAMS; 38
Redland; 64.8
BRAMS; 1.9
Redland; 5.2
Sesame; 25.7
Jena; 59.3
BRAMS; 0.5
Redland; 2.6
Sesame; 11.9
Jena; 39.9
Sesame; 1.8
BRAMS; 0.1
0
Jena; 12.8
200
association
length
100
[relations]
Redland; 0.43
300
x 1.70
10000
5000
association
length
[relations]
6
7
8
9
10
x
x
x
x
x
Sesame
17.3
41.8
86.5
726.2
28111
Redland
x
x
x
x
x
BRAMS
0.1
0.8
2.28
22.41
309.9
1,506
15,339
667,901
8,812,652
298,990,413
0
Jena
Number of paths
20000
15000
BRAMS, 309.9
30000
Sesame, 28111
bi-BFS search on Univ(50, 0)
Redland, out of memory during load
Jena, out of memory
BRAMS, 22.41
Redland, out of memory during load
Sesame, 726.2
Jena, out of memory
BRAMS, 2.28
Redland, out of memory during load
Sesame, 86.5
Jena, out of memory
BRAMS, 0.8
Redland, out of memory during load
Sesame, 41.8
Jena, out of memory
BRAMS, 0.1
Redland, out of memory during load
Sesame, 17.3
Jena, out of memory
time [s]
Computer Science Department
University of Georgia
Results - timing
25000
association
length
[relations]
Paths
0
Jena
500
Redland; 427
BRAHMS; 336
Jena; out of memory
Redland; 49
BRAHMS; 23
Jena; out of memory
1500
Sesame; 18341
Sesame; 1442
Sesame; 721
1000
Redland; 20
BRAHMS; 3.89
Jena; out of memory
Jena; 108
Sesame; 175
Redland; 3.26
BRAHMS; 0.47
Jena; 71
Sesame; 113
Redland; 1.63
BRAHMS; 0.21
Jena; 25
Sesame; 15
Redland; 0.19
BRAHMS; 0.02
Jena; 13
Sesame; 9.38
Redland; 0.12
BRAHMS; 0.01
Time [sec]
Computer Science Department
University of Georgia
Results - timing
bi-BFS on Univ(10,0) - 100Mb file
5
6
7
8
9
10
11
13
25
71
108
x
x
x
Sesame
9.38
15
113
175
721
1442
18341
Redland
0.12
0.19
1.63
3.26
20
49
427
BRAHMS
0.01
0.02
0.21
0.47
3.89
23
336
5
319
4,988
97,868
1,401,886
22,876,121
319,574,607
Computer Science Department
University of Georgia
Results - timing
bi-BFS search on Univ(700,0) - 6.5Gb file
350
314,116,239
1,271,857
94,152
200
10,000,000
1,000,000
100,000
10,000
150
1,000
BRAHMS
Paths
BRAHMS; 0.33
BRAHMS; 0.15
association
length
[relations] 0
BRAHMS; 0.02
50
32
BRAHMS; 46.42
205
100
100
10
4
5
6
7
8
0.02
0.15
0.33
46.42
308.87
32
205
94,152
1,271,857
314,116,239
1
Found paths
[log scale]
Time [sec]
250
100,000,000
BRAHMS; 308.87
300
1,000,000,000
Computer Science Department
University of Georgia
BRAHMS - today and tomorrow
• Today
– Implemented (most of) SPARQL over BRAHMS
– BRAHMS successfully used as storage for
funded projects in LSDIS lab
• Insider Threat project 1
• Peer-to-Peer Semantic Association Discovery
2
• Future
– create context representation in Brahms
– design and create new querying model with
use of context and association discovery
[1] B. Aleman-Meza, et. al., An Ontological Approach to the Document Access Problem of Insider Threat, Proceedings of the IEEE
Intl. Conference on Intelligence and Security Informatics (ISI-2005), May 19-20, 2005
[2] M. Perry et. al., "Peer-to-Peer Discovery of Semantic Associations", Second International Workshop on Peer-to-Peer Knowledge
Management, San Diego, CA, July 17, 2005
Computer Science Department
University of Georgia
Thank you
SemDis project
http://lsdis.cs.uga.edu/project/semdis
BRAHMS page
http://lsdis.cs.uga.edu/project/semdis/brahms
Download