Regions of Interest What’s in a ROI? Use cases Requirements Current Storage System Problems Alternative Storage ROI Geometry Measurements ROI on Channel Annotations ▪ ROI ▪ Measurement ▪ Links User created ROI Measurement tools HCS generated ROI Automatic External External analysis Particle Tracking Other Templates ROIs without images Human generated More interactions ▪ Merge, Propagate, Split, Delete Measurements ▪ Geometry ▪ Intensity ▪ Path ROI/ROI Links Tags mostly on ROI Write Many/Read Many HCS Generated ROI Lots of ROI Attached to Channel Measurements Attached ▪ Multiple measurements Tags on ROI, Measurements ▪ Analysis, results and meta. Write Once, Read Many External Tool can Generate ROI (+ scripts) Can be tagged Links (ROI/ROI, ROI/Image) Results can be in any format ROI need not be attached to image Template to define other ROI N-Dimensional Data Storage of Image data simple ROI more complex ▪ Database entry, file format We don’t just want to store in HDF Database ROI ROI Annotations PyTables Mask ROI Measurements Pytables ROI are heterogeneous Concurrency Python behind a core service call Measurements are optimal Tagging is an issue ▪ Inside file ▪ Multiple annotations reported to be slow ROI can be stored in database Mask data can be an issue Tagging in RBD not best Many more annotations than we’d like Link to external source for measurements Key-Value Pair Stores Berkeley DB Project Voldermort Tokyo Cabinet Document DB MongoDB CouchDB Graph DB Neo4J InfoGrid Table DB Cassandra Hypertables HBase Other opinions on the storage solutions MongoDB vs CouchDB, Cassandra, .. CouchDB vs MongoDB Pros and cons of MongoDB Digg on Cassandra What is a supercolumn Cassandra talk Indexing nodes in Neo4J Document Database NOSQL movement Schemaless No Tables ▪ Collections of like data No Joins ▪ Document is equivalent of row of data ▪ Distributed file system (GridFS) Pros Cons It has bindings to numerous languages (C++, C#, Java, Python, ...). Allows storage, indexing, linking of any user data Annotations are now very easy, efficient Has mechanisms for schema upgrade Dynamic Queries Replication Sharding. Map-Reduce framework. Fast. GridFS is a distributed file storage mechanism within Mongo. Easy to install Schemaless, data integrity will need to be worked on. Graph structures not inherently supported. DEPLOYMENTS SourceForge http://sourceforge.net/ BusinessInsider http://www.businessinsider.com/ New York Times http://www.nytimes.com/ Disqus http://www.disqus.com/ Human Interaction Merge, Propagate, Split ✓ Geometry ✓ Intensity ✓ Path ✓ ROI/ROI Links ✓ Tags ✓ HCS Many ROI ✓ Tags on ROI ✓ Tags on Measurement ✓ Tables of Measurements ✓ Externally Generated Tags ✓ ROI/ROI Links, ROI/Image Links Many formats, unknown types ✓ Other N-Dimensional ROI ✓ Hierarchical Structures ✓ connection = Connection(); db = connection['databaseName']; collection = db.['collectionName']; collection.insert({"tags" : [ ], "label" : “MyROI”, "shapes" : [{ "tags" : [{"tag" : "foo1", "namespace" : "bob"}], "rx" : 17, "ry" : 17, "label" : null, "cy" : 75, "cx" : 3, "t" : 0, "z" : 0, "type" : "Ellipse", "id" : 3 }, { "tags" : [{"tag" : "foo2", "namespace" : "bob"}], "rx" : 10, "ry" : 16, "label" : null, "cy" : 82, "cx" : 45, "t" : 0, "z" : 0, "type" : "Ellipse", "id" : 5 }], "type" : "Roi", "id" : 565 }) Find roi with tag foofoo and shapes with tag foo1 connection = Connection(); db = connection['databaseName']; collection = db.['collectionName']; collection.find({”shapes.tags.tag”:”foo1”,”tags.tag”:”foofoo”}) Find roi shapes with tag containing mitosis connection = Connection(); db = connection['databaseName']; collection = db.['collectionName']; collection.find({"shapes.tags.tag":'/.*mitosis.*/i'}) Graph Database use nodes to represent objects User specifies relationship between nodes Allows complex traversal of node structures PROS Handles graph structures nicely Transactional Supported by Gremlin Gremlin Native RDF http://components.neo4j.org/neordf-sail/ Easy to install CONS No C++ language binding. Not distributed. Tables are not so easily modeled. Difficult to query on node contents DEPLOYMENTS The Swedish Defence forces http://www.mil.se Windh Technologies http://www.windh.com Flextoll http://www.flextoll.se public enum OMERORelations implements RelationshipType { ASSOCIATE, DERIVE, AGGREGATE, COMPOSE } Node image = neo.createNode(); image.setProperty("IObject",imageI); image.setProperty("id",imageI.getId().getValue()); image.setProperty("name",imageI.getName().getValue()); Node derivedImage = neo.createNode(); derivedImage.setProperty("IObject",derivedImageI); derivedImage.setProperty("id",derivedImageI.getId().getValue()); derivedImage.setProperty("name",derivedImageI.getName().getValue()); Relationship relationship = image.createRelationshipTo( derivedImage, OMERORelations.DERIVE ); relationship.setProperty("type","ROI"); relationship.setProperty("operation","crop"); relationship.setProperty("roi",cropRoiI); Human Interaction Merge, Propagate, Split ✓ Geometry Intensity Path ✓ ROI/ROI Links ✓ Tags HCS Many ROI ✓ Tags on ROI ✓ Tags on Measurement ✓ Tables of Measurements Externally Generated Tags ✓ ROI/ROI Links, ROI/Image Links ✓ Many formats, unknown types Other N-Dimensional ROI Hierarchical Structures ✓ Implementation of Google’s BigTables, is a complex implement of a key/value store to represent a table. A sophisticated toolset is required to get the most out of this solutions, for instance Google has created sawzall to query this system. Digg have released a language to work with Cassandra called LazyBoy. Works by creating a table which has columns linked together called column families, like data will exist in the same column family (Ellipse ROI). Pros Quick Handles heterogeneous data well Can manage distributed data Different rows can have different columns Map/Reduce Focus on writes not reads Scales nicely Easy to Install Cons Not simple to work with Building hierarchical structures Sorting Querying ▪ Ad Hoc Queries are bad, Digg still use MySQL for certain queries. Have to manage secondary indexes, (K/V) Version 0.5 Deployments Facebook (MAYBE!!) http://www.facebook.com Digg http://www.digg.com Human Interaction Merge, Propagate, Split ✓ Geometry ✓ Intensity ✓ Path ROI/ROI Links Tags ✓ HCS Many ROI ✓ Tags on ROI ✓ Tags on Measurement ✓ Tables of Measurements ✓ Externally Generated Tags ✓ ROI/ROI Links, ROI/Image Links ✓ Many formats, unknown types Other N-Dimensional ROI ✓ Hierarchical Structures Implementation of Google’s BigTables, is a complex implement of a key/value store to represent a table. A sophisticated toolset is required to get the most out of this solutions, for instance Google has created sawzall to query this system. HyperTable has a query language call HQL. Works by creating a table which has columns linked together called column families, like data will exist in the same column family (Ellipse ROI). Pros Quick Handles heterogeneous data well Different rows can have different columns Can manage distributed data Map/Reduce Scales nicely Easy to Install Cons GPL License Building hierarchical structures Docs are weak HQL works for simple queries only Map/Reduce for other work limit of 255 column families Secondary keys Deployments Rediff http://www.rediff.com Zvents http://www.zvents.com/ Human Interaction Merge, Propagate, Split ✓ Geometry ✓ Intensity ✓ Path ROI/ROI Links Tags ✓ HCS Many ROI ✓ Tags on ROI ✓ Tags on Measurement ✓ Tables of Measurements ✓ Externally Generated Tags ✓ ROI/ROI Links, ROI/Image Links ✓ Many formats, unknown types Other N-Dimensional ROI ✓ Hierarchical Structures Why do we have an RDMS We don’t normalise the data Each import will normalise on: ▪ Image, ObjectiveSettings, LogicalChannel, LightSettings, Detector Settings. Object Penalty Difference between normalisation and view