Neuroimaging Databases: A Data Engineering Perspective Amarnath Gupta University of California San Diego

advertisement
Neuroimaging Databases:
A Data Engineering Perspective
Amarnath Gupta
University of California San Diego
Three Queries
select E.eID, M.eID
Emp(eID, name, degree, salary).
from emp E, emp M, dept D
Project(pID, start_date, end_date, status). where E.salary > (
select avg(salary)
Dept(dID, name, mgrID).
from emp E2, dept D2
Works_For(pID, eID, location).
where E2.eID = D2.mgrID
) and M.eID = D.mgrID and
1. Which employees have a Ph.D. degree and work in the
E.salary > M.salary
San Francisco office?
group by E.eID
2. Find a pair of
employees who always work on the same
select E.eID
project in the same location?
from emp E, works_for W
select E1.eID, E2.eID
where E.degree
‘Ph.D’ and E who earn more
3. In La Jolla SEARS,
find
all=emp
employees
from
E1, emp E2, works_for W1, works_for W2
E.eID = W.eID
and (over all departments),
than the average manager’s
salary
where
E1.eID
= W1.eID
and
E2.eID
= W2.eID and
and the list the managers
M
who
earn
less
than
E.
W.location = ‘San Francisco’
W1.pID = W2.pID and
W1.location = W2.location and
2
E1.eID != E2.eID
IMAGE03, Edinburgh
Now Try These Queries
1.
In mice, which ‘calcium binding’ proteins are found in the
brain region ‘hippocampus’?
2. Find protein pairs that act as voltage-gated channels
and are always co-localized in the region “cerebellum”.
3. In mouse-strain X, find all brain regions R which
express more a-synuclein than the average a-synuclein
expression level over all other brain regions, and list the
brain regions S that express less a-synuclein than R.
A. Why are these queries inherently harder?
B. Why is it a very hard task to build systems that
would answer queries like these and produce
scientifically valid results?
IMAGE03, Edinburgh
3
The Data Modeling Problem
Lack of disciplined abstraction in
modeling the data
Large Scale Brain Maps
• Custom high
precision
montaging
stage
• 40 X 30
image panels
• 40X 1.3 oil
objective
• 800 Mb full
resolution
TIFF
IMAGE03, Edinburgh
5
The Molecular Distribution Case
• Protein localization queries
– Which proteins are found more in the granule cell layer
of than the Purkinje cell layer?
– Are proteins P1 and P2 always co-localized, sometimes
co-localized or never co-localized in the cerebellum?
– Which proteins follow the distribution pattern CA1 >
(basal ganglia ~ deep cerebellar nuclei) > CA3 ?
• The abstract model
– Array Data Model (Libkin, Machlin, Wong 1996)
– Histogram Data Model (Santini, Gupta 1999)
• A molecular distribution can be modeled as a “block histogram”
where the “base dimensions” are in R2 (or R3)
• A cell in the histogram can contain a tuple (or a vector) of
aggregate values
IMAGE03, Edinburgh
6
Block Histogram as an ADT
Abstract Data Types
type image { id: identifier,
picture: blob
regions: set(region),
color block histogram:
2Darray(histogram),
};
type region { label: string,
shape: polygon
};
type histogram { variable name: string,
value:1Darray(bucket),
};
type bucket { start bucket: integer,
end bucket: integer,
count: integer
};
IMAGE03, Edinburgh
Block Histogram
7
Querying Block Histograms
• Which proteins follow the distribution pattern CA1 >
(basal ganglia ~ deep cerebellar nuclei) > CA3 ?
– cut: histogram  polygon  histogram
– agg: agg_func  histogram  attribute_name  number
– sim_dist: number  number  number
select protein
from brain_level_protein_distributions D, mouse_atlas M
where a1 is agg(avg, D.pd_hist.cut(M.ca1_poly), protein_amt) and
a2 is agg(avg, D.pd_hist.cut(M.bg_poly), protein_amt) and
a3 is agg(avg, D.pd_hist.cut(M.dcn_poly), protein_amt) and
a4 is agg(avg, D.pd_hist.cut(M.ca3_poly), protein_amt) and
sim_dist(a1, a2) > 0.2 and sim_dist(a2, a3) < 0.1 and
sim_dist(a3, a4) > 0.2
Similar models on Volumes and Surfaces are being developed
IMAGE03, Edinburgh
8
The Representation Selection Problem
Often multiple representations of
the data are created for different
purposes, but the queries are over
the “generic” data
Surface Representations
Neuroscientists use different representations
of the cortex surfaces for different purposes
Fiducial representation: as-exact-as-possible
representation of the cortex, with all the
folds and the creases of the actual surface.
Allows the measurement of all geometric
quantities of interest, including differential properties (Gaussian curvature..)
but most quantities are difficult to
compute, as they require the
integration of the local properties
of the surface.
Spherical map: the cortex can be
projected on the surface of a sphere in
a way that preserves (approximately) the
distances between points. This represntation affords the efficient computation of
distances,areas, and topological relations, but
not of properties related to the curvature of
the surface.
flat map: preserves the area of the regions,
but introduces cuts so that distances and
topological properties can’t be computed
IMAGE03, Edinburgh
All these representations are
stored in the database, but
scientists ask questions on a
conceptual model based on the
fiducial representation. How can
we rewrite the query to make
optimal use of the available
representations?
10
Configuration
<RewriteConfiguration>
<ReplaceTypes>
<Type name="Cortex"/>
<Type name="Spherical"/>
</ReplaceTypes>
<AttributeConversionTable>
<ConversionSpec type="Cortex">
<Attribute name="A">
<Translation type="Spherical">
$.Q
</Translation>
<Translation type="Flat">
$.A
</Translation>
</Attribute>
</ConversionSpec>
<ConversionSpec type="Spherical">
<Attribute name="AREA">
<Translation type="Flat">
$.A*2
</Translation>
</Attribute>
</ConversionSpec>
</AttributeConversionTable>
declaration of the
types that can be
replaced
<FunctionParameterTable>
<Function name="Area">
<Type name="Cortex" feasibility="2"/>
<Type name="Spherical" feasibility="5"/>
<Type name="Flat" feasibility="8"/>
</Function>
<Function name="Connectivity">
<Type name="Cortex" feasibility="5"/>
<Type name="Spherical" feasibility="5"/>
<Type name="Flat" feasibility="0"/>
</Function>
</FunctionParameterTable>
Function-type table: for each function of the
geometric data cartridge, lists the various
representations, and the feasibility of computing
that function with the given data type.
Feasibility=0 means that the function can’t be
computed with data of that type.
IMAGE03, Edinburgh
Conversion of attributes
between representations
<Strategy>
<Step type="replacement" threshold="2"/>
<Step type="consolidation" />
</Strategy>
</RewriteConfiguration>
Query rewriting
strategy
11
Variable replacement-step 1
VLDB 2002
query
select
query
from
where
select
from
where
AND
a
b
c
a
b
c a
b=a
F(b)
Insertion of the new variable a
IMAGE03, Edinburgh
F(b)
12
Variable replacement-step 2
query
select
from
query
where
select
from
where
AND
a
b
AND
c a
a
b
c a
b=a
F(b)
IMAGE03, Edinburgh
b=a
F(replace(a))
During consolidation, every other function that can be
efficiently computed using the variable a (which has already
been inserted) will be computed using it.
13
Scenario
R( a ,  )
Fiducial
Spherical
Flat
Replace if:
The current representation has efficiency less than a AND there is a representation with
efficiency at least 
Gauss
6
8
1
Area
2
5
8
5
6
0
Connectivity
Query:
select *
from Cortex c
where (Connectivity(c.TOPO) = 2 AND Gauss(c.PEAKS) < 2)
AND Area(c.PTS) < 100
Strategy 1:
1. R(3,3): Area -> Flat
Strategy 2:
1. R(8,8): Gauss -> Spherical, Area->Flat
Strategy 3:
1. R(8,8): Gauss -> Spherical, Area->Flat
2. C:
3. R(6,6): Connectivity -> Spherical
IMAGE03, Edinburgh
14
A Thought
•Multimedia Databases advocated the need to
query by features and k-NN queries
•The mainstream DBs hasn’t quite “bought”
the idea of features
•Is this the time to think how attribute-value
based querying and feature-based querying
would work together?
The Semantic Rewriting Problem
The user prefers to query on a high-level
schema (remember “conceptual query
languages”?)
So the system should rewrite the query on the
logical schema but the rewriting should be
semantically sound
A Deception
• The scientific question
– Are proteins P1 and P2 always co-localized, sometimes
co-localized or never co-localized in the cerebellum?
• The database queries
– –Find
Find all
that that
all images
images II1such
, I2 such
•
•
•
•
•
•
anatomic
observed in I
•anatomicstructure
structureAAis
1 is observed in I1
cerebellum)
A
is cerebellum
OR part-of(A,
part-of(A,
cerebellum)
•anatomic
structure
A2 is observed
in I2
R_P1
a region where
P1 is found
in I
•A1 is is
cerebellum
OR part-of(A
1, cerebellum)
R_P2
is a region
where P2 is found in I
•part-of(A
2 , A 1)
boundary(A)
overlaps
boundary(R_P1)
•R_P1 is a region
where
P1 is found in I1
boundary(A)
overlaps
boundary(R_P2)
•R_P2 is a region
where
P2 is found in I
– Count
the number
of images I
•boundary(A
1) overlaps boundary(R_P1)
•boundary(A
overlaps
boundary(R_P2)
– Similarly
find2)other
images
where
2
• P1 is present but P2 is not in the same regions
– Report the ratios
IMAGE03, Edinburgh
17
An External “Knowledge
Source”
ANATOM Domain Map
SSDBM 2000
Using the Ontology
• SMOP – a simple matter of (query) planning?
– Rewrite the query with the ontology source O, and write a rule to
execute the O.part_of predicate first
• Semantic Correctness
– Purkinje cells are part of the cerebellum
– dendrite is a compartment of the (generic) neuron
– Should the images be selected if
• Image I has P1, P2 in a region marked ‘dendrite’ ?
• Image I has P1 in a region labeled ‘dendrite’ and P2 in a different
region also marked ‘dendrite’?
• Image I1 has P1 in a region marked ‘Purkinje Cell’ and I2 has P2 in a
region marked ‘Purkinje cell dendrite’?
• Image I1 has P1 in a region marked ‘SER’ and P2 in a region marked
‘Spine’, both covered by a larger region marked ‘dendrite’?
• How can these cases be automatically taken care of in
the query rewriting process?
IMAGE03, Edinburgh
19
The Ontology Search Problem
(aside from the subsumption problem)
The Ontology can be viewed a large graph
where the edges denote relations. These edges
may have many labels with widely different
semantics. We need to perform meaningful
graph-search over them.
Graph-Structured Knowledge Sources
• Taxonomies are often directed and acyclic
– Querying labeled graphs
• A large fragment of the ontologies we encounter are DAGs
where edges are often transitive
• We represent DAGs in a relational structure
– Each node carries its DFS traversal numbers
– Ancestor
Descendant
operations
become range
Whatand
about
more general
graphs?
queries
What about graphs where the edge labels have specific semantics?
– Left biased Numbering scheme
» Merge nodes: have pointers to all parents
» Other nodes: have pointers to leftmost parents
Current 2003
» Parent pointers carry edge labels
– Path Expressions are evaluated using an extension of the
PathStack algorithm (Srivastava et al, 2001)
» Adds linear (in the number of variables of the path
expression) complexity over PathStack
IMAGE03, Edinburgh
21
Modeling Interactions
(Towards a “Disease Map”)
• An interaction in a graph is
– A labeled edge
• regulates(A,B)
A
– A parameterized edge
• regulates(up)(A,B)
regulates
B
A regulates(up) B
– The specialization of an edge
A’
regulates
B’
– A conditional edge
A
activates
B
• activates(A,B,phosphorylation)::regulates(A,B)
• inhibits(A,B,deacetylation)  binds_to (C,A)  A
exists((low(nitrogen)):condition)
binds to
A
– A complex edge
• inhibits(binding(A,B), binding(C,D))
– A state transition
inhibits
F(A,B,e)
B
C
B
inhibits(proc, proc)
binds to
D
• releases(Byck1p,Tpk1p)
– …
IMAGE03, Edinburgh
PRECOND
bound(Byck1p,Tpk1p)
THEN
binds_to (cAMP, Byck1p)
POSTCOND
bound(Byck1p,cAMP)
AND
free(Tpk1p)
22
The Feasible Rewriting Problem
If sources admit limited access patterns, can
feasible plans be constructed?
A Touch of Theory
(Nash and Ludäscher, 2003)
• Web sources, functions and web services can be
modeled as relations with limited access patterns
• Planning an arbitrary Union of Conjunctive
Queries (UCQ) with negation
– Checking feasibility is equivalent to checking
containment for UCQ and is hence P2P-complete
– Plan computation for UCQ queries can be
approximated by producing an underestimate and an
overestimate of the query and deferring the
feasibility check
– Complete answers can be obtained even if the parts of
the plan are not answerable
• partial results are produced when some of the conjuncts are
feasible
IMAGE03, Edinburgh
24
The Execution Planning Problem
Remote, Distributed Functions, and
Data Movement
(where Data Engineering meets the
Grid Environments)
Planning Queries with Functions
where a1 is agg(avg, D.pd_hist.cut(M.ca1_poly), protein_amt) and …
sim_dist(a1, a2) > 0.2 and …
Standard Mediator
Distributed System over the Grid
• X0  select ca1_poly
from M @AtlasSource
• X1  D.pd_hist.cut(X0)
@Datacutter
• a1  avg(X0,
protein_amt)
@Mediator
• temp_store(a1)
@MediatorStore
• Create transaction T1(
X0  select ca1_poly from M
@AtlasSource
Store X0 into $V1 @ AtlasWrapper)
• Create transaction T2(
X1  D.pd_hist.cut(fetch(X0, $T0))
@Datacutter
Store X1 into $V2 @TempStore)
• a1  avg(X0, protein_amt)
• temp_store(a1) @MediatorStore
IMAGE03, Edinburgh
26
Planning Queries with Functions
Distributed System over the Grid with GridService Catalog
• Create transaction T1(
X0  select ca1_poly from M @AtlasSource
Store X0 into $V1 @ AtlasWrapper)
• Create transaction T2(
ServiceCatalog.lookup(histogram_cutting_service, $resource,
$paramList)
R1  constructRequest ((X1  D.pd_hist.cut(fetch(X0, $T0))),
$resource, $paramList)
X1  ExecuteRequest(R1))
• Create transaction T3(
S1  getSize(X1)
ServiceCatalog.lookup(dataStorageService, S1, $resource,
$params2)
R2  constructRequest ((Store X1 into $V2), $resource,
$params2))
How do you plan (and cost estimate) the operations ?
IMAGE03, Edinburgh
27
The “Goodness of Result” Problem
The query retrieves information from the
information sources.
The Result Processor may need to estimate the
“quality” of the results with respect to a
reference
Two Viewpoints
• The application person
– Send the result retrieved
• Case 1
– To a statistical package and compute standard statistics S1…Sk
• Case 2
– To a program that generates a specialized random set of data
and matches the statistical significance of the retrieved results
• The database person
– For these applications
• Can we perform the queries on a sample rather than the entire data?
Any guidelines on the sampling method?
• Can we use approximations instead of producing exact answers?
• Should we find only “interesting” or “most frequent” data by using
data mining algorithms?
• Can we package the descriptive statistics that a DBMS can compute
to make the overall work more efficient?
• Can the use of user-defined aggregates (cf. ATLAS project at UCLA)
help eliminate the statistical package?
IMAGE03, Edinburgh
29
In Essence
• A tour of a few “database-y” problems we have
encountered so far in our work with
Neuroimaging and associated information
– Still scratching the surface of most problems
• The help of forward-thinking domain scientists
has been the most crucial asset in figuring out
the problems at a deeper-than-usual level
• The database scientists, need to be “crossthinkers” to venture beyond our own domain of
specific expertise to develop a holistic approach
to these problems
• There are many more exciting problems – let’s go
get them!!
IMAGE03, Edinburgh
30
Acknowledging
• Maryann Martone
– who always asks hard questions I don’t know how to answer
• Bertram Ludäscher
– who has finally convinced me that “theory” is more practical than
I thought
• Simone Santini
– the feature-man who (almost) always wins the argument on any
technical matter
• Animesh Ray
– the geneticist, who is forcing me to learn and think about
process interactions and models of complex phenomena
• Mark Ellisman
– the godfather who excels at making offers we can’t refuse
• The staff and students who make it happen
IMAGE03, Edinburgh
31
Download