The DCC Research Agenda Peter Buneman Digital Curation Centre

advertisement
Digital Curation Centre
The DCC Research Agenda
Peter Buneman
Research Director
Digital Curation Centre
and
Professor of Database Systems
School of Informatics
University of Edinburgh
Funders:
The Digital Curation Centre
“… to provide a focus for research into curation issues and to
promote expertise and good practice for the management of all
cultural, scholarly and research outputs in digital format.”
[edited mission statement]
• UK funding from JISC and EPSRC e-science programme
• Partners:
–
–
–
–
University of Edinburgh (leader)
University of Bath (UKOLN)
CCLRC
University of Glasgow
• Research led by Edinburgh Database Group
2
Organisation to Engage & Collaborate
curation
organisations
eg DPC
communities of
practice: users
community
support &
outreach
Collaborative
Associates
Network of
Data
Organisations
service
definition
& delivery
management
& admin
support
research
research
collaborators
development
co-ordination
testbeds
& tools
3
Industry
standards bodies
What is Digital Curation?
• Preserving stuff?
– Librarians and archivists
– Scientists (with huge amounts
of regular experimental data)
• Publishing stuff?
– Publishers of “reference” data:
compendia, dictionaries,
bibliographies, gazetteers, etc.
– Scientists (with lots of complex
annotated data)
4
Both communities call themselves “curators” but at first
sight they have almost orthogonal concerns
Their concerns look orthogonal, but…
• Shouldn’t the “publishers” be
concerned about the long-term
usefulness of their findings?
• The “preservers” do more than
preserve – they classify and annotate.
– Shouldn’t they publish (and preserve)
their own work?
5
As you dig deeper you find that there is a lot of
commonality.
Database Technology is Central
6
Much/most scientific data is now in databases
• They often do not contain source experimental data.
Sometimes just annotation/metadata
• They borrow extensively from, and refer to, other
databases
• You are now judged by your databases as well as your
(paper) publications!!
• These databases are built and maintained with a great
deal of human or computational effort.
What makes a database?
– it has internal structure or it changes.
Size alone doesn’t qualify, but data formats do!
The DCC Research Agenda
• Data integration and publishing
– Slowly coming to market. Publishing in community formats is a new twist
• Annotation
– Everybody agrees this is important. No-one understands it.
• Metadata extraction
– Semantic or otherwise, it’s a key part of annotation
• Archiving and Appraisal
– What do we do about databases – they change!
• Legal issues
– Can we at least help to clarify what is going on?
• Provenance and data quality
– Again, we don’t fully understand it.
• Organisational dynamics of repositories
• Economic analyses of curation
• Ontologies, performance, registries, structure evolution…
7
Some active topics by the Edinburgh
Database Group
•
•
•
•
•
8
Archiving Scientific Databases
Keys and Digital Object Identifiers
Data publishing
Data Security
Provenance and Annotation
Archiving (preserving) databases
• How do you preserve something that
changes every hour or minute?
– Important for the scientific record – someone
might have cited your data at time t.
• Current practice
–
–
–
–
9
Create versions (how often?)
Log changes
Use diffs
Do nothing (common!)
A Sequence of Versions
10
Pushing time down
This relies on a deterministic / keyed model
11
[Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]
Uncompressed
•
Green line: sizes of 100
versions
•
Red line: size of cumulative
archive. Top right red point
contains whole of green line
Compressed
•
Size (bytes) x 106
100 days of
OMIM
Legend
•archive
•inc diff
•version
•compressed inc diff
•compressed archive
archive size between 0.94
and 1 times compressed diff
repository size
gzip(inc diff)
12
XMill(archive)
The Bottom Line
• Can archive a whole year of Swissprot or OMIM
with < 15% overhead (size of current file)
• Retrieval is a linear scan
• Works well with compression to less than 30% of
current file. Archive is an XML file
• Archive as often as you like! (Almost)
• Works well with indexing
• Permits temporal queries on objects
13
How do we cite data?
• A URL or citation to an article is already
unsatisfactory.
– DCC client complaint: “I spend a lot of time
searching [electronic documents or digital libraries]
for the part that is relevant to the citation.”
• The problem is much worse when you are citing
something in a very large database.
• How do you use a citation to locate data?
• How do you ensure that the citation persists?
– Connections with DB archiving and DOIs
14
Location is typically informative?
• File and directory names that contain data
/timit/train/dr1/fcjf0/sa1.wav
corpus: timit
type: training
dialect-region:1
sex: f
speaker-id: cjf0
sentence-id: sa1
file-type: waveform
•
Compound keys traditionally indicated location:
BL MS Cotton Nero A.ix
15
Manuscript in the British Library, which used to be in
the library of a Mr. Cotton [which burnt down] under a
statue of Nero, top shelf, nine books along from the left.
Keys for XML
• Implicit keys are ubiquitous in scientific data formats
(easily converted to XML)
• Some proposals for key specifications in XML work
(DTD IDs, XML-Schema)
• “Deep citation” in digital libraries.
– Persistent identifiers for some small element of a
large collection
• Natural consequence of translating back from
deterministic model to XML (node-labeled)
16
• Interactions with data models/formats
Relative keys
General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ...
Example:
book{name}.chapter{number}.verse{number}
number specifies
chapter only
within book
number specifies
verse only within
chapter
Also:
bible{}.book{name}.chapter{number}.verse{number}
empty key: at most one bible node
17
Keys and file formats
Remember: structured files are databases!
• Understanding and
registering formats is
only a first step
• The real issue is still
integration and
transformation.
• Keys and other
constraints may help
18
Data exchange on the Web
Web
DTD
XML
XML
Q: XML view
DB1
DB2
All members of a community agree on a DTD and then
exchange data XML Publishing:
• mapping relational data to XML
• conforming to the predefined DTD
How do we transmit incremental changes?
19
Security in Databases and XML
• Current approaches “all or nothing”
– How do you stop applications compromising
security?
• Next approach – mark individual data items
– Makes the problem even worse!
• New approach – security based on the
structure of the database/document
– Static guarantees
– Greater efficiency
20
Annotation, Provenance
• So much scientific data is now in
databases that scientists are starting to
communicate by annotating data.
• Also data is increasingly copied
between databases. How do you know
where your data came from?
• These two topics are closely related.
21
Understanding Provenance
• Provenance is a major problem in
scientific databases, but we lack
– tools for recording it
– fundamental understanding of the issues
22
• How is provenance passed through
database queries?
• How can we automatically record
provenance when we update
databases?
Two kinds of provenance?
name
J.S. Bach
G.F. Handel
W.A. Mozart
born
1685
1685
1756
period
baroque
baroque
classical
SELECT name, born
FROM composer
SELECT name, born
FROM composer
WHERE born < SELECT AVERAGE born FROM composer
name
born
J.S. Bach
1685
. . .
…
Why is this element in the output?
Where does this element come from?
23
Why and Where
• Why-provenance of an output tuple d
– the set of all witnesses for d
– a witness for d is a minimal set of source tuples
which “proves” that d exists in the output
– For positive queries -- a set of tuples in the
source whose deletion causes d to disappear
• Where-provenance of output data d
– the set of all source locations whose contents
are copied to d
24
Annotation and Provenance
• Simple connection. Provenance
information is a form of annotation.
• Fundamental connection: annotations
need to spread along lines of provenance.
– BioDAS (Distributed Annotation Server) (L.Stein et. al )
• annotate on genome sequences
• notion of location is specific to genome
• Annotation Systems:
– Annotea (W3C)
• annotate web pages,location is defined with Xpointer
– Third voice (now defunct)
25
The annotation issue is complex
• Should our queries be “annotation conscious”?
SELECT name, age
FROM employee
WHERE age = 50
SELECT name, 50 as age
FROM employee
WHERE age = 50
• What are we annotating?
Name Shoesize
Joe
8
…
…
26
Hatsize
47
...
47 is prime
47 is too low
• New theories and models are needed!
Edinburgh Database Group/DCC Research
27
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Core
Rajendra Bose*
Peter Buneman*
James Cheney*
Byron Choi
Wenfei Fan
Cong Gao
Floris Geerts
Xibei Jia
Christoph Koch
Robert Hutchison
Savvas Makalias
Tasos Kementsietsidis
Margaret McGinley*
Joseph Spadavecchia
Stratis Viglas
•
•
•
•
•
•
•
•
•
•
Associates
Douglas Armstrong*
Malcolm Atkinson*
Peter Burnhill*
Kousha Etessami
Robert Mann*
Robin Rice
Recent and future
vistors (DCC & DBG)
Michael Lesk*
Zhenxin Wu*
Renee Miller*
Jim Frew
* -- involved with DCC
DCC and DB group have positions in databases,
digital curation, XML, web technology, fundamentals.
Edinburgh is a
great place to
live!!
Contact
Peter Buneman
opb@inf.ed.ac.uk
28
Top-rated department. World-class database research. Good connections
with logical foundations, scientific DBs, distributed computation (Grid)
Download