The Research Agenda Peter Buneman Research Director Digital Curation Centre

advertisement
Digital Curation Centre
a centre of expertise in data curation and preservation
The Research Agenda
Peter Buneman
Research Director
Digital Curation Centre
and
School of Informatics
University of Edinburgh
Funders:
What is Digital Curation?
• Preserving stuff?
– Librarians and archivists
– Scientists (with huge amounts
of regular experimental data)
• Publishing stuff?
– Publishers of “reference” data:
compendia, dictionaries,
bibliographies, gazetteers, etc.
– Scientists (with lots of complex
annotated data)
2
Both communities call themselves “curators” but at first
sight they have almost orthogonal concerns
Their concerns look orthogonal, but…
• Shouldn’t the “publishers” be
concerned about the long-term
usefulness of their findings?
• The “preservers” do more than
preserve – they classify and annotate.
– Shouldn’t they publish (and preserve)
their own work?
3
As you dig deeper you find that there is a lot of
commonality.
Curated Databases are Central
4
Much/most scientific data is now in databases
• They often do not contain source experimental data.
Sometimes just annotation/metadata
• They borrow extensively from, and refer to, other
databases
• You are now judged by your databases as well as your
(paper) publications!!
• These databases are built and maintained with a great
deal of human or computational effort.
What makes a database?
– it has internal structure or it changes.
Size alone doesn’t qualify
The Research Agenda
• Data integration and publishing
– Slowly coming to market. Publishing in community formats is a new twist
• Annotation
– Everybody agrees this is important. No-one understands it.
• Metadata extraction
– Semantic or otherwise, it’s a key part of annotation
• Archiving and Appraisal
– What do we do about databases – they change!
• Legal issues
– Can we at least help to clarify what is going on?
• Provenance and data quality
– Again, we don’t fully understand it.
• Organisational dynamics of repositories
• Economic analyses of curation
• Ontologies, performance, registries, structure evolution…
5
Archiving (preserving) databases
• How do you preserve something that
changes every hour or minute?
– Important for the scientific record – someone
might have cited your data at time t.
• Current practice
–
–
–
–
6
Create versions (how often?)
Log changes
Use diffs
Do nothing (common!)
A Sequence of Versions
7
Pushing time down
This relies on a deterministic / keyed model
8
[Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]
Uncompressed
•
Archive size is
–  1.01 times diff repository
size
–  1.04 times size of largest
version
Size (bytes) x 106
100 days of
OMIM
Compressed
9
•
archive size between 0.94
and 1 times compressed diff
repository size
•
gzip - unix compression tool
•
XMill - XML compression tool
Legend
•archive
•inc diff
•version
•compressed inc diff
•compressed archive
gzip(inc diff)
XMill(archive)
The Bottom Line
• Can archive a whole year of Swissprot or OMIM
with < 15% overhead (size of current file)
• Retrieval is a linear scan
• Works well with compression to less than 30% of
current file. Archive is an XML file
• Archive as often as you like! (Almost)
• Works well with indexing
• Permits temporal queries on objects
10
How do we cite data?
• A URL or citation to an article is already
unsatisfactory.
– DCC client complaint: “I spend a lot of time
searching [electronic documents] for the part that
is relevant to the citation.”
• The problem is much worse when you are
citing something in a very large database.
• How do you use a citation to locate data?
• How do you ensure that the citation
persists?
– Connections with DB archiving and DOIs
11
Location is typically informative?
• File and directory names that contain data
/timit/train/dr1/fcjf0/sa1.wav
corpus: timit
type: training
dialect-region:1
sex: f
speaker-id: cjf0
sentence-id: sa1
file-type: waveform
•
Compound keys traditionally indicated location:
BL MS Cotton Nero A.ix
12
Manuscript in the British Library, which used to be in
the library of a Mr. Cotton [which burnt down] under a
statue of Nero, top shelf, nine books along from the left.
Keys for XML
• Implicit keys are ubiquitous in scientific data formats
(easily converted to XML)
• Some proposals for key specifications in XML work
(DTD IDs, XML-Schema)
• “Deep citation” in digital libraries.
• Natural consequence of translating back from
deterministic model to XML (node-labeled)
13
• Interactions with data models/formats
Relative keys
General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ...
Example:
book{name}.chapter{number}.verse{number}
number specifies
chapter only
within book
number specifies
verse only within
chapter
Also:
bible{}.book{name}.chapter{number}.verse{number}
empty key: at most one bible node
14
Keys and file formats
Remember: structured files are databases!
• Understanding and
registering formats is
only a first step
• The real issue is still
integration and
transformation.
• Keys and other
constraints may help
15
Data exchange on the Web
Web
DTD
XML
XML
Q: XML view
DB1
16
DB2
All members of a community (industry) agree on a DTD
and then exchange data w.r.t. it: e-commerce, healthcare, ...
XML Publishing:
• mapping relational data to XML
• conforming to the predefined DTD
Progress report on DCC research
(funding period: -2 weeks)
• Four new research fellows at Edinburgh:
– Mags McGinley (legal practice) IP, copyright in databases
– James Cheney (Cornell) Programming Languages, Digital
Libraries, XML compression
– Tasos Kemensietsidis (Toronto) Data integration, P2P
databases
– Rajendra Bose (UCSB) Earth sciences data. “Workflow”
provenance in scientific data.
• At UKOLN
– Michael Day, metadata and Interoperability
• At CCLRC
– Shoaib Sufi, data portals and metadata
• At Glasgow
17
– Position in metadata extraction advertised
Progress report on DCC research
(continued)
• Pleasant DCC space (thanks to Edina and
Informatics) to house DCC and database group.
• Collaboration with
– biologists (EBI & Edinburgh) on data publishing and
– astronomers (Edinburgh) on XML manipulation &
representation of large data sets.
• First DCC research visitor (Michael Lesk)
• Work with partners in progress on
– annotation
– DOIs
18
Please join us!!!
DCC has research positions in databases, digital
curation, XML, web technology, fundamentals.
Edinburgh is a
great place!!
Contact
Peter Buneman
opb@inf.ed.ac.uk
19
Top-rated department. World-class database group. Good connections
with logical foundations, scientific DBs, distributed computation (Grid)
Download