Digital Curation Centre The DCC Research Agenda Peter Buneman Research Director Digital Curation Centre and Professor of Database Systems School of Informatics University of Edinburgh Funders: The Digital Curation Centre “… to provide a focus for research into curation issues and to promote expertise and good practice for the management of all cultural, scholarly and research outputs in digital format.” [edited mission statement] • UK funding from JISC and EPSRC e-science programme • Partners: – – – – University of Edinburgh (leader) University of Bath (UKOLN) CCLRC University of Glasgow • Research led by Edinburgh Database Group 2 Organisation to Engage & Collaborate curation organisations eg DPC communities of practice: users community support & outreach Collaborative Associates Network of Data Organisations service definition & delivery management & admin support research research collaborators development co-ordination testbeds & tools 3 Industry standards bodies What is Digital Curation? • Preserving stuff? – Librarians and archivists – Scientists (with huge amounts of regular experimental data) • Publishing stuff? – Publishers of “reference” data: compendia, dictionaries, bibliographies, gazetteers, etc. – Scientists (with lots of complex annotated data) 4 Both communities call themselves “curators” but at first sight they have almost orthogonal concerns Their concerns look orthogonal, but… • Shouldn’t the “publishers” be concerned about the long-term usefulness of their findings? • The “preservers” do more than preserve – they classify and annotate. – Shouldn’t they publish (and preserve) their own work? 5 As you dig deeper you find that there is a lot of commonality. Database Technology is Central 6 Much/most scientific data is now in databases • They often do not contain source experimental data. Sometimes just annotation/metadata • They borrow extensively from, and refer to, other databases • You are now judged by your databases as well as your (paper) publications!! • These databases are built and maintained with a great deal of human or computational effort. What makes a database? – it has internal structure or it changes. Size alone doesn’t qualify, but data formats do! The DCC Research Agenda • Data integration and publishing – Slowly coming to market. Publishing in community formats is a new twist • Annotation – Everybody agrees this is important. No-one understands it. • Metadata extraction – Semantic or otherwise, it’s a key part of annotation • Archiving and Appraisal – What do we do about databases – they change! • Legal issues – Can we at least help to clarify what is going on? • Provenance and data quality – Again, we don’t fully understand it. • Organisational dynamics of repositories • Economic analyses of curation • Ontologies, performance, registries, structure evolution… 7 Some active topics by the Edinburgh Database Group • • • • • 8 Archiving Scientific Databases Keys and Digital Object Identifiers Data publishing Data Security Provenance and Annotation Archiving (preserving) databases • How do you preserve something that changes every hour or minute? – Important for the scientific record – someone might have cited your data at time t. • Current practice – – – – 9 Create versions (how often?) Log changes Use diffs Do nothing (common!) A Sequence of Versions 10 Pushing time down This relies on a deterministic / keyed model 11 [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ] Uncompressed • Green line: sizes of 100 versions • Red line: size of cumulative archive. Top right red point contains whole of green line Compressed • Size (bytes) x 106 100 days of OMIM Legend •archive •inc diff •version •compressed inc diff •compressed archive archive size between 0.94 and 1 times compressed diff repository size gzip(inc diff) 12 XMill(archive) The Bottom Line • Can archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file) • Retrieval is a linear scan • Works well with compression to less than 30% of current file. Archive is an XML file • Archive as often as you like! (Almost) • Works well with indexing • Permits temporal queries on objects 13 How do we cite data? • A URL or citation to an article is already unsatisfactory. – DCC client complaint: “I spend a lot of time searching [electronic documents or digital libraries] for the part that is relevant to the citation.” • The problem is much worse when you are citing something in a very large database. • How do you use a citation to locate data? • How do you ensure that the citation persists? – Connections with DB archiving and DOIs 14 Location is typically informative? • File and directory names that contain data /timit/train/dr1/fcjf0/sa1.wav corpus: timit type: training dialect-region:1 sex: f speaker-id: cjf0 sentence-id: sa1 file-type: waveform • Compound keys traditionally indicated location: BL MS Cotton Nero A.ix 15 Manuscript in the British Library, which used to be in the library of a Mr. Cotton [which burnt down] under a statue of Nero, top shelf, nine books along from the left. Keys for XML • Implicit keys are ubiquitous in scientific data formats (easily converted to XML) • Some proposals for key specifications in XML work (DTD IDs, XML-Schema) • “Deep citation” in digital libraries. – Persistent identifiers for some small element of a large collection • Natural consequence of translating back from deterministic model to XML (node-labeled) 16 • Interactions with data models/formats Relative keys General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ... Example: book{name}.chapter{number}.verse{number} number specifies chapter only within book number specifies verse only within chapter Also: bible{}.book{name}.chapter{number}.verse{number} empty key: at most one bible node 17 Keys and file formats Remember: structured files are databases! • Understanding and registering formats is only a first step • The real issue is still integration and transformation. • Keys and other constraints may help 18 Data exchange on the Web Web DTD XML XML Q: XML view DB1 DB2 All members of a community agree on a DTD and then exchange data XML Publishing: • mapping relational data to XML • conforming to the predefined DTD How do we transmit incremental changes? 19 Security in Databases and XML • Current approaches “all or nothing” – How do you stop applications compromising security? • Next approach – mark individual data items – Makes the problem even worse! • New approach – security based on the structure of the database/document – Static guarantees – Greater efficiency 20 Annotation, Provenance • So much scientific data is now in databases that scientists are starting to communicate by annotating data. • Also data is increasingly copied between databases. How do you know where your data came from? • These two topics are closely related. 21 Understanding Provenance • Provenance is a major problem in scientific databases, but we lack – tools for recording it – fundamental understanding of the issues 22 • How is provenance passed through database queries? • How can we automatically record provenance when we update databases? Two kinds of provenance? name J.S. Bach G.F. Handel W.A. Mozart born 1685 1685 1756 period baroque baroque classical SELECT name, born FROM composer SELECT name, born FROM composer WHERE born < SELECT AVERAGE born FROM composer name born J.S. Bach 1685 . . . … Why is this element in the output? Where does this element come from? 23 Why and Where • Why-provenance of an output tuple d – the set of all witnesses for d – a witness for d is a minimal set of source tuples which “proves” that d exists in the output – For positive queries -- a set of tuples in the source whose deletion causes d to disappear • Where-provenance of output data d – the set of all source locations whose contents are copied to d 24 Annotation and Provenance • Simple connection. Provenance information is a form of annotation. • Fundamental connection: annotations need to spread along lines of provenance. – BioDAS (Distributed Annotation Server) (L.Stein et. al ) • annotate on genome sequences • notion of location is specific to genome • Annotation Systems: – Annotea (W3C) • annotate web pages,location is defined with Xpointer – Third voice (now defunct) 25 The annotation issue is complex • Should our queries be “annotation conscious”? SELECT name, age FROM employee WHERE age = 50 SELECT name, 50 as age FROM employee WHERE age = 50 • What are we annotating? Name Shoesize Joe 8 … … 26 Hatsize 47 ... 47 is prime 47 is too low • New theories and models are needed! Edinburgh Database Group/DCC Research 27 • • • • • • • • • • • • • • • Core Rajendra Bose* Peter Buneman* James Cheney* Byron Choi Wenfei Fan Cong Gao Floris Geerts Xibei Jia Christoph Koch Robert Hutchison Savvas Makalias Tasos Kementsietsidis Margaret McGinley* Joseph Spadavecchia Stratis Viglas • • • • • • • • • • Associates Douglas Armstrong* Malcolm Atkinson* Peter Burnhill* Kousha Etessami Robert Mann* Robin Rice Recent and future vistors (DCC & DBG) Michael Lesk* Zhenxin Wu* Renee Miller* Jim Frew * -- involved with DCC DCC and DB group have positions in databases, digital curation, XML, web technology, fundamentals. Edinburgh is a great place to live!! Contact Peter Buneman opb@inf.ed.ac.uk 28 Top-rated department. World-class database research. Good connections with logical foundations, scientific DBs, distributed computation (Grid)