Digital Curation Centre a centre of expertise in data curation and preservation The Research Agenda Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: What is Digital Curation? • Preserving stuff? – Librarians and archivists – Scientists (with huge amounts of regular experimental data) • Publishing stuff? – Publishers of “reference” data: compendia, dictionaries, bibliographies, gazetteers, etc. – Scientists (with lots of complex annotated data) 2 Both communities call themselves “curators” but at first sight they have almost orthogonal concerns Their concerns look orthogonal, but… • Shouldn’t the “publishers” be concerned about the long-term usefulness of their findings? • The “preservers” do more than preserve – they classify and annotate. – Shouldn’t they publish (and preserve) their own work? 3 As you dig deeper you find that there is a lot of commonality. Curated Databases are Central 4 Much/most scientific data is now in databases • They often do not contain source experimental data. Sometimes just annotation/metadata • They borrow extensively from, and refer to, other databases • You are now judged by your databases as well as your (paper) publications!! • These databases are built and maintained with a great deal of human or computational effort. What makes a database? – it has internal structure or it changes. Size alone doesn’t qualify The Research Agenda • Data integration and publishing – Slowly coming to market. Publishing in community formats is a new twist • Annotation – Everybody agrees this is important. No-one understands it. • Metadata extraction – Semantic or otherwise, it’s a key part of annotation • Archiving and Appraisal – What do we do about databases – they change! • Legal issues – Can we at least help to clarify what is going on? • Provenance and data quality – Again, we don’t fully understand it. • Organisational dynamics of repositories • Economic analyses of curation • Ontologies, performance, registries, structure evolution… 5 Archiving (preserving) databases • How do you preserve something that changes every hour or minute? – Important for the scientific record – someone might have cited your data at time t. • Current practice – – – – 6 Create versions (how often?) Log changes Use diffs Do nothing (common!) A Sequence of Versions 7 Pushing time down This relies on a deterministic / keyed model 8 [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ] Uncompressed • Archive size is – 1.01 times diff repository size – 1.04 times size of largest version Size (bytes) x 106 100 days of OMIM Compressed 9 • archive size between 0.94 and 1 times compressed diff repository size • gzip - unix compression tool • XMill - XML compression tool Legend •archive •inc diff •version •compressed inc diff •compressed archive gzip(inc diff) XMill(archive) The Bottom Line • Can archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file) • Retrieval is a linear scan • Works well with compression to less than 30% of current file. Archive is an XML file • Archive as often as you like! (Almost) • Works well with indexing • Permits temporal queries on objects 10 How do we cite data? • A URL or citation to an article is already unsatisfactory. – DCC client complaint: “I spend a lot of time searching [electronic documents] for the part that is relevant to the citation.” • The problem is much worse when you are citing something in a very large database. • How do you use a citation to locate data? • How do you ensure that the citation persists? – Connections with DB archiving and DOIs 11 Location is typically informative? • File and directory names that contain data /timit/train/dr1/fcjf0/sa1.wav corpus: timit type: training dialect-region:1 sex: f speaker-id: cjf0 sentence-id: sa1 file-type: waveform • Compound keys traditionally indicated location: BL MS Cotton Nero A.ix 12 Manuscript in the British Library, which used to be in the library of a Mr. Cotton [which burnt down] under a statue of Nero, top shelf, nine books along from the left. Keys for XML • Implicit keys are ubiquitous in scientific data formats (easily converted to XML) • Some proposals for key specifications in XML work (DTD IDs, XML-Schema) • “Deep citation” in digital libraries. • Natural consequence of translating back from deterministic model to XML (node-labeled) 13 • Interactions with data models/formats Relative keys General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ... Example: book{name}.chapter{number}.verse{number} number specifies chapter only within book number specifies verse only within chapter Also: bible{}.book{name}.chapter{number}.verse{number} empty key: at most one bible node 14 Keys and file formats Remember: structured files are databases! • Understanding and registering formats is only a first step • The real issue is still integration and transformation. • Keys and other constraints may help 15 Data exchange on the Web Web DTD XML XML Q: XML view DB1 16 DB2 All members of a community (industry) agree on a DTD and then exchange data w.r.t. it: e-commerce, healthcare, ... XML Publishing: • mapping relational data to XML • conforming to the predefined DTD Progress report on DCC research (funding period: -2 weeks) • Four new research fellows at Edinburgh: – Mags McGinley (legal practice) IP, copyright in databases – James Cheney (Cornell) Programming Languages, Digital Libraries, XML compression – Tasos Kemensietsidis (Toronto) Data integration, P2P databases – Rajendra Bose (UCSB) Earth sciences data. “Workflow” provenance in scientific data. • At UKOLN – Michael Day, metadata and Interoperability • At CCLRC – Shoaib Sufi, data portals and metadata • At Glasgow 17 – Position in metadata extraction advertised Progress report on DCC research (continued) • Pleasant DCC space (thanks to Edina and Informatics) to house DCC and database group. • Collaboration with – biologists (EBI & Edinburgh) on data publishing and – astronomers (Edinburgh) on XML manipulation & representation of large data sets. • First DCC research visitor (Michael Lesk) • Work with partners in progress on – annotation – DOIs 18 Please join us!!! DCC has research positions in databases, digital curation, XML, web technology, fundamentals. Edinburgh is a great place!! Contact Peter Buneman opb@inf.ed.ac.uk 19 Top-rated department. World-class database group. Good connections with logical foundations, scientific DBs, distributed computation (Grid)