The Research Agenda Peter Buneman Research Director Digital Curation Centre

Digital Curation Centre a centre of expertise in data curation and preservation The Research Agenda Peter Buneman Research Director Digital Curation Centre and School of Informatics University of Edinburgh Funders: What is Digital Curation? • Preserving stuff? – Librarians and archivists – Scientists (with huge amounts of regular experimental data) • Publishing stuff? – Publishers of “reference” data: compendia, dictionaries, bibliographies, gazetteers, etc. – Scientists (with lots of complex annotated data) 2 Both communities call themselves “curators” but at first sight they have almost orthogonal concerns Their concerns look orthogonal, but… • Shouldn’t the “publishers” be concerned about the long-term usefulness of their findings? • The “preservers” do more than preserve – they classify and annotate. – Shouldn’t they publish (and preserve) their own work? 3 As you dig deeper you find that there is a lot of commonality. Curated Databases are Central 4 Much/most scientific data is now in databases • They often do not contain source experimental data. Sometimes just annotation/metadata • They borrow extensively from, and refer to, other databases • You are now judged by your databases as well as your (paper) publications!! • These databases are built and maintained with a great deal of human or computational effort. What makes a database? – it has internal structure or it changes. Size alone doesn’t qualify The Research Agenda • Data integration and publishing – Slowly coming to market. Publishing in community formats is a new twist • Annotation – Everybody agrees this is important. No-one understands it. • Metadata extraction – Semantic or otherwise, it’s a key part of annotation • Archiving and Appraisal – What do we do about databases – they change! • Legal issues – Can we at least help to clarify what is going on? • Provenance and data quality – Again, we don’t fully understand it. • Organisational dynamics of repositories • Economic analyses of curation • Ontologies, performance, registries, structure evolution… 5 Archiving (preserving) databases • How do you preserve something that changes every hour or minute? – Important for the scientific record – someone might have cited your data at time t. • Current practice – – – – 6 Create versions (how often?) Log changes Use diffs Do nothing (common!) A Sequence of Versions 7 Pushing time down This relies on a deterministic / keyed model 8 [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ] Uncompressed • Archive size is –  1.01 times diff repository size –  1.04 times size of largest version Size (bytes) x 106 100 days of OMIM Compressed 9 • archive size between 0.94 and 1 times compressed diff repository size • gzip - unix compression tool • XMill - XML compression tool Legend •archive •inc diff •version •compressed inc diff •compressed archive gzip(inc diff) XMill(archive) The Bottom Line • Can archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file) • Retrieval is a linear scan • Works well with compression to less than 30% of current file. Archive is an XML file • Archive as often as you like! (Almost) • Works well with indexing • Permits temporal queries on objects 10 How do we cite data? • A URL or citation to an article is already unsatisfactory. – DCC client complaint: “I spend a lot of time searching [electronic documents] for the part that is relevant to the citation.” • The problem is much worse when you are citing something in a very large database. • How do you use a citation to locate data? • How do you ensure that the citation persists? – Connections with DB archiving and DOIs 11 Location is typically informative? • File and directory names that contain data /timit/train/dr1/fcjf0/sa1.wav corpus: timit type: training dialect-region:1 sex: f speaker-id: cjf0 sentence-id: sa1 file-type: waveform • Compound keys traditionally indicated location: BL MS Cotton Nero A.ix 12 Manuscript in the British Library, which used to be in the library of a Mr. Cotton [which burnt down] under a statue of Nero, top shelf, nine books along from the left. Keys for XML • Implicit keys are ubiquitous in scientific data formats (easily converted to XML) • Some proposals for key specifications in XML work (DTD IDs, XML-Schema) • “Deep citation” in digital libraries. • Natural consequence of translating back from deterministic model to XML (node-labeled) 13 • Interactions with data models/formats Relative keys General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ... Example: book{name}.chapter{number}.verse{number} number specifies chapter only within book number specifies verse only within chapter Also: bible{}.book{name}.chapter{number}.verse{number} empty key: at most one bible node 14 Keys and file formats Remember: structured files are databases! • Understanding and registering formats is only a first step • The real issue is still integration and transformation. • Keys and other constraints may help 15 Data exchange on the Web Web DTD XML XML Q: XML view DB1 16 DB2 All members of a community (industry) agree on a DTD and then exchange data w.r.t. it: e-commerce, healthcare, ... XML Publishing: • mapping relational data to XML • conforming to the predefined DTD Progress report on DCC research (funding period: -2 weeks) • Four new research fellows at Edinburgh: – Mags McGinley (legal practice) IP, copyright in databases – James Cheney (Cornell) Programming Languages, Digital Libraries, XML compression – Tasos Kemensietsidis (Toronto) Data integration, P2P databases – Rajendra Bose (UCSB) Earth sciences data. “Workflow” provenance in scientific data. • At UKOLN – Michael Day, metadata and Interoperability • At CCLRC – Shoaib Sufi, data portals and metadata • At Glasgow 17 – Position in metadata extraction advertised Progress report on DCC research (continued) • Pleasant DCC space (thanks to Edina and Informatics) to house DCC and database group. • Collaboration with – biologists (EBI & Edinburgh) on data publishing and – astronomers (Edinburgh) on XML manipulation & representation of large data sets. • First DCC research visitor (Michael Lesk) • Work with partners in progress on – annotation – DOIs 18 Please join us!!! DCC has research positions in databases, digital curation, XML, web technology, fundamentals. Edinburgh is a great place!! Contact Peter Buneman opb@inf.ed.ac.uk 19 Top-rated department. World-class database group. Good connections with logical foundations, scientific DBs, distributed computation (Grid)

The Research Agenda Peter Buneman Research Director Digital Curation Centre

Related documents

Products

Support

The Research Agenda Peter Buneman Research Director Digital Curation Centre

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib