Why Don’t Scientists Use Databases? Peter Buneman Division of Informatics University of Edinburgh Digital Libraries grant IIS 98-17444 (NSF,DARPA,NLM, LoC,NEH, NASA) http://db.cis.upenn.edu http://db.cis.upenn.edu/Research/provenance.html NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 1 Why Don’t Scientists Use Relational Databases Much? Thanks to: • The ontologists and astronomers at Edinburgh • The database and bio-informatics groups at Penn • Aleri Inc. Special thanks (material stolen from) • Sanjeev Khanna, Wang-Chiew, Keishi Tajima, Susan Davidson, Fidel Salas NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 2 Scientific Data is Ubiquitous • 500 or so public molecular biology databases. – much discovery in silico • Vast amounts of satellite imagery – maintaining it is very expensive • Terabytes of astronomical data (not image data) • Linguistic corpora are essential research tools -also in terabytes NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 3 Relational DBs -- tabular data Id 123 456 321 Name H. Simpson L. Simpson A. Jones Address Title Springfield Algorithms Springfield Voltaire London Geometry Id Course Grade 456 Geometry A 123 Algorithms D 456 Voltaire A 321 Geometry B 321 Algorithms C Dept CompSci French Math Teacher Dr. Deadhead Prof. lePew Dr. Obtuse •Useful information is obtained by combining tables. •Efficient algorithms for – comining and indexing tables – transaction processing (updates and multiple users) • Relational databases are a multi giga-$ industry NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 4 Reasons for Mismatch • Scientific data sets are too large (image data, huge analyses) • Scientific data is too complex • Relational databases don’t work well with arrays and scientific computation • Schema evolution and history are important • Databases are too expensive NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 5 Metadata Swissprot -a curated database ... OS OC OC RN RP RC RX RA RL ... ID AC DT DT DT DE OS OC OC RN RP RC RX RA RL RN RP RA RL CC CC CC CC CC DR DR DR KW FT FT FT FT FT FT FT FT SQ 11SB_CUCMA STANDARD; PRT; 480 AA. P13744; 01-JAN-1990 (REL. 13, CREATED) 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) 11S GLOBULIN BETA SUBUNIT PRECURSOR. CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; VIOLALES; CUCURBITACEAE. [1] SEQUENCE FROM N.A. STRAIN=CV. KUROKAWA AMAKURI NANKIN; MEDLINE; 88166744. HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; EUR. J. BIOCHEM. 172:627-632(1988). [2] SEQUENCE OF 22-30 AND 297-302. OHMIYA M., HARA I., MASTUBARA H.; PLANT CELL PHYSIOL. 21:157-167(1980). -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A DISULFIDE BOND. -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). EMBL; M36407; G167492; -. PIR; S00366; FWPU1B. PROSITE; PS00305; 11S_SEED_STORAGE; 1. SEED STORAGE PROTEIN; SIGNAL. SIGNAL 1 21 CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. CHAIN 22 296 GAMMA CHAIN (ACIDIC). CHAIN 297 480 DELTA CHAIN (BASIC). MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). CONFLICT 27 27 S -> E (IN REF. 2). CONFLICT 30 30 E -> S (IN REF. 2). SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; VIOLALES; CUCURBITACEAE. [1] SEQUENCE FROM N.A. STRAIN=CV. KUROKAWA AMAKURI NANKIN; MEDLINE; 88166744. HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; EUR. J. BIOCHEM. 172:627-632(1988). (1 of ~100,000 entries) Data ??? MARSSLFTFL RAEAEAIFTE IPGCAETYQT FADTRNVANQ ... CLAVFINGCL VWDQDNDEFQ DLRRSQSAGS IDPYLRKFYL NeSC, 25 April 2002 SQIEQQSPWE CAGVNMIRHT AFKDQHQKIR AGRPEQVERG FQGSEVWQQH IRPKGLLLPG PFREGDLLVV VEEWERSSRK // RYQSPRACRL FSNAPKLIFV PAGVSHWMYN GSSGEKSGNI Why Don’t Scientists Use Databases? ENLRAQDPVR AQGFGIRGIA RGQSDLVLIV FSGFADEFLE 6 Record (inadequate) of history DT DT DT 11SB_CUCMA STANDARD; PRT; 480 AA. P13744; 01-JAN-1990 (REL. 13, CREATED) 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) 11S GLOBULIN BETA SUBUNIT PRECURSOR. CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; VIOLALES; CUCURBITACEAE. [1] SEQUENCE FROM N.A. STRAIN=CV. KUROKAWA AMAKURI NANKIN; MEDLINE; 88166744. HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; EUR. J. BIOCHEM. 172:627-632(1988). [2] SEQUENCE OF 22-30 AND 297-302. OHMIYA M., HARA I., MASTUBARA H.; PLANT CELL PHYSIOL. 21:157-167(1980). -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A DISULFIDE BOND. -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). EMBL; M36407; G167492; -. PIR; S00366; FWPU1B. PROSITE; PS00305; 11S_SEED_STORAGE; 1. SEED STORAGE PROTEIN; SIGNAL. SIGNAL 1 21 CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. CHAIN 22 296 GAMMA CHAIN (ACIDIC). CHAIN 297 480 DELTA CHAIN (BASIC). MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). CONFLICT 27 27 S -> E (IN REF. 2). CONFLICT 30 30 E -> S (IN REF. 2). SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE 01-JAN-1990 (REL. 13, CREATED) 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) Hierarchical data. Order important. RN RP RC RX RA RL RN RP RA RL ID AC DT DT DT DE OS OC OC RN RP RC RX RA RL RN RP RA RL CC CC CC CC CC DR DR DR KW FT FT FT FT FT FT FT FT SQ [1] SEQUENCE FROM N.A. STRAIN=CV. KUROKAWA AMAKURI NANKIN; MEDLINE; 88166744. HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; EUR. J. BIOCHEM. 172:627-632(1988). [2] // SEQUENCE OF 22-30 AND 297-302. OHMIYA M., HARA I., MASTUBARA H.; PLANT CELL PHYSIOL. 21:157-167(1980). NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 7 Tree data (recursive query processing?) OC OC 11SB_CUCMA STANDARD; PRT; 480 AA. P13744; 01-JAN-1990 (REL. 13, CREATED) 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) 11S GLOBULIN BETA SUBUNIT PRECURSOR. CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; VIOLALES; CUCURBITACEAE. [1] SEQUENCE FROM N.A. STRAIN=CV. KUROKAWA AMAKURI NANKIN; MEDLINE; 88166744. HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; EUR. J. BIOCHEM. 172:627-632(1988). [2] SEQUENCE OF 22-30 AND 297-302. OHMIYA M., HARA I., MASTUBARA H.; PLANT CELL PHYSIOL. 21:157-167(1980). -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A DISULFIDE BOND. -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). EMBL; M36407; G167492; -. PIR; S00366; FWPU1B. PROSITE; PS00305; 11S_SEED_STORAGE; 1. SEED STORAGE PROTEIN; SIGNAL. SIGNAL 1 21 CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. CHAIN 22 296 GAMMA CHAIN (ACIDIC). CHAIN 297 480 DELTA CHAIN (BASIC). MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). CONFLICT 27 27 S -> E (IN REF. 2). CONFLICT 30 30 E -> S (IN REF. 2). SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; VIOLALES; CUCURBITACEAE. Array indices (array operations?) FT FT FT FT FT FT FT FT ID AC DT DT DT DE OS OC OC RN RP RC RX RA RL RN RP RA RL CC CC CC CC CC DR DR DR KW FT FT FT FT FT FT FT FT SQ SIGNAL CHAIN CHAIN CHAIN MOD_RES DISULFID CONFLICT CONFLICT NeSC, 25 April 2002 1 22 22 297 22 124 27 30 21 480 296 480 22 303 27 30 // 11S GLOBULIN BETA SUBUNIT. GAMMA CHAIN (ACIDIC). DELTA CHAIN (BASIC). PYRROLIDONE CARBOXYLIC ACID. INTERCHAIN (GAMMA-DELTA) (POTENTIAL). S -> E (IN REF. 2). E -> S (IN REF. 2). Why Don’t Scientists Use Databases? 8 Structure in comments = schema evolution CC CC CC CC CC ID AC DT DT DT DE OS OC OC RN RP RC RX RA RL RN RP RA RL CC CC CC CC CC DR DR DR KW FT FT FT FT FT FT FT FT SQ 11SB_CUCMA STANDARD; PRT; 480 AA. P13744; 01-JAN-1990 (REL. 13, CREATED) 01-JAN-1990 (REL. 13, LAST SEQUENCE UPDATE) 01-NOV-1990 (REL. 16, LAST ANNOTATION UPDATE) 11S GLOBULIN BETA SUBUNIT PRECURSOR. CUCURBITA MAXIMA (PUMPKIN) (WINTER SQUASH). EUKARYOTA; PLANTA; EMBRYOPHYTA; ANGIOSPERMAE; DICOTYLEDONEAE; VIOLALES; CUCURBITACEAE. [1] SEQUENCE FROM N.A. STRAIN=CV. KUROKAWA AMAKURI NANKIN; MEDLINE; 88166744. HAYASHI M., MORI H., NISHIMURA M., AKAZAWA T., HARA-NISHIMURA I.; EUR. J. BIOCHEM. 172:627-632(1988). [2] SEQUENCE OF 22-30 AND 297-302. OHMIYA M., HARA I., MASTUBARA H.; PLANT CELL PHYSIOL. 21:157-167(1980). -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A DISULFIDE BOND. -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). EMBL; M36407; G167492; -. PIR; S00366; FWPU1B. PROSITE; PS00305; 11S_SEED_STORAGE; 1. SEED STORAGE PROTEIN; SIGNAL. SIGNAL 1 21 CHAIN 22 480 11S GLOBULIN BETA SUBUNIT. CHAIN 22 296 GAMMA CHAIN (ACIDIC). CHAIN 297 480 DELTA CHAIN (BASIC). MOD_RES 22 22 PYRROLIDONE CARBOXYLIC ACID. DISULFID 124 303 INTERCHAIN (GAMMA-DELTA) (POTENTIAL). CONFLICT 27 27 S -> E (IN REF. 2). CONFLICT 30 30 E -> S (IN REF. 2). SEQUENCE 480 AA; 54625 MW; D515DD6E CRC32; MARSSLFTFL CLAVFINGCL SQIEQQSPWE FQGSEVWQQH RYQSPRACRL ENLRAQDPVR RAEAEAIFTE VWDQDNDEFQ CAGVNMIRHT IRPKGLLLPG FSNAPKLIFV AQGFGIRGIA EAFQIDGGLV RKLKGEDDER DRIVQVDEDF EVLLPEKDEE ERSRGRYIES ESESENGLEE TICTLRLKQN IGRSVRADVF NPRGGRISTA NYHTLPILRQ VRLSAERGVL YSNAMVAPHY TVNSHSVMYA TRGNARVQVV DNFGQSVFDG EVREGQVLMI PQNFVVIKRA SDRGFEWIAF KTNDNAITNL LAGRVSQMRM LPLGVLSNMY RISREEAQRL KYGQQEMRVL SPGRSQGRRE -!- FUNCTION: THIS IS A SEED STORAGE PROTEIN. -!- SUBUNIT: HEXAMER; EACH SUBUNIT IS COMPOSED OF AN ACIDIC AND A BASIC CHAIN DERIVED FROM A SINGLE PRECURSOR AND LINKED BY A DISULFIDE BOND. -!- SIMILARITY: TO OTHER 11S SEED STORAGE PROTEINS (GLOBULINS). // NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 9 To turn Swissprot into tables requires: • 20 - 30 tables – nothing extraordinary by relational standards, but – huge query to reconstruct original form • Invented keys • Queries on order and arrays • Recursive query processing • Also need to deal with schema evolution NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 10 Curated Databases • Useful scientific databases are often curated : they are created/ maintained with a great deal of “manual” labour. What really happens DB2 DB1 Database people’s idea of what happens NeSC, 25 April 2002 select xyz from pqr where abc Why Don’t Scientists Use Databases? 11 Database Inter-dependence is Complex GERD EpoDB TRRD BEAD TransFac GenBank GAIA Swissprot NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 12 Three new topics • Annotation – how do I annotate a data element, and how is this passed through queries? • Archiving – how do we keep all the old versions of a database? • Vertical partitioning. – combining databases and vector processing. NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 13 Data Annotation (Khanna, Tan) • Some databases (e.g. biology and linguistics) are designed to accommodate annotations • Also a need for ad hoc (unanticipated) annotations. – How are annotations communicated? – How are they passed through queries? • No general techniques or principles. NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 14 Sharing annotations (courtesy of Wang-Chiew Tan) Serves fine French Cuisine in elegant setting. Jackets required. NYRestaurants (Source Table) Cost Restaurant Peacock Alley Bull & Bear Pacifica Soho Kitchen & Bar Extensive wine list! Type Zip $$$ $$$ French 10022 Seafood 10022 $ $ Chinese 10013 American 10022 Yummy chicken curry!! Cheap Restaurants (View 2) All Restaurants (View 1) Restaurant Peacock Alley Bull & Bear Pacifica NeSC, 25 April 2002 Soho Kitchen & Bar Cost $$$ $$$ $ $ Type French Seafood Restaurant Pacifica Soho Kitchen & Bar Chinese Why Don’t Scientists American Use Databases? Cost $ $ Type Chinese American 15 Annotation looks simple but ... • Computing how an annotation should move through a query is intractable • Equivalent queries may not carry annotations in the same way • New insights are needed! NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 16 How do we Build Archival Databases? [Khanna, Tajima, Tan] • Many scientific database keep archives. It’s important to preserve the state of knowledge as it was in the past • Archive frequently: space consuming • Archive infrequently: delay in getting recent information published. NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 17 The dangers of electronic documents Report of a DOE “bioinformatics summit” ca. 1994 http://www.ornl.gov/hgmis/publicat/miscpubs/bioinfo/inf_rep2.html#AppenI Then: APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully relationalized sequence database is available, none of the queries in this appendix can be answered. ... Now: APPENDIX I: SAMPLE QUESTIONS FOR A FEDERATED DATABASE Continued HGP progress will depend in part upon the ability of genome databases to answer increasingly complex queries that span multiple community databases. Some examples of such queries are given in this appendix. Note, however, until a fully atomized sequence database is available (i.e., no data stored in ASCII text fields), none of the queries in this appendix can be answered. ... (No archive/edition! No footnote!) NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 18 Examples from Bioinformatics • Swissprot. New version produced every four months. – Old versions are kept. – Difficult to get at most recent data • OMIM. New version produced every day – Old versions are not kept – Impossible to reconstruct past states of the data NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 19 Current approaches use “diff” Line Number 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Version 1: Version 2: Output of line diff (versions 1-2): <DB> <Person> <Name>Joe</> <DateOfBirth>March</> <Address>South Street</> <Zip>12345</> </> <Person> <Name>Jane</> <DateOfBirth>May</> <Address>Pine Street</> <Zip>67890</> </> </> <DB> <Person> <Name>Jane</> <DateOfBirth>May</> <Address>South Street</> <Zip>12345</> </> <Person> <Name>Joe</> <DateOfBirth>March</> <Address>Pine Street</> <Zip>67890</> </> </> 3,4c <Name>Jane</> <DateOfBirth>May</> 9,10c <Name>Joe</> <DateOfBirth>March</> need to preserve “object continuity” through time NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 20 A Sequence of Versions NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 21 “Pushing” time down [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ] NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 22 Size (bytes) x 106 Experimental Results (OMIM) Uncompressed • Legend •archive •inc diff •version •compressed inc diff •compressed archive gzip(inc diff) Archive size is – 1.01 times diff repository size – 1.04 times size of largest version Compressed • archive size is between 0.94 and 1 times compressed diff repository size • gzip - unix compression tool • XMill - XML compression tool XMill(archive) Number of versions NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 23 Size (bytes) x 106 Legend •archive •inc diff •version •compressed inc diff •compressed archive Experimental Results (Swissprot) Uncompressed • Archive size is – 1.08 times diff repository size – 1.92 times size of largest version Compressed • archive size is between 0.59 and 1 times compressed diff repository size • gzip - unix compression tool • XMill - XML compression tool Number of versions NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 24 The Bottom Line • We have built an archiver, using XML as the base format • We can build a year of archives (archive as often as you like) for a 14% increase on the size of the most recent database • Based on keys -- preserves object history • Works well with compression • Obtaining an old archive is no more expensive than getting the current version. NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 25 Vertical partitioning • An old idea revisited • Fusion of array processing languages and database query languages • Substantial use on Wall Street!!! NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 26 Conventional Storage Rows are stored contiguously. Order is not preserved (Horizontal partitioning) disk pages NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 27 Problems with conventional storage • Unanticipated queries will probably read the whole database SELECT average sqrt(shoe-size) FROM employee WHERE hat-size > shoe-size (this only needs two fields) • Order or rows is “random” and does not support order-sensitive functions: moving window averages, convolutions, etc. NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 28 Vertical partitioning (vectorisation) Columns are stored contiguously. Order is preserved (Vertical partitioning) disk pages NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 29 Advantages of Vertical Patitioning • Faster queries. – A query that reads 2 columns in 100 does 2% of the i/o (i/o cost dominates) – A few columns can often reside in memory. • Computation on order • Can use both SQL and vector processing languages • Downside: deletions are horribly expensive. – but deletions are uncommon in scientific DBs • Vertical partitioning can also be performed on hierarchical structures -- like Swissprot -- and XML NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 30 Many other issues • Heterogeneous data integration – a perennial problem – can it be done by the end-users? • Distributed query evaluation against redundant, constrained data. • Data provenance • Data streams • and many more All these involve hard, fundamental problems in Computer Science NeSC, 25 April 2002 Why Don’t Scientists Use Databases? 31