What should DBs (or DB researchers) do? by not peter buneman A database analyst walks into a bar and goes up to two tables. "Hi. Can I join you?" Principles of Data-Intensive Research: We Need Some James Cheney Why isn't someone in CS solving your problem? • They don't know about it • They don't understand it • It's too domain-specific • It's not in their interest • high risk, low reward • Too similar to a problem solved already in the 70s Hamming • • I went home one Friday after finishing a problem, and curiously enough I wasn't happy; I was depressed. I could see life being a long sequence of one problem after another after another. After quite a while of thinking I decided, ``No, I should be in the mass production of a variable product. I should be concerned with all of next year's problems, not just the one in front of my face.'' http://www.cs.virginia.edu/~robins/YouAndYourResearch.pdf My background • Programming languages • Logic • Formal methods • Obviously, completely irrelevant to DIR, right? Clear as cloud • There is a lack of clarity about what the real problems are • Exacerbated by: • discipline boundaries • cognitive dissonance • thanklessness of bridge-building • absence of theoretical basis for "real" DIR Hoare When any new language design project is nearing completion, there is always a mad rush to get new features added before standardization.The rush is mad indeed, because it leads into a trap from which there is no escape. A feature which is omitted can always be added later, when its design and its implications are well understood. A feature which is included before it is fully understood can never be removed later. —C. A. R. Hoare, Turing Award lecture,1980 Some tarpits • • • DIR now involves software engineering, DB admin, data modeling, language design • • In many cases, by amateurs (no offense!) F. Brooks: The Mythical Man-Month. Read it! Re-inventing the pothole, not the wheel Is this a problem? • Would you use a bridge designed by engineer who never heard of the calculus? Revenge of Chicken Little • Remember the Software Crisis? • Never really went away • No one has ever been killed by a data tsunami • But injury, loss routinely caused by bad software • We've learned to live with this The Data Deluge • Bytes are not information • If they're all zeros, do they matter? • C.E. Shannon - A Mathematical Theory of Communication. Read it! • Good models -> high compressibility • But danger of confirmation bias Complexity • Many dimensions MB TB Clear spec PL, algorithms Databases Fuzzy spec Software engineering, security, GOFAI DIR? Modern AI? Complexity • Many dimensions Theory driven Data driven Mature (known unknown) Physics, chemistry Astronomy, earth science Developing (unknown unknown) Classical bio Social/economic Bioinformatics, e-SocSci Transparent • Some DIR challenges are (conceptually) straightforward applications of CS • machine learning • databases • algorithms • Which is great! • But only tip of iceberg Opaque • Other challenges are conceptually opaque • many possible solutions • but not clear which solution is "right" • Provenance, metadata (IMO) prime examples • Almost every speaker mentioned it • Few gave any details CS view of computers Rest of world + Rest of world + Rest of world + The dreaded P-word • • • Almost every talk mentioned provenance • Almost every usage had a different meaning Is there a common core? • • Something that could be standardized? Or is this premature? W3C Provenance Incubator Group • http://www.w3.org/2005/Incubator/prov/wiki/ Metadata • "One person's data is another's metadata" • Please let us move beyond these cliches! • AFAICT, metadata really code for data that... • someone else neglected to record • and you need (for integration, reuse, etc). • No silver bullet. Or regular bullet. Or gun. Summary • Think about what actual problem is • Don't sweat the exabytes • Many dimensions of complexity • Need case studies, formal models of essential problems (metadata, provenance) • Models feed into systems, which can be evaluated against needs