What should DBs (or DB researchers) do? by not peter buneman

advertisement
What should DBs (or
DB researchers) do?
by not peter buneman
A database analyst walks into a bar
and goes up to two tables.
"Hi. Can I join you?"
Principles of Data-Intensive
Research:
We Need Some
James Cheney
Why isn't someone in CS
solving your problem?
• They don't know about it
• They don't understand it
• It's too domain-specific
• It's not in their interest
• high risk, low reward
• Too similar to a problem solved already in
the 70s
Hamming
•
•
I went home one Friday after finishing a problem,
and curiously enough I wasn't happy; I was
depressed. I could see life being a long sequence
of one problem after another after another. After
quite a while of thinking I decided, ``No, I should
be in the mass production of a variable product. I
should be concerned with all of next year's
problems, not just the one in front of my face.''
http://www.cs.virginia.edu/~robins/YouAndYourResearch.pdf
My background
• Programming languages
• Logic
• Formal methods
• Obviously, completely irrelevant to DIR,
right?
Clear as cloud
• There is a lack of clarity about what the
real problems are
• Exacerbated by:
• discipline boundaries
• cognitive dissonance
• thanklessness of bridge-building
• absence of theoretical basis for "real" DIR
Hoare
When any new language design project is nearing
completion, there is always a mad rush to get new
features added before standardization.The rush is mad
indeed, because it leads into a trap from which there is
no escape. A feature which is omitted can always be
added later, when its design and its implications are
well understood. A feature which is included before it is
fully understood can never be removed later.
—C. A. R. Hoare, Turing Award lecture,1980
Some tarpits
•
•
•
DIR now involves software engineering, DB
admin, data modeling, language design
•
•
In many cases, by amateurs (no offense!)
F. Brooks: The Mythical Man-Month. Read it!
Re-inventing the pothole, not the wheel
Is this a problem?
•
Would you use a bridge designed by engineer
who never heard of the calculus?
Revenge of Chicken
Little
• Remember the Software Crisis?
• Never really went away
• No one has ever been killed by a data
tsunami
• But injury, loss routinely caused by bad
software
• We've learned to live with this
The Data Deluge
• Bytes are not information
• If they're all zeros, do they matter?
• C.E. Shannon - A Mathematical Theory of
Communication. Read it!
• Good models -> high compressibility
• But danger of confirmation bias
Complexity
• Many dimensions
MB
TB
Clear spec
PL, algorithms
Databases
Fuzzy spec
Software engineering,
security, GOFAI
DIR? Modern AI?
Complexity
• Many dimensions
Theory driven
Data driven
Mature
(known unknown)
Physics,
chemistry
Astronomy,
earth science
Developing
(unknown unknown)
Classical bio
Social/economic
Bioinformatics,
e-SocSci
Transparent
• Some DIR challenges are (conceptually)
straightforward applications of CS
• machine learning
• databases
• algorithms
• Which is great!
• But only tip of iceberg
Opaque
• Other challenges are conceptually opaque
• many possible solutions
• but not clear which solution is "right"
• Provenance, metadata (IMO) prime examples
• Almost every speaker mentioned it
• Few gave any details
CS view of computers
Rest of world
+
Rest of world
+
Rest of world
+
The dreaded P-word
•
•
•
Almost every talk mentioned provenance
•
Almost every usage had a different meaning
Is there a common core?
•
•
Something that could be standardized?
Or is this premature?
W3C Provenance Incubator Group
•
http://www.w3.org/2005/Incubator/prov/wiki/
Metadata
• "One person's data is another's metadata"
• Please let us move beyond these cliches!
• AFAICT, metadata really code for data that...
• someone else neglected to record
• and you need (for integration, reuse, etc).
• No silver bullet. Or regular bullet. Or gun.
Summary
• Think about what actual problem is
• Don't sweat the exabytes
• Many dimensions of complexity
• Need case studies, formal models of
essential problems (metadata, provenance)
• Models feed into systems, which can be
evaluated against needs
Download