CS 540 Database Management Systems Lecture 4: Project topics overview

advertisement
CS 540
Database Management Systems
Lecture 4: Project topics overview
Outline
• How to choose a project topic?
• Broad topic areas
• How to pick a project, write your report, and present
your work?
• Overview of sample project topics
What is a good research problem?
• A good research problem is a solvable challenge that is well
connected to a real world need/problem.
• Real word challenges vs. imaginary challenges
– Not all challenges are interesting (to the society)
– Real world challenges are always interesting to work on
– Imaginary challenges may (happen to) be interesting
– Spend your effort to solve interesting challenges so that
you’ll make more contributions to the society
• However, not all real world problems are challenges; some are
straightforward to solve.
• Not all challenges/problems are solvable (with limited resources,
time, money, tools, etc)
Identify a Good Research Problem
High impact
High risk (hard)
Good long-term
research problems
Level of Challenges
Low impact
Difficult
Often publishable,
but not good research
problems
Low impact
Low risk
Bad research problems
Generally not publishable
High impact
Low risk (easy)
Good short-term
research problems
Unknown
Good applications
Not interesting
for research
Known
Course project
Impact/Usefulness
Landscape of data management
Query capability
Inferences/Mining
Inexact Matching
Exact Matching
RDMS
Scale
Data complexity
DB and related areas
Web/Bio Information Management
Multimedia
Information
Management
Databases
Text Information
Management
(Information Retrieval)
Data Mining/Machine Learning
Map of general topic areas
Multimedia
DB
Web/Bio DB Applications
Web/Bio
Multimedia
DB+IR
Databases
IR
Data Mining
Core/Traditional DB
Data Mining,
Decision Support
The big challenge
“... Our biggest challenge is a unification of approximate and
exact reasoning. Most of us come from the exact-reasoning
world – but most of our clients are asking questions with
approximate or probabilistic answers….”
-Jim Gray [SIGMOD 2004]
How to do a bad project and give a bad
presentation!
Slides from “How to Have a Bad Career!” by
David A. Patterson
How to Do Bad a Project?
Let Complexity Be Your Guide (Confuse Thine Enemies)
• Best compliment:
“Its so complicated, I can’t understand the ideas”
• Easier to claim credit for subsequent good ideas
– If no one understands, how can they contradict your
claim?
• It’s easier to be complicated
• If it were not unsimple then how could distinguished
colleagues in departments around the world be
positively appreciative of both your extraordinary
intellectual grasp of the nuances of issues as well as
the depth of your contribution?
How to Do Bad a Project?
Never be Proven Wrong
• Avoid Implementing
• Avoid Quantitative Experiments
– If you’ve got good intuition, who needs
experiments?
– Takes too long to measure
• Avoid Benchmarks
• Projects whose payoff is ≥ 20 years
gives you 19 safe years
How to Do a Bad Project?
Use the Computer Scientific Method
Obsolete Scientific Method
• Hypothesis
• Sequence of experiments
• Change 1 parameter/exp.
• Prove/Disprove
Hypothesis
• Document for others to
reproduce results
Computer Scientific Method
• Hunch
• 1 experiment
& change all parameters
• Discard if doesn’t support
hunch
• Why waste time? We know
this
5 Commandments for Bad Writing
I.
Thou shalt not define terms, nor explain anything.
–
that’s why there are dictionaries. its insults the readers.
II. Thou shalt replace “will do” with “have done”.
–
After all, someone is likely to build it in the 2 to 3 years.
III. Thou shalt not mention drawbacks to your approach.
–
that’s not your job; let others find the flaws.
IV. Thou shalt not reference any papers.
– if they were good people, they’d be at your institution.
V. Thou shalt write before implementing.
–
highest performance.
7 Talk Commandments for a Bad Talk
I. Thou shalt not illustrate.
II. Thou shalt not covet brevity.
–
Do you want to continue the stereotype that engineers can't write? Always
use complete sentences, never just key words. If possible, use whole
paragraphs and read every word.
III. Thou shalt not print large.
–
Be humble -- use a small font. Important people sit in front.
IV. Thou shalt not use color.
V. Thou shalt cover thy naked slides.
VI. Thou shalt not skip slides in a long talk.
–
You prepared the slides; people came for your whole talk; so just talk faster.
VII. Thou shalt not practice.
–
Why waste research time practicing a talk?
Following all the commandments
•
We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance
of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer
running equivalent code. It is shown that the control flow processor compares favorably in the program.
•
We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as
an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up
compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets
algorithms that partition control flow graphs into clusters.
•
A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a
subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks
defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters.
•
We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the
performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modeling the application on the
performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate
modeling is used.
•
We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and
the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows
more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the
overall execution time to increase substantially.
•
The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed
has to exhibit a strong locality effect in order to allow efficient execution
•
Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an
issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all
possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to
systems with arbitrary loops and nondeterminism.
Following all the commandments
How to Do a Bad Poster
David Patterson
University of California
Berkeley, CA 94720
Our compiling strategy is to exploit coarse-grain
parallelism at function application level: and the
function application level parallelism is
implemented by fork-join mechanism. The
compiler translates source programs into control
flow graphs based on analyzing flow of control,
and then serializes instructions within graphs
according to flow arcs such that function
applications, which have no control dependency,
are executed in parallel.
We have demonstrated that to achieve the best
execution time for a control flow program, the
number of nodes within the system and the type
of mapping scheme used are particularly
important. In addition, we observe that a large
number of subsystem nodes allows more actors
to be fired concurrently, but the communication
overhead in passing control tokens to their
destination nodes causes the overall execution
time to increase substantially.
We describe the philosophy and design of the
control flow machine, and present the results of
detailed simulations of the performance of a
single processing element. Each factor is
compared with the measured performance of an
advanced von Neumann computer running
equivalent code. It is shown that the control flow
processor compares favorably in the program.
A hierarchical macro-control-flow computation
allows them to exploit the coarse grain parallelism
inside a macrotask, such as a subroutine or a loop,
hierarchically. We use a hierarchical definition of
macrotasks, a parallelism extraction scheme
among macrotasks defined inside an upper layer
macrotask, and a scheduling scheme which assigns
hierarchical macrotasks on hierarchical clusters.
The relationship between the mapping scheme
employed and locality effect in a program are
discussed. The mapping scheme employed has to
exhibit a strong locality effect in order to allow
efficient execution. We assess the average number
of instructions in a cluster and the reduction in
matching operations compared with fine grain
control flow execution.
We present a denotational semantics for a logic
program to construct a control flow for the logic
program. The control flow is defined as an
algebraic manipulator of idempotent substitutions
and it virtually reflects the resolution deductions.
We also present a bottom-up compilation of
medium grain clusters from a fine grain control
flow graph. We compare the basic block and the
dependence sets algorithms that partition control
flow graphs into clusters.
We apply a parallel simulation scheme to a real
problem: the simulation of a control flow
architecture, and we compare the performance of
this simulator with that of a sequential one.
Moreover, we investigate the effect of modeling
the application on the performance of the
simulator. Our study indicates that parallel
simulation can reduce the execution time
significantly if appropriate modeling is used.
Medium grain execution can benefit from a higher
output bandwidth of a processor and finally, a
simple superscalar processor with an issue rate of
ten is sufficient to exploit the internal parallelism
of a cluster. Although the technique does not
exhaustively detect all possible errors, it detects
nontrivial errors with a worst-case complexity
quadratic to the system size. It can be automated
and applied to systems with arbitrary loops and
nondeterminism.
Alternatives to Bad Papers
• Do opposite of Bad Paper commandments
Define terms, distinguish “will do” vs “have done”,
mention drawbacks, real performance, reference other papers.
• Find related work
• First read Strunk and White, then follow these steps;
1. 1-page paper outline, with tentative page budget/section
2. Paragraph map
• 1 topic phrase/sentence per paragraph
3. (Re)Write draft
• Long captions/figure can contain details
• Uses Tables to contain facts that make dreary prose
4. Read aloud, spell check & grammar check
5. Get feedback from friends and critics on draft; go to 3.
• www.cs.berkeley.edu/~pattrsn/talks/writingtips.html
Alternatives to Bad Talk
• Do opposite of Bad Talk commandments
I. Thou shalt not illustrate.
II. Thou shalt not covet brevity.
III. Thou shalt not print large.
IV. Thou shalt not use color.
V. Thou shalt cover thy naked slides.
VI. Thou shalt not skip slides in a long talk.
VII.Thou shalt not practice.
• Allocate 2 minutes per slide, leave time for questions
• Don’t over animate
• Do dry runs with friends/critics for feedback,
– including tough audience questions
• Record a practice talk (audio or video)
– Don’t memorize speech, but have notes ready
Alternatives to Bad Talk
Sample Project Topics
Query and visualize RDF data
• Many graph datasets are in Resource Description
Framework (RDF) format
– Also called linked data
• RDF database
– set of triplets:
subject predicate object
• The number and size of data sets are rapidly growing.
• Wikidata, DBPedia, FOAF, Knowledge graph, …
• You may find datasets at linkeddata.org, rdfdata.org, …
21
Query and visualize RDF data
• RDF database
– No prescribed schema:
• easy to create and extend: semantic Web standard
• hard to formulate queries!
• query processing is relatively inefficient.
• RDF data management systems / triple stores
– Public: Apache Jena, KiWi, …
– Proprietary: IBM DB2, Oracle, …
• SPARQL query language
– Similar to SQL
22
Query and visualize RDF data
• Create an easy to use query interface for RDF data
– some work on keyword search over RDF
• low precision, slow
– You may combine SPARQL with some keyword
search features.
– Query suggestion, auto-completion, .. for SPRAQL or
keyword queries.
23
Query and visualize RDF data
• The results of RDF queries are usually not easy to
understand
– Large graphs
• Create an interface that summarizes the results
– Show the most important/relevant nodes/ links first
– User can navigate over results
– You may do this for the whole database
• It helps users to understand the structure of the data
and specify queries.
24
Query and visualize RDF data
• Create an interaction interface over RDF
– Users usually interact with the database over a long period
of time
• Submit query => explore the result => formulate the next
query => explore the result => …
– The interface makes it easier for users to formulate
queries based on the current results.
• Keeps a history of previous queries
25
Querying relational data
• Most users do not know the schema and content of
their relational databases.
• Create an interface that helps users write SQL
queries
– Query completion and suggestion
– Create visualization of the schema
• More important tables at higher level.
26
Data independence
• Relational model are not access path independent
• How can you make SQL more access path
independent?
– Map the schema of the query to the schema of the
database.
database schema: EmpManager(E,M,D)
user assumes the schema: Emp(E, D), Manager(M, D)
user query: select E from Emp =>
transformed query: select E from EmpManager
– Try all possible schemas.
• Slow!
• Data independent learning and inference
27
Visualize relational data
• Create a visualization engines for SQL queries
– Many users like to see charts and visualizations instead of
tables.
– Visualization engines do not normally work with
relational databases.
• Create an interactive query interface for SQL
– Keeps a history of previous queries
28
Data preparation
• Most data scientists spend about 80% of their times
on data preparation!
– Transforming data from one form to another
• Most data sets are in spreadsheets, flat files, XML, HTML
tables, …
• We have to transform them to relational or RDF form.
– Cleaning data
• Removing meaningless values, apply constraints, …
– ….
• Currently most data preparation are done manually.
29
Help users prepare their data
• Example: Data wrangler (now part of Trifacta)
• http://vis.stanford.edu/wrangler/app/
30
Help users prepare their data
• Pick a widely used data format
– spread sheet, Json, XML, log files, …
• Define natural and basic transformation
operations for this format
– Cleaning, re-organizing, transforming to
relational or RDF format
– Design a transformation interface
• Design a Domain Specific Language (DSL).
• Predict/ suggest transformation operations
31
Theory projects
• Read some papers and approaches on a problem,
analyze, compare, and/or extend them.
– High technical depth / theory.
– You may slightly extend on approach.
• Schema equivalency
– One can represent the same data in different schemas:
• Emp(E, D), Manager(M, D) vs. EmpManager(E,M,D)
– Given two relational schemas, how can we find
our if they represent the same information?
• Representation dependence in probabilistic inference, J.
Halpern, JAIR, 2004.
• Relative information capacity of simple relational schema,
R. Hull, PODS, 1984.
32
Good project
• Technical deep
– More than building some forms over a database
• Novel
– Has some new ideas
• Effectively presented
• All in the scope of a term!
33
Project timeline
• Proposal due 1/19
– Group members, brief description of the problem.
• Midterm presentation due 2/3 – 2/4
– Clear definition of the problem, initial work and
result, plan for the rest of the term.
– A practice for final presentation!
• Final presentation 8/4- 10/4
– Final results, analysis of the results.
• Final report 11/4
34
What you should do
•
•
•
•
35
Form teams.
Evaluate possible topics for your project.
Talk to the instructors and TAs
Submit your project proposal.
What is next
• Database system implementation
– DBMS architecture, storage, and access methods
• You have two papers to review
– rather short papers!
36
Download