CS 540 Database Management Systems Lecture 4: Project topics overview Outline • How to choose a project topic? • Broad topic areas • How to pick a project, write your report, and present your work? • Overview of sample project topics What is a good research problem? • A good research problem is a solvable challenge that is well connected to a real world need/problem. • Real word challenges vs. imaginary challenges – Not all challenges are interesting (to the society) – Real world challenges are always interesting to work on – Imaginary challenges may (happen to) be interesting – Spend your effort to solve interesting challenges so that you’ll make more contributions to the society • However, not all real world problems are challenges; some are straightforward to solve. • Not all challenges/problems are solvable (with limited resources, time, money, tools, etc) Identify a Good Research Problem High impact High risk (hard) Good long-term research problems Level of Challenges Low impact Difficult Often publishable, but not good research problems Low impact Low risk Bad research problems Generally not publishable High impact Low risk (easy) Good short-term research problems Unknown Good applications Not interesting for research Known Course project Impact/Usefulness Landscape of data management Query capability Inferences/Mining Inexact Matching Exact Matching RDMS Scale Data complexity DB and related areas Web/Bio Information Management Multimedia Information Management Databases Text Information Management (Information Retrieval) Data Mining/Machine Learning Map of general topic areas Multimedia DB Web/Bio DB Applications Web/Bio Multimedia DB+IR Databases IR Data Mining Core/Traditional DB Data Mining, Decision Support The big challenge “... Our biggest challenge is a unification of approximate and exact reasoning. Most of us come from the exact-reasoning world – but most of our clients are asking questions with approximate or probabilistic answers….” -Jim Gray [SIGMOD 2004] How to do a bad project and give a bad presentation! Slides from “How to Have a Bad Career!” by David A. Patterson How to Do Bad a Project? Let Complexity Be Your Guide (Confuse Thine Enemies) • Best compliment: “Its so complicated, I can’t understand the ideas” • Easier to claim credit for subsequent good ideas – If no one understands, how can they contradict your claim? • It’s easier to be complicated • If it were not unsimple then how could distinguished colleagues in departments around the world be positively appreciative of both your extraordinary intellectual grasp of the nuances of issues as well as the depth of your contribution? How to Do Bad a Project? Never be Proven Wrong • Avoid Implementing • Avoid Quantitative Experiments – If you’ve got good intuition, who needs experiments? – Takes too long to measure • Avoid Benchmarks • Projects whose payoff is ≥ 20 years gives you 19 safe years How to Do a Bad Project? Use the Computer Scientific Method Obsolete Scientific Method • Hypothesis • Sequence of experiments • Change 1 parameter/exp. • Prove/Disprove Hypothesis • Document for others to reproduce results Computer Scientific Method • Hunch • 1 experiment & change all parameters • Discard if doesn’t support hunch • Why waste time? We know this 5 Commandments for Bad Writing I. Thou shalt not define terms, nor explain anything. – that’s why there are dictionaries. its insults the readers. II. Thou shalt replace “will do” with “have done”. – After all, someone is likely to build it in the 2 to 3 years. III. Thou shalt not mention drawbacks to your approach. – that’s not your job; let others find the flaws. IV. Thou shalt not reference any papers. – if they were good people, they’d be at your institution. V. Thou shalt write before implementing. – highest performance. 7 Talk Commandments for a Bad Talk I. Thou shalt not illustrate. II. Thou shalt not covet brevity. – Do you want to continue the stereotype that engineers can't write? Always use complete sentences, never just key words. If possible, use whole paragraphs and read every word. III. Thou shalt not print large. – Be humble -- use a small font. Important people sit in front. IV. Thou shalt not use color. V. Thou shalt cover thy naked slides. VI. Thou shalt not skip slides in a long talk. – You prepared the slides; people came for your whole talk; so just talk faster. VII. Thou shalt not practice. – Why waste research time practicing a talk? Following all the commandments • We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the control flow processor compares favorably in the program. • We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters. • A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters. • We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modeling the application on the performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used. • We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially. • The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution • Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to systems with arbitrary loops and nondeterminism. Following all the commandments How to Do a Bad Poster David Patterson University of California Berkeley, CA 94720 Our compiling strategy is to exploit coarse-grain parallelism at function application level: and the function application level parallelism is implemented by fork-join mechanism. The compiler translates source programs into control flow graphs based on analyzing flow of control, and then serializes instructions within graphs according to flow arcs such that function applications, which have no control dependency, are executed in parallel. We have demonstrated that to achieve the best execution time for a control flow program, the number of nodes within the system and the type of mapping scheme used are particularly important. In addition, we observe that a large number of subsystem nodes allows more actors to be fired concurrently, but the communication overhead in passing control tokens to their destination nodes causes the overall execution time to increase substantially. We describe the philosophy and design of the control flow machine, and present the results of detailed simulations of the performance of a single processing element. Each factor is compared with the measured performance of an advanced von Neumann computer running equivalent code. It is shown that the control flow processor compares favorably in the program. A hierarchical macro-control-flow computation allows them to exploit the coarse grain parallelism inside a macrotask, such as a subroutine or a loop, hierarchically. We use a hierarchical definition of macrotasks, a parallelism extraction scheme among macrotasks defined inside an upper layer macrotask, and a scheduling scheme which assigns hierarchical macrotasks on hierarchical clusters. The relationship between the mapping scheme employed and locality effect in a program are discussed. The mapping scheme employed has to exhibit a strong locality effect in order to allow efficient execution. We assess the average number of instructions in a cluster and the reduction in matching operations compared with fine grain control flow execution. We present a denotational semantics for a logic program to construct a control flow for the logic program. The control flow is defined as an algebraic manipulator of idempotent substitutions and it virtually reflects the resolution deductions. We also present a bottom-up compilation of medium grain clusters from a fine grain control flow graph. We compare the basic block and the dependence sets algorithms that partition control flow graphs into clusters. We apply a parallel simulation scheme to a real problem: the simulation of a control flow architecture, and we compare the performance of this simulator with that of a sequential one. Moreover, we investigate the effect of modeling the application on the performance of the simulator. Our study indicates that parallel simulation can reduce the execution time significantly if appropriate modeling is used. Medium grain execution can benefit from a higher output bandwidth of a processor and finally, a simple superscalar processor with an issue rate of ten is sufficient to exploit the internal parallelism of a cluster. Although the technique does not exhaustively detect all possible errors, it detects nontrivial errors with a worst-case complexity quadratic to the system size. It can be automated and applied to systems with arbitrary loops and nondeterminism. Alternatives to Bad Papers • Do opposite of Bad Paper commandments Define terms, distinguish “will do” vs “have done”, mention drawbacks, real performance, reference other papers. • Find related work • First read Strunk and White, then follow these steps; 1. 1-page paper outline, with tentative page budget/section 2. Paragraph map • 1 topic phrase/sentence per paragraph 3. (Re)Write draft • Long captions/figure can contain details • Uses Tables to contain facts that make dreary prose 4. Read aloud, spell check & grammar check 5. Get feedback from friends and critics on draft; go to 3. • www.cs.berkeley.edu/~pattrsn/talks/writingtips.html Alternatives to Bad Talk • Do opposite of Bad Talk commandments I. Thou shalt not illustrate. II. Thou shalt not covet brevity. III. Thou shalt not print large. IV. Thou shalt not use color. V. Thou shalt cover thy naked slides. VI. Thou shalt not skip slides in a long talk. VII.Thou shalt not practice. • Allocate 2 minutes per slide, leave time for questions • Don’t over animate • Do dry runs with friends/critics for feedback, – including tough audience questions • Record a practice talk (audio or video) – Don’t memorize speech, but have notes ready Alternatives to Bad Talk Sample Project Topics Query and visualize RDF data • Many graph datasets are in Resource Description Framework (RDF) format – Also called linked data • RDF database – set of triplets: subject predicate object • The number and size of data sets are rapidly growing. • Wikidata, DBPedia, FOAF, Knowledge graph, … • You may find datasets at linkeddata.org, rdfdata.org, … 21 Query and visualize RDF data • RDF database – No prescribed schema: • easy to create and extend: semantic Web standard • hard to formulate queries! • query processing is relatively inefficient. • RDF data management systems / triple stores – Public: Apache Jena, KiWi, … – Proprietary: IBM DB2, Oracle, … • SPARQL query language – Similar to SQL 22 Query and visualize RDF data • Create an easy to use query interface for RDF data – some work on keyword search over RDF • low precision, slow – You may combine SPARQL with some keyword search features. – Query suggestion, auto-completion, .. for SPRAQL or keyword queries. 23 Query and visualize RDF data • The results of RDF queries are usually not easy to understand – Large graphs • Create an interface that summarizes the results – Show the most important/relevant nodes/ links first – User can navigate over results – You may do this for the whole database • It helps users to understand the structure of the data and specify queries. 24 Query and visualize RDF data • Create an interaction interface over RDF – Users usually interact with the database over a long period of time • Submit query => explore the result => formulate the next query => explore the result => … – The interface makes it easier for users to formulate queries based on the current results. • Keeps a history of previous queries 25 Querying relational data • Most users do not know the schema and content of their relational databases. • Create an interface that helps users write SQL queries – Query completion and suggestion – Create visualization of the schema • More important tables at higher level. 26 Data independence • Relational model are not access path independent • How can you make SQL more access path independent? – Map the schema of the query to the schema of the database. database schema: EmpManager(E,M,D) user assumes the schema: Emp(E, D), Manager(M, D) user query: select E from Emp => transformed query: select E from EmpManager – Try all possible schemas. • Slow! • Data independent learning and inference 27 Visualize relational data • Create a visualization engines for SQL queries – Many users like to see charts and visualizations instead of tables. – Visualization engines do not normally work with relational databases. • Create an interactive query interface for SQL – Keeps a history of previous queries 28 Data preparation • Most data scientists spend about 80% of their times on data preparation! – Transforming data from one form to another • Most data sets are in spreadsheets, flat files, XML, HTML tables, … • We have to transform them to relational or RDF form. – Cleaning data • Removing meaningless values, apply constraints, … – …. • Currently most data preparation are done manually. 29 Help users prepare their data • Example: Data wrangler (now part of Trifacta) • http://vis.stanford.edu/wrangler/app/ 30 Help users prepare their data • Pick a widely used data format – spread sheet, Json, XML, log files, … • Define natural and basic transformation operations for this format – Cleaning, re-organizing, transforming to relational or RDF format – Design a transformation interface • Design a Domain Specific Language (DSL). • Predict/ suggest transformation operations 31 Theory projects • Read some papers and approaches on a problem, analyze, compare, and/or extend them. – High technical depth / theory. – You may slightly extend on approach. • Schema equivalency – One can represent the same data in different schemas: • Emp(E, D), Manager(M, D) vs. EmpManager(E,M,D) – Given two relational schemas, how can we find our if they represent the same information? • Representation dependence in probabilistic inference, J. Halpern, JAIR, 2004. • Relative information capacity of simple relational schema, R. Hull, PODS, 1984. 32 Good project • Technical deep – More than building some forms over a database • Novel – Has some new ideas • Effectively presented • All in the scope of a term! 33 Project timeline • Proposal due 1/19 – Group members, brief description of the problem. • Midterm presentation due 2/3 – 2/4 – Clear definition of the problem, initial work and result, plan for the rest of the term. – A practice for final presentation! • Final presentation 8/4- 10/4 – Final results, analysis of the results. • Final report 11/4 34 What you should do • • • • 35 Form teams. Evaluate possible topics for your project. Talk to the instructors and TAs Submit your project proposal. What is next • Database system implementation – DBMS architecture, storage, and access methods • You have two papers to review – rather short papers! 36