Wrapup Amol Deshpande CMSC424 DBMS at a glance Data Models Conceptual representation of the data Data Retrieval How to ask questions of the database How to answer those questions Data Storage How/where to store data, how to access it Data Integrity Manage crashes, concurrency Manage semantic inconsistencies Not fully disjoint categorization !! DBMS at a glance Data Models E/R Model, Relational model Very simple and hence effective Easy to make things complicated, very hard to keep them simple No other data model has survived for so long What is the future of XML ? DBMS at a glance Data Retrieval How to ask questions of the database Declarative languages are great Hide complexity from users, can optimize things, can evolve easily SQL – More or less declarative How to answer those questions Parsing --> Optimization --> Processing Operators: Hashing, sorting, joins, aggregation Data structures – Hash indexes: Good for equality queries – Tree indexes: For everything else Optimization: Complex, but key piece of a database system DBMS at a glance Data Storage How/where to store data, how to access it Need to be cognizant of the memory hierarchy Memory is cheap, disk is very expensive to access Further disk is cheap to access sequentially, much more expensive to access randomly – Many of our decisions are influenced by this RAID: Surviving failures Accessing data: Indexes What happens if a new form of storage comes along with different properties (say holographic storage ?) We will need to rethink the tradeoffs, but we now know the approach DBMS at a glance Data Integrity Manage crashes, concurrency Transactions, 2-phase locking Write-ahead logging DBMS pretty much the last word on concurrency/recovery OSs don’t come close to supporting anything like that Manage semantic inconsistencies Normalization, FDs Not easy to identify tools, but we have learned how to think about them – Try to capture them in the E/R diagram as much as possible Motivation: Data Overload We began the first lecture with discussing the data overload Huge amounts of data generated every day Much faster than our ability to process it Increasing ability to capture more enterprise data Web, blogs, RSS Feeds etc Multimedia – Flickr and cellphone cameras has led a revolution in how people take pictures – Videos will be next – Not hard to imagine capturing every moment of your life Sensor/RFID data – Tiny sensors/RFID just beginning to become ubiquitous – Billions of these generating a tiny amount of data every second is still too much Biological/Scientific data Motivation: Data Overload Relational databases help for structured data But increasingly not sufficient The things we want to do with data can’t be expressed in SQL E.g. with biological data, web Too much unstructured data Distributed data generation creates additional headaches Almost impossible to try to collect the data in one location Making sense of this requires not only advances in data processing, but also in data understanding/mining Interdisciplinary efforts Some Lessons from RDBMS But can use the lessons learned from developing RDBMS Data independence / abstraction is good Hide details, even if initially it leads to inefficiency Look for structure Every seemingly highly unstructured data might have structure Look for patterns in usage Relational database are fast because query processing is predictable – Unlike say OS workloads which are very hard to optimize for If you can identify patterns, you can probably optimize them Declarative languages are great Say what you want, not how to get it