Chapters 2 and 12 review

advertisement
BTM 382 Database Management
Chapter 2: Data models
Chapter 12.12-13: CAP and Hadoop
Chitu Okoli
Associate Professor in Business Technology Management
John Molson School of Business, Concordia University, Montréal
Models and data models
What is a model?
 A model is a simplified way to describe or explain a
complex reality
 A model helps people communicate and work simply
yet effectively when talking about and manipulating
complex real-world phenomena
Scientific models
Image sources:
http://www.redorbit.com/education/reference_library/space_1/universe/2574692/geocentric_model/
http://hendrianusthe.wordpress.com/2012/06/21/heliocentric-vs-geocentric/
Conceptual models
Image sources:
http://info563.malagaclasses.info/strategy-it-2/
http://fivewhys.wordpress.com/2012/05/22/business-model-innovation/
Importance of Data Models
Communication tool
Give an overall view of the database
Organize data for various users
Are an abstraction for the creation of welldesigned good database
The Evolution of Data Models
Obsolete models:
Hierarchical and network models
The Relational Model
 Uses key concepts from mathematical relations (tables)
 “Relational” in “relational model” means “tables” (mathematical
relations), not “relationships”
 Table (relations)
 Matrix consisting of row/column intersections
 Relations have well defined methods (queries) for combining
their data members
 Selecting (reading) and joining (combining) data is defined based
on rigorous mathematical principles
 Relational data management system (RDBMS)
 Relations where originally too advanced for 1970s computing
power
 As computing power increased, simplicity of the model prevailed
The Entity Relationship Model
 Very detailed specification of relationships and their properties
 Enhancement of the relational model
 Relations (tables) become entities
 Entity relationship diagram (ERD)
 Uses graphic representations to model database
components
 Many variations for notation exist
 In this class, we use the Crow’s Foot notation
The Object-Oriented Data Model (OODM)
 Addresses “impedance mismatch” problem of the ER model
 The ER model’s view of data (tables) and programmers’ view of data
(objects in OOP), is completely different
 This mismatch makes database programming painful, especially for very
complex data structures
 OODM Uses object-oriented programming concepts to store data




Objects represent nouns (entities or records)
Objects have attributes (properties or fields) with values (data)
Objects have methods (operations or functions)
Classes group similar objects using a hierarchy and inheritance
 In an OODBMS, the data retrieval and storage closely mirrors the data
structures that programmers use, and so programming complex objects
is much easier than with the ER model
 More advanced forms support the Extended Relational Data Model,
Object/Relational DBMS, and XML data structures
OODBMS vs. RDBMS
https://youtu.be/kORTgvfHl4g
Big Data and NoSQL
Explaining Big Data
https://youtu.be/7D1CQ_LOizA
Big Data
 Volume
 Huge amounts of data (terabytes and petabytes),
especially from the Internet
 Velocity
 Organizations need to process the huge amounts of
data rapidly, just as with smaller databases
 Variety
 Wide variety of data, much of it unstructured and even
changing in structure
How do you handle Big Data?
The problem with RDBMSs
1. Scale up: use more powerful, expensive servers
 But RDBMS is very computing intensive
 Big data would require much faster, more capable,
more expensive computers, and even that’s not good
enough for big data
2. Scale out: use many cheap distributed servers
 But RDBMS is slow with distributed processing
 Consistency is the biggest problem: guaranteeing
consistency (which RDBMS is great at) is slow
 Slow infrastructure isn’t good enough for big data
What is NoSQL?
https://www.youtube.com/watch?v=qUV2j3XBRHc
NoSQL databases to the Big Data rescue
 “NoSQL” means:
 Non-relational or non-RDBMS
 Also “Not only SQL”—a few in fact do support SQL
 It is not one model; it is many different models that are not
relational data models
 Scale out (many cheap distributed servers) instead of scale up
 High scalability
 Support distributed database architectures
 High availability
 Rapid performance for big data, including unstructured and sparse data
 Fault tolerance
 Continue to work even if some servers in the cluster fail
 Emphasis is high performance speed, rather than transaction
consistency
Types of NoSQL databases
Also see:
Picking the Right
NoSQL Database Tool
Image sources:
https://www.linkedin.com/pulse/20140823125259-38485481-nosql-databases-where-i-can-use?trk=sushi_topic_posts
http://www.monitis.com/blog/2011/05/22/picking-the-right-nosql-database-tool/
Disadvantages of NoSQL
 Complex programming is required
 “NoSQL” means you lose the ease-of-use and structural
independence of SQL
 There is often no built-in implementation of relationships in
the database—you have to program relationships yourself in
code
 Data might be sometimes inconsistent
 No guarantee of transaction integrity
 Entity integrity and referential integrity not guaranteed
 The data you retrieve at any given moment might be
wrong… but it will eventually become OK
 This is the price to pay for rapid performance in a distributed
database
The CAP theorem for distributed databases
 CAP stands for:
 Consistency: All nodes see the same data
 Availability: A request always gets a response (success or failure)
 Partition tolerance: Even if a node fails, the system can still
function
 A distributed database can guarantee only two of the three
CAP characteristics, never all three at the same time
 However, over time, it might be able to provide all three
 NoSQL databases are distributed, and so the CAP theorem
restricts them to providing BASE, not ACID
Image source: PRWEB
ACID versus BASE
 A relational database guarantees the ACID properties:
 Atomicity, Consistency, Isolated, Durable
 In short, a set of SQL statements (called a transaction) will
either all work, or all fail—no half way success, and the result
will not corrupt the database
 A price to pay: results might be somewhat slow
 A NoSQL database does not guarantee ACID; it only
guarantees BASE properties:
 Basically Available, Soft-state, Eventual consistency
 In short, at any given moment, not everything might be
consistent, but the database will eventually get consistent
 In return, these imperfect results are delivered fast
Summary and conclusions of various
data models
Distributed Database Spectrum
Table 12.8
Sacrifices availability to ensure
consistency and isolation
Historical outline of data models
Which data model should you use?
 Hierarchical or network models
 Obsolete—no one uses these any longer
 Entity-relationship model
 Almost always
 90% or more of professional database situations
 Object-oriented database
 When you have very complex data structures, you need rapid
performance, and it helps achieve organizational objectives
 Source: Barry & Associates, Inc
 When data structures are so complex that organizing data as tables
causes headaches in programming retrieval and storage
 NoSQL
 When you have vast amounts of unstructured data and you need
rapid performance
 When speed is more important than data consistency
Sources
 Most of the slides are adapted from Database
Systems: Design, Implementation and
Management by Carlos Coronel and Steven Morris.
11th edition (2015) published by Cengage Learning.
ISBN 13: 978-1-285-19614-5
 Other sources are noted on the slides themselves
Download