Uploaded by Tinashe Tivatyi

Lecture 1 - Introduction

Lecture 1 - Introduction
• A structured set of data held in a computer, especially one
that is accessible in various ways.
• An electronic system that allows data to be easily
accessed, manipulated and updated.
• It is used by an organization as a method of storing,
managing and retrieving information. Modern databases
are managed using a database management system
• Examples include: ORACLE, SQL Server, MySQL,
Microsoft Access and MongoDB.
Relational database
• A database structured to recognize relations between
stored items of information.
• A set of formally described tables from which data can be
accessed or reassembled in many different ways without
having to reorganize the database tables.
• The standard user and application programming interface
(API) of a relational database is the Structured Query
Language (SQL).
SQL query language
• A programming language designed
to manage data stored in relational
• SQL operates through simple,
declarative statements.
• This keeps data accurate and secure,
and it helps maintain the integrity of
databases, regardless of size.
Big data
• Big Data is a phrase used to refer to a massive volume of both
structured and unstructured data that is so large it is difficult to
process using traditional database and software techniques.
• In most enterprise scenarios the volume of data is too big or it moves
too fast or it exceeds current processing capacity.
• The data set is very large and may be analysed computationally to
reveal patterns, trends, and associations, especially relating to human
behaviour and interactions.
– The volume of available data is huge
– The data keeps growing at a staggering rate
– Data comes from a variety of sources in different formats.
Distributed Database Systems
• A distributed system is a collection of independant
computers which are interconnected through a network.
A distributed database
• A database that consists of two or more files located in different sites
either on the same network or on entirely different networks.
• Portions of the database are stored in multiple physical locations and
processing is distributed among multiple database nodes.
• Databases in the collection are logically interrelated with each other. Often they represent a
single logical database.
• Data is physically stored across multiple sites. Data in each site can be managed by a DBMS
independent of the other sites.
• The processors in the sites are connected via a network. They do not have any multiprocessor
Distributed database management system
• DDBMS: The software that manages the DDB and provides a
mechanism that makes the distribution transparent to the user.
Promises of DDBMS
• Transparency:
– To what extent should the distributed system appear to the user as a single system?
– Distributed systems should be perceived by users and application programmers as a whole rather
than as a collection of cooperating components.
• Reliability / Availability
- If one component is down, yet the system continues to be operational
• Performance
– tasks execute efficiently independently of the location of invocation or of participating resources
• Scalability- Allowing database processing systems to cope with volume, veracity, velocity and
variety aspects that big data brings into the system.
- The cost of putting together a system of smaller computers is much less than the cost
of a single machine with equivelant power.
Transparencies in DDBMS
• Distribution Transparency
Allows user to perceive database as single, logical entity.
Fragmentation Transparency
Location Transparency
Replication Transparency
Local Mapping Transparency
If DDBMS exhibits distribution transparency, user does not need to know:
– Data is fragmented (fragmentation transparency),
– Location of data items (location transparency),
– Replication of fragments (replication transparency).
• Naming transparency
• Objects should be uniquely identifiable throughout their life
regardless of their system of origin and current location.
• Each item in a DDB must have a unique name.
• This type of transparency allows for local autonomy.
Concurrency Transparency
• All transactions must execute independently and be logically
consistent with results obtained if transactions executed one
at a time, in some arbitrary serial order.
Transaction Transparency
• Ensures that all distributed transactions maintain distributed
database’s integrity and consistency.
• Distributed transaction accesses data stored at more than
one location.
• Each transaction is divided into number of sub-transactions,
one for each site that has to be accessed.
• DDBMS must ensure the indivisibility of both the global
transaction and each subtransactions.
Distributed DBMS environment
Factors Encouraging DDBMS
Distributed Nature of Organizational Units − Most organizations in the current times are subdivided into
multiple units that are physically distributed over the globe. Each unit requires its own set of local data. Thus,
the overall database of the organization becomes distributed.
Need for Sharing of Data − The multiple organizational units often need to communicate with each other and
share their data and resources. This demands common databases or replicated databases that should be used
in a synchronized manner.
Support for Both OLTP and OLAP − Online Transaction Processing (OLTP) and Online Analytical Processing
(OLAP) work upon diversified systems which may have common data. Distributed database systems aid both
these processing by providing synchronized data.
Database Recovery − One of the common techniques used in DDBMS is replication of data across different
sites. Replication of data automatically helps in data recovery if database in any site is damaged. Users can
access data from other sites while the damaged site is being reconstructed. Thus, database failure may
become almost inconspicuous to users.
Support for Multiple Application Software − Most organizations use a variety of application software each
with its specific database support. DDBMS provides a uniform functionality for using the same data among
different platforms.
Advantages of Distributed Databases
Following are the advantages of distributed databases over centralized databases.
• Modular Development − If the system needs to be expanded to new locations or new units, in centralized
database systems, the action requires substantial efforts and disruption in the existing functioning. However,
in distributed databases, the work simply requires adding new computers and local data to the new site and
finally connecting them to the distributed system, with no interruption in current functions.
• More Reliable − In case of database failures, the total system of centralized databases comes to a halt.
However, in distributed systems, when a component fails, the functioning of the system continues may be at
a reduced performance. Hence DDBMS is more reliable.
• Better Response − If data is distributed in an efficient manner, then user requests can be met from local data
itself, thus providing faster response. On the other hand, in centralized systems, all queries have to pass
through the central computer for processing, which increases the response time.
• Lower Communication Cost − In distributed database systems, if data is located locally where it is mostly
used, then the communication costs for data manipulation can be minimized. This is not feasible in
centralized systems.
Complications introduced by distribution
Replication of data
– A distributed database can be designed so that the entire database, or portions of it, reside at different sites of a
computer network.
– It is not essential that every site on the network contain the database; it is only essential that there be more than one site
where the database resides.
– The possible duplication of data items is mainly due to reliability and efficiency considerations.
– Consequently, the distributed database system is responsible for
(1) choosing one of the stored copies of the requested data for access in case of retrievals, and
(2) making sure that the effect of an update is reflected on each and every copy of that data item.
Second, if some sites fail (e.g., by either hardware or software malfunction), or if some communication links fail
(making some of the sites unreachable) while an update is being executed, the system must make sure that the
effects will be reflected on the data residing at the failing or unreachable sites as soon as the system can
recover from the failure.
The third point is that since each site cannot have instantaneous information on the actions currently being
carried out at the other sites, the synchronization of transactions on multiple sites is considerably harder than
for a centralized system.
• Need for complex and expensive software − DDBMS demands complex and
often expensive software to provide data transparency and co-ordination across
the several sites.
• Processing overhead − Even simple operations may require a large number of
communications and additional calculations to provide uniformity in data across
the sites.
• Data integrity − The need for updating data in multiple sites pose problems of
data integrity.
• Overheads for improper data distribution − Responsiveness of queries is
largely dependent upon proper data distribution. Improper data distribution often
leads to very slow response to user requests.
• A parallel system is one in which multiple processors
have direct access to shared memory.
Why use parallel computing?
• Fundamental limits on single processor speed (High-end
CPU is expensive / rises sharply )
• Disparity between CPU & memory speeds (Moore's Law)
• Distributed data communications
– Many processors working together (Save time – wall clock time)
– Solve larger problems – larger than one processor’s CPU and
memory can handle
• Need for very large scale computing platforms
Why use parallel computing
• Solve larger problems – larger than one processor’s CPU and memory
can handle.
• Provide concurrency – do multiple things at the same time: online
access to databases, search engine.
• Taking advantage of non-local resources – using computing resources
on a wide area network, or even internet (grid computing).
• Cost saving – using multiple “cheap” computing resources instead of a
high-end CPU.
• Overcoming memory constraints – for large problems, using memories
of multiple computers may overcome the memory constraint obstacle.
Parallel Databases
• Typical system
- Fast interconnect
- Homogeneous software
(and even hardware)
• Typical goals
- High performance
- Transparency
• Typical system properties
- Geographical distribution
- Heterogeneity and autonomy
• Typical goals
- Data sharing
- Disconnected operation
Some commercial distributed relational databases
Teradata Database
Actian Matrix
Amazon Redshift
Sybase IQ
Microsoft Pdw
Netezza (company)
Vertica Database
Splice Machine - The Hadoop RDBMS
Some Distributed Database Open Source Projects
Awesome Bigdata
Talent Plan
Yugabyte Db
Awesome Blockchains
Awesome Distributed Systems
Bitconch Core
Mysql Notes
Py Swirld
Dble Docs Cn