Uploaded by sainisahil0401

Data base management

advertisement
Traditional file-based approach
The term 'file-based approach' refers to the situation where data is stored in one or more
separate computer files defined and managed by different application programs. Typically,
for example, the details of customers may be stored in one file, orders in another, etc.
Computer programs access the stored files to perform the various tasks required by the
business. Each program, or sometimes a related set of programs, is called a computer
application. For example, all of the programs associated with processing customers' orders
are referred to as the order processing application. The file-based approach might have
application programs that deal with purchase orders, invoices, sales and marketing, suppliers,
customers, employees, and so on.
Limitations

Data duplication: Each program stores its own separate files. If the same data is to be
accessed by different programs, then each program must store its own copy of the
same data.

Data inconsistency: If the data is kept in different files, there could be problems when
an item of data needs updating, as it will need to be updated in all the relevant files; if
this is not done, the data will be inconsistent, and this could lead to errors.

Difficult to implement data security: Data is stored in different files by different
application programs. This makes it difficult and expensive to implement
organisation-wide security procedures on the data.
The database approach
The database approach is an improvement on the shared file solution as the use of a database
management system (DBMS) provides facilities for querying, data security and integrity, and
allows simultaneous access to data by a number of different users. At this point we should
explain some important terminology:

Database: A database is a collection of related data.

Database management system: The term 'database management system', often
abbreviated to DBMS, refers to a software system used to create and manage
databases. The software of such systems is complex, consisting of a number of
different components, which are described later in this chapter. The term database
system is usually an alternative term for database management system.

System catalogue/Data dictionary: The description of the data in the database
management system.

Database application: Database application refers to a program, or related set of
programs, which use the database management system to perform the computerrelated tasks of a particular business function, such as order processing.
One of the benefits of the database approach is that the problem of physical data dependence
is resolved; this means that the underlying structure of a data file can be changed without the
application programs needing amendment. This is achieved by a hierarchy of levels of data
specification. Each such specification of data in a database system is called a schema. The
different levels of schema provided in database systems are described below. Further details
of what is included within each specific schema are discussed later in the chapter.
The Systems Planning and Requirements Committee of the American National Standards
Institute encapsulated the concept of schema in its three-level database architecture model,
known as the ANSI/SPARC architecture, which is shown in the diagram below:
Three-level architecture
The ANSI/SPARC model is a three-level database architecture with a hierarchy of levels,
from the users and their applications at the top, down to the physical storage of data at the
bottom. The characteristics of each level, represented by a schema, are now described.
The external schema
The external schemas describe the database as it is seen by the user, and the user applications.
The external schema maps onto the conceptual schema, which is described below.
There may be many external schemas, each reflecting a simplified model of the world, as
seen by particular applications. External schemas may be modified, or new ones created,
without the need to make alterations to the physical storage of data. The interface between the
external schema and the conceptual schema can be amended to accommodate any such
changes.
The external schema allows the application programs to see as much of the data as they
require, while excluding other items that are not relevant to that application. In this way, the
external schema provides a view of the data that corresponds to the nature of each task.
The external schema is more than a subset of the conceptual schema. While items in the
external schema must be derivable from the conceptual schema, this could be a complicated
process, involving computation and other activities.
The conceptual schema
The conceptual schema describes the universe of interest to the users of the database system.
For a company, for example, it would provide a description of all of the data required to be
stored in a database system. From this organisation-wide description of the data, external
schemas can be derived to provide the data for specific users or to support particular tasks.
At the level of the conceptual schema we are concerned with the data itself, rather than
storage or the way data is physically accessed on disk. The definition of storage and access
details is the preserve of the internal schema.
The internal schema
A database will have only one internal schema, which contains definitions of the way in
which data is physically stored. The interface between the internal schema and the conceptual
schema identifies how an element in the conceptual schema is stored, and how it may be
accessed.
If the internal schema is changed, this will need to be addressed in the interface between the
internal and the conceptual schemas, but the conceptual and external schemas will not need to
change. This means that changes in physical storage devices such as disks, and changes in the
way files are organised on storage devices, are transparent to users and application programs.
In distinguishing between 'logical' and 'physical' views of a system, it should be noted that the
difference could depend on the nature of the user. While 'logical' describes the user angle, and
'physical' relates to the computer view, database designers may regard relations (for staff
records) as logical and the database itself as physical. This may contrast with the perspective
of a systems programmer, who may consider data files as logical in concept, but their
implementation on magnetic disks in cylinders, tracks and sectors as physical.
Physical data independence
In a database environment, if there is a requirement to change the structure of a particular file
of data held on disk, this will be recorded in the internal schema. The interface between the
internal schema and the conceptual schema will be amended to reflect this, but there will be
no need to change the external schema. This means that any such change of physical data
storage is not transparent to users and application programs. This approach removes the
problem of physical data dependence.
Logical data independence
Any changes to the conceptual schema can be isolated from the external schema and the
internal schema; such changes will be reflected in the interface between the conceptual
schema and the other levels. This achieves logical data independence. What this means,
effectively, is that changes can be made at the conceptual level, where the overall model of an
organisation's data is specified, and these changes can be made independently of both the
physical storage level, and the external level seen by individual users. The changes are
handled by the interfaces between the conceptual, middle layer, and the physical and external
layers.
Benefits of the database approach
The benefits of the database approach are as follows:

Ease of application development: The programmer is no longer burdened with designing,
building and maintaining master files.

Minimal data redundancy: All data files are integrated into a composite data structure. In
practice, not all redundancy is eliminated, but at least the redundancy is controlled. Thus
inconsistency is reduced.

Enforcement of standards: The database administrator can define standards for names,
etc.

Data can be shared. New applications can use existing data definitions.

Physical data independence: Data descriptions are independent of the application
programs. This makes program development and maintenance an easier task. Data is
stored independently of the program that uses it.

Logical data independence: Data can be viewed in different ways by different users.

Better modelling of real-world data: Databases are based on semantically rich data
models that allow the accurate representation of real-world information.

Uniform security and integrity controls: Security control ensures that applications can only
access the data they are required to access. Integrity control ensures that the database
represents what it purports to represent.

Economy of scale: Concentration of processing, control personal and technical expertise.
Risks of the database approach

New specialised personnel: Need to hire or train new personnel e.g. database
administrators and application programmers.

Need for explicit backup.

Organisational conflict: Different departments have different information needs and data
representation.

Large size: Often needs alarmingly large amounts of processing power.

Expensive: Software and hardware expenses.

High impact of failure: Concentration of processing and resources makes an organisation
vulnerable if the system fails for any length of time.
The role of the data administrator
It is important that the data administrator is aware of any issues that may affect the handling and
use of data within the organisation. Data administration includes the responsibility for determining
and publicising policy and standards for data naming and data definition conventions, access
permissions and restrictions for data and processing of data, and security issues.
Difference between File System and DBMS:
Basis
File System
DBMS
Structure
The file system is software that
manages and organizes the files
in a storage medium within a
computer.
DBMS is software for
managing the database.
Data
Redundancy
Redundant data can be present in
a file system.
In DBMS there is no
redundant data.
Backup and
Recovery
It doesn’t provide backup and
recovery of data if it is lost.
It provides backup and
recovery of data even if it
is lost.
Query
processing
There is no efficient query
processing in the file system.
Efficient query processing
is there in DBMS.
Consistency
There is less data consistency in
the file system.
There is more data
consistency because of the
process of normalization.
Complexity
It is less complex as compared to
DBMS.
It has more complexity in
handling as compared to
the file system.
File systems provide less security
in comparison to DBMS.
DBMS has more security
mechanisms as compared
to file systems.
It is less expensive than DBMS.
It has a comparatively
higher cost than a file
system.
Data
Independence
There is no data independence.
In DBMS data
independence exists.
User Access
Only one user can access data at
a time.
Multiple users can access
data at a time.
Security
Constraints
Cost
Data Warehousing
Data warehousing is a collection of tools and techniques using which more
knowledge can be driven out from a large amount of data. This helps with the
decision-making process and improving information resources.
Data warehouse is basically a database of unique data structures that allows
relatively quick and easy performance of complex queries over a large amount of
data. It is created from multiple heterogeneous sources.
Characteristics of Data Warehousing



Integrated
Time variant
Non-volatile
The purpose of Data warehouse is to support the decision making process. It makes
information easily accessible as we can generate reports from the data warehouse. It
usually contains historical data derived from transactional data but can also include
data from other sources. Data warehouse is always kept separated from
transactional data.
We have multiple data sources on which we apply ETL processes in which we
Extract data from data source, then transform it according to some rules and then
load the data into the desired destination, thus creating a data warehouse.
Data Mining
Data mining refers to extracting knowledge from large amounts of data. The data
sources can include databases, data warehouse, web etc.
Knowledge discovery is an iterative sequence:

Data cleaning – Remove inconsistent data.

Data integration – Combining multiple data sources into one.

Data selection – Select only relevant data to be analysed.

Data transformation – Data is transformed into appropriate form for mining.

Data mining – methods to extract data patterns.

Pattern evaluation – identify interesting patterns in the data.

Knowledge representation- visualization and knowledge representation
techniques are used.
What kind of data that can be mined?



Database Data
Data Warehouse
Transactional Data
Scope of Data mining

Automated Prediction of trends and behaviours: Data mining automates the
process of finding the predictive information in large databases. For example :
Consider a marketing company. In this company, data mining uses the past
promotional mailing to identify the targets to maximize the return.

Automated discovery of previously unknown patterns: Data mining sweeps
through the database and identifies previously hidden patterns. For example:
In a retail store data mining will go through the entire database and find the
pattern for the items which are usually brought together.
The difference between data mining and data warehousing.
Data Mining

It is a process used to determine data patterns.

It can be understood as a general method to extract useful data from a set of
data.

Data is analysed repeatedly in this process.

It is done by business entrepreneurs and engineers to extract meaningful
data.

It uses many techniques that includes pattern recognition to identify patterns
in data.

It helps detect unwanted errors that may occur in the system.

It is cost-efficient in comparison to other statistical data processing
techniques.

It isn’t completely accurate since nothing is ideal in the real-world.
Data Warehousing

It is a database system that has been designed to perform analytics.

It combines all the relevant data into a single module.

The process of data warehousing is done by engineers.

Here, data is stored in a periodic manner.

In this process, data is extracted and stored in a location for ease of reporting.

It is updated at regular intervals of time.

This is the reason why it is used in major companies, in order to stay up-todate.

It helps simplify every type of data for business.

Data loss is possible if the data required for analysis is not integrated to the
data warehouse.

It stores large amounts of historical data that helps the user in analysing the
trends and seasonality to make further predictions.
Data Mining
Data mining is the process of discovering meaningful new correlations, patterns, and
trends by sifting through a large amount of data stored in repositories, using pattern
recognition technologies as well as statistical and mathematical techniques. It is the
analysis of observational datasets to find unsuspected relationships and to
summarize the data in novel ways that are both understandable and useful to the
data owner.
Data mining can include the use of several types of software packages including
analytics tools. It can be automated, or it can be largely labor-intensive, where
individual workers send specific queries for information to an archive or database.
Generally, data mining defines operations that contain relatively sophisticated search
operations that return focused and definite results. For instance, a data mining tool
can view through dozens of years of accounting data to find a definite column of
expenses or accounts receivable for a specific operating year.
Big Data
Big Data refers to the vast amount that can be structured, semi-structured, and
unstructured sets of data ranging in terms of tera-bytes. It is complex to process a
large amount of data on an individual system that's why the RAM of this computer
saves the interim computation during the processing and analyzing. When we try to
process such a huge amount of data, it takes much time to do these processing
steps on a single system. Also, our computer system doesn't work correctly due to
overload.
Big data sets are those that outgrow the simple type of database and data handling
structure that were used in previous times when big data was more highly-priced and
less feasible. For instance, sets of data that are too high to be simply handled in a
Microsoft Excel spreadsheet can be defined as big data sets.
The comparison between Data Mining and Big Data.
Data Mining
Big Data
Data mining is the process of discovering
meaningful new correlations, patterns,
and trends by sifting through a large
amount of data stored in repositories,
using pattern recognition technologies as
well as statistical and mathematical
techniques.
Big Data is an all-inclusive term that defines
the collection and subsequent analysis of
significantly huge data sets that can include
hidden data or insights that could not be found
using traditional methods and tools. The
amount of data is quite a lot for traditional
computing systems to handle and analyze.
The purpose is to find patterns,
anomalies, and correlations in a large
store of data.
The purpose is to discover insights from data
sets that are diverse, complex, and of massive
scale.
Use cases include financial services,
airlines and trucking companies, the
healthcare sector, telecommunications
and utilities, media and entertainment, ecommerce, education, IoT, etc.
It acts as a base to machine learning and
artificial intelligence applications worldwide.
How do data warehousing and OLAP relate to data mining?
Data warehouses and data marts are used in a broad area of applications. Business
executives use the data in data warehouses and data marts to implement data
analysis and create strategic decisions. In some firms, data warehouses are used as
an integral element of a plan-execute-assess “closed-loop” feedback system for
enterprise administration.
Data warehouses are used widely in banking and financial services, consumer
goods and retail distribution sectors, and controlled manufacturing, including
demand-based production. Generally, the longer a data warehouse has been in use,
the more it will have developed. This evolution takes place throughout various
phases.
Initially, the data warehouse is generally used for generating documents and
answering predefined queries. It can be used to analyze summarized and detailed
information, where the results are displayed in the form of documents and charts.
Later, the data warehouse is used for strategic objectives, implementing
multidimensional analysis and sophisticated slice-and-dice operations.
Finally, the data warehouse can be employed for knowledge discovery and strategic
decision-making using data mining tools. In this framework, the tools for data
warehousing can be classified into access and retrieval tools, database documenting
tools, data analysis tools, and data mining tools.
Business users required to have the means to understand what exists in the data
warehouse (through metadata), how to create the contents of the data warehouse,
how to test the contents using analysis tools, and how to display the results of such
analysis.
There are three kinds of data warehouse applications such as information
processing, analytical processing, and data mining.
Information processing − It provides querying, basic statistical analysis, and
documenting using crosstabs, tables, charts, or graphs. A latest trend in data
warehouse data processing is to create low-cost Web-based accessing tools that are
then unified with Web browsers.
Analytical processing − It provides basic OLAP operations, involving slice-anddice, drill-down, roll-up, and pivoting. It usually operates on historical information in
both summarized and detailed structure. The major strength of online analytical
processing over data processing is the multidimensional data analysis of data
warehouse information.
Data mining − It provides knowledge discovery by finding hidden patterns and
associations, building analytical models, implementing classification and prediction,
and displaying the mining results using visualization tools.
Data mining contains more automated and deeper analysis than OLAP, data mining
is expected to have wider software. Data mining can support business managers find
and reach more appropriate users, and gain critical business insights that can
support drive market share and raise profits.
Download