Traditional file-based approach The term 'file-based approach' refers to the situation where data is stored in one or more separate computer files defined and managed by different application programs. Typically, for example, the details of customers may be stored in one file, orders in another, etc. Computer programs access the stored files to perform the various tasks required by the business. Each program, or sometimes a related set of programs, is called a computer application. For example, all of the programs associated with processing customers' orders are referred to as the order processing application. The file-based approach might have application programs that deal with purchase orders, invoices, sales and marketing, suppliers, customers, employees, and so on. Limitations Data duplication: Each program stores its own separate files. If the same data is to be accessed by different programs, then each program must store its own copy of the same data. Data inconsistency: If the data is kept in different files, there could be problems when an item of data needs updating, as it will need to be updated in all the relevant files; if this is not done, the data will be inconsistent, and this could lead to errors. Difficult to implement data security: Data is stored in different files by different application programs. This makes it difficult and expensive to implement organisation-wide security procedures on the data. The database approach The database approach is an improvement on the shared file solution as the use of a database management system (DBMS) provides facilities for querying, data security and integrity, and allows simultaneous access to data by a number of different users. At this point we should explain some important terminology: Database: A database is a collection of related data. Database management system: The term 'database management system', often abbreviated to DBMS, refers to a software system used to create and manage databases. The software of such systems is complex, consisting of a number of different components, which are described later in this chapter. The term database system is usually an alternative term for database management system. System catalogue/Data dictionary: The description of the data in the database management system. Database application: Database application refers to a program, or related set of programs, which use the database management system to perform the computerrelated tasks of a particular business function, such as order processing. One of the benefits of the database approach is that the problem of physical data dependence is resolved; this means that the underlying structure of a data file can be changed without the application programs needing amendment. This is achieved by a hierarchy of levels of data specification. Each such specification of data in a database system is called a schema. The different levels of schema provided in database systems are described below. Further details of what is included within each specific schema are discussed later in the chapter. The Systems Planning and Requirements Committee of the American National Standards Institute encapsulated the concept of schema in its three-level database architecture model, known as the ANSI/SPARC architecture, which is shown in the diagram below: Three-level architecture The ANSI/SPARC model is a three-level database architecture with a hierarchy of levels, from the users and their applications at the top, down to the physical storage of data at the bottom. The characteristics of each level, represented by a schema, are now described. The external schema The external schemas describe the database as it is seen by the user, and the user applications. The external schema maps onto the conceptual schema, which is described below. There may be many external schemas, each reflecting a simplified model of the world, as seen by particular applications. External schemas may be modified, or new ones created, without the need to make alterations to the physical storage of data. The interface between the external schema and the conceptual schema can be amended to accommodate any such changes. The external schema allows the application programs to see as much of the data as they require, while excluding other items that are not relevant to that application. In this way, the external schema provides a view of the data that corresponds to the nature of each task. The external schema is more than a subset of the conceptual schema. While items in the external schema must be derivable from the conceptual schema, this could be a complicated process, involving computation and other activities. The conceptual schema The conceptual schema describes the universe of interest to the users of the database system. For a company, for example, it would provide a description of all of the data required to be stored in a database system. From this organisation-wide description of the data, external schemas can be derived to provide the data for specific users or to support particular tasks. At the level of the conceptual schema we are concerned with the data itself, rather than storage or the way data is physically accessed on disk. The definition of storage and access details is the preserve of the internal schema. The internal schema A database will have only one internal schema, which contains definitions of the way in which data is physically stored. The interface between the internal schema and the conceptual schema identifies how an element in the conceptual schema is stored, and how it may be accessed. If the internal schema is changed, this will need to be addressed in the interface between the internal and the conceptual schemas, but the conceptual and external schemas will not need to change. This means that changes in physical storage devices such as disks, and changes in the way files are organised on storage devices, are transparent to users and application programs. In distinguishing between 'logical' and 'physical' views of a system, it should be noted that the difference could depend on the nature of the user. While 'logical' describes the user angle, and 'physical' relates to the computer view, database designers may regard relations (for staff records) as logical and the database itself as physical. This may contrast with the perspective of a systems programmer, who may consider data files as logical in concept, but their implementation on magnetic disks in cylinders, tracks and sectors as physical. Physical data independence In a database environment, if there is a requirement to change the structure of a particular file of data held on disk, this will be recorded in the internal schema. The interface between the internal schema and the conceptual schema will be amended to reflect this, but there will be no need to change the external schema. This means that any such change of physical data storage is not transparent to users and application programs. This approach removes the problem of physical data dependence. Logical data independence Any changes to the conceptual schema can be isolated from the external schema and the internal schema; such changes will be reflected in the interface between the conceptual schema and the other levels. This achieves logical data independence. What this means, effectively, is that changes can be made at the conceptual level, where the overall model of an organisation's data is specified, and these changes can be made independently of both the physical storage level, and the external level seen by individual users. The changes are handled by the interfaces between the conceptual, middle layer, and the physical and external layers. Benefits of the database approach The benefits of the database approach are as follows: Ease of application development: The programmer is no longer burdened with designing, building and maintaining master files. Minimal data redundancy: All data files are integrated into a composite data structure. In practice, not all redundancy is eliminated, but at least the redundancy is controlled. Thus inconsistency is reduced. Enforcement of standards: The database administrator can define standards for names, etc. Data can be shared. New applications can use existing data definitions. Physical data independence: Data descriptions are independent of the application programs. This makes program development and maintenance an easier task. Data is stored independently of the program that uses it. Logical data independence: Data can be viewed in different ways by different users. Better modelling of real-world data: Databases are based on semantically rich data models that allow the accurate representation of real-world information. Uniform security and integrity controls: Security control ensures that applications can only access the data they are required to access. Integrity control ensures that the database represents what it purports to represent. Economy of scale: Concentration of processing, control personal and technical expertise. Risks of the database approach New specialised personnel: Need to hire or train new personnel e.g. database administrators and application programmers. Need for explicit backup. Organisational conflict: Different departments have different information needs and data representation. Large size: Often needs alarmingly large amounts of processing power. Expensive: Software and hardware expenses. High impact of failure: Concentration of processing and resources makes an organisation vulnerable if the system fails for any length of time. The role of the data administrator It is important that the data administrator is aware of any issues that may affect the handling and use of data within the organisation. Data administration includes the responsibility for determining and publicising policy and standards for data naming and data definition conventions, access permissions and restrictions for data and processing of data, and security issues. Difference between File System and DBMS: Basis File System DBMS Structure The file system is software that manages and organizes the files in a storage medium within a computer. DBMS is software for managing the database. Data Redundancy Redundant data can be present in a file system. In DBMS there is no redundant data. Backup and Recovery It doesn’t provide backup and recovery of data if it is lost. It provides backup and recovery of data even if it is lost. Query processing There is no efficient query processing in the file system. Efficient query processing is there in DBMS. Consistency There is less data consistency in the file system. There is more data consistency because of the process of normalization. Complexity It is less complex as compared to DBMS. It has more complexity in handling as compared to the file system. File systems provide less security in comparison to DBMS. DBMS has more security mechanisms as compared to file systems. It is less expensive than DBMS. It has a comparatively higher cost than a file system. Data Independence There is no data independence. In DBMS data independence exists. User Access Only one user can access data at a time. Multiple users can access data at a time. Security Constraints Cost Data Warehousing Data warehousing is a collection of tools and techniques using which more knowledge can be driven out from a large amount of data. This helps with the decision-making process and improving information resources. Data warehouse is basically a database of unique data structures that allows relatively quick and easy performance of complex queries over a large amount of data. It is created from multiple heterogeneous sources. Characteristics of Data Warehousing Integrated Time variant Non-volatile The purpose of Data warehouse is to support the decision making process. It makes information easily accessible as we can generate reports from the data warehouse. It usually contains historical data derived from transactional data but can also include data from other sources. Data warehouse is always kept separated from transactional data. We have multiple data sources on which we apply ETL processes in which we Extract data from data source, then transform it according to some rules and then load the data into the desired destination, thus creating a data warehouse. Data Mining Data mining refers to extracting knowledge from large amounts of data. The data sources can include databases, data warehouse, web etc. Knowledge discovery is an iterative sequence: Data cleaning – Remove inconsistent data. Data integration – Combining multiple data sources into one. Data selection – Select only relevant data to be analysed. Data transformation – Data is transformed into appropriate form for mining. Data mining – methods to extract data patterns. Pattern evaluation – identify interesting patterns in the data. Knowledge representation- visualization and knowledge representation techniques are used. What kind of data that can be mined? Database Data Data Warehouse Transactional Data Scope of Data mining Automated Prediction of trends and behaviours: Data mining automates the process of finding the predictive information in large databases. For example : Consider a marketing company. In this company, data mining uses the past promotional mailing to identify the targets to maximize the return. Automated discovery of previously unknown patterns: Data mining sweeps through the database and identifies previously hidden patterns. For example: In a retail store data mining will go through the entire database and find the pattern for the items which are usually brought together. The difference between data mining and data warehousing. Data Mining It is a process used to determine data patterns. It can be understood as a general method to extract useful data from a set of data. Data is analysed repeatedly in this process. It is done by business entrepreneurs and engineers to extract meaningful data. It uses many techniques that includes pattern recognition to identify patterns in data. It helps detect unwanted errors that may occur in the system. It is cost-efficient in comparison to other statistical data processing techniques. It isn’t completely accurate since nothing is ideal in the real-world. Data Warehousing It is a database system that has been designed to perform analytics. It combines all the relevant data into a single module. The process of data warehousing is done by engineers. Here, data is stored in a periodic manner. In this process, data is extracted and stored in a location for ease of reporting. It is updated at regular intervals of time. This is the reason why it is used in major companies, in order to stay up-todate. It helps simplify every type of data for business. Data loss is possible if the data required for analysis is not integrated to the data warehouse. It stores large amounts of historical data that helps the user in analysing the trends and seasonality to make further predictions. Data Mining Data mining is the process of discovering meaningful new correlations, patterns, and trends by sifting through a large amount of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. It is the analysis of observational datasets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner. Data mining can include the use of several types of software packages including analytics tools. It can be automated, or it can be largely labor-intensive, where individual workers send specific queries for information to an archive or database. Generally, data mining defines operations that contain relatively sophisticated search operations that return focused and definite results. For instance, a data mining tool can view through dozens of years of accounting data to find a definite column of expenses or accounts receivable for a specific operating year. Big Data Big Data refers to the vast amount that can be structured, semi-structured, and unstructured sets of data ranging in terms of tera-bytes. It is complex to process a large amount of data on an individual system that's why the RAM of this computer saves the interim computation during the processing and analyzing. When we try to process such a huge amount of data, it takes much time to do these processing steps on a single system. Also, our computer system doesn't work correctly due to overload. Big data sets are those that outgrow the simple type of database and data handling structure that were used in previous times when big data was more highly-priced and less feasible. For instance, sets of data that are too high to be simply handled in a Microsoft Excel spreadsheet can be defined as big data sets. The comparison between Data Mining and Big Data. Data Mining Big Data Data mining is the process of discovering meaningful new correlations, patterns, and trends by sifting through a large amount of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques. Big Data is an all-inclusive term that defines the collection and subsequent analysis of significantly huge data sets that can include hidden data or insights that could not be found using traditional methods and tools. The amount of data is quite a lot for traditional computing systems to handle and analyze. The purpose is to find patterns, anomalies, and correlations in a large store of data. The purpose is to discover insights from data sets that are diverse, complex, and of massive scale. Use cases include financial services, airlines and trucking companies, the healthcare sector, telecommunications and utilities, media and entertainment, ecommerce, education, IoT, etc. It acts as a base to machine learning and artificial intelligence applications worldwide. How do data warehousing and OLAP relate to data mining? Data warehouses and data marts are used in a broad area of applications. Business executives use the data in data warehouses and data marts to implement data analysis and create strategic decisions. In some firms, data warehouses are used as an integral element of a plan-execute-assess “closed-loop” feedback system for enterprise administration. Data warehouses are used widely in banking and financial services, consumer goods and retail distribution sectors, and controlled manufacturing, including demand-based production. Generally, the longer a data warehouse has been in use, the more it will have developed. This evolution takes place throughout various phases. Initially, the data warehouse is generally used for generating documents and answering predefined queries. It can be used to analyze summarized and detailed information, where the results are displayed in the form of documents and charts. Later, the data warehouse is used for strategic objectives, implementing multidimensional analysis and sophisticated slice-and-dice operations. Finally, the data warehouse can be employed for knowledge discovery and strategic decision-making using data mining tools. In this framework, the tools for data warehousing can be classified into access and retrieval tools, database documenting tools, data analysis tools, and data mining tools. Business users required to have the means to understand what exists in the data warehouse (through metadata), how to create the contents of the data warehouse, how to test the contents using analysis tools, and how to display the results of such analysis. There are three kinds of data warehouse applications such as information processing, analytical processing, and data mining. Information processing − It provides querying, basic statistical analysis, and documenting using crosstabs, tables, charts, or graphs. A latest trend in data warehouse data processing is to create low-cost Web-based accessing tools that are then unified with Web browsers. Analytical processing − It provides basic OLAP operations, involving slice-anddice, drill-down, roll-up, and pivoting. It usually operates on historical information in both summarized and detailed structure. The major strength of online analytical processing over data processing is the multidimensional data analysis of data warehouse information. Data mining − It provides knowledge discovery by finding hidden patterns and associations, building analytical models, implementing classification and prediction, and displaying the mining results using visualization tools. Data mining contains more automated and deeper analysis than OLAP, data mining is expected to have wider software. Data mining can support business managers find and reach more appropriate users, and gain critical business insights that can support drive market share and raise profits.