BACS 485 03/21/16 7:19 PM 485_Intro.doc DATABASE PROCESSING CONCEPTS I. PRIMARY POINTS OF LECTURE SERIES View data as a valuable corporate resource Be familiar with the concept of IRM (information resource mgt) Basic components of database system Comparison of File Processing and Database Processing Advantages/Disadvantages of File and Database Processing Costs and Risks of DBMS approach II. DATA AS A CORPORATE RESOURCE In the future, those companies that succeed will be those that treat data as a valuable corporate resource. In the past data was thought of as the private domain of functional areas. This was a natural out growth of current technology. More formally stated, this is the concept of IRM... A. Information Resource Management 25 years ago this aspect of database was not taught. You just jumped right into the technical aspects. Now, not so. Now, database must be viewed in the larger context as part of the corporate information resource. One of the primary purposes of information is to support managerial decision making. Success of company depends on the quality (and speed) of management decisions. This is a fairly new way to view database. Greatly affects the way systems are designed and thought of. The main point is that you need to design databases within the overall context of the company (no longer can you ignore corporate structure and build strictly to functional specifications). III. FILE VERSUS DATABASE PROCESSING ENVIRONMENTS Current Database systems are the end result of 40 years of evolution. Knowing the history of computer systems helps to understand why things are as they are and where they may be going. Also it helps avoid re-making the same mistakes again. Copyright Jay Lightfoot, Ph.D. 1 A. Brief History In the beginning there were no computers. Businesses relied on paper systems and written procedures. Some very clever methods were devised (and are still used today). The problem with these systems were that they: (partial list) could not grow infinitely were not responsive or flexible enough depended upon dedicated (life-long) workers do not provide companies with competitive edge One of the results of these manual systems was that each functional area "owned" their data and became very possessive of it. This was OK for then. When computers were introduced it gave a way to automate manual tasks. (In fact, the title "computer" used to apply to a person that did manual computations.) Allows company to: speed up tasks (without infinite personnel) get better task integrity (accuracy) grow companies bigger This was a big deal. Much fan-fare. People expected lots. The big companies were the first to buy computers. The early computers were very expensive and not very user-friendly. They cost so much that they were applied to the areas where the chance of success was best (i.e., payback of the huge investment). The best area to get quick pay-back (and get lots of good PR) was to automate independent application systems to automate jobs of clerical workers and solve a specific problem. For example, payroll, accounts payable, purchasing,... This gave early computer systems a bad reputation for putting people out of work (they still have that reputation to some people). The early system designers tried to program the computer exactly the same as the manual system. This was easier, but tended to prolong the fiction of functional data ownership. The system and its data were owned by the department responsible for the function. The data was considered a private resource. Computer technology advances and systems grow more powerful, but the functional users still consider data to be a private resource. (A fiefdom). BAD!!! Also, each time a new application was developed, new files were defined. It took years for people to realize (and really start to do something about) treating data as a corporate resource that needs to be managed. This was one of the factors that started the database movement. Imagine situation where money is not managed in company. Each department has a pile they control. No sharing, no cross-department information sharing about the money. If big purchase came up, no way to consolidate money to make it. It is ridiculous to consider, but that is exactly the situation of the traditional approach. Copyright Jay Lightfoot, Ph.D. 2 Let's take a closer look at the traditional file approach..... B. Traditional File Processing Approach In the traditional approach, computer systems are designed for and used by individual departments for specific tasks. Each department maintains and "owns" their own data. Thus, little sharing takes place between departments (only that sharing that is mandated and required). Each new system usually builds its own data files rather than try to access existing ones because to tough to coordinate modifications and requirements. A central DP shop provides the programming service for the functional departments (but it is still the departments data). The DP shop must somehow manage all programming needs for the whole company without the benefit of a global plan or centralized authority. Very tough to do! Large backlogs characterized this setup. Impatient departments sometimes built their own mini-DP shop to handle their own programming needs, but that seldom worked because existing data sharing goes away. 1. Advantages There are 2 transitive advantages to the traditional approach (but they go away quickly). Be aware that I am not defending these methods. I just want you to realize that when these systems initially developed, there were valid reasons (advantages). 1st advantage... a) Simple (natural) The traditional file processing method was the easiest way for systems to develop given the current state-of-the-art. Those people in the 50's weren't stupid. They realized that the situation was not ideal, it's just that the computer technology was limited (e.g., tapes and cards were available, not disks). (Later, it stayed popular because of momentum and the expense to change.) Also, the systems that were automated were not the real money-makers for the company. Marketing and production make money. Payroll and General Ledger just keep track of it. Lots easier ways to improve company performance elsewhere than by improving computerized accounting systems (e.g., work on mission critical manual systems). So, for the time period, the traditional systems were the natural way to start. Not so today! 2nd advantage of traditional systems... Copyright Jay Lightfoot, Ph.D. 3 b) Cheap (initially) Building integrated systems is much more complex than building stand-alone (traditional) systems. Thus, it costs more to build integrated systems than to build stand-alone ones. Computers were normally bought to save money, so it did not make sense to spend your "profit" on unnecessary integration. Integration was viewed as unnecessary because they had gotten along fine without it in the manual systems (or so they thought). Once a computerized manual system was working, it was cheaper to leave it alone than to severely modify it. (They did not know yet that 80% of all programming is maintenance.) 2. Disadvantages of traditional systems The disadvantages of the traditional file approach far outweigh the "so called advantages". This is why database was developed. 1st disadvantage.... a) Distributed Ownership In traditional systems each functional department "owned" its own data. This is the "my data" syndrome. They feel it is their personal property (like office furniture) to do with as they please. Often they are hesitant to let anyone else in the company see it (because then their power would be diminished) Merging data became a political process. Does not help global competitiveness. 3 negative aspects to distributed ownership..... (1) Uncontrolled Redundancy Each new system required new files to be built (since no one would share existing data even if technology existed to make it possible). Lots of data was stored multiple times. Once for each system that needed access to it. This was bad because: disk space is wasted data must be input several times data must be updated several times (multiple occurrences) data was inconsistent (next section) Copyright Jay Lightfoot, Ph.D. 4 Everyone realized that redundancy was a problem, but not one knew how to fix it short of giving up control of "their data". 2nd aspect..... (2) Inconsistent Data This is the worst of the problems -- it requires special attention. When you store data several times (several places) you WILL have inconsistencies not matter how careful you are. The worst case is where you miss an update. Thus, you have two (or more) versions of data... which is correct? A more subtle inconsistency is when the update is slightly incorrect (misspelling of name or different abbreviation) or the timing of the update is incorrect (raise after P/R period...). Since each copy of data is "owned" by a different group, it is really no ones responsibility to make sure that it is globally correct. Responsibility falls between the cracks. This allows multiple versions of the truth to be stored in the computer. This undermines the users confidence in a database if it appears to give incorrect answers. It can also can mess up customer relations if you send goods to the wrong address or bill for an incorrect invoice. 3rd negative aspect of distributed ownership.... (3) Islands of Information A problem that redundancy and the "my data" syndrome causes is that groups develop their own data environments that are independently maintained. These users become self-sufficient and cut off from others. They are not willing to share or receive information. Not only redundant data, but redundant tools and effort too. 2nd disadvantage of traditional systems... b) Limited Data Sharing Since everyone has their own copy of the data, no one shares. This wastes disk space and perpetuates the cycle of independent file systems. For example, you are designing a new system that could use data already defined and maintained by other departments. Since they are unwilling to share (and it would require some rewrite of their (working) system) you decide to take the easy way out and just re-enter the data. This is politically easier and simpler for the moment, but is very short sighted. It delays fixing the problem. 3rd disadvantage of traditional systems... Copyright Jay Lightfoot, Ph.D. 5 c) Inflexibility Independent islands of information do not easily cooperate. When changes are needed it becomes a political nightmare. People have to call in favors and make deals to get things done. If quick action and responsiveness are required -- forget it. This is where traditional file systems really start to hurt companies. It causes loss of business to more efficient competition. In addition, changes to file structures cause a ripple effect in all the programs that access that data. For example 9-digit zip codes were a big headache. Independent file-based systems tend to be designed for a specific purpose. They do not easily allow un-planned processing. The 1st negative aspect of inflexibility is... (1) No Data Independence This is very important. When data definition or processing changes in a traditional file system it requires many modifications. The reason for this is similar to the data redundancy discussion above. In this case, the data definition is redundantly stored. In addition, the way the data are processed is also stored redundantly. So, you have to make changes in many places to accommodate a "simple" data change. When you can make data modifications without changing programs you have "data independence". 2nd aspect of inflexibility... (2) Ad Hoc Queries Constrained Another result of inflexibility is that it is difficult to get information in any format other than the standard format. For instance, say a manager is asked (of the top of her head) "how many red widgets were sold on the east cost to people under 12 years old?". Chances are that a special report does not exist with this information. (Called "ad hoc" query -- one time only query). To get the information a special report must be generated. Programmer time is usually backlogged for months so it is easier to pull the information by hand from old reports. This defeats the purpose of the computer. The data is out there and would be fairly easy to collect if proper tools existed. Recently, tools have been built to help fix this problem (called 4th generation languages). These make it easier Copyright Jay Lightfoot, Ph.D. 6 for non-programmers to do ad hoc queries, but still requires in-depth knowledge about where and how the data are stored. Most managers don't know (or care to know) this. 4th disadvantage of traditional system... d) Poor Standard Enforcement Another result of distributed ownership. Standards are things like: size of data fields (5 or 9 digit zip codes?) abbreviations used (AV. or AVE. for avenue?) timing of changes (when do you close the books?) who can access data and when? how do you name data fields? how do you calculate accounting discounts? what operating system do we use (UNIX or DOS?) what file format is best (ASCII or EBCDIC?) etc… Standards are one of the main things that make a group of independent file systems into a corporate database. Traditional file systems have very poor (if any) standard enforcement primarily because no one is in charge of the data. No rules exist, so you get a free-for-all. 3 aspects to poor standards enforcement... (1) Processing Processing standards tell groups how and when to transform data into information. It can include the obvious "how do you calculate accounting discounts" but can also include more global questions like: what hardware is used what software is used what network protocol is used etc..... Traditional file systems develop in the absence of any centralized control, hence there are few processing standards enforced unless people informally get together or unless there is a strong push from management towards commonality. Copyright Jay Lightfoot, Ph.D. 7 2nd aspect of poor standards enforcement... (2) Storage The way data are stored also requires standards. The obvious storage standards include: how long are data fields what type are the data fields what are your abbreviations Some more subtle standards include: what type of disk drive do you use what encoding scheme is used is data compression or encryption used Traditional file based systems do not generally have any storage standards because no central control exerted to make them conform. 3rd aspect of standards enforcement... (3) Naming Finally, naming schemes are a type of standard not often found in traditional file based systems. Because each group builds their files in isolation, they way data are names is inconsistent. This means that even if users wanted to share data they would have to figure out what it is called first. Last disadvantage of traditional systems... e) Excessive Costs Traditional file based systems cost more in the long run. They may seem cheaper when you are developing them, but the combined effect of all the previously stated problems eventually adds up. I group the cost problems into 2 general categories... (1) Low Programmer Productivity All the complexity of determining file names and working out standards, and dealing with data independence problems results in LOW PROGRAMMER PRODUCTIVITY. The complexity Copyright Jay Lightfoot, Ph.D. 8 of trying to share data slows down programmers. Also, if they have to "reinvent the wheel" every time they do something it results in wasted effort. Finally, programmer productivity is lower because much time is spent in maintenance (next topic). (2) Excessive Program Maintenance Lack of data independence and the inflexibility of traditional systems means that LOTS of effort is spent just to keep the systems working. Statistics show that 70% to 90% of all programmer effort is spent maintaining existing programs -- not building new systems. This does not advance the state of the company, just keeps them even. Everyone suffers from the high cost of maintenance. Programmers hate it (lousy job), managers dislike it because it is not productive, users dislike it because of the backlogs. BAD!! This is a prime factor in low programmer productivity. Can be directly tied to lack of data sharing and centralized control. C. Database Approach 1. Life Cycle Connection Nolan has divided the life cycle of organizations information systems into 6 phases. STAGE 1: INITIATION - DP first introduced STAGE 2: CONTAGION - Spreads rapidly automating basic functions STAGE 3: CONTROL - Costs go up so more control needed STAGE 4: INTEGRATION - Pull together independent systems (try) --> STAGE 5: ARCHITECTURE - Develop enterprise-wide data model (plan) STAGE 6: DISTRIBUTION - Distribute data and processing The traditional approach occupies stages 1 through 4. As the organizational information system goals advanced, more money was spent. Traditional DP processes were concerned with automating existing systems. About stage 4 (integration) companies started experimenting with database approach. This approach combines (integrates) the islands of information into a useful centralized data repository. Note that it initially saved money (disk space, documentation, specification...) but gradually grew more expensive as you require more from your information system. The point is that database approach is a natural outgrowth to file processing system. An evolution. As managers expect more from their information systems, the database approach makes more sense. As database approach comes on-line, things become possible that could not be done without database. This makes going back to the old ways impossible (if you want to stay competitive). Copyright Jay Lightfoot, Ph.D. 9 Let's look at the characteristics of the database approach in detail... 2. Database Approach Characteristics As hinted above, the database approach does away with the "my data" syndrome. It collects all data into an organized repository. The approach brings a new way of thinking and some new tools that permit the control of redundancy, improved flexibility, and centralized data control. 3 features characterize the database approach: 1. Shareability of data 2. Centralized data management and control 3. Adaptability Here's a summary of the approach... The database approach begins by building a model of the organization (called an enterprise model). The model summarizes the structure of the data used (and needed) in the company. It shows the data entities and the relationships between them. (In real life, you seldom have a global enterprise model, rather you have a collection of smaller system models. Their collected whole is equivalent to the enterprise model idea.) The structure is carefully built so that all users needs are taken care of with a minimum of redundancy. (Thus, sharing and redundancy are both solved.) Next, you describe this structure to a piece of software (called a DBMS) that can maintain both it and data in its format. Finally you load the data and use the DBMS to access and maintain the it and its structure. It is the role of the DBMS to retrieve and store the data. Also the DBMS does its best to make sure the data are accurate (not necessarily correct). It is all very complex, more so that this would indicate. There are certain advantages to the database approach... 3. Advantages Basically, the advantages of the database approach fix the disadvantages of the traditional approach (on purpose). There should be no surprises here. The 1st advantage ... Copyright Jay Lightfoot, Ph.D. 10 a) Data Integration (conceptual model) For the first time, companies may see exactly what data they have and need. This sounds simple, but it is surprising how many companies don't know what data they have and require. Knowing this can lead to some efficiencies (e.g., why are we storing this information that no one wants?). Allows streamlining of requirements and systems if effort is applied. The 2nd advantage of database approach... b) Controlled Redundancy Very obvious. If you only store it once then you do away with redundancy and its related problems. Example: If you only store an employee address in one place then a single update is effective for all users. You cannot get rid of ALL redundancy. There are times when it is a good idea to keep redundant data (We'll talk about that later in the semester), but you can minimize it. This has a side benefit of saving disk space and requiring fewer maintenance programs. 3rd advantage of database approach... c) Data Sharing Since it is only stored in one place, you must share it. Actually, sharing is a very important advantage on its own. It means that you truly are managing your data and treating it like a company asset. However, when data are shared, it brings on some new problems not present in traditional systems. Like concurrency control. We'll talk about that later. 4th advantage... d) Data Integrity Your book calls this "data consistency". There is more to it than that. It refers to the problem of inconsistent data in file based systems. If data redundancy is reduced then data is only stored in one place, so the multiple inconsistent version problem is fixed. Integrity is the assurance that the data are: accurate complete Copyright Jay Lightfoot, Ph.D. 11 timely reliable It also deals with the notion that the data accurately represents the real world system it attempts to model. In other words, integrity is concerned with obvious things like: names spelled correctly proper account number all social security numbers are unique And also concerned with less obvious things like: customer discount applied to proper customer employee does not make more money than their boss Data integrity is a very important part of a DBMS (and a part that most do not do very well). 5th advantage database approach... e) Standard Enforcement Standards enforcement is an important (and often ignored) part of the database approach. Since you have centralized control of data it is easier to enforce standards. For example: naming standards (e.g., first 3 characters is system name) field definition standards (MM/DD/YY for dates) usage standards (no payroll data access on weekends) processing standards (common encryption algorithm) etc... The DBA is the person(s) responsible for standard enforcement. If they do a through job then the database if much more usable. 6th advantage... f) Centralized Control Centralized control is sort of a side benefit of the database approach (not the direct reason for using the approach, but a side effect.) Most of the benefits of centralized control are themselves advantages of the database approach. Centralized control provides some benefits that you might not immediately think of. For instance, 1st benefit of centralized control... Copyright Jay Lightfoot, Ph.D. 12 (1) Security/Privacy You can better control data security and privacy if you have centralized control. Security is the assurance that data is kept from unauthorized users. Privacy is the assurance that data is not misused by those users. (Similar concepts, but they are not identical) Having all the data in one spot within the conceptual model makes this easier. 2nd benefit of centralized control... (2) Integrity Integrity = accurate, complete, timely, reliable As mentioned before, integrity is much easier if you have a centralized body in charge of the data and the standards. Traditional file systems developed integrity through consistent "good programming practices" and voluntary cooperation among departments. This, in part, explains why some companies were more successful at system integration than others using similar equipment and doing similar processing (i.e., the informal network was better). 3rd benefit of centralized control... (3) Backup/Recovery Backup and recovery are often not done properly in file based systems because they are not thought of as necessary. (Amateur systems managers don't realize the importance of backups.) Ironically, backups are about the most important thing that you can do. When departments handled their own data, traditional based backup and recovery was done in a spotty and inconsistent manner. Some people were good at it and others were not. The reason that more problems did not occur was because companies usually only had 1 computer center (for all their independent file-based systems). The computer center handled backup and recovery without the department's knowledge. Backup and recovery are much easier when database methods used because the DBMS can handle the details (consistently). 4th benefit of centralized control... (4) Standards Obviously, standards are easier when they are centrally dictated. You cannot have standards without some central body (formal or informal) coming up with them. Copyright Jay Lightfoot, Ph.D. 13 7th advantage of the database approach... g) Flexibility/Responsiveness The database approach is much more flexible because you can access data without concern with where or how it is stored. This idea is very close to the idea of data independence. Some books also mention that database systems provide multiple retrieval paths to each data item. This is true (but misses the prime point about data independence). By default, a flexible system is also responsive. That is, you can make changes more quickly because there is less to change. (Less detail to worry about). Specifically, the flexibility of the database approach helps with... (1) Ad Hoc Queries Query languages have been developed that non-programmers can use to make one-time-only queries (ad hoc). In this way they can satisfy their unique information needs without using traditional programming services (i.e., avoid the backlog). GOOD!! However, it still does not fix the problem where users must have in-depth knowledge of data names and where it is stored. 4th generation languages (which is what most DBMS query tools are) help, but not quite as good as the sales literature promises. 2nd benefit from better flexibility... (2) Multiple User Views A side benefit of the database approach is called "views". A view is a single users perspective of the database. Briefly, users do not need to see the whole database. They typically only need to see a small portion (the same portion that the traditional file would have given). You define this database subset with the DBMS mechanism called "view definition". (We'll do this later.) Views enhance flexibility because they have great data independence. You can modify the underlying data without disturbing the view and change the view definition without modifying the programs that use it (with one exception -- you can’t remove data that the program directly uses). 8th advantage of the database approach... Copyright Jay Lightfoot, Ph.D. 14 h) Data Independence Traditional programming builds the data definition and the data processing into the same unit (called a program). This means that whenever the data definition changes, you have modify each program that uses it. BAD!!! The database approach separates the data definition and the data access methods from the programs that process the data. This provides data independence (that is the ability to change the data structure without modifying programs that use the data). 9th advantage of the database approach... i) Cost Savings The database approach saves money in the long run. However, as we will see later, a good DBMS is expensive, so this benefit is weak and a little harder to prove. Database cost savings come from 2 general areas... (1) Easier Application Development Much of the work in traditional application development is defining the data and writing code to access it. Also, a good deal of traditional application programs were devoted to integrity checks. The database approach minimizes this because the data are already defined and the access is up to the DBMS. Also, the integrity constraints can be built into the DBMS. This speeds up application development. However, it slows down analysis and building time because you have to do all that work up front. (No free lunches). Code re-use is another advantage of the database approach. In some systems, you can build generic routines and have the DBMS store them for later re-use. No need to reinvent the wheel and test these routines once you know they work. This idea was common in traditional file-based systems also, but is easier in database approach. 2nd area where you get cost savings from DBMS approach... (2) Reduced Maintenance Since the data, the data definition, and the application programs in a DBMS are all independent, you can change any without affecting the others (data independence). This makes program maintenance (and testing) easier and quicker (less money). Old-style systems forced the programmer to be concerned with "side effects" (unplanned effects of a program change). It required extensive planning and testing for even simple modifications (big $). The database approach does not totally do away with the problem of maintenance side effects, but at least it Copyright Jay Lightfoot, Ph.D. 15 minimizes them. The end effect is better programmer productivity. All is not rosy for database. There are some problems.... 4. Disadvantages This semester we will not dwell on the problems of the database approach. The advantages of database greatly outweigh the problems. However, I want you to have a realistic view of DBMS from this course (we are not "selling" DBMSs here.) a) Specialized Personnel and Hardware (big $) Although the database approach is cheaper in the long run due to improved programmer productivity (and other reasons), it has some up-front costs that are not trivial. These costs are often overlooked by anxious users. You don't want to be surprised. For example... (1) DBA, DA, Special Programmers If you plan on using existing people, you will have to send them to training classes on your DBMS product. These classes cost between $600 to $900 a day (plus expenses). Classes can last weeks. You can also do on-site training, but that is also very expensive. If you want to hire people already trained, they cost more than normal programmers (because they have had the training classes or experience). Either way, you cannot avoid the training cost and subsequent loss of productivity while they learn. In addition, you will have to hire (or train) a database administrator (DBA). This is a very specialized person (or group of people) and they are not cheap. ($50,000 to $150,000 per year normally.) 2nd area of increased costs... (2) Bigger Disk, CPU, Memory As far as hardware is concerned, you will probably have to upgrade your CPU, buy more fast RAM, and buy more disk drives. This costs big $. This is in addition to the cost of the DBMS itself. (They typically cost between $15,000 and $100,000 (or more). 3rd reason for increased cost... Copyright Jay Lightfoot, Ph.D. 16 (3) Potentially Slower Processing Once they spend all this money, many are surprised that the overall system throughput is slower than with their old file-based system (even with the beefed up CPU). The reason for this is that the machine is having to do a lot more things (e.g., concurrency control, security, integrity, recovery, ...). You want these things, it is just that people are surprised that they slow down the machine. The real gain in DBMS comes in programming cost savings better data consistency data sharing synergy. 2nd disadvantage of the database approach... b) Data Concurrency Problems Along with the advantages of data sharing come several problems. These problems are unique to systems that allow more than one functional area to use the same data at the same time. 1st data sharing problem... (1) Deadlock / Livelock When data are shared, the DBMS must be careful to only allow one user at a time to change it. If multiple users could change at the same time then it would be chaos. The DBMS does this by giving sole access to update data to one user at a time. Anybody else that needs it must wait until the first is through. Deadlock is a condition where 2 (or more) users are each waiting for the other to release the same piece of data. Neither can continue until the other gives up the needed data. Deadlock (a.k.a. deadly embrace) will continue until you turn the computer off or until one of the users "gives in". For example, imagine a 4-way stop sign with two cars waiting for each other. Traditionally the DBMS looks for deadlock and forces one user to release their data and start over. Livelock is a condition where a user cannot proceed for a long time because they are never given access to needed data. This is different from deadlock in that they do not have anything that anyone else wants. They are just "low on the priority list" so they never get access to a popular piece of data. Again, the DBMS has ways to fix this problem. Both problems are directly related to data sharing (and use lots of CPU cycles). The 2nd problem directly related to data sharing...(concurrency) Copyright Jay Lightfoot, Ph.D. 17 (2) Transaction Rollback One way the DBMS fixes deadlock is through "transaction rollback". Basically, a rollback is the process of "undoing" everything that a user has done. (A "transaction" is an arbitrary unit of work on the database.) When the DBMS selects a "victim", they UNDO everything that user has done (including releasing locks on data items). Once this is done then other users can continue. Rollbacks use lots of CPU cycles also and are normally result of data contention (via data sharing). 3rd problem of the database approach... c) Data Security Problems One of the few areas where the traditional approach was better. When you combine all the traditional file-based systems you also combine all the security concerns of that data. Strangely enough, the traditional file based approach had fairly good security because people guarded their data very carefully. Some of the security problems are not new (e.g., backup/recovery, authorization). They are just bigger. Other problems spring up specifically because a DBMS is present. For example, the potential for catastrophic failure. All in all, DBMS security is not new, just more intense. 1st aspect of DBMS security is failure potential... (1) Catastrophic Failure Possible When you put all your eggs in one basket, you risk losing them. Translation: Lack of redundancy can easily lead to losing everything if you are not careful. Not only are the data not redundantly stored, but the processing is also performed on single set of hardware. Dangerous! When you depend on one system (DBMS) to perform all data retrieval, storage, backup/recovery, integrity, ... then you are very vulnerable to a failure of that system. On a less catastrophic note, a single bad application can sometimes bring down the whole DBMS. This sort of thing rarely happened on old file-based systems. 2nd aspect of security problems present in DBMS... Copyright Jay Lightfoot, Ph.D. 18 (2) Elaborate Backup/Recovery This generally comes under the heading of security. You can provide backup tapes, backup disks, backup CPUs, ... all the way up to totally redundant sites. The more security you buy, the more it costs. This is not really that different from file-based systems since most of the big ones ran at a single hardware site. The problem is that all users are down when you do backups, not just a few functions. (Note, some fancy DBMSs, like ORACLE, do incremental backups so that users are not overly inconvenienced.) 4th disadvantage of the database approach... d) Organizational Conflict Shared database requires a consensus on data definition and ownership. This is where the DBA really earns their money. It takes a good politician to navigate this minefield. This is a political concern. Users from separate functional areas must agree on who owns the data, who can maintain it, when, how it is processed, distributed, .normally textbooks minimize the political aspects of data processing in the organization. You should remember that politics always follows large budgets and power. This is true in the DP area. The 5th database disadvantage... e) Complex Conceptual Design It is more complex to understand (and program) when you must be concerned with the entire enterprise model. DBMS programmers must be concerned with transaction interaction, timing, and a host of problems not present in old file-based systems. To solve this, few companies actually have a full enterprise model. Most have subsets of the model that link logically related sub-systems. This is a practical way to approach extreme complexity. It is not necessarily good, it is just the way things are until we have better tools to help the programmer deal with extreme detail. (e.g., CASE tools, SEEs, ...) 5th database disadvantage... f) Extra Cost of DBMS on Program Development Having a DBMS in place can slow down certain aspects of the system life-cycle. If you ever evaluate a DBMS, you should realize this. Copyright Jay Lightfoot, Ph.D. 19 1st of all you have testing difficulties... (1) Testing Difficulties You usually cannot test applications directly against the live database because it may damage it (again, all eggs in one basket...). This means that you must either make an extract of the full database or develop a sub-set of the database strictly for testing purposes. This takes time and disk space. It also may not adequately test the problems that often occur under "load" (multiple users). 2nd extra cost of program development... (2) Complex Programming Environment This is related to the "complex conceptual design" and "special programmers" topics above. Basically, DBMS products are more complex to use than file-based systems. It takes time to learn them. Also, a full enterprise model of a company is usually too complex for a single programmer to comprehend without aids (CASE tools, ...). Taken together, you lose some of your programmer productivity (short run) due to the complexity of the DBMS. All in all however, the advantages FAR OUTWEIGH the disadvantages. So the database approach is the way of the future. NEXT let’s examine the component parts of a DBMS.... IV. COMPONENTS OF A DATABASE SYSTEM A DBMS is not one big program. It is made up of several large complex sub-systems and a several groups of people (classes of users). To begin, the basic functions of a DBMS are: data storage, retrieval, and update a user-accessible catalog transaction support concurrency control services recovery services authorization services support for data communications integrity services services to promote data independence utility services (reorganize statistics, monitoring...) These diverse functions are handled by the collection of sub-systems discussed above. Copyright Jay Lightfoot, Ph.D. 20 1st component of DBMS A. Data (raw) The main purpose of a DBMS is to store and retrieve data. Obviously this is the most important component. There are 2 general categories of data: 1. actual (real) 2. virtual (derived) The actual data is stored on disk and can be read and written just as in traditional systems (although you don't worry about storage details) Virtual data is not actually stored anywhere. Rather, the derivation method is stored. When you require the information it is created from existing data. Example: Employee age would not really be stored in the database. Rather their birthday and the current day would be used to calculate it. In that way it would always be current. Another example is if you want a sum of a purchase order. The sum would not actually be on disk, rather it would be calculated "on the fly". 2nd component of DBMS... B. Structure (Repository, Schema) Raw data is not really very useful by itself. You cannot have any relationships among data items in raw data (just a bunch of data items). What is missing is the structure that ties the pieces together. We will be working with this all semester. It is a very powerful aspect of the DBMS approach. It is also one of the main complexities that must be dealt with. There are several basic concepts associated with structure: structure within "record" structure between "records" cardinality (1:1, 1:N, M:N) - nature of relationship degree (how many entities participate) generalization/specialization primary key uniqueness referential integrity user defined integrity rules A generic term for information about the structure of data is "metadata". In database terminology, the part of the DBMS that defines the structure is called the "schema". The Copyright Jay Lightfoot, Ph.D. 21 schema is stored in the DBMS component called the "catalog" or the repository. Again, we will get to most of this later, this is just an introduction. 3rd DBMS component ... C. DBMS The Data Base Management System (DBMS) is the generic name for the collection of subsystems used to create, maintain, and provide controlled access to data. They range in complexity from small PC-DBMS systems (Access, dBase IV, Rbase...) costing a few hundred dollars to large mainframe products (ORACLE, IBM DB2) costing several hundred thousand dollars. 1. DBMS Engine The central component of the DBMS. A module that provides access to the physical structure and the data. Also coordinates all of the other functions done by the DBMS. The central control module that receives requests from users , determines where the data is and how to get it, and issues physical I/O requests to the computer operating system. Also provides some misc. services such as memory and buffer management, index maintenance, and disk management. 2nd sub-component... 2. Interface Subsystem Provides facilities for users and applications to access the various components of the DBMS. Most DBMS products provide a variety of languages and interfaces to satisfy the different types of users and the different sub-systems that must be accessed. The following are common interfaces that are provided (some are missing in the smaller/cheaper products). Note that some DBMS combine the functions of several interfaces into a single sub-system (e.g., SQL is DDL, DML, and DCL combined). 1st interface component...(of 6) a) Definition Language (DDL) (structure) Used to define and maintain the database structures (e.g., records, tables, files, views, indexing, ...) Specifically DDL defines: all data items (type, specification...) all record types (tables in relational model) the relationships among record types (not in relational model) Copyright Jay Lightfoot, Ph.D. 22 user views (or subschemas) The DDL is used to define the conceptual database and turn it into a physical database 2nd interface component... b) Query Language (DML) Used to manipulate and query the data itself. Typically used by a host program or as ad hoc commands in interactive mode. For example, you could select a subset of data based upon some query criteria using the DML. In some database systems the DML also provides the commands to "navigate" the data structure. 3rd interface component... c) Control Language (DCL) Used to grant and revoke access rights to individuals (and groups). A "right" is the privilege to perform a data manipulation operation. For example, the DBA can grant a clerk the right to access and delete INVENTORY records (but not to update them). Another related concept is a database “role”. A role is a predefined set of access rights and privileges that can be assigned to a user. When the definition of the role changes, all users assigned to that role get the updated access rights. Again, not all DBMSs call the DDL, DML, and DCL separate interfaces. However, all 3 functions must be present in the DBMS. 4th interface component... d) Graphic Interface (QBE) Optional. Some modern DBMSs provide a graphical representation of the data structure (a table usually) that allows you to select which data items to query on and the conditions for selection. Normally this feature is found in Query By Example sub-systems of relational DBMSs. The graphic interface makes it easier for non-technical users to make complex queries. Also handy because it is a common interface that can cross diverse DBMS systems. 5th interface component... Copyright Jay Lightfoot, Ph.D. 23 e) Forms Interface Optional. A screen-oriented form is presented to the user. They respond by filling in the blanks. The result is that the DBMS uses the form that you design to input and output data. 6th interface component... f) High-Level Programming Language (interface) Programmers need to be able to access data via a high-level language. This could be old-style (3rd generation) languages like COBOL, FORTRAN, Pascal. Or it could be newer 4th generation languages like Toolbox, Mapper, Easytrieve, ... . Most big mainframe products (Ingress, Oracle, Informix...) include a 4GL as part of the DBMS. Studies have shown that application done using 4GLs result in a system that is up 10 to 20 times faster than using traditional 3rd generation languages. (Note, code does not run faster, it is just debugged sooner.) The interface usually is achieved by adding a few extra commands (verbs) to the standard language and having a pre-processor translate these verbs into DBMS calls (using the CALL format of the specific operating system). This method works well because the user does not need to know the complexities of operating system calls and the resulting code is somewhat portable. The interfaces to "old" languages is needed because there is a lot of code and programmer experience out there that cannot be ignored. 3rd sub-component DBMS... 3. Repository Dictionary Subsystem The word "repository" is used to relate back to the Information Resource Management concept mentioned earlier. The data in the database should be treated as a corporate resource. This resource must be managed. The repository is more than a "data dictionary" or "catalog". It is the central place that you store: system documentation data structure project life cycle information conceptual model information etc… CASE tools use it extensively. A DBMS system component must be present to manage and control access to the repository. To a certain extent, the DCL does part and CASE tools do part. Copyright Jay Lightfoot, Ph.D. 24 It provides facilities for recording, storing, and processing descriptions of an organization's data and data processing resources. Pretty new idea (still being defined by industry). 4th sub-component... 4. Data Integrity Subsystem Provides facilities for managing the integrity in the database and the integrity of metadata in the repository. There are 4 important functions: intrarecord integrity - enforce constraints on data item values and types within each record in the database referential integrity - enforce the validity of references between records in the database user-define integrity - Business rules (arbitrary) that must be upheld (e.g., employee can't make more than boss). concurrency control - Assuring the validity of data when multiple users access simultaneously (get into more later). This is a traditional weak spot on most commercial DBMSs. 5th sub-component... 5. Security Management Subsystem A subsystem that provides facilities to protect and control access to the database. The 2 most important aspects of security are: securing data from unauthorized access protect it against disasters The first is done through passwords, views, and protection levels. Encryption is also widely used. The second aspect uses backups, logs, before and after images, disaster recovery plans, etc. 6th sub-component... 6. Backup/Recovery Subsystem Copyright Jay Lightfoot, Ph.D. 25 A subsystem that logs transactions and database changes and periodically makes backup copies of the database. This is done so you don't lose data in the event of a problem. There are different levels of problems that backup/recovery prepares for. They range from redoing a transaction that was rolled back due to a concurrency conflict (minor) to totally restoring the database after the computer center is destroyed (major). 7th sub-component... 7. Application Development Subsystem Optional. Provides facilities so that end users and/or programmers can develop complete database applications. Some use elaborate CASE tools as well as screen and report generators to create full applications with minimal work. Others help write code from sketchy specifications. In any event, this is an aid to non-technical users and to beef up programmer productivity. 8th sub-component... 8. Performance Management Subsystem The DBA needs some way to determine if the DBMS is performing well. These tools (often called "monitoring utilities") give the DBA information needed to tune DBMS performance. Example: A monitor utility can find data items that are accessed frequently enough to need an index. They can also be used to determine if a data item needs to be on a faster disk drive or possibly replicated. 4th DMBS component... D. CASE Tools Computer-Aided Software Engineering (CASE) tools are used for automating the software development and maintenance tasks. Basically, you use a computer program to automate the software life-cycle work. This is fairly new and not widely used in industry, but you will have some contact during your career. Depending upon the CASE tool, it can automate: feasibility analysis requirements definition logical design prototyping Copyright Jay Lightfoot, Ph.D. 26 programming and testing implementation maintenance In future, most large software projects will use CASE tools (probably hooked into the data repository subsystem). Some of the advantages of using CASE tools are: improved productivity in development improved quality through automated checking automatic preparation and update of documentation encouragement of prototyping and incremental development automatic preparation of program code from requirements definition reduced maintenance efforts 5th DBMS component... E. Application Programs These are the specific procedures (programs) that manipulate the data in the database. They are written for specific applications needed by the business. You don't buy these as part of the database. Some are fairly predictable (payroll, accounts payable, general ledger) while others could be quite arbitrary and specialized. the main point is that they are specifically written for the business using business defined requirements. The best thing that the DBMS can do to help is to provide tools to help write and maintain the programs. 6th DBMS component (ROLE)... F. Data Administrators The next 3 "components" could better be called "database roles". They are people that perform certain tasks. This includes the DBA and the DA. The two positions are responsible for the technical and political management of the DBMS and its use. The DBA is the person responsible for managing the organization's data. DBA has the ultimate authority and responsibility for ensuring that the data is secure, accurate, and timely. Specifically, the DBA does: build and maintain the data dictionary resolve disputes about data usage, control... design and maintain the conceptual and physical database monitor database for performance tuning perform backups and recoveries etc... Copyright Jay Lightfoot, Ph.D. 27 Sometimes the DBA position is held by a group of people and sometimes the technical side is handled by a Data Administrator (DA) while the political side is handled by the DBA. 2nd "role"... G. Systems Developers These are programmers that write the application code to meet specification. They are typically well versed in the use of the DBMS tools and in the function that they are programming. Seldom are system developers the same people that use the system. This partly due to the specialized nature of the work and partly due to a need to provide security by separating the users and the designers of the system. 3rd "role"... H. End Users Finally, the end users are those that have a need to access the data. Generally, end users are non-technical and do not use most of the DBMS components (at least they don't know they are using them). To the end user, the DBMS should look no worse than the file-based system that it replaced (or you haven't done you job correctly). Hopefully it looks better. Copyright Jay Lightfoot, Ph.D. 28