File Handling Year One Information Processing FILE HANDLING TABLE OF CONTENTS FILE HANDLING ................................................................................................................... 2 LOGICAL ORGANISATION ................................................................................................ 2 SERIAL ACCESS ....................................................................................................................... 3 FEATURES OF SERIAL FILES .................................................................................................... 3 SEQUENTIAL ACCESS .............................................................................................................. 3 FEATURES OF SEQUENTIAL FILES ............................................................................................ 4 RANDOM OR DIRECT ACCESS.................................................................................................. 4 Hash Coding ...................................................................................................................... 4 ADVANTAGES OF HASH CODING ............................................................................................. 5 DISADVANTAGES OF HASH CODING ........................................................................................ 5 INDEXED SEQUENTIAL ............................................................................................................ 6 FULLY INDEXED FILES ............................................................................................................ 6 INDEXED SEQUENTIAL FILES................................................................................................... 6 ADVANTAGES OF INDEXED SEQUENTIAL FILES ....................................................................... 7 DISADVANTAGES OF INDEXED SEQUENTIAL FILES.................................................................. 7 PHYSICAL FILE ORGANISATION .................................................................................... 7 SERIAL FILE ORGANISATION ................................................................................................... 8 SEQUENTIAL FILE ORGANISATION .......................................................................................... 8 RANDOM OR DIRECT ACCESS FILE ORGANISATION ................................................................ 9 INDEX-SEQUENTIAL FILE ORGANISATION ............................................................................. 10 CRITERIA FOR SELECTING FILE ORGANISATION ..................................................................... 10 UPDATING FILES ................................................................................................................... 15 Batch Processing ............................................................................................................. 15 Sequential File Updating ................................................................................................. 15 Indexed File Updating ..................................................................................................... 16 Large Index Problems ...................................................................................................... 16 Index Sequential Access Method ...................................................................................... 17 ROLE OF THE OPERATING SYSTEM IN INPUT AND OUTPUT OF FILES ..................................... 17 1 File Handling File Handling On completing this handout, you will have learned about: The local and physical organisation of files. Serial and sequential file handling methods. Direct and index sequential files. Creating, reading, writing and deleting records from a variety of file structures. Creating code to carry out the above operations. Logical Organisation A file is logically organised as follows: File Record … Record Data item Data item … Data item Or field Or field Header Record Tailer Or field A record is a collection of data that belongs together, e.g. all the data about an individual person. A data item is an individual field of a record and usually contains one piece of data, e.g. a date, first name, age. These fields are collected together to form records and records are collected together to form a file. Therefore, a file is made up of records containing fields. Once a program has processed data it can be permanently stored, allowing later retrieval. This data is stored on magnetic tape (sequential device), magnetic disk (floppy or hard disk) or optical media (CD-ROM/CD-R/ CD-RW /DVD etc.) (Direct access devices). Characteristics of a Sequential Device Characteristics of a Random or Direct Access Slow. Fast Inexpensive. Expensive. Access time is dependent on current position. Have an almost constant access time. 2 File Handling Serial Access When this type of file organisation is used each record is stored, one after the other, with no regard to any logical order. It is the simplest form of file organisation. This type of technique is normally used for storing records for further processing. 001 003 006 004 002 005 Features of Serial Files 1. Easy to implement on magnetic tape. 2. Generally slow access. 3. Usually used for further processing (e.g. sorting) of records. 4. Are used mainly as temporary files to store transaction data. 5. This type of organisation is suitable for batch search servicing i.e. we can group together a number of requests and process them as a group. 6. It is not a suitable file organisation for on-line access because it is too slow. 7. Not suitable as a master file as the whole file has to be searched for a particular record, starting at the beginning. Sequential Access Just as in serial organisation, records are stored one after the other but are sorted using a key sequence. This is less flexible and more organised than a serial file. In this method, records are kept in some pre-defined order e.g. names stored alphabetically, or records stored numerically. Retrieval is achieved by scanning the entries in the same order e.g. 001, 002, 003, 004, 005 etc, so if we want record number 200 then records 001 to199have to be scanned first. 001 002 003 004 005 006 The principle advantage is rapid access to sets of records e.g. if the nth record has just been accessed the (n+1)th record can be accessed very quickly. Hence, in a sequential access file you can read and write information sequentially, starting from the beginning of the file. 3 File Handling Features of Sequential files 1. Records stored in pre-defined order. 2. Sequential access to successive records. 3. Suited to magnetic tape. 4. To maintain the sequential order updating becomes a more complicated and difficult task. Records will usually need to be moved by one place in order to add (slot in) a record in the proper sequential order. Deleting records will usually require that records be shifted back one place to avoid gaps in the sequence. 5. Very useful for transaction processing where the hit rate is very high e.g. systems, when the whole file is processed as this is quick and efficient. payroll 6. Access times are still too slow (no better on average than serial) to be useful in on-line applications. Random or Direct Access 001 002 003 004 005 006 Records are accessed directly, allowing records to be read in any order. For example, to read record 005 you just jump directly to it. Hence, in a random access data file you can read or write information anywhere in the file. This implies that the medium being used allows a jump to any point in the file. In practice, this requires some form of disk storage. Direct addressing is not favoured if the demand is primarily for sequential processing. Hash Coding The purpose of hash coding is to enable direct retrieval of desired records without the need to search files or indices. The hashing algorithm is first applied to one of the keys of the record (e.g. driving licence number, student number or National Insurance number), which converts the key to an address, by mathematical or logical calculations. Direct addressing is used when records have to be searched frequently in an unpredictable fashion. For example, the sale of spare parts in a garage or the sale of goods in shops where details about individual items have to be made available simultaneously in a random fashion at many points (check out lanes in a supermarket). 4 File Handling There are many techniques for hash coding. One method is to divide the primary key by a prime number and use the remainder as the address. For instance, suppose we have a fairly large list of students whose names are associated with other details. We would like to use student number as the primary key. In this case we would divide the student number with a prime number, say 97, and use the remainder as the storage location in the file. (For example, let's take a Student No, 1069, and divide it by 97. We get a remainder 2, which is the location of that student record.) The remainder will be between 0-96. This gives us 97 potential locations for records. Once records are stored in this fashion, retrieval simply involves supplying a student number, which will be used by the hashing algorithm to locate the desired student record. Advantages of Hash Coding 1. Rapid access to records in a direct fashion. It doesn't make use of large index tables and dictionaries and therefore response times are very fast. Disadvantages of Hash Coding 1. Collision requires the creation of overflow area. Two keys can sometimes calculate to the same address. 2. In the above example if there is a student number 3300, division by 97 will produce a remainder 2. However, we may already have a record (say 1069) in storage location 2. In this case, the extra record will have to be kept in an overflow area. 3. So if hashing produces more than one location for each record, response time may increase because of the necessity to search the overflow area when the key in the hash address does not match the key we are looking for. 4. Sometimes storage space can be wasted if there are not enough records to occupy the reserved spaces. For example, if we are using 97 as the prime key there should be close to 97 records to go into these predetermined locations. If we choose to divide by 9713 there should be around 9713 records to optimise the use of storage space. 5. The table of locations almost always reveals that records are not kept in sequential order by the key. Indeed the records are kept in a pseudo random fashion. Therefore sequential processing of such a file can raise awkward problems. Suppose we wish to produce a sequential list of student numbers; then, for efficiency, we have to keep a separate sequentially sorted copy. Hash coding is therefore not used in applications that involve frequent sequential processing of records. A more suitable technique would be to use an indexed sequential file organisation. 5 File Handling Indexed Sequential This system organises the file into sequential order, usually based on a key field, similar in principle to the sequential access file. However, it is also possible to directly access records by using a separate index file. An indexed file system consists of a pair of files: one holding the data and one storing an index to that data. The index file will store the addresses of the records stored on the main file. There may be more than one index created for a data file e.g. a library may have its books stored on computer with indices on author, subject and class mark. There are two types of indexed files: Fully Indexed Indexed Sequential Fully Indexed Files An index to a fully indexed file will contain an entry for every single record stored on the main file. The records will be indexed on some key e.g. student number. Very large files will have correspondingly large indices. The index to a (large) file may be split into different index levels. When records are added to such a file, the index (or indices) must also be updated to include their relative position and change the relative position of any other records involved. Indexed Sequential Files This is basically a mixture of sequential and indexed file organisation techniques. Records are held in sequential order and can be accessed randomly through an index. Thus, these files share the merits of both systems enabling sequential or direct access to the data. The index to these files operates by storing the highest record key in given cylinders and tracks. Note how this organisation gives the index a tree structure. Obviously this type of file organisation will require a direct access device, such as a hard disk. Indexed sequential file organisation is very useful where records are often retrieved randomly and are also processed in (sequential) key order. Banks may use this organisation for their auto-bank machines i.e. customers randomly access their accounts throughout the day and at the end of the day the banks can update the whole file sequentially. 6 File Handling Advantages of Indexed Sequential Files 1. Allows records to be accessed directly or sequentially. 2. Direct access ability provides vastly superior (average) access times. Disadvantages of Indexed Sequential Files 1. The fact that several tables must be stored for the index makes for a considerable storage overhead. 2. As the items are stored in a sequential fashion this adds complexity to the addition/deletion of records. Because frequent updating can be very inefficient, especially for large files, batch updates are often performed. Physical File Organisation There are various ways in which a file is physically stored on a tape or disk. The information is initially mapped onto the physical blocks, and eventually onto the tracks and sectors of a disk. At Hope, we keep records of students and each student has a unique identification number that is used as a primary key field, e.g. 10052329. For further illustration purposes we will assume that Hope only has 999 students, catering for a range of ID’s from 001 to 999, hence the following file, will be used to demonstrate the different file organisations: Student_ ID_ Number Student_Surname 001 George 002 Hugh 003 Adams 004 Murray 005 Sinclair 006 Patterson … … … … … … 999 Cookson 7 File Handling Serial File Organisation In order to access data within a serial file a pointer is used. Sequential File Organisation Our student files can be sorted by ID and stored on magnetic tape. 001 002 003 004 005 006 In order to access record 005, ‘Sinclair’, the R/W (read/write) head, which is positioned at the beginning of the file, would need to read records 001 through to 004 first. If we held 999 students records on file, accessing the last record would take a long time. A preferred method would be to implement sequential file organisation on a disk but this is not possible, so direct access would be the preferred method of file storage and retrieval. 8 File Handling Random or Direct Access File Organisation Transferring the above example onto a disk would result in the following: Sector 7 Sector 0 Sector 6 Sector 1 Sector 5 Sector 2 Track 00 Track 01 Track 02 Sector 4 Sector 3 Added to the disk is an index, which is loaded into RAM and defines the relationship between the primary key and the corresponding disk address: Index Record Disk Address Track Sector 001 00 0 002 00 1 003 00 2 004 00 3 005 00 4 The index tells the disk R/W head where to look for the data (sector and track). The R/W head goes directly to the correct disk track position, waits for the correct sector to rotate under the head and then retrieves the student’s record. 9 File Handling Due to the size of the index (holding in our case pointers to 999 records and their relevant disk addresses), a compromise sometimes has to be reached between direct and sequential file organisation. Index-Sequential File Organisation To store such information would require a vast amount of memory. In order to avoid this and reduce our index file size, we could simply omit the last digit as shown below: Index Record Disk Address Track Sector 00 00 0 00 00 1 00 00 2 00 00 3 00 00 4 … … 01 01 … 02 02 This time-space compromise would reduce the demand on memory and the time spent processing the data. If we were to look for record 010 we will have an immediate access to it (provided there are no other IDs within the same region, i.e. 011 to 019), otherwise the records would be accessed sequentially through the index, until the required record is reached. Criteria for selecting file organisation This depends on a range of factors such as file-use ratio (file activity), file volatility, file size and user requirements. 10 File Handling There are four main criteria to be considered when choosing a file organisation technique: File use ratio (hit rate) File volatility File size User requirements 1. File use ratio (file activity) If we divide the number of records that are accessed (within a specified process or period) by the total number of records in the file, we can calculate the file-use ratio (hit rate). If the ratio is high it indicates that the majority of records are used regularly which means sequential/serial file organisation may be the appropriate method. On the other hand, if the ratio is low (say 5% to 10%) then the implication is that the ability to retrieve a desired record quickly is crucial and therefore direct file organisation should be recommended: let's illustrate file activity using three examples: a. Payroll production: an example of a high activity file. In most, if not all, organisations production of payroll and payslips is a regular event, which can be either weekly or monthly. Such an application requires processing of all or nearly all the employee records and therefore the file-use ratio will be close to or equal to one (100%). Thus sequential file organisation is preferred. b. Customer accounts in banks: an example of a medium activity file. This is an application in which both random and sequential access are required. For example several customers should be able to withdraw cash from a cash dispensers simultaneously and randomly, and the bank should be able to update all customer accounts periodically by sequential processing. Indexed sequential file organisation may therefore be the most suited to this type of application. c. Airline ticket reservations: an example of a low activity file. In most cases only one record is accessed at a time. This record is required quickly and therefore direct accessing is most appropriate. Calculating the file use ratio To calculate the file use ratio we need to know the number of records accessed and the number of records in file 11 File Handling Examples: File has 8,000 records, 250 of which are accessed and updated per week. File use ratio = 250 / 8000 = 0.03125 per week (very low) 4100 records are accessed per week. File use ratio = 4100 / 8000 = 0.5125 per week (medium) If all but 400 were accessed weekly, i.e. 7600 accessed per week. Then: file use ratio = 7600 / 8000 = 0.95 per week (very high) 2. File volatility This indicates how often files require modification and updating, e.g. insertions and deletions. Highly volatile files are not usually indexed, as this would entail excessive overheads in too frequently updating the index and file. Indexing is used when the data is fairly stable. 3. File size When files are large serial/sequential location techniques give longer access times. Thus large files are usually indexed or direct files. 4. User requirements The main factor to concern most users is how they access the files: batch access or interactive access. If they are happy to use batch access then sequential file organisation is likely to be appropriate providing the file activity is reasonable. If the user needs to operate interactively then direct access will be required which will mean indexed or hash coded files. Access or response times may also influence the user and direct access (indexed or hashed) is faster except when data is only accessed sequentially. Hash coding is quickest but provides no means for sequential searching. Minor Criteria The type of storage device available e.g. serial/sequential access. The ease (or complexity) of actually implementing the file organisation technique with the data concerned. Availability/features/cost of software to handle the organisation technique preferred according to other factors. 12 magnetic tape will only allow File Handling A Note on Physical and Relative Addresses Records are of no use to us unless we can retrieve the data they store. To retrieve records we must obviously know where they are stored. There are 2 ways of indicating the location in which they are stored: 1. Physical Addresses Eventually all addressing must map to physical addresses. These tell us the actual physical location of the record on the storage medium e.g. on a magnetic disk we would need to know the cylinder, track and sector which held the record. Physical Address Illustrated below is a disk cylinder. Records are written onto the disk starting with: Track 1 on surface 1, then Track 1 on surface 2, then Track 1 on surface 3. To expand, records could be stored as follows: Cylinder Surface Sector Record 1 1 9 128936A 2 7 1 117237X 3 4 8 456233C 1 4 9 980763A 13 File Handling 2. Relative Addresses Modern file organisation techniques usually use relative addressing. The address is provided according to its position in the file and not its physical location on the storage device. Thus the 56th record in a file would have a logical address of 56, quite independent of its physical location. Relative addresses must be converted to physical ones at some point for the computer to find the record. Relative Address Record 1 68-768 2 68-888 3 97-023 4 98-222 File Content Files will have very different contents according to the work that they are created to assist. The number and type of users may also have an affect. We will briefly discuss 4 general possibilities: 1. Private one user files These are created to be used by one operator or one body. They often hold data for just one job. 2. Private database files They store data for a group of related users e.g. managers in an organisation. Several programs may well operate on the same database file(s) e.g. a student file may be used to produce student identity cards, update course/exam results and produce mail shots. 3. Public files These are also called shared files. They are created in order that users of a common computing service can all access each other’s files either in parts or in their entirety, as specified by the producers of the files. 4. Public database files These are also called databanks and are databases that are open to public enquiry. They usually concentrate on a particular field such as medicine, law, finance etc. Often they are not a free service but charge a subscription/registration fee and/or charge for usage. 14 File Handling Updating Files The files in an information system are classified by six functions: 1. Master File Contains permanent records that are updated by adding, deleting or editing data. 2. Transaction File Contains records of changes, additions and deletions made to a master file that may be summarised before storage in the master file. A key field is selected for sorting records in a transaction file before updating the master file. 3. Table File Contains a table of static data e.g. tax rates that is referenced by one of the other types of files. 4. Report File Contains information that has been prepared by the user for display or spooling to a printer e.g. output of the maintenance run of a Pascal program. 5. Control File A small file containing file handling records. 6. History File Backup files from past runs. Batch Processing In batch processing, data is stored during working hours and then copied to a secondary storage medium such as a magnetic tape or server during the evening or whenever the computer is idle. Batch processing usually requires the use of the computer or a peripheral device for an extended period of time. Once the batch job begins, it continues until it is done or until an error occurs. Sequential File Updating Processing data organised sequentially usually uses a master file and a transaction file that contains modifications (additions, changes and deletions) for the master file. In the traditional method of merging the transaction file and the old master file, the batch processing approach is used. This implies that updates to the old master file 'build up' in the transaction file where they are sorted into the same key field order as the master file in the merge run. 15 File Handling Transaction File Old Master File File update process New Master File Indexed File Updating Disc storage allows sequential file organisation but it is more appropriate to use methods that can take advantage of the direct access organisation of files on discs. Indexing methods vary and can be a much more complicated process than sequential processing but fortunately, operating systems usually supply indexing routines. The simplest indexing method is to have a list (usually stored in main memory) of the key field and the record number of the associated record. The ability to access an indexed file as either a sequential or a direct access file is important to the database concept. Inverted files contain multiple indices, but if more than two key fields (the Primary key and the Secondary key) are used then updating the file takes more time and effort. As records are inserted and deleted then all indices affected must be changed. The inverted file organisation contains an inversion index for each secondary key field that contains all the values of the key field and a pointer to the records in the main file that contain those key values. Large Index Problems Partial Index Structure If the size of the index file must remain small then a partial index to a file segment rather than one index per record may be used. 16 File Handling A file with a partial index Index Key Segment File divided into segments B D ¥ ¥ ¥ Y Z 1 2 A...................... B..................... C...................... D...................... 13 13 E...................... F...................... ¥ ¥ ¥ ¥ Y...................... Z...................... If file growth is anticipated then only a fraction of each track on the disk is initially filled so that room is left to add new records. The file design includes consideration of the load factor of each track or surface of the disk. LOAD FACTOR = NO. OF KEY VALUES TO BE STORED / NO. OF FILE POSITIONS Even a small load factor may eventually lead to a segment outgrowing its track requiring that the segments be split into smaller units or the storage of the overflow records elsewhere on the disk. Index Sequential Access Method A popular indexed file process used on personal computers is the Indexed Sequential Access Method (ISAM). This "cylinder and surface' indexing method is based upon the physical characteristics of magnetic disk storage. Its implementation in Pascal uses relative files for rapid access to individual records, however only one index for the address of the surface is used instead of two indices for both the cylinder and surface addresses. The surface addresses are stored in the index file in terms of the relative record numbers. Role of the Operating System in Input and Output of Files Third-generation procedural programming languages such as Pascal and C have file organisation statements for the reading and writing of records by the operating system. Before you can access a file, you must open it via a system routine that allocates some memory for the file control block and record buffers. Programming languages have statements that are effectively calls to the system routine. C is slightly different since it has no specific I/O statements. It has a standard set of library routines for handling I/O. After a program has finished with a file, it must tell the operating system that it has finished with a 17 File Handling file by using a close statement. This is important so the operating system will flush any data that has been buffered in memory and hasn't actually been physically written to the file and so the operating system can release the memory set aside for the File Control Block (FCB) and the buffers. 18 File Handling 19