1 File Processing Chapter 14 2 What You Will Learn Create files Write files Read files Update files 3 Introduction • Storage of data in variables is temporary – when the program is done running, when computer is turned off The data is gone! • Data stored permanently (more or less) on secondary storage devices – magnetic disks – optical disks – tapes • We consider creation & use of files of data 4 The Data Hierarchy • Lowest level of data storage is in binary digits – 0s and 1s -- bits • Bits grouped together into bytes • Bytes grouped into characters • Characters and/or bytes grouped together to form fields • Fields grouped together to form records • Records grouped to form files 5 Storage of Sequential Files • Any type of serial access storage device – originally cards – magnetic tape – even punched paper tape! • Now would be placed on direct access devices (disks) 6 Sequential Devices -- Some History • Tape Drives 7 Data Stored on Tape Interblock Gap for purpose of acceleration or deceleration of tape past the read/write head. Result is large percentage of tape length wasted on the gaps. 8 Data Stored on Tape Records blocked in groups to save space wasted by the required gaps 9 Characteristics of Sequential Systems • Two types of files – Master files -- files with relatively permanent data – Transaction files -- file of transitory records • Updating master files – – – – use transaction files combine (merge) old master with transaction file create new master file Old master and Transaction file are the backup for the New Master 10 Creating a Sequential File • #include <fstream.h> – this is the class which enables file I/O • Specify an object of type ofstream • Use the open command #include <fstream.h> … ofstream outfile; outfile.open ("sales.txt"); 11 Creating a Sequential File • The open command creates a file on the disk with the name specified in the parameter • If the file did not exist on the disk, it is created • If the file existed before, it is overwritten – If you wish to prevent this, other measures must be taken 12 Creating a Sequential File • The open command establishes a "line of communication" from the running program to the operating system and the disk drive • Also possible to specify ofstream outfile ("sales.txt", ios::app); – for appepending to the end of existing file 13 File Open Modes ios: app Write all output to end of file ios::ate Open for output at end, can also wirte anywhere in file ios::in Open file for input ios::out Open file for output ios::trunc Discard file's contents if it exists ios::nocreate If file does not exist, operation fails ios::noreplace If file DOES exist, operation fails 14 Creating a Sequential File • After creating an ofstream object and attempting to open it, program can test for successful open – if (!outfile) … • When output to the file is complete make sure to close the file outfile.close( ); – This flushes the file buffer, makes sure all data gets physically to the disk 15 Reading Data from a sequential Access File • Create and open the file ifstream infile ("sales.txt", ios::in) – Note : ifstream for input ios::in is default • Possible to place the input command in the while loop condition while (infile >> item_num >> qty) … See CD ROM – when no more data exists for extraction >>, Fig 14.7 false is returned, loop ends 16 Reading Data from a sequential Access File • Retrieving sequential data => program starts reading from beginning of file – program keeps track of where it is in the flie with a file position "pointer" – file position pointer is the byte number of the next byte in the file (stream) to be read or written • Possible to manipulate the file position pointer – infile.seekg(n) or outfile.seekp(n) 17 Reading Data from a sequential Access File • Reposition file position pointer to beginning of file outfile.seekp(0); // seekp for write only infile.seekg(0); // seekg for read only • Argument for seekp, seekg is long integer – specifies number of bytes to move file position pointer – default is bytes from beginning of file 18 Reading Data from a sequential Access File • Second possible argument specifies direction of move of file position pointer – ios::cur => bytes forward from current position – ios::end => bytes back from end of file • What would this do? outfile.seekp(0,ios::end); Places file position pointer at end of file (0 bytes back from end) 19 Reading Data from a sequential Access File • Possible to request where you are in the file, location of the file position pointer fpp = infile.tellg( ); – similar .tellp( ) for an output file • View example in figure 14.8, pg 715 on text CD audio 20 Updating Sequential Files • Cannot be done with text files – names, string fields are variable length – would overwrite surrounding data if we changed contents of text file • Consider sequence read a line of data (numbers, names) edit the data, change numbers, names reposition file position pointer rewrite new data …. Problem??? • Des NOT use insert mode like a text editor! 21 Files of Records • Must use files containing structures – each record is exactly the same size struct • Now we use a different .read and .write command – these do unformatted I/O – we are reading bytes straight into the record structure • Enables us to reposition file position pointer and write over old record precisely 22 Files of Records Cast the address of sales_rec to a pointer to char infile.read ((char*)&sales_rec,sizeof(sale_rec)); or outfile.write((char*)&sales_rec,sizeof(sales_rec)); Specify number of bytes to be transferred from the file, to the address specified by the pointer 23 Files of Records • Text uses a different version … .write(reinterpret_cast<const char*> (&sales_rec), sizeof (sales_rec)); infile.read ((char*)&sales_rec,sizeof(sale_rec)); or outfile.write((char*)&sales_rec,sizeof(sales_rec)); • Note Figure 14.11, pg 719 … listen to text CD audio description 24 Updating Master Files Old Master File Transaction Update Program Exception Report New Master 25 Updating Master Files • Transaction file contains – records to be added – modifications to existing records on old master – orders for deletions of records on the old master • Possible errors – Adding => record for specified ID already exists – Modify => record does not exist – Delete => record does not exist 26 Program Flow Chart Open files, read 1st records Trans Key < OM key Compare keys Trans key > OM key Trans key = = OM key Type of Trans Add OK Other, Type of Trans Del, go on Other, Error Write OM record to NM file, Trans key yet to be matched Modify, Make changes 27 Sequential Update 28 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 29 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* *modified 30 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* *modified 31 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 *modified 32 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 *modified 33 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 *modified ?35 D 34 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 *modified ?35 D ?40 A 35 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 *modified ?35 D ?40 A 36 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 50* *modified ?35 D ?40 A 37 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 50* *modified ?35 D ?40 A ?55 M 38 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 50* 60 *modified ?35 D ?40 A ?55 M 39 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 50* 60 70 *modified Error Rpt ?35 D ?40 A ?55 M 40 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 50* 70 80* *modified Error Rpt ?35 D ?40 A ?55 M 41 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 50* 70 80* 90 *modified Error Rpt ?35 D ?40 A ?55 M 42 Sequential Update Transaction File Old Master 10 20 25 30 35 40 50 55 80 10 20 30 40 50 60 70 80 90 100 M M I D D A M M M New Master 10* 20* 25 40 50* 70 80* 90 100 *modified Error Rpt ?35 D ?40 A ?55 M 43 Editing Transaction Data • Plan on existence of bad data • Common Edit checks – – – – – reasonableness check (type, # digits) range check value check (small set of possible values) check digits, self checking numbers proper case 44 Writing Data Randomly to a Random Access File • Done in the context of – – – – finding a record on the file, reading it, manipulating it (make changes, update), then write it back to (over) the original position in the file • Program must make sure to cover all these steps Writing Data Randomly to a Random Access File • Ask the user what record is desired – remember they are numbered starting with 0 • Do a seekg to that record sales_file.seekg( sizeof (sales_rec) * rec_num); • Read the record then alter, update sales_file.read( … ) • Do a seekp Why ??? back to original loacation? sales_file.seekp( … ) • Do a .write back to the file sales_file.write ( … ) Note fig 14.12, pg 721 45 46 Reading Data Sequentially from a Random Access File • Open the file • Prime the pump with initial read • Use while (file_var) { process try next read } • If reading only, no need to close the file • If doing any writing to the file, should close See Fig. 14.14 on CD Rom 47 Input/Output of Objects • Suppose we create classes and wish to save objects of classes in the file • Actually only the data items of a class get written to or read from a file • The function members of the class are available only to the running program • Typically you would let a structure be the private data See Fig 14.15 on CD Rom 48 Input/Output of Objects • Consider having a file containing different type objects – must know what kind of object is coming so as to read it into the correct structure • Possible to write a flag immediately prior to each record, specifying what type of record (structure) it is • Then read the flag first and use a switch statement to determine which read 49 Procedures for Data Processing • Instructions for operations personnel • Prevent errors in update of sequential files • Examples – multiphased updates – hardcopy output reviewed and authorized by supervisor – Transaction counts – Batch totals – Backup and recovery 50 Direct Access Devices • Records can be read/written in any order – Without referencing preceding records • More like a CD or LP record – Contrast sequential access like video or audio cassette 51 Need for Direct Access Files • • • • Banking transactions Credit card authorization Inventory check in a parts store Basically ==> whenever batching of records is impractical or infeasible 52 On Line Systems • When user is in direct communication with cpu • Many direct access systems are on line, but not all! • “Direct Access” and “on line” are NOT synonyms 53 Disk Components • Cylinders • Tracks • Sectors 54 Disk Components • Read/Write heads move in and out • Possible to have multiple fixed heads • Need head for each track surface • Same amount of data on inside as outside tracks 55 Disk Components • Track • Sector • Inner sectors have same amount of data as outer outer (why?) • Outer sectors less dense (bits per inch) 56 Timing Components • Access motion time (seek time) – motion of read/write heads between tracks – this is the most lengthy time of the three • Rotational delay • Data transfer time 57 Hard Drive Head Tolerance Head rides on a cushion of air. 58 Accessing Direct Access files • Mapping relationship between key field an address on disk • Function R(key) ==> address on disk • Part of the mapping process is done by application • Part done by Operating System 59 Absolute Addressing • Key value is the absolute physical address on the disk – (cylinder, track, sector, offset) • Advantages -- simple, fast • Disadvantages – – – – requires low level knowledge of physical device typical key values not appropriate device dependent File reorganization requires key change 60 Absolute Addressing • Very low level • Used on mainframes back in the 70’s – IBM 1130 • Very hard to do on anything we use today • Operating System will rarely let us get at specific disk locations 61 Relative Addressing • Key value is relative position of the record in the file • Operating system takes care of locating this offset into the file • Advantages – simple, number of record gets program there – device independent • Disadvantages – address/space dependent (when you reorganize the file, the key values will change) – Typical keys are not entirely appropriate (a record number) -- but better than physical address 62 Relative Addressing • R(rec#) = rec# (an identity function) • Use seekg command in C++ • Example infile.seekg (count * sizeof (tennants)); // count is the record number desired • File “pointer” is positioned ready to read the record desired 63 Directory Lookup • Keep a directory or table of ordered pairs – (key value, rec#) • Options – Keep in sorted order for binary search – Random order -- sequential search – Tree structure for quick search 64 Directory Lookup • Problems : when table gets quite large, too big for an in memory data structure – must be a file – slower to search through the file • When you reorganize the file, the directory is rebuilt 65 Directory Lookup • Advantages – appropriate key values – device independent – address space independent (key values not changed when file is reorganized, directory takes care of it) – relatively fast if in memory (can have a cache of part of directory in memory) 66 Address Calculation Techniques • Most often used in relative addressing file system • Perform a calculation on the key value • Result is a relative address Key Value Hash Function Relative Addr 67 Address Calculation • R(key) = “predictively random” address • R(abc) = 3 R(xyz) = 7 R(pdq) = 1 0 1 2 3 4 5 6 7 8 9 10 68 Address Calculation • Other names – – – – scatter storage techniques randomizing techniques key-to-address transformation hashing 69 Hashing Problems • Collisions • Consider K1 != K2 but ... R(K1) = = R(K2) • K1 and K2 called synonyms • We must eventually solve this problem 70 Hashing • Advantages – Natural key values – Device independent (move file from floppy to hard drive, no reorganization needed) – Address space independent (key values not changed when we reorganize the file)) – Directory search not needed 71 Hashing • Disadvantages – file reorganization -- must change the hashing function – cost of processing time (although relatively minimal) – processing of collision resolution can be significant (will involve I/O) – requires wasted file space(you must set aside more space for more records in the file than are actually used) 72 Performance Considerations of Hashing Function • Distribution of key values actually in use – all hash to one area? – ideal is uniform distribution • Number of live records in the file compared to size of file • Measurement of live fullness of the file ==> the Load Factor 73 Hashing Performance • Load factor > 0.7 or 0.8 causes probability of collision to get high and slow down the algorithm • Must compute size of file needed for max number of records anticipated # of live records in file Load factor max # records in file 74 Types of Hashing Functions • Division remainder Address = key % divisor • Divisor places limitation on size of the file • Choose divisor large enough • Nice to choose divisor to be a prime (or at least have large prime divisors) 75 Types of Hashing Functions • Mid Square Hashing – Square the key – Specified digits extracted – Example 34282 = 11751184 key = 511 – Number of digits determines file size required N digits => 10N 76 Types of Hashing Functions • Hashing by folding – key value partitioned into number of parts (each has same # digits) – Each partition folded and summed with others – Truncated result is relative address • Example key = = 0001 2345 6789 Add 1000 2345 9876 ====== 13221 ==> rel addr = 3221 77 Collision Resolution -- Linear Probing • Adding new Records • Hash to the record, check if occupied – if occupied go “next door” • Keep going next door until find empty record • If find a record with same key, then this is an illegal add 78 Linear Probing -- Fetching Records • Going back to fetch a record • Seek to record specified by hash function • If keys do not match, this must be a collision • Keep looking “next door” until keys match • Alternatively, if you find a blank record, then you are searching for a record that does not exist 79 Example • Assume the divisor is 13 • Hash the records with the following keys 124, 315, 87, 302, 137, 94, 188, 328 0 1 2 3 4 5 6 7 8 9 10 11 12 • Now go back and find some of the same keys. How is linear probing used now? 80 Linear Probing -- Deleting a Record • What happens when one of a sequence of “next door” records is deleted? • If we blank the record (set fields to zeros and null strings) – we try to fetch a record, linear probing and hit this record – a blank record implies the record for which we search does not exist – it might be later on in a sequence of linear probes 81 Linear Probing -- Deleting a Record • Solution Do not blank the record • Set one of the fields to a sentinel value strcpy (name, “DELETED”); • Records so marked could be used again (next time we are adding a record) 82 Collision Resolution -- Double Hashing • When a collision occurs, hash again • Use same function, altered key or secondary hash function • Could hash to same file or to an overflow file • What to do when you get “double collision” – may have to ultimately resort to linear probing 83 Collision Resolution -- Synonym Chaining • Use linear probe but have a link field • Link field points to next synonym record • Saves multiple sequential file access through a clump of records 84 Synonym Chaining • R(abc) = 3 R(xyz) = 4 R(pdq) = 1 R(wow) = 3 (but gets probed to 5) (make a link from 3 to 5) pdq 0 1 2 abc xyz wow 3 4 5 6 7 8 9 10 85 Collision Resolution -- Bucket Addressing • Hash to groups of records called “buckets” • Collision hashes to another record in the same bucket • If the bucket fills up ==> “spillover” – into next bucket – or into an overflow area • Buckets can be kept in order as they arrive or sorted