FILE PROCESSING CONCEPT 1 Introduction Primary Key Classification of Data Files By Content By Mode of Processing By Organization of files • • • • Serial Sequential Index Sequential Random Transformation Method Q&A 2 File Processing is a computer programming term refers to the use of computer files to store data in persistent memory/permanent storage Variables and arrays are temporary storage of data File processing is a useful alternative to a database only where the information is only going to be accessed by a single user, where speed of data input is vital and where the amount of data being stored is relatively small 3 Elements of Computer file A collection of info, stored on magnetic media/optical disks/pen drive Data files – similar in concepts Files can be created, updated and processed File contains logical record fields Characters There are 2 categories of record: Logical Record and Physical record. Logical records are referred to each line of data in a file. Physical record is defined as one or more logical records read into or written from main memory as a unit of information 4 FILE Mimi HND IS Single 25 Anna HND IS Single 24 Ena HND CP Married 28 Minor HND CP Single HND CP FIELD Minor CHARACTER Single REC-1 … REC-2 24 LOGICAL RECORD Minor FILE 24 Field1 Char1 Field2 Char2 .. . RECn Fieldn … Charn 5 The number of characters grouped into a field can vary from field to field in a record 2 types of record : fixed length • Where each record has a fixed length e.g. 90 characters. Fields not completely filled will be padded with space characters resulting waste of space. variable length • Where fields of record size vary according to the size of data contained in them. • Special character called field separators are used to indicate the start and end of a record. 6 the information contained in the file is related to specific detail Different files are used to store different types of details – different types of details are not mixed into a single file Records are not usually transferred to and from main memory as single logical records but grouped together (as a block of logical records). When read, records are stored in a buffer temporarily. File normally ends with “end of file” marker. 7 File always contains primary key (a field of the record which has unique value) to uniquely identify a particular record Primary Key is made up of one field or combination of two or more fields of the record Primary key allows easier/quicker search and retrieval of a particular record by matching the search key and the primary key. 8 The way data files are used is dependent upon : the contents, mode of processing and organisation of the file 9 6 basic categories: 1. Master File 2. Transaction File 3. Index File 4. Table File 5. Archival/History File 6. Backup File 10 contain permanent info of current status type. used for basic identification and accumulation of certain statistical data e.g. Product file, Staff file, Customer File etc. Contain all the data and activities included on the master file. Accumulated records are used to update the master file e.g. invoices, purchase order etc. Updating method is batch 11 Index files actually consist of a pair of files: one holding the data and one storing an index to that data. Used to indicate location of specific records in other files (usually master file) using an index key or address. Static reference data used during processing e.g. pay rate table for preparation of payroll 12 Often termed master files. Contain non-current statistical data – used to create comparative reports, pay commission etc. Normally updated periodically & involve large volume of data Non-current files stored in the file library Used when the current master file is destroyed 13 Input Data loaded into CPU, processed, output placed in another file Output Data processed, written onto another file Overlay A record is accessed, loaded into CPU, updated, written back to the original location (overwrite the original value). 14 File organization is how the records is stored, processed and accessed It has 3 functions: 1. Storage of records. 2. Maintenance of files (updating, editing, deleting) 3. Enable retrieval of required items (searching). 15 There are several types of file organization: 1. Serial 2. Sequential 3. Indexed Sequential 4. Random 16 Most simple form of file organization Records are not kept in any pre-determined order Records are position one after another new records are added to the bottom of the file regardless of what these rows contain This type of technique is normally used for storing records for further processing (eg. Sorting) Normally applied to storage on magnetic tape Accessing records is very slow 17 more organised than a serial file records are kept in some pre-defined order - in the order of primary key e.g. books data are stored alphabetically according to their author Will not be necessary to search the whole file if the record is not present This is less flexible because if we are looking for books with authors whose names beginning with N, then we need to scan along from A until we come to N 18 Data cannot be modified without the risk of destroying the other data in the file. E.g. if the name “Sam” needed to be changed to “Shaun”, the old name cannot simply be overwritten. The new record contains more characters than the original one. The characters beyond the ‘a’ in “Shaun” would overwrite the beginning of the next sequential record in the file. Suitable for storage on magnetic tape Sequential access is not usually used to update records in place. Instead the entire file usually rewritten. This requires processing every record in the file to update one record. NOTE : In both files (serial and sequential), individual records can only be found by reading the whole file until the required key value is located. 19 basically a hybrid of sequential and random file organisation techniques (uses Sequential & random access method) Often referred to as ISAM (Indexed Sequential Access Method) Records are maintained in key sequence but have an index structure built on top of actual data The index to a (large) file may be split into different index levels – INDEX OF INDEXES Master Index – highest level index, contain pointers to the low level index 20 Locating a particular record – following the index tree from master index to the target data block containing the target record. Block is read to locate the target record with matching key This organisation may be useful for auto-bank machines i.e. customers randomly access their accounts throughout the day and at the end of the day the banks can update the whole file sequentially One of the drawback of using this organization is the fact that several tables must be stored for the index which makes for a considerable storage overhead 21 044A 046E Locating record 7, which address is 050E 047J Block 2 048E INDEX Block # 2 048E 3 050K 050K 4 049A Last rec key 051D 049T Block 3 050E 050K 050J 050Z Block 4 051C 051D 22 Multi-level structure Locating record 100, which address is 053X Index Low Level Index # Last Rec Key Block # Lowlevel index 2 Last Rec Key 81 002A 82 004C . 007E . 158 052A 2 053X 159 058E 3 098E 160 063X 4 122A 052C Block 159 053X 056J 058E 23 Records normally fixed in length Accessed directly without searching thru the preceding records Data can be inserted in a randomly accessed file without destroying other data in the file. Data previously stored can also be updated or deleted without rewriting the entire file/overwriting. Eg. Airline reservation systems, banking systems etc. Since every record is the same length, the computer can quickly calculates (as a function of the record key) the exact location of a record relative to the beginning of the file. 24 Random file uses block address calculation algorithm Using this algorithm, the return is the block number with the record key as the input to the algorithm Problem is how to store data efficiently, so that by giving the record key, the storage location can be found. Keys are unlikely to run sequentially file has clusters and gaps. For example, storage is determined by key sequence in alphabetical order of first letter of customer name. Some of the letters are common eg. A, B, D but some are not e.g. Q, X. Need of a good algorithm to generate the uniform/consistent addresses – hashing algorithm 25 5 major techniques for hash coding Division Truncation Extraction Folding Randomizing All techniques aim to generate a uniformly distributed set of addresses which will map the keys to the storage area as uniformly as possible. Best known and most used technique– division Division is done by dividing the primary key by a positive integer, usually a prime number, which is approximately equal to the number of available addresses and use the remainder as the address 26 Here are some relatively simple hash functions that have been used: The division-remainder method: The size of the number of items in the table is estimated. That number is then used as a divisor into each original value or key to extract a quotient and a remainder. The remainder is the hashed value. (Since this method is liable to produce a number of collisions, any search mechanism would have to be able to recognize a collision and offer an alternate search mechanism.) Folding: This method divides the original value (digits in this case) into several parts, adds the parts together, and then uses the last four digits (or some other arbitrary number of digits that will work ) as the hashed value or key. 27 Radix transformation: Where the value or key is digital, the number base (or radix) can be changed resulting in a different sequence of digits. (For example, a decimal numbered key could be transformed into a hexadecimal numbered key.) High-order digits could be discarded to fit a hash value of uniform length. Digit rearrangement: This is simply taking part of the original value or key such as digits in positions 3 through 6, reversing their order, and then using that sequence of digits as the hash value or key. 28