Transaction Files Store details of transactions during a period Period can be day/month/week Transient Once processed can be discarded Master Files "Permanent" Kept up to date by applying transaction files o Unchanging data eg. Payroll file, name, address o Changing data - Gross pay to date. Types of File Organisation Choice of file depends upon intended use: what proportion of records to be processed each time file is updated; whether individual records need to be quickly accessible. There are four main types of file organisation: Serial Sequential Random Indexed sequential Serial Records are stored one after another Records stored in any order (e.g. as the're entered) To find a record must read each record in turn from start of file New records added at end of file How do you think a record could be deleted from a serial file? Copy to new file skipping record to be deleted, delete original file and rename new file! Sequential Records are stored one after another Records are sorted into primary key sequence To find a record must read each record in turn from start of file A sequential file is particularly suitable when all records in a file need to be processed (e.g. payroll file). How do you think a record is be added to a sequential file? Copy to new file until the insertion point is reached, write the record to be inserted, copy rest of file then delete original file and rename new file! How do you think a record is deleted from a sequential file? The same way as a serial file! Random Also known as Hash or Direct files. Records are located by disc address or relative position in file An algorithm is used to convert the primary key to the address (the Hash function) An example of a hash function is division/remainder method. The primary key is divided by the number of addresses in the file, and the remainder of the division is address of the record. A problem with hashing is that more than one primary key may map to the same address. This is called a collision (aka aliasing or a synonym). A random file has to have a method of dealing with collisions A good hash function should be designed so as: to minimise collisions; to be fast to calculate; to generate any of the available addresses. Indexed Sequential An index is records the highest primary key stored in each block of records. Within each block, the records are stored in sequence of primary key. Space is provided in each block so that more records can be added in the correct sequence. An overflow area (usually at the end of the file) is used for records which will not fit into the correct block. A pointer to the location of the record in the overflow area is left in the block. Because the records are indexed, but also stored sequentially, they can be accessed either randomly or sequentially. The file combines the advantages of random and sequential files. File Roles Master: contains permanent data, some of which is updated regularly by transaction files. Transaction: contains details of changes which occur during a transaction period. Reference: contains data updated infrequently, often from an outside source. Updating Files A master file is updated by a transaction file. Each record to be updated is read into memory, updated and then written by to the file. A sequential master file cannot be updated in the same location. It must be read modified and written back to a new file as described previously. When a sequential master file is updated, the transaction files are also sorted into sequence so that only one pass through the files is needed. The steps are: 1. A record is read from the master file. 2. A record is read from the transaction file. 3. The primary keys are compared. If the transaction file key is greater, the current record does not need to be updated, and is written unchanged to a new master file and the next record is read from the master file. This step is then repeated. 4. If the keys are equal, the record is updated from the transaction record and then written to the new master file. The operation then repeats from step 1 until there are no further records in the master file. As each update by transaction file produces a new master file, there is a copy of the master file before the update so two versions exist. Usually two previous versions of the master file are kept as backup, and these "generations" of master file are often known as Grandfather (oldest) Father and Son (most recent). If the master file is indexed sequential or random, it can be updated without copying to a new master file. This is called update by overlay. Criteria for Choice of File Organisation The way a file is organised determines how it can be accessed. A sequential file can only be accessed sequentially. A random file can be accessed randomly, so it could be accessed sequentially, but this would be unusual (and slower than a sequential file). An indexed sequential file can be accessed either sequentially or randomly. The following factors should be considered when choosing file organisation: what response time is required; must information be absolutely up to date; can requests for information be batched and processed together; is information required in sequence; what is the most suitable storage medium; what happens if data is lost? Hit rate This is the proportion of records accessed in any one pass through a file. A high hit rate would suggest use of a sequential organisation, a low hit rate would favour random organisation. Text Files A text file consists of lines of alphanumeric characters. That's it! No control codes (except carriage return and line feed), nothing else! Examples of text files include: Program source code Simple text files Script files HTML source code Configuration files Non-text Files Non-text files are often called binary files. They contain a sequence of arbitrary codes. A binary file can contain anything (that can be represented in binary code). Examples of non-text files include: Object code Application specific data files Image files What sort of file do you think a word ".doc" file is? (A Non-Text file) File Structure A file can be considered as a collection of records. Each record represents information about one object. E.g. consider a file being used to store names and addresses, each name and associated address would be one record. A record can be further subdivided into fields. In the case of the address book these may be Title, Forename etc. Primary Key Within a file, a record must have a unique identifier. This is usually one field in the record which can be guarenteed to be unique. Where the data in none of the fields can be guarenteed unique on its own, a special field with a serial number is often added. The unique field is called the primary key, and is required so that each record can be located or selected unambiguously. Secondary Key Other fields in a record can be defined as secondary keys. These are not unique, but may be used to quickly locate groups of records. Record Types The records in a file can be either fixed length or variable length. Fixed Length Records Number of fields is same in all records Length of each field is same in all records Advantages file processing simple easy to estimate file size Disadvantages inefficient use of storage space Variable Length Records Number of fields can vary from record to record Length of each field can vary from record to record End of each field and record needs to be indicated either by a special character, or size at the start of each field/record. Advantages compact, no waste of space flexible, as many fields as needed Disadvantages file processing complex hard to estimate file size File Size Estimation In principle the size of a file of fixed size records can be estimated by multiplying the expected number of records by the number of bytes in each record. In practice is not quite so simple. Data is stored on disc or tape in blocks. Each block is of a fixed size determined by the storage device. E.g. a disc drive may store data in 512 byte blocks. When records are written to a storage device, only a whole number of records can be put into each block of storage. Some space in each block will be used for information about the size and number of records, the rest will be wasted. Therefore to estimate file size first calculate how many records fit in each block. Then use this to calculate the number of blocks which will be used on the storage device. Example A storage device has a block size of 512 bytes. A file of 1200 records of size 108 bytes is to be stored. Estimate the file size. 1. Cacluate number of records per block: = 512 / 108 =4 2. Calculate number of blocks: = 1200 / 4 = 300 3. Multiply number of blocks by block size: = 300 * 0.5 kB = 150 kB