Index-Sequential File Organisation

advertisement
File Handling
Year One Information Processing
FILE HANDLING
TABLE OF CONTENTS
FILE HANDLING ................................................................................................................... 2
LOGICAL ORGANISATION ................................................................................................ 2
SERIAL ACCESS ....................................................................................................................... 3
FEATURES OF SERIAL FILES .................................................................................................... 3
SEQUENTIAL ACCESS .............................................................................................................. 3
FEATURES OF SEQUENTIAL FILES ............................................................................................ 4
RANDOM OR DIRECT ACCESS.................................................................................................. 4
Hash Coding ...................................................................................................................... 4
ADVANTAGES OF HASH CODING ............................................................................................. 5
DISADVANTAGES OF HASH CODING ........................................................................................ 5
INDEXED SEQUENTIAL ............................................................................................................ 6
FULLY INDEXED FILES ............................................................................................................ 6
INDEXED SEQUENTIAL FILES................................................................................................... 6
ADVANTAGES OF INDEXED SEQUENTIAL FILES ....................................................................... 7
DISADVANTAGES OF INDEXED SEQUENTIAL FILES.................................................................. 7
PHYSICAL FILE ORGANISATION .................................................................................... 7
SERIAL FILE ORGANISATION ................................................................................................... 8
SEQUENTIAL FILE ORGANISATION .......................................................................................... 8
RANDOM OR DIRECT ACCESS FILE ORGANISATION ................................................................ 9
INDEX-SEQUENTIAL FILE ORGANISATION ............................................................................. 10
CRITERIA FOR SELECTING FILE ORGANISATION ..................................................................... 10
UPDATING FILES ................................................................................................................... 15
Batch Processing ............................................................................................................. 15
Sequential File Updating ................................................................................................. 15
Indexed File Updating ..................................................................................................... 16
Large Index Problems ...................................................................................................... 16
Index Sequential Access Method ...................................................................................... 17
ROLE OF THE OPERATING SYSTEM IN INPUT AND OUTPUT OF FILES ..................................... 17
1
File Handling
File Handling
On completing this handout, you will have learned about:

The local and physical organisation of files.

Serial and sequential file handling methods.

Direct and index sequential files.

Creating, reading, writing and deleting records from a variety of file structures.

Creating code to carry out the above operations.
Logical Organisation
A file is logically organised as follows:
File
Record
…
Record
Data item
Data item
…
Data item
Or field
Or field
Header
Record
Tailer
Or field
A record is a collection of data that belongs together, e.g. all the data about an individual
person. A data item is an individual field of a record and usually contains one piece of data,
e.g. a date, first name, age. These fields are collected together to form records and records
are collected together to form a file. Therefore, a file is made up of records containing fields.
Once a program has processed data it can be permanently stored, allowing later retrieval.
This data is stored on magnetic tape (sequential device), magnetic disk (floppy or hard disk)
or optical media (CD-ROM/CD-R/ CD-RW /DVD etc.) (Direct access devices).
Characteristics of a Sequential Device
Characteristics of a Random or Direct
Access

Slow.

Fast

Inexpensive.

Expensive.

Access time is dependent on
current position.

Have an almost constant access
time.
2
File Handling
Serial Access
When this type of file organisation is used each record is stored, one after the other, with no
regard to any logical order. It is the simplest form of file organisation. This type of
technique is normally used for storing records for further processing.
001
003
006
004
002
005
Features of Serial Files
1. Easy to implement on magnetic tape.
2. Generally slow access.
3. Usually used for further processing (e.g. sorting) of records.
4. Are used mainly as temporary files to store transaction data.
5. This type of organisation is suitable for batch search servicing i.e. we can group
together a number of requests and process them as a group.
6. It is not a suitable file organisation for on-line access because it is too slow.
7. Not suitable as a master file as the whole file has to be searched for a particular
record, starting at the beginning.
Sequential Access
Just as in serial organisation, records are stored one after the other but are sorted using a key
sequence. This is less flexible and more organised than a serial file. In this method, records
are kept in some pre-defined order e.g. names stored alphabetically, or records stored
numerically. Retrieval is achieved by scanning the entries in the same order e.g. 001, 002,
003, 004, 005 etc, so if we want record number 200 then records 001 to199have to be
scanned first.
001
002
003
004
005
006
The principle advantage is rapid access to sets of records e.g. if the nth record has just been
accessed the (n+1)th record can be accessed very quickly.
Hence, in a sequential access file you can read and write information sequentially, starting
from the beginning of the file.
3
File Handling
Features of Sequential files
1. Records stored in pre-defined order.
2. Sequential access to successive records.
3. Suited to magnetic tape.
4. To maintain the sequential order updating becomes a more complicated and difficult task.
Records will usually need to be moved by one place in order to add (slot in) a record in
the proper sequential order. Deleting records will usually require that records be shifted
back one place to avoid gaps in the sequence.
5. Very useful for transaction processing where the hit rate is very high e.g.
systems, when the whole file is processed as this is quick and efficient.
payroll
6. Access times are still too slow (no better on average than serial) to be useful in on-line
applications.
Random or Direct Access
001
002
003
004
005
006
Records are accessed directly, allowing records to be read in any order. For example, to read
record 005 you just jump directly to it.
Hence, in a random access data file you can read or write information anywhere in the file.
This implies that the medium being used allows a jump to any point in the file. In practice,
this requires some form of disk storage.
Direct addressing is not favoured if the demand is primarily for sequential processing.
Hash Coding
The purpose of hash coding is to enable direct retrieval of desired records without the need to
search files or indices.
The hashing algorithm is first applied to one of the keys of the record (e.g. driving licence
number, student number or National Insurance number), which converts the key to an
address, by mathematical or logical calculations. Direct addressing is used when records
have to be searched frequently in an unpredictable fashion. For example, the sale of spare
parts in a garage or the sale of goods in shops where details about individual items have to be
made available simultaneously in a random fashion at many points (check out lanes in a
supermarket).
4
File Handling
There are many techniques for hash coding. One method is to divide the primary key by a
prime number and use the remainder as the address.
For instance, suppose we have a fairly large list of students whose names are associated with
other details. We would like to use student number as the primary key. In this case we
would divide the student number with a prime number, say 97, and use the remainder as the
storage location in the file. (For example, let's take a Student No, 1069, and divide it by 97.
We get a remainder 2, which is the location of that student record.) The remainder will be
between 0-96. This gives us 97 potential locations for records.
Once records are stored in this fashion, retrieval simply involves supplying a student number,
which will be used by the hashing algorithm to locate the desired student record.
Advantages of Hash Coding
1. Rapid access to records in a direct fashion. It doesn't make use of large index tables and
dictionaries and therefore response times are very fast.
Disadvantages of Hash Coding
1. Collision requires the creation of overflow area. Two keys can sometimes calculate to the
same address.
2. In the above example if there is a student number 3300, division by 97 will produce a
remainder 2. However, we may already have a record (say 1069) in storage location 2.
In this case, the extra record will have to be kept in an overflow area.
3. So if hashing produces more than one location for each record, response time may
increase because of the necessity to search the overflow area when the key in the hash
address does not match the key we are looking for.
4. Sometimes storage space can be wasted if there are not enough records to occupy the
reserved spaces. For example, if we are using 97 as the prime key there should be close
to 97 records to go into these predetermined locations. If we choose to divide by 9713
there should be around 9713 records to optimise the use of storage space.
5. The table of locations almost always reveals that records are not kept in sequential order
by the key. Indeed the records are kept in a pseudo random fashion. Therefore sequential
processing of such a file can raise awkward problems. Suppose we wish to produce a
sequential list of student numbers; then, for efficiency, we have to keep a separate
sequentially sorted copy. Hash coding is therefore not used in applications that involve
frequent sequential processing of records. A more suitable technique would be to use an
indexed sequential file organisation.
5
File Handling
Indexed Sequential
This system organises the file into sequential order, usually based on a key field, similar in
principle to the sequential access file. However, it is also possible to directly access records
by using a separate index file. An indexed file system consists of a pair of files: one holding
the data and one storing an index to that data. The index file will store the addresses of the
records stored on the main file.
There may be more than one index created for a data file e.g. a library may have its books
stored on computer with indices on author, subject and class mark.
There are two types of indexed files:

Fully Indexed

Indexed Sequential
Fully Indexed Files
An index to a fully indexed file will contain an entry for every single record stored on the
main file. The records will be indexed on some key e.g. student number. Very large files
will have correspondingly large indices.
The index to a (large) file may be split into different index levels.
When records are added to such a file, the index (or indices) must also be updated to include
their relative position and change the relative position of any other records involved.
Indexed Sequential Files
This is basically a mixture of sequential and indexed file organisation techniques. Records
are held in sequential order and can be accessed randomly through an index. Thus, these files
share the merits of both systems enabling sequential or direct access to the data.
The index to these files operates by storing the highest record key in given cylinders and
tracks.
Note how this organisation gives the index a tree structure.
Obviously this type of file organisation will require a direct access device, such as a hard
disk.
Indexed sequential file organisation is very useful where records are often retrieved randomly
and are also processed in (sequential) key order. Banks may use this organisation for their
auto-bank machines i.e. customers randomly access their accounts throughout the day and at
the end of the day the banks can update the whole file sequentially.
6
File Handling
Advantages of Indexed Sequential Files
1. Allows records to be accessed directly or sequentially.
2. Direct access ability provides vastly superior (average) access times.
Disadvantages of Indexed Sequential Files
1. The fact that several tables must be stored for the index makes for a considerable
storage overhead.
2. As the items are stored in a sequential fashion this adds complexity to the
addition/deletion of records. Because frequent updating can be very inefficient,
especially for large files, batch updates are often performed.
Physical File Organisation
There are various ways in which a file is physically stored on a tape or disk. The information
is initially mapped onto the physical blocks, and eventually onto the tracks and sectors of a
disk.
At Hope, we keep records of students and each student has a unique identification number
that is used as a primary key field, e.g. 10052329.
For further illustration purposes we will assume that Hope only has 999 students, catering for
a range of ID’s from 001 to 999, hence the following file, will be used to demonstrate the
different file organisations:
Student_ ID_ Number
Student_Surname
001
George
002
Hugh
003
Adams
004
Murray
005
Sinclair
006
Patterson
…
…
…
…
…
…
999
Cookson
7
File Handling
Serial File Organisation
In order to access data within a serial file a pointer is used.
Sequential File Organisation
Our student files can be sorted by ID and stored on magnetic tape.
001
002
003
004
005
006
In order to access record 005, ‘Sinclair’, the R/W (read/write) head, which is positioned at the
beginning of the file, would need to read records 001 through to 004 first. If we held 999
students records on file, accessing the last record would take a long time.
A preferred method would be to implement sequential file organisation on a disk but this is
not possible, so direct access would be the preferred method of file storage and retrieval.
8
File Handling
Random or Direct Access File Organisation
Transferring the above example onto a disk would result in the following:
Sector 7
Sector 0
Sector 6
Sector 1
Sector 5
Sector 2
Track 00
Track 01
Track 02
Sector 4
Sector 3
Added to the disk is an index, which is loaded into RAM and defines the relationship
between the primary key and the corresponding disk address:
Index
Record
Disk Address
Track
Sector
001
00
0
002
00
1
003
00
2
004
00
3
005
00
4

The index tells the disk R/W head where to look for the data (sector and track).

The R/W head goes directly to the correct disk track position, waits for the correct
sector to rotate under the head and then retrieves the student’s record.
9
File Handling
Due to the size of the index (holding in our case pointers to 999 records and their relevant
disk addresses), a compromise sometimes has to be reached between direct and sequential file
organisation.
Index-Sequential File Organisation
To store such information would require a vast amount of memory. In order to avoid this and
reduce our index file size, we could simply omit the last digit as shown below:
Index
Record
Disk Address
Track
Sector
00
00
0
00
00
1
00
00
2
00
00
3
00
00
4
…
…
01
01
…
02
02
This time-space compromise would reduce the demand on memory and the time spent
processing the data.
If we were to look for record 010 we will have an immediate access to it (provided there are
no other IDs within the same region, i.e. 011 to 019), otherwise the records would be
accessed sequentially through the index, until the required record is reached.
Criteria for selecting file organisation
This depends on a range of factors such as file-use ratio (file activity), file volatility, file size
and user requirements.
10
File Handling
There are four main criteria to be considered when choosing a file organisation technique:

File use ratio (hit rate)

File volatility

File size

User requirements
1. File use ratio (file activity)
If we divide the number of records that are accessed (within a specified process or period)
by the total number of records in the file, we can calculate the file-use ratio (hit rate).
If the ratio is high it indicates that the majority of records are used regularly which means
sequential/serial file organisation may be the appropriate method.
On the other hand, if the ratio is low (say 5% to 10%) then the implication is that the
ability to retrieve a desired record quickly is crucial and therefore direct file organisation
should be recommended: let's illustrate file activity using three examples:
a. Payroll production: an example of a high activity file.
In most, if not all, organisations production of payroll and payslips is a regular event,
which can be either weekly or monthly. Such an application requires processing of all
or nearly all the employee records and therefore the file-use ratio will be close to or
equal to one (100%). Thus sequential file organisation is preferred.
b. Customer accounts in banks: an example of a medium activity file.
This is an application in which both random and sequential access are required. For
example several customers should be able to withdraw cash from a cash dispensers
simultaneously and randomly, and the bank should be able to update all customer
accounts periodically by sequential processing. Indexed sequential file organisation
may therefore be the most suited to this type of application.
c. Airline ticket reservations: an example of a low activity file.
In most cases only one record is accessed at a time. This record is required quickly
and therefore direct accessing is most appropriate.
Calculating the file use ratio
To calculate the file use ratio we need to know the number of records accessed and the
number of records in file
11
File Handling
Examples:

File has 8,000 records, 250 of which are accessed and updated per week.
File use ratio = 250 / 8000 = 0.03125 per week (very low)

4100 records are accessed per week.
File use ratio = 4100 / 8000 = 0.5125 per week (medium)

If all but 400 were accessed weekly, i.e. 7600 accessed per week.
Then: file use ratio = 7600 / 8000 = 0.95 per week (very high)
2. File volatility
This indicates how often files require modification and updating, e.g. insertions and
deletions. Highly volatile files are not usually indexed, as this would entail excessive
overheads in too frequently updating the index and file. Indexing is used when the data is
fairly stable.
3. File size
When files are large serial/sequential location techniques give longer access times. Thus
large files are usually indexed or direct files.
4. User requirements
The main factor to concern most users is how they access the files: batch access or
interactive access. If they are happy to use batch access then sequential file organisation
is likely to be appropriate providing the file activity is reasonable.
If the user needs to operate interactively then direct access will be required which will
mean indexed or hash coded files.
Access or response times may also influence the user and direct access (indexed or
hashed) is faster except when data is only accessed sequentially. Hash coding is quickest
but provides no means for sequential searching.
Minor Criteria

The type of storage device available e.g.
serial/sequential access.

The ease (or complexity) of actually implementing the file organisation technique
with the data concerned.

Availability/features/cost of software to handle the organisation technique preferred
according to other factors.
12
magnetic tape will only allow
File Handling
A Note on Physical and Relative Addresses
Records are of no use to us unless we can retrieve the data they store. To retrieve records we
must obviously know where they are stored. There are 2 ways of indicating the location in
which they are stored:
1. Physical Addresses
Eventually all addressing must map to physical addresses. These tell us the actual
physical location of the record on the storage medium e.g. on a magnetic disk we would
need to know the cylinder, track and sector which held the record.
Physical Address
Illustrated below is a disk cylinder. Records are written onto the disk starting with:

Track 1 on surface 1, then

Track 1 on surface 2, then

Track 1 on surface 3.
To expand, records could be stored as follows:
Cylinder
Surface
Sector
Record
1
1
9
128936A
2
7
1
117237X
3
4
8
456233C
1
4
9
980763A
13
File Handling
2. Relative Addresses
Modern file organisation techniques usually use relative addressing. The address is
provided according to its position in the file and not its physical location on the storage
device. Thus the 56th record in a file would have a logical address of 56, quite
independent of its physical location. Relative addresses must be converted to physical
ones at some point for the computer to find the record.
Relative Address
Record
1
68-768
2
68-888
3
97-023
4
98-222
File Content
Files will have very different contents according to the work that they are created to assist.
The number and type of users may also have an affect. We will briefly discuss 4 general
possibilities:
1. Private one user files
These are created to be used by one operator or one body. They often hold data for just
one job.
2. Private database files
They store data for a group of related users e.g. managers in an organisation. Several
programs may well operate on the same database file(s) e.g. a student file may be used to
produce student identity cards, update course/exam results and produce mail shots.
3. Public files
These are also called shared files. They are created in order that users of a common
computing service can all access each other’s files either in parts or in their entirety, as
specified by the producers of the files.
4. Public database files
These are also called databanks and are databases that are open to public enquiry. They
usually concentrate on a particular field such as medicine, law, finance etc. Often they
are not a free service but charge a subscription/registration fee and/or charge for usage.
14
File Handling
Updating Files
The files in an information system are classified by six functions:
1. Master File
Contains permanent records that are updated by adding, deleting or editing data.
2. Transaction File
Contains records of changes, additions and deletions made to a master file that may be
summarised before storage in the master file. A key field is selected for sorting records in
a transaction file before updating the master file.
3. Table File
Contains a table of static data e.g. tax rates that is referenced by one of the other types of
files.
4. Report File
Contains information that has been prepared by the user for display or spooling to a
printer e.g. output of the maintenance run of a Pascal program.
5. Control File
A small file containing file handling records.
6. History File
Backup files from past runs.
Batch Processing
In batch processing, data is stored during working hours and then copied to a secondary
storage medium such as a magnetic tape or server during the evening or whenever the
computer is idle. Batch processing usually requires the use of the computer or a peripheral
device for an extended period of time. Once the batch job begins, it continues until it is done
or until an error occurs.
Sequential File Updating
Processing data organised sequentially usually uses a master file and a transaction file that
contains modifications (additions, changes and deletions) for the master file. In the
traditional method of merging the transaction file and the old master file, the batch processing
approach is used. This implies that updates to the old master file 'build up' in the transaction
file where they are sorted into the same key field order as the master file in the merge run.
15
File Handling
Transaction
File
Old Master
File
File update
process
New
Master File
Indexed File Updating
Disc storage allows sequential file organisation but it is more appropriate to use methods that
can take advantage of the direct access organisation of files on discs. Indexing methods vary
and can be a much more complicated process than sequential processing but fortunately,
operating systems usually supply indexing routines. The simplest indexing method is to have
a list (usually stored in main memory) of the key field and the record number of the
associated record. The ability to access an indexed file as either a sequential or a direct
access file is important to the database concept.
Inverted files contain multiple indices, but if more than two key fields (the Primary key and
the Secondary key) are used then updating the file takes more time and effort. As records are
inserted and deleted then all indices affected must be changed. The inverted file organisation
contains an inversion index for each secondary key field that contains all the values of the
key field and a pointer to the records in the main file that contain those key values.
Large Index Problems
Partial Index Structure
If the size of the index file must remain small then a partial index to a file segment rather than
one index per record may be used.
16
File Handling
A file with a partial index
Index
Key
Segment
File divided into segments
B
D
¥
¥
¥
Y
Z
1
2
A......................
B.....................
C......................
D......................
13
13
E......................
F......................
¥
¥
¥
¥
Y......................
Z......................
If file growth is anticipated then only a fraction of each track on the disk is initially filled so
that room is left to add new records. The file design includes consideration of the load factor
of each track or surface of the disk.
LOAD FACTOR = NO. OF KEY VALUES TO BE STORED / NO. OF FILE
POSITIONS
Even a small load factor may eventually lead to a segment outgrowing its track requiring that
the segments be split into smaller units or the storage of the overflow records elsewhere on
the disk.
Index Sequential Access Method
A popular indexed file process used on personal computers is the Indexed Sequential Access
Method (ISAM). This "cylinder and surface' indexing method is based upon the physical
characteristics of magnetic disk storage. Its implementation in Pascal uses relative files for
rapid access to individual records, however only one index for the address of the surface is
used instead of two indices for both the cylinder and surface addresses. The surface
addresses are stored in the index file in terms of the relative record numbers.
Role of the Operating System in Input and Output of Files
Third-generation procedural programming languages such as Pascal and C have file
organisation statements for the reading and writing of records by the operating system.
Before you can access a file, you must open it via a system routine that allocates some
memory for the file control block and record buffers. Programming languages have
statements that are effectively calls to the system routine. C is slightly different since it has
no specific I/O statements. It has a standard set of library routines for handling I/O. After a
program has finished with a file, it must tell the operating system that it has finished with a
17
File Handling
file by using a close statement. This is important so the operating system will flush any data
that has been buffered in memory and hasn't actually been physically written to the file and
so the operating system can release the memory set aside for the File Control Block (FCB)
and the buffers.
18
File Handling
19
Download