Slide 8

advertisement
FILE PROCESSING CONCEPT
1

Introduction
 Primary Key
 Classification of Data Files
 By Content
 By Mode of Processing
 By Organization of files
•
•
•
•
Serial
Sequential
Index Sequential
Random

Transformation Method
 Q&A
2

File Processing is a computer programming
term

refers to the use of computer files to store data
in persistent memory/permanent storage

Variables and arrays are temporary storage of
data

File processing is a useful alternative to a
database only where the information is only
going to be accessed by a single user, where
speed of data input is vital and where the
amount of data being stored is relatively small
3
Elements of Computer file
 A collection of info, stored on magnetic media/optical
disks/pen drive
 Data files – similar in concepts
 Files can be created, updated and processed
 File contains logical record fields Characters
There are 2 categories of record: Logical Record and Physical
record.
Logical records are referred to each line of data in a file.
Physical record is defined as one or more logical records read
into or written from main memory as a unit of information
4
FILE
Mimi
HND IS
Single
25
Anna
HND IS
Single
24
Ena
HND CP
Married 28
Minor
HND CP
Single
HND CP
FIELD
Minor
CHARACTER
Single
REC-1
…
REC-2
24
LOGICAL RECORD
Minor
FILE
24
Field1
Char1
Field2
Char2
..
.
RECn
Fieldn
… Charn
5

The number of characters grouped into a field can vary
from field to field in a record

2 types of record :
 fixed length
• Where each record has a fixed length e.g. 90
characters. Fields not completely filled will be
padded with space characters resulting waste of
space.
 variable length
• Where fields of record size vary according to the
size of data contained in them.
• Special character called field separators are used to
indicate the start and end of a record.
6

the information contained in the file is related to specific
detail

Different files are used to store different types of details
– different types of details are not mixed into a single file

Records are not usually transferred to and from main
memory as single logical records but grouped together
(as a block of logical records).

When read, records are stored in a buffer temporarily.

File normally ends with “end of file” marker.
7

File always contains primary key (a field of
the record which has unique value) to
uniquely identify a particular record

Primary Key is made up of one field or
combination of two or more fields of the
record

Primary key allows easier/quicker search and
retrieval of a particular record by matching
the search key and the primary key.
8

The way data files are used is dependent
upon :
 the contents,
 mode of processing and
 organisation of the file
9

6 basic categories:
1. Master File
2. Transaction File
3. Index File
4. Table File
5. Archival/History File
6. Backup File
10

contain permanent info of current status type.

used for basic identification and accumulation of
certain statistical data e.g. Product file, Staff file,
Customer File etc.

Contain all the data and activities included on the
master file.

Accumulated records are used to update the master
file e.g. invoices, purchase order etc.

Updating method is batch
11

Index files actually consist of a pair of files:
one holding the data and one storing an
index to that data.

Used to indicate location of specific records
in other files (usually master file) using an
index key or address.

Static reference data used during
processing e.g. pay rate table for
preparation of payroll
12

Often termed master files.

Contain non-current statistical data – used to create
comparative reports, pay commission etc.

Normally updated periodically & involve large
volume of data

Non-current files stored in the file library

Used when the current master file is destroyed
13

Input
 Data loaded into CPU, processed, output
placed in another file

Output
 Data processed, written onto another file

Overlay
 A record is accessed, loaded into CPU,
updated, written back to the original
location (overwrite the original value).
14


File organization is how the records is stored,
processed and accessed
It has 3 functions:
1. Storage of records.
2. Maintenance of files (updating, editing,
deleting)
3. Enable retrieval of required items
(searching).
15

There are several types of file
organization:
1. Serial
2. Sequential
3. Indexed Sequential
4. Random
16

Most simple form of file organization

Records are not kept in any pre-determined order

Records are position one after another

new records are added to the bottom of the file
regardless of what these rows contain

This type of technique is normally used for storing
records for further processing (eg. Sorting)

Normally applied to storage on magnetic tape

Accessing records is very slow
17

more organised than a serial file

records are kept in some pre-defined order - in the
order of primary key

e.g. books data are stored alphabetically according
to their author

Will not be necessary to search the whole file if the
record is not present

This is less flexible because if we are looking for
books with authors whose names beginning with N,
then we need to scan along from A until we come to
N
18

Data cannot be modified without the risk of destroying the other
data in the file.

E.g. if the name “Sam” needed to be changed to “Shaun”, the
old name cannot simply be overwritten. The new record
contains more characters than the original one. The characters
beyond the ‘a’ in “Shaun” would overwrite the beginning of the
next sequential record in the file.

Suitable for storage on magnetic tape

Sequential access is not usually used to update records in
place. Instead the entire file usually rewritten. This requires
processing every record in the file to update one record.
NOTE : In both files (serial and sequential), individual
records can only be found by reading the whole file
until the required key value is located.
19

basically a hybrid of sequential and random file
organisation techniques (uses Sequential & random
access method)

Often referred to as ISAM (Indexed Sequential Access
Method)

Records are maintained in key sequence but have an
index structure built on top of actual data

The index to a (large) file may be split into different index
levels – INDEX OF INDEXES

Master Index – highest level index, contain pointers to
the low level index
20

Locating a particular record – following the index
tree from master index to the target data block
containing the target record.

Block is read to locate the target record with
matching key

This organisation may be useful for auto-bank
machines i.e. customers randomly access their
accounts throughout the day and at the end of the
day the banks can update the whole file
sequentially

One of the drawback of using this organization is
the fact that several tables must be stored for the
index which makes for a considerable storage
overhead
21
044A
046E
Locating record 7, which
address is 050E
047J
Block 2
048E
INDEX
Block
#
2
048E
3
050K
050K
4
049A
Last rec key
051D
049T
Block 3
050E
050K
050J
050Z
Block 4
051C
051D
22
Multi-level structure
Locating record 100,
which address is 053X
Index
Low Level
Index #
Last Rec
Key
Block #
Lowlevel
index 2
Last Rec
Key
81
002A
82
004C
.
007E
.
158
052A
2
053X
159
058E
3
098E
160
063X
4
122A
052C
Block 159
053X
056J
058E
23

Records normally fixed in length

Accessed directly without searching thru the preceding
records

Data can be inserted in a randomly accessed file without
destroying other data in the file.

Data previously stored can also be updated or deleted
without rewriting the entire file/overwriting.

Eg. Airline reservation systems, banking systems etc.

Since every record is the same length, the computer can
quickly calculates (as a function of the record key) the exact
location of a record relative to the beginning of the file.
24

Random file uses block address calculation algorithm

Using this algorithm, the return is the block number with
the record key as the input to the algorithm

Problem is how to store data efficiently, so that by giving
the record key, the storage location can be found.

Keys are unlikely to run sequentially  file has clusters
and gaps. For example, storage is determined by key
sequence in alphabetical order of first letter of customer
name. Some of the letters are common eg. A, B, D but
some are not e.g. Q, X.

Need of a good algorithm to generate the
uniform/consistent addresses – hashing algorithm
25

5 major techniques for hash coding
 Division
 Truncation
 Extraction
 Folding
 Randomizing

All techniques aim to generate a uniformly distributed set of
addresses which will map the keys to the storage area as uniformly
as possible.

Best known and most used technique– division

Division is done by dividing the primary key by a positive integer,
usually a prime number, which is approximately equal to the
number of available addresses and use the remainder as the
address
26
Here are some relatively simple hash functions that have been
used:

The division-remainder method: The size of the number of
items in the table is estimated. That number is then used as
a divisor into each original value or key to extract a quotient
and a remainder. The remainder is the hashed value. (Since
this method is liable to produce a number of collisions, any
search mechanism would have to be able to recognize a
collision and offer an alternate search mechanism.)

Folding: This method divides the original value (digits in this
case) into several parts, adds the parts together, and then
uses the last four digits (or some other arbitrary number of
digits that will work ) as the hashed value or key.
27

Radix transformation: Where the value or key is
digital, the number base (or radix) can be changed
resulting in a different sequence of digits. (For
example, a decimal numbered key could be
transformed into a hexadecimal numbered key.)
High-order digits could be discarded to fit a hash
value of uniform length.

Digit rearrangement: This is simply taking part of
the original value or key such as digits in positions
3 through 6, reversing their order, and then using
that sequence of digits as the hash value or key.
28
Download