File Processing

advertisement
1
File Processing
Chapter 14
2
What You Will Learn
Create files
Write files
Read files
Update files
3
Introduction
• Storage of data in variables is temporary
– when the program is done running,
when computer is turned off
The data is gone!
• Data stored permanently (more or less) on
secondary storage devices
– magnetic disks
– optical disks
– tapes
• We consider creation & use of files of data
4
The Data Hierarchy
• Lowest level of data storage is in binary
digits
– 0s and 1s -- bits
• Bits grouped together into bytes
• Bytes grouped into characters
• Characters and/or bytes grouped together to
form fields
• Fields grouped together to form records
• Records grouped to form files
5
Storage of Sequential Files
• Any type of serial access storage device
– originally cards
– magnetic tape
– even punched paper tape!
• Now would be placed on direct access
devices (disks)
6
Sequential Devices -- Some
History
• Tape Drives
7
Data Stored on Tape
Interblock Gap for
purpose of
acceleration or
deceleration of tape
past the read/write
head.
Result is large
percentage of tape
length wasted on the
gaps.
8
Data Stored on Tape
Records blocked in
groups to save
space wasted by
the required gaps
9
Characteristics of Sequential
Systems
• Two types of files
– Master files -- files with relatively permanent data
– Transaction files -- file of transitory records
• Updating master files
–
–
–
–
use transaction files
combine (merge) old master with transaction file
create new master file
Old master and Transaction file are the backup for
the New Master
10
Creating a Sequential File
• #include <fstream.h>
– this is the class which enables file I/O
• Specify an object of type ofstream
• Use the open command
#include <fstream.h>
…
ofstream outfile;
outfile.open ("sales.txt");
11
Creating a Sequential File
• The open command creates a file on the
disk with the name specified in the
parameter
• If the file did not exist on the disk, it is
created
• If the file existed before, it is overwritten
– If you wish to prevent this, other measures must
be taken
12
Creating a Sequential File
• The open command establishes a "line of
communication" from the running program
to the operating system and the disk drive
• Also possible to specify
ofstream outfile ("sales.txt", ios::app);
– for appepending to the end of existing file
13
File Open Modes
ios: app
Write all output to end of file
ios::ate
Open for output at end, can also wirte
anywhere in file
ios::in
Open file for input
ios::out
Open file for output
ios::trunc
Discard file's contents if it exists
ios::nocreate
If file does not exist, operation fails
ios::noreplace
If file DOES exist, operation fails
14
Creating a Sequential File
• After creating an ofstream object and
attempting to open it, program can test for
successful open
– if (!outfile) …
• When output to the file is complete make
sure to close the file
outfile.close( );
– This flushes the file buffer, makes sure all data
gets physically to the disk
15
Reading Data from a sequential
Access File
• Create and open the file
ifstream infile ("sales.txt", ios::in)
– Note : ifstream for input
ios::in is default
• Possible to place the input command in the
while loop condition
while (infile >> item_num >> qty) …
See CD ROM
– when no more data exists for extraction >>,
Fig 14.7
false is returned, loop ends
16
Reading Data from a sequential
Access File
• Retrieving sequential data => program starts
reading from beginning of file
– program keeps track of where it is in the flie with a
file position "pointer"
– file position pointer is the byte number of the next
byte in the file (stream) to be read or written
• Possible to manipulate the file position pointer
– infile.seekg(n) or outfile.seekp(n)
17
Reading Data from a sequential
Access File
• Reposition file position pointer to beginning
of file
outfile.seekp(0); // seekp for write only
infile.seekg(0); // seekg for read only
• Argument for seekp, seekg is long integer
– specifies number of bytes to move file position
pointer
– default is bytes from beginning of file
18
Reading Data from a sequential
Access File
• Second possible argument specifies
direction of move of file position pointer
– ios::cur => bytes forward from current position
– ios::end => bytes back from end of file
• What would this do?
outfile.seekp(0,ios::end);
Places file position pointer at end of
file (0 bytes back from end)
19
Reading Data from a sequential
Access File
• Possible to request where you are in the file,
location of the file position pointer
fpp = infile.tellg( );
– similar .tellp( ) for an output file
• View example in figure 14.8, pg 715 on text
CD audio
20
Updating Sequential Files
• Cannot be done with text files
– names, string fields are variable length
– would overwrite surrounding data if we
changed contents of text file
• Consider sequence
read a line of data (numbers, names)
edit the data, change numbers, names
reposition file position pointer
rewrite new data …. Problem???
• Des NOT use insert mode like a text editor!
21
Files of Records
• Must use files containing structures
– each record is exactly the same size struct
• Now we use a different .read and
.write command
– these do unformatted I/O
– we are reading bytes straight into the record
structure
• Enables us to reposition file position pointer
and write over old record precisely
22
Files of Records
Cast the address of sales_rec
to a pointer to char
infile.read ((char*)&sales_rec,sizeof(sale_rec));
or
outfile.write((char*)&sales_rec,sizeof(sales_rec));
Specify number of bytes to be
transferred from the file, to the
address specified by the pointer
23
Files of Records
• Text uses a different version
… .write(reinterpret_cast<const char*>
(&sales_rec), sizeof (sales_rec));
infile.read ((char*)&sales_rec,sizeof(sale_rec));
or
outfile.write((char*)&sales_rec,sizeof(sales_rec));
• Note Figure 14.11, pg 719 … listen to text CD
audio description
24
Updating Master Files
Old Master
File
Transaction
Update
Program
Exception
Report
New Master
25
Updating Master Files
• Transaction file contains
– records to be added
– modifications to existing records on old master
– orders for deletions of records on the old master
• Possible errors
– Adding => record for specified ID already
exists
– Modify => record does not exist
– Delete => record does not exist
26
Program Flow Chart
Open files, read 1st records
Trans Key < OM key
Compare
keys
Trans key > OM key
Trans key = = OM key
Type of
Trans
Add OK
Other,
Type of
Trans
Del,
go on
Other,
Error
Write OM
record to
NM file,
Trans key
yet to be
matched
Modify, Make changes
27
Sequential Update
28
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
29
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
*modified
30
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
*modified
31
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
*modified
32
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
*modified
33
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
*modified
?35 D
34
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
*modified
?35 D
?40 A
35
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
*modified
?35 D
?40 A
36
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
50*
*modified
?35 D
?40 A
37
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
50*
*modified
?35 D
?40 A
?55 M
38
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
50*
60
*modified
?35 D
?40 A
?55 M
39
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
50*
60
70
*modified
Error Rpt
?35 D
?40 A
?55 M
40
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
50*
70
80*
*modified
Error Rpt
?35 D
?40 A
?55 M
41
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
50*
70
80*
90
*modified
Error Rpt
?35 D
?40 A
?55 M
42
Sequential Update
Transaction File
Old Master
10
20
25
30
35
40
50
55
80
10
20
30
40
50
60
70
80
90
100
M
M
I
D
D
A
M
M
M
New Master
10*
20*
25
40
50*
70
80*
90
100
*modified
Error Rpt
?35 D
?40 A
?55 M
43
Editing Transaction Data
• Plan on existence of bad data
• Common Edit checks
–
–
–
–
–
reasonableness check (type, # digits)
range check
value check (small set of possible values)
check digits, self checking numbers
proper case
44
Writing Data Randomly to a
Random Access File
• Done in the context of
–
–
–
–
finding a record on the file,
reading it,
manipulating it (make changes, update),
then write it back to (over) the original position
in the file
• Program must make sure to cover all these
steps
Writing Data Randomly to a
Random Access File
• Ask the user what record is desired
– remember they are numbered starting with 0
• Do a seekg to that record
sales_file.seekg( sizeof (sales_rec) * rec_num);
• Read the record then alter, update
sales_file.read( … )
• Do a
seekp
Why ???
back to original loacation?
sales_file.seekp( … )
• Do a .write back to the file
sales_file.write ( … )
Note fig 14.12,
pg 721
45
46
Reading Data Sequentially from
a Random Access File
• Open the file
• Prime the pump with initial read
• Use while (file_var)
{ process
try next read }
• If reading only, no need to close the file
• If doing any writing to the file, should close
See Fig. 14.14 on CD Rom
47
Input/Output of Objects
• Suppose we create classes and wish to save
objects of classes in the file
• Actually only the data items of a class get
written to or read from a file
• The function members of the class are
available only to the running program
• Typically you would let a structure be the
private data
See Fig 14.15 on CD Rom
48
Input/Output of Objects
• Consider having a file containing different
type objects
– must know what kind of object is coming so as
to read it into the correct structure
• Possible to write a flag immediately prior to
each record, specifying what type of record
(structure) it is
• Then read the flag first and use a switch
statement to determine which read
49
Procedures for Data Processing
• Instructions for operations personnel
• Prevent errors in update of sequential files
• Examples
– multiphased updates
– hardcopy output reviewed and authorized by
supervisor
– Transaction counts
– Batch totals
– Backup and recovery
50
Direct Access Devices
• Records can be read/written in any order
– Without referencing preceding records
• More like a CD or LP record
– Contrast sequential access like video or audio
cassette
51
Need for Direct Access Files
•
•
•
•
Banking transactions
Credit card authorization
Inventory check in a parts store
Basically ==> whenever batching of records
is impractical or infeasible
52
On Line Systems
• When user is in direct communication with
cpu
• Many direct access systems are on line, but
not all!
• “Direct Access” and “on line” are NOT
synonyms
53
Disk Components
• Cylinders
• Tracks
• Sectors
54
Disk Components
• Read/Write heads
move in and out
• Possible to have
multiple fixed heads
• Need head for each
track surface
• Same amount of data
on inside as outside
tracks
55
Disk Components
• Track
• Sector
• Inner sectors have
same amount of data
as outer outer (why?)
• Outer sectors less
dense (bits per inch)
56
Timing Components
• Access motion time (seek time)
– motion of read/write heads between tracks
– this is the most lengthy time of the three
• Rotational delay
• Data transfer time
57
Hard Drive Head Tolerance
Head rides on a cushion of air.
58
Accessing Direct Access files
• Mapping relationship between key field an
address on disk
• Function
R(key) ==> address on disk
• Part of the mapping process is done by
application
• Part done by Operating System
59
Absolute Addressing
• Key value is the absolute physical address on
the disk
– (cylinder, track, sector, offset)
• Advantages -- simple, fast
• Disadvantages
–
–
–
–
requires low level knowledge of physical device
typical key values not appropriate
device dependent
File reorganization requires key change
60
Absolute Addressing
• Very low level
• Used on mainframes back in the 70’s
– IBM 1130
• Very hard to do on anything we use today
• Operating System will rarely let us get at
specific disk locations
61
Relative Addressing
• Key value is relative position of the record in the file
• Operating system takes care of locating this offset
into the file
• Advantages
– simple, number of record gets program there
– device independent
• Disadvantages
– address/space dependent (when you reorganize the file,
the key values will change)
– Typical keys are not entirely appropriate (a record
number) -- but better than physical address
62
Relative Addressing
• R(rec#) = rec# (an identity function)
• Use seekg command in C++
• Example
infile.seekg (count * sizeof (tennants));
// count is the record number desired
• File “pointer” is positioned ready to read the record
desired
63
Directory Lookup
• Keep a directory or table of ordered pairs
– (key value, rec#)
• Options
– Keep in sorted order for binary search
– Random order -- sequential search
– Tree structure for quick search
64
Directory Lookup
• Problems : when table gets quite large, too
big for an in memory data structure
– must be a file
– slower to search through the file
• When you reorganize the file, the directory
is rebuilt
65
Directory Lookup
• Advantages
– appropriate key values
– device independent
– address space independent (key values not
changed when file is reorganized, directory
takes care of it)
– relatively fast if in memory (can have a cache
of part of directory in memory)
66
Address Calculation Techniques
• Most often used in
relative addressing file
system
• Perform a calculation
on the key value
• Result is a relative
address
Key Value
Hash Function
Relative Addr
67
Address Calculation
• R(key) = “predictively random” address
• R(abc) = 3
R(xyz) = 7
R(pdq) = 1
0
1
2
3
4
5
6
7
8
9
10
68
Address Calculation
• Other names
–
–
–
–
scatter storage techniques
randomizing techniques
key-to-address transformation
hashing
69
Hashing Problems
• Collisions
• Consider K1 != K2
but ... R(K1) = = R(K2)
• K1 and K2 called synonyms
• We must eventually solve this problem
70
Hashing
• Advantages
– Natural key values
– Device independent
(move file from floppy to hard drive, no
reorganization needed)
– Address space independent
(key values not changed when we reorganize
the file))
– Directory search not needed
71
Hashing
• Disadvantages
– file reorganization -- must change the hashing
function
– cost of processing time (although relatively minimal)
– processing of collision resolution can be significant
(will involve I/O)
– requires wasted file space(you must set aside more
space for more records in the file than are actually
used)
72
Performance Considerations of
Hashing Function
• Distribution of key values actually in use
– all hash to one area?
– ideal is uniform distribution
• Number of live records in the file compared
to size of file
• Measurement of live fullness of the file
==> the Load Factor
73
Hashing Performance
• Load factor > 0.7 or 0.8 causes probability of collision
to get high and slow down the algorithm
• Must compute size of file needed for max number of
records anticipated
# of live records in file
Load factor 
max # records in file
74
Types of Hashing Functions
• Division remainder
Address = key % divisor
• Divisor places limitation on size of the file
• Choose divisor large enough
• Nice to choose divisor to be a prime (or at
least have large prime divisors)
75
Types of Hashing Functions
• Mid Square Hashing
– Square the key
– Specified digits extracted
– Example 34282 = 11751184
key = 511
– Number of digits determines file size required
N digits => 10N
76
Types of Hashing Functions
• Hashing by folding
– key value partitioned into number of parts (each has
same # digits)
– Each partition folded and summed with others
– Truncated result is relative address
• Example key = = 0001 2345 6789
Add 1000
2345
9876
======
13221 ==> rel addr = 3221
77
Collision Resolution -- Linear
Probing
• Adding new Records
• Hash to the record, check if occupied
– if occupied go “next door”
• Keep going next door until find empty
record
• If find a record with same key, then this is
an illegal add
78
Linear Probing -- Fetching
Records
• Going back to fetch a record
• Seek to record specified by hash function
• If keys do not match, this must be a
collision
• Keep looking “next door” until keys match
• Alternatively, if you find a blank record,
then you are searching for a record that does
not exist
79
Example
• Assume the divisor is 13
• Hash the records with the following keys
124, 315, 87, 302, 137, 94, 188, 328
0
1
2
3
4
5
6
7
8
9 10 11 12
• Now go back and find some of the same
keys. How is linear probing used now?
80
Linear Probing -- Deleting a
Record
• What happens when one of a sequence of
“next door” records is deleted?
• If we blank the record (set fields to zeros
and null strings)
– we try to fetch a record, linear probing and hit
this record
– a blank record implies the record for which we
search does not exist
– it might be later on in a sequence of linear
probes
81
Linear Probing -- Deleting a
Record
• Solution
Do not blank the record
• Set one of the fields to a sentinel value
strcpy (name, “DELETED”);
• Records so marked could be used again
(next time we are adding a record)
82
Collision Resolution -- Double
Hashing
• When a collision occurs, hash again
• Use same function, altered key or secondary
hash function
• Could hash to same file or to an overflow
file
• What to do when you get “double collision”
– may have to ultimately resort to linear probing
83
Collision Resolution -- Synonym
Chaining
• Use linear probe but have a link field
• Link field points to next synonym record
• Saves multiple sequential file access
through a clump of records
84
Synonym Chaining
• R(abc) = 3
R(xyz) = 4
R(pdq) = 1
R(wow) = 3 (but gets probed to 5)
(make a link from 3 to 5)
pdq
0
1
2
abc
xyz
wow
3
4
5
6
7
8
9
10
85
Collision Resolution -- Bucket
Addressing
• Hash to groups of records called “buckets”
• Collision hashes to another record in the
same bucket
• If the bucket fills up ==> “spillover”
– into next bucket
– or into an overflow area
• Buckets can be kept in order as they arrive
or sorted
Download