4. Field and Record Org

advertisement
Fundamental File Structure
Concepts
Chapter 4
File Processing - Fundamental concepts
MVNC
1
Record and Field Structure



A record is a collection of fields.
A field is used to store information about some
attribute.
The question: when we write records, how do
we organize the fields in the records:
»
»
»
»
so that the information can be recovered
so that we save space
so that we can process efficiently
to maximize record structure flexibility
File Processing - Fundamental concepts
MVNC
2
Field Structure issues

What if
»
»
Field values vary greatly
Fields are optional
File Processing - Fundamental concepts
MVNC
3
Field Delineation methods




Fixed length fields
Include length with field
Separate fields with a delimiter
Include keyword expression to identify each
field
File Processing - Fundamental concepts
MVNC
4
Fixed length fields


Easy to implement - use language record
structures (no parsing)
Fields must be declared at maximum length
needed
10
10
last
first
“Yeakus
Bill
15
15
address
city
123 Pine
Utica
File Processing - Fundamental concepts
MVNC
2
9
state
zip
OH43050
5
Include length with field


Begin field with length indicator
If maximum field length <256, a byte can be
used for length
last
first
address
city
state
zip
Length bytes
Yeakus
Bill
123 Pine
06 59 65 61 6B 75 73 04 42 69 6C 6C 08 31 32 33 20 50 69 6E 64
File Processing - Fundamental concepts
MVNC
6
Separate fields with a
delimiter

Use a special character not used in data
»
»
»

space, comma, tab
Also special ASCII char’s: Field Separator (fs) 1C
Here we use “|”
Also need a end of record delimiter: “#”
“Yeakus|Bill|123 Pine|Utica|OH|43050#“
File Processing - Fundamental concepts
MVNC
7
Include keyword expression




Keywords label each fields
A self-describing structure
Allows LOTS of flexibility
Uses lots of space
“LAST=Yeakus|FIRST=Bill|ADDRESS=123 Pine|
CITY=Utica|STATE=OH|ZIP=43050#“
File Processing - Fundamental concepts
MVNC
8
Optional Fields

Fixed length
»

Field length
»

zero length field
Delimiter
»

Leave blank
Adjacent delimiters
Keywords
»
Just leave out
File Processing - Fundamental concepts
MVNC
9
Reading a stream of fields



Need to break record into fields
Fixed length can simply be read into record
structure
Others must be “parsed” with a parse
algorithm
File Processing - Fundamental concepts
MVNC
10
Record Structures


How do we organize records in a file?
Records can be fixed length or variable length
»
»
»
Fixed length allows simple direct access lookup
Fixed may waste space
Variable - how do we find a records position?
File Processing - Fundamental concepts
MVNC
11
Record Structures



Fixed Length Records
Fixed number of fields in records
Variable length
»
»
»
prefix each record with a length
Use a second file to keep track of record start
positions
Place delimiter between records
File Processing - Fundamental concepts
MVNC
12
Fixed Length Records




All records same length
Record positions can be calculated for direct
access reads.
Does not imply the that the sizes or number of
fields are fixed.
Variable length records would lead to unused
space.
File Processing - Fundamental concepts
MVNC
13
Fixed number of fields in
records


Field size could be fixed or variable
Fixed
»
»

results in fixed size records
simply read directly into “struct”
Variable sized fields
»
»
delimited or field lengths
Simply count fields while parsing
File Processing - Fundamental concepts
MVNC
14
Variable length Records



prefix each record with a length
Use a second file to keep track of record start
positions
Place delimiter between records
File Processing - Fundamental concepts
MVNC
15
Prefix records with a length


Allows true variable length records
Form of prefix:
»
»
»

Character number (fixed length)
Binary number (write integer without conversion)
Must consider Maximum length
No direct access (great for sequencial access)
File Processing - Fundamental concepts
MVNC
16
Index of record start addresses



A second file is simply a list of offsets to
successive records
Since the offsets are fixed length, this file
allows direct access, thereby allow direct
access to main file.
Problem
»
»
Maintaining file (adding and deleting records)
Cost of index
File Processing - Fundamental concepts
MVNC
17
Place delimiter between
records




Special character not used in record
Allows efficient variable size
No direct access
Bible files - use ‘\n’ as delimiter
File Processing - Fundamental concepts
MVNC
18
Binary data in files

Binary reals and integers can be written, and
read, from a file:
»
»
Need to know byte size of variables used.
“tsize” function returns data size
File Processing - Fundamental concepts
MVNC
19
Binary data in files
int rsize;
char rec_buf[MAX];
...
cpystr(rec_buf,”this is a test record”);
rsize = strlen(rec_buf);
write(my_fd,&rsize,tsize(int)); // write the size
write(my_fd,&rec_buf,rsize); // write the record
...
read(my_fd, &rsize,tsize(int)); // read the size
read(my_fd,&rec_buf,rsize); // read the record
File Processing - Fundamental concepts
MVNC
20
Viewing Binary file data

Use the file dump utility (od - octal dump)
»
»
»

od -xc <filename>
x - hex output
c - character output
Useful for viewing what is actually in file
File Processing - Fundamental concepts
MVNC
21
Using Classes to Manipulate
Buffer

Three Classes
» delimited fields
» Length-based fields
» Fixed length fields
File Processing - Fundamental concepts
MVNC
22
Class for Delimited fields

Consider a class to manage delimited text
buffers
» Allows reading and writing of delimited records
» Allows packing and unpacking
File Processing - Fundamental concepts
MVNC
23
Class for Delimited fields
class Person
{
public:
// fields
char LastName [11];
char FirstName [11];
char Address [16];
char City [16];
char State [3];
char ZipCode [10];
// Methods next ...
}
File Processing - Fundamental concepts
MVNC
24
Class for Delimited fields
class DelimTextBuffer
{ public:
DelimTextBuffer (char Delim = '|', int maxBytes = 1000);
int Read (istream &);
int Write (ostream &) const;
int Pack (const char *, int size = -1);
int Unpack (char *);
private:
char Delim;
char DelimStr[2]; // zero terminated string for Delim
char * Buffer; // character array to hold field values
int BufferSize; // size of packed fields
int MaxBytes; // maximum number of characters in the buffer
int NextByte; // packing/unpacking position in buffer
};
File Processing - Fundamental concepts
MVNC
25
Class for Delimited fields

Packing a buffer
Person Bill_Yeakus
DelimitedTextBuffer buffer;
buffer.pack(Bill_Yeakus.LastName);
buffer.pack(Bill_Yeakus.FastName);
…
buffer.pack(Bill_Yeakus.ZipCode);
buffer.Write(stream);
File Processing - Fundamental concepts
MVNC
26
Class for Delimited fields
int DelimTextBuffer :: Pack (const char * str, int size)
// set the value of the next field of the buffer;
// if size = -1 (default) use strlen(str) as Delim of field
{
short len; // length of string to be packed
if (size >= 0) len = size;
else len = strlen (str);
if (len > strlen(str)) // str is too short!
return FALSE;
int start = NextByte; // first character to be packed
NextByte += len + 1;
if (NextByte > MaxBytes) return FALSE;
memcpy (&Buffer[start], str, len);
Buffer [start+len] = Delim; // add delimeter
BufferSize = NextByte;
return TRUE;
}
File Processing - Fundamental concepts MVNC
27
Class for Delimited fields
int DelimTextBuffer :: Write (ostream & stream) const
{
stream . write ((char*)&BufferSize, sizeof(BufferSize));
stream . write (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts
MVNC
28
Class for Delimited fields
int DelimTextBuffer :: Read (istream & stream)
{
Clear ();
stream . read ((char*)&BufferSize, sizeof(BufferSize));
if (stream.fail()) return FALSE;
if (BufferSize > MaxBytes) return FALSE; // buffer overflow
stream . read (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts
MVNC
29
Class for Delimited fields
int DelimTextBuffer :: Unpack (char * str)
// extract the value of the next field of the buffer
{
int len = -1; // length of packed string
int start = NextByte; // first character to be unpacked
for (int i = start; i < BufferSize; i++)
if (Buffer[i] == Delim)
{len = i - start; break;}
if (len == -1) return FALSE; // delimeter not found
NextByte += len + 1;
if (NextByte > BufferSize) return FALSE;
strncpy (str, &Buffer[start], len);
str [len] = 0; // zero termination for string
return TRUE;
}
File Processing - Fundamental concepts
MVNC
30
Class for Delimited fields

Class Person can be extended to provide
specialized packing functions
File Processing - Fundamental concepts
MVNC
31
Class for Delimited fields
int Person::Pack (DelimTextBuffer & Buffer) const
{// pack the fields into a FixedTextBuffer, return TRUE if all
succeed, FALSE o/w
int result;
Buffer . Clear ();
result = Buffer . Pack (LastName);
result = result && Buffer . Pack (FirstName);
result = result && Buffer . Pack (Address);
result = result && Buffer . Pack (City);
result = result && Buffer . Pack (State);
result = result && Buffer . Pack (ZipCode);
return result;
}
File Processing - Fundamental concepts
MVNC
32
Class for Delimited fields
int Person::Unpack
{
int result;
result = Buffer
result = result
result = result
result = result
result = result
result = result
return result;
}
(DelimTextBuffer & Buffer)
. Unpack (LastName);
&& Buffer . Unpack (FirstName);
&& Buffer . Unpack (Address);
&& Buffer . Unpack (City);
&& Buffer . Unpack (State);
&& Buffer . Unpack (ZipCode);
File Processing - Fundamental concepts
MVNC
33
Class for Fixed Length fields
int FixedTextBuffer :: AddField (int fieldSize)
{
if (NumFields == MaxFields) return FALSE;
if (BufferSize + fieldSize > MaxChars) return FALSE;
FieldSize[NumFields] = fieldSize;
NumFields ++;
BufferSize += fieldSize;
return TRUE;
}
File Processing - Fundamental concepts
MVNC
34
Class for Fixed Length fields
int FixedTextBuffer :: Read (istream & stream)
{
stream . read (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts
MVNC
35
Class for Fixed Length fields
int FixedTextBuffer :: Write (ostream & stream)
{
stream . write (Buffer, BufferSize);
return stream . good ();
}
File Processing - Fundamental concepts
MVNC
36
Class for Fixed Length fields
int FixedTextBuffer :: Pack (const char * str)
// set the value of the next field of the buffer;
{
if (NextField == NumFields || !Packing) // buffer is full or not
packing mode
return FALSE;
int len = strlen (str);
int start = NextCharacter; // first byte to be packed
int packSize = FieldSize[NextField]; // number bytes to be packed
strncpy (&Buffer[start], str, packSize);
NextCharacter += packSize;
NextField ++;
// if len < packSize, pad with blanks
for (int i = start + packSize; i < NextCharacter; i ++)
Buffer[start] = ' ';
Buffer [NextCharacter] = 0; // make buffer look like a string
if (NextField == NumFields) // buffer is full
{
Packing = FALSE;
NextField = NextCharacter = 0;
}
return TRUE;
}
File Processing - Fundamental concepts
MVNC
37
Class for Fixed Length fields
int FixedTextBuffer :: Unpack (char * str)
// extract the value of the next field of the buffer
{
if (NextField == NumFields || Packing) // buffer is full or
not unpacking mode
return FALSE;
int start = NextCharacter; // first byte to be unpacked
int packSize = FieldSize[NextField]; // number bytes to be
unpacked
strncpy (str, &Buffer[start], packSize);
str [packSize] = 0; // terminate string with zero
NextCharacter += packSize;
NextField ++;
if (NextField == NumFields) Clear (); // all fields
unpacked
return TRUE;
}
File Processing - Fundamental concepts
MVNC
38
Class for Fixed Length fields
void FixedTextBuffer :: Print (ostream & stream)
{
stream << "Buffer has max fields "<<MaxFields<<" and actual
"<<NumFields<<endl
<<"max bytes "<<MaxChars<<" and Buffer Size
"<<BufferSize<<endl;
for (int i = 0; i < NumFields; i++)
stream <<"\tfield "<<i<<" size "<<FieldSize[i]<<endl;
if (Packing) stream <<"\tPacking\n";
else stream <<"\tnot Packing\n";
stream <<"Contents: '"<<Buffer<<"'"<<endl;
}
File Processing - Fundamental concepts
MVNC
39
Class for Fixed Length fields
class FixedTextBuffer
{ public:
FixedTextBuffer (int maxFields, int maxChars = 1000);
int AddField (int fieldSize);
int Read (istream &);
int Write (ostream &);
int Pack (const char *);
int Unpack (char *);
private:
char * Buffer; // character array to hold field values
int BufferSize; // sum of the sizes of declared fields
int * FieldSize; // array to hold field sizes
int MaxChars; // maximum number of characters in the buffer
int NextCharacter; // packing/unpacking position in buffer
};
File Processing - Fundamental concepts
MVNC
40
Class for Fixed Length fields
int Person::Pack (FixedTextBuffer & Buffer) const
{// pack the fields into a FixedTextBuffer, return TRUE if all
succeed, FALSE o/w
int result;
Buffer . Clear ();
result = Buffer . Pack (LastName);
result = result && Buffer . Pack (FirstName);
result = result && Buffer . Pack (Address);
result = result && Buffer . Pack (City);
result = result && Buffer . Pack (State);
result = result && Buffer . Pack (ZipCode);
return result;
}
File Processing - Fundamental concepts
MVNC
41
Class for Fixed Length fields
int Person::Unpack
{
Clear ();
int result;
result = Buffer
result = result
result = result
result = result
result = result
result = result
return result;
}
(FixedTextBuffer & Buffer)
. Unpack (LastName);
&& Buffer . Unpack (FirstName);
&& Buffer . Unpack (Address);
&& Buffer . Unpack (City);
&& Buffer . Unpack (State);
&& Buffer . Unpack (ZipCode);
File Processing - Fundamental concepts
MVNC
42
Record Access - Keys



Attribute used to identify records
Often used to find records
Standard or canonical form
» rules which keys must conform to
» prevents missing record because key in different
form
» Example:
– all capitals
– Phone in form (nnn) nnn-nnnn
File Processing - Fundamental concepts
MVNC
43
Record Access - Keys

Keys can distinct - uniquely identify records
» Primary keys
» one-to-one relationship between key value and
possible entities represented
» SSN, Student ID

Keys can identify a collection of records
» Secondary keys
» one-to-many relationship
» City, position, department
File Processing - Fundamental concepts
MVNC
44
Record Access - Keys

Primary key desired characteristics
» unique among collection of entities
» dataless - what if some entities have not value of
this type (e.g. SSN)
» unchanging
File Processing - Fundamental concepts
MVNC
45
Record access

Performance of access method
» how do we compare techniques?
» Must be careful what events we count.
» “big-oh” notation gives us a way to factor out all but
the most significant factors
File Processing - Fundamental concepts
MVNC
46
Record Access - timing

Sequential searching
» Consider file of 4000 records
» What if no blocking done, and one record per
block? (500 bytes records, 512 byte blocks)
» What if cluster size set to 8?
» always requires O(n), but search is faster by a
constant factor
File Processing - Fundamental concepts
MVNC
47
Sequential searching


Usually NOT the best method
Sometimes it is best:
»
»
»
»
Searching for some ASCII pattern (grep)
Small files
Files rarely searched
Searching on secondary key, and a large
percentage of records match (say 25%)
File Processing - Fundamental concepts
MVNC
48
Unix Tools for sequential file
processing



cat - display a file
wc - count lines, words, and characters
grep - find lines in file(s) which match regular
expression.
File Processing - Fundamental concepts
MVNC
49
Direct Access


Move “directly” to record without scanning preceding
data
Different languages/OS’s support different models:
» Byte offset model
– Programmer must specify offset to record, and record size to
read.
– Supports variable size records, skip sequential processing
» Relative Record Number (RRN) model
–
–
–
–
File has a fixed record size (declared at creation time)
Records are specified by a record number
File modeled as a collection of components
Higher level of abstraction
File Processing - Fundamental concepts
MVNC
50
Direct Access

Different language support
» RRN support
– PL/I
– COBOL
– Pascal (files are modeled as a collection of components
(records)
– FORTRAN
» Byte offset
–C
File Processing - Fundamental concepts
MVNC
51
Choosing Record Sizes for
Direct Access

Fixed Length Fields
» Very easy to parse records - just read into record
structure!
» Each field must be maximum length needed!
– Thus record must be as long all the maximum fields
10
10
last
first
“Yeakus
Bill
123 Pine
15
15
address
Utica
File Processing - Fundamental concepts
city
OH43050
MVNC
2
9
state
zip
“
52
Choosing Record Sizes for
Direct Access

Variable length fields
» Each field can be any length
» since some can be long, others short, overall
record size may be shorter.
» This gives more flexibility to fields length
» Records must be parsed, space wasted for
delimiter or length bytes.
Yeakus|Bill|123|Pine|Utica|OH43050
Snivenloppinsky|Helmut|12232 Galmentary Avenue|Spotsdale|NY|11232
File Processing - Fundamental concepts
MVNC
53
Header Records

The first record in a direct file may be used to
store special information
»
»
»
»


Number of records used.
Location of first record in key order sequence.
Location of first empty record
File record structure (meta-data)
In languages with the RRN model Pascal,
variant record facility must be used
In C, the header record can be of different
size from the rest of the file records.
File Processing - Fundamental concepts
MVNC
54
Header Records



Consider “update.c” is text.
Header record contains 2 byte number of
record count.
Header size is 32, record size is 64
static struct {
short rec_count;
char fill[30];
} head;
File Processing - Fundamental concepts
MVNC
55
Header Records



Must be written when file created
Must be rewritten when file changed
Must be read when file is opened
File Processing - Fundamental concepts
MVNC
56
File Access and Organization

File Organization
» Variable Length Records
» Fixed Length Records
» Field Structures (size bytes, delimiters, fixed)

File Access
» Sequential access
» Direct access
» Indexed access
File Processing - Fundamental concepts
MVNC
57
File Access and Organization

Interaction between organization and access
» Can the file be divided into fields?
» Is there a higher level of organization to the file
(mete data)?
» Do all records have to have the same number of
fields, bytes?
» How do we distinguish one record from the next?
» How do we recognize if a fixed length record holds
real data or not?
File Processing - Fundamental concepts
MVNC
58
File Access and Organization

There is a often a trade-off between space
and time
» Fixed length records - allow direct access, waste
space
» Variable require sequential search


We also must consider the typical use of the
file - what are the desired access patterns
Selection of a particular organization has
implications on the allowable types of access
File Processing - Fundamental concepts
MVNC
59
Portability and
Standardization

Differences among Languages
» Fixed sized records versus byte addressable
access

Differences among Machine Architectures
» Byte order of binary data
» May be high order or low order byte first
File Processing - Fundamental concepts
MVNC
60
Byte order of binary data

High order first: (Big Endian)
» A long int: say 45 is stored in memory.
» It is stored as: 00 00 00 2D
» Sun’s, Network protocols

Low order first (Little Endian)
» A long int: say 45 is stored in memory.
» It is stored as: 2D 00 00 00
» PC’s, VAX’s
File Processing - Fundamental concepts
MVNC
61
Byte order of binary data



If binary data is written to a file, it is written in
the order stored in memory
If the data is later read by a system with a
different ordering, the number will be
incorrect!
For the sake of portability, files should be
written in an agreed upon format (probably
Big Endian)
File Processing - Fundamental concepts
MVNC
62
Download