Chapter -6: Data Types Introduction data type predefined operations on those objects

advertisement
Chapter -6: Data Types
Introduction
• A data type defines a collection of data objects and a set of
predefined operations on those objects
• A descriptor is the collection of the attributes of a variable. In an
implementation, a descriptor is a collection of memory cells that
store variable attributes. If the attributes are all static, descriptors
are required only at compile time.
• These descriptors are built by the compiler, as a part of the
symbol table, and are used during compilation.
• For dynamic attributes, part or all of the descriptor must be
maintained during execution. In this case, descriptor is used by
the run-time system.
• In all cases, descriptors are used for type checking and to build
the code for the allocation and deallocation operations.
• An object represents an instance of a user-defined (abstract data)
type
• One design issue for all data types: What operations are defined
and how are they specified?
Primitive Data Types
• Almost all programming languages provide a set of primitive data
types
• Primitive data types: Those not defined in terms of other data
types
• Some primitive data types are merely reflections of the hardware.
(E.g. integer types)
• Others require little non-hardware support for their
implementation.
• The primitive data types are used with one or more type
constructors, to provide the structured types.
Primitive Data Types: Integer
• Almost always an exact reflection of the hardware so the mapping
is trivial
• There may be as many as eight different integer types in a
language
Ch6-1
• Java’s signed integer sizes: byte, short, int, long
• C++, C#, include unsigned integer types ( without sign)
• A signed integer, value is represented in a computer by a string of
bits, the left most one represents sign.
• A negative integer could be stored in sign-magnitude notation, in
which the sign bit is set to indicate negative and the reminder bits
represent the absolute value of the number.
• Most computers now use a twos complement notation to store
negative integers (take logical complement of the positive version
of the number and adding one.
• E.g. -2 11111110
 2= 00000010
Com=11111101
1
11111110
L most bit = -128
Add the rest of bits= 2+4+8+16+32+64=126
Subtract= 128-126=-2
Primitive Data Types: Floating Point
• Model real numbers, but only as approximations. Floating-point
values are represented as fractions and exponents.
• Languages for scientific use supports at least two floating-point
types (e.g., float (4 bytes) and double (8 bytes).
• The collection of values that can be represented as a floatingpoint type is defined in terms of precision and range.
• Precision is the accuracy of the fractional part of a value
measured as number of bits.
• Range is a combination of the range of fractions and range of
exponents.
• Usually exactly like the hardware (i.e. language implementers use
whatever representation is supported by the hardware), but not
always
• Most newer machines use the IEEE Floating-Point
Standard 754 format
Ch6-2
Primitive Data Types: Decimal
• Most large computers that are designed for business applications
(money) have hardware support for decimal data types
– Essential to COBOL
– C# offers a decimal data type
• Store a fixed number of decimal digits with the decimal point at a
fixed position in the value
• Decimal types are stored using binary codes for the decimal digits
(B CD).
• In some cases, they are stored one digit per byte, but in others
they are packed two digits per byte. Either way, they take more
storage than binary representations. It takes at least 4 bits to code
a decimal digit. E.g. 7=0111, 9=1001, 3=0011 etc. to store 6digit coded decimal number requires 24 bits of memory.
Operations on decimal values are done in hardware on machines
that have such capabilities; otherwise they are simulated in
software.
• Advantage: accuracy (being able to precisely store decimal
values)
Ch6-3
• Disadvantages: limited range (exponents are not allowed), wastes
memory
Primitive Data Types: Boolean
• Simplest of all types
• Range of values: two elements, one for “true” and one for “false”
• C98, exceptions in which numeric expressions are used as
conditionals. All operands with non-zero values are considered
true, and zero is false.
• C99, C++ have Boolean type. They also allow numeric
expression to be used as if they were Boolean.
• Java, C# not allowed.
• Boolean types are used to represent switches or flags in programs.
• Could be implemented as bits, but often as bytes
– Advantage: readability (more readable than using integer)
Primitive Data Types: Character
• Stored as numeric codings
• Most commonly used coding: ASCII (8-bit code): 128 characters,
Extended ASCII: 256 characters. Ada uses Extended
• An alternative, 16-bit coding: Unicode
– Includes characters from most natural languages
– Originally used in Java
– C# and JavaScript also support Unicode
Character String Types
• Values are sequences of characters
• Design issues:
– Is it a primitive type or just a special kind of array?
– Should the length of strings be static or dynamic?
Character String Types Operations
• If strings are not defined as a primitive type, string data is usually
stored in arrays of single characters, and referenced as such in the
language (C, C++).
• C, C++ use char arrays to store character strings and provide a
collection of string operations through standard library whose
leader file is string.h
Ch6-4
• Character strings are determined with a special character, null,
represent zero.
• Char *str=”apples”; : str is a char pointer set to point at the
string of characters, apples0 where 0 is the null character. This
initialization of str is legal because character string literals are
represented by char pointers, rather than the string itself.
• Typical operations:
– Assignment and copying. In C, C++
– Comparison (=, >, etc.)
– Catenation
– Substring reference
– Pattern matching
Character String Type in Certain Languages
• C and C++ (C++ supports strings through its standard class
library String, also support array of characters)
– Not primitive
– Use char arrays and a library of functions that provide
operations header file string.h
– Most commonly used library functions for character strings
C, C++ are (stecpy, strcmp, strlen, strcat)
• SNOBOL4 (a string manipulation language)
– Primitive
– Many operations, including elaborate pattern matching
• Java
– Primitive via the String class
• Fortran95: treats strings as a primitive type and provides
assignment, relational operators, catenation, and substring
reference operations of them (slices).
Character String Length Options
• There are several design choices regarding the length of string
values.
Ch6-5
• Static length string: the length can be static and set when the
string is created. COBOL, Java’s String class. Another Java class
called stringBufferclass of changeable values.
• Limited Dynamic Length: allowing strings to have varying length
up to a declared and fixed maximum set by the variable’s
definition. Such string variables can store any number of
characters between zero and the maximum. C and C++
– In C-based language, a special character is used to indicate
the end of a string’s characters, rather than maintaining the
length
• Dynamic length strings: allow strings to have varying length with
no maximum. This option requires the overhead of dynamic
storage allocation and deallocation but provides maximum
flexibility. SNOBOL4, Perl, JavaScript
• Ada supports all three string length options
Character String Type Evaluation
• Aid to writability. Dealing with strings as arrays is more difficult
than dealing with primitive string type.
• As a primitive type with static length, they are inexpensive to
provide. Providing strings through a standard library is nearly as
convenient as having them as a primitive type.
• Dynamic length is nice and flexible, but is it worth the expense?
The overhead of their implementation must be weighted against
that additional flexibility.
Character String Implementation
• Static length: compile-time descriptor need only during
compilation. Has 3 fields. See figure-1
• Limited dynamic length: may need a run-time descriptor for
length to store both the fixed maximum length and the current
length. See figure-2 (but not in C and C++ because the end of
string is marked with null character. Do not need max length,
because index values n array references are not range-checked in
these languages). Static and dynamic length strings require no
special dynamic storage allocation.
Ch6-6
• Dynamic length: need run-time descriptor; allocation/deallocation
is the biggest implementation problem which requires a complex
storage management. Length and storage to which it is bound
grow and shrink dynamically.
• Type approaches to support dynamic allocation/ deallocation
- Strings can be stored in a linked list: drawbacks are extra
storage occupied by the links, and the complexity of string
operations.
- Or store complete strings in adjacent storage cells. Problem:
when a string grows and the adjacent space is not available.
Solution is to find another hole that fits the new string, and
deallocate the previous hole.
- Although linked-list method requires more storage, the
dynamic allocation process is simple, but some string
operations are slow due to pointer chasing (sequential
access)
- Using adjacent memory for complete strings results in faster
string operations and required significantly less storage. But
the allocation/ deallocation process is slower.
Compile- and Run-Time Descriptors
Type
name
Address
of first
characte
r
Compile-time
descriptor for static
strings
Figure-1
Run-time descriptor
for limited dynamic
strings Ch6-7
Figure-2
User-Defined Ordinal Types
• An ordinal type is one in which the range of possible values can
be easily associated with the set of positive integers
• Examples of primitive ordinal types in Java
– Integer, char, Boolean
• In some languages, users can define two kinds of ordinal types:
(enumerated and subrange).
Enumeration Types
• All possible values, which are named constants, are provided in
the definition
• C# example
enum days {mon, tue, wed, thu, fri, sat, sun};
the enumeration constants are typically implicitly assigned the integer
values, 0, 1,… etc.
• Design issues
– Is an enumeration constant allowed to appear in more than
one type definition, and if so, how is the type of an
occurrence of that constant checked?
– Are enumeration values coerced to integer?
– Are any other type coerced to an enumeration type?
– All these design issues are related to type checking.
• If an enumeration variable is coerced to a numeric type, there is
little control over the range of legal operations or its range of
values.
• If an int type value is coerced to an enumeration type, an
enumeration type variable could be assigned any integer value,
whether it represented an enumeration constant or not.
• Design
- If a language does not have enumeration types, we could
simulated it
o e.g. Fortran77: INTEGER RED, BLUE
DATA RED, BLUE /0, 1/
The problem here: since we did not define a type for our
colors, there is no type checking when they are used. E.g. it
would be legal to add the two together.
Ch6-8
Also they could be combined with any other numeric type
operand with any arithmetic operator. Also, because they are
just variables, they could be assigned any integer value,
destroying the relationship with the colors, although to solve
this latter issue we could make them named constants.
o C, Pascal: 1st include enumeration data type. C++ includes C’s
enumeration type.
C++ could have
enum colors {red, blue, green, yellow, black};
Colors mycolor= blue; youcolor=red;
Enumeration values are coerced to int when they are put in
integer context. E.g. if current value of mycolor is blue, the
statement mycolor++ would assign green to mycolor.
o C++ allows enumeration constants to be assigned to variables
of any numeric type, though that would most often be an error
- No other type value is coerced to an enumeration type in C++,
mycolor=4; is legal, R.H.S sould be cast to C++ enumeration
constants can appear in only one enumeration type in the same
referencing environment.
o Ada, enumeration literals are allowed to appear in more than
one declaration in the same referencing environment. These are
called overloaded literals.
- The rules for solving overloading must be determined from the
context. E.g. if an overloaded literal and an enumeration variable
are compared, the literal’s type is resolved to be that of the
variable.
- Because neither the enumeration literals nor the enumeration
variables in Ada are coerced to integer, both the range of
operations and the range of values of enumeration types are
restricted, allowing many errors to be compiler-detected.
Evaluation of Enumerated Type
• Aid to readability, e.g., no need to code a color as a number.
Named values are easily recognized.
• Aid to reliability, e.g., compiler can check: in C#, Ada, Java0.5
– Operations (don’t allow colors to be added). No arithmetic
operations are legal on enumeration types.
Ch6-9
– No enumeration variable can be assigned a value outside its
defined range
– Ada, C#, and Java 5.0 provide better support for
enumeration than C++ because enumeration type variables
in these languages are not coerced into integer types
• C treats enumeration variables like integer variables.
• C++ numeric values can be assigned to enumeration type
variables only if they are cast to type of the assigned variable.
Subrange Types
• An ordered contiguous subsequence of an ordinal type
– Example: 12..18 is a subrange of integer type. Introduced in
Pascal, included in Ada.
• Ada’s design
- In Ada, sub ranges are included in the category of types
called subtypes.
- In Pascal
Type
strIndex=0..mastrLength;
var
I: strIndex;
type Days is (mon, tue, wed, thu, fri, sat, sun);
subtype Weekdays is Days range mon..fri;
subtype Index is Integer range 1..100;
- all operations defined for the parent type are also defined for the
subtype, except assignment of values outside the specified range. E.g.
in the following:
Day1: Days;
Day2: Weekday;
Day2 := Day1;
- The assignment is legal unless the value of Day1 is sat to sun.
- The compiler must generate range-checking code for every
assignment to subrange variable sub ranges require run-time
range checking.
Ch6-10
Subrange Evaluation
• Aid to readability
– Make it clear to the readers that variables of subrange can
store only certain range of values
• Reliability
– Assigning a value to a subrange variable that is outside the
specified range is detected as an error
Implementation of User-Defined Ordinal Types
• Enumeration types are implemented as integers
• Subrange types are implemented like the parent types with code
inserted (by the compiler) to restrict assignments to subrange
variables
Array Types
• An array is an aggregate of homogeneous data elements in which
an individual element is identified by its position in the aggregate,
relative to the first element.
• A reference of an array element in a program includes one or
more non-constant subscripts. Such references require a run-time
calculation to determine the memory location being referenced.
Array Design Issues
• What types are legal for subscripts?
• Are subscripting expressions in element references range
checked?
• When are subscript ranges bound?
• When does allocation take place?
• What is the maximum number of subscripts?
• Can array objects be initialized?
• Are any kind of slices allowed?
Array Indexing
• Specific element of an array is referenced by aggregate name, and
subscripts or indexes.
• Indexing (or subscripting) is a mapping from indices to elements
array_name (index_value_list)  an element
• Index Syntax
Ch6-11
– FORTRAN, PL/I, Ada use parentheses
• Ada explicitly uses parentheses to show uniformity
between array references and function calls because
both are mapping. E.g. sum:=sum+B(I);
• When need another information to determine whether
B(I) is a function call or an array reference. Reduce
readability.
– Most other languages use brackets
– Two district types are involved in an array type: element
type, and type of subscripts.
Arrays Index (Subscript) Types
• FORTRAN, C: integer only
• Pascal: any ordinal type (integer, Boolean, char, enumeration)
• Ada: integer or enumeration (includes Boolean and char)
• Java: integer types only
• C, C++, Perl, and Fortran do not specify range checking of
subscripts
• Java, ML, C# specify range checking
• Ada checks the range of all subscripts, but this feature can be
disabled by the programmer.
Subscript Binding and Array Categories
• The binding of the subscript type to an array is usually static, but
the subscript value ranges are sometimes dynamically bound.
• Lower bound of the subscription range, in some languages, is
implicit. E.g. C-based fixed to zero, Fortran it default to 1,
Pascal subscript ranges must be specified by the programmer.
• There are five categories of arrays, based on the binding to
subscript value ranges and the binding to storage.
• Static: subscript ranges are statically bound and storage allocation
is static (before run-time)
– Advantage: efficiency (no dynamic allocation)
• Fixed stack-dynamic: subscript ranges are statically bound, but
the allocation is done at declaration elaboration time during
execution
Ch6-12
•
•
•
•
– Advantage: space efficiency. A large array in one
subprogram can use the same space as a large array in a
different subprogram, as long as both subprograms are not at
the same time.
Stack-dynamic: subscript ranges are dynamically bound and the
storage allocation is dynamic (done at run-time). Once the
subscript range is bound and the storage is allocated, they remain
fixed during the lifetime of the variable.
– Advantage: flexibility (the size of an array need not be
known until the array is to be used)
Fixed heap-dynamic: similar to fixed stack-dynamic: subscript
range and the storage binding are dynamic but fixed after
allocation. The differences are that the bindings are done when
the user program requests them, rather than at elaboration time,
and the storage s allocated from the heap, rather than the stack.
Heap-dynamic: binding of subscript ranges and storage allocation
is dynamic and can change any number of times during the
array’s lifetime.
– Advantage: flexibility (arrays can grow or shrink during
program execution)
Examples of the categories:
o C and C++ arrays that include static modifier are static
o C and C++ arrays without static modifier are fixed stackdynamic
o Ada arrays can be stack-dynamic. E.g.:
Get List _ Len);
declare
The user inputs the number of
desired elements for the array list.
List : array (1...List _ Len) of int eger ; The elements are then dynamically
allocated when execution reaches
begin
the declare block. When execution

reaches the end of the block, the
end ;
list array is deallocated.
o C and C++ provide fixed heap-dynamic arrays.
- malloc, free (general heap allocation and deallocation
operations), can be used for C arrays.
Ch6-13
- C++ uses operations (new, delete) to manage heap storage.
- Fortran95 supports fixed heap-dynamic arrays, also C#.
- In Java all arrays are fixed heap-dynamic array. Once created,
they keep the same subscript ranges and storage.
o C# includes a second array class ArrayList that provides heapdynamic array. Objects of this class are created without any
elements.
ArrayList int List= new ArrayList();
Elements are added to this object with (Add) method.
ArrayList.Add(nextone);
o Perl and JavaScript support heap-dynamic arrays. Arrays
implicitly grow whenever assignments are made to elements
beyond the last current element, and shrink by assign them an
empty aggregate().
e.g.: In Pearl we could creat an array of 5 numbers with
@list=(1,2,4,7,19);
The array could be lengthend with (push) function
Push(@list,13 ,17)
To become, (1, 2, 4, 7, 19, 13, 17). And emptied with
@list=();
Array Initialization
• Some language allow initialization at the time of storage
allocation
– C, C++, Java, C# example
int list [] = {4, 5, 7, 83}; compiler sets the length of the array
– Character strings in C and C++ implemented as array of char
char name [] = “freddie”; The array name will have 8
elements, because all strings are terminated with null character
(zero), which implicitly supplied by the system.
– Arrays of strings in C and C++ can initialized with string
literals
char *names [] = {“Bob”, “Jake”, “Joe”];
– Java initialization of String objects
String[] names = {“Bob”, “Jake”, “Joe”};
Ch6-14
Arrays Operations
• APL provides the most powerful array processing operations for
vectors and matrixes as well as unary operators (for example, to
reverse column elements). E.g. A+B is valid expression, where A
and B are scalar variables, vectors, or matrixes.
• Ada allows array assignment but also catenation (&). Catenation
is defined between two single-dimensioned arrays and between a
single-dimensioned array and a scalar.
• Fortran provides elemental operations because they are between
pairs of array elements
– For example, + operator between two arrays results in an
array of the sums of the element pairs of the two arrays
Rectangular and Jagged Arrays
• A rectangular array is a multi-dimensioned array in which all of
the rows have the same number of elements and all columns have
the same number of elements
• A jagged matrix has rows with varying number of elements. E.g.
a jagged matrix may consist of 3 rows, one with 5 elements, one
with 7 elements, and one with 12 elements. This also applies to
the columns or higher dimensions.
– Jagged arrays are made possible when multi-dimensioned
arrays actually appear as arrays of arrays
• C, C++, and Java support jugged arrays but nor rectangular
arrays. Reference of an element of a multidimensional array uses
a separate pair of brackets for each dimension. E.g.
myArray[3][7];
• Fortran and Ada support rectangular arrays. All subscript
expression is references to elements are placed in a single pair of
brackets. E.g. myArray[3, 7];
• C# supports both.
Slices
• A slice of an array is some substructure of that array; e.g. if A is a
matrix, the 1st row of A is one possible slice. Last row, 1st column
are also.
Ch6-15
• It is not a new data type, it is nothing more than a referencing
mechanism
• Slices are only useful in languages that have array operations,
(i.e. if arrays cannot be manipulated as units, that language has no
use for slices).
Slice Examples
• Fortran 95
Integer, Dimension (10) :: Vector
Integer, Dimension (3, 3) :: Mat
Integer, Dimension (3, 3, 4) :: Cube
Remember that the default lower bound for Fortran array is 1.
Vector (3:6) is a four element array
Mat(:, 2) referes to the 2nd column of Mat.
Mat(3, :) referes to the 3rd row of Mat.
All of these references can be used as singl-dimensioned arrays.
References to all array slices are treated as if they were arrays of the
remaining dimensionality.
Slices Examples in Fortran 95
Ch6-16
Implementation of Arrays
• Implementation arrays require more compile-time effort than
does implementing simple types (integer). The codes to allow
accessing of array elements must be generated at compile time.
At run time, this code must be executed to produce element
addresses.
• Access function maps subscript expressions to an address in the
array
• A single-dimensioned array is a list of adjacent memory cells.
Suppose the lower bound of array list is 1.
• Access function for single-dimensioned arrays:
address(list[k]) = address (list[lower_bound])
+ ((k-lower_bound) * element_size)
• If the element type is statically bound and the array is statically
bound to storage, then the value of address (list[lower-bound])
can be computed before run time.
• If the base, or beginning address, of the array is not known until
run time, the subtraction, must be done when the array is
allocated.
Accessing Multi-dimensioned Arrays
• Values of data types that have two or more dimensions must be
mapped onto the single-dimensional memory.
• Two common ways:
– Row major order (by rows) – used in most languages
– column major order (by columns) – used in Fortran
Locating an Element in a Multi-dimensioned Array
• General format
Location (a[i,j]) = address of a [row_lb,col_lb] + (((i - row_lb) * n) +
(j - col_lb)) * element_size
Ch6-17
- Row major order
locX (i1 , i2 ,in )  loc( X (l1l2 ln ))  (i1  l1 )u2u3 un 
(i2  l2 )u3u4 un    (in1  ln )un  (in  ln )
- Col. Major order
locX (i1 , i2 , , in )  loc ( X (l1l2 ln ))  (in  ln )u1u2 un1 
(in1  ln1 )u1u2 un2    (i1  l1)
-
For matrix in row major order, the number of elements that precedes
an element is the number of rows above the element times the size
of the row, plus the numbers of elements to the left of the element.
Compile-Time Descriptors
Single-dimensioned array
Multi-dimensional array
Ch6-18
Associative Arrays
• An associative array is an unordered collection of data elements
that are indexed by an equal number of values called keys. In nonassociative arrays, the indices never need to be stored, because of
their regularities.
– In an associative array, user defined keys must be stored in
the structure. So, each element of an associative array is a
pair of entities (key, value).
• Design issues: What is the form of references to elements
Associative Arrays in Perl
• Are called hashes, because elements are stored and retrieved with
hash functions.
• Names begin with %; literals are delimited by parentheses
%hi_temps = ("Mon" => 77, "Tue" => 79, “Wed” => 65, …);
• Subscripting is done using braces and keys.
• The key value is placed in braces and the hash name is replaced
by a scalar variable name that is the same except for the first
character.
$hi_temps{"Wed"} = 83;
– Elements can be removed with delete
delete $hi_temps{"Tue"};
– The entire hash can be emptied by assignng empty literal to it
@hi_temps=();
– The size of a pearl hash is dynamic (grows and shrinks).
– The exists operators returns true or false; depending on whether its
operand key is an element in the hash
if (exists $hi_temps {“Tue”} …
– PHP’s arrays are both normal arrays and associative.
– A hash is much better than an array if searches of the elements are
required, because the implicit hashing operation used to access
hash elements is very efficient. On the other hand, if every element
of a list must be processed, it would be more efficient to use an
array.
Ch6-19
Record Types
• A record is a possibly heterogeneous aggregate of data elements
in which the individual elements are identified by names
• Design issues:
– What is the syntactic form of references to the fields?
– Are elliptical references allowed?
Definition of Records
• COBOL uses level numbers to show nested records; others use
recursive definition
01 EMP-REC.
02 EMP-NAME.
05 FIRST PIC X(20).
05 MID PIC X(10).
05 LAST PIC X(20).
02 HOURLY-RATE PIC 99V99.
• Ada: Record structures are indicated in an orthogonal way
type Emp_Rec_Type is record
First: String (1..20);
Mid: String (1..10);
Last: String (1..20);
Hourly_Rate: Float;
end record;
Emp_Rec: Emp_Rec_Type;
• Record Field References
1. COBOL
field_name OF record_name_1 OF ... OF record_name_n; where
recore_name_1 is the smallest or innermost record that contains the
field. Ex:
MID OF EMP_NAME OF EMP_REC
2. Others (dot notation)
record_name_1.record_name_2. ... record_name_n.field_name
Ex: Employee_Record.Employee_name.Mid
References to Records
• Most language use dot notation
Ch6-20
Ex: reference to the field mid in Ada record example
• Fully qualified references must include all record names
• Elliptical references allow leaving out record names as long as the
reference is unambiguous, for example in COBOL
FIRST, FIRST OF EMP-NAME, and FIRST of EMP-REC are
elliptical references to the employee’s first name
Operations on Records
• Assignment is very common if the types are identical
• Ada allows record comparison for equality or inequality.
• Ada records can be initialized with aggregate literals
• COBOL provides MOVE CORRESPONDING statement for:
– Copies a field of the source record to the corresponding field
in the target record
Evaluation and Comparison to Arrays
• Design of record is straight forward and safe design.
• The only aspect of records that is not clearly readable is the
elliptical references allowed by COBOL.
• Records are used when collection of data values is heterogeneous
and the different fields are not processed in the same way.
• Access to array elements is much slower than access to record
fields, because subscripts are dynamic (field names are static)
• Dynamic subscripts could be used with record field access, but it
would disallow type checking and it would be much slower
Implementation of Record Type
 The field of records are stored in adjacent memory location , but
because the sizes of fields are not necessarily the same, the
access method used for arrays is not used for records.
Ch6-21
Offset address
relative to the
beginning of the
records is associated
with each field.
And the field access
is accomplished by
using these offsets.
Compile-time descriptor for record
Unions Types
• A union is a type whose variables are allowed to store different
type values at different times during execution. Example, table of
contents for a compiler.
• Design issues
– Should type checking be required? And this must be
dynamic.
– Should unions be embedded in records?
Discriminated vs. Free Unions
• Fortran, C, and C++ provide union constructs in which there is no
language support for type checking; the union in these languages
is called free union because programmers are allowed complete
freedom from type checking in their use.
• Type checking of unions require that each union include a type
indicator called a discriminant or tag (discriminated union).
ALGOL68 was the 1st language to provide it.
– Supported by Ada
Ch6-22
Ada Union Types
 Ada design for discriminated unions, allowes the user to specify
variables of a variant record type that will store only one of the
possible type values in the variant. In this way the user can tell the
system when the type checking can be static. Such a restricted
variable is called a constrained variant variable.
 Unconstrained variable records in Ada allow the values of their
variants to change type during execution.
 The type of variant can be changed only by assigning the entire
record, including the discriminant.
Ex: Consider Ada Variant record
type Shape is (Circle, Triangle, Rectangle);
type Colors is (Red, Green, Blue);
type Figure (Form: Shape) is record
Filled: Boolean;
Color: Colors;
case Form is
when Circle => Diameter: Float;
when Triangle =>
Leftside, Rightside: Integer;
Angle: Float;
when Rectangle => Side1, Side2: Integer;
end case;
end record;
the following two statements declare variables of type figure
Figure_1: Figure;// Unconstrained variable record that has no initial
value. Its type can be changed by assignment of whole record.
Figure_1:=(FilledTrue,
ColorBlue
Form Rectangle
Side_112
Side_23);
Figure_2:Figure(FormTrangle);// Is constrained to be triangle and
cannot be changed to another variant.
Ch6-23
Ada Union Type Illustrated
A discriminated union of three shape variables
(Assume all the variables are the same size)
Evaluation of Unions
• Potentially unsafe construct
– Do not allow type checking. This way Fortran, C, C++ are
not strongly typed
• Java and C# do not support unions
– Reflective of growing concerns for safety in programming
language
• Discriminated unions are implemented by simply using the same
address for every possible variant. Sufficient storage of the largest
variant is allocated.
Pointer and Reference Types
• A pointer type variable has a range of values that consists of
memory addresses and a special value, nil. The value nil is not a
valid address and is used to indicate that a pointer cannot
currently be used to reference any memory cel.
• Provide the power of indirect addressing
• Provide a way to manage dynamic memory
• A pointer can be used to access a location in the area where
storage is dynamically created or allocated (usually called a heap)
Ch6-24
• Variables that are dynamically allocated from the heap are called
heap-dynamic variables, which do not have identifiers associated
with them, and can be referenced only by pointers or variables.
• Variables without names are called anonymous variables.
• Pointers are not structured types, although are defined using the
type operator (* in C and C++, access in Ada).
• Pointers are different from scalar variables because they are most
often used to reference some other variables, rather than being
used to store data of same sort.
• Pointers add writability to a language (dynamic structures trees,
linked lists).
Design Issues of Pointers
• What are the scope of and lifetime of a pointer variable?
• What is the lifetime of a heap-dynamic variable?
• Are pointers restricted as to the type of value to which they can
point?
• Are pointers used for dynamic storage management, indirect
addressing, or both?
• Should the language support pointer types, reference types, or
both?
Pointer Operations
• Two fundamental operations: assignment and dereferencing
• Assignment is used to set a pointer variable’s value to some
useful address
• Dereferencing yields the value stored at the location represented
by the pointer’s value
– Dereferencing can be explicit or implicit
– C++ uses an explicit operation via *
j = *ptr
sets j to the value located at ptr
Ch6-25
Pointer Assignment Illustrated
The assignment operation j = *ptr
When pointers point to records, the syntax of the references to the
fields of these records varies among languages.
C, C++, there are two ways. If a pointer variable P points to a record
with a field named age, we use, (*p).age, another way page
Problems with Pointers
• Dangling pointers (dangerous)
– A pointer points to a heap-dynamic variable that has been
de-allocated. Dangling pointers are dangerous for several
reasons:
1. The location being pointed to may have been reallocated to some
new heap-dynamic variable. If the new variable is not the same
type as the old one, type checks of uses of the dangling pointers
are invalid.
2. Even if the new one is the same type, its new value will bear no
relationship to the old pointer’s dereferenced value.
3. If the dangling pointer is used to change the heap-dynamic
variable will be destroyed.
4. It is possible that the location now is being temporarily used by
the storage management system, possibly as a pointer in a chain
Ch6-26
of variable blocks of storage, thereby allowing a change to the
location to cause the storage manager to fail.
Ex: C++
int *arrayptr1;
int *arrayptr2=new int [100]; // create heap-dynamic structure
arrayptr1=arrayptr2;
delete []arrayptr2;
//new, arrayptr1 is dangling, because the heap storage to which it
was pointing has been deallocated.
• Lost heap-dynamic variable
– An allocated heap-dynamic variable that is no longer
accessible to the user program (often called garbage). Lost
heap-dynamic variables are created by the following
sequence of operations.
• Pointer p1 is set to point to a newly created heapdynamic variable
• Pointer p1 is later set to point to another newly created
heap-dynamic variable
• The 1st heap-dynamic variable is now inaccessible, or
lost (memory leakage)
Pointers in Ada
• Some dangling pointers are disallowed because dynamic objects
can be automatically de-allocated at the end of pointer's type
scope
• The lost heap-dynamic variable problem is not eliminated by Ada
Pointers in C and C++
• Extremely flexible but must be used with care
• Pointers can point at any variable regardless of when it was
allocated
• Used for dynamic storage management and addressing
• Pointer arithmetic is possible
• Explicit dereferencing and address-of operators the asterisk (*)
denotes the dereferencing operation, and ampersand (&) denotes
the operator for producing the address of a variable.
Ch6-27
Ex:
int *ptr;
int count, init;
…
ptr=&init;
are equivalent to count=init;
count=*ptr;
• Domain type need not be fixed (void *) (generic pointers)
• void * can point to any type and can be type checked (cannot be
de-referenced)
Pointer Arithmetic in C and C++
float stuff[100];
float *p;
p = stuff; //assign the address of stuff[0] to p
*(p+5) is equivalent to stuff[5] and p[5]
*(p+i) is equivalent to stuff[i] and p[i]
Pointers in Fortran 95
• Pointers point to heap and non-heap variables (static)
• Implicit dereferencing
• Pointers can only point to variables that have the TARGET
attribute
• The TARGET attribute is assigned in the declaration:
INTEGER, TARGET :: NODE
Reference Types
• C++ includes a special kind of pointer type called a reference
type that is used primarily for formal parameters. Reference type
variables are specified y (&)
Ex: int result=0;
int &ref_result=result;
…
ref_result=100;
result and ref_result are aliases.
– Advantages of both pass-by-reference and pass-by-value
• Java extends C++’s reference variables and allows them to
replace pointers entirely
Ch6-28
– References refer to class instances. Java reference variables
can be assigned to refer to different class instances. In the
following, String is a standard Java class
String str1;
…
str1=”This is a Java literal string”;
str1 is defined to be a reference to a string class instance or
object.
Because Java class instances are implicitly deallocated, there
cannot be a dangling reference.
• C# includes both the references of Java and the pointers of C++
Evaluation of Pointers
• Dangling pointers and dangling objects are problems as is heap
management
• Pointers are like goto's--they widen the range of cells that can be
accessed by a variable
• Pointers or references are necessary for dynamic data structures-so we can't design a language without them
Representations of Pointers
• Large computers use single values
• Intel microprocessors use segment and offset. So pointers are
references and implemented as pairs of 16-bits cells, one for each
of the two parts of an address.
Dangling Pointer Problem
• There have been several proposed solutions to the dangling
pointer problem.
• Tombstone: extra heap cell that is a pointer to the heap-dynamic
variable
– The actual pointer variable points only at tombstones and
never to heap-dynamic variables.
– When heap-dynamic variable de-allocated, tombstone
remains but set to nil, indicating that the heap-dynamic
variable no longer exists.
Ch6-29
– This approach prevents a pointer from ever pointing to
deallocated variable. Any reference to any pointer that point
to a nil tombstone can be detected as an error.
– Tombstones are costly in time and space. Because
tombstones are never deallocated, their storage is never
reclaimed. Every access to heap-dynamic variable through a
tombstone requires one more level of indirection access.
. an alternatve locks-and-keys: Pointer values are represented as
ordered (key, address) pairs
– Heap-dynamic variables are represented as variable plus cell
for integer lock value
– When heap-dynamic variable allocated, lock value is created
and placed in lock cell and key cell of the pointer that is
specified in the call to new.
– Every access to the dereferenced pointer compares the key
value of the pointer to the lock value in the heap-dynamic
variable. If they match, the access is legal; otherwise, the
access is treated as run-time error.
– Any copies of the pointer value to other pointer must copy
the key value. Therefore, any number of pointers can
reference a given heap-dynamic variable. When a heapdynamic variable is deallocated with dispose, its lock value
is cleared to an illegal lock value. Then, if the pointer other
than the one specified in the dispose is dereferenced, its
address value will still be intact, but its key value will no
longer match the lock, so the address will not be allowed.
Heap Management
• A very complex run-time process
• Single-size cells vs. variable-size cells. i.e. all heap storage is
allocated and deallocated in units of a single size, or, in which
variable size segments are allocated and deallocated.
• Single-size cell: every cell contains a pointer (like Lisp)
• In a single-size allocation heap, all available cells are linked
together using the pointer in the cells, forming a list of available
Ch6-30
space. Allocation is simple taking required number of cells from
this list.
• Deallocation is complex. A heap-dynamic variable can be pointed
to by more than one pointer making it difficult to determine when
the variable is no longer useful to the program. One pointer is
disconnected from a cell does not make it garbage.
• Two approaches to reclaim garbage
– Reference counters (eager approach): reclamation is
gradual
– Garbage collection (lazy approach): reclamation occurs
when the list of variable space becomes empty
Reference Counter
• Reference counters: maintain a counter in every cell that stores
the number of pointers currently pointing at the cell. If reference
counter reaches zero, it considered garbage and returned to free
list.
– Disadvantages: space required for the counters, execution
time required to maintain counter value if pointers changing
values heavily (Lisp), complications for cells connected
circularly. The problem is that each cell in the circular list
has a reference counter value of at least 1, which prevents it
from return to available list.
Garbage Collection
• The run-time system allocates storage cells as requested and
disconnects pointers from cells as necessary until it has allocated
all available cells. Garbage collection then begins to gather al the
garbage left floating around in the heap.
– To facilitate the process, every heap cell has an extra bit
used by collection algorithm. The process consists of 3
phases:
– All cells in the heap initially set to garbage
– All pointers in program are traced into heap, and reachable
cells marked as not garbage
– All garbage cells returned to list of available cells. To see
how marking algorithm works: assume all heap-dynamic
Ch6-31
variables (heap cells), consist of information part (tag), and
two pointers (Llink, Rlink). We build directed graphs with at
most two edges leading from any node. Marking algorithm
traverse all spanning trees of the graphs marking all cells
that are found
For every pointer r do
Mark(r)
Void mark (void *ptr)
{
If (ptr !=0)
If (*ptr.tag is not marked)
{
Set *ptr.tag;
Mark (*ptr.llink);
Mark (*ptr.rlink);
}
}
– Disadvantages: when you need it most, it works worst (takes
most time when program needs most of cells in heap)
because it makes a good deal of time to trace and mark
useful cells.
– In this case, the process yields only a small number of cells
that can be placed in the free list. Marking algorithm
requires a great deal of storage (for stack) because of
recursion.
Ch6-32
Marking Algorithm
Variable-Size Cells
• All the difficulties of single-size cells plus more
• Variable size are required by most programming languages
• If garbage collection is used, additional problems occur
– The initial setting of the indicators of all cells in the heap to
indicate that they are garbage is difficult. Because cells are
different sizes, scanning them is a problem. One solution is
for each cell to have the cell size as its first field.
– The marking process in nontrivial. How can a chain be
followed from a pointer if there is no predefined location for
the pointer in the pointed-to cell?
– Cells that do not contain pointers at all are also a problem.
Adding a system pointer to each cell will work, but it must
be maintained in parallel with the user-defined pointers. This
task adds space and execution time overhead to the cost of
running the program.
– Maintaining the list of available space is another source of
overhead. The list can begin with a single cell consisting of
Ch6-33
all available space. Requests for segments reduce the size of
this block. Reclaimed cells are added to the list. The list
becomes a long list of various size segments (blocks). This
slows allocation because requests cause the list to be
searched for sufficiently large blocks. Then fragmentation is
very high. Need to compact. Or use best-fit strategy which
needs to keep the list ordered by block size, which is an
overhead time.
Ch6-34
Download