Chapter 6 - Programming Languages

advertisement
Chapter 6 - Programming Languages
3/9/05
Data Types*
Historically: at first just a few primitives, programmer could make do; e.g. using
FORTRAN arrays to represent linked lists or records. Trend toward including lots of
specialized types in a language, COBOL with strings & more numeric types. Late 60's
Algol moved toward providing basic types and mechanisms to allow user to build types
from there. Currently popular languages tend to provide modest set of primitive types,
allow user creation of types, and also extensive libraries of additional types.
What's important about types?




Ability to model the application domain.
Support for operations to be performed on each type.
Provide compiler means of catching logical errors.
Abstract data types: the association of data and operations, independent of
implementation.
Primitives.
Integer is probably most basic type. Size and representation is hardware-dependent;
most languages provide syntax for 2-3 sizes of integer, compiler then utilizes whatever
the target machine has to offer.
Floating-point types provide an approximation of real values. (this can be problematic:
for instance, .1 decimal is 00011(0011)infin
All representations include a sign, fractional part (mantissa) and an exponent. Most
modern machines use standard IEEE formats, and include at least two sizes of floatingpoint numbers. If the two sizes are called float and double, the double format typically
allows at least twice the number of bits to express the fractional part ("significant
digits").
Decimal data types arose from the issues that arise in using binary floating-point
representation, and have important business applications. However, they often have
limited ranges of value and somewhat wasteful representation. Binary coded decimal
allows decimal digits to be stored one or two per byte.
Booleans. Most modern languages include a Boolean primitive, stored as a byte.
Characters. Stored as integers, typically as ASCII, increasingly as Unicode. Java was
the first major languages to use Unicode.
Character Strings.
Character strings are needed in most applications, and are integral to many. Strings are
implemented as primitive types in some languages, as character arrays in others, or often
in libraries. One important issue is whether their length static or dynamic.
C, C++, Pascal & Ada do not have a primitive string type, but have some built-in support
for the use of character arrays as strings. This includes, at a minimum, single-character
reference, concatenation, string comparisons, and substring references.
FORTRAN (newer versions) treat strings as primitive type and supports assignment,
comparison, substring references.
In Java, strings are implemented as classes: String for constant strings, and Stringbuffer
for string variables.
Snobol, Perl and JavaScript provide extensive pattern-matching capabilities.
Static length strings such as those in FORTRAN, Pascal and Ada require that the
programmer declare a string length; the unused suffix is treated as blank. These strings
allow for the easiest checking at compile-time.
Limited dynamic length strings can store any number of characters up to some
maximum. C and C++ use a special null character to terminate strings, so they don't
explicitly maintain the string's length. In other languages, the run-time system must
maintain a descriptor that includes the maximum length and the current length of the
string.
Snobol, Perl and JavaScript allow dynamic length strings, which have no fixed
maximum. They are allocated additional space as needed. This imposes more run-time
overhead in either allocating blocks of space and treating them as a linked list, or
periodically re-issuing space from the heap. We see techniques for doing this in the more
general context of referenced types that are dynamically created from the heap using e.g.
new.
User-defined ordinal types.
An ordinal type is one that can be mapped to the set of non-negative integers. Integers,
characters and booleans are examples of primitive ordinal types.
Enumeration types are supported by many languages to allow for more readable code.
For instance, in C++ the statement
enum days {mon, tues, wed, etc.} creates a new data type, from which variables of that
type can then be declared. Although these values cannot be input or output directly, they
can be used as arguments/parameters, loop indices, or case selectors and enhance code
readability. They are generally implemented by coding to an underlying integer value.
One design issue is whether an identifier can be used for more than one type -- the issue
is how the compiler will be able to check type compatibility in this case. C, C++, Pascal
do not allow multiple uses of a value in different enumeration types. What does Java do?
Subrange types are contiguous subsequences of ordinal types. They can be used to
make it explicit that in a given application, some kinds of values must fall within a given
range. This allows the compiler to support checking of logic (and typography). They are
implemented as their underlying type, but the compiler must add code to check at runtime
that the values fall within the allowed range.
Ada provides an interesting pair of alternatives illustrating key issues about type
compatibility:
type DERIVED_SMALL_INT is new INTEGER range 1..100;
subtype SUBRANGE_SMALL_INT is INTEGER range 1..100;
Variables of each of these types inherit integer operations. However, derived types are
not compatible with integers, while subrange types are. So type distance_metric is new
float and type distance_imperial is new float would not be compatible with each other.
Array Types.
An array is a collection of homogeneous elements; each can be referenced individually
by the name of the collection and its position in the collection. Typical syntax is
arrayname[index/subscript] (a few languages use parentheses).
Generally, the base type of the array can be any type (primitive or otherwise) defined in
the program. Generally, the index type must be some ordinal type; sometimes they must
be integers or subranges of integers. Checking the range of array references is supported
in Java, Pascal & Ada but not in C/C++ or FORTRAN.
Arrays can be grouped into four categories, depending on the binding to subscript values
and the binding of the array to storage.
Static array. Subscript ranges are statically bound, and the array is allocated before
runtime. FORTRAN 77 does this.
Fixed stack-dynamic array. Subscript ranges are statically bound, but the storage
allocation is dynamic. An array declared in a C function is like this; if you have a large
array in two different functions that don't run at the same time, they can use the same
space (from the stack).
Stack-dynamic array. Both subscript ranges and storage allocation are dynamic, but
once done, they are fixed for the lifetime of the array.
Heap-dynamic arrays. Both dynamic, and can change during lifetime. Arrays can grow
and shrink during execution. FORTRAN 90 allows for this, by providing explicit control
to allocate/de-allocate.C/C++ malloc/free, or new/delete.
Array dimensions. Earlier languages tended to limit dimensions, most don't now.
Array initialization. Most languages provide some syntax for initializing (usually
small!) arrays, e.g. String [] names = ["tom", "dick","harry"];
Most languages allow some aggregate array operations. FORTRAN 77 provided none;
all operations had to be done element-wise. Ada allows assignment and concatenation of
1-d arrays. FORTRAN 90 allows elemental operations. For instance C = A + B (all
arrays) does pairwise addition.
Some languages allow references to array slices, e.g. a column of a matrix might be
assigned to a one-dimensional array.
Languages must specify in what order a multi-dimensional array will be stored;
FORTRAN is column-major (column-by-column), most other languages row-major.
This allows for the calculation of an offset into the array. (look at in 2d case).
Array Types.
An array is a collection of homogeneous elements; each can be referenced individually
by the name of the collection and its position in the collection. Typical syntax is
arrayname[index/subscript] (a few languages use parentheses).
Generally, the base type of the array can be any type (primitive or otherwise) defined in
the program. Generally, the index type must be some ordinal type; sometimes they must
be integers or subranges of integers. Checking the range of array references is supported
in Java, Pascal & Ada but not in C/C++ or FORTRAN.
Arrays can be grouped into four categories, depending on the binding of subscript values
and the binding of the array to storage.
Before looking at these categories, let's look at how array element addressing is done so
that we can see why the categories are important. First, note that accessing an array
element is a more complicated problem than accessing a scalar; instead of just a single
address, based on where the variable is loaded, there is the address of the beginning of the
array, and then some offset into the array.
1-d case: address(A[k]) = address(A[1]) + (k-1) * elementSize. Compiler rewrites as
address(A[k]) = address(A[1]) - elementSize + k*elementSize,
So that the first two terms are a constant and only the final addition and multiplication
need to be done when k is known.
Now consider how multi-dimensional arrays are handled, for example 2d. Languages
must specify in what order a multi-dimensional array will be stored; FORTRAN is
column-major (column-by-column), most other languages row-major. This allows for the
calculation of an offset into the array.
Work through calculation on 2D, then re-arrange to get constants first. This generalized
to k-dimensional arrays.
Now we can see why the binding of subscript ranges and the binding to storage are
critical to array implementation.
Static array. Subscript ranges are statically bound, and the array is allocated before
runtime. FORTRAN 77 does this.
Fixed stack-dynamic array. Subscript ranges are statically bound, but the storage
allocation is dynamic. An array declared in a C function is like this; if you have a large
array in two different functions that don't run at the same time, they can use the same
space (from the stack).
Stack-dynamic array. Both subscript ranges and storage allocation are dynamic, but
once done, they are fixed for the lifetime of the array. Ada allows size of allocated array
to be based on input, if the declaration follows the point of input. Once the declaration is
reached, the size of the array is fixed.
Heap-dynamic arrays. Both dynamic, and can change during lifetime. Arrays can grow
and shrink during execution. FORTRAN 90 allows for this, by providing explicit control
to allocate/de-allocate.C/C++ malloc/free, or new/delete.
Array dimensions. Earlier languages tended to limit dimensions, most don't now.
Array initialization. Most languages provide some syntax for initializing (usually
small!) arrays, e.g. String [] names = ["tom", "dick","harry"];
Most languages allow some aggregate array operations. FORTRAN 77 provided none;
all operations had to be done element-wise. Ada allows assignment and concatenation of
1-d arrays. FORTRAN 90 allows elemental operations. For instance C = A + B (all
arrays) does pairwise addition.
Some languages allow references to array slices, e.g. a column of a matrix might be
assigned to a one-dimensional array.
Associative Arrays
An associative array is a collection of data elements indexed by its associated key. A
Perl hash is an example of such a data type. It would allow us to say e.g.
%fruits = ('Apples', 1.39, 'Oranges', 1.99, 'Pears', 1.79), or equivalently
%fruits = ('Apples' => 1.39, 'Oranges'=> 1.99, 'Pears'=> 1.79), where the => operator
describes an association between key and value.
Elements can be added, deleted, referenced, or updated. They keys must be distinct but
the values do not.
Perl associative arrays are implemented using a hash function.
Record Types
Unlike arrays, records are a heterogeneous data type. As early as COBOL, this type was
included to address the need to associate related data of different base types, e.g. an
employee record that includes strings, integers, etc.
COBOL records were described with a unique hierarchical level structure. Most
subsequent languages (Pascal, C/C++) have used a simple nested container structure, in
which each element of the structure is listed just as a variable would be declared
externally to the structure. Structures can generally be nested. In C, we might have
struct point {
int x;
int y;
};
enum colors {blue, green, yellow};
and
struct rectangle{
struct point lleft;
struct point uright;
colors color;
int fillpercent;
};
struct rectangle r1;
r1.lleft.x = 4.5;
This is an example of a fully qualified reference. COBOL allows elliptical references
as well, along the lines of x of r1 (note that we're mixing languages here). Pascal and
other languages allow a relaxation of full qualification by use of a with clause; again
mixing languages, you could say
with r1.lleft do {
x = 1.0;
y = 3.0;
}
Most languages that have records allow assignment of records. COBOL has a
mechanism that allows assignment between different record structures, in which
matching type/name pairs will have values assigned.
Unions.
FORTRAN, C and C++ have free unions. A variable declared as a union data type may
take on a value of any of its constituent types. It is up to the programmer to keep track of
what's stored there.
Pascal implements a variant record, which is a form of union. Suppose we have an
employee record type, and some employees are hourly and some are salaried. There are
many fields in common, so we don't want to have two different record types. We would
include a field called say pay_type, which would act as a tag or discriminant.
Depending on the value of pay_type, the rest of the record might then consist of either
weekly_salary, or regular_rate and overtime_rate. Storage is allocated for the largest
of the variant possibilities.
Pascal variant records limit the compilers ability to check type compatibility for two
reasons. First, nothing prevents the programmer from changing the tag without changing
the variant data; the compiler would consider the variant to be of one type, based on the
tag, and it would be incorrect. Secondly, the programmer can simply omit the tag,
turning the record into a free union.
Ada requires a tag on all variant records, and enforces that the variant part of the record
be changed if the tag is changed.
Java has no union construct.
Sets.
Most languages do not support a set type, although some (e.g. C++) include sets in
standard libraries. Pascal does have one, but it is restricted to small set sizes because it is
implemented as a bit-string the size of a machine word. Generally, programmers may use
arrays to implement set operations.
Pointers.
A pointer variable can contain memory addresses as values. They have two main uses:
(1) Indirect addressing of any variables, and (2) A means of implementing dynamic
memory allocation (from the heap). Basic operations on pointers: assignment (to what
address does the pointer point?), and dereferencing (what is the value at the address
being pointed to?). Picture like text for j = *ptr.
Often pointers point to records. C++ syntax is either (*p).color or p->color (arrow
combines * and .).
In order to use pointers for managing heap-dynamic variables, there must be an
allocation operation (and optionally a deallocation operation).
There are two main potential problems associated with pointers:
(1) A dangling pointer occurs with heap-dynamic variables, when two pointers point to
the same variable, and one of them is implicitly or explicitly de-allocated. An attempt to
dereference the remaining pointer can then produce run-time errors.
(2) Memory leakage. A heap-dynamic variable can be "lost" if the only pointer
referencing it is assigned to point somewhere else.
Pascal: pointers only used for dynamically allocated variables. De-allocation operator is
specified but implemented unevenly.
Ada has both implicit and explicit deallocation. The implicit works by leaving
everything allocated until the type goes out of scope, then de-allocating everything.
C/C++ pointers are ((in)famously) flexible. Pointers can refer to any location in
memory. Pointer arithmetic is allowed. Array names are actually pointers to the 0th
array element.
Reference types are used in C++ for formal "in/out" parameters in function headers. For
example float & sum will pass the address of the corresponding actual parameter
(argument) rather than a copy of its value. Java references refer to objects.
Heap Management.
Download