Chapter 6 - Programming Languages 3/9/05 Data Types* Historically: at first just a few primitives, programmer could make do; e.g. using FORTRAN arrays to represent linked lists or records. Trend toward including lots of specialized types in a language, COBOL with strings & more numeric types. Late 60's Algol moved toward providing basic types and mechanisms to allow user to build types from there. Currently popular languages tend to provide modest set of primitive types, allow user creation of types, and also extensive libraries of additional types. What's important about types? Ability to model the application domain. Support for operations to be performed on each type. Provide compiler means of catching logical errors. Abstract data types: the association of data and operations, independent of implementation. Primitives. Integer is probably most basic type. Size and representation is hardware-dependent; most languages provide syntax for 2-3 sizes of integer, compiler then utilizes whatever the target machine has to offer. Floating-point types provide an approximation of real values. (this can be problematic: for instance, .1 decimal is 00011(0011)infin All representations include a sign, fractional part (mantissa) and an exponent. Most modern machines use standard IEEE formats, and include at least two sizes of floatingpoint numbers. If the two sizes are called float and double, the double format typically allows at least twice the number of bits to express the fractional part ("significant digits"). Decimal data types arose from the issues that arise in using binary floating-point representation, and have important business applications. However, they often have limited ranges of value and somewhat wasteful representation. Binary coded decimal allows decimal digits to be stored one or two per byte. Booleans. Most modern languages include a Boolean primitive, stored as a byte. Characters. Stored as integers, typically as ASCII, increasingly as Unicode. Java was the first major languages to use Unicode. Character Strings. Character strings are needed in most applications, and are integral to many. Strings are implemented as primitive types in some languages, as character arrays in others, or often in libraries. One important issue is whether their length static or dynamic. C, C++, Pascal & Ada do not have a primitive string type, but have some built-in support for the use of character arrays as strings. This includes, at a minimum, single-character reference, concatenation, string comparisons, and substring references. FORTRAN (newer versions) treat strings as primitive type and supports assignment, comparison, substring references. In Java, strings are implemented as classes: String for constant strings, and Stringbuffer for string variables. Snobol, Perl and JavaScript provide extensive pattern-matching capabilities. Static length strings such as those in FORTRAN, Pascal and Ada require that the programmer declare a string length; the unused suffix is treated as blank. These strings allow for the easiest checking at compile-time. Limited dynamic length strings can store any number of characters up to some maximum. C and C++ use a special null character to terminate strings, so they don't explicitly maintain the string's length. In other languages, the run-time system must maintain a descriptor that includes the maximum length and the current length of the string. Snobol, Perl and JavaScript allow dynamic length strings, which have no fixed maximum. They are allocated additional space as needed. This imposes more run-time overhead in either allocating blocks of space and treating them as a linked list, or periodically re-issuing space from the heap. We see techniques for doing this in the more general context of referenced types that are dynamically created from the heap using e.g. new. User-defined ordinal types. An ordinal type is one that can be mapped to the set of non-negative integers. Integers, characters and booleans are examples of primitive ordinal types. Enumeration types are supported by many languages to allow for more readable code. For instance, in C++ the statement enum days {mon, tues, wed, etc.} creates a new data type, from which variables of that type can then be declared. Although these values cannot be input or output directly, they can be used as arguments/parameters, loop indices, or case selectors and enhance code readability. They are generally implemented by coding to an underlying integer value. One design issue is whether an identifier can be used for more than one type -- the issue is how the compiler will be able to check type compatibility in this case. C, C++, Pascal do not allow multiple uses of a value in different enumeration types. What does Java do? Subrange types are contiguous subsequences of ordinal types. They can be used to make it explicit that in a given application, some kinds of values must fall within a given range. This allows the compiler to support checking of logic (and typography). They are implemented as their underlying type, but the compiler must add code to check at runtime that the values fall within the allowed range. Ada provides an interesting pair of alternatives illustrating key issues about type compatibility: type DERIVED_SMALL_INT is new INTEGER range 1..100; subtype SUBRANGE_SMALL_INT is INTEGER range 1..100; Variables of each of these types inherit integer operations. However, derived types are not compatible with integers, while subrange types are. So type distance_metric is new float and type distance_imperial is new float would not be compatible with each other. Array Types. An array is a collection of homogeneous elements; each can be referenced individually by the name of the collection and its position in the collection. Typical syntax is arrayname[index/subscript] (a few languages use parentheses). Generally, the base type of the array can be any type (primitive or otherwise) defined in the program. Generally, the index type must be some ordinal type; sometimes they must be integers or subranges of integers. Checking the range of array references is supported in Java, Pascal & Ada but not in C/C++ or FORTRAN. Arrays can be grouped into four categories, depending on the binding to subscript values and the binding of the array to storage. Static array. Subscript ranges are statically bound, and the array is allocated before runtime. FORTRAN 77 does this. Fixed stack-dynamic array. Subscript ranges are statically bound, but the storage allocation is dynamic. An array declared in a C function is like this; if you have a large array in two different functions that don't run at the same time, they can use the same space (from the stack). Stack-dynamic array. Both subscript ranges and storage allocation are dynamic, but once done, they are fixed for the lifetime of the array. Heap-dynamic arrays. Both dynamic, and can change during lifetime. Arrays can grow and shrink during execution. FORTRAN 90 allows for this, by providing explicit control to allocate/de-allocate.C/C++ malloc/free, or new/delete. Array dimensions. Earlier languages tended to limit dimensions, most don't now. Array initialization. Most languages provide some syntax for initializing (usually small!) arrays, e.g. String [] names = ["tom", "dick","harry"]; Most languages allow some aggregate array operations. FORTRAN 77 provided none; all operations had to be done element-wise. Ada allows assignment and concatenation of 1-d arrays. FORTRAN 90 allows elemental operations. For instance C = A + B (all arrays) does pairwise addition. Some languages allow references to array slices, e.g. a column of a matrix might be assigned to a one-dimensional array. Languages must specify in what order a multi-dimensional array will be stored; FORTRAN is column-major (column-by-column), most other languages row-major. This allows for the calculation of an offset into the array. (look at in 2d case). Array Types. An array is a collection of homogeneous elements; each can be referenced individually by the name of the collection and its position in the collection. Typical syntax is arrayname[index/subscript] (a few languages use parentheses). Generally, the base type of the array can be any type (primitive or otherwise) defined in the program. Generally, the index type must be some ordinal type; sometimes they must be integers or subranges of integers. Checking the range of array references is supported in Java, Pascal & Ada but not in C/C++ or FORTRAN. Arrays can be grouped into four categories, depending on the binding of subscript values and the binding of the array to storage. Before looking at these categories, let's look at how array element addressing is done so that we can see why the categories are important. First, note that accessing an array element is a more complicated problem than accessing a scalar; instead of just a single address, based on where the variable is loaded, there is the address of the beginning of the array, and then some offset into the array. 1-d case: address(A[k]) = address(A[1]) + (k-1) * elementSize. Compiler rewrites as address(A[k]) = address(A[1]) - elementSize + k*elementSize, So that the first two terms are a constant and only the final addition and multiplication need to be done when k is known. Now consider how multi-dimensional arrays are handled, for example 2d. Languages must specify in what order a multi-dimensional array will be stored; FORTRAN is column-major (column-by-column), most other languages row-major. This allows for the calculation of an offset into the array. Work through calculation on 2D, then re-arrange to get constants first. This generalized to k-dimensional arrays. Now we can see why the binding of subscript ranges and the binding to storage are critical to array implementation. Static array. Subscript ranges are statically bound, and the array is allocated before runtime. FORTRAN 77 does this. Fixed stack-dynamic array. Subscript ranges are statically bound, but the storage allocation is dynamic. An array declared in a C function is like this; if you have a large array in two different functions that don't run at the same time, they can use the same space (from the stack). Stack-dynamic array. Both subscript ranges and storage allocation are dynamic, but once done, they are fixed for the lifetime of the array. Ada allows size of allocated array to be based on input, if the declaration follows the point of input. Once the declaration is reached, the size of the array is fixed. Heap-dynamic arrays. Both dynamic, and can change during lifetime. Arrays can grow and shrink during execution. FORTRAN 90 allows for this, by providing explicit control to allocate/de-allocate.C/C++ malloc/free, or new/delete. Array dimensions. Earlier languages tended to limit dimensions, most don't now. Array initialization. Most languages provide some syntax for initializing (usually small!) arrays, e.g. String [] names = ["tom", "dick","harry"]; Most languages allow some aggregate array operations. FORTRAN 77 provided none; all operations had to be done element-wise. Ada allows assignment and concatenation of 1-d arrays. FORTRAN 90 allows elemental operations. For instance C = A + B (all arrays) does pairwise addition. Some languages allow references to array slices, e.g. a column of a matrix might be assigned to a one-dimensional array. Associative Arrays An associative array is a collection of data elements indexed by its associated key. A Perl hash is an example of such a data type. It would allow us to say e.g. %fruits = ('Apples', 1.39, 'Oranges', 1.99, 'Pears', 1.79), or equivalently %fruits = ('Apples' => 1.39, 'Oranges'=> 1.99, 'Pears'=> 1.79), where the => operator describes an association between key and value. Elements can be added, deleted, referenced, or updated. They keys must be distinct but the values do not. Perl associative arrays are implemented using a hash function. Record Types Unlike arrays, records are a heterogeneous data type. As early as COBOL, this type was included to address the need to associate related data of different base types, e.g. an employee record that includes strings, integers, etc. COBOL records were described with a unique hierarchical level structure. Most subsequent languages (Pascal, C/C++) have used a simple nested container structure, in which each element of the structure is listed just as a variable would be declared externally to the structure. Structures can generally be nested. In C, we might have struct point { int x; int y; }; enum colors {blue, green, yellow}; and struct rectangle{ struct point lleft; struct point uright; colors color; int fillpercent; }; struct rectangle r1; r1.lleft.x = 4.5; This is an example of a fully qualified reference. COBOL allows elliptical references as well, along the lines of x of r1 (note that we're mixing languages here). Pascal and other languages allow a relaxation of full qualification by use of a with clause; again mixing languages, you could say with r1.lleft do { x = 1.0; y = 3.0; } Most languages that have records allow assignment of records. COBOL has a mechanism that allows assignment between different record structures, in which matching type/name pairs will have values assigned. Unions. FORTRAN, C and C++ have free unions. A variable declared as a union data type may take on a value of any of its constituent types. It is up to the programmer to keep track of what's stored there. Pascal implements a variant record, which is a form of union. Suppose we have an employee record type, and some employees are hourly and some are salaried. There are many fields in common, so we don't want to have two different record types. We would include a field called say pay_type, which would act as a tag or discriminant. Depending on the value of pay_type, the rest of the record might then consist of either weekly_salary, or regular_rate and overtime_rate. Storage is allocated for the largest of the variant possibilities. Pascal variant records limit the compilers ability to check type compatibility for two reasons. First, nothing prevents the programmer from changing the tag without changing the variant data; the compiler would consider the variant to be of one type, based on the tag, and it would be incorrect. Secondly, the programmer can simply omit the tag, turning the record into a free union. Ada requires a tag on all variant records, and enforces that the variant part of the record be changed if the tag is changed. Java has no union construct. Sets. Most languages do not support a set type, although some (e.g. C++) include sets in standard libraries. Pascal does have one, but it is restricted to small set sizes because it is implemented as a bit-string the size of a machine word. Generally, programmers may use arrays to implement set operations. Pointers. A pointer variable can contain memory addresses as values. They have two main uses: (1) Indirect addressing of any variables, and (2) A means of implementing dynamic memory allocation (from the heap). Basic operations on pointers: assignment (to what address does the pointer point?), and dereferencing (what is the value at the address being pointed to?). Picture like text for j = *ptr. Often pointers point to records. C++ syntax is either (*p).color or p->color (arrow combines * and .). In order to use pointers for managing heap-dynamic variables, there must be an allocation operation (and optionally a deallocation operation). There are two main potential problems associated with pointers: (1) A dangling pointer occurs with heap-dynamic variables, when two pointers point to the same variable, and one of them is implicitly or explicitly de-allocated. An attempt to dereference the remaining pointer can then produce run-time errors. (2) Memory leakage. A heap-dynamic variable can be "lost" if the only pointer referencing it is assigned to point somewhere else. Pascal: pointers only used for dynamically allocated variables. De-allocation operator is specified but implemented unevenly. Ada has both implicit and explicit deallocation. The implicit works by leaving everything allocated until the type goes out of scope, then de-allocating everything. C/C++ pointers are ((in)famously) flexible. Pointers can refer to any location in memory. Pointer arithmetic is allowed. Array names are actually pointers to the 0th array element. Reference types are used in C++ for formal "in/out" parameters in function headers. For example float & sum will pass the address of the corresponding actual parameter (argument) rather than a copy of its value. Java references refer to objects. Heap Management.