Chapter -6: Data Types Introduction • A data type defines a collection of data objects and a set of predefined operations on those objects • A descriptor is the collection of the attributes of a variable. In an implementation, a descriptor is a collection of memory cells that store variable attributes. If the attributes are all static, descriptors are required only at compile time. • These descriptors are built by the compiler, as a part of the symbol table, and are used during compilation. • For dynamic attributes, part or all of the descriptor must be maintained during execution. In this case, descriptor is used by the run-time system. • In all cases, descriptors are used for type checking and to build the code for the allocation and deallocation operations. • An object represents an instance of a user-defined (abstract data) type • One design issue for all data types: What operations are defined and how are they specified? Primitive Data Types • Almost all programming languages provide a set of primitive data types • Primitive data types: Those not defined in terms of other data types • Some primitive data types are merely reflections of the hardware. (E.g. integer types) • Others require little non-hardware support for their implementation. • The primitive data types are used with one or more type constructors, to provide the structured types. Primitive Data Types: Integer • Almost always an exact reflection of the hardware so the mapping is trivial • There may be as many as eight different integer types in a language Ch6-1 • Java’s signed integer sizes: byte, short, int, long • C++, C#, include unsigned integer types ( without sign) • A signed integer, value is represented in a computer by a string of bits, the left most one represents sign. • A negative integer could be stored in sign-magnitude notation, in which the sign bit is set to indicate negative and the reminder bits represent the absolute value of the number. • Most computers now use a twos complement notation to store negative integers (take logical complement of the positive version of the number and adding one. • E.g. -2 11111110 2= 00000010 Com=11111101 1 11111110 L most bit = -128 Add the rest of bits= 2+4+8+16+32+64=126 Subtract= 128-126=-2 Primitive Data Types: Floating Point • Model real numbers, but only as approximations. Floating-point values are represented as fractions and exponents. • Languages for scientific use supports at least two floating-point types (e.g., float (4 bytes) and double (8 bytes). • The collection of values that can be represented as a floatingpoint type is defined in terms of precision and range. • Precision is the accuracy of the fractional part of a value measured as number of bits. • Range is a combination of the range of fractions and range of exponents. • Usually exactly like the hardware (i.e. language implementers use whatever representation is supported by the hardware), but not always • Most newer machines use the IEEE Floating-Point Standard 754 format Ch6-2 Primitive Data Types: Decimal • Most large computers that are designed for business applications (money) have hardware support for decimal data types – Essential to COBOL – C# offers a decimal data type • Store a fixed number of decimal digits with the decimal point at a fixed position in the value • Decimal types are stored using binary codes for the decimal digits (B CD). • In some cases, they are stored one digit per byte, but in others they are packed two digits per byte. Either way, they take more storage than binary representations. It takes at least 4 bits to code a decimal digit. E.g. 7=0111, 9=1001, 3=0011 etc. to store 6digit coded decimal number requires 24 bits of memory. Operations on decimal values are done in hardware on machines that have such capabilities; otherwise they are simulated in software. • Advantage: accuracy (being able to precisely store decimal values) Ch6-3 • Disadvantages: limited range (exponents are not allowed), wastes memory Primitive Data Types: Boolean • Simplest of all types • Range of values: two elements, one for “true” and one for “false” • C98, exceptions in which numeric expressions are used as conditionals. All operands with non-zero values are considered true, and zero is false. • C99, C++ have Boolean type. They also allow numeric expression to be used as if they were Boolean. • Java, C# not allowed. • Boolean types are used to represent switches or flags in programs. • Could be implemented as bits, but often as bytes – Advantage: readability (more readable than using integer) Primitive Data Types: Character • Stored as numeric codings • Most commonly used coding: ASCII (8-bit code): 128 characters, Extended ASCII: 256 characters. Ada uses Extended • An alternative, 16-bit coding: Unicode – Includes characters from most natural languages – Originally used in Java – C# and JavaScript also support Unicode Character String Types • Values are sequences of characters • Design issues: – Is it a primitive type or just a special kind of array? – Should the length of strings be static or dynamic? Character String Types Operations • If strings are not defined as a primitive type, string data is usually stored in arrays of single characters, and referenced as such in the language (C, C++). • C, C++ use char arrays to store character strings and provide a collection of string operations through standard library whose leader file is string.h Ch6-4 • Character strings are determined with a special character, null, represent zero. • Char *str=”apples”; : str is a char pointer set to point at the string of characters, apples0 where 0 is the null character. This initialization of str is legal because character string literals are represented by char pointers, rather than the string itself. • Typical operations: – Assignment and copying. In C, C++ – Comparison (=, >, etc.) – Catenation – Substring reference – Pattern matching Character String Type in Certain Languages • C and C++ (C++ supports strings through its standard class library String, also support array of characters) – Not primitive – Use char arrays and a library of functions that provide operations header file string.h – Most commonly used library functions for character strings C, C++ are (stecpy, strcmp, strlen, strcat) • SNOBOL4 (a string manipulation language) – Primitive – Many operations, including elaborate pattern matching • Java – Primitive via the String class • Fortran95: treats strings as a primitive type and provides assignment, relational operators, catenation, and substring reference operations of them (slices). Character String Length Options • There are several design choices regarding the length of string values. Ch6-5 • Static length string: the length can be static and set when the string is created. COBOL, Java’s String class. Another Java class called stringBufferclass of changeable values. • Limited Dynamic Length: allowing strings to have varying length up to a declared and fixed maximum set by the variable’s definition. Such string variables can store any number of characters between zero and the maximum. C and C++ – In C-based language, a special character is used to indicate the end of a string’s characters, rather than maintaining the length • Dynamic length strings: allow strings to have varying length with no maximum. This option requires the overhead of dynamic storage allocation and deallocation but provides maximum flexibility. SNOBOL4, Perl, JavaScript • Ada supports all three string length options Character String Type Evaluation • Aid to writability. Dealing with strings as arrays is more difficult than dealing with primitive string type. • As a primitive type with static length, they are inexpensive to provide. Providing strings through a standard library is nearly as convenient as having them as a primitive type. • Dynamic length is nice and flexible, but is it worth the expense? The overhead of their implementation must be weighted against that additional flexibility. Character String Implementation • Static length: compile-time descriptor need only during compilation. Has 3 fields. See figure-1 • Limited dynamic length: may need a run-time descriptor for length to store both the fixed maximum length and the current length. See figure-2 (but not in C and C++ because the end of string is marked with null character. Do not need max length, because index values n array references are not range-checked in these languages). Static and dynamic length strings require no special dynamic storage allocation. Ch6-6 • Dynamic length: need run-time descriptor; allocation/deallocation is the biggest implementation problem which requires a complex storage management. Length and storage to which it is bound grow and shrink dynamically. • Type approaches to support dynamic allocation/ deallocation - Strings can be stored in a linked list: drawbacks are extra storage occupied by the links, and the complexity of string operations. - Or store complete strings in adjacent storage cells. Problem: when a string grows and the adjacent space is not available. Solution is to find another hole that fits the new string, and deallocate the previous hole. - Although linked-list method requires more storage, the dynamic allocation process is simple, but some string operations are slow due to pointer chasing (sequential access) - Using adjacent memory for complete strings results in faster string operations and required significantly less storage. But the allocation/ deallocation process is slower. Compile- and Run-Time Descriptors Type name Address of first characte r Compile-time descriptor for static strings Figure-1 Run-time descriptor for limited dynamic strings Ch6-7 Figure-2 User-Defined Ordinal Types • An ordinal type is one in which the range of possible values can be easily associated with the set of positive integers • Examples of primitive ordinal types in Java – Integer, char, Boolean • In some languages, users can define two kinds of ordinal types: (enumerated and subrange). Enumeration Types • All possible values, which are named constants, are provided in the definition • C# example enum days {mon, tue, wed, thu, fri, sat, sun}; the enumeration constants are typically implicitly assigned the integer values, 0, 1,… etc. • Design issues – Is an enumeration constant allowed to appear in more than one type definition, and if so, how is the type of an occurrence of that constant checked? – Are enumeration values coerced to integer? – Are any other type coerced to an enumeration type? – All these design issues are related to type checking. • If an enumeration variable is coerced to a numeric type, there is little control over the range of legal operations or its range of values. • If an int type value is coerced to an enumeration type, an enumeration type variable could be assigned any integer value, whether it represented an enumeration constant or not. • Design - If a language does not have enumeration types, we could simulated it o e.g. Fortran77: INTEGER RED, BLUE DATA RED, BLUE /0, 1/ The problem here: since we did not define a type for our colors, there is no type checking when they are used. E.g. it would be legal to add the two together. Ch6-8 Also they could be combined with any other numeric type operand with any arithmetic operator. Also, because they are just variables, they could be assigned any integer value, destroying the relationship with the colors, although to solve this latter issue we could make them named constants. o C, Pascal: 1st include enumeration data type. C++ includes C’s enumeration type. C++ could have enum colors {red, blue, green, yellow, black}; Colors mycolor= blue; youcolor=red; Enumeration values are coerced to int when they are put in integer context. E.g. if current value of mycolor is blue, the statement mycolor++ would assign green to mycolor. o C++ allows enumeration constants to be assigned to variables of any numeric type, though that would most often be an error - No other type value is coerced to an enumeration type in C++, mycolor=4; is legal, R.H.S sould be cast to C++ enumeration constants can appear in only one enumeration type in the same referencing environment. o Ada, enumeration literals are allowed to appear in more than one declaration in the same referencing environment. These are called overloaded literals. - The rules for solving overloading must be determined from the context. E.g. if an overloaded literal and an enumeration variable are compared, the literal’s type is resolved to be that of the variable. - Because neither the enumeration literals nor the enumeration variables in Ada are coerced to integer, both the range of operations and the range of values of enumeration types are restricted, allowing many errors to be compiler-detected. Evaluation of Enumerated Type • Aid to readability, e.g., no need to code a color as a number. Named values are easily recognized. • Aid to reliability, e.g., compiler can check: in C#, Ada, Java0.5 – Operations (don’t allow colors to be added). No arithmetic operations are legal on enumeration types. Ch6-9 – No enumeration variable can be assigned a value outside its defined range – Ada, C#, and Java 5.0 provide better support for enumeration than C++ because enumeration type variables in these languages are not coerced into integer types • C treats enumeration variables like integer variables. • C++ numeric values can be assigned to enumeration type variables only if they are cast to type of the assigned variable. Subrange Types • An ordered contiguous subsequence of an ordinal type – Example: 12..18 is a subrange of integer type. Introduced in Pascal, included in Ada. • Ada’s design - In Ada, sub ranges are included in the category of types called subtypes. - In Pascal Type strIndex=0..mastrLength; var I: strIndex; type Days is (mon, tue, wed, thu, fri, sat, sun); subtype Weekdays is Days range mon..fri; subtype Index is Integer range 1..100; - all operations defined for the parent type are also defined for the subtype, except assignment of values outside the specified range. E.g. in the following: Day1: Days; Day2: Weekday; Day2 := Day1; - The assignment is legal unless the value of Day1 is sat to sun. - The compiler must generate range-checking code for every assignment to subrange variable sub ranges require run-time range checking. Ch6-10 Subrange Evaluation • Aid to readability – Make it clear to the readers that variables of subrange can store only certain range of values • Reliability – Assigning a value to a subrange variable that is outside the specified range is detected as an error Implementation of User-Defined Ordinal Types • Enumeration types are implemented as integers • Subrange types are implemented like the parent types with code inserted (by the compiler) to restrict assignments to subrange variables Array Types • An array is an aggregate of homogeneous data elements in which an individual element is identified by its position in the aggregate, relative to the first element. • A reference of an array element in a program includes one or more non-constant subscripts. Such references require a run-time calculation to determine the memory location being referenced. Array Design Issues • What types are legal for subscripts? • Are subscripting expressions in element references range checked? • When are subscript ranges bound? • When does allocation take place? • What is the maximum number of subscripts? • Can array objects be initialized? • Are any kind of slices allowed? Array Indexing • Specific element of an array is referenced by aggregate name, and subscripts or indexes. • Indexing (or subscripting) is a mapping from indices to elements array_name (index_value_list) an element • Index Syntax Ch6-11 – FORTRAN, PL/I, Ada use parentheses • Ada explicitly uses parentheses to show uniformity between array references and function calls because both are mapping. E.g. sum:=sum+B(I); • When need another information to determine whether B(I) is a function call or an array reference. Reduce readability. – Most other languages use brackets – Two district types are involved in an array type: element type, and type of subscripts. Arrays Index (Subscript) Types • FORTRAN, C: integer only • Pascal: any ordinal type (integer, Boolean, char, enumeration) • Ada: integer or enumeration (includes Boolean and char) • Java: integer types only • C, C++, Perl, and Fortran do not specify range checking of subscripts • Java, ML, C# specify range checking • Ada checks the range of all subscripts, but this feature can be disabled by the programmer. Subscript Binding and Array Categories • The binding of the subscript type to an array is usually static, but the subscript value ranges are sometimes dynamically bound. • Lower bound of the subscription range, in some languages, is implicit. E.g. C-based fixed to zero, Fortran it default to 1, Pascal subscript ranges must be specified by the programmer. • There are five categories of arrays, based on the binding to subscript value ranges and the binding to storage. • Static: subscript ranges are statically bound and storage allocation is static (before run-time) – Advantage: efficiency (no dynamic allocation) • Fixed stack-dynamic: subscript ranges are statically bound, but the allocation is done at declaration elaboration time during execution Ch6-12 • • • • – Advantage: space efficiency. A large array in one subprogram can use the same space as a large array in a different subprogram, as long as both subprograms are not at the same time. Stack-dynamic: subscript ranges are dynamically bound and the storage allocation is dynamic (done at run-time). Once the subscript range is bound and the storage is allocated, they remain fixed during the lifetime of the variable. – Advantage: flexibility (the size of an array need not be known until the array is to be used) Fixed heap-dynamic: similar to fixed stack-dynamic: subscript range and the storage binding are dynamic but fixed after allocation. The differences are that the bindings are done when the user program requests them, rather than at elaboration time, and the storage s allocated from the heap, rather than the stack. Heap-dynamic: binding of subscript ranges and storage allocation is dynamic and can change any number of times during the array’s lifetime. – Advantage: flexibility (arrays can grow or shrink during program execution) Examples of the categories: o C and C++ arrays that include static modifier are static o C and C++ arrays without static modifier are fixed stackdynamic o Ada arrays can be stack-dynamic. E.g.: Get List _ Len); declare The user inputs the number of desired elements for the array list. List : array (1...List _ Len) of int eger ; The elements are then dynamically allocated when execution reaches begin the declare block. When execution reaches the end of the block, the end ; list array is deallocated. o C and C++ provide fixed heap-dynamic arrays. - malloc, free (general heap allocation and deallocation operations), can be used for C arrays. Ch6-13 - C++ uses operations (new, delete) to manage heap storage. - Fortran95 supports fixed heap-dynamic arrays, also C#. - In Java all arrays are fixed heap-dynamic array. Once created, they keep the same subscript ranges and storage. o C# includes a second array class ArrayList that provides heapdynamic array. Objects of this class are created without any elements. ArrayList int List= new ArrayList(); Elements are added to this object with (Add) method. ArrayList.Add(nextone); o Perl and JavaScript support heap-dynamic arrays. Arrays implicitly grow whenever assignments are made to elements beyond the last current element, and shrink by assign them an empty aggregate(). e.g.: In Pearl we could creat an array of 5 numbers with @list=(1,2,4,7,19); The array could be lengthend with (push) function Push(@list,13 ,17) To become, (1, 2, 4, 7, 19, 13, 17). And emptied with @list=(); Array Initialization • Some language allow initialization at the time of storage allocation – C, C++, Java, C# example int list [] = {4, 5, 7, 83}; compiler sets the length of the array – Character strings in C and C++ implemented as array of char char name [] = “freddie”; The array name will have 8 elements, because all strings are terminated with null character (zero), which implicitly supplied by the system. – Arrays of strings in C and C++ can initialized with string literals char *names [] = {“Bob”, “Jake”, “Joe”]; – Java initialization of String objects String[] names = {“Bob”, “Jake”, “Joe”}; Ch6-14 Arrays Operations • APL provides the most powerful array processing operations for vectors and matrixes as well as unary operators (for example, to reverse column elements). E.g. A+B is valid expression, where A and B are scalar variables, vectors, or matrixes. • Ada allows array assignment but also catenation (&). Catenation is defined between two single-dimensioned arrays and between a single-dimensioned array and a scalar. • Fortran provides elemental operations because they are between pairs of array elements – For example, + operator between two arrays results in an array of the sums of the element pairs of the two arrays Rectangular and Jagged Arrays • A rectangular array is a multi-dimensioned array in which all of the rows have the same number of elements and all columns have the same number of elements • A jagged matrix has rows with varying number of elements. E.g. a jagged matrix may consist of 3 rows, one with 5 elements, one with 7 elements, and one with 12 elements. This also applies to the columns or higher dimensions. – Jagged arrays are made possible when multi-dimensioned arrays actually appear as arrays of arrays • C, C++, and Java support jugged arrays but nor rectangular arrays. Reference of an element of a multidimensional array uses a separate pair of brackets for each dimension. E.g. myArray[3][7]; • Fortran and Ada support rectangular arrays. All subscript expression is references to elements are placed in a single pair of brackets. E.g. myArray[3, 7]; • C# supports both. Slices • A slice of an array is some substructure of that array; e.g. if A is a matrix, the 1st row of A is one possible slice. Last row, 1st column are also. Ch6-15 • It is not a new data type, it is nothing more than a referencing mechanism • Slices are only useful in languages that have array operations, (i.e. if arrays cannot be manipulated as units, that language has no use for slices). Slice Examples • Fortran 95 Integer, Dimension (10) :: Vector Integer, Dimension (3, 3) :: Mat Integer, Dimension (3, 3, 4) :: Cube Remember that the default lower bound for Fortran array is 1. Vector (3:6) is a four element array Mat(:, 2) referes to the 2nd column of Mat. Mat(3, :) referes to the 3rd row of Mat. All of these references can be used as singl-dimensioned arrays. References to all array slices are treated as if they were arrays of the remaining dimensionality. Slices Examples in Fortran 95 Ch6-16 Implementation of Arrays • Implementation arrays require more compile-time effort than does implementing simple types (integer). The codes to allow accessing of array elements must be generated at compile time. At run time, this code must be executed to produce element addresses. • Access function maps subscript expressions to an address in the array • A single-dimensioned array is a list of adjacent memory cells. Suppose the lower bound of array list is 1. • Access function for single-dimensioned arrays: address(list[k]) = address (list[lower_bound]) + ((k-lower_bound) * element_size) • If the element type is statically bound and the array is statically bound to storage, then the value of address (list[lower-bound]) can be computed before run time. • If the base, or beginning address, of the array is not known until run time, the subtraction, must be done when the array is allocated. Accessing Multi-dimensioned Arrays • Values of data types that have two or more dimensions must be mapped onto the single-dimensional memory. • Two common ways: – Row major order (by rows) – used in most languages – column major order (by columns) – used in Fortran Locating an Element in a Multi-dimensioned Array • General format Location (a[i,j]) = address of a [row_lb,col_lb] + (((i - row_lb) * n) + (j - col_lb)) * element_size Ch6-17 - Row major order locX (i1 , i2 ,in ) loc( X (l1l2 ln )) (i1 l1 )u2u3 un (i2 l2 )u3u4 un (in1 ln )un (in ln ) - Col. Major order locX (i1 , i2 , , in ) loc ( X (l1l2 ln )) (in ln )u1u2 un1 (in1 ln1 )u1u2 un2 (i1 l1) - For matrix in row major order, the number of elements that precedes an element is the number of rows above the element times the size of the row, plus the numbers of elements to the left of the element. Compile-Time Descriptors Single-dimensioned array Multi-dimensional array Ch6-18 Associative Arrays • An associative array is an unordered collection of data elements that are indexed by an equal number of values called keys. In nonassociative arrays, the indices never need to be stored, because of their regularities. – In an associative array, user defined keys must be stored in the structure. So, each element of an associative array is a pair of entities (key, value). • Design issues: What is the form of references to elements Associative Arrays in Perl • Are called hashes, because elements are stored and retrieved with hash functions. • Names begin with %; literals are delimited by parentheses %hi_temps = ("Mon" => 77, "Tue" => 79, “Wed” => 65, …); • Subscripting is done using braces and keys. • The key value is placed in braces and the hash name is replaced by a scalar variable name that is the same except for the first character. $hi_temps{"Wed"} = 83; – Elements can be removed with delete delete $hi_temps{"Tue"}; – The entire hash can be emptied by assignng empty literal to it @hi_temps=(); – The size of a pearl hash is dynamic (grows and shrinks). – The exists operators returns true or false; depending on whether its operand key is an element in the hash if (exists $hi_temps {“Tue”} … – PHP’s arrays are both normal arrays and associative. – A hash is much better than an array if searches of the elements are required, because the implicit hashing operation used to access hash elements is very efficient. On the other hand, if every element of a list must be processed, it would be more efficient to use an array. Ch6-19 Record Types • A record is a possibly heterogeneous aggregate of data elements in which the individual elements are identified by names • Design issues: – What is the syntactic form of references to the fields? – Are elliptical references allowed? Definition of Records • COBOL uses level numbers to show nested records; others use recursive definition 01 EMP-REC. 02 EMP-NAME. 05 FIRST PIC X(20). 05 MID PIC X(10). 05 LAST PIC X(20). 02 HOURLY-RATE PIC 99V99. • Ada: Record structures are indicated in an orthogonal way type Emp_Rec_Type is record First: String (1..20); Mid: String (1..10); Last: String (1..20); Hourly_Rate: Float; end record; Emp_Rec: Emp_Rec_Type; • Record Field References 1. COBOL field_name OF record_name_1 OF ... OF record_name_n; where recore_name_1 is the smallest or innermost record that contains the field. Ex: MID OF EMP_NAME OF EMP_REC 2. Others (dot notation) record_name_1.record_name_2. ... record_name_n.field_name Ex: Employee_Record.Employee_name.Mid References to Records • Most language use dot notation Ch6-20 Ex: reference to the field mid in Ada record example • Fully qualified references must include all record names • Elliptical references allow leaving out record names as long as the reference is unambiguous, for example in COBOL FIRST, FIRST OF EMP-NAME, and FIRST of EMP-REC are elliptical references to the employee’s first name Operations on Records • Assignment is very common if the types are identical • Ada allows record comparison for equality or inequality. • Ada records can be initialized with aggregate literals • COBOL provides MOVE CORRESPONDING statement for: – Copies a field of the source record to the corresponding field in the target record Evaluation and Comparison to Arrays • Design of record is straight forward and safe design. • The only aspect of records that is not clearly readable is the elliptical references allowed by COBOL. • Records are used when collection of data values is heterogeneous and the different fields are not processed in the same way. • Access to array elements is much slower than access to record fields, because subscripts are dynamic (field names are static) • Dynamic subscripts could be used with record field access, but it would disallow type checking and it would be much slower Implementation of Record Type The field of records are stored in adjacent memory location , but because the sizes of fields are not necessarily the same, the access method used for arrays is not used for records. Ch6-21 Offset address relative to the beginning of the records is associated with each field. And the field access is accomplished by using these offsets. Compile-time descriptor for record Unions Types • A union is a type whose variables are allowed to store different type values at different times during execution. Example, table of contents for a compiler. • Design issues – Should type checking be required? And this must be dynamic. – Should unions be embedded in records? Discriminated vs. Free Unions • Fortran, C, and C++ provide union constructs in which there is no language support for type checking; the union in these languages is called free union because programmers are allowed complete freedom from type checking in their use. • Type checking of unions require that each union include a type indicator called a discriminant or tag (discriminated union). ALGOL68 was the 1st language to provide it. – Supported by Ada Ch6-22 Ada Union Types Ada design for discriminated unions, allowes the user to specify variables of a variant record type that will store only one of the possible type values in the variant. In this way the user can tell the system when the type checking can be static. Such a restricted variable is called a constrained variant variable. Unconstrained variable records in Ada allow the values of their variants to change type during execution. The type of variant can be changed only by assigning the entire record, including the discriminant. Ex: Consider Ada Variant record type Shape is (Circle, Triangle, Rectangle); type Colors is (Red, Green, Blue); type Figure (Form: Shape) is record Filled: Boolean; Color: Colors; case Form is when Circle => Diameter: Float; when Triangle => Leftside, Rightside: Integer; Angle: Float; when Rectangle => Side1, Side2: Integer; end case; end record; the following two statements declare variables of type figure Figure_1: Figure;// Unconstrained variable record that has no initial value. Its type can be changed by assignment of whole record. Figure_1:=(FilledTrue, ColorBlue Form Rectangle Side_112 Side_23); Figure_2:Figure(FormTrangle);// Is constrained to be triangle and cannot be changed to another variant. Ch6-23 Ada Union Type Illustrated A discriminated union of three shape variables (Assume all the variables are the same size) Evaluation of Unions • Potentially unsafe construct – Do not allow type checking. This way Fortran, C, C++ are not strongly typed • Java and C# do not support unions – Reflective of growing concerns for safety in programming language • Discriminated unions are implemented by simply using the same address for every possible variant. Sufficient storage of the largest variant is allocated. Pointer and Reference Types • A pointer type variable has a range of values that consists of memory addresses and a special value, nil. The value nil is not a valid address and is used to indicate that a pointer cannot currently be used to reference any memory cel. • Provide the power of indirect addressing • Provide a way to manage dynamic memory • A pointer can be used to access a location in the area where storage is dynamically created or allocated (usually called a heap) Ch6-24 • Variables that are dynamically allocated from the heap are called heap-dynamic variables, which do not have identifiers associated with them, and can be referenced only by pointers or variables. • Variables without names are called anonymous variables. • Pointers are not structured types, although are defined using the type operator (* in C and C++, access in Ada). • Pointers are different from scalar variables because they are most often used to reference some other variables, rather than being used to store data of same sort. • Pointers add writability to a language (dynamic structures trees, linked lists). Design Issues of Pointers • What are the scope of and lifetime of a pointer variable? • What is the lifetime of a heap-dynamic variable? • Are pointers restricted as to the type of value to which they can point? • Are pointers used for dynamic storage management, indirect addressing, or both? • Should the language support pointer types, reference types, or both? Pointer Operations • Two fundamental operations: assignment and dereferencing • Assignment is used to set a pointer variable’s value to some useful address • Dereferencing yields the value stored at the location represented by the pointer’s value – Dereferencing can be explicit or implicit – C++ uses an explicit operation via * j = *ptr sets j to the value located at ptr Ch6-25 Pointer Assignment Illustrated The assignment operation j = *ptr When pointers point to records, the syntax of the references to the fields of these records varies among languages. C, C++, there are two ways. If a pointer variable P points to a record with a field named age, we use, (*p).age, another way page Problems with Pointers • Dangling pointers (dangerous) – A pointer points to a heap-dynamic variable that has been de-allocated. Dangling pointers are dangerous for several reasons: 1. The location being pointed to may have been reallocated to some new heap-dynamic variable. If the new variable is not the same type as the old one, type checks of uses of the dangling pointers are invalid. 2. Even if the new one is the same type, its new value will bear no relationship to the old pointer’s dereferenced value. 3. If the dangling pointer is used to change the heap-dynamic variable will be destroyed. 4. It is possible that the location now is being temporarily used by the storage management system, possibly as a pointer in a chain Ch6-26 of variable blocks of storage, thereby allowing a change to the location to cause the storage manager to fail. Ex: C++ int *arrayptr1; int *arrayptr2=new int [100]; // create heap-dynamic structure arrayptr1=arrayptr2; delete []arrayptr2; //new, arrayptr1 is dangling, because the heap storage to which it was pointing has been deallocated. • Lost heap-dynamic variable – An allocated heap-dynamic variable that is no longer accessible to the user program (often called garbage). Lost heap-dynamic variables are created by the following sequence of operations. • Pointer p1 is set to point to a newly created heapdynamic variable • Pointer p1 is later set to point to another newly created heap-dynamic variable • The 1st heap-dynamic variable is now inaccessible, or lost (memory leakage) Pointers in Ada • Some dangling pointers are disallowed because dynamic objects can be automatically de-allocated at the end of pointer's type scope • The lost heap-dynamic variable problem is not eliminated by Ada Pointers in C and C++ • Extremely flexible but must be used with care • Pointers can point at any variable regardless of when it was allocated • Used for dynamic storage management and addressing • Pointer arithmetic is possible • Explicit dereferencing and address-of operators the asterisk (*) denotes the dereferencing operation, and ampersand (&) denotes the operator for producing the address of a variable. Ch6-27 Ex: int *ptr; int count, init; … ptr=&init; are equivalent to count=init; count=*ptr; • Domain type need not be fixed (void *) (generic pointers) • void * can point to any type and can be type checked (cannot be de-referenced) Pointer Arithmetic in C and C++ float stuff[100]; float *p; p = stuff; //assign the address of stuff[0] to p *(p+5) is equivalent to stuff[5] and p[5] *(p+i) is equivalent to stuff[i] and p[i] Pointers in Fortran 95 • Pointers point to heap and non-heap variables (static) • Implicit dereferencing • Pointers can only point to variables that have the TARGET attribute • The TARGET attribute is assigned in the declaration: INTEGER, TARGET :: NODE Reference Types • C++ includes a special kind of pointer type called a reference type that is used primarily for formal parameters. Reference type variables are specified y (&) Ex: int result=0; int &ref_result=result; … ref_result=100; result and ref_result are aliases. – Advantages of both pass-by-reference and pass-by-value • Java extends C++’s reference variables and allows them to replace pointers entirely Ch6-28 – References refer to class instances. Java reference variables can be assigned to refer to different class instances. In the following, String is a standard Java class String str1; … str1=”This is a Java literal string”; str1 is defined to be a reference to a string class instance or object. Because Java class instances are implicitly deallocated, there cannot be a dangling reference. • C# includes both the references of Java and the pointers of C++ Evaluation of Pointers • Dangling pointers and dangling objects are problems as is heap management • Pointers are like goto's--they widen the range of cells that can be accessed by a variable • Pointers or references are necessary for dynamic data structures-so we can't design a language without them Representations of Pointers • Large computers use single values • Intel microprocessors use segment and offset. So pointers are references and implemented as pairs of 16-bits cells, one for each of the two parts of an address. Dangling Pointer Problem • There have been several proposed solutions to the dangling pointer problem. • Tombstone: extra heap cell that is a pointer to the heap-dynamic variable – The actual pointer variable points only at tombstones and never to heap-dynamic variables. – When heap-dynamic variable de-allocated, tombstone remains but set to nil, indicating that the heap-dynamic variable no longer exists. Ch6-29 – This approach prevents a pointer from ever pointing to deallocated variable. Any reference to any pointer that point to a nil tombstone can be detected as an error. – Tombstones are costly in time and space. Because tombstones are never deallocated, their storage is never reclaimed. Every access to heap-dynamic variable through a tombstone requires one more level of indirection access. . an alternatve locks-and-keys: Pointer values are represented as ordered (key, address) pairs – Heap-dynamic variables are represented as variable plus cell for integer lock value – When heap-dynamic variable allocated, lock value is created and placed in lock cell and key cell of the pointer that is specified in the call to new. – Every access to the dereferenced pointer compares the key value of the pointer to the lock value in the heap-dynamic variable. If they match, the access is legal; otherwise, the access is treated as run-time error. – Any copies of the pointer value to other pointer must copy the key value. Therefore, any number of pointers can reference a given heap-dynamic variable. When a heapdynamic variable is deallocated with dispose, its lock value is cleared to an illegal lock value. Then, if the pointer other than the one specified in the dispose is dereferenced, its address value will still be intact, but its key value will no longer match the lock, so the address will not be allowed. Heap Management • A very complex run-time process • Single-size cells vs. variable-size cells. i.e. all heap storage is allocated and deallocated in units of a single size, or, in which variable size segments are allocated and deallocated. • Single-size cell: every cell contains a pointer (like Lisp) • In a single-size allocation heap, all available cells are linked together using the pointer in the cells, forming a list of available Ch6-30 space. Allocation is simple taking required number of cells from this list. • Deallocation is complex. A heap-dynamic variable can be pointed to by more than one pointer making it difficult to determine when the variable is no longer useful to the program. One pointer is disconnected from a cell does not make it garbage. • Two approaches to reclaim garbage – Reference counters (eager approach): reclamation is gradual – Garbage collection (lazy approach): reclamation occurs when the list of variable space becomes empty Reference Counter • Reference counters: maintain a counter in every cell that stores the number of pointers currently pointing at the cell. If reference counter reaches zero, it considered garbage and returned to free list. – Disadvantages: space required for the counters, execution time required to maintain counter value if pointers changing values heavily (Lisp), complications for cells connected circularly. The problem is that each cell in the circular list has a reference counter value of at least 1, which prevents it from return to available list. Garbage Collection • The run-time system allocates storage cells as requested and disconnects pointers from cells as necessary until it has allocated all available cells. Garbage collection then begins to gather al the garbage left floating around in the heap. – To facilitate the process, every heap cell has an extra bit used by collection algorithm. The process consists of 3 phases: – All cells in the heap initially set to garbage – All pointers in program are traced into heap, and reachable cells marked as not garbage – All garbage cells returned to list of available cells. To see how marking algorithm works: assume all heap-dynamic Ch6-31 variables (heap cells), consist of information part (tag), and two pointers (Llink, Rlink). We build directed graphs with at most two edges leading from any node. Marking algorithm traverse all spanning trees of the graphs marking all cells that are found For every pointer r do Mark(r) Void mark (void *ptr) { If (ptr !=0) If (*ptr.tag is not marked) { Set *ptr.tag; Mark (*ptr.llink); Mark (*ptr.rlink); } } – Disadvantages: when you need it most, it works worst (takes most time when program needs most of cells in heap) because it makes a good deal of time to trace and mark useful cells. – In this case, the process yields only a small number of cells that can be placed in the free list. Marking algorithm requires a great deal of storage (for stack) because of recursion. Ch6-32 Marking Algorithm Variable-Size Cells • All the difficulties of single-size cells plus more • Variable size are required by most programming languages • If garbage collection is used, additional problems occur – The initial setting of the indicators of all cells in the heap to indicate that they are garbage is difficult. Because cells are different sizes, scanning them is a problem. One solution is for each cell to have the cell size as its first field. – The marking process in nontrivial. How can a chain be followed from a pointer if there is no predefined location for the pointer in the pointed-to cell? – Cells that do not contain pointers at all are also a problem. Adding a system pointer to each cell will work, but it must be maintained in parallel with the user-defined pointers. This task adds space and execution time overhead to the cost of running the program. – Maintaining the list of available space is another source of overhead. The list can begin with a single cell consisting of Ch6-33 all available space. Requests for segments reduce the size of this block. Reclaimed cells are added to the list. The list becomes a long list of various size segments (blocks). This slows allocation because requests cause the list to be searched for sufficiently large blocks. Then fragmentation is very high. Need to compact. Or use best-fit strategy which needs to keep the list ordered by block size, which is an overhead time. Ch6-34