LANGUAGE PRIMITIVE DATA STRUCTURES in C++ How well do the C++ int related data types represent the mathematical concept of integer? The mathematical concept of INTEGER is an aleph null, or countable infinite set1. Integers are often represented by the integer number line. The set of integers is closed over addition, subtraction, and multiplication. Integer division can also be defined so that it is closed except, of course, in the case of division by zero. The set of integers also has other well-defined properties such as the existence of additive and multiplicative identities, and an additive inverse. In fact, the set of integers is a Linear Algebra over the operations of + and *, so you can see your Linear Algebra class for more fun along these lines. In C++, int, long int (or long), short int (or short), and so forth store integers in two’s complement form2 as a specific number of bits.3 For example, an int is almost always stored as either 16 bits or 32 bits (2 or 4 bytes). Regardless of language, the range of an integer stored in N bits in two’s complement is always -2N-1+2N-1-1 for 8 bits, this range is -128 to +127 for 16 bits, this range is ≈ ±32K for 32 bits, this range is ≈ ±2G Many languages also support unsigned integers. In C++, they are called uint, ulong, ushort. The range of integers for an N bit unsigned integer is 02N-1. As long as no value computed or stored is outside the supported range, the computer representation of integers is perfect. Integer overflow or underflow is the result when such an operation fails. limits.h contains predefined constants representing the highest and lowest values for some of these data types. The names of some of these constants are given below. Be advised, however, that it is usually unwise to test against these values as comparisons involve subtraction, and any operation involving one of these numbers can easily produce overflow or underflow. 1 Kurt Godel first characterized infinite sets using aleph. He tried to conceive of an infinite number of different infinites. Any set that can be placed in 1 to 1 correspondence with the set of integers is aleph null (also called aleph naught). The uncountable infinite set of real numbers is aleph one. I read somewhere that the set of functions R to R where R is the set of real numbers is aleph two, but I have not seen a proof. 2 You should have studied two’s complement in Discrete Math and Computer Organization. If you are not familiar with this scheme, I can give you a reference. 3 I could have said bytes actually, as the number of bits allocated to an integer is always a multiple of 8 these days. You can use the sizeof ( ) function to determine the actual number of bytes allocated to any data type. INT_MIN LONG_MIN SHRT_MIN INT_MAX LONG_MAX SHRT_MAX UINT_MAX ULONG_MAX USHRT_MAX How well do the C++ double related data types represent the mathematical concept of a real number? The mathematical concept of a REAL number is an aleph one, uncountable infinite set. The real number line is often used to represent real numbers. Notice the fundamental difference between the set of integers and the set of real numbers. The number of integers between any two numbers on the integer number line is finite. The number of real numbers between any two numbers on the real number line is infinite, and not only infinite but uncountably infinite. In C++, float, double, and long double attempt to represent real numbers. Computer representations of real numbers are stored in a manner similar to scientific notation, except in binary. These representations are called floating-point representations. There are many variations, but the method described here is most common today as it follows the IEEE 754 floating-point representation standard. For 64 bits, there is a single sign bit, an 11 bit exponent4 and a 52 bit mantissa. Despite the best efforts of a number of people over the years, 32 bits is simply unable to store floating-point numbers with reasonable accuracy and range. This is why C and C++ ignore float, and use the double representation almost exclusively. For example, none of the functions in math.h5 expect a float parameter nor do they return a float value. The exponent provides the range of numbers that can be represented. The 11 bit exponent in double gives a range of 21023. In base 10, this is 10308. The mantissa provides the level of accuracy of the representation (the number of significant digits). The 52 bit mantissa is normally stored as a binary value 1 and <2. The exponent is adjusted to get mantissa in the desired range. Such numbers are called normalized. Some values such as ZERO cannot be normalized and some other values are de-normalized for other purposes, such as to permit a closer approach to 0.0. As a leading 1 is inevitable on all normalized floating-point representations, most computers do not store it, thus getting 53 bit mantissas. These 53 bits in binary provide 15 to 16 significant digits in decimal. 4 Exponents are almost always stored in excess notation. Excess notation means that a bias is added to the exponent to make it positive. For an 11 bit exponent, this bias is 1023. The reason is subtle. It allows two 64 bit floating-point numbers to be compared as if they were 64 bit integers, greatly speeding up real number comparison and simplifying the design of an ALU. 5 Not even the functions that include an “f” to indicate “floating” such as atof, modf, and fabs deal with float. They all return a double. Thus, all whole numbers requiring no more than 15 or 16 significant digits can be represented exactly in floating-point. Much larger numbers whose binary form has many zeros in the lower bits are also stored exactly. For example: 10101000000000000000000000000000000000000000000000000000000000002 = 1.01012 * 264 (mantissa = 0101… and exponent = 64 + 1023 in binary) Furthermore, any fraction that can be stated with the denominators as powers of 2 can be represented exactly. For example: 53/64 = 32/64 + 16/64 + 4/64 + 1/64 = ½ + ¼ + 1/16 + 1/64 = 0.1101012 Many common fractions such as 1/3, 1/5, 2/3, etc cannot be represented exactly as they become repeating binary fractions when converted to binary. For example, 2/5 = 0.410 = 0.011001100110…2. Furthermore, no irrational numbers or transcendental numbers such as 2 or or e can be represented exactly. Even such mundane values as those associated with money are usually approximated when stored or computed: $34.73 cannot be stored exactly as 73/100 cannot be expressed as a sum of fractions with denominators all powers of 2. You should almost never attempt to compare two floating-point values for either equality or inequality. In fact, all compilers should flag such a statement with a warning message. Unfortunately, some do and some do not. The only situation where you can reasonably expect the comparison to work correctly is if both values to be compared were input, or are constants. Use >= or <= when reasonable, but if you want to test if A equals B, you should instead test something like: if( fabs( A-B) < ERROR_FACTOR ) Here, ERROR_FACTOR is a previously defined constant along the lines of 1.0E-10. You should try to avoid subtraction of values that are almost equal. This leads to the loss of many significant digits. Consider the following decimal example in which the number of significant digits drops from 15 down to only 2 in 1 operation. Unfortunately, such operations are almost inevitable if one is searching for the root of an equation. In addition, the higher the degree of the polynomial, the more acute the roundoff problem is likely to be. 1.56784932563566 - 1.56784932563549 ----------------------------------------0.00000000000017 Watch out for the accumulation of round-off error. Many times a calculation or a sequence of calculations appears in a loop. Each iteration bases the current calculation on the previous one, and the round-offs can sometimes accumulate. In such cases, if the loop is modified so that the steps the loop takes become smaller generating more calculations, the result can be either better or worse. The order of operations can affect round-off error. If a sequence of operations must be performed, the greatest level of accuracy will generally follow the form with the least variation in the size of the partial results. For example: h= a*b3 -----d*e*g h = a/d*b/e*b/g*b; should probably be written something like: // Use of the pow() function, while tempting, // will often give the least accurate result. The number of operations can affect round-off error. If a sequence of operations must be performed, consider writing the expression so that the number of operations is minimized. As each operation may have round-off error, reducing the number of operations to be performed should produce a positive benefit in most cases. For example, three versions of the evaluation of a cubic equation are shown below. The best one is the last. Making a function call (that may require an unknown number of operations) should be avoided if it can be done with reasonable ease. For example pow below probably works by computing antilog( exponent * log( base ) ). This is fine for a real exponent, but a preposterous way to raise a number to an integer exponent. Y = a * pow(x,3) + b*pow(x,2) + c * x + d; // 5 ops + 2 calls. Worst Y = a * x * x * x + b * x * x + c * x + d; // 9 ops. Better Y = x * ( x * ( x * a + b ) + c) + d; // 6 ops. Best DATA AGGREGATES and other DERIVED DATA TYPES Array An array is a data aggregate in which every element must be the same type. The fundamental operation for accessing elements within the array is the index (or subscript) (i.e. a[j]). C++ treats multidimensional arrays as arrays of arrays (i.e. a[j][k][y]). Some other languages such as Pascal and Fortran do not (i.e. a[j, k, y]). Some languages permit references to a portion of an array, but most, including C++, do not. Arrays cannot be returned as the return value of function, but a pointer to an array can be returned. Arrays cannot be copied or input or output as if they were atomic objects (except strings). You cannot say a = b; if they are both arrays or cout <<a; . Arrays can be initialized when declared, but no repeat factors are allowed. int a[5] = {1, 2, -1, 5}; // Unlisted values, in this case a[4] will be given the value 0 int c[4][3] = { {1, 2}, // c[0][3] will be 0 {4, 5, 6}, {7} // c[2][1] and c[2][2] will be 0 } // all of row [3] will be 0 Arrays are always stored in a linear block of memory. The compiler converts the index or subscript as stated in the high level language into an offset into memory in assembly language. An index into a single dimension integer array in C++: If an int stored in 4 bytes, then an array of 100 integers will be stored in a contiguous block of 400 bytes in memory. The address of the first byte in the block is what is passed whenever an array is passed as a parameter. A reference to any element of the array merely requires that the index be shifted twice to the left6 then added to the address of the beginning of the array to generate a memory address. In all modern languages, a two dimensional array is stored by rows, also called rowmajor form. In memory, the zeroth row is stored first, then the first row, and so on. In the example below, the values stored in the 3 by 4 array indicate the row and column. i.e. 21 means row 2, column 1. A 2D reference is converted to linear, then shifted to generate a byte offset. Thus, the 2D array below: int A[3][4]; Row 0 Row 1 Row 2 Col 0 00 10 20 Col 1 01 11 21 Col 2 02 12 22 Col 3 03 13 23 Would actually be stored as a 1D array as shown below (assuming the starting address 2000) Address of A[r][c] = AddressOfA + BytesPerInt * (r*NumCols + c) A[2][1] = 2000 +4 * (2 * 4 + 1) = 2000 +4 *9 = 2036 2D index 6 1D index Value Byte address Shifting a binary number to the left two bits is, of course, the same as multiplying times 4 and is much faster on most hardware. [0][0] [0][1] [0][2] [0][3] [1][0] [1][1] [1][2] [1][3] [2][0] [2][1] [2][2] [2][3] [0] [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 00 01 02 03 10 11 12 13 20 21 22 23 2000 2004 2008 2012 2016 2020 2024 2028 2032 2036 2040 2044 Struct or Record The C++ struct is called a record in most languages that are not C based. A struct or record is an aggregate of related data. It often consists of multiple data types and sizes (i.e. EmployeeRecord, SalesRecord, InventoryRecord, or EnrollmentRecord). A struct can easily be used to store data of the same type however, but the data would still need to be related in some fundamental way. One example would be a record storing a Complex number in cartestian coordinates. It would consist of two doubles, one for the Real part, and the other for the Imaginary part. Obviously, a small array could have been used, but then the component references would have been less clear. References to fields within a struct use the dot notation: structName.fieldName Unlike arrays, records can generally be treated as atomic objects. There are many other interesting and useful things that could be said, but why bother. Most modern books strongly discourage the use of structs and, with considerable justification, encourage the use of classes instead. MORE SEMANTICALLY COMPLICATED STRUCTURES Abstract Data Type (ADT) – a data structure combined with a set of operations defined on the structure Class – Classes are a modern derivative of the Abstract Data Type in which additional capabilities such as encapsulation, operator overloading, polymorphism, and information hiding have been added. References to components of a class generally require a similar syntax as for a struct: objectName.dataName or objectName.functionName