LANGUAGE PRIMITE DATA STRUCTURES

advertisement
LANGUAGE PRIMITIVE DATA STRUCTURES in C++
How well do the C++ int related data types represent the mathematical
concept of integer?
The mathematical concept of INTEGER is an aleph null, or countable infinite set1.
Integers are often represented by the integer number line. The set of integers is closed
over addition, subtraction, and multiplication. Integer division can also be defined so that
it is closed except, of course, in the case of division by zero. The set of integers also has
other well-defined properties such as the existence of additive and multiplicative
identities, and an additive inverse. In fact, the set of integers is a Linear Algebra over the
operations of + and *, so you can see your Linear Algebra class for more fun along these
lines.
In C++, int, long int (or long), short int (or short), and so forth store integers in two’s
complement form2 as a specific number of bits.3 For example, an int is almost always
stored as either 16 bits or 32 bits (2 or 4 bytes). Regardless of language, the range of an
integer stored in N bits in two’s complement is always
-2N-1+2N-1-1
for 8 bits, this range is -128 to +127
for 16 bits, this range is ≈ ±32K
for 32 bits, this range is ≈ ±2G
Many languages also support unsigned integers. In C++, they are called uint, ulong,
ushort. The range of integers for an N bit unsigned integer is 02N-1.
As long as no value computed or stored is outside the supported range,
the computer representation of integers is perfect. Integer overflow or
underflow is the result when such an operation fails.
limits.h
contains predefined constants representing the highest and lowest values
for some of these data types. The names of some of these constants are given below. Be
advised, however, that it is usually unwise to test against these values as comparisons
involve subtraction, and any operation involving one of these numbers can easily produce
overflow or underflow.
1
Kurt Godel first characterized infinite sets using aleph. He tried to conceive of an infinite number of
different infinites. Any set that can be placed in 1 to 1 correspondence with the set of integers is aleph null
(also called aleph naught). The uncountable infinite set of real numbers is aleph one. I read somewhere that
the set of functions R to R where R is the set of real numbers is aleph two, but I have not seen a proof.
2
You should have studied two’s complement in Discrete Math and Computer Organization. If you are not
familiar with this scheme, I can give you a reference.
3
I could have said bytes actually, as the number of bits allocated to an integer is always a multiple of 8
these days. You can use the sizeof ( ) function to determine the actual number of bytes allocated to any data
type.
INT_MIN
LONG_MIN
SHRT_MIN
INT_MAX
LONG_MAX
SHRT_MAX
UINT_MAX
ULONG_MAX
USHRT_MAX
How well do the C++ double related data types represent the
mathematical concept of a
real number?
The mathematical concept of a REAL number is an aleph one, uncountable infinite set.
The real number line is often used to represent real numbers. Notice the fundamental
difference between the set of integers and the set of real numbers. The number of integers
between any two numbers on the integer number line is finite. The number of real
numbers between any two numbers on the real number line is infinite, and not only
infinite but uncountably infinite.
In C++, float, double, and long double attempt to represent real numbers. Computer
representations of real numbers are stored in a manner similar to scientific notation,
except in binary. These representations are called floating-point representations. There
are many variations, but the method described here is most common today as it follows
the IEEE 754 floating-point representation standard. For 64 bits, there is a single sign bit,
an 11 bit exponent4 and a 52 bit mantissa.
Despite the best efforts of a number of people over the years, 32 bits is simply unable to
store floating-point numbers with reasonable accuracy and range. This is why C and C++
ignore float, and use the double representation almost exclusively. For example, none of
the functions in math.h5 expect a float parameter nor do they return a float value.
The exponent provides the range of numbers that can be represented. The 11 bit
exponent in double gives a range of  21023. In base 10, this is  10308.
The mantissa provides the level of accuracy of the representation (the number of
significant digits). The 52 bit mantissa is normally stored as a binary value 1 and <2.
The exponent is adjusted to get mantissa in the desired range. Such numbers are called
normalized. Some values such as ZERO cannot be normalized and some other values are
de-normalized for other purposes, such as to permit a closer approach to 0.0. As a leading
1 is inevitable on all normalized floating-point representations, most computers do not
store it, thus getting 53 bit mantissas. These 53 bits in binary provide 15 to 16 significant
digits in decimal.
4
Exponents are almost always stored in excess notation. Excess notation means that a bias is added to the
exponent to make it positive. For an 11 bit exponent, this bias is 1023. The reason is subtle. It allows two
64 bit floating-point numbers to be compared as if they were 64 bit integers, greatly speeding up real
number comparison and simplifying the design of an ALU.
5
Not even the functions that include an “f” to indicate “floating” such as atof, modf, and fabs deal with
float. They all return a double.
Thus, all whole numbers requiring no more than 15 or 16 significant digits can be
represented exactly in floating-point. Much larger numbers whose binary form has many
zeros in the lower bits are also stored exactly. For example:
10101000000000000000000000000000000000000000000000000000000000002
= 1.01012 * 264
(mantissa = 0101… and exponent = 64 + 1023 in binary)
Furthermore, any fraction that can be stated with the denominators as powers of 2 can be
represented exactly. For example:
53/64 = 32/64 + 16/64 + 4/64 + 1/64 = ½ + ¼ + 1/16 + 1/64
= 0.1101012
Many common fractions such as 1/3, 1/5, 2/3, etc cannot be represented exactly as they
become repeating binary fractions when converted to binary. For example, 2/5 = 0.410 =
0.011001100110…2. Furthermore, no irrational numbers or transcendental numbers such
as 2 or  or e can be represented exactly.
Even such mundane values as those associated with money are usually approximated
when stored or computed:
$34.73 cannot be stored exactly as 73/100 cannot be expressed as a sum of
fractions with denominators all powers of 2.
You should almost never attempt to compare two floating-point values
for either equality or inequality. In fact, all compilers should flag such a
statement with a warning message. Unfortunately, some do and some do not. The only
situation where you can reasonably expect the comparison to work correctly is if both
values to be compared were input, or are constants. Use >= or <= when reasonable, but if
you want to test if A equals B, you should instead test something like:
if( fabs( A-B) < ERROR_FACTOR )
Here, ERROR_FACTOR is a previously defined constant along the lines of 1.0E-10.
You should try to avoid subtraction of values that are almost equal. This
leads to the loss of many significant digits. Consider the following decimal example in
which the number of significant digits drops from 15 down to only 2 in 1 operation.
Unfortunately, such operations are almost inevitable if one is searching for the root of an
equation. In addition, the higher the degree of the polynomial, the more acute the roundoff problem is likely to be.
1.56784932563566
- 1.56784932563549
----------------------------------------0.00000000000017
Watch out for the accumulation of round-off error. Many times a
calculation or a sequence of calculations appears in a loop. Each iteration bases the
current calculation on the previous one, and the round-offs can sometimes accumulate. In
such cases, if the loop is modified so that the steps the loop takes become smaller
generating more calculations, the result can be either better or worse.
The order of operations can affect round-off error. If a sequence of
operations must be performed, the greatest level of accuracy will generally follow the
form with the least variation in the size of the partial results. For example:
h=
a*b3
-----d*e*g
h = a/d*b/e*b/g*b;
should probably be written something like:
// Use of the pow() function, while tempting,
// will often give the least accurate result.
The number of operations can affect round-off error. If a sequence of
operations must be performed, consider writing the expression so that the number of
operations is minimized. As each operation may have round-off error, reducing the
number of operations to be performed should produce a positive benefit in most cases.
For example, three versions of the evaluation of a cubic equation are shown below. The
best one is the last. Making a function call (that may require an unknown number of
operations) should be avoided if it can be done with reasonable ease. For example pow
below probably works by computing antilog( exponent * log( base ) ). This is fine for a
real exponent, but a preposterous way to raise a number to an integer exponent.
Y = a * pow(x,3) + b*pow(x,2) + c * x + d; // 5 ops + 2 calls. Worst
Y = a * x * x * x + b * x * x + c * x + d; // 9 ops.
Better
Y = x * ( x * ( x * a + b ) + c) + d;
// 6 ops.
Best
DATA AGGREGATES and other DERIVED DATA TYPES
Array
An array is a data aggregate in which every element must be the same type. The
fundamental operation for accessing elements within the array is the index (or subscript)
(i.e. a[j]). C++ treats multidimensional arrays as arrays of arrays (i.e. a[j][k][y]). Some
other languages such as Pascal and Fortran do not (i.e. a[j, k, y]). Some languages permit
references to a portion of an array, but most, including C++, do not.
Arrays cannot be returned as the return value of function, but a pointer to an array can be
returned. Arrays cannot be copied or input or output as if they were atomic objects
(except strings). You cannot say a = b; if they are both arrays or cout <<a; . Arrays can be
initialized when declared, but no repeat factors are allowed.
int a[5] = {1, 2, -1, 5};
// Unlisted values, in this case a[4] will be given the value 0
int c[4][3] = {
{1, 2},
// c[0][3] will be 0
{4, 5, 6},
{7}
// c[2][1] and c[2][2] will be 0
}
// all of row [3] will be 0
Arrays are always stored in a linear block of memory. The compiler converts the index or
subscript as stated in the high level language into an offset into memory in assembly
language.
An index into a single dimension integer array in C++: If an int stored in 4 bytes, then
an array of 100 integers will be stored in a contiguous block of 400 bytes in memory. The
address of the first byte in the block is what is passed whenever an array is passed as a
parameter. A reference to any element of the array merely requires that the index be
shifted twice to the left6 then added to the address of the beginning of the array to
generate a memory address.
In all modern languages, a two dimensional array is stored by rows, also called rowmajor form. In memory, the zeroth row is stored first, then the first row, and so on. In the
example below, the values stored in the 3 by 4 array indicate the row and column. i.e. 21
means row 2, column 1. A 2D reference is converted to linear, then shifted to generate a
byte offset. Thus, the 2D array below:
int A[3][4];
Row 0
Row 1
Row 2
Col 0
00
10
20
Col 1
01
11
21
Col 2
02
12
22
Col 3
03
13
23
Would actually be stored as a 1D array as shown below (assuming the starting address
2000)
Address of
A[r][c]
= AddressOfA + BytesPerInt * (r*NumCols + c)
A[2][1]
= 2000
+4
* (2 * 4
+ 1)
= 2000
+4
*9
= 2036
2D
index
6
1D
index
Value
Byte
address
Shifting a binary number to the left two bits is, of course, the same as multiplying times 4 and is much
faster on most hardware.
[0][0]
[0][1]
[0][2]
[0][3]
[1][0]
[1][1]
[1][2]
[1][3]
[2][0]
[2][1]
[2][2]
[2][3]
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
00
01
02
03
10
11
12
13
20
21
22
23
2000
2004
2008
2012
2016
2020
2024
2028
2032
2036
2040
2044
Struct or Record
The C++ struct is called a record in most languages that are not C based. A struct or
record is an aggregate of related data. It often consists of multiple data types and sizes
(i.e. EmployeeRecord, SalesRecord, InventoryRecord, or EnrollmentRecord). A struct
can easily be used to store data of the same type however, but the data would still need to
be related in some fundamental way. One example would be a record storing a Complex
number in cartestian coordinates. It would consist of two doubles, one for the Real part,
and the other for the Imaginary part. Obviously, a small array could have been used, but
then the component references would have been less clear.
References to fields within a struct use the dot notation: structName.fieldName
Unlike arrays, records can generally be treated as atomic objects. There are many other
interesting and useful things that could be said, but why bother. Most modern books
strongly discourage the use of structs and, with considerable justification, encourage the
use of classes instead.
MORE SEMANTICALLY COMPLICATED STRUCTURES
Abstract Data Type (ADT) – a data structure combined with a set of operations defined
on the structure
Class – Classes are a modern derivative of the Abstract Data Type in which additional
capabilities such as encapsulation, operator overloading, polymorphism, and information
hiding have been added. References to components of a class generally require a similar
syntax as for a struct: objectName.dataName
or
objectName.functionName
Download