LANGUAGE PRIMITE DATA STRUCTURES

advertisement
LANGUAGE PRIMITIVE DATA STRUCTURES
How well do the integer related data types in modern computer
languages represent the mathematical concept of integer?
The mathematical concept of an INTEGER is an aleph null, or countable
infinite set1. Integers are often represented by the integer number line. The
set of integers is closed over addition, subtraction, and multiplication.
Integer division can also be defined so that it is closed except, of course, in
the case of division by zero. The set of integers also has other well-defined
properties such as the existence of additive and multiplicative identities, and
an additive inverse. In fact, the set of integers is a Linear Algebra over the
operations of + and *, so you can see your Linear Algebra class for more fun
along these lines.
In C++, int, long int (or long), short int (or short), and so forth store
integers in two’s complement form2 as a specific number of bits.3 For
example, an int is almost always stored as either 16 bits or 32 bits (2 or 4
bytes). Regardless of language, the range of an integer stored in N bits in
two’s complement is always:
-2N-1+2N-1-1
for 8 bits, this range is -128 to +127
for 16 bits, this range is ≈ ±32K
for 32 bits, this range is ≈ ±2G
Java supports a very similar set of integer data types, and calls them byte,
short, int, and long. Many languages also support unsigned integers. In C++,
they are called uint, ulong, ushort. The range of integers for an N bit
unsigned integer is 02N-1. In C++, when you ask for short or long
integers, the compiler will decide whether or not you get them. In Java, you
will get the size requested, but support for short is spotty, and automatic
conversions to int will often cause compilation errors. For example, short a
1
Kurt Godel first characterized infinite sets using aleph. He tried to conceive of an infinite number of
different infinites. Any set that can be placed in 1 to 1 correspondence with the set of integers is aleph null
(also called aleph naught). The uncountable infinite set of real numbers is aleph one. I read somewhere that
the set of functions R to R where R is the set of real numbers is aleph two, and so on.
2
You should have studied two’s complement in Discrete Math or Computer Organization. If you are not
familiar with this scheme, I can give you a reference.
3
I could have said bytes actually, as the number of bits allocated to an integer is always a multiple of 8 in
modern computers. In C++, you can use the sizeof ( ) function to determine the actual number of bytes
allocated to any data type.
= 3, b = 4; a++; will compile, but a= a+b; will not without converting to a =
(short)(a+b); This applies to a number of operations.
As long as no value computed or stored is outside the supported range,
the computer representation of integers is a (more or less) perfect
mirror to the mathematician’s “integer”. Note: we consider zero to be
positive and even, but mathematicians do not. Integer overflow or underflow
is the result when such an operation fails.
In C++, limits.h contains predefined constants representing the highest
and lowest values for some of these data types. The names of some of these
constants are given below. Be advised, however, that it is usually unwise to
test against these values as comparisons involve subtraction, and any
operation involving one of these numbers can easily produce overflow or
underflow.
INT_MIN
INT_MAX
UINT_MAX
LONG_MIN
LONG_MAX
ULONG_MAX
SHRT_MIN
SHRT_MAX
USHRT_MAX
In Java, wrapper classes such as Integer, Long, and Short are
predefined for primitive types, and these wrapper classes have predefined
static constants such as: Integer.MAX_VALUE, Integer.MIN_VALUE,
Long.MAX_VALUE, etc.
How well do the floating point related data types represent the
mathematical concept of a
real number?
The mathematical concept of a REAL number is an aleph one, uncountable
infinite set. The real number line is often used to represent real numbers.
Notice the fundamental difference between the set of integers and the set of
real numbers. The number of integers between any two numbers on the
integer number line is finite. The number of real numbers between any two
numbers on the real number line is infinite, and not only infinite but
uncountably infinite.
In C++, float, double, and long double are used to represent real numbers.
Java supports only float and double. Computer representations of real
numbers are stored in a manner similar to scientific notation, except in
binary. These representations are called floating-point representations. There
are many variations, but the method described here is most common today as
it follows the IEEE 754 floating-point representation standard. For 64 bits, a
normalized number is stored as a single sign bit, an 11 bit exponent4 (the
base is understood to be 2), 1 understood but not stored bit containing a 1,
and a 52 bit fraction/mantissa. There is an understood binary point between
the understood 1 and the fraction.
Despite the best efforts of a number of people over the years, 32 bits is
simply unable to store floating-point numbers with reasonable accuracy and
range. This is why C and C++ ignore float, and use the double
representation almost exclusively. For example, none of the functions in
math.h5 expect a float parameter nor do they return a float value. Java is
somewhat more orthogonal, as it overloads the mathematical method names
for most numeric types for some functions. For example, abs, min, max, and
round are defined for double, float, long and int. Even so, sqrt, floor, and
ceil are only defined for double.
The exponent provides the range of numbers that can be represented. The 11
bit exponent in double gives a range of  21023. In base 10, this is  10308.
The fraction/mantissa provides the level of accuracy of the representation
(the number of significant digits). Normalization adjusts the binary point
until there is a single 1 to the left of the binary point Thus, the mantissa is
greater than or equal to 1 and less than 2. This leftmost 1 is normally not
stored. The remaining 52 bits of the mantissa is normally stored and is
usually called the fraction for obvious reasons. The exponent is adjusted to
get mantissa in the desired range. Such numbers are called normalized.
Some values such as ZERO cannot be normalized and some other values are
de-normalized for other purposes, such as to permit a closer approach to 0.0.
These 53 bits (1 not stored) in the mantissa provide 15 to 16 significant
digits in decimal.
Thus, all whole numbers requiring no more than 15 or 16 significant digits
can be represented exactly in floating-point. Much larger numbers whose
4
Exponents are almost always stored in excess notation. Excess notation means that a bias is added to the
exponent to make it positive. For an 11 bit exponent, this bias is 1023. The reason is subtle. It allows two
64 bit floating-point numbers to be compared as if they were 64 bit integers, greatly speeding up real
number comparison and simplifying the design of an ALU.
5
Not even the functions that include an “f” to indicate “floating” such as atof, modf, and fabs deal with
float. They all return a double.
binary form has many zeros in the lower bits are also stored exactly. For
example:
10101000000000000000000000000000000000000000000000000000000000002
= 1.01012 * 264 (normalized to between 1 and 2)
mantissa = 1.01012 The stored fraction will be = 01010…02
exponent = 64
The stored exponent will be 6410 + 102310 = 108710
As 11 bits in binary, this is will be 100 0011 11112
Furthermore, any fraction that can be stated with the denominators as
powers of 2 can be represented exactly. For example:
53/64 = 32/64 + 16/64 + 4/64 + 1/64 = ½ + ¼ + 1/16 + 1/64
= 0.1101012
Many common fractions such as 1/3, 1/5, 2/3, etc cannot be represented
exactly as they become repeating binary fractions when converted to binary.
For example, 2/5 = 0.410 = 0.011001100110…2. Furthermore, no irrational
numbers or transcendental numbers such as 2 or  or e can be represented
exactly.
Even such pedestrian values as those associated with money are usually
approximated when stored or computed:
$1.73 cannot be stored exactly as 73/100 cannot be expressed as a
sum of fractions with denominators all powers of 2.
Therefore, you should almost never attempt to compare two floatingpoint values for either equality or inequality. In fact, all compilers should
flag such a statement with a warning message. Unfortunately, some do and
some do not. The only situation where you can reasonably expect the
comparison to work correctly is when the values to be compared were input
or constants. (That is, not computed.) Use >= or <= when reasonable, but if
you want to test if A equals B, you should instead test something like:
if( fabs( A-B) < ERROR_FACTOR ) in C++
if( abs( A-B) < ERROR_FACTOR ) in Java
Here, ERROR_FACTOR is a previously defined constant along the lines of
1.0E-10.
You should try to avoid subtraction of values that are almost equal. This
often leads to the loss of many significant digits. Consider the following
decimal example in which the number of significant digits drops from 15
down to only 2 in 1 operation. Unfortunately, such operations are almost
inevitable if one is searching for the root of an equation. In addition, the
higher the degree of the polynomial, the more acute the round-off problem is
likely to be.
1.56784932563566
- 1.56784932563549
----------------------------------------0.00000000000017
Watch out for the accumulation of round-off error. Many times a
calculation or a sequence of calculations appears in a loop. Each iteration
bases the current calculation on the previous one, and the round-offs errors
can sometimes accumulate. In such cases, if the loop is modified so that the
steps the loop takes become smaller generating even more calculations, the
result can be either better or worse.
The order of operations can affect round-off error. If a sequence of
operations must be performed, the greatest level of accuracy will generally
follow the form with the least variation in the size of the partial results. For
example:
h=
a*b3
-----d*e*g
should probably be written something like:
h = a/d*b/e*b/g*b; // Use of the pow() function, while tempting,
// will often give the least accurate result.
The number of operations can affect round-off error. If a sequence of
operations must be performed, consider writing the expression so that the
number of operations is minimized. As each operation may have round-off
error, reducing the number of operations to be performed should produce a
positive benefit in most cases. For example, three versions of the evaluation
of a cubic polynomial in one variable are shown below. The best version is
the last one. Making a function call (that may require an unknown number of
operations) should be avoided if it can be done with reasonable ease. For
example pow( base, exponent ) below probably works by computing
antilog( exponent * log( base ) ). This is fine for a real exponent, but a
preposterous way to raise a number to an integer exponent.
Y = a * pow(x,3) + b*pow(x,2) + c * x + d; // 5 ops + 2 calls. Worst
Y = a * x * x * x + b * x * x + c * x + d; // 9 ops.
Better
Y = x * ( x * ( x * a + b ) + c) + d;
// 6 ops.
Best
DATA AGGREGATES and other DERIVED DATA TYPES
Array
An array is a data aggregate in which every element must be the same type.
The fundamental operation for accessing elements within the array is the
index (or subscript) (i.e. a[j]). C++ and Java both treat multidimensional
arrays as arrays of arrays (i.e. a[j][k][y]). Some other languages such as
Pascal and Fortran do not (i.e. a[j, k, y]). Some languages permit references
to a portion of an array, but most, including C++ and Java, do not. Java has
some additional features such as jagged arrays and an ArrayList class that
are worthy of more space than I have here.
Arrays cannot be returned as the return value of function, but a pointer to an
array can be returned. Arrays cannot be copied or input or output as if they
were atomic objects (except strings). You cannot say a = b; if they are both
arrays or cout <<a; (or in Java, System.out.println(a);) . Arrays can be
initialized when declared, but no repeat factors are allowed. The array
dimension can be deduced by the compiler from the initialization list.
int a[5] = {1, 2, -1, 5}; // Unlisted values, in this case a[4], will be given the
value 0.
int c[4][3] = {
}
{1, 2},
{4, 5, 6},
{7}
// c[0][3] will be 0
// c[2][1] and c[2][2] will be 0
// all of row [3] will be 0
int b[] = {5,7,4}; // b will be created b[3] and initialized as requested.
Arrays are always stored in a linear block of memory. The compiler converts
the index or subscript as stated in the high level language into an offset into
memory in assembly language.
An index into a single dimension integer array in C++ and Java: If an
int is stored in 4 bytes, then an array of 100 integers will be stored in a
contiguous block of 400 bytes in memory. The address of the first byte in the
block is what is passed whenever an array is passed as a parameter. A
reference to any element of the array merely requires that the index be
shifted twice to the left6 then added to the address of the beginning of the
array to generate a memory address.
In all modern languages, a two dimensional array is stored by rows, also
called row-major form. In memory, the zeroth row is stored first, then the
first row, and so on. In the example below, the values stored in the 3 by 4
array indicate the row and column. i.e. 21 means row 2, column 1. A 2D
reference is converted to linear, then shifted to generate a byte offset. Thus
the 2D array below:
int A[3][4];
Col 0
Row 0 00
Row 1 10
Row 2 20
Col 1
01
11
21
Col 2
02
12
22
Col 3
03
13
23
would actually be stored as a 1D array as shown below (assuming the
starting address 2000)
Address of A[r][c]
= AddressOfA + BytesPerInt * (r*NumCols + c)
A[2][1]
= 2000
+4
* (2 * 4
+ 1)
= 2000
+4
*9
= 2036
2D
index
[0][0]
[0][1]
[0][2]
[0][3]
6
1D
index
[0]
[1]
[2]
[3]
Value
00
01
02
03
Byte
address
2000
2004
2008
2012
Shifting a binary number to the left two bits is, of course, the same as multiplying times 4 and is much
faster on most hardware.
[1][0]
[1][1]
[1][2]
[1][3]
[2][0]
[2][1]
[2][2]
[2][3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
10
11
12
13
20
21
22
23
2016
2020
2024
2028
2032
2036
2040
2044
Struct or Record
The C++ struct is called a record in most languages that are not C based. It
will not be discussed here at any length as modern languages correctly
expect you to use classes instead. A struct or record is an aggregate of
related data. It often consists of multiple data types and sizes (i.e.
EmployeeRecord, SalesRecord, InventoryRecord, or EnrollmentRecord). A
struct can easily be used to store data of the same type however, but the data
would still need to be related in some fundamental way. References to fields
within a struct often use the dot notation. In fact, it would be fair to say class
syntax borrowed this from record syntax. Unlike arrays, records can
generally be treated as atomic objects. There are many other interesting and
useful things that could be said, but why bother. You will likely be using a
class instead. In fact, Java does not even support them.
MORE SEMANTICALLY COMPLICATED STRUCTURES
Abstract Data Type (ADT) – a data structure combined with a set of
operations defined on the structure
Class – Classes are a modern derivative of the Abstract Data Type in which
additional capabilities such as encapsulation, operator overloading,
polymorphism, and information hiding have been added. References to
components of a class generally require a similar syntax as for a struct:
objectName.dataName or
objectName.functionName
Download