data type

advertisement
Chapter 6
Data Types
ISBN 0-321-33025-0
Chapter 6 Topics
•
•
•
•
•
•
•
•
•
Introduction
Primitive Data Types
Character String Types
User-Defined Ordinal Types
Array Types
Associative Arrays
Record Types
Union Types
Pointer and Reference Types
1-2
Data Type
• A data type defines
– a collection of data values
and
– a set of predefined operations on those values.
1-3
Data Types and Implementation
Easiness of a Programming Language
•
•
Computer programs produce results by
manipulating data.
An important factor in determining the ease with
which programs can perform this task is
–
how well the data types available in the language being
used match the objects in the real-world problem space.
•
•
e.g.: If a language provides an employee type, it can
facilitate a programmer to write employee
management-related program.
It is therefore crucial that a language support an
appropriate collection of data types and structures.
1-4
Data Type Evolution
• The contemporary concepts of data types
have evolved over the last 50 years.
• In the earliest languages, all problem space
data structures had to be modeled with only
a few basic language-supported data structures.
– For example, in pre-90 Fortrans, linked lists
and binary trees are commonly modeled with
arrays.
1-5
Evolution of Data Type Design
• One of the most important advances in the
evolution of data type design
– is introduced in ALGOL 68
– is to provide
• a few basic types
and
• a few flexible structure-defining operators that allow a
programmer to design a data structure for each need.
means a structured type
1-6
Advantages of User-defined Types – (1)
User-defined types provide improved
readability through the use of meaningful
names for types.
• User-defined types also aid modifiability:
•
– A programmer can change the type of a
category of variables in a program by changing
only a type declaration statement.
1-7
Advantages of User-defined Types – (2)
• They allow type checking of the variables of
a special category of use, which would
otherwise not be possible.
– For example, zoo is a user-defined type:
• We can check whether two variables are of the
same zoo data type.
1-8
Abstract Data Type
• Taking the concept of a user-defined type a step
farther, we arrive at abstract data types.
• The fundamental idea of an abstract data type is that
the interface of a type, which is visible to the user,
is separated from
the representation of values of that type and the implementation of set of
operations on values of that type, which are hidden from the user.
• All of the types provided by a high-level
programming language are abstract data types.
1-9
Scalar Types
• In computing, a scalar variable is one that
can hold only one value at a time; as
opposed to sturctured variables like
array, list, hash, record, etc.
• A scalar data type is the type of a scalar
variable.
– For example, char, int, float, and double are
the most common scalar data types in the C
programming language.
1-10
Structured Data Types
• The two most common structured (nonscalar)
data types are
– arrays
and
– records.
1-11
Type Operators
• A few data types are specified by type operators,
or constructors, which are used to form type
expressions.
– For example, C uses brackets and asterisks
as type operators to specify arrays and pointers.
1-12
Variables and Descriptors
• It is convenient, both logically and
concretely, to think of variables in terms of
descriptors.
• A descriptor is the collection of the attributes of a
variable.
1-13
Implementation of Descriptors
• In an implementation, a descriptor is an
area of memory that stores the attributes of a
variable.
1-14
Implementation of Descriptors with
Only Static Attributes
• If the attributes are all static, descriptors are
required only at compile times.
• These descriptors
– are built by the compiler, usually as a part of the
symbol table
and
– are used during compilation.
1-15
Implementation of Descriptors with
Dynamic Attributes
• For dynamic attributes, however, part or all of
the descriptor must be maintained during
execution.
– In this case, the descriptor is used by the runtime system.
• In all cases, descriptors are used for
– type checking
and
– to build the code for the allocation and deallocation
operations.
1-16
Object
• The word object is often associated with
– the value of a variable
and
– the space it occupies.
• In this book, the author reserves object
exclusively for instances of user-defined abstract
data types.
1-17
Primitive Data Types
• Data types that are not defined in terms of
other types are called primitive data types.
– Nearly all programming languages provide a set
of primitive data types.
– Some of the primitive types are merely
reflections of the hardware.
• For example, most integer types.
– Others require only a little non-hardware
support for their implementation.
1-18
Primitive Data Types and Structured
Types
• The primitive data types of a language are
used, along with one or more type
constructors, to provide the structured types.
1-19
Numeric Type
• Many early programming languages had only
numeric primitive types.
• Numeric types still play a central role among the
collections of types supported by contemporary
languages.
–
–
–
–
–
–
Integer
Floating-Point
Decimal
Boolean Types
Character Types
Complex
1-20
Integer
• The most common primitive numeric data type
is integer.
• Many computers now support several sizes
of integers.
• These sizes of integers, and often a few
others, are supported by some
programming languages.
– For example, Java includes four signed integer
size: byte, short, int, and long.
1-21
Integer Types and Hardware
• A signed integer value is represented in a
computer by a string of bits, with one of the
bits (typically the leftmost) representing the
sign.
• Most integer types are supported directly by
the hardware.
1-22
Representation of negative numbers in C
[stackoverflow]
• ISO C (C99), section 6.2.6.2/2, states that an
implementation must choose one of three different
representations for integral data types,
– two's complement,
– one's complement or
– sign/magnitude
• It's incredibly likely that the two's complement
implementations far outweigh the others.
1-23
Integer Representation
[stackoverflow]
• In all those representations, positive numbers are
identical.
• To get the negative representation for a positive
number, you:
– invert all bits for one's complement.
– invert all bits then add one for two's complement.
– invert just the sign bit for sign/magnitude.
1-24
8 bit one's complement
[wikipedia]
1-25
8 bit two's complement
[wikipedia]
1-26
Floating-point
• Floating-point data types model real numbers, but
the representations are only approximations
for most real values.
– For example, neither of the fundamental
numbers π or ℯ (the base for the natural
logarithms) can be correctly represented in
floating-point notation.
– Of course, neither of these numbers can be
accurately represented in any finite space.
1-27
Problems of Floating-Point Numbers –(1)
• On most computers, floating-point
numbers are stored in binary; hence, they
can not accurately represent most real
numbers.
– For example:
• in decimal : 0.1
• in binary: 0.0001100110011…
2-1 2-2 2-3 2-4
1-28
Problems of Floating-Point Numbers –(2)
• The loss of accuracy through arithmetic
operations.
– For more information on the problems of
floating-point notation, see any book on
numerical analysis.
• (a/b)×c , a is a small number, b and c are large number
≡
≡
(a/b)×c
(may loss accuracy)
(a×c)/b
1-29
Internal Representation of FloatingPoint Values
• Most newer machines use the IEEE FloatingPoint Standard 754 format to represent
float-point numbers.
– Under the above format, floating-point values are
represented as fractions (mantissa or significand) and
exponents.
• Language implementers use whatever
representation that is supported by the
hardware.
1-30
IEEE Floating-point Formats
single precision
double precision
1-31
Real Value of Single-precision
Floating-Point Format [wikipedia]
The real value = (-1)sign(1.b22b21…b0)2 x 2e-127
1-32
Floating-point Type
• Most languages include two floating-point
types, often called float and double.
– The float type is the standard size, usually being
stored in four bytes of memory.
– The double type is provided for situations where
larger fractional parts are needed.
• Double-precision variables usually
– occupy twice as much storage as float variables
and
– provide at least twice the number of bits of fraction.
1-33
Precision and Range
• The collection of values that can be
represented by a floating-point type is
defined in terms of precision and range.
– Precision is the accuracy of the fractional part of a
value, measured as the number of bits.
– Range is a combination of the range of fractions,
and, more importantly, the range of exponents.
1-34
Decimal
• Most larger computers that are designed to
support business systems applications have
hardware support for decimal data types.
• Decimal data types store a fixed number of
decimal digits, with the decimal point at a
fixed position in the value.
1-35
Internal Representation of a Decimal
Value
• Decimal types are stored very much like character
strings, using binary codes for the decimal digits.
– These representations are called binary coded decimal (BCD).
• In some cases, they are stored one digit per byte,
but in others they are packed two digits per byte.
Either way, they take more storage than binary
representations.
• The operations on decimal values are done in
hardware on machines that have such capabilities;
otherwise, they are simulated in software.
1-36
BCD Example[wikipedia]
• To BCD-encode a decimal number using
the common encoding, each decimal digit
is stored in a four-bit nibble.
• Thus, the BCD encoding for the number
127 would be: 0001 0010 0111
1-37
Boolean Types
• Their range of values has only two elements,
one for true and one for false.
• They were introduced in ALGOL 60 and
have been included in most generalpurpose languages designed since 1960.
• One popular exception is C, in which numeric
expressions can be used as conditionals.
– In such expressions, all operands with nonzero
values are considered true, and zero is
considered false.
1-38
Internal Representation of a Boolean
Type Value
• A Boolean value could be represented by a
single bit.
• But because a single bit of memory is
difficult to access efficiently on many
machines, they are often stored in the
smallest efficiently addressable cell of
memory, typically a byte.
1-39
Character Types
• Character data are stored in computers as
numeric coding.
• The most commonly used coding is ASCII
(American Standard Code for Information
Interchange), which uses the values 0 to
127 to code 128 different characters.
• Many programming languages include a
primitive type for character data.
1-40
More about ASCII Code[wikipedia]
• ASCII is, strictly, a 7-bit code, meaning it uses the
bit patterns representable with seven binary digits
(a range of 0 to 127 decimal) to represent
character information.
• At the time ASCII was introduced, many computers
dealt with 8-bit groups (bytes or, more specifically,
octets) as the smallest unit of information; the
eighth bit was commonly used as a parity bit for
error checking on communication lines or other
device-specific functions.
• Machines which did not use parity typically set the
eighth bit to zero, though some systems such as
Prime machines running PRIMOS set the eighth bit
of ASCII characters to one.
1-41
Unicode
• Because of
the globalization of business
and
the need for computers to communicate with
other computers around the world,
the ASCII character set is rapidly becoming
inadequate.
• A 16-bit character set named Unicode has been
developed as an alternative.
• Unicode includes the characters from most of the
world's natural languages.
– For example, Unicode includes the Cyrillic alphabet, as
used in Serbia, and the Thai digits.
1-42
Character String Type
• A character string type is one in which the
values consist of sequences of characters.
• Character strings also are an essential type for
all programs that do character manipulation.
1-43
Design Issues
• The two most important design issues that
are specific to character string types are the
following:
– Should strings be simply
• a special kind of character array
or
• a primitive type (with no array-style subscripting
operations)?
– Should strings have static or dynamic length?
1-44
String Operations
• The common string operations are:
–
–
–
–
–
Assignment
Catenation
Substring reference
Comparison
Pattern matching.
1-45
Internal Representation of a String Type
Value in C and C++
• If strings are not defined as a primitive type,
string data is usually stored in arrays of single
characters and referenced as such in the
language.
• This is the approach taken by C and C++.
1-46
String Operations in C and C++
• C and C++ use char arrays to store character strings.
• C and C++ provide a collection of string operations
through a standard library whose header file is
string.h.
– Most uses of strings and most of the library functions use
the convention
• that character strings are terminated with a special
character, null, which is represented with zero.
– This is an alternative to maintain the length of string
variables.
– The library operations simply carry out their operations
until the null character appears in the string being
operated on.
1-47
Character String Literals in C
• The character string literals that are built by
the compiler have the null character.
– For example, consider the following declaration:
char *str = "apples";
• In this example, str is a char pointer set to point at
the string of characters, apples0, where 0 is the
null character.
• This initialization of str is legal because character
string literals are represented by char pointers, rather
than the string itself.
1-48
String Types in Java (1)
• In Java, strings are supported as a primitive
type by the String class, whose values are
constant strings.
– The String class represents character strings.
– All string literals in Java programs, such as "abc",
are implemented as instances of this class.
– Strings are constant; their values cannot be
changed after they are created[oracle].
1-49
String Types in Java (2)
• The StringBuffer class, whose values are
changeable and are more like arrays of single
characters.
– Subscripting is allowed on StringBuffer
variables.
1-50
String Types in Fortran 95
• Fortran 95
treats strings as a primitive type provide
• assignment,
• relational operators,
• catenation,
and
• substring reference operations
for them.
• A substring reference is a reference to a substring of a
given string.
1-51
Pattern Matching
• Pattern matching is another fundamental
character string operation.
– In some languages, pattern matching is
supported directly in the language.
• Perl, JavaScript, and PHP include built-in pattern
matching operations.
– In others, it is provided by a function or class
library.
• Pattern-matching capabilities are included in the
class libraries of C++, Java, and C#.
1-52
String Length Options
• There are several design choices regarding
the length of string values.
– static length strings
– limited dynamic length strings
– dynamic length strings
1-53
Static Length Strings
• A string is called a static length string if the
length of it
– can be static
and
– can be set when the string is created.
• This is the choice for
– the immutable objects of Java’s String class
– similar classes in the C++ standard class library
– similar classes in the .NET class library available
to C#
1-54
Example – the Java String Class
[wikepedia]
• String s = "ABC";
s.toLowerCase();
• The method toLowerCase() will not change the data "ABC"
that s contains.
• Instead, a new String object is instantiated and given the
data "abc" during its construction.
• A reference to this String object is returned by the
toLowerCase() method.
• To make the String s contain the data "abc“, a different
approach is needed.
• s = s.toLowerCase();
• Now the String s references a new String object that
contains "abc“.
• The String class's methods never affect the data that a
String object contains
1-55
Limited Dynamic Length Strings
• The second option is to allow strings to have
varying length up to a declared and fixed maximum set by
the variable's definition.
• These are called limited dynamic length strings.
• Such string variables can store any number of
characters between zero and the maximum.
– Recall that strings in C and C++ use a special character to
indicate the end of the string's characters, rather than
maintaining the string length.
• This string-handling approach may result in buffer overflow
vulnerabilities.
1-56
Dynamic Length Strings
• The third option is to allow strings to have
varying length with NO maximum, as in
JavaScript, and Perl.
• These are called dynamic length strings.
• This option requires the overhead of
dynamic storage allocation and deallocation but
provides maximum flexibility.
1-57
Evaluation – (1)
• String types are important to the writability of a
language.
• Dealing with strings as arrays can be more
cumbersome than dealing with a primitive string type.
– The addition of strings as a primitive type to a language is
NOT costly in terms of either language or compiler
complexity.
– Therefore, it is difficult to justify the omission of primitive
string types in some contemporary languages.
– Of course, providing strings through a standard library is
nearly as convenient as having them as a primitive type.
1-58
Evaluation – (2)
• String operations such as simple pattern
matching and catenation are essential and
should be included for string type values.
• Although dynamic length strings are obviously
the most flexible, the overhead of their
implementation must be weighed against
that additional flexibility.
• Dynamic length strings are often included
only in languages that are interpreted.
1-59
Array Types
• An array is a homogeneous aggregate of data
elements in which an individual element is
identified by its position in the aggregate,
relative to the first element.
• The individual data elements of an array are
of some previously defined type
– either primitive
or
– otherwise.
1-60
Why Array Types Are Needed?
• A majority of computer programs need to
model collections of values in which the values
– are of the same type
and
– must be processed in the same way.
• Thus, the universal need for arrays is
obvious.
1-61
Arrays and Indexes
• Specific elements of an array are referenced
by means of a two-level syntactic mechanism:
– the first part is the aggregate name
– the second part is a possibly dynamic selector
consisting of one or more items known as
subscripts or indexes.
• If all of the indexes in a reference are constants, the
selector is static; otherwise, it is dynamic.
1-62
The Selection Operation
• The selection operation can be thought of as a mapping
from
the array name and the set of subscript values
to
an element in the aggregate.
• Indeed, arrays are sometimes called finite mappings.
• Symbolically, this mapping can be shown as
array_name(subscript_value_list) → element
1-63
The Syntax of Array References
• The syntax of array references is fairly
universal:
– The array name is followed by the list of subscripts,
which is surrounded by either parentheses or
brackets.
1-64
Ordinal Type
• An ordinal type is one in which the range of
possible values can be easily associated with
the set of positive integers.
• For example,
– In Java, the primitive ordinal types are
• integer
• char
and
• Boolean
1-65
User-Defined Ordinal Types
• There are two user-defined ordinal types that
have been supported by programming
languages:
– enumeration
and
– subrange
1-66
The Element Type and the Type of the
Subscripts of an Array Type
• Two distinct types are involved in an array
type:
– the element type
– the type of the subscripts.
• The type of the subscripts is often a subrange of
integers.
• But Ada allows any ordinal type to be used as
subscripts, such as
– Boolean
– character
and
– enumeration.
1-67
Example
type Week-Day_type is (Monday, Tuesday, Wednesday,
Thursday, Friday);
type Sales is array (Week_Day_Type) of Float;
1-68
Type of Ada for Loop Counter
• An Ada for loop can use any ordinal type
variable for its counter.
• This allows arrays with ordinal type
subscripts to be conveniently processed.
1-69
Subscript Range Check
• Early programming languages did NOT
specify that subscript ranges must be
implicitly checked.
• Range errors in subscripts are common in
programs, so requiring range checking is an
important factor in the reliability of
languages.
– Among contemporary languages
• C, C++, and Fortran do not specify range checking
of subscripts
• but Java, ML, and C# do
1-70
Subscript Bindings
• The binding of the subscript type to an array
variable is usually static.
• But the binding of the subscript value ranges to
an array variable are sometimes dynamically
bound.
1-71
The Implicit Low Bound of a Subscript
Range
• In some languages, the lower bound of the
subscript range is implicit.
– For example,
• In the C-based languages, the lower bound of all
index ranges is fixed at 0.
• In Fortran 95, it defaults to 1.
• In most other languages, subscript ranges must be
completely specified by the programmer.
1-72
Array Categories
• There are five categories of arrays.
–
–
–
–
–
static arrays
fixed stack-dynamic arrays
stack-dynamic arrays
fixed heap-dynamic arrays
heap-dynamic arrays
1-73
Criteria to Classify Array Categories
• The category definitions are based on
– the binding to subscript ranges
– the binding to storage
and
– from where the storage is allocated.
• The category names indicate the design
choices of these three.
1-74
Duration of the Subscript Ranges and
Array Storage
• In the first four of the 5 array categories,
once
– the subscript range are bound
and
– the storage is allocated,
they remain fixed for the lifetime of the variable.
– When the subscript ranges are fixed, the array
cannot change size.
1-75
Static Arrays
• A static array is one in which
– the subscript ranges are statically bound
– storage allocation is static (done before run
time).
1-76
Advantages of Static Arrays
• The advantage of static arrays is efficiency:
– No dynamic allocation or deallocation is
required.
1-77
Example
• Arrays declared in C and C++ functions that
include the static modifier are examples
of static arrays.
1-78
Fixed Stack-dynamic Arrays
• A fixed stack-dynamic array is one in which
– the subscript ranges are statically bound
– but the allocation is done at declaration elaboration
time during execution.
1-79
Advantages of Fixed Stack-dynamic
Arrays
• The advantage of fixed stack-dynamic
arrays over static arrays is space efficiency.
– A large array in one procedure can use the same
space as a large array in a different procedure,
as long as both procedures are not active at the
same time.
1-80
Example
• Arrays that are declared in C and C++
functions (without the static specifier) are
examples of fixed stack-dynamic arrays.
1-81
Stack-dynamic Arrays
• A stack-dynamic array is one in which
– the subscript ranges are dynamically bound at
elaboration time
– the storage allocation is dynamically bound at
elaboration time.
• Once
– the subscript ranges are bound
and
– the storage is allocated,
however, they remain fixed during the lifetime of the
variable.
1-82
Advantages of Stack-dynamic Arrays
• The advantage of stack-dynamic arrays
over static and fixed stack-dynamic arrays
is flexibility.
– The size of an array need not be known until the
array is about to be used.
1-83
Example
• Ada arrays can be stack-dynamic, as in the following;
Get(List_Len);
declare
List : array (1.. List_Len) of Integer;
begin
. . .
end;
• In this example, the user inputs the number of desired
elements in the array List, which are then dynamically
allocated when execution reaches the declare block.
• When execution reaches the end of the block, the List array
is deallocated.
1-84
Fixed Heap-Dynamic Arrays
• A fixed heap-dynamic array is similar to a fixed
stack-dynamic array, in that
– the subscript ranges
and
– the storage binding
are both fixed after storage is allocated.
1-85
The Differences between a Fixed Stack-Dynamic
Array and a Fixed Heap-Dynamic Array
• Both
– the subscript ranges
and
– storage bindings
are done when the user program requests
them during execution, rather than at
elaboration time.
• The storage is allocated from the heap,
rather than the stack
1-86
Example
• C and C++ also provide fixed heap-dynamic arrays.
– The standard library functions malloc and free, which are
general heap allocation and deallocation operations,
respectively, can be used for C arrays.
– C++ uses the operators new and delete to manage heap
storage.
• An array is treated as a pointer to a collection of
storage cells, where the pointer can be indexed, e.g.
*(p+i) where p is a pointer and i is an integer.
1-87
Heap-dynamic Arrays
• A heap-dynamic array is one in which
– the binding of subscript ranges and storage
allocation is dynamic
and
– can change any number of times during the
array's lifetime.
• The advantage of heap-dynamic arrays over
the others is flexibility:
– Arrays can grow and shrink during program
execution as the need for space changes.
1-88
Example – (1)
• Perl and JavaScript support heap-dynamic arrays.
– A Perl array can be made to grow
• by using the
– push (puts one or more new elements on the end of the array)
and
– unshift (puts one or more new elements on the beginning of
the array)
or
• by assigning a value to the array specifying a subscript
beyond the highest current subscript of the array.
– A Perl array can be made to shrink to no elements by
assigning it the empty list, ().
1-89
Example – (2)
• In Perl we could create an array of five
numbers with
@list = (1, 2, 4, 7, 10);
• Later, the array could be lengthened with
the push function, as in
push(@list, 13, 17)
• Now the arrary’s value is
(1, 2, 4, 7, 10, 13, 17);
• Still later @list could be emptied with
@list = ()
1-90
Heterogeneous Arrays
• A heterogeneous array is one in which the
elements need not be of the same type.
• Such arrays are supported by Perl, Python,
JavaSript, and Ruby.
• In all of these languages, arrays are heap
dynamic.
1-91
Array Operations
• An array operation is one that operates on an
array as a unit.
1-92
The Most Common Array Operations
• assignment
• catenation
• comparison for equality and inequality
and
• slices.
1-93
Ada Array Operations
• Ada allows array assignments, including those
where the right side is an aggregate value
rather than an array name.
• Ada also provides catenation, specified by the
ampersand (&).
– Catenation is defined
• between two single-dimensioned arrays
and
• between a single-dimensioned array and a scalar.
1-94
Rectangular Arrays
• A rectangular array is a multidimensioned array in
which all of the rows have the same number
of elements, all of the columns have the same
number of elements, and so forth.
• Rectangular arrays accurately model tables.
1-95
Jagged Arrays
• A jagged array is one in which the lengths of
the rows need not be the same.
– For example, a jagged matrix may consist of 3
rows, one with 5 elements, one with 7 elements,
and one with 12 elements.
• This also applies to the columns or higher
dimensions.
– So, if there is a third dimension (layers), each
layer can have a different number of elements.
1-96
Implementation of Array Types
• Implementing arrays requires considerably
more compile-time effort than does
implementing simple types, such as integer.
– The code to allow accessing of array elements
must be generated at compile time.
– At run time, this code must be executed to
produce element addresses.
– This is no way to precompute the address to be
accessed by a reference such as list[k]
1-97
Compute the Address of an Array
Element – (1)
• A single-dimensioned array is a list of adjacent memory cells.
• Suppose the array list is defined to have a subscript range
lower bound of 1.
• The access function for list is often of the form
address(list[k])=address(list[l])+(k-l)* element_size
• This simplifies to
address(list[k])=(address(list[l]) - element_size) +
(k * element_size)
where the first operand of the addition is the constant part of the
access function, and the second is the variable part.
1-98
Compute the Address of an Array
Element – (2)
• If the element type is statically bound and
the array is statically bound to storage,
then the value of the constant part can be
computed before run time.
• Only the addition and multiplication operations
remain to be done at run time.
• If the base, or beginning address, of the
array is not known until run time, the
subtraction must be done when the array is
allocated.
1-99
The General Form of the Access
Function
• The generalization of the access function
for an arbitrary lower bound is
address(list[k]) =
address(list[lower_bound]) +
((k - lower_bound) * element_size)
1-100
Layout of Multidimensional Arrays
• Hardware memory is linear—it is usually a
simple sequence of bytes.
• So values of data types that have two or more
dimensions must be mapped onto the singledimensioned memory.
1-101
Methods to Map a Multidimensional
Array to a One-Dimensional Array
• There are two common ways in which
multidimensional arrays can be mapped to
one dimension:
– row major order
– column major order
1-102
Row Major Order
• In row major order, the elements of the array
whose first subscript is equal to the lower
bound value of that subscript are stored
first, followed by the elements of the second
value of the first subscript, and so forth.
1-103
Column Major Order
• In column major order, the elements of an
array whose last subscript is equal to the
lower bound value of that subscript are
stored first, followed by the elements of the
second value of the last subscript, and so forth,
• Column major order is used in Fortran,
but the other languages use row major
order.
1-104
Example
• For example,
– If the matrix had the values
3 4 7
6 2 5
1 3 8
• it would be stored in row major order as
3, 4, 7, 6, 2, 5, 1, 3, 8
• it would be stored in column major order as
3, 6, 1, 4, 2, 3, 7, 5, 8
1-105
The Access Function for TwoDimensional Arrays
• The access function for two-dimensional
arrays can be developed as follows.
– In general, the address of an element is the base
address of the array plus the element size times the
number of elements that precede it in the array.
(n-1) elements
element n
base address
1-106
The Access Function for Two-dimensional
Arrays Stored in Row Major Order
• For a matrix in row major order,
– the number of elements that precedes an
element is
• the number of rows above the element times the size of a row,
plus the number of elements to the left of the element.
– This is illustrated in Figure 6.6, in which we
make the simplifying assumption that subscript
lower bounds are all 1.
• To get an actual address value, the number
of elements that precede the desired
element must be multiplied by the element
size.
1-107
Figure 6.6
1-108
Access Function for a[i,j] – (1)
• Now, the access function can be written as
location(a[i,j]) = address of a[l,l]+
((((number of rows above the ith row)*(size of a row))
+ (number of elements left of the jth column)) *
element size)
1-109
Access Function for a[i,j] – (2)
• Because the number of rows above the ith row is (i - 1)
and the number of elements to the left of the jth column is
(j - l), we have
location(a[i, j]) = address of a[1,1] +
((((i - l) * n) + (j - 1)) * element_size)
• where n is the number of elements per row.
• This can be rearranged to the form
location(a[i,j]) = address of a[1,l] - ((n + 1) * element_size)
+ ((i * n + j) * element_size)
• where the first two terms are the constant part and the last is
the variable part.
1-110
Access Function for a[i,j] – (3)
• The generalization to arbitrary lower bounds
results in the following access function:
location(a[i,j]) = address of a[row_lb,col_lb] +
(((i–row_lb)*n) + (j-col_lb))*element_size
– where row_lb is the lower bound of the rows and
col_lb is the lower bound of the columns.
1-111
Associative Arrays
• An associative array is an unordered collection of data
elements that are indexed by an equal number of
values called keys.
• In the case of non-associative arrays, the indices never
need to be stored (because of their regularity).
• In an associative array, however, the user-defined keys
must be stored in the structure.
• So each element of an associative array is in fact a
pair of entities
– a key
and
– a value.
1-112
Structure and Operations
• In Perl, associative arrays are often called hashes,
because in the implementation their elements are
stored and retrieved with hash functions.
• The name space for Perl hashes is distinct; every
hash variable must begin with a percent sign (%).
• Hashes can be set to literal values with the
assignment statement, as in
%salaries = ("Gary" => 75000, "Perry" => 57000,
"Mary" => 55750, “Cedric" => 47850);
1-113
Reference to Individual Element
• Individual element values are referenced
using notation that is unique to Perl.
• The key value is placed in braces and the
hash name is replaced by a scalar variable
name that is the same except for the first
character.
– Scalar variable names begin with dollar signs ($).
– For example,
$salaries{"Perry"} = 58850;
1-114
Add and Delete Elements from a Hash
• A new element is added using the same
statement form.
• An element can be removed from the hash
with the delete operator, as in
delete $salaries{"Gary"};
• The entire hash can he emptied by
assigning the empty literal to it, as in
%salaries = ();
1-115
The Size of a Perl Hash
• The size of a Perl hash is dynamic:
– It grows when a new element is added.
– It shrinks when an element is deleted
and
– also when it is emptied by assignment of the
empty literal.
1-116
record Type
• A record is a possibly heterogeneous aggregate of
data elements in which the individual
elements are identified by names.
• In C and C++ records are supported with the
struct data type.
1-117
Difference between a record and an
array Type
• The fundamental difference between a record and
an array is
– the homogeneity of elements in arrays
versus
– the possible heterogeneity of elements in records.
• One result of this difference is that record elements,
or fields, are not usually referenced by indices.
– The fields are named with identifiers.
– References to the fields are made using these identifiers.
1-118
Pointer Type
• A pointer type is one in which the variables
have a range of values that consists of
memory addresses and a special value, nil.
• The value nil is not a valid address and is
used to indicate that a pointer cannot
currently be used to reference any memory
cell.
1-119
Usages of Pointers
• Pointers have been designed for two
distinct kinds of uses.
– First, pointers provide some of the power of
indirect addressing, which is heavily used in
assembly language programming.
– Second, pointers provide a method of dynamic
storage management.
• A pointer can be used to access a location in the
area where storage is dynamically allocated, which
is usually called a heap.
1-120
Heap-dynamic Variables
• Variables that are dynamically allocated
from the heap are called heap-dynamic
variables.
• They
– often do not have identifiers associated with them
and
– thus can be referenced only by pointer or reference
type variables.
• Variables without names are called
anonymous variables.
1-121
Reference Types and Value Types
• Point type variables are different from scalar
variables because
– they are most often used to reference some
other variable,
– rather than being used to store data of some
sort.
• These two categories of variables are called
reference types and value types, respectively.
1-122
Point Types Add Writability
• Uses of pointers add writability to a language.
– For example, suppose it is necessary to
implement a dynamic structure like a binary tree in a
language like Fortran 77, which does not have
pointers.
• This would require the programmer to provide and
maintain a pool of available tree nodes.
• Also, because of the lack of dynamic storage in Fortran
77, it would be necessary for the programmer to
guess the maximum number of required nodes.
– This is clearly an awkward and error-prone way to deal
with binary trees.
1-123
Pointer Operations
• Languages that provide a pointer type
usually include two fundamental pointer
operations
– assignment
– dereferencing
1-124
The Assignment Operation
• The assignment operation sets a pointer
variable's value to some useful address.
– If pointer variables are used only to manage
dynamic storage, the allocation mechanism, whether
by operator or built-in subprogram, serves to
initialize the pointer variable.
– If pointers are used for indirect addressing to
variables that are not heap-dynamic, then there
must be an explicit operator or built-in subprogram
for fetching the address of a variable, which can
then be assigned to the pointer variable.
1-125
Two Distinct Ways to Interpreted an Occurrence of a
Pointer Variable in an Expression
• First, it could be interpreted as a
reference to the contents of the memory cell
to which it is bound, which in the
case of a pointer is an address.
• Second, the pointer could also be
interpreted as a reference to the value in
the memory cell pointed to by the
memory cell to which the pointer
variable is bound.
pointer
address 1
address 2
address 2
value 2
– In this case, the pointer is interpreted as
an indirect reference.
1-126
The Dereferencing Operation
• Dereferencing, which takes a reference
through one level of indirection, is the
second fundamental pointer operation.
– To clarify dereferencing, consider a pointer
variable, ptr, that is bound to a memory cell
with the value 7080.
– Suppose that the memory cell whose address is
7080 has the value 206.
– A normal reference to ptr yields 7080, but a
dereferenced reference to ptr yields 206.
1-127
The Assignment Operation j = * ptr
1-128
Reference a Field of a Record Pointed by
a Pointer
• When pointers point to records, the syntax of the
references to the fields of these records varies among
languages.
– In C and C++, there are two ways a pointer to a record can
be used to reference a field in that record.
• If a pointer variable p points to a record with a field
named age, (*p).age can be used to refer to that field.
• The operator ->, when used between a pointer to a
record and a field of that record, combines dereferencing
and field reference.
– For example, the expression p->age is equivalent to (*p).age.
int gender
(*p).age
int age
or
char address[4]
p->age
add3
add2
p
add1
1-129
Heap Memory Allocation Operation
• Languages that provide pointers for the
management of a heap must include an explicit
allocation operation.
• Allocation is sometimes specified with a
subprogram, such as malloc in C.
• In languages that support OOP, allocation of heap
objects is often specified with the new operator.
• C++, which does not provide implicit deallocation,
uses delete as its deallocation operation
1-130
Pointer Problems
• Dangling Pointers
• Lost Heap-Dynamic Variables
1-131
Dangling Pointers
• A dangling pointer, or dangling reference, is a
pointer that contains the address of a heapdynamic variable that has been deallocated.
1-132
Pseudo Code of Creating Dangling
Pointer
•
The following sequence of operations creates a
dangling pointer in many languages:
1. Pointer p1 is set to point at a new heap-dynamic variable.
2. Pointer p2 is assigned p1's value.
3. The heap-dynamic variable pointed to by p1 is explicitly
deallocated (setting p1 to nil), but p2 is not changed by
the operation. p2 is now a dangling pointer.
p1 = nil
p2
heap
1-133
Problems Caused by Dangling Pointers (1)
• The location being pointed to may have
been reallocated to some new heap-dynamic
variable.
• If the new variable is not the same type as the old
one, type checks of uses of the dangling pointer are
invalid.
• Even if the new one is the same type, its new value
will bear no relationship to the old pointer's
dereferenced value.
1-134
Problems Caused by Dangling Pointers (2)
• If the dangling pointer is used to change
the heap-dynamic variable, the value of the
new heap-dynamic variable will be
destroyed.
1-135
Problems Caused by Dangling Pointers (3)
• It is possible that the location now is being
temporarily used by the storage management
system, possibly as a pointer in a chain of
available blocks of storage, thereby
allowing a change to the location to cause
the storage manager to fail.
abce
q is a dangling pointer.
q
q->value=0xabcd;
1-136
Use-after-Free Vulnerabilities
• Dangling pointers critically impact program
correctness and security because they open
the door to use-after-free. [Caballero et al.]
• Use-after-free errors occur when a
program continues to use a pointer after it
has been freed [OWASP] .
1-137
Example 1
[mitre]
#include <stdio.h>
#include <unistd.h>
#define BUFSIZER1 512
#define BUFSIZER2 ((BUFSIZER1/2) - 8)
int main(int argc, char **argv) {
char *buf1R1, *buf2R1, *buf2R2, *buf3R2;
buf1R1 = (char *) malloc(BUFSIZER1);
buf2R1 = (char *) malloc(BUFSIZER1);
free(buf2R1);
buf2R2 = (char *) malloc(BUFSIZER2);
buf3R2 = (char *) malloc(BUFSIZER2);
strncpy(buf2R1, argv[1], BUFSIZER1-1);
free(buf1R1);
free(buf2R2);
free(buf3R2);
}
1-138
Example 1
[mitre]
#include <stdio.h>
#include <unistd.h>
#define BUFSIZER1 512
#define BUFSIZER2 ((BUFSIZER1/2) - 8)
int main(int argc, char **argv) {
char *buf1R1, *buf2R1, *buf2R2, *buf3R2;
buf1R1 = (char *) malloc(BUFSIZER1);
buf2R1 = (char *) malloc(BUFSIZER1);
free(buf2R1);
buf2R2 = (char *) malloc(BUFSIZER2);
buf3R2 = (char *) malloc(BUFSIZER2);
strncpy(buf2R1, argv[1], BUFSIZER1-1);
free(buf1R1);
free(buf2R2);
free(buf3R2);
}
1-139
Example 2
[mitre]
char* ptr = (char*)malloc (SIZE);
if (err) {
abrt = 1;
free(ptr);
}
...
if (abrt) {
logError("operation aborted before commit", ptr);
}
//When an error occurs, the pointer is immediately freed.
//However, this pointer is later incorrectly used in the logError function.
1-140
Example 2
[mitre]
char* ptr = (char*)malloc (SIZE);
if (err) {
abrt = 1;
free(ptr);
}
...
if (abrt) {
logError("operation aborted before commit", ptr);
}
//When an error occurs, the pointer is immediately freed.
//However, this pointer is later incorrectly used in the logError function.
1-141
Lost Heap-Dynamic Variables
• A lost heap-dynamic variable is an allocated
heap-dynamic variable that is no longer
accessible to the user program.
• Such variables are often called garbage
because
– they are not useful for their original purpose
and
– they also cannot be reallocated for some new
use by the program.
1-142
Pseudo Code of Creating Lost HeapDynamic Variables
•
Lost heap-dynamic variables are most often
created by the following sequence of
operations:
1. Pointer p is set to point to a newly created
heap-dynamic variable.
2. p is later set to point to another newly created
heap-dynamic variable. The first heap-dynamic
variable is now inaccessible, or lost. Sometimes
this is called memory leakage.
p=malloc(4);
p=malloc(20);
x
p
1-143
Formal Parameters and Actual
Parameters
• The parameters in the subprogram header are
called formal parameters.
• Subprogram call statements must include
– the name of the subprogram
and
– a list of parameters to be bound to the formal
parameters of the subprogram.
• These parameters are called actual parameters.
1-144
Reference Types in C++
• C++ includes a special kind of reference type that is used
primarily for the formal parameters in function
definitions.
– A C++ reference type variable is a constant pointer that is always
implicitly dereferenced.
– Because a C++ reference type variable is a constant,
• it must be initialized with the address of some variable in its
definition
and
• after initialization a reference type variable can NEVER be
set to reference any other variable.
1-145
Example
• Reference type variables are specified in definitions by
preceding their names with ampersands (&).
– For example,
int result = 0;
int &ref_result = result;
. . .
ref_result = 100;
– In this code segment, result and ref_result are aliases.
1-146
Formal Parameters with Reference
Types
• When used as formal parameters in function
definitions, reference types provide for two-way
communication between the caller function and the
called function.
– This is not possible with nonpointer primitive parameter types,
because C++ parameters are passed by value.
– Reference parameters are referenced in the called function
exactly as are other parameters.
– The calling function need not specify that a parameter
whose corresponding formal parameter is a reference type
is anything unusual. The compiler passes addresses, rather
than values, to reference parameters.
1-147
Example
[wikipedia]
void square(int x, int& result)
{ result = x*x; }
• Then, the following call would place 9 in y:
square(3, y);
1-148
Download