Advances in Programming Languages: Addresses, pointers and reference types Stephen Gilmore The University of Edinburgh January 24, 2007 Background Computer programs manipulate data values stored in memory and in registers. Data values held in memory have an associated address whereas data values held in registers do not. In some programming languages addresses can be manipulated as values and in others not. Allowing programs to modify addresses in an unrestricted way leads to unpredictable behaviour. Modern programming languages attempt to control modification and update of addresses. Addresses, pointers and references C provides an address of operator, &, which allows us to take the address of a variable. A holder for such an address is a pointer to the data stored at that address. Given a pointer to an address in memory, we can dereference the pointer (using *). Java provides no means to obtain the address of an object. Caml provides references created using ref and dereferenced (using !). Using addresses: implementing call by reference One use of the address-of operator is passing the address of a variable to a function. The address can be used to refer to the contents of the variable which is the actual parameter of the call. In this way a variable can be shared between the caller, who owns it, and the callee, who uses it. Passing the address of a variable provides a way to implement call by reference in a call by value language (such as C). /* File: programs/c/Addresses.c */ #include <stdio.h> #include <stdlib.h> void updateByRef(int* x) { *x = *x + 13; } void updateByValue(int x) { x += 12; } int main() { int x = 3; /* x is an integer variable */ updateByValue(x); /* this call has no effect on x */ updateByRef(&x); /* this call updates x */ printf("x is %d\n", x); /* prints "x is 16" */ exit(0); } 1 UG4 Advances in Programming Languages — 2005/2006 2 Call by reference in C# Other programming languages consider calling by reference to be an elementary part of the language, not something which must be simulated by taking addresses and using pointers. C# allows parameters to be passed by reference by declaring them to be reference parameters in the method header. A value parameter is declared as int x, as in Java. A reference parameter is declared as ref int x. /* File: programs/cs/Addresses.cs */ using System; class Addresses { static void updateByRef(ref int x) { x = x + 13; } static void updateByValue(int x) { x += 12; } static void Main() { int x = 3; /* x is an integer variable */ updateByValue(x); /* this has no effect on x */ updateByRef(ref x); /* this call updates x */ Console.WriteLine("x is {0}", x); /* "x is 16" */ } } Handling addresses Taking the address of a variable may seem entirely unproblematic, but there are some complications. One complication is that programming languages such as C allow arithmetic operations to be performed on addresses, for example to obtain an adjacent address. Another is that in block-structured programming languages variables are declared in a block and are destroyed on exit from that block. This can lead to incorrectly retaining stale addresses. The following C program demonstrates one of the problems of address arithmetic: we can obtain a good address, such as the address of the variable x, but after we move away from this address we point to a storage location with arbitrary content, or can point to a location which we do not own. /* File: programs/c/Pointers.c */ #include <stdio.h> #include <stdlib.h> int main() { int x = 3; /* x is an integer variable */ int *y = &x; /* y is a pointer to an integer */ y--; /* pointer arithmetic */ printf("*y is %d\n", *y); /* dereference y */ y++; /* pointer arithmetic */ printf("*y is %d\n", *y); /* dereference y */ y=0; /* pointer assignment */ printf("*y is %d\n", *y); /* segmentation fault */ exit(0); } UG4 Advances in Programming Languages — 2005/2006 3 Cyclone: a modern dialect of C The Cyclone programming language is a modern dialect of C which attempts to provide C-like constructs, and to match the efficiency of C, while providing secure type checking comparable with that of Java and Caml. Cyclone devotes considerable attention to the use of pointers, in order to avoid errors such as those seen above, and forbids arithmetic on *-pointers. /* File: programs/cyclone/Pointers.cyc */ #include <stdio.h> #include <stdlib.h> int main() { /* x is an integer variable */ int x = 3; /* y is a pointer to an integer */ int *y = &x; /* Arithmetic is not allowed on these pointers so the compilation is faulted at this point */ y--; exit(0); } Of course, pointer arithmetic is not always problematic, otherwise it would probably not be in C at all. There are some relatively simple uses of pointer arithmetic, such as to walk through arrays such as the argument vector (argv) passed to the main function. The argument count (argc) is decremented on each iteration of the loop, and the pointer into the argument vector is advanced. The loop terminates when argc reaches 0. /* File: programs/c/Arguments.c */ #include <stdio.h> /* char* denotes a string. char** is an array of strings. */ int main(int argc, char** argv) { argc--; argv++; /* skip the command name */ while (argc > 0) { printf("%s ",*argv); /* print a string */ argc--; argv++; } printf("\n"); return 0; } Fat pointers One of the difficulties of using C-style pointers is that they are thin, with no additional bounds information. We need a separate integer counter to know the length of the array. An alternative is to package the length and content information together in an object, as Java does with its arrays and strings. Cylone introduces fat pointers which retain bounds information (hence they are ‘fat’—more than just a single address). The Cyclone notation for a fat pointer is ?. Fat pointers allow address arithmetic, supporting the C programming model, but bounds access and null pointer errors will be trapped at run-time. UG4 Advances in Programming Languages — 2005/2006 4 /* File: programs/cyclone/Arguments.cyc */ #include <stdio.h> /* char? is a fat pointer to a string. char?? is a fat pointer to an array of strings. Both have null and bounds checks. */ int main(int argc, char?? argv) { argc--; argv++; /* skip command name */ while (argc > 0) { printf("%s ",*argv); /* print a string */ argc--; argv++; } printf("\n"); return 0; } Efficient code and pointer safety The reason that Cyclone supports both thin pointers and fat pointers is that it allows the programmer to avoid unnecessary bounds checks when pointer arithmetic is not being used. Bounds checking is an overhead at run-time so it is good to be able to avoid it sometimes, when it is safe to do so. A language with only safe pointer and array use such as Java necessarily inserts some additional run-time checks which are not necessary, but does guarantee memory safety, which is a huge gain. Run-time errors with pointers In C there are no bounds checks so out-of-bounds errors simply corrupt unexpected areas of memory, leading to unexpected behaviour later, possibly a segmentation fault, at a point in the execution where the error in the program logic caused things to start to go wrong. This may make diagnosing the cause of the error more difficult and more expensive. In contrast bounds checking on strings and arrays raises an exception when the violation occurs. Bounds checks do not prevent run-time errors, they simply trap them when they occur. /* File: programs/cyclone/Pointers2.cyc */ #include <stdio.h> #include <stdlib.h> int main() { /* x is an integer variable */ int x = 3; /* y is a fat pointer to an integer */ int ?y = &x; /* fat pointer arithmetic is allowed */ y--; printf("*y is %d\n", *y); /* This attempt to dereference y leads to a run-time exception. */ exit(0); } UG4 Advances in Programming Languages — 2005/2006 5 Cyclone error report Uncaught exception Cyc_Array_bounds thrown from around Pointers2.cyc:13 Stale addresses It is tempting to think of the & operator as “returning the address of a variable”, but in truth it is more correct to think of it as “returning an address (which is currently associated with this variable)”. As a program calls functions they are loaded onto the run-time stack and their local variables occupy addresses on the stack. When a function exits, its storage area is freed so that that memory may be re-used by the next function which is called. If we attempt to take the address of a local variable in C we will be warned about this by the compiler but warnings, unlike errors, can sometimes be ignored. /* File: programs/c/StaleAddresses.c */ #include "stdio.h" #include "stdlib.h" int* makeCounter() { int counter = 0; return &counter; /* return address of local variable */ } int main() { int* p; p = makeCounter(); printf("*p is %d\n", *p); /* prints "*p is 0" */ exit(0); } This program compiles (with warnings) but runs successfully. An inexperienced C programmer might conclude that it is possible to take the address of a local variable in this way, and thereby lengthen its lifetime because it needs to be retained while we still hold a pointer to it. To see that this is not the case, we only need to make the program a little more complex by adding in an additional function. /* File: programs/c/StaleAddresses2.c */ #include "stdio.h" #include "stdlib.h" int* makeCounter() { int counter = 0; return &counter; /* return address of local variable */ } /* no use of pointers or addresses in this function */ void dummy() { int x = 13; /* simple integer variable */ } int main() { int* p; p = makeCounter(); UG4 Advances in Programming Languages — 2005/2006 6 printf("*p is %d\n", *p); /* prints "*p is 0" */ dummy(); printf("*p is %d\n", *p); /* prints "*p is 13" */ exit(0); } In Cyclone both of these programs are faulted by the compiler with a type error message. Cyclone considers the type of a variable to include information about the place where it is allocated (and thus an integer pointer declared in the local function makeCounter cannot be assigned to an integer pointer declared in main). Variable lifetimes in Caml Caml is not a block structured language, so locally-created variables can outlive the function invocation which created them. In Caml the idiom of creating an initialised counter and returning it from a function can be programmed in the expected way and no memory-related errors occur. (* File: programs/caml/StaleAddresses2.ml *) let makeCounter() = let counter = ref 0 in counter;; let dummy() = let x = 13 in ();; let main() = let p = makeCounter() in print_string("!p is " ^ string_of_int(!p) ^ "\n"); dummy(); (* does not change p: still zero *) print_string("!p is " ^ string_of_int(!p) ^ "\n"); exit(0);; main();; Summary We discussed addresses, pointers and references. • In C we can manipulate addresses and pointers as data. There are no checks and there are no guarantees of safety or security. • In C# passing by reference can eliminate some uses of addresses and pointers. • Cyclone separates thin and fat pointers. It allows arithmetic only on fat pointers and performs bounds checking. • In Caml and Java addresses and pointers are not exposed to the programmer.