SYLLABUS ADVANCED CONCEPTS OF C AND INTRODUCTION TO DATA STRUCTURES DATA TYPES, ARRAYS, POINTERS, RELATION BETWEEN POINTERS AND ARRAYS, SCOPE RULES AND STORAGE CLASSES, DYNAMIC ALLOCATION AND DE-ALLOCATION OF MEMORY,DANGLING POINTER PROBLEM, STRUCTURES, ENUMERATED CONSTANTS , UNIONS COMPLEXITY OF ALGORITHMS PROGRAM ANALYSIS, PERFORMANCE ISSUES, GROWTH OF FUNCTIONS, ASYMPTOTIC NOTATIONS, TIME-SPACE TRADE OFFS, SPACE USAGE, SIMPLICITY, OPTIMALITY INTRODUCTION TO DATA AND FILE STRUCTURE INTRODUCTION, PRIMITIVE AND SIMPLE STRUCTURES, STRUCTURES , FILE ORGANIZATIONS ARRAYS LINEAR AND NONLINEAR SEQUENTIAL ALLOCATION, MULTIDIMENSIONAL ARRAYS , ADDRESS CALCULATIONS , GENERAL MULTIDIMENSIONAL ARRAYS , SPARSE ARRAYS STRINGS INTRODUCTION , STRING FUNCTIONS , STRING LENGTH , STRING COPY, STRING COMPARE , STRING CONCATENATION ELEMENTARY DATA STRUCTURES STACK , OPERATIONS ON STACK, IMPLEMENTATION OF STACKS, RECURSION AND STACKS ,EVALUATION OF EXPRESSIONS USING STACKS, QUEUE, ARRAY IMPLEMENTATION OF QUEUES, CIRCULAR QUEUE , DEQUES , PRIORITY QUEUES LINKED LISTS SINGLY LINKED LISTS, IMPLEMENTATION OF LINKED LIST, CONCATENATION OF LINKED LISTS , MERGING OF LINKED LISTS, REVERSING OF LINKED LIST, DOUBLY LINKED LIST, IMPLEMENTATION OF DOUBLY LINKED LIST, CIRCULAR LINKED LIST, APPLICATIONS OF THE LINKED LISTS GRAPHS ADJACENCY MATRIX AND ADJACENCY LISTS , GRAPH TRAVERSAL, IMPLEMENTATION, SHORTEST PATH PROBLEM , MINIMAL SPANNING TREE, OTHER TASKS TREES INTRODUCTION, PROPERTIES OF A TREE , BINARY TREES, IMPLEMENTATION, TRAVERSALS OF A BINARY TREE, BINARY SEARCH TREES (BST), INSERTION IN BST , DELETION OF A NODE, SEARCH FOR A KEY IN BST, HEIGHT BALANCED TREE, B-TREE, INSERTION, DELETION FILE ORGANIZATION INTRODUCTION, TERMINOLOGY , FILE ORGANISATION, SEQUENTIAL FILES, DIRECT FILE ORGANIZATION , DIVISION-REMAINDER HASHING, INDEXED SEQUENTIAL FILE ORGANIZATION SEARCHING INTRODUCTION, SEARCHING TECHNIQUES, SEQUENTIAL SEARCH, BINARY SEARCH, HASHING, HASH FUNCTIONS, COLLISION RESOLUTION SORTING INTRODUCTION, INSERTION SORT, BUBBLE SORT, SELECTION SORT, RADIX SORT, QUICK SORT, 2-WAY MERGE SORT, HEAP SORT, HEAPSORT VS. QUICKSORT S S PUBLICATIONS D.NO: 10-13-36, sistla vari street, Repalle-522265, Guntur (Dt), A.P, INDIA Email: mdsspublications@gmail.com , Web-site: www.sspublications.co.in UNIT 1 ADVANCED CONCEPTS OF C AND INTRODUCTION TO DATA STRUCTURES 1.1 1.2 1.3 INTRODUCTION DATA TYPES ARRAYS 1.3.1 HANDLING ARRAYS 1.3.2 INITIALIZING THE ARRAYS 1.4 MULTIDIMENSIONAL ARRAYS 1.4.1 INITIALIZATION OF TWO DIMENSIONAL ARRAY 1.5 POINTERS 1.5.1 ADVANTAGES AND DISADVANTAGES OF POINTERS 1.5.2 DECLARING AND INITIALIZING POINTERS 1.5.3 POINTER ARITHMETIC 1.6 1.7 1.8 1.9 ARRAY OF POINTERS PASSING PARAMETERS TO THE FUNCTIONS RELATION BETWEEN POINTERS AND ARRAYS SCOPE RULES AND STORAGE CLASSES 1.9.1 AUTOMATIC VARIABLES 1.9.2 STATIC VARIABLES 1.9.3 EXTERNAL VARIABLES 1.9.4 REGISTER VARIABLE 1.10 DYNAMIC ALLOCATION AND DE-ALLOCATION OF MEMORY 1.10.1 FUNCTION MALLOC(SIZE) 1.10.2 FUNCTION CALLOC(N,SIZE) 1.10.3 FUNCTION FREE(BLOCK) 1.11 DANGLING POINTER PROBLEM. 1.12 STRUCTURES. 1.13 ENUMERATED CONSTANTS 1.14 UNIONS UNIT 2 COMPLEXITY OF ALGORITHMS 2.1. PROGRAM ANALYSIS 2.2. PERFORMANCE ISSUES 2.3. GROWTH OF FUNCTIONS 2.4. ASYMPTOTIC NOTATIONS 2.4.1. BIG-O NOTATION (O) 2.4.2. BIG-OMEGA NOTATION () 2.4.3. BIG-THETA NOTATION () 2.5. TIME-SPACE TRADE OFFS 2.6. SPACE USAGE 2.7. SIMPLICITY 2.8. OPTIMALITY UNIT 3 3.1 3.2 3.3 3.4 INTRODUCTION TO DATA AND FILE STRUCTURE INTRODUCTION PRIMITIVE AND SIMPLE STRUCTURES LINEAR AND NONLINEAR STRUCTURES FILE ORGANIZATIONS UNIT 4 ARRAYS 4.1 INTRODUCTION 4.1.1. SEQUENTIAL ALLOCATION 4.1.2. MULTIDIMENSIONAL ARRAYS 4.2. ADDRESS CALCULATIONS 4.3. GENERAL MULTIDIMENSIONAL ARRAYS 4.4. SPARSE ARRAYS UNIT 5 STRINGS 5.1 INTRODUCTION 5.2 STRING FUNCTIONS 5.3 STRING LENGTH 5.3.1 USING ARRAY 5.3.2 USING POINTERS 5.4 STRING COPY 5.4.1 USING ARRAY 5.4.2 USING POINTERS 5.5 STRING COMPARE 5.5.1 USING ARRAY 5.6 STRING CONCATENATION UNIT 6 ELEMENTARY DATA STRUCTURES 6.1 INTRODUCTION 6.2 STACK 6.2.1 DEFINITION 6.2.2 OPERATIONS ON STACK 6.2.3 IMPLEMENTATION OF STACKS USING ARRAYS 6.2.3.1 FUNCTION TO INSERT AN ELEMENT INTO THE STACK 6.2.3.2 FUNCTION TO DELETE AN ELEMENT FROM THE STACK 6.2.3.3 FUNCTION TO DISPLAY THE ITEMS 6.3 RECURSION AND STACKS 6.4 EVALUATION OF EXPRESSIONS USING STACKS 6.4.1 POSTFIX EXPRESSIONS 6.4.2 PREFIX EXPRESSION 6.5 QUEUE 6.5.1 INTRODUCTION 6.5.2 ARRAY IMPLEMENTATION OF QUEUES 6.5.2.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE 6.5.2.2 FUNCTION TO DELETE AN ELEMENT FROM THE QUEUE 6.6 CIRCULAR QUEUE 6.6.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE 6.6.2 FUNCTION FOR DELETION FROM CIRCULAR QUEUE 6.6.3 CIRCULAR QUEUE WITH ARRAY IMPLEMENTATION 6.7 DEQUES 6.8 PRIORITY QUEUES UNIT 7 LINKED LISTS 7.1. INTRODUCTION 7.2. SINGLY LINKED LISTS. 7.2.1. IMPLEMENTATION OF LINKED LIST 7.2.1.1. INSERTION OF A NODE AT THE BEGINNING 7.2.1.2. INSERTION OF A NODE AT THE END 7.2.1.3. INSERTION OF A NODE AFTER A SPECIFIED NODE 7.2.1.4. TRAVERSING THE ENTIRE LINKED LIST 7.2.1.5. DELETION OF A NODE FROM LINKED LIST 7.3. CONCATENATION OF LINKED LISTS 7.4. MERGING OF LINKED LISTS 7.5. REVERSING OF LINKED LIST 7.6. DOUBLY LINKED LIST. 7.6.1. IMPLEMENTATION OF DOUBLY LINKED LIST 7.7. CIRCULAR LINKED LIST 7.8. APPLICATIONS OF THE LINKED LISTS UNIT 8 GRAPHS 8.1 INTRODUCTION 8.2 ADJACENCY MATRIX AND ADJACENCY LISTS 8.3 GRAPH TRAVERSAL 8.3.1 DEPTH FIRST SEARCH (DFS) 8.3.1.1 IMPLEMENTATION 8.3.2 BREADTH FIRST SEARCH (BFS) 8.3.2.1 IMPLEMENTATION 8.4 SHORTEST PATH PROBLEM 8.5 MINIMAL SPANNING TREE 8.6 OTHER TASKS UNIT 9 TREES 9.1. INTRODUCTION 9.1.1. OBJECTIVES 9.1.2. BASIC TERMINOLOGY 9.1.3. PROPERTIES OF A TREE 9.2. BINARY TREES 9.2.1. PROPERTIES OF BINARY TREES 9.2.2. IMPLEMENTATION 9.2.3. TRAVERSALS OF A BINARY TREE 9.2.3.1. IN ORDER TRAVERSAL 9.2.3.2. POST ORDER TRAVERSAL 9.2.3.3. PREORDER TRAVERSAL 9.3. BINARY SEARCH TREES (BST) 9.3.1. INSERTION IN BST 9.3.2. DELETION OF A NODE 9.3.3. SEARCH FOR A KEY IN BST 9.4. HEIGHT BALANCED TREE 9.5. B-TREE 9.5.1. INSERTION 9.5.2. DELETION UNIT 10 FILE ORGANIZATION 10.1. INTRODUCTION 10.2. TERMINOLOGY 10.3. FILE ORGANISATION 10.3.1. SEQUENTIAL FILES 10.3.1.1. BASIC OPERATIONS 10.3.1.2. DISADVANTAGES 10.3.2. DIRECT FILE ORGANIZATION 10.3.2.1. DIVISION-REMAINDER HASHING 10.3.3. INDEXED SEQUENTIAL FILE ORGANIZATION UNIT 11 SEARCHING 11.1. INTRODUCTION 11.2. SEARCHING TECHNIQUES 11.2.1. SEQUENTIAL SEARCH 11.2.1.1. ANALYSIS 11.2.2. BINARY SEARCH 11.2.2.1. ANALYSIS 11.3. HASHING 11.3.1. HASH FUNCTIONS 11.4. COLLISION RESOLUTION UNIT 12 SORTING 12.1. INTRODUCTION 12.2. INSERTION SORT 12.2.1. ANALYSIS 12.3. BUBBLE SORT 12.3.1. ANALYSIS 12.4. SELECTION SORT 12.4.1. ANALYSIS 12.5. RADIX SORT 12.5.1. ANALYSIS 12.6. QUICK SORT 12.6.1. ANALYSIS 12.7. 2-WAY MERGE SORT 12.8. HEAP SORT 12.9. HEAPSORT VS. QUICKSORT UNIT 1 ADVANCED CONCEPTS OF C AND INTRODUCTION TO DATA STRUCTURES 1.1. INTRODUCTION 1.2. DATA TYPES 1.3. ARRAYS 1.3.1. HANDLING ARRAYS 1.3.2. INITIALIZING THE ARRAYS 1.4. MULTIDIMENSIONAL ARRAYS 1.4.1. INITIALIZATION OF TWO DIMENSIONAL ARRAY 1.5. POINTERS 1.5.1. ADVANTAGES AND DISADVANTAGES OF POINTERS 1.5.2. DECLARING AND INITIALIZING POINTERS 1.5.3. POINTER ARITHMETIC 1.6. 1.7. 1.8. 1.9. ARRAY OF POINTERS PASSING PARAMETERS TO THE FUNCTIONS RELATION BETWEEN POINTERS AND ARRAYS SCOPE RULES AND STORAGE CLASSES 1.9.1. AUTOMATIC VARIABLES 1.9.2. STATIC VARIABLES 1.9.3. EXTERNAL VARIABLES 1.9.4. REGISTER VARIABLE 1.10. DYNAMIC ALLOCATION AND DE-ALLOCATION OF MEMORY 1.10.1. FUNCTION MALLOC(SIZE) 1.10.2. FUNCTION CALLOC(N,SIZE) 1.10.3. FUNCTION FREE(BLOCK) 1.11. DANGLING POINTER PROBLEM. 1.12. STRUCTURES. 1.13. ENUMERATED CONSTANTS 1.14. UNIONS 1.1. INTRODUCTION This chapter familiarizes you with the concepts of arrays, pointers and dynamic memory allocation and de-allocation techniques. We briefly discuss about types of data structures and algorithms. Let us start the discussion with data types. 1.2. DATA TYPES As we know that the data, which will be given to us, should be stored and again referred back. These are done with the help of variables. A particular variable‟s memory requirement depends on which type it belongs to. The different types in C are integers, float (Real numbers), characters, double, long, short etc. These are the available built in types. Many a times we may come across many data members of same type that are related. Giving them different variable names and later remembering them is a tedious process. It would be easier for us if we could give a single name as a parent name to refer to all the identifiers of the same type. A particular value is referred using an index, which indicates whether the value is first, second or tenth in that parents name. We have decided to use the index for reference as the values occupy successive memory locations. We will simply remember one name (starting address) and then can refer to any value, using index. Such a facility is known as ARRAYS. 1.3. ARRAYS An array can be defined as the collection of the sequential memory locations, which can be referred to by a single name along with a number, known as the index, to access a particular field or data. When we declare an array, we will have to assign a type as well as size. e.g. When we want to store 10 integer values, then we can use the following declaration. int A[10]; By this declaration we are declaring A to be an array, which is supposed to contain in all 10 integer values. When the allocation is done for array a then in all 10 locations of size 2 bytes for each integer i.e. 20 bytes will be allocated and the starting address is stored in A. When we say A[0] we are referring to the first integer value in A. | ----------------- Array----------------| A[0]A[1] A[9] fig(1). Array representation Hence if we refer to the ith value in array we should write A[i-1]. When we declare the array of SIZE elements, where, SIZE is given by the user, the index value changes from 0 to SIZE-1. Here it should be remembered that the size of the array is „always a constant‟ and not a variable. This is because the fixed amount of memory will be allocated to the array before execution of the program. This method of allocating the memory is known as „static allocation‟ of memory. 1.3.1 HANDLING ARRAYS Normally following procedure is followed for programming so that, by changing only one #define statement we can run the program for arrays of different sizes. #define SIZE 10 int a[SIZE], b[SIZE]; Now if we want this program to run for an array of 200 elements we need to change just the #define statement. 1.3.2 INITIALIZING THE ARRAYS. One method of initializing the array members is by using the „for‟ loop. The following for loop initializes 10 elements with the value of their index. #define SIZE 10 main() { int arr[SIZE], i; } for(i = 0; i < SIZE ; i++ ) { arr[i] = i; } An array can also be initialized directly as follows. int arr[3] = {0,1,2}; An explicitly initialized array need not specify size but if specified the number of elements provided must not exceed the size. If the size is given and some elements are not explicitly initialized they are set to zero. e.g. int arr[] = {0,1,2}; int arr1[5] = {0,1,2}; /* Initialized as {0,1,2,0,0}*/ const char a_arr3[6] = ”Daniel”; /* ERROR; Daniel has 7elements 6 in Daniel and a \0*/ To copy one array to another each element has to be copied using for structure. Any expression that evaluates into an integral value can be used as an index into array. e.g. arr[get_value()] = somevalue; 1.4 MULTIDIMENSIONAL ARRAYS An array in which the elements need to be referred by two indices it is called a twodimensional array or a “matrix” and if it is referred using more than two indices, it will be Multidimensional Array. e.g. int arr[4][3]; This is a two-dimensional array with 4 as row dimension and 3 as a column dimension. 1.4.1 INITIALIZATION OF TWO DIMENSIONAL ARRAY Just like one-dimensional arrays, members of matrices can also be initialized in two ways – using „for‟ loop and directly. Initialization using nested loops is shown below. e.g. int arr[10][10]; for(int i = 0;i< 10;i++) { for(int j = 0; j< 10;j++) { arr[i][j] = i+j; } } Now let us see how members of matrices are initialized directly. e.g. int arr[4][3] = {{0,1,2},{3,4,5},{6,7,8},{9,10,11}}; The nested brackets are optional. 1.5 POINTERS The computer memory is a collection of storage cells. These locations are numbered sequentially and are called addresses. Pointers are addresses of memory location. Any variable, which contains an address is a pointer variable. Pointer variables store the address of an object, allowing for the indirect manipulation of that object. They are used in creation and management of objects that are dynamically created during program execution. 1.5.1 ADVANTAGES AND DISADVANTAGES OF POINTERS Pointers are very effective when - The data in one function can be modified by other function by passing the address. Memory has to be allocated while running the program and released back if it is not required thereafter. Data can be accessed faster because of direct addressing. - The only disadvantage of pointers is, if not understood and not used properly can introduce bugs in the program. 1.5.2 DECLARING AND INITIALIZING POINTERS Pointers are declared using the (*) operator. The general format is: data_type *ptrname; where type can be any of the basic data type such as integer, float etc., or any of the userdefined data type. Pointer name becomes the pointer of that data type. e.g. int *iptr; char *cptr; float *fptr; The pointer iptr stores the address of an integer. In other words it points to an integer, cptr to a character and fptr to a float value. Once the pointer variable is declared it can be made to point to a variable with the help of an address (reference) operator(&). e.g. int num = 1024; int *iptr; iptr = &num; // iptr points to the variable num. The pointer can hold the value of 0(NULL), indicating that it points to no object at present. Pointers can never store a non-address value. e.g. iptr1=ival; // invalid, ival is not address. A pointer of one type cannot be assigned the address value of the object of another type. e.g. double dval, *dptr = &dval; // allowed iptr = &dval ; //not allowed 1.5.3 POINTER ARITHMETIC The pointer values stored in a pointer variable can be altered using arithmetic operators. You can increment or decrement pointers, you can subtract one pointer from another, you can add or subtract integers to pointers but two pointers can not be added as it may lead to an address that is not present in the memory. No other arithmetic operations are allowed on pointers than the ones discussed here. Consider a program to demonstrate the pointer arithmetic. e.g. # include<stdio.h> main() { int a[]={10,20,30,40,50}; ptr--> F000 10 int *ptr; F002 20 int i; F004 30 ptr=a; F006 40 for(i=0; i<5; i++) F008 50 { printf(“%d”,*ptr++); } } Output: 10 20 30 40 50 The addresses of each memory location for the array „a‟ are shown starting from F002 to F008. Initial address of F000 is assigned to „ptr‟. Then by incrementing the pointer value next values are obtained. Here each increment statement increments the pointer variable by 2 bytes because the size of the integer is 2 bytes. The size of the various data types is shown below for a 16-bit machine. It may vary from system to system. char int float long int double short int 1byte 2bytes 4bytes 4bytes 8bytes 2bytes 1.6 ARRAY OF POINTERS Consider the declaration shown below: char *A[3]={“a”, “b”, “Text Book”}; The example declares „A‟ as an array of character pointers. Each location in the array points to string of characters of varying length. Here A[0] points to the first character of the first string and A[1] points to the first character of the second string, both of which contain only one character. however, A[2] points to the first character of the third string, which contains 9 characters. 1.7 PASSING PARAMETERS TO THE FUNCTIONS The different ways of passing parameters into the function are: Pass by value( call by value) Pass by address/pointer(call by reference) In pass by value we copy the actual argument into the formal argument declared in the function definition. Therefore any changes made to the formal arguments are not returned back to the calling program. In pass by address we use pointer variables as arguments. Pointer variables are particularly useful when passing to functions. The changes made in the called functions are reflected back to the calling function. The program uses the classic problem in programming, swapping the values of two variables. void val_swap(int x, int y) { int t; // Call by Value } t = x; x = y; y = t; void add_swap(int *x, int *y) // Call by Address { int t; } t = *x; *x = *y; *y = t; void main() { int n1 = 25, n2 = 50; printf(“\n Before call by Value : ”); printf(“\n n1 = %d n2 = %d”,n1,n2); val_swap( n1, n2 ); printf(“\n After call by value : ”); printf(“\n n1 = %d n2 = %d”,n1,n2); printf(“\n Before call by Address : ”); printf(“\n n1 = %d n2 = %d”,n1,n2); val_swap( &n1, &n2 ); printf(“\n After call by value : ”); printf(“\n n1 = %d n2 = %d”,n1,n2); } Output: Before call by value : n1 = 25 n2 = 50 After call by value : n1 = 25 n2 = 50 // x = 50, y = 25 Before call by address : n1 = 25 n2 = 50 After call by address : n1 = 50 n2 = 25 //x = 50, y = 25 1.8 RELATION BETWEEN POINTERS AND ARRAYS Pointers and Arrays are related to each other. All programs written with arrays can also be written with the pointers. Consider the following: int arr[] = {0,1,2,3,4,5,6,7,8,9}; To access the value we can write, arr[0] or *arr; arr[1] or *(arr+1); Since „*‟ is used as pointer operator and is also used to dereference the pointer variables, you have to know the difference between them throughly. *(arr+1) means the address of arr is increased by 1 and then the contents are fetched. *arr+1 means the contents are fetched from address arr and one is added to the content. Now we have understood the relation between an array and pointer. The traversal of an array can be made either through subscripting or by direct pointer manipulation. e.g. void print(int *arr_beg, int *arr_end) { while(arr_beg ! = arr_end) { printf(“%i”,*arr_beg); ++arr_beg; } } void main() { int arr[] = {0,1,2,3,4,5,6,7,8,9} print(arr,arr+9); } arr_end initializes element past the end of the array so that we can iterate through all the elements of the array. This however works only with pointers to array containing integers. 1.9 SCOPE RULES AND STORAGE CLASSES Since we explained that the values in formal variables are not reflected back to the calling program, it becomes important to understand the scope and lifetime of the variables. The storage class determines the life of a variable in terms of its duration or its scope. There are four storage classes: - automatic - static - external - register 1.9.1 AUTOMATIC VARIABLES Automatic variables are defined within the functions. They lose their value when the function terminates. It can be accessed only in that function. All variables when declared within the function are, by default, „automatic‟. However, we can explicitly declare them by using the keyword ‘automatic’. e.g. void print() { auto int i =0; printf(“\n Value of i before incrementing is %d”, i); i = i + 10; printf(“\n Value of i after incrementing is %d”, i); } main() { } print(); print(); print(); Output: Value of i before incrementing is : 0 Value Value Value Value Value of of of of of i i i i i after incrementing is : 10 before incrementing is : 0 after incrementing is : 10 before incrementing is : 0 after incrementing is : 10 1.9.2. STATIC VARIABLES Static variables have the same scope as automatic variables, but, unlike automatic variables, static variables retain their values over number of function calls. The life of a static variable starts, when the first time the function in which it is declared, is executed and it remains in existence, till the program terminates. They are declared with the keyword static. e.g. void print() { static int i =0; printf(“\n Value of i before incrementing is %d”, i); i = i + 10; printf(“\n Value of i after incrementing is %d”, i); } main() { } print(); print(); print(); Output: Value Value Value Value Value Value of of of of of of i i i i i i before incrementing is : 0 after incrementing is : 10 before incrementing is : 10 after incrementing is : 20 before incrementing is : 20 after incrementing is : 30 It can be seen from the above example that the value of the variable is retained when the function is called again. It is allocated memory and is initialized only for the first time. 1.9.3. EXTERNAL VARIABLES Different functions of the same program can be written in different source files and can be compiled together. The scope of a global variable is not limited to any one function, but is extended to all the functions that are defined after it is declared. However, the scope of a global variable is limited to only those functions, which are in the same file scope. If we want to use a variable defined in another file, we can use extern to declare them. e.g. // FILE 1 – gis global and can be used only in main() and // // fn1(); int g = 0; void main() { : } : void fn1() { : : } // FILE 2 If the variable declared in file1 is required to be used in file2 then it is to be declared as an extern. extern int g = 0; void fn2() { : : } void fn3() { : } 1.9.4. REGISTER VARIABLE Computers have internal registers, which are used to store data temporarily, before any operation can be performed. Intermediate results of the calculations are also stored in registers. Operations can be performed on the data stored in registers more quickly than on the data stored in memory. This is because the registers are a part of the processor itself. If a particular variable is used often – for instance, the control variable in a loop, can be assigned a register, rather than a variable. This is done using the keyword register. However, a register is assigned by the compiler only if it is free, otherwise it is taken as automatic. Also, global variables cannot be register variables. e.g. void loopfn() { register int i; } for(i=0; i< 100; i++) { printf(“%d”, i); } 1.10 DYNAMIC ALLOCATION AND DE-ALLOCATION OF MEMORY Memory for system defined variables and arrays are allocated at compilation time. The size of these variables cannot be varied during run time. These are called „static data structures‟. The disadvantage of these data structures is that they require fixed amount of storage. Once the storage is fixed if the program uses small memory out of it remaining locations are wasted. If we try to use more memory than declared overflow occurs. If there is an unpredictable storage requirement, sequential allocation is not recommended. The process of allocating memory at run time is called ‘dynamic allocation’. Here, the required amount of memory can be obtained from free memory called „Heap‟, available for the user. This free memory is stored as a list called ‘Availability List’. Getting a block of memory and returning it to the availability list, can be done by using functions like: malloc() calloc() free() 1.10.1 FUNCTION MALLOC(SIZE) This function is defined in the header file <stdlib.h> and <alloc.h>. This function allocates a block of „size’ bytes from the heap or availability list. On success it returns a pointer of type „void‟ to the allocated memory. We must typecast it to the type we require like int, float etc. If required space does not exist it returns NULL. Syntax: ptr = (data_type*) malloc(size); where ptr is a pointer variable of type data_type. data_type can be any of the basic data type, user defined or derived data type. size is the number of bytes required. e.g. ptr =(int*)malloc(sizeof(int)*n); allocates memory depending on the value of variable n. # # # # include<stdio.h> include<string.h> include<alloc.h> include<process.h> main() { char *str; if((str=(char*)malloc(10))==NULL) /* allocate memory for string */ { printf(“\n OUT OF MEMORY”); exit(1); /* terminate the program */ } } strcpy(str,”Hello”); printf(“\n str= %s “,str); free(str); /* copy hello into str */ /* display str */ /* free memory */ In the above program if memory is allocated to the str, a string hello is copied into it. Then str is displayed. When it is no longer needed, the memory occupied by it is released back to the memory heap. 1.10.2 FUNCTION CALLOC(N,SIZE) This function is defined in the header file <stdlib.h> and <alloc.h>. This function allocates memory from the heap or availability list. If required space does not exist for the new block or n, or size is zero it returns NULL. Syntax: ptr = (data_type*) calloc(n,size); where - ptr is a pointer variable of type data_type. data_type can be any of the basic data type, user defined or derived data type. size is the number of bytes required. n is the number of blocks to be allocated of size bytes. and a pointer to the first byte of the allocated region is returned. e.g. # # # # include<stdio.h> include<string.h> include<alloc.h> include<process.h> main() { char *str = NULL; str=(char*)calloc(10,sizeof(char)); /* allocate memory for string */ if(str == NULL); { printf(“\n OUT OF MEMORY”); exit(1); /* terminate the program */ } strcpy(str,”Hello”); printf(“\n str= %s “,str); free(str); /* copy hello into str */ /* display str */ /* free memory */ } 1.10.3 FUNCTION FREE(BLOCK) This function frees allocated block of memory using malloc() or calloc(). The programmer can use this function and de-allocate the memory that is not required any more by the variable. It does not return any value. 1.11 DANGLING POINTER PROBLEM. We can allocate memory to the same variable more than once. The compiler will not raise any error. But it could lead to bugs in the program. We can understand this problem with the following example. # include<stdio.h> # include<alloc.h> main() { int *a; a= (int*)malloc(sizeof(int)); *a = 10; a= (int*)malloc(sizeof(int)); *a = 20; ----> ----> 10 20 } In this program segment memory allocation for variable „a‟ is done twice. In this case the variable contains the address of the most recently allocated memory, thereby making the earlier allocated memory inaccessible. So, memory location where the value 10 is stored is inaccessible to any of the application and is not possible to free it so that it can be reused. To see another problem, consider the next program segment: main() { int *a; a= (int*)malloc(sizeof(int)); *a = 10; free(a); } ----> 10 ----> ? Here, if we de-allocate the memory for the variable „a‟ using free(a), the memory location pointed by „a‟ is returned to the memory pool. Now since pointer „a‟ does not contain any valid address we call it as ‘Dangling Pointer’. If we want to reuse this pointer we can allocate memory for it again. 1.12 STRUCTURES A structure is a derived data type. It is a combination of logically related data items. Unlike arrays, which are a collection of similar data types, structures can contain members of different data type. The data items in the structures generally belong to the same entity, like information of an employee, players etc. The general format of structure declaration is: struct tag { type member1; type member2; type member3; : : }variables; We can omit the variable declaration in the structure declaration and define it separately as follows : struct tag variable; e.g. struct Account { int accnum; char acctype; char name[25]; float balance; }; We can declare structure variables as : struct Account oldcust; We can refer to the member variables of the structures by using a dot operator (.). e.g. newcust.balance = 100.0; printf(“%s”, oldcust.name); We can initialize the members as follows : e.g. Account customer = {100, „w‟, „David‟, 6500.00}; We cannot copy one structure variable into another. If this has to be done then we have to do member-wise assignment. We can also have nested structures as shown in the following example: struct Date { int dd, mm, yy; }; struct Account { int accnum; char acctype; char name[25]; float balance; struct Date d1; }; Now if we have to access the members of date then we have to use the following method. Account c1; c1.d1.dd=21; We can pass and return structures into functions. The whole structure will get copied into formal variable. We can also have array of structures. If we declare array to account structure it will look like, Account a[10]; Every thing is same as that of a single element except that it requires subscript in order to know which structure we are referring to. We can also declare pointers to structures and to access member variables we have to use the pointer operator -> instead of a dot operator. Account *aptr; printf(“%s”,aptr->name); A structure can contain pointer to itself as one of the variables, also called self-referential structures. e.g. struct info { int i, j, k; info *next; }; In short we can list the uses of the structure as: - Related data items of dissimilar data types can be logically grouped under a common name. Can be used to pass parameters so as to minimize the number of function arguments. When more than one data has to be returned from the function these are useful. Makes the program more readable. 1.13 ENUMERATED CONSTANTS Enumerated constants enable the creation of new types and then define variables of these types so that their values are restricted to a set of possible values. There syntax is: where enum identifier {c1,c2,...}[var_list]; - e.g. enum is the keyword. identifier is the user defined enumerated data type, which can be used to declare the variables in the program. {c1,c2,...} are the names of constants and are called enumeration constants. var_list is an optional list of variables. enum Colour{RED, BLUE, GREEN, WHITE, BLACK}; Colour is the name of an enumerated data type. It makes RED a symbolic constant with the value 0, BLUE a symbolic constant with the value 1 and soon. Every enumerated constant has an integer value. If the program doesn‟t specify otherwise, the first constant will have the value 0, the remaining constants will count up by 1 as compared to their predecessors. Any of the enumerated constant can be initialised to have a particular value, however, those that are not initialised will count upwards from the value of previous variables. e.g. enum Colour{RED = 100, BLUE, GREEN = 500, WHITE, BLACK = 1000}; The values assigned will be RED = 100,BLUE = 101,GREEEN = 500,WHITE = 501,BLACK = 1000 You can define variables of type Colour, but they can hold only one of the enumerated values. In our case RED, BLUE, GREEEN, WHITE, BLACK . You can declare objects of enum types. e.g. enum Days{SUN, MON, TUE, WED, THU, FRI, SAT}; Days day; Day = SUN; Day = 3; // error int and day are of different types Day = hello; // hello is not a member of Days. Even though enum symbolic constants are internally considered to be of type unsigned int we cannot use them for iterations. e.g. enum Days{SUN, MON, TUE, WED, THU, FRI, SAT}; for(enum i = SUN; i<SAT; i++) //not allowed. There is no support for moving backward or forward from one enumerator to another. However whenever necessary, an enumeration is automatically promoted to arithmetic type. e.g. if( MON > 0) { printf(“ Monday is greater”); } int num = 2*MON; 1.14 UNIONS A union is also like a structure, except that only one variable in the union is stored in the allocated memory at a time. It is a collection of mutually exclusive variables, which means all of its member variables share the same physical storage and only one variable is defined at a time. The size of the union is equal to the largest member variables. A union is defined as follows: union tag { type memvar1; type memvar2; type memvar3; : : }; A union variable of this data type can be declared as follows, union tag variable_name; e.g. union utag { int num; char ch; }; union tag ff; The above union will have two bytes of storage allocated to it. The variable num can be accessed as ff.num and ch is accessed as ff.ch. At any time, only one of these two variables can be referred to. Any change made to one variable affects another. Thus unions use memory efficiently by using the same memory to store all the variables, which may be of different types, which exist at mutually exclusive times and are to be used in the program only once. In this chapter we have studies some advanced features of C. We have seen how the flexibility of language allowed us to define a user-defined variable, is a combination of different types of variables, which belong to some entity. We also studies arrays and pointers in detail. These are very useful in the study of various data structures. UNIT 2 COMPLEXITY OF ALGORITHMS 2.9. PROGRAM ANALYSIS 2.10. PERFORMANCE ISSUES 2.11. GROWTH OF FUNCTIONS 2.12. ASYMPTOTIC NOTATIONS 2.12.1. BIG-O NOTATION (O) 2.12.2. BIG-OMEGA NOTATION () 2.12.3. BIG-THETA NOTATION () 2.13. TIME-SPACE TRADE OFFS 2.14. SPACE USAGE 2.15. SIMPLICITY 2.16. OPTIMALITY 2.1 PROGRAM ANALYSIS The program analysis is defined what happens when a program is executed - the sequence of actions executed and the changes in the program state that occur during a run. There are many ways of analysing a program, for instance: (i) verifying that it satisfies the requirements. (ii) proving that it runs correctly without any logic errors. (ii) determining if it is readable. (iii) checking that modifications can be made easily, without introducing new errors. (iv) we may also analyze program execution time and the storage complexity associated with it i.e. how fast does the program run and how much storage it requires. Another related question can be : how big must its data structure be and how many steps will be required to execute its algorithm? Since this course concerns data representation and writing programs, we shall analyse programs in terms of storage and time complexity. 2.2 PERFORMANCE ISSUES In considering the performance of a program, we are primarily interested in (i) how fast does it run? (ii) how much storage does it use? Generally we need to analyze efficiencies, when we need to compare alternative algorithms and data representations for the same problem or when we deal with very large programs. We often find that we can trade time efficiency for space efficiency, or vice-versa. For finding any of these i.e. time or space efficiency, we need to have some estimate of the problem size. 2.3 GROWTH OF FUNCTIONS Informally, an algorithm can be defined as the finite sequence of steps designed to sort out a computationally solvable problem. By this definition, it is immaterial, what time an algorithm takes to solve a problem of a given size. But this is impractical to choose an algorithm that finds the solution of a particular problem in a very long time. Fortunately, to estimate the computation time of an algorithm is fairly possible. The computation time of an algorithm can be formulated in terms of a function f(n), where n is the size of the input instancei. Let me try to explain this situation. Below is a C-function isprime(int) that returns true if the integer n is prime else returns false. int isprime(int n) { int divisor=2; while(divisor<=n/2) { if(n % divisor = = 0) return 0; divisor ++; } return 1; } Figure 1 The while loop in the fig 1 makes n comparisons in the worst case, if the input n is not a prime number. In case, it is prime it makes at most n/2 comparisons. Obviously, the number of comparisons is directly proportional to the size of n (alternatively, it depends upon the no of bits required to represent the number n). That is f(n)=n in worst case n is not prime else f(n)= n/2 if prime. However, it would had been better if the while loop looked like while( divisor<= n) {………} Doing so, irrespectively the number being prime the worst case of computation remains same. Similar things can be done for different algorithms devised to solve an entirely different problem. Many a times we are interested in finding out the facts like: 1) What is the longest time interval that it takes to complete a particular algorithm for any random input. 2) What is the smallest time interval a given algorithm can take to solve any input instance. Since there may be many possible input instances, we need to just estimate the running time of the algorithm. This fact finding is called computation of order of run time of an algorithm. 2.4 ASYMPTOTIC NOTATIONS 2.4.1. BIG-O NOTATION(O) Let f and g be functions from the set of integers or the set of real numbers to the set of real numbers. We say that f(x) is O(g(x)) if there are constants C and k such that f(x) <= Cg(x) whenever x > k. For example, when we say the running time T(n) of some program is O(n 2), read “big oh of n squared” or just “oh of n squared,” we mean that there are positive constants c and n 0 such that for n equal to or greater than n 0, we have T(n)<=cn2. In support to the above discussion I am presenting a few examples to visualize the effect of this definition. Example 1: Analyze the running time of factorial(x). Input size is the value of x. Given n as input, factorial(n) recursively makes n+1 calls to factorial. Each call to factorial makes a few constant time operations such as multiply, if-control, and return. Let the time taken for each call be the constant k. Therefore, the total time taken for factorial(n) is k*(n+1). This is O(n) time. In other words, factorial(x) has O(x) running time. In other words, factorial(x) has linear running time. int factorial(int n) { if(n==0) return 1; else return (n * factorial(n-1)); } Figure 2 Example 2: Let f(x)=x2 +2x+1 be a function in x. Now x2 >x andx2>1, for every x>1.Therefore for every x>1 we have, x2 +2x+1 >= x2 +2x+ x2. Alternatively, for every x>1, x2 +2x+1 >= 4x2. Comparing the situation with f(x)<= Cg(x) we get C=4 and k=1. So that function f(x)=x 2 +2x+1 is O(x2). Figure 3 Since the Big-O Notation finds the upper limit of the completion of a function it is also called the upper bound of a function or an algorithm. 2.4.2 BIG-OMEGA NOTATION () Let f and g be functions from the set of integers or the set of real numbers to the set of real numbers. We say that f(x) is (g(x)) if there are constants C and k such that f(x) >= Cg(x) whenever x > k. In terms of limits, lim g(x) f(x) = a constant (possibly 0) iff , f(n) = (g(x)) Example 3: Let f(x) = 8x3 + 5x2 + 7 >= 8x3 (x >= 0) Therefore, f(x) = (x3) 2.4.3. BIG-THETA NOTATION () For the similar functions f and g as discussed in above two definitions, we say that f(x) is (g(x)) if there are constants C1 ,C2 andk such that, 0<= C1f(x)<=f(x) <= C2f(x) whenever x > k. Since, (g(x)) bounds a function from both upper and lower sides, it is also called tight bound for the function f(x). Mathematically, lim g(x) nf(x) = a constant. In other words f(n) = (g(x)) iff f(x) and g(x) have same leading terms and except for possibly different constant factors. Example 4: f(x) = 3x2 + 8x log x is (x2) 2.5. TIME-SPACE TRADE OFFS Over 50 years of researches for algorithms related to different problem areas like decision-making, optimization etc, scientists have tried to concentrate on computational hardness of different solution methods. The study of time-space trade-offs, i.e., formulae that relate the most fundamental complexity measures, time and space, was initiated by Cobham, who studied problems like recognizing the set of palindromes on Turing machines. There are two main lines of motivation for such studies: one is the lower bound perspective, where restricting space allows you to prove general lower bounds for decision problems; the other line is the upper bound perspective where one attempts to find time efficient algorithms that are also space efficient (or vice versa). Also, upper bounds are interesting for finding, in conjunction with lower bounds, the computational complexity of fundamental problems such as sorting. So, mainly algorithms are constrained under two resources i.e. time and space. However, there are certain problems for which it is difficult or even impossible to find such a solution that is both time and space efficient. In such cases, the algorithms are written for specific resource configuration and also taking in consideration nature of input instance. Fundamentally, we measure either of the space and time complexities as a function of the size of the input instance. However, it is also likely to depend upon nature of the input. So, let us define the time and space complexity as below: Time Complexity T (n): An algorithm A for a problem P is said to have time complexity of T(n) if the number of steps required to complete its run for an input of size n is always less than equal to T(n). Space Complexity S(n): An algorithm A for a problem P is said to have space complexity of S(n) if the no. of bits required to complete its run for an input of size n is always less than equal to S(n). If we consider, both time and space requirements, general format is to use the quantity (time * space) for a given algorithm. T(n)*S(n) is therefore quite handy to estimate the overall efficiency of an algorithm in general*. The amount of space can easily be estimated for most of the algorithms directly and the measurement of run-time of algorithm mathematically or manually by testing can tell us the efficiency of the algorithm. From here we can establish, the range of this T*S term, that we can afford. All algorithms that lie in this range are then acceptable to us. Many a times algorithms need to do certain implicit tasks that are actually not the part of run time of algorithm. These implicit tasks, for most of the times are, one-time investments. For Example, sorting of list for doing binary search in randomly accessible storage containing the list of elements. So, in order to actually evaluate the importance of each of the operations in the algorithm over a data-structure and the complexity of the over all algorithm it is desirable to find, the time required to perform a sequence of operations averaged over all the operations performed. This computation is called amortized analysis. The usefulness of this computation is reflected when one has to establish a fact that, an operation under investigation is however costly in a local perspective but the overall cost of algorithm is minimized due to its use and thus the operation is very useful. A simple algorithm may consist of some initialization instructions and a loop. The number of passes made through the body of the loop is a fairly good indication of the work done by such an algorithm. Of course, the amount of work done in one pass through a loop may be much more than the amount done in another pass, and one algorithm may have longer loop bodies than another algorithm, but we are narrowing in on a good measure of work. Though some loops may have, say, five steps and some nine, for large inputs the number of passes through the loops will generally be large compared to the loop sizes. Thus counting the passes through all the loops in the algorithm is a good idea. In many cases, to analyze an algorithm we can isolate a particular operation fundamental to the problem under study (or to the types of algorithms being considered), ignore initialization, loop control, and other bookkeeping, and just count the chosen, or basic, operations performed by the algorithm. For many algorithms, exactly one of these operations is performed on each pass The function T(n), can be compared with another function f(n) to find the order of T(n) and hence the bounds of the algorithm can be evaluated. * Generalization of efficiency is required because efficiency also depends on the nature of input. For Example, an algorithm may be efficient for one sequence of elements to be sorted and not for other, size being the same. through the main loops of the algorithm, so this measure is similar to the one described in the previous paragraph. Here are some examples of reasonable choices of basic operations for several problems: Problem Operation Find x in an array of names. Comparison of x with an entry in the array Multiply two numbers matrices Multiplication of two real with real entries. (or multiplication and addition of real no‟s) Sort an array of numbers Comparison of two array entries Traverse a binary tree Traversing an edge Any non-iterative procedure, including recursive Procedure invocation. So long as the basic operations are chosen well and the total number of operations performed is roughly proportional to the number of basic operations, we have a good measure of the work done by an algorithm and a good criterion for comparing several algorithms. This is the measure we use in this chapter and in several other chapters in this book. You may not yet be entirely convinced that this is a good choice; we will add more justification for it in the next section. For now, we simply make a few points. First, in some situations, we may be intrinsically interested in the basic operation: It might be a very expensive operation compared to the others, or it might be of some theoretical interest. Second, we are often interested in the rate of growth of the time required for the algorithm, as the inputs get larger. So long as the total number of operations is roughly proportional to the basic operations, just counting the latter can give us a pretty clear idea of how feasible it is to use the algorithm on large inputs. Finally, this choice of the measure of work allows a great deal of flexibility. Though we will often try to choose one, or at most two, specific operations to count, we could include some overhead operations, and, in the extreme, we could choose as the basic operations the set of machine instructions for a particular computer. At the other extreme, we could consider “one pass through a loop” as the basic operation. Thus by varying the choice of basic operations, we can vary the degree of precision and abstraction in our analysis to fit our needs. What if we choose a basic operation for a problem and then find that the total number of operations performed by an algorithm is not proportional to the number of basic operations? What if it is substantially higher? In the extreme case, we might choose a basic operation for a certain problem and then discover that some algorithms for the problem use such different methods that they do not do any of the operations we are counting. In such a situation, we have two choices. We could abandon our focus on the particular operation and revert to counting passes through loops. Or, if we are especially interested in the particular operation chosen, we could restrict our study to a particular class of algorithms, one for which the chosen operation is appropriate. Algorithms that use other techniques for which a different choice of basic operation is appropriate could be studied separately. A class of algorithms for a problem is usually defined by specifying the operations that may be performed on the data. (The degree of formality of the specifications will vary; usually informal descriptions will suffice in this book). Throughout this section, we have often used the phrase “the amount of work done by an algorithm.” It could be replaced by the term “the complexity of an algorithm.” Complexity means the amount of work done, measured by some specified complexity measure, which in many of our examples is the number of specified basic operations performed. Note that, in this sense, complexity has nothing to do with how complicated or tricky an algorithm is; a very complicated algorithm may have low complexity. We will use the terms “complexity,:” “Amount of work done,” and “number of basic operations done” almost interchangeably in this book done” almost interchangeably in this book. Average and Worst-Case Analysis Now that we have a general approach to analyzing the amount of work done by an algorithm, we need a way to present the results of the analysis concisely. A single number cannot describe the amount of work done because the number of steps performed is not the same for all inputs. We observe first that the amount of work done usually depends on the size of the input. For example, alphabetizing an array of 1000 names usually requires more operations than alphabetizing an array 100 names, using the same algorithm. Solving a system of 12 linear equations in 12 unknowns generally takes more work than solving a system of 2 linear equations in 2 unknowns. We observe, secondly, that even if we consider inputs of only one size, the number of operations performed by an algorithm may depend on the particular input. An algorithm for alphabetizing an array of names may do very little work if only a few of the names are out of order, but it may have to do much more work on an array that is very scrambled. Solving a system of 12 linear equations may not require much work if most of the coefficients are zero. The first observation indicates that we need a measure of the size of the input for a problem. It is usually easy to choose a reasonable measure of size. Here are some examples: Problem Find x in an array of names Size of input The number of names in the array Multiply two matrices The dimensions of the matrices Sort an array of numbers The number of entries in the array Traverse a binary tree The number of nodes in the tree Solve a system of linear equations The number of equations, or the number of unknowns, or both Solve a problem concerning a graph The number of nodes in the graph, or the number of edges or both The number of operations performed may at, say, n, depend on the particular input. Even if the input size is fixed How, then, are the results of the analysis of an algorithm to be expressed? Most often we describe a behavior of an algorithm by stating its worst-case complexity. Worst-case complexity Let Dn be the set of inputs of size n for the problem under consideration, and let I be an element of Dn. Let t(I) be the number of basic operations performed by the algorithm on input I. We define the function W by W(n)=max{t(I)|I Dn} The function W(n) is called the worst-case complexity of the algorithm. W(n) is the maximum number of basic operations performed by the algorithm on any input of size n. It is often not very difficult to compute W(n). The worst-case complexity is valuable because it gives an Upper bound on the work done by the algorithm. The worst-case analysis could be used to help form an algorithm. We will do worst-case analysis for most of the algorithms presented in this book. Unless otherwise stated, whenever we refer to the amount of work done by an algorithm, we mean the amount of work done in the worst case. It may seem that a more useful and natural way to describe the behavior of an algorithm is to tell how much work it does on the average; that is, to compute the number of operations performed for each input of size n and then take the average. In practice some inputs might occur much more frequently than others so a weighted average is more meaningful. Average complexity Let Pr(I) be the probability that input I occurs. Then the average behavior of the algorithm is defined as A(n)= I D Pr(I)t(I). n We determine t(I) by analyzing the algorithm, but Pr(I) cannot be computed analytically. The function Pr(I) is determined from experience and/or special information about the application for which the algorithm is to be used, or making some simplifying assumption (e.g., that all inputs of size n are equally likely to occur). If Pr(I) is complicated, the computation of average behavior is difficult. Also, of course, if Pr(I) depends on a particular application of the algorithm, the function A describes the average behavior of the algorithm for only that application. The following examples illustrate worst-case and average analysis. Example Problem: Let E be an array containing n entries (called keys), E[0],…..,E[n-1], in no particular order. Find an index of a specified key K, if K is in the array; return – I as the answer if K is not in the array. Strategy: Compare K to each entry in turn until a match is found or the array is exhausted. If K is not in the array, the algorithm returns – 1 as its answer. There is a large class of procedures similar to this one, and we call these procedures generalized searching routines. Often they occur as subroutines of more complex procedures. Generalized searching routine A generalized searching routine is a procedure that processes an idefinite amount of data until it either exhausts the data or achieves its goal. It follows this high-level outline: If there is no more data to examine: Fail Else Examine one datum If this datum is what we want: Succeed. Else Keep searching in remaining data. The scheme is called generalized searching because the routine often performs some other simple operations as it searches, such as moving data elements, adding to or deleting from a data structure, and so on. Sequential Search, Unordered Input: E, n, K, where E is an array with n entries (indexed 0,….n-1), and K is the item sought. For simplicity, we assume that K and the entries of E are integers, as is n. Output: Returns „ans‟, the location of K in E (-1 if K is not found). int seqSearch(int E[], int n, int K) 1. int ans, index; 2. ans=-1;//Assume failure 3. for (index = 0; index <n; index ++) 4. if (k==E[index]) 5. ans=index;//Success! 6. break;//Take the rest of the afternoon off. //Continue loop. 7. return ans; Basic Operation: Comparison of x with an array entry. Worst-Case Analysis: Clearly W(n)=n. The worst cases occur when K appears only in the last position in the array and when K is not in the array at all. In both of these cases K is compared to all n entries. Average Behavior Analysis: We will make several simplifying assumptions first to do an easy example then, do a slightly more complicated analysis with different assumptions. We assume that the elements in the array are distinct and that if K is in the array, then it is equally likely to be in any particular position. For our first case, we assume that K is in the array and we denote this event by “succ,” in accordance with the terminology of probabilities. The inputs can be categorized according to where in the array K appears, so there are n inputs to consider. For 0<I<n, let Ii represent the event that K appears in the ith position in the array. Then, let t(I) be the number of comparisons done (the number of times the condition in line 4 is tested) by the algorithm on input I. Clearly, for <i<n,t(Ii)=I+1. Thus n-1 Asucc(n)= Pr(Ii|succ)t(Ii) i=0 n-1 i=0 n =1) (i+1) = (1) n (n+1)=n+1 n The subscript “succ” denotes that we are assuming a successful search in this computation. The result should satisfy our intuition that on the average, about half the array will be searched. Now, let us consider the event that K is not in the array at all, which we call “fail”. There is only one input for this case, which we call I fail. The number of comparisons in this case is t(Ifail)=n, so Afail=n. Finally, we combine the cases in which K is in the array and is not in the array. Let q be the probability that K is in the array. By the law of conditional expectations A(n)=Pr(succ)Asucc(n)+Pr(fail)Afail(n) =q(1/2(n+1))+(1-q)n=n(1-1/2q)+q. If q=1, that is if K is always in the array, then A(n)=(n+1)/2, as before, If q=1/2, that is, if there is a 50-50 chance that K is not in the array, then A(n)=3n/4+1/4; roughly three-fourths of the entries are examined. Example Illustrates how we should interpret D n, the set of inputs of size n. Rather than consider all possible arrays of names, numbers, or whatever, that could occur as inputs, we identify the properties of the inputs that affect the behavior of the algorithm; in this case, whether K is in the array at all and, if so, where it appears. An element I in Dn may be thought of as a set (or equivalence class) of all arrays and values for K such that K occurs in the specified place in the array (or not at all). Then t(I) is the number of operations done for any one of the inputs in I. Observe also that the input for which an algorithm behaves worst depends on the particular algorithm, not on the problem. A worst case occurs when the only position in the array containing K is the last. For an algorithm that searched the array only position in the array containing K is the last. For an algorithm that searched the array backwards (i.e., beginning with index=n-1), a worst case would occur if K appeared only in position 0. (Another worst case would again be when K is not in the array at all). Example Illustrates an assumption we often make when doing average analysis of sorting and searching algorithms: that the elements are distinct. The average analysis for the case of distinct elements gives a fair approximation for the average behavior analysis for the case of distinct elements gives a fair approximation for the average behavior in cases with few duplicates. If there might be many duplicates, it is harder to make reasonable assumptions about the probability that K‟s first appearance in the array occurs at any particular position. 2.6 SPACE USAGE The number of memory cells used by a program, like the number of seconds required executing a program, depends on the particular implementation. However, just examining an algorithm can make some conclusions about space usage. A program will require storage space for the instruction, the constants and variables used by the program, and the input data. It may also use some workspace for manipulating the data and storing information needed to carry out its computations. The input data itself may be re-presentable in several forms, some of which require more space than others. If the input data have one natural form (say, an array of numbers or a matrix), then we analyze the amount of extra space used, aside from the program and the input. If the amount of extra space is constant with respect to the input size, the algorithm is said to work in place. This term is used especially in reference to sorting algorithms. (A relaxed definition of in place is often used when the extra space is not constant, but is only a logarithmic function of the input size, because the log function grows so slowly; we will clarify any cases in which we use the relaxed definition). If the input can be represented in various forms, then we will consider the space required for the input itself as well as any extra space used. In general, we will refer to the number of “cells” used without precisely defining cells. You may think of a cell as being large enough to hold one number or one object. If the amount of space used depends on the particular input, worst-case and average-case analysis can be done. 2.7 SIMPLICITY It is often, though not always, the case that the simplest and most straightforward way to solve a problem is not the most efficient. Yet simplicity in an algorithm easier, and it makes writing, feature. It may make verifying the correctness of the algorithm easier, and it makes writing, debugging, and modifying a program easier. The time needed to produce a debugged program should be considered when choosing an algorithm, but if the program is to be used very often, its efficiency will probably be the determining factor in the choice. 2.8 OPTIMALITY No matter how clever we are, we can‟t improve an algorithm for a problem beyond a certain point. Each problem has inherent complexity; that is, there is some minimum amount of work required to solve it. To analyze the complexity of a problem, as opposed to that of a specific algorithm, we choose a class of algorithms (often by specifying the types of operations the algorithms will be permitted to perform) and a measure of complexity, for example, the basic operation(s) to be counted. Then we may ask how many operations are actually needed to solve the problem. We say that an algorithm is optimal (in the worst case) if there is no algorithm in the class under study that performs fewer basic operations (in the worst case). Note that when we speak of algorithms in the class under study, we don‟t mean only those algorithms that people have thought of. We mean all possible algorithms, including those not yet discovered. “Optimal” doesn‟t mean “the best known”; it means “the best possible.” SUMMARY:If the problem size doubles and the algorithm takes one more step, we relate the number of steps to the number of steps to the problem size by O(log2N). It is read as order of log2N. - If the problem size doubles and the algorithm takes twice as many steps, the number of steps is related to problem size by O(N) i.e. order of N, i.e.number of steps is directly proportional to N. - If the problem size doubles and the algorithm takes more than twice as many steps, i.e. the number of steps required grow faster than the problem size, we use the expression O(N log2 N) You may notice that the growth rate complexities is more than the double of the growth rate of problem size, but it is not a lot fasten - If the number of steps used is proportional to the square of problem size ,we say the com plexity is of the order of N2 or O(N2). - If the algorithm is independent of problem size, the complexity is constant in time and space, i.e. O(1). The notation being used, i.e. a capital O() is called Big- Oh notation. ************************************************************************************************************* UNIT 3 INTRODUCTION TO DATA AND FILE STRUCTURE 3.5 INTRODUCTION 3.6 PRIMITIVE AND SIMPLE STRUCTURES 3.7 LINEAR AND NONLINEAR STRUCTURES 3.8 FILE ORGANIZATIONS 3.1 INTRODUCTION Data Structures are very important in computer systems. In a program, every variable is of some explicitly or implicitly defined data structure, which determines the set of operations that are legal upon that variable. Knowledge of data structures is required for people who design and develop computer programs of any kind : systems software or applications software. The data structures that are discussed here are termed logical data structure. There may be several different physical organizations on storage possible for each logical data structure. The logical or mathematical model of a particular organization of data is called a data structure.Often the different data values are related to each other. To enable programs to make use of these relationships, these data values must be in an organised form. The organised collection of data is called a data structure. The programs have to follow certain rules to access and process the structured data. We may, therefore, say data are represented that: Data Structure = Organised Data + Allowed Operations. If you recall, this is an extension of the concept of data type. We had defined a data type as: Data Type = Permitted Data Values + Operations The choice of a particular data model depends on two considerations: To identify and develop useful mathematical entities and operations. To determine representations for those entities and to implement operations on these concrete representations. The representation should be simple enough that one can effectively process the data when necessary. Data Structures can be categorized according to figure 1.1. Data Structure Primitive Compound Data Structure -integer -float -character -String -Array -Record -Sets File Structure Structure Simple Data File organization Data Structure -Sequential -Indexed Sequential Linear -Linked List -Stack -Queue Non Linear Binary -Binary Tree -Binary Search Tree N-ary -Graph -General Tree -B-Tree -B+Tree Figure 1.1: Charaterisation of Data Structure 3.2 PRIMITIVE AND SIMPLE STRUCTURES They are the data structures that are not composed of other data structures. We will consider briefly examples of three primitives: integers, Booleans, and characters. Other data structures can be constructed from one or more primitives. The simple data structures build from primitives that are strings, arrays, sets, and records(or structure in some programming languages). Many programming languages support these data structures. In other words These are the data structures that can be manipulated directly by machine instructions. The integer, real, character etc., are some of primitive data structures. In C, the different primitive data structures are int, float, char and double Example of primitive Data Structure The primitive data structures are also known as data types in computer language. The example of different data items are: integers : 10, 20, 5, - 15, etc. i.e., a subset of integer. In C language an integer is declared as : int x; Each of the integer occupied 2 bytes of memory space. float: 6.2, 7.215162, 62.5 etc i.e., a subset of real number. In C language a float variable is declared as: float y; Each of the float number occupied 4 bytes of memory space. character: Any character enclosed within single quotes is treated as character data. e.g., „a‟, 1‟, „?‟, „‟*‟ etc are the character data types. Characters are declared as: char c ; Each character occupied 1 byte of memory space. Example of Simple Data Structure The simple data types are composed primitive data structure. Array: Array is the collection of similar type of data item under the same name. e.g. int x [20] declares a collection of 20 integers under the same name x. Records: The records are also known as structure in C/C++ language There are collection of different data items under the same name. e.g., in C language the declaration struct student { char name [30]; char fname [30]; int roll; char class [5]; } y; define structure describing the record of a student. Obviously this structure contains different types of data items. The size of structure is depends on the constituents of the structure. In this example the six of structure is 30 + 30 + 2 + 5 = 37 bytes. 3.3 LINEAR AND NONLINEAR STRUCTURES Simple data structures can be combined in various ways to from more complex structures. The two fundamental kinds of more complex data structures are linear and nonlinear, depending on the complexity of the logical relationships they represent. The linear data structures that we will discuss include stack, queues, and linear lists. The nonlinear data structures include trees and graphs. We will find that there are many types of tree structures that are useful in information systems. In other words These data structures cannot be manipulated directly by machine instructions. Arrays, linked lists, trees etc., are some non-primitive data structures. These data structures can be further classified into „linear’ and ‘non-linear’ data structures. The data structures that show the relationship of logical adjacency between the elements are called linear data structures. Otherwise they are called non-linear data structures. Different linear data structures are stack, queue, linear linked lists such as singly linked list, doubly linked list etc. Trees, graphs etc are non-linear data structures. 3.4 FILE ORGANIZATIONS The data structuring techniques applied to collections of data that are managed as “black boxes” by operating systems are commonly called file organizations. A file carries a name, contents, a location where it is kept, and some administrative information, for example, who owns it and how big it is. The four basic kinds of file organization that we will discuss are sequential, relative, indexed sequential, and multi-key file organizations. These organizations determine how the contents of files are structured. They are built on the data structuring techniques. ************************************************************************************************************ UNIT 4 ARRAYS 4.1 INTRODUCTION 4.4.1. SEQUENTIAL ALLOCATION 4.4.2. MULTIDIMENSIONAL ARRAYS 4.5. ADDRESS CALCULATIONS 4.6. GENERAL MULTIDIMENSIONAL ARRAYS 4.7. SPARSE ARRAYS 4.1 INTRODUCTION The simplest type of data structure is a linear Array. A linear array is a list of a finite number n, of homogeneous data elements (i.e., data elements of the same type) such that : (a) The elements of the arrays are referenced respectively by an index set consisting of n consecutive numbers. (b) The elements of the arrays are stored respectively in successive memory locations. The number n of elements is called the length or size of the array. If not explicity stated, we will assume the index set consists of the integers 0, 1…, n-1. In general, the length or the number of data elements of the array can be obtained from the index set by the formula Length = UB – LB + 1 (2.1) where UB is the largest index, called the upper bound, and LB is the smallest index, called the lower bound, of the array. Note that: length = UB when LB = 1 The elements of an array A may be denoted by the subscript notation A0, A1, A2, A3, …., An - 1 or by the parentheses notation(used in FORTRAN,PL/1, and BASIC) A(0), A(1), A(2), ….., A(n-1) or by the bracket notation(used in Pascal) A[0], A[1], A[2], A[3], ….., A[n-1] We will usually use the subscript notation or the bracket notation. Regardless of the notation, The number i in A[i] is called a subscript or an index and A[i]is called subscripted variable. Note that subscripts allow any elements of A to be referenced by its relative position in A. Example 1 (a) Let DATA be a 6-element linear array of integers such that DATA[0] =247 , DATA[1] =56 , DATA[2] =429 , DATA[3] =135 , DATA[4]=87 , DATA[5] =156 Sometimes we denote such an array by simply writing DATA : 247,56,429,135,87,156 The array DATA is frequently pictured as in Fig. 2.1(a) or Fig. 2.1(b) DATA 0 2247 1 56 2 429 3 135 4 87 5 156 DATA 247 0 Fig 2.1(a) (b) 56 1 429 2 135 3 87 4 156 5 Fig 2.1(b) An automobile company uses an array AUTO to record the number of automobiles sold each year from 1932 through 1984.Rather than beginning the index set with 1,it is more useful to begin the index set with 1932 so that AUTO[i] =number of automobiles sold in the year i Then LB =1932 is the lower bound and UB=1984 is the upper bound of AUTO. By Eq.(2.1) Length = UB –LB+1=1984-1930+1=55 That is, AUTO contains 55 elements and its index set consists of all integers from 1932 through 1984. Each programming language has its own rules for declaring arrays. Each such declaration must give, implicitly or explicitly, three items of information:(1) the name of the array, (2) the data type of the array and (3) the index set of the array. Example 2 Suppose DATA is a 6-element linear array containing real values. Various programming languages declare such an array as follows: FORTRAN: REAL DATA(6) PL/1: DECLARE DATA(6) FLOAT; Pascal: VAR DATA : ARRAY[1..6] OF REAL C language: float DATA[6] We will declare such an array, when necessary, by writing DATA[6]. (The context will usually indicate the data type, as it will not be explicitly declared.) Some programming languages (e.g. FORTRAN and Pascal) allocate memory space for arrays statically, i.e., during program compilation; hence the size of the array is fixed during the program execution. On the other hand, some programming languages allow one to read an integer n and then declare an array with n elements; such programming languages are said to allocate memory dynamically. 4.1.1 SEQUENTIAL ALLOCATION Let LA be linear array in the memory of the computer. Recall that the memory of the computer is simply a sequence of addressed locations as pictured in Fig. 2.2. Let us use the notation ADD(LA[i]) = address of the element LA[i] of the array LA As previously noted, the elements of LA are stored in successive memory cells. Accordingly, the computer does not need to keep track of the address of every element of LA, but needs to keep track only of the address of the first element of LA, denoted by Base(LA) and called the base address of LA. Using this address Base(LA), the computer calculates the address of any element of LA by the following formula : ADD(LA[i]) = Base(LA) + w(i - lower bound) (2.2) where w is the number of words per memory cell for the array LA. Observe that the time to calculate Add(LA[i]) is essentially the same for any value of i. Furthermore, given any subscript i, one can locate and access the content of LA[i] without scanning any other element of LA. In particular, C language uses the following formula to calculate the address of the element. ADD (LA[i]) = Base (LA) + w*i (2.3) As the lower bound is zero by default. 1000 1001 1002 1003 1004 . . . Figure 2.2 Example 3 Consider the array AUTO in example 1(b), which records the number of automobiles sold each year from 1932 through 1984. Suppose AUTO appears in memory as pictured in Fig 2.3. That is, Base(AUTO) =200, and w=4 words per memory cell for AUTO. Then ADD (AUTO[1932])=200, ADD (AUTO[1933])=204, ADD (AUTO[1934])=208,…. The address of any array element for the year K=1965 can be obtained by using Eq(2.2): ADD (AUTO[1965])=Base(AUTO)+w(1965-lower bound)=200+4(1965-1932)=332 Again we emphasize that the contents of this element can be obtained without scanning any other element in array AUTO. 200 201 202 AUTO[1932] 203 204 205 206 AUTO[1933] 207 208 209 AUTO[1934] 210 211 . . . Figure 2.3 4.1.2 MULTIDIMENSIONAL ARRAYS Most programming languages allow two-dimensional and three dimensional arrays, i.e., arrays where elements are referenced, respectively, by two or three subscripts. In fact, some programming languages allow the number of dimensions for an array to be as high as 7. This sections discusses these multidimensional arrays. Two-Dimensional Arrays A two-dimensional m n array A is a collections of m * n data elements such that each element is specified by a pair of integers (such as i, j), called subscripts, with the property that 0 i m-1 and 0 j n-1 The elements of A with first subscript i and second subscript j will be denoted by Aij or A[i, j] Two-dimensional arrays are called matrices in mathematics and tables in business applications: hence two-dimensional arrays are sometimes called matrix arrays. There is a standard way of drawing a two-dimensional m n array A where the elements of A form a rectangular array with m rows and n column and where the element A[i, j] appears in row i and a column j.(A row is a horizontal list of elements, and a column is a vertical list of elements.)Figure 2.4 shows the case where A has 3 rows and 4 columns. We emphasize that each row contains those elements with the same first subscript, and each column contains those elements with the same second subscript. 1 0 Rows A[0,0] Columns 2 A[0,1] 3 A[0,2] 4 A[0,3] 1 A[1,0] A[1,1] A[1,2] A[1,3] 2 A[2,0] A[2,1] A[2,2] A[2,3] Figure 2.4 Example 4 Suppose each student of a class of 25 students is given 4 tests. Assuming the students are numbered from 0 to 24, the test scores can be assign to a 25 * 4 matrix array SCORE as pictured in fig. 2.5. Thus SCORE[i,j] contains the i th student‟s score on the jth test. In particular, the second row of the array, SCORE[2,0], SCORE[2,1], SCORE[2,2], SCORE[2,3] Contains the four test score of the second student. Student Test 1 Test 2 Test 3 Test 4 0 1 2 . . . 24 84 95 72 . . . 78 73 100 66 . . . 70 88 88 77 . . . 70 81 96 72 . . . 85 Figure 2.5: Array Score Suppose A is a two dimensional m n array. The first dimension of A contains the index set 0,……..,m-1, with lower bound 0 and upper bound m - 1; and the second dimension of A contains the index set 0, 1, 2, ...., n-1, with lower bound 0 and upper bound n-1. The length of dimension is the number of integer in its index set. The pair of lengths m*n (read” m by n “) is called the size of array. 4.2. ADDRESS CALCULATIONS Let A be two-dimensional m*n array. Although A is pictured as a rectangular array of elements with m rows n columns, the array will be represented in memory by a block of m.n sequential memory location. Specifically, the programming language will store the array A either(1) column by column is what is called column-major order, or(2) row by row, in row major order. Figure 2.6 shows these two ways when A is a two dimensional 2*3 array. We emphasize that the particular representation used depends upon the programming language, not the user. Pascal, C and C++ language use the Row Major ordering while FORTRAN uses the column major ordering. Subscripts A Subscripts A (0,0) (0,0) (1,0) Column 1 Row 1 (0,1) (2,0) (0,2) (0,1) (1,0) (1,1) Column 2 . (2,1) Row 2 (1,1) (1,2) (2,2) (1,2) (0,2) a) Column-major ordering b)Row Major ordering Figure 2.6 Recall that, for a linear array LA, the computer does not keep track of the address ADD (LA [i]) of every element LA[ i] of LA, but does not keep track of Base(LA), the address of the first element of LA. The computer use the formula ADD(LA[i]) = Base(LA)+w(i-l) To find the address of LA[i] in the time independent of i.(Here w is the number of words per memory cell for the array LA, and l is the lower bound of the index set of LA.) A similar situation also holds for any two-Dimensional m*n array A. That is, the computer keeps the track of Base(A)-the address of the first element of A[0,0] of A, and computes the address ADD (A[i, j]) of A[i, j]using the formula (Column-major order) ADD (A[i,j])=Base(A)+w[m(j-l2)+(i-l1)] (Row-major order) ADD (A[i,j])=Base(A)+w[n(i-l1)+(j-12)] (2.4) (2.5) Again, w denotes the number of words per memory location for the array A. Note that the formulas are linear in i and j. C language makes use of Row-major ordering and use the following formula: ADD (A[i, j]) = Base (A) + w (n * i + j) (2.6) Example 5 Consider the 25 * 4 matrices array SCORE in Example 4. Suppose Base(SCORE) = 200 and there are w=4 words per memory cell. Furthermore, Suppose the programming language stores twodimensional arrays using row-major order. Then the address of SCORE[12,3]), the third test of twelfth students, follows: [With l1 = l2 = 1] ADD (SCORE[12,3])=200+4[4(12-1)+(3-1)]=200+4[46]=384 Observe that we have simply used Eq. (2.5). Multidimensional clearly illustrate the difference between the logical and the physical views of data. Figure 2.4 shows how one logically views a 3*4 matrix array A, that is, as a rectangular array of data where A[i, j] appears in row i and column j. On the other hand, the data will be physically stored in memory by linear collection of memory cells. The situation will occur through the text ; i.e., certain data may be viewed logically as trees or graphs although physically the data will be stored linearly in memory cells. 4.3GENERAL MULTIDIMENSIONAL ARRAYS General multidimensional arrays are defined analogously. More specifically, an n-dimensional m0 m1 m2 ……….. mn-1 array B is collection of m0*m1*m2*………..*mn-1 data elements in which each element is specified by a list of integers-such as k0, k1, …kn-1 called subscripts with the property that 0 k0 m0, 0 k1 m1, …, 0 kn-1 mn-1 The element of B with subscripts k0, K1,…….,kn-1 will be denoted by Bk0, Bk1….Bkn-1 or B[k0], B[k1] …,B[kn-1] The array will be stored in the memory in a sequence of memory locations. Specifically, the programming language will store the array B either in row-major order. Or in column-major order. By row-major order, we mean that the elements are listed so that the subscripts vary like an automobile odometer, i.e., so that the last subscript varies first(most rapidly), the next to last subscript varies second(less rapidly),and so on. By column-major order, we mean that the first subscript varies first(more rapidly), and so on. 4.4 SPARSE ARRAYS Matrices with a relatively high proportional of zero entries are called sparse matrices. Sparse arrays are special arrays which arise commonly in applications. It is difficult to draw the dividing line between the sparse and non-sparse array. Loosely an array is called sparse if it has a relatively number of zero elements. For example in Figure 8, out of 49 elements, only 6 are nonzero This is a sparse array. 0 1 2 3 4 5 0 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 2 0 0 0 0 0 0 0 3 0 0 0 0 3 0 0 4 0 2 0 0 0 0 0 5 0 0 0 0 4 0 0 6 0 0 0 0 2 0 0 6 Figure 8: A Sparse Array If we store those array through the techniques presented in previous section, there would be much wasted space. Let us consider 2 alternative representations that will store explicitly only the non-zero elements. 1. Vector representation 2. Linked List representation We shall discuss only the first representation here in this Unit. The Linked List representation shall be discussed in a later Unit. Each element of a 2-dimensional array is uniquely characterized by its row and column position we may, therefore, store a sparse array in another array of the form. A(0 ... n, 1..3) where n is number of non-zero elements. The sparse array given in Figure 8 may be stored in the array A(O..6, 1..3) as shown in Figure 9. A 0 1 2 3 4 5 6 0 7 0 1 3 4 5 6 1 7 5 4 4 1 4 4 2 6 1 1 3 2 4 2 Figure : 9 Sparse Array Representation The Elements A(O, 1) and A(0,2) contain the number of rows and columns of the sparse array. A(0,3) contains the number of non- zero elements of sparse array. The first and second element of each of the rows store the number of row and column of the non- zero term and the third element stores the value of non-zero term. In other words, each non-zero element in a 2-dimensional sparse array is represented as a triplet with the format (row subscript, column subscript, value). If the sparse array was one-dimensional, each non-zero element would be represented by a pair. In general for an N-dimensional sparse array, non-zero elements are represented by an entry with N+1 values. Following is the program in C that illustrates how to store the different sparse matrices /*SPARSE MATRIX*/ #include<stdio.h> #include<conio.h> void main() { int a[3][3],nz=0,i,j,tp[5][3],r=0; clrscr(); for(i=0;i<3;i++) { for(j=0;j<3;j++) { printf(“enter any value”); scanf(“%d”,&a[i][j]); if(a[i][j] <> 0) nz++; } } if (nz >4) { printf(“\n not sparse matrix”); exit(0); } else { tp[0][0]=3; tp[0][1]=3; tp[0][2]=nz; r=1; for(i=0;i<3;i++) { for(j=0;j<3;j++) { if(a[i][j]<>0) { tp[r][0]=i; tp[r][1]=j; tp[r][2]=a[i][j]; r++; } } } for(i=0;i<=nz;i++) { for(j=0;j<3;j++) { printf(“\t %d”,tp[i][j]); } printf(“\n”); } } } Summary Data Structure which displays the relationship of adjacency between elements is said to be "Linear". Length finding, traversing from Left to Right, Retrieval of any element, storing any element, Deleting any element are the main operations which can be performed on any linear Data Structure. Arrays are one of the Linear Data Structures. Single Dimension as well as Multimension Arrays are represented in memory as one dimension array. Elements of any Multidimensional Array can be stored in two forms Row major and column major. A Matrix which has many zero entries is called Sparse Matrix. There are several Alternative representations of any Sparse Matrix using Linear Data Structures like Array and Linked List. We can represent any sparse Matrix through Single Dimension Array if it conform a Nice Pattern. We can represent any sparse Matrix through 3-Tuple Method in which 2 Dimensional Array is used and it has 3 fields (columns): Row number, column number, and value of Non zero term. Sparse Matrix can also be represented through Singly Linked List in which each node contain Row no, column no, value and a link to next Non zero term. The word “instance” has been used because there may be several distinct inputs to an algorithm and so there solution periods. i UNIT 5 STRINGS 5.1 INTRODUCTION 5.2 STRING FUNCTIONS 5.3 STRING LENGTH 5.3.1 USING ARRAY 5.3.2 USING POINTERS 5.4 STRING COPY 5.4.1 USING ARRAY 5.4.2 USING POINTERS 5.5 STRING COMPARE 5.5.1 USING ARRAY 5.6 STRING CONCATENATION 5.1 INTRODUCTION A string is an array of characters. They can contain any ASCII character and are useful in many operations. A character occupies a single byte. Therefore a string of length N characters requires N bytes of memory. Since strings do not use bounding indexes it is important to mark their end. Whenever enter key is pressed by the user the compiler treats it as the end of string. It puts a special character „\0‟ (NULL) at the end and uses it as the end of the string marker there onwards. When the function scanf() is used for reading the string, it puts a „\0‟ character when it receives space. Hence if a string must contain a space in it we should use the function gets(). 5.2 STRING FUNCTIONS Let us first consider the functions, which are required for general string operations. The string functions are available in the header file “string.h”. We can also write these ourselves to understand their working. We can write these functions using (i)Array of Characters and (ii) Pointers. 5.3 STRING LENGTH The length of the string is the number of characters in the string, which includes spaces, and all ASCII characters. As the array index starts at zero, we can say the position occupied by „\0‟ indicates the length of that string. Let us write these functions in two different ways mentioned earlier. 5.3.1 USING ARRAY int strlen1(char s[]) { int i=0; } while(s[i] != „\0‟) i++; return(i); Here we increment the positions till we reach the end of the string. The counter contains the size of the string. 5.3.2 USING POINTERS int strlen1(char *s) { char *p; p=s; while(*s != „\0‟) s++; return(s-p); }; The function is called in the same manner as earlier but in the function we accept the start address in s. This address is copied to p. The variable s is incremented till we get end of string. The difference in the last and first address will be the length of the string. 5.4 STRING COPY :Copy s2 to s1 In this function we have to copy the contents of one string into another string. 5.4.1. USING ARRAYS void strcopy(char s1[], char s2[]) { int i=0; } while( s2[i] != „\0‟) s1[i] = s2[i++]; s1[i]=‟\0‟; Till ith character is not „\0‟ copy the character s and put a „\0‟ as the end of the new string. 5.4.2. USING POINTERS void strcopy( char *s1, char *s2) { while( *s2) { *s1 = *s2; s1 ++; s2 ++; } *s1 = *s2; } 5.5 STRING COMPARE 5.5.1. USING ARRAYS void strcomp(char s1[], char s2[]) { int i=0; while( s1[i] != „\0‟ && s2[i] != „\0‟) { if(s1[i] != s2[i]) break; else i++; } return( s1[i] – s2[i]); } The function returns zero , if the two strings are equal. When the first string is less compared to second, it returns a negative value, otherwise a positive value. The reader can write the same function using the pointers. 5.6 STRING CONCATENATION OF S2 TO THE END OF S1 At the end of string one add the string two. Go till the end of the first string. From the next position copy the characters from the second string as long as there are characters in the second string and at the end close it with a „\0‟ character. This is left as an exercise for the student. ************************************************************************************************************* UNIT 6 ELEMENTARY DATA STRUCTURES 6.1 INTRODUCTION 6.2 STACK 6.2.1 DEFINITION 6.2.2 OPERATIONS ON STACK 6.2.3 IMPLEMENTATION OF STACKS USING ARRAYS 6.2.3.1 FUNCTION TO INSERT AN ELEMENT INTO THE STACK 6.2.3.2 FUNCTION TO DELETE AN ELEMENT FROM THE STACK 6.2.3.3 FUNCTION TO DISPLAY THE ITEMS 6.3 RECURSION AND STACKS 6.4 EVALUATION OF EXPRESSIONS USING STACKS 6.4.1 POSTFIX EXPRESSIONS 6.4.2 PREFIX EXPRESSION 6.5 QUEUE 6.5.1 INTRODUCTION 6.5.2 ARRAY IMPLEMENTATION OF QUEUES 6.5.2.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE 6.5.2.2 FUNCTION TO DELETE AN ELEMENT FROM THE QUEUE 6.6 CIRCULAR QUEUE 6.6.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE 6.6.2 FUNCTION FOR DELETION FROM CIRCULAR QUEUE 6.6.3 CIRCULAR QUEUE WITH ARRAY IMPLEMENTATION 6.7 DEQUES 6.8 PRIORITY QUEUES 6.1 INTRODUCTION Linear arrays discussed in the previous unit allowed one to insert and delete elements at any place in the list - at the beginning, at the end or at the middle. There are certain frequent situations in computer science when one wants to restrict insertions and deletions so that they can take place only at the beginning or at the end of the list, not in the middle. In stacks one data item will be placed on the other as they arrive. Stack is a data structure in which the latest data will be processed first. The data is coming in a sequence and we want to decide the sequence in which it must be processed. It is many times necessary that we accept a data item, which in turn depends on some other data item and so we accept that data item. 6.2 STACK 6.2.1 DEFINITION A stack is a homogeneous collection of items of any one type, arranged linearly with access at one end only, called the top. This means that data can be added or removed from only the top. Formally this type of stack is called a Last-In-First-Out (LIFO) stack. Data is added to the stack using the Push operation, and removed using the Pop operation. Description In order to clarify the idea of a stack let's look at a "real life" example of a stack. Think of a stack of plates in a high school cafeteria. When the plates are being stacked, they are added one on top of each other. It doesn't make much sense to put each plate on the bottom of the pile, as that would be far more work, and would accomplish nothing over stacking them on top of each other. Similarly when a plate is taken, it is usually taken from the top of the stack. The Stack Implemented as an Array One of two ways to implement a stack is by using a one dimensional array (also known as a vector). When implemented this way, the data is simply stored in the array. Top is an integer value, which contains the array index for the top of the stack. Each time data is added or removed, top is incremented or decremented accordingly, to keep track of the current top of the stack. By convention, an empty stack is indicated by setting top to be equal to -1. Stacks implemented as arrays are useful if a fixed amount of data is to be used. However, if the amount of data is not a fixed size or the amount of the data fluctuates widely during the stack's life time, then an array is a poor choice for implementing a stack. For example, consider a call stack for a recursive procedure. First, it can be difficult to know how many times a recursive procedure will be called, making it difficult to decide upon array bounds. Second, it is possible for the recursive procedure to sometimes be called a small number of times, called a large number of times at other times. An array would be a poor choice, as you would have to declare it to be large enough that there is no danger of it running out of storage space when the procedure recursions many times. This can waste a significant amount of memory if the procedure normally only recursions a few times. 6.2.2 OPERATIONS ON STACK The two main operations applicable to a stack are: push: an item is put on top of the stack, increasing the stack size by one. As stack size is usually limited, this may provoke a stack overflow if the maximum size is exceeded. pop: the top item is taken from the stack, decreasing stack size by one. In the case where there was no top item (i.e. the stack was empty), a stack underflow occurs. Given a stack S, and an item I, performing the operations – push(S,I) Adds the item I to the top of the stack s and similarly the operations. pop(S) Removes the top element and returns it as a functions value. Thus assignment operations I=pop(S); Removes the element at the top of S and assigns its value to I. For eg. If S is the stack of figure 4.3 we performed the operations push (S,G) in going from frame a to frame b. Top=5 push G push H pop H Pop G Pop F Push F H G G F G F E F F F E E E E D D D D C C C C B B B B A A A A F E D E D C D C B C B A B A A Figure 4.3: A Motion Picture of a Stack We then performed in turn, the following operations. push(S, H); (frame(c)) pop(S); (frame(d)) pop(S); (frame(e)) pop(S) (frame(f)) push(S, F); (frame(g)) Because of the operations, which adds element to a stack, a stack is sometimes called a pushdown list. There is no upper limit on the number of items that may be kept in a stack since the definition does not specify how many items are allowed in the collections. However if a stack contains a single item and the stack is popped, the resulting stack contains no item and is called an empty stack. The operation – isempty(s) determines whether or not a stack s is empty. If the stack is empty, empty(S) returns the value TRUE, otherwise it returns the value FALSE. Another operations that can be performed on a stack are to determine what the top item on a stack is with out removing it. This operation is written stacktop(S) & returns the top element of stacks. I = stacktop(s); Is equivalent to I = pop(s); push(s,I); Like the pop operations, stack top is not defined for an empty stack. The result of an illegal attempt to pop is called underflow. 6.2.3 IMPLEMENTATION OF STACKS USING ARRAYS Stacks are one of the most important linear data structures of variable size. This is an important subclass of the lists(arrays) that permit the insertion and deletion from only one end TOP. But to allow flexibility sometimes we can say that the stack allows us to access(read) elements in the middle and also change them as we will see in two functions to be explained later. But this is a very rare phenomenon. Some terminologies Insertion Deletion : the operation is also called push : the operation is also called pop Top : A pointer, which keeps track of the top element in the Stack. If an array of size N is declared to be a stack, then TOP will be –1 when the stack is empty and is N when the stack is full. Pictorial representation of stack Insertion Current TOP Deletion Maximum Size Allocated Bottom Fixed Bottom Fig 1 Memory representation of Stacks 6.2.3.1. FUNCTION TO INSERT AN ELEMENT INTO THE STACK Before inserting any element into the stack we must check whether the stack is full. In such case we cannot enter the element into the stack. If the stack is not full, we must increment the position of the TOP and insert the element. So, first we will write a function that checks whether the stack is full. /* function to check whether stack is full */ int stack_full(int top) { if(top == SIZE -1) return (1); else return (0); } The function returns 1, if, the stack is full. Since we place TOP at –1 to denote stack empty condition, the top varies from 0 to SIZE –1 when it stores elements. Therefore TOP at SIZE-1 denotes that the stack is full. /* function to insert an element into the stack */ push ( float s[], int *t, float val) { if( ! stack_full(*t)) { *t = *t + 1; s[*t] = val; } else { printf(“\n STACK FULL”); } } Note that we are accepting the TOP as a pointer(*t). This is because the changes must be reflected to the calling function. If we simply pass it by value the changes will not be reflected. Otherwise we have to explicitly return it back to the calling function. 6.2.3.2. FUNCTION TO DELETE AN ELEMENT FROM THE STACK Before deleting any element from the stack we must check whether the stack is empty. In such case we cannot delete the element from the stack. If the stack is not empty, we must delete the element by decrementing the position of the TOP. So, first we will write a function that checks whether the stack is empty. /* function to check whether stack is empty */ int stack_empty(int top) { if(top == -1) return (1); else return (0); } This function returns 1 if the stack is empty. Since the elements are stored from positions 0 to SIZE-1, the empty condition is considered when the TOP has –1 in it. /* function to delete an element from the stack */ float pop ( float s[], int *t) { float val; if( ! stack_empty(*t)) { val = s[*t]; *t = *t - 1; return val; } else { printf(“\n STACK EMPTY”); return 0; } } Since the TOP points to the current top item, first we store this value in a temporary variable and then decrements the TOP. Now return temporary variable to the calling function. We can also see functions to display the stack, either in the same way as they arrived or the reverse( the way in which they are waiting to be processed). 6.2.3.3 FUNCTION TO DISPLAY THE ITEMS /* displays from top to bottom */ void display_TB(float s[], int top) { while( top >= 0) { printf(“%f\n”, s[top]); top--; } } /* displays from bottom to top */ void display_BT(float s[], int top) { int i; for( i=0; i<=top ;i++) { printf(“%f\n”, s[i]); top--; } } We can use these functions in a program and see how they look in the stack. # define SIZE 10 main() { float stk[SIZE], val; int top = -1,ele; push(stk, &top, 10); push(stk, &top, 20); push(stk, &top, 30); ele = pop(stk,&top); printf(“%f”, ele); ele = pop(stk,&top); printf(“%f”, ele); push(stk, &top, 40); ele = pop(stk,&top); printf(“%f”, ele); } Now we will see the working of the stack with diagrams. Stack Area 20 Top-> 10 Empty Stack After first push 10 After second push 30 20 20 10 10 PUSH 30 10 POP POP 40 10 10 PUSH 40 P POP Fig 2. Trace of stack values with push and pop functions A C-Program to Simulate the Operation of the Stack Array Implementation /*stack-array implement*/ #define MAX 50 #include <stdio.h> int top, check; void main() { int stack[MAX], element, quit; char c; int pop(int []); void push(int [], int); void display(int []); printf("Program of stack with array\n"); do top = -1; quit = 0; { printf("\n\tOptions\t\tChoice"); printf("\n\tPush\t\tP"); printf("\n\tPop\t\tO"); printf("\n\tExit\t\tE"); printf("\n\tEnter choice : "); do c = getchar(); while (strchr("PpOoEe",c) == NULL); switch(c) { case 'P' : case 'p' : printf("\nEnter an element to be pushed : "); scanf("%d",&element); if (!isfull(stack)) { push(stack,element); printf("\n\t**** Stack *****\n"); display(stack); } break; case 'O' : case 'o' : if (!isempty(stack)) { element = pop(stack); printf("Popped element = %d\n",element); printf("\n\t**** Stack *****\n"); display(stack); } else } } } while (!quit); printf("\n"); printf("\nStack underflow...don't pop\n"); break; case 'E' : case 'e' : quit = 1; int isempty() { if(top==-1) { printf("\nStack underflow...cann't pop\n"); return(1); } else return(0); } int isfull() { if(top==MAX) { printf("\nStack overflow...cann't push\n"); return(1); } else return(0); } void push(int stack[],int element) { if (!isfull(stack)) { ++top; stack[top] = element; } return; } void display(int stack[]) { int i; if (top == -1) printf("\n***** Empty *****\n"); else { for (i=top; i>=0; --i) printf("%7d",stack[i]); } printf("\n"); } int pop(int stack[]) { int srs; if (!isempty(stack)) { srs = stack[top]; --top; return srs; } return (-1); } 6.3 Recursion and Stacks Recursion is a process of expressing a function in terms of itself. A function, which contains a call to the same function or a call to another function (direct recursion), which eventually call the first function (indirect recursion), is also termed as recursion. We can also define recursion as a process in which a function calls itself with reduced input and has a base condition to stop the process. i.e., any recursive function must satisfy two conditions: it must have a terminal condition after each recursive call it should reach a value nearing the terminal condition. An expression, a language construct or a solution to a problem can be expressed using recursion. We will understand the process with the help of a classic example to find the factorial of a number. The recursive definition to obtain factorial of n is shown below Fact (n) = 1 if n =0 n*fact(n-1) otherwise We can compute 5! as shown below 5!=5*4! 4!=4*3! 3!=3*2! 2!=2*1! 1!=1*0! 0!=1 By definition 0! Is 1. So, 0! Will not be expressed in terms of itself. Now, the computations will be carried out in reverse order as shown 0!=1 1!=1*0!=1*1=1 2!=2*1!=2*1=2 3!=3*2!=3*2=6 4!=4*3!=4*6=24 5!=5*4!=5*24=120 The C program for finding the factorial of a number is as shown: #include<stdio.h> int fact(int n) { if(n==0) return 1; return n*fact(n-1); } main() { } int n; printf(“Enter the number \n”); scanf(“%d”,&n); printf(“The factorial of %d = %d\n”,n,fact(n)); In fact() the terminal condition is fact(0) which is 1. if we don‟t write terminal condition, the function ends up in calling itself forever, i.e. in an infinite lop. Every tine the function is entered in a recursion a separate memory is allocated for the local variables and formal variables. Once the control comes out the function the memory is deallocated. When a function is called, the return address, the values of local and formal variables are pushed onto the stack, a block of memory of contiguous locations, set aside for this purpose. After this the control enters into the function. Once the return statement is encountered, control comes back to the previous call, by using the return value present in the stack, and it substitutes the return value to the call. If the function does not return any value, control goes to the statement that follows the function call. To explain how the factorial program actually works, we will write it using indirect recursion: # include<stdio.h> int fact(int n) { int x, y, res; if(n==0) return 1; x=n-1; y=fact(x); res= n*y; } main() { } return res; int n; printf(“Enter the number \n”); scanf(“%d”,&n); printf(“The factorial of %d = %d\n”,n,fact(n)); Suppose we have the statement A= fact(n); In the function main(), where the value of n is 4. When the function is called first time, the value of n in the function and the return address say XX00 is pushed on to the stack. Now the value of formal parameter is 4. since we have not reached the base condition of n=0 the value of n is reduced to 3 and the function is called again with 3 as parameter. Now again the new return address say XX20 and parameter 3 are stored into the stack. This process continues till n takes up the value 0, every time pushing the return address and the parameter, Finally control returns after folding back each time from one call to another with the result value 24. The number of times the function is called recursively is called the Depth of recursion. Iteration versus Recursion In recursion, every time a function is called, all the local variables , formal variables and return address will be pushed on the stack. So, it occupies more stack and most of the time is spent in pushing and popping. On the other hand, the non-recursive functions execute much faster and are easy to design. There are many situations where recursion is best suited for solving problems. In such cases this method is more efficient and can be understood easily. If we try to write such functions using iterations we will have to use stacks explicitly. 6.4 EVALUATION OF EXPRESSIONS USING STACKS All the arithmetic expressions contain variables or constants, operators and parenthesis. These expressions are normally in the infix form, where the operators separate the operands. Also, there will be rules for the evaluation of the expressions and for assigning the priorities to the operators. The expression after evaluation will result in a single value. We can evaluate an expression using the stacks. A complex assignment statement such as: z + x/y ** a+b*c–x*a might have several meanings; even if it were uniquely defined, by the full use of parenthesis. An expression is made up of operands, operators and delimiters. The above expression has five operands x, y, a, b and c. The first problem with understanding the meaning of an expression is to decide in what order the operations are carried out. This means that every language must uniquely define such an order. To fix the order of evaluation we assign to each operator a priority. A set of sample priorities are as follows: Operator Priority Associatively () 8 Left to Right ^ or **, unary -, unary+, ![not] 7 Right to Left *, /, % 6 Left to Right +, - 5 Left to Right <, <=, >, >= 4 Left to Right = =, != 3 Left to Right && 2 Left to Right ¦¦|| 1 Left to Right But by using parenthesis we can override these rules and such expressions are always evaluated with the inner most parenthesized expression first. The above notation of any expression is called Infix Notation (in which operators come in between the operands). The notation is a traditional notation, which needs operator's priorities and associativities. But how can a compiler accept such an expression and produce correct Polish Notation or Prefix form (In which operators come before operands) Polish Notation has several advantages over Infix Notation such as: there is no need for considering priorities while evaluating them, there is no need of introducing parenthesis for maintaining order of execution of operators. Similarly, Reverse Polish Notation or Postfix Form also has same advantages over Infix Notation, In this notation operators come after the operands. Example 1 Infix Prefix Postfix x+y +xy xy+ x+y*z +x+yz xyz*+ x+y-z -+xyz xyz*+- Stacks are frequently used for converting INFIX form into equivalent PREFIX and POSTFIX forms. Consider an infix expression: 2 + 3 * ( 4 – 6 / 2 + 7 ) / (2 + 3) – (4 –1) * (2 – 10 / 2)) When it comes to the evaluation of the expression, following rules are used. brackets should be evaluated first. * and / have equal priority, which is higher than + and -. All operators are left associative, or when it comes to equal operators, the evaluation is from left to right. In the above case the bracket (4-1) is evaluated first. Then (2-10/2) will be evaluated in which /, being higher priority, 10/2 will be evaluated first. The above sentence is questionable, because as we move from left to right, the first bracket, which will be evaluated, will be (4-6/2+7). The evaluation is as follows:Step 1: Division has higher priority. Therefore 6/2 will result in 3. The expression now will be (4-3+7). Step 2: As – and + have same priority, (4-3) will b evaluated first. Step 3: 1+7 will result in 8. The total evaluation is as follows. 2 + 3 * (4 – 6 / 2 + 7) / (2 + 3)-(( 4 – 1) * (2 – 10 /2 )) =2 + 3 * 8 / (2 + 3) - ((4 - 1) * (2 – 10 / 2)) =2 + 3 * 8 / 5 -((4 - 1) * (2 – 10 / 2)) =2 + 3 * 8 /5 -(3 * (2 – 10 / 2)) =2 + 3 * 8 /5 -(3 * (2 - 5)) =2 + 3 * 8 / 5 - (3 * (-3)) =2 + 3 * 8 / 5 + 9 =2 + 24 / 5 + 9 =2 + 4.8 + 9 =6.8 + 9 =15.8 6.4.1 Postfix Expressions In the postfix expression, every operator is preceded by two operands on which it operates. The postfix expression is the postorder traversal of the tree. The postorder traversal is Left-Right-Root. In the expression tree, Root is always an operator. Left and Right sub-trees are expressions by themselves, hence they can be treated as operands. If we want to operate the postfix expression from the given infix, then consider the evaluation sequence and apply it from bottom to top, every time converting infix to postfix. e7 e7 = e6 + a = e6 a + But e6 = e5 - e4 = e5 e4 – a + But e5 = a - b = ab – e4 – a + But e4 = e3 * b = ab – e3 b * - a + But e3 = e2 + a = ab – e2 a + b * -a + But e2 = c * e1 = ab – ce1 * a + b * -a + But e1 = d / a = ab – cda / * a + b * -a + The postfix expression does not require brackets. The above method will not be useful for programming. For programming, we use a stack, which will contain operators and opening brackets. The priorities are assigned using numerical values. Priority of + and – is equal to 1. Priority of * and / is 2. The incoming priority of the opening bracket is highest and the outgoing priority of the closing bracket is lowest. An operator will be pushed into the stack, provided the priority of the stack top operator is less then the current operator. Opening bracket will always be pushed into the stack. Any operator can be pushed on the opening bracket. Whenever operand is received, it will directly be printed. Whenever closing bracket is received, elements will be popped till opening bracket and printed, the execution is as shown. e.g. b -( a + b ) * (( c - d) / a + a ) Symbol is b, hence print b. On -, Stack being empty, push. Therefore Top On (, push. Therefore Top ( On a, operand, Hence Print a. On +, Stack being (, push. Therefore Top + ( On b, operand, Hence Print b. On ) , pop till (, and then print. Therefore pop +, print +. Pop (. Top - On *, Stack being -, push, Therefore Top * On (, push. Therefore Top ( * Note : For Convenience, we will draw horizontal stack. 10. On (, push. Therefore -* (( Top 11. On C, operand, Hence print C. 12. On-, push. Therefore 13. On d, operand, Hence print d. -*((- 14. On ) , pop till (, and then print. Therefore pop -, print -. Pop( 15. On /, push. Therefore 16. On a, operand, Hence print a. Top -*( -*(/ Top Top 17. On +, Stack top is /, pop till (, and then print. Therefore pop/, print/. -*+ Pop(. Push+. Top 18. On a, operand, Hence print a. 19. On ), pop till (, and then print + Therefore pop +, print_+. Pop(. -* 20. End of the Infix expression. -* Top pop all and print, Hence print +. print -. Top Therefore the generated postfix expression is bab+cd–a/a+*Algorithm to convert an infix expression to postfix expression Step1: Accept infix expression in a sting S. Step2: i being the position, let it be equal to 0. Step3: Initially top = -1, indicating stack is empty. Step4: If S[i] is equal to operand, print it, go to step 8. Step5: If S[i] is equal to opening bracket, push it, go to step 8. Step6: If S[i] is equal to operator, then Step6a: Pop the operator from the stack, say, p, If priority of P is less than priority of s[i], then push S[i], push p, go to step 8. Else print p, goto step 6a. Step7: If S[i] is equal to operator, then Step7a: pop the operator from the stack, say, p, If p is not equal to opening bracket, then print p, step 7a. Else go to step 8. Step8: Increment i. Step9: If s[i] is equal to „\0‟, then go to step 4. Step10: pop the operator from the stack, say, p. Step11: Print p. Step12: If stack is not empty, then go to step 10. Step13: Stop. 6.4.2 PREFIX EXPRESSION To convert Infix expression to prefix expression, we can again have steps, similar to the above algorithm, but a single stack may not be sufficient. We will require two stacks for the following. Containing operators for assigning priorities. Containing operands or operand expression. Every time we get the operand, it will be pushed in stack2 and operator will be pushed in stack1. Pushing the operator in stack1 is unconditional, whereas when operators are pushed, all the rules of previous methods are applicable as they are. When an operator is popped from stack1, the corresponding two operands are popped from stack2, say O1 and O2 respectively. We form the prefix expression as operator, O1, O2 and this prefix expression will be treated as a single entity and pushed on stack2. Whenever closing bracket is received, we pop the operators from stack1, till opening bracket is received. At the end stack1 will be empty and stack2 will contain single operand in the form of prefix expression. e.g. (a+b)*(c–d)+a Stack1 Stack2 ( ( a + ( a + ( b a ( ( +ab O1=a O2=b +ab +ab * +ab ( * +ab ( * c+ab ( * c+ab ( * d c +ab ( * +ab O1=c O2=d * +ab ( * +ab ( * c+ab ( * c+ab ( * d c +ab ( * +ab O1=c O2=d -cd ( * -cd +ab * -cd +ab O1=-cd O2=+ab *+ab-cd + + *+abcd a *+ab-cd The expression is + * + ab -cda The trace of the stack contents and the actions on receiving each symbol is shown. Stack2 can be stack of header nodes for different lists, containing prefix expressions or could be a multidimensional array or could be an array of strings. The recursive functions can be written for both the conversions. Following are the steps for the non-recursive algorithm for converting Infix to Prefix. Step1 : Get the prefix expression, say S. Step2 : Set the position counter, i to 0. Step3 : Initially top1 and top2 are –1, indicating that the stacks are empty. Step4 : If S[i] is equal to opening bracket, push it in stack1, go to step8. Step5 : If S[i] is equal to operand it in stack2, go to step8. Step6 : If S[i] is equal to operator, stack1 is empty or stack top elements has less priority as compared to S[i] , go to step 8. Else p= pop the operator from stack1. O1= pop the operator from stack1. O2= pop the operand from stack2. From the Prefix expression p, O1, O2, Push in stack2 and go to step6. Step7 : If S[i]= opening bracket, then Step7a: p= pop the operator from stack1. If p is not equal to closing bracket, then O1= pop the operand from stack2. O2= pop the operand from stack2. From the prefix expression p, O1, O2, Push in stack2 and go to step 7a. Else go to step 8. Step8 : Increment i. Step9 : If s[i] is not equal to “/0”, then go to step4. Step10: Every time pop one operator from stack1, pop2 operands from stack2, from the prefix expression p, O1, O2, push in stack2 and repeat till stack1 becomes empty. Step11: Pop operand from stack2 and print it as prefix expression. Step12: Stop. The reverse conversions are left as exersise to the students. Exercises: 1. In case we are required to reverse the stack, the one way will be to pop each element from the existing stack and put it in another stack. Thus it is possible to reverse the stack using the stack. This is very obvious but when we are required to use the queue for the same purpose then we will use the following steps (algorithm): pop a value from the stack. add that value to the queue. Repeat the above steps till the stack is empty. Now the stack is empty and queue contains all the elements. Delete a value from the queue and push it in the stack. Repeat from step 5 till the queue is empty. The value, which was popped from the stack for the first time, will also be the first value getting deleted from the queue and is the first value getting pushed back into the stack. Thus the top value of the original stack will be the bottom value and hence the stack will be reversed. Use the functions written for stacks and queues to write the program. 2. A double-ended queue is a linear list in which additions and deletions may be at either end. Write functions to add and delete elements from either end of the queue. Write a program using stacks, to check whether a given string is palindrome. The string is palindrome when it reads same in both the directions. Remember we are not supposed to store the string in the array of characters. The general logic will be to remember the first character and push all others in the stack. As the string ends pop a character from the stack, it will be the last character, it should be equal to the first character remembered. Now pop the next character, and repeat the procedure. 6.5 QUEUES 6.5.1 INTRODUCTION Queues arise quite naturally in the computer for solution of many problems. Perhaps the most common occurrence of a queue in Computer Applications is for the scheduling of jobs. Queue is a Linear list which has two ends, one for insertion of elements and other for deletion of elements. The first end is called 'Rear' and the later is called 'Front'. Elements are inserted from Rear End and Deleted from Front End. Queues are called First-In-First-Out (FIFO) List, since the first element in a queue will be the first element out of the queue. In other words, the order in which the elements enter a queue is the order in which they leave. Rear Front Figure 5.1: A Possible Queue 6.5.2 ARRAY IMPLEMENTATION OF QUEUES Queues may be represented in the computer in various ways, usually by means of one way lists or linear Arrays, unless otherwise stated or implied. Each Queue will be maintained by a linear array queue[ ] and two integer variables Front and Rear containing the location of the front element of the queue and the location of the Rear element of the queue. Additions to the queue take place at the rear. Deletions are made from the front. So, if the job is submitted for execution, it joins at the rear of the job queue. The Window a1 Front a2 a3 -------- Rear ------- Figure 5.2: Queue Represented By Array When we represent any Queue through an Array, we have to predefine the size of Queue and we can not enter more elements than that predefined size, say max. Initially Rear = -1, latest inserted element in queue. Initially Front=0, because Front points to the first inserted element which is not yet deleted. Initially queue is empty, hence Front = 0. Similarly : Condition for empty queue is FRONT = 0 REAR = -1 When FRONT=REAR there is exactly one element in Queue. -------REAR = -1 FRONT = 0 a1 -------- FRONT = REAR Figure 5.3 (a) Empty Queue (b) Queue with Exactly One Element Whenever an element is added to the Queue, the value of REAR is increased by 1; this can be implemented as: REAR = REAR+1; or REAR + + ; provided that REAR is less than MAX-1, which is the condition for full Queue. 6.5.2.1FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE Before inserting any element into the queue we must check whether the queue is full. /* function to check whether queue is full */ int q_full(int rear) { if(rear == SIZE -1) return (1); else return (0); } This function returns 1, if the queue is full. Since we place rear at –1 to denote queue empty condition, the rear varies from 0 to SIZE –1 when it stores elements. Therefore rear at SIZE-1 denotes that the queue is full. /* function to insert an element into the queue */ add_q ( int a[], int *r, int val) { if( ! q_full(*r)) { *r = *r + 1; } else { } } a[*r] = val; printf(“\n STACK FULL”); The call for the function will be add_q(a, &rear, value); 6.5.2.2. FUNCTION TO DELETE AN ELEMENT FROM THE QUEUE Before deleting any element from the queue we must check whether the queue is empty. In such case we cannot delete the element from the queue. /* function to check whether queue is empty */ int q_empty(int front, int rear) { if( front > rear) return (1); else return (0); } This function returns 1 if the queue is empty. /* function to delete an element from the queue */ } int delete_q ( int a[], int *f, int *r) { int val; if( ! q_empty(*f,*r)) { *f = *f +1; /* the new front position */ val = a[*f-1]; if( *f > *r) { *f = 0; *r = -1; } return (val); } else { printf(“\n QUEUE EMPTY”); return 0; } The major problem in the above implementation is that whenever we remove an element from the queue, front will increment and the location used by the element cannot be used again. This problem can be solved if we shift all the elements to the left by one location on every delete operation. This will be very time consuming and is not the effective way of solving the problem. Example:- Empty Space REAR a3 -------- an-1 an 0 1 2 ------------------- FRONT MAX-1 Figure 5.4 This representation of queue has a major drawback. Suppose a situation when, queue becomes full i.e. REAR = MAX, but we delete some elements from the Queue and FRONT is now pointing to A[i] such that if A[0] = MAX as shown in Fig. 5.4. We can not insert any element in Queue though there is space in Queue, and if we utilize that space by inserting elements on the positions A[0], A[1]….A[i-1], A[i] we are violating the FIFO property of Queue. One way to do this is to simply move the entire queue to the beginning of the array, changing FRONT and REAR accordingly, and then inserting ITEM as above. This procedure may be very expensive. We can overcome this drawback through circular representation of queue 6.6. CIRCULAR QUEUE The above problem can be solved only when the first position in the array will be logically the next position of the last position of the array. By this way we can say that the array is circular in nature because every position in the array will have logical next position in the array. The queue, which we are going to handle, using this approach is called the circular queue. Remember that it is not the infinite queue but we reuse the empty locations effectively. Now all the functions, which we have written previously will change. We will have a very fundamental function for such case, which will find the logical next position for any given position in the array. We assume that the Queue[ ] is circular, that is Queue[0] comes after Queue[MAX] in the array. Because we have to Reset value of REAR and FRONT from MAX-1 to 0 while inserting or deleting elements we can not do by REAR = REAR+1 and FRONT = FRONT+1 Instead we can do this by the following assignment: REAR = (REAR+1)% MAX and FRONT = (FRONT+1)%MAX. which increments the REAR and FRONT from 0 to MAX and when needed resets them from MAX1 to 0. Similarly, conditions for Empty and Full Queues also can not be as before. Instead we have to assume that, Queue will be empty when, REAR = FRONT and full when, (REAR+1)%MAX=FRONT 3 2 MAX-1 0 1 Figure 5.5 Remember, in Linear Queue we say that there is exactly one element in Queue when REAR = FRONT, but here we can not assume that because then Queue was empty only in the case when FRONT=REAR+1, but here Queue may be empty even when FRONT!=REAR+1. Hence, we have to sacrifice one space in this implementation, and this is the only drawback of this scheme. MAX-1 0 i FRONT = REAR (a) Queue . MAX-1 aM AX -1 … …. i-1 a0 FRONT ai ii ai-1 REAR (b) Full Queue Figure 5.6 6.6.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE insert_CQ(int x) { if ((REAR+1)%MAX == FRONT) { printf("Q_FULL"); return; } (REAR++)%MAX; CQUEUE[REAR]=x; } // CQueue[ ] is an Array Represents // Circular Queue. 6.6.2 FUNCTION FOR DELETION FROM CIRCULAR QUEUE: int delete_cq() { int x; if (REAR == FRONT) { printf("Q_Empty"); return; } x = cqueue[FRONT]; FRONT = (FRONT+1)%MAX; return (x); } Program 6.6.3 Circular Queue with Array Implementation #define MAX 10 #include <stdio.h> int front=0, rear=-1, count=0; void main() { int queue[MAX], element, quit; char c; void insert(int queue[10],int element); int deletq(int queue[10]); void display(int queue[10]); printf("Program of queue with array\n"); quit = 0; do { printf("\n\tOptions\t\tChoice"); printf("\n\tInsert\t\tI"); printf("\n\tDelete\t\tD"); printf("\n\tExit\t\tE"); printf("\n\tEnter choice : "); do c = getchar(); while (strchr("IiDdEe",c) == NULL); switch(c) { case 'I' : case 'i' : printf("\nEnter an element to be inserted : "); scanf("%d",&element); insert(queue,element); display(queue); break; case 'D' : case 'd' : deletq(queue); display(queue); break; case 'E' : case 'e' : quit = 1; } } while (!quit); printf("\n"); } int isfull() { if(count==10) { printf("\nQueue overflow....cann't insert\n"); return(1); } else return(0); } int isempty() { if(count==0) { printf("\nQueue underflow....cann't delete\n"); return(1); } else return(0); } void insert(int queue[],int element) { if (!isfull()) { rear=(rear+1)%10; queue[rear] = element; count++; } return; } void display(int queue[]) { int c,i; if(count==0) { printf(“\ncircular queue is empty”); } else { i=front; for (c = 1; c <=count ; c++) { printf("%6d\n",queue[i]); i=(i+1)%10; } } return; } int deletq(int queue[]) { if (!isempty()) { front=(front+1)%10; count--; } return(); } 6.7 DEQUES A deque (pronounced either “deck” or “dequeue” ) is a linear list in which elements can be added or removed at either end but not in the middle. The term deque is a contraction of the name double-ended queue. There are two variations of a deque namely: 1. An Input Restricted deque 2. An Output Restricted deque Specifically, and Input-Restricted deque is a deque which allows insertions at only one end of the list but allows deletions at both ends of the list; and an output-restricted deque is a deque which allows deletions at only one end of the list but allows insertions at both ends of the list. Input Output Output Input Input Output (a) Input Restricted Deque (b) Output Restricted Deque Figure 5.7 we will assume our deque is maintained by a circular array DEQUE with pointers LEFT and RIGHT, which point to the two ends o the deque. We assume that the elements extend from the left end to the right end in the array. The term “circular” comes from the fact that we assume that DEQUE[1] comes after DEQUE[N] in the array. Figure 6-18 pictures two deques, each with 4 elements maintained in an array with N = 8 memory locations. The condition LEFT = NULL will be used to indicate that a deque is empty. DEQUE LEFT: 4 RIGHT: 7 1 2 3 AAA BBB CCC DDD 4 5 6 7 8 (a) LEFT: 7 RIGHT: 2 DEQUE YYY 1 ZZZ 2 3 4 5 6 7 WWW XXX 8 (b) Fig 6.18 The procedures which insert and delete elements in deques and the variations on those procedures are given as supplementary problems. As with queues, a complication may arise (a) when there is overflow, that is , when an element is to be inserted into a deque which is already full , or (b) when there is underflow, that is , when an element is to be deleted from a deque which is empty. The procedures must consider these possibilities. 6.8. PRIORITY QUEUES A priority queue is a collection of elements such that each element has been assigned a priority and the order in which elements are deleted and processed comes from the following rules. 1. An element of higher priority is processed before any element of lower priority. 2. Two elements with the same priority are processed according to the order in which they were added to the queue. A prototype of priority is processed first, and programs with the same priority form a standard Queue. A prototype of a priority queue is a timesharing system: programs of high priority are processed first, and programs with the same priority form a standard queue. There can be two types of implementations of priority queue : i) Ascending Priority Queue ii) Descending Priority Queue A collection of items into which items can be inserted arbitrarily and from which only the smallest item can be removed is called Ascending Priority Queue. In Descending Priority Queue only the largest item is deleted. The elements of priority Queue need not be numbers or characters that can be composed directly. They may be complex structures that are ordered by one or several fields. Sometimes the field on which the element of a priority queue is ordered is not even past of the elements themselves. Array Representation of a Priority Queue FRONT REAR 0 1 1 1 0 2 2 -1 -1 3 4 0 4 3 3 0 2 3 XXX VVV 4 5 DDD EEE AAA 0 1 1 BBB CCC FFF SSS 2 3 4 GGG Another way to maintain a priority queue in memory is to use a separate queue for each level of priority (or for each priority number). Each such queue will appear in its own circular array and must have its own pair of pointers, Front and Rear. In fact, if each queue is allocated the same amount of space, a two-dimensional array Queue can be used instead of the linear arrays. Figure 5.8 indicates this representation. Observe that Front(K) and Rear (K) contain, respectively, the front and rear elements of row K of Queue, the row that maintains the queue of elements with priority number K. Figure 5.8 The following are outlines of algorithms for deleting and inserting elements in a priority queque that is maintained in memory by a two-dimensional array QUEQUE, as above. The details of the algorithms are left as exercises. Algorithm: This algorithm deletes and processes the first element in a priority queue maintained by a two-dimensional array QUEUE. 1. [Find the first nonempty queque.] Find the smallest K such that FRONT[K] NULL. 2. Delete and process the front element in row K of QUEUE. 3. Exit. Adding an element to our priority queue is much more complicated than deleting an element from the queue, because we need to find the correct place to insert the element. An outline of the algorithm follows. Algorithm 6.14: This algorithm adds an ITEM with priority number N to a priority queue which is maintained in memory as a one-way list. Traverse the one-way list until finding a node X whose priority number exceeds N. Insert ITEM in front of node X. If no such node is found, insert ITEM as the last element of the list. The above insertion algorithm may be pictured as a weighted object “sinking” through layers of elements until it meets an element with a heavier weight. The details of the above algorithm are left as an exercise. The main difficulty in the algorithm comes form the fact that ITEM is inserted before n0ode X. This means that, while traversing the list, one must also keep track of the address of the node preceding the node being accessed. Summary One again we see the time-space tradeoff when choosing between different data structures for a given problem. The array representation of a priority queue is more time-efficient than the oneway list. This is because when adding an element to a one-way list, one must perform a linear search on the list. On the other hand, the one-way list representation of the priority queue may be more space-efficient than the array representation. This is because in using the array representation, overflow occurs when the number of elements in any single priority level exceeds the capacity for that level, but in using the one-way list, overflow occurs only when the total number of elements exceeds the total capacity. Another alternative is to use a linked list for each priority level. UNIT 7 LINKED LISTS 7.1. INTRODUCTION 7.2. SINGLY LINKED LISTS. 7.2.1. IMPLEMENTATION OF LINKED LIST 7.2.1.1. INSERTION OF A NODE AT THE BEGINNING 7.2.1.2. INSERTION OF A NODE AT THE END 7.2.1.3. INSERTION OF A NODE AFTER A SPECIFIED NODE 7.2.1.4. TRAVERSING THE ENTIRE LINKED LIST 7.2.1.5. DELETION OF A NODE FROM LINKED LIST 7.3. CONCATENATION OF LINKED LISTS 7.4. MERGING OF LINKED LISTS 7.5. REVERSING OF LINKED LIST 7.6. DOUBLY LINKED LIST. 7.6.1. IMPLEMENTATION OF DOUBLY LINKED LIST 7.7. CIRCULAR LINKED LIST 7.8. APPLICATIONS OF THE LINKED LISTS 7.1. INTRODUCTION We have seen representation of linear data structures by using sequential allocation method of storage, as in, arrays. But this is unacceptable in cases like: a) UNPREDICTABLE STORAGE REQUIREMENTS: The exact amount of data storage required by the program varies with the amount of data being processed. This may not be available at the time we write programs but are to be determined later. For example, linked allocations are very beneficial in case of polynomials. When we add two polynomials, and none of their degrees match, the resulting polynomial has the size equal to the sum of the two polynomials to be added. In such cases we can generate nodes (allocate memory to the data member) whenever required, if we use linked representation (dynamic memory allocation). b) EXTENSIVE DATA MANIPULATION TAKES PLACE. Frequently many operations like insertion, deletion etc, are to be performed on the linked list. Pointers are used for the dynamic memory allocation. These pointers are always of same length regardless of which data element it is pointing to( int, float, struct etc,). This enables the manipulation of pointers to be performed in a uniform manner using simple techniques. These make us capable of representing a much more complex relationship between the elements of a data structure than a linear order method. The use of pointers or links to refer to elements of a data structure implies that elements, which are logically adjacent, need not be physically adjacent in the memory. Just like family members dispersed, but still bound together. 7.2. Singly Linked List [or] One way chain This is a list, which can may consist of an ordered set of elements that may vary in number. Each element in this linked list is called as node. A node in a singly linked list consists of two parts, a information part where the actual data is stored and a link part, which stores the address of the successor(next) node in the list. The order of the elements is maintained by this explicit link between them. The typical node is as shown : INFO LINK NODE Fig 1. Structure of a Node In figure 2, the arrows represent the links. The data part of each node consists of the marks obtained by a student and the next part is a pointer to the next node. The NULL in the last node indicates that this node is the last node in the list and has no successors at present. In the above the example the data part has a single element marks but you can have as many elements as you require, like his name, class etc. We have to consider a logical ordered list, i.e. elements are stored in different memory locations but they are linked to each other and form a logical list as in Fig. 3.1. This link represents that each element A1 A2 A3 …… . A4 An Figure 3.1: Logical List has address of its logical successor element in the list. We can understand this concept through a real life example : Suppose their is a list of 8 friends, x1, x2......x8. Each friend resides at different locations of the city. x1 knows the address of x2, x2 knows the address of x3 and so on .... x7 has the address of x8. If one wants to go to the house of x 8 and he does not know the address he will go to x2 and so on Fig 3.2. Consider an example where the marks obtained by the students are stored in a linked list as shown in the figure : |data |Next| 62 72 82 34 NULL |<-NODE ->| fig 2. Singly Linked List The concept of linked list is like a family despaired, but still bound together. From the above discussion it is clear that Link list is a collection of elements called nodes, each of x1 Add.of x2 x2 Add.of x3 X3 Figure 3.2 which stores two items of information: An element of the list A link or address of another node ……. X8 NULL Link can be provided through a pointer that indicates the location of the node containing the successor of this list element. The NULL in the last node indicates that this is the last node in the list. REPRESENTATION OF LINKED LIST Because each node of an element contains two parts, we have to represent each node through a structure. While defining linked list we must have recursive definitions: struct node { int data; struct node * link; } link is a pointer of struct node type i.e. it can hold the address of variable of struct node type. Pointers permit the referencing of structures in a uniform way, regardless of the organization of the structure being referenced. Pointers are capable of representing a much more complex relationship between elements of a structure than a linear order. Initialization : main() { struct node *p, *list, *temp; list = p = temp = NULL; . . . } 7.2.1 IMPLEMENTATION OF LINKED LIST Link List is a linear Data Structure with the following operations defined on it : Insertion of a node at the beginning Insertion of a node at the end Insertion of a node after a specified node Deletion of a particular node from the list Traversing of entire link list. 7.2.1.1. INSERTION OF A NODE AT THE BEGINNING We have a linked list in which the first element is pointed by list pointer. We can take node data as Input, from user and point this node through temp. Now we can attach temp to the list by putting address of List in the link field of node pointed by temp Fig. 3.3. Then we can update the ……. list NULL temp Figure 3.3 pointer list by putting address contained in temp. This can be accomplished as: addbeg() { int x; temp = malloc(sizeof(struct node)); scanf("%d", &x); temp data = x; temp link = list; list = temp; } 7.2.1.2 INSERTION OF A NODE AT THE END We traverse the list until NULL (i.e. end of the list) is found. We traverse the list through an additional pointer 'p' and, fix the start pointer list at the beginning of linked list. When p reaches the end, we will attach temp to p by putting the address of node pointed by temp in the link field of p Fig. 3.4. ……. list NILL p ………. p …………………… p temp Figure 3.4 addend() { int x; temp = malloc(sizeof(struct node)); scanf("%d", &x); temp link = NULL; x NULL p = list; if (list == NULL) /* Initially Empty List*/ { list = malloc(sizeof(struct node); } list = temp; else { While (p link ! = NULL) } p = p link; } p link = temp; } } 7.2.1.3. INSERTION OF A NODE AFTER A SPECIFIED NODE Traverse the list until node with specified value is found or the end of list is found. If end of list is found then print the message that "Given No. is not present" otherwise insert node pointed by temp between nodes pointed by p and p link (p is used to traverse the list) Fig. 3.5. p n NULL p link temp x Figure 3.5 insert() { int num, x; /*num is data to be found*/ scanf("%d", &x); /*x is data to be inserted*/ NULL temp = malloc(sizeof(struct node)); temp data = x; temp link = NULL; if (list == NULL) printf("List is Empty"); else { p = list; while (p data ! = num ¦¦ p! = NULL) { p = p link; } if (p = NULL) { printf("Number Not Found"); return; } else /*Number found on the location pointed by p*/ { temp link = p link; p link = temp; } } } 7.2.1.4. TRAVERSING THE ENTIRE LINKED LIST display() { if (q == NULL) printf("List is Empty"); else { p = list; while (p! = NULL) { printf ("%d", p data); p = p link; } } } 7.2.1.5. DELETION OF A NODE FROM LINKED LIST Search the node which is to be deleted from the list by traversing the list through pointer p. If end of List is found then print the message the 'Given No. is not found' otherwise store the address of node successor to the node to be deleted in the link field of p. Free the node to be deleted Fig. 3.6. num Node to be deleted Figure 3.6 int delete() { int x, num; struct node *del; if (list == NULL) printf ("List in Empty"); else NULL { p = list; while (p link data ! = num ¦ ¦ p link!=NULL) { p = p link; } if (p link = NULL) printf ("No. Not Found"); else { del = p link; // p link contains the address of the node p link = p link link; // to be deleted. x = del data free (del); return (x); } } } 7.3. CONCATENATION OF LINKED LISTS Consider a case where we have two Linked Lists pointed to by two different pointers, say p and q respectively, and we want to concatenate 2ndlist at the end of the first list. We can do it by traversing first list till the end and then store the address of first node in the second list, in the link field of last node of first list. Suppose we are traversing first list by pointer temp, then we can concatenate the list by the statement. temp link = q; (Fig. 3.7) p NULL q NULL (a) Temp p q NULL (b) Figure 3.7 (a) Lists before Concatenation, (b) List after Concatenation The function to achieve this is given below: Concatenate (struct node *p, struct node *q) { struct node *temp; temp = p; if (p == NULL) // If first list is NULL then Concatenated p = q; else // List will be only Second List and will be //pointed by p; { temp = p; while (temp link ! = NULL) temp = temp link; temp link = q; } } 7.4. MERGING OF LINKED LISTS Suppose we have two linked lists pointed to by two different pointers P and q, we wish to merge the two lists into a third list pointed by z. While carrying out this merging we wish to ensure that those elements that are common to both the lists occur only once in the third list. The function to achieve this is given below : it is assumed that both lists are sorted in ascending order and the resultant third list will also be sorted. merge (struct node *p, struct node *q) { struct node *z; z = malloc(sizeof(struct node)); if (p == NULL && q == NULL) return; while (p!= NULL && q! = NULL) { if (p data < q data); { z data = q data; p = p link; } if (p data > q data) { z data = q data; q = q link; } if ((p data = q data) != 0) { z data = p data; p = p link; q = q link; } z link = malloc(sizeof(struct node)); z = z link; } while (p! = NULL { z data = p data; z link = malloc(sizeof(struct node)); z = z link; p = p link; } while (q!= NULL) { z data = q data; z link = malloc(sizeof(struct node)); z = z link; q = q link; } 7.5. REVERSING OF LINKED LIST Suppose we have a link list pointed by p. In order to reverse it we will have to take two more pointers q and r of the struct node type. We will traverse the list through p and make q trails p and r trails q; and assign q's link to r. The function to achieve this is given below : reverse (struct node *p) { struct node *q; *r; q = NULL; while (p!= NULL) // p will traverse the list till end { r = q; // r trails q q = p; // q trails p p = p link; // p moves to next node q link = r; } // link q to preceding node } 7.6. DOUBLY LINKED LISTS In the single linked list each node provides information about where the next node is in the list. It faces difficulty if we are pointing to a specific node, then we can move only in the direction of the links. It has no idea about where the previous node lies in memory. The only way to find the node which precedes that specific node is to start back at the beginning of the list. The same problem arises when one wishes to delete an arbitrary node from a single linked list. Since in order to easily delete an arbitrary node one must know the preceding node. This problem can be avoided by using Doubly Linked List, we can store in each node not only the address of next node but also the address of the previous node in the linked list. A node in Doubly Linked List has three fields Fig 3.10. Data Left Link Right Link L LINK DATA R LINK Figure 3.10: Node of Doubly Linked List Left link keeps the address of previous node and Right Link keeps the address of next node. Doubly Linked List has following property: p=pllinkrlink=prlinkllink. (Figure 3.11) p L LINK R LINK Figure 3.11 This formula reflects the essential virtue of this structure, namely, that one can go back and forth with equal ease. 7.6.1 IMPLEMENTATION OF DOUBLY LINKED LIST Structure of a node of Doubly Linked List can be defined as: struct node { int data; struct node *llink; } struct node *rlink; One operation that can be performed on doubly linked list is to delete a given node pointed by 'p'. Function for this operation is given below: delete(struct node *p) { if (p==Null) print f("Node Not Found") else { pllink rlink=pllink; prlinkllink= pllink; free(p); } } 7.7. CIRCULAR LINKED LIST Circular Linked List is another remedy for the drawbacks of the Single Linked List besides Doubly Linked List. A slight change to the structure of a linear list is made to convert it to circular linked list; link field in the last node contains a pointer back to the first node rather than a NULL Figure 3.12 Figure 3.12: Circular Linked List From any point in such a list it is possible to reach any other point in the list. If we begin at a given node and traverse the entire list, we ultimately end up at the starting point. 7.8 APPLICATIONS OF THE LINKED LISTS In computer science linked lists are extensively used in Data Base Management Systems Process Management, Operating Systems, Editors etc. Earlier we saw that how singly linked list and doubly linked list can be implemented using the pointers. We also saw that while using arrays vary often the list of items to be stored in an array is either too short or too big as compared to the declared size of the array. Moreover, during program execution the list cannot grow beyond the size of the declared array. Also, operations like insertions and deletions at a specified location in a list require a lot of movement of data, thereby leading to an inefficient and time-consuming algorithm. The primary advantage of linked list over an array is that the linked list can grow or shrink in size during its lifetime. In particular, the linked list „s maximum size need not be known in advance. In practical applications this often makes it possible to have several data structures share the same space, without paying particular attention to their relative size at any time. The second advantage of providing flexibility in allowing the items to be rearranged efficiently is gained at the expense of quick access to any arbitrary item in the list. In arrays we can access any item at the same time as no traversing is required. We are not suggesting that you should not use arrays at all. There are several applications where using arrays is more beneficial than using linked lists. We must select a particular data structure depending on the requirements. Summary Linked List is a Linear Data Structure which is based on Dynamic Memory Allocation(i.e. memory Allocation at Run time rather than at Compile Time). Linked List has several Advantages over Array because of Dynamic Memory Allocation. Linked List doesn't face the problems like overflow and Memory Wastage due to Static Memory Allocation. Each Node of a linked List contains address of its successor Node. Various operations like Insertion, Deletion, Traversing can be performed on any Linked List. Two Lists can be Merged and Concatenated by link manipulation. Linear or Single Linked List can not be traversed in reverse direction, this drawback is removed in doubly and Circular Linked List. In Circular linked List Link pointer of last node of Linked List points to first node of Linked List instead of pointing to NULL. In Doubly Linked List each node contains address of its previous node in addition to the address of next node. This provides two way traversing. UNIT 8 GRAPHS 8.1 INTRODUCTION 8.2 ADJACENCY MATRIX AND ADJACENCY LISTS 8.3 GRAPH TRAVERSAL 8.3.1 DEPTH FIRST SEARCH (DFS) 8.3.1.1 IMPLEMENTATION 8.3.2 BREADTH FIRST SEARCH (BFS) 8.3.2.1 IMPLEMENTATION 8.4 SHORTEST PATH PROBLEM 8.5 MINIMAL SPANNING TREE 8.6 OTHER TASKS 8.1. INTRODUCTION A graph is a collection of vertices and edges, G =(V, E) where V is set of vertices and E is set of edges. An edge is defined as pair of vertices, which are adjacent to each other. When these pairs are ordered, the graph is known as directed graph. These graphs have many properties and they are very important because they actually represent many practical situations, like networks. In our current discussion we are interested on the algorithms which will be used for most of the problems related to graphs like to check connectivity, the depth first search and breadth first search, to find a path from one vertex to another, to find multiple paths from one vertex to another, to find the number of components of the graph, to find the critical vertices and edges. The basic problem about the graph is its representation for programming. 8.2. ADJACENCY MATRIX AND ADJACENCY LISTS We can use the adjacency matrix, i.e. a matrix whose rows and columns both represent the vertices to represent graphs. In such a matrix when the ith row, jth column element is 1, we say that there is an edge between the i th and jth vertex. When there is no edge the value will be zero. The other representation is to prepare the adjacency lists for each vertex. Now we will see an example of a graph and see how an adjacency matrix can be written for it. We will also see the adjacency relations expressed in form of a linked list. For Example: Consider the following graph, 0 1 0 2 0 4 0 3 5 0 6 0 Fig 4. Graph The adjacency matrix will be 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 0 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 1 1 0 1 1 0 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 0 Fig 5. Adjacency Matrix representation of graph in fig 4 ADJACENCY LIST REPRESENTATION In this representation, we store a graph as a linked structure. We store all the vertices in a list and then for each vertex, we have a linked list of its adjacent vertices. Let us see it through an example. Consider the graph given in Figure 13. Figure 13 The adjacency list representation needs a list of all of its nodes, i.e. Figure 14: Adjacency List Structure for Graph in Figure 13. Note that adjacent vertices may appear in the adjacency list in arbitrary order. Also an arrow from v2 to v3 in the list linked to v1 does not mean that V2 and V3 are adjacent. The adjacency list representation is better for sparse graphs because the space required is O(V + E), as contrasted with the O(V2) required by the adjacency matrix representation. 8.3 GRAPH TRAVERSAL A graph traversal means visiting all the nodes of the graph. Graph traversal may be needed in many of application areas and there may be many methods for visiting the vertices of the graph. Two graph traversal methods, which are being discussed in this section are the commonly used methods and are also found to be efficient graph traversal methods. These are A)Depth First Search or DFS; and B) Breadth First Search or BFS 8.3.1. DEPTH FIRST SEARCH (DFS) In graphs, we do not have any start vertex or any special vertex singled out to start traversal from. Therefore the traversal may start from any arbitrary vertex. We start with say, vertex v. An adjacent vertex is selected and a Depth First Search is initiated from it, i.e. let V1, V2 .. Vk are adjacent vertices to vertex v. We may select any vertex from this list. Say, we select v1 . Now all the adjacent vertices to v1 are identified and all of those are visited; next V2 is selected and all its adjacent vertices visited and so on. This process continues till all the vertices are visited. It is very much possible that we reach a traversed vertex second time. Therefore we have to set a flag somewhere to check if the vertex is already visited. Let us see it through an example. Consider the following graph. Fig 15 : Example Graph for DFS Let us start with v1. Its adjacent vertices are v2, v8 and V3. Let us pick on V2. Its adjacent vertices are v1 ,v4, v5. v1 is already visited. Let us pick on V4. Its adjacent vertices are V2, V8. v2 is already visited. Let us visit v8. Its adjacent vertices are V4, V5, V1, V6, V7. V4 and v1, are already visited. Let us traverse V5. Its adjacent vertices are v2, v8. Both are already visited Therefore, we back track. We had V6 and V7 unvisited in the list of v8. We may visit any. We visit v6. Its adjacent are v8 and v3. Obviously the choice is v3. Its adjacent vertices are v1, v7 . We visit v71. All the adjacent vertices of v7 are already visited, we back track and find that we have visited all the vertices. Fig. 16 : Example Graphs for DFS Therefore the sequence of traversal is v1,v2 ,v4 , v8 , v5 , v6 , v3 , v7. This is not a unique or the only sequence possible using this traversal method. Let us consider another graph as given in Figure 16. Is v1 , v2 , v3 , v5 , v4 , v6 a traversed sequence using DFS method? We may implement the Depth First Search method by using a stack. pushing all unvisited vertices adjacent to the one just visited and popping the stack to find the next vertex to visit. 8.3.1.1. IMPLEMENTATION We use an array val[V] to record the order in which the vertices are visted. Each entry in the array is initialized to the value unseen to indicate that no vertex has yet been visted. The goal is to systematically visit all the vertices of the graph, setting the val entry for the ith vertex visited to i, for i = 1, 2...... ,V. The following program uses a procedure visit that visits all the vertices in the same connected component as the vertex given in the argument. void search( ) { int k; for (k = 1; k V; k++) val[k] = unseen; for (k = 1; k V; k++) if (val[k] == unseen) visit(k); } The first for loop initializes the val array. Then, visit is called for the first vertex, which results in the val values for all the vertices connected to that vertex being set to values different from unseen. Then search scans through the val array to find a vertex that hasn't been seen yet and calls visit for that vertex, continuing in this way until all vertices have been visited. Note that this method does not depend on how the graph is represented or how visit is implemented. First we consider a recursive implementation of visit for the adjacency list representation: to visit a vertex, we check all its edges to see if they lead to vertices that have not yet been seen; if so, we visit them. void visit (int k) // DFS, adjacency lists { struct node *t; val[k] = ++i; for (t = adj [k]; t ! = z; t = t-next) if (val[t-vl == unseen) visit (t-v); } We move to a stack-based implementation: Stack stack(maxV); void visit(int k) // non-recursive DFS, adjacency lists { struct node *t stack.push(k); while (!stack.empty ( )) { k = stack.pop(); val[k] = ++id; for (t = adj[k]; t != z; t = t-next) if (val[t-v} == unseen) {stack.push(t-v); val[t-v] = 1;} Vertices that have ben touched but not yet visited are kept on a stack. To visit a vertex, we traverse its edges and push onto the stack any vertex that has, not yet been visited and that is not already on the stack. In the recursive implementation, the bookkeeping for the "partially visited" vertices is hidden in the local variable t in the recursive procedure. We could implement this directly by maintaining pointers (corresponding to t) into the adjacency lists, and so on. Depth-first search immediately solves some basic graph- processing problems. For example, the procedure is based on finding the connected components in turn; the number of connected components is the number of times visit is called in the last line of the program. Testing if a graph has a cycle is also a trivial modification of the above program. A graph has a cycle if and only if a node that is not unseen is discovered in visit. That is, if we encounter an edge pointing to a vertex that we have already visited, then we have a cycle. 8.3.2. BREADTH FIRST SEARCH (BFS) In DFS we pick on one of the adjacent vertices; visit all of its adjacent vertices and back track to visit the unvisited adjacent vertices. In BFS, we first visit all the adjacent vertices of the start vertex and then visit all -the unvisited vertices adjacent to these and so on. Let us consider the same example, given in Figure l5. We start say, with v1. Its adjacent vertices are V2, V8, V3. We visit all one by one. We pick on one of these, say v2. The unvisited adjacent vertices to v2 are v4, v5. We visit both. We go back to the remaining visited vertices of v1 and pick on one of those, say v3. The unvisited adjacent vertices to v3 are v6, v7. There are no more unvisited adjacent vertices of v8, v4, v5, v6 and v7. \ Figure 17: Thus, the sequence so generated is v1, v2, v8, v3, v4, v5, v6, v7. Here we need a queue instead of a stack to implement it. We add unvisited vertices adjacent to the one just visited at the rear and read at from to find the next vertex to visit. 8.3.2.1. IMPLEMENTATION To implement breadth-first search, we change stack operations to queue operations in the stackbased search program above: Queue queue(maxV); void visit (int k) // BFS, adjacency lists { struct node *t; queue.put (k); while (!queue.empty))) { k = queue.geto; val[k] = ++id; for (t = adj[k]; t ! = z; t = t-ncxt) if (val[t-vl == unseen) ( queue.put(t-v); val[t-vl = 1;} } } The contrast between depth-first and breadth-first search is quite evident when we consider a larger graph. In both cases, the search starts at the node at the bottom left. Depth-first search wends it way through the graph, storing on the stack the points where other paths branch off; breadth-first search "sweeps through" the graph, using a queue to remember the frontier of visited places. Depth-first search "explores" the graph by looking for new vertices far away from the start point, taking closer vertices only when dead ends are encountered; breadth-first search completely covers the area close to the starting point, moving farther away only when everything close has been looked at. Again, the order in which the nodes are visited depends largely upon the order in which the edges appear in the input and upon the effects of this ordering on the order in which vertices appears on the adjacency lists. Depth-first search was first stated formally hundreds of years ago as a method for traversing maxes. Depth-first search is appropriate for one person looking for something in a maze because the "next place to look" is always close by; breadth-first search is more like a group of people looking for something by fanning out in all directions. 8.4. SHORTEST PATH PROBLEM We have seen in the graph traversals that we can travel through edges of the graph. It is very much likely in applications that these edges have some weights attached to it. This weight may reflect distance, time or some other quantity that corresponds to the cost we incur when we travel through that edge. For example, in the graph in Figure 18, we can go from Delhi to Andaman Nicobar through Madras at a cost of 7 or through Calcutta at a cost of 5. (These numbers may reflect the airfare in thousands.) In these and many other applications, we are often required to find a shortest path, i.e a path having the minimum weight between two vertices. In this section, we shall discuss this problem of finding shortest path for directed graph in which every edge has a non-negative weight attached. Figure 18: A graph connecting four cities Let us at this stage recall how do we define a path. A path in a graph is sequence of vertices such that there is an edge that we can follow between each consecutive pair of vertices. Length of the path is the sum of weights of the edges on that path. The starting vertex of the path is called the source vertex and the last vertex of the path is called the destination vertex. Shortest path from vertex v to vertex w is a path for which the sum of the weights of the arcs or edges on the path is minimum. Here you must note that the path that may look longer if we see the number of edges and vertices visited, at times may be actually shorter costwise. Also we may have two kinds of problems in finding shortest path. One could be that we have a single source vertex and we seek a shortest path from this source vertex v to every other vertex of the graph. It is called single source shortest path problem. Consider the weighted graph in Figure 19 with 8 nodes A,B,C,D,E,F,G and H. Fig. 19 : A Weighted Graph There are many paths from A to H. Length of path AFDEH = 1 + 3 + 4 + 6 = 14 Another path from A to H is ABCEH. Its length is 2+2+3+6 = 13. We may further look for a path with length shorter than 13, if exists. For graphs with a small number of vertices and edges, one may exploit all the permutations combinations to find shortest path. Further we shall have to exercise this methodology for finding shortest path from A to all the remaining vertices. Thus, this method is obviously not cost effective for even a small sized graph. There exists an algorithm for solving this problem. It works like as explained below: Let us consider the graph in Figure 18 once again. 1. We start with source vertex A. 2. We locate the vertex closest to it, i.e. we find a vertex from the adjacent vertices of A for which the length of the edge is minimum. ere B and F are the adjacent vertices of A and Length (AB) Length (AF) Therefore we choose F. Fig. 20 (a) 3. Now we look for all the adjacent vertices excluding the just earlier vertex of newly added vertex and the remaining adjacent vertices of earlier vertices, i.e. we have D,E and G (as adjacent vertices of F) and B (as remaining adjacent vertex of A). Now we again compare the length of the paths from source vertex to these unattached vertices, i.e. compare length (AB), length (AFD), length (AFG) and length (AFE). We find the length (AB) the minimum. There we choose vertex B. Fig. 20 (b) 4. We go back to step 3 and continue till we exhaust all the vertices. Let us see how it works for the above example. Vertices that Path from may be attached A Length ABD 4 AFD 4 G AFG 6 C ABC 4 AFE 4 D E ABE 6 We may choose D. C or E. We choose say D through B. Fig. 20 (c) G AFG 6 C ABC 4 E AFE 4 ABE 6 BDE 8 We may choose C or E We choose say C G AFG 6 E AFE 4 H ABE 6 ABDE 8 ABCE 7 We choose path AFG Therefore the shortest paths from source vertex A to all the other vertices are AB ABC DB AFE F AFG ABCH Fig. 20 (d) 8.5. MINIMAL SPANNING TREE You have already learnt in Section 4.3 that a tree is. a connected acyclic graph. If we are given a graph G = (V,E), we may have more than one V tree structures. Let us see what do we mean by this statement. Consider the graph given in Figure 21 some of the tree structures for this graph are given in Figure 22(a), 22(b) and 22(c). Fig.22 You may notice that they differ from each other significantly, however, for each structure (i) the vertex set is same as that of Graph G (ii) the edge set is a subset of G(E); and (iii) there is no cycle. Such a structure is called spanning tree of graph. Let us formally define a spanning tree. A tree T is a spanning tree of a connected graph G(V,E) such that 1) every vertex of G belongs to an edge in T and 2) the edges in T form a tree. Let us see how we construct a spanning tree for a given graph. Take any vertex v as an initial partial tree and add edges one by one so that each edge joins a new vertex to the partial tree. In general if there are n vertices in the graph we shall construct a spanning tree in (n- 1) steps i.e. (n- 1) edges are needed to added. Frequently, we encounter weighted graphs and we need to built a subgraph that must be connected and must include every vertex in the graph. To construct such a subgraph with least weight or least cost, we must not have cycles in it. Therefore, actually we need to construct a spanning tree with minimum cost or a Minimal Spanning Tree. You must notice the difference between the shortest path problem and the Minimal Spanning tree problem. Let us see this difference through an example. Consider the graph given in Figure 23. Figure 23 This could, for instance, represent the feasible communication lines between 7 cities and the cost of an edge could be interpreted as the actual cost of building that link (in lakhs of rupees). The Minimal Spanning Tree problem for this situation could be the building a least cost communication network. Ale shortest path problem (One source - all definitions) could be identifying one city and finding out the least cost communication lines from this city to all the other cities. Therefore, Shortest Path trees are rooted, while MST are free trees. Also MSTs are defined only on undirected graphs. BUILDING A MINIMUM SPANNING TREE As stated in an earlier paragraph, we need to select n-1 edges in G (of n vertices) such that these form an MST o^ G. We begin by first selecting an edge with least cost. It can between any two vertices of G. Subsequently, from the set of remaining edges, we can select another least cost edge and so on. Each time an edge is picked, we determine whether or not the inclusion of this edge into the spanning tree being constructed creates a cycle. If it does, this edge is discarded. If no cycle is created, this edge is included in the spanning tree being constructed. Let us work it out through an example. Consider the graph given in Figure 23. The minimum cost edge is that connects vertex A and B. Let us choose that. Figure 24(a) Now we have Vertices that may be added Edge Cost F BF 8 AF 7 G BG 9 C AC 4 D AD 5 The least cost edge is AC; therefore we choose AC Now we have Figure 24(b) Vertices that may be added Edge Cost F BF 8 AF 7 BG 9 CG 10 D AD 5 E CE 8 G The least cost edge is Ad; therefore we choose ad. Now we have Figure 24 (c) Vertices that may be added Edge Cost F BF 8 G E AF 8 DF 10 BG 9 CG 10 CE 8 DE 9 AF is the minimum cost edge; therefore, we add it to the partial tree. Now we have Figure 24 (d) Vertices that may be added Edge Cost G BG 9 CG 10 CE 8 DE 9 FE 11 E Obvious choice is CE Figure 24 (e) The only vertex left is G and we have the minimum cost edge that connects it to the tree constructed so far is BG. Therefore we add it and the minimal spanning tree constructed would be of the costs 9+3+7+5+4+8 = 36; and is given in Figure 24(f). Figure 24 (f) This method is called the Kruskal's method of creating a minimal spanning tree. 8.6. OTHER TASKS FOR THE GRAPHS: Some other functions, which are associated with graph for solving the problems are: a. To find the degree of the vertex The degree of the vertex is defined as a number of vertices which are adjacent to given vertex, in other words, it is number of 1‟s in the row of that vertex in the adjacency matrix or it will be number of nodes present in the adjacency list of that vertex. b. To find the number of edges. By hand shaking lemma , we know that the number of edges in a graph is half of the sum of degrees of all the vertices. c. To print a path from one vertex to another. Here we are required to follow the above algorithm of BFS such that one of the vertices is the starting vertex for the algorithm and the process will continue till we reach the second vertex. d. To print the multiple paths from one vertex to another. The previous algorithm should be used in some different form so that we can get multiple paths. e. To find the number of components in a graph. In this case we will again use the BFS, and check the visited array, if it does not contain all the vertices marked as visited then increment the component counter by 1 and from any of the vertex which is not visited, restart the BFS. Repeat till all the vertices are visited. f. To find the critical vertices and edges. The vertex which when removed from the graph, leaves the graph as disconnected, will be termed as critical vertex. To find the critical vertex we should first remove each vertex and check the number of components of the remaining graph. If the graph, which is remaining, is not a connected graph, the vertex, which is removed, is a critical vertex. Similarly removal of an edge from the graph, if increases the number of components, it will be known as critical edge. If we try to check whether a particular vertex or edge is critical, then remove the same and rerun the program for finding the number of components. SUMMARY Graphs provide in excellent way to describe the essential features of many applications. Graphs are mathematical structures and are found to be useful in problem solving. They may be implemented in many ways by the use of different kinds of data structures. Graph traversals, Depth First as well as Breadth First, are also required in many applications. Existence of cycles makes graph traversal challenging, and this leads to finding some kind of acyclic subgraph of a graph. That is we actually reach at the problems of finding a shortest path and of finding a minimum cost spanning tree problems. In this Unit, we have built up the basic concepts of the graph theory and of the problems, the solutions of which may be programmed. Some graph problems that arise naturally and are easy to state seem to be quite difficult, and no good algorithms are known to solve them. For example, no efficient algorithm is known for finding the minimum-cost tour that visits each vertex in a weighted graph. This problem, called the travelling salesman problem, belongs to a large class of difficult problems. Other graph problems may well have efficient algorithms, though none have been found. An example of this is the graph isomorphism problem. Determine whether two graphs could be made identical by renaming vertices. Efficient algorithms are known for this problem for many special types of graphs, but the general problem remains open. In short, there is a wide spectrum of problems and algorithms for dealing with graphs. But many relatively easy problems do arise quite often, and the graph algorithms we studied in this unit serve well in a great variety of applications. UNIT 9 TREES 9.1. INTRODUCTION 9.1.1. OBJECTIVES 9.1.2. BASIC TERMINOLOGY 9.1.3. PROPERTIES OF A TREE 9.2. BINARY TREES 9.2.1. PROPERTIES OF BINARY TREES 9.2.2. IMPLEMENTATION 9.2.3. TRAVERSALS OF A BINARY TREE 9.2.3.1. IN ORDER TRAVERSAL 9.2.3.2. POST ORDER TRAVERSAL 9.2.3.3. PREORDER TRAVERSAL 9.3. BINARY SEARCH TREES (BST) 9.3.1. INSERTION IN BST 9.3.2. DELETION OF A NODE 9.3.3. SEARCH FOR A KEY IN BST 9.4. HEIGHT BALANCED TREE 9.5. B-TREE 9.5.1. INSERTION 9.5.2. DELETION 9.1 INTRODUCTION In the previous block we discussed Arrays, Lists, Stacks, Queues and Graphs. All the data structures except Graphs are linear data structures. Graphs are classified in the non-linear category of data structures. At this stage you may recall from the previous block on Graphs that an important class of Graphs is called Trees. A Tree Is an a cyclic, connected graph. A Tree contains no loops or cycles. The concept of trees is one of the most fundamental and useful concepts in computer science. Trees have many variations, implementations and applications. Trees find their use in applications such as compiler construction, database design, windows, operating system programs, etc. A tree structure is one in which items of data are related by edges. A very common example is the ancestor tree as given in Figure 1. This tree shows the ancestors of LAKSHMI. Her parents are VIDYA and RAMKRISHNA; RAMKRISHNA'S PARENT are SUMATHI and VUAYANANDAN who are also the grand parents of LAKSHMI (on father's side); VIDYA'S parents are JAYASHRI and RAMAN and so on. Figure 1: A Family Tree I We can also have another form of ancestor tree as given in Figure 2. Figure 2: A Family Tree II We could have also generated the image of tree in Figure 1 as Figure 3: A Family Tree III All the above structures are called rooted trees. A tree is said to be rooted if it has one node, called the root that is distinguished from the other nodes. In Figure 1 the root is LAKSHMI, In Figure 2 the root is KALYANI and In Figure 3 the root is LAKSHMI. In this Unit our attention will be restricted to rooted trees. In this Unit, first we shall consider the basic definitions and terminology associated with trees, examine some important properties, and look at ways of representing trees within the computer. In later sections, we shall see many algorithms that operate on these fundamental data structures. 9.1.1 OBJECTIVES At the end of this unit you shall be able to: · · · · · · · define a tree, a rooted tree, a binary tree, and a binary search tree differentiate between a general tree and a binary tree describe the properties of a binary search tree code the insertion, deletion and searching of an element in a binary search tree show how an arithmetic expression may be stored in a binary tree build and evaluate an expression tree code the preorder, in order, and post order traversal of a tree<> 9.1.2. BASIC TERMINOLOGY Trees are encountered frequently in everyday life. An example is found in the organizational chart of a large corporation. Computer Sciences in particular makes extensive use of trees. For example, in database it is useful in organizing and relating data. It is also used for scanning, parsing, generation of code and evaluation of arithmetic expressions in compiler design. We usually draw trees with the root at the top. Each node (except the root) has exactly one node above it, which is called its parent; the nodes directly below a node are called its children. We sometimes carry the analogy to family trees further and refer to the grandparent or the sibling of a node. Let us formally define some of the tree-related terms. A tree is a non-empty collection of vertices and edges that satisfies certain . requirements. A vertex is a simple object (also referred to as a node) that can have a name and can carry other associated information: An edge is a connection between two vertices. A tree may, therefore, be defined as a finite set of zero or more vertices such that there is one specially designated vertex called ROOT, and the remaining vertices are partitioned into a collection of sub- trees, each of which is also a tree. In Figure 2 root is KALYANI. The three sub-trees are rooted at BABU, RAJAN and JAYASHRI. Sub-trees with JAYASHRI as root has two sub-trees rooted at SUKANYA and VIDYA and so on. The nodes of a tree have a parent-child relationship. The root does not have a parent; but each one of the other nodes has a parent node associated to it. A node may or may not have children. A node that has no children is called a leaf node. A line from a parent to a child node is called a branch or an edge. If a tree has n nodes, one of which is the root then there would be n-1 branches. It follows from the fact that each branch connects some node to its parent, and every node except the root has one parent. Nodes with the same parent are called siblings. Consider the tree given in Figure 4. Figure 4: A Tree K, L, and M are all siblings. B, C, D are also siblings. A path in a tree is a list of distinct vertices in which successive vertices are connected by edges in the tree. There is exactly one path between the root and each of the other nodes in the tree. If there is more than one path between the root and some node, or if there is no path between the root and some node, then what we have is a graph, not a tree. Nodes with no children are called leaves, or terminal nodes. Nodes with at least one child are sometimes called nonterminal nodes. We sometime refer to nonterminal nodes as internal nodes and terminal nodes as external nodes. The length of a path is the number of branches on the path. Further if there is path from n., then n. is ancestor of i and ni. is a descendant of n.. Also there is a path of length zero from every node to itself, and there is exactly one path from the root to each node. Let us now see how these terms apply to the tree given in Figure 4. A path from A to K is A-D-G-J-K and the length of this path is 4. A is ancestor of K and K is descendant of A. All the other nodes on the path are also descendants of A and ancestors of K. The depth of any node ni is the length of the path from root to n.. Thus the root is at depth 0(zero). The height of a node n. is the longest path from ni. to a leaf. Thus all leaves are at height zero. Further, the height of a tree is same as the height of the root. For the tree in Figure 4, F is at height 1 and depth 2. D is at height 3 and depth 1. The height of the tree is 4. Depth of a node is sometimes also referred to as level of a node. A set of trees is called a forest; for example, if we remove the root and the edges connecting it from the tree in Figure 4, we are left with a forest consisting of three trees rooted at A, D and G, as shown in Figure 5. Let us now list some of the properties of a tree: Figure 5: A Forest (sub-tree) 9.1.3. PROPERTIES OF A TREE 1. Any node can be root of the tree each node in a tree has the property that there is exactly one path connecting that node with every other node in the tree. Technically, our definition in which the root is identified, pertain to a rooted tree in which the root is not identified is called a free tree. 2. Each node, except the root ,has a unique parents and every edge connects a node to its parents. Therefor, a tree with N nodes has N-1 edges. 9.2. BINARY TREES By definition, a Binary tree is a tree which is either empty or consists of a root node and two disjoint binary trees called the left subtree and right subtree. In Figure 6, a binary tree T is depicted with a left subtree, L(T) and a right subtree R(T). Figure 6: A Binary Tree In a binary tree, no node can have more than two children. binary trees are special cases of general trees. The terminology we have discussed in the previous section applies to binary trees also. 9.2.1. PROPERTIES OF BINARY TREES: 1. Recall from the previous section the definition of internal and external nodes.- A binary tree with N internal nodes has maximum of (N + 1) external nodes : Root is considered as an internal node. 2. The external path length of any binary tree with N internal nodes is 2N greater than the internal path length. 3. The height of a full binary tree with N internal nodes is about log2N As we shall see, binary trees appear extensively in computer applications, and performance is best when the binary trees are full (or nearly full). You should note carefully that, while every binary tree is a tree, not every tree is a binary tree. A full binary tree or a complete binary tree is a binary tree in which all internal nodes have degree and all leaves are at the same level. The figure 6a illustrates a full binary tree. The degree of a node is the number of non empty sub trees it has. A leaf node has a degree zero. Figure 6(a): A full binary tree 9.2.2. IMPLEMENTATION A binary tree can be implemented as an array of nodes or a linked list. The most common and easiest way to implement a tree is to represent a node as a record consisting of the data and pointer to each child of the node. Because a binary tree has at most two children, we can keep direct pointers to them. A binary tree node declaration in pascal may look like. Type tree_ptr = ^tree _ node; tree_node = Record data : data _ type; left : tree _ ptr; right: tree _ ptr end; Let us now consider a special case of binary tree. it is called a 2-tree or a strictly binary tree. It is a non-empty binary tree in which either both sub trees are empty or both sub trees are 2-trees. For example, the binary trees in Figure 7(a) and 7(b) are 2-trees, but the trees in Figure 7(c) and 7(d) are not 2- trees. Figure : 7 (a) Figure : 7 (b) Figure 7: (a) and (b) Binary Trees(c) and (d) not Binary Trees Binary trees are most commonly represented by linked lists. Each node can be considered as having 3 elementary fields: a data field, left pointer, pointing to left sub tree and right pointer pointing to the right sub tree. Figure 8 contains an example of linked storage representation of a binary tree (shown in figure 6). Figure 8: Linked Representation of a Binary Tree A binary tree is said to be complete Figure 6(a) if it contains the maximum number of nodes possible for its height. In a complete binary tree: - The number of nodes at level 0 is 1. The number of nodes at level 1 is 2. The number of nodes at level 2 is 4, and so on. The number of nodes at level I is 2I. Therefore for a complete binary tree with k levels contains _ki=0 2 i nodes. 9.2.3. TRAVERSALS OF A BINARY TREE A traversal of a graph is to visit each node exactly once. In this section we shall discuss traversal of a binary tree. It is useful in many applications. For example, in searching for particular nodes. Compilers commonly build binary trees in the process of scanning, parsing, generating code and evaluation of arithmetic expression. Let T be a binary tree. There are a number of different ways to proceed. The methods differ primarily in the order in which they visit the nodes. The four different traversals of T are In order, Post order, Preorder and Level-by-level traversal. 9.2.3.1. IN ORDER TRAVERSAL It follows the general strategy of Left-Root-Right. In this traversal, if T is not empty, we first traverse (in order) the left sub tree; then visit the root node of T, and then traverse (in order) the right sub tree. Consider the binary tree given in Figure 9. Figure 9. Expression Tree This is an example of an expression tree for (A + B*C)-(D*E) A binary tree can be used to represent arithmetic expressions if the node value can be either operators or operand values and are such that: · each operator node has exactly two branches · each operand node has no branches, such trees are called expression trees. Tree, T, at the start is rooted at '_'; Since left(T) is not empty; current T becomes rooted at +; Since left(T) is not empty; current T becomes rooted at 'A'. Since left(T) is empty; we visit root i.e. A. We access T' root i.e. '+'. We now perform in order traversal of right(T). Current T becomes rooted at '*'. Since left(T) is not empty; Current T becomes rooted at 'B' since left(T) is empty; we visit its root i.e. B; cheek for right(T) which is empty, therefore, we move back to parent tree. We visit its root i.e. '*'. Now in order traversal of right(T)is performed; which would give us 'C'. We visit T's root i.e. 'D' and perform in order traversal of right(T); which would give us'* and E'. Therefore, the complete listing is A+B*C-D*E You may note that expression is in infix notation. The in order traversal produces a(parenthesized) left expression, then prints out the operator at root and then a(parenthesized) right expression. This method of traversal is probably the most widely used. The following is a pascal procedure for in order traversal of a binary tree procedure INORDER (TREE: BINTREE); begin if TREE <>nil then begin INORDER (TREE^LEFT); Write ln ( TREE^DATA); INORDER (TREE ^ RIGHT); end end; Figure 10 gives a trace of the in order traversal of tree given in figure 9. Root of the tree + A Empty sub tree Empty sub tree * B Empty sub tree Empty sub tree C Empty sub tree Empty sub tree * D Empty sub tree Empty sub tree E Empty sub tree Empty sub tree Output A + B * C D * E over Figure 10: Trace of in order traversal of tree given in figure 9 Please notice that this procedure, like the definition for traversal is recursive. 9.2.3.2. POST ORDER TRAVERSAL In this traversal we first traverse left(T) (in post order); then traverse Right(T) (in post order); and finally visit root. It is a Left-Right-Root strategy, i.e. Traverse the left sub tree In Post order. Traverse the right sub tree in Post order.P Visit the root. For example, a post order traversal of the tree given in Figure 9 would be ABC*+DE*You may notice that it is the postfix notation of the expression (A + (B*C)) -(D*E) We leave the details of the post order traversal method as an exercise. You may also implement it using Pascal or C language. 9.2.3.3. PREORDER TRAVERSAL In this traversal, we visit root first; then recursively perform preorder traversal of Left(T); followed by pre order. traversal of Right(T) i.e. a Root-Left-Right traversal, i.e. Visit the root Traverse the left sub tree preorder. Traverse the right sub tree preorder. A preorder traversal of the tree given in Figure 9 would yield - +A*BC*DE It is the prefix notation of the expression (A+ (B*C)) - (D*E) Preorder traversal is employed in Depth First Search. (See Unit 4, Block 4). For example, suppose we make a depth first search of the binary tree given in Figure 11. Figure 12: Binary tree example for depth first search We shall visit a node; go left as deeply as possible before searching to its right. The order in which the nodes would be visited is ABDECFHIJKG which is same as preorder traversal. LEVEL BY LEVEL TRAVERSAL In this method we traverse level-wise i.e. we first visit node root at level 'O' i.e. root. There would be just one. Then we visit nodes at level one from left to right. There would be at most two. Then we visit all the nodes at level '2' from left to right and so on. For example the level by level traversal of the binary tree given in Figure 11 will yield ABCDEFGHIJK This is same as breadth first search (see Unit 4, Block 4). This traversal is different from other three traversals in the sense that it need not be recursive, therefore, we may use queue kind of a data structure to implement it, while we need stack kind of data structure for the earlier three traversals. 9.3. BINARY SEARCH TREES (BST) A Binary Search Tree, BST, is an ordered binary tree T such that either it is an empty tree or · each data value in its left sub tree less than the root value, · each data value. in its right sub tree greater than the root value, and · left and right sub trees are again binary search trees. Figure 12(a) depicts a binary search tree, while the one in Figure 12(b) is not a binary search tree. Figure 12(a): Binary Search Tree Figure 12(b): Binary tree but not binary search tree Clearly, duplicate items are not allowed in a binary search tree. You may also notice that an in order traversal of a BST yields a sorted list in ascending order. Operations of BST We now give a list of the operations that are usually performed on a BST. 1. Initialization of a BST: This operation makes an empty tree. 2. Cheek whether BST is Empty: This operation cheeks whether the tree is empty. 3. Create a node for the BST: This operation allocates memory space for the new node; returns with error if no spade is available. 4. 5. 6. 7. 8. 9. Retrieve a node's data. Update a node's data. Insert a node in BST. Delete a node (or sub tree) of a BST. Search for a node in BST. Traverse (in inorder, preorder, or post order) a BST. We shall describe some of the operations in detail. 9.3.1. INSERTION IN BST Inserting a node to the tree: To insert a node in a BST, we must check whether the tree already contains any nodes. If tree is empty, the node is placed in the root node. If the tree is not empty, then the proper location is found and the added node becomes either a left or a right child of an existing node. The logic works this way: add-node, (node, value) { if (two values are same) { duplicate., return (FAILURE) } else if (value) { if (left child exists) { add-node (left child, value); } else { allocate new node and make left child point to it,. return (SUCCESS); } } else if (value value stored in current node) { if (right child exists) { add-node (right child, value); } else { allocate new node and make right child point to it return (SUCCESS); } } } The function continues recursively until either it finds a duplicate (no duplicate strings are allowed) or it hits a dead end. If it determines that the value to be added belongs to the left-child sub tree and there is no left-child node, it creates one. If a left-child node exists, then it begins its search with the sub tree beginning at this node. If the function determines that the value to be added belongs to the right of the current node, a similar process occurs. Let us consider a BST given in Figure 13(a). Figure 13: Insertion In a Binary Search Tree If we want to insert 5 in the above BST, we first search the tree. If the key to be inserted is found in tree, we do nothing (since duplicates are not allowed), otherwise a nil is returned. In case a nil is returned, we insert the data at the last point traversed. In the example above a search operation will return nil on not finding a right, sub tree of tree rooted at 4. Therefore, 5 must be inserted as a right child of 4. 9.3.2. DELETION OF A NODE Once again the node to be deleted is searched in BST. If found, we need to consider the following possibilities: (i) If node is a leaf, it can be deleted by making its parent pointing to nil. The deleted node is now unreferenced and may be disposed off. (ii) If the node has one child; its parent's pointer needs to be adjusted. For example for node 1 to be deleted from BST given in Figure 13(a); the left pointer of node 6 is made to point to child of node 1 i.e. node 4 and the new structure would be Figure 14: Deletion of a Terminal Node (iii) If the node to be deleted has two children; then the value is replaced by the smallest value in the right sub tree or the largest key value in the left sub tree; subsequently the empty node is recursively deleted. Consider the BST in Figure 15. Figure 15: Binary search tree If the node 6 is to be deleted then first its value is replaced by smallest value in its right subtree i.e. by 7. So we will have Figure 16. Figure 16. Deletion of a nod having left and right child Now we need to, delete this empty node as explained in (iii). (iv) Therefore, the final structure would be Figure 17. Figure 17: Tree alter a deletion of a node having left and right child 9.3.3. SEARCH FOR A KEY IN BST To search the binary tree for a particular node, we use procedures similar to those we used when adding to it. Beginning at the root node, the current node and the entered key are compared. If the values are equal success is output. If the entered value is less than the value in the node, then it must be in the left-child sub tree. If there is no left-child sub tree, the value is not in the tree i.e. a failure is reported. If there is a left-child subtree, then it is examined the same way. Similarly, it the entered value is greater than the value in the current node, the right child is. searched. Figure 18 shows the path through the tree followed in the search for the key H. Figure 18: Search for a Key In BST find-key (key value, node) { if (two values are same) { print value stored in node; return (SUCCESS); } else if (key value value stored in current node) { if (left child exists) { find-key (key-value, left hand); } else { there is no left subtree., return (string not found) } } else if (key-value value stored in current node) { if (right child exists) { find-key (key-value, rot child); } else { there is no right subtree; return (string not found) } } } SUMMARY This unit introduced the tree data structure which is an acyclic, connected, simple graph. Terminologypertaining to trees was introduced. A special case of general case of general tree, a binary tree was focussed on. In a binary tree, each node has a maximum of two subtrees, left and right subtree. Sometimes it is necessary to traverse a tree, that is, to visit all the tree's nodes. Four methods of tree traversals were presented in order, post order, preorder and level by level traversal. These methods differ in the order in which the root, left subtree and right subtree are traversed. Each ordering is appropriate for a different type of applications. An important class of binary trees is a complete or full binary tree. A full binary tree is one in which internal nodes completely fill every level, except possibly the. last. A complete binary tree where the internal nodes on the bottom level all appear to the left of the external nodes on that level. Figure 6a shows an example of a complete binary tree. 9.4. HEIGHT BALANCED TREE A binary tree of height h is completely balanced or balanced if all leaves occur at nodes of level h or h-1 and if all nodes at levels lower than h-1 have two. children. According to this definition, the tree in figure 1(a) is balanced, because all leaves occur at levels 3 considering at level 1 and all node's at levels 1 and 2 have two children. Intuitively we might consider a tree to be well balanced if, for each node, the longest paths from the left of the node are about the same length as the longest paths on the right. More precisely, a tree is height balanced if, for each node in the tree, the height of the left subtree differs from the height, of the right subtree by no more than 1. The tree in figure 2(a) height balanced, but it is not completely balanced. On the other hand, the tree in figure 2(b) is completely balanced tree. Figure 2(a) Figure 2(b) An almost height balanced tree is called an AYL tree after the Russian mathematician G. M. Adelson - Velskii and E. M. Lendis, who first defined and studied this form of a tree. AVL Tree may or may not be perfectly balanced. Let us determine how many nodes might be there in a balanced tree of height h. The, root will be the only node at level 1; Each subsequent levels will be as full as possible i.e. 2 nodes at level 2, 4 nodes at level 3 and so on, i.e. in general there will be 21-1 nodes at level 1. Therefore the number of nodes from level 1 through level h-1 will be 1 + 2 + 22 + 23 + ... + 2h-2 = 2h-1 - 1 The number of nodes at level h may range from a single node to a maximum of 2h-1 nodes. Therefore, the total number of nodes n of the tree may range for (2h-1-1+1) to (2h-1-1+2h-1) or 2h-1 to 2h -1. BUILDING HEIGHT BALANCED TREE Each node of an AVL tree has the Property that the height of the left subtree is either one more, equal, or one less than the height of the right subtree. We may define a balance factor (BF) as BF = (Height of Right- subtree - Height of Left- subtree) Further If two subtree are of same height BF = 0 if Right subtree is higher BF = +1 if Left subtree is higher BF = -1 For example balance factor are stated near the nodes in Figure 3. BF of the root node is zero because height of the right subtree and the left subtree is three. The BF at the node DIN is 17because the height of its left subtree is 2 and of right subtree is 1 etc. Figure 3: Tree having a balance factor at each node Let the values given were in the order BIN, FEM, IND, NEE, LAL, PRI, JIM, AMI, HEM, DIN and we needed to make the height balanced tree. It would work out as follows: We begin at the root of the tree since the tree is initially empty we have Figure 4 (a) We have FEM to be added. It would be on the right of the already existing tree. Therefore we have Figure 4 (b) The resulting tree is still height balanced. Now we need to add IND ie. on the, further right of FEM. Figure 4 (c) Since BF of one of the nodes is other than 0, + 1, or -1, we need to rebalance the tree. In such a case, when the new node goes to the longer side, we need to rotate the structure counter clockwise i.e. a rotation is carried out in counter clockwise direction around the closest parent of the inserted node with BF = + 2. In this case we get Figure 4 (d) We now have a balanced tree. On adding NEE, we get Figure 4 (e) Since all the nodes have B. F. <+ 2 we continue with the next node. Now we need to add LAL Figure 4 (f) To regain balance we need to rotate the tree counter clockwise at IND and we get Figure 4 (g) On adding PRI we get Figure 4 (h) On adding JIM, we get Figure 4 (i) The tree is still balanced. Now we add AMI Figure 4 (j) Now we need to rotate the tree at FEN (i.e. the closest parent to AMI with BF= +2) in clockwise direction. On rotating it once we get Figure 4 (k) Now HEM is to be added. On doing so the structures we will get Figure 4 (l) Tree is still balance. Now we need to add DIN, we get Figure 4: Construction of AVL Tree You may notice that Figure 3 and Figure 4 (m) are different although the elements are same. This is because the AVL tree structure depends on the order in which elements are added. Can you determine the order in which the elements were added for a resultant structure as given in Figure 3? Let us take another example. Consider the following list of elements 3,5,11,8,4,1,12,7,2,6,10 The tree structures generated till we add 11 are all balanced structure as given in Figure 5 (a) to (c) Figure 5 (a) to (c) Here we need rebalancing. Rotation around 3 would give Figure 5 (d) Further we add 8,4,1,12,7,2 as depicted in Figure 5 (a) and (k). Figure 5 (e) to (f) Figure 5 (g) to (h) After adding 6, the tree becomes unbalanced. We need to rebalance by rotating the structure at the node 8. It is shown in Figure 5 (I) Figure 5 (i) to (j) Again on adding 10 we need to balance Figure 5 (k) to (l) Since the closest parent with BF = 2 is at node 11 we first rotate at it in clockwise direction. However, we would still not get a balanced structure as the right of node 7 shall have move nodes than the left of it. Therefore, we shall need another rotation, in anti-clockwise direction at node 7, and finally we get a structure as shown in Figure 5 (n). Figure 5 (l) to (m) Which is the height balance tree. 9.5. B-TREE We have already defined m-way tree in the Introduction. A B-tree is a balanced M-way tree. A node of the tree may contain many records or keys and pointers to children. A B-tree is also known as the balanced sort tree. It finds its use in external sorting. It is not a binary tree. To reduce disk accesses, several conditions of the tree must be true; · the height of the tree must be kept to a minimum, · there must be no empty subtrees above the leaves of the tree; · the leaves of the tree must all be on the same level; and · all nodes except the leaves must have at least some minimum number of children. B-Tree of order M has following properties: 1. Each node has a maximum of M children and a minimum of M/2 children or any no. from 2 to the maximum. 2. Each node has one fewer keys than children with a maximum of M-1 keys. 3. Keys are arranged in a defined order within the node. All keys in the subtree to the left of a key are predecessors ofthe key and that on the right are successors of the key. 4. When a new key is to be inserted into a full node, the node is split into two nodes, and the key with the median value is inserted in the parent node. In case the parent node is the root, a new root is created. 5. All leaves are on the same level i.e. there is no empty subtree above the level of the leaves. The order imposes a bound on the business of the tree. While root and terminal nodes are special cases, normal nodes have between M/2 and M children. For example, a normal node of tree of order 11 has at least 6 more than 11 children. 9.5.1. B-TREE INSERTION First the search for the place where the new record must be put is done. If the node can accommodate the new record insertion is simple. The record is added to the node with an appropriate pointer so that number of pointes remain one more that the number of records. If the node overflows because there is an upper bound on the size of a node, splitting is required. The node is spilt into three parts. The middle record is passed upward and inserted into the parent, leaving two children behind where there was one before. Splitting may propagate up the tree because the parent into which a record to be split in its child node, may overflow. Therefore it may also split. If the root is required to be split, a new root is created with just two children, and the tree grows taller by one level. The method is well explained by the following examples: Example: Consider building a B-tree of degree 4 that is a balanced four-way tree where each node can hold three data values and have four branches. Suppose it needs to contain the following values. 1 5 6 2 8 11 13 18 20 7 9 The first value 1 is placed in a new node which can accommodate the next two values also i.e. when the fourth value 2 is to be added, the node is split at a median value 5 into two leaf nodes with a parent at 5. The following item 8 is to be added in a leaf node. A search for its appropriate place puts it in the node containing 6. Next, 11 is also put in the same. So we have Now 13 is to be added. But the right leaf node, where 1-3 finds appropriate plane, is full. Therefore it is split at median value 8 and this it moves up to the parent. Also it splits up to make two nodes, The remaining items may also be added following the above procedure. The final result is Note that the tree built up in this manner is balanced, having all of its leaf nodes at one level. Also the tree appears to grow at its root, rather than at its leaves as was the case in binary tree. A B-tree of order 3 is popularly known as 2-3 tree and that of order 4 as 2-34 tree. 9.5.2. B-TREE DELETION As in the insertion method, the record to be deleted is first searched for. If the record is in a terminal node, deletion is simple. The record alongwith an appropriate pointer is deleted. If the record is not in terminal node, it is replaced by a copy of its successor, that is, a record with a next, higher value. The successor of any record not at the lowest level will always be in a terminal node. Thus in all cases deletion involves removing a record from a terminal node. If on deleting the record, the new node size is not below the minimum, the deletion is over. If the new node size is lower than the minimum, an underflow occurs. Redistribution is carried out if either of adjacent siblings contains more than the minimum number of records. For redistribution, the contents of the node which has less than minimum records, the contents of its adjacent sibling which has more the minimum records, and the separating record from parent are collected. The central record from this collection is written back to parent. The left and right halves are written back to the two siblings. In case the node with less than minimum number of records has no adjacent sibling that is more than minimally full. Concatentation is used. In this case the node is merged with its adjacent sibling and the separating record from its parent. It may be solved by redistribution or concatenation. We will illustrate by deleting keys from the tree given below: 1. Delete 'h'. This is a simple deletion 2. Delete, 'r'. 'r' is not at a leaf node . Therefore its successor Is, is moved up 'r' moved down and deleted. 3. Delete 'P'. The node contains less than minimum numbers of keys required The slibing can spare a key. So ,t, moves up and 's' moves down. 4. Deletion 'd'. Again node is less than minimal required. This leaves the parent with only one key. Its slibing cannot contribute. therefore f,j, m and t are combined to form the new root. Therefore the size of tree shrinks by one level. B-Tree of order 5 Let us build a B-tree of order 5 for following data: 1. 2. H K Z are simply inserted in the same node D H K Z 3. Add B: node is full so it must split H is median for BDHKZ · H is made as the parent. Since the splitting is at root node we must make one more node. 4. P, Q,E,A are simply inserted in S to be added at K P Q Z Q becomes the median W and T are simply inserted after Q After Adding C we get L,N are simply inserted Y to be added in STWZ Since W is median M to be put in K L N P &is median. Then, it will be promoted to CHQW but that is also full therefore this root will be split and new root will be created SUMMARY In this Unit, we discussed the AVL trees and B-Trees. AVL trees are restricted growth Binary trees. Normal binary trees suffer with the problem of balancing. AVL trees offer one kind of solution. AVL trees are not completely balanced trees. In completely balanced trees the number of nodes in the two subtrees of a node differ by at most 1. In AVL trees the heights of two subtrees of a node may differ by at most 1. Multiway - Trees are generalization of binary trees. A node in an m-way tree can contain R records and R + 1 pointers (or children). Consequently the branching factor increases or tree becomes shorter as compared to the binary tree for the same number of records. B-trees are balanced multiway trees. It restricts the number of records in a node between m/2 and m-1, i.e. it requires a node for a tree of order m to be at least half full. Further the structure of B-Tree self balancing operations, as insertions and deletions are performed. Two variations of B-Tree also exist as B* Tree and B + tree. The student is advised to read and some text on these structure as well. UNIT 10 FILE ORGANIZATION 10.1. INTRODUCTION 10.2. TERMINOLOGY 10.3. FILE ORGANISATION 10.3.1. Sequential Files 10.3.1.1. BASIC OPERATIONS 10.3.1.2. DISADVANTAGES 10.3.2. DIRECT FILE ORGANIZATION 10.3.2.1. DIVISION-REMAINDER HASHING 10.3.3. INDEXED SEQUENTIAL FILE ORGANIZATION 10.1 INTRODUCTION Many tasks in an information oriented society require people to deal with voluminous data and to use computers to deal with this data efficiently and speedily. For example in an airlines reservation, office data regarding flights, routes, seats available etc. is required to facilitate booking of seats. A university might like to store data related to all students-the courses they sign up for etc. all this implies the following: - Data will be stored on external storage devices like magnetic tapes, floppy disks etc. - Data will be accessed by many people and software programs - Users of the data will expect that - it is always reliably available for processing - it is secure - it is stored in a manner flexible enough to allow the users to add new types of data as per his changing needs Files deal with storage and retrieval of data in a computer. Programming languages incorporate statements and structures which allow users to write programs to access and use the data in the file. 10.2 TERMINOLOGY We Will now define the term's of the hierarchical structure of computer stored data collections. 1. Field: It is an elementary data item characterized by its size, length and type. For example: Name : a character type Of size 10 Age : a numeric type 2. Record: It is a collection of related fields that can be treated as a unit from an applications point of view. For example: A university could use a student record with the , fields, university enrolment no., Name Major subjects 3. File: Data is organized for storage in files. A file is a collection of similar, related records. it has an identifying name. For example. "STUDENT" could be a file Consisting of student records for all the pupils in a university. 4. Index: An index file corresponds to a data file. It's records contain a key field and a Pointer to that record of the data file which has the same value of the key field. Indexing will be discussed in detail later in the unit. The data stored in files is accessed by software which can be divided into the following two categories: 1. User Programs: These are usually written by a Programmer to manipulate retrieved data in the manner required by the application. 2. File Operations: These dial with the physical movement of data in and out of files. User programs effectively use file operations through appropriate programming language syntax The File Management system manages the independent files and acts as the software interface between the user programs and the file operations. File operations can be categorized as 1. CREATION of the file 2. INSERTION of records in the file 3. UPDATION of previously inserted records 4. RETRIEVAL of previously inserted records 5. DELETION of records 6. DELETION of the file. 10.3 FILE ORGANISATION File organization can most simply be defined as the method of sorting storing Data record in file in a file and the subsequent implications on the way these records can be accessed. The factors involved in selecting a particular file organization for use are: - Ease of retrieval - Convenience of updates - Economy of storage - Reliability - Security - Integrity Different file organizations accord the above factors differing weightages. The choice must be made depending upon the individual needs of the particular application in question. We now introduce in brief the various commonly encountered file organizations. SEQUENTIAL FILES Data records are stored in some specific sequence e.g. order of arrival value of key field etc. Records of a sequential file cannot be accessed at random i.e. to access the nth record, one must traverse the preceding (n-1) records. Sequential files will be dealt with at length in the next section. RELATIVE FILES Each data record has a fixed place in a relative file. Each record must have associated with it in integer key value that will help identify this slot. This key, therefore, will be used for insertion and retrieval of the record. Random as well as sequential access is possible. Relative files can exist only on random access devices like disks. DIRECT FILES These are similar to relative files, except that the key value need not be an integer. The user can specify keys which make sense to his application. INDEXED SEQUENTIAL FILES An index is added to the sequential file to provide random access. An overflow area needs to be maintained to permit insertion in sequence. INDEXED FLIES In this file organization, no sequence is imposed on the storage of records in the data file, therefore, no overflow area is needed. The index, however, is maintained in strict sequence. Multiple indexes are allowed on a file to improve access. 10.3.1. SEQUENTIAL FILES We will now discuss in detail the Sequential file organization as defined in previous page.Sequential files have data records stored in a specific sequence. A sequentially organized file may be stored on either a serial-access or a direct-access storage medium. STRUCTURE To provide the 'sequence' required a 'key' must be defined for the data records. Usually a field whose values can uniquely identify data records is selected as the key. If a single field cannot fulfil this criterion, then a combination of fields can serve as the key. For example in a file which keeps student records, a key could be student no. 10.3.1.1. OPERATIONS 1. Insertion: Records must be inserted at the place dictated by the sequence of the keys. As is obvious, direct insertions into the main data file would lead to frequent rebuilding of the file. This problem could be mitigated by reserving overflow areas' in the file for insertions. But this leads to wastage of space and also the overflow areas may also be filled. The common method is to use transaction logging. This works as follows: - collect records for insertion in a transaction file in their order of arrival - when population of the transactions file has ceased, sort the transaction file in the order of the key of the primary data file - merge the two files on the basis of the key to get a new copy of the primary sequential file. Such insertions are usually done in a batch mode when the activity/program which populates the transaction file have ceased. The structure of the transaction files records will be identical to that of the primary file. 2. Deletion: Deletion is the reverse process Of insertion. The space occupied by the record should be freed for use. Usually deletion (like-insertion) is not done immediately. The concerned record (al 9 with a marker or 'tombstone' to indicate deletion) is written to a transaction file. At the time of merging the corresponding data record will be dropped from the primary data file. 3. Updation : Updation is a combination of insertion and deletions. The record with the new values is inserted and the earlier version deleted. This is also done using transaction files. 4. Retrieval: User programs will often retrieve data for viewing prior to making decisions, therefore, it is vital that this data reflects the latest state of the data if the merging activity has not yet taken Place. Retrieval is usually done for a particular value of the key field. Before return in to the user, the data record should be merged with the transaction record (if any)for that key value. The other two operations 'creation' and 'deletion' of files are achieved by simple programming language statements. 10.3.1.2. DISADVANTAGES Following are some of the disadvantages of sequential file organization: * Updates are not easily accommodated * By definition, random access is not possible * All records must The structurally identical. If a new field has to be added, then every record must The rewritten to provide space for the new field. * Continuous areas may not he possible because both the primary data file and the transaction file must be looked during merging. AREAS OF USE Sequential files are most frequently used in commercial batch oriented data processing where there is the concept of a master file to which details are added periodically. Ex. payroll applications. 10.3.2. DIRECT FILE ORGANIZATION It offers an effective way to organize data when there, is a need to access individual records directly. To access a record directly (or random access) a relationship is used to translate the key value into a physical address. This is called the mapping function R R.(key value) -- Address Direct files are stored on DASD (Direct Access Storage Device) A calculation is performed on the key value to get an address. This address calculation technique is often termed as hashing. The calculation applied is called a hash function. Here we discus a very commonly used hash function called Division - Remainder. 10.3.2.1. DIVISION-REMAINDER HASHING According to this method, key value is divided by an appropriate number, generally a prime number, and the division of remainder is used as the address for the record. The choice of appropriate divisor may not be so simple. If it is known that the file is to contain n records, then we must, assuming that only one record can be stored a given address, have divisor n. Also we may have a very large key space as compared to the address space. Key space refers to all the possible key values. Although only a part of this space will The address space possibly may not match actually be used as key values in the file. the size of key space, therefore a one to one mapping may not be there. That is calculated address may not be unique. It is called Collision, i.e. R(K1) = R(K2) but K1 = K2 Two unequal keys have been calculated to have the same address. The key are called synonyms. There are various approaches to handle the problem of collisions. One of these is to hash to buckets. A bucket is a space that can accommodate multiple records. A discussion on buckets and other such methods to handle collisions is out of the scope of this Unit. However the student is advised to read some text on Bucket Addressing and related topics. 10.3.3. INDEXED SEQUENTIAL FILE ORGANIZATION When there is need to access records sequentially by some key value and also to access records directly by the same key value, the collection of records may be organized in an effective manned called Indexes Sequential Organization. You must be familiar with search process for a word in a language dictionary. The data in the dictionary is stored in sequential manner. However an index is provided in terms of thumb tabs. To search for a word we do not search sequentially. We access the index that is the appropriate thumb tab, locate an approximate location for the word and then proceed to find the word sequentially. To implement the concept of indexed sequential file organizations, we consider an approach in which the index part and data part reside on a separate file. The index file has a tree structure and data file has a sequential structure. Since the data file is sequenced, it is not necessary for the index to have an entry for each record Following figure shows a sequential file with a two-level index. Level 1 of the index holds an entry for each three-record section of the main file. The level 2 indexes level 1 in the same way. When the new records are inserted in the data file, the sequence of records need to be preserved and also the index is accordingly updated. Two approaches used to implement indexes are static indexes and dynamic indexes. As the main data file changes due to insertions and deletions, the static index contents may change but the structure does not change . In case of dynamic indexing approach, insertions and deletions in the main data file may lead to changes in the index structure. Recall the change in height of B-Tree as records are inserted and deleted. Both dynamic and static indexing techniques are useful depending on the type of application. SUMMARY This Unit dealt with the methods of physically storing data in the files. The terms fields, records and files were defined. The organization types were introduced. The various file organization were discussed. Sequential File Organization finds in use in application areas where batch processing is more common. Sequential Files are simple to use and can be stored on inexpensive media. They are suitable for applications that require direct access to only particular records of the collection. They do not provide adequate support for interactive applications. In Direct file organization there exists a predictable relationship between the key used and by program to identify a Particular record and or programmer that record's location on secondary storage. A direct file must be stored on a direct access device. Direct files are used extensively in application areas where interactive processing is used. An Indexed Sequential file supports both sequential access by key value and direct access to a particular record given its key value. It is implemented by building an index on top of a sequential data file that resides on a direct access storage device. UNIT 11 SEARCHING 11.1. INTRODUCTION 11.2. SEARCHING TECHNIQUES 11.2.1. SEQUENTIAL SEARCH 11.2.1.1. ANALYSIS 11.2.2. BINARY SEARCH 11.2.2.1. ANALYSIS 11.3. HASHING 11.3.1. HASH FUNCTIONS 11.4. COLLISION RESOLUTION 11.1. INTRODUCTION In many cases we require the data to be presented in the form where it follows certain sequence of the records. If we have the data of the students in the class, then we will prefer to have them arranged in the alphabetical manner. For preparing the result sheet, we would like to arrange the data as per the examination numbers. When we prepare the merit list we would like to have the same data so that it is arranged in the decreasing order of the total marks obtained by the students. Thus arranging the data in either ascending or descending manner based on certain key in the record is known as SORTING. As we do not receive the data in the sorted form, we are required to arrange the data in the particular from. For ascending order we will require the smallest key value first. Thus till we do not get all the data items, we cannot start arranging them. Arranging the data as we receive it is done using linked lists. But in all other cases, we need to have all the data, which is to be sorted, and it will be present in the form of an ARRAY. Sometimes it is very important to search for a particular record, may be, depending on some value. The process of finding a particular record is known as SEARCHING. Suppose S is a collection of data maintained in memory by a table using some type of data structure. Searching is the operation which finds the location LOC in memory of some given ITEM of information or sends some message that ITEM does not belong to S. The search is said to be successful or unsuccessful according to whether ITEM does or does not belong to S. The searching algorithm that is used depends mainly on the type of data structure, that is used to maintain S in memory. Data modification, another term related to searching refers to the operations of inserting, deleting and updating. Here data modification will mainly refer to inserting and deleting. These operations are closely related to searching, since usually one must search for the location of the ITEM to be deleted or one must search for the proper place to insert ITEM in the table. The insertion or deletion also requires a certain amount of execution time, which also depends mainly on the type of data structure that is used. Generally speaking, there is a tradeoff between data structures with fast searching algorithms and data structures with fast modification algorithms. This situation is illustrated below, where we summarize the searching and data modification of three of the data structures previously studied in the text. (1) Sorted array. Here one can use a binary search to find the location LOC of a given ITEM in time O(log n). On the other hand, inserting and deleting are very slow, since, on the average, n/2 = o(n) elements must be moved for a given insertion or deletion. Thus a sorted array would likely be used when there is a great deal of searching but only very little data modification. (2) Linked list. Here one can only perform a linear search to find the location LOC of a given ITEM, and the search may be very, very slow, possibly requiring time O(n). On the other hand, inserting and deleting requires only a few pointers to be changed. Thus a linked list would be used when there is a great deal of data modification, as in word (string) processing. (3) Binary search tree. This data structure combines the advantages of the sorted array and the linked list. That is, searching is reduced to searching only a certain path P in the tree T, which , on the average, requires only O(log n) comparisons. Furthermore, the tree T is maintained in memory by a kinked representation, so only certain pointers need by changed after the location of the insertion or deletion is found. The main drawback of the binary search tree is that the tree may be very unbalanced, so that the length of a path P may be O(n) rather than O(log n). This will reduce the searching to approximately a linear search. 11.2. SEARCHING TECHNIQUES We will discuss two searching methods – the sequential search and the binary search. 11.2.1. SEQUENTIAL SEARCH This is a natural searching method. Here we search for a record by traversing through the entire list from beginning until we get the record. For this searching technique the list need not be ordered. The algorithm is presented below: 1. 2. 3. 4. Set flag =0. Set index = 0. Begin from index at first record to the end of the list and if the required record is found make flag = 1. At the end if the flag is 1, the record is found, otherwise the search is failure. int Lsearch(int L[SIZE], int ele ) { int it; for(it = 1; it<=SIZE; it++) { if( L[it] == ele) { return 1; break; } } return 0; } 11.2.1.1. ANALYSIS Whether the search takes place in an array or a linked list, the critical part in performance is the comparison in the loop. If the comparisons are less the loop terminates faster. The least number of iterations that could be required is 1 if the element that was searched is first one in the list. The maximum comparison is N ( N is the total size of the list), when the element is the last in the list. Thus if the required item is in position „i‟ in the list, „i‟ comparisons are required. Hence the average number of comparisons is (1 + 2 + 3 + .... + I + ... + N) / N N(N+1) = ------------2*N = (N + 1) / 2. Sequential search is easy to write and efficient for short lists. It does not require the list to be sorted. However if the list is long, this searching method becomes inefficient, as it has to travel through the whole list. We can overcome this shortcoming by using Binary Search Method. 11.2.2. BINARY SEARCH Binary search method employs the process of searching for a record only in half of the list, depending on the comparison between the element to be searched and the central element in the list. It requires the list to be sorted to perform such a comparison. It reduces the size of the portion to be searched by half after each iteration. Let us consider an example to see how this works. The numbers in the list are 10 20 30 40 50 60 70 80 90.The element to be searched is 30. First it is compared with 50(central element). Since it is smaller than the central element we consider only the left part of the list(i.e. from 10 to 40) for further searching. Next the comparison is made with 30 and returns as search is successful. Let us see the function, which performs this on the list. int Bsearch(int list[SIZE],int ele) { int top, bottom, middle; top= SIZE –1; bottom = 0; while(top > = bottom) { middle = (top + bottom)/2; if(list[middle]==ele) return middle; else if(list[middle]< ele) bottom = middle + 1; else top = middle –1; } } return –1; 11.2.2.1. ANALYSIS In this case, after each comparison either the search terminates successfully or the list remaining to be searched, is reduced by half. So after k comparisons the list remaining to be searched is N/ 2^k where N is the number of elements. Hence even at the worst case this method needs no more than log2N + 1 comparisons. Binary search is a fast algorithm for searching sorted sequences. It runs in about log2 N time, as opposed to an average run time of N/2 for linear search. For large lists that you need to search multiple times, it might well be more efficient to sort and use binary search instead of sticking to a linear search. 11.3. HASHING The search time of each algorithm discussed so far depends on the number n of elements in the collection S of data. This section discusses a searching technique, called hashing or hash addressing, which is essentially independent of the number n. The terminology, which we use in our presentation of hashing will be oriented toward file management. First of all, we assume that there is a file F of n records with a set K of keys, which uniquely determine the records in F. Secondly, we assume that F is maintained in memory by a Table T of m memory locations and that L is the set of memory addresses of the locations in T. For notational convenience, we assume that the keys in K and the addresses in L are (decimal) integers. (Analogous methods will work with binary integers or with keys which are character strings, such as names, since there are standard ways of representing strings by integers).The subject of hashing will be introduced by the following example. Suppose a company with 68 employees assigns a 4-digit employee number to each employee, which is used as the primary key in the company‟s employee file. We can, in fact, use the employee number as the address of the record in memory. The search will require no comparisons at all. Unfortunately, this technique will require space for 10 000 memory locations, whereas space for fewer than 30 such locations would actually be used. Clearly, this tradeoff of space for time is not worth the expense. The general idea of using the key to determine the address of a record is an excellent idea, but it must be modified so that a great deal of space is not wasted. This modification takes the form of a function H from the set K of keys into the set L of memory addresses. Such a function, H: K L is called a hash function or hashing function. Unfortunately, such a function H may not yield distinct values: it is possible that two different keys K1 and K2 will yield the same hash address. This situation is called collision, and some method must be used to resolve it. Accordingly, the topic of hashing is divided into two parts: (1) hash functions and (2) collision resolutions. We discuss these two parts separately. 11.3.1 HASH FUNCTIONS The two principal criteria used in selecting a hash function H: K L are as follows. First of all, the function H should be very easy and quick to compute. Second the function H should, as far as possible, uniformly distribute the hash addresses throughout the set L so that there are a minimum number of collisions. Naturally, there is no guarantee that the second condition can be completely fulfilled without actually knowing beforehand the keys and addresses. However, certain general techniques do help. One technique is to “chop” a key k into pieces and combine the pieces in some way to form the hash address H(k). (The term “hashing “ comes from this technique of “chopping” a key into pieces) We next illustrate some popular hash functions. We emphasize that each of these hash functions can be easily and quickly evaluated by the computer. (a ) Division method. Choose a number m larger than the number n of keys in K. (The number m is usually chosen to be a prime number or a number without small divisors, since this frequently minimizes he number of collisions.) The hash function H is defined by H(k) = k (mod m) or H(k) = k (mod m) + 1 (b) Midsquare method. The key k is squared. Then the hash function H is defined by H(k) = 1 Where l is obtained by deleting digits from both ends of k2. We emphasize that the same positions of k2 must be used for all of the keys. (c) Folding method. The key k is partitioned into a number of parts, k1,…kr, where each part, except possibly the last, has the same number of digits as the required address. Then the pats are added together, ignoring the last carry. That is, H(k) = k1 + k2 … + kr where the leading-digitcarries, if any are ignored. Sometimes, for extra “milling,” the evennumbered parts, k2, k4,…, are each reversed before the addition. 11.4. COLLISION RESOLUTION Suppose we want to add a new record R with key k to our file F, but suppose the memory location address H(k) is already occupied. This situation is called collision. This subsection discusses two general ways of resolving collisions. The particular procedure that one chooses depends on many factors. One important factor is the ratio of the number n of keys in k (which is the number of records in F) to the number m of hash addresses in L . This ratio, = 24/365 = 7% is very small, it can be shown that there is a better than fifty-fifty chance that two of the students have the same birthday. The efficiency of a hash function with a collision resolution procedure is measured by he average number of probes (key comparisons) needed to find the location of the record with a given key k. The efficiency depends mainly on the load factor. SUMMARY This unit concentrates on searching techniques used for information retrieval. The sequential search method was seen to be easy to implement and relatively efficient to use for small lists. But very time consuming for long unsorted lists. The binary search method is an improvement, in that it eliminates half the list from consideration at each iteration; It has checks to incorporated to ensure speedy termination under all possible conditions. It requires only twenty comparisons for a million records and is hence very efficient. The prerequisite for it is that the list should be sorted in increasing order. UNIT 12 SORTING 12.1. INTRODUCTION 12.2. INSERTION SORT 12.2.1. ANALYSIS 12.3. BUBBLE SORT 12.3.1. ANALYSIS 12.4. SELECTION SORT 12.4.1. ANALYSIS 12.5. RADIX SORT 12.5.1. ANALYSIS 12.6. QUICK SORT 12.6.1. ANALYSIS 12.7. 2-WAY MERGE SORT 12.8. HEAP SORT 12.9. HEAPSORT VS. QUICKSORT 12.1 INTRODUCTION Sorting is one of the most important operations performed by computers. In the days of magnetic tape storage before modern data-bases, it was almost certainly the most common operation performed by computers as most "database" updating was done by sorting transactions and merging them with a master file. It's still important for presentation of data extracted from databases: most people prefer to get reports sorted into some relevant order before wading through pages of data. Sorting has been the subject of extensive research in the area of computer science because of its importance in solving the problem of sorting an initially unordered collection of keys (data) to produce an ordered collection (A sort-key is a single-valued function of the corresponding element of the list). For instance, to arrange names in alphabetical order, line up customers in a bank, customers by zip code, cities population increase, comparing different items, and so on. Thus, a number of different techniques have developed in solving sorting problems. Because the sorting problem has fascinated theoretical computer scientists, much is known about the efficiency of various solutions, and about limitations on the best possible solutions. Efficient sorting is important to optimizing the use of other algorithms (such as search algorithms and merge algorithms) that require sorted lists to work correctly; it is also often useful for canonicalizing data and for producing human-readable output. Sort algorithms used in computer science are often classified by: computational complexity (worst, average and best behaviour) in terms of the size of the list (n). Typically, good behaviour is O(n log n) and bad behaviour is O(n2). Sort algorithms which only use an abstract key comparison operation always need at least O(n log n) comparisons on average; sort algorithms which exploit the structure of the key space cannot sort faster than O(n log k) where k is the size of the keyspace. memory usage (and use of other computer resources) stability: stable sorts keep the relative order of elements that have an equal key. That is, a sort algorithm is stable if whenever there are two records R and S with the same key and with R appearing before S in the original list, R will appear before S in the sorted list. Sorting algorithms that are not stable can be specially implemented to be stable. One way of doing this is to artificially extend the key comparison, such that comparisons between two objects with otherwise equal keys are decided using the order of the entries in the original data order as a tiebreaker. Some sorting algorithms follow with runtime order Bubble sort - O(n2) Selection sort - O(n²) Insertion sort - O(n²) Quicksort * - O(n log n) 2-WAY Merge sort - O(n log n) Heapsort * - O(n log n) (*) unstable 12.2. INSERTION SORT This is a naturally occuring sorting method exemplified by a card player arranging the cards dealt to him. He picks up the cards as they are dealt and inserts them into the required position. Thus at every step, we insert an item into its proper place in an already ordered list. We will illustrate insertion sort with an example before (figure 1) presenting the formal algorithm. Example 1: Sort the following list using the insertion sort method: Figure 1: Insertion sort Thus to find the correct position search the list till an item just greater than the target is found. Shift all the items from this point one, down the list. Insert the target in the vacated slot. We now present the algorithm for insertion sort. ALGORITHM: INSERT SORT INPUT: LIST[ ] of N items in random order. OUTPUT: LIST[ ] of N items in sorted order. 1. BEGIN, 2. FOR I = 2 TO N DO 3. BEGIN 4. F LIST[I] LIST[I-1] 5. THEN BEGIN 6. J = I 7. T = LIST[I] /*STORE LIST[I]*/ 8. REPEAT /* MOVE OTHER ITEMS DOWN THE LIST*/ 9. J = J-1 10. LIST [J + 1] =LIST [J]; 11. IFJ = 1THEN 12. FOUND =TRUE 13. UNTIL (FOUND = TRUE) 14. LIST [I] = T 15. END 16. END 17. END C Code void InsertionSort(int array[],int size) { int k; for(k=1;k<size;k++) { int elem; int pos; } } elem=array[k]; for(pos=k-1;array[pos]>elem && pos>=0;pos--) { array[pos+1]=array[pos]; } array[pos+1]=elem; 12.2.1 ANALYSIS To determine the average efficiency of insertion sort consider the number of times that the inner loop iterates. As with other loops featuring nested loops, the number of iterations follows a familiar pattern: 1 + 2 + … + (n-2) + (n-1) = n(n-1)/2 = O(n2) . Conceptually, the above pattern is caused by the sorted sub-list that is built throughout the insertion sort algorithm. It takes one iteration to build a sorted sub-list of length 1, two iterations to build a sorted sub-list of length 2 and finally n-1 iterations to build the final list. To determine whether there are any best or worst cases for the sort, we can examine the algorithm to find data sets that would behave differently from the average case with random data. Because the average case identified above locally sorts each sub-list there is no arrangement of the aggregate data set that is significantly worse for insertion sort. The nature of the sorting algorithm does however lend itself to perform more efficiently on certain data. In the case where the data is already sorted, insertion sort won't have to do any shifting because the local sub-list will already be sorted. That is, the first element will already be sorted, the first two will already be sorted, the first three, and so on. In this case, insertion sort will iterate once through the list, and, finding no elements out of order, will not shift any of the data around. The best case for insertion sort is on a sorted list where it runs is O(n).It takes O(n2) time in the average and worst cases, which makes it impractical for sorting large numbers of elements. However, insertion sort's inner loop is very fast, which often makes it one of the fastest algorithms for sorting small numbers of elements, typically less than 10 or so. 12.3 BUBBLE SORT In this sorting algorithm, multiple swappings take place in one pass. Smaller elements move or 'bubble' up to the top of the list, hence the name given to the algorithm. In this method adjacent members of the list to be sorted are compared. if the item on top is greater than the item immediately below it, they are swapped. This process is carried on till the list is sorted. The detailed algorithm follows: ALGORITHM BUBBLE SORT INPUT: LIST [ ] of N items in random order. OUTPUT: LIST [ ] of N items sorted in ascending order. 1. SWAP = TRUE PASS 0/ 2. WHILE SWAP = TRUE DO BEGIN 2.1 FOR.1 = 0 TO (N-PASS) DO BEGIN 2.1.1 IF A[I] A [I + 1] BEGIN TMP = A[I] A[I] A[I+1] A[I+ 1] TMP SWAP = TRUE END ELSE SWAP = FALSE 2.1.2 PASS = PASS + 1 END END C Code void bubbleSort(int *array, int length) { int i, j; for(i = 0; i < length - 1; i++) for(j = 0; j < length – i-1; j++) if(array[j] > array[j+1]) /* compare neighboring elements */ { int temp; temp = array[j]; /* swap array[j] and array[j+1] */ array[j] = array[j+1]; array[j+1] = temp; } } The algorithm for bubble sort requires a pair of nested loops. The outer loop must iterate once for each element in the data set (of size n) while the inner loop iterates n times the first time it is entered, n-1 times the second, and so on. Consider the purpose of each loop. As explained above, bubble sort is structured so that on each pass through the list the next largest element of the data is moved to its proper place. Therefore, to get all n elements in their correct places, the outer loop must be executed n times. The inner loop is executed on each iteration of the outer loop. Its purpose is to put the next largest element is being put into place. The inner loop therefore does the comparing and swapping of adjacent elements. 12.3.1. ANALYSIS To determine the complexity of this loop, we calculate the number of comparisons that have to be made. On the first iteration of the outer loop, while trying to place the largest element, there have to be n - 1 comparisons: the first comparison is made between the first and second elements, the second is made between the second and third elements, and so on until the (n-1)th comparison is made between the (n-1)th and the nth element. On the second iteration of the outer loop, there is no need to compare again the last element of the list, because it was put in the correct place on the previous pass. Therefore, the second iteration requires only n-2 comparisons. This pattern continues until the second-to-last iteration of the outer loop when only the first two elements of the list are unsorted; clearly in this case, only one comparison is necessary. The total number of comparisons, therefore, is (n-1)+(n-2)…(2)+(1) = n(n-1)/2 or O(n2) . The best case for bubble sort occurs when the list is already sorted or nearly sorted. In the case where the list is already sorted, bubble sort will terminate after the first iteration, since no swaps were made. Any time that a pass is made through the list and no swaps were made, it is certain that the list is sorted. Bubble sort is also efficient when one random element needs to be sorted into a sorted list, provided that new element is placed at the beginning and not at the end. When placed at the beginning, it will simply bubble up to the correct place, and the second iteration through the list will generate 0 swaps, ending the sort. Recall that if the random element is placed at the end, bubble sort loses its efficiency because each element greater than it must bubble all the way up to the top. The absolute worst case for bubble sort is when the smallest element of the list is at the large end. Because in each iteration only the largest unsorted element gets put in its proper location, when the smallest element is at the end, it will have to be swapped each time through the list, and it won‟t get to the front of the list until all n iterations have occurred. In this worst case, it take n iterations of n/2 swaps so the order is, again, n 2. Best Case:: Comparisons : n Swaps : 0 Average Case:: Comparisons : n2 Swaps : ~n2/2 Worst Case:: Comparisons : n2 Swaps : n2/2 12.4 SELECTION SORT The bubble sort algorithm could had been made better if rather than swapping an element approximately every time with its adjacent one only on the basis of a comparison, if we could find an exact element in one whole pass of the full array without any swapping work at all but in last. Consider the following unsorted data: 8 9 3 5 6 4 2 1 7 0. On the first iteration of the sort, the minimum data point is found by searching through all the data; in this case, the minimum value is 0. That value is then put into its correct place i.e. at the beginning of the list by exchanging the places of the two values. The 0 is swapped into the 8's position and the 8 is placed where the 0 was, without distinguishing whether that is the correct place for it, which it is not. Now that the first element is sorted, it never has to be considered again. So, although the current state of the data set is 0 9 3 5 6 4 2 1 7 8, the 0 is no longer considered, and the selection sort repeats itself on the remainder of the unsorted data: 9 3 5 6 4 2 1 7 8.Consider a trace of the selection sort algorithm on a data set of ten elements: (i) (ii) (iii) (iv) (v) (vi) (vii) (viii) (ix) (x) 8935642170 09 3 5 6 4 2 1 7 8 01 3 5 6 4 2 9 7 8 01 2 5 6 4 3 9 7 8 01 2 3 6 4 5 9 7 8 01 2 3 4 6 5 9 7 8 01 2 3 4 5 6 9 7 8 01 2 3 4 5 6 9 7 8 01 2 3 4 5 6 7 9 8 01 2 3 4 5 6 7 8 9 Selection Sort Algorithm The selection sort algorithm is: 1. Select index of first element as min. 2. Compare element[min] with next element, if next element is less than element[min], replace min with its index value. 3. Repeat Step 2 till end of list. 4. Now, final element[min] is swapped with very first element[min]. Hence, this final element is at its appropriate position. 5. So, for next iteration first element is neglected for consideration of future list and above steps are repeated for all elements individually. C Code void selectionSort(int numbers[], int array_size) { int i, j; int min, temp; for (i = 0; i < array_size-1; i++) { min = i; for (j = i+1; j < array_size; j++) { if (numbers[j] < numbers[min]) min = j; } temp = numbers[i]; numbers[i] = numbers[min]; numbers[min] = temp; } } Like bubble sort, selection sort is implemented with one loop nested inside another. This suggests that the efficiency of selection sort, like bubble sort, is n 2 . To understand why this is indeed correct, consider how many comparisons must take place. The first iteration through the data requires n-1 comparisons to find the minimum value to swap into the first position. Because the first position can then be ignored when finding the next smallest value, the second iteration requires n-2 comparisons and third requires n-3. This progression continues as follows: (n-1) + (n-2) + … + 2 + 1 = n(n-1)/2 = O(n2) 12.4.1. ANALYSIS Unlike other quadratic tests, the efficiency of selection sort is independent of the data. Bubble sort, for example, can sort sorted and some nearly-sorted lists in linear time because it is able to identify when it has a sorted list. Selection sort does not do anything like that because it is only seeking the minimum value on each iteration. Therefore, it cannot recognize (on the first iteration) the difference between the following two sets of data: 1 2 3 4 5 6 7 8 9 and 1 9 8 7 6 5 4 3 2. In each case, it will identify the 1 as the smallest element and then go on to sorting the rest of the list. Because it treats all data sets the same and has no ability to short-circuit the rest of the sort if it ever comes across a sorted list before the algorithm is complete, insertion sort has no best or worst cases. Selection sort always takes O(n2) operations, regardless of the characteristics of the data being sorted. Even though Selection Sort is one of the slower algorithms, it is used because it sorts with a minimum of data shifting. That is, the algorithm doesn't swap the elements of the array as many times as other algorithms do in order to sort the array. 12.5 RADIX SORT Radix sortis a fast ,stable, sort algorithm which can be used to sort items that are identified by unique keys. Every key is a string or number, and radix sort sorts these keys in a particular lexicographic-like order. Radix sort is one of the linear sorting algorithms for integers. It functions by sorting the input numbers on each digit, for each of the digits in the numbers. However, the process adopted by this sort method is somewhat counterintuitive, in the sense that the numbers are sorted on the least-significant digit first, followed by the second-least significant digit and so on till the most significant digit. Let us sort the following list: 150, 45, 75, 90, 2, 24, 802, 66 1. sorting by least significant digit (1s place) gives: 150, 90, 2, 802, 24, 45, 75, 66 2. sorting by next digit (10s place) gives: 2, 802, 24, 45,150,66,75, 90 3. sorting by most significant digit (100s place) gives: 2, 24, 45, 66, 75, 90, 150, 802 Let us consider a series of decimal numbers to be sorted. If one sets up 10 bins, passes through the list of numbers placing each in the appropriate bin according to the least significant digit, and then combines the bins in order without changing the order of numbers in the bin, by repeating the process with the next most significant digit until all digits have been used, one will have sorted the numbers. For computer implementation, the bins are linked lists, and only pointers to the data are manipulated. If bit bins are used, only two lists need be manipulated, while for byte bins, up to 256 may be required, depending on the nature of the data. After each pass through the sort key, the lists are joined, and the new list is used for the next pass. For a straightforward implementation, the sort time depends linearly on the number of records to be sorted, and the number of passes, which in turn is related only to the number of “bins” and sort key length. The bin sorting approach can be generalized in a technique that is known as radix sorting. To appreciate Radix Sort, consider the following analogy: Suppose that we wish to sort a deck of 52 playing cards (the different suits can be given suitable values, for example 1 for Diamonds, 2 for Clubs, 3 for Hearts and 4 for Spades). The 'natural' thing to do would be to first sort the cards according to suits, then sort each of the four separate piles, and finally combine the four in order. This approach, however, has an inherent disadvantage. When each of the piles is being sorted, the other piles have to be kept aside and kept track of. If, instead, we follow the 'counterintuitive' approach of first sorting the cards by value, this problem is eliminated. After the first step, the four separate piles are combined in order and then sorted by suit. 10.6.1. Radix Sort Algorithm A radix sort algorithm works as follows: 1. take the least significant digit (or group bits) of each key. 2. sort the list of elements based on that digit, but keep the order of elements with the same digit (this is the definition of a stable sort). 3. repeat the sort with each more significant digit. C Code void rsort(record a[], int n) { int i,j; int shift; record temp[MAXLENGTH]; int bucket_size[256], first_in_bucket[256]; for(shift=0; shift<32; shift+=8) { /* compute the size of each bucket and copy each record from array a to array temp */ for(i=0; i<256; i++) bucket_size[i]=0; for(j=0; j<n; j++) { i=(a[j].key>>shift)&255; } bucket_size[i]++; temp[j]=a[j]; /* mark off the beginning of each bucket */ first_in_bucket[0]=0; for(i=1; i<256; i++) first_in_bucket[i]=first_in_bucket[i-1]+bucket_size[i-1]; } } /* copy each record from array temp to its bucket in array a */ for(j=0; j<n; j++) { i=(temp[j].key>>shift)&255; a[first_in_bucket[i]]=temp[j]; first_in_bucket[i]++; } 12.5.1. ANALYSIS The algorithm operates in O(nk) time, where n is the number of items, and k is the average key length. If the size of the possible key space is proportional to the number of items, then each key will be log n symbols long, and radix sort uses O(n log n) time in this case. In practice, if the keys used are short integers, it is practical to complete the sorting with only two passes, and comparisons can be done with only a few bit operations that operate in constant time. In this case, radix sort can effectively be regarded as taking O(n) time and in practice can be significantly faster than any other sorting algorithm. The greatest disadvantages of radix sort are that it usually cannot be made to run in place, so O(n) additional memory space is needed, and that it requires one pass for each symbol of the key, so it is very slow for potentially-long keys. The time complexity of the algorithm is as follows: Suppose that the n input numbers have maximum k digits. Then the Counting Sort procedure is called a total of k times. Counting Sort is a linear, or O(n) algorithm. So the entire Radix Sort procedure takes O(kn) time. If the numbers are of finite size, the algorithm runs in O(n) asymptotic time. 12.6. QUICK SORT This is the most widely used internal sorting algorithm. In its basic form, it was invented by C.A.R. Hoare in 1960. Its popularity lies in the- ease of implementation, moderate use of resources and acceptable behaviour for a variety of sorting cases. The basis of quick sort is the 'divide' and conquer' strategy i.e. Divide the problem [list to be sorted] into sub-problems [sublists], until solved sub problems [sorted sub-lists] are found. This is implemented as Choose one item A[I] from the list A[ ]. Rearrange the list so that this item is in the proper position i.e. all preceding items have a lesser value and all succeeding items have a greater value than this item. 1. A[0], A[1] .. A[I-1] in sub list 1 2. A[I] 3. A[I + 1], A[I + 2] ... A[N] in sublist 2 Repeat steps 1 & 2 for sublist & sublist2 till A[ ] is a sorted list. As can be seen, this algorithm has a recursive structure., Step 2 or the 'divide' procedure is of utmost importance in this algorithm. This is usually implemented as follows: 1. Choose A[I] the dividing element. 2. From the left end of the list (A[O] onwards) scan till an item A[R] is found whose value is greater than A[I]. 3. From the right end of list [A[N] backwards] scan till an item A[L] is found whose Value is less than A[1]. 4. Swap A[-R] & A[L]. 5. Continue steps 2, 3 & 4 till the scan pointers cross. Stop at this stage. 6. At this point sublist 1 & sublist2 are ready. 7. Now do the same for each of sublist 1 & sublist2. We will now give the implementation of Quicksort and illustrate it by an example. Quicksort (int A[], int X, int 1) { int L, R, V 1. 1. If (IX) { 2. V = A[1], L = X-1, R = I; 3. 3. For (;;) 4. While (A[ + + L] V); 5. While (A[- -R] V); 6. If (L = R) /* left & right ptrs. have crossed */ 7. break; 8. Swap (A, L, R) /* Swap A[L] & A[R] */ } 9. Swap (A, L, I) 10. Quicksort (A, X, L-1) 11. Quicksort (A, L + 1, I) } } Quick sort is called with A, I, N to sort the whole file. C Code void quickSort(int numbers[], int array_size) { q_sort(numbers, 0, array_size - 1); } void q_sort(int numbers[], int left, int right) { int pivot, l_hold, r_hold; l_hold = left; r_hold = right; pivot = numbers[left]; while (left < right) { while ((numbers[right] >= pivot) && (left < right)) right--; if (left != right) { numbers[left] = numbers[right]; left++; } while ((numbers[left] <= pivot) && (left < right)) left++; if (left != right) { numbers[right] = numbers[left]; right--; } } numbers[left] = pivot; pivot = left; left = l_hold; right = r_hold; if (left < pivot) q_sort(numbers, left, pivot-1); if (right > pivot) q_sort(numbers, pivot+1, right); } Example: Consider the following list to be sorted in ascending order. 'ADD YOUR MAN'. (Ignore blanks) at this point 'N' is in its correct place. A[6], A[1] to A[5] constitutes sub list 1. A[7] to A[10] constitutes sublist2. Now 10. Quick sort (A, 1, 5) 11. Quick sort (A, 6, 10) The Quick sort algorithm uses the O(N Log2N) comparisons on average. The performance can be improved by keeping in mind the following points. 1. Switch to a faster sorting scheme like insertion sort when the sublist size becomes comparitively small. 2. Use a better dividing element I in the implementations. We have always used A[N] as the dividing element. A useful method for the selection of a dividing element is the Median-of three method. Select any3 elements from the list. Use the median of these as the dividing element. 12.6.1. ANALYSIS The efficiency of quick sort is determined by calculating the running time of the two recursive calls plus the time spent in the partition. The partition step of quick sort takes n-1 comparisons. The efficiency of the recursive calls depends largely on how equally the pivot value splits the array. In the average case, assume that the pivot does split the array into two roughly equal halves. As is common with divide-and-conquer sorts, the dividing algorithm has a running time of log(n) . Thus the overall quicksort algorithm has running time O(n log(n)) . The worst case occurs when the pivot value always ends up being one of the extreme values in the array. For example, this might happen in a sorted array if the first value is selected as the pivot. In this case, the partitioning phase still requires n-1 comparisons, as before, but quicksort does not achieve the O(log(n)) efficiency in the dividing process. Instead of breaking an 8 element array into arrays of size 4, 2, and 1 in three recursive calls, the array size only reduces by one: 7, 6, and 5. Thus the dividing process becomes linear and the worst case efficiency is O(n 2) . Note that quicksort performs badly once the amounts of data become small due to the overhead of recursion. This is often addressed by switching to a different sort for data smaller than some magic number such as 25 or 30 elements. Quite a lot of people believe that Quick Sort is the quickest sorting algorithm, which is not true. There's no such thing as the quickest sorting algorithm. It depends on the data, the data types, the implementation language and much more. Indeed generally the "best algorithm" is Heap Sort. Quicksort is perfect for random data and if the items can be swapped very fast. For almost sorted data quicksort will degenerate to O(n 2) (worst case), the best case (random data) is O(n*logn), as any good sorting algorithm. 12.7 2-WAY MERGE SORT Merge sort is also one of the 'divide and conquer' class of algorithms. The basic idea into this is to divide the list into a number of sub lists, sort each of these sub lists and merge them to get a single sorted list. The recursive implementation of 2- way merge sort divides the fist into 2 sorts the sub lists and then merges them to get the sorted list. The illustrative implementation of 2 way merge sort sees the input initially as n lists of size 1. These are merged to get n/2 lists of size 2. These n/2 lists are merged pair wise and so on till a single list is obtained. This can be better understood by the following example. This is also called CONCATENATE SORT. Figure 2 : 2-way merge sort We give here the recursive implementation of 2 Way Merge Sort. Mergesort (int List [], int low, int high) { int mid, 1. Mid = (low + high)/2; 2. Mergesort (LIST, low, mid); 3. Mergesort (LIST, mid + 1, high); 4. Merge (low, mid, high, List, FINAL) } Merge (int low, int mid, int high, int LIST[], int FINAL) { Int a, b, c, d; a = low, b = low, c = mid + 1 While (a < = mid and c < = high) do { If LIST [a] < = LIST [c] then { FINAL [b] =LIST [a] a = a+1 } else { FINAL [b] = LIST [c] c=c+l } b = b+1 } If (a mid) then For d = c to high do { B[b] = LIST [d] b=b+1 } Else For d = a to mid do { B[b] = A[d] b = b + l; } } To sort the entire list, Mergesort should be called with LIST,1, N. Mergesort is the best method for sorting linked lists in random order. The total computing time is of the 0(n log2n ). The disadvantage of using mergesort is that it requires two arrays of the same size and type for the merge phase. That is, to sort and list of size n, it needs space for 2n elements. 12.8 HEAP SORT We will begin by defining a new structure the heap. We have studied binary trees in BLOCK 5, UNIT 1. A binary tre is illustrated below. Figure 3(a) : Heap 1 A complete binary tree is said to satisfy the 'heap condition' if the key of each node is greater than or equal to the key in its children. Thus the root node will have the largest key value. Trees can be represented as arrays, by first numbering the nodes (starting from the root) from left to right. The key values of the nodes are then assigned to array positions whose index is given by the number of the node. For the example tree, the corresponding array would be The relationships of a node can also be determined from this array representation. If a node is at position j, its children will be at positions 2j and 2j + 1. Its parent will be at position [J/2 |. Consider the node M. It is at the position 5. Its parent node is, therefore, at position [5/2| = 2 i.e. the parent is R. Its children are at positions 2x5 & (2x5) + 1, i.e.10 + 11 respectively i.e. E & I are its children. We see from the pictorial representation that these relationships are correct. A heap is a complete binary tree, in which each node satisfies the heap condition, represented as an array. We will now study the operations possible on a heap and see how these can be combined to generate a sorting algorithm. The operations on a heap work in 2 steps. 1. The required node is inserted/deleted/or replaced. 2. 1 may cause violation of the heap condition so the heap is traversed and modified to rectify any, such violations. Examples Insertion Consider the insertion of a node R in the heap 1. 1. Initially R is added as the right child of J and given the number 13. 2. But R J, the heap condition is violated. 3. Move R upto position 6 and move J down to position 13. 4. R P, therefore, the heap condition is still violated. 5. Swap R and P. 6. The heap condition is now satisfied by all nodes to get. Figure 3(a) : Heap 2 Deletion Consider the deletion of M from heap 2. 1. The larger of M's children is promoted to 5. Figure 3(a) : Heap 3 An efficient sorting method based on the heap construction and node removal from the heap in order. This algorithm is guaranteed to sort n elements in N log N steps. We will first see 2 methods of heap construction and then removal in order from the heap to sort the list. 1. Top down heap construction - Insert items into an initially empty heap, keeping the heap condition inviolate at all steps. 2. Bottom up heap construction - Build a heap with the items in the order presented. - From the right most node modify to satisfy the heap condition. We will exemplify this with an example. Example: Build a heap of the following using both methods of construction. PROFESSIONAL Top down construction Figure 4: Heap Sort (Top down Construction) Figure 4: Heap Sort (Top down Construction) Figure 5: Heap Sort by bottom-up approach We will now see how sorting takes place using the heap built by the top down approach. The sorted elements will be placed in X [ ] and array of size 12. 1. Remove S and store in X [12 ) (b) 2. Remove S and store in X [11] (c) 9. Similarly the remaining 5 nodes are removed and the heap modified, to get the sorted list. AEEILN00PRSS Figure 6 : Sorting process through Heap 12.9. HEAPSORT VS. QUICKSORT Heapsort primarily competes with Quicksort, another efficient nearly-in-place comparison-based sort algorithm. Quicksort is typically somewhat faster, but the worst-case running time for Quicksort is O(n2) which becomes unacceptable for large data sets. See Quicksort for a detailed discussion of this problem, and possible solutions. The Quicksort algorithm also requires Ω(log n) extra storage space, making it not a strictly in-place algorithm. This typically does not pose a problem except on the smallest embedded systems. Obscure constant-space variations on Quicksort exist but are never used in practice due to their extra complexity. In situations where very limited extra storage space is available, Heapsort is the sorting algorithm of choice. Thus, because of the O(n log n) upper bound on Heapsort's running time and constant upper bound on its auxiliary storage, embedded systems with real-time constraints often use Heapsort. SUMMARY Sorting is an important application activity. Many sorting algorithms are available, each the most efficient for a particular situation or a particular kind of data. The choice of a sorting algorithm is crucial to the performance of the application. In this unit we have studied many sorting algorithms used in internal sorting. This is not a conclusive list and the student is advised to read the suggested volumes for exposure to additional sorting methods and for detailed discussions of the methods introduced here. FILE STRUCTURE Unit 1 PHYSICAL STORAGE DEVICES AND THEIR CHARACTERISTICS History of file structures, Physical Files, Logical Files, Introduction, Magnetic Tape, Data Storage, Data Retrieval, Magnetic Disk, Floppy Diskette, Characteristics of Magnetic Disk Drives, Characteristics of Magnetic Disk Processing, Optical Technology. Unit 2 CONSTITUENTS OF FILE AND FILE OPERATION Constituents of a File, Field, Record, Header Records, Primary and Secondary Key, File Operations Unit 3 FILE ORGANIZATIONS File Concepts, Serial File, Sequential File, Processing Sequential Files, Indexed Sequential File, Inverted File, Direct File, Multi-list File Unit 4 HASHING FUNCTIONS AND COLLISION HANDLING METHOD Hash Tables, Hashing Function, Terms Associated with Hash Tables, Bucket Overflow, Handling of Bucket Overflows Unit 1 PHYSICAL STORAGE DEVICES AND THEIR CHARACTERISTICS 1. History of file structures 2. Types of file 2.1. Physical Files 2.2. Logical Files 3. Functions of file 4. Storage Devices 4.1. Magnetic Tape 4.2. Magnetic Disk 4.3. Floppy Diskette 4.4. Characteristics of Magnetic Disk Drives 5. Data Retrieval 5.1. Data Retrieval (Hard Disk) 5.2. Data Retrieval (Floppy Disk) 6. Optical Technology. What is File Structure? A file Structure is a combination of representations for data in files and of operations for accessing the data. A File Structure allows applications to read, write and modify data. It might also support finding the data that matches some search criteria or reading though the data in some particular order. 1. History of File structure In file processing class, you will study the design of files and various operations on files. How is file processing course different from data structures course? In data structures course, you will study how information is stored on main memory. But in file processing course, you will study how information is stored on secondary memory. Main memory is a volatile storage device. When you power off your computer system, all the information in random access memory (RAM) are lost. Secondary memory is a nonvolatile storage device. When you power off your computer system, all the information are stored permanently on secondary memory. Therefore, it is not a good practice to store gigabytes of information on RAM. Usually information is stored permanently on secondary memory such as tapes, disk, and optical disks. The storage capacity of RAM is smaller than that of Secondary storage devices. RAM size is about 128 megabytes and its cost is about $15 for most of the PC, while secondary storage can be 20-30 gigabytes. Its cost is about $100, too. Also the access speed on RAM is much faster than the access speed on Secondary storage. The time takes to retrieve information from RAM is about 120 nanoseconds (billionths of second). Getting the same information from a disk might take 30 milliseconds (thousandths of second). As needed by software application, information can be retrieved from secondary memory and the retrieved data stay temporarily in main memory to be processed. The way files are stored on secondary memory can have a great impact on the speed of information being retrieved from secondary memory. Therefore in this course, you will study various file structures and theirs related effects on file accessing speed. A short history of file structure design since it takes time to get information from secondary memory, the goal of file structures design is to minimize the access time to disk. Information must be organized and grouped , so users can get everything that they need in one trip to disk. If we need a book's title, author, publisher, patron , we would get all the related information in one place instead of looking different place in the disk. File structures issue becomes complicated as files change, grow and shrink when different add, delete operation are applied to files. Index, and B tree structures are developed in the past 30 years to ease the file processing time and file integrity. 2. Types of file 2.1 Logical File A channel that hides the details of the file‟s location and physical format to the program. The logical file has a logical name used for referring to the file inside the program. This logical name is a variable inside the program, for instance. FILE *fp In C++ the logical name is the name of an object of the class fstream, like: Fstream outfile; 2.2 Physical File A collection of bytes stored on a disk or tape. Physical file has a name, for instance ourfile.exe. There may be thousands of physical files on a disk, but a program only has about 20 logical files open at the same time. When a program wants ot use a particular file, “data”, the OS must find the physical the called “data” and make the hookup by assigning a logical file to it. This logical file has a logical name, which is what is used inside the program. The Operating System is responsible for associating a logical file in a program to a physical file in disk or tape 3. Functions of file Reading read(source_file, destination_address, size); source_file: location the program reads from, i.e. its logical file name destination_address: first address of the memory block where we want to store the data. Size: how much information is being brought in from the file (byte count). eg. (Based on C) char a; FILE *fp; … fp = fopen(“ourfile.txt” , ”r”); fread(&a, 1, 1,fp); eg. (Based on C++) char a; fstream infile; infile.open(“ourfile.txt”, ios::in); infile >> a; Writing write(destination_file, source_addr, size); destination_file: The logical file name where the data will be written. source_addr: First address of the memory block where the data to be written is stored. size: The number of bytes to be written. eg. (Based on C) char a; FILE *fp; … fp = fopen(“ourfile.txt” , ”r”); … fwrite(&a, 1, 1,fp); eg. (Based on C++) char a; fstream outfile; outfile.open(“ourfile.txt”, ios::out); outfile << a; Opening Files The first operation generally done on an object of one of these classes is to associate it to a real file, that is to say, to open a file. The open file is represented within the program by a stream object (an instantiation of one of these classes) and any input or output performed on this stream object will be applied to the physical file. In order to open a file with a stream object we use its member function open(): void open (const char * filename, openmode mode); where filename is a string of characters representing the name of the file to be opened and mode is a combination of the following flags: ios::in Open file for reading ios::out Open file for writing ios::ate Initial position: end of file ios::app Every output is appended at the end of file ios::trunc If the file already existed it is erased ios::binary Binary mode These flags can be combined using bitwise operator OR: |. For example, if we want to open the file "example.bin" in binary mode to add data we could do it by the following call to function-member open: ofstream file; file.open ("example.bin", ios::out | ios::app | ios::binary); All of the member functions open of classes ofstream, ifstream and fstream include a default mode when opening files that varies from one to the other: class default mode to parameter ofstream ios::out | ios::trunc ifstream ios::in fstream ios::in | ios::out The default value is only applied if the function is called without specifying a mode parameter. If the function is called with any value in that parameter the default mode is stepped on, not combined. Since the first task that is performed on an object of classes ofstream, ifstream and fstream is frequently to open a file, these three classes include a constructor that directly calls the open member function and has the same parameters as this. This way, we could also have declared the previous object and conducted the same opening operation just by writing: ofstream file ("example.bin", ios::out | ios::app | ios::binary); Both forms to open a file are valid. You can check if a file has been correctly opened by calling the member function is_open(): bool is_open(); that returns a bool type value indicating true in case that indeed the object has been correctly associated with an open file or false otherwise. Closing files When reading, writing or consulting operations on a file are complete we must close it so that it becomes available again. In order to do that we shall call the member function close(), that is in charge of flushing the buffers and closing the file. Its form is quite simple: void close (); Once this member function is called, the stream object can be used to open another file, and the file is available again to be opened by other processes. In case that an object is destructed while still associated with an open file, the destructor automatically calls the member function close. 4. Storage Devices It is important to know the difference between secondary storage and a computer's main memory. Secondary storage is also called auxiliary storage and is used to store data and programs when they are not being processed. Secondary storage is more permanent than main memory, as data and programs are retained when the power is turned off. The needs of secondary storage can vary greatly between users. A personal computer might only require 20,000 bytes of secondary storage but large companies, such as banks, may require secondary storage devices that can store billions of characters. Because of such a variety of needs, a variety of storage devices are available. The two most common types of secondary storage are magnetic tapes and magnetic disks. Computer storage is the holding of data in an electromagnetic form for access by a computer processor. Primary storage is data in random access memory (RAM) and other "built-in" devices. Secondary storage is data on hard disks, tapes, and other external devices. Primary storage is much faster to access than secondary storage because of the proximity of the storage to the processor or because of the nature of the storage devices. On the other hand, secondary storage can hold much more data than primary storage. 4.1 Magnetic Tape Storage A magnetically coated strip of plastic on which data can be encoded. Tapes for computers are similar to tapes used to store music. Storing data on tapes is considerably cheaper than storing data on disks. Tapes also have large storage capacities, ranging from a few hundred kilobytes to several gigabytes. Accessing data on tapes, however, is much slower than accessing data on disks. Tapes are sequential-accessmedia, which means that to get to a particular point on the tape, the tape must go through all the preceding points. In contrast, disks are random-access media because a disk drive can access any point at random without passing through intervening points. Because tapes are so slow, they are generally used only for long-term storage and backup. Data to be used regularly is almost always kept on a disk. Tapes are also used for transporting large amounts of data. Tapes come in a variety of sizes and formats. Tapes are sometimes called streamers or streaming tapes. Magnetic tape is a one-half inch or one-quarter inch ribbon of plastic material on which data is recorded. The tape drive is an input/output device that reads, writes and erases data on tapes. Magnetic tapes are erasable, reusable and durable. They are made to store large quantities of data inexpensively and therefore are often used for backup. Magnetic tape is not suitable for data files that are revised or updated often because it stores data sequentially. 4.2 Magnetic Disk (Organization of disks) Magnetic disks are the most widely used storage medium for computers. A magnetic disk offers high storage capacity, reliability, and the capacity to directly access stored data. Magnetic disks hold more data in a small place and attain faster data access speeds. Types of magnetic disks include diskettes, hard disks, and removable disk cartridges. The information stored on a disk is stored on the surface of one or more platters. The arrangement is such that the information is stored in successive tracks on the surface of the disk. Each track is often divided into a number of sectors. A sector is the smallest addressable portion of a disk. When a read statement calls for a particular byte from a disk file, the computer OS finds the correct surface, track and sector; reads the entire sector into a special area in memory called a buffer and then finds the requested byte within that buffer. Disk drives typically have a number of platters. The tracks that are directly above and below on another from a cylinder. The significance of cylinder is that all of information on a single cylinder can be accessed without moving the arm that holds the read/write heads. Moving this arm is called seeking. This arm movement is usually the slowest part of reading information from disk. Storing the data: Data is stored on the surface of a platter in sectors and tracks. Tracks are concentric circles, and sectors are pie-shaped wedges on a track. The process of low-level formatting a drive establishes the tracks and sectors on the platter. The starting and ending points of each sector are written onto the platter. This process prepares the drive to hold blocks of bytes. High-level formatting then writes the file-storage structures, like the file-allocation table, into the sector. This process prepares the drive to hold files. Capacity and space needed: Platters are organized into specific structures to enable the organized storage and retrieval of data. Each platter is broken into tracks--tens of thousands of them, which are tightly packed concentric circles. These are similar in structure to the annual rings of a tree (but not similar to the grooves in a vinyl record album, which form a connected spiral and not concentric rings). A track holds too much information to be suitable as the smallest unit of storage on a disk, so each one is further broken down into sectors. A sector is normally the smallest individually addressable unit of information stored on a hard disk, and normally holds 512 bytes of information. The first PC hard disks typically held 17 sectors per track. Today's hard disks can have thousands of sectors in a single track, and make use of zoned recording to allow more sectors on the larger outer tracks of the disk. A platter from a 5.25" hard disk, with 20 concentric tracks drawn over the surface. This is far lower than the density of even the oldest hard disks; even if visible, the tracks on a modern hard disk would require high magnification to resolve. Each track is divided into 16 imaginary sectors. Older hard disks had the same number of sectors per track, but new ones use zoned recording with a different number of sectors per track in different zones of tracks. All information stored on a hard disk is recorded in tracks, which are concentric circles placed on the surface of each platter, much like the annual rings of a tree. The tracks are numbered, starting from zero, starting at the outside of the platter and increasing as you go in. A modern hard disk has tens of thousands of tracks on each platter. Hard Disks Hard disks provide larger and faster secondary storage capabilities than diskettes. Usually hard disks are permanently mounted inside the computer and are not removable like diskettes. On minicomputers (G) and mainframes (G), hard disks are often called fixed disks. They are also called direct-access storage devices (DASD). Most personal computers have two to four disk drives. The input/output device that transfers data to and from the hard disk is the hard disk drive. HARD DISK STORAGE CAPACITY o Like diskettes, hard disks must be formatted before they can store information. The storage capacity for hard drives is measured in megabytes. Common sizes for personal computers range from 100MB to 500MB of storage. Each 10MB of storage is equivalent to approximately 5,000 printed pages (with approxiamtely 2,000 characters per page). Disk Cartridges Removable disk cartridges are another form of disk storage for personal computers. They offer the storage and fast access of hard disks and the portability of diskettes. They are often used when security is an issue since, when you are done using the computer, the disk cartridge can be removed and locked up leaving no data on the computer. 4.3 Floppy Diskettes It is a small flexible mylar disk coated with iron-oxide on which data are stored. These disks are available in three sizes : 1. 2. 8-inch portable floppy (flexible) disks. 5 1 4 inch portable floppy disks. These two diskettes are Individually packaged in protective envelopes. Both the floppies are the most popular online secondary storage medium used in PC and intelligent terminal systems. 3. 1 Compact floppy disks measuring less than 4 inches in diameter - The 2 inch diameter miniature diskettes are individually packed in a hard plastic case. This case has a dustsealing and finger proof shutter which opens automatically once the case is inserted in its disk drive. These disks are popular in desktop and portable personal computers. 3 A floppy disk may be single sided (data can be recorded on one side only) or double sided (data can be recorded on both sides of the disk). All disks have two sides but difference depends on whether the disk has been certified free of errors on one or both sides. Data recorded on a double sided disk by a double-sided drive can't be read on a single-sided drive because of the way the data are stored. A double-sided drive had 2 read/write heads, one for each side while a single-sided drive has only one read/write head. The most commonly used diskettes are: DISKETTE STORAGE CAPACITY o Before you can store data on your diskette, it must be formatted (G). The amount of data you can store on a diskette depends on the recording density and the number of tracks on the diskette. The recording density is the number of bits (G) that can be recorded on one inch of track on the diskette, or bits per inch (bpi). The second factor that influences the amount of data stored on a diskette is the number of tracks on which the data can be stored or tracks per inch (tpi). Commonly used diskettes are referred to as either double-density or high-density (single-density diskettes are no longer used). Double-density diskettes (DD) can store 360K for a 5 1/4 inch diskette and 720K for a 3 1/2 inch diskette. High-density diskettes (HD) can store 1.2 megabytes (G) on a 5 1/4 inch diskette and 1.44 megabytes on a 3 1/2 inch diskette. CARE OF DISKETTES o You should keep diskettes away from heat, cold, magnetic fields (including telephones) and contaminated environments such as dust, smoke, or salt air. Also keep them away from food and do not touch the disk surface. 4.4. Characteristics of Magnetic Disk Drives 1. The storage capacity of a single disk ranges from 10MB to 10GB. A typical commercial database may require hundreds of disks. 2. Figure 10.2 shows a moving-head disk mechanism. o Each disk platter has a flat circular shape. Its two surfaces are covered with a magnetic material and information is recorded on the surfaces. The platter of hard disks are made from rigid metal or glass, while floppy disks are made from flexible material. o The disk surface is logically divided into tracks, which are subdivided into sectors. A sector (varying from 32 bytes to 4096 bytes, usually 512 bytes) is the smallest unit of information that can be read from or written to disk. There are 4-32 sectors per track and 20-1500 tracks per disk surface. o The arm can be positioned over any one of the tracks. o o o 3. 4. 5. 6. 7. 8. 9. The platter is spun at high speed. To read information, the arm is positioned over the correct track. When the data to be accessed passes under the head, the read or write operation is performed. A disk typically contains multiple platters (see Figure 10.2). The read-write heads of all the tracks are mounted on a single assembly called a disk arm, and move together. o Multiple disk arms are moved as a unit by the actuator. o Each arm has two heads, to read disks above and below it. o The set of tracks over which the heads are located forms a cylinder. o This cylinder holds that data that is accessible within the disk latency time. o It is clearly sensible to store related data in the same or adjacent cylinders. Disk platters range from 1.8" to 14" in diameter, and 5"1/4 and 3"1/2 disks dominate due to the lower cost and faster seek time than do larger disks, yet they provide high storage capacity. A disk controller interfaces between the computer system and the actual hardware of the disk drive. It accepts commands to r/w a sector, and initiate actions. Disk controllers also attach checksums to each sector to check read error. Remapping of bad sectors: If a controller detects that a sector is damaged when the disk is initially formatted, or when an attempt is made to write the sector, it can logically map the sector to a different physical location. SCSI (Small Computer System Interconnect) is commonly used to connect disks to PCs and workstations. Mainframe and server systems usually have a faster and more expensive bus to connect to the disks. Head crash: why cause the entire disk failing (?). A fixed dead disk has a separate head for each track -- very many heads, very expensive. Multiple disk arms: allow more than one track to be accessed at a time. Both were used in high performance mainframe systems but are relatively rare today. Magnetic disk drives are direct-access devices designed to minimize the access time required to locate specific records. Each drive has not one but a series of access arms that can locate records on specific surfaces. The time it takes to locate specific records will be much less than that required by a tape drive with only one read/write mechanism. There are several types of disk mechanisms. 1. Moving-Head Magnetic Disk: In a moving-head disk system, all the read/write heads are attached to a single movable access mechanism. Thus the access mechanism moves directly to a specific disk address, as indicated by the computer. Because all the read/write heads move together to locate a record, this type of mechanism has a relatively slow access rate as compared to other disks. The access time, however, is still considerably faster than that for tape. 2. Fixed-Head Magnetic Disk: Since disks are generally used for high-speed access of records from a file (for example, an airline reservation file), method that can reduce access time would result in a substantial benefit. For this reason, fixed-head magnetic disks were developed. These devices do not have a movable access arm. Instead, each track has its own read/write mechanism that access a record as it rotates past the arm. The disks in this device are not removable and the capacity of each disk is somewhat less but the access time is significantly reduced. Still other disk devices combine the technologies of both moving and fixed-head access to produce a high-capacity, rapid access device. 5. Data Retrieval 5.1 Data Retrieval-Hard Disk The case of stacked disk system, there is an access arm having two read/write heads for each cording surface. All these access arms move in unison in and out of the disks. Each hard disk cording surface has specified number of tracks and sectors. A particular track on all the surfaces of multiple disks comprise that particular cylinder of the disks. For example, all the tenth tracks of all surface together form tenth cylinder of the disks. Thus, a cylinder may be defined as all tracks on magnetic disks that are accessible by a single movement of the access mechanism. Track position of a read/write head on top recording surface is the same as track position of other heads on the access RMS serving other surfaces. For accessing a record, its disk address (i.e. the cylinder number, surface number and record number) must be provided by a computer program. The motor rotates the disk at a high speed and in its one revolution, data on surface of a particular track is read. In the next revolution, data on surface 2 of the same track is read. A full track of data can be read in a single revolution. This procedure continues down the cylinder with very fast movement of access arms. After access of the data, they are copied from the disk to the processor at a rate (transfer rate) which depends on the density of the stored data and the disk's rotation speed. 5.2 Data Retrieval-Floppy Disk As shown in the figure 8.7, a typical floppy disk jacket has small hole to the right of the larger center hole, called Index hole window. The disk also has a small hole, called the index hole which cannot be seen unless it is aligned with the Jacket hole. This Index hole enables the disk controller to locate disk sectors. Access of data is carried out in the following steps: 1. Read/Write head moves to the track specified in the disk address (I.e. track number and sector number) 2. Disk-drive controller locates the index reference i.e. the index hole for determining sector locations. This takes place with a light passing through the disk once index hole is aligned with the index-hole window. This lights triggers a light-sensing device which marks the hole's location and hence index hole is detected. 3. The disk controller begins reading data. 4. unit. When the specified sector passes under the head, the controller transmits data to the processor 6. Optical Technology The first optical disks that became commercially available, around 1982, were compact audio disks. Since then, and during a particularly active marketing period from 1984 to 1986, at least a dozen other optical formats have emerged or are under development. The rapid proliferation of formats has led, understandably, to some confusion. This digest will briefly describe the most prominent formats (and their acronyms), and the contexts in which they are used. Optical disks go by many names--OD, laser disk, and disc among them--all of which are acceptable. At first glance, they bear some resemblance to floppy disks: they may be about the same size, store information in digital form, and be used with microcomputers. But where information is encoded on floppy disks in the form of magnetic charges, it is encoded on optical disks in the form of microscopic pits. These pits are created and read with laser (light) technology. The optical disks that are sold are actually "pressed," or molded from a glass master disk, in somewhat the same way as phonograph records. The copies are composed of clear hard plastic with a reflective metal backing, and protected by a tough lacquer. As the disk spins in the optical disk player (reader, drive), a reader head projects a beam from a lowpower laser through the plastic onto the pitted data surface and interprets the reflected light. Optical disk players can stand alone, like portable tape players, or be connected to stereos, televisions, or microcomputers. Optical disks lack the erasability and access speed of floppies, but these disadvantages are offset by their huge storage capacity, higher level of accuracy, and greater durability. A storage medium from which data is read and to which it is written by lasers. Optical disks can store much more data -- up to 6 gigabytes (6 billion bytes) -- than most portable magnetic media, such as floppies. There are three basic types of optical disks: CD-ROM : Like audio CDs, CD-ROMs come with data already encoded onto them. The data is permanent and can be read any number of times, but CD-ROMs cannot be modified. WROM : Stands for write-once, read-many. With a WORM disk drive, you can write data onto a WORM disk, but only once. After that, the WORM disk behaves just like a CD-ROM. erasable: Optical disks that can be erased and loaded with new data, just like magnetic disks. These are often referred to as EO (erasable optical) disks. These three technologies are not compatible with one another; each requires a different type of disk drive and disk. Even within one category, there are many competing formats, although CD-ROMs are relatively standardized. Unit 2 CONSTITUENTS OF FILE AND FILE OPERATION 1. Field and Record 2. Header Records 3. Primary and Secondary Key 4. File Operations Record A record can be defined as a set of fields that belong together when the file is viewed in terms of a higher level of organization. Like the notion of a fie, a record is another conceptual tool. It is another level of organization that we impose on the data to preserve meaning. Records do not necessarily exist in the file in any physical sense, yet they are an important logical notion included in the file's structure. Most often, as in the example above, a record in a file represents a structured data object. Writing a record into a file can be thought of as saving the state (or value) of an object that is stored in memory. Reading a record from a file into a memory resident object restores the state of the object. It is our goal in designing file structures to facilitate this transfer of information between memory and files. We will use the term object to refer to data residing in memory and the term record to refer to data residing in a file residing in a file. Following are some of the most often used methods for organizing the records of a file: Require that the records be a predictable number of bytes in length. Require that the records be a predictable number of fields in length. Begin each record with a length indicator consisting of a count of the number of bytes that the record contains. Use a second file to keep track of the beginning byte address for each record. Place a delimiter at the end of each record to separate it from the next record. Field and Record Organization The basic Logical unit of data is the field, which contains a single data value. Fields are organized into aggregates, either as many copies of a single field (an array) or as a list of different fields (a record). When a record is stored in memory, we refer to it as an object and refer to its fields as members. In this lecture, we will investigate the many ways that objects can be represented as records in files. Stream Files Simplest view of a file: stream of bytes (characters) May end with special control character Does not recognize or impose any structure or meaning Unix views file as a stream of bytes Some utilities (sort, grep) recognize lines and within lines, whitespace characters as data delimiters. Record Files A view that sees a file as a collection of records A record is a collection of related fields A field is the smallest unit of meaningful data View directly supported by COBOL To define a field in a file Force it into a predictable length Precede with a byte count 15Harinder Grover 10Susheel Shukla 28Mukul Gupta separate with a delimiter Harinder Grover:Susheel Shukla:Mukul Gupta To define a field in a file, consider the followings. Space needed to represent a field Ability to represent all data Missing values Processing time to recognize a field Field Structures There are many ways of adding structure to files to maintain the identity of fields Force the field into a predictable length Begin each field with a length indicator Use a “keyword = value” expression to identify each field and its content. Reading a Stream of Fields A Program can easily read a stream of fields and output This time, we do preserve the notion of fields, but something is missing rather than stream of fields, these should be two records. Last Name: Sharma First Name: Gopi Ram Address: 30, JP Road. City: Bombay State: Maharashtra Zip Code: 2145631 Last Name: Sinha First Name: Ram Prasad Address: 13, Krishna Nagar. City: Delhi State: Delhi Zip Code: 110011 Record Structures that use a Length Indicator The notion of records that we implemented are lacking something: none of the variability in the length of records that was inherent in the initial stream file was conserved. Implementation: Writing the variable-length records to the file Representing the record length Reading the variable length record from the file 2. Header Records Sometimes data about the file is kept within the file. A special record at the beginning of the file could contain information such as number of following records, length of records etc. Header records may also precede related groups of records o Example: Employee file Employee‟ records are grouped by department Each group preceded by a record that indicates the number of records in following department group. Record Structure Choosing a Record Structure and Record Length within a fixed-length record. There are two approaches about it. 1. Fixed-Length fields in record (simple but problematic). 2. Varying field boundaries within the fixed-length record. Header Records are often used at the beginning of the file to hold some general info about a file to assist in future use of the file. To organize a file into records: Make records a predictable length Fields within the record may or may not be fixed Allows a seek to a particular record example. o rec size = 100, o want to retrieve record number 50 o seek to file position 5000 and read 100 bytes make records a predictable number of fields o precede record with a byte count o Separate with a delimiter o Keep a table that holds the byte position of each record Requires two files Allows a seek to the desired record Example: if length of record 1 is 40 bytes, record 2 is 15 bytes, record 3 is 30 bytes and record 4 is 20 bytes, the index would have 5 entries: 00 40 55 85 105 to access record 3, find the byte offset in the table (55) and the length of the field (85-55 = 30): seek to file position 55 and read 30 bytes. Record Structure I A record can be defined as a set of fields that belong together when the file is viewed in terms of a higher level of organization. Like the notion of a field, a record is another conceptual tool, which need not exist in the file in any physical sense. Yet, they are an important logical notion included in the file‟s structure. Record Structure II Methods for organizing the records of a file include: o Requiring that the records be a predictable number of bytes in length. o Requiring that the records be a predictable number of fields in length. o Beginning each record with a length indicator consisting of a count of the number of bytes that the record contains. o Using a second file to keep track of the beginning byte address for each record. o Placing a delimiter at the end of each record to separate it from the next record. 3. Primary and Secondary Key To explore this idea, we begin with an analogy. Suppose that you have a filing system containing information dealing with different companies. A separate folder exists for each company, and the folders are arranged alphabetically by company name. Obviously, if you wish to retrieve data for a particular company, you must know the name of the company. This may seem trivial, but it is in fact the crux behind an important concept: if you wish to locate a record in a random-access file, its identifier must be specified. This identifier is technically called a record key. The way in which keys are used in conjunction with random-access files is illustrated in figure 9.9. If the value of a record's key is known, it may be supplied to the DBMS, which converts the value into the disk address at which the record is located. The question mark in the figure indicates that several methods exist for doing the conversion, depending on the type of random-access file. If a particular key value locates only a single record, the key is said to be unique. On the other hand, a non-unique key may correspond to several records. Usually, a file must have at least one unique key, whose value specifies the principal "name" of each record. This is called the primary key of the file. If a file has more than one unique field, the one chosen to be the primary key would be that whose values seemed best suited as the record "names." Any keys other than the primary key are called secondary keys. A secondary key may be either unique or non-unique. The simplest type of primary key consists of a single field within each record, often referred to as a key field. In some situations, the database designer specifies to the DBMS which field is to be the primary key, in other cases, the DBMS itself creates a special primary-key field. Concatenated Keys A key may be formed By combining two or more fields. For example, consider a simple file for holding data on customer payments to Jellybeans, Inc. If we assume that there may be more than one customer with the same name, then there is no unique field in the file. However, if a combination of two fields is unique, it may be used as a primary key. For example, it is unlikely that a customer will submit two payments on the same day. Therefore, the combination of CUSTOMER-NAME and DATERECEIVED is unique, and it may be used as a primary key. This type of field combination is called a concatenated key, and it is expressed as: CUSTOMER-NAME + DATE-RECEIVED Each key value is generated by literally connecting, or concatenating, values from the two fields together. For example, suppose that a particular record in the file contains the following values: CUSTOMER_NAME: "Jones, Ltd" AMOUNT_PAID: 125 DATE_RECEIVED: "11/25/85" PAYMENT_TYPE: "MO" The concatenated key for the record would be"Jones, Ltd. 11/25/85" Any type of field, or combination of fields, may be a key. Regardless of the composition of the key, the process represented in figure 9.4 is valid because any type of key may be used to generate an address. 4. File Operations The fundamental operations that are performed on files are the following: 1. Creation 2. Update, including: record insertion record modification record deletion 3. Retrieval, including: inquiry report generation 4. Maintenance, including: restructuring reorganization Creating a File The initial creation of a file is also referred to as the loading of the file. The bilk of the work in creating transaction and master files involves data collection and validation. In some implementations, space is first allocated to the file, then the data are loaded into that skeleton. In other implementations, the file is constructed a record at a time. We will see examples of both approaches. In many cases, data are loaded into a transaction or master file in batch mode, even if the file actually is built one record at a time. Loading a master file interactively can be excessively time-consuming and labor-intensive if large volumes of data are involved. The contents of a master file represent a snapshot in lime of the part of world that the master file represents. For example, the payroll master file represent the present state of the company's payroll situation: month-to-date and year-to-date fields indicate appropriately accumulated figures for amounts paid, vacation taken, vacation due, etc.. for each employee. Updating a File Changing the contents of a master file to make it relied a more current snapshot of the real world is known as updating the file. These changes may include (1) the insertion of new record occurrences, e.g., adding a record for a newly hired employee, (2) the modification of existing record occurrences, e.g., changing the pay rate for an employee who has received a raise, and (3) the deletion of existing record occurrences, e.g., removing the record of an employee who has left the company. The updated file then represents a more current picture of reality. In some implementations, the records of a file can be modified in place, new records can be inserted in available free space, and records can he deleted to make space available for reuse. If a file is updated in place by a program, then the file usually is an input output file for that program. Some implementations are more restrictive and a file cannot be updated in place. In these cases the old file is input to an update program and a new version of the file a output. The file is essentially recreated with current information. However, not all of the records need to have been modified, some (maybe even most) of the records may have been copied directly from the old version to the updated version of the file. We will consider this situation further in detail in sequential file organization. Retrieving from a File The access of a file for purposes of extracting meaningful information is called retrieveval. There are two basic classes of file retrieval: inquiry and report generation. These two classes can be distinguished by the volume of data that they produce. An inquiry results in a relatively low-volume response, whereas a report may create many pages of output. However, some installations prefer to distinguish between inquiry and report generation by their modes of processing. If a retrieval is processed interactively, these installations would call the retrieval an inquiry or query. If a retrieval is processed in batch mode, the retrieval would be called report generation. This terminology lends to make report generation more of a planned, scheduled process and inquiry more of an ad hoc, spontaneous process. Both kinds of retrieval are required by most information systems. An inquiry generally is formulated in a query language, which ideally is a natural-languagelike structure that is easy for a "non-computer-expert" to learn and to use. A query processor is a program that translates the user's inquiries into instructions that are used directly for file access. Most installations that have query processors have acquired them from vendors rather than designing and implementing them inhouse. A file retrieval request can he comprehensive or selective. Comprehensive retrieval reports information from all the records on a file, whereas selective retrieval applies some qualification criteria to choose which records will supply information for output. Examples of selective retrieval requests formulated in a typical but fictitious query language are the following: FIND EMP-NAME OF EMP-PAY-RECORD WHERE EMP-NO = 12751 FIND ALL EMP-NAME, EMP-NO OF EMP-PAY-RECORD WHERE EMP-DEPT-NAME = “MIS” FIND ALL EMP-NAME, EMP-NO OF EMP-PAY-RECORD WHERE 20.000 < EMP-SAL < 40,000 FIND ALL EMP-NAME. EMP-AGE, EMP-PHONE OF EMP-PAY-RECORD WHERE EMP-AGE < 40 AND EMP-SEX - "M" AND EMP-SAL > 50,000 COUNT EMP-PAY-RECORDS WHERE EMP-AGE < 40 FIND AVERAGE EMP-SAL OF EMP-PAY-RECORD WHERE DEPT-NAME = “MIS” In each case the WHERE clause gives the qualification criteria. Note that the last two queries apply aggregate functions COUNT and AVERAGE to the qualifying set of records. Some file organizations are better suited to selective retrievals and others are more suited to comprehensive retrievals. We will study examples of both types. Maintaining a File Changes that are made to files to improve the performance of the programs that access them are known as maintenance activities. There are two basic classes of maintenance operations: restructuring and reorganization. Restructuring a file implies that structural changes are made to the file within the context of the same file organization technique. For example, field widths could be changed, new fields could be added to records, more space might be allocated to the file the index tree of the file might be balanced, or the records of the file might be resequenced, but the file organization method would remain the same. File reorganization implies a change from one tile organization to another The various file organizations differ in their maintenance requirements. These maintenance requirements are also very dependent upon the nature of activity on the file contents and how quickly that activity changes. Some implementations have file restructuring utilities that are automatically invoked by the operating system; others require that data processing personnel notice when file activity has changed sufficiently or program performance has degraded enough to warrant restructuring or reorganization of a file. Some installations perform file maintenance on a routine basis. For example, a utility might be run weekly to collect free space from deleted records, to balance index trees, and to expand or contract space allocations. In general, master files and program files are created, updated, retrieved from, and maintained. Work files are created, updated, and retrieved from, but are not maintained. Report files generally are not updated, retrieved from, or maintained. Transaction file are generally created and used for one-time processing. Unit 3 FILE ORGANIZATION 1. File Concepts 2. Serial File 3. Sequential File 4. Processing Sequential Files 5. Indexing 6. Indexed Sequential File 7. Inverted File 8. Direct File 9. Multi-list File 1. File concepts A file consists of a number of records. Each record is made up of number of fields and each field consists of a number of characters. (i) Character is the smallest element in a file and can be alphabetic, numeric and special. (ii) Field is an item of data within a record. It is made up of number of characters. Ex: a name, number, date or an amount etc., A name — student name; A number — a student register number; A date — birth date: An amount — goods purchased, money paid. (iii) Record is made up of related fields. Ex: a student record or an employee record Figure 10.1: Student Record The files when created will have key field which helps in accessing a particular record within a file. The records can be recognized accordingly by that particular key. Ex: Student number is the key field where the records can be arranged serially in ascending order. They help in sorting the records and when the records are sorted in sequence the file is called a sequential file. The records are read into memory one at a time in sequence. The files are stored normally on backing storage media, as they are too big to fit into the main storage all at once. Some files are processed at regular intervals to provide informations. Ex: A pay roll file may be processed each week or month in order to produce employee's wages. Others are processed at irregular intervals. Ex: A file containing medical details of a doctor's patients. There are various types of files and a variety of file processing methods. The basic files are data files, which contain programs and are read into main memory from backing storage when the program is to be used. Program file is a file on which data held. The hierarchy of the file structure can be represented as follows in figure 10.2. Figure 10.2: Hierarchy of the file structure File structure In a traditional file environment, data records are organized using either a sequential file organisation or a direct or random file organization. Records in a file can be accessed sequentially or they can be accessed directly if the sequential file is on disk and uses an indexed sequential access method. Records on a file with direct file organisation can be accessed directly without an index. By allowing different functional areas and groups in the organization to maintain their own files independently, the traditional file environment creates problems such as data redundancy and inconsistency, programs data independence, inflexibility, poor security and lack of data sharing and availability. Models of File Organization Records of data should be organized on the file in such a way that it provides: (i) Quick access to the desired record for retrieval (ii) Updating the file (iii) Storage area on the recording media is fully utilized Other Factors are: Reliability, privacy and integrity of data. 2. Serial File It contains a set of records in no particular order but the records are stored as they arrive. They do not follow any particular sequence of attribute values. This method of storing records is adopted when it is not possible to arrange the records in any logical order, when the files of the record are not well defined, and the exact use of the file cannot be anticipated. Files are created in this mode by punching the documents in the order of arrival. Then, it is organized into another mode. Location of a record in a pile file can be done by sequentially searching the records till the desired value of the key attribute is reached. New records are always at the end of the file. Changes and deletions of records in a serial file stored on random access media can be done by locating the record and changing its contents and flagging the record to indicate that the record has been invalidated. The file may be reorganised periodically to remove the holes created by deleting of records. However updating of the serial file on sequential access media can be done only by creating a new file. 3. Sequential File File on sequential access media are generally organised in the sequential mode. Sequential files may also be maintained on the random access media. The records are arranged in the ascending or descending order of the values of a key attribute in the record. The sequence of records in the file can be changed from one key attribute to another key attribute by sorting the file. The key for sequencing the records may also consist of a group of attributes. The updating and processing of records of a sequential file stored on a sequential access media is carried out in a batch mode. The transactions leading to changes in the file data are collected in a batch periodically. For example: Transfers, promotions, retirements etc., which leads to changes in the personnel file data can be collected on a monthly basis, recorded in the form of a transaction file. The transaction file is arranged in the same sequence of the master file to be updated. The updating involves the reading of records from both transaction and the master file and matching the two for the key attribute. The additions, deletions, and changes are then carried out in the records of the master file and the updated records are written on the new updated master file. The sequential file update schematic is shown in figure 10.4. The location of a record in a sequential file stored on random access media can be done by one of the following methods: (i) Sequential Search (ii) Binary Search (iii) Probing (iv) Skip Search In a 'sequential search', each record is read one after another starting from the first record in the file till the desired key attribute value is reached. 'Binary search' can reduce the search time considerably as compared to the sequential search. Here, the first record to be read is the one, which is in the middle of the file. Ex: In a file of 200 records, the 100th record will be read first. The value of the key attribute of this record is found out and this will either be less or more than the attribute value of the desired record. In this method, we can decide whether the desired record lies in the first or second half of the file. The next record read is one, which lies in the middle of the area localized from the previous record operation (50th record). Figure 10.3: Binary search Example 1 If the record lies in the second position of hundred records, the next record is the 50th record to decide whether the desired record lies amongst the first fifty or next fifty. The process is repeated many times till the desired record has been localised into a small area consisting of say 5 or 10 records. Then sequentially all records are searched to locate the desired record. Probing' is done where the approximate range in which the desired record lies can be ascertained from the value of the key attribute. If the key attribute is the name of the doctor and it is known that the name is starting with P like 'padmanab' lies between 30% and 45% of the records, only this area may be searched sequentially to locate the desired record. In 'skip search', records are accessed in a particular order say every 20th record is read till the value of the key attribute exceeds the desired value. By this method, every time an area of 20 records is localised for the search of the desired record by sequential search. 4. Processing Sequential Files Because sequential files are in a sequence and must be kept in that sequence only, much of the sequential file processing involves sorting data on some key. For example, all subscribers must be sorted on their last name and first name before a telephone book can be printed. There have been numerous books and articles written on various approaches to sorting. Most computer manufacturers supply sorting packages as a part of their operating systems. These packages are very efficient and simple to use. All that is necessary is to indicate the fields, record sizes and sort key, and to assign intermediate work areas for the sort to use. A scheme for updating a sequence file is as shown in figure 10.4 above. Since the master file is in order, input transactions must be sorted into the same order as the file before being processed. A new file is created in the update process, since it is not a good practice to try to read and write from the tape file. The old file in the sequential update provides backup. If we keep the input transactions of the old file, any errors or the accidental destruction of the new tape can easily be remedied by running the update program again and updating the old file with the transactions. On an update, there are three possible actions. First, we can modify a record. We can change some part of the record platters (except for the very top and bottom ones) are coated with a magnetic material like that on a tape. Read and write heads are fitted between the platters. By moving the heads in and out, we can access any track on the rotating disk. The maximum block size or physical record size for a disk file is limited by the physical capacity of each track. If the access arms do not move, each head reads or writes on the same track of each platter. Conceptually, tracks form a cylinder and when using a disk file sequentially, we write on B given track of the first platter, and then on the same track of the second platter and so on. This minimizes the access time since the heads do not have to move. Example 2 Expensive purchase with a credit card. The merchant is required to check the validity of your credit card and to check the amount of available credit. To do this, the merchant accesses the computer system that maintains file of credit card numbers, status (valid, lost, stolen etc) and available credit. It is impractical to store this data sequentially. You and the merchant would become quite impatient if you had to wait 30 minutes for the computer system to sequentially process the file to the point of your record. Direct access is required There are many ways of accessing a file for direct access. The file must be store* on disk or similar direct access device so that records can be read and written in the middle of the file. Second, some means must be developed for determining the location of a particular record. Indexers are common means. Suppose record are stored on a disk in such a way that 1000 of them reside on each track. Suppose that the credit card file is stored on 300 tracks so that a total of 300,000 records can be accommodated. If we know the relative position of the record in the file, the physical location could be determined. Record number 2050, for example, resides on the third track of the file in relative position number 50. What is needed is some means of establishing a relationship between a credit card number or some similar identifying value and records relative position in the file. Indexes created and maintained for direct access processing take file space and thus increase the amount of storage required. Such processing can only be done on disks or similar devices. Tape may not be used. It is often used in personal computer applications although most users are unaware of the fact. Computer data is processed in two fundamental ways. File processing and data processing. With file processing, data is stored and processed in separate files. Consider the figure 10.6. Figure 10.6: File Processing Example Figure 10.6 shows two different programs processing two separate files. One program processes the employee file to produce reports about employees; the second program processes a file about personal computer hardware (the PC Hardware file) to produce a report about the inventory of hardware. The formats of the records in these two files are shown in figure 10.7. Figure 10.7: Sequential file processing of time slips Sequential access storage devices use media such as magnetic tapes whose storage locations do not have unique addresses and hence cannot be accessed directly. Instead, data must be stored and retrieved using a sequential or serial process. Data are recorded one after another in a predetermined sequence. Locating particular item of data, requires searching majority of the recorded data on the tape until the desired item is located. 5. Indexing 6. Indexed Sequential File Search in a long sequential file can be very time consuming with unnecessary delays in the location of the record and unavoidable. Computer time is being spent in the search routine. Search can be made faster if an index to the file is provided (see figure 10.5). Such a file organisation is called indexed sequential file. The index of the file contains a pointer record for one or a group of records in the main file. The index file can be searched by the sequential search or binary search method. For very long files. Index file itself can be long. In such cases, an index of the index file may be necessary, called the higher level index to search the record in the lower level index file. Files can be indexed on the key attribute in the way they are sequenced. They may also be indexed on other attributes on which they are not sequenced. In such e case, one pointer record will be required in the index for each record of the main file. Such a file becomes an indexed non-sequential file. With this, it is possible to locate the record by more than one key attributes of the record. The file may be organized in the indexed sequential mode for the attribute most commonly used to locate the record and in the indexed non-sequential mode for other attributes. Addition of records in the indexed sequential file are made in the overflow areas provided in each group. For this purpose, some sectors in the area forming the group can be kept blank, while organizing the flow as overflow areas. Additional overflow areas are kept at the end of the file. The indexed sequential files will have to be reorganized periodically when the overflow area become full or too many holes have been created due to the deletion of the obsolete records in the sequential search, through the chained records, has become too time consuming. The reorganization can be done by reading the old file, regrouping the updated records, and writing a new file with indices. Examples 3 One way to think of the structure of an indexed sequential file is as an index with pointers to a sequential data file (Figure 10.9). In the pictured example, the index has been structured as a binary search tree. The index is used to service a request for access to a particular record; the sequential data file is used to support sequential access to the entire collection of records. (Use of a binary search tree and a sequential file to provide indexed sequential access) You are already familiar with this approach to structuring a collection of information to facilitate both sequential and direct access. For example, consider a dictionary: the thumb tabs provide an index into the sequentially organized collection of words. In order to find a particular word, say "syzygy," you usually do not scan the dictionary sequentially. Rather, you select the appropriate thumb tab, S, the first letter of the word, and use that tab to direct your search into the approximate location of the word in the data collection. Again, you probably would not proceed with a sequential search from the beginning of the S's to find your target word, "syzygy." Instead, you would use the column headings on each page which indicate the first and last elements on that page. Once you have located the proper page, your search might proceed sequentially to find the sought word. It is important to note from the dictionary example that the key that is used to sequentialize entries is the same as the key that is used to search directly for an individual entry. The thumb tab and column-heading index structure does not work unless this point is true. Applications Because of its capability to support both sequential and direct access, indexed sequential organization is used frequently to support files that are used for both batch and interactive processing. For example, consider a credit card billing system with an indexed sequential master file of customer account information. An appropriate key for the file would be the account number of each record. The file could be accessed in batch mode, monthly, to generate customer invoices and to build summary reports of account activity. Each account record would be accessed once in this processing. The bills and the detail lines on the summary report would appear in account number sequence. The credit card management agency wants to use the master file of account information also to support interactive inquiry of the current credit status of any account. When a customer makes a purchase, the master file will be consulted to determine the credit limit and then to update the remaining credit amount. This kind of processing could not be supported well by sequential access to the customer account file. Rather, the need to access an individual account, given its account number, dictates use of the index of the indexed sequential file organization. Another example of an application that calls for support by an indexed sequential file is a class records system. Processing requirements for this system include the following: 1. List the names and addresses of all students. 2. Compute the average age of the students. 3. Compute the mean and standard deviation for students' grade point averages. 4. Compute the total number of credit hours for classes in which the students are presently enrolled. 5. Change the classification of a particular student from probational to regular. 6. Display the grade record for a particular student. 7. Insert a record for a new student. 8. Delete the record for a particular student who has withdrawn from the school. Some of these requirements call for sequential accessibility of the student file; others call for direct accessibility to particular records of the file. The combination of requirements can be satisfied by using indexed sequential file organization. The sort key and the index key for the file would be the student identifier. 7. Inverted File In the inverted file organisation, one index is maintained for each key attribute of the record. The index file contains the value of the key attribute followed by the address of all the records in the main file with the same value of the key attribute. It requires three kinds of files to be maintained: (i) The Main File (ii) The Directory File (iii) The Index File There is a separate directory file for each key attribute and the directory file contains the value of the key attributes. Inverted file is a file that references entities by their attributes. Inverted file is very useful where the list of records with specified value of key attribute is required. The maintenance of index files can be very time consuming. Example 4 Consider a banking system in which there are several types of users: tellers, loan officers, branch managers, bank officers, account holders, and so forth. All have the need to access the same data, say records of the format shown in Figure 10.10. Various types of users need to access these records in different ways. A teller might identify an account record by its ID value. A loan officer might need to access all account records with a given value for OVERDRAW-LIMIT, or all account records for a given value of SOCNO. A branch manager might access records by the BRANCH and TYPE group code. A bank officer might want periodic reports of all accounts data, sorted by ID. An account holder (customer) might be able to access his or her own record by giving the appropriate ID value or a combination of NAME, SOCNO, and TYPE code. Example Record Format Example Data File A simple inversion index is structured as a table. For example, inverting the example ACCOUNT-FILE on SOCNO results in the inversion index shown in Figure 10.12. This figure refers to the data records shown in Figure 10.11. Figure 10.12 may bring to mind the index structures that we discussed in the context of relative files at that time we referred to an index as a directory. Either term is correct. Figure 10.12: Example index inverting the records of Figure 10.11 by SOCNO key value Variations This particular inversion index has sorted entries, which facilitates searching for a particular key value, because binary search techniques can be used. When N is relatively large, sequential searching significantly slows response time and throughput. Of course when a record is added to the data file, the inversion index must also be updated and maintained in sorted order. Not all inversion indexes are sorted. An inversion index can be built on top of a relative file or on top of an indexed sequential file. An inversion index on key SOCNO for a relative file with user-key ID would provide a file that would support direct access by either ID or SOCNO. An inversion index on key SOCNO for an indexed sequential file with key ID would provide a file that would support direct access by either ID or SOCNO and would support sequential access by ID. A sorted SOCNO inversion index could be used to access records in order by SOCNO but would probably be expensive to use. 8. Direct File Direct files are not maintained in any particular sequence. Instead, the value of the key attribute is converted into the sector address by a predetermined relationship. This converts the value of the key attribute to sector address for the storage and retrieval of the record. This method is generally used where the range of the key attribute values is large as compared to the number of records. Employee and PC Hardware file duplicate some data. Two fundamental ways of organizing files are sequential and direct access. The way of processing these two files and retrieval are already explained. The sequential file processing of time slips is given in figure 10.8. Data and information are to be stored immediately after input or during processing but before output. The primary storage will be in CPU, whereas the secondary storage devices are magnetic disk, tape devices. The speed, cost, capacity of several alternative primary and secondary storage media are shown in figure 10.13. The cost/ speed/capacity tradeoffs moves from semiconductor memories to magnetic moving surface media (magnetic disk and tapes) to optical disks. Figure 10.13: Speed, cost, capacity of 'storage' media alternatives Semiconductor chips, which come under primary storage category, are called direct access or random access memories (RAM). Magnetic devices are frequently called direct access storage devices (DASD). Media such as magnetic tapes are known as sequential access devices; magnetic bubble and other devices come under both direct/sequential access properties. Direct and Random access has the same concept. An element of data (byte or record) or instructions can be directly stored and retrieved by selecting and using any of the locations on the storage media. It also means each storage position a) has a unique address and b) can be accessed in approximately the same length of time without searching other storage positions. Each memory cell on a microelectronic semiconductor RAM chip can be individually sensed or changed in the same length of time. Any data stored on a magnetic disk can be accessed directly or approximately at the same time period. Figure 10.14: Direct Access Storage Device The disk unit is the device in which the disk pack is placed (which connect, the drive mechanism). Once the pack is located into the unit, the read/write mechanism located inside the unit, positions itself over the first track of each surface. The mechanism consists of a number of arms at the ends of which there is a read/write head for each surface. All arms are fixed together and all move together as one when accessing a surface on the disk pack. The disk when loaded is driven at a high number of revolutions (several thousand)/minute and access can only be made to its surfaces when the disk revolves. In a 6 disk pack, there are 10 recording surfaces. Similarly 20 recording surfaces are there in a II disk pack. (2n-2) is the number of recording surfaces. The method of converting the value of the key attribute to the address is known as 'randomizing' or 'hashing.' The most common method of hashing (out of several methods) is the division method. In this method, the value of the key attribute i.e. converted into an integer if it is already not an integer. It is then divided by another integer, often a prime number, just smaller than the file size. The remainder of the division is used as the address. The other methods used for hashing are the 'midsquare‟ method and 'folding' method. It is quite possible that two different values of key attributes may get converted to the same address on 'hashing', when a 'collision' is said to have occurred. This is handled by storing the record immediately following the previous record stored with the samehashed address. Collisions can also be handled by providing blocks or 'buckets' or records to store all the records with the same address. When the bucket is full, additional records with the same hash addresses can be stored in the overflow areas provided at the end of the file. The overflow records are changed to the last record in the bucket. The storage of records in a direct file are randomly scattered in the file. It is not possible to utilise full storage area. The ratio of the number of records to the total capacity of the file is called the loading factor'. High loading factor leads to too many collisions and increase the search time. Low loading factor leads to under-utilisation of the file area. Example 5 Some of the earliest investigations in hashing yielded a hash function known as the division-remainder method, or simply the division method. The basic idea of this approach is to divide a key value by an appropriate number, then to use the remainder of the division as the relative address for the record. For example, let div be the divisor, key be the key, and addr be the resultant address. In Pascal, the function R (key address) would be implemented as addr : = key mod div; The remainder (and result of the mod function) is defined as the result of subtracting the product of the quotient and divisor from the dividend. That is, ADDR = KEY - DIV-TEMP In all cases here, ADDR should be declared to be an integer. While the actual calculation of a relative address, given a key value and divisor, is straightforward, the choice of the appropriate divisor may not be quite so simple. There are several factors that should be considered in selecting the divisor. First, the range of values that result from the operation key mod div is 0 through div - 1. Thus the value of div determines the size of the relative address space. If it is known that our relative file is going to contain at least n records, then we must, assuming that only one record can be stored at a given relative address, have div > n. For example, consider design of a relative file that will contain about 4000 records. The address space should accommodate at least 5000 records for 80% loading. A number that is approximately 5000 and that does not contain any prime factors less than 20 is 5003. Figure 10.15 illustrates use of this divisor with a small set of key values. Figure 10.15: Example using divisor 5003 9. Multi-list File These files are very useful, where lists of records with specified key attributal values are desired frequently. Ex: The list of employees posted to a particular place or the list of employee due to retirement in a particular unit. In this file organization, all the records with a specified key attribute value are chained together. The directory file, like the one in the inverted file organization contains the pointer to the first record, with specified key attribute value. The first record contains the address of the second record in the chain and the second contains the address of the third record and so on. When the last record in the chain contains pointer to the first record, the records are said to form a ring. A number of such rings for different key attribute values and for different attributes can be formed. The directory provides entry point to the rings. Example 6 Figures 10.16 and 10.17 illustrate the multi-list indexes for secondary keys GROUP-CODE and OVERDRAW-LIMIT respectively in our example file. Figure 10.18 shows the corresponding data file. Note that while inversion did not affect the data file, use of the multi-list organization does. Each record must have space for the pointers that implement the secondary-key accessibility. Figure 10.16: Multi-list index for GROUP-CODE secondary key and the data file in figure 16.9 Figure 10.17: Multi-list index for OVERDRAW-LIMIT secondary key and the data file in figure 16.9 The same kinds of design decisions must be addressed as were required with inversion: Should key value entries be ordered? How should the index itself be structured? Should direct or indirect addressing be used? Should data record entries for a given key value be ordered? Here we have ordered key values, used tabular index structures with indirect addressing, and have linked data records by ascending ID value. Note the result of building a multi-list structure to implement a secondary key that has unique values! If there are N data records, there will be N value entries in the index, each of which points to a linked list of length one. (See Figure 10.12.) The index is the same as had the secondary key been implemented using inversion. One attractive characteristic of the multi-list approach is that index entries can be fixedlength. Each value has associated with it just one pointer. Example data file with multi-list structure Unit 4 HASHING 1. Hash Tables, 2. Hash Function, 3. Terms Associated with Hash Tables 4. Bucket Overflow 5. Handling of Bucket Overflows 1. Hash Tables In tree tables, the search for an identifier key is carried out via a sequence of comparisons. Hash differs from this, in that the address or location of an identifier, X, is obtained by computing some arithmetic function f of X. f (X) gives the address of X in the table. This address will be referred to as the hash or home address of X. The memory available to maintain the symbol table is assumed to be sequential. This memory is referred to as the hash table, HT. The term bucket denotes a unit of storage that can store one or more records. A bucket is typically one disk block size but could be chosen to be smaller or larger than a disk block. If the number of buckets in a Hash table HT is b, then the buckets are designated HT(0), ... HT(b-1). Each bucket is capable of holding one or more records. The number of records a bucket can store is known as its slot-size. Thus, a bucket is said to consist of s slots, if it can hold s number of records in it. A function that is used to compute the address of a record in the hash table, is known as hash function. Usually s = 1 and in this case each bucket can hold exactly 1 record. A hashing function, f(x) is used to perform an identifier transformation on X. f(X) maps the set of possible identifiers(i.e. X) onto the integers 0 through b-1, giving the bucket number where this identifier will eventually be stored. 2. Hashing Function A hashing function f, transforms an identifier X into a bucket address in the hash table. The address so computed is known as hash address of the identifier X. If more than one record have same hashing address, they are said to collide. This phenomenon is called address collision. The desired properties of a hashing function are that it should be easily computable and that it should minimizes the number of collisions. A Uniform Hash Function is a hashing function in which probability that f(X) = i is 1/b, b being the number of buckets in the hash table. In other words, each bucket has equal probability of being assigned a record being inserted. The worst possible hash function maps all search key values to the same bucket. This function is undesirable because all the records have to be kept in the same bucket. An ideal hash function distributes the stored keys uniformly across all the buckets so that every bucket has the same number of records. Therefore, it is desirable to choose a hash function that assigns search key values to buckets such that the following holds: The distribution of key-values is uniform, that is, each bucket is assigned the same number of search key values from the set of all possible search key values. The distribution is random, that is, in the average case, each bucket will have nearly the same number of values assigned to it, regardless of the actual distribution of search key values. Several kinds of uniform hash functions are in use. We shall describe few of them: Mid Square hash function The middle of square (or Mid-square for short) function, fm, is computed by squaring the identifier and then using an appropriate number of bits from the middle of the square to obtain the bucket address. Since the middle bits of the square will usually depend upon all of the characters in the identifier, it is expected that different identifiers would result in different hash addresses with high probability even when some of the characters in the identifiers are the same. The number of bits to be used to obtain the bucket address depends on the table size. If r bits are used to compute hash address, the range of values is 2 r, so the size of hash table is chosen to be a power of 2 when this kind of scheme is used. Conversely, if the size of the hash table is 2r, then the number of bits to be selected from the middle of the square will be r. Mid-square hash address( X ) = r number of middle digits of( X 2 ) Example 1 Let the hash table size be 8 slots.; s=1 ;and let X be an identifier from a set of identifiers. Y be the unique numerical value identifying X. Computation of mid-square hash function is carried out as follows: Hash table size = 8 = 23 r=3 X Y Y2 Binary(Y2) Mid-Sq(Y2) A1 1 1 00 000 01 000(0) A7 7 49 01 100 01 100(4) A8 8 64 10 000 00 000(0) A2 2 4 00 001 00 001(1) A6 6 36 01 001 00 001(1) A5 5 25 00 110 01 110(6) A4 4 16 00 100 00 100(4) A3 3 9 00 010 01 010(2) We see that there is hash collision (hash clash) for the keys A1and A8, A7 and A4, A2 and A6. Division hash function Another simple choice for a hash function is obtained by using the modulo (mod) operator. The identifier X is divided by some number M and the remainder is used as the hash address of X. f (x) = X mod M This gives bucket address in the range 0 - (M-1) and so the hash table is at least of size b = m. M should be prime number such that M does not divide rk+ a where r is the radix of the character set and k and a are very small numbers. Example 2 Given a hash table with 10 buckets, what is the hash key for 'Cat'? Since 'Cat' = 131130 when converted to ASCII, then x = 131130. We are given the table size (i.e., m= 10, starting at 0 and ending at 9). f(x) = xmod m f(131130) = 131130 mod 10 =0 'Cat' is inserted into the table at address 0. The Division method is distribution-independent. The Multiplication Method It multiplies of all the individual digits in the key together, and takes the remainder after dividing the resulting number by the table size. f(x) = (a * b * c * d *....) mod m Where: m is the table size, a, b, c, d, etc. are the individual digits of the item. Example 3 Given a hash table of ten buckets (0 through 9), what is the hash key for 'Cat'? Since 'Cat' = 131130 when converted to ASCII, then x = 131130 We are given the table size (i.e., m = 10). f(x) = (a * b * c * d *....) mod m f(131130) = (1 * 3 * 1 * 1 * 3 * 0) mod 10 = 0 mod 10 =0 'Cat' is inserted into the table at address 0. Folding hash function In this method identifier X is partitioned into several parts, all but the last being of the same length. These parts are then added together to obtain the hash address for X. There are two ways to carry out this addition. f(X) = P1 + P2 +…..Pn) mod (hash-size) In the first, all but the last part are shifted so that the least significant bit of each part lines up with the corresponding bit of the last part. The different parts are now added together to get f (x). P1 P2 P3 P1 123 P2 203 P3 241 P4 112 P5 20 P4 P5 P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20 699 Figure 11.1: Shift Folding This method is known as shift folding. The other method of adding the parts is folding at the boundaries. In this method the identifier is folded at the part boundaries and digits falling into the same position are added together. P1 123 P2 302 P3 241 P4 211 P5 20 897 Figure 11.2 : Folding at Boundaries Pir = Reverse of Pi Example 4 Fold the key 123456789 into a hash table of ten spaces (0 through 9). We are given x = 123456789 and the table size (i.e., m = 10). Since we can break x into three parts any way we want to, we will break it up evenly. Thus P1= 123, P2 = 456, and P3 = 789. f(x) = (P1 + P2 + P3) mod m f(123456789) = (123+456+789) mod 10 = 1368 mod 10 =8 123456789 is inserted into the table at address 8. Digit Analysis hash function This method is particularly useful in the case of static files where all the identifiers in the table are known in advance. Each identifier X is interpreted as a number using some radix r. The same radix is used for all the identifiers in the table. Using this radix, the digits of each identifier are examined. Digits having most skewed distribution are deleted. Enough digits are deleted so that the number of digits left is small enough to give an address in the range of the hash table. Perfect Hash Functions Given a set of keys K = {kl,k2, . . . ,kn}. A perfect hash function is a hash function h such that h(ki)!= h(kj) for all distinct i and j. That is, no hash clashes occur under a perfect hash function. In general, it is difficult to find a perfect hash function for a particular set of keys. Further, once a few more keys are added to the set for which a perfect hash function has been found, the hash function generally ceases to be perfect for the expanded set. Thus, although it is desirable to find a perfect hash function to ensure immediate retrieval, it is not practical to do so unless the set of keys is static and is frequently searched. The most obvious example of such a situation is a compiler in which the set of reserved words of the programming language being compiled does not change and must be accessed repeatedly. In such a situation, the effort required to find a perfect hashing function is worthwhile because, once the function is determined, it can save a great deal of time in repeated applications. Of course, the larger the hash table, the easier it is to find a perfect hash function for a given set of keys. If 10 keys must be placed in a table of 100 elements, 63 percent of the possible hash functions are perfect (although as soon as the number of keys reaches 13 in a 100-item table, the majority are no longer perfect). In general it is desirable to have a perfect hash function for a set of n keys in a table of only n positions. Such a perfect hash function is called minimal. In practice this is difficult to achieve. One technique finds perfect hash functions of the form h(key) = (key + s)/d for some integers s and d. These are called quotient reduction perfect hash functions, and, once found, are quite easy to compute. Cichelli presents a very simple method that often produces a minimal or near minimal perfect hash function for a set of character strings. The hash function produced is of the form h(key) = val(key[0]) + val(key[length(key) - 1]) + length(key) where val(c) is an integer value associated with the character c and key{i] is the /th character of key. That is, add integer values associated with the first and last characters of the key to the length of the key. The integer values associated with particular characters are determined in two steps as follows. The first step is to order the keys so that the sum of the occurrence frequencies of the first and last characters of the keys are in decreasing order. Thus if e occurs ten times as a last or first character, g occurs six times, „/‟ occurs nine times, and o occurs four times, the keys gate, goat, and ego have occurrence frequencies 16 (6+ 10), 15(6+9), and 14(10+4), respectively, and are therefore ordered properly. Once the keys have been ordered, attempt to assign integer values. Each key is examined in turn. If the key's first or last character has not been assigned values, attempt to assign one or two values between 0 and some predetermined limit. If appropriate values can be assigned to produce a hash value that does not clash with the hash value of a previous key, tentatively assign those values. If not, or if both characters have been assigned values that result in a conflicting hash value, backtrack to modify tentative assignments made for a previous key. To find a minimal perfect hash function, the predetermined limit for each character is set to the number of distinct first and last character occurrences. 3. Terms Associated with Hash Tables Identifier Density The ratio n/T is called the identifier density, where n = number of identifiers T = total number of possible identifiers. The number of identifiers, n, in use is usually several orders of magnitude less than the total number of possible identifiers, T. The number of buckets b, in the hash table are also much less than T. Loading Factor The loading factor is equal to n/sb, where s = number of slots in a bucket b = total number of bucket The number of buckets b is also very less than total number of possible identifiers, T. Synonyms The hash function f almost always maps several different identifiers into the same bucket. Two identifiers I1, I2 are said to be synonyms with respect to f if f(I1) = f(I2). Distinct synonyms are entered into the same bucket so long as all the s slots in that bucket have not been used. Collision A collision is said to occur, when two non-identical identifiers are hashed into the same bucket. When the bucket size is 1, collision and overflow occurs simultaneously. 4. Bucket Overflow So far we have assumed that, when a record is inserted, the bucket to which it is mapped has available space to store the record. If the bucket does not have enough space, however, it indicates an error condition called Bucket Overflow. Bucket overflow can occur due to several reasons Insufficient Buckets: The number of buckets which we denote by nb, must be chosen such that nb>nr/fr, where nr denotes the total number of records that will be stored and fr denotes the number of records that will fit in a bucket. If the condition is not met, there will be less number of buckets than required and hence will cause bucket overflow. Skewness: If the selection of a bucket, during insertion, is more frequent than that of others, the bucket is said to be skewed as against being symmetrical. When entering records in such bucket set, it is likely that a few buckets are assigned most of the incoming records, causing filling of these buckets very early. A new insertion in these buckets will cause a bucket overflow even while there is space available in other buckets. 5. Handling of Bucket Overflows When situation of overflow occurs it should be resolved and the records must be placed somewhere else in the table, i.e. an alternative hash address must be generated for these entries. The resolution should aim at reducing the chances of further bucket flows. Some of the approaches used for overflow resolution, are describe here: Over Flow Chaining or Closed Hashing In this approach, whenever a bucket overflow occurs, a new bucket (called over-flow bucket) is attached to the original bucket through a pointer. If the attached bucket is also full, another bucket is attached to this bucket. The process continues. All the overflow buckets of a given bucket are chained together in a linked list. Overflow handling using such a linked list is called Overflow Chaining. As an example, let us take an array of pointers as Hash table (Figure 11.3). Figure 11.3: A Chained Hash Table Advantages of Chaining 1) Space Saving Since the hash table is a contiguous array, enough space must be set-aside at compilation time to avoid overflow. On the other hand, if the hash table contains only the pointers to the records, then the size of the hash table may be reduced. 2) Collision Resolution Chaining allows simple and efficient collision handling. To handle collision only a link field is to be added. 3) Overflow It is no longer necessary that the size of the hash table exceeds the number of records. If there are more records than entries in the table it means that some of the linked lists serve the purpose of containing more than one record. 4) Deletion Deletion proceeds in exactly the same way as deletion from a simple linked list. So in chained hash table deletion becomes a quick and easy task. Rehashing or Open Hashing The form of hash structure that we have just described is sometimes referred to as closed hashing. Under an alternate approach, called open hashing, the set of buckets is fixed and there are no overflow chains, instead if a bucket is full, records are inserted in some other bucket in the initial set of buckets B. Rehashing techniques essentially employ applying and, if necessary, re-applying some hash function again and again until an empty bucket is found. Rehashing, involves using a secondary hash function on the hash key of the item. The rehash function is applied successively until an empty position is found where the item can be inserted. If the hash position of the item is found to be occupied during a search, the rehash function is again used to locate the item. Other policy, in which we compute another hash function for new record based not on the identifier this time, but on the hash key itself. If this position is also filled then function is again applied on the resulting address. This process is repeated until an empty location is reached. Let X is the identifier to be stored. A hash function is applied to it to compute the address in the hash table, i.e. address = f(X). In case, there is a collision, the same function or some other function may be applied on the computed address to get the final address. In case this also results into a collision, the method is continued. In general, if f(X) is a hashing function, giving rise to a collision. Another function ff(X) is applied on it to get new index. In case this also does not yield an empty space, the function is reapplied successively until a space is found. Initially : hash index = f(X) Next index = ff( f (X) ) if collision then again if collision then Next index = ff( ff( f (X) ) ) and so on. Example 5 Let us suppose the value to be hashed is 15424 by division method for a hash table size 10 and slot size 1. The hash table is already filled with some indices as shown in the following figure: F(X) = X mod 10 F(15424) = 15424 mod 10 = 4. There is collision. Let us take ff(X) = (2 * X )mod 10 to be rehashing function. Let us apply this function : ff( f(15424) ) = ff( 4 ) = (2 * 4) mod 10 = 8. Here also there is a collision. Let us apply ff once again. ff( ff( f( X ) ) ) = ff( ff( 4 ) ) = ff( 8 ) = 6. This is required index. 1 2 3 4 5 6 7 8 9 10 The hash index (i.e. 4) is already filled. Therefore, search linearly to find the place at 7 th bucket. Linear Probing One policy is to use the next bucket (in cyclic order) that has space. This policy is called "Linear probing". It starts with a hash address (the location where the collision occurred) and does a sequential search through a table for the desired key or an empty location. Hence, this method searches in straight line, and it is therefore called linear probing. The table should be considered circular, so that when the last location is reached, the search proceeds to the first location of the table. After first being located to an already occupied table position, the item is sent sequentially down the table until an empty space is found. If m is the number of possible positions in the table, then the linear probe continues until either an empty spot is found, or until m-1 locations have been searched. Example 6 Let us suppose the value to be hashed is 15424 by division method for a hash table size 10 and slot size 1. The hash table is already filled with some indices as shown in the following figure: F(X) = X mod 10 F(15424) = 15424 mod 10 = 4 The hash index (i.e. 4) is already filled. Therefore, search linearly to find the place at 7 th bucket. 15424 1 2 3 4 5 6 7 8 9 10 Example 7 Here is the part of a program program for linear hashing. It uses f(key) hashing function and ff(i) as rehashing function. Special value nullkey indicates an empty record in the table. It searches a bucket linearly. #define TABLESIZE ….. typedef KEYTYPE …. typedef RECTYPE ….. strcut rec { KEYTYPE k; RECTYPE r; } table[TABLESIZE]; int searchplace(KEYTYPE key, RECTYPE rec) { int I; i = f(key); while (table[i].k != key && table[I].k != nullkey) I = ff(i); If(table[I].k == nullkey) { table[I].k = key; table[I].r = rec; } return(i); } Clustering The situation, where two keys that hash into different values compete with each other in successive rehashes, is called primary clustering. It also forms a measure of the suitability of a rehash function. One thing that happens with linear probing however is clustering, or bunching together. Primary clustering occurs when items are always hashed to the same spot in the table, and the lookup requires searching through the occupied buckets. The next empty bucket will always be together with the occupied ones, giving rise to clusters. 1, 2 or 3 buckets before finding the empty spot. The opposite of clustering is uniform distribution. Simply, this is when the hash function uniformly distributes the items over the table. 1 2 3 4 5 6 7 8 9 10 cluster Resolving cluster Quadratic Probing If there is a collision at hash address h, this method probes the table at locations f+1, f+4, f+9 ..., that is at location f+i 2 for i = 1, 2, ... that is, the increment function is i 2. Quadratic probing reduces clustering, but it will not probe all locations in the table. 675564 1 2 3 4 5 6 7 8 9 10 ffj(X) = (f(X) + j2) % mod tablesize Example 8 Let us hash key value 675564 using division method in the above hash table. F(675564) = 675564 mod 10 = 4 Obviously, there is a collision. Linear probing would suggest 8 th position increasing clustering. Let us apply quadratic rehashing. ff1(675564) = ( f(675564) + 12) mod 10 = 5, again a collision. Reapply the ff again. ff2(675564) = ( ff1(675564) + 22) mod 10 = (5 + 4) mod 10 = 9. Location 9 th, being empty, the key will be inserted in 9th bucket. Notice it has not increased clustering. Double Hashing This is another method of collision resolution, but unlike the linear collision resolution, double hashing uses a second hashing function which normally limits multiple collisions. The idea is that if two values hash to the same spot in the table, a constant can be calculated from the initial value using the second hashing function which can then be used to change the sequence of locations in the table, but still have access to the entire table. It consists of two rehashing functions f 1 and f2. First of all f1 is applied to get the location for insertion. If it occupied then f2 is used to rehash. If again there is a collision, f 1 is used for rehashing. This way alternatively each function is employed until the empty location is obtained. Example 9 Let us suppose we have to insert key value 23763 by division hashing function in a table of size 10. The two functions are f1(X) = (X + 1 )mod tablesize and f2(X) = 2 + X % tablesize. We apply the first function to compute the first hash index: f1(23763) = (1+23763) mod 10 = 4. Let us suppose 4 th location is not free. Apply the rehashing function: f2(f1(23763))=f2(4)= 2 + 4 mod 10 = 6 If 6th place is also not empty, continue: f1(f2(f1(23763))) = f1(6) = (1 + 6) mod 10 = 7. and so on. Key Dependent Increments In key dependent increments we assume a function, which can be part of key itself. For example: We could truncate the key to a single character and use its code as the increment. Bucket 0 Bucket 1 e-215 Bucket 2 e-101 e-110 Bucket 3 e-217 e-102 Bucket 4 e-218 Bucket 5 e-203 Bucket 6 CHENNI GA URA V e-217 BANGLORE CHA NDRA e-101 DELHI SHARA D e-110 MADURAI ARUN e-215 MUMBAI SAURA BH e-102 JAIPUR AJIT e-203 LUCKNOW RISHA BH e-218