Data Stucture - SS Publications

advertisement
SYLLABUS
ADVANCED CONCEPTS OF C AND INTRODUCTION TO DATA STRUCTURES
DATA TYPES, ARRAYS, POINTERS, RELATION BETWEEN POINTERS AND ARRAYS, SCOPE
RULES AND STORAGE CLASSES, DYNAMIC ALLOCATION AND DE-ALLOCATION OF
MEMORY,DANGLING POINTER PROBLEM, STRUCTURES, ENUMERATED CONSTANTS ,
UNIONS
COMPLEXITY OF ALGORITHMS
PROGRAM ANALYSIS, PERFORMANCE ISSUES, GROWTH OF FUNCTIONS, ASYMPTOTIC
NOTATIONS, TIME-SPACE TRADE OFFS, SPACE USAGE, SIMPLICITY, OPTIMALITY
INTRODUCTION TO DATA AND FILE STRUCTURE
INTRODUCTION, PRIMITIVE AND SIMPLE STRUCTURES,
STRUCTURES , FILE ORGANIZATIONS
ARRAYS
LINEAR
AND
NONLINEAR
SEQUENTIAL ALLOCATION, MULTIDIMENSIONAL ARRAYS , ADDRESS CALCULATIONS ,
GENERAL MULTIDIMENSIONAL ARRAYS , SPARSE ARRAYS
STRINGS
INTRODUCTION , STRING FUNCTIONS , STRING LENGTH , STRING COPY, STRING COMPARE ,
STRING CONCATENATION
ELEMENTARY DATA STRUCTURES
STACK , OPERATIONS ON STACK, IMPLEMENTATION OF STACKS, RECURSION AND STACKS
,EVALUATION OF EXPRESSIONS USING STACKS, QUEUE, ARRAY IMPLEMENTATION OF
QUEUES, CIRCULAR QUEUE , DEQUES , PRIORITY QUEUES
LINKED LISTS
SINGLY LINKED LISTS, IMPLEMENTATION OF LINKED LIST, CONCATENATION OF LINKED
LISTS , MERGING OF LINKED LISTS, REVERSING OF LINKED LIST, DOUBLY LINKED LIST,
IMPLEMENTATION OF DOUBLY LINKED LIST, CIRCULAR LINKED LIST, APPLICATIONS OF THE
LINKED LISTS
GRAPHS
ADJACENCY MATRIX AND ADJACENCY LISTS , GRAPH TRAVERSAL, IMPLEMENTATION,
SHORTEST PATH PROBLEM , MINIMAL SPANNING TREE,
OTHER TASKS
TREES
INTRODUCTION, PROPERTIES OF A TREE , BINARY TREES, IMPLEMENTATION, TRAVERSALS
OF A BINARY TREE, BINARY SEARCH TREES (BST), INSERTION IN BST , DELETION OF A
NODE, SEARCH FOR A KEY IN BST, HEIGHT BALANCED TREE, B-TREE, INSERTION,
DELETION
FILE ORGANIZATION INTRODUCTION, TERMINOLOGY , FILE ORGANISATION, SEQUENTIAL
FILES, DIRECT FILE ORGANIZATION , DIVISION-REMAINDER HASHING, INDEXED
SEQUENTIAL FILE ORGANIZATION
SEARCHING
INTRODUCTION, SEARCHING TECHNIQUES, SEQUENTIAL SEARCH, BINARY SEARCH,
HASHING, HASH FUNCTIONS, COLLISION RESOLUTION
SORTING
INTRODUCTION, INSERTION SORT, BUBBLE SORT, SELECTION SORT, RADIX SORT, QUICK
SORT, 2-WAY MERGE SORT, HEAP SORT, HEAPSORT VS. QUICKSORT
S S PUBLICATIONS
D.NO: 10-13-36, sistla vari street, Repalle-522265, Guntur (Dt), A.P, INDIA
Email: mdsspublications@gmail.com , Web-site: www.sspublications.co.in
UNIT 1 ADVANCED CONCEPTS OF C AND INTRODUCTION TO DATA STRUCTURES
1.1
1.2
1.3
INTRODUCTION
DATA TYPES
ARRAYS
1.3.1 HANDLING ARRAYS
1.3.2 INITIALIZING THE ARRAYS
1.4 MULTIDIMENSIONAL ARRAYS
1.4.1 INITIALIZATION OF TWO DIMENSIONAL ARRAY
1.5 POINTERS
1.5.1 ADVANTAGES AND DISADVANTAGES OF POINTERS
1.5.2 DECLARING AND INITIALIZING POINTERS
1.5.3 POINTER ARITHMETIC
1.6
1.7
1.8
1.9
ARRAY OF POINTERS
PASSING PARAMETERS TO THE FUNCTIONS
RELATION BETWEEN POINTERS AND ARRAYS
SCOPE RULES AND STORAGE CLASSES
1.9.1 AUTOMATIC VARIABLES
1.9.2 STATIC VARIABLES
1.9.3 EXTERNAL VARIABLES
1.9.4 REGISTER VARIABLE
1.10 DYNAMIC ALLOCATION AND DE-ALLOCATION OF MEMORY
1.10.1 FUNCTION MALLOC(SIZE)
1.10.2 FUNCTION CALLOC(N,SIZE)
1.10.3 FUNCTION FREE(BLOCK)
1.11 DANGLING POINTER PROBLEM.
1.12 STRUCTURES.
1.13 ENUMERATED CONSTANTS
1.14 UNIONS
UNIT 2
COMPLEXITY OF ALGORITHMS
2.1. PROGRAM ANALYSIS
2.2. PERFORMANCE ISSUES
2.3. GROWTH OF FUNCTIONS
2.4. ASYMPTOTIC NOTATIONS
2.4.1. BIG-O NOTATION (O)
2.4.2. BIG-OMEGA NOTATION ()
2.4.3. BIG-THETA NOTATION ()
2.5. TIME-SPACE TRADE OFFS
2.6. SPACE USAGE
2.7. SIMPLICITY
2.8. OPTIMALITY
UNIT 3
3.1
3.2
3.3
3.4
INTRODUCTION TO DATA AND FILE STRUCTURE
INTRODUCTION
PRIMITIVE AND SIMPLE STRUCTURES
LINEAR AND NONLINEAR STRUCTURES
FILE ORGANIZATIONS
UNIT 4 ARRAYS
4.1 INTRODUCTION
4.1.1. SEQUENTIAL ALLOCATION
4.1.2. MULTIDIMENSIONAL ARRAYS
4.2. ADDRESS CALCULATIONS
4.3. GENERAL MULTIDIMENSIONAL ARRAYS
4.4. SPARSE ARRAYS
UNIT 5 STRINGS
5.1 INTRODUCTION
5.2 STRING FUNCTIONS
5.3 STRING LENGTH
5.3.1 USING ARRAY
5.3.2 USING POINTERS
5.4 STRING COPY
5.4.1 USING ARRAY
5.4.2 USING POINTERS
5.5 STRING COMPARE
5.5.1 USING ARRAY
5.6 STRING CONCATENATION
UNIT 6 ELEMENTARY DATA STRUCTURES
6.1 INTRODUCTION
6.2 STACK
6.2.1
DEFINITION
6.2.2
OPERATIONS ON STACK
6.2.3
IMPLEMENTATION OF STACKS USING ARRAYS
6.2.3.1 FUNCTION TO INSERT AN ELEMENT INTO THE STACK
6.2.3.2 FUNCTION TO DELETE AN ELEMENT FROM THE STACK
6.2.3.3 FUNCTION TO DISPLAY THE ITEMS
6.3 RECURSION AND STACKS
6.4 EVALUATION OF EXPRESSIONS USING STACKS
6.4.1 POSTFIX EXPRESSIONS
6.4.2 PREFIX EXPRESSION
6.5 QUEUE
6.5.1 INTRODUCTION
6.5.2 ARRAY IMPLEMENTATION OF QUEUES
6.5.2.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE
6.5.2.2 FUNCTION TO DELETE AN ELEMENT FROM THE QUEUE
6.6 CIRCULAR QUEUE
6.6.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE
6.6.2 FUNCTION FOR DELETION FROM CIRCULAR QUEUE
6.6.3 CIRCULAR QUEUE WITH ARRAY IMPLEMENTATION
6.7 DEQUES
6.8 PRIORITY QUEUES
UNIT 7 LINKED LISTS
7.1. INTRODUCTION
7.2. SINGLY LINKED LISTS.
7.2.1. IMPLEMENTATION OF LINKED LIST
7.2.1.1. INSERTION OF A NODE AT THE BEGINNING
7.2.1.2. INSERTION OF A NODE AT THE END
7.2.1.3. INSERTION OF A NODE AFTER A SPECIFIED NODE
7.2.1.4. TRAVERSING THE ENTIRE LINKED LIST
7.2.1.5. DELETION OF A NODE FROM LINKED LIST
7.3. CONCATENATION OF LINKED LISTS
7.4. MERGING OF LINKED LISTS
7.5. REVERSING OF LINKED LIST
7.6. DOUBLY LINKED LIST.
7.6.1. IMPLEMENTATION OF DOUBLY LINKED LIST
7.7. CIRCULAR LINKED LIST
7.8. APPLICATIONS OF THE LINKED LISTS
UNIT 8 GRAPHS
8.1 INTRODUCTION
8.2 ADJACENCY MATRIX AND ADJACENCY LISTS
8.3 GRAPH TRAVERSAL
8.3.1 DEPTH FIRST SEARCH (DFS)
8.3.1.1 IMPLEMENTATION
8.3.2 BREADTH FIRST SEARCH (BFS)
8.3.2.1 IMPLEMENTATION
8.4 SHORTEST PATH PROBLEM
8.5 MINIMAL SPANNING TREE
8.6 OTHER TASKS
UNIT 9 TREES
9.1. INTRODUCTION
9.1.1. OBJECTIVES
9.1.2. BASIC TERMINOLOGY
9.1.3. PROPERTIES OF A TREE
9.2. BINARY TREES
9.2.1. PROPERTIES OF BINARY TREES
9.2.2. IMPLEMENTATION
9.2.3. TRAVERSALS OF A BINARY TREE
9.2.3.1.
IN ORDER TRAVERSAL
9.2.3.2.
POST ORDER TRAVERSAL
9.2.3.3.
PREORDER TRAVERSAL
9.3. BINARY SEARCH TREES (BST)
9.3.1. INSERTION IN BST
9.3.2. DELETION OF A NODE
9.3.3. SEARCH FOR A KEY IN BST
9.4. HEIGHT BALANCED TREE
9.5. B-TREE
9.5.1. INSERTION
9.5.2. DELETION
UNIT 10 FILE ORGANIZATION
10.1. INTRODUCTION
10.2. TERMINOLOGY
10.3. FILE ORGANISATION
10.3.1. SEQUENTIAL FILES
10.3.1.1. BASIC OPERATIONS
10.3.1.2. DISADVANTAGES
10.3.2. DIRECT FILE ORGANIZATION
10.3.2.1. DIVISION-REMAINDER HASHING
10.3.3. INDEXED SEQUENTIAL FILE ORGANIZATION
UNIT 11 SEARCHING
11.1.
INTRODUCTION
11.2. SEARCHING TECHNIQUES
11.2.1. SEQUENTIAL SEARCH
11.2.1.1. ANALYSIS
11.2.2. BINARY SEARCH
11.2.2.1. ANALYSIS
11.3. HASHING
11.3.1. HASH FUNCTIONS
11.4. COLLISION RESOLUTION
UNIT 12 SORTING
12.1. INTRODUCTION
12.2. INSERTION SORT
12.2.1. ANALYSIS
12.3. BUBBLE SORT
12.3.1. ANALYSIS
12.4. SELECTION SORT
12.4.1. ANALYSIS
12.5. RADIX SORT
12.5.1. ANALYSIS
12.6. QUICK SORT
12.6.1. ANALYSIS
12.7. 2-WAY MERGE SORT
12.8. HEAP SORT
12.9. HEAPSORT VS. QUICKSORT
UNIT 1 ADVANCED CONCEPTS OF C AND INTRODUCTION TO DATA STRUCTURES
1.1. INTRODUCTION
1.2. DATA TYPES
1.3. ARRAYS
1.3.1. HANDLING ARRAYS
1.3.2. INITIALIZING THE ARRAYS
1.4. MULTIDIMENSIONAL ARRAYS
1.4.1. INITIALIZATION OF TWO DIMENSIONAL ARRAY
1.5. POINTERS
1.5.1. ADVANTAGES AND DISADVANTAGES OF POINTERS
1.5.2. DECLARING AND INITIALIZING POINTERS
1.5.3.
POINTER ARITHMETIC
1.6.
1.7.
1.8.
1.9.
ARRAY OF POINTERS
PASSING PARAMETERS TO THE FUNCTIONS
RELATION BETWEEN POINTERS AND ARRAYS
SCOPE RULES AND STORAGE CLASSES
1.9.1. AUTOMATIC VARIABLES
1.9.2. STATIC VARIABLES
1.9.3. EXTERNAL VARIABLES
1.9.4. REGISTER VARIABLE
1.10.
DYNAMIC ALLOCATION AND DE-ALLOCATION OF MEMORY
1.10.1. FUNCTION MALLOC(SIZE)
1.10.2. FUNCTION CALLOC(N,SIZE)
1.10.3. FUNCTION FREE(BLOCK)
1.11.
DANGLING POINTER PROBLEM.
1.12.
STRUCTURES.
1.13.
ENUMERATED CONSTANTS
1.14.
UNIONS
1.1.
INTRODUCTION
This chapter familiarizes you with the concepts of arrays, pointers and dynamic memory
allocation and de-allocation techniques. We briefly discuss about types of data structures and
algorithms. Let us start the discussion with data types.
1.2.
DATA TYPES
As we know that the data, which will be given to us, should be stored and again referred
back. These are done with the help of variables. A particular variable‟s memory requirement
depends on which type it belongs to. The different types in C are integers, float (Real numbers),
characters, double, long, short etc. These are the available built in types.
Many a times we may come across many data members of same type that are related.
Giving them different variable names and later remembering them is a tedious process. It would
be easier for us if we could give a single name as a parent name to refer to all the identifiers of the
same type. A particular value is referred using an index, which indicates whether the value is
first, second or tenth in that parents name.
We have decided to use the index for reference as the values occupy successive memory
locations. We will simply remember one name (starting address) and then can refer to any value,
using index. Such a facility is known as ARRAYS.
1.3.
ARRAYS
An array can be defined as the collection of the sequential memory locations, which can be
referred to by a single name along with a number, known as the index, to access a
particular field or data.
When we declare an array, we will have to assign a type as well as size.
e.g.
When we want to store 10 integer values, then we can use the following declaration.
int A[10];
By this declaration we are declaring A to be an array, which is supposed to contain in all 10
integer values. When the allocation is done for array a then in all 10 locations of size 2 bytes for
each integer i.e. 20 bytes will be allocated and the starting address is stored in A. When we say
A[0] we are referring to the first integer value in A.
| ----------------- Array----------------|
A[0]A[1]
A[9]
fig(1). Array representation
Hence if we refer to the ith value in array we should write A[i-1]. When we declare the array of SIZE
elements, where, SIZE is given by the user, the index value changes from 0 to SIZE-1.
Here it should be remembered that the size of the array is „always a constant‟ and not a variable.
This is because the fixed amount of memory will be allocated to the array before execution of the
program. This method of allocating the memory is known as „static allocation‟ of memory.
1.3.1
HANDLING ARRAYS
Normally following procedure is followed for programming so that, by changing only one #define
statement we can run the program for arrays of different sizes.
#define SIZE 10
int a[SIZE], b[SIZE];
Now if we want this program to run for an array of 200 elements we need to change just the
#define statement.
1.3.2 INITIALIZING THE ARRAYS.
One method of initializing the array members is by using the „for‟ loop. The following for loop
initializes 10 elements with the value of their index.
#define SIZE 10
main()
{
int arr[SIZE], i;
}
for(i = 0; i < SIZE ; i++ )
{
arr[i] = i;
}
An array can also be initialized directly as follows.
int arr[3] = {0,1,2};
An explicitly initialized array need not specify size but if specified the number of elements
provided must not exceed the size. If the size is given and some elements are not explicitly
initialized they are set to zero.
e.g.
int arr[] = {0,1,2};
int arr1[5] = {0,1,2}; /* Initialized as {0,1,2,0,0}*/
const char a_arr3[6] = ”Daniel”; /* ERROR; Daniel has 7elements 6 in Daniel and a \0*/
To copy one array to another each element has to be copied using for structure.
Any expression that evaluates into an integral value can be used as an index into array.
e.g.
arr[get_value()] = somevalue;
1.4 MULTIDIMENSIONAL ARRAYS
An array in which the elements need to be referred by two indices it is called a twodimensional array or a “matrix” and if it is referred using more than two indices, it will be
Multidimensional Array.
e.g.
int arr[4][3];
This is a two-dimensional array with 4 as row dimension and 3 as a column dimension.
1.4.1 INITIALIZATION OF TWO DIMENSIONAL ARRAY
Just like one-dimensional arrays, members of matrices can also be initialized in two ways – using
„for‟ loop and directly. Initialization using nested loops is shown below.
e.g.
int arr[10][10];
for(int i = 0;i< 10;i++)
{
for(int j = 0; j< 10;j++)
{
arr[i][j] = i+j;
}
}
Now let us see how members of matrices are initialized directly.
e.g.
int arr[4][3] = {{0,1,2},{3,4,5},{6,7,8},{9,10,11}};
The nested brackets are optional.
1.5 POINTERS
The computer memory is a collection of storage cells. These locations are numbered sequentially
and are called addresses. Pointers are addresses of memory location. Any variable, which
contains an address is a pointer variable. Pointer variables store the address of an object,
allowing for the indirect manipulation of that object. They are used in creation and management
of objects that are dynamically created during program execution.
1.5.1 ADVANTAGES AND DISADVANTAGES OF POINTERS
Pointers are very effective when
-
The data in one function can be modified by other function by passing the address.
Memory has to be allocated while running the program and released back if it is
not required thereafter.
Data can be accessed faster because of direct addressing.
-
The only disadvantage of pointers is, if not understood and not used properly can
introduce bugs in the program.
1.5.2 DECLARING AND INITIALIZING POINTERS
Pointers are declared using the (*) operator. The general format is:
data_type *ptrname;
where type can be any of the basic data type such as integer, float etc., or any of the userdefined data type. Pointer name becomes the pointer of that data type.
e.g.
int
*iptr;
char *cptr;
float *fptr;
The pointer iptr stores the address of an integer. In other words it points to an integer,
cptr to a character and fptr to a float value.
Once the pointer variable is declared it can be made to point to a variable with the help of
an address (reference) operator(&).
e.g.
int num = 1024;
int *iptr;
iptr = # // iptr points to the variable num.
The pointer can hold the value of 0(NULL), indicating that it points to no object at present.
Pointers can never store a non-address value.
e.g.
iptr1=ival;
// invalid, ival is not address.
A pointer of one type cannot be assigned the address value of the object of another type.
e.g.
double dval, *dptr = &dval; // allowed
iptr = &dval ;
//not allowed
1.5.3 POINTER ARITHMETIC
The pointer values stored in a pointer variable can be altered using arithmetic operators. You can
increment or decrement pointers, you can subtract one pointer from another, you can add or
subtract integers to pointers but two pointers can not be added as it may lead to an address that
is not present in the memory. No other arithmetic operations are allowed on pointers than the
ones discussed here. Consider a program to demonstrate the pointer arithmetic.
e.g.
# include<stdio.h>
main()
{
int a[]={10,20,30,40,50};
ptr--> F000
10
int *ptr;
F002
20
int i;
F004
30
ptr=a;
F006
40
for(i=0; i<5; i++)
F008
50
{
printf(“%d”,*ptr++);
}
}
Output:
10 20 30 40 50
The addresses of each memory location for the array „a‟ are shown starting from F002 to F008.
Initial address of F000 is assigned to „ptr‟. Then by incrementing the pointer value next values are
obtained. Here each increment statement increments the pointer variable by 2 bytes because the
size of the integer is 2 bytes. The size of the various data types is shown below for a 16-bit
machine. It may vary from system to system.
char
int
float
long int
double
short int
1byte
2bytes
4bytes
4bytes
8bytes
2bytes
1.6 ARRAY OF POINTERS
Consider the declaration shown below:
char *A[3]={“a”, “b”, “Text Book”};
The example declares „A‟ as an array of character pointers. Each location in the array points to
string of characters of varying length. Here A[0] points to the first character of the first string and
A[1] points to the first character of the second string, both of which contain only one character.
however, A[2] points to the first character of the third string, which contains 9 characters.
1.7 PASSING PARAMETERS TO THE FUNCTIONS
The different ways of passing parameters into the function are:
Pass by value( call by value)
Pass by address/pointer(call by reference)
In pass by value we copy the actual argument into the formal argument declared in the function
definition. Therefore any changes made to the formal arguments are not returned back to the
calling program.
In pass by address we use pointer variables as arguments. Pointer variables are particularly
useful when passing to functions. The changes made in the called functions are reflected back to
the calling function. The program uses the classic problem in programming, swapping the values
of two variables.
void val_swap(int x, int y)
{
int t;
// Call by Value
}
t = x;
x = y;
y = t;
void add_swap(int *x, int *y) // Call by Address
{
int t;
}
t = *x;
*x = *y;
*y = t;
void main()
{
int n1 = 25, n2 = 50;
printf(“\n Before call by Value : ”);
printf(“\n n1 = %d n2 = %d”,n1,n2);
val_swap( n1, n2 );
printf(“\n After call by value : ”);
printf(“\n n1 = %d n2 = %d”,n1,n2);
printf(“\n Before call by Address : ”);
printf(“\n n1 = %d n2 = %d”,n1,n2);
val_swap( &n1, &n2 );
printf(“\n After call by value : ”);
printf(“\n n1 = %d n2 = %d”,n1,n2);
}
Output:
Before call by value : n1 = 25 n2 = 50
After call by value
: n1 = 25 n2 = 50 // x = 50, y = 25
Before call by address : n1 = 25 n2 = 50
After call by address : n1 = 50 n2 = 25 //x = 50, y = 25
1.8
RELATION BETWEEN POINTERS AND ARRAYS
Pointers and Arrays are related to each other. All programs written with arrays can also be written
with the pointers. Consider the following:
int arr[] = {0,1,2,3,4,5,6,7,8,9};
To access the value we can write,
arr[0] or *arr;
arr[1] or *(arr+1);
Since „*‟ is used as pointer operator and is also used to dereference the pointer variables, you have
to know the difference between them throughly.
*(arr+1) means the address of arr is increased by 1 and then the contents are fetched.
*arr+1 means the contents are fetched from address arr and one is added to the content.
Now we have understood the relation between an array and pointer. The traversal of an
array can be made either through subscripting or by direct pointer manipulation.
e.g.
void print(int *arr_beg, int *arr_end)
{
while(arr_beg ! = arr_end)
{
printf(“%i”,*arr_beg);
++arr_beg;
}
}
void main()
{
int arr[] = {0,1,2,3,4,5,6,7,8,9}
print(arr,arr+9);
}
arr_end initializes element past the end of the array so that we can iterate through all the
elements of the array. This however works only with pointers to array containing integers.
1.9 SCOPE RULES AND STORAGE CLASSES
Since we explained that the values in formal variables are not reflected back to the calling
program, it becomes important to understand the scope and lifetime of the variables.
The storage class determines the life of a variable in terms of its duration or its scope. There are
four storage classes:
- automatic
- static
- external
- register
1.9.1 AUTOMATIC VARIABLES
Automatic variables are defined within the functions. They lose their value when the function
terminates. It can be accessed only in that function. All variables when declared within the
function are, by default, „automatic‟. However, we can explicitly declare them by using the
keyword ‘automatic’.
e.g.
void print()
{
auto int i =0;
printf(“\n Value of i before incrementing is %d”, i);
i = i + 10;
printf(“\n Value of i after incrementing is %d”, i);
}
main()
{
}
print();
print();
print();
Output:
Value of i before incrementing is : 0
Value
Value
Value
Value
Value
of
of
of
of
of
i
i
i
i
i
after incrementing is : 10
before incrementing is : 0
after incrementing is : 10
before incrementing is : 0
after incrementing is : 10
1.9.2. STATIC VARIABLES
Static variables have the same scope as automatic variables, but, unlike automatic variables,
static variables retain their values over number of function calls. The life of a static variable
starts, when the first time the function in which it is declared, is executed and it remains in
existence, till the program terminates. They are declared with the keyword static.
e.g.
void print()
{
static int i =0;
printf(“\n Value of i before incrementing is %d”, i);
i = i + 10;
printf(“\n Value of i after incrementing is %d”, i);
}
main()
{
}
print();
print();
print();
Output:
Value
Value
Value
Value
Value
Value
of
of
of
of
of
of
i
i
i
i
i
i
before incrementing is : 0
after incrementing is : 10
before incrementing is : 10
after incrementing is : 20
before incrementing is : 20
after incrementing is : 30
It can be seen from the above example that the value of the variable is retained when the function
is called again. It is allocated memory and is initialized only for the first time.
1.9.3. EXTERNAL VARIABLES
Different functions of the same program can be written in different source files and can be
compiled together. The scope of a global variable is not limited to any one function, but is
extended to all the functions that are defined after it is declared. However, the scope of a global
variable is limited to only those functions, which are in the same file scope. If we want to use a
variable defined in another file, we can use extern to declare them.
e.g.
// FILE 1 – gis global and can be used only in main() and // // fn1();
int g = 0;
void main()
{
:
}
:
void fn1()
{
:
:
}
// FILE 2 If the variable declared in file1 is required to be used in file2 then it is to be
declared as an extern.
extern int g = 0;
void fn2()
{
:
:
}
void fn3()
{
:
}
1.9.4. REGISTER VARIABLE
Computers have internal registers, which are used to store data temporarily, before any operation
can be performed. Intermediate results of the calculations are also stored in registers. Operations
can be performed on the data stored in registers more quickly than on the data stored in memory.
This is because the registers are a part of the processor itself. If a particular variable is used often
– for instance, the control variable in a loop, can be assigned a register, rather than a variable.
This is done using the keyword register. However, a register is assigned by the compiler only if it
is free, otherwise it is taken as automatic. Also, global variables cannot be register variables.
e.g.
void loopfn()
{
register int i;
}
for(i=0; i< 100; i++)
{
printf(“%d”, i);
}
1.10 DYNAMIC ALLOCATION AND DE-ALLOCATION OF MEMORY
Memory for system defined variables and arrays are allocated at compilation time. The size of
these variables cannot be varied during run time. These are called „static data structures‟.
The disadvantage of these data structures is that they require fixed amount of storage. Once the
storage is fixed if the program uses small memory out of it remaining locations are wasted. If we
try to use more memory than declared overflow occurs.
If there is an unpredictable storage requirement, sequential allocation is not recommended. The
process of allocating memory at run time is called ‘dynamic allocation’. Here, the required amount
of memory can be obtained from free memory called „Heap‟, available for the user. This free
memory is stored as a list called ‘Availability List’. Getting a block of memory and returning it to
the availability list, can be done by using functions like:
malloc()
calloc()
free()
1.10.1 FUNCTION MALLOC(SIZE)
This function is defined in the header file <stdlib.h> and <alloc.h>. This function allocates a block
of „size’ bytes from the heap or availability list. On success it returns a pointer of type „void‟ to
the allocated memory. We must typecast it to the type we require like int, float etc. If required
space does not exist it returns NULL.
Syntax:
ptr = (data_type*) malloc(size);
where
ptr is a pointer variable of type data_type.
data_type can be any of the basic data type, user defined or derived data type.
size is the number of bytes required.
e.g.
ptr =(int*)malloc(sizeof(int)*n);
allocates memory depending on the value of variable n.
#
#
#
#
include<stdio.h>
include<string.h>
include<alloc.h>
include<process.h>
main()
{
char *str;
if((str=(char*)malloc(10))==NULL) /* allocate memory for
string */
{
printf(“\n OUT OF MEMORY”);
exit(1);
/* terminate the program */
}
}
strcpy(str,”Hello”);
printf(“\n str= %s “,str);
free(str);
/* copy hello into str */
/* display str */
/* free memory */
In the above program if memory is allocated to the str, a string hello is copied into it. Then str is
displayed. When it is no longer needed, the memory occupied by it is released back to the memory
heap.
1.10.2 FUNCTION CALLOC(N,SIZE)
This function is defined in the header file <stdlib.h> and <alloc.h>. This function allocates
memory from the heap or availability list. If required space does not exist for the new block or n,
or size is zero it returns NULL.
Syntax:
ptr = (data_type*) calloc(n,size);
where
-
ptr is a pointer variable of type data_type.
data_type can be any of the basic data type, user defined or derived data type.
size is the number of bytes required.
n is the number of blocks to be allocated of size bytes.
and a pointer to the first byte of the allocated region is returned.
e.g.
#
#
#
#
include<stdio.h>
include<string.h>
include<alloc.h>
include<process.h>
main()
{
char *str = NULL;
str=(char*)calloc(10,sizeof(char)); /* allocate memory for string */
if(str == NULL);
{
printf(“\n OUT OF MEMORY”);
exit(1);
/* terminate the program */
}
strcpy(str,”Hello”);
printf(“\n str= %s “,str);
free(str);
/* copy hello into str */
/* display str */
/* free memory */
}
1.10.3 FUNCTION FREE(BLOCK)
This function frees allocated block of memory using malloc() or calloc(). The programmer can use
this function and de-allocate the memory that is not required any more by the variable. It does
not return any value.
1.11 DANGLING POINTER PROBLEM.
We can allocate memory to the same variable more than once. The compiler will not raise any
error. But it could lead to bugs in the program. We can understand this problem with the
following example.
# include<stdio.h>
# include<alloc.h>
main()
{
int *a;
a= (int*)malloc(sizeof(int));
*a = 10;
a= (int*)malloc(sizeof(int));
*a = 20;
---->
---->
10
20
}
In this program segment memory allocation for variable „a‟ is done twice. In this case the variable
contains the address of the most recently allocated memory, thereby making the earlier allocated
memory inaccessible. So, memory location where the value 10 is stored is inaccessible to any of
the application and is not possible to free it so that it can be reused.
To see another problem, consider the next program segment:
main()
{
int *a;
a= (int*)malloc(sizeof(int));
*a = 10;
free(a);
}
---->
10
---->
?
Here, if we de-allocate the memory for the variable „a‟ using free(a), the memory location pointed
by „a‟ is returned to the memory pool. Now since pointer „a‟ does not contain any valid address we
call it as ‘Dangling Pointer’. If we want to reuse this pointer we can allocate memory for it again.
1.12 STRUCTURES
A structure is a derived data type. It is a combination of logically related data items. Unlike
arrays, which are a collection of similar data types, structures can contain members of different
data type. The data items in the structures generally belong to the same entity, like information of
an employee, players etc.
The general format of structure declaration is:
struct tag
{
type member1;
type member2;
type member3;
:
:
}variables;
We can omit the variable declaration in the structure declaration and define it separately as
follows :
struct tag variable;
e.g.
struct Account
{
int accnum;
char acctype;
char name[25];
float balance;
};
We can declare structure variables as :
struct Account oldcust;
We can refer to the member variables of the structures by using a dot operator (.).
e.g.
newcust.balance = 100.0;
printf(“%s”, oldcust.name);
We can initialize the members as follows :
e.g.
Account customer = {100, „w‟, „David‟, 6500.00};
We cannot copy one structure variable into another. If this has to be done then we have to do
member-wise assignment.
We can also have nested structures as shown in the following example:
struct Date
{
int dd, mm, yy;
};
struct Account
{
int accnum;
char acctype;
char name[25];
float balance;
struct Date d1;
};
Now if we have to access the members of date then we have to use the following method.
Account c1;
c1.d1.dd=21;
We can pass and return structures into functions. The whole structure will get copied into formal
variable.
We can also have array of structures. If we declare array to account structure it will look like,
Account a[10];
Every thing is same as that of a single element except that it requires subscript in order to know
which structure we are referring to.
We can also declare pointers to structures and to access member variables we have to use the
pointer operator -> instead of a dot operator.
Account *aptr;
printf(“%s”,aptr->name);
A structure can contain pointer to itself as one of the variables, also called self-referential
structures.
e.g.
struct info
{
int i, j, k;
info *next;
};
In short we can list the uses of the structure as:
-
Related data items of dissimilar data types can be logically grouped under a
common name.
Can be used to pass parameters so as to minimize the number of function
arguments.
When more than one data has to be returned from the function these are useful.
Makes the program more readable.
1.13 ENUMERATED CONSTANTS
Enumerated constants enable the creation of new types and then define variables of these types
so that their values are restricted to a set of possible values. There syntax is:
where
enum identifier {c1,c2,...}[var_list];
-
e.g.
enum is the keyword.
identifier is the user defined enumerated data type, which can be used to declare
the variables in the program.
{c1,c2,...} are the names of constants and are called enumeration constants.
var_list is an optional list of variables.
enum Colour{RED, BLUE, GREEN, WHITE, BLACK};
Colour is the name of an enumerated data type. It makes RED a symbolic constant with the value
0, BLUE a symbolic constant with the value 1 and soon.
Every enumerated constant has an integer value. If the program doesn‟t specify otherwise, the
first constant will have the value 0, the remaining constants will count up by 1 as compared to
their predecessors.
Any of the enumerated constant can be initialised to have a particular value, however, those that
are not initialised will count upwards from the value of previous variables.
e.g.
enum Colour{RED = 100, BLUE, GREEN = 500, WHITE, BLACK = 1000};
The values assigned will be RED = 100,BLUE = 101,GREEEN = 500,WHITE = 501,BLACK = 1000
You can define variables of type Colour, but they can hold only one of the enumerated values. In
our case RED, BLUE, GREEEN, WHITE, BLACK .
You can declare objects of enum types.
e.g.
enum Days{SUN, MON, TUE, WED, THU, FRI, SAT};
Days day;
Day = SUN;
Day = 3; // error int and day are of different types
Day = hello;
// hello is not a member of Days.
Even though enum symbolic constants are internally considered to be of type unsigned int we
cannot use them for iterations.
e.g.
enum Days{SUN, MON, TUE, WED, THU, FRI, SAT};
for(enum i = SUN; i<SAT; i++)
//not allowed.
There is no support for moving backward or forward from one enumerator to another. However
whenever necessary, an enumeration is automatically promoted to arithmetic type.
e.g.
if( MON > 0)
{
printf(“ Monday is greater”);
}
int num = 2*MON;
1.14 UNIONS
A union is also like a structure, except that only one variable in the union is stored in the
allocated memory at a time. It is a collection of mutually exclusive variables, which means all of
its member variables share the same physical storage and only one variable is defined at a time.
The size of the union is equal to the largest member variables. A union is defined as follows:
union tag
{
type memvar1;
type memvar2;
type memvar3;
:
:
};
A union variable of this data type can be declared as follows,
union tag variable_name;
e.g.
union utag
{
int num;
char ch;
};
union tag ff;
The above union will have two bytes of storage allocated to it. The variable num can be accessed
as ff.num and ch is accessed as ff.ch. At any time, only one of these two variables can be referred
to. Any change made to one variable affects another.
Thus unions use memory efficiently by using the same memory to store all the variables, which
may be of different types, which exist at mutually exclusive times and are to be used in the
program only once.
In this chapter we have studies some advanced features of C. We have seen how the flexibility of
language allowed us to define a user-defined variable, is a combination of different types of
variables, which belong to some entity. We also studies arrays and pointers in detail. These are
very useful in the study of various data structures.
UNIT 2
COMPLEXITY OF ALGORITHMS
2.9. PROGRAM ANALYSIS
2.10.
PERFORMANCE ISSUES
2.11.
GROWTH OF FUNCTIONS
2.12.
ASYMPTOTIC NOTATIONS
2.12.1. BIG-O NOTATION (O)
2.12.2. BIG-OMEGA NOTATION ()
2.12.3. BIG-THETA NOTATION ()
2.13.
TIME-SPACE TRADE OFFS
2.14.
SPACE USAGE
2.15.
SIMPLICITY
2.16.
OPTIMALITY
2.1 PROGRAM ANALYSIS
The program analysis is defined what happens when a program is executed - the sequence of
actions executed and the changes in the program state that occur during a run.
There are many ways of analysing a program, for instance:
(i) verifying that it satisfies the requirements.
(ii) proving that it runs correctly without any logic errors.
(ii) determining if it is readable.
(iii) checking that modifications can be made easily, without introducing new errors.
(iv) we may also analyze program execution time and the storage complexity associated with it i.e.
how fast does the program run and how much storage it requires.
Another related question can be : how big must its data structure be and how many steps will be
required to execute its algorithm?
Since this course concerns data representation and writing programs, we shall analyse programs
in terms of storage and time complexity.
2.2 PERFORMANCE ISSUES
In considering the performance of a program, we are primarily interested in
(i) how fast does it run?
(ii) how much storage does it use?
Generally we need to analyze efficiencies, when we need to compare alternative algorithms and
data representations for the same problem or when we deal with very large programs.
We often find that we can trade time efficiency for space efficiency, or vice-versa. For finding any
of these i.e. time or space efficiency, we need to have some estimate of the problem size.
2.3 GROWTH OF FUNCTIONS
Informally, an algorithm can be defined as the finite sequence of steps designed to sort out a
computationally solvable problem. By this definition, it is immaterial, what time an algorithm
takes to solve a problem of a given size. But this is impractical to choose an algorithm that finds
the solution of a particular problem in a very long time. Fortunately, to estimate the computation
time of an algorithm is fairly possible. The computation time of an algorithm can be formulated in
terms of a function f(n), where n is the size of the input instancei. Let me try to explain this
situation. Below is a C-function isprime(int) that returns true if the integer n is prime else returns
false.
int isprime(int n)
{
int divisor=2;
while(divisor<=n/2)
{
if(n % divisor = = 0)
return 0;
divisor ++;
}
return 1;
}
Figure 1
The while loop in the fig 1 makes n comparisons in the worst case, if the input n is not a prime
number. In case, it is prime it makes at most n/2 comparisons. Obviously, the number of
comparisons is directly proportional to the size of n (alternatively, it depends upon the no of bits
required to represent the number n). That is f(n)=n in worst case n is not prime else f(n)= n/2 if
prime. However, it would had been better if the while loop looked like
while( divisor<= n) {………}
Doing so, irrespectively the number being prime the worst case of computation remains same.
Similar things can be done for different algorithms devised to solve an entirely different problem.
Many a times we are interested in finding out the facts like:
1) What is the longest time interval that it takes to complete a particular algorithm for any
random input.
2) What is the smallest time interval a given algorithm can take to solve any input instance.
Since there may be many possible input instances, we need to just estimate the running time
of the algorithm. This fact finding is called computation of order of run time of an algorithm.
2.4 ASYMPTOTIC NOTATIONS
2.4.1. BIG-O NOTATION(O)
Let f and g be functions from the set of integers or the set of real numbers to the set of real
numbers. We say that f(x) is O(g(x)) if there are constants C and k such that f(x) <= Cg(x) whenever
x > k. For example, when we say the running time T(n) of some program is O(n 2), read “big oh of n
squared” or just “oh of n squared,” we mean that there are positive constants c and n 0 such that
for n equal to or greater than n 0, we have T(n)<=cn2. In support to the above discussion I am
presenting a few examples to visualize the effect of this definition.
Example 1: Analyze the running time of factorial(x). Input size is the value of x. Given n as input,
factorial(n) recursively makes n+1 calls to factorial. Each call to factorial makes a few constant
time operations such as multiply, if-control, and return. Let the time taken for each call be the
constant k. Therefore, the total time taken for factorial(n) is k*(n+1). This is O(n) time. In other
words, factorial(x) has O(x) running time. In other words, factorial(x) has linear running time.
int factorial(int n)
{
if(n==0)
return 1;
else
return (n * factorial(n-1));
}
Figure 2
Example 2: Let f(x)=x2 +2x+1 be a function in x. Now x2 >x andx2>1, for every x>1.Therefore for
every x>1 we have, x2 +2x+1 >= x2 +2x+ x2. Alternatively, for every x>1, x2 +2x+1 >= 4x2.
Comparing the situation with f(x)<= Cg(x) we get C=4 and k=1. So that function f(x)=x 2 +2x+1 is
O(x2).
Figure 3
Since the Big-O Notation finds the upper limit of the completion of a function it is also called the
upper bound of a function or an algorithm.
2.4.2 BIG-OMEGA NOTATION ()
Let f and g be functions from the set of integers or the set of real numbers to the set of real
numbers. We say that f(x) is (g(x)) if there are constants C and k such that f(x) >= Cg(x) whenever
x > k.
In terms of limits,
lim
g(x)
f(x) = a constant (possibly 0) iff , f(n) = (g(x))
Example 3: Let f(x) = 8x3 + 5x2 + 7 >= 8x3 (x >= 0)
Therefore, f(x) = (x3)
2.4.3. BIG-THETA NOTATION ()
For the similar functions f and g as discussed in above two definitions, we say that f(x) is (g(x)) if
there are constants C1 ,C2 andk such that,
0<= C1f(x)<=f(x) <= C2f(x) whenever x > k.
Since,  (g(x)) bounds a function from both upper and lower sides, it is also called tight bound for
the function f(x).
Mathematically,
lim
g(x)
nf(x)
= a constant. In other words f(n) = (g(x)) iff f(x) and g(x) have same leading terms and except for
possibly different constant factors.
Example 4: f(x) = 3x2 + 8x log x is (x2)
2.5. TIME-SPACE TRADE OFFS
Over 50 years of researches for algorithms related to different problem areas like decision-making,
optimization etc, scientists have tried to concentrate on computational hardness of different
solution methods. The study of time-space trade-offs, i.e., formulae that relate the most
fundamental complexity measures, time and space, was initiated by Cobham, who studied
problems like recognizing the set of palindromes on Turing machines. There are two main lines of
motivation for such studies: one is the lower bound perspective, where restricting space allows
you to prove general lower bounds for decision problems; the other line is the upper bound
perspective where one attempts to find time efficient algorithms that are also space efficient (or
vice versa). Also, upper bounds are interesting for finding, in conjunction with lower bounds, the
computational complexity of fundamental problems such as sorting. So, mainly algorithms are
constrained under two resources i.e. time and space. However, there are certain problems for
which it is difficult or even impossible to find such a solution that is both time and space efficient.
In such cases, the algorithms are written for specific resource configuration and also taking in
consideration nature of input instance. Fundamentally, we measure either of the space and time
complexities as a function of the size of the input instance. However, it is also likely to depend
upon nature of the input. So, let us define the time and space complexity as below:
Time Complexity T (n): An algorithm A for a problem P is said to have time complexity of T(n)  if
the number of steps required to complete its run for an input of size n is always less than equal to
T(n).
Space Complexity S(n): An algorithm A for a problem P is said to have space complexity of S(n) if
the no. of bits required to complete its run for an input of size n is always less than equal to S(n).
If we consider, both time and space requirements, general format is to use the quantity (time *
space) for a given algorithm. T(n)*S(n) is therefore quite handy to estimate the overall efficiency of
an algorithm in general*. The amount of space can easily be estimated for most of the algorithms
directly and the measurement of run-time of algorithm mathematically or manually by testing can
tell us the efficiency of the algorithm. From here we can establish, the range of this T*S term, that
we can afford. All algorithms that lie in this range are then acceptable to us.
Many a times algorithms need to do certain implicit tasks that are actually not the part of run
time of algorithm. These implicit tasks, for most of the times are, one-time investments. For
Example, sorting of list for doing binary search in randomly accessible storage containing the list
of elements. So, in order to actually evaluate the importance of each of the operations in the
algorithm over a data-structure and the complexity of the over all algorithm it is desirable to find,
the time required to perform a sequence of operations averaged over all the operations performed.
This computation is called amortized analysis. The usefulness of this computation is reflected
when one has to establish a fact that, an operation under investigation is however costly in a local
perspective but the overall cost of algorithm is minimized due to its use and thus the operation is
very useful.
A simple algorithm may consist of some initialization instructions and a loop. The number of
passes made through the body of the loop is a fairly good indication of the work done by such an
algorithm. Of course, the amount of work done in one pass through a loop may be much more
than the amount done in another pass, and one algorithm may have longer loop bodies than
another algorithm, but we are narrowing in on a good measure of work. Though some loops may
have, say, five steps and some nine, for large inputs the number of passes through the loops will
generally be large compared to the loop sizes. Thus counting the passes through all the loops in
the algorithm is a good idea.
In many cases, to analyze an algorithm we can isolate a particular operation fundamental to the
problem under study (or to the types of algorithms being considered), ignore initialization, loop
control, and other bookkeeping, and just count the chosen, or basic, operations performed by the
algorithm. For many algorithms, exactly one of these operations is performed on each pass

The function T(n), can be compared with another function f(n) to find the order of T(n) and hence the
bounds of the algorithm can be evaluated.
* Generalization of efficiency is required because efficiency also depends on the nature of input. For Example,
an algorithm may be efficient for one sequence of elements to be sorted and not for other, size being the
same.
through the main loops of the algorithm, so this measure is similar to the one described in the
previous paragraph.
Here are some examples of reasonable choices of basic operations for several problems:
Problem
Operation
Find x in an array of names.
Comparison of x with an entry in the array
Multiply two
numbers
matrices
Multiplication of two real with real entries.
(or multiplication and addition of real no‟s)
Sort an array of numbers
Comparison of two array entries
Traverse a binary tree
Traversing an edge
Any non-iterative procedure,
including recursive
Procedure invocation.
So long as the basic operations are chosen well and the total number of operations performed is
roughly proportional to the number of basic operations, we have a good measure of the work done
by an algorithm and a good criterion for comparing several algorithms. This is the measure we
use in this chapter and in several other chapters in this book. You may not yet be entirely
convinced that this is a good choice; we will add more justification for it in the next section. For
now, we simply make a few points.
First, in some situations, we may be intrinsically interested in the basic operation: It might be a
very expensive operation compared to the others, or it might be of some theoretical interest.
Second, we are often interested in the rate of growth of the time required for the algorithm, as the
inputs get larger. So long as the total number of operations is roughly proportional to the basic
operations, just counting the latter can give us a pretty clear idea of how feasible it is to use the
algorithm on large inputs.
Finally, this choice of the measure of work allows a great deal of flexibility. Though we will often
try to choose one, or at most two, specific operations to count, we could include some overhead
operations, and, in the extreme, we could choose as the basic operations the set of machine
instructions for a particular computer. At the other extreme, we could consider “one pass
through a loop” as the basic operation. Thus by varying the choice of basic operations, we can
vary the degree of precision and abstraction in our analysis to fit our needs.
What if we choose a basic operation for a problem and then find that the total number of
operations performed by an algorithm is not proportional to the number of basic operations?
What if it is substantially higher? In the extreme case, we might choose a basic operation for a
certain problem and then discover that some algorithms for the problem use such different
methods that they do not do any of the operations we are counting. In such a situation, we have
two choices. We could abandon our focus on the particular operation and revert to counting
passes through loops. Or, if we are especially interested in the particular operation chosen, we
could restrict our study to a particular class of algorithms, one for which the chosen operation is
appropriate. Algorithms that use other techniques for which a different choice of basic operation
is appropriate could be studied separately. A class of algorithms for a problem is usually defined
by specifying the operations that may be performed on the data. (The degree of formality of the
specifications will vary; usually informal descriptions will suffice in this book).
Throughout this section, we have often used the phrase “the amount of work done by an
algorithm.” It could be replaced by the term “the complexity of an algorithm.” Complexity means
the amount of work done, measured by some specified complexity measure, which in many of our
examples is the number of specified basic operations performed. Note that, in this sense,
complexity has nothing to do with how complicated or tricky an algorithm is; a very complicated
algorithm may have low complexity. We will use the terms “complexity,:” “Amount of work done,”
and “number of basic operations done” almost interchangeably in this book done” almost
interchangeably in this book.
Average and Worst-Case Analysis
Now that we have a general approach to analyzing the amount of work done by an algorithm, we
need a way to present the results of the analysis concisely. A single number cannot describe the
amount of work done because the number of steps performed is not the same for all inputs. We
observe first that the amount of work done usually depends on the size of the input. For example,
alphabetizing an array of 1000 names usually requires more operations than alphabetizing an
array 100 names, using the same algorithm. Solving a system of 12 linear equations in 12
unknowns generally takes more work than solving a system of 2 linear equations in 2 unknowns.
We observe, secondly, that even if we consider inputs of only one size, the number of operations
performed by an algorithm may depend on the particular input. An algorithm for alphabetizing
an array of names may do very little work if only a few of the names are out of order, but it may
have to do much more work on an array that is very scrambled. Solving a system of 12 linear
equations may not require much work if most of the coefficients are zero.
The first observation indicates that we need a measure of the size of the input for a problem. It is
usually easy to choose a reasonable measure of size. Here are some examples:
Problem
Find x in an array of names
Size of input
The number of names in the array
Multiply two matrices
The dimensions of the matrices
Sort an array of numbers
The number of entries in the array
Traverse a binary tree
The number of nodes in the tree
Solve a system of linear equations
The number of equations, or the
number of unknowns, or both
Solve a problem concerning a graph
The number of nodes in the graph, or the
number of edges or both
The number of operations performed may
at, say, n, depend on the particular input.
Even if the input size is fixed
How, then, are the results of the analysis of an algorithm to be expressed? Most often we describe
a behavior of an algorithm by stating its worst-case complexity.
Worst-case complexity
Let Dn be the set of inputs of size n for the problem under consideration, and let I be an element
of Dn. Let t(I) be the number of basic operations performed by the algorithm on input I. We define
the function W by
W(n)=max{t(I)|I Dn}
The function W(n) is called the worst-case complexity of the algorithm. W(n) is the maximum
number of basic operations performed by the algorithm on any input of size n.
It is often not very difficult to compute W(n). The worst-case complexity is valuable because it
gives an Upper bound on the work done by the algorithm. The worst-case analysis could be used
to help form an algorithm. We will do worst-case analysis for most of the algorithms presented in
this book. Unless otherwise stated, whenever we refer to the amount of work done by an
algorithm, we mean the amount of work done in the worst case.
It may seem that a more useful and natural way to describe the behavior of an algorithm is to tell
how much work it does on the average; that is, to compute the number of operations performed
for each input of size n and then take the average. In practice some inputs might occur much
more frequently than others so a weighted average is more meaningful.
Average complexity
Let Pr(I) be the probability that input I occurs. Then the average behavior of the algorithm is
defined as
A(n)=
I D
 Pr(I)t(I).
n
We determine t(I) by analyzing the algorithm, but Pr(I) cannot be computed analytically. The
function Pr(I) is determined from experience and/or special information about the application for
which the algorithm is to be used, or making some simplifying assumption (e.g., that all inputs of
size n are equally likely to occur). If Pr(I) is complicated, the computation of average behavior is
difficult. Also, of course, if Pr(I) depends on a particular application of the algorithm, the function
A describes the average behavior of the algorithm for only that application. The following examples
illustrate worst-case and average analysis.
Example
Problem: Let E be an array containing n entries (called keys), E[0],…..,E[n-1], in no particular
order. Find an index of a specified key K, if K is in the array; return – I as the answer if K is not
in the array.
Strategy: Compare K to each entry in turn until a match is found or the array is exhausted. If K
is not in the array, the algorithm returns – 1 as its answer.
There is a large class of procedures similar to this one, and we call these procedures generalized
searching routines. Often they occur as subroutines of more complex procedures.
Generalized searching routine
A generalized searching routine is a procedure that processes an idefinite amount of data until it
either exhausts the data or achieves its goal. It follows this high-level outline:
If there is no more data to examine:
Fail
Else
Examine one datum
If this datum is what we want:
Succeed.
Else
Keep searching in remaining data.
The scheme is called generalized searching because the routine often performs some other simple
operations as it searches, such as moving data elements, adding to or deleting from a data
structure, and so on.
Sequential Search, Unordered
Input: E, n, K, where E is an array with n entries (indexed 0,….n-1), and K is the item sought.
For simplicity, we assume that K and the entries of E are integers, as is n.
Output:
Returns „ans‟, the location of K in E (-1 if K is not found).
int seqSearch(int E[], int n, int K)
1.
int ans, index;
2.
ans=-1;//Assume failure
3.
for (index = 0; index <n; index ++)
4.
if (k==E[index])
5.
ans=index;//Success!
6.
break;//Take the rest of the afternoon off.
//Continue loop.
7.
return ans;
Basic Operation: Comparison of x with an array entry.
Worst-Case Analysis: Clearly W(n)=n. The worst cases occur when K appears only in the last
position in the array and when K is not in the array at all. In both of these cases K is compared
to all n entries.
Average Behavior Analysis: We will make several simplifying assumptions first to do an easy
example then, do a slightly more complicated analysis with different assumptions. We assume
that the elements in the array are distinct and that if K is in the array, then it is equally likely to
be in any particular position.
For our first case, we assume that K is in the array and we denote this event by “succ,” in
accordance with the terminology of probabilities. The inputs can be categorized according to
where in the array K appears, so there are n inputs to consider. For 0<I<n, let Ii represent the
event that K appears in the ith position in the array. Then, let t(I) be the number of comparisons
done (the number of times the condition in line 4 is tested) by the algorithm on input I. Clearly,
for <i<n,t(Ii)=I+1. Thus
n-1
Asucc(n)= Pr(Ii|succ)t(Ii)
i=0
n-1
i=0
n
=1) (i+1) = (1) n (n+1)=n+1
n
The subscript “succ” denotes that we are assuming a successful search in this computation. The
result should satisfy our intuition that on the average, about half the array will be searched. Now,
let us consider the event that K is not in the array at all, which we call “fail”. There is only one
input for this case, which we call I fail. The number of comparisons in this case is t(Ifail)=n, so
Afail=n. Finally, we combine the cases in which K is in the array and is not in the array. Let q be
the probability that K is in the array. By the law of conditional expectations
A(n)=Pr(succ)Asucc(n)+Pr(fail)Afail(n)
=q(1/2(n+1))+(1-q)n=n(1-1/2q)+q.
If q=1, that is if K is always in the array, then A(n)=(n+1)/2, as before, If q=1/2, that is, if there is
a 50-50 chance that K is not in the array, then A(n)=3n/4+1/4; roughly three-fourths of the
entries are examined.
Example
Illustrates how we should interpret D n, the set of inputs of size n. Rather than consider all
possible arrays of names, numbers, or whatever, that could occur as inputs, we identify the
properties of the inputs that affect the behavior of the algorithm; in this case, whether K is in the
array at all and, if so, where it appears. An element I in Dn may be thought of as a set (or
equivalence class) of all arrays and values for K such that K occurs in the specified place in the
array (or not at all). Then t(I) is the number of operations done for any one of the inputs in I.
Observe also that the input for which an algorithm behaves worst depends on the particular
algorithm, not on the problem. A worst case occurs when the only position in the array containing
K is the last. For an algorithm that searched the array only position in the array containing K is
the last. For an algorithm that searched the array backwards (i.e., beginning with index=n-1), a
worst case would occur if K appeared only in position 0. (Another worst case would again be
when K is not in the array at all).
Example
Illustrates an assumption we often make when doing average analysis of sorting and searching
algorithms: that the elements are distinct. The average analysis for the case of distinct elements
gives a fair approximation for the average behavior analysis for the case of distinct elements gives
a fair approximation for the average behavior in cases with few duplicates. If there might be many
duplicates, it is harder to make reasonable assumptions about the probability that K‟s first
appearance in the array occurs at any particular position.
2.6 SPACE USAGE
The number of memory cells used by a program, like the number of seconds required executing a
program, depends on the particular implementation. However, just examining an algorithm can
make some conclusions about space usage. A program will require storage space for the
instruction, the constants and variables used by the program, and the input data. It may also
use some workspace for manipulating the data and storing information needed to carry out its
computations. The input data itself may be re-presentable in several forms, some of which
require more space than others.
If the input data have one natural form (say, an array of numbers or a matrix), then we analyze
the amount of extra space used, aside from the program and the input. If the amount of extra
space is constant with respect to the input size, the algorithm is said to work in place. This term
is used especially in reference to sorting algorithms. (A relaxed definition of in place is often used
when the extra space is not constant, but is only a logarithmic function of the input size, because
the log function grows so slowly; we will clarify any cases in which we use the relaxed definition).
If the input can be represented in various forms, then we will consider the space required for the
input itself as well as any extra space used. In general, we will refer to the number of “cells” used
without precisely defining cells. You may think of a cell as being large enough to hold one
number or one object. If the amount of space used depends on the particular input, worst-case
and average-case analysis can be done.
2.7 SIMPLICITY
It is often, though not always, the case that the simplest and most straightforward way to solve a
problem is not the most efficient. Yet simplicity in an algorithm easier, and it makes writing,
feature. It may make verifying the correctness of the algorithm easier, and it makes writing,
debugging, and modifying a program easier. The time needed to produce a debugged program
should be considered when choosing an algorithm, but if the program is to be used very often, its
efficiency will probably be the determining factor in the choice.
2.8 OPTIMALITY
No matter how clever we are, we can‟t improve an algorithm for a problem beyond a certain point.
Each problem has inherent complexity; that is, there is some minimum amount of work required
to solve it. To analyze the complexity of a problem, as opposed to that of a specific algorithm, we
choose a class of algorithms (often by specifying the types of operations the algorithms will be
permitted to perform) and a measure of complexity, for example, the basic operation(s) to be
counted. Then we may ask how many operations are actually needed to solve the problem. We
say that an algorithm is optimal (in the worst case) if there is no algorithm in the class under
study that performs fewer basic operations (in the worst case). Note that when we speak of
algorithms in the class under study, we don‟t mean only those algorithms that people have
thought of. We mean all possible algorithms, including those not yet discovered. “Optimal”
doesn‟t mean “the best known”; it means “the best possible.”
SUMMARY:If the problem size doubles and the algorithm takes one more step, we relate the number of steps
to the number of steps to the problem size by
O(log2N). It is read as order of log2N.
- If the problem size doubles and the algorithm takes twice as many steps, the number of steps
is related to problem size by O(N) i.e. order of N, i.e.number of steps is directly proportional to N.
- If the problem size doubles and the algorithm takes more than twice as many steps, i.e. the
number of steps required grow faster than the problem size, we use the expression
O(N log2 N)
You may notice that the growth rate complexities is more than the double of the growth rate of
problem size, but it is not a lot fasten
- If the number of steps used is proportional to the square of problem size ,we say the com plexity is of the order of N2 or O(N2).
- If the algorithm is independent of problem size, the complexity is constant in time and space, i.e.
O(1).
The notation being used, i.e. a capital O() is called Big- Oh notation.
*************************************************************************************************************
UNIT 3
INTRODUCTION TO DATA AND FILE STRUCTURE
3.5 INTRODUCTION
3.6 PRIMITIVE AND SIMPLE STRUCTURES
3.7 LINEAR AND NONLINEAR STRUCTURES
3.8 FILE ORGANIZATIONS
3.1 INTRODUCTION
Data Structures are very important in computer systems. In a program, every variable is of some
explicitly or implicitly defined data structure, which determines the set of operations that are legal
upon that variable. Knowledge of data structures is required for people who design and develop
computer programs of any kind : systems software or applications software. The data structures
that are discussed here are termed logical data structure. There may be several different physical
organizations on storage possible for each logical data structure. The logical or mathematical
model of a particular organization of data is called a data structure.Often the different data values
are related to each other. To enable programs to make use of these relationships, these data
values must be in an organised form. The organised collection of data is called a data structure.
The programs have to follow certain rules to access and process the structured data. We may,
therefore, say data are represented that:
Data Structure = Organised Data + Allowed Operations.
If you recall, this is an extension of the concept of data type. We had defined a data type as:
Data Type = Permitted Data Values + Operations
The choice of a particular data model depends on two considerations:

To identify and develop useful mathematical entities and operations.

To determine representations for those entities and to implement operations on these
concrete representations. The representation should be simple enough that one can
effectively process the data when necessary.
Data Structures can be categorized according to figure 1.1.
Data Structure
Primitive
Compound
Data Structure
-integer
-float
-character
-String
-Array
-Record
-Sets
File Structure
Structure
Simple Data
File organization
Data Structure -Sequential
-Indexed Sequential
Linear
-Linked List
-Stack
-Queue
Non Linear
Binary
-Binary Tree
-Binary Search Tree
N-ary
-Graph
-General Tree
-B-Tree
-B+Tree
Figure 1.1: Charaterisation of Data Structure
3.2 PRIMITIVE AND SIMPLE STRUCTURES
They are the data structures that are not composed of other data structures. We will consider
briefly examples of three primitives: integers, Booleans, and characters. Other data structures can
be constructed from one or more primitives. The simple data structures build from primitives that
are strings, arrays, sets, and records(or structure in some programming languages). Many
programming languages support these data structures. In other words These are the data
structures that can be manipulated directly by machine instructions. The integer, real, character
etc., are some of primitive data structures. In C, the different primitive data structures are int,
float, char and double
Example of primitive Data Structure
The primitive data structures are also known as data types in computer language. The example of
different data items are:
integers :
10, 20, 5, - 15, etc. i.e., a subset of integer. In C language an integer is declared as
:
int x;
Each of the integer occupied 2 bytes of memory space.
float:
6.2, 7.215162, 62.5 etc i.e., a subset of real number. In C language a float variable
is declared as:
float y;
Each of the float number occupied 4 bytes of memory space.
character:
Any character enclosed within single quotes is treated as character data. e.g., „a‟,
1‟, „?‟, „‟*‟ etc are the character data types. Characters are declared as:
char c ;
Each character occupied 1 byte of memory space.
Example of Simple Data Structure
The simple data types are composed primitive data structure.
Array:
Array is the collection of similar type of data item under the same name.
e.g. int x [20]
declares a collection of 20 integers under the same name x.
Records:
The records are also known as structure in C/C++ language
There are collection of different data items under the same name.
e.g., in C language the declaration
struct student {
char name [30];
char fname [30];
int roll;
char class [5];
} y;
define structure describing the record of a student. Obviously this structure contains different
types of data items. The size of structure is depends on the constituents of the structure. In this
example the six of structure is 30 + 30 + 2 + 5 = 37 bytes.
3.3 LINEAR AND NONLINEAR STRUCTURES
Simple data structures can be combined in various ways to from more complex structures. The
two fundamental kinds of more complex data structures are linear and nonlinear, depending on
the complexity of the logical relationships they represent. The linear data structures that we will
discuss include stack, queues, and linear lists. The nonlinear data structures include trees and
graphs. We will find that there are many types of tree structures that are useful in information
systems. In other words These data structures cannot be manipulated directly by machine
instructions. Arrays, linked lists, trees etc., are some non-primitive data structures. These data
structures can be further classified into „linear’ and ‘non-linear’ data structures.
The data structures that show the relationship of logical adjacency between the elements
are called linear data structures. Otherwise they are called non-linear data structures.
Different linear data structures are stack, queue, linear linked lists such as singly linked list,
doubly linked list etc.
Trees, graphs etc are non-linear data structures.
3.4 FILE ORGANIZATIONS
The data structuring techniques applied to collections of data that are managed as “black boxes”
by operating systems are commonly called file organizations. A file carries a name, contents, a
location where it is kept, and some administrative information, for example, who owns it and how
big it is. The four basic kinds of file organization that we will discuss are sequential, relative,
indexed sequential, and multi-key file organizations. These organizations determine how the
contents of files are structured. They are built on the data structuring techniques.
************************************************************************************************************
UNIT 4 ARRAYS
4.1 INTRODUCTION
4.4.1. SEQUENTIAL ALLOCATION
4.4.2. MULTIDIMENSIONAL ARRAYS
4.5. ADDRESS CALCULATIONS
4.6. GENERAL MULTIDIMENSIONAL ARRAYS
4.7. SPARSE ARRAYS
4.1 INTRODUCTION
The simplest type of data structure is a linear Array.
A linear array is a list of a finite number n, of homogeneous data elements (i.e., data
elements of the same type) such that :
(a)
The elements of the arrays are referenced respectively by an index set consisting of n
consecutive numbers.
(b)
The elements of the arrays are stored respectively in successive memory locations.
The number n of elements is called the length or size of the array. If not explicity stated, we will
assume the index set consists of the integers 0, 1…, n-1. In general, the length or the number of
data elements of the array can be obtained from the index set by the formula
Length = UB – LB + 1
(2.1)
where UB is the largest index, called the upper bound, and LB is the smallest index, called the
lower bound, of the array. Note that:
length = UB
when LB = 1
The elements of an array A may be denoted by the subscript notation
A0, A1, A2, A3, …., An - 1
or by the parentheses notation(used in FORTRAN,PL/1, and BASIC)
A(0), A(1), A(2), ….., A(n-1)
or by the bracket notation(used in Pascal)
A[0], A[1], A[2], A[3], ….., A[n-1]
We will usually use the subscript notation or the bracket notation. Regardless of the notation,
The number i in A[i] is called a subscript or an index and A[i]is called subscripted variable.
Note that subscripts allow any elements of A to be referenced by its relative position in A.
Example 1
(a)
Let DATA be a 6-element linear array of integers such that
DATA[0] =247 , DATA[1] =56 , DATA[2] =429 , DATA[3] =135 , DATA[4]=87 , DATA[5] =156
Sometimes we denote such an array by simply writing
DATA : 247,56,429,135,87,156
The array DATA is frequently pictured as in Fig. 2.1(a) or Fig. 2.1(b)
DATA
0
2247
1
56
2
429
3
135
4
87
5
156
DATA
247
0
Fig 2.1(a)
(b)
56
1
429
2
135
3
87
4
156
5
Fig 2.1(b)
An automobile company uses an array AUTO to record the number of automobiles sold each
year from 1932 through 1984.Rather than beginning the index set with 1,it is more useful to
begin the index set with 1932 so that
AUTO[i] =number of automobiles sold in the year i
Then LB =1932 is the lower bound and UB=1984 is the upper bound of AUTO.
By Eq.(2.1)
Length = UB –LB+1=1984-1930+1=55
That is, AUTO contains 55 elements and its index set consists of all integers from 1932
through 1984.
Each programming language has its own rules for declaring arrays. Each such declaration
must give, implicitly or explicitly, three items of information:(1) the name of the array,
(2) the data type of the array and (3) the index set of the array.
Example 2
Suppose DATA is a 6-element linear array containing real values. Various programming languages
declare such an array as follows:
FORTRAN:
REAL DATA(6)
PL/1:
DECLARE DATA(6) FLOAT;
Pascal:
VAR DATA : ARRAY[1..6] OF REAL
C language:
float DATA[6]
We will declare such an array, when necessary, by writing DATA[6]. (The context will usually
indicate the data type, as it will not be explicitly declared.)
Some programming languages (e.g. FORTRAN and Pascal) allocate memory space for arrays
statically, i.e., during program compilation; hence the size of the array is fixed during the program
execution. On the other hand, some programming languages allow one to read an integer n and
then declare an array with n elements; such programming languages are said to allocate memory
dynamically.
4.1.1 SEQUENTIAL ALLOCATION
Let LA be linear array in the memory of the computer. Recall that the memory of the computer is
simply a sequence of addressed locations as pictured in Fig. 2.2. Let us use the notation
ADD(LA[i]) = address of the element LA[i] of the array LA
As previously noted, the elements of LA are stored in successive memory cells.
Accordingly, the computer does not need to keep track of the address of every element of LA, but
needs to keep track only of the address of the first element of LA, denoted by
Base(LA)
and called the base address of LA. Using this address Base(LA), the computer calculates the
address of any element of LA by the following formula :
ADD(LA[i]) = Base(LA) + w(i - lower bound)
(2.2)
where w is the number of words per memory cell for the array LA. Observe that the time to
calculate Add(LA[i]) is essentially the same for any value of i. Furthermore, given any subscript i,
one can locate and access the content of LA[i] without scanning any other element of LA.
In particular, C language uses the following formula to calculate the address of the element.
ADD (LA[i]) = Base (LA) + w*i
(2.3)
As the lower bound is zero by default.
1000
1001
1002
1003
1004
.
.
.
Figure 2.2
Example 3
Consider the array AUTO in example 1(b), which records the number of automobiles sold each
year from 1932 through 1984. Suppose AUTO appears in memory as pictured in Fig 2.3. That is,
Base(AUTO) =200, and w=4 words per memory cell for AUTO. Then
ADD (AUTO[1932])=200, ADD (AUTO[1933])=204, ADD (AUTO[1934])=208,….
The address of any array element for the year K=1965 can be obtained by using Eq(2.2):
ADD (AUTO[1965])=Base(AUTO)+w(1965-lower bound)=200+4(1965-1932)=332
Again we emphasize that the contents of this element can be obtained without scanning any other
element in array AUTO.
200
201
202
AUTO[1932]
203
204
205
206
AUTO[1933]
207
208
209
AUTO[1934]
210
211
.
.
.
Figure 2.3
4.1.2 MULTIDIMENSIONAL ARRAYS
Most programming languages allow two-dimensional and three dimensional arrays, i.e., arrays
where elements are referenced, respectively, by two or three subscripts. In fact, some
programming languages allow the number of dimensions for an array to be as high as 7. This
sections discusses these multidimensional arrays.
Two-Dimensional Arrays
A two-dimensional m  n array A is a collections of m * n data elements such that each element is
specified by a pair of integers (such as i, j), called subscripts, with the property that
0 i  m-1 and 0  j  n-1
The elements of A with first subscript i and second subscript j will be denoted by
Aij or A[i, j]
Two-dimensional arrays are called matrices in mathematics and tables in business
applications: hence two-dimensional arrays are sometimes called matrix arrays.
There is a standard way of drawing a two-dimensional m n array A where the elements of A form
a rectangular array with m rows and n column and where the element A[i, j] appears in row i and
a column j.(A row is a horizontal list of elements, and a column is a vertical list of elements.)Figure
2.4 shows the case where A has 3 rows and 4 columns. We emphasize that each row contains
those elements with the same first subscript, and each column contains those elements with the
same second subscript.
1
0
Rows
A[0,0]
Columns
2
A[0,1]
3
A[0,2]
4
A[0,3]
1
A[1,0]
A[1,1]
A[1,2]
A[1,3]
2
A[2,0]
A[2,1]
A[2,2]
A[2,3]
Figure 2.4
Example 4
Suppose each student of a class of 25 students is given 4 tests. Assuming the students are
numbered from 0 to 24, the test scores can be assign to a 25 * 4 matrix array SCORE as pictured
in fig. 2.5. Thus SCORE[i,j] contains the i th student‟s score on the jth test. In particular, the
second row of the array,
SCORE[2,0], SCORE[2,1], SCORE[2,2], SCORE[2,3]
Contains the four test score of the second student.
Student
Test 1
Test 2
Test 3
Test 4
0
1
2
.
.
.
24
84
95
72
.
.
.
78
73
100
66
.
.
.
70
88
88
77
.
.
.
70
81
96
72
.
.
.
85
Figure 2.5: Array Score
Suppose A is a two dimensional m  n array. The first dimension of A contains the index set
0,……..,m-1, with lower bound 0 and upper bound m - 1; and the second dimension of A contains
the index set 0, 1, 2, ...., n-1, with lower bound 0 and upper bound n-1. The length of dimension
is the number of integer in its index set. The pair of lengths m*n (read” m by n “) is called the size
of array.
4.2. ADDRESS CALCULATIONS
Let A be two-dimensional m*n array. Although A is pictured as a rectangular array of elements with m rows n columns,
the array will be represented in memory by a block of m.n sequential memory location. Specifically, the programming
language will store the array A either(1) column by column is what is called column-major order, or(2) row by row, in
row major order. Figure 2.6 shows these two ways when A is a two dimensional 2*3 array. We emphasize that the
particular representation used depends upon the programming language, not the user. Pascal, C and C++ language use
the Row Major ordering while FORTRAN uses the column major ordering.
Subscripts
A
Subscripts
A
(0,0)
(0,0)
(1,0)
Column 1
Row 1
(0,1)
(2,0)
(0,2)
(0,1)
(1,0)
(1,1)
Column 2
.
(2,1)
Row 2
(1,1)
(1,2)
(2,2)
(1,2)
(0,2)
a) Column-major ordering
b)Row Major ordering
Figure 2.6
Recall that, for a linear array LA, the computer does not keep track of the address ADD (LA [i]) of
every element LA[ i] of LA, but does not keep track of Base(LA), the address of the first element of
LA. The computer use the formula
ADD(LA[i]) = Base(LA)+w(i-l)
To find the address of LA[i] in the time independent of i.(Here w is the number of words per
memory cell for the array LA, and l is the lower bound of the index set of LA.)
A similar situation also holds for any two-Dimensional m*n array A. That is, the computer
keeps the track of Base(A)-the address of the first element of A[0,0] of A, and computes the
address ADD (A[i, j]) of A[i, j]using the formula
(Column-major order)
ADD (A[i,j])=Base(A)+w[m(j-l2)+(i-l1)]
(Row-major order) ADD (A[i,j])=Base(A)+w[n(i-l1)+(j-12)]
(2.4)
(2.5)
Again, w denotes the number of words per memory location for the array A. Note that the
formulas are linear in i and j.
C language makes use of Row-major ordering and use the following formula:
ADD (A[i, j]) = Base (A) + w (n * i + j)
(2.6)
Example 5
Consider the 25 * 4 matrices array SCORE in Example 4. Suppose Base(SCORE) = 200 and there
are w=4 words per memory cell. Furthermore, Suppose the programming language stores twodimensional arrays using row-major order. Then the address of SCORE[12,3]), the third test of
twelfth students, follows: [With l1 = l2 = 1]
ADD (SCORE[12,3])=200+4[4(12-1)+(3-1)]=200+4[46]=384
Observe that we have simply used Eq. (2.5).
Multidimensional clearly illustrate the difference between the logical and the physical views of
data. Figure 2.4 shows how one logically views a 3*4 matrix array A, that is, as a rectangular
array of data where A[i, j] appears in row i and column j. On the other hand, the data will be
physically stored in memory by linear collection of memory cells. The situation will occur through
the text ; i.e., certain data may be viewed logically as trees or graphs although physically the data
will be stored linearly in memory cells.
4.3GENERAL MULTIDIMENSIONAL ARRAYS
General multidimensional arrays are defined analogously. More specifically, an n-dimensional m0
 m1  m2  ………..  mn-1 array B is collection of m0*m1*m2*………..*mn-1 data elements in which
each element is specified by a list of integers-such as k0, k1, …kn-1 called subscripts with the
property that
0  k0  m0, 0  k1  m1, …, 0  kn-1 mn-1
The element of B with subscripts k0, K1,…….,kn-1 will be denoted by
Bk0, Bk1….Bkn-1
or
B[k0], B[k1] …,B[kn-1]
The array will be stored in the memory in a sequence of memory locations.
Specifically, the programming language will store the array B either in row-major order. Or in
column-major order. By row-major order, we mean that the elements are listed so that the
subscripts vary like an automobile odometer, i.e., so that the last subscript varies first(most
rapidly), the next to last subscript varies second(less rapidly),and so on. By column-major order,
we mean that the first subscript varies first(more rapidly), and so on.
4.4 SPARSE ARRAYS
Matrices with a relatively high proportional of zero entries are called sparse matrices. Sparse
arrays are special arrays which arise commonly in applications. It is difficult to draw the dividing
line between the sparse and non-sparse array. Loosely an array is called sparse if it has a
relatively number of zero elements. For example in Figure 8, out of 49 elements, only 6 are nonzero This is a sparse array.
0
1
2
3
4
5
0
0
0
0
0
0
1
0
1
0
0
0
0
1
0
0
2
0
0
0
0
0
0
0
3
0
0
0
0
3
0
0
4
0
2
0
0
0
0
0
5
0
0
0
0
4
0
0
6
0
0
0
0
2
0
0
6
Figure 8: A Sparse Array
If we store those array through the techniques presented in previous section, there would be
much wasted space.
Let us consider 2 alternative representations that will store explicitly only the non-zero elements.
1. Vector representation
2. Linked List representation
We shall discuss only the first representation here in this Unit. The Linked List representation
shall be discussed in a later Unit.
Each element of a 2-dimensional array is uniquely characterized by its row and column position
we may, therefore, store a sparse array in another array of the form.
A(0 ... n, 1..3)
where n is number of non-zero elements.
The sparse array given in Figure 8 may be stored in the array A(O..6, 1..3) as shown in Figure 9.
A
0
1
2
3
4
5
6
0
7
0
1
3
4
5
6
1
7
5
4
4
1
4
4
2
6
1
1
3
2
4
2
Figure : 9 Sparse Array Representation
The Elements A(O, 1) and A(0,2) contain the number of rows and columns of the sparse array. A(0,3)
contains the number of non- zero elements of sparse array. The first and second element of each of
the rows store the number of row and column of the non- zero term and the third element stores
the value of non-zero term. In other words, each non-zero element in a 2-dimensional sparse
array is represented as a triplet with the format (row subscript, column subscript, value).
If the sparse array was one-dimensional, each non-zero element would be represented by a pair.
In general for an N-dimensional sparse array, non-zero elements are represented by an entry with
N+1 values.
Following is the program in C that illustrates how to store the different sparse matrices
/*SPARSE MATRIX*/
#include<stdio.h>
#include<conio.h>
void main()
{
int a[3][3],nz=0,i,j,tp[5][3],r=0;
clrscr();
for(i=0;i<3;i++)
{
for(j=0;j<3;j++)
{
printf(“enter any value”);
scanf(“%d”,&a[i][j]);
if(a[i][j] <> 0)
nz++;
}
}
if (nz >4)
{
printf(“\n not sparse matrix”);
exit(0);
}
else
{
tp[0][0]=3;
tp[0][1]=3;
tp[0][2]=nz;
r=1;
for(i=0;i<3;i++)
{
for(j=0;j<3;j++)
{
if(a[i][j]<>0)
{
tp[r][0]=i;
tp[r][1]=j;
tp[r][2]=a[i][j];
r++;
}
}
}
for(i=0;i<=nz;i++)
{
for(j=0;j<3;j++)
{
printf(“\t %d”,tp[i][j]);
}
printf(“\n”);
}
}
}
Summary

Data Structure which displays the relationship of adjacency between elements is said to be
"Linear".

Length finding, traversing from Left to Right, Retrieval of any element, storing any element,
Deleting any element are the main operations which can be performed on any linear Data
Structure.

Arrays are one of the Linear Data Structures.

Single Dimension as well as Multimension Arrays are represented in memory as one
dimension array.

Elements of any Multidimensional Array can be stored in two forms Row major and
column major.

A Matrix which has many zero entries is called Sparse Matrix.

There are several Alternative representations of any Sparse Matrix using Linear Data
Structures like Array and Linked List.

We can represent any sparse Matrix through Single Dimension Array if it conform a Nice
Pattern.

We can represent any sparse Matrix through 3-Tuple Method in which 2 Dimensional
Array is used and it has 3 fields (columns): Row number, column number, and value of
Non zero term.

Sparse Matrix can also be represented through Singly Linked List in which each node
contain Row no, column no, value and a link to next Non zero term.
The word “instance” has been used because there may be several distinct inputs to an algorithm
and so there solution periods.
i
UNIT 5 STRINGS
5.1 INTRODUCTION
5.2 STRING FUNCTIONS
5.3 STRING LENGTH
5.3.1 USING ARRAY
5.3.2 USING POINTERS
5.4 STRING COPY
5.4.1 USING ARRAY
5.4.2 USING POINTERS
5.5 STRING COMPARE
5.5.1 USING ARRAY
5.6 STRING CONCATENATION
5.1 INTRODUCTION
A string is an array of characters. They can contain any ASCII character and are useful in many
operations. A character occupies a single byte. Therefore a string of length N characters requires
N bytes of memory. Since strings do not use bounding indexes it is important to mark their end.
Whenever enter key is pressed by the user the compiler treats it as the end of string. It puts a
special character „\0‟ (NULL) at the end and uses it as the end of the string marker there onwards.
When the function scanf() is used for reading the string, it puts a „\0‟ character when it receives
space. Hence if a string must contain a space in it we should use the function gets().
5.2 STRING FUNCTIONS
Let us first consider the functions, which are required for general string operations. The string
functions are available in the header file “string.h”. We can also write these ourselves to
understand their working. We can write these functions using
(i)Array of Characters and
(ii) Pointers.
5.3 STRING LENGTH
The length of the string is the number of characters in the string, which includes spaces, and
all ASCII characters. As the array index starts at zero, we can say the position occupied by „\0‟
indicates the length of that string. Let us write these functions in two different ways
mentioned earlier.
5.3.1 USING ARRAY
int strlen1(char s[])
{
int i=0;
}
while(s[i] != „\0‟)
i++;
return(i);
Here we increment the positions till we reach the end of the string. The counter contains the size
of the string.
5.3.2 USING POINTERS
int strlen1(char *s)
{
char *p;
p=s;
while(*s != „\0‟)
s++;
return(s-p);
};
The function is called in the same manner as earlier but in the function we accept the start
address in s. This address is copied to p. The variable s is incremented till we get end of string.
The difference in the last and first address will be the length of the string.
5.4 STRING COPY :Copy s2 to s1
In this function we have to copy the contents of one string into another string.
5.4.1. USING ARRAYS
void strcopy(char s1[], char s2[])
{
int i=0;
}
while( s2[i] != „\0‟)
s1[i] = s2[i++];
s1[i]=‟\0‟;
Till ith character is not „\0‟ copy the character s and put a „\0‟ as the end of the new string.
5.4.2. USING POINTERS
void strcopy( char *s1, char *s2)
{
while( *s2)
{
*s1 = *s2;
s1 ++;
s2 ++;
}
*s1 = *s2;
}
5.5 STRING COMPARE
5.5.1. USING ARRAYS
void strcomp(char s1[], char s2[])
{
int i=0;
while( s1[i] != „\0‟ && s2[i] != „\0‟)
{
if(s1[i] != s2[i])
break;
else
i++;
}
return( s1[i] – s2[i]);
}
The function returns zero , if the two strings are equal. When the first string is less
compared to second, it returns a negative value, otherwise a positive value.
The reader can write the same function using the pointers.
5.6 STRING CONCATENATION OF S2 TO THE END OF S1
At the end of string one add the string two. Go till the end of the first string. From the next
position copy the characters from the second string as long as there are characters in the second
string and at the end close it with a „\0‟ character. This is left as an exercise for the student.
*************************************************************************************************************
UNIT 6 ELEMENTARY DATA STRUCTURES
6.1 INTRODUCTION
6.2 STACK
6.2.1 DEFINITION
6.2.2 OPERATIONS ON STACK
6.2.3 IMPLEMENTATION OF STACKS USING ARRAYS
6.2.3.1 FUNCTION TO INSERT AN ELEMENT INTO THE STACK
6.2.3.2 FUNCTION TO DELETE AN ELEMENT FROM THE STACK
6.2.3.3 FUNCTION TO DISPLAY THE ITEMS
6.3 RECURSION AND STACKS
6.4 EVALUATION OF EXPRESSIONS USING STACKS
6.4.1 POSTFIX EXPRESSIONS
6.4.2 PREFIX EXPRESSION
6.5 QUEUE
6.5.1 INTRODUCTION
6.5.2 ARRAY IMPLEMENTATION OF QUEUES
6.5.2.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE
6.5.2.2 FUNCTION TO DELETE AN ELEMENT FROM THE QUEUE
6.6 CIRCULAR QUEUE
6.6.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE
6.6.2 FUNCTION FOR DELETION FROM CIRCULAR QUEUE
6.6.3 CIRCULAR QUEUE WITH ARRAY IMPLEMENTATION
6.7 DEQUES
6.8 PRIORITY QUEUES
6.1 INTRODUCTION
Linear arrays discussed in the previous unit allowed one to insert and delete elements at any
place in the list - at the beginning, at the end or at the middle. There are certain frequent
situations in computer science when one wants to restrict insertions and deletions so that they
can take place only at the beginning or at the end of the list, not in the middle. In stacks one data
item will be placed on the other as they arrive. Stack is a data structure in which the latest data
will be processed first. The data is coming in a sequence and we want to decide the sequence in
which it must be processed. It is many times necessary that we accept a data item, which in turn
depends on some other data item and so we accept that data item.
6.2 STACK
6.2.1 DEFINITION
A stack is a homogeneous collection of items of any one type, arranged linearly with access at one
end only, called the top. This means that data can be added or removed from only the top.
Formally this type of stack is called a Last-In-First-Out (LIFO) stack. Data is added to the stack
using the Push operation, and removed using the Pop operation.
Description
In order to clarify the idea of a stack let's look at a "real life" example of a stack. Think of a stack
of plates in a high school cafeteria. When the plates are being stacked, they are added one on top
of each other. It doesn't make much sense to put each plate on the bottom of the pile, as that
would be far more work, and would accomplish nothing over stacking them on top of each other.
Similarly when a plate is taken, it is usually taken from the top of the stack.
The Stack Implemented as an Array
One of two ways to implement a stack is by using a one dimensional array (also known as a
vector). When implemented this way, the data is simply stored in the array. Top is an integer
value, which contains the array index for the top of the stack. Each time data is added or
removed, top is incremented or decremented accordingly, to keep track of the current top of the
stack. By convention, an empty stack is indicated by setting top to be equal to -1.
Stacks implemented as arrays are useful if a fixed amount of data is to be used. However, if the
amount of data is not a fixed size or the amount of the data fluctuates widely during the stack's
life time, then an array is a poor choice for implementing a stack. For example, consider a call
stack for a recursive procedure. First, it can be difficult to know how many times a recursive
procedure will be called, making it difficult to decide upon array bounds.
Second, it is possible for the recursive procedure to sometimes be called a small number of times,
called a large number of times at other times. An array would be a poor choice, as you would have
to declare it to be large enough that there is no danger of it running out of storage space when the
procedure recursions many times. This can waste a significant amount of memory if the
procedure normally only recursions a few times.
6.2.2 OPERATIONS ON STACK
The two main operations applicable to a stack are:
push: an item is put on top of the stack, increasing the stack size by one. As stack size is usually
limited, this may provoke a stack overflow if the maximum size is exceeded.
pop: the top item is taken from the stack, decreasing stack size by one. In the case where there
was no top item (i.e. the stack was empty), a stack underflow occurs.
Given a stack S, and an item I, performing the operations –
push(S,I)
Adds the item I to the top of the stack s and similarly the operations.
pop(S)
Removes the top element and returns it as a functions value. Thus assignment operations
I=pop(S);
Removes the element at the top of S and assigns its value to I.
For eg. If S is the stack of figure 4.3 we performed the operations push (S,G) in going from frame a
to frame b.
Top=5
push G
push H
pop H
Pop G
Pop F
Push F
H
G
G
F
G
F
E
F
F
F
E
E
E
E
D
D
D
D
C
C
C
C
B
B
B
B
A
A
A
A
F
E
D
E
D
C
D
C
B
C
B
A
B
A
A
Figure 4.3: A Motion Picture of a Stack
We then performed in turn, the following operations.
push(S, H);
(frame(c))
pop(S);
(frame(d))
pop(S);
(frame(e))
pop(S)
(frame(f))
push(S, F);
(frame(g))
Because of the operations, which adds element to a stack, a stack is sometimes called a
pushdown list.
There is no upper limit on the number of items that may be kept in a stack since the definition
does not specify how many items are allowed in the collections. However if a stack contains a
single item and the stack is popped, the resulting stack contains no item and is called an empty
stack.
The operation – isempty(s)
determines whether or not a stack s is empty. If the stack is empty, empty(S) returns the value
TRUE, otherwise it returns the value FALSE.
Another operations that can be performed on a stack are to determine what the top item on a
stack is with out removing it. This operation is written stacktop(S) & returns the top element of
stacks.
I = stacktop(s);
Is equivalent to
I = pop(s);
push(s,I);
Like the pop operations, stack top is not defined for an empty stack. The result of an illegal
attempt to pop is called underflow.
6.2.3 IMPLEMENTATION OF STACKS USING ARRAYS
Stacks are one of the most important linear data structures of variable size. This is an important
subclass of the lists(arrays) that permit the insertion and deletion from only one end TOP. But to
allow flexibility sometimes we can say that the stack allows us to access(read) elements in the
middle and also change them as we will see in two functions to be explained later. But this is a
very rare phenomenon.
Some terminologies
Insertion
Deletion
: the operation is also called push
: the operation is also called pop
Top
: A pointer, which keeps track of the top element in
the Stack. If an array of size N is declared to be
a stack, then TOP will be –1 when the stack is empty
and is N when the stack is full.
Pictorial representation of stack
Insertion
Current TOP
Deletion
Maximum Size
Allocated
Bottom
Fixed Bottom
Fig 1 Memory representation of Stacks
6.2.3.1. FUNCTION TO INSERT AN ELEMENT INTO THE STACK
Before inserting any element into the stack we must check whether the stack is full. In such case
we cannot enter the element into the stack. If the stack is not full, we must increment the position
of the TOP and insert the element. So, first we will write a function that checks whether the stack
is full.
/* function to check whether stack is full */
int stack_full(int top)
{
if(top == SIZE -1)
return (1);
else
return (0);
}
The function returns 1, if, the stack is full. Since we place TOP at –1 to denote stack empty
condition, the top varies from 0 to SIZE –1 when it stores elements. Therefore TOP at SIZE-1
denotes that the stack is full.
/* function to insert an element into the stack */
push ( float s[], int *t, float val)
{
if( ! stack_full(*t))
{
*t = *t + 1;
s[*t] = val;
}
else
{
printf(“\n STACK FULL”);
}
}
Note that we are accepting the TOP as a pointer(*t). This is because the changes must be
reflected to the calling function. If we simply pass it by value the changes will not be reflected.
Otherwise we have to explicitly return it back to the calling function.
6.2.3.2. FUNCTION TO DELETE AN ELEMENT FROM THE STACK
Before deleting any element from the stack we must check whether the stack is empty. In such
case we cannot delete the element from the stack. If the stack is not empty, we must delete the
element by decrementing the position of the TOP. So, first we will write a function that checks
whether the stack is empty.
/* function to check whether stack is empty */
int stack_empty(int top)
{
if(top == -1)
return (1);
else
return (0);
}
This function returns 1 if the stack is empty. Since the elements are stored from positions
0 to SIZE-1, the empty condition is considered when the TOP has –1 in it.
/* function to delete an element from the stack */
float pop ( float s[], int *t)
{
float val;
if( ! stack_empty(*t))
{
val = s[*t];
*t = *t - 1;
return val;
}
else
{
printf(“\n STACK EMPTY”);
return 0;
}
}
Since the TOP points to the current top item, first we store this value in a temporary
variable and then decrements the TOP. Now return temporary variable to the calling function. We
can also see functions to display the stack, either in the same way as they arrived or the reverse(
the way in which they are waiting to be processed).
6.2.3.3 FUNCTION TO DISPLAY THE ITEMS
/* displays from top to bottom */
void display_TB(float s[], int top)
{
while( top >= 0)
{
printf(“%f\n”, s[top]);
top--;
}
}
/* displays from bottom to top */
void display_BT(float s[], int top)
{
int i;
for( i=0; i<=top ;i++)
{
printf(“%f\n”, s[i]);
top--;
}
}
We can use these functions in a program and see how they look in the stack.
# define SIZE 10
main()
{
float stk[SIZE], val;
int top = -1,ele;
push(stk, &top, 10);
push(stk, &top, 20);
push(stk, &top, 30);
ele = pop(stk,&top);
printf(“%f”, ele);
ele = pop(stk,&top);
printf(“%f”, ele);
push(stk, &top, 40);
ele = pop(stk,&top);
printf(“%f”, ele);
}
Now we will see the working of the stack with diagrams.
Stack Area
20
Top->
10
Empty Stack
After first push
10
After second push
30
20
20
10
10
PUSH 30
10
POP
POP
40
10
10
PUSH 40
P
POP
Fig 2. Trace of stack values with push and pop functions
A C-Program to Simulate the Operation of the Stack
Array Implementation
/*stack-array implement*/
#define MAX 50
#include <stdio.h>
int top, check;
void main()
{
int stack[MAX], element, quit;
char c;
int pop(int []);
void push(int [], int);
void display(int []);
printf("Program of stack with array\n");
do
top = -1;
quit = 0;
{
printf("\n\tOptions\t\tChoice");
printf("\n\tPush\t\tP");
printf("\n\tPop\t\tO");
printf("\n\tExit\t\tE");
printf("\n\tEnter choice : ");
do
c = getchar();
while (strchr("PpOoEe",c) == NULL);
switch(c)
{
case 'P' :
case 'p' :
printf("\nEnter an element to be pushed : ");
scanf("%d",&element);
if (!isfull(stack))
{
push(stack,element);
printf("\n\t**** Stack *****\n");
display(stack);
}
break;
case 'O' :
case 'o' :
if (!isempty(stack))
{
element = pop(stack);
printf("Popped element = %d\n",element);
printf("\n\t**** Stack *****\n");
display(stack);
}
else
}
}
}
while (!quit);
printf("\n");
printf("\nStack underflow...don't pop\n");
break;
case 'E' :
case 'e' :
quit = 1;
int isempty()
{
if(top==-1)
{
printf("\nStack underflow...cann't pop\n");
return(1);
}
else
return(0);
}
int isfull()
{
if(top==MAX)
{
printf("\nStack overflow...cann't push\n");
return(1);
}
else
return(0);
}
void push(int stack[],int element)
{
if (!isfull(stack))
{
++top;
stack[top] = element;
}
return;
}
void display(int stack[])
{
int i;
if (top == -1)
printf("\n***** Empty *****\n");
else
{
for (i=top; i>=0; --i)
printf("%7d",stack[i]);
}
printf("\n");
}
int pop(int stack[])
{
int srs;
if (!isempty(stack))
{
srs = stack[top];
--top;
return srs;
}
return (-1);
}
6.3 Recursion and Stacks
Recursion is a process of expressing a function in terms of itself. A function, which contains a
call to the same function or a call to another function (direct recursion), which eventually call the
first function (indirect recursion), is also termed as recursion. We can also define recursion as a
process in which a function calls itself with reduced input and has a base condition to stop the
process. i.e., any recursive function must satisfy two conditions:
it must have a terminal condition after each recursive call it should reach a value nearing the
terminal condition.
An expression, a language construct or a solution to a problem can be expressed using recursion.
We will understand the process with the help of a classic example to find the factorial of a
number.
The recursive definition to obtain factorial of n is shown below
Fact (n) =
1
if n =0
n*fact(n-1)
otherwise
We can compute 5! as shown below
5!=5*4!
4!=4*3!
3!=3*2!
2!=2*1!
1!=1*0!
0!=1
By definition 0! Is 1. So, 0! Will not be expressed in terms of itself. Now, the computations will be
carried out in reverse order as shown
0!=1
1!=1*0!=1*1=1
2!=2*1!=2*1=2
3!=3*2!=3*2=6
4!=4*3!=4*6=24
5!=5*4!=5*24=120
The C program for finding the factorial of a number is as shown:
#include<stdio.h>
int fact(int n)
{
if(n==0)
return 1;
return n*fact(n-1);
}
main()
{
}
int n;
printf(“Enter the number \n”);
scanf(“%d”,&n);
printf(“The factorial of %d = %d\n”,n,fact(n));
In fact() the terminal condition is fact(0) which is 1. if we don‟t write terminal condition,
the function ends up in calling itself forever, i.e. in an infinite lop.
Every tine the function is entered in a recursion a separate memory is allocated for the
local variables and formal variables. Once the control comes out the function the memory is deallocated.
When a function is called, the return address, the values of local and formal variables are
pushed onto the stack, a block of memory of contiguous locations, set aside for this purpose. After
this the control enters into the function. Once the return statement is encountered, control comes
back to the previous call, by using the return value present in the stack, and it substitutes the
return value to the call. If the function does not return any value, control goes to the statement
that follows the function call.
To explain how the factorial program actually works, we will write it using indirect
recursion:
# include<stdio.h>
int fact(int n)
{
int x, y, res;
if(n==0)
return 1;
x=n-1;
y=fact(x);
res= n*y;
}
main()
{
}
return res;
int n;
printf(“Enter the number \n”);
scanf(“%d”,&n);
printf(“The factorial of %d = %d\n”,n,fact(n));
Suppose we have the statement
A= fact(n);
In the function main(), where the value of n is 4. When the function is called first time, the value
of n in the function and the return address say XX00 is pushed on to the stack. Now the value of
formal parameter is 4. since we have not reached the base condition of n=0 the value of n is
reduced to 3 and the function is called again with 3 as parameter. Now again the new return
address say XX20 and parameter 3 are stored into the stack. This process continues till n takes
up the value 0, every time pushing the return address and the parameter, Finally control returns
after folding back each time from one call to another with the result value 24. The number of
times the function is called recursively is called the Depth of recursion.
Iteration versus Recursion
In recursion, every time a function is called, all the local variables , formal variables and
return address will be pushed on the stack. So, it occupies more stack and most of the time is
spent in pushing and popping. On the other hand, the non-recursive functions execute much
faster and are easy to design.
There are many situations where recursion is best suited for solving problems. In such
cases this method is more efficient and can be understood easily. If we try to write such
functions using iterations we will have to use stacks explicitly.
6.4 EVALUATION OF EXPRESSIONS USING STACKS
All the arithmetic expressions contain variables or constants, operators and parenthesis. These
expressions are normally in the infix form, where the operators separate the operands. Also, there
will be rules for the evaluation of the expressions and for assigning the priorities to the operators.
The expression after evaluation will result in a single value. We can evaluate an expression using
the stacks. A complex assignment statement such as:
z + x/y ** a+b*c–x*a
might have several meanings; even if it were uniquely defined, by the full use of parenthesis.
An expression is made up of operands, operators and delimiters. The above expression has five
operands x, y, a, b and c. The first problem with understanding the meaning of an expression is to
decide in what order the operations are carried out. This means that every language must
uniquely define such an order. To fix the order of evaluation we assign to each operator a priority.
A set of sample priorities are as follows:
Operator
Priority
Associatively
()
8
Left to Right
^ or **, unary -, unary+, ![not]
7
Right to Left
*, /, %
6
Left to Right
+, -
5
Left to Right
<, <=, >, >=
4
Left to Right
= =, !=
3
Left to Right
&&
2
Left to Right
¦¦||
1
Left to Right
But by using parenthesis we can override these rules and such expressions are always evaluated
with the inner most parenthesized expression first.
The above notation of any expression is called Infix Notation (in which operators come in between
the operands). The notation is a traditional notation, which needs operator's priorities and
associativities. But how can a compiler accept such an expression and produce correct Polish
Notation or Prefix form (In which operators come before operands) Polish Notation has several
advantages over Infix Notation such as: there is no need for considering priorities while evaluating
them, there is no need of introducing parenthesis for maintaining order of execution of operators.
Similarly, Reverse Polish Notation or Postfix Form also has same advantages over Infix Notation,
In this notation operators come after the operands.
Example 1
Infix
Prefix
Postfix
x+y
+xy
xy+
x+y*z
+x+yz
xyz*+
x+y-z
-+xyz
xyz*+-
Stacks are frequently used for converting INFIX form into equivalent PREFIX and POSTFIX forms.
Consider an infix expression:
2 + 3 * ( 4 – 6 / 2 + 7 ) / (2 + 3) – (4 –1) * (2 – 10 / 2))
When it comes to the evaluation of the expression, following rules are used.
brackets should be evaluated first.
* and / have equal priority, which is higher than + and -.
All operators are left associative, or when it comes to equal operators, the evaluation is from left to
right.
In the above case the bracket (4-1) is evaluated first. Then (2-10/2) will be evaluated in which /,
being higher priority, 10/2 will be evaluated first. The above sentence is questionable, because as
we move from left to right, the first bracket, which will be evaluated, will be (4-6/2+7).
The evaluation is as follows:Step 1: Division has higher priority. Therefore 6/2 will
result in 3. The expression now will be (4-3+7).
Step 2: As – and + have same priority, (4-3) will b
evaluated first.
Step 3: 1+7 will result in 8.
The total evaluation is as follows.
2 + 3 * (4 – 6 / 2 + 7) / (2 + 3)-(( 4 – 1) * (2 – 10 /2 ))
=2 + 3 * 8 / (2 + 3) - ((4 - 1) * (2 – 10 / 2))
=2 + 3 * 8 / 5 -((4 - 1) * (2 – 10 / 2))
=2 + 3 * 8 /5 -(3 * (2 – 10 / 2))
=2 + 3 * 8 /5 -(3 * (2 - 5))
=2 + 3 * 8 / 5 - (3 * (-3))
=2 + 3 * 8 / 5 + 9
=2 + 24 / 5 + 9
=2 + 4.8 + 9
=6.8 + 9
=15.8
6.4.1 Postfix Expressions
In the postfix expression, every operator is preceded by two operands on which it operates. The
postfix expression is the postorder traversal of the tree. The postorder traversal is Left-Right-Root.
In the expression tree, Root is always an operator. Left and Right sub-trees are expressions by
themselves, hence they can be treated as operands.
If we want to operate the postfix expression from the given infix, then consider the evaluation
sequence and apply it from bottom to top, every time converting infix to postfix.
e7
e7
= e6 + a
= e6 a +
But e6 = e5 - e4
= e5 e4 – a +
But e5 = a - b
= ab – e4 – a +
But e4 = e3 * b
= ab – e3 b * - a +
But e3 = e2 + a
= ab – e2 a + b * -a +
But e2 = c * e1
= ab – ce1 * a + b * -a +
But e1 = d / a
= ab – cda / * a + b * -a +
The postfix expression does not require brackets. The above method will not be useful for
programming. For programming, we use a stack, which will contain operators and opening
brackets. The priorities are assigned using numerical values. Priority of + and – is equal to 1.
Priority of * and / is 2. The incoming priority of the opening bracket is highest and the outgoing
priority of the closing bracket is lowest. An operator will be pushed into the stack, provided the
priority of the stack top operator is less then the current operator. Opening bracket will always be
pushed into the stack. Any operator can be pushed on the opening bracket. Whenever operand is
received, it will directly be printed. Whenever closing bracket is received, elements will be popped
till opening bracket and printed, the execution is as shown.
e.g.
b -( a + b ) * (( c - d) / a + a )
Symbol is b, hence print b.
On -, Stack being empty, push. Therefore
Top  On (, push. Therefore
Top (
On a, operand, Hence Print a.
On +, Stack being (, push. Therefore
Top +
(
On b, operand, Hence Print b.
On ) , pop till (, and then print.
Therefore pop +, print +.
Pop (.
Top  -
On *, Stack being -, push, Therefore
Top  *
On (, push. Therefore
Top  (
*
Note : For Convenience, we will draw horizontal stack.
10. On (, push. Therefore
-* ((
 Top
11. On C, operand, Hence print C.
12. On-, push. Therefore
13. On d, operand, Hence print d.
-*((-
14. On ) , pop till (, and then print.
Therefore pop -, print -.
Pop(
15. On /, push. Therefore
16. On a, operand, Hence print a.
 Top
-*(
-*(/
 Top
 Top
17. On +, Stack top is /, pop till (, and then print.
Therefore pop/, print/.
-*+
Pop(.
Push+.
 Top
18. On a, operand, Hence print a.
19. On ), pop till (, and then print +
Therefore pop +, print_+.
Pop(.
-*
20. End of the Infix expression.
-*
 Top
pop all and print, Hence print +.
print -.
 Top
Therefore the generated postfix expression is
bab+cd–a/a+*Algorithm to convert an infix expression to postfix expression
Step1: Accept infix expression in a sting S.
Step2: i being the position, let it be equal to 0.
Step3: Initially top = -1, indicating stack is empty.
Step4: If S[i] is equal to operand, print it, go to step 8.
Step5: If S[i] is equal to opening bracket, push it, go to
step 8.
Step6: If S[i] is equal to operator, then
Step6a: Pop the operator from the stack, say, p,
If priority of P is less than priority of s[i], then push
S[i], push p, go to step 8.
Else print p, goto step 6a.
Step7: If S[i] is equal to operator, then
Step7a: pop the operator from the stack, say, p,
If p is not equal to opening bracket, then print
p, step 7a.
Else go to step 8.
Step8: Increment i.
Step9: If s[i] is equal to „\0‟, then go to step 4.
Step10: pop the operator from the stack, say, p.
Step11: Print p.
Step12: If stack is not empty, then go to step 10.
Step13: Stop.
6.4.2 PREFIX EXPRESSION
To convert Infix expression to prefix expression, we can again have steps, similar to the above
algorithm, but a single stack may not be sufficient. We will require two stacks for the following.
Containing operators for assigning priorities.
Containing operands or operand expression.
Every time we get the operand, it will be pushed in stack2 and operator will be pushed in
stack1. Pushing the operator in stack1 is unconditional, whereas when operators are pushed, all
the rules of previous methods are applicable as they are. When an operator is popped from
stack1, the corresponding two operands are popped from stack2, say O1 and O2 respectively. We
form the prefix expression as operator, O1, O2 and this prefix expression will be treated as a
single entity and pushed on stack2. Whenever closing bracket is received, we pop the operators
from stack1, till opening bracket is received. At the end stack1 will be empty and stack2 will
contain single operand in the form of prefix expression.
e.g.
(a+b)*(c–d)+a
Stack1
Stack2
(
(
a
+
(
a
+
(
b
a
(
(
+ab
O1=a
O2=b
+ab
+ab
*
+ab
(
*
+ab
(
*
c+ab
(
*
c+ab
(
*
d
c
+ab
(
*
+ab
O1=c
O2=d
*
+ab
(
*
+ab
(
*
c+ab
(
*
c+ab
(
*
d
c
+ab
(
*
+ab
O1=c
O2=d
-cd
(
*
-cd
+ab
*
-cd
+ab
O1=-cd
O2=+ab
*+ab-cd
+
+
*+abcd
a
*+ab-cd
The expression is + * + ab -cda
The trace of the stack contents and the actions on receiving each symbol is shown. Stack2
can be stack of header nodes for different lists, containing prefix expressions or could be a
multidimensional array or could be an array of strings.
The recursive functions can be written for both the conversions. Following are the steps for
the non-recursive algorithm for converting Infix to Prefix.
Step1 : Get the prefix expression, say S.
Step2 : Set the position counter, i to 0.
Step3 : Initially top1 and top2 are –1, indicating that the
stacks are empty.
Step4 : If S[i] is equal to opening bracket, push it in
stack1, go to step8.
Step5 : If S[i] is equal to operand it in stack2, go to step8.
Step6 : If S[i] is equal to operator, stack1 is empty or stack
top elements has less priority as compared to S[i] ,
go to step 8.
Else p= pop the operator from stack1.
O1= pop the operator from stack1.
O2= pop the operand from stack2.
From the Prefix expression p, O1, O2,
Push in stack2 and go to step6.
Step7 : If S[i]= opening bracket, then
Step7a: p= pop the operator from stack1.
If p is not equal to closing bracket, then
O1= pop the operand from stack2.
O2= pop the operand from stack2.
From the prefix expression p, O1, O2,
Push in stack2 and go to step 7a.
Else go to step 8.
Step8 : Increment i.
Step9 : If s[i] is not equal to “/0”, then go to step4.
Step10: Every time pop one operator from stack1, pop2 operands
from stack2, from the prefix expression p, O1, O2,
push in stack2 and repeat till stack1 becomes empty.
Step11: Pop operand from stack2 and print it as prefix
expression.
Step12: Stop.
The reverse conversions are left as exersise to the students.
Exercises:
1. In case we are required to reverse the stack, the one way
will be to pop each element from the existing stack and put
it in another stack. Thus it is possible to reverse the stack
using the stack. This is very obvious but when we are
required to use the queue for the same purpose then we will
use the following steps (algorithm):
pop a value from the stack.
add that value to the queue.
Repeat the above steps till the stack is empty.
Now the stack is empty and queue contains all the elements.
Delete a value from the queue and push it in the stack.
Repeat from step 5 till the queue is empty.
The value, which was popped from the stack for the first time, will also be the first value getting
deleted from the queue and is the first value getting pushed back into the stack. Thus the top
value of the original stack will be the bottom value and hence the stack will be reversed. Use the
functions written for stacks and queues to write the program.
2. A double-ended queue is a linear list in which additions
and deletions may be at either end. Write functions to add
and delete elements from either end of the queue.
Write a program using stacks, to check whether a given string is palindrome. The string is
palindrome when it reads same in both the directions. Remember we are not supposed to store
the string in the array of characters. The general logic will be to remember the first character and
push all others in the stack. As the string ends pop a character from the stack, it will be the last
character, it should be equal to the first character remembered. Now pop the next character, and
repeat the procedure.
6.5 QUEUES
6.5.1 INTRODUCTION
Queues arise quite naturally in the computer for solution of many problems. Perhaps the most
common occurrence of a queue in Computer Applications is for the scheduling of jobs.
Queue is a Linear list which has two ends, one for insertion of elements and other for deletion of
elements. The first end is called 'Rear' and the later is called 'Front'. Elements are inserted from
Rear End and Deleted from Front End. Queues are called First-In-First-Out (FIFO) List, since the
first element in a queue will be the first element out of the queue. In other words, the order in
which the elements enter a queue is the order in which they leave.
Rear
Front
Figure 5.1: A Possible Queue
6.5.2 ARRAY IMPLEMENTATION OF QUEUES
Queues may be represented in the computer in various ways, usually by means of one way lists
or linear Arrays, unless otherwise stated or implied. Each Queue will be maintained by a linear
array queue[ ] and two integer variables Front and Rear containing the location of the front
element of the queue and the location of the Rear element of the queue. Additions to the queue
take place at the rear. Deletions are made from the front. So, if the job is submitted for execution,
it joins at the rear of the job queue. The
Window
a1
Front
a2
a3 -------- Rear
-------
Figure 5.2: Queue Represented By Array
When we represent any Queue through an Array, we have to predefine the size of Queue and we
can not enter more elements than that predefined size, say max.
Initially Rear = -1, latest inserted element in queue.
Initially Front=0, because Front points to the first inserted element which is not yet deleted.
Initially queue is empty, hence Front = 0.
Similarly : Condition for empty queue is
FRONT = 0
REAR = -1
When FRONT=REAR there is exactly one element in Queue.
-------REAR = -1
FRONT = 0
a1
--------
FRONT = REAR
Figure 5.3 (a) Empty Queue
(b) Queue with Exactly One Element
Whenever an element is added to the Queue, the value of REAR is increased by 1; this can be
implemented as:
REAR = REAR+1;
or
REAR + + ;
provided that REAR is less than MAX-1, which is the condition for full Queue.
6.5.2.1FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE
Before inserting any element into the queue we must check whether the queue is full.
/* function to check whether queue is full */
int q_full(int rear)
{
if(rear == SIZE -1)
return (1);
else
return (0);
}
This function returns 1, if the queue is full. Since we place rear at –1 to denote queue
empty condition, the rear varies from 0 to SIZE –1 when it stores elements. Therefore rear at
SIZE-1 denotes that the queue is full.
/* function to insert an element into the queue */
add_q ( int a[], int *r, int val)
{
if( ! q_full(*r))
{
*r = *r + 1;
}
else
{
}
}
a[*r] = val;
printf(“\n STACK FULL”);
The call for the function will be
add_q(a, &rear, value);
6.5.2.2. FUNCTION TO DELETE AN ELEMENT FROM THE QUEUE
Before deleting any element from the queue we must check whether the queue is empty. In such
case we cannot delete the element from the queue.
/* function to check whether queue is empty */
int q_empty(int front, int rear)
{
if( front > rear)
return (1);
else
return (0);
}
This function returns 1 if the queue is empty.
/* function to delete an element from the queue */
}
int delete_q ( int a[], int *f, int *r)
{
int val;
if( ! q_empty(*f,*r))
{
*f = *f +1;
/* the new front position */
val = a[*f-1];
if( *f > *r)
{
*f = 0;
*r = -1;
}
return (val);
}
else
{
printf(“\n QUEUE EMPTY”);
return 0;
}
The major problem in the above implementation is that whenever we remove an element
from the queue, front will increment and the location used by the element cannot be used again.
This problem can be solved if we shift all the elements to the left by one location on every delete
operation. This will be very time consuming and is not the effective way of solving the problem.
Example:-
Empty Space
REAR
a3 -------- an-1 an
0
1
2
-------------------
FRONT
MAX-1
Figure 5.4
This representation of queue has a major drawback. Suppose a situation when, queue becomes
full i.e. REAR = MAX, but we delete some elements from the Queue and FRONT is now pointing to
A[i] such that if A[0] = MAX as shown in Fig. 5.4. We can not insert any element in Queue though
there is space in Queue, and if we utilize that space by inserting elements on the positions A[0],
A[1]….A[i-1], A[i] we are violating the FIFO property of Queue. One way to do this is to simply
move the entire queue to the beginning of the array, changing FRONT and REAR accordingly, and
then inserting ITEM as above. This procedure may be very expensive.
We can overcome this drawback through circular representation of queue
6.6. CIRCULAR QUEUE
The above problem can be solved only when the first position in the array will be logically the
next position of the last position of the array. By this way we can say that the array is circular
in nature because every position in the array will have logical next position in the array. The
queue, which we are going to handle, using this approach is called the circular queue.
Remember that it is not the infinite queue but we reuse the empty locations effectively. Now all
the functions, which we have written previously will change. We will have a very fundamental
function for such case, which will find the logical next position for any given position in the
array.
We assume that the Queue[ ] is circular, that is Queue[0] comes after Queue[MAX] in the array.
Because we have to Reset value of REAR and FRONT from MAX-1 to 0 while inserting or deleting
elements we can not do by
REAR = REAR+1 and
FRONT = FRONT+1
Instead we can do this by the following assignment:
REAR = (REAR+1)% MAX and
FRONT = (FRONT+1)%MAX.
which increments the REAR and FRONT from 0 to MAX and when needed resets them from MAX1 to 0.
Similarly, conditions for Empty and Full Queues also can not be as before.
Instead we have to assume that, Queue will be empty when,
REAR = FRONT
and full when,
(REAR+1)%MAX=FRONT
3
2
MAX-1
0
1
Figure 5.5
Remember, in Linear Queue we say that there is exactly one element in Queue when REAR =
FRONT, but here we can not assume that because then Queue was empty only in the case when
FRONT=REAR+1, but here Queue may be empty even when
FRONT!=REAR+1. Hence, we have to sacrifice one space in this implementation, and this is the
only drawback of this scheme.
MAX-1
0
i
FRONT = REAR
(a) Queue
.
MAX-1
aM
AX
-1
…
….
i-1
a0
FRONT
ai
ii
ai-1
REAR
(b) Full Queue
Figure 5.6
6.6.1 FUNCTION TO INSERT AN ELEMENT INTO THE QUEUE
insert_CQ(int x)
{
if ((REAR+1)%MAX == FRONT)
{
printf("Q_FULL");
return;
}
(REAR++)%MAX;
CQUEUE[REAR]=x;
}
// CQueue[ ] is an Array Represents
// Circular Queue.
6.6.2 FUNCTION FOR DELETION FROM CIRCULAR QUEUE:
int delete_cq()
{
int x;
if (REAR == FRONT)
{
printf("Q_Empty");
return;
}
x = cqueue[FRONT];
FRONT = (FRONT+1)%MAX;
return (x);
}
Program
6.6.3 Circular Queue with Array Implementation
#define MAX 10
#include <stdio.h>
int front=0, rear=-1, count=0;
void main()
{
int queue[MAX], element, quit;
char c;
void insert(int queue[10],int element);
int deletq(int queue[10]);
void display(int queue[10]);
printf("Program of queue with array\n");
quit = 0;
do
{
printf("\n\tOptions\t\tChoice");
printf("\n\tInsert\t\tI");
printf("\n\tDelete\t\tD");
printf("\n\tExit\t\tE");
printf("\n\tEnter choice : ");
do
c = getchar();
while (strchr("IiDdEe",c) == NULL);
switch(c)
{
case 'I' :
case 'i' :
printf("\nEnter an element to be inserted : ");
scanf("%d",&element);
insert(queue,element);
display(queue);
break;
case 'D' :
case 'd' :
deletq(queue);
display(queue);
break;
case 'E' :
case 'e' :
quit = 1;
}
}
while (!quit);
printf("\n");
}
int isfull()
{
if(count==10)
{
printf("\nQueue overflow....cann't insert\n");
return(1);
}
else
return(0);
}
int isempty()
{
if(count==0)
{
printf("\nQueue underflow....cann't delete\n");
return(1);
}
else
return(0);
}
void insert(int queue[],int element)
{
if (!isfull())
{
rear=(rear+1)%10;
queue[rear] = element;
count++;
}
return;
}
void display(int queue[])
{
int c,i;
if(count==0)
{
printf(“\ncircular queue is empty”);
}
else
{
i=front;
for (c = 1; c <=count ; c++)
{
printf("%6d\n",queue[i]);
i=(i+1)%10;
}
}
return;
}
int deletq(int queue[])
{
if (!isempty())
{
front=(front+1)%10;
count--;
}
return();
}
6.7 DEQUES
A deque (pronounced either “deck” or “dequeue” ) is a linear list in which elements can be added
or removed at either end but not in the middle. The term deque is a contraction of the name
double-ended queue.
There are two variations of a deque namely:
1.
An Input Restricted deque
2.
An Output Restricted deque
Specifically, and Input-Restricted deque is a deque which allows insertions at only one end of the
list but allows deletions at both ends of the list; and an output-restricted deque is a deque which
allows deletions at only one end of the list but allows insertions at both ends of the list.
Input
Output
Output
Input
Input
Output
(a) Input Restricted Deque
(b) Output Restricted Deque
Figure 5.7
we will assume our deque is maintained by a circular array DEQUE with pointers LEFT and
RIGHT, which point to the two ends o the deque. We assume that the elements extend from the
left end to the right end in the array. The term “circular” comes from the fact that we assume that
DEQUE[1] comes after DEQUE[N] in the array. Figure 6-18 pictures two deques, each with 4
elements maintained in an array with N = 8 memory locations. The condition LEFT = NULL will
be used to indicate that a deque is empty.
DEQUE
LEFT: 4
RIGHT: 7
1
2
3
AAA BBB CCC DDD
4
5
6
7
8
(a)
LEFT: 7
RIGHT: 2
DEQUE
YYY
1
ZZZ
2
3
4
5
6
7
WWW XXX
8
(b)
Fig 6.18
The procedures which insert and delete elements in deques and the variations on those
procedures are given as supplementary problems. As with queues, a complication may arise (a)
when there is overflow, that is , when an element is to be inserted into a deque which is already
full , or (b) when there is underflow, that is , when an element is to be deleted from a deque which
is empty. The procedures must consider these possibilities.
6.8. PRIORITY QUEUES
A priority queue is a collection of elements such that each element has been assigned a priority
and the order in which elements are deleted and processed comes from the following rules.
1.
An element of higher priority is processed before any element of lower priority.
2.
Two elements with the same priority are processed according to the order in which they were
added to the queue.
A prototype of priority is processed first, and programs with the same priority form a standard
Queue. A prototype of a priority queue is a timesharing system: programs of high priority are
processed first, and programs with the same priority form a standard queue.
There can be two types of implementations of priority queue :
i)
Ascending Priority Queue
ii)
Descending Priority Queue
A collection of items into which items can be inserted arbitrarily and from which only the smallest
item can be removed is called Ascending Priority Queue.
In Descending Priority Queue only the largest item is deleted. The elements of priority Queue need
not be numbers or characters that can be composed directly. They may be complex structures
that are ordered by one or several fields. Sometimes the field on which the element of a priority
queue is ordered is not even past of the elements themselves.
Array Representation of a Priority Queue
FRONT
REAR
0
1
1
1
0
2
2
-1
-1
3
4
0
4
3
3
0
2
3
XXX
VVV
4
5
DDD
EEE
AAA
0
1
1
BBB
CCC
FFF
SSS
2
3
4
GGG
Another way to maintain a priority queue in memory is to use a separate queue for each level of
priority (or for each priority number). Each such queue will appear in its own circular array and
must have its own pair of pointers, Front and Rear. In fact, if each queue is allocated the same
amount of space, a two-dimensional array Queue can be used instead of the linear arrays.
Figure 5.8 indicates this representation. Observe that Front(K) and Rear (K) contain, respectively,
the front and rear elements of row K of Queue, the row that maintains the queue of elements with
priority number K.
Figure 5.8
The following are outlines of algorithms for deleting and inserting elements in a priority queque
that is maintained in memory by a two-dimensional array QUEQUE, as above. The details of the
algorithms are left as exercises.
Algorithm: This algorithm deletes and processes the first element in a priority queue maintained
by a two-dimensional array QUEUE.
1.
[Find the first nonempty queque.]
Find the smallest K such that FRONT[K]  NULL.
2.
Delete and process the front element in row K of QUEUE.
3.
Exit.
Adding an element to our priority queue is much more complicated than deleting an element
from the queue, because we need to find the correct place to insert the element. An outline of the
algorithm follows.
Algorithm 6.14: This algorithm adds an ITEM with priority number N to a priority
queue which is maintained in memory as a one-way list.
Traverse the one-way list until finding a node X whose priority number exceeds N. Insert ITEM in
front of node X.
If no such node is found, insert ITEM as the last element of the list.
The above insertion algorithm may be pictured as a weighted object “sinking” through layers of
elements until it meets an element with a heavier weight.
The details of the above algorithm are left as an exercise. The main difficulty in the algorithm
comes form the fact that ITEM is inserted before n0ode X. This means that, while traversing the
list, one must also keep track of the address of the node preceding the node being accessed.
Summary
One again we see the time-space tradeoff when choosing between different data structures for a
given problem. The array representation of a priority queue is more time-efficient than the oneway list. This is because when adding an element to a one-way list, one must perform a linear
search on the list. On the other hand, the one-way list representation of the priority queue may
be more space-efficient than the array representation. This is because in using the array
representation, overflow occurs when the number of elements in any single priority level exceeds
the capacity for that level, but in using the one-way list, overflow occurs only when the total
number of elements exceeds the total capacity. Another alternative is to use a linked list for each
priority level.
UNIT 7 LINKED LISTS
7.1. INTRODUCTION
7.2. SINGLY LINKED LISTS.
7.2.1. IMPLEMENTATION OF LINKED LIST
7.2.1.1.
INSERTION OF A NODE AT THE BEGINNING
7.2.1.2.
INSERTION OF A NODE AT THE END
7.2.1.3.
INSERTION OF A NODE AFTER A SPECIFIED NODE
7.2.1.4.
TRAVERSING THE ENTIRE LINKED LIST
7.2.1.5.
DELETION OF A NODE FROM LINKED LIST
7.3. CONCATENATION OF LINKED LISTS
7.4. MERGING OF LINKED LISTS
7.5. REVERSING OF LINKED LIST
7.6. DOUBLY LINKED LIST.
7.6.1. IMPLEMENTATION OF DOUBLY LINKED LIST
7.7. CIRCULAR LINKED LIST
7.8. APPLICATIONS OF THE LINKED LISTS
7.1. INTRODUCTION
We have seen representation of linear data structures by using sequential allocation method of
storage, as in, arrays. But this is unacceptable in cases like:
a) UNPREDICTABLE STORAGE REQUIREMENTS:
The exact amount of data storage required by the program varies with the amount of data being
processed. This may not be available at the time we write programs but are to be determined
later.
For example, linked allocations are very beneficial in case of polynomials. When we add two
polynomials, and none of their degrees match, the resulting polynomial has the size equal to the
sum of the two polynomials to be added. In such cases we can generate nodes (allocate memory to
the data member) whenever required, if we use linked representation (dynamic memory
allocation).
b) EXTENSIVE DATA MANIPULATION TAKES PLACE.
Frequently many operations like insertion, deletion etc, are to be performed on the linked list.
Pointers are used for the dynamic memory allocation. These pointers are always of same length
regardless of which data element it is pointing to( int, float, struct etc,). This enables the
manipulation of pointers to be performed in a uniform manner using simple techniques. These
make us capable of representing a much more complex relationship between the elements of a
data structure than a linear order method.
The use of pointers or links to refer to elements of a data structure implies that elements, which
are logically adjacent, need not be physically adjacent in the memory. Just like family members
dispersed, but still bound together.
7.2. Singly Linked List [or] One way chain
This is a list, which can may consist of an ordered set of elements that may vary in
number. Each element in this linked list is called as node. A node in a singly linked list consists of
two parts, a information part where the actual data is stored and a link part, which stores the
address of the successor(next) node in the list. The order of the elements is maintained by this
explicit link between them. The typical node is as shown :
INFO
LINK
NODE
Fig 1. Structure of a Node
In figure 2, the arrows represent the links. The data part of each node consists of the
marks obtained by a student and the next part is a pointer to the next node. The NULL in the last
node indicates that this node is the last node in the list and has no successors at present. In the
above the example the data part has a single element marks but you can have as many elements
as you require, like his name, class etc.
We have to consider a logical ordered list, i.e. elements are stored in different memory locations
but they are linked to each other and form a logical list as in Fig. 3.1. This link represents that
each element
A1
A2
A3
……
.
A4
An
Figure 3.1: Logical List
has address of its logical successor element in the list. We can understand this concept through a
real life example : Suppose their is a list of 8 friends, x1, x2......x8. Each friend resides at different
locations of the city. x1 knows the address of x2, x2 knows the address of x3 and so on .... x7 has
the address of x8. If one wants to go to the house of x 8 and he does not know the address he will
go to x2 and so on Fig 3.2.
Consider an example where the marks obtained by the students are stored in a linked list
as shown in the figure :
|data |Next|
62
72
82
34
NULL
|<-NODE ->|
fig 2. Singly Linked List
The concept of linked list is like a family despaired, but still bound together. From the above
discussion it is clear that Link list is a collection of elements called nodes, each of
x1
Add.of x2
x2
Add.of x3
X3
Figure 3.2
which stores two items of information:

An element of the list

A link or address of another node
…….
X8
NULL
Link can be provided through a pointer that indicates the location of the node containing the
successor of this list element. The NULL in the last node indicates that this is the last node in the
list.
REPRESENTATION OF LINKED LIST
Because each node of an element contains two parts, we have to represent each node through a
structure.
While defining linked list we must have recursive definitions:
struct node
{
int data;
struct node * link;
}
link is a pointer of struct node type i.e. it can hold the address of variable of struct node type.
Pointers permit the referencing of structures in a uniform way, regardless of the organization of
the structure being referenced. Pointers are capable of representing a much more complex
relationship between elements of a structure than a linear order.
Initialization :
main()
{
struct node *p, *list, *temp;
list = p = temp = NULL;
.
.
.
}
7.2.1 IMPLEMENTATION OF LINKED LIST
Link List is a linear Data Structure with the following operations defined on it :

Insertion of a node at the beginning

Insertion of a node at the end

Insertion of a node after a specified node

Deletion of a particular node from the list

Traversing of entire link list.
7.2.1.1. INSERTION OF A NODE AT THE BEGINNING
We have a linked list in which the first element is pointed by list pointer.
We can take node data as Input, from user and point this node through temp. Now we can attach
temp to the list by putting address of List in the link field of node pointed by temp Fig. 3.3. Then
we can update the
…….
list
NULL
temp
Figure 3.3
pointer list by putting address contained in temp. This can be accomplished as:
addbeg()
{
int x;
temp = malloc(sizeof(struct node));
scanf("%d", &x);
temp  data = x;
temp  link = list;
list = temp;
}
7.2.1.2 INSERTION OF A NODE AT THE END
We traverse the list until NULL (i.e. end of the list) is found. We traverse the list through an
additional pointer 'p' and, fix the start pointer list at the beginning of linked list. When p reaches
the end, we will attach temp to p by putting the address of node pointed by temp in the link field
of p Fig. 3.4.
…….
list
NILL
p  ………. p  …………………… p 
temp
Figure 3.4
addend()
{
int x;
temp = malloc(sizeof(struct node));
scanf("%d", &x); temp link = NULL;
x
NULL
p = list;
if (list == NULL) /* Initially Empty List*/
{
list = malloc(sizeof(struct node);
}
list = temp;
else
{
While (p  link ! = NULL)
}
p = p  link;
}
p  link = temp;
}
}
7.2.1.3. INSERTION OF A NODE AFTER A SPECIFIED NODE
Traverse the list until node with specified value is found or the end of list is found. If end of list is found then
print the message that "Given No. is not present" otherwise insert node pointed by temp between nodes
pointed by p and p  link (p is used to traverse the list) Fig. 3.5.
p
n
NULL
p link
temp
x
Figure 3.5
insert()
{
int num, x;
/*num is data to be found*/
scanf("%d", &x);
/*x is data to be inserted*/
NULL
temp = malloc(sizeof(struct node));
temp  data = x;
temp  link = NULL;
if (list == NULL)
printf("List is Empty");
else
{
p = list;
while (p  data ! = num ¦¦ p! = NULL)
{
p = p  link;
}
if (p = NULL)
{
printf("Number Not Found");
return;
}
else
/*Number found on the location pointed by p*/
{
temp  link = p  link;
p  link = temp;
}
}
}
7.2.1.4. TRAVERSING THE ENTIRE LINKED LIST
display()
{
if (q == NULL) printf("List is Empty");
else
{
p = list;
while (p! = NULL)
{
printf ("%d", p  data);
p = p  link;
}
}
}
7.2.1.5. DELETION OF A NODE FROM LINKED LIST
Search the node which is to be deleted from the list by traversing the list through pointer p. If end
of List is found then print the message the 'Given No. is not found' otherwise store the address of
node successor to the node to be deleted in the link field of p. Free the node to be deleted
Fig. 3.6.
num
Node to be
deleted
Figure 3.6
int delete()
{
int x, num;
struct node *del;
if (list == NULL) printf ("List in Empty");
else
NULL
{
p = list;
while (p link  data ! = num ¦ ¦ p  link!=NULL)
{
p = p  link;
}
if (p  link = NULL) printf ("No. Not Found");
else
{
del = p link;
// p  link contains the address of the node
p  link = p  link  link;
// to be deleted.
x = del  data
free (del);
return (x);
}
}
}
7.3. CONCATENATION OF LINKED LISTS
Consider a case where we have two Linked Lists pointed to by two different pointers, say p and q
respectively, and we want to concatenate 2ndlist at the end of the first list. We can do it by
traversing first list till the end and then store the address of first node in the second list, in the
link field of last node of first list. Suppose we are traversing first list by pointer temp, then we can
concatenate the list by the statement.
temp  link = q; (Fig. 3.7)
p
NULL
q
NULL
(a)
Temp 
p
q
NULL
(b)
Figure 3.7 (a) Lists before Concatenation, (b) List after Concatenation
The function to achieve this is given below:
Concatenate (struct node *p, struct node *q)
{
struct node *temp;
temp = p;
if (p == NULL) // If first list is NULL then Concatenated
p = q;
else
// List will be only Second List and will be
//pointed by p;
{
temp = p;
while (temp  link ! = NULL)
temp = temp  link;
temp  link = q;
}
}
7.4. MERGING OF LINKED LISTS
Suppose we have two linked lists pointed to by two different pointers P and q, we wish to merge
the two lists into a third list pointed by z. While carrying out this merging we wish to ensure that
those elements that are common to both the lists occur only once in the third list. The function to
achieve this is given below : it is assumed that both lists are sorted in ascending order and the
resultant third list will also be sorted.
merge (struct node *p, struct node *q)
{
struct node *z; z = malloc(sizeof(struct node));
if (p == NULL && q == NULL) return;
while (p!= NULL && q! = NULL)
{
if (p  data < q  data);
{
z  data = q  data;
p = p  link;
}
if (p  data > q  data)
{
z  data = q  data;
q = q  link;
}
if ((p  data = q  data) != 0)
{
z  data = p  data;
p = p  link;
q = q  link;
}
z  link = malloc(sizeof(struct node));
z = z  link;
}
while (p! = NULL
{
z  data = p  data;
z  link = malloc(sizeof(struct node));
z = z  link; p = p  link;
}
while (q!= NULL)
{
z  data = q  data;
z  link = malloc(sizeof(struct node));
z = z  link; q = q  link;
}
7.5. REVERSING OF LINKED LIST
Suppose we have a link list pointed by p. In order to reverse it we will have to take two more
pointers q and r of the struct node type. We will traverse the list through p and make q trails p
and r trails q; and assign q's link to r. The function to achieve this is given below :
reverse (struct node *p)
{
struct node *q; *r;
q = NULL;
while (p!= NULL)
// p will traverse the list till end
{
r = q;
// r trails q
q = p;
// q trails p
p = p  link;
// p moves to next node
q  link = r;
}
// link q to preceding node
}
7.6. DOUBLY LINKED LISTS
In the single linked list each node provides information about where the next node is in the list. It
faces difficulty if we are pointing to a specific node, then we can move only in the direction of the
links. It has no idea about where the previous node lies in memory. The only way to find the node
which precedes that specific node is to start back at the beginning of the list. The same problem
arises when one wishes to delete an arbitrary node from a single linked list. Since in order to
easily delete an arbitrary node one must know the preceding node. This problem can be avoided
by using Doubly Linked List, we can store in each node not only the address of next node but
also the address of the previous node in the linked list. A node in Doubly Linked List has three
fields Fig 3.10.

Data

Left Link

Right Link
L LINK
DATA
R LINK
Figure 3.10: Node of Doubly Linked List
Left link keeps the address of previous node and Right Link keeps the address of next node.
Doubly Linked List has following property:
p=pllinkrlink=prlinkllink. (Figure 3.11)
p
L LINK
R LINK
Figure 3.11
This formula reflects the essential virtue of this structure, namely, that one can go back and forth
with equal ease.
7.6.1 IMPLEMENTATION OF DOUBLY LINKED LIST
Structure of a node of Doubly Linked List can be defined as:
struct node
{
int data;
struct node *llink;
}
struct node *rlink;
One operation that can be performed on doubly linked list is to delete a given node pointed by 'p'.
Function for this operation is given below:
delete(struct node *p)
{
if (p==Null) print f("Node Not Found")
else
{
pllink rlink=pllink;
prlinkllink= pllink;
free(p);
}
}
7.7. CIRCULAR LINKED LIST
Circular Linked List is another remedy for the drawbacks of the Single Linked List besides Doubly
Linked List. A slight change to the structure of a linear list is made to convert it to circular linked
list; link field in the last node contains a pointer back to the first node rather than a NULL Figure
3.12
Figure 3.12: Circular Linked List
From any point in such a list it is possible to reach any other point in the list. If we begin at a
given node and traverse the entire list, we ultimately end up at the starting point.
7.8 APPLICATIONS OF THE LINKED LISTS
In computer science linked lists are extensively used in Data Base Management Systems
Process Management, Operating Systems, Editors etc. Earlier we saw that how singly linked list
and doubly linked list can be implemented using the pointers. We also saw that while using
arrays vary often the list of items to be stored in an array is either too short or too big as
compared to the declared size of the array. Moreover, during program execution the list cannot
grow beyond the size of the declared array. Also, operations like insertions and deletions at a
specified location in a list require a lot of movement of data, thereby leading to an inefficient and
time-consuming algorithm.
The primary advantage of linked list over an array is that the linked list can grow or shrink
in size during its lifetime. In particular, the linked list „s maximum size need not be known in
advance. In practical applications this often makes it possible to have several data structures
share the same space, without paying particular attention to their relative size at any time.
The second advantage of providing flexibility in allowing the items to be rearranged
efficiently is gained at the expense of quick access to any arbitrary item in the list. In arrays we
can access any item at the same time as no traversing is required.
We are not suggesting that you should not use arrays at all. There are several applications
where using arrays is more beneficial than using linked lists. We must select a particular data
structure depending on the requirements.
Summary

Linked List is a Linear Data Structure which is based on Dynamic Memory Allocation(i.e.
memory Allocation at Run time rather than at Compile Time).

Linked List has several Advantages over Array because of Dynamic Memory Allocation.
Linked List doesn't face the problems like overflow and Memory Wastage due to Static
Memory Allocation.

Each Node of a linked List contains address of its successor Node.

Various operations like Insertion, Deletion, Traversing can be performed on any Linked
List.

Two Lists can be Merged and Concatenated by link manipulation.

Linear or Single Linked List can not be traversed in reverse direction, this drawback is
removed in doubly and Circular Linked List.

In Circular linked List Link pointer of last node of Linked List points to first node of Linked
List instead of pointing to NULL.

In Doubly Linked List each node contains address of its previous node in addition to the
address of next node. This provides two way traversing.
UNIT 8 GRAPHS
8.1 INTRODUCTION
8.2 ADJACENCY MATRIX AND ADJACENCY LISTS
8.3 GRAPH TRAVERSAL
8.3.1 DEPTH FIRST SEARCH (DFS)
8.3.1.1 IMPLEMENTATION
8.3.2 BREADTH FIRST SEARCH (BFS)
8.3.2.1 IMPLEMENTATION
8.4 SHORTEST PATH PROBLEM
8.5 MINIMAL SPANNING TREE
8.6 OTHER TASKS
8.1. INTRODUCTION
A graph is a collection of vertices and edges, G =(V, E) where V is set of vertices and E is set
of edges. An edge is defined as pair of vertices, which are adjacent to each other. When these pairs
are ordered, the graph is known as directed graph.
These graphs have many properties and they are very important because they actually
represent many practical situations, like networks. In our current discussion we are interested on
the algorithms which will be used for most of the problems related to graphs like to check
connectivity, the depth first search and breadth first search, to find a path from one vertex to
another, to find multiple paths from one vertex to another, to find the number of components of
the graph, to find the critical vertices and edges.
The basic problem about the graph is its representation for programming.
8.2. ADJACENCY MATRIX AND ADJACENCY LISTS
We can use the adjacency matrix, i.e. a matrix whose rows and columns both represent
the vertices to represent graphs. In such a matrix when the ith row, jth column element is 1, we
say that there is an edge between the i th and jth vertex. When there is no edge the value will be
zero. The other representation is to prepare the adjacency lists for each vertex.
Now we will see an example of a graph and see how an adjacency matrix can be written for
it. We will also see the adjacency relations expressed in form of a linked list.
For Example:
Consider the following graph,
0
1
0
2
0
4
0
3
5
0
6
0
Fig 4. Graph
The adjacency matrix will be
0
1
2
3
4
5
6
0
1
2
3
4
5
6
0
1
0
1
0
0
0
1
0
1
1
0
0
0
0
1
0
0
1
1
0
1
1
0
0
0
0
1
0
1
1
0
0
1
1
0
0
1
0
1
0
1
0
0
0
1
1
1
0
Fig 5. Adjacency Matrix representation of graph in fig 4
ADJACENCY LIST REPRESENTATION
In this representation, we store a graph as a linked structure. We store all the vertices in a list
and then for each vertex, we have a linked list of its adjacent vertices. Let us see it through an
example. Consider the graph given in Figure 13.
Figure 13
The adjacency list representation needs a list of all of its nodes, i.e.
Figure 14: Adjacency List Structure for Graph in Figure 13.
Note that adjacent vertices may appear in the adjacency list in arbitrary order. Also an arrow from
v2 to v3 in the list linked to v1 does not mean that V2 and V3 are adjacent.
The adjacency list representation is better for sparse graphs because the space required is O(V +
E), as contrasted with the O(V2) required by the adjacency matrix representation.
8.3 GRAPH TRAVERSAL
A graph traversal means visiting all the nodes of the graph. Graph traversal may be needed in
many of application areas and there may be many methods for visiting the vertices of the graph.
Two graph traversal methods, which are being discussed in this section are the commonly used
methods and are also found to be efficient graph traversal methods. These are
A)Depth First Search or DFS; and
B) Breadth First Search or BFS
8.3.1. DEPTH FIRST SEARCH (DFS)
In graphs, we do not have any start vertex or any special vertex singled out to start traversal from.
Therefore the traversal may start from any arbitrary vertex.
We start with say, vertex v. An adjacent vertex is selected and a Depth First Search is initiated
from it, i.e. let V1, V2 .. Vk are adjacent vertices to vertex v. We may select any vertex from this
list. Say, we select v1 . Now all the adjacent vertices to v1 are identified and all of those are
visited; next V2 is selected and all its adjacent vertices visited and so on. This process continues
till all the vertices are visited. It is very much possible that we reach a traversed vertex second
time. Therefore we have to set a flag somewhere to check if the vertex is already visited. Let us see
it through an example. Consider the following graph.
Fig 15 : Example Graph for DFS
Let us start with v1.
Its adjacent vertices are v2, v8 and V3. Let us pick on V2.
Its adjacent vertices are v1 ,v4, v5. v1 is already visited. Let us pick on V4.
Its adjacent vertices are V2, V8.
v2 is already visited. Let us visit v8. Its adjacent vertices are V4, V5, V1, V6, V7.
V4 and v1, are already visited. Let us traverse V5.
Its adjacent vertices are v2, v8. Both are already visited Therefore, we back track.
We had V6 and V7 unvisited in the list of v8. We may visit any. We visit v6.
Its adjacent are v8 and v3. Obviously the choice is v3.
Its adjacent vertices are v1, v7 . We visit v71.
All the adjacent vertices of v7 are already visited, we back track and find that we have visited all
the vertices.
Fig. 16 : Example Graphs for DFS
Therefore the sequence of traversal is
v1,v2 ,v4 , v8 , v5 , v6 , v3 , v7.
This is not a unique or the only sequence possible using this traversal method.
Let us consider another graph as given in Figure 16.
Is v1 , v2 , v3 , v5 , v4 , v6 a traversed sequence using DFS method?
We may implement the Depth First Search method by using a stack. pushing all unvisited vertices
adjacent to the one just visited and popping the stack to find the next vertex to visit.
8.3.1.1. IMPLEMENTATION
We use an array val[V] to record the order in which the vertices are visted. Each entry in the array
is initialized to the value unseen to indicate that no vertex has yet been visted. The goal is to
systematically visit all the vertices of the graph, setting the val entry for the ith vertex visited to i,
for i = 1, 2...... ,V. The following program uses a procedure visit that visits all the vertices in the
same connected component as the vertex given in the argument.
void search( )
{
int k;
for (k = 1; k V; k++) val[k] = unseen;
for (k = 1; k V; k++)
if (val[k] == unseen) visit(k);
}
The first for loop initializes the val array. Then, visit is called for the first vertex, which results in
the val values for all the vertices connected to that vertex being set to values different from
unseen. Then search scans through the val array to find a vertex that hasn't been seen yet and
calls visit for that vertex, continuing in this way until all vertices have been visited. Note that this
method does not depend on how the graph is represented or how visit is implemented.
First we consider a recursive implementation of visit for the adjacency list representation: to visit
a vertex, we check all its edges to see if they lead to vertices that have not yet been seen; if so, we
visit them.
void visit (int k) // DFS, adjacency lists
{
struct node *t;
val[k] = ++i;
for (t = adj [k]; t ! = z; t = t-next)
if (val[t-vl == unseen) visit (t-v);
}
We move to a stack-based implementation:
Stack stack(maxV);
void visit(int k) // non-recursive DFS, adjacency lists
{
struct node *t
stack.push(k);
while (!stack.empty ( ))
{
k = stack.pop(); val[k] = ++id;
for (t = adj[k]; t != z; t = t-next)
if (val[t-v} == unseen)
{stack.push(t-v); val[t-v] = 1;}
Vertices that have ben touched but not yet visited are kept on a stack. To visit a vertex, we
traverse its edges and push onto the stack any vertex that has, not yet been visited and that is
not already on the stack. In the recursive implementation, the bookkeeping for the "partially
visited" vertices is hidden in the local variable t in the recursive procedure. We could implement
this directly by maintaining pointers (corresponding to t) into the adjacency lists, and so on.
Depth-first search immediately solves some basic graph- processing problems. For example, the
procedure is based on finding the connected components in turn; the number of connected
components is the number of times visit is called in the last line of the program. Testing if a graph
has a cycle is also a trivial modification of the above program. A graph has a cycle if and only if a
node that is not unseen is discovered in visit. That is, if we encounter an edge pointing to a vertex
that we have already visited, then we have a cycle.
8.3.2. BREADTH FIRST SEARCH (BFS)
In DFS we pick on one of the adjacent vertices; visit all of its adjacent vertices and back track to
visit the unvisited adjacent vertices. In BFS, we first visit all the adjacent vertices of the start
vertex and then visit all -the unvisited vertices adjacent to these and so on. Let us consider the
same example, given in Figure l5. We start say, with v1. Its adjacent vertices are V2, V8, V3. We
visit all one by one. We pick on one of these, say v2. The unvisited adjacent vertices to v2 are v4,
v5. We visit both. We go back to the remaining visited vertices of v1 and pick on one of those, say
v3. The unvisited adjacent vertices to v3 are v6, v7. There are no more unvisited adjacent vertices
of v8, v4, v5, v6 and v7.
\
Figure 17:
Thus, the sequence so generated is v1, v2, v8, v3, v4, v5, v6, v7. Here we need a queue instead of
a stack to implement it. We add unvisited vertices adjacent to the one just visited at the rear and
read at from to find the next vertex to visit.
8.3.2.1. IMPLEMENTATION
To implement breadth-first search, we change stack operations to queue operations in the stackbased search program above:
Queue queue(maxV);
void visit (int k) // BFS, adjacency lists
{
struct node *t;
queue.put (k);
while (!queue.empty)))
{
k = queue.geto; val[k] = ++id;
for (t = adj[k]; t ! = z; t = t-ncxt)
if (val[t-vl == unseen)
( queue.put(t-v); val[t-vl = 1;}
}
}
The contrast between depth-first and breadth-first search is quite evident when we consider a
larger graph.
In both cases, the search starts at the node at the bottom left. Depth-first search wends it way
through the graph, storing on the stack the points where other paths branch off; breadth-first
search "sweeps through" the graph, using a queue to remember the frontier of visited places.
Depth-first search "explores" the graph by looking for new vertices far away from the start point,
taking closer vertices only when dead ends are encountered; breadth-first search completely
covers the area close to the starting point, moving farther away only when everything close has
been looked at. Again, the order in which the nodes are visited depends largely upon the order in
which the edges appear in the input and upon the effects of this ordering on the order in which
vertices appears on the adjacency lists.
Depth-first search was first stated formally hundreds of years ago as a method for traversing
maxes. Depth-first search is appropriate for one person looking for something in a maze because
the "next place to look" is always close by; breadth-first search is more like a group of people
looking for something by fanning out in all directions.
8.4. SHORTEST PATH PROBLEM
We have seen in the graph traversals that we can travel through edges of the graph. It is very
much likely in applications that these edges have some weights attached to it. This weight may
reflect distance, time or some other quantity that corresponds to the cost we incur when we travel
through that edge. For example, in the graph in Figure 18, we can go from Delhi to Andaman
Nicobar through Madras at a cost of 7 or through Calcutta at a cost of 5. (These numbers may
reflect the airfare in thousands.) In these and many other applications, we are often required to
find a shortest path, i.e a path having the minimum weight between two vertices. In this section,
we shall discuss this problem of finding shortest path for directed graph in which every edge has
a non-negative weight attached.
Figure 18: A graph connecting four cities
Let us at this stage recall how do we define a path. A path in a graph is sequence of vertices such
that there is an edge that we can follow between each consecutive pair of vertices. Length of the
path is the sum of weights of the edges on that path. The starting vertex of the path is called the
source vertex and the last vertex of the path is called the destination vertex. Shortest path from
vertex v to vertex w is a path for which the sum of the weights of the arcs or edges on the path is
minimum.
Here you must note that the path that may look longer if we see the number of edges and vertices
visited, at times may be actually shorter costwise.
Also we may have two kinds of problems in finding shortest path. One could be that we have a
single source vertex and we seek a shortest path from this source vertex v to every other vertex of
the graph. It is called single source shortest path problem.
Consider the weighted graph in Figure 19 with 8 nodes A,B,C,D,E,F,G and H.
Fig. 19 : A Weighted Graph
There are many paths from A to H.
Length of path AFDEH = 1 + 3 + 4 + 6 = 14
Another path from A to H is ABCEH. Its length is 2+2+3+6 = 13.
We may further look for a path with length shorter than 13, if exists. For graphs with a small
number of vertices and edges, one may exploit all the permutations combinations to find shortest
path. Further we shall have to exercise this methodology for finding shortest path from A to all the
remaining vertices. Thus, this method is obviously not cost effective for even a small sized graph.
There exists an algorithm for solving this problem. It works like as explained below:
Let us consider the graph in Figure 18 once again.
1. We start with source vertex A.
2. We locate the vertex closest to it, i.e. we find a vertex from the adjacent vertices of A for which
the length of the edge is minimum. ere B and F are the adjacent vertices of A and Length (AB)
Length (AF)
Therefore we choose F.
Fig. 20 (a)
3. Now we look for all the adjacent vertices excluding the just earlier vertex of newly added vertex
and the remaining adjacent vertices of earlier vertices, i.e. we have D,E and G (as adjacent
vertices of F) and B (as remaining adjacent vertex of A).
Now we again compare the length of the paths from source vertex to these unattached vertices,
i.e. compare length (AB), length (AFD), length (AFG) and length (AFE). We find the length (AB) the
minimum. There we choose vertex B.
Fig. 20 (b)
4. We go back to step 3 and continue till we exhaust all the vertices.
Let us see how it works for the above example.
Vertices that
Path from
may be attached
A
Length
ABD
4
AFD
4
G
AFG
6
C
ABC
4
AFE
4
D
E
ABE
6
We may choose D. C or E.
We choose say D through B.
Fig. 20 (c)
G
AFG
6
C
ABC
4
E
AFE
4
ABE
6
BDE
8
We may choose C or E
We choose say C
G
AFG
6
E
AFE
4
H
ABE
6
ABDE
8
ABCE
7
We choose path AFG
Therefore the shortest paths from source vertex A to all the other vertices are
AB
ABC
DB
AFE
F
AFG
ABCH
Fig. 20 (d)
8.5. MINIMAL SPANNING TREE
You have already learnt in Section 4.3 that a tree is. a connected acyclic graph. If we are given a
graph G = (V,E), we may have more than one V tree structures. Let us see what do we mean by
this statement. Consider the graph given in Figure 21 some of the tree structures for this graph
are given in Figure 22(a), 22(b) and 22(c).
Fig.22
You may notice that they differ from each other significantly, however, for each structure
(i) the vertex set is same as that of Graph G
(ii) the edge set is a subset of G(E); and
(iii) there is no cycle.
Such a structure is called spanning tree of graph. Let us formally define a spanning tree. A tree T
is a spanning tree of a connected graph G(V,E) such that
1) every vertex of G belongs to an edge in T and
2) the edges in T form a tree.
Let us see how we construct a spanning tree for a given graph. Take any vertex v as an initial
partial tree and add edges one by one so that each edge joins a new vertex to the partial tree. In
general if there are n vertices in the graph we shall construct a spanning tree in (n- 1) steps i.e.
(n- 1) edges are needed to added.
Frequently, we encounter weighted graphs and we need to built a subgraph that must be
connected and must include every vertex in the graph. To construct such a subgraph with least
weight or least cost, we must not have cycles in it. Therefore, actually we need to construct a
spanning tree with minimum cost or a Minimal Spanning Tree.
You must notice the difference between the shortest path problem and the Minimal Spanning tree
problem.
Let us see this difference through an example. Consider the graph given in Figure 23.
Figure 23
This could, for instance, represent the feasible communication lines between 7 cities and the cost
of an edge could be interpreted as the actual cost of building that link (in lakhs of rupees). The
Minimal Spanning Tree problem for this situation could be the building a least cost
communication network. Ale shortest path problem (One source - all definitions) could be
identifying one city and finding out the least cost communication lines from this city to all the
other cities. Therefore, Shortest Path trees are rooted, while MST are free trees. Also MSTs are
defined only on undirected graphs.
BUILDING A MINIMUM SPANNING TREE
As stated in an earlier paragraph, we need to select n-1 edges in G (of n vertices) such that these
form an MST o^ G. We begin by first selecting an edge with least cost. It can between any two
vertices of G. Subsequently, from the set of remaining edges, we can select another least cost edge
and so on. Each time an edge is picked, we determine whether or not the inclusion of this edge
into the spanning tree being constructed creates a cycle. If it does, this edge is discarded. If no
cycle is created, this edge is included in the spanning tree being constructed.
Let us work it out through an example. Consider the graph given in Figure 23. The minimum cost
edge is that connects vertex A and B. Let us choose that.
Figure 24(a)
Now we have
Vertices that may be added
Edge Cost
F
BF
8
AF
7
G
BG
9
C
AC
4
D
AD
5
The least cost edge is AC; therefore we choose AC
Now we have
Figure 24(b)
Vertices that may be added
Edge Cost
F
BF
8
AF
7
BG
9
CG
10
D
AD
5
E
CE
8
G
The least cost edge is Ad; therefore we choose ad.
Now we have
Figure 24 (c)
Vertices that may be added
Edge Cost
F
BF
8
G
E
AF
8
DF
10
BG
9
CG
10
CE
8
DE
9
AF is the minimum cost edge; therefore, we add it to the partial tree.
Now we have
Figure 24 (d)
Vertices that may be added
Edge Cost
G
BG
9
CG
10
CE
8
DE
9
FE
11
E
Obvious choice is CE
Figure 24 (e)
The only vertex left is G and we have the minimum cost edge that connects it to the tree
constructed so far is BG. Therefore we add it and the minimal spanning tree constructed would be
of the costs 9+3+7+5+4+8 = 36; and is given in Figure 24(f).
Figure 24 (f)
This method is called the Kruskal's method of creating a minimal spanning tree.
8.6. OTHER TASKS FOR THE GRAPHS:
Some other functions, which are associated with graph for solving the problems are:
a.
To find the degree of the vertex
The degree of the vertex is defined as a number of vertices which are adjacent to
given vertex, in other words, it is number of 1‟s in the row of that vertex in the adjacency
matrix or it will be number of nodes present in the adjacency list of that vertex.
b.
To find the number of edges.
By hand shaking lemma , we know that the number of edges in a graph is half of
the sum of degrees of all the vertices.
c.
To print a path from one vertex to another.
Here we are required to follow the above algorithm of BFS such that one of the
vertices is the starting vertex for the algorithm and the process will continue till we reach
the second vertex.
d.
To print the multiple paths from one vertex to another.
The previous algorithm should be used in some different form so that we can get
multiple paths.
e.
To find the number of components in a graph.
In this case we will again use the BFS, and check the visited array, if it does not
contain all the vertices marked as visited then increment the component counter by 1
and from any of the vertex which is not visited, restart the BFS. Repeat till all the vertices
are visited.
f.
To find the critical vertices and edges.
The vertex which when removed from the graph, leaves the graph as disconnected,
will be termed as critical vertex. To find the critical vertex we should first remove each
vertex and check the number of components of the remaining graph. If the graph, which is
remaining, is not a connected graph, the vertex, which is removed, is a critical vertex.
Similarly removal of an edge from the graph, if increases the number of
components, it will be known as critical edge. If we try to check whether a particular
vertex or edge is critical, then remove the same and rerun the program for finding the
number of components.
SUMMARY
Graphs provide in excellent way to describe the essential features of many applications. Graphs
are mathematical structures and are found to be useful in problem solving. They may be
implemented in many ways by the use of different kinds of data structures. Graph traversals,
Depth First as well as Breadth First, are also required in many applications. Existence of cycles
makes graph traversal challenging, and this leads to finding some kind of acyclic subgraph of a
graph. That is we actually reach at the problems of finding a shortest path and of finding a
minimum cost spanning tree problems.
In this Unit, we have built up the basic concepts of the graph theory and of the problems, the
solutions of which may be programmed.
Some graph problems that arise naturally and are easy to state seem to be quite difficult, and no
good algorithms are known to solve them. For example, no efficient algorithm is known for finding
the minimum-cost tour that visits each vertex in a weighted graph. This problem, called the
travelling salesman problem, belongs to a large class of difficult problems.
Other graph problems may well have efficient algorithms, though none have been found. An
example of this is the graph isomorphism problem. Determine whether two graphs could be made
identical by renaming vertices. Efficient algorithms are known for this problem for many special
types of graphs, but the general problem remains open. In short, there is a wide spectrum of
problems and algorithms for dealing with graphs. But many relatively easy problems do arise
quite often, and the graph algorithms we studied in this unit serve well in a great variety of
applications.
UNIT 9 TREES
9.1. INTRODUCTION
9.1.1. OBJECTIVES
9.1.2. BASIC TERMINOLOGY
9.1.3. PROPERTIES OF A TREE
9.2. BINARY TREES
9.2.1. PROPERTIES OF BINARY TREES
9.2.2. IMPLEMENTATION
9.2.3. TRAVERSALS OF A BINARY TREE
9.2.3.1.
IN ORDER TRAVERSAL
9.2.3.2.
POST ORDER TRAVERSAL
9.2.3.3.
PREORDER TRAVERSAL
9.3. BINARY SEARCH TREES (BST)
9.3.1. INSERTION IN BST
9.3.2. DELETION OF A NODE
9.3.3. SEARCH FOR A KEY IN BST
9.4. HEIGHT BALANCED TREE
9.5. B-TREE
9.5.1. INSERTION
9.5.2. DELETION
9.1 INTRODUCTION
In the previous block we discussed Arrays, Lists, Stacks, Queues and Graphs. All the data
structures except Graphs are linear data structures. Graphs are classified in the non-linear
category of data structures. At this stage you may recall from the previous block on Graphs that
an important class of Graphs is called Trees. A Tree Is an a cyclic, connected graph. A Tree
contains no loops or cycles. The concept of trees is one of the most fundamental and useful
concepts in computer science. Trees have many variations, implementations and applications.
Trees find their use in applications such as compiler construction, database design, windows,
operating system programs, etc. A tree structure is one in which items of data are related by
edges. A very common example is the ancestor tree as given in Figure 1. This tree shows the
ancestors of LAKSHMI. Her parents are VIDYA and RAMKRISHNA; RAMKRISHNA'S PARENT are
SUMATHI and VUAYANANDAN who are also the grand parents of LAKSHMI (on father's side);
VIDYA'S parents are JAYASHRI and RAMAN and so on.
Figure 1: A Family Tree I
We can also have another form of ancestor tree as given in Figure 2.
Figure 2: A Family Tree II
We could have also generated the image of tree in Figure 1 as
Figure 3: A Family Tree III
All the above structures are called rooted trees. A tree is said to be rooted if it has one node, called
the root that is distinguished from the other nodes.
In Figure 1 the root is LAKSHMI,
In Figure 2 the root is KALYANI and
In Figure 3 the root is LAKSHMI.
In this Unit our attention will be restricted to rooted trees.
In this Unit, first we shall consider the basic definitions and terminology associated with trees,
examine some important properties, and look at ways of representing trees within the computer.
In later sections, we shall see many algorithms that operate on these fundamental data
structures.
9.1.1 OBJECTIVES
At the end of this unit you shall be able to:
·
·
·
·
·
·
·
define a tree, a rooted tree, a binary tree, and a binary search tree
differentiate between a general tree and a binary tree
describe the properties of a binary search tree
code the insertion, deletion and searching of an element in a binary search tree
show how an arithmetic expression may be stored in a binary tree
build and evaluate an expression tree
code the preorder, in order, and post order traversal of a tree<>
9.1.2. BASIC TERMINOLOGY
Trees are encountered frequently in everyday life. An example is found in the organizational chart
of a large corporation. Computer Sciences in particular makes extensive use of trees. For example,
in database it is useful in organizing and relating data. It is also used for scanning, parsing,
generation of code and evaluation of arithmetic expressions in compiler design.
We usually draw trees with the root at the top. Each node (except the root) has exactly one node
above it, which is called its parent; the nodes directly below a node are called its children. We
sometimes carry the analogy to family trees further and refer to the grandparent or the sibling of a
node.
Let us formally define some of the tree-related terms.
A tree is a non-empty collection of vertices and edges that satisfies certain . requirements.
A vertex is a simple object (also referred to as a node) that can have a name and can carry other
associated information:
An edge is a connection between two vertices.
A tree may, therefore, be defined as a finite set of zero or more vertices such that


there is one specially designated vertex called ROOT, and
the remaining vertices are partitioned into a collection of sub- trees, each of which is also
a tree.
In Figure 2 root is KALYANI. The three sub-trees are rooted at BABU, RAJAN and JAYASHRI.
Sub-trees with JAYASHRI as root has two sub-trees rooted at SUKANYA and VIDYA and so on.
The nodes of a tree have a parent-child relationship.
The root does not have a parent; but each one of the other nodes has a parent node associated to
it. A node may or may not have children. A node that has no children is called a leaf node.
A line from a parent to a child node is called a branch or an edge. If a tree has n nodes, one of
which is the root then there would be n-1 branches. It follows from the fact that each branch
connects some node to its parent, and every node except the root has one parent.
Nodes with the same parent are called siblings. Consider the tree given in Figure 4.
Figure 4: A Tree
K, L, and M are all siblings. B, C, D are also siblings.
A path in a tree is a list of distinct vertices in which successive vertices are connected by edges in
the tree. There is exactly one path between the root and each of the other nodes in the tree. If
there is more than one path between the root and some node, or if there is no path between the
root and some node, then what we have is a graph, not a tree.
Nodes with no children are called leaves, or terminal nodes. Nodes with at least one child are
sometimes called nonterminal nodes. We sometime refer to nonterminal nodes as internal nodes
and terminal nodes as external nodes.
The length of a path is the number of branches on the path. Further if there is path from n., then
n. is ancestor of i and ni. is a descendant of n.. Also there is a path of length zero from every node
to itself, and there is exactly one path from the root to each node.
Let us now see how these terms apply to the tree given in Figure 4.
A path from A to K is A-D-G-J-K and the length of this path is 4.
A is ancestor of K and K is descendant of A. All the other nodes on the path are also descendants
of A and ancestors of K.
The depth of any node ni is the length of the path from root to n.. Thus the root is at depth
0(zero). The height of a node n. is the longest path from ni. to a leaf. Thus all leaves are at height
zero. Further, the height of a tree is same as the height of the root. For the tree in Figure 4, F is at
height 1 and depth 2. D is at height 3 and depth 1. The height of the tree is 4. Depth of a node is
sometimes also referred to as level of a node.
A set of trees is called a forest; for example, if we remove the root and the edges connecting it from
the tree in Figure 4, we are left with a forest consisting of three trees rooted at A, D and G, as
shown in Figure 5. Let us now list some of the properties of a tree:
Figure 5: A Forest (sub-tree)
9.1.3. PROPERTIES OF A TREE
1. Any node can be root of the tree each node in a tree has the property that there is exactly
one path connecting that node with every other node in the tree. Technically, our definition
in which the root is identified, pertain to a rooted tree in which the root is not identified is
called a free tree.
2. Each node, except the root ,has a unique parents and every edge connects a node to its
parents. Therefor, a tree with N nodes has N-1 edges.
9.2. BINARY TREES
By definition, a Binary tree is a tree which is either empty or consists of a root node and two
disjoint binary trees called the left subtree and right subtree. In Figure 6, a binary tree T is
depicted with a left subtree, L(T) and a right subtree R(T).
Figure 6: A Binary Tree
In a binary tree, no node can have more than two children. binary trees are special cases of
general trees. The terminology we have discussed in the previous section applies to binary trees
also.
9.2.1. PROPERTIES OF BINARY TREES:
1. Recall from the previous section the definition of internal and external nodes.- A binary
tree with N internal nodes has maximum of (N + 1) external nodes : Root is considered as
an internal node.
2. The external path length of any binary tree with N internal nodes is 2N greater than the
internal path length.
3. The height of a full binary tree with N internal nodes is about log2N
As we shall see, binary trees appear extensively in computer applications, and performance is best
when the binary trees are full (or nearly full). You should note carefully that, while every binary
tree is a tree, not every tree is a binary tree.
A full binary tree or a complete binary tree is a binary tree in which all internal nodes have degree
and all leaves are at the same level. The figure 6a illustrates a full binary tree.
The degree of a node is the number of non empty sub trees it has. A leaf node has a degree zero.
Figure 6(a): A full binary tree
9.2.2. IMPLEMENTATION
A binary tree can be implemented as an array of nodes or a linked list. The most common and
easiest way to implement a tree is to represent a node as a record consisting of the data and
pointer to each child of the node. Because a binary tree has at most two children, we can keep
direct pointers to them. A binary tree node declaration in pascal may look like.
Type
tree_ptr
= ^tree _ node;
tree_node
= Record
data : data _ type;
left : tree _ ptr;
right: tree _ ptr
end;
Let us now consider a special case of binary tree. it is called a 2-tree or a strictly binary tree. It is
a non-empty binary tree in which either both sub trees are empty or both sub trees are 2-trees.
For example, the binary trees in Figure 7(a) and 7(b) are 2-trees, but the trees in Figure 7(c) and
7(d) are not 2- trees.
Figure : 7 (a) Figure : 7 (b)
Figure 7: (a) and (b) Binary Trees(c) and (d) not Binary Trees
Binary trees are most commonly represented by linked lists. Each node can be considered as
having 3 elementary fields: a data field, left pointer, pointing to left sub tree and right pointer
pointing to the right sub tree.
Figure 8 contains an example of linked storage representation of a binary tree (shown in figure 6).
Figure 8: Linked Representation of a Binary Tree
A binary tree is said to be complete Figure 6(a) if it contains the maximum number of nodes
possible for its height. In a complete binary tree:
-
The number of nodes at level 0 is 1.
The number of nodes at level 1 is 2.
The number of nodes at level 2 is 4, and so on.
The number of nodes at level I is 2I. Therefore for a complete binary tree with k
levels contains _ki=0 2 i nodes.
9.2.3. TRAVERSALS OF A BINARY TREE
A traversal of a graph is to visit each node exactly once. In this section we shall discuss traversal
of a binary tree. It is useful in many applications. For example, in searching for particular nodes.
Compilers commonly build binary trees in the process of scanning, parsing, generating code and
evaluation of arithmetic expression. Let T be a binary tree. There are a number of different ways
to proceed. The methods differ primarily in the order in which they visit the nodes. The four
different traversals of T are In order, Post order, Preorder and Level-by-level traversal.
9.2.3.1. IN ORDER TRAVERSAL
It follows the general strategy of Left-Root-Right. In this traversal, if T is not empty, we first
traverse (in order) the left sub tree;
then visit the root node of T, and
then traverse (in order) the right sub tree.
Consider the binary tree given in Figure 9.
Figure 9. Expression Tree
This is an example of an expression tree for (A + B*C)-(D*E)
A binary tree can be used to represent arithmetic expressions if the node value can be either
operators or operand values and are such that:
· each operator node has exactly two branches
· each operand node has no branches, such trees are called expression trees.
Tree, T, at the start is rooted at '_';
Since left(T) is not empty; current T becomes rooted at +;
Since left(T) is not empty; current T becomes rooted at 'A'.
Since left(T) is empty; we visit root i.e. A.
We access T' root i.e. '+'.
We now perform in order traversal of right(T).
Current T becomes rooted at '*'.
Since left(T) is not empty; Current T becomes rooted at 'B' since left(T) is empty; we visit its root
i.e. B; cheek for right(T) which is empty, therefore, we move back to parent tree. We visit its root
i.e. '*'.
Now in order traversal of right(T)is performed; which would give us 'C'. We visit T's root i.e. 'D' and
perform in order traversal of right(T); which would give us'* and E'. Therefore, the complete listing
is
A+B*C-D*E
You may note that expression is in infix notation. The in order traversal produces
a(parenthesized) left expression, then prints out the operator at root and then a(parenthesized)
right expression. This method of traversal is probably the most widely used. The following is a
pascal procedure for in order traversal of a binary tree
procedure INORDER (TREE: BINTREE);
begin
if TREE <>nil
then begin
INORDER (TREE^LEFT);
Write ln ( TREE^DATA);
INORDER (TREE ^ RIGHT);
end
end;
Figure 10 gives a trace of the in order traversal of tree given in figure 9.
Root of the tree
+
A
Empty sub tree
Empty sub tree
*
B
Empty sub tree
Empty sub tree
C
Empty sub tree
Empty sub tree
*
D
Empty sub tree
Empty sub tree
E
Empty sub tree
Empty sub tree
Output
A
+
B
*
C
D
*
E
over
Figure 10: Trace of in order traversal of tree given in figure 9
Please notice that this procedure, like the definition for traversal is recursive.
9.2.3.2. POST ORDER TRAVERSAL
In this traversal we first traverse left(T) (in post order); then traverse Right(T) (in post order); and
finally visit root. It is a Left-Right-Root strategy, i.e.
Traverse the left sub tree In Post order.
Traverse the right sub tree in Post order.P Visit the root.
For example, a post order traversal of the tree given in Figure 9 would be
ABC*+DE*You may notice that it is the postfix notation of the expression
(A + (B*C)) -(D*E)
We leave the details of the post order traversal method as an exercise. You may also implement it
using Pascal or C language.
9.2.3.3. PREORDER TRAVERSAL
In this traversal, we visit root first; then recursively perform preorder traversal of Left(T); followed
by pre order. traversal of Right(T) i.e. a Root-Left-Right traversal, i.e.
Visit the root
Traverse the left sub tree preorder.
Traverse the right sub tree preorder.
A preorder traversal of the tree given in Figure 9 would yield
- +A*BC*DE
It is the prefix notation of the expression
(A+ (B*C)) - (D*E)
Preorder traversal is employed in Depth First Search. (See Unit 4, Block 4). For example, suppose
we make a depth first search of the binary tree given in Figure 11.
Figure 12: Binary tree example for depth first search
We shall visit a node; go left as deeply as possible before searching to its right. The order in which
the nodes would be visited is
ABDECFHIJKG
which is same as preorder traversal.
LEVEL BY LEVEL TRAVERSAL
In this method we traverse level-wise i.e. we first visit node root at level 'O' i.e. root. There would
be just one. Then we visit nodes at level one from left to right. There would be at most two. Then
we visit all the nodes at level '2' from left to right and so on. For example the level by level
traversal of the binary tree given in Figure 11 will yield
ABCDEFGHIJK
This is same as breadth first search (see Unit 4, Block 4). This traversal is different from other
three traversals in the sense that it need not be recursive, therefore, we may use queue kind of a
data structure to implement it, while we need stack kind of data structure for the earlier three
traversals.
9.3. BINARY SEARCH TREES (BST)
A Binary Search Tree, BST, is an ordered binary tree T such that either it is an empty tree or
· each data value in its left sub tree less than the root value,
· each data value. in its right sub tree greater than the root value, and
· left and right sub trees are again binary search trees.
Figure 12(a) depicts a binary search tree, while the one in
Figure 12(b) is not a binary search tree.
Figure 12(a): Binary Search Tree
Figure 12(b): Binary tree but not binary search tree
Clearly, duplicate items are not allowed in a binary search tree. You may also notice that an in
order traversal of a BST yields a sorted list in ascending order.
Operations of BST
We now give a list of the operations that are usually performed on a BST.
1. Initialization of a BST: This operation makes an empty tree.
2. Cheek whether BST is Empty: This operation cheeks whether the tree is empty.
3. Create a node for the BST: This operation allocates memory space for the new node;
returns with error if no spade is available.
4.
5.
6.
7.
8.
9.
Retrieve a node's data.
Update a node's data.
Insert a node in BST.
Delete a node (or sub tree) of a BST.
Search for a node in BST.
Traverse (in inorder, preorder, or post order) a BST.
We shall describe some of the operations in detail.
9.3.1. INSERTION IN BST
Inserting a node to the tree: To insert a node in a BST, we must check whether the tree already
contains any nodes. If tree is empty, the node is placed in the root node. If the tree is not empty,
then the proper location is found and the added node becomes either a left or a right child of an
existing node. The logic works this way:
add-node, (node, value)
{
if (two values are same)
{
duplicate.,
return (FAILURE)
}
else if (value)
{
if (left child exists)
{
add-node (left child, value);
}
else
{
allocate new node and make left
child point to it,.
return (SUCCESS);
}
}
else if (value value stored in current node)
{
if (right child exists)
{
add-node (right child, value);
}
else
{
allocate new node and make right child
point to it
return (SUCCESS);
}
}
}
The function continues recursively until either it finds a duplicate (no duplicate strings are
allowed) or it hits a dead end. If it determines that the value to be added belongs to the left-child
sub tree and there is no left-child node, it creates one. If a left-child node exists, then it begins its
search with the sub tree beginning at this node. If the function determines that the value to be
added belongs to the right of the current node, a similar process occurs.
Let us consider a BST given in Figure 13(a).
Figure 13: Insertion In a Binary Search Tree
If we want to insert 5 in the above BST, we first search the tree. If the key to be inserted is found
in tree, we do nothing (since duplicates are not allowed), otherwise a nil is returned. In case a nil
is returned, we insert the data at the last point traversed. In the example above a search
operation will return nil on not finding a right, sub tree of tree rooted at 4. Therefore, 5 must be
inserted as a right child of 4.
9.3.2. DELETION OF A NODE
Once again the node to be deleted is searched in BST. If found, we need to consider the following
possibilities:
(i) If node is a leaf, it can be deleted by making its parent pointing to nil. The deleted node is now
unreferenced and may be disposed off.
(ii) If the node has one child; its parent's pointer needs to be adjusted. For example for node 1 to
be deleted from BST given in Figure 13(a); the left pointer of node 6 is made to point to child of
node 1 i.e. node 4 and the new structure would be
Figure 14: Deletion of a Terminal Node
(iii) If the node to be deleted has two children; then the value is replaced by the smallest value in
the right sub tree or the largest key value in the left sub tree; subsequently the empty node is
recursively deleted. Consider the BST in Figure 15.
Figure 15: Binary search tree
If the node 6 is to be deleted then first its value is replaced by smallest value in its right subtree
i.e. by 7. So we will have Figure 16.
Figure 16. Deletion of a nod having left and right child
Now we need to, delete this empty node as explained in (iii).
(iv) Therefore, the final structure would be Figure 17.
Figure 17: Tree alter a deletion of a node having left and right child
9.3.3. SEARCH FOR A KEY IN BST
To search the binary tree for a particular node, we use procedures similar to those we used when
adding to it. Beginning at the root node, the current node and the entered key are compared. If
the values are equal success is output. If the entered value is less than the value in the node,
then it must be in the left-child sub tree. If there is no left-child sub tree, the value is not in the
tree i.e. a failure is reported. If there is a left-child subtree, then it is examined the same way.
Similarly, it the entered value is greater than the value in the current node, the right child is.
searched. Figure 18 shows the path through the tree followed in the search for the key H.
Figure 18: Search for a Key In BST
find-key (key value, node)
{
if (two values are same)
{
print value stored in node;
return (SUCCESS);
}
else if (key value value stored in current node)
{
if (left child exists)
{
find-key (key-value, left hand);
}
else
{
there is no left subtree.,
return (string not found)
}
}
else if (key-value value stored in current node)
{
if (right child exists)
{
find-key (key-value, rot child);
}
else
{
there is no right subtree;
return (string not found)
}
}
}
SUMMARY
This unit introduced the tree data structure which is an acyclic, connected, simple graph.
Terminologypertaining to trees was introduced. A special case of general case of general tree, a
binary tree was focussed on. In a binary tree, each node has a maximum of two subtrees, left and
right subtree. Sometimes it is necessary to traverse a tree, that is, to visit all the tree's nodes.
Four methods of tree traversals were presented in order, post order, preorder and level by level
traversal. These methods differ in the order in which the root, left subtree and right subtree are
traversed. Each ordering is appropriate for a different type of applications.
An important class of binary trees is a complete or full binary tree. A full binary tree is one in
which internal nodes completely fill every level, except possibly the. last. A complete binary tree
where the internal nodes on the bottom level all appear to the left of the external nodes on that
level. Figure 6a shows an example of a complete binary tree.
9.4. HEIGHT BALANCED TREE
A binary tree of height h is completely balanced or balanced if all leaves occur at nodes of level h
or h-1 and if all nodes at levels lower than h-1 have two. children. According to this definition, the
tree in figure 1(a) is balanced, because all leaves occur at levels 3 considering at level 1 and all
node's at levels 1 and 2 have two children. Intuitively we might consider a tree to be well balanced
if, for each node, the longest paths from the left of the node are about the same length as the
longest paths on the right.
More precisely, a tree is height balanced if, for each node in the tree, the height of the left subtree
differs from the height, of the right subtree by no more than 1. The tree in figure 2(a) height
balanced, but it is not completely balanced. On the other hand, the tree in figure 2(b) is
completely balanced tree.
Figure 2(a)
Figure 2(b)
An almost height balanced tree is called an AYL tree after the Russian mathematician G. M.
Adelson - Velskii and E. M. Lendis, who first defined and studied this form of a tree. AVL Tree
may or may not be perfectly balanced.
Let us determine how many nodes might be there in a balanced tree of height h.
The, root will be the only node at level 1;
Each subsequent levels will be as full as possible i.e. 2 nodes at level 2, 4 nodes at level 3 and so
on, i.e. in general there will be 21-1 nodes at level 1. Therefore the number of nodes from level 1
through level h-1 will be
1 + 2 + 22 + 23 + ... + 2h-2 = 2h-1 - 1
The number of nodes at level h may range from a single node to a maximum
of 2h-1 nodes. Therefore, the total number of nodes n of the tree may range for
(2h-1-1+1) to (2h-1-1+2h-1)
or 2h-1 to 2h -1.
BUILDING HEIGHT BALANCED TREE
Each node of an AVL tree has the Property that the height of the left subtree is either one more,
equal, or one less than the height of the right subtree. We may define a balance factor (BF) as
BF = (Height of Right- subtree - Height of Left- subtree)
Further
If two subtree are of same height
BF = 0
if Right subtree is higher
BF = +1
if Left subtree is higher
BF = -1
For example balance factor are stated near the nodes in Figure 3. BF of the root node is zero
because height of the right subtree and the left subtree is three. The BF at the node DIN is 17because the height of its left subtree is 2 and of right subtree is 1 etc.
Figure 3: Tree having a balance factor at each node
Let the values given were in the order
BIN, FEM, IND, NEE, LAL, PRI, JIM, AMI, HEM, DIN and we needed to make the height balanced
tree. It would work out as follows:
We begin at the root of the tree since the tree is initially empty we have
Figure 4 (a)
We have FEM to be added. It would be on the right of the already existing tree. Therefore we have
Figure 4 (b)
The resulting tree is still height balanced. Now we need to add IND ie. on the, further right of
FEM.
Figure 4 (c)
Since BF of one of the nodes is other than 0, + 1, or -1, we need to rebalance the tree. In such a
case, when the new node goes to the longer side, we need to rotate the structure counter
clockwise i.e. a rotation is carried out in counter clockwise direction around the closest parent of
the inserted node with BF = + 2.
In this case we get
Figure 4 (d)
We now have a balanced tree.
On adding NEE, we get
Figure 4 (e)
Since all the nodes have B. F. <+ 2 we continue with the next node.
Now we need to add LAL
Figure 4 (f)
To regain balance we need to rotate the tree counter clockwise at IND and we get
Figure 4 (g)
On adding PRI we get
Figure 4 (h)
On adding JIM, we get
Figure 4 (i)
The tree is still balanced. Now we add AMI
Figure 4 (j)
Now we need to rotate the tree at FEN (i.e. the closest parent to AMI with BF= +2) in clockwise
direction. On rotating it once we get
Figure 4 (k)
Now HEM is to be added. On doing so the structures we will get
Figure 4 (l)
Tree is still balance. Now we need to add DIN, we get
Figure 4: Construction of AVL Tree
You may notice that Figure 3 and Figure 4 (m) are different although the
elements are same. This is because the AVL tree structure depends on the
order in which elements are added. Can you determine the order in which the
elements were added for a resultant structure as given in Figure 3?
Let us take another example. Consider the following list of elements
3,5,11,8,4,1,12,7,2,6,10
The tree structures generated till we add 11 are all balanced structure as
given in Figure 5 (a) to (c)
Figure 5 (a) to (c)
Here we need rebalancing. Rotation around 3 would give
Figure 5 (d)
Further we add 8,4,1,12,7,2 as depicted in Figure 5 (a) and (k).
Figure 5 (e) to (f)
Figure 5 (g) to (h)
After adding 6, the tree becomes unbalanced. We need to rebalance by
rotating the structure at the node 8. It is shown in Figure 5 (I)
Figure 5 (i) to (j)
Again on adding 10 we need to balance
Figure 5 (k) to (l)
Since the closest parent with BF = 2 is at node 11 we first rotate at it in
clockwise direction. However, we would still not get a balanced structure as
the right of node 7 shall have move nodes than the left of it. Therefore, we
shall need another rotation, in anti-clockwise direction at node 7, and finally
we get a structure as shown in Figure 5 (n).
Figure 5 (l) to (m)
Which is the height balance tree.
9.5. B-TREE
We have already defined m-way tree in the Introduction. A B-tree is a balanced M-way tree. A
node of the tree may contain many records or keys and pointers to children.
A B-tree is also known as the balanced sort tree. It finds its use in external sorting. It is not a
binary tree.
To reduce disk accesses, several conditions of the tree must be true;
· the height of the tree must be kept to a minimum,
· there must be no empty subtrees above the leaves of the tree;
· the leaves of the tree must all be on the same level; and
· all nodes except the leaves must have at least some minimum number of children.
B-Tree of order M has following properties:
1. Each node has a maximum of M children and a minimum of M/2 children or any no. from 2 to
the maximum.
2. Each node has one fewer keys than children with a maximum of M-1 keys.
3. Keys are arranged in a defined order within the node. All keys in the subtree to the left of a key
are predecessors ofthe key and that on the right are successors of the key.
4. When a new key is to be inserted into a full node, the node is split into two nodes, and the key
with the median value is inserted in the parent node. In case the parent node is the root, a new
root is created.
5. All leaves are on the same level i.e. there is no empty subtree above the level of the leaves.
The order imposes a bound on the business of the tree.
While root and terminal nodes are special cases, normal nodes have between M/2 and M children.
For example, a normal node of tree of order 11 has at least 6 more than 11 children.
9.5.1. B-TREE INSERTION
First the search for the place where the new record must be put is done. If the node can
accommodate the new record insertion is simple. The record is added to the node with an
appropriate pointer so that number of pointes remain one more that the number of records. If the
node overflows because there is an upper bound on the size of a node, splitting is required. The
node is spilt into three parts. The middle record is passed upward and inserted into the parent,
leaving two children behind where there was one before. Splitting may propagate up the tree
because the parent into which a record to be split in its child node, may overflow. Therefore it may
also split. If the root is required to be split, a new root is created with just two children, and the
tree grows taller by one level.
The method is well explained by the following examples:
Example: Consider building a B-tree of degree 4 that is a balanced four-way tree where each node
can hold three data values and have four branches. Suppose it needs to contain the following
values.
1 5 6 2 8 11 13 18 20 7 9
The first value 1 is placed in a new node which can accommodate the next two values also i.e.
when the fourth value 2 is to be added, the node is split at a median value 5
into two leaf nodes with a parent at 5.
The following item 8 is to be added in a leaf node. A search for its appropriate
place puts it in the node containing 6. Next, 11 is also put in the same. So we
have
Now 13 is to be added. But the right leaf node, where 1-3 finds appropriate
plane, is full. Therefore it is split at median value 8 and this it moves up to
the parent. Also it splits up to make two nodes,
The remaining items may also be added following the above procedure. The
final result is
Note that the tree built up in this manner is balanced, having all of its leaf
nodes at one level. Also the tree appears to grow at its root, rather than at its
leaves as was the case in binary tree.
A B-tree of order 3 is popularly known as 2-3 tree and that of order 4 as 2-34 tree.
9.5.2. B-TREE DELETION
As in the insertion method, the record to be deleted is first searched for.
If the record is in a terminal node, deletion is simple. The record alongwith an appropriate pointer
is deleted.
If the record is not in terminal node, it is replaced by a copy of its successor, that is, a record with
a next, higher value. The successor of any record not at the lowest level will always be in a
terminal node. Thus in all cases deletion involves removing a record from a terminal node.
If on deleting the record, the new node size is not below the minimum, the deletion is over. If the
new node size is lower than the minimum, an underflow occurs.
Redistribution is carried out if either of adjacent siblings contains more than the minimum
number of records. For redistribution, the contents of the node which has less than minimum
records, the contents of its adjacent sibling which has more the minimum records, and the
separating record from parent are collected. The central record from this collection is written back
to parent. The left and right halves are written back to the two siblings.
In case the node with less than minimum number of records has no adjacent sibling that is more
than minimally full. Concatentation is used. In this case the node is merged with its adjacent
sibling and the separating record from its parent.
It may be solved by redistribution or concatenation.
We will illustrate by deleting keys from the tree given below:
1. Delete 'h'. This is a simple deletion
2. Delete, 'r'. 'r' is not at a leaf node . Therefore its successor Is, is moved up
'r' moved down and deleted.
3. Delete 'P'. The node contains less than minimum numbers of keys
required The slibing can spare a key. So ,t, moves up and 's' moves down.
4. Deletion 'd'. Again node is less than minimal required. This leaves the
parent with only one key. Its slibing cannot contribute. therefore f,j, m and t
are combined to form the new root. Therefore the size of tree shrinks by one
level.
B-Tree of order 5
Let us build a B-tree of order 5 for following data:
1.
2. H K Z are simply inserted in the same node D H K Z
3. Add B: node is full so it must split
H is median for
BDHKZ
· H is made as the parent. Since the splitting is at root node we must make
one more node.
4. P, Q,E,A are simply inserted in
S to be added at K P Q Z
Q becomes the median
W and T are simply inserted after Q
After Adding C we get
L,N are simply inserted
Y to be added in
STWZ
Since W is median
M to be put in K L N P &is median. Then, it will be promoted to CHQW but
that is also full therefore this root will be split and new root will be created
SUMMARY
In this Unit, we discussed the AVL trees and B-Trees. AVL trees are restricted growth Binary
trees. Normal binary trees suffer with the problem of balancing. AVL trees offer one kind of
solution. AVL trees are not completely balanced trees. In completely balanced trees the number of
nodes in the two subtrees of a node differ by at most 1. In AVL trees the heights of two subtrees of
a node may differ by at most 1.
Multiway - Trees are generalization of binary trees. A node in an m-way tree can contain R records
and R + 1 pointers (or children). Consequently the branching factor increases or tree becomes
shorter as compared to the binary tree for the same number of records. B-trees are balanced
multiway trees. It restricts the number of records in a node between m/2 and m-1, i.e. it requires
a node for a tree of order m to be at least half full.
Further the structure of B-Tree self balancing operations, as insertions and deletions are
performed.
Two variations of B-Tree also exist as B* Tree and B + tree. The student is advised to read and
some text on these structure as well.
UNIT 10 FILE ORGANIZATION
10.1. INTRODUCTION
10.2. TERMINOLOGY
10.3. FILE ORGANISATION
10.3.1. Sequential Files
10.3.1.1. BASIC OPERATIONS
10.3.1.2. DISADVANTAGES
10.3.2. DIRECT FILE ORGANIZATION
10.3.2.1. DIVISION-REMAINDER HASHING
10.3.3. INDEXED SEQUENTIAL FILE ORGANIZATION
10.1 INTRODUCTION
Many tasks in an information oriented society require people to deal with voluminous data and to
use computers to deal with this data efficiently and speedily. For example in an airlines
reservation, office data regarding flights, routes, seats available etc. is required to facilitate
booking of seats. A university might like to store data related to all students-the courses they sign
up for etc. all this implies the following:
- Data will be stored on external storage devices like magnetic tapes, floppy disks etc.
- Data will be accessed by many people and software programs
- Users of the data will expect that
- it is always reliably available for processing
- it is secure
- it is stored in a manner flexible enough to allow the users to add new types of data as per his
changing needs
Files deal with storage and retrieval of data in a computer. Programming languages incorporate
statements and structures which allow users to write programs to access and use the data in the
file.
10.2 TERMINOLOGY
We Will now define the term's of the hierarchical structure of computer stored data collections.
1. Field: It is an elementary data item characterized by its size, length and type.
For example:
Name : a character type Of size 10
Age : a numeric type
2. Record: It is a collection of related fields that can be treated as a unit from an applications
point of view.
For example:
A university could use a student record with the , fields, university enrolment no., Name Major
subjects
3. File: Data is organized for storage in files. A file is a collection of similar, related records. it has
an identifying name.
For example. "STUDENT" could be a file Consisting of student records for all the pupils in a
university.
4. Index: An index file corresponds to a data file. It's records contain a key field and a Pointer to
that record of the data file which has the same value of the key field.
Indexing will be discussed in detail later in the unit.
The data stored in files is accessed by software which can be divided into the following two
categories:
1. User Programs: These are usually written by a Programmer to manipulate retrieved data in the
manner required by the application.
2. File Operations: These dial with the physical movement of data in and out of files. User
programs effectively use file operations through appropriate programming language syntax The
File Management system manages the independent files and acts as the software interface
between the user programs and the file operations.
File operations can be categorized as
1. CREATION of the file
2. INSERTION of records in the file
3. UPDATION of previously inserted records
4. RETRIEVAL of previously inserted records
5. DELETION of records
6. DELETION of the file.
10.3 FILE ORGANISATION
File organization can most simply be defined as the method of sorting storing Data record in file in
a file and the subsequent implications on the way these records can be accessed. The factors
involved in selecting a particular file organization for use are:
- Ease of retrieval
- Convenience of updates
- Economy of storage
- Reliability
- Security
- Integrity
Different file organizations accord the above factors differing weightages. The choice must be
made depending upon the individual needs of the particular application in question.
We now introduce in brief the various commonly encountered file organizations.
SEQUENTIAL FILES
Data records are stored in some specific sequence e.g. order of arrival value of key field etc.
Records of a sequential file cannot be accessed at random i.e. to access the nth record, one must
traverse the preceding (n-1) records. Sequential files will be dealt with at length in the next
section.
RELATIVE FILES
Each data record has a fixed place in a relative file. Each record must have associated with it in
integer key value that will help identify this slot. This key, therefore, will be used for insertion and
retrieval of the record. Random as well as sequential access is possible. Relative files can exist
only on random access devices like disks.
DIRECT FILES
These are similar to relative files, except that the key value need not be an integer. The user can
specify keys which make sense to his application.
INDEXED SEQUENTIAL FILES
An index is added to the sequential file to provide random access. An overflow area needs to be
maintained to permit insertion in sequence.
INDEXED FLIES
In this file organization, no sequence is imposed on the storage of records in the data file,
therefore, no overflow area is needed. The index, however, is maintained in strict sequence.
Multiple indexes are allowed on a file to improve access.
10.3.1. SEQUENTIAL FILES
We will now discuss in detail the Sequential file organization as defined in previous
page.Sequential files have data records stored in a specific sequence.
A sequentially organized file may be stored on either a serial-access or a direct-access storage
medium.
STRUCTURE
To provide the 'sequence' required a 'key' must be defined for the data records. Usually a field
whose values can uniquely identify data records is selected as the key. If a single field cannot fulfil
this criterion, then a combination of fields can serve as the key. For example in a file which keeps
student records, a key could be student no.
10.3.1.1. OPERATIONS
1. Insertion: Records must be inserted at the place dictated by the sequence of the keys. As is
obvious, direct insertions into the main data file would lead to frequent rebuilding of the file. This
problem could be mitigated by reserving overflow areas' in the file for insertions. But this leads to
wastage of space and also the overflow areas may also be filled.
The common method is to use transaction logging. This works as follows:
- collect records for insertion in a transaction file in their order of arrival
- when population of the transactions file has ceased, sort the transaction file in the order of the
key of the primary data file
- merge the two files on the basis of the key to get a new copy of the primary sequential file.
Such insertions are usually done in a batch mode when the activity/program which populates the
transaction file have ceased. The structure of the transaction files records will be identical to that
of the primary file.
2. Deletion: Deletion is the reverse process Of insertion. The space occupied by the record should
be freed for use. Usually deletion (like-insertion) is not done immediately. The concerned record
(al 9 with a marker or 'tombstone' to indicate deletion) is written to a transaction file. At the time
of merging the corresponding data record will be dropped from the primary data file.
3. Updation : Updation is a combination of insertion and deletions. The record with the new
values is inserted and the earlier version deleted. This is also done using transaction files.
4. Retrieval: User programs will often retrieve data for viewing prior to making decisions,
therefore, it is vital that this data reflects the latest state of the data if the merging activity has not
yet taken Place.
Retrieval is usually done for a particular value of the key field. Before return in to the user, the
data record should be merged with the transaction record (if any)for that key value.
The other two operations 'creation' and 'deletion' of files are achieved by simple programming
language statements.
10.3.1.2. DISADVANTAGES
Following are some of the disadvantages of sequential file organization:
* Updates are not easily accommodated
* By definition, random access is not possible
* All records must The structurally identical. If a new field has to be added, then every record
must The rewritten to provide space for the new field.
* Continuous areas may not he possible because both the primary data file and the transaction
file must be looked during merging.
AREAS OF USE
Sequential files are most frequently used in commercial batch oriented data processing where
there is the concept of a master file to which details are added periodically. Ex. payroll
applications.
10.3.2. DIRECT FILE ORGANIZATION
It offers an effective way to organize data when there, is a need to access individual records
directly.
To access a record directly (or random access) a relationship is used to translate the key value
into a physical address. This is called the mapping function R R.(key value) -- Address
Direct files are stored on DASD (Direct Access Storage Device)
A calculation is performed on the key value to get an address. This address calculation technique
is often termed as hashing. The calculation applied is called a hash function.
Here we discus a very commonly used hash function called Division - Remainder.
10.3.2.1. DIVISION-REMAINDER HASHING
According to this method, key value is divided by an appropriate number, generally a prime
number, and the division of remainder is used as the address for the record.
The choice of appropriate divisor may not be so simple. If it is known that the file is to contain n
records, then we must, assuming that only one record can be stored a given address, have divisor
n.
Also we may have a very large key space as compared to the address space. Key space refers to all
the possible key values. Although only a part of this space will The address space possibly may
not match actually be used as key values in the file. the size of key space, therefore a one to one
mapping may not be there. That is calculated address may not be unique. It is called Collision, i.e.
R(K1) = R(K2) but K1 = K2
Two unequal keys have been calculated to have the same address. The key are called synonyms.
There are various approaches to handle the problem of collisions. One of these is to hash to
buckets. A bucket is a space that can accommodate multiple records. A discussion on buckets
and other such methods to handle collisions is out of the scope of this Unit. However the student
is advised to read some text on Bucket Addressing and related topics.
10.3.3. INDEXED SEQUENTIAL FILE ORGANIZATION
When there is need to access records sequentially by some key value and also to access records
directly by the same key value, the collection of records may be organized in an effective manned
called Indexes Sequential Organization.
You must be familiar with search process for a word in a language dictionary. The data in the
dictionary is stored in sequential manner. However an index is provided in terms of thumb tabs.
To search for a word we do not search sequentially. We access the index that is the appropriate
thumb tab, locate an approximate location for the word and then proceed to find the word
sequentially.
To implement the concept of indexed sequential file organizations, we consider an approach in
which the index part and data part reside on a separate file. The index file has a tree structure
and data file has a sequential structure. Since the data file is sequenced, it is not necessary for
the index to have an entry for each record Following figure shows a sequential file with a two-level
index.
Level 1 of the index holds an entry for each three-record section of the main file. The level 2
indexes level 1 in the same way.
When the new records are inserted in the data file, the sequence of records need to be preserved
and also the index is accordingly updated.
Two approaches used to implement indexes are static indexes and dynamic indexes.
As the main data file changes due to insertions and deletions, the static index contents may
change but the structure does not change . In case of dynamic indexing approach, insertions and
deletions in the main data file may lead to changes in the index structure. Recall the change in
height of B-Tree as records are inserted and deleted.
Both dynamic and static indexing techniques are useful depending on the type of application.
SUMMARY
This Unit dealt with the methods of physically storing data in the files. The terms fields, records
and files were defined. The organization types were introduced.
The various file organization were discussed. Sequential File Organization finds in use in
application areas where batch processing is more common. Sequential Files are simple to use and
can be stored on inexpensive media. They are suitable for applications that require direct access
to only particular records of the collection. They do not provide adequate support for interactive
applications.
In Direct file organization there exists a predictable relationship between the key used and by
program to identify a Particular record and or programmer that record's location on secondary
storage. A direct file must be stored on a direct access device. Direct files are used extensively in
application areas where interactive processing is used.
An Indexed Sequential file supports both sequential access by key value and direct access to a
particular record given its key value. It is implemented by building an index on top of a sequential
data file that resides on a direct access storage device.
UNIT 11 SEARCHING
11.1.
INTRODUCTION
11.2.
SEARCHING TECHNIQUES
11.2.1. SEQUENTIAL SEARCH
11.2.1.1. ANALYSIS
11.2.2. BINARY SEARCH
11.2.2.1. ANALYSIS
11.3.
HASHING
11.3.1. HASH FUNCTIONS
11.4.
COLLISION RESOLUTION
11.1. INTRODUCTION
In many cases we require the data to be presented in the form where it follows certain sequence of
the records. If we have the data of the students in the class, then we will prefer to have them
arranged in the alphabetical manner. For preparing the result sheet, we would like to arrange the
data as per the examination numbers. When we prepare the merit list we would like to have the
same data so that it is arranged in the decreasing order of the total marks obtained by the
students. Thus arranging the data in either ascending or descending manner based on certain key
in the record is known as SORTING. As we do not receive the data in the sorted form, we are
required to arrange the data in the particular from. For ascending order we will require the
smallest key value first. Thus till we do not get all the data items, we cannot start arranging
them. Arranging the data as we receive it is done using linked lists. But in all other cases, we
need to have all the data, which is to be sorted, and it will be present in the form of an ARRAY.
Sometimes it is very important to search for a particular record, may be, depending on some
value. The process of finding a particular record is known as SEARCHING. Suppose S is a
collection of data maintained in memory by a table using some type of data structure. Searching
is the operation which finds the location LOC in memory of some given ITEM of information or
sends some message that ITEM does not belong to S. The search is said to be successful or
unsuccessful according to whether ITEM does or does not belong to S. The searching algorithm
that is used depends mainly on the type of data structure, that is used to maintain S in memory.
Data modification, another term related to searching refers to the operations of inserting, deleting
and updating. Here data modification will mainly refer to inserting and deleting. These operations
are closely related to searching, since usually one must search for the location of the ITEM to be
deleted or one must search for the proper place to insert ITEM in the table. The insertion or
deletion also requires a certain amount of execution time, which also depends mainly on the type
of data structure that is used. Generally speaking, there is a tradeoff between data structures
with fast searching algorithms and data structures with fast modification algorithms. This
situation is illustrated below, where we summarize the searching and data modification of three of
the data structures previously studied in the text.
(1) Sorted array. Here one can use a binary search to find the location LOC of a given ITEM in
time O(log n). On the other hand, inserting and deleting are very slow, since, on the average, n/2
= o(n) elements must be moved for a given insertion or deletion. Thus a sorted array would likely
be used when there is a great deal of searching but only very little data modification.
(2) Linked list. Here one can only perform a linear search to find the location LOC of a given
ITEM, and the search may be very, very slow, possibly requiring time O(n). On the other hand,
inserting and deleting requires only a few pointers to be changed. Thus a linked list would be
used when there is a great deal of data modification, as in word (string) processing.
(3) Binary search tree. This data structure combines the advantages of the sorted array and the
linked list. That is, searching is reduced to searching only a certain path P in the tree T, which ,
on the average, requires only O(log n) comparisons. Furthermore, the tree T is maintained in
memory by a kinked representation, so only certain pointers need by changed after the location
of the insertion or deletion is found. The main drawback of the binary search tree is that the tree
may be very unbalanced, so that the length of a path P may be O(n) rather than O(log n). This will
reduce the searching to approximately a linear search.
11.2. SEARCHING TECHNIQUES
We will discuss two searching methods – the sequential search and the binary search.
11.2.1. SEQUENTIAL SEARCH
This is a natural searching method. Here we search for a record by traversing through the entire
list from beginning until we get the record. For this searching technique the list need not be
ordered. The algorithm is presented below:
1.
2.
3.
4.
Set flag =0.
Set index = 0.
Begin from index at first record to the end of the list and if the required record is found
make flag = 1.
At the end if the flag is 1, the record is found, otherwise the search is failure.
int Lsearch(int L[SIZE], int ele )
{
int it;
for(it = 1; it<=SIZE; it++)
{
if( L[it] == ele)
{
return 1;
break;
}
}
return 0;
}
11.2.1.1. ANALYSIS
Whether the search takes place in an array or a linked list, the critical part in performance
is the comparison in the loop. If the comparisons are less the loop terminates faster. The least
number of iterations that could be required is 1 if the element that was searched is first one in the
list. The maximum comparison is N ( N is the total size of the list), when the element is the last in
the list. Thus if the required item is in position „i‟ in the list, „i‟ comparisons are required. Hence
the average number of comparisons is
(1 + 2 + 3 + .... + I + ... + N) / N
N(N+1)
= ------------2*N
= (N + 1) / 2.
Sequential search is easy to write and efficient for short lists. It does not require the list to
be sorted. However if the list is long, this searching method becomes inefficient, as it has to travel
through the whole list.
We can overcome this shortcoming by using Binary Search Method.
11.2.2. BINARY SEARCH
Binary search method employs the process of searching for a record only in half of the list,
depending on the comparison between the element to be searched and the central element in the
list. It requires the list to be sorted to perform such a comparison. It reduces the size of the
portion to be searched by half after each iteration.
Let us consider an example to see how this works. The numbers in the list are 10 20 30 40
50 60 70 80 90.The element to be searched is 30.
First it is compared with 50(central element).
Since it is smaller than the central element we consider only the left part of the list(i.e. from 10 to
40) for further searching.
Next the comparison is made with 30 and returns as search is successful.
Let us see the function, which performs this on the list.
int Bsearch(int list[SIZE],int ele)
{
int top, bottom, middle;
top= SIZE –1;
bottom = 0;
while(top > = bottom)
{
middle = (top + bottom)/2;
if(list[middle]==ele)
return middle;
else
if(list[middle]< ele)
bottom = middle + 1;
else
top = middle –1;
}
}
return –1;
11.2.2.1. ANALYSIS
In this case, after each comparison either the search terminates successfully or the list remaining
to be searched, is reduced by half. So after k comparisons the list remaining to be searched is N/
2^k where N is the number of elements. Hence even at the worst case this method needs no more
than log2N + 1 comparisons. Binary search is a fast algorithm for searching sorted sequences. It
runs in about log2 N time, as opposed to an average run time of N/2 for linear search. For large
lists that you need to search multiple times, it might well be more efficient to sort and use binary
search instead of sticking to a linear search.
11.3. HASHING
The search time of each algorithm discussed so far depends on the number n of elements in the
collection S of data. This section discusses a searching technique, called hashing or hash
addressing, which is essentially independent of the number n.
The terminology, which we use in our presentation of hashing will be oriented toward file
management. First of all, we assume that there is a file F of n records with a set K of keys, which
uniquely determine the records in F. Secondly, we assume that F is maintained in memory by a
Table T of m memory locations and that L is the set of memory addresses of the locations in T. For
notational convenience, we assume that the keys in K and the addresses in L are (decimal)
integers. (Analogous methods will work with binary integers or with keys which are character
strings, such as names, since there are standard ways of representing strings by integers).The
subject of hashing will be introduced by the following example.
Suppose a company with 68 employees assigns a 4-digit employee number to each
employee, which is used as the primary key in the company‟s employee file. We can, in fact, use
the employee number as the address of the record in memory. The search will require no
comparisons at all. Unfortunately, this technique will require space for 10 000 memory locations,
whereas space for fewer than 30 such locations would actually be used. Clearly, this tradeoff of
space for time is not worth the expense.
The general idea of using the key to determine the address of a record is an excellent idea,
but it must be modified so that a great deal of space is not wasted. This modification takes the
form of a function H from the set K of keys into the set L of memory addresses. Such a function,
H: K
L
is called a hash function or hashing function. Unfortunately, such a function H may not yield
distinct values: it is possible that two different keys K1 and K2 will yield the same hash address.
This situation is called collision, and some method must be used to resolve it. Accordingly, the
topic of hashing is divided into two parts: (1) hash functions and (2) collision resolutions. We
discuss these two parts separately.
11.3.1 HASH FUNCTIONS
The two principal criteria used in selecting a hash function H: K
L are as follows. First of all,
the function H should be very easy and quick to compute. Second the function H should, as far as
possible, uniformly distribute the hash addresses throughout the set L so that there are a
minimum number of collisions. Naturally, there is no guarantee that the second condition can be
completely fulfilled without actually knowing beforehand the keys and addresses. However,
certain general techniques do help. One technique is to “chop” a key k into pieces and combine
the pieces in some way to form the hash address H(k). (The term “hashing “ comes from this
technique of “chopping” a key into pieces)
We next illustrate some popular hash functions. We emphasize that each of these hash functions
can be easily and quickly evaluated by the computer.
(a ) Division method. Choose a number m larger than the number n of keys in K. (The number
m is usually chosen to be a prime number or a number without small divisors, since this
frequently minimizes he number of collisions.) The hash function H is defined by
H(k) = k (mod m)
or H(k) = k (mod m) + 1
(b) Midsquare method. The key k is squared. Then the hash function H is defined by
H(k) = 1
Where l is obtained by deleting digits from both ends of k2. We emphasize that the same positions
of k2 must be used for all of the keys.
(c) Folding method. The key k is partitioned into a number of parts, k1,…kr, where each part,
except possibly the last, has the same number of digits as the required address. Then the pats
are added together, ignoring the last carry. That is,
H(k) = k1 + k2 … + kr
where the leading-digitcarries, if any are ignored. Sometimes, for extra “milling,” the evennumbered parts, k2, k4,…, are each reversed before the addition.
11.4. COLLISION RESOLUTION
Suppose we want to add a new record R with key k to our file F, but suppose the memory location
address H(k) is already occupied. This situation is called collision. This subsection discusses two
general ways of resolving collisions. The particular procedure that one chooses depends on many
factors. One important factor is the ratio of the number n of keys in k (which is the number of
records in F) to the number m of hash addresses in L . This ratio, = 24/365 = 7% is very small, it
can be shown that there is a better than fifty-fifty chance that two of the students have the same
birthday.
The efficiency of a hash function with a collision resolution procedure is measured by he average
number of probes (key comparisons) needed to find the location of the record with a given key k.
The efficiency depends mainly on the load factor.
SUMMARY
This unit concentrates on searching techniques used for information retrieval. The sequential
search method was seen to be easy to implement and relatively efficient to use for small lists. But
very time consuming for long unsorted lists. The binary search method is an improvement, in that
it eliminates half the list from consideration at each iteration; It has checks to incorporated to
ensure speedy termination under all possible conditions. It requires only twenty comparisons for
a million records and is hence very efficient. The prerequisite for it is that the list should be sorted
in increasing order.
UNIT 12 SORTING
12.1. INTRODUCTION
12.2. INSERTION SORT
12.2.1. ANALYSIS
12.3. BUBBLE SORT
12.3.1. ANALYSIS
12.4. SELECTION SORT
12.4.1. ANALYSIS
12.5. RADIX SORT
12.5.1. ANALYSIS
12.6. QUICK SORT
12.6.1. ANALYSIS
12.7. 2-WAY MERGE SORT
12.8. HEAP SORT
12.9. HEAPSORT VS. QUICKSORT
12.1 INTRODUCTION
Sorting is one of the most important operations performed by computers. In the days of magnetic
tape storage before modern data-bases, it was almost certainly the most common operation
performed by computers as most "database" updating was done by sorting transactions and
merging them with a master file. It's still important for presentation of data extracted from
databases: most people prefer to get reports sorted into some relevant order before wading
through pages of data. Sorting has been the subject of extensive research in the area of computer
science because of its importance in solving the problem of sorting an initially unordered
collection of keys (data) to produce an ordered collection (A sort-key is a single-valued function of
the corresponding element of the list). For instance, to arrange names in alphabetical order, line
up customers in a bank, customers by zip code, cities population increase, comparing different
items, and so on. Thus, a number of different techniques have developed in solving sorting
problems. Because the sorting problem has fascinated theoretical computer scientists, much is
known about the efficiency of various solutions, and about limitations on the best possible
solutions. Efficient sorting is important to optimizing the use of other algorithms (such as search
algorithms and merge algorithms) that require sorted lists to work correctly; it is also often useful
for canonicalizing data and for producing human-readable output. Sort algorithms used in
computer science are often classified by:



computational complexity (worst, average and best behaviour) in terms of the size of the
list (n). Typically, good behaviour is O(n log n) and bad behaviour is O(n2). Sort algorithms
which only use an abstract key comparison operation always need at least O(n log n)
comparisons on average; sort algorithms which exploit the structure of the key space
cannot sort faster than O(n log k) where k is the size of the keyspace.
memory usage (and use of other computer resources)
stability: stable sorts keep the relative order of elements that have an equal key. That is, a
sort algorithm is stable if whenever there are two records R and S with the same key and
with R appearing before S in the original list, R will appear before S in the sorted list.
Sorting algorithms that are not stable can be specially implemented to be stable. One way of doing
this is to artificially extend the key comparison, such that comparisons between two objects with
otherwise equal keys are decided using the order of the entries in the original data order as a tiebreaker. Some sorting algorithms follow with runtime order






Bubble sort - O(n2)
Selection sort - O(n²)
Insertion sort - O(n²)
Quicksort * - O(n log n)
2-WAY Merge sort - O(n log n)
Heapsort * - O(n log n)
(*) unstable
12.2. INSERTION SORT
This is a naturally occuring sorting method exemplified by a card player arranging the cards dealt
to him. He picks up the cards as they are dealt and inserts them into the required position. Thus
at every step, we insert an item into its proper place in an already ordered list.
We will illustrate insertion sort with an example before (figure 1) presenting the formal algorithm.
Example 1: Sort the following list using the insertion sort method:
Figure 1: Insertion sort
Thus to find the correct position search the list till an item just greater than the target is found.
Shift all the items from this point one, down the list. Insert the target in the vacated slot.
We now present the algorithm for insertion sort.
ALGORITHM: INSERT SORT
INPUT: LIST[ ] of N items in random order.
OUTPUT: LIST[ ] of N items in sorted order.
1. BEGIN,
2. FOR I = 2 TO N DO
3. BEGIN
4. F LIST[I] LIST[I-1]
5. THEN BEGIN
6. J = I
7. T = LIST[I] /*STORE LIST[I]*/
8. REPEAT /* MOVE OTHER ITEMS DOWN THE LIST*/
9. J = J-1
10. LIST [J + 1] =LIST [J];
11. IFJ = 1THEN
12. FOUND =TRUE
13. UNTIL (FOUND = TRUE)
14. LIST [I] = T
15. END
16. END
17. END
C Code
void InsertionSort(int array[],int size)
{
int k;
for(k=1;k<size;k++)
{ int elem;
int pos;
}
}
elem=array[k];
for(pos=k-1;array[pos]>elem && pos>=0;pos--)
{ array[pos+1]=array[pos]; }
array[pos+1]=elem;
12.2.1 ANALYSIS
To determine the average efficiency of insertion sort consider the number of times that the inner
loop iterates. As with other loops featuring nested loops, the number of iterations follows a
familiar pattern: 1 + 2 + … + (n-2) + (n-1) = n(n-1)/2 = O(n2) . Conceptually, the above pattern is
caused by the sorted sub-list that is built throughout the insertion sort algorithm. It takes one
iteration to build a sorted sub-list of length 1, two iterations to build a sorted sub-list of length 2
and finally n-1 iterations to build the final list. To determine whether there are any best or worst
cases for the sort, we can examine the algorithm to find data sets that would behave differently
from the average case with random data. Because the average case identified above locally sorts
each sub-list there is no arrangement of the aggregate data set that is significantly worse for
insertion sort. The nature of the sorting algorithm does however lend itself to perform more
efficiently on certain data. In the case where the data is already sorted, insertion sort won't have
to do any shifting because the local sub-list will already be sorted. That is, the first element will
already be sorted, the first two will already be sorted, the first three, and so on. In this case,
insertion sort will iterate once through the list, and, finding no elements out of order, will not shift
any of the data around. The best case for insertion sort is on a sorted list where it runs is O(n).It
takes O(n2) time in the average and worst cases, which makes it impractical for sorting large
numbers of elements. However, insertion sort's inner loop is very fast, which often makes it one of
the fastest algorithms for sorting small numbers of elements, typically less than 10 or so.
12.3 BUBBLE SORT
In this sorting algorithm, multiple swappings take place in one pass. Smaller elements move or
'bubble' up to the top of the list, hence the name given to the algorithm.
In this method adjacent members of the list to be sorted are compared. if the item on top is
greater than the item immediately below it, they are swapped. This process is carried on till the
list is sorted.
The detailed algorithm follows:
ALGORITHM BUBBLE SORT
INPUT: LIST [ ] of N items in random order.
OUTPUT: LIST [ ] of N items sorted in ascending order.
1. SWAP = TRUE
PASS 0/
2. WHILE SWAP = TRUE DO
BEGIN
2.1 FOR.1 = 0 TO (N-PASS) DO
BEGIN
2.1.1 IF A[I] A [I + 1]
BEGIN
TMP = A[I]
A[I] A[I+1]
A[I+ 1] TMP
SWAP = TRUE
END
ELSE
SWAP = FALSE
2.1.2 PASS = PASS + 1
END
END
C Code
void bubbleSort(int *array, int length)
{
int i, j;
for(i = 0; i < length - 1; i++)
for(j = 0; j < length – i-1; j++)
if(array[j] > array[j+1]) /* compare neighboring elements */
{
int temp;
temp = array[j]; /* swap array[j] and array[j+1] */
array[j] = array[j+1];
array[j+1] = temp;
}
}
The algorithm for bubble sort requires a pair of nested loops. The outer loop must iterate
once for each element in the data set (of size n) while the inner loop iterates n times the first time
it is entered, n-1 times the second, and so on. Consider the purpose of each loop. As explained
above, bubble sort is structured so that on each pass through the list the next largest element of
the data is moved to its proper place. Therefore, to get all n elements in their correct places, the
outer loop must be executed n times. The inner loop is executed on each iteration of the outer
loop. Its purpose is to put the next largest element is being put into place. The inner loop
therefore does the comparing and swapping of adjacent elements.
12.3.1. ANALYSIS
To determine the complexity of this loop, we calculate the number of comparisons that have to be
made. On the first iteration of the outer loop, while trying to place the largest element, there have
to be n - 1 comparisons: the first comparison is made between the first and second elements, the
second is made between the second and third elements, and so on until the (n-1)th comparison is
made between the (n-1)th and the nth element. On the second iteration of the outer loop, there is
no need to compare again the last element of the list, because it was put in the correct place on
the previous pass. Therefore, the second iteration requires only n-2 comparisons. This pattern
continues until the second-to-last iteration of the outer loop when only the first two elements of
the list are unsorted; clearly in this case, only one comparison is necessary. The total number of
comparisons, therefore, is (n-1)+(n-2)…(2)+(1) = n(n-1)/2 or O(n2) .
The best case for bubble sort occurs when the list is already sorted or nearly sorted. In the
case where the list is already sorted, bubble sort will terminate after the first iteration, since no
swaps were made. Any time that a pass is made through the list and no swaps were made, it is
certain that the list is sorted. Bubble sort is also efficient when one random element needs to be
sorted into a sorted list, provided that new element is placed at the beginning and not at the end.
When placed at the beginning, it will simply bubble up to the correct place, and the second
iteration through the list will generate 0 swaps, ending the sort. Recall that if the random element
is placed at the end, bubble sort loses its efficiency because each element greater than it must
bubble all the way up to the top.
The absolute worst case for bubble sort is when the smallest element of the list is at the
large end. Because in each iteration only the largest unsorted element gets put in its proper
location, when the smallest element is at the end, it will have to be swapped each time through
the list, and it won‟t get to the front of the list until all n iterations have occurred. In this worst
case, it take n iterations of n/2 swaps so the order is, again, n 2.
Best Case::
Comparisons : n Swaps : 0
Average Case:: Comparisons : n2 Swaps : ~n2/2
Worst Case:: Comparisons : n2 Swaps : n2/2
12.4 SELECTION SORT
The bubble sort algorithm could had been made better if rather than swapping an element
approximately every time with its adjacent one only on the basis of a comparison, if we could find
an exact element in one whole pass of the full array without any swapping work at all but in last.
Consider the following unsorted data: 8 9 3 5 6 4 2 1 7 0. On the first iteration of the sort, the
minimum data point is found by searching through all the data; in this case, the minimum value
is 0. That value is then put into its correct place i.e. at the beginning of the list by exchanging the
places of the two values. The 0 is swapped into the 8's position and the 8 is placed where the 0
was, without distinguishing whether that is the correct place for it, which it is not. Now that the
first element is sorted, it never has to be considered again. So, although the current state of the
data set is 0 9 3 5 6 4 2 1 7 8, the 0 is no longer considered, and the selection sort repeats itself
on the remainder of the unsorted data: 9 3 5 6 4 2 1 7 8.Consider a trace of the selection sort
algorithm on a data set of ten elements:
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
(viii)
(ix)
(x)
8935642170
09 3 5 6 4 2 1 7 8
01 3 5 6 4 2 9 7 8
01 2 5 6 4 3 9 7 8
01 2 3 6 4 5 9 7 8
01 2 3 4 6 5 9 7 8
01 2 3 4 5 6 9 7 8
01 2 3 4 5 6 9 7 8
01 2 3 4 5 6 7 9 8
01 2 3 4 5 6 7 8 9
Selection Sort Algorithm
The selection sort algorithm is:
1. Select index of first element as min.
2. Compare element[min] with next element, if next element is less than element[min], replace
min with its index value.
3. Repeat Step 2 till end of list.
4. Now, final element[min] is swapped with very first element[min]. Hence, this final element
is at its appropriate position.
5. So, for next iteration first element is neglected for consideration of future list and above
steps are repeated for all elements individually.
C Code
void selectionSort(int numbers[], int array_size)
{
int i, j;
int min, temp;
for (i = 0; i < array_size-1; i++)
{
min = i;
for (j = i+1; j < array_size; j++)
{
if (numbers[j] < numbers[min])
min = j;
}
temp = numbers[i];
numbers[i] = numbers[min];
numbers[min] = temp;
}
}
Like bubble sort, selection sort is implemented with one loop nested inside another. This suggests
that the efficiency of selection sort, like bubble sort, is n 2 . To understand why this is indeed
correct, consider how many comparisons must take place. The first iteration through the data
requires n-1 comparisons to find the minimum value to swap into the first position. Because the
first position can then be ignored when finding the next smallest value, the second iteration
requires n-2 comparisons and third requires n-3. This progression continues as follows:
(n-1) + (n-2) + … + 2 + 1 = n(n-1)/2 = O(n2)
12.4.1. ANALYSIS
Unlike other quadratic tests, the efficiency of selection sort is independent of the data. Bubble
sort, for example, can sort sorted and some nearly-sorted lists in linear time because it is able to
identify when it has a sorted list. Selection sort does not do anything like that because it is only
seeking the minimum value on each iteration. Therefore, it cannot recognize (on the first iteration)
the difference between the following two sets of data: 1 2 3 4 5 6 7 8 9 and 1 9 8 7 6 5 4 3 2. In
each case, it will identify the 1 as the smallest element and then go on to sorting the rest of the
list. Because it treats all data sets the same and has no ability to short-circuit the rest of the sort
if it ever comes across a sorted list before the algorithm is complete, insertion sort has no best or
worst cases. Selection sort always takes O(n2) operations, regardless of the characteristics of the
data being sorted. Even though Selection Sort is one of the slower algorithms, it is used because it
sorts with a minimum of data shifting. That is, the algorithm doesn't swap the elements of the
array as many times as other algorithms do in order to sort the array.
12.5 RADIX SORT
Radix sortis a fast ,stable, sort algorithm which can be used to sort items that are identified by
unique keys. Every key is a string or number, and radix sort sorts these keys in a particular
lexicographic-like order. Radix sort is one of the linear sorting algorithms for integers. It functions
by sorting the input numbers on each digit, for each of the digits in the numbers. However, the
process adopted by this sort method is somewhat counterintuitive, in the sense that the numbers
are sorted on the least-significant digit first, followed by the second-least significant digit and so
on till the most significant digit.
Let us sort the following list:
150, 45, 75, 90, 2, 24, 802, 66
1. sorting by least significant digit (1s place) gives:
150, 90, 2, 802, 24, 45, 75, 66
2. sorting by next digit (10s place) gives:
2, 802, 24, 45,150,66,75, 90
3. sorting by most significant digit (100s place) gives:
2, 24, 45, 66, 75, 90, 150, 802
Let us consider a series of decimal numbers to be sorted. If one sets up 10 bins, passes through
the list of numbers placing each in the appropriate bin according to the least significant digit, and
then combines the bins in order without changing the order of numbers in the bin, by repeating
the process with the next most significant digit until all digits have been used, one will have
sorted the numbers. For computer implementation, the bins are linked lists, and only pointers to
the data are manipulated. If bit bins are used, only two lists need be manipulated, while for byte
bins, up to 256 may be required, depending on the nature of the data. After each pass through
the sort key, the lists are joined, and the new list is used for the next pass. For a straightforward
implementation, the sort time depends linearly on the number of records to be sorted, and the
number of passes, which in turn is related only to the number of “bins” and sort key length. The
bin sorting approach can be generalized in a technique that is known as radix sorting.
To appreciate Radix Sort, consider the following analogy: Suppose that we wish to sort a
deck of 52 playing cards (the different suits can be given suitable values, for example 1 for
Diamonds, 2 for Clubs, 3 for Hearts and 4 for Spades). The 'natural' thing to do would be to first
sort the cards according to suits, then sort each of the four separate piles, and finally combine the
four in order. This approach, however, has an inherent disadvantage. When each of the piles is
being sorted, the other piles have to be kept aside and kept track of. If, instead, we follow the
'counterintuitive' approach of first sorting the cards by value, this problem is eliminated. After the
first step, the four separate piles are combined in order and then sorted by suit.
10.6.1. Radix Sort Algorithm
A radix sort algorithm works as follows:
1. take the least significant digit (or group bits) of each key.
2. sort the list of elements based on that digit, but keep the order of elements with the same
digit (this is the definition of a stable sort).
3. repeat the sort with each more significant digit.
C Code
void rsort(record a[], int n)
{
int i,j;
int shift;
record temp[MAXLENGTH];
int bucket_size[256], first_in_bucket[256];
for(shift=0; shift<32; shift+=8)
{
/* compute the size of each bucket and
copy each record from array a to array temp */
for(i=0; i<256; i++)
bucket_size[i]=0;
for(j=0; j<n; j++)
{
i=(a[j].key>>shift)&255;
}
bucket_size[i]++;
temp[j]=a[j];
/* mark off the beginning of each bucket */
first_in_bucket[0]=0;
for(i=1; i<256; i++)
first_in_bucket[i]=first_in_bucket[i-1]+bucket_size[i-1];
}
}
/* copy each record from array temp to its bucket in array a */
for(j=0; j<n; j++)
{
i=(temp[j].key>>shift)&255;
a[first_in_bucket[i]]=temp[j];
first_in_bucket[i]++;
}
12.5.1. ANALYSIS
The algorithm operates in O(nk) time, where n is the number of items, and k is the average key
length. If the size of the possible key space is proportional to the number of items, then each key
will be log n symbols long, and radix sort uses O(n log n) time in this case. In practice, if the keys
used are short integers, it is practical to complete the sorting with only two passes, and
comparisons can be done with only a few bit operations that operate in constant time. In this
case, radix sort can effectively be regarded as taking O(n) time and in practice can be significantly
faster than any other sorting algorithm. The greatest disadvantages of radix sort are that it
usually cannot be made to run in place, so O(n) additional memory space is needed, and that it
requires one pass for each symbol of the key, so it is very slow for potentially-long keys. The time
complexity of the algorithm is as follows: Suppose that the n input numbers have maximum k
digits. Then the Counting Sort procedure is called a total of k times. Counting Sort is a linear, or
O(n) algorithm. So the entire Radix Sort procedure takes O(kn) time. If the numbers are of finite
size, the algorithm runs in O(n) asymptotic time.
12.6. QUICK SORT
This is the most widely used internal sorting algorithm. In its basic form, it was invented by
C.A.R. Hoare in 1960. Its popularity lies in the- ease of implementation, moderate use of
resources and acceptable behaviour for a variety of sorting cases. The basis of quick sort is the
'divide' and conquer' strategy i.e. Divide the problem [list to be sorted] into sub-problems [sublists], until solved sub problems [sorted sub-lists] are found. This is implemented as
Choose one item A[I] from the list A[ ].
Rearrange the list so that this item is in the proper position i.e. all preceding items have a lesser
value and all succeeding items have a greater value than this item.
1. A[0], A[1] .. A[I-1] in sub list 1
2. A[I]
3. A[I + 1], A[I + 2] ... A[N] in sublist 2
Repeat steps 1 & 2 for sublist & sublist2 till A[ ] is a sorted list.
As can be seen, this algorithm has a recursive structure.,
Step 2 or the 'divide' procedure is of utmost importance in this algorithm. This is usually
implemented as follows:
1. Choose A[I] the dividing element.
2. From the left end of the list (A[O] onwards) scan till an item A[R] is found whose value is
greater than A[I].
3. From the right end of list [A[N] backwards] scan till an item A[L] is found whose Value is
less than A[1].
4. Swap A[-R] & A[L].
5. Continue steps 2, 3 & 4 till the scan pointers cross. Stop at this stage.
6. At this point sublist 1 & sublist2 are ready.
7. Now do the same for each of sublist 1 & sublist2.
We will now give the implementation of Quicksort and illustrate it by an example. Quicksort (int
A[], int X, int 1)
{
int L, R, V 1.
1. If (IX)
{
2. V = A[1], L = X-1, R = I; 3.
3. For (;;)
4. While (A[ + + L] V);
5. While (A[- -R] V);
6. If (L = R) /* left & right ptrs. have crossed */
7. break;
8. Swap (A, L, R) /* Swap A[L] & A[R] */ }
9. Swap (A, L, I)
10. Quicksort (A, X, L-1)
11. Quicksort (A, L + 1, I) } }
Quick sort is called with A, I, N to sort the whole file.
C Code
void quickSort(int numbers[], int array_size)
{
q_sort(numbers, 0, array_size - 1);
}
void q_sort(int numbers[], int left, int right)
{
int pivot, l_hold, r_hold;
l_hold = left;
r_hold = right;
pivot = numbers[left];
while (left < right)
{
while ((numbers[right] >= pivot) && (left < right))
right--;
if (left != right)
{
numbers[left] = numbers[right];
left++;
}
while ((numbers[left] <= pivot) && (left < right))
left++;
if (left != right)
{
numbers[right] = numbers[left];
right--;
}
}
numbers[left] = pivot;
pivot = left;
left = l_hold;
right = r_hold;
if (left < pivot)
q_sort(numbers, left, pivot-1);
if (right > pivot)
q_sort(numbers, pivot+1, right);
}
Example: Consider the following list to be sorted in ascending order. 'ADD YOUR MAN'. (Ignore
blanks)
at this point 'N' is in its correct place.
A[6], A[1] to A[5] constitutes sub list 1.
A[7] to A[10] constitutes sublist2. Now
10. Quick sort (A, 1, 5)
11. Quick sort (A, 6, 10)
The Quick sort algorithm uses the O(N Log2N) comparisons on average. The performance can be
improved by keeping in mind the following points.
1. Switch to a faster sorting scheme like insertion sort when the sublist size becomes
comparitively small.
2. Use a better dividing element I in the implementations. We have always used A[N] as the
dividing element. A useful method for the selection of a dividing element is the Median-of
three method.
Select any3 elements from the list. Use the median of these as the dividing element.
12.6.1. ANALYSIS
The efficiency of quick sort is determined by calculating the running time of the two recursive
calls plus the time spent in the partition. The partition step of quick sort takes n-1 comparisons.
The efficiency of the recursive calls depends largely on how equally the pivot value splits the
array. In the average case, assume that the pivot does split the array into two roughly equal
halves. As is common with divide-and-conquer sorts, the dividing algorithm has a running time of
log(n) . Thus the overall quicksort algorithm has running time O(n log(n)) . The worst case occurs
when the pivot value always ends up being one of the extreme values in the array. For example,
this might happen in a sorted array if the first value is selected as the pivot. In this case, the
partitioning phase still requires n-1 comparisons, as before, but quicksort does not achieve the
O(log(n)) efficiency in the dividing process. Instead of breaking an 8 element array into arrays of
size 4, 2, and 1 in three recursive calls, the array size only reduces by one: 7, 6, and 5. Thus the
dividing process becomes linear and the worst case efficiency is O(n 2) . Note that quicksort
performs badly once the amounts of data become small due to the overhead of recursion. This is
often addressed by switching to a different sort for data smaller than some magic number such as
25 or 30 elements.
Quite a lot of people believe that Quick Sort is the quickest sorting algorithm, which is not
true. There's no such thing as the quickest sorting algorithm. It depends on the data, the data
types, the implementation language and much more. Indeed generally the "best algorithm" is
Heap Sort. Quicksort is perfect for random data and if the items can be swapped very fast. For
almost sorted data quicksort will degenerate to O(n 2) (worst case), the best case (random data) is
O(n*logn), as any good sorting algorithm.
12.7 2-WAY MERGE SORT
Merge sort is also one of the 'divide and conquer' class of algorithms. The basic idea into this is to
divide the list into a number of sub lists, sort each of these sub lists and merge them to get a
single sorted list. The recursive implementation of 2- way merge sort divides the fist into 2 sorts
the sub lists and then merges them to get the sorted list. The illustrative implementation of 2 way
merge sort sees the input initially as n lists of size 1. These are merged to get n/2 lists of size 2.
These n/2 lists are merged pair wise and so on till a single list is obtained. This can be better
understood by the following example. This is also called CONCATENATE SORT.
Figure 2 : 2-way merge sort
We give here the recursive implementation of 2 Way Merge Sort.
Mergesort (int List [], int low, int high)
{
int mid,
1. Mid = (low + high)/2;
2. Mergesort (LIST, low, mid);
3. Mergesort (LIST, mid + 1, high);
4. Merge (low, mid, high, List, FINAL)
}
Merge (int low, int mid, int high, int LIST[], int FINAL)
{
Int a, b, c, d;
a = low, b = low, c = mid + 1
While (a < = mid and c < = high) do
{
If LIST [a] < = LIST [c] then
{
FINAL [b] =LIST [a]
a = a+1
}
else
{
FINAL [b] = LIST [c]
c=c+l
}
b = b+1
}
If (a mid) then
For d = c to high do
{
B[b] = LIST [d]
b=b+1
}
Else
For d = a to mid do
{
B[b] = A[d]
b = b + l;
}
}
To sort the entire list, Mergesort should be called with LIST,1, N.
Mergesort is the best method for sorting linked lists in random order. The total computing time is
of the 0(n log2n ).
The disadvantage of using mergesort is that it requires two arrays of the same size and type for
the merge phase. That is, to sort and list of size n, it needs space for 2n elements.
12.8 HEAP SORT
We will begin by defining a new structure the heap. We have studied binary trees in BLOCK 5,
UNIT 1. A binary tre is illustrated below.
Figure 3(a) : Heap 1
A complete binary tree is said to satisfy the 'heap condition' if the key of each node is greater than
or equal to the key in its children. Thus the root node will have the largest key value.
Trees can be represented as arrays, by first numbering the nodes (starting from the root) from left
to right. The key values of the nodes are then assigned to array positions whose index is given by
the number of the node. For the example tree, the corresponding array would be
The relationships of a node can also be determined from this array representation. If a node is at
position j, its children will be at positions 2j and 2j + 1. Its parent will be at position [J/2 |.
Consider the node M. It is at the position 5. Its parent node is, therefore, at position [5/2| = 2 i.e.
the parent is R. Its children are at positions 2x5 & (2x5) + 1, i.e.10 + 11 respectively i.e. E & I are
its children. We see from the pictorial representation that these relationships are correct.
A heap is a complete binary tree, in which each node satisfies the heap condition, represented as
an array.
We will now study the operations possible on a heap and see how these can be combined to
generate a sorting algorithm.
The operations on a heap work in 2 steps.
1. The required node is inserted/deleted/or replaced.
2. 1 may cause violation of the heap condition so the heap is traversed and modified to rectify
any, such violations.
Examples
Insertion
Consider the insertion of a node R in the heap 1.
1. Initially R is added as the right child of J and given the number 13.
2. But R J, the heap condition is violated.
3. Move R upto position 6 and move J down to position 13.
4. R P, therefore, the heap condition is still violated.
5. Swap R and P.
6. The heap condition is now satisfied by all nodes to get.
Figure 3(a) : Heap 2
Deletion Consider the deletion of M from heap 2.
1. The larger of M's children is promoted to 5.
Figure 3(a) : Heap 3
An efficient sorting method based on the heap construction and node removal from the heap in
order. This algorithm is guaranteed to sort n elements in N log N steps.
We will first see 2 methods of heap construction and then removal in order from the heap to sort
the list.
1. Top down heap construction
- Insert items into an initially empty heap, keeping the heap condition inviolate at all steps.
2. Bottom up heap construction
- Build a heap with the items in the order presented.
- From the right most node modify to satisfy the heap condition.
We will exemplify this with an example.
Example: Build a heap of the following using both methods of construction.
PROFESSIONAL
Top down construction
Figure 4: Heap Sort (Top down Construction)
Figure 4: Heap Sort (Top down Construction)
Figure 5: Heap Sort by bottom-up approach
We will now see how sorting takes place using the heap built by the top down approach. The
sorted elements will be placed in X [ ] and array of size 12.
1. Remove S and store in X [12 )
(b)
2. Remove S and store in X [11]
(c)
9. Similarly the remaining 5 nodes are removed and the heap modified, to get the sorted list.
AEEILN00PRSS
Figure 6 : Sorting process through Heap
12.9. HEAPSORT VS. QUICKSORT
Heapsort primarily competes with Quicksort, another efficient nearly-in-place comparison-based
sort algorithm. Quicksort is typically somewhat faster, but the worst-case running time for
Quicksort is O(n2) which becomes unacceptable for large data sets. See Quicksort for a detailed
discussion of this problem, and possible solutions. The Quicksort algorithm also requires Ω(log n)
extra storage space, making it not a strictly in-place algorithm. This typically does not pose a
problem except on the smallest embedded systems. Obscure constant-space variations on
Quicksort exist but are never used in practice due to their extra complexity. In situations where
very limited extra storage space is available, Heapsort is the sorting algorithm of choice. Thus,
because of the O(n log n) upper bound on Heapsort's running time and constant upper bound on
its auxiliary storage, embedded systems with real-time constraints often use Heapsort.
SUMMARY
Sorting is an important application activity. Many sorting algorithms are available, each the most
efficient for a particular situation or a particular kind of data. The choice of a sorting algorithm is
crucial to the performance of the application.
In this unit we have studied many sorting algorithms used in internal sorting. This is not a
conclusive list and the student is advised to read the suggested volumes for exposure to
additional sorting methods and for detailed discussions of the methods introduced here.
FILE STRUCTURE
Unit 1 PHYSICAL STORAGE DEVICES AND THEIR CHARACTERISTICS
History of file structures, Physical Files, Logical Files, Introduction, Magnetic Tape, Data Storage,
Data Retrieval, Magnetic Disk, Floppy Diskette, Characteristics of Magnetic Disk Drives,
Characteristics of Magnetic Disk Processing, Optical Technology.
Unit 2 CONSTITUENTS OF FILE AND FILE OPERATION
Constituents of a File, Field, Record, Header Records, Primary and Secondary Key, File Operations
Unit 3 FILE ORGANIZATIONS
File Concepts, Serial File, Sequential File, Processing Sequential Files, Indexed Sequential File,
Inverted File, Direct File, Multi-list File
Unit 4 HASHING FUNCTIONS AND COLLISION HANDLING METHOD
Hash Tables, Hashing Function, Terms Associated with Hash Tables, Bucket Overflow, Handling of
Bucket Overflows
Unit 1
PHYSICAL STORAGE DEVICES AND THEIR
CHARACTERISTICS
1. History of file structures
2. Types of file
2.1. Physical Files
2.2. Logical Files
3. Functions of file
4. Storage Devices
4.1. Magnetic Tape
4.2. Magnetic Disk
4.3. Floppy Diskette
4.4. Characteristics of Magnetic Disk Drives
5. Data Retrieval
5.1. Data Retrieval (Hard Disk)
5.2. Data Retrieval (Floppy Disk)
6. Optical Technology.
What is File Structure?
A file Structure is a combination of representations for data in files and of operations for accessing
the data.
A File Structure allows applications to read, write and modify data. It might also support finding the
data that matches some search criteria or reading though the data in some particular order.
1. History of File structure
In file processing class, you will study the design of files and various operations on files. How is file
processing course different from data structures course? In data structures course, you will study
how information is stored on main memory. But in file processing course, you will study how
information is stored on secondary memory. Main memory is a volatile storage device. When you
power off your computer system, all the information in random access memory (RAM) are lost.
Secondary memory is a nonvolatile storage device. When you power off your computer system, all
the information are stored permanently on secondary memory. Therefore, it is not a good practice
to store gigabytes of information on RAM. Usually information is stored permanently on secondary
memory such as tapes, disk, and optical disks. The storage capacity of RAM is smaller than that of
Secondary storage devices. RAM size is about 128 megabytes and its cost is about $15 for most
of the PC, while secondary storage can be 20-30 gigabytes. Its cost is about $100, too. Also the
access speed on RAM is much faster than the access speed on Secondary storage. The time
takes to retrieve information from RAM is about 120 nanoseconds (billionths of second). Getting
the same information from a disk might take 30 milliseconds (thousandths of second). As needed
by software application, information can be retrieved from secondary memory and the retrieved
data stay temporarily in main memory to be processed. The way files are stored on secondary
memory can have a great impact on the speed of information being retrieved from secondary
memory. Therefore in this course, you will study various file structures and theirs related effects on
file accessing speed.
A short history of file structure design since it takes time to get information from secondary
memory, the goal of file structures design is to minimize the access time to disk. Information must
be organized and grouped , so users can get everything that they need in one trip to disk. If we
need a book's title, author, publisher, patron , we would get all the related information in one place
instead of looking different place in the disk. File structures issue becomes complicated as files
change, grow and shrink when different add, delete operation are applied to files. Index, and B tree
structures are developed in the past 30 years to ease the file processing time and file integrity.
2. Types of file
2.1 Logical File
A channel that hides the details of the file‟s location and physical format to the program.
The logical file has a logical name used for referring to the file inside the program. This logical
name is a variable inside the program, for instance.
FILE *fp
In C++ the logical name is the name of an object of the class fstream, like:
Fstream outfile;
2.2 Physical File
A collection of bytes stored on a disk or tape.
Physical file has a name, for instance ourfile.exe. There may be thousands of physical files on a
disk, but a program only has about 20 logical files open at the same time.
When a program wants ot use a particular file, “data”, the OS must find the physical the called
“data” and make the hookup by assigning a logical file to it. This logical file has a logical name,
which is what is used inside the program.
The Operating System is responsible for associating a logical file in a program to a physical file in
disk or tape
3. Functions of file
Reading
read(source_file, destination_address, size);
source_file: location the program reads from, i.e. its logical file name
destination_address: first address of the memory block where we want to store the data.
Size: how much information is being brought in from the file (byte count).
eg. (Based on C)
char a;
FILE *fp;
…
fp = fopen(“ourfile.txt” , ”r”);
fread(&a, 1, 1,fp);
eg. (Based on C++)
char a;
fstream infile;
infile.open(“ourfile.txt”, ios::in);
infile >> a;
Writing
write(destination_file, source_addr, size);
destination_file: The logical file name where the data will be written.
source_addr: First address of the memory block where the data to be written is stored.
size: The number of bytes to be written.
eg. (Based on C)
char a;
FILE *fp;
…
fp = fopen(“ourfile.txt” , ”r”);
…
fwrite(&a, 1, 1,fp);
eg. (Based on C++)
char a;
fstream outfile;
outfile.open(“ourfile.txt”, ios::out);
outfile << a;
Opening Files
The first operation generally done on an object of one of these classes is to associate it to a real file, that is
to say, to open a file. The open file is represented within the program by a stream object (an instantiation of
one of these classes) and any input or output performed on this stream object will be applied to the physical
file.
In order to open a file with a stream object we use its member function open():
void open (const char * filename, openmode mode);
where filename is a string of characters representing the name of the file to be opened and mode is a
combination of the following flags:
ios::in
Open file for reading
ios::out
Open file for writing
ios::ate
Initial position: end of file
ios::app
Every output is appended at the end of file
ios::trunc If the file already existed it is erased
ios::binary Binary mode
These flags can be combined using bitwise operator OR: |. For example, if we want to open the file
"example.bin" in binary mode to add data we could do it by the following call to function-member open:
ofstream file;
file.open ("example.bin", ios::out | ios::app | ios::binary);
All of the member functions open of classes ofstream, ifstream and fstream include a default mode when
opening files that varies from one to the other:
class
default mode to parameter
ofstream ios::out | ios::trunc
ifstream ios::in
fstream
ios::in | ios::out
The default value is only applied if the function is called without specifying a mode parameter. If the
function is called with any value in that parameter the default mode is stepped on, not combined.
Since the first task that is performed on an object of classes ofstream, ifstream and fstream
is frequently to open a file, these three classes include a constructor that directly calls the
open member function and has the same parameters as this. This way, we could also have
declared the previous object and conducted the same opening operation just by writing:
ofstream file ("example.bin", ios::out | ios::app | ios::binary);
Both forms to open a file are valid.
You can check if a file has been correctly opened by calling the member function is_open():
bool is_open();
that returns a bool type value indicating true in case that indeed the object has been correctly associated
with an open file or false otherwise.
Closing files
When reading, writing or consulting operations on a file are complete we must close it so that it becomes
available again. In order to do that we shall call the member function close(), that is in charge of flushing the
buffers and closing the file. Its form is quite simple:
void close ();
Once this member function is called, the stream object can be used to open another file, and the file is
available again to be opened by other processes.
In case that an object is destructed while still associated with an open file, the destructor
automatically calls the member function close.
4. Storage Devices
It is important to know the difference between secondary storage and a computer's main
memory. Secondary storage is also called auxiliary storage and is used to store data and
programs when they are not being processed. Secondary storage is more permanent than
main memory, as data and programs are retained when the power is turned off. The needs
of secondary storage can vary greatly between users. A personal computer might only
require 20,000 bytes of secondary storage but large companies, such as banks, may
require secondary storage devices that can store billions of characters. Because of such a
variety of needs, a variety of storage devices are available. The two most common types
of secondary storage are magnetic tapes and magnetic disks.
Computer storage is the holding of data in an electromagnetic form for access by a computer processor.
Primary storage is data in random access memory (RAM) and other "built-in" devices. Secondary storage is
data on hard disks, tapes, and other external devices.
Primary storage is much faster to access than secondary storage because of the proximity of the storage to
the processor or because of the nature of the storage devices. On the other hand, secondary storage can
hold much more data than primary storage.
4.1 Magnetic Tape Storage
A magnetically coated strip of plastic on which data can be encoded. Tapes for computers are
similar to tapes used to store music.
Storing data on tapes is considerably cheaper than storing data on disks. Tapes also have large
storage capacities, ranging from a few hundred kilobytes to several gigabytes. Accessing data on
tapes, however, is much slower than accessing data on disks. Tapes are sequential-accessmedia,
which means that to get to a particular point on the tape, the tape must go through all the
preceding points. In contrast, disks are random-access media because a disk drive can access
any point at random without passing through intervening points.
Because tapes are so slow, they are generally used only for long-term storage and backup. Data
to be used regularly is almost always kept on a disk. Tapes are also used for transporting large
amounts of data.
Tapes come in a variety of sizes and formats.
Tapes are sometimes called streamers or streaming tapes.
Magnetic tape is a one-half inch or one-quarter inch ribbon of plastic material on which
data is recorded. The tape drive is an input/output device that reads, writes and erases
data on tapes. Magnetic tapes are erasable, reusable and durable. They are made to store
large quantities of data inexpensively and therefore are often used for backup. Magnetic
tape is not suitable for data files that are revised or updated often because it stores data
sequentially.
4.2 Magnetic Disk (Organization of disks)
Magnetic disks are the most widely used storage medium for computers. A magnetic disk
offers high storage capacity, reliability, and the capacity to directly access stored data.
Magnetic disks hold more data in a small place and attain faster data access speeds.
Types of magnetic disks include diskettes, hard disks, and removable disk cartridges.
The information stored on a disk is stored on the surface of one or more platters. The
arrangement is such that the information is stored in successive tracks on the surface of
the disk. Each track is often divided into a number of sectors. A sector is the smallest
addressable portion of a disk. When a read statement calls for a particular byte from a disk
file, the computer OS finds the correct surface, track and sector; reads the entire sector
into a special area in memory called a buffer and then finds the requested byte within that
buffer.
Disk drives typically have a number of platters. The tracks that are directly above and
below on another from a cylinder. The significance of cylinder is that all of information on a
single cylinder can be accessed without moving the arm that holds the read/write heads.
Moving this arm is called seeking. This arm movement is usually the slowest part of
reading information from disk.
Storing the data:
Data is stored on the surface of a platter in sectors and tracks. Tracks are concentric
circles, and sectors are pie-shaped wedges on a track. The process of low-level formatting
a drive establishes the tracks and sectors on the platter. The starting and ending points of
each sector are written onto the platter. This process prepares the drive to hold blocks of
bytes. High-level formatting then writes the file-storage structures, like the file-allocation
table, into the sector. This process prepares the drive to hold files.
Capacity and space needed:
Platters are organized into specific structures to enable the organized storage and retrieval
of data. Each platter is broken into tracks--tens of thousands of them, which are tightly
packed concentric circles. These are similar in structure to the annual rings of a tree (but
not similar to the grooves in a vinyl record album, which form a connected spiral and not
concentric rings).
A track holds too much information to be suitable as the smallest unit of storage on a disk,
so each one is further broken down into sectors. A sector is normally the smallest
individually addressable unit of information stored on a hard disk, and normally holds 512
bytes of information. The first PC hard disks typically held 17 sectors per track. Today's
hard disks can have thousands of sectors in a single track, and make use of zoned
recording to allow more sectors on the larger outer tracks of the disk.
A platter from a 5.25" hard disk, with 20 concentric tracks drawn
over the surface. This is far lower than the density of even the oldest
hard disks; even if visible, the tracks on a modern hard disk would
require high magnification to resolve. Each track is divided into
16 imaginary sectors. Older hard disks had the same number of
sectors per track, but new ones use zoned recording with a different
number of sectors per track in different zones of tracks.
All information stored on a hard disk is recorded in tracks, which are concentric circles
placed on the surface of each platter, much like the annual rings of a tree. The tracks are
numbered, starting from zero, starting at the outside of the platter and increasing as you
go in. A modern hard disk has tens of thousands of tracks on each platter.
Hard Disks
Hard disks provide larger and faster secondary storage capabilities than diskettes. Usually
hard disks are permanently mounted inside the computer and are not removable like
diskettes. On minicomputers (G) and mainframes (G), hard disks are often called fixed
disks. They are also called direct-access storage devices (DASD). Most personal
computers have two to four disk drives. The input/output device that transfers data to and
from the hard disk is the hard disk drive.
HARD DISK STORAGE CAPACITY
o
Like diskettes, hard disks must be formatted before they can store information. The storage
capacity for hard drives is measured in megabytes. Common sizes for personal computers
range from 100MB to 500MB of storage. Each 10MB of storage is equivalent to
approximately 5,000 printed pages (with approxiamtely 2,000 characters per page).
Disk Cartridges
Removable disk cartridges are another form of disk storage for personal computers. They
offer the storage and fast access of hard disks and the portability of diskettes. They are
often used when security is an issue since, when you are done using the computer, the
disk cartridge can be removed and locked up leaving no data on the computer.
4.3 Floppy Diskettes
It is a small flexible mylar disk coated with iron-oxide on which data are stored. These disks
are available in three sizes :
1.
2.
8-inch portable floppy (flexible) disks.
5
1
4 inch portable floppy disks.
These two diskettes are Individually packaged in protective envelopes. Both the floppies
are the most popular online secondary storage medium used in PC and intelligent terminal
systems.
3.
1
Compact floppy disks measuring less than 4 inches in diameter - The 2 inch diameter
miniature diskettes are individually packed in a hard plastic case. This case has a dustsealing and finger proof shutter which opens automatically once the case is inserted in its
disk drive. These disks are popular in desktop and portable personal computers.
3
A floppy disk may be single sided (data can be recorded on one side only) or double sided
(data can be recorded on both sides of the disk). All disks have two sides but difference
depends on whether the disk has been certified free of errors on one or both sides. Data
recorded on a double sided disk by a double-sided drive can't be read on a single-sided
drive because of the way the data are stored. A double-sided drive had 2 read/write heads,
one for each side while a single-sided drive has only one read/write head. The most
commonly used diskettes are:
DISKETTE STORAGE CAPACITY
o
Before you can store data on your diskette, it must be formatted (G). The amount of data
you can store on a diskette depends on the recording density and the number of tracks on
the diskette. The recording density is the number of bits (G) that can be recorded on one
inch of track on the diskette, or bits per inch (bpi). The second factor that influences the
amount of data stored on a diskette is the number of tracks on which the data can be stored
or tracks per inch (tpi). Commonly used diskettes are referred to as either double-density or
high-density (single-density diskettes are no longer used). Double-density diskettes (DD)
can store 360K for a 5 1/4 inch diskette and 720K for a 3 1/2 inch diskette. High-density
diskettes (HD) can store 1.2 megabytes (G) on a 5 1/4 inch diskette and 1.44 megabytes on
a 3 1/2 inch diskette.
CARE OF DISKETTES
o
You should keep diskettes away from heat, cold, magnetic fields (including telephones) and
contaminated environments such as dust, smoke, or salt air. Also keep them away from food
and do not touch the disk surface.
4.4. Characteristics of Magnetic Disk Drives
1. The storage capacity of a single disk ranges from 10MB to 10GB. A typical commercial database
may require hundreds of disks.
2. Figure 10.2 shows a moving-head disk mechanism.
o Each disk platter has a flat circular shape. Its two surfaces are covered with a magnetic
material and information is recorded on the surfaces. The platter of hard disks are made
from rigid metal or glass, while floppy disks are made from flexible material.
o The disk surface is logically divided into tracks, which are subdivided into sectors. A sector
(varying from 32 bytes to 4096 bytes, usually 512 bytes) is the smallest unit of information
that can be read from or written to disk. There are 4-32 sectors per track and 20-1500 tracks
per disk surface.
o The arm can be positioned over any one of the tracks.
o
o
o
3.
4.
5.
6.
7.
8.
9.
The platter is spun at high speed.
To read information, the arm is positioned over the correct track.
When the data to be accessed passes under the head, the read or write operation is
performed.
A disk typically contains multiple platters (see Figure 10.2). The read-write heads of all the tracks are
mounted on a single assembly called a disk arm, and move together.
o Multiple disk arms are moved as a unit by the actuator.
o Each arm has two heads, to read disks above and below it.
o The set of tracks over which the heads are located forms a cylinder.
o This cylinder holds that data that is accessible within the disk latency time.
o It is clearly sensible to store related data in the same or adjacent cylinders.
Disk platters range from 1.8" to 14" in diameter, and 5"1/4 and 3"1/2 disks dominate due to the lower
cost and faster seek time than do larger disks, yet they provide high storage capacity.
A disk controller interfaces between the computer system and the actual hardware of the disk drive.
It accepts commands to r/w a sector, and initiate actions. Disk controllers also attach checksums to
each sector to check read error.
Remapping of bad sectors: If a controller detects that a sector is damaged when the disk is initially
formatted, or when an attempt is made to write the sector, it can logically map the sector to a
different physical location.
SCSI (Small Computer System Interconnect) is commonly used to connect disks to PCs and
workstations. Mainframe and server systems usually have a faster and more expensive bus to
connect to the disks.
Head crash: why cause the entire disk failing (?).
A fixed dead disk has a separate head for each track -- very many heads, very expensive. Multiple
disk arms: allow more than one track to be accessed at a time. Both were used in high performance
mainframe systems but are relatively rare today.
Magnetic disk drives are direct-access devices designed to minimize the access time required to
locate specific records. Each drive has not one but a series of access arms that can locate records
on specific surfaces. The time it takes to locate specific records will be much less than that
required by a tape drive with only one read/write mechanism.
There are several types of disk mechanisms.
1.
Moving-Head Magnetic Disk: In a moving-head disk system, all the read/write heads are
attached to a single movable access mechanism. Thus the access mechanism moves
directly to a specific disk address, as indicated by the computer.
Because all the read/write heads move together to locate a record, this type of mechanism
has a relatively slow access rate as compared to other disks. The access time, however, is
still considerably faster than that for tape.
2.
Fixed-Head Magnetic Disk: Since disks are generally used for high-speed access of
records from a file (for example, an airline reservation file), method that can reduce access
time would result in a substantial benefit. For this reason, fixed-head magnetic disks were
developed. These devices do not have a movable access arm.
Instead, each track has its own read/write mechanism that access a record as it rotates past the
arm. The disks in this device are not removable and the capacity of each disk is somewhat less but
the access time is significantly reduced.
Still other disk devices combine the technologies of both moving and fixed-head access to produce
a high-capacity, rapid access device.
5. Data Retrieval
5.1 Data Retrieval-Hard Disk
The case of stacked disk system, there is an access arm having two read/write heads for each
cording surface. All these access arms move in unison in and out of the disks. Each hard disk
cording surface has specified number of tracks and sectors. A particular track on all the surfaces of
multiple disks comprise that particular cylinder of the disks. For example, all the tenth tracks of all
surface together form tenth cylinder of the disks. Thus, a cylinder may be defined as all tracks on
magnetic disks that are accessible by a single movement of the access mechanism. Track position
of a read/write head on top recording surface is the same as track position of other heads on the
access RMS serving other surfaces.
For accessing a record, its disk address (i.e. the cylinder number, surface number and record
number) must be provided by a computer program. The motor rotates the disk at a high speed and
in its one revolution, data on surface of a particular track is read. In the next revolution, data on
surface 2 of the same track is read. A full track of data can be read in a single revolution. This
procedure continues down the cylinder with very fast movement of access arms.
After access of the data, they are copied from the disk to the processor at a rate (transfer rate)
which depends on the density of the stored data and the disk's rotation speed.
5.2 Data Retrieval-Floppy Disk
As shown in the figure 8.7, a typical floppy disk jacket has small hole to the right of the larger
center hole, called Index hole window. The disk also has a small hole, called the index hole which
cannot be seen unless it is aligned with the Jacket hole. This Index hole enables the disk controller
to locate disk sectors.
Access of data is carried out in the following steps:
1.
Read/Write head moves to the track specified in the disk address (I.e. track number and
sector number)
2.
Disk-drive controller locates the index reference i.e. the index hole for determining sector
locations. This takes place with a light passing through the disk once index hole is aligned
with the index-hole window. This lights triggers a light-sensing device which marks the
hole's location and hence index hole is detected.
3.
The disk controller begins reading data.
4.
unit.
When the specified sector passes under the head, the controller transmits data to the processor
6. Optical Technology
The first optical disks that became commercially available, around 1982, were compact
audio disks. Since then, and during a particularly active marketing period from 1984 to
1986, at least a dozen other optical formats have emerged or are under development. The
rapid proliferation of formats has led, understandably, to some confusion. This digest will
briefly describe the most prominent formats (and their acronyms), and the contexts in
which they are used.
Optical disks go by many names--OD, laser disk, and disc among them--all of which are
acceptable. At first glance, they bear some resemblance to floppy disks: they may be
about the same size, store information in digital form, and be used with microcomputers.
But where information is encoded on floppy disks in the form of magnetic charges, it is
encoded on optical disks in the form of microscopic pits. These pits are created and read
with laser (light) technology.
The optical disks that are sold are actually "pressed," or molded from a glass master disk,
in somewhat the same way as phonograph records. The copies are composed of clear
hard plastic with a reflective metal backing, and protected by a tough lacquer. As the disk
spins in the optical disk player (reader, drive), a reader head projects a beam from a lowpower laser through the plastic onto the pitted data surface and interprets the reflected
light. Optical disk players can stand alone, like portable tape players, or be connected to
stereos, televisions, or microcomputers.
Optical disks lack the erasability and access speed of floppies, but these disadvantages
are offset by their huge storage capacity, higher level of accuracy, and greater durability.
A storage medium from which data is read and to which it is written by lasers. Optical disks can store much
more data -- up to 6 gigabytes (6 billion bytes) -- than most portable magnetic media, such as floppies. There
are three basic types of optical disks:
 CD-ROM : Like audio CDs, CD-ROMs come with data already encoded onto them. The data is
permanent and can be read any number of times, but CD-ROMs cannot be modified.
 WROM : Stands for write-once, read-many. With a WORM disk drive, you can write data onto a
WORM disk, but only once. After that, the WORM disk behaves just like a CD-ROM.
erasable: Optical disks that can be erased and loaded with new data, just like magnetic disks.
These are often referred to as EO (erasable optical) disks.
These three technologies are not compatible with one another; each requires a different
type of disk drive and disk. Even within one category, there are many competing formats,
although CD-ROMs are relatively standardized.
Unit 2
CONSTITUENTS OF FILE AND FILE OPERATION
1. Field and Record
2. Header Records
3. Primary and Secondary Key
4. File Operations
Record
A record can be defined as a set of fields that belong together when the file is viewed in terms of a
higher level of organization. Like the notion of a fie, a record is another conceptual tool. It is
another level of organization that we impose on the data to preserve meaning. Records do not
necessarily exist in the file in any physical sense, yet they are an important logical notion included
in the file's structure.
Most often, as in the example above, a record in a file represents a structured data object. Writing
a record into a file can be thought of as saving the state (or value) of an object that is stored in
memory. Reading a record from a file into a memory resident object restores the state of the
object. It is our goal in designing file structures to facilitate this transfer of information between
memory and files. We will use the term object to refer to data residing in memory and the term
record to refer to data residing in a file residing in a file.
Following are some of the most often used methods for organizing the records of a file:

Require that the records be a predictable number of bytes in length.

Require that the records be a predictable number of fields in length.

Begin each record with a length indicator consisting of a count of the number of bytes that
the record contains.

Use a second file to keep track of the beginning byte address for each record.

Place a delimiter at the end of each record to separate it from the next record.
Field and Record Organization

The basic Logical unit of data is the field, which contains a single data value.

Fields are organized into aggregates, either as many copies of a single field (an array)
or as a list of different fields (a record).

When a record is stored in memory, we refer to it as an object and refer to its fields as
members.

In this lecture, we will investigate the many ways that objects can be represented as
records in files.
Stream Files

Simplest view of a file: stream of bytes (characters)

May end with special control character

Does not recognize or impose any structure or meaning

Unix views file as a stream of bytes

Some utilities (sort, grep) recognize lines and within lines, whitespace characters as
data delimiters.
Record Files
A view that sees a file as a collection of records

A record is a collection of related fields

A field is the smallest unit of meaningful data

View directly supported by COBOL
To define a field in a file

Force it into a predictable length

Precede with a byte count

15Harinder Grover

10Susheel Shukla

28Mukul Gupta

separate with a delimiter
Harinder Grover:Susheel Shukla:Mukul Gupta
To define a field in a file, consider the followings.

Space needed to represent a field

Ability to represent all data

Missing values

Processing time to recognize a field
Field Structures
There are many ways of adding structure to files to maintain the identity of fields

Force the field into a predictable length

Begin each field with a length indicator

Use a “keyword = value” expression to identify each field and its content.
Reading a Stream of Fields

A Program can easily read a stream of fields and output

This time, we do preserve the notion of fields, but something is missing rather than
stream of fields, these should be two records.
Last Name:
Sharma
First Name:
Gopi Ram
Address:
30, JP Road.
City:
Bombay
State:
Maharashtra
Zip Code:
2145631
Last Name:
Sinha
First Name:
Ram Prasad
Address:
13, Krishna Nagar.
City:
Delhi
State:
Delhi
Zip Code:
110011
Record Structures that use a Length Indicator

The notion of records that we implemented are lacking something: none of the
variability in the length of records that was inherent in the initial stream file was
conserved.

Implementation:

Writing the variable-length records to the file

Representing the record length

Reading the variable length record from the file
2. Header Records
Sometimes data about the file is kept within the file.

A special record at the beginning of the file could contain information such as
number of following records, length of records etc.

Header records may also precede related groups of records
o Example: Employee file

Employee‟ records are grouped by department

Each group preceded by a record that indicates the number of records
in following department group.
Record Structure
Choosing a Record Structure and Record Length within a fixed-length record. There are
two approaches about it.
1. Fixed-Length fields in record (simple but problematic).
2. Varying field boundaries within the fixed-length record.
Header Records are often used at the beginning of the file to hold some general info about
a file to assist in future use of the file.
To organize a file into records:

Make records a predictable length

Fields within the record may or may not be fixed

Allows a seek to a particular record example.
o rec size = 100,
o want to retrieve record number 50
o seek to file position 5000 and read 100 bytes

make records a predictable number of fields
o precede record with a byte count
o Separate with a delimiter
o Keep a table that holds the byte position of each record

Requires two files

Allows a seek to the desired record

Example: if length of record 1 is 40 bytes, record 2 is 15 bytes, record
3 is 30 bytes and record 4 is 20 bytes, the index would have 5 entries:

00 40 55 85 105

to access record 3, find the byte offset in the table (55) and the length
of the field (85-55 = 30): seek to file position 55 and read 30 bytes.
Record Structure I

A record can be defined as a set of fields that belong together when the file is viewed
in terms of a higher level of organization.

Like the notion of a field, a record is another conceptual tool, which need not exist in
the file in any physical sense.

Yet, they are an important logical notion included in the file‟s structure.
Record Structure II

Methods for organizing the records of a file include:
o Requiring that the records be a predictable number of bytes in length.
o Requiring that the records be a predictable number of fields in length.
o Beginning each record with a length indicator consisting of a count of the number of
bytes that the record contains.
o Using a second file
to keep track of the beginning byte address for
each record.
o Placing a delimiter at the end of each record to separate it from the next record.
3. Primary and Secondary Key
To explore this idea, we begin with an analogy. Suppose that you have a filing system
containing information dealing with different companies. A separate folder exists for each
company, and the folders are arranged alphabetically by company name. Obviously, if you
wish to retrieve data for a particular company, you must know the name of the company.
This may seem trivial, but it is in fact the crux behind an important concept: if you wish to
locate a record in a random-access file, its identifier must be specified. This identifier is
technically called a record key.
The way in which keys are used in conjunction with random-access files is illustrated in
figure 9.9. If the value of a record's key is known, it may be supplied to the DBMS, which
converts the value into the disk address at which the record is located. The question mark
in the figure indicates that several methods exist for doing the conversion, depending on
the type of random-access file.
If a particular key value locates only a single record, the key is said to be unique. On the
other hand, a non-unique key may correspond to several records.
Usually, a file must have at least one unique key, whose value specifies the principal
"name" of each record. This is called the primary key of the file. If a file has more than one
unique field, the one chosen to be the primary key would be that whose values seemed
best suited as the record "names." Any keys other than the primary key are called
secondary keys. A secondary key may be either unique or non-unique.
The simplest type of primary key consists of a single field within each record, often
referred to as a key field. In some situations, the database designer specifies to the DBMS
which field is to be the primary key, in other cases, the DBMS itself creates a special
primary-key field.
Concatenated Keys
A key may be formed By combining two or more fields. For example, consider a simple file
for holding data on customer payments to Jellybeans, Inc.
If we assume that there may be more than one customer with the same name, then there
is no unique field in the file. However, if a combination of two fields is unique, it may be
used as a primary key. For example, it is unlikely that a customer will submit two payments
on the same day. Therefore, the combination of CUSTOMER-NAME and DATERECEIVED is unique, and it may be used as a primary key. This type of field combination
is called a concatenated key, and it is expressed as:
CUSTOMER-NAME + DATE-RECEIVED
Each key value is generated by literally connecting, or concatenating, values from the two
fields together. For example, suppose that a particular record in the file contains the
following values:
CUSTOMER_NAME: "Jones, Ltd"
AMOUNT_PAID: 125
DATE_RECEIVED: "11/25/85"
PAYMENT_TYPE: "MO"
The concatenated key for the record would be"Jones, Ltd. 11/25/85"
Any type of field, or combination of fields, may be a key. Regardless of the composition of
the key, the process represented in figure 9.4 is valid because any type of key may be
used to generate an address.
4. File Operations
The fundamental operations that are performed on files are the following:
1.
Creation
2.
Update, including:
record insertion
record modification
record deletion
3.
Retrieval, including:
inquiry
report generation
4.
Maintenance, including:
restructuring
reorganization
Creating a File
The initial creation of a file is also referred to as the loading of the file. The bilk of the work
in creating transaction and master files involves data collection and validation. In some
implementations, space is first allocated to the file, then the data are loaded into that
skeleton. In other implementations, the file is constructed a record at a time. We will see
examples of both approaches. In many cases, data are loaded into a transaction or master
file in batch mode, even if the file actually is built one record at a time. Loading a master
file interactively can be excessively time-consuming and labor-intensive if large volumes of
data are involved.
The contents of a master file represent a snapshot in lime of the part of world that the
master file represents. For example, the payroll master file represent the present state of
the
company's payroll
situation:
month-to-date and
year-to-date
fields indicate
appropriately accumulated figures for amounts paid, vacation taken, vacation due, etc.. for
each employee.
Updating a File
Changing the contents of a master file to make it relied a more current snapshot of the
real world is known as updating the file. These changes may include (1) the insertion of
new record occurrences, e.g., adding a record for a newly hired employee, (2) the
modification of existing record occurrences, e.g., changing the pay rate for an employee
who has received a raise, and (3) the deletion of existing record occurrences, e.g.,
removing the record of an employee who has left the company. The updated file then
represents a more current picture of reality.
In some implementations, the records of a file can be modified in place, new records can
be inserted in available free space, and records can he deleted to make space available
for reuse. If a file is updated in place by a program, then the file usually is an input output
file for that program.
Some implementations are more restrictive and a file cannot be updated in place. In these
cases the old file is input to an update program and a new version of the file a output. The
file is essentially recreated with current information. However, not all of the records need
to have been modified, some (maybe even most) of the records may have been copied
directly from the old version to the updated version of the file. We will consider this
situation further in detail in sequential file organization.
Retrieving from a File
The access of a file for purposes of extracting meaningful information is called retrieveval.
There are two basic classes of file retrieval: inquiry and report generation. These two
classes can be distinguished by the volume of data that they produce. An inquiry results in
a relatively low-volume response, whereas a report may create many pages of output.
However, some installations prefer to distinguish between inquiry and report generation by
their modes of processing. If a retrieval is processed interactively, these installations would
call the retrieval an inquiry or query. If a retrieval is processed in batch mode, the retrieval
would be called report generation. This terminology lends to make report generation more
of a planned, scheduled process and inquiry more of an ad hoc, spontaneous process.
Both kinds of retrieval are
required by
most
information
systems.
An inquiry generally is formulated in a query language, which ideally is a natural-languagelike structure that is easy for a "non-computer-expert" to learn and to use. A query
processor is a program that translates the user's inquiries into instructions that are used
directly for file access. Most installations that have query processors have acquired them
from vendors rather than designing and implementing them inhouse.
A file retrieval request can he comprehensive or selective. Comprehensive retrieval reports
information from all the records on a file, whereas selective retrieval applies some
qualification criteria to choose which records will supply information for output. Examples
of selective retrieval requests formulated in a typical but fictitious query language are the
following:
FIND EMP-NAME OF EMP-PAY-RECORD WHERE
EMP-NO = 12751
FIND ALL EMP-NAME, EMP-NO OF EMP-PAY-RECORD
WHERE EMP-DEPT-NAME = “MIS”
FIND ALL EMP-NAME, EMP-NO OF EMP-PAY-RECORD
WHERE 20.000 < EMP-SAL < 40,000
FIND ALL EMP-NAME. EMP-AGE, EMP-PHONE OF
EMP-PAY-RECORD WHERE EMP-AGE < 40 AND
EMP-SEX - "M" AND EMP-SAL > 50,000
COUNT EMP-PAY-RECORDS WHERE EMP-AGE < 40
FIND AVERAGE EMP-SAL OF EMP-PAY-RECORD
WHERE DEPT-NAME = “MIS”
In each case the WHERE clause gives the qualification criteria. Note that the last two
queries apply aggregate functions COUNT and AVERAGE to the qualifying set of records.
Some file organizations are better suited to selective retrievals and others are more suited
to comprehensive retrievals. We will study examples of both types.
Maintaining a File
Changes that are made to files to improve the performance of the programs that access
them are known as maintenance activities. There are two basic classes of maintenance
operations: restructuring and reorganization. Restructuring a file implies that structural
changes are made to the file within the context of the same file organization technique. For
example, field widths could be changed, new fields could be added to records, more space
might be allocated to the file the index tree of the file might be balanced, or the records of
the file might be resequenced, but the file organization method would remain the same.
File reorganization implies a change from one tile organization to another
The various file organizations differ in their maintenance requirements. These
maintenance requirements are also very dependent upon the nature of activity on the file
contents and how quickly that activity changes. Some implementations have file
restructuring utilities that are automatically invoked by the operating system; others require
that data processing personnel notice when file activity has changed sufficiently or
program performance has degraded enough to warrant restructuring or reorganization of a
file. Some installations perform file maintenance on a routine basis. For example, a utility
might be run weekly to collect free space from deleted records, to balance index trees, and
to expand or contract space allocations.
In general, master files and program files are created, updated, retrieved from, and
maintained. Work files are created, updated, and retrieved from, but are not maintained.
Report files generally are not updated, retrieved from, or maintained. Transaction file are
generally created and used for one-time processing.
Unit 3
FILE ORGANIZATION
1. File Concepts
2. Serial File
3. Sequential File
4. Processing Sequential Files
5. Indexing
6. Indexed Sequential File
7. Inverted File
8. Direct File
9. Multi-list File
1. File concepts
A file consists of a number of records. Each record is made up of number of fields and
each field consists of a number of characters.
(i)
Character is the smallest element in a file and can be alphabetic, numeric and
special.
(ii)
Field is an item of data within a record. It is made up of number of characters. Ex: a
name, number, date or an amount etc., A name — student name; A number — a
student register number; A date — birth date: An amount — goods purchased,
money paid.
(iii)
Record is made up of related fields. Ex: a student record or an employee record
Figure 10.1: Student Record
The files when created will have key field which helps in accessing a particular record
within a file. The records can be recognized accordingly by that particular key. Ex: Student
number is the key field where the records can be arranged serially in ascending order.
They help in sorting the records and when the records are sorted in sequence the file is
called a sequential file. The records are read into memory one at a time in sequence. The
files are stored normally on backing storage media, as they are too big to fit into the main
storage all at once. Some files are processed at regular intervals to provide informations.
Ex: A pay roll file may be processed each week or month in order to produce employee's
wages. Others are processed at irregular intervals. Ex: A file containing medical details of
a doctor's patients.
There are various types of files and a variety of file processing methods. The basic files
are data files, which contain programs and are read into main memory from backing
storage when the program is to be used. Program file is a file on which data held.
The hierarchy of the file structure can be represented as follows in figure 10.2.
Figure 10.2: Hierarchy of the file structure
File structure
In a traditional file environment, data records are organized using either a sequential file
organisation or a direct or random file organization. Records in a file can be accessed
sequentially or they can be accessed directly if the sequential file is on disk and uses an
indexed sequential access method. Records on a file with direct file organisation can be
accessed directly without an index. By allowing different functional areas and groups in the
organization to maintain their own files independently, the traditional file environment
creates problems such as data redundancy and inconsistency, programs data
independence, inflexibility, poor security and lack of data sharing and availability.
Models of File Organization
Records of data should be organized on the file in such a way that it provides:
(i)
Quick access to the desired record for retrieval
(ii)
Updating the file
(iii)
Storage area on the recording media is fully utilized
Other Factors are: Reliability, privacy and integrity of data.
2. Serial File
It contains a set of records in no particular order but the records are stored as they arrive.
They do not follow any particular sequence of attribute values. This method of storing
records is adopted when it is not possible to arrange the records in any logical order, when
the files of the record are not well defined, and the exact use of the file cannot be
anticipated. Files are created in this mode by punching the documents in the order of
arrival. Then, it is organized into another mode.
Location of a record in a pile file can be done by sequentially searching the records till the
desired value of the key attribute is reached. New records are always at the end of the file.
Changes and deletions of records in a serial file stored on random access media can be
done by locating the record and changing its contents and flagging the record to indicate
that the record has been invalidated. The file may be reorganised periodically to remove
the holes created by deleting of records. However updating of the serial file on sequential
access media can be done only by creating a new file.
3. Sequential File
File on sequential access media are generally organised in the sequential mode.
Sequential files may also be maintained on the random access media. The records are
arranged in the ascending or descending order of the values of a key attribute in the
record. The sequence of records in the file can be changed from one key attribute to
another key attribute by sorting the file. The key for sequencing the records may also
consist of a group of attributes.
The updating and processing of records of a sequential file stored on a sequential access
media is carried out in a batch mode. The transactions leading to changes in the file data
are collected in a batch periodically. For example: Transfers, promotions, retirements etc.,
which leads to changes in the personnel file data can be collected on a monthly basis,
recorded in the form of a transaction file. The transaction file is arranged in the same
sequence of the master file to be updated. The updating involves the reading of records
from both transaction and the master file and matching the two for the key attribute. The
additions, deletions, and changes are then carried out in the records of the master file and
the updated records are written on the new updated master file. The sequential file update
schematic is shown in figure 10.4.
The location of a record in a sequential file stored on random access media can be done
by one of the following methods:
(i)
Sequential Search
(ii)
Binary Search
(iii)
Probing
(iv)
Skip Search
In a 'sequential search', each record is read one after another starting from the first record
in the file till the desired key attribute value is reached.
'Binary search' can reduce the search time considerably as compared to the sequential
search. Here, the first record to be read is the one, which is in the middle of the file. Ex: In
a file of 200 records, the 100th record will be read first. The value of the key attribute of
this record is found out and this will either be less or more than the attribute value of the
desired record. In this method, we can decide whether the desired record lies in the first or
second half of the file. The next record read is one, which lies in the middle of the area
localized from the previous record operation (50th record).
Figure 10.3: Binary search
Example 1
If the record lies in the second position of hundred records, the next record is the 50th
record to decide whether the desired record lies amongst the first fifty or next fifty. The
process is repeated many times till the desired record has been localised into a small area
consisting of say 5 or 10 records. Then sequentially all records are searched to locate the
desired record.
Probing' is done where the approximate range in which the desired record lies can be
ascertained from the value of the key attribute. If the key attribute is the name of the doctor
and it is known that the name is starting with P like 'padmanab' lies between 30% and 45%
of the records, only this area may be searched sequentially to locate the desired record.
In 'skip search', records are accessed in a particular order say every 20th record is read till the value of the
key attribute exceeds the desired value. By this method, every time an area of 20 records is localised for the
search of the desired record by sequential search.
4. Processing Sequential Files
Because sequential files are in a sequence and must be kept in that sequence only, much
of the sequential file processing involves sorting data on some key. For example, all
subscribers must be sorted on their last name and first name before a telephone book can
be printed. There have been numerous books and articles written on various approaches
to sorting. Most computer manufacturers supply sorting packages as a part of their
operating systems. These packages are very efficient and simple to use. All that is
necessary is to indicate the fields, record sizes and sort key, and to assign intermediate
work areas for the sort to use.
A scheme for updating a sequence file is as shown in figure 10.4 above. Since the master
file is in order, input transactions must be sorted into the same order as the file before
being processed. A new file is created in the update process, since it is not a good practice
to try to read and write from the tape file. The old file in the sequential update provides
backup. If we keep the input transactions of the old file, any errors or the accidental
destruction of the new tape can easily be remedied by running the update program again
and updating the old file with the transactions.
On an update, there are three possible actions. First, we can modify a record. We can
change some part of the record platters (except for the very top and bottom ones) are
coated with a magnetic material like that on a tape. Read and write heads are fitted
between the platters. By moving the heads in and out, we can access any track on the
rotating disk. The maximum block size or physical record size for a disk file is limited by
the physical capacity of each track. If the access arms do not move, each head reads or
writes on the same track of each platter. Conceptually, tracks form a cylinder and when
using a disk file sequentially, we write on B given track of the first platter, and then on the
same track of the second platter and so on. This minimizes the access time since the
heads do not have to move.
Example 2
Expensive purchase with a credit card. The merchant is required to check the validity of
your credit card and to check the amount of available credit. To do this, the merchant
accesses the computer system that maintains file of credit card numbers, status (valid,
lost, stolen etc) and available credit. It is impractical to store this data sequentially. You
and the merchant would become quite impatient if you had to wait 30 minutes for the
computer system to sequentially process the file to the point of your record. Direct access
is required There are many ways of accessing a file for direct access. The file must be
store* on disk or similar direct access device so that records can be read and written in the
middle of the file. Second, some means must be developed for determining the location of
a particular record. Indexers are common means. Suppose record are stored on a disk in
such a way that 1000 of them reside on each track.
Suppose that the credit card file is stored on 300 tracks so that a total of 300,000 records
can be accommodated. If we know the relative position of the record in the file, the
physical location could be determined. Record number 2050, for example, resides on the
third track of the file in relative position number 50. What is needed is some means of
establishing a relationship between a credit card number or some similar identifying value
and records relative position in the file. Indexes created and maintained for direct access
processing take file space and thus increase the amount of storage required. Such
processing can only be done on disks or similar devices. Tape may not be used. It is often
used in personal computer applications although most users are unaware of the fact.
Computer data is processed in two fundamental ways. File processing and data
processing. With file processing, data is stored and processed in separate files. Consider
the figure 10.6.
Figure 10.6: File Processing Example
Figure 10.6 shows two different programs processing two separate files. One program
processes the employee file to produce reports about employees; the second program
processes a file about personal computer hardware (the PC Hardware file) to produce a
report about the inventory of hardware. The formats of the records in these two files are
shown in figure 10.7.
Figure 10.7: Sequential file processing of time slips
Sequential access storage devices use media such as magnetic tapes whose storage
locations do not have unique addresses and hence cannot be accessed directly. Instead,
data must be stored and retrieved using a sequential or serial process.
Data are recorded one after another in a predetermined sequence.
Locating particular item of data, requires searching majority of the recorded data on the
tape until the desired item is located.
5. Indexing
6. Indexed Sequential File
Search in a long sequential file can be very time consuming with unnecessary delays in
the location of the record and unavoidable. Computer time is being spent in the search
routine. Search can be made faster if an index to the file is provided (see figure 10.5).
Such a file organisation is called indexed sequential file.
The index of the file contains a pointer record for one or a group of records in the main file.
The index file can be searched by the sequential search or binary search method. For very
long files. Index file itself can be long. In such cases, an index of the
index file may be
necessary, called the higher level index to search the record in the lower level index file.
Files can be indexed on the key attribute in the way they are sequenced. They may also
be indexed on other attributes on which they are not sequenced. In such e case, one
pointer record will be required in the index for each record of the main file. Such a file
becomes an indexed non-sequential file. With this, it is possible to locate the record by
more than one key attributes of the record. The file may be organized in the indexed
sequential mode for the attribute most commonly used to locate the record and in the
indexed non-sequential mode for other attributes.
Addition of records in the indexed sequential file are made in the overflow areas provided
in each group. For this purpose, some sectors in the area forming the group can be kept
blank, while organizing the flow as overflow areas. Additional overflow areas are kept at
the end of the file.
The indexed sequential files will have to be reorganized periodically when the overflow
area become full or too many holes have been created due to the deletion of the obsolete
records in the sequential search, through the chained records, has become too time
consuming. The reorganization can be done by reading the old file, regrouping the
updated records, and writing a new file with indices.
Examples 3
One way to think of the structure of an indexed sequential file is as an index with pointers
to a sequential data file (Figure 10.9). In the pictured example, the index has been
structured as a binary search tree. The index is used to service a request for access to a
particular record; the sequential data file is used to support sequential access to the entire
collection of records.
(Use of a binary search tree and a sequential file to provide indexed sequential access)
You are already familiar with this approach to structuring a collection of information to
facilitate both sequential and direct access. For example, consider a dictionary: the thumb
tabs provide an index into the sequentially organized collection of words. In order to find a
particular word, say "syzygy," you usually do not scan the dictionary sequentially. Rather,
you select the appropriate thumb tab, S, the first letter of the word, and use that tab to
direct your search into the approximate location of the word in the data collection. Again,
you probably would not proceed with a sequential search from the beginning of the S's to
find your target word, "syzygy." Instead, you would use the column headings on each page
which indicate the first and last elements on that page. Once you have located the proper
page, your search might proceed sequentially to find the sought word.
It is important to note from the dictionary example that the key that is used to sequentialize
entries is the same as the key that is used to search directly for an individual entry. The
thumb tab and column-heading index structure does not work unless this point is true.
Applications
Because of its capability to support both sequential and direct access, indexed sequential
organization is used frequently to support files that are used for both batch and interactive
processing. For example, consider a credit card billing system with an indexed sequential
master file of customer account information. An appropriate key for the file would be the
account number of each record. The file could be accessed in batch mode, monthly, to
generate customer invoices and to build summary reports of account activity. Each
account record would be accessed once in this processing. The bills and the detail lines on
the summary report would appear in account number sequence.
The credit card management agency wants to use the master file of account information
also to support interactive inquiry of the current credit status of any account. When a
customer makes a purchase, the master file will be consulted to determine the credit limit
and then to update the remaining credit amount. This kind of processing could not be
supported well by sequential access to the customer account file. Rather, the need to
access an individual account, given its account number, dictates use of the index of the
indexed sequential file organization.
Another example of an application that calls for support by an indexed sequential file is a
class records system. Processing requirements for this system include the following:
1.
List the names and addresses of all students.
2.
Compute the average age of the students.
3.
Compute the mean and standard deviation for students' grade point averages.
4.
Compute the total number of credit hours for classes in which the students are
presently enrolled.
5.
Change the classification of a particular student from probational to regular.
6.
Display the grade record for a particular student.
7.
Insert a record for a new student.
8.
Delete the record for a particular student who has withdrawn from the school.
Some of these requirements call for sequential accessibility of the student file; others call
for direct accessibility to particular records of the file. The combination of requirements can
be satisfied by using indexed sequential file organization. The sort key and the index key
for the file would be the student identifier.
7. Inverted File
In the inverted file organisation, one index is maintained for each key attribute of the
record. The index file contains the value of the key attribute followed by the address of all
the records in the main file with the same value of the key attribute.
It requires three kinds of files to be maintained:
(i)
The Main File
(ii)
The Directory File
(iii)
The Index File
There is a separate directory file for each key attribute and the directory file contains the
value of the key attributes.
Inverted file is a file that references entities by their attributes. Inverted file is very useful
where the list of records with specified value of key attribute is required. The maintenance
of index files can be very time consuming.
Example 4
Consider a banking system in which there are several types of users: tellers, loan officers,
branch managers, bank officers, account holders, and so forth. All have the need to
access the same data, say records of the format shown in Figure 10.10. Various types of
users need to access these records in different ways. A teller might identify an account
record by its ID value. A loan officer might need to access all account records with a given
value for OVERDRAW-LIMIT, or all account records for a given value of SOCNO. A
branch manager might access records by the BRANCH and TYPE group code. A bank
officer might want periodic reports of all accounts data, sorted by ID. An account holder
(customer) might be able to access his or her own record by giving the appropriate ID
value or a combination of NAME, SOCNO, and TYPE code.
Example Record Format
Example Data File
A simple inversion index is structured as a table. For example, inverting the example
ACCOUNT-FILE on SOCNO results in the inversion index shown in Figure 10.12. This
figure refers to the data records shown in Figure 10.11.
Figure 10.12 may bring to mind the index structures that we discussed in the context of
relative files at that time we referred to an index as a directory. Either term is correct.
Figure 10.12: Example index inverting the records of Figure 10.11 by SOCNO key value
Variations
This particular inversion index has sorted entries, which facilitates searching for a
particular key value, because binary search techniques can be used. When N is relatively
large, sequential searching significantly slows response time and throughput. Of course
when a record is added to the data file, the inversion index must also be updated and
maintained in sorted order. Not all inversion indexes are sorted.
An inversion index can be built on top of a relative file or on top of an indexed sequential file. An inversion
index on key SOCNO for a relative file with user-key ID would provide a file that would support direct access
by either ID or SOCNO. An inversion index on key SOCNO for an indexed sequential file with key ID would
provide a file that would support direct access by either ID or SOCNO and would support sequential access
by ID. A sorted SOCNO inversion index could be used to access records in order by SOCNO but would
probably be expensive to use.
8. Direct File
Direct files are not maintained in any particular sequence. Instead, the value of the key
attribute is converted into the sector address by a predetermined relationship. This
converts the value of the key attribute to sector address for the storage and retrieval of the
record. This method is generally used where the range of the key attribute values is large
as compared to the number of records.
Employee and PC Hardware file duplicate some data. Two fundamental ways of
organizing files are sequential and direct access.
The way of processing these two files and retrieval are already explained. The sequential
file processing of time slips is given in figure 10.8.
Data and information are to be stored immediately after input or during processing but
before output. The primary storage will be in CPU, whereas the secondary storage devices
are magnetic disk, tape devices. The speed, cost, capacity of several alternative primary
and secondary storage media are shown in figure 10.13. The cost/ speed/capacity tradeoffs moves from semiconductor memories to magnetic moving surface media (magnetic
disk and tapes) to optical disks.
Figure 10.13: Speed, cost, capacity of 'storage' media alternatives
Semiconductor chips, which come under primary storage category, are called direct
access or random access memories (RAM). Magnetic devices are frequently called direct
access storage devices (DASD). Media such as magnetic tapes are known as sequential
access devices; magnetic bubble and other devices come under both direct/sequential
access properties.
Direct and Random access has the same concept. An element of data (byte or record) or
instructions can be directly stored and retrieved by selecting and using any of the locations
on the storage media. It also means each storage position a) has a unique address and b)
can be accessed in approximately the same length of time without searching other storage
positions. Each memory cell on a microelectronic semiconductor RAM chip can be
individually sensed or changed in the same length of time. Any data stored on a magnetic
disk can be accessed directly or approximately at the same time period.
Figure 10.14: Direct Access Storage Device
The disk unit is the device in which the disk pack is placed (which connect, the drive
mechanism). Once the pack is located into the unit, the read/write mechanism located
inside the unit, positions itself over the first track of each surface. The mechanism consists
of a number of arms at the ends of which there is a read/write head for each surface. All
arms are fixed together and all move together as one when accessing a surface on the
disk pack. The disk when loaded is driven at a high number of revolutions (several
thousand)/minute and access can only be made to its surfaces when the disk revolves. In
a 6 disk pack, there are 10 recording surfaces. Similarly 20 recording surfaces are there in
a II disk pack. (2n-2) is the number of recording surfaces.
The method of converting the value of the key attribute to the address is known as
'randomizing' or 'hashing.' The most common method of hashing (out of several methods)
is the division method. In this method, the value of the key attribute i.e. converted into an
integer if it is already not an integer. It is then divided by another integer, often a prime
number, just smaller than the file size. The remainder of the division is used as the
address. The other methods used for hashing are the 'midsquare‟ method and 'folding'
method. It is quite possible that two different values of key attributes may get converted to
the same address on 'hashing', when a 'collision' is said to have occurred. This is handled
by storing the record immediately following the previous record stored with the samehashed address. Collisions can also be handled by providing blocks or 'buckets' or records
to store all the records with the same address. When the bucket is full, additional records
with the same hash addresses can be stored in the overflow areas provided at the end of
the file. The overflow records are changed to the last record in the bucket.
The storage of records in a direct file are randomly scattered in the file. It is not possible to
utilise full storage area. The ratio of the number of records to the total capacity of the file is
called the loading factor'. High loading factor leads to too many collisions and increase the
search time. Low loading factor leads to under-utilisation of the file area.
Example 5
Some of the earliest investigations in hashing yielded a hash function known as the
division-remainder method, or simply the division method. The basic idea of this approach
is to divide a key value by an appropriate number, then to use the remainder of the division
as the relative address for the record.
For example, let div be the divisor, key be the key, and addr be the resultant address. In
Pascal, the function
R (key address)
would be implemented as
addr : = key mod div;
The remainder (and result of the mod function) is defined as the result of subtracting the
product of the quotient and divisor from the dividend. That is,
ADDR = KEY - DIV-TEMP
In all cases here, ADDR should be declared to be an integer.
While the actual calculation of a relative address, given a key value and divisor, is
straightforward, the choice of the appropriate divisor may not be quite so simple. There are
several factors that should be considered in selecting the divisor. First, the range of values
that result from the operation key mod div is 0 through div - 1. Thus the value of div
determines the size of the relative address space. If it is known that our relative file is
going to contain at least n records, then we must, assuming that only one record can be
stored at a given relative address, have div > n.
For example, consider design of a relative file that will contain about 4000 records. The
address space should accommodate at least 5000 records for 80% loading. A number that
is approximately 5000 and that does not contain any prime factors less than 20 is 5003.
Figure 10.15 illustrates use of this divisor with a small set of key values.
Figure 10.15: Example using divisor 5003
9. Multi-list File
These files are very useful, where lists of records with specified key attributal values are
desired frequently. Ex: The list of employees posted to a particular place or the list of
employee due to retirement in a particular unit.
In this file organization, all the records with a specified key attribute value are chained
together.
The directory file, like the one in the inverted file organization contains the
pointer to the first record, with specified key attribute value. The first record contains the
address of the second record in the chain and the second contains the address of the third
record and so on. When the last record in the chain contains pointer to the first record, the
records are said to form a ring. A number of such rings for different key attribute values
and for different attributes can be formed. The directory provides entry point to the rings.
Example 6
Figures 10.16 and 10.17 illustrate the multi-list indexes for secondary keys GROUP-CODE
and OVERDRAW-LIMIT respectively in our example file. Figure 10.18 shows the
corresponding data file. Note that while inversion did not affect the data file, use of the
multi-list organization does. Each record must have space for the pointers that implement
the secondary-key accessibility.
Figure 10.16: Multi-list index for GROUP-CODE secondary key and the data file in figure 16.9
Figure 10.17: Multi-list index for OVERDRAW-LIMIT secondary key and the data file in figure 16.9
The same kinds of design decisions must be addressed as were required with inversion:

Should key value entries be ordered?

How should the index itself be structured?

Should direct or indirect addressing be used?

Should data record entries for a given key value be ordered?
Here we have ordered key values, used tabular index structures with indirect addressing,
and have linked data records by ascending ID value.
Note the result of building a multi-list structure to implement a secondary key that has
unique values! If there are N data records, there will be N value entries in the index, each
of which points to a linked list of length one. (See Figure 10.12.) The index is the same as
had the secondary key been implemented using inversion.
One attractive characteristic of the multi-list approach is that index entries can be fixedlength. Each value has associated with it just one pointer.
Example data file with multi-list structure
Unit 4
HASHING
1. Hash Tables,
2. Hash Function,
3. Terms Associated with Hash Tables
4. Bucket Overflow
5. Handling of Bucket Overflows
1. Hash Tables
In tree tables, the search for an identifier key is carried out via a sequence of comparisons.
Hash differs from this, in that the address or location of an identifier, X, is obtained by
computing some arithmetic function f of X. f (X) gives the address of X in the table. This
address will be referred to as the hash or home address of X.
The memory available to maintain the symbol table is assumed to be sequential. This
memory is referred to as the hash table, HT. The term bucket denotes a unit of storage
that can store one or more records. A bucket is typically one disk block size but could be
chosen to be smaller or larger than a disk block.
If the number of buckets in a Hash table HT is b, then the buckets are designated HT(0), ...
HT(b-1). Each bucket is capable of holding one or more records. The number of records a
bucket can store is known as its slot-size. Thus, a bucket is said to consist of s slots, if it
can hold s number of records in it.
A function that is used to compute the address of a record in the hash table, is known as
hash function. Usually s = 1 and in this case each bucket can hold exactly 1 record. A
hashing function, f(x) is used to perform an identifier transformation on X. f(X) maps the
set of possible identifiers(i.e. X) onto the integers 0 through b-1, giving the bucket number
where this identifier will eventually be stored.
2. Hashing Function
A hashing function f, transforms an identifier X into a bucket address in the hash table. The
address so computed is known as hash address of the identifier X. If more than one record
have same hashing address, they are said to collide. This phenomenon is called address
collision.
The desired properties of a hashing function are that it should be easily computable and
that it should minimizes the number of collisions.
A Uniform Hash Function is a hashing function in which probability that f(X) = i is 1/b, b
being the number of buckets in the hash table. In other words, each bucket has equal
probability of being assigned a record being inserted.
The worst possible hash function maps all search key values to the same bucket. This
function is undesirable because all the records have to be kept in the same bucket.
An ideal hash function distributes the stored keys uniformly across all the buckets so that
every bucket has the same number of records. Therefore, it is desirable to choose a hash
function that assigns search key values to buckets such that the following holds:

The distribution of key-values is uniform, that is, each bucket is assigned the same
number of search key values from the set of all possible search key values.

The distribution is random, that is, in the average case, each bucket will have nearly
the same number of values assigned to it, regardless of the actual distribution of
search key values.
Several kinds of uniform hash functions are in use. We shall describe few of them:
Mid Square hash function
The middle of square (or Mid-square for short) function, fm, is computed by squaring the
identifier and then using an appropriate number of bits from the middle of the square to
obtain the bucket address.
Since the middle bits of the square will usually depend upon all of the characters in the
identifier, it is expected that different identifiers would result in different hash addresses
with high probability even when some of the characters in the identifiers are the same.
The number of bits to be used to obtain the bucket address depends on the table size. If r
bits are used to compute hash address, the range of values is 2 r, so the size of hash table
is chosen to be a power of 2 when this kind of scheme is used. Conversely, if the size of
the hash table is 2r, then the number of bits to be selected from the middle of the square
will be r.
Mid-square hash address( X ) = r number of middle digits of( X 2 )
Example 1
Let the hash table size be 8 slots.; s=1 ;and let X be an identifier from a set of identifiers. Y
be the unique numerical value identifying X. Computation of mid-square hash function is
carried out as follows:
Hash table size = 8 = 23
 r=3
X
Y
Y2
Binary(Y2)
Mid-Sq(Y2)
A1
1
1
00 000 01
000(0)
A7
7
49
01 100 01
100(4)
A8
8
64
10 000 00
000(0)
A2
2
4
00 001 00
001(1)
A6
6
36
01 001 00
001(1)
A5
5
25
00 110 01
110(6)
A4
4
16
00 100 00
100(4)
A3
3
9
00 010 01
010(2)
We see that there is hash collision (hash clash) for the keys A1and A8, A7 and A4, A2 and
A6.
Division hash function
Another simple choice for a hash function is obtained by using the modulo (mod) operator.
The identifier X is divided by some number M and the remainder is used as the hash
address of X.
f (x) = X mod M
This gives bucket address in the range 0 - (M-1) and so the hash table is at least of size b
= m. M should be prime number such that M does not divide rk+ a where r is the radix of
the character set and k and a are very small numbers.
Example 2
Given a hash table with 10 buckets, what is the hash key for 'Cat'?
Since 'Cat' = 131130 when converted to ASCII, then x = 131130.
We are given the table size (i.e., m= 10, starting at 0 and ending at 9).
f(x)
= xmod m
f(131130)
= 131130 mod 10
=0
'Cat' is inserted into the table at address 0.
The Division method is distribution-independent.
The Multiplication Method
It multiplies of all the individual digits in the key together, and takes the remainder after
dividing the resulting number by the table size.
f(x) = (a * b * c * d *....) mod m
Where: m is the table size, a, b, c, d, etc. are the individual digits of the item.
Example 3
Given a hash table of ten buckets (0 through 9), what is the hash key for 'Cat'?
Since 'Cat' = 131130 when converted to ASCII, then x = 131130
We are given the table size (i.e., m = 10).
f(x) = (a * b * c * d *....) mod m
f(131130) = (1 * 3 * 1 * 1 * 3 * 0) mod 10
= 0 mod 10
=0
'Cat' is inserted into the table at address 0.
Folding hash function
In this method identifier X is partitioned into several parts, all but the last being of the same
length. These parts are then added together to obtain the hash address for X. There are
two ways to carry out this addition.
f(X) = P1 + P2 +…..Pn) mod (hash-size)
In the first, all but the last part are shifted so that the least significant bit of each part lines
up with the corresponding bit of the last part. The different parts are now added together to
get f (x).
P1
P2
P3
P1
123
P2
203
P3
241
P4
112
P5
20
P4
P5
P1 = 123, P2 = 203, P3 = 241, P4 = 112, P5 = 20
699
Figure 11.1: Shift Folding
This method is known as shift folding. The other method of adding the parts is folding at
the boundaries. In this method the identifier is folded at the part boundaries and digits
falling into the same position are added together.
P1
123
P2
302
P3
241
P4
211
P5
20
897
Figure 11.2 : Folding at Boundaries Pir = Reverse of Pi
Example 4
Fold the key 123456789 into a hash table of ten spaces (0 through 9).
We are given x = 123456789 and the table size (i.e., m = 10).
Since we can break x into three parts any way we want to, we will break it up evenly.
Thus P1= 123, P2 = 456, and P3 = 789.
f(x) = (P1 + P2 + P3) mod m
f(123456789) = (123+456+789) mod 10
= 1368 mod 10
=8
123456789 is inserted into the table at address 8.
Digit Analysis hash function
This method is particularly useful in the case of static files where all the identifiers in the
table are known in advance. Each identifier X is interpreted as a number using some radix
r. The same radix is used for all the identifiers in the table.
Using this radix, the digits of each identifier are examined. Digits having most skewed
distribution are deleted. Enough digits are deleted so that the number of digits left is small
enough to give an address in the range of the hash table.
Perfect Hash Functions
Given a set of keys K = {kl,k2, . . . ,kn}. A perfect hash function is a hash function h such
that h(ki)!= h(kj) for all distinct i and j. That is, no hash clashes occur under a perfect hash
function. In general, it is difficult to find a perfect hash function for a particular set of keys.
Further, once a few more keys are added to the set for which a perfect hash function has
been found, the hash function generally ceases to be perfect for the expanded set.
Thus, although it is desirable to find a perfect hash function to ensure immediate retrieval,
it is not practical to do so unless the set of keys is static and is frequently searched. The
most obvious example of such a situation is a compiler in which the set of reserved words
of the programming language being compiled does not change and must be accessed
repeatedly. In such a situation, the effort required to find a perfect hashing function is
worthwhile because, once the function is determined, it can save a great deal of time in
repeated applications.
Of course, the larger the hash table, the easier it is to find a perfect hash function for a
given set of keys. If 10 keys must be placed in a table of 100 elements, 63 percent of the
possible hash functions are perfect (although as soon as the number of keys reaches 13 in
a 100-item table, the majority are no longer perfect).
In general it is desirable to have a perfect hash function for a set of n keys in a table of
only n positions. Such a perfect hash function is called minimal. In practice this is difficult
to achieve.
One technique finds perfect hash functions of the form h(key) = (key + s)/d for some
integers s and d. These are called quotient reduction perfect hash functions, and, once
found, are quite easy to compute.
Cichelli presents a very simple method that often produces a minimal or near minimal
perfect hash function for a set of character strings. The hash function produced is of the
form
h(key) = val(key[0]) + val(key[length(key) - 1]) + length(key)
where val(c) is an integer value associated with the character c and key{i] is the /th
character of key. That is, add integer values associated with the first and last characters of
the key to the length of the key. The integer values associated with particular characters
are determined in two steps as follows.
The first step is to order the keys so that the sum of the occurrence frequencies of the first
and last characters of the keys are in decreasing order. Thus if e occurs ten times as a
last or first character, g occurs six times, „/‟ occurs nine times, and o occurs four times, the
keys gate, goat, and ego have occurrence frequencies 16 (6+ 10), 15(6+9), and 14(10+4),
respectively, and are therefore ordered properly.
Once the keys have been ordered, attempt to assign integer values. Each key is examined
in turn. If the key's first or last character has not been assigned values, attempt to
assign one or two values between 0 and some predetermined limit. If appropriate
values can be assigned to produce a hash value that does not clash with the hash
value of a previous key, tentatively assign those values. If not, or if both characters
have been assigned values that result in a conflicting hash value, backtrack to modify
tentative assignments made for a previous key. To find a minimal perfect hash
function, the predetermined limit for each character is set to the number of distinct first
and last character occurrences.
3. Terms Associated with Hash Tables
Identifier Density
The ratio n/T is called the identifier density, where
n = number of identifiers
T = total number of possible identifiers.
The number of identifiers, n, in use is usually several orders of magnitude less than
the total number of possible identifiers, T.
The number of buckets b, in the hash table are also much less than T.
Loading Factor
The loading factor is equal to n/sb, where
s = number of slots in a bucket
b = total number of bucket
The number of buckets b is also very less than total number of possible identifiers, T.
Synonyms
The hash function f almost always maps several different identifiers into the same bucket.
Two identifiers I1, I2 are said to be synonyms with respect to f if f(I1) = f(I2). Distinct
synonyms are entered into the same bucket so long as all the s slots in that bucket have
not been used.
Collision
A collision is said to occur, when two non-identical identifiers are hashed into the same
bucket. When the bucket size is 1, collision and overflow occurs simultaneously.
4. Bucket Overflow
So far we have assumed that, when a record is inserted, the bucket to which it is mapped
has available space to store the record. If the bucket does not have enough space,
however, it indicates an error condition called Bucket Overflow.
Bucket overflow can occur due to several reasons
Insufficient Buckets: The number of buckets which we denote by nb, must be chosen such
that nb>nr/fr, where nr denotes the total number of records that will be stored and fr
denotes the number of records that will fit in a bucket. If the condition is not met, there will
be less number of buckets than required and hence will cause bucket overflow.
Skewness: If the selection of a bucket, during insertion, is more frequent than that of
others, the bucket is said to be skewed as against being symmetrical. When entering
records in such bucket set, it is likely that a few buckets are assigned most of the incoming
records, causing filling of these buckets very early. A new insertion in these buckets will
cause a bucket overflow even while there is space available in other buckets.
5. Handling of Bucket Overflows
When situation of overflow occurs it should be resolved and the records must be placed
somewhere else in the table, i.e. an alternative hash address must be generated for these
entries. The resolution should aim at reducing the chances of further bucket flows.
Some of the approaches used for overflow resolution, are describe here:
Over Flow Chaining or Closed Hashing
In this approach, whenever a bucket overflow occurs, a new bucket (called over-flow
bucket) is attached to the original bucket through a pointer. If the attached bucket is also
full, another bucket is attached to this bucket. The process continues. All the overflow
buckets of a given bucket are chained together in a linked list. Overflow handling using
such a linked list is called Overflow Chaining.
As an example, let us take an array of pointers as Hash table (Figure 11.3).
Figure 11.3: A Chained Hash Table
Advantages of Chaining
1)
Space Saving
Since the hash table is a contiguous array, enough space must be set-aside at
compilation time to avoid overflow. On the other hand, if the hash table contains
only the pointers to the records, then the size of the hash table may be reduced.
2)
Collision Resolution
Chaining allows simple and efficient collision handling. To handle collision only a
link field is to be added.
3)
Overflow
It is no longer necessary that the size of the hash table exceeds the number of
records. If there are more records than entries in the table it means that some of the
linked lists serve the purpose of containing more than one record.
4)
Deletion
Deletion proceeds in exactly the same way as deletion from a simple linked list. So
in chained hash table deletion becomes a quick and easy task.
Rehashing or Open Hashing
The form of hash structure that we have just described is sometimes referred to as closed
hashing. Under an alternate approach, called open hashing, the set of buckets is fixed and
there are no overflow chains, instead if a bucket is full, records are inserted in some other
bucket in the initial set of buckets B.
Rehashing techniques essentially employ applying and, if necessary, re-applying some
hash function again and again until an empty bucket is found.
Rehashing, involves using a secondary hash function on the hash key of the item. The
rehash function is applied successively until an empty position is found where the item can
be inserted. If the hash position of the item is found to be occupied during a search, the
rehash function is again used to locate the item.
Other policy, in which we compute another hash function for new record based not on the
identifier this time, but on the hash key itself. If this position is also filled then function is
again applied on the resulting address. This process is repeated until an empty location is
reached.
Let X is the identifier to be stored. A hash function is applied to it to compute the address
in the hash table, i.e. address = f(X). In case, there is a collision, the same function or
some other function may be applied on the computed address to get the final address. In
case this also results into a collision, the method is continued.
In general, if f(X) is a hashing function, giving rise to a collision. Another function ff(X) is
applied on it to get new index. In case this also does not yield an empty space, the
function is reapplied successively until a space is found.
Initially :
hash index = f(X)
Next index = ff( f (X) )
if collision then
again if collision then
Next index = ff( ff( f (X) ) ) and so on.
Example 5
Let us suppose the value to be hashed is 15424 by division method for a hash table size
10 and slot size 1. The hash table is already filled with some indices as shown in the
following figure:
F(X) = X mod 10
F(15424) = 15424 mod 10 = 4. There is collision. Let us take ff(X) = (2 * X )mod 10 to be rehashing function.
Let us apply this function : ff( f(15424) ) = ff( 4 ) = (2 * 4) mod 10 = 8. Here also there is a
collision. Let us apply ff once again. ff( ff( f( X ) ) ) = ff( ff( 4 ) ) = ff( 8 ) = 6. This is required
index.
1
2
3
4
5
6
7
8
9
10
The hash index (i.e. 4) is already filled. Therefore, search linearly to find the place at 7 th bucket.
Linear Probing
One policy is to use the next bucket (in cyclic order) that has space. This policy is called
"Linear probing".
It starts with a hash address (the location where the collision occurred) and does a
sequential search through a table for the desired key or an empty location. Hence, this
method searches in straight line, and it is therefore called linear probing.
The table should be considered circular, so that when the last location is reached, the
search proceeds to the first location of the table.
After first being located to an already occupied table position, the item is sent sequentially
down the table until an empty space is found. If m is the number of possible positions in
the table, then the linear probe continues until either an empty spot is found, or until m-1
locations have been searched.
Example 6
Let us suppose the value to be hashed is 15424 by division method for a hash table size 10 and slot size 1.
The hash table is already filled with some indices as shown in the following figure:
F(X) = X mod 10
F(15424) = 15424 mod 10 = 4
The hash index (i.e. 4) is already filled. Therefore, search linearly to find the place at 7 th
bucket.
15424
1
2
3
4
5
6
7
8
9
10
Example 7
Here is the part of a program program for linear hashing. It uses f(key) hashing function and ff(i) as
rehashing function. Special value nullkey indicates an empty record in the table. It searches a bucket
linearly.
#define TABLESIZE …..
typedef KEYTYPE ….
typedef RECTYPE …..
strcut rec {
KEYTYPE k;
RECTYPE r;
} table[TABLESIZE];
int searchplace(KEYTYPE key, RECTYPE rec)
{
int I;
i = f(key);
while (table[i].k != key && table[I].k != nullkey)
I = ff(i);
If(table[I].k == nullkey)
{
table[I].k = key;
table[I].r = rec;
}
return(i);
}
Clustering
The situation, where two keys that hash into different values compete with each other in
successive rehashes, is called primary clustering.
It also forms a measure of the suitability of a rehash function.
One thing that happens with linear probing however is clustering, or bunching together.
Primary clustering occurs when items are always hashed to the same spot in the table,
and the lookup requires searching through the occupied buckets. The next empty bucket
will always be together with the occupied ones, giving rise to clusters. 1, 2 or 3 buckets
before finding the empty spot.
The opposite of clustering is uniform distribution. Simply, this is when the hash function
uniformly distributes the items over the table.
1
2
3
4
5
6
7
8
9
10

cluster
Resolving cluster
Quadratic Probing
If there is a collision at hash address h, this method probes the table at locations f+1, f+4,
f+9 ..., that is at location f+i 2 for i = 1, 2, ... that is, the increment function is i 2. Quadratic
probing reduces clustering, but it will not probe all locations in the table.
675564
1
2
3
4
5
6
7
8
9
10
ffj(X) = (f(X) + j2) % mod tablesize
Example 8
Let us hash key value 675564 using division method in the above hash table.
F(675564) = 675564 mod 10 = 4
Obviously, there is a collision. Linear probing would suggest 8 th position increasing
clustering. Let us apply quadratic rehashing.
ff1(675564) = ( f(675564) + 12) mod 10 = 5, again a collision. Reapply the ff again.
ff2(675564) = ( ff1(675564) + 22) mod 10 = (5 + 4) mod 10 = 9. Location 9 th, being empty,
the key will be inserted in 9th bucket. Notice it has not increased clustering.
Double Hashing
This is another method of collision resolution, but unlike the linear collision resolution,
double hashing uses a second hashing function which normally limits multiple collisions.
The idea is that if two values hash to the same spot in the table, a constant can be
calculated from the initial value using the second hashing function which can then be used
to change the sequence of locations in the table, but still have access to the entire table.
It consists of two rehashing functions f 1 and f2. First of all f1 is applied to get the location for
insertion. If it occupied then f2 is used to rehash. If again there is a collision, f 1 is used for
rehashing. This way alternatively each function is employed until the empty location is
obtained.
Example 9
Let us suppose we have to insert key value 23763 by division hashing function in a table of
size 10. The two functions are f1(X) = (X + 1 )mod tablesize and f2(X) = 2 + X % tablesize.
We apply the first function to compute the first hash index:
f1(23763) = (1+23763) mod 10 = 4. Let us suppose 4 th location is not free. Apply the
rehashing function:
f2(f1(23763))=f2(4)= 2 + 4 mod 10 = 6 If 6th place is also not empty, continue:
f1(f2(f1(23763))) = f1(6) = (1 + 6) mod 10 = 7. and so on.
Key Dependent Increments
In key dependent increments we assume a function, which can be part of key itself.
For example: We could truncate the key to a single character and use its code as the
increment.
Bucket 0
Bucket 1
e-215
Bucket 2
e-101
e-110
Bucket 3
e-217
e-102
Bucket 4
e-218
Bucket 5
e-203
Bucket 6
CHENNI
GA URA V
e-217
BANGLORE
CHA NDRA
e-101
DELHI
SHARA D
e-110
MADURAI
ARUN
e-215
MUMBAI
SAURA BH
e-102
JAIPUR
AJIT
e-203
LUCKNOW
RISHA BH
e-218
Download