Informatik I - HS 17 (Summary) for Chemists Contents 1 UNIX 1 2 Data Representation 2 3 Data Processing 4 4 C++ Mastery 4.1 Basic Syntax . . . . 4.2 Types of Statements 4.3 Functions . . . . . . 4.4 Input and Output . 4.5 Arrays . . . . . . . 4.6 Types of Errors . . 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 9 11 12 13 14 Algorithmics 5.1 Algorithms for Sorting . . 5.2 Algorithms for Searching 5.3 Numerical Integration . . 5.4 Algorithmic Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 18 19 23 . . . . . . . . . . . . . . . . . . 6 Computer Architecture 24 7 Computer Simulation 26 8 Representation of Chemical Structures 27 9 Molecular Simulation 30 Compiled by Eduard Meier (meiered@student.ethz.ch) 1. UNIX 1 1 UNIX UNIX commands are of the form: command [−options] [object1] [object2] ... Referencing to a file or directory is done by → an absolute path name (e.g. /aa/dd/ff), → a relative path name (e.g. dd/ff1 ). The top directory of a system is called root directory (/), the highest directory for the user is the home directory (~). The parent directory (..) is one up from the current directory (.), while the child directory is one down (dir_name). Wildcards can be used to execute a command pertaining to multiple files: → Any single character: ? → Any string of characters: * → Any character from list: [ABC] → Any character in range: [A-Z] 1 If Command Function echo echo arguments pwd return working directory name cd [dir] change working directory [to directory dir] cd change to home directory cd .. change to parent directory cd ~username change to home directory of user username ls list names of all files in current directory ls [file ...] list only named files ls -a list all files including hidden files ls -t list in time order, most recent first ls -l list files in long format ls -r list in reverse order (may be combined with -t) mkdir dir make directory dir rmdir dir remove directory dir cp source dest copy source to dest mv source dest move source to dst rm [file ...] remove named files cat [file ...] print contents of named files to standard output more [file ...] print contents of files one page at a time wc [file ...] count lines, words and characters for each file grep pattern [file ...] search for a pattern in a file sort [file ...] sort the files alphabetically by line current directory is /aa 2. DATA REPRESENTATION 2 Command Function head [-n] file print the first n lines of a file tail [-n] file print the last n lines of a file cmp file1 file2 print location of first difference between file1 and file2 diff file1 file2 print all differences between file1 and file2 awk pattern [file ...] scans file for pattern with ability to remove, add and modify the text man command display help (man page) on command chmod ugoa2 ± rwx3 file change the permissions of a file chmod ugo4 file ssh host connect with the machine named host ssh user@host connect as user with the machine named host finger (@host) check who is logged into your machine (named host) Mail check if you got mail and read it Mail user send an email to user cmd > file redirect the output to overwrite file cmd >> file redirect the output to append to file cmd >& file redirected output + errors overwriting cmd >>& file redirected output + errors appending cmd < file redirected input from file cmd1 | cmd2 pipe: output of cmd1 is connected to input of cmd2 CTRL-C aborts the program running in the shell CTRL-Z suspends the program running in the shell bg puts a running program to the background fg puts a running program to the foreground program-name & 2 set the permissions of a file starts the program in the background → shell free to type other commands Data Representation Computers are finite and discrete - they can only handle material that can be mapped to a finite sequence of zeroes and ones. The basic information unit in the digital world is the bit (binary information ticket), which represents an information element with two states. They are grouped into words of various lengths. The storage capacity (Sn ) of a word of length n (n ≥ 1) is given by S n = 2n The most basic word length is the byte, which contains 8 bits. Current operating systems rely on either 32- or 64-bits word lengths. Often, byte multiples are defined in powers of 2 grouped by approximate powers of 10 since 210 ≈ 1000 (e.g. 1 kilobyte = 210 bytes, 1 megabyte = 220 bytes, etc.) 2 user (u), group (g), other (o), all (a) permission (+), retract permission (-), read (r), write (w), execute (x) 4 Three-digit octal string setting the permissions (user, group, others). Start from 0. Add 4 for read, 2 for write, 1 for execute. 3 grant 2. DATA REPRESENTATION 3 A number system is defined by a chosen base b and a set of b digit symbols (characters), one of them being zero. An arbitrary positive integer number n is then written as a collection of p digits p −1 n = [d p−1 d p−2 · · · d1 d0 ]b = ∑d mb m m=0 Our usual decimal system relies on a base b = 10 and the digit symbols {0,1,2,3,4,5,6,7,8,9} 1970 = [1970]10 = 0 · 100 + 7 · 101 + 9 · 102 + 1 · 103 The binary system relies on a base b = 2 and the associated digits {0,1} [1970]10 = [11110110010]2 = 0 · 20 + 1 · 21 + 0 · 22 + 0 · 23 + 1 · 24 + 1 · 25 + 0 · 26 + 1 · 27 + 1 · 28 + 1 · 29 + 1 · 210 → Binary to decimal conversion can be done be summing up each digit multiplied with the associated power of 2. → For decimal to binary conversion, it is convenient to start at the largest power of 2 that can be substracted, to substract it, and to continue until 0 is reached. [111111...111111]2 = 2 p − 1 ≈ 2 p The octal system relies on a base b = 8 and the digit symbols {0,1,2,3,4,5,6,7} - this corresponds to a 3-bit code. The hexadecimal system relies on a base b = 16 and the digit symbols {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} - this corresponds to a 4-bit code 1970 = [11110110010]2 = [011 110 110 010]2 = [0111 1011 0010]2 =[ 3 =[ 7 6 6 2 ]8 = [3662]8 B 2 ]16 = [7B2]16 Real numbers can be represented with fixed-point 5 . p −1 r = [d p−1 d p−2 · · · d1 d0 . d−1 d−2 · · · d−q−1 d−q ]b = ∑d mb m m=−q [. 111111...111111]2 = 1 − 2−q ≈ 1 for large q or with floating-point representation, where an unsigned digit string (mantissa) and the position of the fractional point (exponent) are stored separately. → The significand is the mantissa interpreted as a number starting right after the decimal point ([0.M]); → The sign of the significand must be stored separately; → The exponent may be positive or negative. number = sign × significand × baseexponent In the binary system (with base 2), one generally adopts the so-called IEEE 754 standard for the floating-point representation 1 bit ← NE bits → ← NM bits → Sign: S Biased exponent: E Mantissa: M Maximal value of E: Emax = 2 NE − 1 Median value of E: E0 = 2 NE −1 − 1 5 Numbers, that may be represented exactly as decimal fixed-point may or may not be representable exactly as binary fixed point - in the latter case, they have a periodic fractional part in the binary representation (e.g. [0.25]10 = [0.01]2 whereas [0.2]10 = [0.0011]2 ) 3. DATA PROCESSING 4 For most values of E, this representation is normalized, and obeys the equation number = (−1)S × [1.M]2 × 2E−E0 for 0 < E < Emax Because it’s normalized, the significand starts with a 1 (instead of a 0)6 . The special values E = 0 and E = Emax are reserved for → denormalized numbers if E = 0: → special numbers if E = Emax : number = (−1)S × [0.M]2 × 21− E0 if S = 1, M = 0 −∞ number = +∞ if S = 0, M = 0 NaN if M 6= 0 for E = 0 for E = Emax In the normalized representation, the smallest and largest representable numbers are of comparable logarithmic magnitudes. However, zero is not exactly representable. That’s why E = 0 was saved for denormalized numbers, since zero is explicitly representable with M = 0. Furthermore, they allow to represent numbers that are smaller than the smallest normalized number7 21− E0 - because [0.M]2 ranges between zero and one. Loss of significance corresponds to errors caused by operations producing too small or too large numbers. E.g. addition in four-digit decimal arithmetic: 0.3721 · 102 + 0.3046 · 10−1 = 0.3721 · 102 + 0.0003046 · 102 = 0.3724 · 102 (rounded) An overflow corresponds to x + y being too large and an underflow to x × y being too small for a word. 3 Data Processing The elementary building block of a processor is a device that can → enable (E-type) or disable (D-type) the flow of current when an input on-off signal is applied8 ; → convert flow information into an output on-off signal of the same nature as the input signal9 . In modern chips, the elementary building block is called a MOSFET (metal-oxide-semiconductor field-effect transistor) which → relies on the properties of semi-conductors; → either goes from insulant to conductor under an applied electric field (E-type); → or goes from conductor to insulant under an applied electric field (D-type); → can be used to build three basic types of logic gates: NOT, AND and OR gates; → has a size of ∼ 1µm10 ; → relies on the voltage of the signals (CPU core voltage), which is typically ∼ 2 V; → involves a current intensity in the order of 10−5 A; → has - as a chip (which involves many MOSFETs) - a power requirement of about 10 W; → has a recovery time11 of typically ∼ 1 ns - which determines the clock rate12 ; Increasing the clock rate by increasing CPU core voltage is called overclocking. It leads to stronger currents and faster recovery, but also to increased power requirement and heating. 6 This implicit leading is not stored, since it is always 1! the smallest positive denormalized number is equal to the smallest possible difference between two normalized numbers. The largest positive denormalized number is equal to the smallest positive normalized number. 8 No signal means that the flow is disabled (E-type) or enabled (D-type). E is short for enhancement, D for depletion mode. 9 In order to use it as input for subsequent building blocks. 10 One can typically pack 0.5 · 106 MOSFETs on a chip. 11 The minimal time needed between two successive operations. 12 The frequency at which successive input batches can be processed (number of clock cycles per second), namely ∼ 1 GHz. 7 Actually, 4. C++ MASTERY 4 5 C++ Mastery A program must first be translated from C++ (higher language) into assembler (machine language) using a compiler. Text typed as /*...*/ or as //... (until end of line) is ignored by the compiler - these are the comments of the programmer. A program usually includes preprocessor instructions13 such as #include <iostream>14 prior to the functions. The subsequent line using namespace std; defines the standard namespace in order to avoid name clashes. Furthermore, every C++ program contains one or more functions, one of which is called main - this is the one that is entered when the program is called. 4.1 Basic Syntax The basic text unit in a C++ program is termed an identifier, it can be any sequence of the characters → Lower-case letters → Upper-case letters → Digits → Underscore a...z A...Z 0...9 and is not starting with a digit. However, some identifiers are reserved to indicate specific instructions in C++ and termed keywords. Hence, it is impossible to give them another than the C++ meaning (i.e. not possible for a variable name). 13 Preprocessing is an additional step applied prior to compilation. specific preprocessor instruction includes the text content of a standard header file iostream.h at this location of the program which contains the declaration (definition) of standard function for stream input-output (IO). 14 This 4. C++ MASTERY 6 The following basic types (to characterize variables) are available Type Size in bytes Explicit constants bool (1/8) boolean constant with values ”false” or ”true” char 1 character constant - single character enclosed by single quotes short (short int) 2 int 4 integer constant long 8 integer constant long long ≥8 integer constant float 4 integer constant - optional ”+” or ”-” followed by arbitrary sequence of digits → Integer numbers floating-point constant - optional ”+” or ”-” followed by a sequence of digits containing a period ”.” or the letter ”e” or ”E” → Real numbers double 8 floating-point constant long double 16 floating-point constant Variables must be explicitly declared along with the specification of their type. Once a variable has been declared, it can be assigned a value using the assignment operator ”=”. The variable is now initialized - the simplest form of initialization involves an explicit constant. boolean character string integer double declaration bool bla; char bla; char bla[5]; int bla; double bla; assignment bla = true; bla = ’A’; bla = "four"; bla = 1; bla = -1.2E-3; Or in compact notation: → Declare and assign at once, → Declare multiple variables, → Combined, int bla = 1; int bla,bla2; int bla1 = 1,bla2 = 2; An operator is represented by a special symbol, and performs a specific action: → A unary operator has one operand, → a unary prefix operator has its single operand at its right; → a unary suffix operator has its single operand at its left. → A binary operator has two operands, both at its left and right. An operation always returns a result, which may be used as operand by further operators - otherwise, the result is discarded. An operand may be an explicit constant, a variable, the return value of a function, or the result of another operation. The resolution (evaluation) of an expression follows strict rules that depend on the relative precedences (priorities) of the operators and on the associativities (left-to-right or right-to-left), and can be modulated by the use of parentheses. A valid combination of operators and operands is called an expression. An expression followed by a semi-column is a valid instance of statement. Operators act on their immediate operands on the left or/and right in increasing order of precedence (from 1 [highest] to 16 [lowest]). At equal precedence, the order of evaluation is determined by the operator associativity, either left-to-right or right-to-left. This can be modulated by parentheses; any expression between parentheses is evaluated as a separate entity before being inserted in the calculation. 4. C++ MASTERY 7 Precedence Operator Description Associativity 1 :: Scope resolution None 2 ++ Suffix increment Left-to-right -- Suffix decrement () Function call [] Array subscripting . Element selection by reference -> Element selection through pointer typeid() Run-time type information const_cast Type cast dynamic_cast Type cast reinterprete_cast Type cast static_cast Type cast ++ Prefix increment -- Prefix decrement + Unary plus - Unary minus ! Logical NOT ~ (type) Bitwise NOT (One’s Complement) * Indirection (dereference) & Address-of sizeof Size-of new.new[] Dynamic memory allocation delete.delete[] Dynamic memory deallocation .* Pointer to member ->* Pointer to member * Multiplication / Division % Modulo (remainder) + Addition - Subtraction << Bitwise left shift >> Bitwise right shift < Less than <= Less than or equal to > Greater than >= Greater than or equal to == Equal to != Not equal to 3 4 5 6 7 8 9 Righ-to-left Type cast Left-to-right Left-to-right Left-to-right Left-to-right Left-to-right Left-to-right 4. C++ MASTERY 8 Precedence Operator Description Associativity 10 & Bitwise AND Left-to-right 11 ^ Bitwise XOR (exclusive or) Left-to-right 12 | Bitwise OR (inclusive or) Left-to-right 13 && Logical AND Left-to-right 14 | | Logical OR Left-to-right 15 ?: Ternary conditional Right-to-left 16 = Direct assignment Righ-to-left += Assignment by sum -= Assignment by difference *= Assignment by product /= Assignment by quotient %= Assignment by remainder <<= Assignment by bitwise left shift >>= Assignment by bitwise right shift &= Assignment by bitwise AND ^= Assignment by bitwise XOR |= Assignment by bitwise OR 17 throw Throw operator Right-to-left 18 , Comma Left-to-right An arithmetic operator is a binary operator + * / % → addition (sum) → subtraction (difference) → multiplication (product) → division (quotient) → remainder of the integer division (modulo) and combines the two expressions appropriately into its result. An assignment operator is a binary operator = += -= *= /= %= → set variable to value (assignment) → increase variable by value → decrease variable by value → amplify variable by value → scale variable by value → modulo variable by value and alters the variable on its left according to the expression on its right, and returns the value as result. An increment/decrement operator is a unary operator ++ -- → increase variable by 1 → decrease variable by 1 and can either be used as a prefix or a suffix. In the former case, the variable on its right is modified, and its updated value returned as result, while in the latter, the variable on its left is modified and its initial value returned as result. More than one assignment/increment/decrement of the same variable or involving an a/i/d and another reading of the variable value must be avoided since the outcome is not specified by the standard and computer-dependent! A short selection of no-go’s: i = i++; i += 1 + i--; j = i + i++; j = ++i + i++; a[i] = i++; 4. C++ MASTERY 9 A comparison operator is a binary operator < <= > >= == != → less than → less than or equal → greater than → greater than or equal → equal to → not equal to and compares the expressions on its left and right, and results in a boolean. A logical operator is a unary (not) or binary (and, or) operator ! && || → logical not → logical and → logical or and associates the expressions on its left and right (for not: at the right), and results in a boolean; the associated expressions are also boolean (or converted to this type). Boolean values are false or true; but it is also common to use integer values 0 or 1, which are automatically converted (implicit cast) from/to false and true. The short-cut property of the logical and and logical or operators means that when the left operand of && is false, the right operand is not evaluated and the result automatically false, respectively, in the case of ||, if the left operand is true, the right operand is not evaluated and the result automatically true. A comma operator is a binary operator , → separate two expressions and evaluates the two expressions in turn from left to right and the result is the value of the expression on the right. A cast operator is a unary operator (bool) → convert to boolean (int) → convert to integer (double) → convert to double (char) → convert to character etc... and converts the expression on its right to the indicated type. When the cast represents an upgrade, the precision of the quantity is unaffected. If it represents a downgrade, the quantity is truncated. If an operator requires a certain data type and no prior conversion was conducted, the compiler decides automatically on the casts that are needed. Downgrading of a real to an integer means truncation and therefore a loss of precision. Conversion to a boolean gives false if zero and true if non-zero. 4.2 Types of Statements An algorithm is a succession of steps, translated in C++ into a succession of statements. In C++, there are six basic types of statements → null statement → declaration statement → expression statement → compound statements → conditional statements → iterative statements 4. C++ MASTERY 10 The null statement is a simple semi-column and does nothing. A declaration statement declares the existence and type of a variable → Syntax: → Example: type variable_name; int x; An expression statement evaluates an expression → Syntax: → Example: expression; j = i++; An compound statement is a sequence of statements to be treated by the compiler as a single statement → Syntax15 : → Example: { statement1 statement2 ... statement N } { int x,y; x = 5; y = x + 5; } An conditional statement is a statement that is only executed if some specified condition is satisfied → The condition is a boolean expression that can evaluate to true or false. → Syntax: if (expression) statement or if (expression) statement1 else statement2 → Example: if ( x > 0 ) y = x; else y = -x; For testing multiple cases, the switch statement represents an alternative to multiple if-else statements ok == true if (c == ’+’) { cout << "set result to a+b"; r = a+b; } else if (c == ’-’) { cout << "set result to a-b"; r = a-b; } ... ok == true switch (c) { case ’+’: cout << "set result to a+b"; r = a+b; break; case ’-’: cout << "set result to a-b"; r = a-b; break; ... } Furthermore, there is an alternative to if-else statements to include tests within expressions if ( a >= 0 ) a_abs = a; else a_abs = -a; a_abs = (a >= 0) ? a : -a; An iterative statement is a statement that is executed multiple times as long as some continuation condition holds or until some termination condition is met → The condition is a boolean expression that can evaluate to true or false. → Syntax: while (expression) statement or do statement while (expression); → Variant: for (statement1 expression1; expression2) statement equivalent: statement1 while (expression1) {statement expression2;} → Example: while ( i != 0 ) { f = f + i; i--} 15 The statements already include the semi-colons. 4. C++ MASTERY 4.3 11 Functions Although it would be possible to write every C++ in a linear fashion, it often turns out to be more elegant and lucid to group logical units into modules, which take the forms of functions. The C++ program itself is written in a function called main. Additional functions with different names can be defined next to main and called from within other functions. Although many functions do not ”communicate” - they are called, do something and return a value -, the calling piece of code needs to provide information to the function or/and receive information from the function. There are four ways a function can communicate with the program → Argument by value → Result (return value) → Argument by reference → Global variable The simplest way a function receives information from the calling function is by its arguments, as passed by value. The called function creates an own variable of the given type that is only valid inside the function. The variable is initialized to the corresponding value listed in the call. Variable names can therefore differ between the calling and the receiving functions without any consequence - consequentially, same variable names can stand for distinct variables int main() { ...; int five = 5; average_function(five, 10.0); ...; return 0; } void average_function(int six, double x) { ...; ...; ...; return; } The simplest way a function returns information to the calling function is by its result. A function can at most have one result16 . Functions of type void have no return value at all. A function may contain multiple return statements at various locations - however, the first return statement encountered ends the function execution. int main() { double a, y = -1.0; a = abs_val(y); cout << y << a << endl; return 0; } double abs_val(double x) { if ( x < 0.0 ) x = -x; return x; } If a function both receives information from and returns information to the calling program by its arguments, they must be passed by reference. In this case, when the function is called17 , the argument variable becomes directly accessible within the function under the indicated name - both are stored at the same memory location. The information flow now goes in both directions, a change on the variable inside the function affects the corresponding variable of the calling program. For this reason, the argument in the calling function must be a variable - a constant could not be altered. int main() { int m, p; m = 4; p = max_obsolete(m); cout << m << p << endl; return 0; } int max_obsolete(int &n) { n = 5 return 4; } In this program, the value of variable ”m” turns to 5 because the argument is passed by reference. 16 Also main has a return value. The integer return value of main goes back to UNIX and can be probed using ”echo $?” immediately after running the command (0 = normal termination). The return value of main can indicate an error in the program - if ensured it returns a non-zero value in this case (which can then be tested with a UNIX C-shell script). 17 The definition of the function now includes a ”&” prior to the variable name e.g. void my function(int &n). 4. C++ MASTERY 12 Variables must be declared before being used. The scope of a declaration represents the domain in which the declaration holds, it is determined by the location of the declaration and may be either → local - the variable is declared inside a function and the scope hence limited to the function itself. → global - the variable is declared outside any function and its scope the entire file containing the declaration. The scope of a global variable may even be extended to multiple files by repeating the declaration in each of the files preceded by the keyword extern. Both local and global variables may not be declared twice in their respective scope. However, the name of a global variable can be reused18 for a local variable19 name inside a function - both being separate variables. Functions - just as variables - must be declared before they are used (called), and these declarations have a scope. The scope of a declaration may again be either → local - the function is declared inside another function and the scope hence limited to this function. → global - the function is declared outside any other function and therefore has a scope reaching over the entire file containing the declaration. A function may be declared multiple times at different places (even within the scope of a previous declaration). The declaration must specify the types of its return value and arguments - it is also called a prototype. In contrast to variables, functions not only have to be declared but also defined (i.e. one needs to specify the statements they contain). Functions can only be defined once and must be defined outside all other functions - at top level. The function definition also counts as a declaration - but the function call must not precede. The variables listed between parentheses (with their types) in the definition of the function are called parameters20 . The function call causes immediate execution of the called function, after which the normal flow of the calling function is resumed. A function can be called multiple times at different places in a program but must be preceded by a declaration with a scope reaching the call statement. At some points, functions and variables occur interconnected: → The data type void is used to denote an empty parameter set or return value (e.g. void fnc(void)). → A name clash can arise when the programmer is using the same name for different objects (e.g. variable, function) within the program. → A function that changes the value of a global variable is said to have a side effect. → When the same variable name is used for a global and a local variable, the local variable has by default precedence within the function and is reverted by the scope-resolution operator ”::”. 4.4 Input and Output The input-output (IO) of a program concerns the reading of data from files (including the keyboard) and writing of data to files (including the screen). Since a file is a sequential21 object, each data item is accessed (read or written) separately. The C++ language contains two alternative sets of functions for performing IO’s → the standard IO library, inherited from the C language → the stream IO library specific to the C++ language There are two main types of IO’s and corresponding files → Unformatted IO/files corresponding to reading/writing binary files (byte sequences) → Formatted22 IO/files corresponding to reading/writing plain text (ascii) files (e.g. streamio) 18 Nevertheless, this is not very wise - it leaves a lot of space for mistakes. default, the local variable will be accessed under the variable name in the function, unless one precedes the variable name with the scope-resolution operator ”::”. 20 The variables listed between parentheses (with their types) in the declaration are called dummy parameters and will only be checked for their types, not their names. Furthermore, the variables listed between parentheses (without their types) in the call of the function are called arguments - they can be arbitrary expressions or variables. 21 The concept opposite to sequential access is random access. 22 Unlike unformatted IO/files, formatted ones are plain text (human readable), editable by hand, portable across data types and machines and rather space-consuming. 19 By 4. C++ MASTERY 13 Features of the C++ stream input-output library include • From keyboard / to screen → Header file #include <iostream> → Input stream cin for standard input → Output stream cout for standard output & cerr for standard error clog for standard error (buffered) • From file / to file → Header file → Input stream type → Output stream type → Open function → Input function → Output function → Check end of input → Send end of line → Close function 4.5 #include <fstream> ifstream ofstream ifstream f_inp("inp.dat"); & ofstream f_out("out.dat"); f_inp >> ... ; f_out << ... ; f_inp.eof()23 f_out << endl; f_inp.close() ; & f_out.close(); Arrays An array is a set of variables of the same type accessible through an integer index. The array size (number of variants in the set) and type of the elements must be specified when the array variable is declared. In C++, array indexes run from zero to the array size minus one. A program setting all elements of an array of size 42 to 42 may look like int main() { int wisdom[42], i; for (i=0; i<42; i++) { wisdom[i] = 42 } return 0; } Arrays with a content that is known at compile time can be initialized using explicit constants. Just like any other variable, they can be passed to functions as arguments void my_function(double a[]) {...} Arrays passed to function can, despite being passed by value, be modified by the called function because it is not the array but rather the pointer to the array start that is passed in the memory. It is therefore possible to alter the pointer without consequent changes outside the called function but not to alter the array itself. Arrays can have one or multiple dimensions. A one-dimensional array takes the form → a[j]. A two-dimensional array can be regarded as a one-dimensional array of which each element is itself a one-dimensional array → b[i][j]24 . This corresponds to the mathematical convention where the element Aij of a matrix is the one at line i and column j25 . Of course, one can extend this ansatz to multi-dimensional arrays → c[i][j][k] etc... An array can be initialized in a simple way only at the same time as it is defined - the compiler may also guess one missing dimension - actually only the slowest index or or int a[4] = {4,2,4,2}; int a[2][2] = {4,2,4,2}; int a[] = {4,2,4,2}; int a[][2] = {4,2,4,2}; which is also applicable to characters char i[4] = {’I’,’N’,’F’,’O’}; or char s[] = {’S’,’U’,’C’,’K’,’S’}; A string is a character array including the termination character ’\0’ at the end and can be initialized as char s[] = {’I’,’N’,’F’,’O’,’\0’}; 23 A or boolean that is true as soon as one tries to read past the last item. array of i times j arrays... 25 j is the fast index (columns). 24 An char i[] = {"SUCKS"}; 4. C++ MASTERY 14 A pointer is a variable that can indicate the memory location of an object of a specific type. The object type must be specified when the pointer variable is declared int* ptr; A pointer to a given variable can be generated using the reference operator ”&” int my_int; → ptr = &my_int; The value of a variable pointed at by a pointer can be determined using the dereference26 (or indirection) operator ”*”. The variable defining an array is actually (just) a pointer to the first element of the array. The addition / subtraction of an integer to / from a pointer shifts the pointer by as many elements along the array int my_array[10]; my_array[5] = 42; int* ptr = my_array; ptr += 5; cout << *ptr; → 42 4.6 Types of Errors In order to function properly, a C++ program has to be correct in terms of syntaxis, semantics and algorithmics. The process of refining a program until it works is called debugging. • Syntaxis encompasses the basic rules for assembling the symbols of the language (i.e. its vocabulary and grammar). → Syntactic errors generally cause a compile-time abort - the compiler will stop with an error message. They are often easy to detect and fix. → Examples of syntactic errors for (i=0,i<42,i++) {...} or if i!=1 j=2; • Semantics encompasses the rules for doing something sensible computerwise (i.e. the meaning of the operations). → Semantic errors usually cause a run-time abort27 They are more difficult to fix. → Examples of semantic errors x /= 0 or for(i=0; i<10; i--) • Algorithmics encompasses the method for solving the problem at hand (i.e. the mathematical logics or physical model you employ). → Algorithmic errors normally let the program run but return non-sense results because of logical or modelling errors. → Examples of algorithmic errors area_circle = 2*pi*r or volume_gas = P / (n*R*T); Besides programming errors, a result may still be affected by numerical errors • Roundoff errors arise because at finite precision, a floating-point real cannot encompass information concerning numbers of widely different magnitudes. → To prevent rounding errors, it is advisable to avoid adding or subtracting numbers of widely different or nearly equal magnitudes, to add numbers in a sequence in increasing order of magnitude and to carry out subtraction before multiplication or division. → Examples of numerical errors 1.000 + 0.0003 → 1.000 or 1.000E0 + 1.000E4 → 1.000E4 • Truncation errors arise mostly by truncating an infinite sum and approximating by a finite sum. → Taylor series expansions lead to truncation errors. They decrease with an increasing amount of calculated terms of series. • Instability errors (due to numerical algorithms) include the growth of round-off errors and/or initially small fluctuations in initial data that might cause a large deviation on the final result. → Using recurrence relations to calculate lattice energies is efficient but prone to instability errors. 26 The notation ”x[]” is equivalent to ”*x” - the notation ”x[i]” is equivalent to ”*(x+i)”. program will stop during execution, often with a not so clear error message (such as a floating-point exception, a segmentation fault), or never stop (endless loop). 27 The 5. ALGORITHMICS 5 15 Algorithmics An algorithm is what precedes programming and does not depend on the programming language it is going to be implemented. It consists of a precise specification of how to solve a (class of) problem(s). Algorithmics is thence a branch of computer science that deals with algorithms. Different algorithms to solve the same problem may vary widely in terms of solution quality, as well as computational efficiency and numerical accuracy. The comparison of the efficiency of different algorithms for the same problem is called algorithm benchmarking. There are different types of benchmarking → Benchmark efficiency at constant accuracy for different algorithms or programs on one CPU type. → Benchmark accuracy at constant computational time for different algorithms or/and implementations on one CPU type → Benchmark CPU types and for the same program at the same accuracy One usually differentiates between recursive and iterative algorithms. In contrast to an iterative algorithm, the recursive one calls itself. Although recursion is clearly more elegant28 , iteration is often more efficient29 . In the following an example of an iterative and a recursive variant for the calculation of n! int fact_ite(int n) { int i, f = 1; for (i=2; i<=n; i++) f *= i; return f; } 5.1 int fact_rec(int n) { if (n == 0) return 1; return n*fact_rec(n-1); } Algorithms for Sorting Sorting algorithms come into play when a list (file) of records (items) is provided as input with each record encompassing a key, for which a well-defined ordering rule is provided. The choice of the sorting algorithm then depends on → the size of the list → the required sorting speed → whether the list has a pre-sorting order → the size of individual records → the average-case performance → the worst-case performance → whether stability30 is required → whether in-place31 sorting is required → whether the key set is small or large To characterize the efficiency of algorithms, it is often useful to know how to calculate the number P of ordered/unique pairs of elements in a set of N numbers. It is defined as the number of combinations (ij) one can make with i = 1...N j = 1...N and i < j (1 ≤ i < j ≤ N) which corresponds to the number of elements in the upper triangle (excluding the diagonal elements) of a N × N matrix a11 a12 a13 · · · a1N a21 a22 a31 a33 . .. . . . a N1 28 Shorter a NN code, fewer variables, closer to the math, ... operations, function calls, ... 30 Whether the initial ordering of records with identical keys has to be preserved. 31 Whether the sorted list is in the same location as the local list and no extra storage is needed during the sorting (in-place). 29 Fewer 5. ALGORITHMICS 16 In total there are N 2 elements including N diagonal elements; hence N 2 − N or N(N − 1) off-diagonal and P = N(N − 1)/2 upper triangle elements. This number is equal to N −1 P= N ∑ ∑1 (5.1) i=1 j=i +1 Selection sort is one of the simplest possible in-place sorting techniques (but it is unstable). The algorithm mainly consists of looping over the array elements except the last (primary index i = 1..N-1) and in the following, for each i, the smallest element among those from i onward is found by taking the initial smallest value to be that of i and looping over a secondary index j = i + 1..N and updated each time a smaller value is found - after the loop, the smallest value found is swapped with i. This corresponds to the C++ code void selection_sort (int N, int a[]) { int i, j, jm, at; // note: a[1..N]; a[0] unused for (i=1; i<=N-1; i++) { jm = i; for (j=i+1; j<=N; j++) // find smallest element from i onward (i.e. i..N) if ( a[j] < a[jm] ) jm = j; at = a[i]; a[i] = a[jm]; a[jm] = at; //do the swap } return; } → Inner-loop contains one comparison (and sometimes an assignment involving jm) → Minimal data movement, so the algorithm is good for large records → Algorithm is not stable → N(N-1)/2 comparisons, N-1 exchanges, 3(N-1) assignments involving a[] Insertion sort is another simple in-place sorting technique (and is stable). The algorithm consists of looping over the array elements from the second onward (primary index i = 2..N), and in the following, for each i, one loops over the preordered elements before i (secondary index j = i − 1..1) and inserts the element at the appropriate place. This corresponds to the C++ code void insertion_sort(int n, int a[]) { // note: a[1..N]; a[0] sentinel int i, j, at; for (i=2; i<=N; i++) { at = a[0] = a[i]; // j = i; while ( at < a[j-1] ) a[j] = a[j-1]; // j--; } a[j] = at; // } return; save element i; set element 0 as sentinel { push element j one forward insert saved element } → The element a[0] (sentinel) contains at, so the while loop always terminates (latest with j=1) → The inner-loop contains one comparison and one assignment involving a[] → There is a lot of data movement, so the algorithm is bad for large records → The algorithm is stable → On average ∼ N 2 /4 comparisons and ∼ N 2 /4 assignments involving a[] moves 5. ALGORITHMICS 17 The sentinel is a programming trick involving the extension of an array with one dummy element (the sentinel) set to a value x (which is to be spotted in the array) in order to avoid N comparisons in the algorithm. In the following two times the same function bodies - one of them including two comparisons and the other only one32 int i, a[N]; int i, a[N+1]; a[N] = x; i = 0; while (a[i]!=x) i++; imin = i; i = 0; while (i<N && a[i]!=x) i++; imin = i; There is also an obvious linear-scaling sorting method when the keyword set is finite (small alphabet of keywords). It is stable and unlike the preceding ones not in-place - it needs auxiliary (working) arrays. The algorithm consists in first counting the occurrences of elements of a type (in an auxiliary array), then finding the starting points of each type in the same array and finally in filling another auxiliary array with the starting points33 . It corresponds to the C++ code void baby_sort(int N, int a[], int K) { int i, k; int n[K+1], b[N+1]; for (k=1; k<=K; k++) n[k]=0; // set n[1..K] to 0 for (i=1; i<=N; i++) n[a[i]]++; // n[k] will give the number of elements of each type n[0]=1; for (k=1; k<= K-1; k++) n[k] += n[k-1]; // n[k-1] will now give the start point of the set // of elements of type k in the reordered array for (i=1; i<=N; i++) // move elements a[1..N] to new locations in b[1..N] b[n[a[i]-1]++] = a[i]; for (i=1; i<=N; i++) a[i]=b[i]; // copy b[1..N] to a[1..N] return; } → Linear scaling (only single-loops, no double-loops) → The key set is always finite because the list is finite, but it can be long (i.e K ≤ N) → If the key values are not nicely sequential (1,2,3,4...K) as assumed, an additional array to map the key values to sequential integers is needed 32 Which corresponds to the solution with sentinel. description: First, every element in the array n[k+1] is set to zero. Afterwards, the number of occurences of the element types in the array a[] is counted and saved in the position of array n[] corresponding to its number (type). The number is raised by one per detection of one type. Afterwards, these numbers of occurences are added up in an ascending manner - to get correct positions, the value of n[0] is set to 1. To move the elements of array a[] into the correspondent position of array b[], the number (type) of an element of a[] has to be subtracted by 1 (because of the preceding ascending summation which indicates the sorted position of the subsequent element in the array), the value received by the array n[] is now the sorted location in array n[] and is (by the suffix incrementation operator) in the case of array n[] raised by one for another element of the same type (if existant). 33 Detailed 5. ALGORITHMICS 5.2 18 Algorithms for Searching A searching algorithm is applied, when a list (file) of records (items) is provided as input with each record encompassing a key, and the record having a specified value of the key (search key) is to be found. Sequential search simply scans through all the records in the list in turn. The number of comparisons depend on whether the list is sorted and if there is a success in finding and are mostly34 N /2 (with N being the number of records). An example of a C++ function searching an array a[] of N integers (with dimension N + 1) for the first occurrence of value s may look like int seq_search(int a[], int N, int s) { int i = 0; a[N] = s; // use a[N] as sentinel while ( s != a[i] ) i++; return i; } Binary search starts with the full list as current sublist, iteratively divides the current sublist into two parts (of equal sizes - up to one unit), determines in which of the two parts the search key may reside, and sets this part as the new sublist. It makes use of a divide and conquer algorithm and requires a sorted list. The number of comparisons is log2 (N)35 . An example of a C++ function searching an array a[] of N integers (with dimension N) for the first occurrence of value s may look like int bin_search(int a[], int N, int s) { int left = 0, right = N-1, middle; while (left <= right) { middle = (left+right)/2; // with truncation if (s < a[middle] right = middle -1; else if (s > a[middle]) left = middle +1; else return middle; } return N; } String search finds an occurrence of a pattern36 within the text - given a text string37 of length N and a pattern (string) of length M. In contrast to the search for a key, the pattern can be very long and must be lined up character by character with the character sequence in the text. It makes use of a brute force algorithm38 that checks for each possible position along the text whether characters of pattern and text match. The number of comparisons in the worst case is (N − M + 1)M ≈ N M. The algorithm consists in declaring two indices i and j, whereas index i indicates the current position in the text and j the position in the pattern. When the characters match, both are increased. When j = M, a match is found and returned, when i = N, no match was found. Upon mismatch, i is reset to i − j + 1 and j to 0 (backtracking). An example of a C++ function may look like int brute_force_search(char a[], int N, char p[], int M) { int i=0, j; while (i<N) { j = 0; while ( j<M && a[i] == p[j]) { i++; j++; } if ( j == M ) return i-j; i -= j-1; } return -1; //return -1 if not found } 34 Except the case of an unsorted list without success in finding the search key → N comparisons. order O[lg(N)] (= O[log2 (N)]) algorithm is much faster than a O[N] algorithm. 36 The C++ standard library also contains functions for pattern matching → s.find("STRING"); 37 A sequence of letters, numbers white spaces, special characters → large alphabet (≥ 26 types). 38 Not efficient for highly degenerate strings (small alphabets → sequences of 0 and 1 values). 35 An 5. ALGORITHMICS 5.3 19 Numerical Integration Prerequisite to numerical algorithms is the investigation of algorithmic stability and numerical errors as discussed in Section 4.6. The results of numerical calculations depend on round-of errors. The rounding-off of a number to n digits consists in the omission of all digits to the right of the nth digit and incrementing it if either the omitted part is larger than half the unit in the nth position or if the omitted part is equal to half the unit in the nth position with the nth digit being odd39 . In order to avoid loss of significance, it is often convenient to rearrange the equations in such a way that the subtraction of almost equal numbers is avoided e.g. to expand the equations in Taylor series of the general form40 N f (x) = lim N →∞ f (n) (a) (x − a)n n! n=0 ∑ To calculate the value of e x − 1 for x ≈ 0, a Tayler series around x = 0 is formulated x e −1 = x2 x3 x4 x x2 x3 1+x+ + + + ... − 1 = x 1 + + + + ... 2 6 24 2 6 6 In order to calculate the integral of a function Iab = Z b a f (x) dx • the primitive function can be found (analytically) dF(x) = f (x) → Iab = [F(x)]ba = F(b) − F(a) dx • special mathematical tricks can be applied such as → change of variable → integration by parts → residue theorem of complex function theory → basis set expansion (e.g. Tayler, Fourier, Laplace, ...) • a table of standard integrals can be considered • a symbolic mathematics program (Mathematica, Matlab, ...) can be applied • a numerical integration method can be used such as → Monte Carlo integration → Rectangular quadrature → Trapezoidal quadrature → Simpson’s quadrature → Romberg’s quadrature Monte Carlo integration is a stochastic method, hence relies on (pseudo)random numbers. It consists in choosing an H ≥ max f (x) over [a, b] and sampling N pairs of real random numbers (xn , yn ) from a uniform rectangular distribution (a ≤ xn ≤ b , 0 ≤ yn ≤ H with n = 1, 2, ..., N). The fraction of points (xn , yn ) in the rectangle for which yn ≤ f (xn ) approximates the ratio of the integral Iab to the area of the rectangle ( N 1 if z > 0 MC −1 Iab = H(b − a)N Θ( f (xn ) − yn ) Θ(z) = 0 otherwise n=1 ∑ MC = I MC − I = O[N −1/2 ] Error → ∆Iab ab ab 39 Other 40 Taylor conventions do exist. Rounding is hence unequal to truncation, where the extra digits are simply cut off. series of f(x) at x=a. 5. ALGORITHMICS 20 Another approach in the numerical evaluation of integrals is by quadrature, where N subintervals 0...N-1 of equal widths41 are used which are defined by N + 1 points 0...N (xn = a + nh with x0 = a, x N = b). The total integral can be written as42 N −1 Iab = N ∑ In or Iab = h n=0 ∑wn f (xn ) (5.2) n=0 • Rectangular (Q) quadrature → Total integral → Alternative form N −1 ∑ In with Q Iab = h ∑ wnQ f (xn ) with Q Iab = Q n=0 N n=0 InQ = h f (xn ) ( 1 0≤n<N Q wn = 0 n=N → Evaluation is asymmetric, i.e. from a to b is not the same as from b to a h Q Q → Estimated error ∆Iab = Iab − Iab ≈ − [ f (b) − f (a)] + O[h2 ] which implies an error linear in h 2 (i.e. a slow convergence - or small h necessary for a good estimate) → If f (b) > f (a), Iab is underestimated and consequentially if f (b) < f (a), Iab is overestimated43 • Trapezoidal (T) quadrature → Total integral T = Iab N −1 ∑ InT with InT = with wnT = n=0 N → Alternative form T =h Iab ∑ wnT f (xn ) n=0 h [ f (xn ) + f (xn+1 )] 2 ( 1/2 n = 0, N 1 0<n<N → Evaluation is not dependent on the direction because of the averaging of two rectangular-rule estimates, i.e. from a to b is the same as from b to a 2 T ≈ h [ f 0 (b) − f 0 (a)] + O[h4 ] which implies an error quadratic in h → Estimated error ∆Iab 12 (i.e. a better convergence - or larger h necessary a good estimate) → If f 0 (b) > f 0 (a), Iab is overestimated and consequentially if f 0 (b) < f 0 (a), Iab is underestimated • Simpson’s (S) quadrature → For this method, N must be even → Parabola through xn , xn+1 , xn+2 pn (x) = (x − xn+1 )(x − xn+2 ) (x − xn )(x − xn+2 ) (x − xn )(x − xn+1 ) f (xn ) − f (xn+1 ) + f (xn+2 ) 2h2 h2 2h2 → Total integral S = Iab N −1 ∑ InS n=0 with h [ f (xn ) + 4 f (xn+1 ) + f (xn+2 )] 3 if n is even (0 otherwise) InS = b−a N 42 With I being the integral over interval n and w the integral contribution of point n divided by h (quadrature height). n n 43 Under the premise that the slope is positive. 41 h = 5. ALGORITHMICS 21 N → Alternative form S =h Iab ∑ wnS f (xn ) wnS = with n=0 1/3 n = 0, N 4/3 n odd, 0 < n < N 2/3 n even, 1 < n < N − 1 h4 000 [ f (b) − f 000 (a)] + O[h6 ] 180 S ≈ → Estimated error ∆Iab • Romberg’s (R) quadrature → Trapezoid rule to calculate successive estimates of the integral, halving the spacing every time b−a ILT Trapezoid estimate of order L requires N = 2 L intervals and 2 L + 1 points with h L = 2L → The Romberg integral estimate of order L is obtained by formulating a linear combination of all these trapezoidal integrals up to order L such that the error is minimal L Romberg estimate of order L ILR = ∑ CL,K IKT K=0 → This is done by introducing intermediate sums such that ( TL,0 = ILT Intermediate sums TL,M TL,M = αTL,M−1 + βTL−1,M−1 0 < M ≤ L with ILR = TL,L → E.g. Romberg approximation of order one h1 = h0 /2 h1 T1,0 = = [ f (a) + 2 f (a + h1 ) + f (b)] 2 = I + C2 h21 + C4 h41 + ... C2 2 C4 4 = I+ h + h + ... 4 0 16 0 4T1,0 − T0,0 1 = = I − C4 h40 + ... = I − 4C4 h41 + ... 3 4 I1R = T1,1 → Error O[h41 ] I1T T1,1 → Romberg approximation of order L TL,0 TL,1 = h L = h0 /2 L h = ILT = L [ f (a) + 2 f (a + h L ) + ... + 2 f (a + 2 L−1 h L ) + f (b)] = I + O[h2L ] 2 4TL,0 − TL−1,0 = I + O[h4L ] 3 ··· ILR = TL,L TL,M = 4 M TL,M−1 − TL−1,M−1 +2 = I + O[h2M ] L 4M − 1 +2 → Error O[h2L ] L In order to process singularities44 , the integrand has to be rearranged - there exist four alternatives → Expand integrand in a Taylor series around the singular point and integrate the series term by term #1 " 1/2 Z 1 Z 1 cos(x) x2 x4 x x 5/2 x 9/2 −1/2 I= dx 1/2 = dx x 1− + − ... = − + −··· 2! 4! 1/2 (5/2)2! (9/2)4! 0 0 x 0 1 1 = 2− + − · · · = 1.809 5 108 44 E.g. if the integrand is singular at x = 0 but the integral is finite. 5. ALGORITHMICS 22 → Remove the singularity by a change of variable x = y2 I= x = 0 → y = 0 and x = 1 → y = 1 dx = 2y dy Z 1 0 cos(x) dx 1/2 = x Z 1 0 cos(y2 ) dy 2y =2 y Z 1 0 dy cos(y2 ) → Remove the singularity by partial integration Z Z v(x) = cos(x) dx u0 (x)v(x) = [u(x)v(x)] − dx u(x)v0 (x) Z 1 i1 Z 1 cos(x) h I= dx 1/2 = 2x1/2 cos(x) − dx 2x1/2 (−sin(x)) 0 0 0 x Z u0 (x) = x −1/2 1 = 2cos(1) + 2 0 dx x1/2 sin(x) → Make the singularity tractable by splitting the integrand into different terms I= Z 1 0 dx cos(x) = x 1/2 = 2+ Z 1 0 Z 1 0 dx 1 + Z 1 cos(x) − 1 0 x 1/2 cos(x) − 1 dx x 1/2 x 1/2 In order to process an infinite integration integral, there are also alternatives → Expand integrand in an asymptotic series around the singularity and integrate the series termwise → Make the integration interval finite by a change of variable I= x = −ln(y) I= dx = Z 0 1 − y −1 dy (−y−1 ) Z ∞ 0 dy dx e− x xe−2x + 1 x = 0 → y = 1 and x = ∞ → y = 0 y = −y2 ln(y) + 1 Z 1 0 dy 1 1 − y2 ln(y) 5. ALGORITHMICS 5.4 23 Algorithmic Strategies A huge variety of algorithms with different aims exists. The six major algorithmic strategies are 1. Brute-force algorithms → Try all combinations one by one, the hard way 2. Greedy algorithms → Always select the option, which yields largest immediate progress towards the goal45 → E.g. making change with coin sizes 25, 11, 5 and 1 cents - aim: return as few coins as possible Greedy: 25+5+1+1+1 cents Optimal: 11+11+11 cents → Another example: steepest-descent minimization (line-minimize along negative gradient46 ) 3. Divide and conquer algorithms → Split the problem into smaller sub-problems, solve these, and combine the solutions → Often easiest to implement using recursion → E.g. Binary search or Quicksort → On computers, multiplications involve binary strings, arranged in words of N = 2n bits (suitable for divide-and-conquer); the elementary multiplication is one-bit and corresponds to the operation AND 4. Dynamic programming algorithms → Solve all the subproblems, store the solutions in a table, use the solutions of the subproblems to solve the problems 5. Local search algorithms → Guess an arbitrary solution, define/perform local transformation to an alternative solution; if better store it, otherwise discard it; and iterate this procedure 6. Backtracking and pruning algorithms → Rank all possible solutions in a tree, traverse the tree, keep track of the best solution so far, skip the branches that cannot contain a better solution47 Furthermore, there are two common benchmark problems 1. Traveling salesman problem → A salesman wants to find a tour along N cities (i.e. a selv-avoiding cycle including all cities) which is of minimum length L 2. Knapsack problem → A knapsack has a capacity of M units and there are N types of items with different values and costs; the optimal choice of objects in terms of total value is to be found 45 Which is locally optimal. much deeper minimum may be missed! 47 Going down the tree, lower bounds will increase - the branch with lowest lower bound is explored in priority. When a leaf is reached, the value is updated (i.e. lowest value so far). Whenever the lower bound at a node is higher than the tour value, the node can be prunned (i.e. subtree skipped). 46 A 6. COMPUTER ARCHITECTURE 6 24 Computer Architecture Various criteria are used in the classification of computers such as • The computer category → supercomputer (>10 MSFr) → mainframe (1-10 MSFr) → server (50-200 kSFr) → workstation (5-50 kSFr) → personal computer (1-10 kSFr) • Logic gate technology → relays (1935-1940) → vacuum tubes (1940-1955) → transistors (1955-1970) → integrated circuits (1970-...) • Computer architecture, including → The instruction set architecture (ISA) (machine code instructions) → The microarchitecture (how the ISA is implemented electronically, computer organization) → The processor organization (single or multiple CPU, relative operation modes and memory accesses of CPUs) The sequence of operations is the following increment PC 1. fetch next instruction from memory into IR 2. decode instruction in IR 3. fetch operands from memory into ALU registers 4. execute instruction in ALU 5. store results from ALU registers into memory goto next 6. COMPUTER ARCHITECTURE 25 The clock rate of a processor is the frequency at which successive input batches can be processed (number of clock cycles per second), namely ∼ 1 GHz (nowadays up to ∼ 3 GHz) → the preceding sequence of operations usually requires multiple clock cycles However, since the work of the processor can be acceleterated by using instruction pipelining (if result of an operation is not immediately needed in the next few operations), it is possible to mutually process about 5 instructions in 5 clock cycles. This places moderate constraints on the processor but important constraints on the programmer/compiler (e.g. if a previous result is necessary for a fetch but is not yet in the memory - here, pipeline processing leads to a mess). Another way to accelerate the work of the processor is using vectorization. It is applicable if the same operation is applied successively to many operands - the vector operation including all operands often only costs one clock cycle (e.g. if an operation i = n ∗ f + o is applied to a large number N of a, b and c values). It places large constraints on processor (needs special chip design) and important constraints on programmer/compiler (only certain pieces of a program can be vectorized). On the other hand, hardware acceleration includes coprocessors or CPU units → Internal Control Unit (ICU) → Arithmetic and Logic Unit (ALU) And a range of Accelerated Processing Units (APU): → Floating Point Unit (FPU) → Graphics Processing Unit (GPU) → Physics Processing Unit (PPU) → Uncommitted Logic Array (ULA) The Instruction Set Architecture (ISA) corresponds to the set of machine code instructions implemented in a processor. There are two main categories • Complex Instruction Set Computer (CISC) → instructions (different lengths) match high-level language, are relatively slow - complex instructions and complex compiler • Reduced Instruction set Computer (RISC48 ) → instructions (equal lengths - enabling pipelining) are relatively short (fast clock possible), simple (simple and cheap CPU) - many instructions per operation in high-level language Since computers are of limited size, memory is often expanded by storing data on a disk that serves as virtual memory - exchanging data in batches (e.g. by pages of ∼4 kbyte). However, with an access time of ∼ms, it is very slow compared to main memory (µs), cache49 (ns) and of course CPU registers. Single Instruction Single Data (SISD) → single-processor (pipelining/vectorization), no parallelism Single Instruction Multiple Data (SIMD) → multiple CPUs, same operation, different chunks of memory Multiple Instruction Single Data (MISD) → multiple CPUs, same operation, same chunks of memory Multiple Instruction Multiple Data (MIMD) → Multiple CPUs, different operations, different chunks of memory For MIMD parallel processing, there are two main options: • Shared memory MIMD(S) → Multiple CPUs operate independently but share the same memory resources → User-friendly programming perspective to memory → Lack of scalability between memory and CPUs (adding more CPUs rapidly increases traffic → Program is responsible for synchronization of CPU actions on memory • Distributed memory MIMD(D)50 → Multiple CPUs have their own local memory → Programmer must define how and when data is communicated and tasks synchronized → Each CPU can rapidly access its own memory without interference → Program must handle data structure for distributed memory 48 The trend nowadays. small and expensive local memory attached to the CPU. 50 Might be further separated into MIM(D ) in a local network and MIM(D ) for non-local networks. L N 49 Fast, 7. COMPUTER SIMULATION 7 26 Computer Simulation The two main components of the scientific approach (induction and deduction) are → Use systematic observations to establish laws about reproducible behaviour of physical systems → Devise models51 from which complex laws can be derived from simpler (fundamental) laws While some models are simple and amenable to an analytical mathematical description, most are too complex to be solved analytically - one may use simulation → the model is formulated in the form of an algorithm that should emulate the real-world process → the emulation is carried out in practice on a computer → the result of the simulation is compared with the real-world observations; agreement provides a hint of validity of the model → experimental observation can be intrepreted in terms of the simulation, providing insight into the complex real-world process in terms of the simpler model assumptions → the result of the simulation can also be compared to those of simpler analytical models, permitting to characterize the deviations from the latter → the simulations may predict properties without experiments The simulation of a process has three stages 1. Development of a model that describes the real process as accurately as possible/required (a) specify the problem and the goals to be reached by simulation using the model (b) formulate a qualitative model (c) specify the set of parameters and variables of the model (d) specify the relations between the parameters and the variables (e) use observations to quantify the parameters 2. Experiment with the model to validate it (a) is the model consistent? (b) are the results plausible? 3. Use the model to analyze or predict a real process Simulation models are representations of real objects aimed at obtaining insight in the behavior of the objects. The structure of the model must correspond to the structure of the object. However, not all details of the object must be represented in the model. Models are classified as → Static ↔ dynamic dynamic: e.g. neural networks → Deterministic ↔ stochastic deterministic: e.g. molecular dynamics simulation → Discrete ↔ continuous continuous: e.g. idem Simulation is a method of investigation which involves designing a model of a real system, and conducting experiments with this model for the purpose of either understanding the behavior of the system or evaluating various strategies for the operation of the system. Simulation is useful if → the real experiments are impossible (e.g. considered objects to small, ...) → the real experiments are economically impractical (e.g. construction of alternative production facilities) → the real experiments are unethical (e.g. epidemics or radiation damage) → the details of the experiments are unobservable (e.g. atomic motions in liquids) Simulation experiments include a variety of objectives, such as → Comparison of different models of a system → Prediction of properties of a system → Sensitivity analysis of the factors determining the output of a system → Optimization of specific output variables of a system → Insights in the details of operations of a system 51 A good model should account for all available observations in its claimed domain of validity and be as simple as possible for this purpose; a model can be invalidated by counter-examples, but never validated. 8. REPRESENTATION OF CHEMICAL STRUCTURES 8 27 Representation of Chemical Structures Standard structural diagrams are not suited for handling by a computer. Representations have to meet several requirements such as → Uniqueness: one representation per compound → Unambiguity: one compound per representation (→ A representation that is both unique and unambiguous is called canonical) → Completeness: representation should describe entire structure diagram → Conciseness: representation should require minimal storage space Types of molecular representations suited for computer manipulation include • Unambiguous representations → Topological representations (WLN, IUPAC, CAS) → Geometrical representation (CSD, PDB) • Ambiguous representations → Fragment codes WLN was invented by the chemist Wiliam J. Wiswesser in the 1940s and is not so much used anymore (despite its compact and intuitive form) SMILES was developed in the 1980s and is still commonly used InChI was initially developed by IUPAC and NIST in ∼2000; current version from 2011 Furthermore, there are different connection table variants in order to represent chemical structures • Simple connection table representation → Procedure 1. Define atom types: as letters C, O, ... (omit hydrogen) or as numbers 1, 2, ... 2. Define bond types: single (1), double (2), triple (3) 3. Choose atom sequence numbers 4. Make connection table containing connections 8. REPRESENTATION OF CHEMICAL STRUCTURES 28 → Drawbacks: redundant information, not unique and not very concise • Compact connection table representation → Same procedure, but 1. choose sequence numbers such that the neighbours j of an atom i have consecutive numbers when j > i 2. list only connections to atoms with lower sequence numbers in the connection table 3. in cyclic structures, add an extra line to indicate ring closure → Drawbacks: not unique (arbitrary numbering) and still not concise • Matrix-based connection table representation → Use three matrices 1. Atom type matrix 2. Atom connectivity matrix 3. Bond type matrix → Drawbacks: not unique (arbitrary numbering) and even less concise 8. REPRESENTATION OF CHEMICAL STRUCTURES 29 None of the preceding connection tables were unique: → In the simple connection table, N atoms permit N! numbering schemes. → In the compact connection table, the number of possibilities is reduced by sequential numbering; however, many possibilities remain. A scheme that has been made unique is called a canonical representation. Canonicalization is an essential component of line representations (WLN, SMILES, InChI). A possible canonicalization procedure for the connection tables is the Morgan’s algorithm (1965). • Morgan’s algorithm → Procedure 1. Calculate stage 1 connectivity values for each atom 2. Calculate stage 2 connectivity values for each atom (sum stage 1 values over attached atoms) 3. Continue as long as the number of different connectivity classes52 increases 4. Generate compact connection tables based on all possible numberings 5. Select one representation based on atom and bond types + neighbours → Atomic numbers: low → high → Bond order: low → high → Lowest-neighbour rank: low → high The algorithm therefore seeks the most deeply embedded atom within the structure and gives it number 1, afterwards following a sequential numbering rule resolving multiple choices based on both the extended connectivities and a predefined ordering of the atom types, bond types and lowestneighbour ranks. 52 Connectivity class: connectivity value occurring at least once in the structure 9. MOLECULAR SIMULATION 9 Molecular Simulation 30 9. MOLECULAR SIMULATION 31 9. MOLECULAR SIMULATION 32