Uploaded by Mario Schacher

1-Informatik-Eduard Meier--Zusammenfassung

advertisement
Informatik I - HS 17
(Summary) for Chemists
Contents
1
UNIX
1
2
Data Representation
2
3
Data Processing
4
4
C++ Mastery
4.1 Basic Syntax . . . .
4.2 Types of Statements
4.3 Functions . . . . . .
4.4 Input and Output .
4.5 Arrays . . . . . . .
4.6 Types of Errors . .
5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
9
11
12
13
14
Algorithmics
5.1 Algorithms for Sorting . .
5.2 Algorithms for Searching
5.3 Numerical Integration . .
5.4 Algorithmic Strategies . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
18
19
23
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
Computer Architecture
24
7
Computer Simulation
26
8
Representation of Chemical Structures
27
9
Molecular Simulation
30
Compiled by Eduard Meier (meiered@student.ethz.ch)
1. UNIX
1
1
UNIX
UNIX commands are of the form:
command
[−options]
[object1]
[object2]
...
Referencing to a file or directory is done by
→ an absolute path name (e.g. /aa/dd/ff),
→ a relative path name (e.g. dd/ff1 ).
The top directory of a system is called root directory (/), the highest directory for the user is the
home directory (~). The parent directory (..) is one up from the current directory (.), while the child
directory is one down (dir_name).
Wildcards can be used to execute a command pertaining to multiple files:
→ Any single character: ?
→ Any string of characters: *
→ Any character from list: [ABC]
→ Any character in range: [A-Z]
1 If
Command
Function
echo
echo arguments
pwd
return working directory name
cd [dir]
change working directory [to directory dir]
cd
change to home directory
cd ..
change to parent directory
cd ~username
change to home directory of user username
ls
list names of all files in current directory
ls [file ...]
list only named files
ls -a
list all files including hidden files
ls -t
list in time order, most recent first
ls -l
list files in long format
ls -r
list in reverse order (may be combined with -t)
mkdir dir
make directory dir
rmdir dir
remove directory dir
cp source dest
copy source to dest
mv source dest
move source to dst
rm [file ...]
remove named files
cat [file ...]
print contents of named files to standard output
more [file ...]
print contents of files one page at a time
wc [file ...]
count lines, words and characters for each file
grep pattern [file ...]
search for a pattern in a file
sort [file ...]
sort the files alphabetically by line
current directory is /aa
2. DATA REPRESENTATION
2
Command
Function
head [-n] file
print the first n lines of a file
tail [-n] file
print the last n lines of a file
cmp file1 file2
print location of first difference between file1 and file2
diff file1 file2
print all differences between file1 and file2
awk pattern [file ...]
scans file for pattern with ability
to remove, add and modify the text
man command
display help (man page) on command
chmod ugoa2 ± rwx3 file
change the permissions of a file
chmod
ugo4
file
ssh host
connect with the machine named host
ssh user@host
connect as user with the machine named host
finger (@host)
check who is logged into your machine (named host)
Mail
check if you got mail and read it
Mail user
send an email to user
cmd > file
redirect the output to overwrite file
cmd >> file
redirect the output to append to file
cmd >& file
redirected output + errors overwriting
cmd >>& file
redirected output + errors appending
cmd < file
redirected input from file
cmd1 | cmd2
pipe: output of cmd1 is connected to input of cmd2
CTRL-C
aborts the program running in the shell
CTRL-Z
suspends the program running in the shell
bg
puts a running program to the background
fg
puts a running program to the foreground
program-name &
2
set the permissions of a file
starts the program in the background
→ shell free to type other commands
Data Representation
Computers are finite and discrete - they can only handle material that can be mapped to a finite sequence
of zeroes and ones. The basic information unit in the digital world is the bit (binary information ticket),
which represents an information element with two states. They are grouped into words of various
lengths. The storage capacity (Sn ) of a word of length n (n ≥ 1) is given by
S n = 2n
The most basic word length is the byte, which contains 8 bits. Current operating systems rely on either
32- or 64-bits word lengths. Often, byte multiples are defined in powers of 2 grouped by approximate
powers of 10 since 210 ≈ 1000 (e.g. 1 kilobyte = 210 bytes, 1 megabyte = 220 bytes, etc.)
2 user
(u), group (g), other (o), all (a)
permission (+), retract permission (-), read (r), write (w), execute (x)
4 Three-digit octal string setting the permissions (user, group, others). Start from 0. Add 4 for read, 2 for write, 1 for execute.
3 grant
2. DATA REPRESENTATION
3
A number system is defined by a chosen base b and a set of b digit symbols (characters), one of them
being zero. An arbitrary positive integer number n is then written as a collection of p digits
p −1
n = [d p−1 d p−2 · · · d1 d0 ]b =
∑d
mb
m
m=0
Our usual decimal system relies on a base b = 10 and the digit symbols {0,1,2,3,4,5,6,7,8,9}
1970 = [1970]10 = 0 · 100 + 7 · 101 + 9 · 102 + 1 · 103
The binary system relies on a base b = 2 and the associated digits {0,1}
[1970]10 = [11110110010]2
= 0 · 20 + 1 · 21 + 0 · 22 + 0 · 23 + 1 · 24 + 1 · 25 + 0 · 26 + 1 · 27 + 1 · 28 + 1 · 29 + 1 · 210
→ Binary to decimal conversion can be done be summing up each digit multiplied with the associated
power of 2.
→ For decimal to binary conversion, it is convenient to start at the largest power of 2 that can be
substracted, to substract it, and to continue until 0 is reached.
[111111...111111]2 = 2 p − 1 ≈ 2 p
The octal system relies on a base b = 8 and the digit symbols {0,1,2,3,4,5,6,7} - this corresponds to a 3-bit
code. The hexadecimal system relies on a base b = 16 and the digit symbols {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F}
- this corresponds to a 4-bit code
1970 = [11110110010]2
= [011 110 110 010]2
= [0111 1011 0010]2
=[ 3
=[ 7
6
6
2 ]8
= [3662]8
B
2 ]16
= [7B2]16
Real numbers can be represented with fixed-point 5 .
p −1
r = [d p−1 d p−2 · · · d1 d0 . d−1 d−2 · · · d−q−1 d−q ]b =
∑d
mb
m
m=−q
[. 111111...111111]2 = 1 − 2−q ≈ 1
for large q
or with floating-point representation, where an unsigned digit string (mantissa) and the position of
the fractional point (exponent) are stored separately.
→ The significand is the mantissa interpreted as a number starting right after the decimal point ([0.M]);
→ The sign of the significand must be stored separately;
→ The exponent may be positive or negative.
number = sign × significand × baseexponent
In the binary system (with base 2), one generally adopts the so-called IEEE 754 standard for the
floating-point representation
1 bit
← NE bits →
← NM bits →
Sign: S
Biased exponent: E
Mantissa: M
Maximal value of E: Emax = 2 NE − 1
Median value of E: E0 = 2 NE −1 − 1
5 Numbers, that may be represented exactly as decimal fixed-point may or may not be representable exactly as binary
fixed point - in the latter case, they have a periodic fractional part in the binary representation (e.g. [0.25]10 = [0.01]2 whereas
[0.2]10 = [0.0011]2 )
3. DATA PROCESSING
4
For most values of E, this representation is normalized, and obeys the equation
number = (−1)S × [1.M]2 × 2E−E0
for 0 < E < Emax
Because it’s normalized, the significand starts with a 1 (instead of a 0)6 .
The special values E = 0 and E = Emax are reserved for
→ denormalized numbers if E = 0:
→ special numbers if E = Emax :
number = (−1)S × [0.M]2 × 21− E0



if S = 1, M = 0
 −∞
number =
+∞ if S = 0, M = 0



NaN if M 6= 0
for E = 0




for E = Emax



In the normalized representation, the smallest and largest representable numbers are of comparable
logarithmic magnitudes. However, zero is not exactly representable. That’s why E = 0 was saved for
denormalized numbers, since zero is explicitly representable with M = 0. Furthermore, they allow
to represent numbers that are smaller than the smallest normalized number7 21− E0 - because [0.M]2
ranges between zero and one.
Loss of significance corresponds to errors caused by operations producing too small or too large
numbers. E.g. addition in four-digit decimal arithmetic:
0.3721 · 102 + 0.3046 · 10−1 = 0.3721 · 102 + 0.0003046 · 102 = 0.3724 · 102 (rounded)
An overflow corresponds to x + y being too large and an underflow to x × y being too small for a
word.
3
Data Processing
The elementary building block of a processor is a device that can
→ enable (E-type) or disable (D-type) the flow of current when an input on-off signal is applied8 ;
→ convert flow information into an output on-off signal of the same nature as the input signal9 .
In modern chips, the elementary building block is called a MOSFET (metal-oxide-semiconductor
field-effect transistor) which
→ relies on the properties of semi-conductors;
→ either goes from insulant to conductor under an applied electric field (E-type);
→ or goes from conductor to insulant under an applied electric field (D-type);
→ can be used to build three basic types of logic gates: NOT, AND and OR gates;
→ has a size of ∼ 1µm10 ;
→ relies on the voltage of the signals (CPU core voltage), which is typically ∼ 2 V;
→ involves a current intensity in the order of 10−5 A;
→ has - as a chip (which involves many MOSFETs) - a power requirement of about 10 W;
→ has a recovery time11 of typically ∼ 1 ns - which determines the clock rate12 ;
Increasing the clock rate by increasing CPU core voltage is called overclocking. It leads to stronger
currents and faster recovery, but also to increased power requirement and heating.
6 This
implicit leading is not stored, since it is always 1!
the smallest positive denormalized number is equal to the smallest possible difference between two normalized
numbers. The largest positive denormalized number is equal to the smallest positive normalized number.
8 No signal means that the flow is disabled (E-type) or enabled (D-type). E is short for enhancement, D for depletion mode.
9 In order to use it as input for subsequent building blocks.
10 One can typically pack 0.5 · 106 MOSFETs on a chip.
11 The minimal time needed between two successive operations.
12 The frequency at which successive input batches can be processed (number of clock cycles per second), namely ∼ 1 GHz.
7 Actually,
4. C++ MASTERY
4
5
C++ Mastery
A program must first be translated from C++ (higher language) into assembler (machine language) using
a compiler. Text typed as /*...*/ or as //... (until end of line) is ignored by the compiler - these
are the comments of the programmer. A program usually includes preprocessor instructions13 such
as #include <iostream>14 prior to the functions. The subsequent line using namespace std; defines
the standard namespace in order to avoid name clashes. Furthermore, every C++ program contains one
or more functions, one of which is called main - this is the one that is entered when the program is
called.
4.1
Basic Syntax
The basic text unit in a C++ program is termed an identifier, it can be any sequence of the characters
→ Lower-case letters
→ Upper-case letters
→ Digits
→ Underscore
a...z
A...Z
0...9
and is not starting with a digit.
However, some identifiers are reserved to indicate specific instructions in C++ and termed keywords.
Hence, it is impossible to give them another than the C++ meaning (i.e. not possible for a variable
name).
13 Preprocessing
is an additional step applied prior to compilation.
specific preprocessor instruction includes the text content of a standard header file iostream.h at this location of the
program which contains the declaration (definition) of standard function for stream input-output (IO).
14 This
4. C++ MASTERY
6
The following basic types (to characterize variables) are available
Type
Size in bytes
Explicit constants
bool
(1/8)
boolean constant with values ”false” or ”true”
char
1
character constant - single character enclosed by single quotes
short (short int)
2
int
4
integer constant
long
8
integer constant
long long
≥8
integer constant
float
4
integer constant - optional ”+” or ”-” followed
by arbitrary sequence of digits → Integer numbers
floating-point constant - optional ”+” or ”-” followed
by a sequence of digits containing a period ”.”
or the letter ”e” or ”E” → Real numbers
double
8
floating-point constant
long double
16
floating-point constant
Variables must be explicitly declared along with the specification of their type. Once a variable has
been declared, it can be assigned a value using the assignment operator ”=”. The variable is now
initialized - the simplest form of initialization involves an explicit constant.
boolean
character
string
integer
double
declaration
bool bla;
char bla;
char bla[5];
int bla;
double bla;
assignment
bla = true;
bla = ’A’;
bla = "four";
bla = 1;
bla = -1.2E-3;
Or in compact notation:
→ Declare and assign at once,
→ Declare multiple variables,
→ Combined,
int bla = 1;
int bla,bla2;
int bla1 = 1,bla2 = 2;
An operator is represented by a special symbol, and performs a specific action:
→ A unary operator has one operand,
→ a unary prefix operator has its single operand at its right;
→ a unary suffix operator has its single operand at its left.
→ A binary operator has two operands, both at its left and right.
An operation always returns a result, which may be used as operand by further operators - otherwise,
the result is discarded. An operand may be an explicit constant, a variable, the return value of a
function, or the result of another operation. The resolution (evaluation) of an expression follows strict
rules that depend on the relative precedences (priorities) of the operators and on the associativities
(left-to-right or right-to-left), and can be modulated by the use of parentheses.
A valid combination of operators and operands is called an expression. An expression followed by a
semi-column is a valid instance of statement.
Operators act on their immediate operands on the left or/and right in increasing order of precedence
(from 1 [highest] to 16 [lowest]). At equal precedence, the order of evaluation is determined by
the operator associativity, either left-to-right or right-to-left. This can be modulated by parentheses;
any expression between parentheses is evaluated as a separate entity before being inserted in the
calculation.
4. C++ MASTERY
7
Precedence
Operator
Description
Associativity
1
::
Scope resolution
None
2
++
Suffix increment
Left-to-right
--
Suffix decrement
()
Function call
[]
Array subscripting
.
Element selection by reference
->
Element selection through pointer
typeid()
Run-time type information
const_cast
Type cast
dynamic_cast
Type cast
reinterprete_cast
Type cast
static_cast
Type cast
++
Prefix increment
--
Prefix decrement
+
Unary plus
-
Unary minus
!
Logical NOT
~
(type)
Bitwise NOT (One’s Complement)
*
Indirection (dereference)
&
Address-of
sizeof
Size-of
new.new[]
Dynamic memory allocation
delete.delete[]
Dynamic memory deallocation
.*
Pointer to member
->*
Pointer to member
*
Multiplication
/
Division
%
Modulo (remainder)
+
Addition
-
Subtraction
<<
Bitwise left shift
>>
Bitwise right shift
<
Less than
<=
Less than or equal to
>
Greater than
>=
Greater than or equal to
==
Equal to
!=
Not equal to
3
4
5
6
7
8
9
Righ-to-left
Type cast
Left-to-right
Left-to-right
Left-to-right
Left-to-right
Left-to-right
Left-to-right
4. C++ MASTERY
8
Precedence
Operator
Description
Associativity
10
&
Bitwise AND
Left-to-right
11
^
Bitwise XOR (exclusive or)
Left-to-right
12
|
Bitwise OR (inclusive or)
Left-to-right
13
&&
Logical AND
Left-to-right
14
| |
Logical OR
Left-to-right
15
?:
Ternary conditional
Right-to-left
16
=
Direct assignment
Righ-to-left
+=
Assignment by sum
-=
Assignment by difference
*=
Assignment by product
/=
Assignment by quotient
%=
Assignment by remainder
<<=
Assignment by bitwise left shift
>>=
Assignment by bitwise right shift
&=
Assignment by bitwise AND
^=
Assignment by bitwise XOR
|=
Assignment by bitwise OR
17
throw
Throw operator
Right-to-left
18
,
Comma
Left-to-right
An arithmetic operator is a binary operator
+
*
/
%
→ addition (sum)
→ subtraction (difference)
→ multiplication (product)
→ division (quotient)
→ remainder of the integer division (modulo)
and combines the two expressions appropriately into its result.
An assignment operator is a binary operator
=
+=
-=
*=
/=
%=
→ set variable to value (assignment)
→ increase variable by value
→ decrease variable by value
→ amplify variable by value
→ scale variable by value
→ modulo variable by value
and alters the variable on its left according to the expression on its right, and returns the value as result.
An increment/decrement operator is a unary operator
++
--
→ increase variable by 1
→ decrease variable by 1
and can either be used as a prefix or a suffix. In the former case, the variable on its right is modified,
and its updated value returned as result, while in the latter, the variable on its left is modified and its
initial value returned as result.
More than one assignment/increment/decrement of the same variable or involving an a/i/d and
another reading of the variable value must be avoided since the outcome is not specified by the standard
and computer-dependent! A short selection of no-go’s:
i = i++;
i += 1 + i--;
j = i + i++;
j = ++i + i++;
a[i] = i++;
4. C++ MASTERY
9
A comparison operator is a binary operator
<
<=
>
>=
==
!=
→ less than
→ less than or equal
→ greater than
→ greater than or equal
→ equal to
→ not equal to
and compares the expressions on its left and right, and results in a boolean.
A logical operator is a unary (not) or binary (and, or) operator
!
&&
||
→ logical not
→ logical and
→ logical or
and associates the expressions on its left and right (for not: at the right), and results in a boolean; the
associated expressions are also boolean (or converted to this type). Boolean values are false or true;
but it is also common to use integer values 0 or 1, which are automatically converted (implicit cast)
from/to false and true.
The short-cut property of the logical and and logical or operators means that when the left operand of &&
is false, the right operand is not evaluated and the result automatically false, respectively, in the case
of ||, if the left operand is true, the right operand is not evaluated and the result automatically true.
A comma operator is a binary operator
,
→ separate two expressions
and evaluates the two expressions in turn from left to right and the result is the value of the expression
on the right.
A cast operator is a unary operator
(bool)
→ convert to boolean
(int)
→ convert to integer
(double) → convert to double
(char)
→ convert to character
etc...
and converts the expression on its right to the indicated type. When the cast represents an upgrade, the
precision of the quantity is unaffected. If it represents a downgrade, the quantity is truncated.
If an operator requires a certain data type and no prior conversion was conducted, the compiler decides
automatically on the casts that are needed. Downgrading of a real to an integer means truncation and
therefore a loss of precision. Conversion to a boolean gives false if zero and true if non-zero.
4.2
Types of Statements
An algorithm is a succession of steps, translated in C++ into a succession of statements.
In C++, there are six basic types of statements
→ null statement
→ declaration statement
→ expression statement
→ compound statements
→ conditional statements
→ iterative statements
4. C++ MASTERY
10
The null statement is a simple semi-column and does nothing.
A declaration statement declares the existence and type of a variable
→ Syntax:
→ Example:
type variable_name;
int x;
An expression statement evaluates an expression
→ Syntax:
→ Example:
expression;
j = i++;
An compound statement is a sequence of statements to be treated by the compiler as a single statement
→ Syntax15 :
→ Example:
{ statement1 statement2 ... statement N }
{ int x,y; x = 5; y = x + 5; }
An conditional statement is a statement that is only executed if some specified condition is satisfied
→ The condition is a boolean expression that can evaluate to true or false.
→ Syntax:
if (expression) statement or if (expression) statement1 else statement2
→ Example: if ( x > 0 ) y = x; else y = -x;
For testing multiple cases, the switch statement represents an alternative to multiple if-else statements
ok == true
if (c == ’+’) {
cout << "set result to a+b";
r = a+b;
} else if (c == ’-’) {
cout << "set result to a-b";
r = a-b;
} ...
ok == true
switch (c) {
case ’+’:
cout << "set result to a+b";
r = a+b;
break;
case ’-’:
cout << "set result to a-b";
r = a-b;
break;
...
}
Furthermore, there is an alternative to if-else statements to include tests within expressions
if ( a >= 0 )
a_abs = a;
else
a_abs = -a;
a_abs = (a >= 0) ? a : -a;
An iterative statement is a statement that is executed multiple times as long as some continuation
condition holds or until some termination condition is met
→ The condition is a boolean expression that can evaluate to true or false.
→ Syntax:
while (expression) statement or do statement while (expression);
→ Variant:
for (statement1 expression1; expression2) statement
equivalent: statement1 while (expression1) {statement expression2;}
→ Example:
while ( i != 0 ) { f = f + i; i--}
15 The
statements already include the semi-colons.
4. C++ MASTERY
4.3
11
Functions
Although it would be possible to write every C++ in a linear fashion, it often turns out to be more
elegant and lucid to group logical units into modules, which take the forms of functions. The C++
program itself is written in a function called main. Additional functions with different names can be
defined next to main and called from within other functions.
Although many functions do not ”communicate” - they are called, do something and return a value -,
the calling piece of code needs to provide information to the function or/and receive information from the
function. There are four ways a function can communicate with the program
→ Argument by value
→ Result (return value)
→ Argument by reference
→ Global variable
The simplest way a function receives information from the calling function is by its arguments, as passed
by value. The called function creates an own variable of the given type that is only valid inside the function.
The variable is initialized to the corresponding value listed in the call. Variable names can therefore
differ between the calling and the receiving functions without any consequence - consequentially, same
variable names can stand for distinct variables
int main() {
...;
int five = 5;
average_function(five, 10.0);
...;
return 0;
}
void average_function(int six, double x) {
...;
...;
...;
return;
}
The simplest way a function returns information to the calling function is by its result. A function can
at most have one result16 . Functions of type void have no return value at all. A function may contain
multiple return statements at various locations - however, the first return statement encountered ends
the function execution.
int main() {
double a, y = -1.0;
a = abs_val(y);
cout << y << a << endl;
return 0;
}
double abs_val(double x) {
if ( x < 0.0 ) x = -x;
return x;
}
If a function both receives information from and returns information to the calling program by its arguments,
they must be passed by reference. In this case, when the function is called17 , the argument variable
becomes directly accessible within the function under the indicated name - both are stored at the same
memory location. The information flow now goes in both directions, a change on the variable inside
the function affects the corresponding variable of the calling program. For this reason, the argument in
the calling function must be a variable - a constant could not be altered.
int main() {
int m, p;
m = 4;
p = max_obsolete(m);
cout << m << p << endl;
return 0;
}
int max_obsolete(int &n) {
n = 5
return 4;
}
In this program, the value of variable ”m” turns to 5 because the argument is passed by reference.
16 Also main has a return value. The integer return value of main goes back to UNIX and can be probed using ”echo $?”
immediately after running the command (0 = normal termination). The return value of main can indicate an error in the program
- if ensured it returns a non-zero value in this case (which can then be tested with a UNIX C-shell script).
17 The definition of the function now includes a ”&” prior to the variable name e.g. void my function(int &n).
4. C++ MASTERY
12
Variables must be declared before being used. The scope of a declaration represents the domain in
which the declaration holds, it is determined by the location of the declaration and may be either
→ local - the variable is declared inside a function and the scope hence limited to the function itself.
→ global - the variable is declared outside any function and its scope the entire file containing the
declaration.
The scope of a global variable may even be extended to multiple files by repeating the declaration in
each of the files preceded by the keyword extern. Both local and global variables may not be declared
twice in their respective scope. However, the name of a global variable can be reused18 for a local
variable19 name inside a function - both being separate variables.
Functions - just as variables - must be declared before they are used (called), and these declarations
have a scope. The scope of a declaration may again be either
→ local - the function is declared inside another function and the scope hence limited to this function.
→ global - the function is declared outside any other function and therefore has a scope reaching over
the entire file containing the declaration.
A function may be declared multiple times at different places (even within the scope of a previous
declaration). The declaration must specify the types of its return value and arguments - it is also called
a prototype.
In contrast to variables, functions not only have to be declared but also defined (i.e. one needs to
specify the statements they contain). Functions can only be defined once and must be defined outside all
other functions - at top level. The function definition also counts as a declaration - but the function call
must not precede. The variables listed between parentheses (with their types) in the definition of the
function are called parameters20 .
The function call causes immediate execution of the called function, after which the normal flow of the
calling function is resumed. A function can be called multiple times at different places in a program
but must be preceded by a declaration with a scope reaching the call statement.
At some points, functions and variables occur interconnected:
→ The data type void is used to denote an empty parameter set or return value (e.g. void fnc(void)).
→ A name clash can arise when the programmer is using the same name for different objects (e.g.
variable, function) within the program.
→ A function that changes the value of a global variable is said to have a side effect.
→ When the same variable name is used for a global and a local variable, the local variable has by
default precedence within the function and is reverted by the scope-resolution operator ”::”.
4.4
Input and Output
The input-output (IO) of a program concerns the reading of data from files (including the keyboard)
and writing of data to files (including the screen). Since a file is a sequential21 object, each data item is
accessed (read or written) separately.
The C++ language contains two alternative sets of functions for performing IO’s
→ the standard IO library, inherited from the C language
→ the stream IO library specific to the C++ language
There are two main types of IO’s and corresponding files
→ Unformatted IO/files corresponding to reading/writing binary files (byte sequences)
→ Formatted22 IO/files corresponding to reading/writing plain text (ascii) files (e.g. streamio)
18 Nevertheless,
this is not very wise - it leaves a lot of space for mistakes.
default, the local variable will be accessed under the variable name in the function, unless one precedes the variable
name with the scope-resolution operator ”::”.
20 The variables listed between parentheses (with their types) in the declaration are called dummy parameters and will only
be checked for their types, not their names. Furthermore, the variables listed between parentheses (without their types) in the
call of the function are called arguments - they can be arbitrary expressions or variables.
21 The concept opposite to sequential access is random access.
22 Unlike unformatted IO/files, formatted ones are plain text (human readable), editable by hand, portable across data types
and machines and rather space-consuming.
19 By
4. C++ MASTERY
13
Features of the C++ stream input-output library include
• From keyboard / to screen
→ Header file
#include <iostream>
→ Input stream
cin for standard input
→ Output stream
cout for standard output & cerr for standard error
clog for standard error (buffered)
• From file / to file
→ Header file
→ Input stream type
→ Output stream type
→ Open function
→ Input function
→ Output function
→ Check end of input
→ Send end of line
→ Close function
4.5
#include <fstream>
ifstream
ofstream
ifstream f_inp("inp.dat"); & ofstream f_out("out.dat");
f_inp >> ... ;
f_out << ... ;
f_inp.eof()23
f_out << endl;
f_inp.close() ; & f_out.close();
Arrays
An array is a set of variables of the same type accessible through an integer index. The array size
(number of variants in the set) and type of the elements must be specified when the array variable is
declared. In C++, array indexes run from zero to the array size minus one. A program setting all elements
of an array of size 42 to 42 may look like
int main() {
int wisdom[42], i;
for (i=0; i<42; i++) {
wisdom[i] = 42
}
return 0;
}
Arrays with a content that is known at compile time can be initialized using explicit constants. Just
like any other variable, they can be passed to functions as arguments
void my_function(double a[]) {...}
Arrays passed to function can, despite being passed by value, be modified by the called function
because it is not the array but rather the pointer to the array start that is passed in the memory. It is
therefore possible to alter the pointer without consequent changes outside the called function but not
to alter the array itself.
Arrays can have one or multiple dimensions. A one-dimensional array takes the form → a[j].
A two-dimensional array can be regarded as a one-dimensional array of which each element is itself
a one-dimensional array → b[i][j]24 . This corresponds to the mathematical convention where the
element Aij of a matrix is the one at line i and column j25 . Of course, one can extend this ansatz to
multi-dimensional arrays → c[i][j][k] etc...
An array can be initialized in a simple way only at the same time as it is defined - the compiler may
also guess one missing dimension - actually only the slowest index
or
or
int a[4] = {4,2,4,2};
int a[2][2] = {4,2,4,2};
int a[] = {4,2,4,2};
int a[][2] = {4,2,4,2};
which is also applicable to characters
char i[4] = {’I’,’N’,’F’,’O’};
or
char s[] = {’S’,’U’,’C’,’K’,’S’};
A string is a character array including the termination character ’\0’ at the end and can be initialized as
char s[] = {’I’,’N’,’F’,’O’,’\0’};
23 A
or
boolean that is true as soon as one tries to read past the last item.
array of i times j arrays...
25 j is the fast index (columns).
24 An
char i[] = {"SUCKS"};
4. C++ MASTERY
14
A pointer is a variable that can indicate the memory location of an object of a specific type. The object
type must be specified when the pointer variable is declared
int* ptr;
A pointer to a given variable can be generated using the reference operator ”&”
int my_int;
→ ptr = &my_int;
The value of a variable pointed at by a pointer can be determined using the dereference26 (or indirection) operator ”*”. The variable defining an array is actually (just) a pointer to the first element of
the array. The addition / subtraction of an integer to / from a pointer shifts the pointer by as many
elements along the array
int my_array[10]; my_array[5] = 42;
int* ptr = my_array; ptr += 5; cout << *ptr;
→ 42
4.6
Types of Errors
In order to function properly, a C++ program has to be correct in terms of syntaxis, semantics and
algorithmics. The process of refining a program until it works is called debugging.
• Syntaxis encompasses the basic rules for assembling the symbols of the language (i.e. its
vocabulary and grammar).
→ Syntactic errors generally cause a compile-time abort - the compiler will stop with an error
message. They are often easy to detect and fix.
→ Examples of syntactic errors
for (i=0,i<42,i++) {...}
or
if i!=1 j=2;
• Semantics encompasses the rules for doing something sensible computerwise (i.e. the meaning
of the operations).
→ Semantic errors usually cause a run-time abort27 They are more difficult to fix.
→ Examples of semantic errors
x /= 0
or
for(i=0; i<10; i--)
• Algorithmics encompasses the method for solving the problem at hand (i.e. the mathematical
logics or physical model you employ).
→ Algorithmic errors normally let the program run but return non-sense results because of logical
or modelling errors.
→ Examples of algorithmic errors
area_circle = 2*pi*r
or
volume_gas = P / (n*R*T);
Besides programming errors, a result may still be affected by numerical errors
• Roundoff errors arise because at finite precision, a floating-point real cannot encompass information concerning numbers of widely different magnitudes.
→ To prevent rounding errors, it is advisable to avoid adding or subtracting numbers of widely
different or nearly equal magnitudes, to add numbers in a sequence in increasing order of magnitude
and to carry out subtraction before multiplication or division.
→ Examples of numerical errors
1.000 + 0.0003 → 1.000
or
1.000E0 + 1.000E4 → 1.000E4
• Truncation errors arise mostly by truncating an infinite sum and approximating by a finite sum.
→ Taylor series expansions lead to truncation errors. They decrease with an increasing amount of
calculated terms of series.
• Instability errors (due to numerical algorithms) include the growth of round-off errors and/or
initially small fluctuations in initial data that might cause a large deviation on the final result.
→ Using recurrence relations to calculate lattice energies is efficient but prone to instability errors.
26 The
notation ”x[]” is equivalent to ”*x” - the notation ”x[i]” is equivalent to ”*(x+i)”.
program will stop during execution, often with a not so clear error message (such as a floating-point exception, a
segmentation fault), or never stop (endless loop).
27 The
5. ALGORITHMICS
5
15
Algorithmics
An algorithm is what precedes programming and does not depend on the programming language it
is going to be implemented. It consists of a precise specification of how to solve a (class of) problem(s).
Algorithmics is thence a branch of computer science that deals with algorithms. Different algorithms
to solve the same problem may vary widely in terms of solution quality, as well as computational
efficiency and numerical accuracy. The comparison of the efficiency of different algorithms for the
same problem is called algorithm benchmarking. There are different types of benchmarking
→ Benchmark efficiency at constant accuracy for different algorithms or programs on one CPU type.
→ Benchmark accuracy at constant computational time for different algorithms or/and implementations
on one CPU type
→ Benchmark CPU types and for the same program at the same accuracy
One usually differentiates between recursive and iterative algorithms. In contrast to an iterative
algorithm, the recursive one calls itself. Although recursion is clearly more elegant28 , iteration is often
more efficient29 . In the following an example of an iterative and a recursive variant for the calculation
of n!
int fact_ite(int n) {
int i, f = 1;
for (i=2; i<=n; i++) f *= i;
return f;
}
5.1
int fact_rec(int n) {
if (n == 0) return 1;
return n*fact_rec(n-1);
}
Algorithms for Sorting
Sorting algorithms come into play when a list (file) of records (items) is provided as input with each
record encompassing a key, for which a well-defined ordering rule is provided. The choice of the
sorting algorithm then depends on
→ the size of the list
→ the required sorting speed
→ whether the list has a pre-sorting order
→ the size of individual records
→ the average-case performance
→ the worst-case performance
→ whether stability30 is required
→ whether in-place31 sorting is required
→ whether the key set is small or large
To characterize the efficiency of algorithms, it is often useful to know how to calculate the number P of
ordered/unique pairs of elements in a set of N numbers. It is defined as the number of combinations
(ij) one can make with
i = 1...N j = 1...N and i < j
(1 ≤ i < j ≤ N)
which corresponds to the number of elements in the upper triangle (excluding the diagonal elements)
of a N × N matrix


a11 a12 a13 · · · a1N


 a21 a22





 a31

a33
 .

..
 .

.
 .

a N1
28 Shorter
a NN
code, fewer variables, closer to the math, ...
operations, function calls, ...
30 Whether the initial ordering of records with identical keys has to be preserved.
31 Whether the sorted list is in the same location as the local list and no extra storage is needed during the sorting (in-place).
29 Fewer
5. ALGORITHMICS
16
In total there are N 2 elements including N diagonal elements; hence N 2 − N or N(N − 1) off-diagonal
and P = N(N − 1)/2 upper triangle elements. This number is equal to
N −1
P=
N
∑ ∑1
(5.1)
i=1 j=i +1
Selection sort is one of the simplest possible in-place sorting techniques (but it is unstable). The algorithm mainly consists of looping over the array elements except the last (primary index i = 1..N-1)
and in the following, for each i, the smallest element among those from i onward is found by taking
the initial smallest value to be that of i and looping over a secondary index j = i + 1..N and updated
each time a smaller value is found - after the loop, the smallest value found is swapped with i. This
corresponds to the C++ code
void selection_sort (int N, int a[]) {
int i, j, jm, at;
// note: a[1..N]; a[0] unused
for (i=1; i<=N-1; i++) {
jm = i;
for (j=i+1; j<=N; j++) // find smallest element from i onward (i.e. i..N)
if ( a[j] < a[jm] )
jm = j;
at = a[i]; a[i] = a[jm]; a[jm] = at; //do the swap
}
return;
}
→ Inner-loop contains one comparison (and sometimes an assignment involving jm)
→ Minimal data movement, so the algorithm is good for large records
→ Algorithm is not stable
→ N(N-1)/2 comparisons, N-1 exchanges, 3(N-1) assignments involving a[]
Insertion sort is another simple in-place sorting technique (and is stable). The algorithm consists of
looping over the array elements from the second onward (primary index i = 2..N), and in the following,
for each i, one loops over the preordered elements before i (secondary index j = i − 1..1) and inserts the
element at the appropriate place. This corresponds to the C++ code
void insertion_sort(int n, int a[]) { // note: a[1..N]; a[0] sentinel
int i, j, at;
for (i=2; i<=N; i++) {
at = a[0] = a[i]; //
j = i;
while ( at < a[j-1] )
a[j] = a[j-1]; //
j--;
}
a[j] = at;
//
}
return;
save element i; set element 0 as sentinel
{
push element j one forward
insert saved element
}
→ The element a[0] (sentinel) contains at, so the while loop always terminates (latest with j=1)
→ The inner-loop contains one comparison and one assignment involving a[]
→ There is a lot of data movement, so the algorithm is bad for large records
→ The algorithm is stable
→ On average ∼ N 2 /4 comparisons and ∼ N 2 /4 assignments involving a[] moves
5. ALGORITHMICS
17
The sentinel is a programming trick involving the extension of an array with one dummy element
(the sentinel) set to a value x (which is to be spotted in the array) in order to avoid N comparisons
in the algorithm. In the following two times the same function bodies - one of them including two
comparisons and the other only one32
int i, a[N];
int i, a[N+1];
a[N] = x;
i = 0;
while (a[i]!=x)
i++;
imin = i;
i = 0;
while (i<N && a[i]!=x)
i++;
imin = i;
There is also an obvious linear-scaling sorting method when the keyword set is finite (small alphabet
of keywords). It is stable and unlike the preceding ones not in-place - it needs auxiliary (working) arrays.
The algorithm consists in first counting the occurrences of elements of a type (in an auxiliary array),
then finding the starting points of each type in the same array and finally in filling another auxiliary
array with the starting points33 . It corresponds to the C++ code
void baby_sort(int N, int a[], int K) {
int i, k;
int n[K+1], b[N+1];
for (k=1; k<=K; k++)
n[k]=0;
// set n[1..K] to 0
for (i=1; i<=N; i++)
n[a[i]]++;
// n[k] will give the number of elements of each type
n[0]=1;
for (k=1; k<= K-1; k++)
n[k] += n[k-1];
// n[k-1] will now give the start point of the set
// of elements of type k in the reordered array
for (i=1; i<=N; i++)
// move elements a[1..N] to new locations in b[1..N]
b[n[a[i]-1]++] = a[i];
for (i=1; i<=N; i++)
a[i]=b[i];
// copy b[1..N] to a[1..N]
return;
}
→ Linear scaling (only single-loops, no double-loops)
→ The key set is always finite because the list is finite, but it can be long (i.e K ≤ N)
→ If the key values are not nicely sequential (1,2,3,4...K) as assumed, an additional array to map the
key values to sequential integers is needed
32 Which
corresponds to the solution with sentinel.
description: First, every element in the array n[k+1] is set to zero. Afterwards, the number of occurences of the
element types in the array a[] is counted and saved in the position of array n[] corresponding to its number (type). The number
is raised by one per detection of one type. Afterwards, these numbers of occurences are added up in an ascending manner
- to get correct positions, the value of n[0] is set to 1. To move the elements of array a[] into the correspondent position of
array b[], the number (type) of an element of a[] has to be subtracted by 1 (because of the preceding ascending summation which indicates the sorted position of the subsequent element in the array), the value received by the array n[] is now the sorted
location in array n[] and is (by the suffix incrementation operator) in the case of array n[] raised by one for another element of
the same type (if existant).
33 Detailed
5. ALGORITHMICS
5.2
18
Algorithms for Searching
A searching algorithm is applied, when a list (file) of records (items) is provided as input with each
record encompassing a key, and the record having a specified value of the key (search key) is to be
found.
Sequential search simply scans through all the records in the list in turn. The number of comparisons
depend on whether the list is sorted and if there is a success in finding and are mostly34 N /2 (with N
being the number of records). An example of a C++ function searching an array a[] of N integers (with
dimension N + 1) for the first occurrence of value s may look like
int seq_search(int a[], int N, int s) {
int i = 0;
a[N] = s;
// use a[N] as sentinel
while ( s != a[i] )
i++;
return i;
}
Binary search starts with the full list as current sublist, iteratively divides the current sublist into two
parts (of equal sizes - up to one unit), determines in which of the two parts the search key may reside,
and sets this part as the new sublist. It makes use of a divide and conquer algorithm and requires a
sorted list. The number of comparisons is log2 (N)35 . An example of a C++ function searching an array
a[] of N integers (with dimension N) for the first occurrence of value s may look like
int bin_search(int a[], int N, int s) {
int left = 0, right = N-1, middle;
while (left <= right) {
middle = (left+right)/2; // with truncation
if (s < a[middle] right = middle -1;
else if (s > a[middle]) left = middle +1;
else return middle;
}
return N;
}
String search finds an occurrence of a pattern36 within the text - given a text string37 of length N and
a pattern (string) of length M. In contrast to the search for a key, the pattern can be very long and
must be lined up character by character with the character sequence in the text. It makes use of a brute
force algorithm38 that checks for each possible position along the text whether characters of pattern
and text match. The number of comparisons in the worst case is (N − M + 1)M ≈ N M. The algorithm
consists in declaring two indices i and j, whereas index i indicates the current position in the text and j
the position in the pattern. When the characters match, both are increased. When j = M, a match is
found and returned, when i = N, no match was found. Upon mismatch, i is reset to i − j + 1 and j to 0
(backtracking). An example of a C++ function may look like
int brute_force_search(char a[], int N, char p[], int M) {
int i=0, j;
while (i<N) {
j = 0;
while ( j<M && a[i] == p[j]) {
i++;
j++;
}
if ( j == M ) return i-j;
i -= j-1;
}
return -1; //return -1 if not found
}
34 Except
the case of an unsorted list without success in finding the search key → N comparisons.
order O[lg(N)] (= O[log2 (N)]) algorithm is much faster than a O[N] algorithm.
36 The C++ standard library also contains functions for pattern matching → s.find("STRING");
37 A sequence of letters, numbers white spaces, special characters → large alphabet (≥ 26 types).
38 Not efficient for highly degenerate strings (small alphabets → sequences of 0 and 1 values).
35 An
5. ALGORITHMICS
5.3
19
Numerical Integration
Prerequisite to numerical algorithms is the investigation of algorithmic stability and numerical errors
as discussed in Section 4.6. The results of numerical calculations depend on round-of errors. The
rounding-off of a number to n digits consists in the omission of all digits to the right of the nth digit
and incrementing it if either the omitted part is larger than half the unit in the nth position or if the
omitted part is equal to half the unit in the nth position with the nth digit being odd39 .
In order to avoid loss of significance, it is often convenient to rearrange the equations in such a way that
the subtraction of almost equal numbers is avoided e.g. to expand the equations in Taylor series of the
general form40
N
f (x) = lim N →∞
f (n) (a)
(x − a)n
n!
n=0
∑
To calculate the value of e x − 1 for x ≈ 0, a Tayler series around x = 0 is formulated
x
e −1 =
x2
x3
x4
x
x2
x3
1+x+
+
+
+ ... − 1 = x 1 + +
+
+ ...
2
6
24
2
6
6
In order to calculate the integral of a function
Iab =
Z b
a
f (x) dx
• the primitive function can be found (analytically)
dF(x)
= f (x) → Iab = [F(x)]ba = F(b) − F(a)
dx
• special mathematical tricks can be applied such as
→ change of variable
→ integration by parts
→ residue theorem of complex function theory
→ basis set expansion (e.g. Tayler, Fourier, Laplace, ...)
• a table of standard integrals can be considered
• a symbolic mathematics program (Mathematica, Matlab, ...) can be applied
• a numerical integration method can be used such as
→ Monte Carlo integration
→ Rectangular quadrature
→ Trapezoidal quadrature
→ Simpson’s quadrature
→ Romberg’s quadrature
Monte Carlo integration is a stochastic method, hence relies on (pseudo)random numbers. It consists
in choosing an H ≥ max f (x) over [a, b] and sampling N pairs of real random numbers (xn , yn ) from a
uniform rectangular distribution (a ≤ xn ≤ b , 0 ≤ yn ≤ H with n = 1, 2, ..., N). The fraction of points
(xn , yn ) in the rectangle for which yn ≤ f (xn ) approximates the ratio of the integral Iab to the area of
the rectangle
(
N
1 if z > 0
MC
−1
Iab = H(b − a)N
Θ( f (xn ) − yn )
Θ(z) =
0 otherwise
n=1
∑
MC = I MC − I = O[N −1/2 ]
Error → ∆Iab
ab
ab
39 Other
40 Taylor
conventions do exist. Rounding is hence unequal to truncation, where the extra digits are simply cut off.
series of f(x) at x=a.
5. ALGORITHMICS
20
Another approach in the numerical evaluation of integrals is by quadrature, where N subintervals
0...N-1 of equal widths41 are used which are defined by N + 1 points 0...N (xn = a + nh with x0 = a,
x N = b). The total integral can be written as42
N −1
Iab =
N
∑ In
or
Iab = h
n=0
∑wn f (xn )
(5.2)
n=0
• Rectangular (Q) quadrature
→ Total integral
→ Alternative form
N −1
∑ In
with
Q
Iab
= h ∑ wnQ f (xn )
with
Q
Iab
=
Q
n=0
N
n=0
InQ = h f (xn )
(
1 0≤n<N
Q
wn =
0
n=N
→ Evaluation is asymmetric, i.e. from a to b is not the same as from b to a
h
Q
Q
→ Estimated error ∆Iab
= Iab
− Iab ≈ − [ f (b) − f (a)] + O[h2 ] which implies an error linear in h
2
(i.e. a slow convergence - or small h necessary for a good estimate)
→ If f (b) > f (a), Iab is underestimated and consequentially if f (b) < f (a), Iab is overestimated43
• Trapezoidal (T) quadrature
→ Total integral
T =
Iab
N −1
∑ InT
with
InT =
with
wnT =
n=0
N
→ Alternative form
T =h
Iab
∑ wnT f (xn )
n=0
h
[ f (xn ) + f (xn+1 )]
2
(
1/2
n = 0, N
1
0<n<N
→ Evaluation is not dependent on the direction because of the averaging of two rectangular-rule
estimates, i.e. from a to b is the same as from b to a
2
T ≈ h [ f 0 (b) − f 0 (a)] + O[h4 ] which implies an error quadratic in h
→ Estimated error ∆Iab
12
(i.e. a better convergence - or larger h necessary a good estimate)
→ If f 0 (b) > f 0 (a), Iab is overestimated and consequentially if f 0 (b) < f 0 (a), Iab is underestimated
• Simpson’s (S) quadrature
→ For this method, N must be even
→ Parabola through xn , xn+1 , xn+2
pn (x) =
(x − xn+1 )(x − xn+2 )
(x − xn )(x − xn+2 )
(x − xn )(x − xn+1 )
f (xn ) −
f (xn+1 ) +
f (xn+2 )
2h2
h2
2h2
→ Total integral
S =
Iab
N −1
∑ InS
n=0
with
h
[ f (xn ) + 4 f (xn+1 ) + f (xn+2 )]
3
if n is even (0 otherwise)
InS =
b−a
N
42 With I being the integral over interval n and w the integral contribution of point n divided by h (quadrature height).
n
n
43 Under the premise that the slope is positive.
41 h
=
5. ALGORITHMICS
21
N
→ Alternative form
S =h
Iab
∑ wnS f (xn )
wnS =
with
n=0



 1/3
n = 0, N
4/3
n odd, 0 < n < N



2/3 n even, 1 < n < N − 1
h4 000
[ f (b) − f 000 (a)] + O[h6 ]
180
S ≈
→ Estimated error ∆Iab
• Romberg’s (R) quadrature
→ Trapezoid rule to calculate successive estimates of the integral, halving the spacing every time
b−a
ILT Trapezoid estimate of order L requires N = 2 L intervals and 2 L + 1 points with h L =
2L
→ The Romberg integral estimate of order L is obtained by formulating a linear combination of
all these trapezoidal integrals up to order L such that the error is minimal
L
Romberg estimate of order L
ILR =
∑ CL,K IKT
K=0
→ This is done by introducing intermediate sums such that
(
TL,0 = ILT
Intermediate sums TL,M
TL,M = αTL,M−1 + βTL−1,M−1 0 < M ≤ L
with
ILR = TL,L
→ E.g. Romberg approximation of order one
h1 = h0 /2
h1
T1,0 =
= [ f (a) + 2 f (a + h1 ) + f (b)]
2
= I + C2 h21 + C4 h41 + ...
C2 2 C4 4
= I+
h +
h + ...
4 0 16 0
4T1,0 − T0,0
1
=
= I − C4 h40 + ... = I − 4C4 h41 + ...
3
4
I1R = T1,1 → Error O[h41 ]
I1T
T1,1
→ Romberg approximation of order L
TL,0
TL,1 =
h L = h0 /2 L
h
= ILT = L [ f (a) + 2 f (a + h L ) + ... + 2 f (a + 2 L−1 h L ) + f (b)] = I + O[h2L ]
2
4TL,0 − TL−1,0
= I + O[h4L ]
3
···
ILR = TL,L
TL,M =
4 M TL,M−1 − TL−1,M−1
+2
= I + O[h2M
]
L
4M − 1
+2
→ Error O[h2L
]
L
In order to process singularities44 , the integrand has to be rearranged - there exist four alternatives
→ Expand integrand in a Taylor series around the singular point and integrate the series term by term
#1
" 1/2
Z 1
Z 1
cos(x)
x2
x4
x
x 5/2
x 9/2
−1/2
I=
dx 1/2 =
dx x
1−
+
− ... =
−
+
−···
2!
4!
1/2
(5/2)2! (9/2)4!
0
0
x
0
1
1
= 2− +
− · · · = 1.809
5 108
44 E.g.
if the integrand is singular at x = 0 but the integral is finite.
5. ALGORITHMICS
22
→ Remove the singularity by a change of variable
x = y2
I=
x = 0 → y = 0 and x = 1 → y = 1
dx = 2y dy
Z 1
0
cos(x)
dx 1/2 =
x
Z 1
0
cos(y2 )
dy 2y
=2
y
Z 1
0
dy cos(y2 )
→ Remove the singularity by partial integration
Z
Z
v(x) = cos(x)
dx u0 (x)v(x) = [u(x)v(x)] − dx u(x)v0 (x)
Z 1
i1 Z 1
cos(x) h
I=
dx 1/2 = 2x1/2 cos(x) −
dx 2x1/2 (−sin(x))
0
0
0
x
Z
u0 (x) = x −1/2
1
= 2cos(1) + 2
0
dx x1/2 sin(x)
→ Make the singularity tractable by splitting the integrand into different terms
I=
Z 1
0
dx
cos(x)
=
x 1/2
= 2+
Z 1
0
Z 1
0
dx
1
+
Z 1
cos(x) − 1
0
x 1/2
cos(x) − 1
dx
x 1/2
x 1/2
In order to process an infinite integration integral, there are also alternatives
→ Expand integrand in an asymptotic series around the singularity and integrate the series termwise
→ Make the integration interval finite by a change of variable
I=
x = −ln(y)
I=
dx =
Z 0
1
− y −1
dy (−y−1 )
Z ∞
0
dy
dx
e− x
xe−2x + 1
x = 0 → y = 1 and x = ∞ → y = 0
y
=
−y2 ln(y) + 1
Z 1
0
dy
1
1 − y2 ln(y)
5. ALGORITHMICS
5.4
23
Algorithmic Strategies
A huge variety of algorithms with different aims exists. The six major algorithmic strategies are
1. Brute-force algorithms
→ Try all combinations one by one, the hard way
2. Greedy algorithms
→ Always select the option, which yields largest immediate progress towards the goal45
→ E.g. making change with coin sizes 25, 11, 5 and 1 cents - aim: return as few coins as possible
Greedy: 25+5+1+1+1 cents
Optimal: 11+11+11 cents
→ Another example: steepest-descent minimization (line-minimize along negative gradient46 )
3. Divide and conquer algorithms
→ Split the problem into smaller sub-problems, solve these, and combine the solutions
→ Often easiest to implement using recursion
→ E.g. Binary search or Quicksort
→ On computers, multiplications involve binary strings, arranged in words of N = 2n bits
(suitable for divide-and-conquer); the elementary multiplication is one-bit and corresponds to
the operation AND
4. Dynamic programming algorithms
→ Solve all the subproblems, store the solutions in a table, use the solutions of the subproblems
to solve the problems
5. Local search algorithms
→ Guess an arbitrary solution, define/perform local transformation to an alternative solution; if
better store it, otherwise discard it; and iterate this procedure
6. Backtracking and pruning algorithms
→ Rank all possible solutions in a tree, traverse the tree, keep track of the best solution so far,
skip the branches that cannot contain a better solution47
Furthermore, there are two common benchmark problems
1. Traveling salesman problem
→ A salesman wants to find a tour along N cities (i.e. a selv-avoiding cycle including all cities)
which is of minimum length L
2. Knapsack problem
→ A knapsack has a capacity of M units and there are N types of items with different values
and costs; the optimal choice of objects in terms of total value is to be found
45 Which
is locally optimal.
much deeper minimum may be missed!
47 Going down the tree, lower bounds will increase - the branch with lowest lower bound is explored in priority. When a leaf
is reached, the value is updated (i.e. lowest value so far). Whenever the lower bound at a node is higher than the tour value, the
node can be prunned (i.e. subtree skipped).
46 A
6. COMPUTER ARCHITECTURE
6
24
Computer Architecture
Various criteria are used in the classification of computers such as
• The computer category
→ supercomputer (>10 MSFr)
→ mainframe (1-10 MSFr)
→ server (50-200 kSFr)
→ workstation (5-50 kSFr)
→ personal computer (1-10 kSFr)
• Logic gate technology
→ relays (1935-1940)
→ vacuum tubes (1940-1955)
→ transistors (1955-1970)
→ integrated circuits (1970-...)
• Computer architecture, including
→ The instruction set architecture (ISA) (machine code instructions)
→ The microarchitecture (how the ISA is implemented electronically, computer organization)
→ The processor organization (single or multiple CPU, relative operation modes and memory
accesses of CPUs)
The sequence of operations is the following
increment PC
1. fetch next instruction from memory into IR
2. decode instruction in IR
3. fetch operands from memory into ALU registers
4. execute instruction in ALU
5. store results from ALU registers into memory
goto next
6. COMPUTER ARCHITECTURE
25
The clock rate of a processor is the frequency at which successive input batches can be processed
(number of clock cycles per second), namely ∼ 1 GHz (nowadays up to ∼ 3 GHz)
→ the preceding sequence of operations usually requires multiple clock cycles
However, since the work of the processor can be acceleterated by using instruction pipelining (if result of
an operation is not immediately needed in the next few operations), it is possible to mutually process
about 5 instructions in 5 clock cycles. This places moderate constraints on the processor but important
constraints on the programmer/compiler (e.g. if a previous result is necessary for a fetch but is not yet
in the memory - here, pipeline processing leads to a mess).
Another way to accelerate the work of the processor is using vectorization. It is applicable if the same
operation is applied successively to many operands - the vector operation including all operands often
only costs one clock cycle (e.g. if an operation i = n ∗ f + o is applied to a large number N of a, b and c
values). It places large constraints on processor (needs special chip design) and important constraints
on programmer/compiler (only certain pieces of a program can be vectorized).
On the other hand, hardware acceleration includes coprocessors or CPU units
→ Internal Control Unit (ICU)
→ Arithmetic and Logic Unit (ALU)
And a range of Accelerated Processing Units (APU):
→ Floating Point Unit (FPU)
→ Graphics Processing Unit (GPU)
→ Physics Processing Unit (PPU)
→ Uncommitted Logic Array (ULA)
The Instruction Set Architecture (ISA) corresponds to the set of machine code instructions implemented
in a processor. There are two main categories
• Complex Instruction Set Computer (CISC)
→ instructions (different lengths) match high-level language, are relatively slow - complex
instructions and complex compiler
• Reduced Instruction set Computer (RISC48 )
→ instructions (equal lengths - enabling pipelining) are relatively short (fast clock possible),
simple (simple and cheap CPU) - many instructions per operation in high-level language
Since computers are of limited size, memory is often expanded by storing data on a disk that serves as
virtual memory - exchanging data in batches (e.g. by pages of ∼4 kbyte). However, with an access time
of ∼ms, it is very slow compared to main memory (µs), cache49 (ns) and of course CPU registers.
Single Instruction Single Data (SISD) → single-processor (pipelining/vectorization), no parallelism
Single Instruction Multiple Data (SIMD) → multiple CPUs, same operation, different chunks of memory
Multiple Instruction Single Data (MISD) → multiple CPUs, same operation, same chunks of memory
Multiple Instruction Multiple Data (MIMD) → Multiple CPUs, different operations, different chunks
of memory
For MIMD parallel processing, there are two main options:
• Shared memory MIMD(S)
→ Multiple CPUs operate independently but share the same memory resources
→ User-friendly programming perspective to memory
→ Lack of scalability between memory and CPUs (adding more CPUs rapidly increases traffic
→ Program is responsible for synchronization of CPU actions on memory
• Distributed memory MIMD(D)50
→ Multiple CPUs have their own local memory
→ Programmer must define how and when data is communicated and tasks synchronized
→ Each CPU can rapidly access its own memory without interference
→ Program must handle data structure for distributed memory
48 The
trend nowadays.
small and expensive local memory attached to the CPU.
50 Might be further separated into MIM(D ) in a local network and MIM(D ) for non-local networks.
L
N
49 Fast,
7. COMPUTER SIMULATION
7
26
Computer Simulation
The two main components of the scientific approach (induction and deduction) are
→ Use systematic observations to establish laws about reproducible behaviour of physical systems
→ Devise models51 from which complex laws can be derived from simpler (fundamental) laws
While some models are simple and amenable to an analytical mathematical description, most are too
complex to be solved analytically - one may use simulation
→ the model is formulated in the form of an algorithm that should emulate the real-world process
→ the emulation is carried out in practice on a computer
→ the result of the simulation is compared with the real-world observations; agreement provides a hint
of validity of the model
→ experimental observation can be intrepreted in terms of the simulation, providing insight into the
complex real-world process in terms of the simpler model assumptions
→ the result of the simulation can also be compared to those of simpler analytical models, permitting
to characterize the deviations from the latter
→ the simulations may predict properties without experiments
The simulation of a process has three stages
1. Development of a model that describes the real process as accurately as possible/required
(a) specify the problem and the goals to be reached by simulation using the model
(b) formulate a qualitative model
(c) specify the set of parameters and variables of the model
(d) specify the relations between the parameters and the variables
(e) use observations to quantify the parameters
2. Experiment with the model to validate it
(a) is the model consistent?
(b) are the results plausible?
3. Use the model to analyze or predict a real process
Simulation models are representations of real objects aimed at obtaining insight in the behavior of the
objects. The structure of the model must correspond to the structure of the object. However, not all
details of the object must be represented in the model. Models are classified as
→ Static ↔ dynamic
dynamic: e.g. neural networks
→ Deterministic ↔ stochastic
deterministic: e.g. molecular dynamics simulation
→ Discrete ↔ continuous
continuous: e.g. idem
Simulation is a method of investigation which involves designing a model of a real system, and
conducting experiments with this model for the purpose of either understanding the behavior of the
system or evaluating various strategies for the operation of the system. Simulation is useful if
→ the real experiments are impossible (e.g. considered objects to small, ...)
→ the real experiments are economically impractical (e.g. construction of alternative production facilities)
→ the real experiments are unethical (e.g. epidemics or radiation damage)
→ the details of the experiments are unobservable (e.g. atomic motions in liquids)
Simulation experiments include a variety of objectives, such as
→ Comparison of different models of a system
→ Prediction of properties of a system
→ Sensitivity analysis of the factors determining the output of a system
→ Optimization of specific output variables of a system
→ Insights in the details of operations of a system
51 A
good model should account for all available observations in its claimed domain of validity and be as simple as possible
for this purpose; a model can be invalidated by counter-examples, but never validated.
8. REPRESENTATION OF CHEMICAL STRUCTURES
8
27
Representation of Chemical Structures
Standard structural diagrams are not suited for handling by a computer. Representations have to meet
several requirements such as
→ Uniqueness: one representation per compound
→ Unambiguity: one compound per representation
(→ A representation that is both unique and unambiguous is called canonical)
→ Completeness: representation should describe entire structure diagram
→ Conciseness: representation should require minimal storage space
Types of molecular representations suited for computer manipulation include
• Unambiguous representations
→ Topological representations (WLN, IUPAC, CAS)
→ Geometrical representation (CSD, PDB)
• Ambiguous representations
→ Fragment codes
WLN was invented by the chemist Wiliam J. Wiswesser in the 1940s and is not so much used anymore
(despite its compact and intuitive form)
SMILES was developed in the 1980s and is still commonly used
InChI was initially developed by IUPAC and NIST in ∼2000; current version from 2011
Furthermore, there are different connection table variants in order to represent chemical structures
• Simple connection table representation
→ Procedure
1. Define atom types: as letters C, O, ... (omit hydrogen) or as numbers 1, 2, ...
2. Define bond types: single (1), double (2), triple (3)
3. Choose atom sequence numbers
4. Make connection table containing connections
8. REPRESENTATION OF CHEMICAL STRUCTURES
28
→ Drawbacks: redundant information, not unique and not very concise
• Compact connection table representation
→ Same procedure, but
1. choose sequence numbers such that the neighbours j of an atom i have consecutive numbers
when j > i
2. list only connections to atoms with lower sequence numbers in the connection table
3. in cyclic structures, add an extra line to indicate ring closure
→ Drawbacks: not unique (arbitrary numbering) and still not concise
• Matrix-based connection table representation
→ Use three matrices
1. Atom type matrix
2. Atom connectivity matrix
3. Bond type matrix
→ Drawbacks: not unique (arbitrary numbering) and even less concise
8. REPRESENTATION OF CHEMICAL STRUCTURES
29
None of the preceding connection tables were unique:
→ In the simple connection table, N atoms permit N! numbering schemes.
→ In the compact connection table, the number of possibilities is reduced by sequential numbering;
however, many possibilities remain.
A scheme that has been made unique is called a canonical representation. Canonicalization is an
essential component of line representations (WLN, SMILES, InChI). A possible canonicalization
procedure for the connection tables is the Morgan’s algorithm (1965).
• Morgan’s algorithm
→ Procedure
1. Calculate stage 1 connectivity values for each atom
2. Calculate stage 2 connectivity values for each atom (sum stage 1 values over attached atoms)
3. Continue as long as the number of different connectivity classes52 increases
4. Generate compact connection tables based on all possible numberings
5. Select one representation based on atom and bond types + neighbours
→ Atomic numbers: low → high
→ Bond order: low → high
→ Lowest-neighbour rank: low → high
The algorithm therefore seeks the most deeply embedded atom within the structure and gives it
number 1, afterwards following a sequential numbering rule resolving multiple choices based on
both the extended connectivities and a predefined ordering of the atom types, bond types and lowestneighbour ranks.
52 Connectivity
class: connectivity value occurring at least once in the structure
9. MOLECULAR SIMULATION
9
Molecular Simulation
30
9. MOLECULAR SIMULATION
31
9. MOLECULAR SIMULATION
32
Download