Uploaded by weinbergnoa3

Dictaat Programming and Genomics

advertisement
Lecture notes
Programming and genomics (8CA10)
2019/2020
A.J. Markvoort
P.A.J. Hilbers
October, 2019
Contents
1 Introduction
1.1 Programming and biomedical applications
1.2 The human genome . . . . . . . . . . . . .
1.3 Introduction to computer programming .
1.4 Additional Python resources . . . . . . . .
1.5 Exercises . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4
4
5
9
10
11
2 Python
2.1 Data model . . . . . . . .
2.2 The print function . . .
2.3 Calculating with variables
2.4 Exercises 1–6 . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12
12
17
17
18
3 Lists and repetition
3.1 Lists . . . . . . . . . . .
3.2 The for statement . . .
3.3 Modules and the import
3.4 Exercises 7–12 . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
22
22
26
27
28
4 Methods, slicing, random and plotting
4.1 Methods and invocations . . . . . . . .
4.2 The dir and help methods . . . . . .
4.3 The range method . . . . . . . . . . .
4.4 Slicing . . . . . . . . . . . . . . . . . .
4.5 Random numbers . . . . . . . . . . . .
4.6 Plotting data using matplotlib . . . .
4.7 Exercises 13–18 . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
32
33
33
34
35
36
37
5 Selection methods and file/user input
5.1 Selection methods (if, elif and else) .
5.2 Conditionals and selection . . . . . . . .
5.2.1 False and True . . . . . . . . .
5.2.2 Comparisons . . . . . . . . . . .
5.2.3 Boolean operations . . . . . . . .
5.3 User input (input) . . . . . . . . . . . .
5.4 Reading from files . . . . . . . . . . . .
5.5 Exercises 19–24 . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
40
40
42
43
43
44
45
46
49
. . . . . .
. . . . . .
statement
. . . . . .
1
.
.
.
.
Programming and genomics 2019/2020
6 Bioinformatics and strings
6.1 From DNA to RNA to protein
6.2 DNA, RNA and Python strings
6.3 Operations on strings . . . . .
6.4 Converting lists and strings . .
6.5 Writing data to file . . . . . . .
6.6 Exercises 25–29 . . . . . . . .
.
.
.
.
.
.
0. Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
52
55
58
62
62
63
7 Functions and parameters
7.1 Function definition (def) . . . . . . . . . . .
7.2 Function call . . . . . . . . . . . . . . . . . .
7.3 Documenting functions . . . . . . . . . . . . .
7.4 Positional parameters as function arguments .
7.5 Keyword parameters and defaults . . . . . . .
7.6 Exercises 30–37 . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
67
67
68
68
69
70
71
8 Tuples and string formatting
8.1 Tuples . . . . . . . . . . . . . . . . . . . .
8.2 Returning multiple values from a function
8.3 String formatting . . . . . . . . . . . . .
8.3.1 Old style: % . . . . . . . . . . . .
8.3.2 New style: the format method . .
8.4 Exercises 38–40 . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
75
76
77
77
79
82
.
.
.
.
.
84
84
87
87
90
93
9 Dictionaries and database queries
9.1 Dictionaries . . . . . . . . . . . . .
9.2 Database queries . . . . . . . . . .
9.2.1 Open arbitrary resources by
9.2.2 Accessing databases: NCBI,
9.3 Exercises 41–48 . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . .
. . . . . . . . . . . . .
URL . . . . . . . . . .
Entrez and BioPython
. . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10 Program design and examples
10.1 A more general repetition construct (while) . .
10.2 Programming: problem formulation, analysis and
10.3 Programming examples . . . . . . . . . . . . . .
10.3.1 Counting ’CGs’ in DNA strings . . . . . .
10.3.2 All pattern positions in a string . . . . . .
10.4 Exercises 49–57 . . . . . . . . . . . . . . . . . .
. . . .
design
. . . .
. . . .
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
96
96
97
98
98
100
103
11 Classes, Excel files and
11.1 Classes and objects .
11.2 Excel files in Python
11.3 Boxplot . . . . . . .
11.4 Exercises 58–61 . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
107
107
110
113
115
boxplots
. . . . . .
. . . . . .
. . . . . .
. . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
12 Graphical user interfaces
118
12.1 A first window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
12.2 The four basic GUI-programming tasks . . . . . . . . . . . . . . . . . . 121
12.3 The label widget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
c ph
2
Programming and genomics 2019/2020
12.4
12.5
12.6
12.7
12.8
The button widget .
The frame widget . .
Bringing the buttons
The entry widget . .
Exercises 62–69 . .
. . . .
. . . .
to life.
. . . .
. . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
0. Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
125
126
127
130
132
13 Two examples
136
13.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
13.2 A bar plot example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
13.3 Exercises 70–73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
A Summary of useful commands
A.1 Common Python constructs .
A.2 Operations on string s . . . .
A.3 Operations on file f . . . . . .
A.4 Operations on lists l . . . . .
A.5 Operations on dictionaries d .
A.6 List generation and plotting .
A.7 turtle . . . . . . . . . . . . .
A.8 openpyxl . . . . . . . . . . . .
A.9 Database queries . . . . . . .
A.10 tkinter . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
B Solutions to selected exercises
c ph
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
148
148
148
149
149
150
151
151
151
152
152
154
3
Chapter 1
Introduction
1.1
Programming and biomedical applications
An introduction in programming is part of almost every study at any university worldwide. This also holds for the studies in the BioMedical Engineering department at the
Eindhoven University of Technology and from the beginning (1997) the course 8C010
“An Introduction to Object Oriented Programming and Java” has been organised in the
first half year of the study. Currently, there is little consensus about which programming
language is most appropriate for introductory computer science classes. Most schools
use a traditional system programming language such as C, C++, or, Java. As may be
inferred from the following table (see Figure 1.1) these languages are indeed popular.
Figure 1.1: TIOBE Programming Community Index for September 2019, see programming languages rankings
Languages such as Tcl, Perl and Python, however, are becoming increasingly popular
for developing application specific software, and are considered to be simpler, safer and
more flexible than C, or Java. In particular Python emerges as a good candidate for a
4
Programming and genomics 2019/2020
1. Introduction
first programming language, and that will be the programming language we will use for
our course. Moreover, one particularly interesting feature of Python is the high number
of interfaces it has to programming languages and tools such as Matlab and R. In most
cases one can almost always use other programming language constructs directly in
Python, and Python is therefore phrased as a glueing language, by which programming
parts written in a certain language are fused together in one single Python program.
In these lecture notes, however, the attention will not be on a programming language.
Emphasis is on programming concepts with special focus on biomedical applications,
and since data analysis is a major component of most biomedical applications that will
be the main topic.
1.2
The human genome
In this section we give a short introduction to the scientific field bioinformatics. Driving
forces are the Human Genome Project and its successor the 1000 Genomes Project which
primary goal is to create a complete and detailed catalogue of human genetic variations.
Historical introduction
Genetics as a set of principles and analytical procedures did not begin until 1866, when
an Augustinian monk named Gregor Mendel (see Figure 1.2a) performed a set of experiments that pointed to the existence of biological elements called genes, the basic units
responsible for possession and passing on of a single characteristic.
Figure 1.2: Historical figures: a) Gregor Johann Mendel (from www.wikipedia.org),
b) Watson and Crick with the DNA double helix model (from
www.thehistoryblog.com).
Until 1944, it was generally assumed that chromosomal proteins carry genetic information, and that DNA plays a secondary role. This view was shattered by Avery and
McCarty who demonstrated that the molecule deoxyribonucleic acid (DNA) is the major
carrier of genetic material in living organisms, i.e., it is responsible for inheritance.
c ph
5
Programming and genomics 2019/2020
1. Introduction
Figure 1.3: a) The purine bases are adenine and guanine (in blue), while the pyrimidines are thymine and cytosine (in pink). RNA contains uracil instead of
thymine. b) A nucleotide is composed of a base, a five-carbon sugar and one to
three phosphate groups.
DNA composition
The basic elements of DNA have been isolated and determined by partly breaking up
purified DNA. These studies demonstrated that DNA is composed of four basic molecules
called nucleotides (see Figure 1.3a), which are identical except that each contains a
different nitrogen base. Each nucleotide (see Figure 1.3b) contains phosphate, a sugar
(of the deoxyribose type) and one of the four bases: Adenine, Guanine, Cytosine, and
Thymine (usually denoted A, G, C and T, respectively).
Structure
In 1953 James Watson and Francis Crick deduced the three dimensional structure of
DNA (see Figure 1.2b) and immediately inferred its method of replication. The structure
of DNA is described as a double helix, which looks rather like two interlocked bedsprings.
Each helix is a chain of nucleotides held together by phosphodiester bonds.
Figure 1.4: Base pairing (www.biology-pages.info)
The two helices are held together by hydrogen bonds. Each base pair consists of one
c ph
6
Programming and genomics 2019/2020
1. Introduction
purine base (A or G) and one pyrimidine base (C or T), paired according the following
rule: G-C, and A-T (see Figure 1.4). The DNA molecule is directional, due to the
asymmetrical structure of the sugars which constitute the skeleton of the molecule.
Each sugar is connected to the strand upstream (i.e., preceding it in the chain) in its
fifth carbon and to the strand downstream (i.e., following it in the chain) in its third
carbon. In biological jargon, the DNA strand goes from 50 (read five prime) to 30 (read
three prime). The directions of the two complementary DNA strands are reversed to
one another, see Figure 1.5.
Figure 1.5: The double helix and the directional conventions.
Genes and chromosomes
Each DNA molecule is packaged in a separate chromosome, and the total genetic information stored in the chromosomes of an organism is said to constitute its genome. With
few exceptions, every cell of a eukaryotic multi-cellular organism contains a complete
set of the genome, while the difference in functionality of cells from different tissues is
due to the variable expression of the corresponding genes. The human genome contains
about 3 ∗ 109 base pairs (abbreviated bp), organized as 46 chromosomes, 22 different
autosomal chromosome pairs, and two sex chromosomes: either XX or XY. The 24 different chromosomes range from 50 ∗ 106 to 250 ∗ 106 bp. The total number of base pairs
varies between different organisms. The organism Amoeba dubia (a single cell organism),
for example, has more than 200 times as many base pairs as human.
The living organisms divide into two major groups: Prokaryotes, which are single-celled
organisms with no cell nucleus, and Eukaryotes, which are higher level organisms, and
their cells have nuclei. With contemporary knowledge of the biochemical basis of heredity, Mendels abstract concept of a gene can be redefined as a physical entity. A gene is
a region of DNA that controls a discrete hereditary characteristic.
The Human Genome Project
The ultimate goal of the human genome project is to produce a single continuous sequence for each of the 24 human chromosomes and to delineate the positions of all genes.
The working draft sequence described by the international human genome sequencing
consortium was constructed by melding together sequence segments derived from over
20,000 large clones.
c ph
7
Programming and genomics 2019/2020
1. Introduction
• 1985 - The project was first initiated by Charles DeLisi associate director for health
and environment research at the depart of energy (DoE) in the United States.
• 1988 - National Institute of Health (NIH) establishes the office of human genome
research.
• 1990 - Human Genome Project (HGP) launched with the intention to be completed
within 15 years time and a 3 billion dollar budget.
• 1996 - In a meeting in Bermuda international partners in the genome project agreed
to formalize the conditions of data access including release of sequence data into
public databases. This came to be known as the Bermuda Principles.
• 1998 - Craig Venter forms a company with intent to sequence the human genome
within three years. The company, later named Celera, introduced a new ambitious
whole genome shotgun approach.
• 1999 - The public project responds to Venters challenge and change their time
destination for completing the first draft.
• December 1999 - The first complete human chromosome sequence (number 22) is
published.
• June 2000 - Leaders of the public project and Celera meet in the White House to
announce completion of a working draft of the human genome.
• February 2001 - The first draft of the human genome was published in the journals
Nature and Science.
• May 2006 - Human Genome Project researchers announced the completion of the
DNA sequence for the last of the 24 human chromosomes.
• January 2008 - The 1000 Genomes Project was launched as an international research effort to establish by far the most detailed catalogue of human genetic
variation.
• May 2008 - Mapping and sequencing of structural variation from eight human
genomes.
• May 2011 - Report about the Economic Impact of the Human Genome Project:
How a $3.8 billion investment drove $796 billion in economic impact, created
310,000 jobs and launched the genomic revolution.
• October 2012, the sequencing of 1092 genomes was announced in a 1092 genomes
Nature publication
The human genome, the first vertebrate genome sequence determined, seems likely to
be quite representative of what we will find in other vertebrate genomes. It is around
30 times larger than the recently sequenced worm Caenorhabditis elegans and fruit fly
Drosophila melanogaster genomes (available at public domains) both around 108 bp, and
250 times larger then that of yeast Sacchromyces cerevisiae. Despite its size, it seems
likely to have only two or three times as many genes as the fly or worm genomes, with
the coding regions of genes accounting for only 1.5% of the DNA. Repeat sequences form
a large proportion of the remaining DNA, around 46% . These repeats may or may not
have a function but they are certainly characteristic of large vertebrate genomes. The
c ph
8
Programming and genomics 2019/2020
1. Introduction
rest of the sequence contains promoters, transcriptional regulatory sequences and other
features.
The 1000 Genomes Project is but the latest increment in a remarkable scientific program
whose origins date back a hundred years to the rediscovery of Mendels laws and whose
end is nowhere in sight. In a sense it provides a capstone for efforts in the past century to discover genetic information and a foundation for efforts in the coming century
to understand it. The scientific work would have profound long term consequences for
medicine, leading to the elucidation of the underlying molecular mechanisms of disease
and thereby facilitating the design in many cases of rational diagnostics and therapeutics targeted at those mechanisms. With this Human Genome Project bioinformatics,
i.e., the use of computational tools in biomedical engineering, has become an essential
ingredient in research.
Part of biomedical research is the study of human cellular processes. The human DNA
is compared to that of other organisms such as mouse, rat and horse. As of February 2,
2014, 12857 complete genomes are published, see the Genomes OnLine Database(GOLD)
Only in the last year before more than 8000 new genomes were completed. Many of these
8000 are from bacteria and their role in humans becomes more and more prominent (the
human body contains over 10 times more microbial cells than human cells). We expect
therefore that in coming years much attention will be on comparing different microbial
genomes to understand their differences. For these comparisons smart algorithms are
needed and, hence, we should consider how to design such algorithms.
Also, with the arrival of next generation sequencing (NGS) platforms, that can perform
sequencing of millions of small fragments of DNA in parallel, an entire human genome
can nowadays be sequenced within a single day. Bioinformatics analyses are used to
piece together these fragments by mapping the individual reads to the human reference
genome.
Moreover, not only in the context of the human genome but also in many other contexts
you will probably encounter (experimental) data that you might want to process and/or
visualize. The ability to program in Python will be very useful in this respect.
1.3
Introduction to computer programming
Computer systems consist of hardware and software. The hardware is the physical
machine having input devices, such as a keyboard and a mouse, and output devices
such as a display screen and a printer, and 2 major components called processor and
memory.
The processor, also called Central Processing Unit (CPU), is the part capable of executing very simple instructions such as moving numbers around from one place in
memory to another and performing some simple arithmetic operations such as addition
and subtraction.
The memory holds the data for the CPU to process, and it holds intermediate results
of calculations. In order to identify different locations in which data has been stored,
memory locations have a unique address.
As stated, computer hardware can only directly execute some very simple instructions.
c ph
9
Programming and genomics 2019/2020
1. Introduction
The very first programmers actually had to enter these simple instructions in the form of
binary codes themselves. The next stage was to create a translator that simply converted
English equivalents of the codes into binary so that instead of having to remember that
the code 001273 05 04 meant add 5 to 4 programmers could now write ADD 5 4.
This very simple improvement made life much simpler and these systems of codes were
really the first programming languages, one for each type of computer. They are known
as assembly languages and assembly programming is still used for a few specialized
programming tasks today.
Even this was very primitive and still told the computer what to do at the hardware
level — move bytes from this memory location to that memory location, add this byte to
that byte etc. It was still very difficult and took a lot of programming effort to achieve
even simple tasks.
Gradually computer scientists developed higher level computer languages to make the
job easier. This was just as well because at the same time users were inventing ever
more complex jobs for computers to solve! This competition between the computer
scientists and the users is still going on and new languages keep on appearing. This
makes programming interesting, but also makes it important that as a programmer you
understand the concepts of programming as well as the pragmatics of doing it in one
particular language.
Programming is a creative process in which a method, called an algorithm, is designed
for solving a problem. An algorithm is a set of instructions that must be expressed so
completely and so precisely that the instructions can be followed without having to fill
in further details. It has the following characteristics
• it is described in terms of simpler actions,
• it is a sequence of actions,
• it usually has to store intermediate results,
• it uses different names for different intermediate results,
• it usually contains a sequence of instructions that have to be repeated until some
test condition is reached, and
• it has an end criterion.
1.4
Additional Python resources
Apart from the remainder of these lecture notes, there are numerous books and web
resources available to assist you in your tour through the Python programming language.
Below you find a short list of some important resources on the web:
1. The default site to look for information on Python is http://docs.python.org/3/.
It contains the documentation, a tutorial, a language reference, and the standard
library reference for the latest version of Python. Also for older versions of Python
such sites are still available, e.g. for Python 3.6 is is http://docs.python.org/3.6/
2. A list with books, websites and video tutorials is available at
https://wiki.python.org/moin/BeginnersGuide/Programmers.
c ph
10
Programming and genomics 2019/2020
1.5
1. Introduction
Exercises
At the end of each of the following chapters a number of exercises is given. Some of these
exercises are marked with a single or two stars (*). This means that for that exercise
(**), or part of that exercise (*), the solution can be found in appendix B of these lecture
notes. For the other exercises the solutions will follow approximately 1 week after the
guided self-study the exercises were scheduled for. For all exercises, thus also for those
exercises for which solutions are already provided, holds that you should (try to) make
them first yourself before looking at the solutions. Finding the solution to an exercise
by writing your own program is very different from understanding a provided solution!
c ph
11
Chapter 2
Python: standard types
Python is a high-level general purpose programming language. It consists of a few simple
constructs that will be introduced step by step in these lecture notes. The Python
version that we will use is 3.6.1. It was released on March 21, 2017, and preinstalled
on the TU/e laptops with Anaconda 4.4.0. Any newer version of Anaconda (that can
be downloaded from https://www.anaconda.com/download/#windows) should be fine
too.
As in every other language, we first have to introduce the principles and rules for constructing sentences in the languages, the so-called grammar rules or syntax. In a
grammar some basic elements are predefined. This also holds for Python, in which we
have the so-called predefined standard types, that are introduced in this chapter.
2.1
Data model
In an object oriented programming language the main concept is the object. Programs
are considered as collections of objects that interact with each other by means of actions.
An object has two parts: the data attributes and the actions, usually called methods,
that act on them.
Objects
Roughly speaking Python has two kinds of objects:
• Predefined objects (standard types), of which most common are:
type
int
float
str
description
integer numbers
floating point numbers
strings
examples
1, 2, 3
1.2, 1e+2
”Hello”, ’hi’
Notations for constant values of built-in types are called literals.
• User defined objects
In this chapter we restrict ourselves to the three above mentioned standard types, and
discuss only some of the methods available for these types. In following chapters other
object types will be introduced.
12
Programming and genomics 2019/2020
2. Python
Variables
Data objects are stored in the memory of your computer. To access and to distinguish
data objects, they can be given names. A name, also called identifier, is a word that
consists of letters, underscores, and digits, it must start with a letter or an underscore.
Identifiers are used to name parts of the program for future reference.
Variables are used to refer to data values. Variables have a name. Every language has
its own rules about which characters are allowed or not allowed in a name. Python takes
notice of the case and is therefore called a case sensitive language. One common style
in giving a variable a name is to start variable names with a lower case letter and use a
capital letter for each first letter of subsequent words in the name, like this:
thisVariableName
There is much freedom with respect to naming, but in general it is considered a good
programming strategy to choose short but meaningful names. If an integer variable is
only used for auxiliary purposes we give it a one letter name like
n, i, q
but of course longer names could also be used.
Assignment statement
Assignment statements can be used to (re)bind names to values. For instance if we want
to store the value 10 in a variable n we have
n = 10
The general construct to assign a value to a variable is
identifier = expression
where expression is a computation that produces a value.
Integers
As usual an integer number is a sequence of digits, and the standard operators are
• subtraction: −
• addition: +
• multiplication: ∗
• true division: /
• floor division: //
all having the standard meaning, but the floor division is perhaps special. It is namely
the division without the remainder. Moreover, all operators return an integer, except
for the ’true division’ which returns a float (see next subsection).
Expressions can be constructed by combining these operators and using the parentheses
( and ) where appropriate.
Examples:
c ph
13
Programming and genomics 2019/2020
2. Python
• 3 + 4, with as value 7,
• 3 + 4 ∗ 7, with as value 31,
• (5 − 4) ∗ 7, with as value 7,
• 7//4, with as value 1.
• 7/4, with as value the float 1.75.
Floats
Next to integers we use only one other type of numbers in this course: floats, an
abbrevation for floating point numbers. Its precise definition is rather complicated, so
we first give an informal description. In informal terms, a float is two integers joined
by a dot and possibly followed by an exponential part consisting of the letter ’e’ (small
or capital) followed by an integer. More formally, a float is either a pointfloat or an
exponentfloat. A pointfloat consist of a sequence of one or more digits followed by a
fraction or of a sequence of one or more digits followed by a dot. A fraction consists
of a dot followed by a sequence of one or more digits. An exponentfloat is optionally
either a sequence of one or more digits or a pointfloat, followed by an exponent, where
an exponent is an e or E followed by a signed sequence of one or more digits. When the
float contains the letter ’e’ or ’E’ we speak about a number in the scientific notation.
Examples:
3.1415
3e+2
1.
0.5e-67
The numeric types (both floats and integers) support the following operations, sorted
by ascending priority:
Operation
x+y
x-y
x*y
x/y
x // y
x%y
-x
abs(x)
int(x)
float(x)
pow(x, y)
x ** y
Result
sum of x and y
difference of x and y
product of x and y
division of x by y
floor division of x by y
remainder of x / y
x negated
absolute value or magnitude of x
x converted to integer
x converted to floating point
x to the power y
x to the power y
Strings
To create string constants, also called string literals, enclose them in single, double,
or triple quotes as follows:
c ph
14
Programming and genomics 2019/2020
2. Python
courseid
= ’The name of this course’
groupname = "Computational Biology"
coursename = """Programming and genomics"""
The same type of quote used to start a string must be used to terminate it. Triplequoted strings capture all the text that appears prior to the terminating triple quote,
as opposed to single- and double-quoted strings, which must be specified on one logical
line. Inside triple-quoted strings double quotes (as in the preceding example) or single
quotes (as in the following example) can be used. Triple quoting is useful when the
contents of a string literal span multiple lines of text such as the following:
’’’Content-type: text/html
<h1> Computational Biology </h1>
Click <a href="http://cbio.bmt.tue.nl/">here</a>.
’’’
Concatenation of strings
Python does also have methods to combine strings. One such a method is concatenation.
Concatenation is the process of tying or glueing strings together to make a new string.
In Python, you can concatenate strings with the + operator. Here are a few examples
>>> ’AA’ + ’TTT’
’AATTT’
>>> ’AA’ + ’ ’ + ’TTT’ + ’!’
’AA TTT!’
Here we have given a short fragment of a Python session in so-called interactive mode.
In this mode the Python interpreter prompts for the next command with its primary
prompt, usually three greater-than signs (>>>) or a numbered prompt like In [3]. The
user enters the input commands directly, and if the command results in output, this is
shown on the next line.
Variables can also be concatenated together if they hold strings as values:
>>> word1 = ’Gene ’
>>> word2 = ’insulin’
>>> word = word1 + word2
>>> word
’Gene insulin’
In the last assignment we could also have written word1 = word1 + word2. In that
case, the old contents of word1 (i.e. ’Gene ’) would be overwritten by the new contents.
Similarly as for (integer) variable n,
>>> n = 3
>>> n = n + 1
>>> n
4
c ph
15
Programming and genomics 2019/2020
2. Python
Indexing
Strings are sequences of symbols/characters, where each symbol in the string has a
position number. For instance for the string "gctgca":
Index
String
0
g
1
c
2
t
3
g
4
c
5
a
Note that the index of the first element is 0!
One can use the index number to get a character from the string.
>>> dnaIns[0]
’g’
>>> dnaIns[2]
’t’
# Get the first character of dnaIns
# Get the third character of dnaIns
Note the use of # in the above statements. This symbol plus the remainder of the line is
ignored by the python interpreter and can thus be used to add comments to the (human)
reader of the code. Adding such comments can highly increase the readability of your
code and thus is good practise.
Using an index number larger than the length of the string is not correct:
>>> dnaIns[500]
#Get the character five hundred and one of dnaIns
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range
The length of a string s can be obtained by: len(s). For example,
>>> len("abc")
3
Substrings: slicing
Apart from single characters, also multiple characters can be selected from a string.
This is denoted as slicing. The slices are called substrings.
Example:
>>> motif = ’GAATTC’
>>> motif[0:3] # the first three characters
’GAA’
>>> motif[1:3] # characters two and three
’AA’
Both the start and end position are optional which means either to start at the beginning
of the string or to extract the substring until the end. When accessing characters, it is
forbidden to access a position that does not exist, whereas during substring extraction,
the longest possible string is extracted (which may be the empty string ’’).
>>> motif = ’GAATTC’
>>> motif[0:3] # the first three characters
’GAA’
>>> motif[:3] # the first three characters
c ph
16
Programming and genomics 2019/2020
2. Python
’GAA’
>>> motif[3:] # everything but the first three characters
’TTC’
>>> motif[3:6]
’TTC’
>>> motif[:]
’GAATTC’
>>> motif[90]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
IndexError: string index out of range
>>> motif[3:90]
’TTC’
>>> motif[10:90]
’’
>>> motif[3:2]
’’
2.2
The print function
The print function writes the value of the expression(s) it is given to the output. Multiple
expressions and strings can be given to a single print function, by separating the items
with commas. Strings are printed without quotes, and a space is inserted between items,
so you can format things nicely, like this:
>>>
>>>
5
>>>
>>>
The
2.3
i = 5
print(i)
i = 3+4
print(’The value of i is’, i)
value of i is 7
Calculating with variables
As a small example, we will now consider a python fragment to calculate someones
body-mass index (BMI). This is defined as a persons body weight (in kilogram) divided
by the square of his/her length (in meters).
>>> mass = 80.
>>> height = 1.90
>>> BMI = mass/height**2
>>> print(’With mass’, mass, ’and length’, height, ’the BMI is’, BMI)
With mass 80.0 and length 1.9 the BMI is 22.1606648199446
By defining two identifiers with readable names, the whole programming fragment becomes well readable. Note that no parentheses () are needed around (height**2) as
the power operator ** has a higher priority than the division operator /.
c ph
17
Programming and genomics 2019/2020
2.4
2. Python
Exercises 1–6
Exercise 1:**
To be able to do the exercises, we assume you have Anaconda 4.4 installed (with Python
version 3.6.1) and opened the Spyder python development environment via Anaconda
Navigator.
If so, you can skip the next paragraph. If you do not have it installed yet, the next
paragraph leads you through the installation process.
In order to install Anaconda, use a web browser to go to https://www.anaconda.
com/download/#windows. Press on Download to download the Python 3.6 version. An
installer will then be downloaded. Once downloaded, which may take a while, run and
follow the installer until Anaconda is installed.
Once Anaconda is installed, start the Anaconda Navigator (e.g. by searching for Anaconda
Navigator at your Windows start screen, via browsing the Apps screen, or in older versions of Windows via Start>All Programs>Anaconda3>Anaconda Navigator). Subsequently, in the Anaconda Navigator, launce Spyder by clicking the appropriate launch
button. A ’Spyder’ window will pop up, which shows a Python prompt in the right
bottom panel. We will start by using python interactively (as a calculator). To use
python interactively, we type commands directly at the python prompt. This prompt
looks like In [1]:. Type at the prompt subsequently the following lines, each followed
by an Enter:
5*40
1.25*7
100/25
106/25
106//25
106.0/25
106.0//25
100/5*5
100/(5*5)
2**10
3*2**3
(3*2)**3
After entering a line the result should appear on the screen. Are the results as you
expected? Why (not)?
Exercise 2:**
Calculations often become much more readable if, rather than using the values directly,
we store those in variables and calculate with those. To calculate for instance the
number of possible DNA sequences of length 10, type at the Python prompt the following
commands:
c ph
18
Programming and genomics 2019/2020
2. Python
nrbases = 4
seqlength = 10
nrbases**seqlength
Is the number of possibilities indeed reported? Apart from writing the result to screen,
it is also possible to store the result in a new variable. In order to do so, we type:
nrpos = nrbases**seqlength
The result can then be shown by inspecting the contents of the variable at the command
line:
nrpos
or using a print statement:
print(nrpos)
The result can also be used in further calculations. For instance, if we would like to
know what the probability is for a randomly generated sequence of length 10 to be
’AAAAAAAAAA’, we have to divide 1 by the number of possible sequences. Calculate
and display this probability.
Exercise 3:**
If a name is not valid, then show, for instance by trying to Which of the following names
are valid Python identifiers? execute an assignment statement, the error message that
is generated by the Python interpreter.
a)
b)
c)
d)
e)
f)
g)
h)
i)
j)
whatsinaname
whats in a name
Whats_in_a_name
5600MB
yo!u
I
HelloYou
Hello;
varName
what?name
Exercise 4:**
The left-hand side of the Spyder window contains a large editor. In this editor a file has
already been opened (untitled0.py). Enter in this window, below the information string
that is already present, the Python commands making up your program:
nrbases = 4
seqlength = 10
nrbases**seqlength
nrpos = nrbases**seqlength
nrpos
print(nrpos)
c ph
19
Programming and genomics 2019/2020
2. Python
You can run your program by selecting the item ’Run’ in the Run menu, by pressing
the function key F5, or by pressing the green triangle in the tool bar.
Upon running your program, Spyder will first ask you to save your program. A File
Dialog will pop up that allows you to save your program. Save your file as exercise4.py,
i.e. with the extension .py. It is advised to store your programs (the solutions of your
exercises) on a D: (data) drive or in your documents folder in a well-organised way, e.g.
in a new folder named D:\courses\8CA10.
Once you saved your program, a second window will pop up, i.e., the ’Run settings
window’. Mark under the header ’General settings’ the option ’Clear all variables before
execution’, and subsequently press the button ’Run’.
Now your program is actually executed, and the output of your program is reported at
the Python prompt (left bottom panel of the Spyder window). How many times is the
value of nrpos reported?
A first advantage of running programs this way is that you can easily rerun them, without
having to type all commands once more. If you do so, the program runs immediately,
without popping up any windows. If you made any changes to your program, these are
saved automatically under the same name, overwriting your old version. To save your
program under a different name, go to the ’Save as...’ item icon in the File menu or
press the key combination or Ctrl+Shft+S.
A second advantage of storing your programs as files on your hard drive is that you can
reuse them at another moment. E.g. close the program exercise4.py you just saved
by selecting ’Close’ in the File menu or clicking on the cross next to the file name, and
subsequently open the file again in the editor (either using the item Open... in the File
menu, or the key combination Ctrl+o).
Exercise 5:**
(a) Create a new file (using the New file ... option in the File menu or using the key
combination Ctrl+N), and write, by substituting the proper calculation at the dots
in the program fragment below, a program to calculate the total DNA mass in an
average human:
genomelength = 3.2e9
nrcells = 4e13
massperbasepair = 660
Na = 6.022e23
#
#
#
#
number of
number of
grams per
number of
base pairs per cell
cells
mole per base pair
molecules per mole (Avogadro’s number)
totalDNAmass = ...
print(’approximate DNA mass one human:’, totalDNAmass, ’grams’)
(b) How much is that in kilograms?
Exercise 6:**
Given is a string s. As an example you may take
c ph
20
Programming and genomics 2019/2020
2. Python
s = ’AAACGAACGTAGGATCAAGTAGGCAAAAAG’
(a) print the first character of s
(b) print the last character of s
(c) print the string using 10 characters per line and a space after the 5th character, i.e.:
AAACG AACGT
AGGAT CAAGT
AGGCA AAAAG
c ph
21
Chapter 3
Lists and repetition
3.1
Lists
Apart from single data elements we are quite used to have multiple elements in a collection. We have multiple files of different type (txt, py, dat, etc.) in a folder, multiple
songs on an mp3-player, and several courses in a semester to be followed. Hence we
need to handle collections of all sorts of objects and sometimes these collections are not
even homogeneous, meaning that they may contain objects of different types. Python
provides several predefined data types that can manage such collections. One of the
most used structures is called list.
List creation
Lists are ordered collections of objects of different sorts. To create and access a list in
Python we use square brackets. You can create an empty list by using a pair of square
brackets with nothing inside, or create a list with contents by separating the values with
commas inside the brackets:
>>> emptyl = [] # creation of an empty list
>>> emptyl
[]
>>> gene = [’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene
[’insulin’, 3630, 333, ’Homo sapiens’]
An empty list can also be generated using the function list()
>>> m = list()
>>> m
[]
# creation of an empty list
The length of a list m, i.e., the number of items the list contains, can be obtained by:
len(m).
>>> len(gene)
4
22
Programming and genomics 2019/2020
3. Lists and repetition
The important thing to remember is that lists are just sequences of objects and that
each object in the list has a position.
Position
Object
0
’insulin’
1
3630
2
333
3
’Homo sapiens’
In Python, the position counting always begins with the number 0!
The objects are accessible using their position (i.e. using the index number) in the
ordered collection, starting at position 0. Once you create a list, you can use the
position to get any object you want from the list. All you need to do is put the position
inside the brackets next to the variable name.
>>> gene[2] # select the third object of the list
333
>>> gene[1] # the second object
3630
Using a position larger than the length of the list is not correct:
>>> gene[100]
#Get the object one hundred and one of gene
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list index out of range
You can also replace, remove or insert individual element of a list using the index
numbers:
>>> gene = [’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene
[’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[0] = 1
[1, 3630, 333, ’Homo sapiens’]
>>> len(gene)
# How many elements are there in list gene?
4
It is also possible to change individual elements of a list.
>>> gene = [’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[2]
333
>>> gene[2] = gene[2] + 23
>>> gene
[’insulin’, 3630, 356, ’Homo sapiens’]
Methods of lists
Given a list l the following methods can be applied to l. All of these methods operate
on the list l itself and do not return a new modified list, but modify l directly.
• l.append(x)
Add an item to the end of the list.
• l.extend(L)
c ph
23
Programming and genomics 2019/2020
3. Lists and repetition
Extend the list l by appending all the items in the given list L.
• l.insert(i, x)
Insert item x at given position i. The first argument is the index of the element
before which to insert, so l.insert(0, x) inserts at the front of the list, and
l.insert(len(l), x) is equivalent to l.append(x).
• l.remove(x)
Remove the first item from the list whose value is x. An error will be raised, if the
item is not in the list.
• l.pop([i])
Remove the item at the given position in the list, and return it. If no index is
specified, l.pop() returns the last item in the list. The item is also removed from
the list.
• l.sort()
sort the items of the list, in place.
• l.reverse()
reverse the elements of the list, in place.
The following methods do not change the list l, but return an integer.
• l.index(x)
Return the index in the list of the first item whose value is x. An error will be
raised, if the item is not in the list.
• l.count(x)
return the number of times x appears in the list.
An example that uses some of the list methods:
>>> a = [66, 333, 333, 1, 1234]
>>> a.insert(2, -1)
>>> a.append(333)
>>> a
[66, 333, -1, 333, 1, 1234, 333]
>>> a.count(333)
3
>>> a.index(333)
1
>>> a.remove(333)
>>> a
[66, -1, 333, 1, 1234, 333]
>>> a.sort()
>>> a
[-1, 1, 66, 333, 333, 1234]
The latter clearly shows that the sort method changes the list a in place and does not
return anything. Thus do NOT use
c ph
24
Programming and genomics 2019/2020
3. Lists and repetition
>>> a=a.sort()
>>> a
because then a is empty (or more formally, it contains the Python object None) and the
original list is lost. The same is true for the list method reverse.
List concatenation and repetition
Lists also have operators such as + and ∗ for concatenation and repetition, respectively.
>>> li =
>>> li =
>>> li
[’gene’,
>>> li =
>>> li
[’gene’,
[’gene’, 3630]
li + [’insulin’, 333]
3630, ’insulin’, 333]
[’gene’, 3630] * 3
3630, ’gene’, 3630, ’gene’, 3630]
Negative indices
Above we have seen that when accessing objects, it is forbidden to access a position that
does not exist, i.e, an index larger or equal to the length of the list results in an error.
A nice thing, however, is that you can also use negative numbers for indexing. The last
object of a list has the index −1, one but the last −2 etc.,
Position
Object
Position
0
’insulin’
-4
1
3630
-3
2
333
-2
3
’Homo sapiens’
-1
So if m is the list [1, ’nr two’, 5], then
>>>
>>>
>>>
3
>>>
5
>>>
5
>>>
1
>>>
1
>>>
1
m = [1, ’nr two’, 5]
n = len(m)
n
m[n-1]
m[-1]
m[0]
m[-len(m)]
m[n-len(m)]
The range method
Python has several built-in functions for generating lists. One example from which we
show here just a simple instance is range(stop). The range function has one integer
c ph
25
Programming and genomics 2019/2020
3. Lists and repetition
argument, called stop. It returns a built-in range object that can be converted to a list
of plain integers, i.e., list(range(stop)) yields the list of plain integers
[0, 1, ..., stop-1].
Later, in section 4.3, we will give a more extensive description of the range-function,
which also allows for start values different from zero and step sizes different from 1.
3.2
The for statement
The for-loop enables iteration on an ordered collection of objects and to execute the
same sequence of statements for each element.
Example:
>>>
>>>
>>>
>>>
>>>
...
...
...
str1 = ’Biomedical Engineering’
str2 = "Programming and genomics course 8CA10"
str3 = "Python is fun"
strlist = [str1, str2, str3]
for s in strlist:
print(s)
print(len(s))
Biomedical Engineering
22
Programming and genomics course 8CA10
37
Python is fun
13
The two print statements
...
...
print(s)
print(len(s))
form a so-called block and the two statements both have four spaces of indentation.
A block is a structure element of a program, that is used to group instructions. All
elements of the group have the same indentation, i.e., the same number of spaces in
front of it.
There is no absolute rule for the size of the indentation but the standard and the
preferable style is to use four spaces for each level of indentation.
The meaning of the for is that for each element in the list the two print statements
are to be executed. After finishing the last element of the list, the interpreter continues
with the first statement after the block.
Another often used for construction applies the range function. The range method
generates a special built-in range object that can deliver a sequence of integers to the
for loop which then iterates over those integers. Running
print(’A table of the first 11 integers and their squares’)
c ph
26
Programming and genomics 2019/2020
3. Lists and repetition
for i in range(11):
print(i, i*i)
print(’End of table’)
thus gives
A table of the first 11 integers and their squares
0 0
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
10 100
End of table
The range method can be used to generate the same result as in the first example in
this section by generating a list with the indices of the list strlist, looping over the
elements in that list and using those indices to access the items in the list:
>>>
>>>
>>>
>>>
>>>
...
...
...
...
str1 = ’Biomedical Engineering’
str2 = "Programming and genomics course 8CA10"
str3 = "Python is fun"
strlist = [str1, str2, str3]
for i in range(len(strlist)):
s = strlist[i]
print(s)
print(len(s))
Biomedical Engineering
22
Programming and genomics course 8CA10
37
Python is fun
13
Though the code is now slightly longer, advantage is that (contrary to the prior case)
within the loop the index that is currently being processed (i) is known.
3.3
Modules and the import statement
All over the world many Python programs have been and are being designed. An
installation of Python usually includes a large number of additional components that
are collected in modules. To make use of them, Python has the import statement. An
import statement imports a module, i.e., a piece of code contained in a file.
c ph
27
Programming and genomics 2019/2020
3. Lists and repetition
For instance, by importing the module math we have access to mathematical constants
and functions such as pi, e, exp, log and sqrt.
>>> import math
Access to its components such as variables and functions is obtained by using the construct modulename.varname
>>> print(math.pi)
3.14159265359
A similar notation is employed when referring to a function from a module, first the
name of the module, then a dot, ’.’, and ending with the function name. In short the
notation is: modulename.functionname
>>> math.exp(1)
2.7182818284590451
>>> math.sqrt(3)
1.7320508075688772
3.4
Exercises 7–12
Exercise 7:**
Show a programming fragment and its result for each of the following actions:
(a) Construct an empty list and assign it to the variable m.
(b) Add the element 7 to the list m.
(c) Print the length of m.
(d) Extend in two different ways the list m with [1, 2, 3, 1].
(e) Print the length of m.
(f ) Change the third element of m to 4.
(g) Remove the first occurrence of 1 from m.
(h) Print the length of m.
(i) Remove the last element from m.
(j) Print the length of m.
(k) Show the contents of m[-1].
(l) Show the contents of m[len(m)-1].
(m) Show the contents of m[-len(m)].
Exercise 8:**
The aim of the exercises below is to get familiar with the range-method and the forstatement.
c ph
28
Programming and genomics 2019/2020
3. Lists and repetition
(a) Assign to variable l a list consisting of the first 20 integers starting counting at
zero.
(b) Print list(range(len(l))).
(c) Execute the following programming fragment and explain the output:
for x in l:
print(l)
(d) Execute the following programming fragment and explain the output:
for x in l:
print(x, 2*x)
(e) Execute the following programming fragment and explain the output:
for x in l:
print(x)
(f ) Execute the following programming fragment and explain the output:
for i in range(len(l)):
print(l[i], 2*l[i])
(g) Execute the following programming fragment and explain the output:
print("Start")
for i in range(len(l)):
print(l[-i])
print("Finished")
(h) Adapt the programming fragment of (g) such that "Finished" is only printed once
at the end.
(i) Print the elements of l in reverse order.
Exercise 9:**
(a) In order to predict the growth of a cell population that initially (at t=0) consists
of ten cells and in which each cell replicates every hour while no cells die, we could
write the following programming fragment:
n=10
for t in range[5]:
print(t, n)
n = 2*n
print(’After 6 hours the number of cells is’, n)
This programming fragment, however, is not yet completely correct. Correct the
errors in this fragment in such a way that the correct number of cells after 6 hours
is printed (i.e., at t=6).
(b) Adapt the programming fragment of (a) taking into account that each cell still
replicates every hour but that after each replication cycle 5 cells die. How many
cells are there after 1 day (24 hours)?
c ph
29
Programming and genomics 2019/2020
3. Lists and repetition
Exercise 10:**
One of the libraries that comes with your Python installation is turtle. Given is the
following python fragment
# import the turtle library
import turtle
d=100
# Lift pen up
turtle.up()
# Move to the point with x and y coordinates -d/2 and d/2, respectively
turtle.goto(-d/2,d/2)
# Pull the pen down
turtle.down()
# Draw omething
for i in range(4):
turtle.forward(d)
turtle.right(90)
# activate the window that pops up
turtle.mainloop()
(a) Run the program. What happens?
(b) Write a python fragment to generate a figure like in Fig. 3.1a
(c) Write a python fragment to generate a figure like in Fig. 3.1b
(d) Write a python fragment to generate a figure like in Fig. 3.1c
Full info on turtle can be found at https://docs.python.org/3.6/library/turtle.
html
Exercise 11:**
Design a python fragment that prints a triangle of stars (’*’) with k stars as basis and k
stars as height. k should be an integer parameter of the method and between two stars
a space should be printed. For k=4 the output should look as follows:
*
* *
* * *
* * * *
c ph
30
Programming and genomics 2019/2020
3. Lists and repetition
Figure 3.1: Figures for turtle exercises.
Exercise 12:
In Exercise 10 the turtle library has been introduced. Write a python fragment, using
this library and nested for-loops, to draw four rows with five hexagons each, i.e., a
figure like in Fig. 3.1d.
c ph
31
Chapter 4
Methods, slicing, random and
plotting
4.1
Methods and invocations
In the previous chapters we have introduced the standard types integer, float, and
string, as well as some operations, such as addition and multiplication, that can be
applied to them. Moreover, as an example of a structured data type lists have been
defined. When discussing operations on lists, we have, without putting emphasis on it,
in fact shown what the standard object-oriented notation is for performing an action
on a object, also called invoking a method. In a Python program we write such a
method invocation by first writing the name of the object followed by a period (called
dot in computer jargon), followed by the method name, and parentheses that may have
arguments inside them. These arguments (possibly zero!) provide the information
needed by the method in order to carry out its action.
Examples of this notation applied to lists l, L, an element x, and integer i are:
l.append(x)
l.extend(L)
l.insert(i, x)
l.remove(x)
We have also introduced the pop-method but because it uses a special notation we give
it additional attention in the next section.
Optional arguments
If l is a list, we can remove the last element from the list by
l.pop()
but in general we can also remove an element at another position in the list. If index i
satisfies 0 ≤ i < len(l), then
l.pop(i)
will remove the item at position i from the list.
32
Programming and genomics 2019/2020 4. Methods, slicing, random and plotting
Instead of having to give two separate definitions, they can be combined as follows
l.pop([i])
The use of the square brackets means that the argument is optional. So the pop-method
may have zero or one argument. In general the term for this construction is that a
method may have optional arguments. This is one of the nice features of Python: only
one definition is needed and by analyzing the number of arguments, the system will
select the correct one. This construct of a varying number of arguments will be used at
more places in these notes, for instance in the next section.
4.2
The dir and help methods
In the previous section and chapter a number of built-in methods on lists have been
described. One can obtain a complete list of all methods of any object (thus also a list)
by giving the dir([object]) command, where the square brackets indicate that the
argument is optional:
• dir([object])
Return an alphabetized list of names comprising (some of) the attributes of the
given object, and of attributes reachable from it.
So dir([]) gives a complete enumeration of all methods that can be applied to lists.
Additional information about these methods can be obtained by entering help([]),
or for instance help([].pop) for specific information on the pop method (such as its
optional argument).
• help([object])
Enter the name of any object to get help on its usage.
4.3
The range method
One of the built-in functions that already has been mentioned is the range function. It
returns a built-in range object, which is a representation for a regular series of integer
numbers. It has one mandatory integer argument, called stop and two optional ones,
called start and step respectively:
range([start,] stop[, step])
If the step argument is omitted, it defaults to 1. If the start argument is omitted, it
defaults to 0. The full form returns a range object the resembles the list of plain integers
[start, start + step, start + 2 * step, ...]. If step is positive, the last element is the largest
start + i * step less than stop; if step is negative, the last element is the largest start +
i * step greater than stop. A value of zero is not allowed for step.
The built-in range object can be converted to a true list of integers using the list()
function. Examples:
>>> l=range(11)
>>> print(l)
range(0, 11)
>>> print(list(l))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
c ph
33
Programming and genomics 2019/2020 4. Methods, slicing, random and plotting
>>> l=range(1, 11)
>>> list(l)
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> l=range(1, 11, 3)
>>> m=list(l)
>>> print(m)
[1, 4, 7, 10]
>>> print(l)
range(1, 11, 3)
>>> l=range(11, 2, -2)
>>> list(l)
[11, 9, 7, 5, 3]
4.4
Slicing
Apart from constructing new lists, it is quite common to select parts of a list. In
particular slices, i.e., all elements from a list in between a start and a stop index, also
called a consecutive part of the list, occur frequently.
Examples:
>>> gene = [’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[1:3]
[3630, 333]
>>> gene[0:4]
[’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[1:1]
[]
>>> gene[1:len(gene)]
[3630, 333, ’Homo sapiens’]
>>> gene[1:4]
[3630, 333, ’Homo sapiens’]
Slice indices have useful defaults; an omitted first index defaults to zero, an omitted
second index defaults to the size of the list being sliced:
>>> gene = [’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[:4] # the first 4 items
[’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[1:] # everything except the first item
[3630, 333, ’Homo sapiens’]
In contrast to indices which lead to errors when not in between the bounds -len(l)
and len(l), degenerate slice indices are handled gracefully: an index that is too large
is replaced by the size of the list, an upper bound smaller than the lower bound returns
an empty list.
>>> gene = [’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[1:100]
[3630, 333, ’Homo sapiens’]
>>> gene[3:1]
c ph
34
Programming and genomics 2019/2020 4. Methods, slicing, random and plotting
[]
Indices may be negative numbers, to start counting from the right. For example:
>>> gene = [’insulin’, 3630, 333, ’Homo sapiens’]
>>> gene[-2:]
# The last two items
[333, ’Homo sapiens’]
>>> gene[:-2]
# Everything except the last two items
[’insulin’, 3630]
Similary as with a range in slicing a step-value can be used:
l[start:stop:step]
and the default value for step is 1.
If we would like to select the list elements with an even index from index 10 and further,
we could establish that by l[10::2].
Another useful application is also to leave the start and end position open and use as
step value minus one:
l[::-1]
This generates a new list with all elements of l, but in reverse order.
4.5
Random numbers
In the programs we have seen so far, fixed values were assigned to variables. As a
result, those programs produce exactly the same result each time you run them, unless
you change those values. Python also has some modules to generate pseudo-random
numbers, allowing to make each run behave differently. In following chapters we will
encounter a number of applications.
To generate random numbers, one first needs to load the library:
>> import random
This library then allows to generate a pseudo-random float between 0 and 1 using
>> random.random()
0.11619275312381916
The random number reported above is different each time. A next call may thus for
instance yield:
>>> random.random()
0.7106072247075303
The same library also provides the possibility to generate pseudo-random integers using
randint(minval,maxval). This yields a random integer between and including the
lower bound minval and upper bound maxval. For example, with lower bound 1 and
upper bound 6 this thus behaves like a virtual dice:
>>> random.randint(1,6)
2
>>> random.randint(1,6)
c ph
35
Programming and genomics 2019/2020 4. Methods, slicing, random and plotting
6
4.6
Plotting data using matplotlib
In many cases it can be quite helpful when a plot of a data set is made. Python offers
many such possibilities, by having plotting facilities and interfaces to several packages
meant for mathematical data analysis. Very useful in this respect is matplotlib, a python
2D plotting library which produces publication quality figures. In particular matplotlib
has a module pyplot which is a collection of command style functions that make matplotlib work like matlab.
In order to use the library, one needs to start by importing it using
import matplotlib.pyplot
If the data that is to be plotted is in a list l
l = [1,4,9,16,25,36,49,64,81,100]
the plot is created by the command
matplotlib.pyplot.plot(l, ’ro’)
In most Python distributions, the plot will not be shown immediately, but will only be
shown in a separate window at the screen after the command:
matplotlib.pyplot.show()
In the Spyder environment, however, the default option is that the plot is immediately
shown in line, i.e., in the Python console. To change the behaviour to have figures in a
separate window, one can type at the python command
%matplotlib auto
to change the behaviour for your current session or change the default setting via the
Tools menu, i.e., goto Tools/Preferences/IPython console/Graphics/ and change the
Backend option from Inline to Automatic, for a more permanent solution.
By the above plot command the values of l are plotted in the color red, the ’r’, and
with the line or marker style ’o’, standing for the circle marker. The result is a plot as
in Figure 4.1a. Other common colors are black (’k’), blue (’b’), green (’g’), yellow (’y’),
magenta (’m’) and cyan (’c’). Other common marker styles are stars (’*’) and plusses
(’+’), while instead of markers also lines could be used, e.g., solid lines (’-’), dashed lines
(’--’), or dotted lines (’:’). Also, markers and lines could be combined. For instance,
’g*-’ yields green stars combined with solid lines.
The plot command has many more possibilities. If one also has a list m (e.g. m =
[0,25,50,75,100]) one can realise a plot in which both lists are plot by:
matplotlib.pyplot.plot(l, ’ro’, m, ’b*-’)
where we plot the contents of m in the color blue, the ’b’, with starts, the ’*’, and as
a solid line, the ’-’. The result is a plot as in Figure 4.1b. In that plot also labels are
added next to the axes. This is obtained using
matplotlib.pyplot.xlabel(’my x-label’)
c ph
36
Programming and genomics 2019/2020 4. Methods, slicing, random and plotting
Figure 4.1: Example of plots with matplotlib.
matplotlib.pyplot.ylabel(’my y-label’)
In the plots sofar, the values of the elements in the lists determined the vertical position
of the data points. The horizontal positions were determined by their position (index)
in the list. A more general version of the plot-command uses two lists, as in:
matplotlib.pyplot.plot(x, y, ’ro’)
in which x and y have to be two lists of the same length. The values in x then determine
the horizontal positions and those in y the vertical positions.
Here we only touched on some basic functionalities of pyplot. More details can be found
on https://matplotlib.org/ and https://matplotlib.org/tutorials/introductory/
pyplot.html.
4.7
Exercises 13–18
Exercise 13:
Let l = [5,3,1,8,5,9,3,8,5,8,5,0,4,6,5,9,7,6,8,10]
(a) Print the contents of the list l
(b) Print the contents of the list l such that both the index of the element in the list
and the element itself are shown, one element per line, i.e.,
0
1
2
3
.
.
5
3
1
8
(c) As you could see, all values in the list are between 0 and 10 (both inclusive). Design
a Python program that prints for all integers between 0 and 10 (both inclusive) the
number of times the integer occurs in the list l
c ph
37
Programming and genomics 2019/2020 4. Methods, slicing, random and plotting
TIP: Think how you could do this using pen and paper. That is, start with an
empty sheet with the numbers 0, 1, through 10 below each other and marking each
occurence when going through the list (so-called tallying).
Exercise 14:*
(a) For a list of scores l its average is given by the sum of its elements divided by the
number of elements the list contains. For example, for l=[0,2,8,10] the average
is (l[0]+l[1]+l[2]+l[3])/4, i.e. (0+2+8+10)/4=5.
More generally, for a list l of arbitrary length n = len(l), the calculation of the
average score can be written as
n−1
X
¯l = 1
l[i]
n i=0
The formula n−1
i=0 l[i] means calculate the sum of l[0], l[1], . . . l[n − 1]. So when
n = len(l) then all elements of l are summed.
P
Design a Python program that calculates and prints the average score. Apply this
to the list l of the previous exercise.
(b) The variance σ 2 is defined by
σ2 =
X
1 n−1
(l[i] − ¯l)2 .
n − 1 i=0
Extend the program of (a) such that it also calculates and prints the variance of
the scores.
Exercise 15:
In this exercise we will make a plot of some data using matplotlib.
(a) Plot the data in the list l used in the previous exercises using the plot command
from the matplotlib.pyplot library. When plotting the data, a plot style can be
specified. What is the difference between the styles ’ro’ and ’k*’?
(b) Construct a Python program by which the contents of the list l are plotted and by
which in the figure in addition a horizontal line is shown with as value the mean of
l.
(c) Extend the Python program of (b) such that it plots two additional horizontal lines
indicating the standard deviation, i.e., one line at mean+variance**0.5 and one at
mean-variance**0.5. Use cyan colored dashed lines using style ’c--’, and add
proper labels at the axes.
Exercise 16:**
Construct the following lists, exclusively using the range and list functions:
(a) a list of the integers 0 . . . 10, hence inclusive both 0 and 10.
(b) a list of the integers 1 . . . 10.
c ph
38
Programming and genomics 2019/2020 4. Methods, slicing, random and plotting
(c) a list of the integers 1 . . . 20 that are multiples of 4
(d) an increasing list of the integers −25 . . . 20 that are multiples of 3.
(e) a decreasing list of the integers −25 . . . 20 that are multiples of 3.
Exercise 17:*
Let n be a multiple of 10 larger than 1000. Construct the following lists only making
use of the range and list functions and the concatenation operator:
(a) [n+10, n+9, .., 2, 1, 0, 0, 5, .., n+5, n+10]
(b) [n, n, (n-5), (n-10), .., 10, 5, 0, -5, -10, ..,-(n-5), -n, -n]
(c) [-n, -(n-1), .., -2, -1, 0, 0, 0, 5, 10, .., n-10]
(d) [n, -n, 0, 5, 10, .., n, -n, -(n+5)]
Exercise 18:
In Exercise 10c a single star had to be drawn making use of the turtle library.
(a) Extend the solution of that exercise to draw 50 stars at random positions (e.g. x
as well as y coordinates drawn randomly between -200 and 200).
(b) Like in (a), but now give each star a random color. The pen color can be changed
using
turtle.pencolor(r,g,b)
where r, g, b are floats between 0 and 1, specifying the amount of red, green and
blue, respectively. (0,0,0) corresponds to black and (1,1,1) to white.
(c) Run your solution of (b) multiple times.
c ph
39
Chapter 5
Selection methods and file/user
input
5.1
Selection methods (if, elif and else)
When dealing with lists, one often encounters the problem that from the lists only
certain elements that satisfy a condition are to be considered. Programming languages
usually have a construct for that called selection. Python has the selection method if
that we introduce by an example.
The if construct
Assume that we have a list l of integers and we want to select only the positive items
from the list. Since there may be more than one such an item, we choose to deliver all
the positive items in a new list called posl. In the decision what the initial value for
posl can be, we have to realize that it might even be the case that no item is positive,
so the only decent initial value for posl is the empty list []. Since the property of being
positive is a single item property, we have to inspect each individual item from the list
on this condition. These considerations lead to the following programming fragment.
posl=[]
for x in l:
if x>0:
posl.append(x)
The construction we have used is more generally defined as
if condition:
block
and means that when the condition is satisfied all actions belonging to the block are to
be performed.
The else construct
If we also want to collect the other items in another list, say negl, then a program is:
40
Programming and genomics 2019/2020 5. Selection methods and file/user input
posl=[]
negl=[]
for x in l:
if x>0:
posl.append(x)
else:
negl.append(x)
This should be read as: for each item, when x is positive, it will be appended to the list
posl, and otherwise, it will be appended to the list negl.
The general construct has the form
if condition:
block1
else:
block2
and is called an if-else construct and its interpretation is: when the condition holds all
actions belonging to block1 are executed, and otherwise all actions of block2 are applied.
The elif construct
There is even a more general form in which conditions are subsequently inspected. Again
we demonstrate the construct first by an example. Assume that we not only want to
select the positive and non-positive items but also want to collect the elements being
zero. Along the same lines and introducing a list zerol in which all zero items are to
be put, a programming fragment is:
posl=[]
negl=[]
zerol=[]
for x in l:
if x>0:
posl.append(x)
else:
if x<0:
negl.append(x)
else:
zerol.append(x)
Since such constructs occur in many occasions, Python even has a shorthand notation
for it:
posl=[]
negl=[]
zerol=[]
for x in l:
if x>0:
posl.append(x)
elif x<0:
negl.append(x)
c ph
41
Programming and genomics 2019/2020 5. Selection methods and file/user input
else:
zerol.append(x)
and the name for it is the elif-construct. In general, more than one elif is allowed,
and the else-part is optional.
5.2
Conditionals and selection
The tests introduced in the previous section (such as x>0) are comparison operations
on integers. Such comparisons can hold, or not. In other words: they are either true
or false. In python such comparisons return a Boolean value. This type is named after
a 19th century mathematician, George Boole who studied logic. This type has only 2
values - either True or False.
>>> 5>3
True
>>> 5<3
False
Boolean is a subtype of integer, and Boolean values behave like the values 0 for False
and 1 for True, respectively. Their role becomes clear when we have to design programs
in which a selection has to occur whether or not a block of statements is to be executed.
Consider the following program fragment:
>>> if nrexons > 1:
...
print("Alternative splicing might occur")
In this fragment we use the if statement by which only if the condition, i.e., the boolean
expression after the if but before the colon, evaluates to True, the block following is
executed. If the condition does not hold, then the block is not executed.
Boolean expressions are statements, the technical term is proposition, that hold or not.
In daily life we are quite used to propositions:
1. It is raining.
2. 2+2 equals 5
3. This is a course that I like.
4. Today these notes were put on oncourse.
These are simple examples. It becomes more interesting when we make new propositions
composed from old ones:
1. It is raining and it is cold.
2. It does not rain.
3. Today or tomorrow these notes are put on oncourse.
Boolean expressions are also frequently used in programming. In this section we therefore treat them in some more detail.
c ph
42
Programming and genomics 2019/2020 5. Selection methods and file/user input
Operation
<
<=
>
>=
==
!=
Meaning
strictly less than
less than or equal
strictly greater than
greater than or equal
equal
not equal
Table 5.1: General comparison operations
5.2.1
False and True
The following values are considered by the interpreter to mean False:
None 0 "" () [] {} False
Hence everything else is interpreted as True.
5.2.2
Comparisons
The boolean expressions used so far are comparison operations on integers. Comparison
operations are not only defined for integers, they are supported by all objects. In
Table 5.1 the comparison operations are summarized.
For floats the meaning is straightforward, though for strings one has to be careful that
python is case sensitive and capitals are considered to be smaller than small letters.
String comparisons that are True are for instance:
"aaa"=="aaa"
"a"!="A"
"aaa"<"baa"
"aaa">"Baa"
Also for other objects (like lists), comparisons like < and > are defined, though their
results may be at first sight unexpected.
Comparisons can be chained arbitrarily; for example, x < y <= z is equivalent to
x < y and y <= z.
Unmeaningfull comparisons (e.g. testing whether some string is larger than some integer) are not allowed, though they can be tested for equality:
>>> "acg"<3
TypeError: ’<’ not supported between instances of ’str’ and ’int’
>>> "acg"==3
False
>>> "acg"!=3
True
On lists and strings an additional comparison operator in is present. On lists, this
results True (or False) when the requested element is present in the list (or not).
>>> 4 in [0,5,3,7]
c ph
43
Programming and genomics 2019/2020 5. Selection methods and file/user input
False
On strings, this results True (or False) when the requested substring is present in the
string (or not).
>>> ’ACT’ in "AAAACTT"
True
5.2.3
Boolean operations
Like in the daily life expressions most programming languages including Python have
more concepts for constructing boolean expressions. There are 5 operators on boolean
operands by which larger expressions can be composed.
• Equality ==
• Inequality !=
• Conjunction: and
• Disjunction: or
• Negation: not
In order to avoid misinterpretations it is advised to use parentheses around boolean
expressions.
Conjunction
and
False
True
False
False
False
True
False
True
When a and b are two boolean expressions, then a and b is only True when both a and
b are True.
a
False
True
False
True
b
False
False
True
True
a and b
False
False
False
True
Equality
==
False
True
False
True
False
True
False
True
When a and b are two boolean expressions, then a==b is only True when a en b have
the same value.
c ph
44
Programming and genomics 2019/2020 5. Selection methods and file/user input
a
False
True
False
True
b
False
False
True
True
a == b
True
False
False
True
Inequality
!=
False
True
False
False
True
True
True
False
When a and b are two boolean expressions, then a!=b is only True when a and b have
different values.
a
False
True
False
True
b
False
False
True
True
a != b
False
True
True
False
Disjunction (or)
When you say that this message is meant for students mechanical or biomedical engineering, then it also applies to a student doing both mechanical and biomedical engineering.
The word or does mean either one or both.
or
False
True
False
False
True
a
False
True
False
True
True
True
True
b
False
False
True
True
a or b
False
True
True
True
Negation (not)
The negation has the meaning of not and is a unary operator.
a
False
True
not a
True
False
not a is thus True when a is not and the other way around.
This concludes the description of the boolean expressions. In the programs that we will
design in the next chapters examples of their use will be shown.
5.3
User input (input)
When a program is run in an interactive session and the program needs some input,
the user should somehow be informed to enter the input. To that end Python has the
function input():
c ph
45
Programming and genomics 2019/2020 5. Selection methods and file/user input
input([prompt])
If the prompt argument is present, it is written to standard output without a trailing
newline. The function then reads a line from input (at the Python Shell), converts it to
a string, and returns that.
Example:
myname = input("Enter your name: ")
If you would response with Peter Hilbers in the window with the command prompt,
i.e., in your window would appear
Enter your name: Peter Hilbers
the result is that the variable myname gets the value ”Peter Hilbers”.
If numeric values are required as input instead of strings, the string (s) read could of
course always be converted to into an integer or a float using the int(s) and float(s)
commands, respectively.
5.4
Reading from files
Often data needs to be read from file. To that end Python has several facilities to handle
file input.
The open method
infile = open(filename)
opens the file with name filename for reading and returns a new object of type file.
Example:
>>> infile = open("sequences.seq")
>>> print(infile)
<_io.TextIOWrapper name=’sequences.seq’ mode=’r’ encoding=’UTF-8’>
A file cannot be displayed like a number or a string, it however has methods for working
with the data in the file.
The readline method
infile.readline()
readline returns the next line from the file object infile.
Example:
>>> infile = open("sequences.seq")
>>> infile.readline()
’CCTCAACAATTCAATAAAATAGCTTCGCGCTAA\n’
Note the line read includes the end of line character (\n).
A Python fragment to read the first two lines from a file is:
c ph
46
Programming and genomics 2019/2020 5. Selection methods and file/user input
>>> infile = open("sequences.seq")
>>> infile.readline()
’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’
>>> infile.readline()
’CCACGTGAACCGTTGTAACTATGTTCTGTGC\n’
When there are no more lines, readline returns the empty string. So if the file only has
two lines, then the following output is produced:
>>> infile = open("sequences.seq")
>>> infile.readline()
’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’
>>> infile.readline()
’CCACGTGAACCGTTGTAACTATGTTCTGTGC\n’
>>> infile.readline()
’’
If the newline should not be included in the line, the rstrip() method can be used. This
method removes all white space (including new line characters) from the end of the
string it is applied to.
Example:
>>> infile = open("sequences.seq")
>>> s=infile.readline()
>>> s
’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’
>>> r=s.rstrip()
>>> r
’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA’
The readlines method
f.readlines()
Given a file object f (for instance returned by the file open method), f.readlines()
reads using readline() until the end of the file and returns a list containing the lines thus
read.
>>> infile = open("sequences.seq")
>>> infile.readlines()
[’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’,
’CCACGTGAACCGTTGTAACTATGTTCTGTGC\n’]
The read method
Instead of the readlines method we could also have used
s=infile.read()
in which the method read is applied to the object infile with as result that one string
is returned containing all characters infile consists of.
c ph
47
Programming and genomics 2019/2020 5. Selection methods and file/user input
>>> infile = open("sequences.seq")
>>> infile.read()
’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\nCCACGTGAACCGTTGTAACTATGTTCTGTGC\n’
The close method
Although we have not dealt with it before since a file is automatically closed when a
program has finished, there is also a method by which a file is closed explicitly. In fact it
is strongly encouraged and considered good programming practice when objects, which
are no longer of concern, are ’dismissed’. To close a file, Python has the close-method:
infile.close()
Reading and processing of a file could thus look like
infile = open("sequences.seq")
lines = infile.readlines()
infile.close()
for line in lines:
print(len(line))
After the close(), the file itself is closed and no longer accessible, though its contents
is still present in the variable lines.
Some elementary string methods
In the next chapter we will consider strings in much more detail. However, when reading
and processing text files a number of methods are already very useful. We will introduce
these here via an example, where s is a string containing one line of text (that might
have been read from a file).
>>> s = ’Jan is 23 years old\n’
>>> s.rstrip()
’Jan is 23 years old’
>>> s.find(’is’)
4
>>> m = s.split()
>>> m
[’Jan’, ’is’, ’23’, ’years’, ’old’]
>>> m[2]
’23’
>>> int(m[2])
23
>>> float(m[2])
23.0
That is, rstrip removes white space (in this case thus the new line character) from the
end of the string, find returns the index of the first occurrence of the substring given
as argument, split returns a list with substrings that result when the string is splitted
on white space, and int and float convert a string into an integer and floating point
number, respectively.
c ph
48
Programming and genomics 2019/2020 5. Selection methods and file/user input
5.5
Exercises 19–24
Exercise 19:**
Below an imperfect Python fragment is given. It should ask the user for an integer
number between 0 and 10, including those boundaries. Correct the code such that it
thanks the user if he/she does so, and gives its opinion on the user otherwise.
s = input("Give a value between 0 and 10: ")
value = int(s)
if (value > 0) and (value < 10)
print(’Thank you’)
print(’You fool!’)
Exercise 20:**
For this, and many of the exercises to come, data files are necessary. These files are
available on the Canvas website and should be downloaded to the same folder where
you store the python programs (.py files) you write. Files can then be opened in your
python programs by just specifying the file name (e.g. "sequences.seq"). This works
because Python automaticcally sets its working directory to the folder of the Python file
you are running, and searches inside that folder. If you want to open files from another
folder on your hard disk, that is also possible by specifying the full path to the file (for
instance "D:\\courses\\8CA10\\sequences.seq"). (Note that the use of the double
backslahes is required.)
(a) Write a Python program that reads the file sequences.seq, and then for all lines,
prints out the line number (starting with 1) then the line itself. Make sure that no
empty lines are present between the lines printed from the file. The output should
be like
1
2
3
4
5
CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA
ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA
TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT
CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT
TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC
(b) Adapt your solution for (a) such that for all lines, it prints out the line number
(starting with 1) and then that part of the line that starts with TT until and including
the first occurrence of AA. (Each line of the file sequences.seq contains a TT and a
subsequent AA, and the position of the first occurrence of a substring substr in a
string s can be obtained using pos=s.find(substr).)
Exercise 21:
Open the two files seq1.seq and seq2.seq. The files have the same number of lines.
Write a Python program that repeatedly reads a line from file seq1.seq, prints the line
to output, then reads a line of file seq2.seq, prints it, etc.
c ph
49
Programming and genomics 2019/2020 5. Selection methods and file/user input
Exercise 22:*
Given is a text file BMIs.txt with on each line three fields separated by tabs. The first
field is the name of a person, the second field his/her weight, and the third field his/her
length.
(a) Design a python fragment that reads the file BMIs.txt and stores its contents in
three lists, named names, weights and lengths, respectively. The list names may
contain strings, while the other two lists should contain floats instead of strings.
The strings read from file may be converted using the command float(s), which
converts a string s into a float.
(b) Print for each of the persons his/her name, weight, length and BMI, where BMI is
defined as weight divided by the square of the length.
(c) Print for each of the persons his/her name and whether this person is, according to
the World Health Organization, ’Underweight’ (BMI below 18.5), ’Healthy weight’
(BMI between 18.5 and 25), ’Overweight’ (BMI between 25 and 30) or ’Obese’ (BMI
above 30). Output should look like
Peter is obese.
Esther is healthy weight
...
Exercise 23:
The purpose of this exercise is to design a Python program by which the positive and
the non-positive elements of a list of integers are plotted (using Matplotlib) in 2 different
colors.
(a) Write a python fragment that reads the file intl.txt and stores the integer numbers
present in that file in a list.
(b) Write a python fragment that plots the positive elements in green and the nonpositive elements in red.
When the program is run a plot like depicted in Figure 5.1 should be drawn.
Exercise 24:
Let l be a sorted list with integer numbers:
(a) Design a Python program that prints only the numbers occurring multiple times in
the list l, where each of those numbers is only printed once.
(b) Design a Python program that prints the number that occurs most frequent in the
list l. In case multiple such numbers exist, only the smallest one of those should
be printed.
c ph
50
Programming and genomics 2019/2020 5. Selection methods and file/user input
Figure 5.1: The requested output of exercise 23 where positive and nonpositive elements are shown in different colors.
c ph
51
Chapter 6
Bioinformatics and strings
Our interest is in computer algorithms by which we can analyse genomic information.
Since this genomic information is usually represented in the form of strings, our programming language should have facilities for string manipulation. We have already seen
that Python has a standard type str for strings.
In this chapter we start with some biological background about the translation of a DNA
sequence into a protein. From it we derive what other string manipulations are needed
to investigate this translation process and we show the kind of Python statements that
are available to implement these operations.
6.1
From DNA to RNA to protein
Transcription of DNA
The main purpose of DNA, in short, is to function as a template from which the single
stranded nucleic acid called RNA (ribonucleic acid) can be transcribed. RNA is in
turn translated into the amino acid sequences for all proteins the organism needs. In
prokaryotes, the double stranded DNA is ‘read’ in the nucleus by an enzyme called
RNA polymerase. The function of this enzyme is to open the double helix to expose
a small part of single stranded DNA sequence, and to transcribe this sequence into an
RNA strand. RNA is very similar to DNA in that it is also a long chain of nucleotides.
However there are a few differences: the sugar molecule in RNA is ribose instead of
deoxyribose and it contains the base uracil (U) instead of thymine (T). Like thymine,
uracil forms base pairs with adenine. Like replication, transcription of DNA into RNA
is done by complementary base pairing. One of the DNA strands acts as a template
for polymerase to link nucleotides to a growing RNA chain. The RNA strand is then
complementary to the template DNA strand. Thus it is a copy of the other DNA strand,
the coding strand, except for the thymines being replaced by uracil, see Figure 6.1.
The start of transcription is initiated by the binding of certain transcription factors to
regions upstream of a gene. Transcription factors are proteins that recognize and bind
to specific DNA sequences that are called promoters. The difference in presence of the
transcription factors is one of the means by which specific cell types regulate transcription rates of certain genes. A gene usually has a few promoter regions. One of those
regions is typically found around 30 bp upstream of the start site of transcription. It is
52
Programming and genomics 2019/2020
6. Bioinformatics and strings
5’... A C G T C G C G C A G T A C A T G ... 3’ coding strand
| | | | | | | | | | | | | | | | |
3’... T G C A G C G C G T C A T G T A C ... 5’ template strand
5’... A C G U C G C G C A G U A C A U G ... 3’ RNA
Figure 6.1: DNA codes for RNA.
termed TATA box because of its high content in T and A nucleotides. The transcription
factors that bind to the TATA box help the polymerase to position at the start site, and
assist in unwinding the DNA locally to facilitate the start of transcription.
After DNA has been transcribed the single stranded RNA, then called the transcript,
undergoes some post-processing. The transcript is longer than needed for protein synthesis. It consists of large regions that do not code for amino acids. Such regions are
called introns, while the regions coding for protein are called exons. Before the RNA
strand leaves the cell’s nucleus the introns are removed or separated from the exons,
regions that are expressed, by a splicing process. In eukaryotes the exons are only a
small fraction of the transcribed DNA. RNA molecules that are transcribed as a code
for an amino acid sequence are called messenger RNA, or mRNA.
Figure 6.2: Schematic view of genic regions. A gene always starts and ends with an
exon, this can be a untranslated regio(UTR) or a coding sequence(CDS). A UTR
itself can have introns, as shown here in the 5’ UTR. Only the exons are found
in the mRNA.
Translation of RNA
The proteins of most living organisms are built from only 20 different amino acids.
Somehow the mRNA sequence made up of only four different nucleotides (A, U, C, and
G) codes for the arrangement of these 20 amino acids. It can be easily deduced that with
a four-letter alphabet, three-letter words suffice to code for all amino acids. Two-letter
words give only 42 = 16 different codes, which is too few. In three-letter words there are
43 = 64 different codes, which is more than enough. Each amino acid can therefore be
specified by more than one of these ’words’, which are named codons. Some codons are
reserved to code for a stop sign which indicates that translation should be terminated,
they do not code for any amino acid. The codon for methionine (AUG) is also a sign for
c ph
53
Programming and genomics 2019/2020
6. Bioinformatics and strings
the start of a protein coding region. The complete code for all 20 amino acids is given
in Table 6.1.
Table 6.1: Genetic Code.
1st
position
U
C
A
G
U
Phe
Phe
Leu
Leu
Leu
Leu
Leu
Leu
Ile
Ile
Ile
Met
Val
Val
Val
Val
2nd position
C
A
G
Ser
Tyr
Cys
Ser
Tyr
Cys
Ser STOP STOP
Ser STOP
Trp
Pro
His
Arg
Pro
His
Arg
Pro
Gln
Arg
Pro
Gln
Arg
Thr
Asn
Ser
Thr
Asn
Ser
Thr
Lys
Arg
Thr
Lys
Arg
Ala
Asp
Gly
Ala
Asp
Gly
Ala
Glu
Gly
Ala
Glu
Gly
3rd
position
U
C
A
G
U
C
A
G
U
C
A
G
U
C
A
G
Because codons consists of 3 letters, each RNA sequence has three possible reading
frames. Not all possible reading frames lead to a product. This is usually because
reading frames either lack the sequences that initiate transcription or have an abundance
of stop codons that terminate translation before a functional protein is formed.
The portion of mRNA that is on the 5’ end, upstream of the start codon is called the
5’ untranslated region (UTR). The portion of mRNA that is 3’ downstream of the stop
codon is the 3’ UTR. See Figure 6.2 for an illustration of the general outlay of a genic
region.
Genes
A part of DNA that codes for the production of a single protein is called a gene. Definitions of what part of the coding sequence is actually the gene, can differ however.
Some carry the opinion that also the promoter and other upstream transcription-factor
binding sequences should be considered as part of the gene. Generally, introns are seen
as part of a gene, but they are not part of the coding DNA, since they are not translated into amino acid sequence. The amount of coding DNA is thus smaller than the
amount of genic DNA. To give an idea: about 98.5% of the human DNA is non-coding
DNA [3][p. 202]. The bulk of DNA that serves no obvious purpose, such as most DNA
within introns and most intergenic DNA, has long been labeled ‘junk DNA’. This term is
misleading. Recent genomic research has led to the belief that some biological function
is associated with some of these regions. Therefore the more neutral term non-coding
DNA is preferred these days.
c ph
54
Programming and genomics 2019/2020
6. Bioinformatics and strings
On genomes, sequences can be found, that resemble known genes, but cannot be translated into functional proteins. These are called pseudo genes. A pseudo gene is sometimes described as a non-functional member to a gene family. It is believed that pseudo
genes are derived from ancestral active genes, and have lost their function through mutations, often by the gain of internal stop codons.
A single region in DNA can code for more than one protein by a process called alternative
splicing. By excluding some exons a different amino acid pattern can emerge, and
therefore a different protein can be constructed. This is illustrated in Figure 6.3.
Figure 6.3: Alternative splicing. Same example gene as in figure 6.2, but now four
different mRNA products are synthesized from the same gene. Only the darker
part is translated into protein.
Functionality of genes can be regulated by methylation. Methylation is the cell’s method
of turning off certain genes. Every type of cell has its own methylation pattern so
that a unique set of proteins is expressed to perform specific functions for that cell
type. In vertebrates methylation usually occurs on cytosine at CpG sites, sites where
cytosine is followed directly by guanine. Deamination of methylated cytosine changes
it to thymine, which is a mutation that can not be efficiently repaired. Thus over
evolutionary time scales the methylated CG sequence will be converted to TG, which
explains the deficiency of CG sequences in inactive genes. CpG islands are short stretches
of DNA in which the frequency of CG sequences is higher than other regions and they
are usually found around promoters of so called housekeeping genes, that are essential
for general cell functions, or other genes that are frequently expressed in a cell.
6.2
DNA, RNA and Python strings
From the short survey about DNA, RNA and proteins we deduce that in order to
interpret genomic data we at least need the following operations on a string.
• a replacement operation: If we have a DNA string and we want to turn it into
RNA then all occurences of T have to be replaced by U. In the translation process
from RNA to protein a three-letter word is replaced by a single amino acid letter.
c ph
55
Programming and genomics 2019/2020
6. Bioinformatics and strings
• substring: If we are interested in the amino acid sequence a protein is composed
of, we need to extract the coding parts of mRNA from the noncoding ones. Exons
are consecutive subparts of the original DNA string, so an operation is needed by
which a subsequence of letters can be obtained.
• concatenation: The DNA sequence that codes for a protein usually consists of
several exons. These exons are glued together to form the complete amino acid
sequence the protein consists of.
Python indeed has such operations.
As an example how to use these operations we consider the DNA sequence of a specific
gene, namely “INS”, that has as product insulin. In the Computational Biology group
of the BioMedical Engineering department we are doing research on the metabolic syndrome. In particular we are interested in the disease Diabetes Mellitus. Patients with
type 2 diabetes mellitus have relatively low insulin production or insulin resistance or
both. A non-trivial fraction of type 2 diabetics eventually require insulin administration
when other medications become inadequate in controlling blood glucose levels. Understanding which processes are responsible for glucose control, what is malfunctioning,
how to prevent it and how to medicate are areas of our research.
A metabolic network describing several processes involved in the insulin pathway is
shown in Figure 6.4 In this figure the genes are given in grey/green boxes with the name
of the gene on the box. In the top part the box with “INS” can be found.
Information about a gene can be found in general bioinformatics sources. One of the best
sites on the web for bioinformatics is the National Center for Biotechnology Information
(NCBI) (http://www.ncbi.nlm.nih.gov/). NCBI provides an integrated approach to
the use of gene and protein sequence information, the scientific literature (MEDLINE),
molecular structures, and related resources, in biomedicine. It facilitates a special search
engine called Entrez. If you know that the gene identification for “INS” is 3630 then if
you enter the following query:
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene&
cmd=Retrieve&dopt=full_report&list_uids=3630
information about the insulin gene is displayed. Both the mRNA and the genomic
information are part of the information returned. Here we show some parts of it.
mRNA
CDS
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
c ph
join(1..42,222..425,1213..1431)
/gene="INS"
join(239..425,1213..1358)
agccctccag
tgcgtcaggt
ggaggacgtg
cacctggcct
ggccctgtgg
agccgcagcc
agtgtgcggg
gcagggtgag
gcgctcccac
tgcacttttt
cccagtcaga
gacaggctgc
gggctcagga
gctgggctcg
tcagcctgcc
atgcgcctcc
tttgtgaacc
gaacgaggct
ccaactgccc
ccagcatggg
taaaaagaag
atctcagcct
atcagaagag
ttccagggtg
tgaagcatgt
tcagccctgc
tgcccctgct
aacacctgtg
tcttctacac
attgctgccc
cagaaggggg
ttctcttggt
gaggacggtg
56
gccatcaagc
gctggacccc
gggggtgagc
ctgtctccca
ggcgctgctg
cggctcacac
acccaagacc
ctggccgccc
caggaggctg
cacgtcctaa
ttggcttcgg
aggtctgttc
aggccccagc
ccaggggccc
gatcactgtc
gccctctggg
ctggtggaag
cgccgggagg
ccagccaccc
ccacccagca
aagtgaccag
cagccccgag
caagggcctt
tctgcagcag
caaggcaggg
cttctgccat
gacctgaccc
ctctctacct
cagaggacct
cctgctcctg
gggggtcagg
ctccctgtgg
atacatcaga
Programming and genomics 2019/2020
6. Bioinformatics and strings
Figure 6.4: Schematic view of the insulin signalling pathway. The gene INS is used in
the examples below (http://www.genome.jp).
661
721
781
841
901
961
1021
1081
1141
1201
1261
1321
1381
gggtgggcac
caccctcatt
gggtcacagg
gggcgtggct
tagtcaggag
gttcaggctc
ttggggcctg
ggctggagat
tgactgtgtc
acgtcctggc
ccttggccct
tctgctccct
gctcctccct
tgatgaccgc
gtgccccacg
gcctgcctga
atggggaaga
ccactgtgac
taggtccaca
gggtgggagt
ctcctgtgtc
agtggggcag
ggaggggtcc
ctaccagctg
ccactcgccc
agattcaagt
ctgcctgcct
gtgggccaga
tgctggggac
gctgccccgg
cccagtgtgg
gcgacctagg
cctctgcctc
gtggagctgg
ctgcagaagc
gagaactact
ctcaaacaaa
gttttgttaa
ctgggcgaac
cccctgtcgc
aggccctggg
ggcgggggaa
gtgaccctcc
gctggcgggc
gccgctgttc
gcgggggccc
gtggcattgt
gcaactagac
tgccccgcag
gtaaagtcct
accccatcac
caggcctcac
gagaagtact
ggaggtggga
ctctaacctg
aggcgggcac
cggaacctgc
tggtgcaggc
ggaacaatgc
gcagcccgca
cccatttctc
gggtgacctg
gcccggagga
ggcagctcca
gggatcacct
catgtgggcg
ggtccagccc
tgtgtctccc
tctgcgcggc
agcctgcagc
tgtaccagca
ggcagcccca
In the first line it is stated that the gene “INS” is built out of 3 parts of the DNA
sequence, the first 42 bases (1..42), followed by the bases from position 222 through 425
(222..425) and finally those at positions 1213 through 1431 (1213..1431) in the DNA
sequence. Next it is shown which parts of the mRNA are the exons and hence code
for the protein. This gene information is going to be used in the examples and in the
exercises.
c ph
57
Programming and genomics 2019/2020
6.3
6. Bioinformatics and strings
Operations on strings
Handling text is a recurring theme in many areas and also in biomedical applications.
One of the strenghts of Python is that there already are a large number of predefined
operations for strings. Since Python is object-oriented as well as has functional programming tools, both styles are used in operations. An example of the functional style
applied on a string object is the method that returns the length of the string.
• len(s) When s is a string then the length of the string can be obtained by: len(s).
Examples
>>> len("abc")
3
>>> len("This is a long string")
21
There also exists a string of length 0, called the empty string. It is denoted by ’’ (or
"").
Below we introduce other operations on strings that use the object-oriented style of
objectname.methodname(arguments). We use ’Biomedical engineering \n ’ as
the string object the actions have to be performed on, but any other string could have
been used. Note that a "\n" is occurring in this string. It means that a newline character is occurring in the string. In general the backslash \ in front of a character in
a string is the indication that the normal meaning of the character (in this case the n)
should not be considered, but that instead a special meaning (in this case its meaning
is the newline character) is to be used. Another example of a special character is \t
which is the tab character.
Examples of string methods:
•
s.count(sub)
Return the number of non-overlapping occurrences of substring sub in string s.
Examples:
–
s = ’Biomedical engineering \n ’
n = s.count("e")
has the effect that variable n obtains the value 4, since there are 4 e’s in s,
while
–
’Biomedical engineering \n ’.count("me")
returns the value 1.
•
s.upper()
Return a copy of the string s converted to uppercase.
Example:
’Biomedical engineering \n ’.upper()
returns ’BIOMEDICAL ENGINEERING \n ’.
Analogously, s.lower() returns a copy of the string s converted to lower case.
c ph
58
Programming and genomics 2019/2020
•
6. Bioinformatics and strings
s.rstrip()
Return a copy of the string s without the trailing whitespace characters (the
characters space, tab, linefeed, return, formfeed, and vertical tab). Analogous to
s.rstrip(), s.lstrip() and s.strip() return a copy of the string s omitting
the whitespace at the front and at front as well as end, respectively.
Example:
’Biomedical engineering \n ’.rstrip()
returns ’Biomedical engineering’.
•
s.find(sub)
Return the lowest index in the string s where substring sub is found. Return -1 if
sub is not found.
Examples:
–
’Biomedical engineering \n ’.find("B")
returns 0 (counting starts at zero!!)
–
’Biomedical engineering \n ’.find("me")
returns 3.
–
’Biomedical engineering \n ’.find("p")
returns −1.
Analogously, s.rfind(sub) returns the highest index (i.e., the starting index of
the first occurence when seaching the string backwards).
A second argument can be provided to the find method. In that case the search
for the substring starts at that index. Thus, for instance
’Biomedical engineering \n ’.find("e")
returns 4, while
’Biomedical engineering \n ’.find("e",8)
returns 11.
•
s.split([sep])
Return a list of the words in the string s, using the optional sep argument as the
delimiter string. The sep argument may consist of multiple characters (for example, ’1, 2, 3’.split(’, ’) returns [’1’, ’2’, ’3’]). Splitting an empty
string with a specified separator returns an empty list.
If sep is not specified (as we have already seen briefly in the previous chapter)
or is None, a different splitting algorithm is applied. First, whitespace characters
(spaces, tabs, newlines, returns, and formfeeds) are stripped from both ends. Then,
words separated by arbitrary length strings of whitespace characters are split.
Splitting an empty string or a string consisting of just whitespace will return an
empty list.
c ph
59
Programming and genomics 2019/2020
6. Bioinformatics and strings
Examples:
–
print("0123124".split("12"))
prints [’0’, ’3’, ’4’] on the screen.
–
’Biomedical engineering \n ’.split()
returns [’Biomedical’, ’engineering’].
The string method replace
In chapter 2 we have already introduced strings. When strings are long and occupy
several lines, we should use the triple quote notation. So if we take for example the first
two lines of the DNA sequence of the gene insulin, we have
dnaIns="""gctgcatcagaagaggccatcaagcaggtctgttccaagggcctttgcgtcaggtgggct
caggattccagggtggctggaccccaggccccagctctgcagcagggaggacgtggctgg"""
where we have left out the spaces that have been added to guide the reading of the sequence. Since the string occupies more than one line, it contains a newline character("\n").
The task is to remove this newline from the sequence. As usual there are many solutions
to this problem and there is no general recipe for finding the best solution.
In this case we know that dnaIns is the name of a variable having a string value. Here
we want to remove the "\n", that is: replace the newline by the empty string and
fortunately Python has such a method.
If s is a string object,
s.replace(old, new) -> string
returns a copy of string s with all occurrences of substring old replaced by new.
So far our methods have had either one or no argument. Many methods however need
more than one argument and replace is such an example. Notice that the order in
which the arguments are given is of importance.
Examples:
•
s = ’acgtaa\ngg’.replace("\n", "")
After execution of this statement s has the value ’acgtaagg’.
•
mRNAs = ’acgtaagg’.replace("t", "u")
Now mRNAs has the value ’acguaagg’, and after
•
mRNAs = ’acgtaagg’.replace("gt", "tgg")
mRNAs gets the value ’actggaagg’.
So converting a DNA string to its RNA sequence is straightforward.
Strings are immutable
Integers and strings are predefined and fixed. This means that they cannot be changed.
The technical term for such a property is immutability. So strings are immutable in
c ph
60
Programming and genomics 2019/2020
6. Bioinformatics and strings
Python. This thus means that you can neither change characters nor change substrings.
You always have to create a new string. Assigning to an indexed position in the string
results in an error:
>>> motif = ’GAATTC’
>>> motif[0] = ’x’
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object doesn’t support item assignment
>>> motif[:1] = ’at’
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: object doesn’t support item assignment
Multiplication of strings
Next to string concatenation, Python also has another construct for new strings: the
repetition operator. It has two operands, a string and an integer, and the result is a
string consisting of the integer copies of the string
>>> polyA=’A’*25
>>> polyA
’AAAAAAAAAAAAAAAAAAAAAAAAA’
For convenience we summarize the string methods, operators and built-in functions that
are often used.
Method, Operator, Function
s + t
s * n
len(s)
s[i]
s[i:j]
s.count(sub)
s.lower()
s.upper()
s.rstrip()
s.lstrip()
s.strip()
s.find(sub)
s.rfind(sub)
s.replace(old,new)
s.split([sep])
c ph
Description
Concatenation
Repetition
Return the length of s
i-th character of s, counting starts at 0
slice of s from i to j
Return the number of nonoverlapping occurrences
of sub in s
Return a copy of s converted to lowercase
Return a copy of s converted to uppercase
Return a copy of s with trailing whitespace
characters removed.
Return a copy of s with whitespace
characters at front removed.
Return a copy of s with whitespace
characters at both front and end removed.
Return the lowest index in s where sub is found
Return the highest index in s where sub is found
Return a copy of string s with all
occurrences of substring old replaced by new.
Return a list of the words in the string s,
using the optional sep argument as the delimiter string.
61
Programming and genomics 2019/2020
6. Bioinformatics and strings
In chapter 3 the dir and help commands were already introduced. These can be applied
on any object, thus also on strings. So dir("abc") gives a complete enumeration of all
(more than 50!!) methods that can be applied to strings. Additional information about
these methods can be obtained by entering help(str).
6.4
Converting lists and strings
Lists are mutable, that is the contents can be changed, which makes them more general
than strings. It can be useful to create a list from a string by using the built-in function
list:
• list(s)
Return a list whose items are the same and in the same order as in the string s
Example:
>>> list(’Gene’)
[’G’, ’e’, ’n’, ’e’]
It is also possible to do the reverse operation: from a list of strings we can make one
string by using the join operation. It has two operands, the list of string elements to
be joined and the separator between the elements.
• sep.join(seq)
Return a string which is the concatenation of the strings in the sequence seq. The
separator between elements is the string sep providing this method.
Example: A string of the words in the sequence separated by spaces and each of
the words printed can be produced by:
>>> "*".join([’A’, ’CC’, ’GGG’, ’T’])
’A*CC*GGG*T’
>>> print("\n".join([’A’, ’CC’, ’GGG’, ’T’]))
A
CC
GGG
T
6.5
Writing data to file
All results obtained so far are presented at the output screen directly. For reporting,
and also when large amounts of data are produced, it is more convenient to write the
information to a file. As for reading, we have to open a file but now in writing mode.
This is done by giving ’w’ as second argument to the open method. Next we write
our results into the file and finally close it. Especially this last action is of importance,
otherwise the system may still have some information inside its internal buffers that is
not yet written into the file.
As you might have guessed, writing a string s to a file outf that has been opened for
writing can be performed by the
outf.write(s)
c ph
62
Programming and genomics 2019/2020
6. Bioinformatics and strings
method.
An example of copying a file is given below.
inf = open(’somefile.txt’, ’r’)
outf = open(’anotherone.txt’, ’w’)
s = inf.read()
outf.write(s)
inf.close()
outf.close()
Remark: If an existing file is opened for writing, the original contents is lost! If you
would like to keep the original contents and write additional contents thereafter, the
second argument to the open method should be ’a’ (i.e. abbreviation for append)
instead of ’w’.
If a newline has to be output, it should explicitly be added. There are several ways to
do this.
Assume we have two strings ’first line’ and ’second line’ to be written to a new file
example.txt and a newline should be in between. A small Python program with this
effect is:
outfn=open(’example.txt’, ’w’)
outfn.write(’first line’)
outfn.write(’\n’)
outfn.write(’second line’)
outfn.close()
A similar program using a list is:
l = [’first line’, ’second line’]
outfn = open(’example.txt’, ’w’)
outfn.write(’\n’.join(l))
outfn.close()
6.6
Exercises 25–29
Exercise 25:**
(a) Write a Python program that aks the user for a sequence and prints its length. The
output should be like
Enter a sequence: GTTGG
It is 5 bases long
(b) Modify the program so that it also prints the number of A, T, C, and G characters
in the sequence. The output should be like
Enter a sequence: GTTGG
It is 5 bases long
c ph
63
Programming and genomics 2019/2020
6. Bioinformatics and strings
adenine: 0
thymine: 2
cytosine: 0
guanine: 3
(c) Modify the program to allow both lower-case and upper-case characters in the sequence. The output should be like
Enter a sequence: ATTgtc
It is 6 bases long
adenine: 1
thymine: 3
cytosine: 1
guanine: 1
(d) Modify the program to print the number of unknown characters in the sequence.
The output should be like
Enter a sequence: ATTU*gtc
It is 8 bases long
adenine: 1
thymine: 3
cytosine: 1
guanine: 1
unknown: 2
Exercise 26:
In this exercise operations on DNA strings are the key elements. The DNA string to
be used is given in the file DNAINS.txt This file consists of one single line of characters.
Read this line into the string dna_ins.
(a) Write a Python program that prints the number of A, T, C, and G characters
occurring in dna_ins.
(b) Replace in dna_ins all occurrences of ”A” by its complement ”T”.
(c) Replace in dna_ins all occurrences of ”A” by its complement ”T” and ”T” by its
complement ”A”.
(d) Write a program that determines the complement of dna_ins. (That is: not only
all occurrences of ”A” by its complement ”T” and ”T” by its complement ”A”,
but also all occurrences of ”C” by its complement ”G” and ”G” by its complement
”C”.)
(e) Generate the reverse complement of dna_ins by converting the string resulting from
(d) to a list, applying an appropriate operation on this list and making a string out
of it again.
Exercise 27:**
(a) Create the following list
c ph
64
Programming and genomics 2019/2020
6. Bioinformatics and strings
[1, 2, 3, 4, ..., n-2, n-1, n, n-1, n-2, ..., 4, 3, 2, 1]
where n is an arbitrary number larger than 100.
Why is:
a = list(range(1,101))
b = a
b.reverse()
print(a+b[1:])
not a proper solution?
(b) Create in two different ways the same list but now excluding the number 73, i.e.,
[1, 2, 3, 4, ..., 71, 72, 74, 75, ..., n-2, n-1, n,
n-1, n-2, ..., 75, 74, 72, 71, ..., 4, 3, 2, 1]
(c) Let:
a=[1,2,3,4]
b=[9,16,25,36]
c=[a,b]
d=[a+b]
What are the lengths of c and d? Are these lengths equal to the sum of the lengths
of a and b?
(d) Create the following list
[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, ...,
n-1, n-1, n-1, n, n, n]
where n is again an arbitrary number larger than 100.
(e) Create the following list
[1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14, 15, 17, 18, 19, ..., n]
where n is again an arbitrary number larger than 100 but not a multiple of 4.
Exercise 28:
In genomics research one freqently accesses databases in which DNA sequences are
stored. The two most used formats are FASTA and FASTQ. In this exercise we concentrate on the FASTA format.
FASTA format is a text-based format for storing biological sequences (usually nucleotide
sequences). An example of such a file type is the file sequences.fasta, which gives the
sequences of all genes related to the TCA cycle in the fasta-format. In this format,
every gene starts with a line starting with the character >. The remainder of that line is
reserved for comments, which may contain all kinds of characters thus in principle also
additional > characters and/or ’ATG’s. The actual sequence starts at the next line and
may thus continue over multiple lines till the next > as the first character of a line.
(a) How many genes are there in the file?
c ph
65
Programming and genomics 2019/2020
6. Bioinformatics and strings
(b) How many occurrences of the triplet ’ATG’ are there in total?
(c) At which positions in the sequence of the 150th gene do you find this triplet ’ATG’
Attention: the file consists of 75803 lines with a total of 4610050 characters. For reading
the file into you program and processing it, this is no issue at all, but printing the full
file contents to screen may take a while. So if you want to check the contents, rather
use slicing operations to print only small parts.
Exercise 29:**
Perform the following actions on the sequence seq="AaaTGGGATAAAAaaat":
seq=seq.upper()
seq=seq.replace(’C’,’ ’)
seq=seq.replace(’T’,’ ’)
seq=seq.replace(’G’,’ ’)
a=seq.split()
a.sort()
print(len(a[-1]))
What is the meaning of the result of these actions?
c ph
66
Chapter 7
Functions and parameters
When a program is to be designed to solve a computational problem, we usually split
this task into smaller pieces and subsequently try to solve the smaller tasks. It might
very well be that the smaller tasks are still too large to be solved and then we repeat
the splitting again, etc., until all the small tasks can be solved. The next step is then
to use the solutions of the smaller steps to solve the larger task. This is usually done
by composing them into a larger unit and similarly as for the iterated splitting process,
the combination of smaller units into larger ones might need to be repeated.
It turns out that when this ’method’ is followed, a well-structured solution is obtained.
In this chapter we discuss some of the programming concepts in Python by which this
structuring can be achieved.
The programs we have written so far have been small, but valuable. It would definitely
be inefficient if we had written some code and we would not able to use it at another
place. We need a way to reuse well-defined pieces of code. In fact, we already did
this when using the built-in functions len, str, range, and help. Moreover, we are
not always interested in how a certain piece of code has been implemented, but we are
primarily interested in its use. The technical term for hiding this kind of implementation
information is abstraction.
Abstraction is useful as labor-saver, but it is more than that. It is the key to designing
computer programs. Instead of defining each detail we usually describe our programs in
higher order actions as ’open the file, count the number of lines’, etc., and we are not
bothered by actually how the operating system of your computer gets access to the file
etc.
7.1
Function definition (def)
Assume that we have two DNA strings, ’aaaa’ and ’acca’, and that we have to print
the number of times the nucleotide ’A’ either as ’a’ or as ’A’ occurs in these DNA
strings. A simple program to achieve this is:
uppers=’aaaa’.upper()
n=uppers.count(’A’)
print("The number of times A occurs in", uppers, "is:", n)
67
Programming and genomics 2019/2020
7. Functions and parameters
uppers=’acca’.upper()
n=uppers.count(’A’)
print("The number of times A occurs in", uppers, "is:", n)
It is clear that we have duplicate actions. Copying and pasting is in general risky due
to the chance of making errors, such as forgetting parts that should have been included
or including parts that should have been left out. It would therefore be valuable if we
would have a construct by which we can group a sequence of actions and by which after
finishing the sequence of actions the result is returned. Such a construct is a function.
Functions are named sequences of statements that perform some task and return a value.
In order to perform this task, a function may need arguments, i.e., values provided to a
function when it is called.
We can define a function by using the def (function definition) statement. For the
example given above a function definition could be:
def numberOfAsInExon(mrnaseq):
uppers=mrnaseq.upper()
n=uppers.count(’A’)
return n
The first line is called the header of the function. After the key-word def the name of the
function is given (in this case numberOfAsInExon) and then between the parentheses
the sequence of arguments, i.e., a number of arguments separated by commas (in this
case only one argument called mrnaseq).
7.2
Function call
A function can be called by using the name of the function with the sequence of arguments between parentheses. In our case two function calls are obtained by
>>> print(numberOfAsInExon(’aaaa’))
4
>>> print(numberOfAsInExon(’acca’))
2
So when this function is called the number of times the base ’A’ occurs in the argument
string converted to uppercase is returned.
The general framework for a function definition is
def fname(args):
statements
7.3
Documenting functions
Simple reuse of function definition is only feasible when it is clear what the function is
doing, i.e., a global description of its functionality and the role of the arguments. One
way of doing so is by using the comment construction (beginning with the hash sign
’#’). In case of function definitions another style is used. If a string is put at the
beginning of a function definition, it is stored as part of the function in the so-called
c ph
68
Programming and genomics 2019/2020
7. Functions and parameters
docstring. The following code demonstrates how to add a docstring to a function:
def numberOfAsInExon(mrnaseq):
"""
Calculates the number of occurrences of
the base A in the string mrnaseq
"""
uppers=mrnaseq.upper()
n=uppers.count(’A’)
return n
If run in the interpreter the information about a function, including its docstring is
obtained by
>>> help(numberOfAsInExon)
Help on function numberOfAsInExon in module __main__:
numberOfAsInExon(mrnaseq)
Calculates the number of occurrences of
the base A in the string mrnaseq
7.4
Positional parameters as function arguments
The arguments of a function we have been using until now are called positional parameters because their positions are important. For instance, consider the following two
functions:
def DNA2mRNA(s, old, new):
"""
Return a copy of the string s with all occurrences
of substring old replaced by new.
"""
return s.replace(old, new)
def aDemoReplace(s, new, old):
"""
Return a copy of the string s with all occurrences
of substring old replaced by new.
"""
return s.replace(old, new)
They both do exactly the same, only with the second and third argument exchanged:
>>> print(DNA2mRNA("ATG", ’T’, ’U’))
AUG
>>> print(aDemoReplace("ATG", ’U’, ’T’))
AUG
>>> print(aDemoReplace("ATG", ’T’, ’U’))
ATG
c ph
69
Programming and genomics 2019/2020
7. Functions and parameters
Python, however, has very attractive properties with respect to parameters.
7.5
Keyword parameters and defaults
In general when a function has many arguments, the order may be hard to remember.
In order to overcome this problem, Python has a very elegant construct by which the
name of a parameter can be supplied:
>>> print(DNA2mRNA(s="ATG", old=’T’, new=’U’))
AUG
Naming parameters has several advantages:
• The order in which the parameters are given is now no longer important.
>>> print(DNA2mRNA(old=’T’, s="ATG", new=’U’))
AUG
The name of the parameter uniquely defines at which position in the function
definition the actual argument should be treated.
It is also possible to combine positional and keyword parameters, e.g.
>>> print(DNA2mRNA("ATG", new=’U’, old=’T’))
AUG
Combining positional and keyword parameters is however only possible when first
the positional parameter(s) are given and then the keyword parameter(s). Otherwise the meaning of the parameters becomes ambiguous, resulting in an error:
>>> DNA2mRNA(s="ATG", ’T’, ’U’)
File "<ipython-input-8-220254502cdf>", line 1
DNA2mRNA(s="ATG", ’T’, ’U’)
SyntaxError: non-keyword arg after keyword arg
In some cases (e.g. split-method) the use of keyword arguments is not allowed
as in "abc".split(sep="b") resulting in
TypeError: split() takes no keyword arguments
The solution is to introduce keyword arguments explicitly in the definition of the
method:
• When in the function definition a name is supplied with a default value, the function can be called with fewer arguments than in its definition given.
def DNA2mRNA(s, old="T", new="U"):
"""
Return a copy of the string s with all occurrences
of substring old replaced by new.
When the argument called old is not given, "T" is used,
when the argument called new is not given, "U" is used,
"""
return s.replace(old, new)
>>> print(DNA2mRNA(s="ATG"))
c ph
70
Programming and genomics 2019/2020
7. Functions and parameters
AUG
The parameters that are supplied with a name are called keyword parameters and even
though it is some additional typing, they clearly help in clarifying the role of each
parameter.
It is even allowed to use a mixture of positional and keyword parameters. The restriction
however is that when a mixture is used, the positional parameters are to be given first
and, of course, in the right order.
All of the following statements have the same result:
>>>
AUG
>>>
AUG
>>>
AUG
>>>
AUG
>>>
AUG
>>>
AUG
print(DNA2mRNA(’ATG’))
print(DNA2mRNA(’ATG’, ’T’))
print(DNA2mRNA(’ATG’, ’T’, ’U’))
print(DNA2mRNA(s=’ATG’))
print(DNA2mRNA(s=’ATG’, new=’U’))
print(DNA2mRNA(’ATG’, new=’U’))
Of course when the argument coupled to the new name is supplied without the keyword
name, the old argument also has to be supplied. For instance,
>>> print(DNA2mRNA(’ATG’, ’U’))
ATG
might not give what is wanted.
7.6
Exercises 30–37
Exercise 30:**
Explain what the output is of the following Python program:
def whatshouldbemyname(l):
’’’This method expects a single argument, namely a list of integers’’’
if not l:
return []
else:
l.sort()
m=[l[0]]
for i in range(1, len(l)):
if l[i] != l[i-1]:
m.append(l[i])
return m
inputlist1 = [4,10,4,4,4,10,4]
c ph
71
Programming and genomics 2019/2020
7. Functions and parameters
result1 = whatshouldbemyname(inputlist1)
inputlist2 = [5,3,1,8,5,9,3,8,5,8,5,0,4,6,5,9,7,6,8,10]
result2 = whatshouldbemyname(inputlist2)
print(result1, result2)
Exercise 31:
Below an imperfect Python program is given. The program consists of a single function
definition and two calls of that function. The function has 2 parameters: a base (base)
and a DNA string dnastring. The function should remove all occurences of base from
the DNA string and return the result. The function should then be called twice: once
to remove all occurences of ’A’ from a DNA string and once to remove all occurences of
’T’ from another DNA string. The two resulting strings should then be concatenated
and printed on a single line.
The program contains a number of syntactically wrong constructs as well as a number
of semantic errors. The first will result in a SyntaxError when trying to execute the
program, while the latter means that for a given input not the correct output is obtained.
Correct the program such that both the syntactic and the semantic errors are resolved.
def removebase(base,,dnastring)
res = base.replace(dnastring,’’)
print(res)
n1 = removebase("A","AACATAAA")
n2 = removebase("T","TCGACATA’)
print(n1+n1)
Exercise 32:**
(a) Design a function wording that has a word as parameter and prints the word, its
length and the reversed word. Apply your function to the word ’verzuring’.
(b) Design a function processFile that has a filename as parameter and prints, by repeatedly calling the wording function, each word in the file, its length and the
reversed word. Apply your function to the file mytext.txt (or any other text file
of your choice).
Exercise 33:
In this exercise we will again make use of the turtle library, which was introduced in
Exercise 10, to draw some figures. The pen color can be changed using
turtle.pencolor(r,g,b)
where r, g, b are floats between 0 and 1, specifying the amount of red, green and blue,
respectively.
(a) Define a function that draws a regular polygon. The function should have a single
c ph
72
Programming and genomics 2019/2020
7. Functions and parameters
required parameter n, which specifies the number of sides the polygon consists of
(e.g. 3 for triangle, 4 for rectangle, 6 for hexagon, etc). Moreover, the function
should have 6 optional parameters: x and y specifying the starting position, d for
the length per side, and r, g, b for the color. Default value for one d should be 100,
while for the other optional parameters the default value should be 0, such that a
black polygon is drawn starting in the origin.
(b) Use the method designed in part (a) to draw: (1) a red square with sides of length
150, (2) a blue triangle with sides of length 120, and (3) a green hexagon with sides
of length 50.
Exercise 34:
Design a function that has a DNA sequence and two integers, say w and linel as parameters. The function prints the string in a tabulated form such that after each w
characters of the sequence a space is inserted and on each line, possibly except the last
one, the next linel characters of the sequence are shown.
For example, for the DNA sequence in the file DNAINS.txt with w = 10 and linel = 60
the output should look like:
gctgcatcag aagaggccat caagcaggtc tgttccaagg gcctttgcgt caggtgggct
caggattcca gggtggctgg accccaggcc ccagctctgc agcagggagg acgtggctgg
gctcgtgaag catgtggggg
Exercise 35:*
In the following exercises each of the functions to be designed should have a string as
only parameter representing the name of an input text file that is to be opened. For
this file you could use mytext2.txt.
(a) Define a sentence to be a string ending on a period ("."). Design a function that
returns a list of all the sentences the file consists of.
(b) Design a function that prints all words of a file on a separate line.
(c) Design a function that sorts the words of a file and subsequently prints each of the
words on a separate line.
(d) Design a function that prints each odd-numbered line of a file.
Exercise 36:
Design a function that copies the lines of an input file in reverse order into an output
file. The function should have two parameters, the first parameter is the name of the
input file, the second parameter the name of the output file. As input file you could use
mytext2.txt.
Exercise 37:
Below an imperfect Python program is given. The program consists of 2 function definitions. The first function has 2 parameters: a base and a list of DNA strings. The
c ph
73
Programming and genomics 2019/2020
7. Functions and parameters
function should determine the position of the last occurence of the specified base for each
DNA string in the list. Those numbers should be returned in a list. The second function,
with as input parameter a list of DNA strings, should first determine the last occurences
of the bases A and C in the DNA strings by two calls of the first function. Subsequently,
this second function should determine how many DNA sequences are present in the list
for which the last C comes after the last A and there are at most two bases between
these two last occurences.
The program contains a number of syntactically wrong constructs as well as a number
of semantic errors. The first will result in a SyntaxError when trying to execute the
program, while the latter means that for a given input not the correct output is obtained.
Correct the program such that both the syntactic and the semantic errors are resolved.
def indexlast(base="A’, l)
m=[]
for i in range(l)
m.extend(m[-j).index(basis)
returm n
def finalCcloseafterfinalA():
lastAl=indexlattst(l)
lastCl=indexlats(l, base="CCC")
found=-1
for i in l:
if lasteAl[i]-lastCl(i) < 2:
found = found - 1
return found
gevonden==finalCcloseafterfinalA(["TCTTTT", "ACAATC", "ACTTACC", "CATA"], [])
print gewonde
‘
c ph
74
Chapter 8
Tuples and string formatting
In previous chapters we have seen how one can write output to either the screen or a
text file. In both cases, it is often useful to write data in a more structured way than
done so far. Therefore in this chapter we will consider formatted strings. Additionally,
we will consider a second container datatype (next to lists that we have seen in Chapter
3), i.e., tuples.
8.1
Tuples
In Chapter 3 we have considered lists. Python has a second datatype that is often
used to store collections of items, i.e, tuples. Tuples are sequences, just like lists. The
differences between tuples and lists are that:
(i) tuples cannot be changed unlike lists and
(ii) tuples use parentheses whereas lists use square brackets.
Examples of tuples are:
>>> tup1 = (12, 5, 3, 4, 2 )
>>> tup2 = (1, ’hi’, 100, 3.14)
Tuples can thus (like lists) contain items of different types.
To make a tuple containing a single item, a comma has to be added after the item
>>> tup3 = (100,)
Without the comma, the result of the above assignment would be that tup3 contains
the integer 100.
A fourth example of a tuple is:
>>> tup4 = "a", "b", "c", "d"
>>> tup4
(’a’, ’b’, ’c’, ’d’)
This shows that when multiple objects are given, separated by commas, without identifying symbols (like brackets for lists or parentheses for tuples), Python turns it into a
tuple. Tuples thus occur quite often: in the following sections we will for instance use
75
Programming and genomics 2019/2020
8. Tuples and string formatting
them to create formatted strings and to define functions that return multiple values.
Indexing and slicing tuples
Elements of tuples are accessed exactly in the same way as lists:
>>> tup1[0]
# indexing
12
>>> tup2[1:5]
# slicing
(’hi’, 100, 3.14)
The indices start again at zero! And slicing of a tuple results in a new tuple.
The main difference with lists is that tuples are immutable. It is thus impossible to
change a tuple. Thus, for instance, assigning one element of the tuple a different value
results in an error:
>>> tup1[0] = 100
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: ’tuple’ object does not support item assignment
Basic operations on tuples
Basic operations on tuples (that we have already seen for lists as well) are:
len((1, 2, 3))
(1, 2, 3) + (4, 5, 6)
(’Hi!’,) * 3
3 in (1, 2, 3)
for x in (1, 2, 3): print(x)
3
(1, 2, 3, 4, 5, 6)
(’Hi!’, ’Hi!’, ’Hi!’)
True
1 2 3
Length
Concatenation
Repetition
Membership
Iteration
Lists can be converted to tuples and tuples to lists:
>>> m = list(tup1)
>>> m
[12, 5, 3, 4, 2]
>>> tup5 = tuple(m)
>>> tup5
(12, 5, 3, 4, 2)
8.2
Returning multiple values from a function
In the previous chapter, we have seen that functions can be used to structure a computer
program. The idea is to split the computational problem at hand into smaller parts,
where each of these smaller parts is considered in a separate function. The function
may be provided with some input (arguments), performs its operations, and returns
the result. Advantage of the use of functions is also that they can be called multiple
times (on the same or different input), such that the same code does not need to be
repeated. Moreover, functions (developed by others) may be reused, where as a user
of the function you do not need to bother about details inside the function, but can
assume that for given input the resulting output is as the specification of that function
c ph
76
Programming and genomics 2019/2020
8. Tuples and string formatting
as described in its doc-string.
The functions considered so far returned only a single object. However, functions may
also return multiple objects separated by commas, which can then be caught by the
same number of variables. For instance:
>>>
...
...
>>>
>>>
6
>>>
7
>>>
8
def myfun(x):
return x+1, x+2, x+3
a, b, c = myfun(5)
a
b
c
What is actually returned is a tuple, so the output of the function can also be caught
in a single variable which then contains the tuple:
>>> d = myfun(5)
>>> d
(6, 7, 8)
Assignment of a tuple to multiple variables can also be done directly. Some examples
are:
>>> a,b = (3,4)
>>> a,b = 3,4
>>> (a,b) = (3,4)
all three assigning the value 3 to a and the value 4 to b.
8.3
String formatting
We have seen before that the print statement can have multiple arguments separated
by commas and that the types of these arguments may differ. An example where two
strings and a float are combined is:
>>> from math import pi
>>> print(’The value of pi is’, pi, ’!’)
The value of pi is 3.141592653589793 !
Python has two alternatives to provide more control on the way a sting is formatted:
(i) the ’old style’ that uses the % symbol (that is very similar to the approach in other
computer languages like C, TCL and Matlab) and (ii) the ’new style’ using the format
method on strings. Both can be used and you are free to use either one you prefer.
8.3.1
Old style: %
In the first alternative, a % character is used inside the string to indicate the position
in the string where a value needs to be inserted and the value that is to be substituted
follows after a second % character placed behind the string. To insert a float, the %
c ph
77
Programming and genomics 2019/2020
8. Tuples and string formatting
character inside the string should be followed by a character f. The above result is thus
(almost) reproduced using
>>> from math import pi
>>> print(’The value of pi is %f !’ % pi)
The value of pi is 3.141593 !
First advantage is that this easily allows for removing the presence of a space between
the numerical value and the exclamation mark. Moreover, another major advantage of
this method is that it readily allows for controlling for instance the number of digits.
This is achieved by adding a dot followed by an integer indicating the desired number of
decimals just in front of the ’f’. For instance, to show just 2 decimals of pi, the print
statement should be adapted to:
>>> from math import pi
>>> print(’The value of pi is %.2f!’ % pi)
The value of pi is 3.14!
Also other types can be formatted:
Format
%s
%d
%f
%e
%X
Description
string
integer
float
float in scientific notation
integer in hexadecimal format
Example
’A%sD’ % ’bc’
’%d*%d’ % (3,4)
’%.3f’ % 0.1234
’%.2e’ % 0.1234
’%X’ % 255
Result
’AbcD’
’3*4’
’0.123’
’1.23e-01’
’FF’
Multiple substitution values
It is also possible to substitute multiple values into a string. The values to be substituted
should then be provided as a tuple, where the number of elements in the tuple should
match the number of % characters in the string indicating the positions where the values
should be substituted. The first element in the tuple is then substituted at the position
of the first % character in the string, the second element in the tuple at the position of
the second % character, etc.
>>> from math import pi
>>> print(’The value of %s is %.5f!’ % (’pi’, pi))
The value of pi is 3.14159!
Alignment
It is also possible to specify the number of characters that is minimally used in the
substituted string. This is obtained by specifying the required number of characters
directly following the % character. This can for instance be useful when one wants to
align columns in a table. While
for
...
0 0
1 1
2 4
c ph
i in range(11):
print(i,i**2,i**3)
0
1
8
78
Programming and genomics 2019/2020
8. Tuples and string formatting
3 9 27
4 16 64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
results in a mess, a nicely aligned table may be obtained using:
>>> for i in range(11):
...
print(’%2d %3d %4d’ % (i, i**2, i**3))
0
0
0
1
1
1
2
4
8
3
9
27
4 16
64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
8.3.2
New style: the format method
The second alternative to format strings is using the str.format method operating on
a string str. Advantages of this approach is that it provides slightly more control and
that the style is more ’Pythonic’. The latter is also immediately its drawback as the
approach is not applicable in for instance Matlab.
The basic usage of the str.format() method in order to reproduce the result at the
beginning of Section 8.3 is:
>>> from math import pi
>>> print(’The value of pi is {} !’.format(pi))
The value of pi is 3.141592653589793 !
The format method thus substitutes the value of its argument at the position in the
string specified by {}. Also this approach easily allows for removing the presence of the
space between the numerical value and the remainder of the string and to control the
number of digits for the value to be substituted. For instance, to show just 2 decimals
of pi, the print statement becomes
>>> print(’The value of pi is {:.2f}!’.format(pi))
The value of pi is 3.14!
where the colon specifies that special formatting follows, f specifies that a float should
be printed and the .m that it should be printed with m decimals. Also other types can
be formatted:
c ph
79
Programming and genomics 2019/2020
Format
s
d
f
e
%
b
X
Description
string
integer
float
float in scientific notation
float as percentage
integer in binary format
integer in hexadecimal format
8. Tuples and string formatting
Example
’A{:s}D’.format(’bc’)
’{:d}*{:d}’.format(3,4)
’{:.3f}’.format(0.1234)
’{:.2e}’.format(0.1234)
’{:.1%}’.format(0.1234)
’{:b}’.format(8)
’{:X}’.format(255)
Result
’AbcD’
’3*4’
’0.123’
’1.23e-01’
’12.3%’
’1000’
’FF’
Multiple substitution values
Also using the format method it is possible to substitute multiple values into a string.
The default is to use multiple {} and multiple parameters for the format method. The
values are then substituted in consecutive order.
>>> print(’The value of {} is {}!’.format(’pi’,pi))
The value of pi is 3.14159265359!
The format of both substitutions can be controlled just as for the case of a single
substitution. Only difference is that it is now also possible to specify which parameter
should be substituted at which position:
>>>
The
>>>
The
>>>
The
print(’The value of {:s} is {:.5f}!’.format(’pi’,pi))
value of pi is 3.14159!
print(’The value of {0:s} is {1:.5f}!’.format(’pi’,pi))
value of pi is 3.14159!
print(’The value of {1:s} is {0:.5f}!’.format(pi,’pi’))
value of pi is 3.14159!
The index of the argument to be substituted is thus specified by the integer in front of
the colon.
A single parameter can also be substituted at multiple places, e.g:
>>> print(’{0:} rounded to 2 digits is {0:.2f}’.format(pi))
3.141592653589793 rounded to 2 digits is 3.14
Instead of working with positional parameters, here also keyword parameters can be
used:
>>> print(’The value of {name:s} is {value:.5f}!’.format(value=pi,name=’pi’))
The value of pi is 3.14159!
The format method of course also works on a string in a variable:
>>> from math import pi, e
>>> s = ’The value of {name:s} is {value:.5f}!’
>>> s
’The value of {name:s} is {value:.5f}!’
>>> t = s.format(value=pi,name=’pi’)
>>> t
’The value of pi is 3.14159!’
>>> print(t)
c ph
80
Programming and genomics 2019/2020
8. Tuples and string formatting
The value of pi is 3.14159!
>>> print(s.format(value=e,name=’e’))
The value of e is 2.71828!
Alignment
It is also possible to specify the number of characters that is minimally used in the
substituted string. This can for instance be useful when one wants to align columns in
a table:
>>> for i in range(11):
...
print(’{0:2d} {1:3d} {2:4d}’.format(i, i**2, i**3))
0
0
0
1
1
1
2
4
8
3
9
27
4 16
64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
The number in front of the type (here ’d’) thus specifies the width of the substring where
the value is formatted.
Alignment within the columns can be achieved using > for right hand side alignment
(default for numbers), < for left hand side alignment, and ^ for centering:
>>> for i in range(11):
...
print(’{0:>2d} {1:<3d} {2:^4d}’.format(i, i**2, i**3))
0 0
0
1 1
1
2 4
8
3 9
27
4 16
64
5 25 125
6 36 216
7 49 343
8 64 512
9 81 729
10 100 1000
c ph
81
Programming and genomics 2019/2020
Format
#d
#.#f
>
<
^
8.4
Description
use # characters
for an integer
use # characters for a
float with # decimals
align right
align left
align center
8. Tuples and string formatting
Example
’{:5d}’.format(123)
Result
’ 123’
’{:7.2f}’.format(0.1234)
’
’{:>5d}’.format(123)
’{:<5d}’.format(123)
’{:^5d}’.format(123)
’ 123’
’123 ’
’ 123 ’
0.12’
Exercises 38–40
Exercise 38:**
Design a function minmaxmean that has a list of integers as parameter and returns
the minimum value in the list, the maximum value, as well as the mean. Apply
your method to the list m1=[9,4,5,6,2,5,4,3,1,2,12,7,4,3,2,8,4,2] and to the
list m2=[3,4,5,2,2,12,2,1,8,2,9,4,3,6,4,4,7,5].
Exercise 39:*
(a) Explain the output of the following python fragment:
for i in range(1,13):
for j in range(1,13):
print(’%4d’ % i*j, end=’ ’)
print()
(b) Correct the above python fragment such that it prints the multiplication table (of
the tables 1 through 12) in the following format to screen:
1
2
3
4
5
6
7
8
9
10
11
12
2
4
6
8
10
12
14
16
18
20
22
24
3
6
9
12
15
18
21
24
27
30
33
36
4
8
12
16
20
24
28
32
36
40
44
48
5
10
15
20
25
30
35
40
45
50
55
60
6
12
18
24
30
36
42
48
54
60
66
72
7
14
21
28
35
42
49
56
63
70
77
84
8
16
24
32
40
48
56
64
72
80
88
96
9
18
27
36
45
54
63
72
81
90
99
108
10
20
30
40
50
60
70
80
90
100
110
120
11
22
33
44
55
66
77
88
99
110
121
132
12
24
36
48
60
72
84
96
108
120
132
144
(c) Design a function mytable with as parameter an integer maxval that creates a
table like above and returns it as a string. The parameter maxval should specify
the number of columns to be plotted, while the number of rows should remain fixed
to 12.
c ph
82
Programming and genomics 2019/2020
8. Tuples and string formatting
Exercise 40:
Given is a text file BMIs.txt (considered before in Exercise 22) with on each line three
fields separated by whitespace. The first field is the name of a person, the second field
his/her weight, and the third field his/her height.
(a) Design a python function with a filename as its single parameter that reads the
specified file (of the type as described above) and returns the contents of the file as
a list of tuples. Each tuple should contain three items, i.e., in consecutive order a
name as a string, a weight as a float, and a height as a float. Apply your python
function to the file BMIs.txt.
(b) Design a python function with a list of tuples, as defined in part (a), as a single
parameter, which uses string formatting to print a table like:
Name
| weight | height | BMI | Category
---------------------------------------------------Peter
| 120.0 |
1.75 | 39.2 | obese
Esther
|
60.0 |
1.61 | 23.1 | healthy weight
Tom
|
90.0 |
1.70 | 31.1 | obese
where BMI is defined as weight divided by the square of the height and the Category
is defined according to the World Health Organization: ’Underweight’ (BMI below
18.5), ’Healthy weight’ (BMI between 18.5 and 25), ’Overweight’ (BMI between 25
and 30) or ’Obese’ (BMI above 30).
c ph
83
Chapter 9
Dictionaries and database queries
Next to a list and a tuple, Python has other built-in datatypes. One of those is the dictionary, which defines one-to-one relationships between keys and values. Such dictionaries
can be a very useful datatype to clearly store and access diverse data. When accessing
databases from Python, the returned data may also be provided as a dictionary.
9.1
Dictionaries
Defining a dictionary
A dictionary is an unordered set of elements where each element is a key:value pair.
To define a dictionary introduce a variable name and explicitly give the elements of the
dictionary enclosed in curly braces.
>>> gene_dict= { ’IGF2’: ’insulin-like growth factor 2 (somatomedin A)’,
’INS’: ’insulin’ }
>>> gene_dict
{ ’IGF2’: ’insulin-like growth factor 2 (somatomedin A)’, ’INS’: ’insulin’ }
>>> gene_dict[’IGF2’]
’insulin-like growth factor 2 (somatomedin A)’,
>>> gene_dict[’INS’]
’insulin’
Here ’IGF2’ is a key and its associated value, referenced by gene_dict[’IGF2’], is
’insulin-like growth factor 2 (somatomedin A)’, and similarly for the key ’INS’ and
the value ’insulin’. Hence dictionary elements are obtained by their keys. An empty
dictionary is created by
empty_d = { }
Dictionaries are not just for strings.
• Dictionary values can be any datatype, including strings, integers, objects, lists,
or even other dictionaries. And within a single dictionary, the values may have
different types. They can be mixed as needed.
• Dictionary keys are more restricted, viz. limited to immutable data types like
strings, integers, tuples, and a few other types. They can also be mixed, i.e., not
84
Programming and genomics 2019/2020
9. Dictionaries and database queries
all keys need to be of the same type.
That different types can be used for keys as well as values is illustrated by:
>>> mydict={}
>>> mydict[’name’] = ’Klaas’
>>> mydict[666] = [6,6,6]
>>> mydict[(1,’hi’)] = 3.1415
>>> mydict
{666: [6, 6, 6], (1, ’hi’): 3.1415, ’name’: ’Klaas’}
But since lists are mutable, using a list as key is thus not allowed:
>>> mydict[[1,2]] = 4
Traceback (most recent call last):
TypeError: unhashable type: ’list’
The number of items in a dictionary can be obtained using the function len
>>> len(mydict)
3
Adding and changing dictionary elements
Duplicate keys cannot occur in a dictionary. Assigning a value to an existing key will
overwrite the old value.
>>> gene_dict
{ ’IGF2’: ’insulin-like growth factor 2 (somatomedin A)’, ’INS’: ’insulin’ }
>>> gene_dict[’IGF2’]=’insulin growth factor 2’
>>> gene_dict
{ ’IGF2’: ’insulin-like growth factor 2’, ’INS’: ’insulin’ }
Adding new elements to a dictionary goes in a similar way
>>> gene_dict[’INSR’]=’insulin receptor’
>>> gene_dict
{ ’INS’, ’insulin’, ’INSR’: ’insulin receptor’,
’IGF2’: ’insulin-like growth factor 2’ }
Dictionaries have no concept of order among elements, they are simply unordered. Note
that the new element (key ’INSR’, value ’insulin receptor’) appears to be in the middle.
In fact, it was just a coincidence that the elements appeared to be in order in the first
example; it is just as much a coincidence that they appear to be out of order now.
An item can be removed from a dictionary using the del statement
>>> gene_dict
{ ’INS’: ’insulin’, ’INSR’: ’insulin receptor’,
’IGF2’: ’insulin-like growth factor 2’ }
>>> del gene_dict[’INS’]
>>> gene_dict
{ ’INSR’: ’insulin receptor’,
’IGF2’: ’insulin-like growth factor 2’ }
c ph
85
Programming and genomics 2019/2020
9. Dictionaries and database queries
Methods of dictionaries
Given a dictionary d the following methods can be applied to d:
• d.keys()
Returns a view on the dictionary’s keys
• d.values()
Returns a view on the dictionary’s values
• d.items()
Returns a view on the dictionary’s (key, value) pairs
• x in d Returns True if x is in the dictionary’s list of keys, False otherwise
If desired, the views returned by the methods d.keys(), d.values() and d.items()
can be converted to true lists using the list function.
Some examples for the dictionary filled earlier this section:
>>> gene_dict.keys()
dict_keys([’IGF2’, ’INS’, ’INSR’])
>>> list(gene_dict.keys())
[’IGF2’, ’INS’, ’INSR’]
>>> gene_dict.values()
dict_values([’insulin growth factor 2’, ’insulin’, ’insulin receptor’])
>>> gene_dict.items()
dict_items([(’IGF2’, ’insulin growth factor 2’), (’INS’, ’insulin’), (’INSR’, ’insulin r
>>> ’INS’ in gene_dict
True
>>> ’insulin’ in gene_dict
False
Looping over dictionaries
To perform an action on all items in a dictionary, we again use the for statement.
>>> for key in gene_dict:
...
print(key, ’stands for’, gene_dict[key])
INS stands for insulin
INSR stands for insulin receptor
IGF2 stands for insulin-like growth factor 2
In this way we thus get the keys one by one and can use those to inspect the corresponding values.
A second way to loop over all items, directly having access to the key-value pairs is:
>>> for key,val in gene_dict.items():
...
print(key, ’stands for’, val)
INS stands for insulin
INSR stands for insulin receptor
IGF2 stands for insulin-like growth factor 2
c ph
86
Programming and genomics 2019/2020
9. Dictionaries and database queries
In both above cases the order in which the keys are processed may appear unclear. If
you rather want them in (alphabetical) order, the sorted function could be used:
>>> for key in sorted(gene_dict):
...
print(key, ’stands for’, gene_dict[key])
IGF2 stands for insulin-like growth factor 2
INS stands for insulin
INSR stands for insulin receptor
Also if you prefer reversed order:
>>> for key in sorted(gene_dict,reverse=True):
...
print(key, ’stands for’, gene_dict[key])
INSR stands for insulin receptor
INS stands for insulin
IGF2 stands for insulin-like growth factor 2
9.2
Database queries
Until now we have merely used data that was locally accessible, but huge amounts of data
are also available outside the TU/e. Accessibility to these data is usually arranged by
large database servers. For the domain of Biomedical Engineering the worldwide most
frequently used source is the National Center for Biotechnology Information (NCBI)
including PubMed, a free full-text archive of biomedical and life sciences journal literature. NCBI advances science and health by providing access to biomedical and genomic
information. In this section we discuss how to retrieve genomic data from this site, but
we start with a more general treatment on how to automatically download data from
an arbitrary site.
9.2.1
Open arbitrary resources by URL
In section 5.4, we have seen that a file of the local file system can be opened by issuing
the open-command. An example is:
inf=open("example.txt")
It would be nice if we could use a similar command for remote files. Python indeed has
a standard module to access remote resources: urllib.
To access a remote file we of course have to know the address of the location the file is residing on. The international standard is the so-called Uniform Resource Locator (URL),
commonly termed as web address. In fact it is more. It includes also a reference to a web
resource that specifies its location on a computer network and a mechanism for retrieving it. URLs occur most commonly to reference web pages (http), but are also used for
file transfer (ftp), email (mailto), database access (JDBC), and many other applications.
Most web browsers display the URL of a web page above the page in an address bar.
A typical URL could have the form http://cbio.bmt.tue.nl/~philbers/index.htm,
which indicates a protocol (http), a hostname (cbio.bmt.tue.nl), as well as a file name
(~philbers/index.htm). To access such a file in Python the following small program
could be used:
c ph
87
Programming and genomics 2019/2020
9. Dictionaries and database queries
import urllib.request
protocol = "http"
hostname = "cbio.bmt.tue.nl"
path = "~philbers/index.htm"
url = protocol+"://"+hostname+"/"+path
rf = urllib.request.urlopen(url)
data = rf.read()
print(data)
Running this program (the url is only accessible from within the TU/e-environment)
results in (linebreaks have been added)
b’<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="nl" xml:lang="nl" xmlns="http://www.w3.org/1999/xhtml">
<HEAD>
<TITLE>Peter Hilbers</TITLE>
</HEAD>
<BODY TEXT="#000000" LINK="#0000FF" VLINK="#00007F" BGCOLOR="#EAD9C7">
<H2>
Welcome, on this page where you can find information concerning
my recent publications, courses and presentations.
</H2>
.....
NWO Computional Life Science<br><br>
NWO Computional Science<br><br>
Lorentz Center<br><br>
</body></html>’
The letter b followed by the quote at the beginning of what is returned shows that it is
not a str object, but of the type bytes. It could be converted to a string (str) using
the decode method of bytes, i.e., adding one line:
data = data.decode()
So compared to local file access the main difference is in using urlopen instead of open
and the extra decode step to obtain the contents as a string (str).
Web pages are often dynamic. This roughly means that the (html) output being generated depends on the optional additional information given in the url. This optional
information is usually called the query and is separated from the preceding part of the
url by a question mark (?). Its syntax is not well defined, but by convention it is most
often a sequence of attribute–value pairs separated by a delimiter. The two worldwide
most used delimiters are the ampersand (&) and the semicolon. Since the ampersand is
most used we will also use it. So an example of a url including a query part is:
url="https://www.ncbi.nlm.nih.gov/pubmed/?term=genomics"
One difficulty with queries is their syntax that makes them hardly readable. For instance
to build a query string into a URL, spaces are to be replaced by plus signs and as a
consequence plus signs need to be escaped by using their ”%xx” variant, viz. ”%2B”.
c ph
88
Programming and genomics 2019/2020
9. Dictionaries and database queries
As an example, suppose we are interested in the Ebola virus. If we would like to use
in the query part of a url the term ’Ebola virus’, we should use ’Ebola+virus’, but
there are in general many more similar substitutions needed, especially when multiple
attribute–value pairs are used. To do those substitutions manually is a tedious task, so
we should have an alternative. Python has a special facility called urlencode(query)
that automatically performs such substitutions. The query could for instance be in the
dictionary form. If we would have the url (one single line)
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&
id=110645916&rettype=fasta
we could build this string in Python by using a dictionary:
import urllib.parse
import urllib.request
protocol="https"
hostname="eutils.ncbi.nlm.nih.gov"
path="entrez/eutils/efetch.fcgi"
url=protocol+"://"+hostname+"/"+path
queryd={}
queryd[’db’]=’nuccore’
queryd[’id’]=’110645916’
queryd[’rettype’]=’fasta’
params = urllib.parse.urlencode(queryd)
params = params.encode(’ascii’)
rf = urllib.request.urlopen(url, params)
data=rf.read()
print(data)
Although it seems a bit superfluous to use so many small steps, such an approach appears
to be less error prone and hence it is strongly recommended to do it in this way. Problem
is, however, that not all webservers accept calls from Python as they do not like ’robots’.
Another example that does work is http://www.chemcalc.org/ which provides some
web services related with mass spectrometry. A small example, that we will also use in
one of the exercises, is:
import urllib.parse
import urllib.request
# Define a molecular formula string
mf = ’C4H5N3O’
# Define the parameters and send them to Chemcalc
pardict = {’mf’: mf,’isotopomers’:’jcamp,xy’}
params = urllib.parse.urlencode(pardict).encode()
# url of Chemcalc
c ph
89
Programming and genomics 2019/2020
9. Dictionaries and database queries
url = ’http://www.chemcalc.org/chemcalc/mf’
# Open the url and read the page
response = urllib.request.urlopen(url, params)
data = response.read()
9.2.2
Accessing databases: NCBI, Entrez and BioPython
Examples of human diseases caused by viruses include the common cold, influenza,
chickenpox, and cold sores and many serious diseases such as avian influenza, SARS,
and Ebola virus disease. If we would like to study the genome of this last virus then
a literature search shows that the genus Ebolavirus is a member of the Filoviridae
family. There are currently 5 known species: Zaire ebolavirus, Sudan ebolavirus, Tai
Forest ebolavirus, Bundibugyo ebolavirus, and Reston ebolavirus. The Zaire ebolavirus
is responsible for the outbreak that started in West Africa in 2014, the largest outbreak
since the virus was first discovered in 1976. Genetic sequencing has shown that the virus
isolated from infected patients in the 2014 outbreak is 97% similar to the virus that first
emerged in 1976. If we want to check this similarity we have to search for the nucleotide
sequences.
The first step towards a complete genome of the Zaire ebolavirus is a general search
(using a web browser) at the NCBI site:
http://www.ncbi.nlm.nih.gov/
Since we are interested in nucleotides, we select the nucleotide database
http://www.ncbi.nlm.nih.gov/nuccore
and add the query term ’ebola+virus+isolate’:
http://www.ncbi.nlm.nih.gov/nuccore/?term=ebola+virus+isolate
It returns with 74491 hits in the nucleotide database for ”ebola virus” of which the first
20 are shown.
In the publication Viruses 2014, 6, 3663-3682; doi:10.3390/v6093663 we find: ’Ebola
virus (EBOV) is the most thoroughly characterized ebola virus. Dozens of EBOV isolates are available, but the vast majority of published experiments have been performed
with isolates Mayinga and Kikwit. The Mayinga isolate, the first EBOV isolate obtained
in 1976, has been used extensively for molecular-biological characterizations. The Kikwit
variant, obtained during an Ebola virus disease outbreak in 1995, has been used almost
exclusively for pathogenesis studies in nonhuman primates in the US (the Mayinga isolate is used almost everywhere else).’ So inspecting the first 20 items we select the items
14 and 15: ’Ebola virus isolate Ebola virus/H.sapiens-tc/COD/1995/Kikwit-807223,
complete genome’ and ’Ebola virus isolate Ebolavirus/H.sapiens-tc/COD/1976/YambukuMayinga, complete genome’.
Next we go to the ’Send to’ button where we select file as destination and start downloading these two fasta records. They are also available on the Canvas site in the file
’ebolasequences.fasta’. For two hits this process is doable but for more hits we need an
alternative: BioPython.
c ph
90
Programming and genomics 2019/2020
9. Dictionaries and database queries
Entrez
The module in the BioPython package to be used for data retrieval is ’Entrez’: ”Entrez
(https://biopython.org/DIST/docs/api/Bio.Entrez-module.html,
https://www.ncbi.nlm.nih.gov/books/NBK25501/ ) is a data retrieval system that provides users access to NCBI’s databases such as PubMed, GenBank, GEO, and many
others. Access Entrez from a web browser to manually enter queries, or you can use
Biopythons Bio.Entrez module for programmatic access to Entrez. The latter allows
you for example to search PubMed or download GenBank records from within a Python
script.”
In this subsection we demonstrate how to use Entrez. Instead of the Ebola virus we use
a bacterium as example, namely the organism Escherichia coli. This bacterium is for
instance used in the laboratory of the ’Chemical Biology’ group for the production of new
proteins but also by student teams from the TU/e that participated in the international
iGEM competition.
The first step in using Entrez in a Python program is always to import it:
from Bio import Entrez
Next to access the NCBI site, an email identification is needed. Use your TU/e email
account for access:
from Bio import Entrez
Entrez.email="your.name@student.tue.nl"
To search for the reference sequence of the complete genome of Escherichia coli we have
to specify two items:
• The database
Since we are searching for the nucleotide sequence we select the nucleotide database:
db="nucleotide"
• The search term
This is more complicated. The NCBI database entries could be interpreted as a
kind of dictionary. NCBI calls her keys Fields. Examples of fields are ”Organism”
and ”Properties” but there is also a construction that matches all fields: ”All
Fields”. As organism we search for ”Escherichia coli”, we are interested in the
complete genome sequence, and only those sequences that have a stable reference.
The term to be used for this last part is that we search in the field ”Properties”
for the value srcdb_refseq. Combining these all leads to our search term:
term=’"Escherichia coli"[Organism] AND "complete genome"[All Fields]
AND "srcdb_refseq"[Properties]’
Such search terms can become long, so we need a more elegant construct. Similarly
as in programming we have to build our large construct from smaller pieces. As
you may have inferred to combine search terms the phrase ”AND” is used. So we
should have built our construct by:
l=[’"Escherichia coli"[Organism]’]
l.append(’"complete genome"[All Fields]’)
c ph
91
Programming and genomics 2019/2020
9. Dictionaries and database queries
l.append(’"srcdb_refseq"[Properties]’)
searchterm=’ AND ’.join(l)
Combining these two items leads to our standard Python search script for Entrez:
from Bio import Entrez
Entrez.email="p.a.j.hilbers@tue.nl"
l=[’"Escherichia coli"[Organism]’]
l.append(’"complete genome"[All Fields]’)
l.append(’"srcdb_refseq"[Properties]’)
searchterm=’ AND ’.join(l)
handle=Entrez.esearch(db="nucleotide", term=searchterm)
Such a handle is similar as a file and we apply the Entrez ’read’ method to it:
record=Entrez.read(handle)
The record we obtain is a Python dictionary having many different key–value pairs:
{’Count’: ’4186’, ’RetMax’: ’20’, ’RetStart’: ’0’, ’IdList’: [’1767732542’,
’1767732539’, ’1767732528’, ’1767732513’, ’1767732496’, ’1767732466’,
’1767732428’, ’1767045039’, ’1724048583’, ’1724048578’, ’1724048573’,
’1724048568’, ’1724048560’, ’1724048547’, ’1724048519’, ’1724048161’,
’1520474167’, ’1520474049’, ’1511333192’, ’1393717272’], ’TranslationSet’:
[{’From’: ’"Escherichia coli"[Organism]’, ’To’: ’"Escherichia coli"[Organism]’}],
’TranslationStack’: [{’Term’: ’"Escherichia coli"[Organism]’, ’Field’:
’Organism’, ’Count’: ’7055571’, ’Explode’: ’Y’}, {’Term’:
’"complete genome"[All Fields]’, ’Field’: ’All Fields’, ’Count’: ’552639’,
’Explode’: ’N’}, ’AND’, {’Term’: ’"srcdb_refseq"[Properties]’, ’Field’:
’Properties’, ’Count’: ’61550647’, ’Explode’: ’N’}, ’AND’],
’QueryTranslation’: ’"Escherichia coli"[Organism] AND
"complete genome"[All Fields] AND "srcdb_refseq"[Properties]’}
It states that all keys are strings. There are 4186 matches (’Count’) in the database.
The Ids of only 20 (’RetMax’) matches are returned in the list ’IdList’. To access these
identifiers we use:
idl=record[’IdList’]
If we are interested in all identifiers we should not be limited to the first 20:
handle=Entrez.esearch(db="nucleotide", term=searchterm, retmax=record[’Count’])
rec2=Entrez.read(handle)
print(len(rec2[’IdList’]))
This should produce 4186 on output (on October 31, 2019). So far we have only searched
for matching records in the database. The next step is to get (some of) these records. To
that end we have to use the Entrez’s fetch method. Similarly as the esearch-method
it needs a database (’db=”nucleotide”’), a sequence of identifiers that are separated by
commas, the return type and the return mode. Since we are interested in the sequences
we use as return type ’fasta’ and as return mode ’text’. So to obtain the first 2 matching
fasta sequences from the u’IdList’:
Entrez.email="p.a.j.hilbers@tue.nl"
c ph
92
Programming and genomics 2019/2020
9. Dictionaries and database queries
idn=",".join(rec2[u’IdList’][:2])
handle=Entrez.efetch(db="nucleotide", id=idn, rettype=’fasta’,
retmode="text")
outf=open(’E_coli2ids.fasta’, ’w’)
outf.write(handle.read())
outf.close()
9.3
Exercises 41–48
Exercise 41:**
(a) Given is an arbitrary string s. Write a python fragment that determines for each
character occurring in the string, how often it occurs.
(b) Print the number of occurences in a nicely aligned table, with in the first column the
character and in the second column its percentage: e.g.: for "AAAAAAAAAAA#A#AB"
the output should look like
#
A
B
2
13
1
12.50%
81.25%
6.25%
Exercise 42:
Let s be a DNA string consisting of the characters A, C, G, and T. The string s may
consist of both capital and small letters. Design a Python function with s as parameter
that determines of each letter in s the number of times it occurs and returns the four
letters in the order of decreasing number of occurence, i.e., first the letter that occurs
most often, subsequently the second most occuring, etc.
Exercise 43:**
(a) Design a function, with a string molformula as single parameter that extracts, using
urllib, information about the molecule described by molformula from the website
http://www.chemcalc.org. (A description of how information can be retrieved
from this website is given at the end of section 9.2.1.) The function should return
the string read from this webpage. Apply the function on your favorite molecule
(e.g. ’C2H6O’ or ’C11H15NO2’).
(b) The result of part (a) should be a string that begins and ends (except for possible
white space characters) with { and }, respectively. This string is in so called JSON
format. That is an open standard format that uses human-readable text to transmit
data objects. Using the Python library json this string s can be converted into a
Python dictionary using:
import json
chemcalcdict = json.loads(s)
Create the dictionary for your favorite molecule and show which keys the dictionary
has. If one of these keys reads ’mw’, check the corresponding value to see what the
c ph
93
Programming and genomics 2019/2020
9. Dictionaries and database queries
molecular weight of your favorite molecule is.
(c) From the dictionary of part (b), we can also extract information on the mass percentage of its constituting elements. Namely,
chemcalcdict[’parts’][0][’ea’]
yields a list of dictionaries. Each dictionary contains an element name, its number
of occurrences, and its mass percentage in the molecule. Use this information to
plot a table like:
C
H
O
52.14%
13.13%
34.73%
Exercise 44:**
Open, using urllib, the webpage http://cbio.bmt.tue.nl/~philbers/index.htm and
count the number of times the word ’computational’ (case insensitive) occurs on that
page.
Exercise 45:**
Design a Python function with a single list parameter that returns the number of hits a
search on the NCBI has when the elements of the lists are concatenated with ’ AND ’
as search term in the nucleotide database. Apply the function for
l=[ "Escherichia coli[Organism]", "complete genome[All Fields]",
"srcdb_refseq[Properties]"]
Exercise 46:
Design a Python function with a searchterm list and an integer as parameters that
returns the list of Entrez id’s that match the search query when the elements of the
searchterm list are concatenated with ’ AND ’ as search term in the nucleotide database
and with the integer as ’retmax’ parameter. Apply the function for
l=[ "Escherichia coli[Organism]", "complete genome[All Fields]",
"srcdb_refseq[Properties]"]
and the total number of hits (as obtained in the previous exercise) as integer parameter.
Exercise 47:
Design a Python function with a list of GenBank ids and a string as parameter that
generates a file with that string as name and as contents the FASTA records of the
GenBank ids. (Multiple ids can be fetched by joining them with a comma.) Apply the
function with a list of 10 different GenBank ids matching ”Escherich coli” as organism,
a complete genome and refseq as property and as string parameter ”E Coli10ids.fasta”
Tips: 1) use your solution of the previous exercise to obtain the 10 ids, and 2) running
your program (i.e. dowloading the sequences) may take a minute or so.
c ph
94
Programming and genomics 2019/2020
9. Dictionaries and database queries
Exercise 48:
On the webpage http://cbio.bmt.tue.nl/~philbers/8CA10/E_coliallids.fasta a
file ’E coliallids.fasta’ is given containing the FASTA records of all (at the moment the
file was generated) 166 hits of the search on reference sequences of complete genome
nucleotides of E coli. N.B.: This file is more than 317 Mbytes in size.
Apart from using your solution to an earlier exercise, this Fasta file could be read and
stored in a dictionary with as keys the id’s of the records and as values the corresponding
sequences:
from Bio import SeqIO
def getFastaRecordsDict(filename):
handle = open(filename, "r")
seqdict = {}
for record in SeqIO.parse(handle, "fasta") :
seqdict[record.id] = str(record.seq)
handle.close()
return seqdict
allEcoliDict = getFastaRecordsDict(’E_coliallids.fasta’)
(a) Design a Python function with a sequence as single parameter, that determines the
relative frequencies of occurrence of the four bases A, C, G and T for the sequence
(i.e. the fractions of those bases in the sequence) and returns those frequencies
as a dictionary. In the solution one should not use the standard count-method of
Python. Test your function on for instance the sequence ”AATAATGCCC”.
(b) Design a Python function with two parameters (a record id and a sequence) that,
using a call to the function created in part (a), returns as a string a single line of
text with as first entry the record id, then the 4 frequencies of the bases and as last
entry the length of the sequence of the record. The entries should be separated by
a semicolon. Test your function on for instance the Fasta record with the id
’gi|452742789|ref|NZ_CADZ01000110.1|’.
(c) Design a function with the allEcoliDict dictionary as single argument, that generates
for each of the records in the dictionary a line as described in part (b) and returns
all these lines as a single string.
(d) Write the output of part (c) to a file ’Ecolifreq.txt’.
(e) Ever since the early days of molecular biology, base composition has been used
as a descriptive statistic for genomes of various organisms. Especially the guaninecytosine content of bulk DNA, i.e. the fraction of C and G among all base pairs, has
been frequently used as, even before the availability of cloning and DNA sequencing,
it could already be determined by measuring the melting temperature of the DNA,
because GC base pairs are more stable than AT base pairs.
Design a function that generates for a record of the FASTA file a similar line as
described in part (b) having the id and the length of the sequence but now the C+G
content of the sequence should be generated.
Remark: It is not excluded that a sequence contains 0 bases!
c ph
95
Chapter 10
Program design and examples
We consider in this chapter another often used, more general repetition construct, the
while. Moreover, we discuss our approach of “How to design programs”. We, however,
can only touch on this last subject. There are numerous books and Bachelor and Master
programmes about this topic. A (bio)medical Engineering student does not need to
know all details but should have some basic knowledge of this material. The approach
we therefore take is to discuss general issues by using practical cases as a guide.
10.1
A more general repetition construct (while)
In Python a for loop is not the only type of looping construct available. In a for
construct one has to know in advance or to be able to calculate the number of iterations
that has to be performed. So what happens when we want to keep doing a specific task
until something happens but we don’t know when that something will be? To solve this
problem we have another type of loop: the while-loop.
An example of its usage is shown below.
>>> value=1.0
>>> while value <= 10:
...
print("Current value is: ", value)
...
value=value*2.7
... print("After the loop the value is: ", value)
Execution of this program fragment gives:
Current value is:
Current value is:
Current value is:
After the loop the
1.0
2.7
7.29
value is:
19.683
With a while-construct a sequence of actions is repeated until a condition no longer
holds.
Stepping through the above example we have
1. First we initialize value to 1.0. Initializing the control variable of a while loop is
96
Programming and genomics 2019/2020
10. Program design and examples
a very important first step, and a frequent cause of errors when missed out.
2. Next we execute the while statement itself, which evaluates a boolean expression.
3. If the result is True, it proceeds to execute the indented block which follows. In
our example value is less than 10 so we enter the block.
4. We execute the print statement to output the first line.
5. The next line of the block multiplies the control variable, value with 2.7. In this
case it is the last indented line, signifying the end of the while block.
6. We go back up to the while statement and repeat steps 4-6 with our new value of
value.
7. We keep on repeating this sequence of actions until value reaches 19.683. At
that point the while test will return False and we skip past the indented block to
the next line with the same indentation as the while statement.
8. In this case a statement is executed by which the value is printed. Since there are
no other lines, so the program stops.
The general form of a while is:
initialization
while boolean_expression:
statement_block
Having introduced these programming constructs we shift our attention to the most
important topic of this course: “How to design pograms”.
10.2
Programming: problem formulation, analysis and design
The design of a program usually follows four phases that are repeated until a satisfactory
solution is obtained.
1. The first step in the design of a program is always a careful analysis and when
needed a reformulation of the problem. By this analysis also a list of requirements
a solution should satisfy is defined. This list of requirements is called the problem
specification. In the analysis we use mathematics to rephrase the problem to arrive
at a formulation that is much more precise than in daily language. In this phase
we also try to abstract from certain irrelevant details.
2. Having formulated the problem into a precise form, the next step is to create a
sketch of a program. Remarkably is that in this stage not a real programming
language is used but rather the program sketch is written in a pseudo code.
3. The third step is to translate this pseudo code into a real programming language.
This translation is achieved by the usage of tools, libraries and templates.
4. The last step in the cycle is to determine whether the solution satisfies the requirements. In many programming texts this last phase is called testing. Since testing
cannot exhaustively check all possibilities we take a different approach by a style
of programming called program derivation. In the last phase the requirements’
c ph
97
Programming and genomics 2019/2020
10. Program design and examples
analysis might result in looking for alternatives.
Usually, in the first approach not all elements a problem consists of are dealt with. In
general the problem is first split into subproblems and these subproblems are solved
using the above scheme. In a later phase the solutions of the subproblems are assembled
into the definite solution.
Please note that the above scheme is not a general theory. In fact there is no general
recipe for designing programs. Programming experience of the last twenty years suggests
that the only way to learn to design programs is by investigating earlier solutions and
by doing. By experience and trying to imitate the solutions from others, sometimes
denoted by the term reverse engineering, a personal style can be developed.
10.3
Programming examples
10.3.1
Counting ’CGs’ in DNA strings
A typical bioinformatics problem is to design a method by which from a given list of
strings, a sorted list of counts of the pattern ’CG’ in the items of the list is determined.
Although the problem is rather simple we use it to describe the general approach step
by step.
The first step is the problem analysis and specification. Since we know how to sort a list
we simply forget for the moment the sorting requirement. Such a problem simplification
is quite common and has deserved a special term: separation of concerns. So our first
task is to generate a list of counts of occurrences of “CG” in a list of string items.
Such a state to be reached is called a postcondition and is denoted by the letter R. Since
a result is to be returned we have to introduce a variable, say outl. Next we have
to specify the relation between the input conditions, called the precondition, and the
variable(s) in the postcondition. The input conditions here is just the list l of string
items and the relation is that outl is the list of counts of the pattern “CG” in the items
of l.
It is always wise to write what has been achieved so far:
P: l is a list of string items
name of the program/method: countList
R: outl, the list of counts of the pattern ‘‘CG’’ in the items of l
Since the list may consist of several items and for each item the same procedure has to
be followed we recognize that our solution needs repetition. But be warned, in some
cases the problem might suggest a repetition, while a direct solution is possible. As
an example consider a program to sum the first n integers. For the solution no loop is
needed, since s=n(n+1)/2 is a simple solution to this problem.
The general framework of a repetition has 3 parts:
I: initialisation
B: body of the loop
F: finalisation
but both the initialisation and the finalisation part might be empty.
c ph
98
Programming and genomics 2019/2020
10. Program design and examples
In designing a loop we are looking for a statement that is valid independent of the
number of times the loop has been repeated. Such a statement is called an invariant
and can usually be derived from the postcondition. Hence in our case it should say
something about the variable outl. At the end of the loop outl should contain all
counts, so when still inside the loop outl holds the counts of the items considered so
far. This is the invariant here. When the loop has not yet been entered, the invariant
should also hold and that is what has to be realized by the initialisation. If outl holds
the counts of the items considered so far, and we have not considered an item yet, then
outl should be the empty list:
outl=[]
It is also clear that we have to loop over all items of the list. The next step is then
simply to write down what has been obtained so far:
outl=[]
# outl holds the counts of the items of l considered so far
for dnas in l:
# outl holds the counts of the items of l considered so far
....
# outl holds the counts of the items of l considered so far
# outl holds the counts of the items of l considered so far
# and all items have been considered, hence
# outl holds the counts of all items of l
If we consider the next item of the list we are obliged, because of the invariant, to add
its count to the list:
countCG=dnas.count("CG")
outl.append(countCG)
and combining the several parts our solution becomes:
outl=[]
# outl holds the counts of the items of l considered so far
for dnas in l:
# outl holds the counts of the items of l considered so far
countCG=dnas.count("CG")
outl.append(countCG)
# outl holds the counts of the items of l considered so far
# outl holds the counts of all items of l
Having the list of all counts it simply suffices to sort the list to obtain the result desired:
outl.sort()
The last step is to assemble all parts into a method that decently returns the list calculated:
def countList(l):
""" In this method the number of occurrences of the
pattern ’CG’ in each of the string items of the list l is returned
in a sorted list.
"""
outl=[]
for dnas in l:
c ph
99
Programming and genomics 2019/2020
10. Program design and examples
countCG=dnas.count("CG")
outl.append(countCG)
outl.sort()
return outl
If we put these lines of code in a file, say countList.py, we finally should test our
method. We could simply call the method by, for instance, :
print(countList([’AAACGCGAA’, ’CCGA’, ’ACCC’]))
but if we would do so an output string would be produced if we would import this file
in other modules. Python has a special construct for testing a module, the
__name__="__main__":
if __name__=="__main__":
print countList([’AAACGCGAA’, ’CCGA’, ’ACCC’])
Only when the file countList.py is imported as the main module, i.e., called by the
run command, the boolean expression holds and the output string is produced.
The procedure described is lengthy but should be a valuable guide for designing loops
in general. This is also shown in the next example.
10.3.2
All pattern positions in a string
As a second example we will design a method with two strings seq and pat as parameters
that returns a list of positions containing all start positions of non-overlapping pat’s in
seq. We have encountered this problem before in Exercise 27, but here we discuss this
problem along the framework introduced in this chapter.
• Specification
pre(P): seq a string, pat a string
name: patternPosList
post(R): list posl holds all positions in seq where pat starts
• Analysis
Since the pattern pat may occur several times in the sequence seq we need a
repetition. Eventually posl should hold all pat occurrence positions, so we again
look for an invariant of the type “DBC”, meaning that we have a part we already
have dealt(D) with, a part to be(B) done and the whole sequence seq as object
that is constant(C). In order to administrate which part of seq we already have
seen we need an additional variable, say pos. When we introduce a new variable
it is advisable to explicitly denote the values a new variable is allowed to take.
It turns out that in adding this information simple mistakes can be avoided, and
moreover it is commonly useful information in the program design phase.
So the invariant I we are striving for is:
posl holds all occurrences of pat in seq[:pos],
seq[pos:] is still to be considered, and
0<=pos<=len(seq)
There are always 3 issues in considering invariants
c ph
100
Programming and genomics 2019/2020
10. Program design and examples
1. finalisation
When pos=len(seq) then posl indeed holds all occurrence positions.
2. init(ialisation)
An invariant should always be simple to establish when the loop is not yet
entered. In this case it is indeed simple to establish the invariant when the
search is still to be started:
pos=0
posl=[]
3. loop and body
Here one should always start with writing down what has been achieved so
far and what is to be established.
# P
pos=0
posl=[]
# I
while B:
# I and B
body
# I
# I and not B
# R: posl holds all positions in seq where pat starts
The not B part results of course from the fact that at that point the repetition
has ended. Its importance becomes clear from the following: I and not B
should guarantee that in posl all positions in seq where pat starts
are stored. So given invariant I:
posl holds all occurrences of pat in seq[:pos],
seq[pos:] is still to be considered, and
0<=pos<=len(seq)
what should not B be such that R holds? It is clear that nothing should be
left in
seq[pos:] is still to be considered
As many roads lead to Rome, here too we have some freedom. An obvious
choice is to have
pos=len(seq)
but we do not directly have a clue how to have pos increased every time
such that the postcondition is established. When do we know that the part
seq[pos:] that still has to be considered does not include the pattern pat
anymore? The answer is not too difficult: when pat is not part of seq[pos:],
calling seq[pos:].find(pat) returns the value -1. So we should continue
looping as long as seq[pos:].find(pat) differs from -1:
# P
c ph
101
Programming and genomics 2019/2020
10. Program design and examples
pos=0
posl=[]
# I
while seq[pos:].find(pat) != -1:
# I and B
# in seq[pos:] pattern pat is still present
loop
# I
# in seq[pos:] no pattern pat is present anymore and I
# R: posl holds all positions in seq where pat starts
From this code it is also clear that in order to stop looping inside the loop
pos has to change. The question is of course how should pos be changed.
For efficiency reasons we want to inspect each pattern’s beginning position
only once when we have found it. So the first step is to introduce a variable
holding this beginning position:
l=seq[pos:].find(pat)
Notice that find always starts counting at 0, so we should always add pos
to the result of find to have the right position inside the original sequence
seq, hence l=l+pos is needed to transfer the result of find into the position
in seq. And if we have this position then a possible next (non-overlapping)
occurrence of pattern pat can hence only start at:
l+len(pat)
So we have
# P
pos=0
posl=[]
# I
l=seq[pos:].find(pat)
while l != -1:
# I and B
# in seq[pos:] pattern pat is still present
# transform l into the position in the original sequence seq
l=pos+l
pos=l+len(pat)
# I
l=seq[pos:].find(pat)
# in seq[pos:] no pattern pat is present anymore and I
# R: posl holds all positions in seq where pat starts
The final part is to add the position l to the list posl:
posl.append(l)
So if we include our method and its test in a file ’testpatpos.py’
def patternPosList(seq, pat):
c ph
102
Programming and genomics 2019/2020
10. Program design and examples
"""return a list of all positions in string seq
where substring pat is found
"""
posl=[] # no positions found yet
pos=0 # everything in seq from 0 to pos is done
l=seq[pos:].find(pat)
while l != -1:
l=l+pos
# new position l in seq found, hence l>=0
posl.append(l)
pos=l+len(pat) # only search in the remaining part of seq
l=seq[pos:].find(pat)
return posl
if __name__=="__main__":
print(patternPosList("AAAAAUGBBBBAUGAAAAUG", "AUG"))
and use the Python interpreter by running this file we should find as output
[4, 11, 17]
10.4
Exercises 49–57
Exercise 49:
The purpose of this exercise is to learn more about the design and usage of invariants
and variant functions.
Given is a bag with white and black marbles and additionally there is a sufficiently large
collection of black marbles. Repeatedly the following actions are taken: 2 marbles are
randomly drawn from the bag, if the two marbles have a different color, the white one
is put back into the bag, if the two marbles have the same color, a black marble is put
into the bag. Two questions have to be answered:
1. Does this repetition end, i.e., is there a moment that there is only one marble in
the bag.
2. If so, can you predict the color of this last marble?
Exercise 50:**
Design a python function that prints a triangle of stars (’*’) with k stars as basis and
k stars as height by using a while-construct. Hence in your solution a for-statement is
not allowed. k should be an integer parameter of the function and between two stars a
space should be printed. For k=4 the output should look as follows:
*
* *
* * *
* * * *
c ph
103
Programming and genomics 2019/2020
10. Program design and examples
Exercise 51:
Similar as the previous exercise but now for an open triangle, i.e., the first output line
has one star, lines 2 through k-1 two stars and the last line k stars. Hence, for k=4 the
output should look as follows:
*
* *
*
*
* * * *
Exercise 52:
Design a function that prints a cross consisting of 2 diagonals of k stars. k is an odd
positive integer parameter of the function. For k=5 de output should be
*
*
* *
*
* *
*
*
Exercise 53:**
Given is the following function
def countSomething(word=’insulin resistance’):
i = 0
counter = 0
jump = word.index(’n’)
while i<len(word):
if word[i]==’n’:
counter = counter+i
i = i+jump
i = i+1
return counter
(a) Without actually running the program, which values does i obtain when the function is called by
countSomething(’diabetes patient’)
(b) The same question but now for
countSomething(’diabetes patient or not’)
(c) Similar for
countSomething(’not a diabetes patient’)
(d) And finally for
countSomething()
c ph
104
Programming and genomics 2019/2020
10. Program design and examples
Exercise 54:**
Given is the following function:
def examplerep():
print("Give 5 numbers smaller than 1000:")
a=1000
b=1000
i=0
while (i<5):
c=int(input("Give one value: "))
if (c<b):
if (c<=a):
b=a
a=c
i=i+1
print("My answers are: "+str(a)+" and "+str(b))
If this function is called with the following 5 integer values that are supplied on input
on different lines, add then adequate output formatting statements to this function such
that a table with the values of the variables in each turn of the iteration is printed. The
table should hence have a column per variable and a row per iteration step.
(a): 0 1 2 3 4
(b): 900 800 700 600 500
(c): 3 33 333 444 33
Exercise 55:
Design in a number of steps a program that prints the word IK in the form of stars.
The letter I consists of a one vertical line of k stars, while the K has a height of k lines
and a width of (k + 1)/2 positions where k is an odd input parameter of at least 3. The
two letters are separated by a space.
(a) Design a function makeI(k) that returns the letter I as list of lines. Clearly show
which invariant(s) are used.
(b) Design a function makeK(k) that returns the letter K as list of lines.
(c) Design a function makewoord(k) that prints the word IK that uses the methods of
(a) and (b).
When for example k = 5, the output would be:
*
*
*
*
*
c ph
* *
**
*
**
* *
105
Programming and genomics 2019/2020
10. Program design and examples
Exercise 56:
Given is a list l of integers.
(a) Design a Python function with l as parameter that returns the number of times of
two consecutive pair of elements of l the first one is smaller than the second one.
(b) Adapt your solution of part (a) such that it returns the index of the 6th pair of
such elements, when they exists, otherwise return -1. Design your function in such
a way that it does not perform redundant calculations. For instance in general it
should not first determine all the pairs.
The solutions should include a specification and, when a repetition is used, an invariant
in accordance to the repetition should be given.
Exercise 57:
Let a CA+C row be defined as a DNA sequence starting with a C, followed by one or
more As and ending with a C.
(a) Design a function that determines the starting positions of all non-overlapping
CA+C rows in the DNA sequences and that returns these indices as a list.
(b) Design a function that determines the end positions of all non-overlapping CA+C
rows in the DNA sequences and that returns these indices as a list.
(c) Design a function that determines the starting and end positions of all non-overlapping
CA+C rows in the DNA sequences and that returns these as a list of 2-tuples (,
where each 2-tuple comprises the starting and end position of one CA+C row). In
your solution you have to make use of your solutions of part (a) and (b).
(d) Which adaptation(s) is (are) necessary when overlap of CA+C rows is allowed for?
Provide specifications, analyses and invariants for all your solutions.
c ph
106
Chapter 11
Classes, Excel files and boxplots
In previous chapters we have written output to screen as well as to text files. Also we
have seen how strings can be formatted such that the data is presented in a structured
way in such a plain text format. Here we will see how one can also use different file
formats instead of plain text files. In particular we focus on reading/writing Excel files.
Finally we will also see how boxplots can be made to visualize data sets. But first we
will consider classes.
11.1
Classes and objects
We have now dealt with all basic components to introduce the main notion in object
oriented programming: classes. So far we have only mentioned the notion of a class and
behind the scenes so perhaps without noticing, have used them in case of the built-in
methods and types. Moreover, our attention has been on the actions to be performed
on the data types. In general, however, there is a close relation between the actions
and the data elements. For instance, if we would have a list of blood pressure values of
a patient and another list with DNA strings from genes belonging to a certain family,
then calculating the average makes sense for the former list and not for the latter,
whereas in case of counting the number of occurrences of the nucleotide base ’A’ it is
the other way around. So what is needed is a programming structure in which the close
relation between the data elements and methods to be invoked on these elements can be
expressed. Object-oriented programming languages offer such a structure called class.
A small example
In this section we show by an example the general framework of designing a class.
For instance suppose we design a database in which we want to store information about
students. One element of a student is the name of the student. As an example we could
create a file ’student.py’ having the following content:
class Student:
def setName(self, stname):
self.name=stname
print("Name of student ’"+self.name+"’ set")
107
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
If we have such a class definition we can create an object of this class by using its name
followed by parentheses. So by
stu1=Student()
stu2=Student()
we have created two new Student objects.
If we have an object of a class, then to perform actions on an object we have to use the
methods defined for the objects of the class. In this example we have defined one such
method, setName.
To use a method on an object, in object-oriented jargon calling a method, one has to
give the name of the object, followed by a ., and then the name of the method and the
arguments:
objectname.methodname(arguments).
The argument list is, however, to be given in a special way. The special thing about
calling a class method is that the object itself is passed as the first argument. In our
example, in the call stu1.setName("Tobe OrNotToBe") the object stu1 is passed as
first argument to the setName method while the string ”Tobe OrNotToBe” becomes the
second argument. So the parameter self gets as value stu1 and the parameter name the
value "Tobe OrNotToBe". Hence stu1.setName("Tobe OrNotToBe") is exactly equivalent to setName(stu1, "Tobe OrNotToBe").
In general, calling a method with a list of n arguments is equivalent to calling the
corresponding function with an argument list that is created by inserting the method’s
object before the first argument.
So the result of
stu1.setName("Tobe OrNotToBe")
is that the data attribute name of the object stu1 gets the value ”Tobe OrNotToBe”,
while the text Name ’Tobe OrNotToBe’ set is shown on output.
So if we subsequently enter the statement
print("The name of this student is", stu1.name)
then the text
The name of this student is Tobe OrNotToBe
should appear.
Extending this class with other data attributes and methods is straightforward. If we
create a file “student.py” with the following contents:
class Student:
def setName(self, stname):
self.name=stname
print("Name ’"+self.name+"’ set")
def setStudentId(self, stid=0):
self.studid=stid
c ph
108
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
print("Student id:", self.studid, "set")
def printvals(self):
print(self.studid)
print(self.name)
then we could use this file and the class defined in it by:
>>> import student
>>> stu=student.Student()
>>> stu.setName("Tobe OrNotToBe")
Name ’Tobe OrNotToBe’ set
>>> print("The name of this student is", stu.name)
The name of this student is Tobe OrNotToBe
>>> stu.setStudentId(12345)
Student id: 12345 set
>>> print("The id of the student is", stu.studid)
The id of the student is 12345
>>> stu.printvals()
12345
Tobe OrNotToBe
So generalizing this example, in designing programs we have to split up our problem
in classes and to define in each of the classes the desired methods, such as setName,
setStudentId, and printvals in the class Student, and the data attributes, such as
name and studid in the class Student.
Creation of an object of the class, in computer jargon object instantiation, has been
realized by a function-like call, by writing the name of the class followed by parentheses.
The object created in this way has no data attributes yet and is an “empty” object.
There is, however, a standard method to create objects having data attributes already
at instantiation. This method is shown in the next subsection.
The built-in init method
The instantiation operation Student(), “calling” a class object, creates an empty object.
Many classes like to create objects in a pre-defined state. To that end a class may define
a special method named __init__(), like this:
def __init__(self):
self.name = ’’
self.studid=0
This special method is also often called the constructor.
When a class defines an __init__() method, class instantiation automatically invokes
__init__() for the newly-created class instance. So in this example, a new, initialized
instance can be obtained by:
stu=student.Student()
and then automatically the data attributes are constructed.
c ph
109
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
Of course, the __init__() method may have arguments for greater flexibility. In that
case, arguments given to the class instantiation operator are passed on to __init__().
For example,
>>> class Student:
...
def __init__(self, stname=’’, stdid=0):
...
self.name = stname
...
self.studid = stdid
...
>>> stu = Student("Tobe OrNotToBe", 12345)
>>> print(stu.name, stu.studid)
Tobe OrNotToBe 12345
11.2
Excel files in Python
In previous chapters we have seen that using python it is possible to read and write text
files, and that this can be used for instance to process data. Another common file type
to store (medical) data is given by Excel files (like .xlsx files).
Importing openpyxl
Excel files can also be read and written from Python using for instance the openpyxl
library, which can be loaded using
import openpyxl
Reading Excel files
If the library is installed, opening a file Munster2mets.xlsx works as follows
import openpyxl
wb = openpyxl.load_workbook(filename = ’Munster2mets.xlsx’)
where the file name is thus provided as a key word parameter. The variable wb now
contains a openpyxl.workbook.Workbook object that may contain multiple work sheets
(data attributes) and some methods. Each work sheet may contain again cells, ordered
in rows and columns, in which the actual data is stored. To get the current work sheet
and determine the number of rows containing data one can use the commands
ws = wb.active
nrrows = ws.max_row
Accessing the cell in the i-th row and j-th column and the data stored in that cell can
be done using
cell = ws.cell(row=i,column=j)
data = cell.value
# get the cell
# get the actual data
Unfortunately, the row and column numbers start at 1, instead of the 0 we are used to
in Python. The above two line can also be combined to
data = ws.cell(row=i,column=j).value
c ph
110
# directly access the data
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
In a small example we will show how the data in the Excel file shown in Figure 11.1a
can be plotted. First we read the Excel file and store the data of the first column in
Figure 11.1: a) Screen shot of the file Munster2mets.xlsx in Microsoft Excel. b) Scatter
plot of the data in the Excel file.
a list x. Subsequently we use Matplotlib to plot the data. When plotting the data we
ignore the first element as this contains, as apparent from Figure 11.1a, a comment
import openpyxl
wb = openpyxl.load_workbook(filename = ’Munster2mets.xlsx’)
ws = wb.active
nrrows = ws.max_row
x = []
for i in range(1,nrrows+1):
x.append(ws.cell(row=i,column=1).value)
import matplotlib.pyplot as plt
plt.plot(x[1:],’r*-’)
If we now also want to use the data in the second column, we could repeat the for loop
for column=1 and thus add e.g. the lines
y = []
for i in range(1,nrrows+1):
y.append(ws.cell(row=i,column=2).value)
or merge the two loops to
x = []
y = []
for i in range(1,nrrows+1):
x.append(ws.cell(row=i,column=1).value)
y.append(ws.cell(row=i,column=2).value)
Both solutions however have the drawback that code is repeated. A nicer solution would
be the use of functions as we considered in the beginning of this chapter. We could then
define a function readColumn(ws, colnr), with the work sheet and requested column
c ph
111
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
number as arguments, which returns the values in the requested column as a list
def readColumn(ws, colnr):
nrrows = ws.max_row
l = []
for i in range(1,nrrows+1):
l.append(ws.cell(row=i,column=colnr).value)
return l
x = readColumn(ws, 1)
y = readColumn(ws, 2)
This of course becomes more and more advantageous as the number of columns to be
processed would increase.
The complete code to make a scatter plot of the data in the first column versus the data
in the second column (as in Fig.11.1b) then may look like:
import openpyxl
import matplotlib.pyplot as plt
def getWorksheet(xlsfilename):
wb = openpyxl.load_workbook(filename = xlsfilename)
ws = wb.active
return ws
def readColumn(ws, colnr):
nrrows = ws.max_row
l = []
for i in range(1,nrrows+1):
l.append(ws.cell(row=i,column=colnr).value)
return l
ws = getWorksheet(’Munster2mets.xlsx’)
x = readColumn(ws, 1)
y = readColumn(ws, 2)
plt.plot(x[1:],y[1:],’r*’)
plt.xlabel(x[0])
plt.ylabel(y[0])
Writing Excel files
Apart from reading Excel files, openpyxl also allows for generating and editing such
files. A new empty workbook can be generated using
import openpyxl
wb = openpyxl.Workbook()
ws1 = wb.active
ws1.title = "Test"
c ph
# generate a new empty workbook
# select the first (and only) work sheet
# change the name of the work sheet
112
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
Data can be added to the work sheets in different manners. The first way is analogous
to the way we read data above, i.e., by accessing cells by row and column number and
accessing their value field, e.g.:
for row in range(10, 20):
for col in range(7, 34):
cell = ws1.cell(column=col, row=row)
cell.value = row-col
Alternatively, the cells could also be accessed by their name in in Excel like fashion, i.e.,
using one (or multiple) letter(s) for the column number followed by the row number.
Adding the value pi to a new work sheet (named ’Pi’) in 6-th column on the 5-th row is
done by
ws2 = wb.create_sheet(title="Pi")
cell = ws2[’F5’]
cell.value = 3.14
As a third method it is also possible to add data to multiple cells at once using the
append command. If a list l is provided as an argument, a new row is added to the
active work sheet and the items of the list are written to the first len(l) columns in
that row.
ws3 = wb.create_sheet(title="Data")
for row in range(1, 40):
ws3.append([’He’,’Ho’,row])
The Excel file can be written to disc using the save method of the workbook
wb.save(filename = ’testbook.xlsx’)
Additional information on openpyxl
Additional information on openpyxl is available from the package documentation at
https://pypi.python.org/pypi/openpyxl.
11.3
Boxplot
A standardized way of displaying the distribution of data is by means of a boxplot (also
known as a box-and-whisker diagram or plot). Such a plot is based on the five number
summary of the data: minimum, first quartile, median, third quartile, and maximum.
In a simple boxplot, see Figure 11.2a, a central rectangle spans the first quartile to the
third quartile (the interquartile range or IQR), a line inside the rectangle shows the
median and ’whiskers’ above and below the box show the locations of the minimum and
maximum.
Real (medical) datasets will often display surprisingly high or surprisingly low values
called outliers. In order to prevent that such outliers stretch the whiskers, the length of
these whiskers is often maximized at a multiple of IQR. Data points outside this range
are then explicitly drawn, as illustrated in Figure 11.2b.
In Python, boxplots can be made using the same library that we used before to plot
data, i.e. using matplotlib. Given a list l, a boxplot showing outliers as in 11.2b is
c ph
113
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
1400
1400
maximum
1200
1200
1000
1000
800
800
600
600
outliers
1.5 IQR
third quartile
400
400
IQR
median
200
0
200
first quartile
minimum
0
1
(a)
1
(b)
Figure 11.2: Boxplots: a) Simple boxplot indicating the five number summary, b)
boxplot with visualization of outliers.
created using
import matplotlib.pyplot as plt
plt.boxplot(l)
Apart from a single positional parameter, i.e., the list of data to be plotted, the boxplot
function has over 20 keyword parameters with default values that can be used to control
the appearance of the result. For instance, adding vert=False results in a horizontal
boxplot instead of the default vertical one, and whis=10 increases the maximal whisker
length (from its default 1.5) to 10 times IQR. Using the latter, the number of points
considered as outliers decreases, possibly to zero and thus resulting in a simple boxplot
as in 11.2a.
import matplotlib.pyplot as plt
plt.boxplot(l, whis=10)
For an extensive overview of all options we refer to the pyplot website http://matplotlib.
org/api/pyplot_api.html.
Multiple data sets
In many cases one may want to compare different data sets. For instance, data on a
metabolite in a group of patients before a treatment and after a treatment, or between a
group of male patients and group of female patients, etc. This can be done by providing
the boxplot function with as first parameter a list of lists:
import matplotlib.pyplot as plt
plt.boxplot([list1, list2])
The data in list1 is then used for the first box with whiskers and the data in list2
for the second, which is drawn in the same figure at the right hand side of the first as
in 11.3a.
c ph
114
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
1400
1400
1200
1200
1000
1000
800
800
600
600
400
400
200
200
0
1
0
2
(a)
label1
label2
(b)
Figure 11.3: Boxplots with more than 1 box with whiskers: a) boxplot with integers
at the ticks on the x-axis indicating order of the data sets in the list provided as
parameter to the boxplot function, b) same boxplot with labels at the x-ticks.
Pimp your plot
In the boxplot created in the previous paragraph, the two boxes where labelled by two
integers at the ticks on the x-axis, indicating the order of the corresponding datasets in
the list provided as parameter to the boxplot function. By first creating a figure object,
connecting an axes object to it, and creating the boxplot on this axis:
import matplotlib.pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.boxplot([list1, list2], notch=True)
ax.set_xticklabels( [’label1’, ’label2’] )
any strings can be placed at the x-ticks using the set xticklabels method on the axes
object, resulting in Figure 11.3b. Here, some extra emphasis is given on the median by
adding notch=True to the parameters given to the boxplot function.
11.4
Exercises 58–61
Exercise 58:**
Design a class Gene with two data attributes: gene_symbol, and gene_name.
(a) Design a method of this class where gene_symbol should have as default value ”INS”
and gene_name ”insulin”, respectively. Give examples of object instantiations with
a varying number of default values and with other values as arguments.
(b) Design a method print_geneinfo by which the contents of the data attributes are
printed.
Exercise 59:
c ph
115
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
(a) Create a class Atom and a constructor having two parameters that are set (use
__init__). The two parameters are atomName and atomWeight.
(b) Create another class Molecule, which has one attribute, a list of atoms. This
attribute should be empty at initialisation.
(c) Design a method addAtom for the Molecule class, which creates a new atom and
adds this to the list. The method addAtom, should have two parameters: atomName
and atomWeight.
(d) Design a method calculateWeight for the Molecule class, which calculates the
weight of the molecule.
Exercise 60:**
In this exercise we consider an Excel file that contains some clinical measurements on a
number of patients. The first two rows contain some info on the file contents and the
headers of the different columns, respectively. Subsequently, it contains a single line of
data per patient.
(a) Write a function with a file name as single parameter. The method should first
check whether the specified file name ends with ’.xlsx’. If not, the method should
return None. Otherwise, the specified Excel file should be read and the workbook
should be returned. Apply your method to the provided Excel file health.xlsx.
(b) Write a function that extracts the data from a specified column of the active work
sheet in a workbook and returns the values in that column in a list. The method
should have two parameters: as first parameter a workbook object and as second parameter a keyword parameter specifying the desired column number, which
should have default value 1.
(c) Write a function with two parameters, i.e., as first parameter a list and as second
parameter a float, which returns a new list with all elements of the input list
multiplied by the second parameter.
(d) Use your functions defined in parts (a), (b) and (c) to extract the information on
patient weight and height from the file health.xlsx. Make a scatterplot of that
data with the weight in kilograms on the x-axis and the height in meters on the
y-axis.
(e) Calculate for all patients their BMI (weight/height**2), with weight in kilograms
and height in meters.
(f) Create using Python a new workbook with only information for the male patients.
For each male patient only the name, height, weight and BMI should be stored in
the first 4 columns, respectively. Write this file as bmi male.xlsx.
Exercise 61:
(a) Use the functions written in parts (a) and (b) of the previous exercise to read the
Excel file health.xlsx and obtain 3 equally long lists: one with the genders, one
with the cholesterol levels, and one with the weights.
(b) Write a function divideByGender with two lists as parameters. A first list with
c ph
116
Programming and genomics 2019/2020
11. Classes, Excel files and boxplots
the gender of a series of patients and a second equally long list with arbitrary
other data on the corresponding patients. One may assume that the gender list
only contains items with the values ’F’ for female and ’M’ for male patients. The
function should return two lists: the first with the data of the female patients, the
second with the data of the male patients.
(c) Make a scatter plot of the cholesterol levels versus the weights (as in Figure 11.4a),
where the male data are shown in red and the female data in blue.
(d) Make a boxplot of the cholesterol level where the data is shown for all patients, for
the female patients as well as for the male patients (as in Figure 11.4b).
Figure 11.4: Examples of a scatter and a box plot of data from the Excel file.
c ph
117
Chapter 12
Graphical user interfaces
In this chapter we will introduced the tkinter system by which graphical user interfaces
can be built. We will show that adding widgets to a GUI is in general straightforward.
12.1
A first window
The top level window
Programming a GUI is exactly like any other kind of programming, sequences, loops,
branches and modules can be used just as before. The first step is defining the main
object by which the windows of the program are managed.
>>> import tkinter
This is the first requirement of any tkinter program - import the names of the widgets.
>>> top = tkinter.Tk()
This creates the top level widget in our widget hierarchy. All other widgets will be
created as children of this widget. In order to make the widget visible we have to enter
the tkinter event loop:
>>> top.mainloop()
The mainloop, the so-called event loop, handles the events from the user (such as mouse
clicks and key presses) or the windowing system (such as redraw events and window
configuration messages), and it also handles operations queued by tkinter itself. This
also means that the application window will not appear before you enter the main loop.
So a minimal tkinter program has at least these three lines
>>> import tkinter
>>> top = tkinter.Tk()
>>> top.mainloop()
When this program is run, see Figure 12.1a, the top level window automatically comes
furnished with widgets to minimize, maximize, and close the window.
Clicking on the ”close” widget (the ”x” in a box, at the right of the title bar) generates a
”destroy” event. The destroy event terminates the main event loop, and since there are
118
Programming and genomics 2019/2020
12. Graphical user interfaces
Figure 12.1: Two first windows: a) with standard title, b) with a user defined title.
no statements after top.mainloop(), the program has nothing more to do, and ends.
Also note that the program will stay in the event loop until we close the window.
The title of the window
In the previous example the title bar had the default value ’Tk’. If we build our own
user interface it is possible to give the window another title.
import tkinter
top = tkinter.Tk()
top.title("This is the top window")
top.mainloop()
If we run this program a window will appear, see Figure 12.1b, but possibly not the
whole text on the title bar is visible. The reason is that the tkinter system has some
predefined values for the initial size of the window and this is our first example in which
the flexible role of object initialisation is clearly shown.
This size can be changed by enlarging the window by clicking the left mouse-button in
one of the corners and sliding the corner to the required size, but there is an alternative.
The size of the top level window
The geometry method is applicable to the top level window widget and sets the size of
the window. If top is the Tk object then
•
top.geometry(gstr)
is the method by which the size of a top-level window is set. The geometry string
gstr must have the form:
"wxh"
where the w and h parts give the window width and height in pixels. They are
separated by the character ”x”.
c ph
119
Programming and genomics 2019/2020
12. Graphical user interfaces
Figure 12.2: Two windows a) one with a user defined title and size, and b) one with
additionally a different background color.
Example:
import tkinter
top = tkinter.Tk()
top.title("This is the top window")
top.geometry("300x200")
top.mainloop()
producing a window as shown in Figure 12.2a.
The background color of the top level window
So far our windows have the same grey background color. A GUI toolkit should obviously
have methods to add color to widgets. To control the appearance of a widget, options
rather than method calls are used. Typical options include color, height and width. To
deal with options, all core widgets implement the same configuration interface, using
again keyword arguments.
• configure(option=value, ...)
One of the options is ”background”, to define the background color of the window.
In tkinter there are two general ways to specify colors:
• The colors ”white”, ”black”, ”red”, ”green”, ”blue”, ”cyan”, ”yellow”, and ”magenta” are available. Other names may work, but depend on the installation.
Example:
import tkinter
top = tkinter.Tk()
top.title("This is the top window")
top.geometry("300x200")
top.configure(background="red")
top.mainloop()
c ph
120
Programming and genomics 2019/2020
12. Graphical user interfaces
The window being produced by running this example is shown in Figure 12.2b.
• The other possibility is to use a string specifying the proportion of red, green, and
blue in hexadecimal digits:
#rrggbb
For example, "#000000" is black, "#ff0000" is red, and "#00ffff" is pure cyan
(green plus blue).
So to obtain the same result as in the previous example we could have written
import tkinter
top = tkinter.Tk()
top.title("This is the top window")
top.geometry("300x200")
top.configure(background="#ff0000")
top.mainloop()
These methods all apply to the top level widget. In general our top level window should
have additional components like buttons, images and text. After an intermezzo about
GUI programming in general we introduce some of these components in building a
realistic GUI using tkinter.
12.2
The four basic GUI-programming tasks
Before introducing the components a window can contain we give some attention to
designing graphical user interfaces in general. When a user interface (UI) is designed,
there is a standard set of tasks that must be accomplished.
• It must be specified how the UI should “look”. That is, we must write code that
determines what the user will see on the computer screen.
• It must be specified what the actions are to be done when the UI is used. That
is, we must write routines that accomplish the tasks of the program.
• We must associate the ”looking” with the ”doing”. That is, we must write code
that associates the things that the user sees on the screen with the routines that
have been written to perform the program’s tasks.
• Finally, we must write code that sits and waits for input from the user.
GUI programming has some special jargon associated with these basic tasks.
• We specify how we want a GUI to look by describing the ”widgets” that we want
it to display, and their spatial relationships (i.e., whether one widget is above or
below, or to the right or left, of other widgets). The word ”widget” is a nonsense
word that has become the common term for ”graphical user interface component”.
Widgets include things such as windows, buttons, menus and menu items, icons,
drop-down lists, scroll bars, and so on.
• The routines that actually do the work of the GUI are called ”callback handlers”
or ”event handlers”. ”Events” are input events such as mouse clicks or presses of a
key on the keyboard. These routines are called ”handlers” because they ”handle”
c ph
121
Programming and genomics 2019/2020
12. Graphical user interfaces
(that is, respond to) such events.
• Associating an event handler with a widget is called ”binding”. Roughly, the
process of binding involves associating three different things:
– a type of event (e.g. a click of the left mouse button, or a press of the ENTER
key on the keyboard),
– a widget (e.g. a button), and
– an event-handler routine.
For example, we might bind (a) a single-click of the left mouse button on (b) the
”CLOSE” button/widget on the screen to (c) the ”closeProgram” routine, which
closes the window and shuts down the program.
• The code that sits and waits for input is called the ”event loop”.
Above we already have seen the name of the event loop method in tkinter, i.e.,
the ”mainloop” method of the top object. As the mainloop runs, it waits for
events to happen in top. If an event occurs, then it is handled and the loop
continues running, waiting for the next evernt. The loop continues to execute
until a ”destroy” event happens to the root window. A ”destroy” event is one that
closes a window. When the top is destroyed, the window is closed and the event
loop is exited.
12.3
The label widget
For building of our graphical user interface we introduce a class. The main reason to
use classes is to simplify the design of the program. A program that is structured into
classes is, especially if it is a very large program, much easier to understand than one
that is unstructured.
Another important consideration is that structuring your application as a class helps
to avoid the use of global variables, i.e., variables that are not defined inside a class,
but that are accessible in all program parts. Because it leads to messy (”spaghetti”)
programs, frequent use of global variables is considered poor programming. A much
better way is to use instance (that is, ”self.”) variables, and for that our application
must have a class structure.
Usually a GUI contains some text. If the text is simple an object from the Label
(widget) class can be used:
c ph
122
Programming and genomics 2019/2020
12. Graphical user interfaces
Figure 12.3: Three windows with labels: a) a window with an unvisible label, b) a
window with the text close to the top, and c) a window with the text close to
the bottom.
import tkinter
class MyApp:
def __init__(self, parent):
self.myParent = parent # always keep a reference to the parent
# widget so parameters of the parent
# can be used
self.label=tkinter.Label(parent,
text="A GUI for inspecting gene descriptions")
root = tkinter.Tk()
root.title("This is the top window")
root.geometry("300x200")
myapp = MyApp(root)
root.mainloop()
Our aim is to create a label widget that is a child widget of top and that displays
the text ”A GUI for inspecting gene descriptions”. Notice that because tkinter object
constructors tend to have many parameters (each with default values) it is usual to use
the named parameter technique of passing arguments to tkinter objects.
If the above program is run perhaps surprisingly this text is not shown as demonstrated
in Figure 12.3a.
This is because we have to explicitly instruct the system where to place the widget.
Three so-called layout managers (or geometry managers) are predefined. Here we are
going yo use the most simple one called pack. With pack the widget is packed itself
into its parent. The options for packing a widget against either the parents wall or a
previous widget with the same packing are TOP, LEFT, RIGHT, and BOTTOM, where the
default value is TOP.
Hence
import tkinter
class MyApp:
def __init__(self, parent):
c ph
123
Programming and genomics 2019/2020
12. Graphical user interfaces
self.myParent = parent # always keep a reference to the parent
# widget so parameters of the parent
# can be used
self.label=tkinter.Label(parent,
text="A GUI for inspecting gene descriptions")
self.label.pack()
root = tkinter.Tk()
root.title("This is the top window")
root.geometry("300x200")
myapp = MyApp(root)
root.mainloop()
should give a window as in Figure 12.3b with the text close to the title bar, while
import tkinter
class MyApp:
def __init__(self, parent):
self.myParent = parent # always keep a reference to the parent
# widget so parameters of the parent
# can be used
self.label=tkinter.Label(parent,
text="A GUI for inspecting gene descriptions")
self.label.pack(side=tkinter.BOTTOM)
root = tkinter.Tk()
root.title("This is the top window")
root.geometry("300x200")
myapp = MyApp(root)
root.mainloop()
should show a window as in Figure 12.3c with the text close to the bottom of the window.
In the above programs the term tkinter. occurs at all places where we have to refer
to objects and their methods from the tkinter module. Although this is the strongly
recommended way of refering to these objects and methods, in Python GUI programming
it is more common to use a somewhat shorter notation, i.e., the term tkinter. is
omitted. In order to obtain a syntactically correct program we should use another way
of importing instead of import tkinter:
from tkinter import *
meaning that everything from tkinter is to be imported. With this notation the last
program becomes:
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent # always keep a reference to the parent
c ph
124
Programming and genomics 2019/2020
12. Graphical user interfaces
# widget so parameters of the parent
# can be used
self.label=Label(parent,
text="A GUI for inspecting gene descriptions")
self.label.pack(side=BOTTOM)
root = Tk()
root.title("This is the top window")
root.geometry("300x200")
myapp = MyApp(root)
root.mainloop()
12.4
The button widget
With the class Button we create a new widget called button. Similarly as with Label
we use
bA = Button(top,text="base A")
bT = Button(top,text="base T")
The class Button takes the parent window as the first argument. As we will see later
other objects may also act as parents. The rest of the arguments are passed by keyword
and are all optional. Again the buttons have first to be packed to make them visible.
bA.pack()
bT.pack()
Notice that after the first command the button is placed in the window. When the second
button is packed the window is expanded to accomodate it. The default TOP stacked
them vertically in the order they were packed. The result is shown in Figure 12.4a.
Figure 12.4: Two windows with buttons: a) one with 2 buttons close to the top, and
b) with two buttons packed horizontally.
If we would use
bA.pack(side=LEFT)
bT.pack(side=LEFT)
c ph
125
Programming and genomics 2019/2020
12. Graphical user interfaces
then the window looks like in Figure 12.4b.
In practice the pack geometry manager is generally used in one of these two modes to
place a set of widgets in either a vertical column or horizontal row.
Our buttons look a little squished. We can fix that by packing them with a little padding.
”padx” adds pixels to the left and right and ”pady” adds them to the top and bottom
(Figure 12.5a).
bA.pack(side=LEFT, padx=10)
bT.pack(side=LEFT, padx=20)
Figure 12.5: Two windows with buttons: a) one window with two buttons packed
horizontally with some extra space, and b) a window with an additional frame
to pack one label on top of three buttons.
12.5
The frame widget
A frame is a widget whose sole purpose is to contain other widgets. Groups of widgets,
whether packed or placed in a grid, may be combined into a single Frame. Frames may
then be packed with other widgets and frames. As an example we place a label over 3
buttons in a row. We first pack the buttons into a frame horizontally and then pack the
label and frame vertically in the window.
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent # always keep a reference to the parent
# widget so parameters of the parent
# can be used
# Create and pack label
self.l = Label(parent, text="A label above the buttons")
self.l.pack()
# Create and pack frame with 3 buttons
self.frame=Frame(parent)
f=self.frame
self.bA = Button(f,text="base A")
self.bT = Button(f,text="base T")
self.bG = Button(f,text="base G")
self.bA.pack(side=LEFT)
c ph
126
Programming and genomics 2019/2020
12. Graphical user interfaces
self.bT.pack(side=LEFT)
self.bG.pack(side=LEFT)
f.pack()
win = Tk()
myapp=MyApp(win)
win.mainloop()
The result is shown in Figure 12.5b.
Our button widgets have only been used to display text so far. The next step is of
course to have some action coupled to clicking on a button.
12.6
Bringing the buttons to life.
We start with the button definition as introduced in the previous chapter.
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent # always keep a reference to the parent
# widget so parameters of the parent
# can be used
self.myContainer1 = Frame(parent)
self.myContainer1.pack()
self.buttonA = Button(self.myContainer1, text="base A")
self.buttonA.pack(side=LEFT)
root = Tk()
root.title("Button A example")
myapp = MyApp(root)
root.mainloop()
The result of running this program is shown in Figure 12.6a.
When the button is clicked, it is highlighted and depresses fine but it just does not do
anything. As we have seen, widgets are objects and have methods. We have been using
their pack method. Now we use the earlier introduced configure method to associate
some action.
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent
self.myContainer1 = Frame(parent)
self.myContainer1.pack()
c ph
127
Programming and genomics 2019/2020
12. Graphical user interfaces
Figure 12.6: Two windows with a single button: a) an inactive button in default colors,
and b) a custom colored button that is (though not visible in this figure) also
connected to an event handler.
self.buttonA = Button(self.myContainer1)
self.buttonA.configure(text="base A", bg="blue", fg="yellow")
self.buttonA.configure(command=self.butA)
self.buttonA.pack(side=LEFT)
def butA(self):
print("Button base A has been pushed")
root = Tk()
root.title("Button A example")
myapp = MyApp(root)
root.mainloop()
Buttons are tied to callback functions using the parameter command either when the
button is created or with configure. In this case when we click button ”base A” the
message ”Button base A has been pushed” is printed.
The window that is shown when running this program is depicted in Figure 12.6b.
Now everytime when we click button ”base A” the message ”Button base A has been
pushed” is printed. Remember that the argument supplied to the command keyword
is a function name. Indeed this should always hold: the argument must be a function
name, since an action has to be taken.
The callback and lambda forms
Suppose we have to design a DNA calculator with 4 buttons ”A”, ”C”, ”G”, and ”T”.
When we push the button the corresponding base name should be printed. The action
for pushing button ”A” has been given above and a straightforward but not too elegant
a solution is to define for instance
def butT(self):
print("Button base T has been pushed")
as callback when pushing button T and similarly for C and G. What we want is one
single method that is called with the parameter of the button:
c ph
128
Programming and genomics 2019/2020
12. Graphical user interfaces
def but(self, b)
print("Button base "+b+" has been pushed")
One might try the following
self.buttonA.configure(command=self.but(’A’))
If we do so, then the string ’Button base A has been pushed’ is printed, as the function
is evaluated when the class is loaded. As a result command receives the return value
of this statement, i.e., None, and this is clearly not the function object we have been
expecting. So when the button is pushed nothing happens!! So we need some method by
which an expression is changed into a function. Python indeed has such a facility: the
lambda form. Lambda forms can be used wherever function objects are required. They
are syntactically restricted to a single expression. Semantically, they are just syntactic
sugar for a normal function definition.
>>> def exampleLambda(c):
... print(’Some text’, c)
...
>>> fa=(lambda : exampleLambda(’aaaa’))
>>> fa
<function <lambda> at 0xb7d2b5dc>
>>> fa()
Some text aaaa
Lambda forms can also use values from the containing namespace.
For example:
def make_inc(n):
...
return lambda
...
>>> f = make_inc(4) #
#
>>> f
<function <lambda> at
>>> f(0)
4
>>> f(1)
5
x: x + n
f is now (lambda x: x + 4)
in math terms f(x)=x+4
0xb7f63b8c>
So in our example we write:
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent
self.myContainer1 = Frame(parent)
self.myContainer1.pack()
self.buttonA = Button(self.myContainer1, text="base A")
self.buttonA.grid(row=0, column=0)
c ph
129
Programming and genomics 2019/2020
12. Graphical user interfaces
self.buttonA.configure(command=(lambda : self.but("A")))
self.buttonT = Button(self.myContainer1, text="base T")
self.buttonT.grid(row=1, column=1)
self.buttonT.configure(command=(lambda : self.but("T")))
def but(self, b):
print("Button base "+b+" has been pushed")
root = Tk()
root.title("Buttons A and T, lambda form")
myapp = MyApp(root)
root.mainloop()
If one of the buttons is pushed indeed the correct text is printed.
12.7
The entry widget
The purpose of an Entry widget in tkinter is to let the user see and modify a single line
of text. Just as in the case of buttons we need some way to communicate with the entry
widget, in this case to set and retrieve text. This is done with a special tkinter object
called a StringVar that simply holds a string of text and allows us to set its contents
and read it (with get).
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent
self.myContainer1 = Frame(parent)
self.myContainer1.pack()
self.v = StringVar()
self.e = Entry(self.myContainer1,textvariable=self.v)
self.e.pack()
win = Tk()
win.title("An example of an entry widget")
win.geometry(’400x50’)
win.configure(bg=’#ff00ff’)
myapp = MyApp(win)
win.mainloop()
The result of this program is shown in Figure 12.7.
When we type ”ACG” into the entry, we can retrieve it from our linked StringVar object
by calling the get()-method. This is shown in the following example where we introduce
a button with as the purpose to show the contents of the entry.
c ph
130
Programming and genomics 2019/2020
12. Graphical user interfaces
Figure 12.7: A window with an entry widget.
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent
self.myContainer1 = Frame(parent)
self.myContainer1.pack()
self.v = StringVar()
self.e = Entry(self.myContainer1,textvariable=self.v)
self.e.pack()
self.buttonS=Button(self.myContainer1, text="Show entered text")
self.buttonS.configure(bg=’blue’, fg=’yellow’)
self.buttonS.configure(command=self.showEntryVal)
self.buttonS.pack()
def showEntryVal(self):
print(self.v.get())
win = Tk()
win.title("An example of an entry widget")
win.geometry(’400x50’)
win.configure(bg=’#0000ff’)
myapp = MyApp(win)
win.mainloop()
giving the result as shown in Figure 12.8.
Figure 12.8: A window with an entry widget and a button.
Similarly as getting the contents of an entry widget object, there is also a method by
which the StringVar object can be given a value. It is the set()-method. Its usage is
shown in the following program fragment.
c ph
131
Programming and genomics 2019/2020
12. Graphical user interfaces
from tkinter import *
class MyApp:
def __init__(self, parent):
self.myParent = parent
self.myContainer1 = Frame(parent)
self.myContainer1.pack()
self.v = StringVar()
self.e = Entry(self.myContainer1,textvariable=self.v)
self.e.pack()
self.assignEntryVal(’An initial value’)
self.buttonS=Button(self.myContainer1, text="Show entered text")
self.buttonS.configure(bg=’blue’, fg=’yellow’)
self.buttonS.configure(command=self.showEntryVal)
self.buttonS.pack()
def showEntryVal(self):
print(self.v.get())
def assignEntryVal(self, s):
self.v.set(s)
win = Tk()
win.title("An example of an entry widget with a value")
win.geometry(’400x50’)
win.configure(bg=’#ff00ff’)
myapp = MyApp(win)
win.mainloop()
Figure 12.9: A window with an entry widget and a button and a value.
12.8
Exercises 62–69
Exercise 62:**
(a) Using tkinter, design a window having ”A DNA calculator” as title.
(b) Change the window of part (a) to a size of height 400 and a width of 500.
(c) Change the background color of the window to blue.
(d) Place a label with as text ”This DNA calculator should become colorful”. Change
c ph
132
Programming and genomics 2019/2020
12. Graphical user interfaces
the size of the window by mouse operations.
(e) Change the background color of the label to red.
(f ) Change the foreground color of the label to yellow.
(g) Add a label to the window with text
"Beautiful colors isn’t it?"
background color green and foreground color white. Note that the two labels are
placed on top of each other.
(h) Change the order in which the two labels are packed.
Exercise 63:
(a) Again create a tkinter window having ”A DNA calculator” as title and with blue
as background color.
(b) Place a label with as text ”This DNA calculator should become colorful” with black
as background color of the label and yellow as foreground color of the label, but
now add the option side=LEFT to the pack method.
(c) Add a label to the window with text
"Beautiful colors isn’t it?"
background color green, foreground color white, and with as value BOTTOM for
the side option of the pack method. Enlarge the size of the window to experience
the effects.
(d) Add a third label with text ”base A” and with TOP as value for the side option,
and enlarge the window to experience the effects.
(e) Finally add a fourth label
"base T"
with RIGHT as value for the side option and again enlarge the window to experience
the effects.
Exercise 64:**
In general the size of the widget does not change when the size of the master window
is adapted. If we require that the widget size is also adapted we have to add values
for the options expand and fill. Moreover there is also another option, namely anchor,
to control the position of the widget inside its master widget. In this exercise some
examples of their use are considered.
(a) Create a tkinter window having ”A DNA calculator” as title, with blue as background color, place a label with as text ”This DNA calculator should become colorful” with black as background color of the label and yellow as foreground color of
the label, but now add only the option anchor=N to the pack method and enlarge
the window to experience the effects.
c ph
133
Programming and genomics 2019/2020
12. Graphical user interfaces
(b) Add side=RIGHT to the argument list of pack and explain the changes when the
size of the master widget is adapted.
(c) Next add fill=X, expand=True to the argument list of pack and discuss the
changes.
(d) Replace the fill value X by Y and investigate the difference.
(e) Replace the fill value Y by BOTH and experience the effect.
Exercise 65:
Design a window with four buttons with as text ”A”, ”C”, ”G”, and ”T”, each having
different backgound colors, filling up the complete master window and when the master
window is resized all space should remain evenly occupied by the buttons.
Exercise 66:
Design a graphical user interface with a label with as text ’Please enter a DNA string’
and an entry widget in which a DNA string can be entered. The window should also
have a button by which the length of the DNA string that has been entered is printed.
Exercise 67:
Each individual widget can be controlled by the pack options. In many cases we however
need the same behaviour simultaneously for several widgets. To that end tkinter has a
container entity by which widgets can be grouped. This framing operation is the subject
of this exercise.
(a) Again create a tkinter window having ”A DNA calculator” as title, with as background color blue, place 4 buttons with as texts ”A”, ”C”, ”G”, and ”T”, respectively. Choose different background and/or foreground colors and use the pack side
option LEFT for ”A”, ”C”, ”T”, and RIGHT for ”G”. Enlarge the window again
to experience the effects.
(b) Next we introduce a frame for the buttons with names ”A” and ”T”. Instead of
placing the two buttons into the top window say win we use the following grouping
f = Frame(win)
bA = Button(f,text="base A")
bT = Button(f,text="base T")
f.pack()
What changes are obtained when the window size is adapted?
(c) Next add fill=X, expand=True to the argument list of the packing of the frame
and discuss the changes.
(d) Similarly as in part (c) but now for button ”C”.
(e) Replace the fill value Y by BOTH in all pack arguments lists and experience the effect.
c ph
134
Programming and genomics 2019/2020
12. Graphical user interfaces
Exercise 68:
Usually the characters that the user typed are shown in an entry widget. In some cases
such as for password entries, an asterisk should instead be echoed. By adding show=* as
option to the entry constructor this effect is obtained. Design a graphical user interface
with two labels and two entry widgets on it. One label should be ’Username’, and the
label positioned below it ’Password’. The entry in the widget coupled to the ’Username’
label should echo the characters typed in by a user, while in the other entry widget
asterisks should then be shown.
Exercise 69:
Similarly as the previous exercise but after having closed the interface a new window
should be popped up with as size ’500x500’, background color ’blue’, foreground color
’yellow’ and a label with as text ’Welcome ...’ where instead of ’...’ the username typed
by the user should appear.
c ph
135
Chapter 13
Two examples
In this chapter we consider two example programs. In the first example a program is
created to perform some simple simulations. This is a nice example of how a problem can be split in multiple subproblems that are each solved in a separate function.
The corresponding exercises are a good test to see whether you comprehend the use of
functions.
The second example is to illustrate that pyplot has many more options than used so far
and can be used to create nice customized plots.
13.1
Simulations
Key advantage of computers is that they can perform tasks repeatedly and very accurately.
Simulation is the imitation of the operation of a real-world process or system over
time. The act of simulating something first requires that a model be developed; this
model represents the key characteristics or behaviors/functions of the selected physical
or abstract system or process. A computer can then be used to study the time evolution
of that model.
Examples of computer simulations of biological systems include molecular dynamics
simulations of proteins, DNA and/or membranes.
Here, we will not go into details in such large scale simulations. Instead we will consider
a small model consisting of a linear array of cells that each can be in one out of two
states. The state of the cells are followed in time, where the state of a cell in the next
generation depends only on the current state of the cell and its two immediate neighbors.
Such a model is called a one dimensional cellular automaton.
The reason we are interested in such models is that they allow, based on some simple
rules, to generate all sorts of (sometimes complex) patterns that may also be observed
in nature, see e.g. Figure 13.1.
The code
We will divide the problem at hand in 4 parts:
136
Programming and genomics 2019/2020
13. Two examples
Figure 13.1: Example of intriguing patterns found in nature.
1. Generating the initial state for n cells.
2. Determining the new state of a cell given its own state and that of its neighbors.
3. Updating the system, i.e., the state of all cells by repeatedly calling the solution
to step 2.
4. Running the simulation, i.e., repeatedly updating the states of all cells by repeatedly calling the solution to step 3.
Generating the initial states
We will consider a one-dimensional cellular automaton consisting of n cells. Each cell
can be in either of two states. We will denote these states by either ’ ’ or ’X’. The
whole one-dimensional cellular automaton is then described by a string of length n made
out of ’ ’s and/or ’X’s.
We can start the simulation from an arbitrary string, but will consider here a string
consisting of n-1 spaces with a single ’X’ in the center. A function that generates and
returns such a string is:
def createInitialLine(n):
"""This function generates and returns a string of ’n’ characters
(’ ’ and ’X’) that can be used as initial configuration for a
cellular automaton simulation.
"""
#Initially empty line (only spaces) with one X at center
l=[’ ’]*n
l[n//2]=’X’
initline = ’’.join(l)
return initline
c ph
137
Programming and genomics 2019/2020
13. Two examples
Rules for new cell state
The idea of the cellular automaton is that the new cell state is fixed by the old state of
that cell plus the old states of its two neighbouring cells. As each cell could be in one out
of 2 states (’ ’ or ’X’, for a triplet of cells there are thus 23 = 8 different possibilities.
For each of these 8 possibilities, we need to define what the new state of the middle cell
should be. A function that has a string of length 3 as input and returns that new state
as a string of length 1 is given here:
def calcNewCellState(pattern):
""" Given a pattern of three characters, only consisting of
’ ’ and ’X’ and indicating the state of a cell and the states of
its left and right hand side neighbours, this function returns
the new state (either ’ ’ or ’X’) of the cell according to some rule.
"""
#Rule 90:
#current pattern
111 110 101 100 011 010 001 000
#new state center cell
0
1
0
1
1
0
1
0
newstate = ’ ’
if pattern == ’
’:
newstate = ’ ’
elif pattern == ’ X’:
newstate = ’X’
elif pattern == ’ X ’:
newstate = ’ ’
elif pattern == ’ XX’:
newstate = ’X’
elif pattern == ’X ’:
newstate = ’X’
elif pattern == ’X X’:
newstate = ’ ’
elif pattern == ’XX ’:
newstate = ’X’
elif pattern == ’XXX’:
newstate = ’ ’
return newstate
Of course, the rule could be changed by changing ’ ’s for ’X’s (and/or the other way
around) for the newstate in the different cases. In fact, there are 28 = 256 different
combinations possible. These are possibilities are called the 256 Rules. The rule in the
above example is called Rule 90. This is because of we order the 8 patterns and the
resulting new cell states accoring to the rule as follows:
#Rule 90:
#current pattern
#new state center cell
111 110 101 100 011 010 001 000
0
1
0
1
1
0
1
0
The 8 digits for the new state of the center cell form a binary number. This can be
converted to a decimal number as follows:
0 ∗ 27 + 1 ∗ 26 + 0 ∗ 25 + 1 ∗ 24 + 23 + 0 ∗ 22 + 1 ∗ 21 + 0 ∗ 20 = 90
c ph
138
Programming and genomics 2019/2020
13. Two examples
Oppositely, a decimal number can also be converted to a binary number, e.g. as seen in
Section 8.3 using Python string formatting:
>> "{:b}".format(90)
’1011010’
>>> "{:b}".format(222)
’11011110’
where in the first example only the leading 0 remains to be added.
Updating the system
A new generation of the cellular automaton can be made by looping over all cells (except
for the two outermost)
def calcNewLine(s):
""" Generate a new generation by updating each cell
based on its old state and that of its two nearest neighbours.
Because first and last cell only have a single neighbour, these
cells remain untouched.
"""
l = [s[0]]
# State of first cell kept
for i in range(1,len(s)-1):
# Update all intermediate cells
p = s[i-1]+s[i]+s[(i+1)]
l.append(calcNewCellState(p))
l.append(s[-1])
# State of last cell kept
newline = ’’.join(l)
return newline
Running the simulation
The evolution of a one-dimensional cellular automaton can be followed by printing the
initial state (generation zero) as a first line, the first new generation on a second line,
and so on. A function that does so for a cellular automaton with width cells over height
generations is shown here:
def runSimulation(width,height):
""" Runs a simulation on a cellular automaton consisting of
’width’ cells, for ’height’ iterations.
"""
curline = createInitialLine(width)
print(curline)
for it in range(height):
curline = calcNewLine(curline)
print(curline)
The simuation can then be run for example for 120 cells over 60 generations using a
single call to the latter function:
runSimulation(width=120,height=60)
resulting in a plot like shown in Fig. 13.2a.
c ph
139
Programming and genomics 2019/2020
13. Two examples
Figure 13.2: Examples of cellular automata with different rules: a) The rule as applied
in the code in this section, b) a rule that generates a very regular pattern, and
c) an example of a rule that generates a very irregular pattern.
13.2
A bar plot example
In this chapter we will combine many of the topics discussed in previous chapters to
create a customized bar plot.
Assume that amongst 3 groups of men and women a small poll has been conducted and
assume that in each group separate scores of its men and women have been collected.
For group 1 the men scored 30, the women 25, for group 2 the men 35, the women 32,
and in group 3 28 and 21, respectively. Our programming task is to create in a single
figure a bar plot of the scores of each group with the scores of the men in red and those
of the women in yellow.
Since we are already familiar with matlab we decide to draw the figure by using matplotlib.
As usual in programming we split our task into several small subtasks. In this case one
may come up with the following 4 design tasks:
• First, create an empty figure with a title.
• Second, add the scores of the men to the figure.
• Next, construct a figure of the scores of the women.
• Finally, combine the two figures into one figure.
Hence by following these steps we obtain our solution.
• As first step we create an empty figure with a title.
Since we are going to use the module pyplot of matplotlib, as first action we should
import it
import matplotlib.pyplot
But when numerous references to a module are to be made long names should be
avoided. To that end Python has an elegant way to introduce a name alias:
import matplotlib.pyplot as plt
Instead of writing matplotlib.pyplot one can now use plt.
c ph
140
Programming and genomics 2019/2020
13. Two examples
In matplotlib a figure can be created by calling the figure-method:
fig = plt.figure()
Since in matplotlib the Axes is the plotting area into which most of the objects go,
we create an instance ax of it, by calling the add subplot-method which returns
an Axes object. In our case we need a figure with only one subplot, so we call:
ax = fig.add_subplot(1, 1, 1) # one row, one column, first plot
To ax a title can be added by calling its set title-method:
ax.set_title(’A small bar plot example’)
and finally we may need to use the pyplot show-method to display our results:
plt.show()
In Anacoda the latter statement is however not required as the figure is shown
immediately upon creation. So the total program becomes
import matplotlib.pyplot as plt
# create a figure and an axes
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.set_title(’A small bar plot example’)
plt.show()
Running this program results in a figure as shown in Figure 13.3.
Figure 13.3: An empty figure with a title.
• The next subtask is to add the scores of the men to the figure.
In chapter 4 (and exercise 15) the plot-method has been introduced:
matplotlib.pyplot.plot(x, y)
in which x and y are two lists of the same length.
c ph
141
Programming and genomics 2019/2020
13. Two examples
Similarly, an Axes instance has a bar-method that plots a bar (rectangle) of height
y[0] at position x[0], a bar of height y[1] at position x[1], . . . a bar of height
y[-1] at position x[-1] using the default width of 0.8
ax.bar(x, y)
in which x and y are two lists of the same length.
So to plot the men’s scores we could use:
menScores = [30, 35, 28]
n=len(menScores)
indmen = range(n) # the x locations for the groups
rectsmen = ax.bar(indmen, menScores, color=’r’)
ax.set_title(’The men scores’)
Notice the use of the keyword argument color.
Information about possible keywords of the bar-method of an axes instance can
be found at: http://matplotlib.org/api/axes api.html.
Since inspecting the figure we find the width too large, we add the keyword argument width=barwidth to our bar-method call. Of course barwidth should first
be given a value, and in our case 0.3 seems appropriate:
menScores = [30, 35, 28]
n=len(menScores)
barwidth=0.3
indmen = range(n) # the x locations for the groups
rectsmen = ax.bar(indmen, menScores, width=barwidth, color=’r’)
ax.set_title(’The men scores’)
Another aspect we dislike about the figure is its height. In matplotlib it is simple
to change the height of the y-axis by applying the ax.set_ylim(top=value)command. The value to be added should depend on the maximum scores of the
men. To that end we use Python’s max-function that calculates the maximum
value of the argument list:
maxvalue=max(menScores)
Combining all and adding a small value to the calculated maximum of the list,
results in
import matplotlib.pyplot as plt
barwidth = 0.3
# the width of the bars
# create a figure and an axes
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
# the data
menScores = [30, 35, 28]
n=len(menScores)
c ph
142
Programming and genomics 2019/2020
13. Two examples
indmen = range(n) # the x locations for the groups
rectsmen = ax.bar(indmen, menScores, width=barwidth, color=’r’)
ax.set_ylim(top=max(menScores)+5)
ax.set_title(’The men scores’)
plt.show()
In Figure 13.4a the result is shown.
(a) Figure with the scores of the men.
(b) Final bar plot.
Figure 13.4: Bar plots at different stage in the program development.
• The next subtask is to construct a figure of the scores of the women. Of course a
similar figure as that of the men can be constructed for the scores of the women,
but if we would use the same x-positions for the bars, in our next step the bars
would become overlapped. So we decide to create a list of x-positions that are
shifted with the width of the bars.
indwomen = []
for i in range(n):
indwomen.append(i+barwidth)
Since constructing such kinds of lists is very frequently occurring, Python has a
special construct, called list comprehension, for it. The same list can be obtained
by
indwomen = [i+barwidth for i in range(n)]
Using this construct with a shifted x-position our program becomes:
import matplotlib.pyplot as plt
barwidth = 0.3
# the width of the bars
# create a figure and an axes
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
c ph
143
Programming and genomics 2019/2020
13. Two examples
# the data
womenScores = [25, 32, 21]
n=len(womenScores)
indwomen = [i+barwidth for i in range(n)]
rectswomen = ax.bar(indwomen, womenScores, width=barwidth, color=’y’)
ax.set_ylim(top=max(womenScores)+5)
ax.set_title(’The women scores’)
plt.show()
When we inspect this figure then we see that the y-axis is lacking a labels Obtaining
the y-label is simple:
ax.set_ylabel(’Scores’)
• Combine the two figures into one figure.
Apart from combining the two figures we also add labels to the bars and a legend
to explain the colors of the bars. We also adapt the title of the figure.
By using the set xticks and the set xticklabels-methods on the ax-object
ticks and labels to the bars are added.
ax.set_xticks([i+width for i in ind])
ax.set_xticklabels( [’G1’, ’G2’, ’G3’])
For the legend we can use the results of the bar-method. The results are lists of
Rectangle instances. The first element of each such instance is the face color of
the bar:
# add a legend
ax.legend([rectsmen[0], rectswomen[0]], [’Men’, ’Women’] )
So our solution becomes
import matplotlib.pyplot as plt
barwidth = 0.3
# the width of the bars
# create a figure and an axes
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
# the data
menScores = [30, 35, 28]
womenScores = [25, 32, 21]
n=len(menScores)
indmen = range(n) # the x locations for the groups
rectsmen = ax.bar(indmen, menScores, width=barwidth, color=’r’)
indwomen = [i+barwidth for i in range(n)]
rectswomen = ax.bar(indwomen, womenScores, width=barwidth, color=’y’)
c ph
144
Programming and genomics 2019/2020
13. Two examples
ax.set_ylim(top=max(menScores+womenScores)+5)
# add a title and labels and ticks to the x- and y-axis
ax.set_ylabel(’Scores’)
ax.set_title(’Scores by group and gender’)
ax.set_xticks(indwomen)
ax.set_xticklabels( [’G1’, ’G2’, ’G3’] )
# add a legend
ax.legend( [rectsmen[0], rectswomen[0]], [’Men’, ’Women’] )
plt.show()
The result is displayed in Figure 13.4b.
13.3
Exercises 70–73
Exercise 70:
In Section 13.1 a program has been designed to simulate a cellular automaton. The initial
state of the system was such that only one cell in the center differed from all others.
Extend the code such that the function runSimulation has an additional parameter
init the can have the value ’center’, ’left’ or ’right’. The behavior for ’center’
should be as before, while for ’left’ only the cell at the left should be ’X’ and all others
’ ’ and for ’right’ only the cell at the right should be ’X’ and all others ’ ’.
Exercise 71:
In Section 13.1 a program is designed to simulate a cellular automaton. There one Rule
was used.
(a) Extend the code such that by calling
runSimulation(width=120,height=60,rule=90)
the same simulation is still performed.
(b) Extend the code such that (apart from Rule 90) it can also perform the simulation
for the following rule (Rule 222):
#Rule 222:
#current pattern
#new state for center cell
111
1
110
1
101
0
100
1
011
1
010
1
(c) Extend the code further such that it can also do Rule 122
(d) Extend the code further such that it can also do Rule 94
(e) Extend the code further such that it can also do Rule 30
(f ) Challenge: Change the code such that it can do all rules 0 up to 255
c ph
145
001
1
000
0
Programming and genomics 2019/2020
13. Two examples
Exercise 72:
(a) In the lecture notes the NCBI site has been introduced. Open a web browser and
go to the site http://www.ncbi.nlm.nih.gov/. Change the field ”All databases”
into ”Nucleotide” and enter ”KR063672.1 OR KR063671.1” in the search field
(KR063672.1 and KR063671.1 are the accession codes for the two most used isolates
of the virus, ”Kikwit” and ”Mayinga” respectively). Download both sequences by
checking the boxes of both entries and clicking on the button ’Send to’, selecting
’File’ as destination and ’FASTA’ as format, and clicking on the button ’Create
file’. When everything went fine, you will be asked to save the file with as name
”sequence.fasta”. In the Download folder of your computer you will then have the
file ’sequence.fasta’. Move that file to your working directory.
(b) In exercise 27 the FASTA format has been described and a Python program has
been asked for to determine the number of sequences in a FASTA file. Design a
method with a filename as parameter that returns a list of 2-tuples as elements.
Each 2-tuple corresponds to a sequence in the FASTA file and has as first element
the description line of the sequence while the second element of the 2-tuple the
nucleotide sequence is. Apply your method to the file ’sequence.fasta’ of part (a).
(c) What are the sequence lengths of the Kikwit and Mayinga isolates?
(d) From the answer of part (c) we infer that the Kikwit sequence is one nucleotide
shorter. So if we want to compare the sequences we have to leave one nucleotide
from the Mayinga sequence out. Of course the question is which one. Write a
methode that leaves the nucleotide at index i out of the sequence and subsequently
counts the number of matches, where a match means that both sequences have the
same nucleotide at the same index.
(e) Apply the method of part (d) to all indices of the Mayinga sequence. The result
should be a list having at index i the number of matches when the i-th nucleotide
is left out of the Mayinga sequence.
(f ) Make a barplot that shows the result of part (e) in a graph. Use the x-axis for the
index i and the number of matches on the y-axis.
Exercise 73:
(a) Design a class Frequency that has two data attributes. The first data attribute is
nme and is used to store a name as string, the second one pctl is a list of percentages.
The __init__-method should have next to self only one parameter, a string. This
string consists of one of more entries that are separated by a semicolon. The first
entry is a substring, the other entries are all strings that can be converted to floats.
(b) Design a method having a filename as parameter that reads from the file a sequence
of lines. Each line starts with a name and is followed by a number of percentages,
where all entries including the name are separated by semicolons. The method
should return the corresponding list of Frequency-objects.
(c) Apply the method of the previous item on the file ’sequencespct.txt’ consisting of a
c ph
146
Programming and genomics 2019/2020
. Two examples
number of gene names and the percentages of occurrences of the nucleotides A, C,
G, and T in its gene sequence.
(d) Use the list of item (c) to generate a bar plot with on the x-axis the names of the
genes and as bars the percentages of occurrences of the nucleotides A, C, G, and T
in the gene sequence in different colors. So the plot shows per gene the distribution
of the nucleotides. The solution should be independent of the number of genes.
Add also a legend to the plot.
(e) Similarly as the previous item, but now are the bars grouped based upon the distribution of a single nucleotide over the genes with a different color per gene. So the
plot shows first how often the nucleotide A occurs in each of the genes, then how
often the T etc. Add also decent xtick-labels to the groups of bars.
c ph
147
Appendix A
Summary of useful commands
A.1
Common Python constructs
Repetition
for x in l:
block
for i in range(len(l)):
block
while bool:
block
Selection
if bool:
block1
elif bool:
block2
else:
block3
Functions
def myfunc(inargs):
block
return outargs
A.2
Operations on string s
s.count(sub)
return the number of non-overlapping occurrences of substring sub in string s.
s.upper() and s.lower()
return a copy of the string s converted to uppercase and lowercase resp.
148
Programming and genomics 2019/2020
A. Summary of useful commands
s.rstrip(), s.lstrip(), s.strip()
return a copy of the string s with trailing whitespace characters (the characters
space, tab, linefeed, return, formfeed, and vertical tab) removed. Analogous for
leading whitespace for lstrip and leading and trailing whitespace for strip.
s.find(sub)
return the lowest index in the string s where substring sub is found. Return -1 if
sub is not found.
s.rfind(sub)
return the highest index in the string s where substring sub is found. Return -1 if
sub is not found.
s.replace(old, new)
return a copy of string s with all occurrences of substring old replaced by new.
s.split([sep])
return a list of the words in the string, using sep as the delimiter string. If sep is
not specified or is None, first the whitespace characters are stripped from both ends
and then words are separated by arbitrary length strings of whitespace characters.
float(s)
return the sting s converted to a floating point number, if possible.
int(s)
return the sting s converted to an integer number, if possible.
A.3
Operations on file f
f=open(fname, option)
return a new object f of type file with filename fname for reading when option is
omitted or option=’r’, for writing when option=’w’.
f.close()
close the file.
f.read()
return the contents of the file in a string.
f.readline()
return the next line from the file. The line includes the end of line character (\n)
f.readlines()
return a list of all lines from the file, where each line includes the newline character.
f.write(s)
write the string s to the file f.
A.4
Operations on lists l
Given a list l the following methods can be applied to l. Note that in most cases the
list l is changed.
l.append(x)
c ph
149
Programming and genomics 2019/2020
A. Summary of useful commands
add an item to the end of the list.
l.extend(L)
extend the list l by appending all the items in the given list L.
l.insert(i, x)
insert item x at given position i.
l.remove(x)
remove the first item from the list whose value is x. Raises an error if the item is
not in the list.
l.pop(i)
remove the item at the given position in the list, and return it. If no index is
specified, l.pop() removes and returns the last item in the list.
l.index(x)
return the index in the list of the first item whose value is x. Raises an error if the
item is not in the list.
l.count(x)
return the number of times x appears in the list.
l.sort()
sort the items of the list, in place.
l.reverse()
reverse the elements of the list, in place.
list(s)
return a list whose items are the same and in the same order as in the string s
sep.join(l)
return a string which is the concatenation of the strings in the list l. The separator
between elements is the string sep providing this method.
A.5
Operations on dictionaries d
Given a dictionary d the following methods can be applied to d.
d.keys()
Returns a view on the dictionary’s keys
d.values()
Returns a view on the dictionary’s values
d.items()
Returns a view on the dictionary’s (key, value) pairs
x in d
Returns True if x is in the dictionary’s list of keys, False otherwise
c ph
150
Programming and genomics 2019/2020
A.6
A. Summary of useful commands
List generation and plotting
range
The general form of the range-method is
range([start,] stop[, step])
If the step argument is omitted, it defaults to 1. If the start argument is omitted,
it defaults to 0. Combined with the list function, the full form
list(range([start,] stop[, step]))
returns a list of plain integers
[start, start + step, start + 2 * step, ...]
If step is positive, the last element is the largest start + i * step less than stop;
if step is negative, the last element is the smallest start + i * step greater than
stop.
plot
Two lists x and y of the same length with in x the x-coordinates and in y the
y-coordinates of a collection of points can be plotted in red with circle markers and
subsequently shown by executing
import matplotlib.pyplot as plt
plt.plot(x, y, ’ro’)
plt.xlabel(’x-text’)
plt.ylabel(’y-text’)
plt.show()
A.7
turtle
In some exercises the turtle library is used. Some basic commands in that library are
import turtle
turtle.up()
turtle.down()
turtle.pencolor(r,g,b)
turtle.goto(x,y)
turtle.forward(dist)
turtle.backward(dist)
turtle.right(degrees)
turtle.left(degrees)
turtle.mainloop()
A.8
#
#
#
#
#
#
#
#
#
#
load the library
pen up (not drawing)
pen down (start drawing)
set pen color
move to position x,y
move specified distance forward
move specified distance backward
turn specified degrees right
turn specified degrees left
place at end of program to activate drawing window
openpyxl
import openpyxl
wb = openpyxl.load_workbook(filename = ’myfile.xlsx’)
ws = wb.active
cell = ws.cell(row=i,column=j)
print(cell.value)
c ph
151
Programming and genomics 2019/2020
A. Summary of useful commands
ws2 = wb.create_sheet(title="mytitle")
ws2[’F5’].value = 3.14
ws2.append([...])
wb2.save(filename = ’myfile2.xlsx’)
A.9
Database queries
urllib:
import urllib.parse, urllib.request
params = urllib.parse.urlencode(mydict) # if needed
url="http(s)://hostname/path"
rf = urllib.request.urlopen(url, params.encode(’ascii’))
Entrez:from Bio import Entrez
Entrez.email="myname@student.tue.nl"
handle=Entrez.esearch(db="nucleotide", term=’...’)
record=Entrez.read(handle)
A.10
tkinter
Creating a master window in tkinter with text mytitle
import tkinter
top=tkinter.Tk()
top.title(mytitle)
...
top.mainloop()
Creating a widget on master and packing it
widget=tkinter.widgetclass(top)
widget.pack(side=tkinter.TOP)
creates an instance of the widget class, as a child to top. widgetclass is one from
Label
Button
Entry
The default value for the side option is tkinter.TOP, other values are tkinter.LEFT,
tkinter.BOTTOM and tkinter.RIGHT.
Configure
To set specific options for a widget use the configure method:
widget.configure(option=value)
General option=value
The following options apply to the above-mentioned widgets:
background=color
foreground=color
where color is one from "white", "black", "red", "green", "blue", "cyan", "yellow",
and "magenta".
c ph
152
Programming and genomics 2019/2020
A. Summary of useful commands
Option for the Label widget
text=mytxt
where mytxt is a string.
c ph
153
Appendix B
Solutions to selected exercises
Solution to exercise 1:
>>> 5*40
200
>>> 1.25*7
8.75
>>> 100/25
4.0
>>> 106/25
4.24
>>> 106//25
4
>>> 106.0/25
4.24
>>> 106.0//25
4.0
>>> 100/5*5
100.0
>>> 100/(5*5)
4.0
>>> 2**10
1024
>>> 3*2**3
24
>>> (3*2)**3
216
All operators on integers also result an integer, except for the true division operator /
which results a float. If one of the operands is a float, all operators also return a float.
The operator // is the floor division, which basically returns the number of times one
number fits into another, without any decimal points or remainders. Moreover, due to
the differences in priorities of operators, the use of brackets () matters.
154
Programming and genomics 2019/2020
B. Solutions to selected exercises
Solution to exercise 2:
>>> nrbases = 4
>>> seqlength = 10
>>> nrbases**seqlength
1048576
>>> nrpos = nrbases**seqlength
>>> nrpos
1048576
>>> print(nrpos)
1048576
>>> prob = 1.0/nrpos
>>> print(’The probability is’, prob)
The probability is 9.5367431640625e-07
Solution to exercise 3:
A name, also called identifier, is a word that consists of letters, underscores, and digits,
it must start with a letter or an underscore. So a space ’ ’, ’ ?’, ’ !’ and ’;’ are not allowed
in an identifier and hence whats in a name, Whats in a name?, yo!u, and Hello; are
all not valid. Since it must start with a letter or an underscore 5600MB is also not valid.
Syntactically correct identifiers are thus whatsinaname, whats_in_a_name, I, HelloYou,
and varName, i.e., a), c), f ), g) and i).
When the invalid identifiers are used, the Python interpreter generates the following
error messages:
>>> whats in a name
File "<stdin>", line 1
whats in a name
^
SyntaxError: invalid syntax
>>> 5600MB
File "<stdin>", line 1
5600MB
^
SyntaxError: invalid syntax
>>> yo!u
File "<stdin>", line 1
yo!u
^
SyntaxError: invalid syntax
>>> Hello;=3
File "<stdin>", line 1
Hello;=3
^
SyntaxError: invalid syntax
>>> what?name
File "<stdin>", line 1
what?name
c ph
155
Programming and genomics 2019/2020
B. Solutions to selected exercises
^
SyntaxError: invalid syntax
In each case the parser repeats the offending line and displays a little arrow pointing at
the earliest point where the error was detected.
Solution to exercise 4:
Having followed the instructions in the exercise, you should have a text file named
exercise4.py in the folder D:\8CA10 that may not only be opened in the Spyder editor,
but for example also using the standard Windows editor Wordpad.
The value 1048576 in the variable nrpos is writting to the Python console only once,
i.e., as a result of the single print statement in the program. Whereas typing a variable
name at the command prompt will show you its contents, from you program only print
statements will result in output at the Python console.
Solution to exercise 5:
(a) Create a new file (using the New file ... option in the File menu or using the key
combination Ctrl+N), paste the text from the exercise and adapt it such that it
reads:
genomelength = 3.2e9
nrcells = 4e13
massperbasepair = 660
Na = 6.022e23
#
#
#
#
number of
number of
grams per
number of
base pairs per cell
cells
mole per base pair
molecules per mole (Avogadro’s number)
totalDNAmass = genomelength*nrcells*massperbasepair/Na
print(’approximate DNA mass one human:’, totalDNAmass, ’grams’)
Finally, save the file as exercise5.py. Running the above Python fragment then
yields as an anwer 140.28561939554965 grams.
(b) The print statement should be replaced by:
print(’approximate DNA mass one human:’, totalDNAmass/1.0e3, ’kg’)
Solution to exercise 6:
Program fragment for all three subparts may read:
s = ’AAACGAACGTAGGATCAAGTAGGCAAAAAG’
print(’a) the first character of s:’)
print(s[0])
print(’b) the last character of s:’)
print(s[len(s)-1])
print(’c) the string using 10 characters per line and space after each 5th:’)
c ph
156
Programming and genomics 2019/2020
B. Solutions to selected exercises
print(s[0:5] + ’ ’ + s[5:10])
print(s[10:15] + ’ ’ + s[15:20])
print(s[20:25] + ’ ’ + s[25:30])
Solution to exercise 7:
Below the programming fragments with some additional comments are given and the
results are shown when the fragments are executed by the Python interpeter.
(a) [] is the empty list, so assigning it to the variable l, can be done by
l=[]
(b) A single element can be added by invoking the append-method:
l.append(7)
print(l)
The print statement then yields: [7]
(c) The length of list variable l is produced by invoking the len-method with l as
parameter. Printing this length using
print(len(l))
yields 1
(d) The first method is to apply the extend-method to l with the list as single parameter:
l.extend([1, 2, 3, 1])
print(l)
yielding as output [7, 1, 2, 3, 1]. A second method is to invoke the appendmethod on l with each of the elements of the extension list successively as parameter:
for x in [1, 2, 3, 1]:
l.append(x)
print(l)
This yields consecutively:
[7,
[7,
[7,
[7,
1]
1, 2]
1, 2, 3]
1, 2, 3, 1]
(e) The result of the above actions is that l has now length 5.
print(len(l))
now thus yields 5
(f ) The third element of l is l[2]. By an assignment its value can be changed.
l[2]=4
c ph
157
Programming and genomics 2019/2020
B. Solutions to selected exercises
print(l)
The print will now thus yield [7, 1, 4, 3, 1].
(g) The first occurrence of element x in l can be removed by invoking the removemethod:
l.remove(1)
print(l)
displaying [7, 4, 3, 1].
(h) The effect of a remove on the length of a list is of course that it has become one
smaller such that
print(len(l))
now yields 4.
(i) The pop-method invoked without an argument removes the last element from a list
and returns this last element. This can be stored in a variable and then printed
t = l.pop()
print(t)
or printed directly
print(l.pop())
In both cases 1 will be displayed.
(j) The effect of a pop on the length of a list is also of course that it has become one
smaller:
print(len(l))
now thus yields 3.
(k) When a negative index is used, then len(l) is added to the index, so l[-1] is the
same as l[len(l)-1]. With a print statement this contents can be shown
print(l[-1])
to be 3.
(l) To validate that l[len(l)-1] is indeed the same as l[-1]
print(l[len(l)-1])
also displays 3.
(m) Index -len(l) is the same as 0, so using
print( l[-len(l)])
the value of the first element is produced, i.e., 7.
Solution to exercise 8:
(a) A list of the first 20 integers can be produced by invoking list(range(20)):
c ph
158
Programming and genomics 2019/2020
B. Solutions to selected exercises
l=list(range(20))
print(l)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
(b)
print(list(range(len(l))))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
(c) Execution of
for x in l:
print(l)
produces as output
[0,
[0,
[0,
.
.
[0,
[0,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
since for all elements of l it prints the whole list l
(d) Execution of
for x in l:
print(x, 2*x)
produces as output
0 0
1 2
2 4
.
.
18 36
19 38
since of all the elements of l first its value and then twice its value are output.
(e) When the following programming fragment is executed
for x in l:
print(x)
the system returns with the error message
IndentationError: expected an indented block
In this case there is no indentation in front of print(x), so in fact no actions are
associated with the loop.
(f ) Execution of the programming fragment:
for i in range(len(l)):
print(l[i], 2*l[i])
c ph
159
Programming and genomics 2019/2020
B. Solutions to selected exercises
gives output identical to that of (d), since for i in range(len(l)) means that
for each of the first 20 integers the corresponding l-value and twice this value are
output. Since l=[0, 1, 2, .., 19] the output is as shown.
(g) Execution of
print("Start")
for i in range(len(l)):
print(l[-i])
print("Finished")
gives as output
Start
0
Finished
19
Finished
.
.
2
Finished
1
Finished
since the first time the loop is executed the value of i is 0, and, hence the value
of the first element of l 0, followed by the string “Finished” is produced, then i
becomes 1 and l[-1], hence the value of the last element of l and “Finished” are
output, then the one but the last etc., until i is 19 and since l[-19] is the same as
l[1] a 1 followed by ”Finished” is produced.
(h) By removing the indentation before print("Finished"), it is taken out of the
block of the for-statement that is executed for every index of l. Then it will only
be executed once when the for-loop is done. Programming fragment should thus
look like:
print("Start")
for i in range(len(l)):
print(l[-i])
print("Finished")
giving as output
Start
0
19
18
.
.
1
Finished
(i) In the previous two exercises the first element of l was produced first and then
c ph
160
Programming and genomics 2019/2020
B. Solutions to selected exercises
the other elements of l in reversed order. So we need only a small adjustment
to produce all elements of l in reversed order. The solution is to find the right
“pattern”: the last element is l[-1-0], one but the last l[-1-1], two but the
last l[-1-2] etc., until l[-1-(len(l)-1)]. Hence a programming fragment that
produces the elements of l in reverse order is:
for i in range(len(l)):
print(l[-1-i])
Solution to exercise 9:
(a) The corrected programming fragment could look like:
n=10
print(0, n)
for t in range(6):
n = 2*n
print(t+1, n)
print(’After 6 hours the number of cells is’, n)
This shows that the number of cells after 6 hours is 640.
(b)
n=10
print(0, n)
for t in range(24):
n = 2*n - 5
print(t+1, n)
print(’After 24 hours the number of cells is’, n)
After 24 hours the number of cells is 83 886 085.
Solution to exercise 10:
(a) A new window pops up in which a square is drawn.
(b) The requested regular octagon consists of 8 equal line pieces, each rotated with
respect to each other by 45 degrees.
import turtle
d=100
turtle.up()
# starting point shifted slight upward
turtle.goto(-d/2,d)
turtle.down()
for i in range(8):
turtle.forward(d)
turtle.right(45)
turtle.mainloop()
c ph
161
Programming and genomics 2019/2020
B. Solutions to selected exercises
(c) The requested star consists of 6 equal parts, where each part now consists of 2 line
pieces rotated 120 degrees with respect to each other and where the 6 parts are
each rotated with respect to each other by 60 degrees. Because we already rotated
120 degrees, we have to rotate (120-60=) 60 degrees back.
import turtle
d=100
turtle.up()
turtle.goto(d/2,d/2)
turtle.down()
for i in range(6):
turtle.forward(d)
turtle.right(120)
turtle.forward(d)
turtle.right(-60)
turtle.mainloop()
Instead of turtle.right(-60) one could also use turtle.left(60).
(d) The requested ’spiral’ plot consists of multiple (200 to be exact) line segments,
where starting from the center consecutive line segments increase in size and make
an angle of 45 degrees.
import turtle
d=10
turtle.up()
turtle.goto(0,0)
turtle.down()
for i in range(200):
turtle.forward(d)
turtle.right(45)
d = d+1
turtle.mainloop()
Solution to exercise 11:
Lets first consider the case where k is equal to 3. Then, what should be printed is:
print(’* ’)
# 1 star, 1 * ’* ’
print(’* * ’)
# 2 stars, 2 * ’* ’
print(’* * * ’) # 3 stars, 3 * ’* ’
So, on the first line once ’* ’. On the second line twice the same string, and on the third
line three times that same string. More general, on the k-th line we need k times ’* ’.
This can be realized in different ways. One way would be
c ph
162
Programming and genomics 2019/2020
B. Solutions to selected exercises
k=5
for n in range(1, k+1):
print(n * ’* ’)
A second way that uses a second (nested) for loop for the repetition on a single line:
k=7
for n in range(1, k+1):
for i in range(n):
print(’*’,end=’’)
print()
Partial solution to exercise 14:
(a) Given a list of values, to calculate the mean we first have to determine the sum of
all the values. Similarly as in the previous exercise we have to consider all elements
one by one:
s = 0
for x in l:
s = s + x
The next step is to calculate the mean and print it
mean = s / len(l)
print(’The mean is ’, mean)
If we apply the calculation to the list l of the previous exercise we find:
The mean is 5.75
Solution to exercise 16:
(a) range(n) generates the sequence of all integers 0, .., n-1, so here we have to
choose the value 11 for n and use the list function to convert the range object into
a true list:
>>> list(range(11))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
(b) range(start, n) generates all integers start, .., n-1, so here we have to choose
for start 1 and for n 11:
>>> list(range(1, 11))
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
(c) When step is greater than 0, range(start, n, step) generates all integers start, start+step, .., s
such that k is maximal, i.e., start + k ∗ step < n, while start + (k + 1) ∗ step ≥ n,
so here we have to choose for start 4, for n 21, and for step 4:
>>> list(range(4, 21, 4))
[4, 8, 12, 16, 20]
(d) As exercise (c) but now we should choose for start -24, for n 21, and for step 3:
c ph
163
Programming and genomics 2019/2020
B. Solutions to selected exercises
>>> list(range(-24, 21, 3))
[-24, -21, -18, -15, -12, -9, -6, -3, 0, 3, 6, 9, 12, 15, 18]
(e) Several solutions exist to construct the requested list. We can use the result of
exercise (d) and then reverse it either using
l=list(range(-24, 21, 3))
m=[]
for i in range(len(l)):
m.append(l[-1-i])
print(m)
or
l=list(range(-24, 21, 3))
m=l[::-1]
print(m)
However, both of these solutions do no exclusively use the range and list functions and, therefore, do not satisfy the exercise, even though the correct list is
provided. The solution is to use range with a negative step value: if step is negative, range(start, n, step) generates all integers from start to start+k*step
where the last element is the smallest start+k*step greater than n, so here we may
choose for start 18, for n -25, and for step-3:
>>> list(range(18, -25, -3))
[18, 15, 12, 9, 6, 3, 0, -3, -6, -9, -12, -15, -18, -21, -24]
Partial solution to exercise 17:
(a) First, observe what sublists the list [n+10, n+9, .., 2, 1, 0, 0, 5, .., n+5, n+10]
is built of. The first sublist is a list starting at n+10, descending by 1, and ending
with a 0. The second list starts with the second 0, increases by steps of 5, and ends
with n+10. Both these sublist can be created using a combination of the range
and list functions and then concatenated using the concatenation operator (+),
resulting in
list(range(n+10, -1, -1)) + list(range(0, n+11, 5))
(b)
[n, n, (n-5), (n-10), .., 10, 5, 0, -5, -10, ..,-(n-5), -n, -n] =
list(range(n, n+1)) + list(range(n, -n-1, -5)) + list(range(-n, -n+1))
Solution to exercise 19:
The python fragment provided contains 3 errors: (1) At the end of an if statement a
colon is required, (2) the boundaries 0 and 10 were not included, and (3) an ’else:’ is
required before the final print statement such that it is only executed when the value
provided is outside the requested range. The corrected program then reads:
s = input("Give a value between 0 and 10: ")
value = int(s)
if (value >= 0) and (value <= 10):
print(’Thank you’)
c ph
164
Programming and genomics 2019/2020
B. Solutions to selected exercises
else:
print(’You fool!’)
Solution to exercise 20:
(a) To open and read all lines from file ’sequences.seq’ we use the following programming
fragment
infile = open(’sequences.seq’)
alllines=infile.readlines()
infile.close()
We then have a list of all lines of the file. By applying a for-loop over the total
number of lines in the file, len(alllines), we easily have access to both the current
line number and the line itself. Note that numbering of list indices starts with 0,
so we have to add 1 to the line number when printed:
for linenr in range(len(alllines)):
print(linenr+1, alllines[linenr].rstrip())
When printing a line, rstrip() is used to remove possible newline characters at the
end of the line. These should not be printed as print itself already finishes with a
newline, and otherwise thus additional blank lines would be printed. So the total
program becomes:
""" Read the file sequences.seq and
for all lines print the line number
and the line itself
"""
infile = open(’sequences.seq’)
alllines=infile.readlines()
infile.close()
for linenr in range(len(alllines)):
print(linenr+1, alllines[linenr].rstrip())
(b) Reading the file and looping over all lines is analogous to the solution of (a). Given
a line, we have to consider the first occurrence of TT. Since it is given that such a
string occurs we simply may use the find-method:
line=alllines[linenr]
m=line.find("TT")
Having the position of the TT we have to search starting from that position for the
occurrence of AA. Once again we use the find-method, but applied to the part of
the line after TT. If we have the position of the AA in the part of the line after the
TT, we translate it to the position in the original line:
remainingpartofline=line[m:]
n=remainingpartofline.find("AA")
# n is the first occurrence of AA in remainingpartofline
n=m+n
# n is the first occurrence of AA in the line after the first TT
c ph
165
Programming and genomics 2019/2020
B. Solutions to selected exercises
So the total program becomes:
""" Read the file sequences.seq and
for all lines print the line number
and the part from TT to the first AA
"""
infile = open(’sequences.seq’)
alllines=infile.readlines()
infile.close()
for linenr in range(len(alllines)):
line=alllines[linenr]
m=line.find("TT")
remainingpartofline=line[m:]
n=remainingpartofline.find("AA")
# n is the first occurrence of AA in remainingpartofline
n=m+n
# n is the first occurrence of AA in the line after the first TT
print(linenr+1, alllines[linenr][m:n+len("AA")])
Partial solution to exercise 22:
(a) To read the file and store its contents in 3 lists:
# Read the file BMIs.txt and store its lines in a list
inf = open(’BMIs.txt’)
lines = inf.readlines()
inf.close()
# Create 3 empty list
names = []
weights = []
lengths = []
# Parse line by line and store data in appropriate lists
for line in lines:
s = line.split()
name = s[0]
weight = float(s[1])
length = float(s[2])
names.append(name)
weights.append(weight)
lengths.append(length)
Solution to exercise 25:
(a) The command s=input(text) displays the string text on the screen and the result
being input is stored in the variable s. Here text should be ’Enter a sequence’:
c ph
166
Programming and genomics 2019/2020
B. Solutions to selected exercises
# Ask the user for a sequence and print its length
seq = input(’Enter a sequence: ’)
print(’It is’, len(seq), ’bases long’)
(b) To determine the number of substrings subs in a string s, the method s.count(subs)
can be applied:
# also print the number of A, T, C, and G characters in the sequence
seq = input(’Enter a sequence: ’)
print(’It is’, len(seq), ’bases long’)
print(’adenine:’, seq.count(’A’))
print(’thymine:’, seq.count(’T’))
print(’cytosine:’, seq.count(’C’))
print(’guanine:’, seq.count(’G’))
(c) To allow for both lower-case and upper-case characters we first transform the input
string to an all uppercase character string by applying the upper-method:
# ... allow both lower-case and upper-case characters
seq = input(’Enter a sequence: ’)
seq = seq.upper()
print(’It is’, len(seq), ’bases long’)
print(’adenine:’, seq.count(’A’))
print(’thymine:’, seq.count(’T’))
print(’cytosine:’, seq.count(’C’))
print(’guanine:’, seq.count(’G’))
(d) To determine the number of characters sum up the occurrences of ’A’, ’C’, ’T’, and
’G’ and compare it to the total length of the string:
# ... also print the number of unknown characters
seq = input(’Enter a sequence: ’)
seq = seq.upper()
n = len(seq)
a = seq.count(’A’)
t = seq.count(’T’)
c = seq.count(’C’)
g = seq.count(’G’)
print(’It is’, n, ’bases long’)
print(’adenine:’, a)
print(’thymine:’, t)
print(’cytosine:’, c)
print(’guanine:’, g)
print(’unknown:’, n - a - t - c - g)
Solution to exercise 27:
(a) The list of integers [start, start + step, start + 2 * step, ...] can be
generated by applying the methods list(range(start, stop, step)). If step
is positive, the last element is the largest start+i*step less than stop; if step is
negative, the last element is the largest start+i*step greater than stop. So
c ph
167
Programming and genomics 2019/2020
B. Solutions to selected exercises
list(range(1, n, 1))
produces
[1, 2, 3, 4, 5, 6, 7, 8, ..., n-2, n-1]
and
list(range(n, 0, -1))
produces
[n, n-1, ..., 3, 2, 1]
Hence the required list can be generated by:
list(range(1, n, 1))+list(range(n, 0, -1))
The solution suggested in the exercise
a = list(range(1,101))
b = a
is not a proper one. The last assignment makes a and b two different names for one
and the same object, so any change to b is also made on a. So after
b.reverse()
a is also reversed:
>>> a
[100, 99, 98, ..., 3, 2, 1]
and
print(a+b[1:])
gives
[100, 99, 98, ..., 3, 2, 1, 99, 98, ... 3, 2, 1]
(b) When l is a list and x is an element in l, then l.remove(x) removes the first
occurrence of x from l. So one way to obtain the result is:
a=list(range(1, n, 1))
a.remove(73)
b=list(range(n, 0, -1))
b.remove(73)
a+b
Another approach is to combine several range commands:
list(range(1, 73, 1))+list(range(74, n, 1))+list(range(n, 73, -1))+
list(range(72, 0, -1))
Yet another approach would be
a=list(range(1, n+1, 1))
a.remove(73)
m = a[:-1]+a[::-1]
c ph
168
Programming and genomics 2019/2020
B. Solutions to selected exercises
(c) When these commands are executed in the Python interpreter, the following output
is produced
>>>
>>>
>>>
>>>
>>>
2
>>>
1
>>>
4
>>>
4
a=[1,2,3,4]
b=[9,16,25,36]
c=[a,b]
d=[a+b]
len(c)
len(d)
len(a)
len(b)
List c consists of two elements. The first element is list a, the second list b. List d
has only one element, list [1, 2, 3, 4, 9, 16, 25, 36].
(d) Inspection of the list learns that it is in fact the sorted version of 3 copies of the
list [0, 1, 2, ..., n-1, n], so one way of obtaining the result is:
a=list(range(0, n+1, 1))*3
a.sort()
print(a)
(e) Applying 3 times the range-method with step=4 and sorting the concatenated
result, does the job:
a=list(range(1, n+1, 4))
b=list(range(2, n+1, 4))
c=list(range(3, n+1, 4))
d=a+b+c
d.sort()
print(d)
Solution to exercise 29:
The sequence seq is a DNA sequence consisting of only A’s, C’s, G’s and T’s. So after
seq=seq.upper()
seq=seq.replace(’C’,’ ’)
seq=seq.replace(’T’,’ ’)
seq=seq.replace(’G’,’ ’)
seq consists only of A’s and spaces. Splitting it on white space hence results in a list
with as elements sequences of only A’s. When this list is sorted the elements are sorted
on length with the shortest sequences first. Hence by
a=seq.split()
a.sort()
print(len(a[-1]))
c ph
169
Programming and genomics 2019/2020
B. Solutions to selected exercises
the length of the longest subsequence consisting of only A’s in the DNA sequence seq
is printed.
Solution to exercise 30:
A proper name for the function whatshouldbemyname would be uniqueSorted. The
function namely returns a new list with a single copy of all elements in the input list,
i.e., all duplicates omitted, where the elements in the returned list are sorted in increasing
order.
Solution to exercise 32:
(a) Lists are more general than strings. There are for instance more methods available
for lists than for strings. An example of such a method is reverse. To reverse a
string, make a list out of it and then make again a string out of the list using the
join-method:
def wording(word):
""" A function that has a word as parameter
and prints the word, its length and
the reversed word. """
letters = list(word)
letters.reverse()
revword = ’’.join(letters)
print(’word:\t’, word)
print(’length:\t’, len(word))
print(’reverse\t:’, revword,’\n’)
wording(’verzuring’)
(b)
def processFile(filename):
"""A function that has a filename as parameter
and prints for each word in the file the word,
its length and the reversed word."""
# Open and read the file ’filename’
inf = open(filename)
filecontents = inf.read()
inf.close()
# Optionally one could remove some punctuation
for c in [’.’,’,’,’:’,’;’]:
filecontents = filecontents.replace(c,’’)
# Divide the filecontents in a list of words
words = filecontents.split()
# and process all words one by one
for word in words:
wording(word)
c ph
170
Programming and genomics 2019/2020
B. Solutions to selected exercises
processFile(’mytext.txt’)
Partial solution to exercise 35:
(a) If we have a string in which the contents of the file is stored, the sentences can be
separated by applying the split-method with as splitting string the period. Next
we have to add the period again to each of the elements of this list. We have to
exclude the last element of the list since that element is empty when the file is ended
with a period or it is a string without a period and hence it is not a sentence. A
Python method that implements this description is:
def text2sentences(filename):
""" A function having an input
file name (string) as parameter,
it returns a list of all the sentences.
"""
infile = open(filename)
alllines = infile.read()
infile.close()
listofsentenceswithoutperiod = alllines.split(’.’)
allsentences=[]
for s in listofsentenceswithoutperiod[:-1]:
allsentences.append(s+’.’)
return allsentences
print(text2sentences(’mytext2.txt’))
Solution to exercise 38:
The requested function could look like:
def minmaxmean(m):
"""Returns the minimum value, the maximum value, as well as the
mean of the items in the list m
"""
if m==[]:
# if the list m is empty: minimum, maximum and mean are undefined
return None, None, None
else:
# otherwise they can be calculated by looping over all items
minval = m[0]
maxval = m[0]
sumval = m[0]
for i in range(1,len(m)):
if m[i] < minval:
minval = m[i]
elif m[i] > maxval:
maxval = m[i]
sumval = sumval + m[i]
c ph
171
Programming and genomics 2019/2020
B. Solutions to selected exercises
return minval, maxval, float(sumval)/len(m)
The function can then be executed with m1 as parameter, storing its resulting 3 values
in 3 distinct variables, which can subsequntly be printed (e.g. using the ’new style’):
m1=[9,4,5,6,2,5,4,3,1,2,12,7,4,3,2,8,4,2]
mymin,mymax,mymean = minmaxmean(m1)
print(’min: {}, max: {}, mean: {:.3f}’.format(mymin,mymax,mymean))
This yields as output:
min: 1, max: 12, mean: 4.611
The same function can then also be executed on the other list, i.e. m2. Instead of
storing the three resulting values in three variables, the three values can also be stored
in a single tuple. Using indexing operations, the three values can then be extracted and
displayed similarly as above. Using the ’old style’ the tuple may also be used directly:
m2=[3,4,5,2,2,12,2,1,8,2,9,4,3,6,4,4,7,5]
myminmaxmeantuple = minmaxmean(m2)
print(’min: %d, max: %d, mean: %.3f’ % myminmaxmeantuple)
Partial solution to exercise 39:
(a) The output of
for i in range(1,5):
for j in range(1,5):
print(’%4d’ % i*j, end=’ ’)
print()
is:
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
The reason for this (maybe unexpected) result is that the % operator here has a
higher priority than the * operator. In each print statement, the value of i is thus
first substituted in the string after which that string is repeated j times. Thus
first a single string ’
1’ is plotted, followed by a space due to the , at the end
of the print statement. Then on the same line two such strings are plotted, again
followed by a space. Then three such strings followed by a space, and finally four
such strings followed by a space. Then the inner loop is finished after which the
second print staement is executed. This simply results in a new line. Subsequently
the inner loop is again executed with i equal to 2, etc, etc, finally resulting in the
above output.
(b) In order to print the requested table, the value that should be printed each time is
the product of i and j. To make sure that this product is calculated first and that
the resulting value is subsequently substituted in the string, parentheses should be
used, i.e.:
c ph
172
Programming and genomics 2019/2020
B. Solutions to selected exercises
for i in range(1,13):
for j in range(1,13):
print(’%4d’ % (i*j), end=’ ’)
print()
Solution to exercise 41:
(a) Because we do not know which characters the arbitrary string consists of, the best
solution seems to loop over all characters in the string and tally the numbers of
characters encountered. A good way to store the numbers of characters encountered
(so far) is by using a dictionary. When we have not yet considered any characters,
the dictionary should be empty. For each character encountered we check whether
it was already observed before or not, i.e., whether it is already present in the
dictionary or not. If it is not present in the dictionary yet, we add that character
as a new key to the dictionary and attach the value 1 to it, i.e., one occurrence (so
far). If we encounter a character that is already present in the dictionary, we just
increase the attached value by one. As a Python fragment this may read:
# define the arbitrary string s
s = ’AAAAAAAAAAA#A#AB’
# loop over all characters and tally
countd={}
for char in s:
if char in countd:
countd[char] += 1
else:
countd[char] = 1
Note: in the above solution,
countd[char] += 1
is a shorter notation for:
countd[char] = countd[char] + 1
i.e., a way to increase the value of countd[char] by 1.
(b) The aligned table could subsequently be produced using:
for key in sorted(countd):
n = countd[key]
print(’{} {:3d} {:7.2%}’.format(key, n, n/len(s)))
Solution to exercise 43:
(a)
import urllib.request
import urllib.parse
def getFromChemCalc(molformula):
""" Return molecular information for ’molformula’ as obtained
c ph
173
Programming and genomics 2019/2020
B. Solutions to selected exercises
from http://www.chemcalc.org
"""
url = ’http://www.chemcalc.org/chemcalc/mf’
# Define the parameters and send them to Chemcalc
mfdict = {’mf’: molformula,’isotopomers’:’jcamp,xy’}
params = urllib.parse.urlencode(mfdict)
response = urllib.request.urlopen(url, params.encode())
# Read the output
return response.read().decode()
chemcalcstr = getFromChemCalc(’C2H6O’)
(b)
import json
chemcalcdict = json.loads(chemcalcstr)
print(chemcalcdict.keys())
print(’molecular weight of’,chemcalcdict[’mf’],’is:’,chemcalcdict[’mw’])
(c)
elemlist = chemcalcdict[’parts’][0][’ea’]
for elem in elemlist:
print(’%4s %7.2f%%’ % (elem[’element’], elem[’percentage’]))
An alternative using the new style formatting is to unpack the dictionary (using **)
and use the dictionary keys as indicators at which position in the string the values
should be inserted:
for elem in elemlist:
print(’{element:>4s} {percentage:7.2f}%’.format(**elem))
Solution to exercise 44:
import urllib.request
import urllib.parse
# Open the webpage
protocol="http"
hostname="cbio.bmt.tue.nl"
path="~philbers/index.htm"
url=protocol+"://"+hostname+"/"+path
rf=urllib.request.urlopen(url)
# Read the data from the webpage
data=rf.read().decode()
# Convert to lower case and count
data = data.lower()
nr = data.count(’computational’)
print(’The webpage contains’, nr, ’times the word "computational"’)
The output of the program:
The webpage contains 9 times the word "computational".
Solution to exercise 45:
c ph
174
Programming and genomics 2019/2020
B. Solutions to selected exercises
from Bio import Entrez
def returnnrhits(l=[]):
Entrez.email="your.name@student.tue.nl"
searchterm=" AND ".join(l)
handle=Entrez.esearch(db="nucleotide",
term=searchterm)
record = Entrez.read(handle)
return record["Count"]
nrhits=returnnrhits(l=[ "Escherichia coli[Organism]",
"complete genome[All Fields]",
"srcdb_refseq[Properties]"])
print("Nr of hits:", nrhits)
The output of the program is:
Nr of hits: 1158
Solution to exercise 50:
def triangleOfStars(k):
"""
A method that prints, for k as an integer parameter,
a filled triangle of stars (’*’) with k stars as
basis and k stars as height. Between two stars a
space is printed.
For example, when k is 5:
print(’*’)
# line
print(’* *’)
# line
print(’* * *’)
# line
print(’* * * *’)
# line
print(’* * * * *’) # line
0
1
2
3
4
has
has
has
has
has
1
2
3
4
5
star
stars
stars
stars
stars
Thus, closed triangle:
with lines numbered from 0 through k-1, each
line has one more star than its line number.
"""
i = 0
# invariant: 0<=i<=k and i lines printed
while i < k:
# invariant: 0<=i<=k and i lines printed
print(’* ’*(i+1))
# line i with i+1 stars
i = i+1
# invariant: 0<=i<=k and i lines printed
# invariant: 0<=i<=k and i lines printed
# because the loop stopped, also holds: i>=k
# Thus: i==k and i lines printed
c ph
175
Programming and genomics 2019/2020
B. Solutions to selected exercises
# Thus: whole triangle printed
# test the function
triangleOfStars(5)
Solution to exercise 53:
In all cases it is being asked to show the values i is obtaining. Adding a line to the
program in which the value is printed, solves the problem. So running
def countSomething(word=’insulin resistance’):
i=0
print(word)
print("The value of i is:", str(i))
counter=0
jump=word.index(’n’)
print("The value of jump is:", str(jump))
while i<len(word):
if word[i]==’n’:
counter=counter+i
i=i+jump
i=i+1
print("The value of i is:", str(i))
return counter
print(countSomething(’diabetes patient’))
print("****")
print(countSomething(’diabetes patient or not’))
print("****")
print(countSomething(’not a diabetes patient’))
print("****")
print(countSomething())
yields
diabetes patient
The value of i is: 0
The value of jump is: 14
The value of i is: 1
The value of i is: 2
The value of i is: 3
The value of i is: 4
The value of i is: 5
The value of i is: 6
The value of i is: 7
The value of i is: 8
The value of i is: 9
The value of i is: 10
The value of i is: 11
c ph
176
Programming and genomics 2019/2020
B. Solutions to selected exercises
The value of i is: 12
The value of i is: 13
The value of i is: 14
The value of i is: 29
14
****
diabetes patient or not
The value of i is: 0
The value of jump is: 14
The value of i is: 1
The value of i is: 2
The value of i is: 3
The value of i is: 4
The value of i is: 5
The value of i is: 6
The value of i is: 7
The value of i is: 8
The value of i is: 9
The value of i is: 10
The value of i is: 11
The value of i is: 12
The value of i is: 13
The value of i is: 14
The value of i is: 29
14
****
not a diabetes patient
The value of i is: 0
The value of jump is: 0
The value of i is: 1
The value of i is: 2
The value of i is: 3
The value of i is: 4
The value of i is: 5
The value of i is: 6
The value of i is: 7
The value of i is: 8
The value of i is: 9
The value of i is: 10
The value of i is: 11
The value of i is: 12
The value of i is: 13
The value of i is: 14
The value of i is: 15
The value of i is: 16
The value of i is: 17
The value of i is: 18
The value of i is: 19
c ph
177
Programming and genomics 2019/2020
B. Solutions to selected exercises
The value of i is: 20
The value of i is: 21
The value of i is: 22
20
****
insulin resistance
The value of i is: 0
The value of jump is: 1
The value of i is: 1
The value of i is: 3
The value of i is: 4
The value of i is: 5
The value of i is: 6
The value of i is: 8
The value of i is: 9
The value of i is: 10
The value of i is: 11
The value of i is: 12
The value of i is: 13
The value of i is: 14
The value of i is: 15
The value of i is: 17
The value of i is: 18
22
The function first prints ’The value of i is: 0’, after which it determines the index of the
first occurrence of the letter ’n’ in the string word and stores that value in the variable
jump. Subsequently, the function iterates over all indices of the string word and prints
those indices. Only, when an a character ’n’ is encountered, the index considered jumps
jump characters forward.
Solution to exercise 54:
We have made a small adaptation to the function by adding as argument a list containing
the values that would otherwise be input. Two print statements have also been added
to produce a ’nice’ table.
def examplerep(inlist):
""" Changed the method to print
a nice table at the end of each iteration. """
a = 1000
b = 1000
i = 0
print("%8s %8s %8s %8s" % ("i", "a", "b","c"))
while (i<5):
c = inlist[i]
# automated input
if (c<b):
if (c<=a):
b = a
c ph
178
Programming and genomics 2019/2020
B. Solutions to selected exercises
a = c
print("%8d %8d %8d %8d" % (i, a, b, c))
i = i + 1
print("My answers are: "+str(a)+" and "+str(b))
examplerep([0, 1, 2, 3, 4])
print()
examplerep([900, 800, 700, 600, 500])
print()
examplerep([3, 33, 333, 444, 33])
When this program is run, the following output is produced:
i
0
1
2
3
4
My answers are:
a
0
1
2
3
4
4 and
b
1000
1000
1000
1000
1000
1000
i
a
b
0
900
1000
1
800
900
2
700
800
3
600
700
4
500
600
My answers are: 500 and 600
i
a
b
0
3
1000
1
33
1000
2
333
1000
3
444
1000
4
33
444
My answers are: 33 and 444
c
0
1
2
3
4
c
900
800
700
600
500
c
3
33
333
444
33
Solution to exercise 58:
(a)
class Gene:
def __init__(self, genesymbol="INS", genename="insulin"):
self.gene_symbol=genesymbol
self.gene_name=genename
Examples of object instantiation are
mygene=Gene()
mygene1=Gene(genesymbol="Casp4")
mygene2=Gene(genesymbol="Casp4",
genename="apoptosis-related cysteine peptidase")
(b) A method of a class has always self as first parameter. Since the method can only
c ph
179
Programming and genomics 2019/2020
B. Solutions to selected exercises
be applied on an object of the class, it is safe to assume that the object is already
has been created, and hence all data attributes have already received a value. The
printing of the contents of the data attributes can be achieved by using the print
method:
def print_geneinfo(self):
print("The symbol of this gene is:", self.gene_symbol)
print("The name of this gene is:", self.gene_name)
So applying this method (mygene2.print_geneinfo()) to the object mygene2 using
the instantiation given above, yields
The symbol of this gene is: Casp4
The name of this gene is : apoptosis-related cysteine peptidase
Solution to exercise 60:
(a)
from openpyxl import load_workbook
import matplotlib.pyplot as plt
def getWorkBook(xlsxfilename):
""" A function with a file name as single parameter
that, if the file name corresponds to an excel file,
reads that file and returns it as a workbook object.
"""
wb = None
if xlsxfilename[-5:]==’.xlsx’:
wb = load_workbook(filename = xlsxfilename)
return wb
wb = getWorkBook(’health.xlsx’)
(b)
def readColumn(wb, colnr=1):
""" Extracts the data from column ’colnr’ of
the workbook ’wb’ and returns it as a list
"""
ws = wb.active
nrrows = ws.max_row
l = []
for i in range(nrrows):
# add 1 because excel columns and rows start at 1
l.append(ws.cell(row=i+1,column=colnr+1).value)
return l
(c)
def scaleList(l, fac=1.0):
""" Multiplies all elements of the list ’l’
with a factor ’fac’ and returns the scaled
values in a new list.
"""
newl = []
for i in range(len(l)):
c ph
180
Programming and genomics 2019/2020
B. Solutions to selected exercises
newl.append(l[i]*fac)
return newl
(d) Visual inspection of the Excel file using Excel learns that the first row contains a
description of the file contents and the second row contains headers for the values
on the subsequent rows. That second row shows that weights can be found in the
fifth column and heights in the fourth column. Moreover, from the first row it
is clear that the specified weights are in pounds, while the heights are in inches.
These thus need to be converted to kilograms and meters, respectively, before being
plotted.
# open the excel file
wb = getWorkBook(’health.xlsx’)
weight = readColumn(wb, 4)
height = readColumn(wb, 3)
print(’Extracted columns:’, weight[1], height[1])
weight = weight[2:]
height = height[2:]
# convert pounds to kilograms
weightKg = scaleList(weight, 0.45359237)
# convert inches to meters
heightMeter = scaleList(height, 0.0254)
# make the scatter plot
plt.figure()
plt.plot(weightKg,heightMeter,’b+’)
plt.xlabel(’weight (kg)’)
plt.ylabel(’heigth (m)’)
(e) We can use the already converted weights and heights to calculate the BMIs:
# calculate BMI
bmicalc = []
for i in range(len(heightMeter)):
bmicalc.append(weightKg[i]/heightMeter[i]**2)
## make plot of BMI values to check them
#plt.figure()
#plt.plot(bmicalc,’ro’)
#plt.xlabel(’patient number’)
#plt.ylabel(’BMI’)
(f) To select the data for male patients, we also need to extract the genders from the
excel workbook. Visual inspection of the Excel file learns that these are in the
second column. Moreover, we need to extract the names from the first column.
Info on male patients can then be written to a new workbook by looping over all
patients and selecting on the gender.
c ph
181
Programming and genomics 2019/2020
B. Solutions to selected exercises
names = readColumn(wb, 0)
names = names[2:]
gender = readColumn(wb, 1)
gender = gender[2:]
wbout = Workbook()
ws1 = wbout.active
ws1.title = "Male data"
ws1.append(["name","height (m)", "weight (kg)","bmi"])
for i in range(len(gender)):
if gender[i] == ’M’:
ws1.append([names[i],heightMeter[i],weightKg[i],bmicalc[i]])
wbout.save(filename = ’bmi_male.xlsx’)
Solution to exercise 62:
The solutions can immediately be obtained by selecting the right configuring options
and hence are straightforward.
(a) Similarly as described in the lecture notes
import tkinter
top = tkinter.Tk()
top.title("A DNA calculator")
top.mainloop()
(b)
import tkinter
top = tkinter.Tk()
top.title("A DNA calculator")
top.geometry("500x400")
top.mainloop()
(c)
import tkinter
top = tkinter.Tk()
top.title("A DNA calculator")
top.geometry("500x400")
top.configure(background="blue")
top.mainloop()
(d)
import tkinter
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=tkinter.Label(parent,
c ph
182
Programming and genomics 2019/2020
B. Solutions to selected exercises
text="This DNA calculator should become colorful")
self.l.pack()
top = tkinter.Tk()
top.title("A DNA calculator")
top.geometry("500x400")
top.configure(background="blue")
myapp=MyApp(top)
top.mainloop()
(e)
import tkinter
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=tkinter.Label(parent,
text="This DNA calculator should become colorful")
self.l.configure(background="red")
self.l.pack()
top = tkinter.Tk()
top.title("A DNA calculator")
top.geometry("500x400")
top.configure(background="blue")
myapp=MyApp(top)
top.mainloop()
(f )
import tkinter
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=tkinter.Label(parent,
text="This DNA calculator should become colorful")
self.l.configure(background="red")
self.l.configure(foreground="yellow")
self.l.pack()
top = tkinter.Tk()
top.title("A DNA calculator")
top.geometry("500x400")
top.configure(background="blue")
myapp=MyApp(top)
top.mainloop()
c ph
183
Programming and genomics 2019/2020
(g)
B. Solutions to selected exercises
import tkinter
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=tkinter.Label(parent,
text="This DNA calculator should become colorful")
self.l.configure(background="red")
self.l.configure(foreground="yellow")
self.l.pack()
self.l2=tkinter.Label(parent, text="Beautiful colors\n isn’t it?",
background="green", foreground="white")
self.l2.pack()
top = tkinter.Tk()
top.title("A DNA calculator")
top.geometry("500x400")
top.configure(background="blue")
myapp=MyApp(top)
top.mainloop()
(h) Change the order in which the two labels are packed.
import tkinter
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=tkinter.Label(parent,
text="This DNA calculator should become colorful")
self.l.configure(background="red")
self.l.configure(foreground="yellow")
self.l2=tkinter.Label(parent, text="Beautiful colors\n isn’t it?",
background="green", foreground="white")
self.l2.pack()
self.l.pack()
top = tkinter.Tk()
top.title("A DNA calculator")
top.geometry("500x400")
top.configure(background="blue")
myapp=MyApp(top)
top.mainloop()
c ph
184
Programming and genomics 2019/2020
B. Solutions to selected exercises
Solution to exercise 64:
(a)
from tkinter import *
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=Label(parent,
text="This DNA calculator should become colorful",
bg="black", fg="yellow")
self.l.pack(anchor=N)
r=Tk()
r.configure(bg="blue")
myapp=MyApp(r)
r.mainloop()
(b)
from tkinter import *
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=Label(parent,
text="This DNA calculator should become colorful",
bg="black", fg="yellow")
self.l.pack(side=RIGHT, anchor=N)
r=Tk()
r.configure(bg="blue")
myapp=MyApp(r)
r.mainloop()
By adding side=RIGHT the label widget remains attached to the right side of the
parent window.
(c)
from tkinter import *
class MyApp:
def __init__(self, parent):
self.parent=parent
self.l=Label(parent,
text="This DNA calculator should become colorful",
bg="black", fg="yellow")
self.l.pack(fill=X, side=RIGHT, expand=True)
c ph
185
Programming and genomics 2019/2020
B. Solutions to selected exercises
r=Tk()
r.configure(bg="blue")
myapp=MyApp(r)
r.mainloop()
Adding fill=X guarantees that in the X-direction the widget remains attached to
the right side, in the Y-direction it remains ’centered’. Expanding it implies that
the widget is taking as much space as is available. In this case this space equals
that of the space of the parent window.
(d) and (e) Similar as in (c), but in (d) the role of the X and Y-direction are interchanged. In (e) both directions are involved.
c ph
186
Bibliography
[1] The Python Programming Language, http://www.python.org/.
[2] Python Documentation, http://www.python.org/doc/
[3] Alberts, Johnson, Lewis, Raff, Roberts, Walter, Molecular Biology of the Cell,
Garland Science, 2002
187
Index
”, 14, 58
., 28
[ ], 22
init() , 109
\n, 58
\t, 58
#, 16
{}, 84
1000 Genomes Project, 5
abstraction, 67
algorithm, 10
alternative splicing, 55
amino acid, 53
assignment statement, 13
base, 6
adenine, 6
cytosine, 6
guanine, 6
thymine, 6
uracil, 6
bioinformatics, 5
definition, 9
block, 26
BMI, 17
Boolean, 42
Boolean expressions, 44
boxplot, 113
chromosome, 7
autosomal, 7, 8
close, 48
codon, 53
comments, 16
comparisons, 43
complementary, 7
complementary base pairing, 52
dictionary, 84
DNA, 5
5’, 7
double helix, 6
docstring, 69
dot notation, 28, 32
elif, 41
else, 40
empty list, 22
Excel files, 110
extron, 53
False, 43
file
open method, 46
float, 14
for statement, 26
function
call, 68
print, 17
gene, 54
genes, 5
identifier, 13, 155
if, 40
if–else statement, 41
immutability, 60
import, 27
index number, 16, 23
negative, 25
input, 45
integer, 13
intron, 53
join, 62
keyword parameter, 70
len, 16, 22
list, 22
append, 23
concatenation, 25
188
Programming and genomics 2019/2020
count, 24
empty list, 22
extend, 23
index method, 24
indexing, 23
insert, 24
methods, 23
negative index number, 25
pop, 24
remove, 24
repetition, 25
reverse, 24
slicing, 34
sort, 24
literals, 12
matplotlib, 36
boxplot, 113
Mendel, 5
method call, 108
methods, 12
negative index number, 25
nucleotides, 6
numeric types
float, 14
integer, 13
operations, 14
B. Index
readlines, 47
RNA, 6, 52
selection method, 40
simulation, 136
slicing, 34
specification, 97
statement
assignment, 13
for, 26
if–else, 41
import, 27
string, 14
concatenation, 15
format, 77, 79
indexing, 16
slicing, 16
string literals, 14
string methods, 58
syntax, 12
transcript, 53
True, 43
tuple, 75
variable, 13
Venter, 8
while-statement, 96
object, 12
object instantiation, 109
optional arguments, 32
plot, 36
positional parameters, 69
print, 77
print function, 17
problem analysis, 97
program design, 97
promoters, 52
pyplot, 36
Python, 4, 12
>>>, 15
interactive mode, 15
prompt, 15
range, 25, 33
read, 47
readline, 46
c ph
189
Download