Lecture notes Programming and genomics (8CA10) 2019/2020 A.J. Markvoort P.A.J. Hilbers October, 2019 Contents 1 Introduction 1.1 Programming and biomedical applications 1.2 The human genome . . . . . . . . . . . . . 1.3 Introduction to computer programming . 1.4 Additional Python resources . . . . . . . . 1.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 4 5 9 10 11 2 Python 2.1 Data model . . . . . . . . 2.2 The print function . . . 2.3 Calculating with variables 2.4 Exercises 1–6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 12 17 17 18 3 Lists and repetition 3.1 Lists . . . . . . . . . . . 3.2 The for statement . . . 3.3 Modules and the import 3.4 Exercises 7–12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 22 26 27 28 4 Methods, slicing, random and plotting 4.1 Methods and invocations . . . . . . . . 4.2 The dir and help methods . . . . . . 4.3 The range method . . . . . . . . . . . 4.4 Slicing . . . . . . . . . . . . . . . . . . 4.5 Random numbers . . . . . . . . . . . . 4.6 Plotting data using matplotlib . . . . 4.7 Exercises 13–18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 32 33 33 34 35 36 37 5 Selection methods and file/user input 5.1 Selection methods (if, elif and else) . 5.2 Conditionals and selection . . . . . . . . 5.2.1 False and True . . . . . . . . . 5.2.2 Comparisons . . . . . . . . . . . 5.2.3 Boolean operations . . . . . . . . 5.3 User input (input) . . . . . . . . . . . . 5.4 Reading from files . . . . . . . . . . . . 5.5 Exercises 19–24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 40 42 43 43 44 45 46 49 . . . . . . . . . . . . statement . . . . . . 1 . . . . Programming and genomics 2019/2020 6 Bioinformatics and strings 6.1 From DNA to RNA to protein 6.2 DNA, RNA and Python strings 6.3 Operations on strings . . . . . 6.4 Converting lists and strings . . 6.5 Writing data to file . . . . . . . 6.6 Exercises 25–29 . . . . . . . . . . . . . . 0. Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 55 58 62 62 63 7 Functions and parameters 7.1 Function definition (def) . . . . . . . . . . . 7.2 Function call . . . . . . . . . . . . . . . . . . 7.3 Documenting functions . . . . . . . . . . . . . 7.4 Positional parameters as function arguments . 7.5 Keyword parameters and defaults . . . . . . . 7.6 Exercises 30–37 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 67 68 68 69 70 71 8 Tuples and string formatting 8.1 Tuples . . . . . . . . . . . . . . . . . . . . 8.2 Returning multiple values from a function 8.3 String formatting . . . . . . . . . . . . . 8.3.1 Old style: % . . . . . . . . . . . . 8.3.2 New style: the format method . . 8.4 Exercises 38–40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 75 76 77 77 79 82 . . . . . 84 84 87 87 90 93 9 Dictionaries and database queries 9.1 Dictionaries . . . . . . . . . . . . . 9.2 Database queries . . . . . . . . . . 9.2.1 Open arbitrary resources by 9.2.2 Accessing databases: NCBI, 9.3 Exercises 41–48 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . URL . . . . . . . . . . Entrez and BioPython . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Program design and examples 10.1 A more general repetition construct (while) . . 10.2 Programming: problem formulation, analysis and 10.3 Programming examples . . . . . . . . . . . . . . 10.3.1 Counting ’CGs’ in DNA strings . . . . . . 10.3.2 All pattern positions in a string . . . . . . 10.4 Exercises 49–57 . . . . . . . . . . . . . . . . . . . . . . design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 96 97 98 98 100 103 11 Classes, Excel files and 11.1 Classes and objects . 11.2 Excel files in Python 11.3 Boxplot . . . . . . . 11.4 Exercises 58–61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 107 110 113 115 boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Graphical user interfaces 118 12.1 A first window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 12.2 The four basic GUI-programming tasks . . . . . . . . . . . . . . . . . . 121 12.3 The label widget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 c ph 2 Programming and genomics 2019/2020 12.4 12.5 12.6 12.7 12.8 The button widget . The frame widget . . Bringing the buttons The entry widget . . Exercises 62–69 . . . . . . . . . . to life. . . . . . . . . . . . . . . . . . . . . . . . 0. Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 126 127 130 132 13 Two examples 136 13.1 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 13.2 A bar plot example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 13.3 Exercises 70–73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 A Summary of useful commands A.1 Common Python constructs . A.2 Operations on string s . . . . A.3 Operations on file f . . . . . . A.4 Operations on lists l . . . . . A.5 Operations on dictionaries d . A.6 List generation and plotting . A.7 turtle . . . . . . . . . . . . . A.8 openpyxl . . . . . . . . . . . . A.9 Database queries . . . . . . . A.10 tkinter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B Solutions to selected exercises c ph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 148 148 149 149 150 151 151 151 152 152 154 3 Chapter 1 Introduction 1.1 Programming and biomedical applications An introduction in programming is part of almost every study at any university worldwide. This also holds for the studies in the BioMedical Engineering department at the Eindhoven University of Technology and from the beginning (1997) the course 8C010 “An Introduction to Object Oriented Programming and Java” has been organised in the first half year of the study. Currently, there is little consensus about which programming language is most appropriate for introductory computer science classes. Most schools use a traditional system programming language such as C, C++, or, Java. As may be inferred from the following table (see Figure 1.1) these languages are indeed popular. Figure 1.1: TIOBE Programming Community Index for September 2019, see programming languages rankings Languages such as Tcl, Perl and Python, however, are becoming increasingly popular for developing application specific software, and are considered to be simpler, safer and more flexible than C, or Java. In particular Python emerges as a good candidate for a 4 Programming and genomics 2019/2020 1. Introduction first programming language, and that will be the programming language we will use for our course. Moreover, one particularly interesting feature of Python is the high number of interfaces it has to programming languages and tools such as Matlab and R. In most cases one can almost always use other programming language constructs directly in Python, and Python is therefore phrased as a glueing language, by which programming parts written in a certain language are fused together in one single Python program. In these lecture notes, however, the attention will not be on a programming language. Emphasis is on programming concepts with special focus on biomedical applications, and since data analysis is a major component of most biomedical applications that will be the main topic. 1.2 The human genome In this section we give a short introduction to the scientific field bioinformatics. Driving forces are the Human Genome Project and its successor the 1000 Genomes Project which primary goal is to create a complete and detailed catalogue of human genetic variations. Historical introduction Genetics as a set of principles and analytical procedures did not begin until 1866, when an Augustinian monk named Gregor Mendel (see Figure 1.2a) performed a set of experiments that pointed to the existence of biological elements called genes, the basic units responsible for possession and passing on of a single characteristic. Figure 1.2: Historical figures: a) Gregor Johann Mendel (from www.wikipedia.org), b) Watson and Crick with the DNA double helix model (from www.thehistoryblog.com). Until 1944, it was generally assumed that chromosomal proteins carry genetic information, and that DNA plays a secondary role. This view was shattered by Avery and McCarty who demonstrated that the molecule deoxyribonucleic acid (DNA) is the major carrier of genetic material in living organisms, i.e., it is responsible for inheritance. c ph 5 Programming and genomics 2019/2020 1. Introduction Figure 1.3: a) The purine bases are adenine and guanine (in blue), while the pyrimidines are thymine and cytosine (in pink). RNA contains uracil instead of thymine. b) A nucleotide is composed of a base, a five-carbon sugar and one to three phosphate groups. DNA composition The basic elements of DNA have been isolated and determined by partly breaking up purified DNA. These studies demonstrated that DNA is composed of four basic molecules called nucleotides (see Figure 1.3a), which are identical except that each contains a different nitrogen base. Each nucleotide (see Figure 1.3b) contains phosphate, a sugar (of the deoxyribose type) and one of the four bases: Adenine, Guanine, Cytosine, and Thymine (usually denoted A, G, C and T, respectively). Structure In 1953 James Watson and Francis Crick deduced the three dimensional structure of DNA (see Figure 1.2b) and immediately inferred its method of replication. The structure of DNA is described as a double helix, which looks rather like two interlocked bedsprings. Each helix is a chain of nucleotides held together by phosphodiester bonds. Figure 1.4: Base pairing (www.biology-pages.info) The two helices are held together by hydrogen bonds. Each base pair consists of one c ph 6 Programming and genomics 2019/2020 1. Introduction purine base (A or G) and one pyrimidine base (C or T), paired according the following rule: G-C, and A-T (see Figure 1.4). The DNA molecule is directional, due to the asymmetrical structure of the sugars which constitute the skeleton of the molecule. Each sugar is connected to the strand upstream (i.e., preceding it in the chain) in its fifth carbon and to the strand downstream (i.e., following it in the chain) in its third carbon. In biological jargon, the DNA strand goes from 50 (read five prime) to 30 (read three prime). The directions of the two complementary DNA strands are reversed to one another, see Figure 1.5. Figure 1.5: The double helix and the directional conventions. Genes and chromosomes Each DNA molecule is packaged in a separate chromosome, and the total genetic information stored in the chromosomes of an organism is said to constitute its genome. With few exceptions, every cell of a eukaryotic multi-cellular organism contains a complete set of the genome, while the difference in functionality of cells from different tissues is due to the variable expression of the corresponding genes. The human genome contains about 3 ∗ 109 base pairs (abbreviated bp), organized as 46 chromosomes, 22 different autosomal chromosome pairs, and two sex chromosomes: either XX or XY. The 24 different chromosomes range from 50 ∗ 106 to 250 ∗ 106 bp. The total number of base pairs varies between different organisms. The organism Amoeba dubia (a single cell organism), for example, has more than 200 times as many base pairs as human. The living organisms divide into two major groups: Prokaryotes, which are single-celled organisms with no cell nucleus, and Eukaryotes, which are higher level organisms, and their cells have nuclei. With contemporary knowledge of the biochemical basis of heredity, Mendels abstract concept of a gene can be redefined as a physical entity. A gene is a region of DNA that controls a discrete hereditary characteristic. The Human Genome Project The ultimate goal of the human genome project is to produce a single continuous sequence for each of the 24 human chromosomes and to delineate the positions of all genes. The working draft sequence described by the international human genome sequencing consortium was constructed by melding together sequence segments derived from over 20,000 large clones. c ph 7 Programming and genomics 2019/2020 1. Introduction • 1985 - The project was first initiated by Charles DeLisi associate director for health and environment research at the depart of energy (DoE) in the United States. • 1988 - National Institute of Health (NIH) establishes the office of human genome research. • 1990 - Human Genome Project (HGP) launched with the intention to be completed within 15 years time and a 3 billion dollar budget. • 1996 - In a meeting in Bermuda international partners in the genome project agreed to formalize the conditions of data access including release of sequence data into public databases. This came to be known as the Bermuda Principles. • 1998 - Craig Venter forms a company with intent to sequence the human genome within three years. The company, later named Celera, introduced a new ambitious whole genome shotgun approach. • 1999 - The public project responds to Venters challenge and change their time destination for completing the first draft. • December 1999 - The first complete human chromosome sequence (number 22) is published. • June 2000 - Leaders of the public project and Celera meet in the White House to announce completion of a working draft of the human genome. • February 2001 - The first draft of the human genome was published in the journals Nature and Science. • May 2006 - Human Genome Project researchers announced the completion of the DNA sequence for the last of the 24 human chromosomes. • January 2008 - The 1000 Genomes Project was launched as an international research effort to establish by far the most detailed catalogue of human genetic variation. • May 2008 - Mapping and sequencing of structural variation from eight human genomes. • May 2011 - Report about the Economic Impact of the Human Genome Project: How a $3.8 billion investment drove $796 billion in economic impact, created 310,000 jobs and launched the genomic revolution. • October 2012, the sequencing of 1092 genomes was announced in a 1092 genomes Nature publication The human genome, the first vertebrate genome sequence determined, seems likely to be quite representative of what we will find in other vertebrate genomes. It is around 30 times larger than the recently sequenced worm Caenorhabditis elegans and fruit fly Drosophila melanogaster genomes (available at public domains) both around 108 bp, and 250 times larger then that of yeast Sacchromyces cerevisiae. Despite its size, it seems likely to have only two or three times as many genes as the fly or worm genomes, with the coding regions of genes accounting for only 1.5% of the DNA. Repeat sequences form a large proportion of the remaining DNA, around 46% . These repeats may or may not have a function but they are certainly characteristic of large vertebrate genomes. The c ph 8 Programming and genomics 2019/2020 1. Introduction rest of the sequence contains promoters, transcriptional regulatory sequences and other features. The 1000 Genomes Project is but the latest increment in a remarkable scientific program whose origins date back a hundred years to the rediscovery of Mendels laws and whose end is nowhere in sight. In a sense it provides a capstone for efforts in the past century to discover genetic information and a foundation for efforts in the coming century to understand it. The scientific work would have profound long term consequences for medicine, leading to the elucidation of the underlying molecular mechanisms of disease and thereby facilitating the design in many cases of rational diagnostics and therapeutics targeted at those mechanisms. With this Human Genome Project bioinformatics, i.e., the use of computational tools in biomedical engineering, has become an essential ingredient in research. Part of biomedical research is the study of human cellular processes. The human DNA is compared to that of other organisms such as mouse, rat and horse. As of February 2, 2014, 12857 complete genomes are published, see the Genomes OnLine Database(GOLD) Only in the last year before more than 8000 new genomes were completed. Many of these 8000 are from bacteria and their role in humans becomes more and more prominent (the human body contains over 10 times more microbial cells than human cells). We expect therefore that in coming years much attention will be on comparing different microbial genomes to understand their differences. For these comparisons smart algorithms are needed and, hence, we should consider how to design such algorithms. Also, with the arrival of next generation sequencing (NGS) platforms, that can perform sequencing of millions of small fragments of DNA in parallel, an entire human genome can nowadays be sequenced within a single day. Bioinformatics analyses are used to piece together these fragments by mapping the individual reads to the human reference genome. Moreover, not only in the context of the human genome but also in many other contexts you will probably encounter (experimental) data that you might want to process and/or visualize. The ability to program in Python will be very useful in this respect. 1.3 Introduction to computer programming Computer systems consist of hardware and software. The hardware is the physical machine having input devices, such as a keyboard and a mouse, and output devices such as a display screen and a printer, and 2 major components called processor and memory. The processor, also called Central Processing Unit (CPU), is the part capable of executing very simple instructions such as moving numbers around from one place in memory to another and performing some simple arithmetic operations such as addition and subtraction. The memory holds the data for the CPU to process, and it holds intermediate results of calculations. In order to identify different locations in which data has been stored, memory locations have a unique address. As stated, computer hardware can only directly execute some very simple instructions. c ph 9 Programming and genomics 2019/2020 1. Introduction The very first programmers actually had to enter these simple instructions in the form of binary codes themselves. The next stage was to create a translator that simply converted English equivalents of the codes into binary so that instead of having to remember that the code 001273 05 04 meant add 5 to 4 programmers could now write ADD 5 4. This very simple improvement made life much simpler and these systems of codes were really the first programming languages, one for each type of computer. They are known as assembly languages and assembly programming is still used for a few specialized programming tasks today. Even this was very primitive and still told the computer what to do at the hardware level — move bytes from this memory location to that memory location, add this byte to that byte etc. It was still very difficult and took a lot of programming effort to achieve even simple tasks. Gradually computer scientists developed higher level computer languages to make the job easier. This was just as well because at the same time users were inventing ever more complex jobs for computers to solve! This competition between the computer scientists and the users is still going on and new languages keep on appearing. This makes programming interesting, but also makes it important that as a programmer you understand the concepts of programming as well as the pragmatics of doing it in one particular language. Programming is a creative process in which a method, called an algorithm, is designed for solving a problem. An algorithm is a set of instructions that must be expressed so completely and so precisely that the instructions can be followed without having to fill in further details. It has the following characteristics • it is described in terms of simpler actions, • it is a sequence of actions, • it usually has to store intermediate results, • it uses different names for different intermediate results, • it usually contains a sequence of instructions that have to be repeated until some test condition is reached, and • it has an end criterion. 1.4 Additional Python resources Apart from the remainder of these lecture notes, there are numerous books and web resources available to assist you in your tour through the Python programming language. Below you find a short list of some important resources on the web: 1. The default site to look for information on Python is http://docs.python.org/3/. It contains the documentation, a tutorial, a language reference, and the standard library reference for the latest version of Python. Also for older versions of Python such sites are still available, e.g. for Python 3.6 is is http://docs.python.org/3.6/ 2. A list with books, websites and video tutorials is available at https://wiki.python.org/moin/BeginnersGuide/Programmers. c ph 10 Programming and genomics 2019/2020 1.5 1. Introduction Exercises At the end of each of the following chapters a number of exercises is given. Some of these exercises are marked with a single or two stars (*). This means that for that exercise (**), or part of that exercise (*), the solution can be found in appendix B of these lecture notes. For the other exercises the solutions will follow approximately 1 week after the guided self-study the exercises were scheduled for. For all exercises, thus also for those exercises for which solutions are already provided, holds that you should (try to) make them first yourself before looking at the solutions. Finding the solution to an exercise by writing your own program is very different from understanding a provided solution! c ph 11 Chapter 2 Python: standard types Python is a high-level general purpose programming language. It consists of a few simple constructs that will be introduced step by step in these lecture notes. The Python version that we will use is 3.6.1. It was released on March 21, 2017, and preinstalled on the TU/e laptops with Anaconda 4.4.0. Any newer version of Anaconda (that can be downloaded from https://www.anaconda.com/download/#windows) should be fine too. As in every other language, we first have to introduce the principles and rules for constructing sentences in the languages, the so-called grammar rules or syntax. In a grammar some basic elements are predefined. This also holds for Python, in which we have the so-called predefined standard types, that are introduced in this chapter. 2.1 Data model In an object oriented programming language the main concept is the object. Programs are considered as collections of objects that interact with each other by means of actions. An object has two parts: the data attributes and the actions, usually called methods, that act on them. Objects Roughly speaking Python has two kinds of objects: • Predefined objects (standard types), of which most common are: type int float str description integer numbers floating point numbers strings examples 1, 2, 3 1.2, 1e+2 ”Hello”, ’hi’ Notations for constant values of built-in types are called literals. • User defined objects In this chapter we restrict ourselves to the three above mentioned standard types, and discuss only some of the methods available for these types. In following chapters other object types will be introduced. 12 Programming and genomics 2019/2020 2. Python Variables Data objects are stored in the memory of your computer. To access and to distinguish data objects, they can be given names. A name, also called identifier, is a word that consists of letters, underscores, and digits, it must start with a letter or an underscore. Identifiers are used to name parts of the program for future reference. Variables are used to refer to data values. Variables have a name. Every language has its own rules about which characters are allowed or not allowed in a name. Python takes notice of the case and is therefore called a case sensitive language. One common style in giving a variable a name is to start variable names with a lower case letter and use a capital letter for each first letter of subsequent words in the name, like this: thisVariableName There is much freedom with respect to naming, but in general it is considered a good programming strategy to choose short but meaningful names. If an integer variable is only used for auxiliary purposes we give it a one letter name like n, i, q but of course longer names could also be used. Assignment statement Assignment statements can be used to (re)bind names to values. For instance if we want to store the value 10 in a variable n we have n = 10 The general construct to assign a value to a variable is identifier = expression where expression is a computation that produces a value. Integers As usual an integer number is a sequence of digits, and the standard operators are • subtraction: − • addition: + • multiplication: ∗ • true division: / • floor division: // all having the standard meaning, but the floor division is perhaps special. It is namely the division without the remainder. Moreover, all operators return an integer, except for the ’true division’ which returns a float (see next subsection). Expressions can be constructed by combining these operators and using the parentheses ( and ) where appropriate. Examples: c ph 13 Programming and genomics 2019/2020 2. Python • 3 + 4, with as value 7, • 3 + 4 ∗ 7, with as value 31, • (5 − 4) ∗ 7, with as value 7, • 7//4, with as value 1. • 7/4, with as value the float 1.75. Floats Next to integers we use only one other type of numbers in this course: floats, an abbrevation for floating point numbers. Its precise definition is rather complicated, so we first give an informal description. In informal terms, a float is two integers joined by a dot and possibly followed by an exponential part consisting of the letter ’e’ (small or capital) followed by an integer. More formally, a float is either a pointfloat or an exponentfloat. A pointfloat consist of a sequence of one or more digits followed by a fraction or of a sequence of one or more digits followed by a dot. A fraction consists of a dot followed by a sequence of one or more digits. An exponentfloat is optionally either a sequence of one or more digits or a pointfloat, followed by an exponent, where an exponent is an e or E followed by a signed sequence of one or more digits. When the float contains the letter ’e’ or ’E’ we speak about a number in the scientific notation. Examples: 3.1415 3e+2 1. 0.5e-67 The numeric types (both floats and integers) support the following operations, sorted by ascending priority: Operation x+y x-y x*y x/y x // y x%y -x abs(x) int(x) float(x) pow(x, y) x ** y Result sum of x and y difference of x and y product of x and y division of x by y floor division of x by y remainder of x / y x negated absolute value or magnitude of x x converted to integer x converted to floating point x to the power y x to the power y Strings To create string constants, also called string literals, enclose them in single, double, or triple quotes as follows: c ph 14 Programming and genomics 2019/2020 2. Python courseid = ’The name of this course’ groupname = "Computational Biology" coursename = """Programming and genomics""" The same type of quote used to start a string must be used to terminate it. Triplequoted strings capture all the text that appears prior to the terminating triple quote, as opposed to single- and double-quoted strings, which must be specified on one logical line. Inside triple-quoted strings double quotes (as in the preceding example) or single quotes (as in the following example) can be used. Triple quoting is useful when the contents of a string literal span multiple lines of text such as the following: ’’’Content-type: text/html <h1> Computational Biology </h1> Click <a href="http://cbio.bmt.tue.nl/">here</a>. ’’’ Concatenation of strings Python does also have methods to combine strings. One such a method is concatenation. Concatenation is the process of tying or glueing strings together to make a new string. In Python, you can concatenate strings with the + operator. Here are a few examples >>> ’AA’ + ’TTT’ ’AATTT’ >>> ’AA’ + ’ ’ + ’TTT’ + ’!’ ’AA TTT!’ Here we have given a short fragment of a Python session in so-called interactive mode. In this mode the Python interpreter prompts for the next command with its primary prompt, usually three greater-than signs (>>>) or a numbered prompt like In [3]. The user enters the input commands directly, and if the command results in output, this is shown on the next line. Variables can also be concatenated together if they hold strings as values: >>> word1 = ’Gene ’ >>> word2 = ’insulin’ >>> word = word1 + word2 >>> word ’Gene insulin’ In the last assignment we could also have written word1 = word1 + word2. In that case, the old contents of word1 (i.e. ’Gene ’) would be overwritten by the new contents. Similarly as for (integer) variable n, >>> n = 3 >>> n = n + 1 >>> n 4 c ph 15 Programming and genomics 2019/2020 2. Python Indexing Strings are sequences of symbols/characters, where each symbol in the string has a position number. For instance for the string "gctgca": Index String 0 g 1 c 2 t 3 g 4 c 5 a Note that the index of the first element is 0! One can use the index number to get a character from the string. >>> dnaIns[0] ’g’ >>> dnaIns[2] ’t’ # Get the first character of dnaIns # Get the third character of dnaIns Note the use of # in the above statements. This symbol plus the remainder of the line is ignored by the python interpreter and can thus be used to add comments to the (human) reader of the code. Adding such comments can highly increase the readability of your code and thus is good practise. Using an index number larger than the length of the string is not correct: >>> dnaIns[500] #Get the character five hundred and one of dnaIns Traceback (most recent call last): File "<stdin>", line 1, in ? IndexError: string index out of range The length of a string s can be obtained by: len(s). For example, >>> len("abc") 3 Substrings: slicing Apart from single characters, also multiple characters can be selected from a string. This is denoted as slicing. The slices are called substrings. Example: >>> motif = ’GAATTC’ >>> motif[0:3] # the first three characters ’GAA’ >>> motif[1:3] # characters two and three ’AA’ Both the start and end position are optional which means either to start at the beginning of the string or to extract the substring until the end. When accessing characters, it is forbidden to access a position that does not exist, whereas during substring extraction, the longest possible string is extracted (which may be the empty string ’’). >>> motif = ’GAATTC’ >>> motif[0:3] # the first three characters ’GAA’ >>> motif[:3] # the first three characters c ph 16 Programming and genomics 2019/2020 2. Python ’GAA’ >>> motif[3:] # everything but the first three characters ’TTC’ >>> motif[3:6] ’TTC’ >>> motif[:] ’GAATTC’ >>> motif[90] Traceback (most recent call last): File "<stdin>", line 1, in ? IndexError: string index out of range >>> motif[3:90] ’TTC’ >>> motif[10:90] ’’ >>> motif[3:2] ’’ 2.2 The print function The print function writes the value of the expression(s) it is given to the output. Multiple expressions and strings can be given to a single print function, by separating the items with commas. Strings are printed without quotes, and a space is inserted between items, so you can format things nicely, like this: >>> >>> 5 >>> >>> The 2.3 i = 5 print(i) i = 3+4 print(’The value of i is’, i) value of i is 7 Calculating with variables As a small example, we will now consider a python fragment to calculate someones body-mass index (BMI). This is defined as a persons body weight (in kilogram) divided by the square of his/her length (in meters). >>> mass = 80. >>> height = 1.90 >>> BMI = mass/height**2 >>> print(’With mass’, mass, ’and length’, height, ’the BMI is’, BMI) With mass 80.0 and length 1.9 the BMI is 22.1606648199446 By defining two identifiers with readable names, the whole programming fragment becomes well readable. Note that no parentheses () are needed around (height**2) as the power operator ** has a higher priority than the division operator /. c ph 17 Programming and genomics 2019/2020 2.4 2. Python Exercises 1–6 Exercise 1:** To be able to do the exercises, we assume you have Anaconda 4.4 installed (with Python version 3.6.1) and opened the Spyder python development environment via Anaconda Navigator. If so, you can skip the next paragraph. If you do not have it installed yet, the next paragraph leads you through the installation process. In order to install Anaconda, use a web browser to go to https://www.anaconda. com/download/#windows. Press on Download to download the Python 3.6 version. An installer will then be downloaded. Once downloaded, which may take a while, run and follow the installer until Anaconda is installed. Once Anaconda is installed, start the Anaconda Navigator (e.g. by searching for Anaconda Navigator at your Windows start screen, via browsing the Apps screen, or in older versions of Windows via Start>All Programs>Anaconda3>Anaconda Navigator). Subsequently, in the Anaconda Navigator, launce Spyder by clicking the appropriate launch button. A ’Spyder’ window will pop up, which shows a Python prompt in the right bottom panel. We will start by using python interactively (as a calculator). To use python interactively, we type commands directly at the python prompt. This prompt looks like In [1]:. Type at the prompt subsequently the following lines, each followed by an Enter: 5*40 1.25*7 100/25 106/25 106//25 106.0/25 106.0//25 100/5*5 100/(5*5) 2**10 3*2**3 (3*2)**3 After entering a line the result should appear on the screen. Are the results as you expected? Why (not)? Exercise 2:** Calculations often become much more readable if, rather than using the values directly, we store those in variables and calculate with those. To calculate for instance the number of possible DNA sequences of length 10, type at the Python prompt the following commands: c ph 18 Programming and genomics 2019/2020 2. Python nrbases = 4 seqlength = 10 nrbases**seqlength Is the number of possibilities indeed reported? Apart from writing the result to screen, it is also possible to store the result in a new variable. In order to do so, we type: nrpos = nrbases**seqlength The result can then be shown by inspecting the contents of the variable at the command line: nrpos or using a print statement: print(nrpos) The result can also be used in further calculations. For instance, if we would like to know what the probability is for a randomly generated sequence of length 10 to be ’AAAAAAAAAA’, we have to divide 1 by the number of possible sequences. Calculate and display this probability. Exercise 3:** If a name is not valid, then show, for instance by trying to Which of the following names are valid Python identifiers? execute an assignment statement, the error message that is generated by the Python interpreter. a) b) c) d) e) f) g) h) i) j) whatsinaname whats in a name Whats_in_a_name 5600MB yo!u I HelloYou Hello; varName what?name Exercise 4:** The left-hand side of the Spyder window contains a large editor. In this editor a file has already been opened (untitled0.py). Enter in this window, below the information string that is already present, the Python commands making up your program: nrbases = 4 seqlength = 10 nrbases**seqlength nrpos = nrbases**seqlength nrpos print(nrpos) c ph 19 Programming and genomics 2019/2020 2. Python You can run your program by selecting the item ’Run’ in the Run menu, by pressing the function key F5, or by pressing the green triangle in the tool bar. Upon running your program, Spyder will first ask you to save your program. A File Dialog will pop up that allows you to save your program. Save your file as exercise4.py, i.e. with the extension .py. It is advised to store your programs (the solutions of your exercises) on a D: (data) drive or in your documents folder in a well-organised way, e.g. in a new folder named D:\courses\8CA10. Once you saved your program, a second window will pop up, i.e., the ’Run settings window’. Mark under the header ’General settings’ the option ’Clear all variables before execution’, and subsequently press the button ’Run’. Now your program is actually executed, and the output of your program is reported at the Python prompt (left bottom panel of the Spyder window). How many times is the value of nrpos reported? A first advantage of running programs this way is that you can easily rerun them, without having to type all commands once more. If you do so, the program runs immediately, without popping up any windows. If you made any changes to your program, these are saved automatically under the same name, overwriting your old version. To save your program under a different name, go to the ’Save as...’ item icon in the File menu or press the key combination or Ctrl+Shft+S. A second advantage of storing your programs as files on your hard drive is that you can reuse them at another moment. E.g. close the program exercise4.py you just saved by selecting ’Close’ in the File menu or clicking on the cross next to the file name, and subsequently open the file again in the editor (either using the item Open... in the File menu, or the key combination Ctrl+o). Exercise 5:** (a) Create a new file (using the New file ... option in the File menu or using the key combination Ctrl+N), and write, by substituting the proper calculation at the dots in the program fragment below, a program to calculate the total DNA mass in an average human: genomelength = 3.2e9 nrcells = 4e13 massperbasepair = 660 Na = 6.022e23 # # # # number of number of grams per number of base pairs per cell cells mole per base pair molecules per mole (Avogadro’s number) totalDNAmass = ... print(’approximate DNA mass one human:’, totalDNAmass, ’grams’) (b) How much is that in kilograms? Exercise 6:** Given is a string s. As an example you may take c ph 20 Programming and genomics 2019/2020 2. Python s = ’AAACGAACGTAGGATCAAGTAGGCAAAAAG’ (a) print the first character of s (b) print the last character of s (c) print the string using 10 characters per line and a space after the 5th character, i.e.: AAACG AACGT AGGAT CAAGT AGGCA AAAAG c ph 21 Chapter 3 Lists and repetition 3.1 Lists Apart from single data elements we are quite used to have multiple elements in a collection. We have multiple files of different type (txt, py, dat, etc.) in a folder, multiple songs on an mp3-player, and several courses in a semester to be followed. Hence we need to handle collections of all sorts of objects and sometimes these collections are not even homogeneous, meaning that they may contain objects of different types. Python provides several predefined data types that can manage such collections. One of the most used structures is called list. List creation Lists are ordered collections of objects of different sorts. To create and access a list in Python we use square brackets. You can create an empty list by using a pair of square brackets with nothing inside, or create a list with contents by separating the values with commas inside the brackets: >>> emptyl = [] # creation of an empty list >>> emptyl [] >>> gene = [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene [’insulin’, 3630, 333, ’Homo sapiens’] An empty list can also be generated using the function list() >>> m = list() >>> m [] # creation of an empty list The length of a list m, i.e., the number of items the list contains, can be obtained by: len(m). >>> len(gene) 4 22 Programming and genomics 2019/2020 3. Lists and repetition The important thing to remember is that lists are just sequences of objects and that each object in the list has a position. Position Object 0 ’insulin’ 1 3630 2 333 3 ’Homo sapiens’ In Python, the position counting always begins with the number 0! The objects are accessible using their position (i.e. using the index number) in the ordered collection, starting at position 0. Once you create a list, you can use the position to get any object you want from the list. All you need to do is put the position inside the brackets next to the variable name. >>> gene[2] # select the third object of the list 333 >>> gene[1] # the second object 3630 Using a position larger than the length of the list is not correct: >>> gene[100] #Get the object one hundred and one of gene Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: list index out of range You can also replace, remove or insert individual element of a list using the index numbers: >>> gene = [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[0] = 1 [1, 3630, 333, ’Homo sapiens’] >>> len(gene) # How many elements are there in list gene? 4 It is also possible to change individual elements of a list. >>> gene = [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[2] 333 >>> gene[2] = gene[2] + 23 >>> gene [’insulin’, 3630, 356, ’Homo sapiens’] Methods of lists Given a list l the following methods can be applied to l. All of these methods operate on the list l itself and do not return a new modified list, but modify l directly. • l.append(x) Add an item to the end of the list. • l.extend(L) c ph 23 Programming and genomics 2019/2020 3. Lists and repetition Extend the list l by appending all the items in the given list L. • l.insert(i, x) Insert item x at given position i. The first argument is the index of the element before which to insert, so l.insert(0, x) inserts at the front of the list, and l.insert(len(l), x) is equivalent to l.append(x). • l.remove(x) Remove the first item from the list whose value is x. An error will be raised, if the item is not in the list. • l.pop([i]) Remove the item at the given position in the list, and return it. If no index is specified, l.pop() returns the last item in the list. The item is also removed from the list. • l.sort() sort the items of the list, in place. • l.reverse() reverse the elements of the list, in place. The following methods do not change the list l, but return an integer. • l.index(x) Return the index in the list of the first item whose value is x. An error will be raised, if the item is not in the list. • l.count(x) return the number of times x appears in the list. An example that uses some of the list methods: >>> a = [66, 333, 333, 1, 1234] >>> a.insert(2, -1) >>> a.append(333) >>> a [66, 333, -1, 333, 1, 1234, 333] >>> a.count(333) 3 >>> a.index(333) 1 >>> a.remove(333) >>> a [66, -1, 333, 1, 1234, 333] >>> a.sort() >>> a [-1, 1, 66, 333, 333, 1234] The latter clearly shows that the sort method changes the list a in place and does not return anything. Thus do NOT use c ph 24 Programming and genomics 2019/2020 3. Lists and repetition >>> a=a.sort() >>> a because then a is empty (or more formally, it contains the Python object None) and the original list is lost. The same is true for the list method reverse. List concatenation and repetition Lists also have operators such as + and ∗ for concatenation and repetition, respectively. >>> li = >>> li = >>> li [’gene’, >>> li = >>> li [’gene’, [’gene’, 3630] li + [’insulin’, 333] 3630, ’insulin’, 333] [’gene’, 3630] * 3 3630, ’gene’, 3630, ’gene’, 3630] Negative indices Above we have seen that when accessing objects, it is forbidden to access a position that does not exist, i.e, an index larger or equal to the length of the list results in an error. A nice thing, however, is that you can also use negative numbers for indexing. The last object of a list has the index −1, one but the last −2 etc., Position Object Position 0 ’insulin’ -4 1 3630 -3 2 333 -2 3 ’Homo sapiens’ -1 So if m is the list [1, ’nr two’, 5], then >>> >>> >>> 3 >>> 5 >>> 5 >>> 1 >>> 1 >>> 1 m = [1, ’nr two’, 5] n = len(m) n m[n-1] m[-1] m[0] m[-len(m)] m[n-len(m)] The range method Python has several built-in functions for generating lists. One example from which we show here just a simple instance is range(stop). The range function has one integer c ph 25 Programming and genomics 2019/2020 3. Lists and repetition argument, called stop. It returns a built-in range object that can be converted to a list of plain integers, i.e., list(range(stop)) yields the list of plain integers [0, 1, ..., stop-1]. Later, in section 4.3, we will give a more extensive description of the range-function, which also allows for start values different from zero and step sizes different from 1. 3.2 The for statement The for-loop enables iteration on an ordered collection of objects and to execute the same sequence of statements for each element. Example: >>> >>> >>> >>> >>> ... ... ... str1 = ’Biomedical Engineering’ str2 = "Programming and genomics course 8CA10" str3 = "Python is fun" strlist = [str1, str2, str3] for s in strlist: print(s) print(len(s)) Biomedical Engineering 22 Programming and genomics course 8CA10 37 Python is fun 13 The two print statements ... ... print(s) print(len(s)) form a so-called block and the two statements both have four spaces of indentation. A block is a structure element of a program, that is used to group instructions. All elements of the group have the same indentation, i.e., the same number of spaces in front of it. There is no absolute rule for the size of the indentation but the standard and the preferable style is to use four spaces for each level of indentation. The meaning of the for is that for each element in the list the two print statements are to be executed. After finishing the last element of the list, the interpreter continues with the first statement after the block. Another often used for construction applies the range function. The range method generates a special built-in range object that can deliver a sequence of integers to the for loop which then iterates over those integers. Running print(’A table of the first 11 integers and their squares’) c ph 26 Programming and genomics 2019/2020 3. Lists and repetition for i in range(11): print(i, i*i) print(’End of table’) thus gives A table of the first 11 integers and their squares 0 0 1 1 2 4 3 9 4 16 5 25 6 36 7 49 8 64 9 81 10 100 End of table The range method can be used to generate the same result as in the first example in this section by generating a list with the indices of the list strlist, looping over the elements in that list and using those indices to access the items in the list: >>> >>> >>> >>> >>> ... ... ... ... str1 = ’Biomedical Engineering’ str2 = "Programming and genomics course 8CA10" str3 = "Python is fun" strlist = [str1, str2, str3] for i in range(len(strlist)): s = strlist[i] print(s) print(len(s)) Biomedical Engineering 22 Programming and genomics course 8CA10 37 Python is fun 13 Though the code is now slightly longer, advantage is that (contrary to the prior case) within the loop the index that is currently being processed (i) is known. 3.3 Modules and the import statement All over the world many Python programs have been and are being designed. An installation of Python usually includes a large number of additional components that are collected in modules. To make use of them, Python has the import statement. An import statement imports a module, i.e., a piece of code contained in a file. c ph 27 Programming and genomics 2019/2020 3. Lists and repetition For instance, by importing the module math we have access to mathematical constants and functions such as pi, e, exp, log and sqrt. >>> import math Access to its components such as variables and functions is obtained by using the construct modulename.varname >>> print(math.pi) 3.14159265359 A similar notation is employed when referring to a function from a module, first the name of the module, then a dot, ’.’, and ending with the function name. In short the notation is: modulename.functionname >>> math.exp(1) 2.7182818284590451 >>> math.sqrt(3) 1.7320508075688772 3.4 Exercises 7–12 Exercise 7:** Show a programming fragment and its result for each of the following actions: (a) Construct an empty list and assign it to the variable m. (b) Add the element 7 to the list m. (c) Print the length of m. (d) Extend in two different ways the list m with [1, 2, 3, 1]. (e) Print the length of m. (f ) Change the third element of m to 4. (g) Remove the first occurrence of 1 from m. (h) Print the length of m. (i) Remove the last element from m. (j) Print the length of m. (k) Show the contents of m[-1]. (l) Show the contents of m[len(m)-1]. (m) Show the contents of m[-len(m)]. Exercise 8:** The aim of the exercises below is to get familiar with the range-method and the forstatement. c ph 28 Programming and genomics 2019/2020 3. Lists and repetition (a) Assign to variable l a list consisting of the first 20 integers starting counting at zero. (b) Print list(range(len(l))). (c) Execute the following programming fragment and explain the output: for x in l: print(l) (d) Execute the following programming fragment and explain the output: for x in l: print(x, 2*x) (e) Execute the following programming fragment and explain the output: for x in l: print(x) (f ) Execute the following programming fragment and explain the output: for i in range(len(l)): print(l[i], 2*l[i]) (g) Execute the following programming fragment and explain the output: print("Start") for i in range(len(l)): print(l[-i]) print("Finished") (h) Adapt the programming fragment of (g) such that "Finished" is only printed once at the end. (i) Print the elements of l in reverse order. Exercise 9:** (a) In order to predict the growth of a cell population that initially (at t=0) consists of ten cells and in which each cell replicates every hour while no cells die, we could write the following programming fragment: n=10 for t in range[5]: print(t, n) n = 2*n print(’After 6 hours the number of cells is’, n) This programming fragment, however, is not yet completely correct. Correct the errors in this fragment in such a way that the correct number of cells after 6 hours is printed (i.e., at t=6). (b) Adapt the programming fragment of (a) taking into account that each cell still replicates every hour but that after each replication cycle 5 cells die. How many cells are there after 1 day (24 hours)? c ph 29 Programming and genomics 2019/2020 3. Lists and repetition Exercise 10:** One of the libraries that comes with your Python installation is turtle. Given is the following python fragment # import the turtle library import turtle d=100 # Lift pen up turtle.up() # Move to the point with x and y coordinates -d/2 and d/2, respectively turtle.goto(-d/2,d/2) # Pull the pen down turtle.down() # Draw omething for i in range(4): turtle.forward(d) turtle.right(90) # activate the window that pops up turtle.mainloop() (a) Run the program. What happens? (b) Write a python fragment to generate a figure like in Fig. 3.1a (c) Write a python fragment to generate a figure like in Fig. 3.1b (d) Write a python fragment to generate a figure like in Fig. 3.1c Full info on turtle can be found at https://docs.python.org/3.6/library/turtle. html Exercise 11:** Design a python fragment that prints a triangle of stars (’*’) with k stars as basis and k stars as height. k should be an integer parameter of the method and between two stars a space should be printed. For k=4 the output should look as follows: * * * * * * * * * * c ph 30 Programming and genomics 2019/2020 3. Lists and repetition Figure 3.1: Figures for turtle exercises. Exercise 12: In Exercise 10 the turtle library has been introduced. Write a python fragment, using this library and nested for-loops, to draw four rows with five hexagons each, i.e., a figure like in Fig. 3.1d. c ph 31 Chapter 4 Methods, slicing, random and plotting 4.1 Methods and invocations In the previous chapters we have introduced the standard types integer, float, and string, as well as some operations, such as addition and multiplication, that can be applied to them. Moreover, as an example of a structured data type lists have been defined. When discussing operations on lists, we have, without putting emphasis on it, in fact shown what the standard object-oriented notation is for performing an action on a object, also called invoking a method. In a Python program we write such a method invocation by first writing the name of the object followed by a period (called dot in computer jargon), followed by the method name, and parentheses that may have arguments inside them. These arguments (possibly zero!) provide the information needed by the method in order to carry out its action. Examples of this notation applied to lists l, L, an element x, and integer i are: l.append(x) l.extend(L) l.insert(i, x) l.remove(x) We have also introduced the pop-method but because it uses a special notation we give it additional attention in the next section. Optional arguments If l is a list, we can remove the last element from the list by l.pop() but in general we can also remove an element at another position in the list. If index i satisfies 0 ≤ i < len(l), then l.pop(i) will remove the item at position i from the list. 32 Programming and genomics 2019/2020 4. Methods, slicing, random and plotting Instead of having to give two separate definitions, they can be combined as follows l.pop([i]) The use of the square brackets means that the argument is optional. So the pop-method may have zero or one argument. In general the term for this construction is that a method may have optional arguments. This is one of the nice features of Python: only one definition is needed and by analyzing the number of arguments, the system will select the correct one. This construct of a varying number of arguments will be used at more places in these notes, for instance in the next section. 4.2 The dir and help methods In the previous section and chapter a number of built-in methods on lists have been described. One can obtain a complete list of all methods of any object (thus also a list) by giving the dir([object]) command, where the square brackets indicate that the argument is optional: • dir([object]) Return an alphabetized list of names comprising (some of) the attributes of the given object, and of attributes reachable from it. So dir([]) gives a complete enumeration of all methods that can be applied to lists. Additional information about these methods can be obtained by entering help([]), or for instance help([].pop) for specific information on the pop method (such as its optional argument). • help([object]) Enter the name of any object to get help on its usage. 4.3 The range method One of the built-in functions that already has been mentioned is the range function. It returns a built-in range object, which is a representation for a regular series of integer numbers. It has one mandatory integer argument, called stop and two optional ones, called start and step respectively: range([start,] stop[, step]) If the step argument is omitted, it defaults to 1. If the start argument is omitted, it defaults to 0. The full form returns a range object the resembles the list of plain integers [start, start + step, start + 2 * step, ...]. If step is positive, the last element is the largest start + i * step less than stop; if step is negative, the last element is the largest start + i * step greater than stop. A value of zero is not allowed for step. The built-in range object can be converted to a true list of integers using the list() function. Examples: >>> l=range(11) >>> print(l) range(0, 11) >>> print(list(l)) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] c ph 33 Programming and genomics 2019/2020 4. Methods, slicing, random and plotting >>> l=range(1, 11) >>> list(l) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] >>> l=range(1, 11, 3) >>> m=list(l) >>> print(m) [1, 4, 7, 10] >>> print(l) range(1, 11, 3) >>> l=range(11, 2, -2) >>> list(l) [11, 9, 7, 5, 3] 4.4 Slicing Apart from constructing new lists, it is quite common to select parts of a list. In particular slices, i.e., all elements from a list in between a start and a stop index, also called a consecutive part of the list, occur frequently. Examples: >>> gene = [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[1:3] [3630, 333] >>> gene[0:4] [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[1:1] [] >>> gene[1:len(gene)] [3630, 333, ’Homo sapiens’] >>> gene[1:4] [3630, 333, ’Homo sapiens’] Slice indices have useful defaults; an omitted first index defaults to zero, an omitted second index defaults to the size of the list being sliced: >>> gene = [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[:4] # the first 4 items [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[1:] # everything except the first item [3630, 333, ’Homo sapiens’] In contrast to indices which lead to errors when not in between the bounds -len(l) and len(l), degenerate slice indices are handled gracefully: an index that is too large is replaced by the size of the list, an upper bound smaller than the lower bound returns an empty list. >>> gene = [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[1:100] [3630, 333, ’Homo sapiens’] >>> gene[3:1] c ph 34 Programming and genomics 2019/2020 4. Methods, slicing, random and plotting [] Indices may be negative numbers, to start counting from the right. For example: >>> gene = [’insulin’, 3630, 333, ’Homo sapiens’] >>> gene[-2:] # The last two items [333, ’Homo sapiens’] >>> gene[:-2] # Everything except the last two items [’insulin’, 3630] Similary as with a range in slicing a step-value can be used: l[start:stop:step] and the default value for step is 1. If we would like to select the list elements with an even index from index 10 and further, we could establish that by l[10::2]. Another useful application is also to leave the start and end position open and use as step value minus one: l[::-1] This generates a new list with all elements of l, but in reverse order. 4.5 Random numbers In the programs we have seen so far, fixed values were assigned to variables. As a result, those programs produce exactly the same result each time you run them, unless you change those values. Python also has some modules to generate pseudo-random numbers, allowing to make each run behave differently. In following chapters we will encounter a number of applications. To generate random numbers, one first needs to load the library: >> import random This library then allows to generate a pseudo-random float between 0 and 1 using >> random.random() 0.11619275312381916 The random number reported above is different each time. A next call may thus for instance yield: >>> random.random() 0.7106072247075303 The same library also provides the possibility to generate pseudo-random integers using randint(minval,maxval). This yields a random integer between and including the lower bound minval and upper bound maxval. For example, with lower bound 1 and upper bound 6 this thus behaves like a virtual dice: >>> random.randint(1,6) 2 >>> random.randint(1,6) c ph 35 Programming and genomics 2019/2020 4. Methods, slicing, random and plotting 6 4.6 Plotting data using matplotlib In many cases it can be quite helpful when a plot of a data set is made. Python offers many such possibilities, by having plotting facilities and interfaces to several packages meant for mathematical data analysis. Very useful in this respect is matplotlib, a python 2D plotting library which produces publication quality figures. In particular matplotlib has a module pyplot which is a collection of command style functions that make matplotlib work like matlab. In order to use the library, one needs to start by importing it using import matplotlib.pyplot If the data that is to be plotted is in a list l l = [1,4,9,16,25,36,49,64,81,100] the plot is created by the command matplotlib.pyplot.plot(l, ’ro’) In most Python distributions, the plot will not be shown immediately, but will only be shown in a separate window at the screen after the command: matplotlib.pyplot.show() In the Spyder environment, however, the default option is that the plot is immediately shown in line, i.e., in the Python console. To change the behaviour to have figures in a separate window, one can type at the python command %matplotlib auto to change the behaviour for your current session or change the default setting via the Tools menu, i.e., goto Tools/Preferences/IPython console/Graphics/ and change the Backend option from Inline to Automatic, for a more permanent solution. By the above plot command the values of l are plotted in the color red, the ’r’, and with the line or marker style ’o’, standing for the circle marker. The result is a plot as in Figure 4.1a. Other common colors are black (’k’), blue (’b’), green (’g’), yellow (’y’), magenta (’m’) and cyan (’c’). Other common marker styles are stars (’*’) and plusses (’+’), while instead of markers also lines could be used, e.g., solid lines (’-’), dashed lines (’--’), or dotted lines (’:’). Also, markers and lines could be combined. For instance, ’g*-’ yields green stars combined with solid lines. The plot command has many more possibilities. If one also has a list m (e.g. m = [0,25,50,75,100]) one can realise a plot in which both lists are plot by: matplotlib.pyplot.plot(l, ’ro’, m, ’b*-’) where we plot the contents of m in the color blue, the ’b’, with starts, the ’*’, and as a solid line, the ’-’. The result is a plot as in Figure 4.1b. In that plot also labels are added next to the axes. This is obtained using matplotlib.pyplot.xlabel(’my x-label’) c ph 36 Programming and genomics 2019/2020 4. Methods, slicing, random and plotting Figure 4.1: Example of plots with matplotlib. matplotlib.pyplot.ylabel(’my y-label’) In the plots sofar, the values of the elements in the lists determined the vertical position of the data points. The horizontal positions were determined by their position (index) in the list. A more general version of the plot-command uses two lists, as in: matplotlib.pyplot.plot(x, y, ’ro’) in which x and y have to be two lists of the same length. The values in x then determine the horizontal positions and those in y the vertical positions. Here we only touched on some basic functionalities of pyplot. More details can be found on https://matplotlib.org/ and https://matplotlib.org/tutorials/introductory/ pyplot.html. 4.7 Exercises 13–18 Exercise 13: Let l = [5,3,1,8,5,9,3,8,5,8,5,0,4,6,5,9,7,6,8,10] (a) Print the contents of the list l (b) Print the contents of the list l such that both the index of the element in the list and the element itself are shown, one element per line, i.e., 0 1 2 3 . . 5 3 1 8 (c) As you could see, all values in the list are between 0 and 10 (both inclusive). Design a Python program that prints for all integers between 0 and 10 (both inclusive) the number of times the integer occurs in the list l c ph 37 Programming and genomics 2019/2020 4. Methods, slicing, random and plotting TIP: Think how you could do this using pen and paper. That is, start with an empty sheet with the numbers 0, 1, through 10 below each other and marking each occurence when going through the list (so-called tallying). Exercise 14:* (a) For a list of scores l its average is given by the sum of its elements divided by the number of elements the list contains. For example, for l=[0,2,8,10] the average is (l[0]+l[1]+l[2]+l[3])/4, i.e. (0+2+8+10)/4=5. More generally, for a list l of arbitrary length n = len(l), the calculation of the average score can be written as n−1 X ¯l = 1 l[i] n i=0 The formula n−1 i=0 l[i] means calculate the sum of l[0], l[1], . . . l[n − 1]. So when n = len(l) then all elements of l are summed. P Design a Python program that calculates and prints the average score. Apply this to the list l of the previous exercise. (b) The variance σ 2 is defined by σ2 = X 1 n−1 (l[i] − ¯l)2 . n − 1 i=0 Extend the program of (a) such that it also calculates and prints the variance of the scores. Exercise 15: In this exercise we will make a plot of some data using matplotlib. (a) Plot the data in the list l used in the previous exercises using the plot command from the matplotlib.pyplot library. When plotting the data, a plot style can be specified. What is the difference between the styles ’ro’ and ’k*’? (b) Construct a Python program by which the contents of the list l are plotted and by which in the figure in addition a horizontal line is shown with as value the mean of l. (c) Extend the Python program of (b) such that it plots two additional horizontal lines indicating the standard deviation, i.e., one line at mean+variance**0.5 and one at mean-variance**0.5. Use cyan colored dashed lines using style ’c--’, and add proper labels at the axes. Exercise 16:** Construct the following lists, exclusively using the range and list functions: (a) a list of the integers 0 . . . 10, hence inclusive both 0 and 10. (b) a list of the integers 1 . . . 10. c ph 38 Programming and genomics 2019/2020 4. Methods, slicing, random and plotting (c) a list of the integers 1 . . . 20 that are multiples of 4 (d) an increasing list of the integers −25 . . . 20 that are multiples of 3. (e) a decreasing list of the integers −25 . . . 20 that are multiples of 3. Exercise 17:* Let n be a multiple of 10 larger than 1000. Construct the following lists only making use of the range and list functions and the concatenation operator: (a) [n+10, n+9, .., 2, 1, 0, 0, 5, .., n+5, n+10] (b) [n, n, (n-5), (n-10), .., 10, 5, 0, -5, -10, ..,-(n-5), -n, -n] (c) [-n, -(n-1), .., -2, -1, 0, 0, 0, 5, 10, .., n-10] (d) [n, -n, 0, 5, 10, .., n, -n, -(n+5)] Exercise 18: In Exercise 10c a single star had to be drawn making use of the turtle library. (a) Extend the solution of that exercise to draw 50 stars at random positions (e.g. x as well as y coordinates drawn randomly between -200 and 200). (b) Like in (a), but now give each star a random color. The pen color can be changed using turtle.pencolor(r,g,b) where r, g, b are floats between 0 and 1, specifying the amount of red, green and blue, respectively. (0,0,0) corresponds to black and (1,1,1) to white. (c) Run your solution of (b) multiple times. c ph 39 Chapter 5 Selection methods and file/user input 5.1 Selection methods (if, elif and else) When dealing with lists, one often encounters the problem that from the lists only certain elements that satisfy a condition are to be considered. Programming languages usually have a construct for that called selection. Python has the selection method if that we introduce by an example. The if construct Assume that we have a list l of integers and we want to select only the positive items from the list. Since there may be more than one such an item, we choose to deliver all the positive items in a new list called posl. In the decision what the initial value for posl can be, we have to realize that it might even be the case that no item is positive, so the only decent initial value for posl is the empty list []. Since the property of being positive is a single item property, we have to inspect each individual item from the list on this condition. These considerations lead to the following programming fragment. posl=[] for x in l: if x>0: posl.append(x) The construction we have used is more generally defined as if condition: block and means that when the condition is satisfied all actions belonging to the block are to be performed. The else construct If we also want to collect the other items in another list, say negl, then a program is: 40 Programming and genomics 2019/2020 5. Selection methods and file/user input posl=[] negl=[] for x in l: if x>0: posl.append(x) else: negl.append(x) This should be read as: for each item, when x is positive, it will be appended to the list posl, and otherwise, it will be appended to the list negl. The general construct has the form if condition: block1 else: block2 and is called an if-else construct and its interpretation is: when the condition holds all actions belonging to block1 are executed, and otherwise all actions of block2 are applied. The elif construct There is even a more general form in which conditions are subsequently inspected. Again we demonstrate the construct first by an example. Assume that we not only want to select the positive and non-positive items but also want to collect the elements being zero. Along the same lines and introducing a list zerol in which all zero items are to be put, a programming fragment is: posl=[] negl=[] zerol=[] for x in l: if x>0: posl.append(x) else: if x<0: negl.append(x) else: zerol.append(x) Since such constructs occur in many occasions, Python even has a shorthand notation for it: posl=[] negl=[] zerol=[] for x in l: if x>0: posl.append(x) elif x<0: negl.append(x) c ph 41 Programming and genomics 2019/2020 5. Selection methods and file/user input else: zerol.append(x) and the name for it is the elif-construct. In general, more than one elif is allowed, and the else-part is optional. 5.2 Conditionals and selection The tests introduced in the previous section (such as x>0) are comparison operations on integers. Such comparisons can hold, or not. In other words: they are either true or false. In python such comparisons return a Boolean value. This type is named after a 19th century mathematician, George Boole who studied logic. This type has only 2 values - either True or False. >>> 5>3 True >>> 5<3 False Boolean is a subtype of integer, and Boolean values behave like the values 0 for False and 1 for True, respectively. Their role becomes clear when we have to design programs in which a selection has to occur whether or not a block of statements is to be executed. Consider the following program fragment: >>> if nrexons > 1: ... print("Alternative splicing might occur") In this fragment we use the if statement by which only if the condition, i.e., the boolean expression after the if but before the colon, evaluates to True, the block following is executed. If the condition does not hold, then the block is not executed. Boolean expressions are statements, the technical term is proposition, that hold or not. In daily life we are quite used to propositions: 1. It is raining. 2. 2+2 equals 5 3. This is a course that I like. 4. Today these notes were put on oncourse. These are simple examples. It becomes more interesting when we make new propositions composed from old ones: 1. It is raining and it is cold. 2. It does not rain. 3. Today or tomorrow these notes are put on oncourse. Boolean expressions are also frequently used in programming. In this section we therefore treat them in some more detail. c ph 42 Programming and genomics 2019/2020 5. Selection methods and file/user input Operation < <= > >= == != Meaning strictly less than less than or equal strictly greater than greater than or equal equal not equal Table 5.1: General comparison operations 5.2.1 False and True The following values are considered by the interpreter to mean False: None 0 "" () [] {} False Hence everything else is interpreted as True. 5.2.2 Comparisons The boolean expressions used so far are comparison operations on integers. Comparison operations are not only defined for integers, they are supported by all objects. In Table 5.1 the comparison operations are summarized. For floats the meaning is straightforward, though for strings one has to be careful that python is case sensitive and capitals are considered to be smaller than small letters. String comparisons that are True are for instance: "aaa"=="aaa" "a"!="A" "aaa"<"baa" "aaa">"Baa" Also for other objects (like lists), comparisons like < and > are defined, though their results may be at first sight unexpected. Comparisons can be chained arbitrarily; for example, x < y <= z is equivalent to x < y and y <= z. Unmeaningfull comparisons (e.g. testing whether some string is larger than some integer) are not allowed, though they can be tested for equality: >>> "acg"<3 TypeError: ’<’ not supported between instances of ’str’ and ’int’ >>> "acg"==3 False >>> "acg"!=3 True On lists and strings an additional comparison operator in is present. On lists, this results True (or False) when the requested element is present in the list (or not). >>> 4 in [0,5,3,7] c ph 43 Programming and genomics 2019/2020 5. Selection methods and file/user input False On strings, this results True (or False) when the requested substring is present in the string (or not). >>> ’ACT’ in "AAAACTT" True 5.2.3 Boolean operations Like in the daily life expressions most programming languages including Python have more concepts for constructing boolean expressions. There are 5 operators on boolean operands by which larger expressions can be composed. • Equality == • Inequality != • Conjunction: and • Disjunction: or • Negation: not In order to avoid misinterpretations it is advised to use parentheses around boolean expressions. Conjunction and False True False False False True False True When a and b are two boolean expressions, then a and b is only True when both a and b are True. a False True False True b False False True True a and b False False False True Equality == False True False True False True False True When a and b are two boolean expressions, then a==b is only True when a en b have the same value. c ph 44 Programming and genomics 2019/2020 5. Selection methods and file/user input a False True False True b False False True True a == b True False False True Inequality != False True False False True True True False When a and b are two boolean expressions, then a!=b is only True when a and b have different values. a False True False True b False False True True a != b False True True False Disjunction (or) When you say that this message is meant for students mechanical or biomedical engineering, then it also applies to a student doing both mechanical and biomedical engineering. The word or does mean either one or both. or False True False False True a False True False True True True True b False False True True a or b False True True True Negation (not) The negation has the meaning of not and is a unary operator. a False True not a True False not a is thus True when a is not and the other way around. This concludes the description of the boolean expressions. In the programs that we will design in the next chapters examples of their use will be shown. 5.3 User input (input) When a program is run in an interactive session and the program needs some input, the user should somehow be informed to enter the input. To that end Python has the function input(): c ph 45 Programming and genomics 2019/2020 5. Selection methods and file/user input input([prompt]) If the prompt argument is present, it is written to standard output without a trailing newline. The function then reads a line from input (at the Python Shell), converts it to a string, and returns that. Example: myname = input("Enter your name: ") If you would response with Peter Hilbers in the window with the command prompt, i.e., in your window would appear Enter your name: Peter Hilbers the result is that the variable myname gets the value ”Peter Hilbers”. If numeric values are required as input instead of strings, the string (s) read could of course always be converted to into an integer or a float using the int(s) and float(s) commands, respectively. 5.4 Reading from files Often data needs to be read from file. To that end Python has several facilities to handle file input. The open method infile = open(filename) opens the file with name filename for reading and returns a new object of type file. Example: >>> infile = open("sequences.seq") >>> print(infile) <_io.TextIOWrapper name=’sequences.seq’ mode=’r’ encoding=’UTF-8’> A file cannot be displayed like a number or a string, it however has methods for working with the data in the file. The readline method infile.readline() readline returns the next line from the file object infile. Example: >>> infile = open("sequences.seq") >>> infile.readline() ’CCTCAACAATTCAATAAAATAGCTTCGCGCTAA\n’ Note the line read includes the end of line character (\n). A Python fragment to read the first two lines from a file is: c ph 46 Programming and genomics 2019/2020 5. Selection methods and file/user input >>> infile = open("sequences.seq") >>> infile.readline() ’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’ >>> infile.readline() ’CCACGTGAACCGTTGTAACTATGTTCTGTGC\n’ When there are no more lines, readline returns the empty string. So if the file only has two lines, then the following output is produced: >>> infile = open("sequences.seq") >>> infile.readline() ’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’ >>> infile.readline() ’CCACGTGAACCGTTGTAACTATGTTCTGTGC\n’ >>> infile.readline() ’’ If the newline should not be included in the line, the rstrip() method can be used. This method removes all white space (including new line characters) from the end of the string it is applied to. Example: >>> infile = open("sequences.seq") >>> s=infile.readline() >>> s ’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’ >>> r=s.rstrip() >>> r ’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA’ The readlines method f.readlines() Given a file object f (for instance returned by the file open method), f.readlines() reads using readline() until the end of the file and returns a list containing the lines thus read. >>> infile = open("sequences.seq") >>> infile.readlines() [’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\n’, ’CCACGTGAACCGTTGTAACTATGTTCTGTGC\n’] The read method Instead of the readlines method we could also have used s=infile.read() in which the method read is applied to the object infile with as result that one string is returned containing all characters infile consists of. c ph 47 Programming and genomics 2019/2020 5. Selection methods and file/user input >>> infile = open("sequences.seq") >>> infile.read() ’CCTACAACAATTCAATAAAATAGCTTCGCGCTAA\nCCACGTGAACCGTTGTAACTATGTTCTGTGC\n’ The close method Although we have not dealt with it before since a file is automatically closed when a program has finished, there is also a method by which a file is closed explicitly. In fact it is strongly encouraged and considered good programming practice when objects, which are no longer of concern, are ’dismissed’. To close a file, Python has the close-method: infile.close() Reading and processing of a file could thus look like infile = open("sequences.seq") lines = infile.readlines() infile.close() for line in lines: print(len(line)) After the close(), the file itself is closed and no longer accessible, though its contents is still present in the variable lines. Some elementary string methods In the next chapter we will consider strings in much more detail. However, when reading and processing text files a number of methods are already very useful. We will introduce these here via an example, where s is a string containing one line of text (that might have been read from a file). >>> s = ’Jan is 23 years old\n’ >>> s.rstrip() ’Jan is 23 years old’ >>> s.find(’is’) 4 >>> m = s.split() >>> m [’Jan’, ’is’, ’23’, ’years’, ’old’] >>> m[2] ’23’ >>> int(m[2]) 23 >>> float(m[2]) 23.0 That is, rstrip removes white space (in this case thus the new line character) from the end of the string, find returns the index of the first occurrence of the substring given as argument, split returns a list with substrings that result when the string is splitted on white space, and int and float convert a string into an integer and floating point number, respectively. c ph 48 Programming and genomics 2019/2020 5. Selection methods and file/user input 5.5 Exercises 19–24 Exercise 19:** Below an imperfect Python fragment is given. It should ask the user for an integer number between 0 and 10, including those boundaries. Correct the code such that it thanks the user if he/she does so, and gives its opinion on the user otherwise. s = input("Give a value between 0 and 10: ") value = int(s) if (value > 0) and (value < 10) print(’Thank you’) print(’You fool!’) Exercise 20:** For this, and many of the exercises to come, data files are necessary. These files are available on the Canvas website and should be downloaded to the same folder where you store the python programs (.py files) you write. Files can then be opened in your python programs by just specifying the file name (e.g. "sequences.seq"). This works because Python automaticcally sets its working directory to the folder of the Python file you are running, and searches inside that folder. If you want to open files from another folder on your hard disk, that is also possible by specifying the full path to the file (for instance "D:\\courses\\8CA10\\sequences.seq"). (Note that the use of the double backslahes is required.) (a) Write a Python program that reads the file sequences.seq, and then for all lines, prints out the line number (starting with 1) then the line itself. Make sure that no empty lines are present between the lines printed from the file. The output should be like 1 2 3 4 5 CCTGTATTAGCAGCAGATTCGATTAGCTTTACAACAATTCAATAAAATAGCTTCGCGCTAA ATTTTTAACTTTTCTCTGTCGTCGCACAATCGACTTTCTCTGTTTTCTTGGGTTTACCGGAA TTGTTTCTGCTGCGATGAGGTATTGCTCGTCAGCCTGAGGCTGAAAATAAAATCCGTGGT CACACCCAATAAGTTAGAGAGAGTACTTTGACTTGGAGCTGGAGGAATTTGACATAGTCGAT TCTTCTCCAAGACGCATCCACGTGAACCGTTGTAACTATGTTCTGTGC (b) Adapt your solution for (a) such that for all lines, it prints out the line number (starting with 1) and then that part of the line that starts with TT until and including the first occurrence of AA. (Each line of the file sequences.seq contains a TT and a subsequent AA, and the position of the first occurrence of a substring substr in a string s can be obtained using pos=s.find(substr).) Exercise 21: Open the two files seq1.seq and seq2.seq. The files have the same number of lines. Write a Python program that repeatedly reads a line from file seq1.seq, prints the line to output, then reads a line of file seq2.seq, prints it, etc. c ph 49 Programming and genomics 2019/2020 5. Selection methods and file/user input Exercise 22:* Given is a text file BMIs.txt with on each line three fields separated by tabs. The first field is the name of a person, the second field his/her weight, and the third field his/her length. (a) Design a python fragment that reads the file BMIs.txt and stores its contents in three lists, named names, weights and lengths, respectively. The list names may contain strings, while the other two lists should contain floats instead of strings. The strings read from file may be converted using the command float(s), which converts a string s into a float. (b) Print for each of the persons his/her name, weight, length and BMI, where BMI is defined as weight divided by the square of the length. (c) Print for each of the persons his/her name and whether this person is, according to the World Health Organization, ’Underweight’ (BMI below 18.5), ’Healthy weight’ (BMI between 18.5 and 25), ’Overweight’ (BMI between 25 and 30) or ’Obese’ (BMI above 30). Output should look like Peter is obese. Esther is healthy weight ... Exercise 23: The purpose of this exercise is to design a Python program by which the positive and the non-positive elements of a list of integers are plotted (using Matplotlib) in 2 different colors. (a) Write a python fragment that reads the file intl.txt and stores the integer numbers present in that file in a list. (b) Write a python fragment that plots the positive elements in green and the nonpositive elements in red. When the program is run a plot like depicted in Figure 5.1 should be drawn. Exercise 24: Let l be a sorted list with integer numbers: (a) Design a Python program that prints only the numbers occurring multiple times in the list l, where each of those numbers is only printed once. (b) Design a Python program that prints the number that occurs most frequent in the list l. In case multiple such numbers exist, only the smallest one of those should be printed. c ph 50 Programming and genomics 2019/2020 5. Selection methods and file/user input Figure 5.1: The requested output of exercise 23 where positive and nonpositive elements are shown in different colors. c ph 51 Chapter 6 Bioinformatics and strings Our interest is in computer algorithms by which we can analyse genomic information. Since this genomic information is usually represented in the form of strings, our programming language should have facilities for string manipulation. We have already seen that Python has a standard type str for strings. In this chapter we start with some biological background about the translation of a DNA sequence into a protein. From it we derive what other string manipulations are needed to investigate this translation process and we show the kind of Python statements that are available to implement these operations. 6.1 From DNA to RNA to protein Transcription of DNA The main purpose of DNA, in short, is to function as a template from which the single stranded nucleic acid called RNA (ribonucleic acid) can be transcribed. RNA is in turn translated into the amino acid sequences for all proteins the organism needs. In prokaryotes, the double stranded DNA is ‘read’ in the nucleus by an enzyme called RNA polymerase. The function of this enzyme is to open the double helix to expose a small part of single stranded DNA sequence, and to transcribe this sequence into an RNA strand. RNA is very similar to DNA in that it is also a long chain of nucleotides. However there are a few differences: the sugar molecule in RNA is ribose instead of deoxyribose and it contains the base uracil (U) instead of thymine (T). Like thymine, uracil forms base pairs with adenine. Like replication, transcription of DNA into RNA is done by complementary base pairing. One of the DNA strands acts as a template for polymerase to link nucleotides to a growing RNA chain. The RNA strand is then complementary to the template DNA strand. Thus it is a copy of the other DNA strand, the coding strand, except for the thymines being replaced by uracil, see Figure 6.1. The start of transcription is initiated by the binding of certain transcription factors to regions upstream of a gene. Transcription factors are proteins that recognize and bind to specific DNA sequences that are called promoters. The difference in presence of the transcription factors is one of the means by which specific cell types regulate transcription rates of certain genes. A gene usually has a few promoter regions. One of those regions is typically found around 30 bp upstream of the start site of transcription. It is 52 Programming and genomics 2019/2020 6. Bioinformatics and strings 5’... A C G T C G C G C A G T A C A T G ... 3’ coding strand | | | | | | | | | | | | | | | | | 3’... T G C A G C G C G T C A T G T A C ... 5’ template strand 5’... A C G U C G C G C A G U A C A U G ... 3’ RNA Figure 6.1: DNA codes for RNA. termed TATA box because of its high content in T and A nucleotides. The transcription factors that bind to the TATA box help the polymerase to position at the start site, and assist in unwinding the DNA locally to facilitate the start of transcription. After DNA has been transcribed the single stranded RNA, then called the transcript, undergoes some post-processing. The transcript is longer than needed for protein synthesis. It consists of large regions that do not code for amino acids. Such regions are called introns, while the regions coding for protein are called exons. Before the RNA strand leaves the cell’s nucleus the introns are removed or separated from the exons, regions that are expressed, by a splicing process. In eukaryotes the exons are only a small fraction of the transcribed DNA. RNA molecules that are transcribed as a code for an amino acid sequence are called messenger RNA, or mRNA. Figure 6.2: Schematic view of genic regions. A gene always starts and ends with an exon, this can be a untranslated regio(UTR) or a coding sequence(CDS). A UTR itself can have introns, as shown here in the 5’ UTR. Only the exons are found in the mRNA. Translation of RNA The proteins of most living organisms are built from only 20 different amino acids. Somehow the mRNA sequence made up of only four different nucleotides (A, U, C, and G) codes for the arrangement of these 20 amino acids. It can be easily deduced that with a four-letter alphabet, three-letter words suffice to code for all amino acids. Two-letter words give only 42 = 16 different codes, which is too few. In three-letter words there are 43 = 64 different codes, which is more than enough. Each amino acid can therefore be specified by more than one of these ’words’, which are named codons. Some codons are reserved to code for a stop sign which indicates that translation should be terminated, they do not code for any amino acid. The codon for methionine (AUG) is also a sign for c ph 53 Programming and genomics 2019/2020 6. Bioinformatics and strings the start of a protein coding region. The complete code for all 20 amino acids is given in Table 6.1. Table 6.1: Genetic Code. 1st position U C A G U Phe Phe Leu Leu Leu Leu Leu Leu Ile Ile Ile Met Val Val Val Val 2nd position C A G Ser Tyr Cys Ser Tyr Cys Ser STOP STOP Ser STOP Trp Pro His Arg Pro His Arg Pro Gln Arg Pro Gln Arg Thr Asn Ser Thr Asn Ser Thr Lys Arg Thr Lys Arg Ala Asp Gly Ala Asp Gly Ala Glu Gly Ala Glu Gly 3rd position U C A G U C A G U C A G U C A G Because codons consists of 3 letters, each RNA sequence has three possible reading frames. Not all possible reading frames lead to a product. This is usually because reading frames either lack the sequences that initiate transcription or have an abundance of stop codons that terminate translation before a functional protein is formed. The portion of mRNA that is on the 5’ end, upstream of the start codon is called the 5’ untranslated region (UTR). The portion of mRNA that is 3’ downstream of the stop codon is the 3’ UTR. See Figure 6.2 for an illustration of the general outlay of a genic region. Genes A part of DNA that codes for the production of a single protein is called a gene. Definitions of what part of the coding sequence is actually the gene, can differ however. Some carry the opinion that also the promoter and other upstream transcription-factor binding sequences should be considered as part of the gene. Generally, introns are seen as part of a gene, but they are not part of the coding DNA, since they are not translated into amino acid sequence. The amount of coding DNA is thus smaller than the amount of genic DNA. To give an idea: about 98.5% of the human DNA is non-coding DNA [3][p. 202]. The bulk of DNA that serves no obvious purpose, such as most DNA within introns and most intergenic DNA, has long been labeled ‘junk DNA’. This term is misleading. Recent genomic research has led to the belief that some biological function is associated with some of these regions. Therefore the more neutral term non-coding DNA is preferred these days. c ph 54 Programming and genomics 2019/2020 6. Bioinformatics and strings On genomes, sequences can be found, that resemble known genes, but cannot be translated into functional proteins. These are called pseudo genes. A pseudo gene is sometimes described as a non-functional member to a gene family. It is believed that pseudo genes are derived from ancestral active genes, and have lost their function through mutations, often by the gain of internal stop codons. A single region in DNA can code for more than one protein by a process called alternative splicing. By excluding some exons a different amino acid pattern can emerge, and therefore a different protein can be constructed. This is illustrated in Figure 6.3. Figure 6.3: Alternative splicing. Same example gene as in figure 6.2, but now four different mRNA products are synthesized from the same gene. Only the darker part is translated into protein. Functionality of genes can be regulated by methylation. Methylation is the cell’s method of turning off certain genes. Every type of cell has its own methylation pattern so that a unique set of proteins is expressed to perform specific functions for that cell type. In vertebrates methylation usually occurs on cytosine at CpG sites, sites where cytosine is followed directly by guanine. Deamination of methylated cytosine changes it to thymine, which is a mutation that can not be efficiently repaired. Thus over evolutionary time scales the methylated CG sequence will be converted to TG, which explains the deficiency of CG sequences in inactive genes. CpG islands are short stretches of DNA in which the frequency of CG sequences is higher than other regions and they are usually found around promoters of so called housekeeping genes, that are essential for general cell functions, or other genes that are frequently expressed in a cell. 6.2 DNA, RNA and Python strings From the short survey about DNA, RNA and proteins we deduce that in order to interpret genomic data we at least need the following operations on a string. • a replacement operation: If we have a DNA string and we want to turn it into RNA then all occurences of T have to be replaced by U. In the translation process from RNA to protein a three-letter word is replaced by a single amino acid letter. c ph 55 Programming and genomics 2019/2020 6. Bioinformatics and strings • substring: If we are interested in the amino acid sequence a protein is composed of, we need to extract the coding parts of mRNA from the noncoding ones. Exons are consecutive subparts of the original DNA string, so an operation is needed by which a subsequence of letters can be obtained. • concatenation: The DNA sequence that codes for a protein usually consists of several exons. These exons are glued together to form the complete amino acid sequence the protein consists of. Python indeed has such operations. As an example how to use these operations we consider the DNA sequence of a specific gene, namely “INS”, that has as product insulin. In the Computational Biology group of the BioMedical Engineering department we are doing research on the metabolic syndrome. In particular we are interested in the disease Diabetes Mellitus. Patients with type 2 diabetes mellitus have relatively low insulin production or insulin resistance or both. A non-trivial fraction of type 2 diabetics eventually require insulin administration when other medications become inadequate in controlling blood glucose levels. Understanding which processes are responsible for glucose control, what is malfunctioning, how to prevent it and how to medicate are areas of our research. A metabolic network describing several processes involved in the insulin pathway is shown in Figure 6.4 In this figure the genes are given in grey/green boxes with the name of the gene on the box. In the top part the box with “INS” can be found. Information about a gene can be found in general bioinformatics sources. One of the best sites on the web for bioinformatics is the National Center for Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/). NCBI provides an integrated approach to the use of gene and protein sequence information, the scientific literature (MEDLINE), molecular structures, and related resources, in biomedicine. It facilitates a special search engine called Entrez. If you know that the gene identification for “INS” is 3630 then if you enter the following query: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene& cmd=Retrieve&dopt=full_report&list_uids=3630 information about the insulin gene is displayed. Both the mRNA and the genomic information are part of the information returned. Here we show some parts of it. mRNA CDS ORIGIN 1 61 121 181 241 301 361 421 481 541 601 c ph join(1..42,222..425,1213..1431) /gene="INS" join(239..425,1213..1358) agccctccag tgcgtcaggt ggaggacgtg cacctggcct ggccctgtgg agccgcagcc agtgtgcggg gcagggtgag gcgctcccac tgcacttttt cccagtcaga gacaggctgc gggctcagga gctgggctcg tcagcctgcc atgcgcctcc tttgtgaacc gaacgaggct ccaactgccc ccagcatggg taaaaagaag atctcagcct atcagaagag ttccagggtg tgaagcatgt tcagccctgc tgcccctgct aacacctgtg tcttctacac attgctgccc cagaaggggg ttctcttggt gaggacggtg 56 gccatcaagc gctggacccc gggggtgagc ctgtctccca ggcgctgctg cggctcacac acccaagacc ctggccgccc caggaggctg cacgtcctaa ttggcttcgg aggtctgttc aggccccagc ccaggggccc gatcactgtc gccctctggg ctggtggaag cgccgggagg ccagccaccc ccacccagca aagtgaccag cagccccgag caagggcctt tctgcagcag caaggcaggg cttctgccat gacctgaccc ctctctacct cagaggacct cctgctcctg gggggtcagg ctccctgtgg atacatcaga Programming and genomics 2019/2020 6. Bioinformatics and strings Figure 6.4: Schematic view of the insulin signalling pathway. The gene INS is used in the examples below (http://www.genome.jp). 661 721 781 841 901 961 1021 1081 1141 1201 1261 1321 1381 gggtgggcac caccctcatt gggtcacagg gggcgtggct tagtcaggag gttcaggctc ttggggcctg ggctggagat tgactgtgtc acgtcctggc ccttggccct tctgctccct gctcctccct tgatgaccgc gtgccccacg gcctgcctga atggggaaga ccactgtgac taggtccaca gggtgggagt ctcctgtgtc agtggggcag ggaggggtcc ctaccagctg ccactcgccc agattcaagt ctgcctgcct gtgggccaga tgctggggac gctgccccgg cccagtgtgg gcgacctagg cctctgcctc gtggagctgg ctgcagaagc gagaactact ctcaaacaaa gttttgttaa ctgggcgaac cccctgtcgc aggccctggg ggcgggggaa gtgaccctcc gctggcgggc gccgctgttc gcgggggccc gtggcattgt gcaactagac tgccccgcag gtaaagtcct accccatcac caggcctcac gagaagtact ggaggtggga ctctaacctg aggcgggcac cggaacctgc tggtgcaggc ggaacaatgc gcagcccgca cccatttctc gggtgacctg gcccggagga ggcagctcca gggatcacct catgtgggcg ggtccagccc tgtgtctccc tctgcgcggc agcctgcagc tgtaccagca ggcagcccca In the first line it is stated that the gene “INS” is built out of 3 parts of the DNA sequence, the first 42 bases (1..42), followed by the bases from position 222 through 425 (222..425) and finally those at positions 1213 through 1431 (1213..1431) in the DNA sequence. Next it is shown which parts of the mRNA are the exons and hence code for the protein. This gene information is going to be used in the examples and in the exercises. c ph 57 Programming and genomics 2019/2020 6.3 6. Bioinformatics and strings Operations on strings Handling text is a recurring theme in many areas and also in biomedical applications. One of the strenghts of Python is that there already are a large number of predefined operations for strings. Since Python is object-oriented as well as has functional programming tools, both styles are used in operations. An example of the functional style applied on a string object is the method that returns the length of the string. • len(s) When s is a string then the length of the string can be obtained by: len(s). Examples >>> len("abc") 3 >>> len("This is a long string") 21 There also exists a string of length 0, called the empty string. It is denoted by ’’ (or ""). Below we introduce other operations on strings that use the object-oriented style of objectname.methodname(arguments). We use ’Biomedical engineering \n ’ as the string object the actions have to be performed on, but any other string could have been used. Note that a "\n" is occurring in this string. It means that a newline character is occurring in the string. In general the backslash \ in front of a character in a string is the indication that the normal meaning of the character (in this case the n) should not be considered, but that instead a special meaning (in this case its meaning is the newline character) is to be used. Another example of a special character is \t which is the tab character. Examples of string methods: • s.count(sub) Return the number of non-overlapping occurrences of substring sub in string s. Examples: – s = ’Biomedical engineering \n ’ n = s.count("e") has the effect that variable n obtains the value 4, since there are 4 e’s in s, while – ’Biomedical engineering \n ’.count("me") returns the value 1. • s.upper() Return a copy of the string s converted to uppercase. Example: ’Biomedical engineering \n ’.upper() returns ’BIOMEDICAL ENGINEERING \n ’. Analogously, s.lower() returns a copy of the string s converted to lower case. c ph 58 Programming and genomics 2019/2020 • 6. Bioinformatics and strings s.rstrip() Return a copy of the string s without the trailing whitespace characters (the characters space, tab, linefeed, return, formfeed, and vertical tab). Analogous to s.rstrip(), s.lstrip() and s.strip() return a copy of the string s omitting the whitespace at the front and at front as well as end, respectively. Example: ’Biomedical engineering \n ’.rstrip() returns ’Biomedical engineering’. • s.find(sub) Return the lowest index in the string s where substring sub is found. Return -1 if sub is not found. Examples: – ’Biomedical engineering \n ’.find("B") returns 0 (counting starts at zero!!) – ’Biomedical engineering \n ’.find("me") returns 3. – ’Biomedical engineering \n ’.find("p") returns −1. Analogously, s.rfind(sub) returns the highest index (i.e., the starting index of the first occurence when seaching the string backwards). A second argument can be provided to the find method. In that case the search for the substring starts at that index. Thus, for instance ’Biomedical engineering \n ’.find("e") returns 4, while ’Biomedical engineering \n ’.find("e",8) returns 11. • s.split([sep]) Return a list of the words in the string s, using the optional sep argument as the delimiter string. The sep argument may consist of multiple characters (for example, ’1, 2, 3’.split(’, ’) returns [’1’, ’2’, ’3’]). Splitting an empty string with a specified separator returns an empty list. If sep is not specified (as we have already seen briefly in the previous chapter) or is None, a different splitting algorithm is applied. First, whitespace characters (spaces, tabs, newlines, returns, and formfeeds) are stripped from both ends. Then, words separated by arbitrary length strings of whitespace characters are split. Splitting an empty string or a string consisting of just whitespace will return an empty list. c ph 59 Programming and genomics 2019/2020 6. Bioinformatics and strings Examples: – print("0123124".split("12")) prints [’0’, ’3’, ’4’] on the screen. – ’Biomedical engineering \n ’.split() returns [’Biomedical’, ’engineering’]. The string method replace In chapter 2 we have already introduced strings. When strings are long and occupy several lines, we should use the triple quote notation. So if we take for example the first two lines of the DNA sequence of the gene insulin, we have dnaIns="""gctgcatcagaagaggccatcaagcaggtctgttccaagggcctttgcgtcaggtgggct caggattccagggtggctggaccccaggccccagctctgcagcagggaggacgtggctgg""" where we have left out the spaces that have been added to guide the reading of the sequence. Since the string occupies more than one line, it contains a newline character("\n"). The task is to remove this newline from the sequence. As usual there are many solutions to this problem and there is no general recipe for finding the best solution. In this case we know that dnaIns is the name of a variable having a string value. Here we want to remove the "\n", that is: replace the newline by the empty string and fortunately Python has such a method. If s is a string object, s.replace(old, new) -> string returns a copy of string s with all occurrences of substring old replaced by new. So far our methods have had either one or no argument. Many methods however need more than one argument and replace is such an example. Notice that the order in which the arguments are given is of importance. Examples: • s = ’acgtaa\ngg’.replace("\n", "") After execution of this statement s has the value ’acgtaagg’. • mRNAs = ’acgtaagg’.replace("t", "u") Now mRNAs has the value ’acguaagg’, and after • mRNAs = ’acgtaagg’.replace("gt", "tgg") mRNAs gets the value ’actggaagg’. So converting a DNA string to its RNA sequence is straightforward. Strings are immutable Integers and strings are predefined and fixed. This means that they cannot be changed. The technical term for such a property is immutability. So strings are immutable in c ph 60 Programming and genomics 2019/2020 6. Bioinformatics and strings Python. This thus means that you can neither change characters nor change substrings. You always have to create a new string. Assigning to an indexed position in the string results in an error: >>> motif = ’GAATTC’ >>> motif[0] = ’x’ Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: object doesn’t support item assignment >>> motif[:1] = ’at’ Traceback (most recent call last): File "<stdin>", line 1, in ? TypeError: object doesn’t support item assignment Multiplication of strings Next to string concatenation, Python also has another construct for new strings: the repetition operator. It has two operands, a string and an integer, and the result is a string consisting of the integer copies of the string >>> polyA=’A’*25 >>> polyA ’AAAAAAAAAAAAAAAAAAAAAAAAA’ For convenience we summarize the string methods, operators and built-in functions that are often used. Method, Operator, Function s + t s * n len(s) s[i] s[i:j] s.count(sub) s.lower() s.upper() s.rstrip() s.lstrip() s.strip() s.find(sub) s.rfind(sub) s.replace(old,new) s.split([sep]) c ph Description Concatenation Repetition Return the length of s i-th character of s, counting starts at 0 slice of s from i to j Return the number of nonoverlapping occurrences of sub in s Return a copy of s converted to lowercase Return a copy of s converted to uppercase Return a copy of s with trailing whitespace characters removed. Return a copy of s with whitespace characters at front removed. Return a copy of s with whitespace characters at both front and end removed. Return the lowest index in s where sub is found Return the highest index in s where sub is found Return a copy of string s with all occurrences of substring old replaced by new. Return a list of the words in the string s, using the optional sep argument as the delimiter string. 61 Programming and genomics 2019/2020 6. Bioinformatics and strings In chapter 3 the dir and help commands were already introduced. These can be applied on any object, thus also on strings. So dir("abc") gives a complete enumeration of all (more than 50!!) methods that can be applied to strings. Additional information about these methods can be obtained by entering help(str). 6.4 Converting lists and strings Lists are mutable, that is the contents can be changed, which makes them more general than strings. It can be useful to create a list from a string by using the built-in function list: • list(s) Return a list whose items are the same and in the same order as in the string s Example: >>> list(’Gene’) [’G’, ’e’, ’n’, ’e’] It is also possible to do the reverse operation: from a list of strings we can make one string by using the join operation. It has two operands, the list of string elements to be joined and the separator between the elements. • sep.join(seq) Return a string which is the concatenation of the strings in the sequence seq. The separator between elements is the string sep providing this method. Example: A string of the words in the sequence separated by spaces and each of the words printed can be produced by: >>> "*".join([’A’, ’CC’, ’GGG’, ’T’]) ’A*CC*GGG*T’ >>> print("\n".join([’A’, ’CC’, ’GGG’, ’T’])) A CC GGG T 6.5 Writing data to file All results obtained so far are presented at the output screen directly. For reporting, and also when large amounts of data are produced, it is more convenient to write the information to a file. As for reading, we have to open a file but now in writing mode. This is done by giving ’w’ as second argument to the open method. Next we write our results into the file and finally close it. Especially this last action is of importance, otherwise the system may still have some information inside its internal buffers that is not yet written into the file. As you might have guessed, writing a string s to a file outf that has been opened for writing can be performed by the outf.write(s) c ph 62 Programming and genomics 2019/2020 6. Bioinformatics and strings method. An example of copying a file is given below. inf = open(’somefile.txt’, ’r’) outf = open(’anotherone.txt’, ’w’) s = inf.read() outf.write(s) inf.close() outf.close() Remark: If an existing file is opened for writing, the original contents is lost! If you would like to keep the original contents and write additional contents thereafter, the second argument to the open method should be ’a’ (i.e. abbreviation for append) instead of ’w’. If a newline has to be output, it should explicitly be added. There are several ways to do this. Assume we have two strings ’first line’ and ’second line’ to be written to a new file example.txt and a newline should be in between. A small Python program with this effect is: outfn=open(’example.txt’, ’w’) outfn.write(’first line’) outfn.write(’\n’) outfn.write(’second line’) outfn.close() A similar program using a list is: l = [’first line’, ’second line’] outfn = open(’example.txt’, ’w’) outfn.write(’\n’.join(l)) outfn.close() 6.6 Exercises 25–29 Exercise 25:** (a) Write a Python program that aks the user for a sequence and prints its length. The output should be like Enter a sequence: GTTGG It is 5 bases long (b) Modify the program so that it also prints the number of A, T, C, and G characters in the sequence. The output should be like Enter a sequence: GTTGG It is 5 bases long c ph 63 Programming and genomics 2019/2020 6. Bioinformatics and strings adenine: 0 thymine: 2 cytosine: 0 guanine: 3 (c) Modify the program to allow both lower-case and upper-case characters in the sequence. The output should be like Enter a sequence: ATTgtc It is 6 bases long adenine: 1 thymine: 3 cytosine: 1 guanine: 1 (d) Modify the program to print the number of unknown characters in the sequence. The output should be like Enter a sequence: ATTU*gtc It is 8 bases long adenine: 1 thymine: 3 cytosine: 1 guanine: 1 unknown: 2 Exercise 26: In this exercise operations on DNA strings are the key elements. The DNA string to be used is given in the file DNAINS.txt This file consists of one single line of characters. Read this line into the string dna_ins. (a) Write a Python program that prints the number of A, T, C, and G characters occurring in dna_ins. (b) Replace in dna_ins all occurrences of ”A” by its complement ”T”. (c) Replace in dna_ins all occurrences of ”A” by its complement ”T” and ”T” by its complement ”A”. (d) Write a program that determines the complement of dna_ins. (That is: not only all occurrences of ”A” by its complement ”T” and ”T” by its complement ”A”, but also all occurrences of ”C” by its complement ”G” and ”G” by its complement ”C”.) (e) Generate the reverse complement of dna_ins by converting the string resulting from (d) to a list, applying an appropriate operation on this list and making a string out of it again. Exercise 27:** (a) Create the following list c ph 64 Programming and genomics 2019/2020 6. Bioinformatics and strings [1, 2, 3, 4, ..., n-2, n-1, n, n-1, n-2, ..., 4, 3, 2, 1] where n is an arbitrary number larger than 100. Why is: a = list(range(1,101)) b = a b.reverse() print(a+b[1:]) not a proper solution? (b) Create in two different ways the same list but now excluding the number 73, i.e., [1, 2, 3, 4, ..., 71, 72, 74, 75, ..., n-2, n-1, n, n-1, n-2, ..., 75, 74, 72, 71, ..., 4, 3, 2, 1] (c) Let: a=[1,2,3,4] b=[9,16,25,36] c=[a,b] d=[a+b] What are the lengths of c and d? Are these lengths equal to the sum of the lengths of a and b? (d) Create the following list [0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, ..., n-1, n-1, n-1, n, n, n] where n is again an arbitrary number larger than 100. (e) Create the following list [1, 2, 3, 5, 6, 7, 9, 10, 11, 13, 14, 15, 17, 18, 19, ..., n] where n is again an arbitrary number larger than 100 but not a multiple of 4. Exercise 28: In genomics research one freqently accesses databases in which DNA sequences are stored. The two most used formats are FASTA and FASTQ. In this exercise we concentrate on the FASTA format. FASTA format is a text-based format for storing biological sequences (usually nucleotide sequences). An example of such a file type is the file sequences.fasta, which gives the sequences of all genes related to the TCA cycle in the fasta-format. In this format, every gene starts with a line starting with the character >. The remainder of that line is reserved for comments, which may contain all kinds of characters thus in principle also additional > characters and/or ’ATG’s. The actual sequence starts at the next line and may thus continue over multiple lines till the next > as the first character of a line. (a) How many genes are there in the file? c ph 65 Programming and genomics 2019/2020 6. Bioinformatics and strings (b) How many occurrences of the triplet ’ATG’ are there in total? (c) At which positions in the sequence of the 150th gene do you find this triplet ’ATG’ Attention: the file consists of 75803 lines with a total of 4610050 characters. For reading the file into you program and processing it, this is no issue at all, but printing the full file contents to screen may take a while. So if you want to check the contents, rather use slicing operations to print only small parts. Exercise 29:** Perform the following actions on the sequence seq="AaaTGGGATAAAAaaat": seq=seq.upper() seq=seq.replace(’C’,’ ’) seq=seq.replace(’T’,’ ’) seq=seq.replace(’G’,’ ’) a=seq.split() a.sort() print(len(a[-1])) What is the meaning of the result of these actions? c ph 66 Chapter 7 Functions and parameters When a program is to be designed to solve a computational problem, we usually split this task into smaller pieces and subsequently try to solve the smaller tasks. It might very well be that the smaller tasks are still too large to be solved and then we repeat the splitting again, etc., until all the small tasks can be solved. The next step is then to use the solutions of the smaller steps to solve the larger task. This is usually done by composing them into a larger unit and similarly as for the iterated splitting process, the combination of smaller units into larger ones might need to be repeated. It turns out that when this ’method’ is followed, a well-structured solution is obtained. In this chapter we discuss some of the programming concepts in Python by which this structuring can be achieved. The programs we have written so far have been small, but valuable. It would definitely be inefficient if we had written some code and we would not able to use it at another place. We need a way to reuse well-defined pieces of code. In fact, we already did this when using the built-in functions len, str, range, and help. Moreover, we are not always interested in how a certain piece of code has been implemented, but we are primarily interested in its use. The technical term for hiding this kind of implementation information is abstraction. Abstraction is useful as labor-saver, but it is more than that. It is the key to designing computer programs. Instead of defining each detail we usually describe our programs in higher order actions as ’open the file, count the number of lines’, etc., and we are not bothered by actually how the operating system of your computer gets access to the file etc. 7.1 Function definition (def) Assume that we have two DNA strings, ’aaaa’ and ’acca’, and that we have to print the number of times the nucleotide ’A’ either as ’a’ or as ’A’ occurs in these DNA strings. A simple program to achieve this is: uppers=’aaaa’.upper() n=uppers.count(’A’) print("The number of times A occurs in", uppers, "is:", n) 67 Programming and genomics 2019/2020 7. Functions and parameters uppers=’acca’.upper() n=uppers.count(’A’) print("The number of times A occurs in", uppers, "is:", n) It is clear that we have duplicate actions. Copying and pasting is in general risky due to the chance of making errors, such as forgetting parts that should have been included or including parts that should have been left out. It would therefore be valuable if we would have a construct by which we can group a sequence of actions and by which after finishing the sequence of actions the result is returned. Such a construct is a function. Functions are named sequences of statements that perform some task and return a value. In order to perform this task, a function may need arguments, i.e., values provided to a function when it is called. We can define a function by using the def (function definition) statement. For the example given above a function definition could be: def numberOfAsInExon(mrnaseq): uppers=mrnaseq.upper() n=uppers.count(’A’) return n The first line is called the header of the function. After the key-word def the name of the function is given (in this case numberOfAsInExon) and then between the parentheses the sequence of arguments, i.e., a number of arguments separated by commas (in this case only one argument called mrnaseq). 7.2 Function call A function can be called by using the name of the function with the sequence of arguments between parentheses. In our case two function calls are obtained by >>> print(numberOfAsInExon(’aaaa’)) 4 >>> print(numberOfAsInExon(’acca’)) 2 So when this function is called the number of times the base ’A’ occurs in the argument string converted to uppercase is returned. The general framework for a function definition is def fname(args): statements 7.3 Documenting functions Simple reuse of function definition is only feasible when it is clear what the function is doing, i.e., a global description of its functionality and the role of the arguments. One way of doing so is by using the comment construction (beginning with the hash sign ’#’). In case of function definitions another style is used. If a string is put at the beginning of a function definition, it is stored as part of the function in the so-called c ph 68 Programming and genomics 2019/2020 7. Functions and parameters docstring. The following code demonstrates how to add a docstring to a function: def numberOfAsInExon(mrnaseq): """ Calculates the number of occurrences of the base A in the string mrnaseq """ uppers=mrnaseq.upper() n=uppers.count(’A’) return n If run in the interpreter the information about a function, including its docstring is obtained by >>> help(numberOfAsInExon) Help on function numberOfAsInExon in module __main__: numberOfAsInExon(mrnaseq) Calculates the number of occurrences of the base A in the string mrnaseq 7.4 Positional parameters as function arguments The arguments of a function we have been using until now are called positional parameters because their positions are important. For instance, consider the following two functions: def DNA2mRNA(s, old, new): """ Return a copy of the string s with all occurrences of substring old replaced by new. """ return s.replace(old, new) def aDemoReplace(s, new, old): """ Return a copy of the string s with all occurrences of substring old replaced by new. """ return s.replace(old, new) They both do exactly the same, only with the second and third argument exchanged: >>> print(DNA2mRNA("ATG", ’T’, ’U’)) AUG >>> print(aDemoReplace("ATG", ’U’, ’T’)) AUG >>> print(aDemoReplace("ATG", ’T’, ’U’)) ATG c ph 69 Programming and genomics 2019/2020 7. Functions and parameters Python, however, has very attractive properties with respect to parameters. 7.5 Keyword parameters and defaults In general when a function has many arguments, the order may be hard to remember. In order to overcome this problem, Python has a very elegant construct by which the name of a parameter can be supplied: >>> print(DNA2mRNA(s="ATG", old=’T’, new=’U’)) AUG Naming parameters has several advantages: • The order in which the parameters are given is now no longer important. >>> print(DNA2mRNA(old=’T’, s="ATG", new=’U’)) AUG The name of the parameter uniquely defines at which position in the function definition the actual argument should be treated. It is also possible to combine positional and keyword parameters, e.g. >>> print(DNA2mRNA("ATG", new=’U’, old=’T’)) AUG Combining positional and keyword parameters is however only possible when first the positional parameter(s) are given and then the keyword parameter(s). Otherwise the meaning of the parameters becomes ambiguous, resulting in an error: >>> DNA2mRNA(s="ATG", ’T’, ’U’) File "<ipython-input-8-220254502cdf>", line 1 DNA2mRNA(s="ATG", ’T’, ’U’) SyntaxError: non-keyword arg after keyword arg In some cases (e.g. split-method) the use of keyword arguments is not allowed as in "abc".split(sep="b") resulting in TypeError: split() takes no keyword arguments The solution is to introduce keyword arguments explicitly in the definition of the method: • When in the function definition a name is supplied with a default value, the function can be called with fewer arguments than in its definition given. def DNA2mRNA(s, old="T", new="U"): """ Return a copy of the string s with all occurrences of substring old replaced by new. When the argument called old is not given, "T" is used, when the argument called new is not given, "U" is used, """ return s.replace(old, new) >>> print(DNA2mRNA(s="ATG")) c ph 70 Programming and genomics 2019/2020 7. Functions and parameters AUG The parameters that are supplied with a name are called keyword parameters and even though it is some additional typing, they clearly help in clarifying the role of each parameter. It is even allowed to use a mixture of positional and keyword parameters. The restriction however is that when a mixture is used, the positional parameters are to be given first and, of course, in the right order. All of the following statements have the same result: >>> AUG >>> AUG >>> AUG >>> AUG >>> AUG >>> AUG print(DNA2mRNA(’ATG’)) print(DNA2mRNA(’ATG’, ’T’)) print(DNA2mRNA(’ATG’, ’T’, ’U’)) print(DNA2mRNA(s=’ATG’)) print(DNA2mRNA(s=’ATG’, new=’U’)) print(DNA2mRNA(’ATG’, new=’U’)) Of course when the argument coupled to the new name is supplied without the keyword name, the old argument also has to be supplied. For instance, >>> print(DNA2mRNA(’ATG’, ’U’)) ATG might not give what is wanted. 7.6 Exercises 30–37 Exercise 30:** Explain what the output is of the following Python program: def whatshouldbemyname(l): ’’’This method expects a single argument, namely a list of integers’’’ if not l: return [] else: l.sort() m=[l[0]] for i in range(1, len(l)): if l[i] != l[i-1]: m.append(l[i]) return m inputlist1 = [4,10,4,4,4,10,4] c ph 71 Programming and genomics 2019/2020 7. Functions and parameters result1 = whatshouldbemyname(inputlist1) inputlist2 = [5,3,1,8,5,9,3,8,5,8,5,0,4,6,5,9,7,6,8,10] result2 = whatshouldbemyname(inputlist2) print(result1, result2) Exercise 31: Below an imperfect Python program is given. The program consists of a single function definition and two calls of that function. The function has 2 parameters: a base (base) and a DNA string dnastring. The function should remove all occurences of base from the DNA string and return the result. The function should then be called twice: once to remove all occurences of ’A’ from a DNA string and once to remove all occurences of ’T’ from another DNA string. The two resulting strings should then be concatenated and printed on a single line. The program contains a number of syntactically wrong constructs as well as a number of semantic errors. The first will result in a SyntaxError when trying to execute the program, while the latter means that for a given input not the correct output is obtained. Correct the program such that both the syntactic and the semantic errors are resolved. def removebase(base,,dnastring) res = base.replace(dnastring,’’) print(res) n1 = removebase("A","AACATAAA") n2 = removebase("T","TCGACATA’) print(n1+n1) Exercise 32:** (a) Design a function wording that has a word as parameter and prints the word, its length and the reversed word. Apply your function to the word ’verzuring’. (b) Design a function processFile that has a filename as parameter and prints, by repeatedly calling the wording function, each word in the file, its length and the reversed word. Apply your function to the file mytext.txt (or any other text file of your choice). Exercise 33: In this exercise we will again make use of the turtle library, which was introduced in Exercise 10, to draw some figures. The pen color can be changed using turtle.pencolor(r,g,b) where r, g, b are floats between 0 and 1, specifying the amount of red, green and blue, respectively. (a) Define a function that draws a regular polygon. The function should have a single c ph 72 Programming and genomics 2019/2020 7. Functions and parameters required parameter n, which specifies the number of sides the polygon consists of (e.g. 3 for triangle, 4 for rectangle, 6 for hexagon, etc). Moreover, the function should have 6 optional parameters: x and y specifying the starting position, d for the length per side, and r, g, b for the color. Default value for one d should be 100, while for the other optional parameters the default value should be 0, such that a black polygon is drawn starting in the origin. (b) Use the method designed in part (a) to draw: (1) a red square with sides of length 150, (2) a blue triangle with sides of length 120, and (3) a green hexagon with sides of length 50. Exercise 34: Design a function that has a DNA sequence and two integers, say w and linel as parameters. The function prints the string in a tabulated form such that after each w characters of the sequence a space is inserted and on each line, possibly except the last one, the next linel characters of the sequence are shown. For example, for the DNA sequence in the file DNAINS.txt with w = 10 and linel = 60 the output should look like: gctgcatcag aagaggccat caagcaggtc tgttccaagg gcctttgcgt caggtgggct caggattcca gggtggctgg accccaggcc ccagctctgc agcagggagg acgtggctgg gctcgtgaag catgtggggg Exercise 35:* In the following exercises each of the functions to be designed should have a string as only parameter representing the name of an input text file that is to be opened. For this file you could use mytext2.txt. (a) Define a sentence to be a string ending on a period ("."). Design a function that returns a list of all the sentences the file consists of. (b) Design a function that prints all words of a file on a separate line. (c) Design a function that sorts the words of a file and subsequently prints each of the words on a separate line. (d) Design a function that prints each odd-numbered line of a file. Exercise 36: Design a function that copies the lines of an input file in reverse order into an output file. The function should have two parameters, the first parameter is the name of the input file, the second parameter the name of the output file. As input file you could use mytext2.txt. Exercise 37: Below an imperfect Python program is given. The program consists of 2 function definitions. The first function has 2 parameters: a base and a list of DNA strings. The c ph 73 Programming and genomics 2019/2020 7. Functions and parameters function should determine the position of the last occurence of the specified base for each DNA string in the list. Those numbers should be returned in a list. The second function, with as input parameter a list of DNA strings, should first determine the last occurences of the bases A and C in the DNA strings by two calls of the first function. Subsequently, this second function should determine how many DNA sequences are present in the list for which the last C comes after the last A and there are at most two bases between these two last occurences. The program contains a number of syntactically wrong constructs as well as a number of semantic errors. The first will result in a SyntaxError when trying to execute the program, while the latter means that for a given input not the correct output is obtained. Correct the program such that both the syntactic and the semantic errors are resolved. def indexlast(base="A’, l) m=[] for i in range(l) m.extend(m[-j).index(basis) returm n def finalCcloseafterfinalA(): lastAl=indexlattst(l) lastCl=indexlats(l, base="CCC") found=-1 for i in l: if lasteAl[i]-lastCl(i) < 2: found = found - 1 return found gevonden==finalCcloseafterfinalA(["TCTTTT", "ACAATC", "ACTTACC", "CATA"], []) print gewonde ‘ c ph 74 Chapter 8 Tuples and string formatting In previous chapters we have seen how one can write output to either the screen or a text file. In both cases, it is often useful to write data in a more structured way than done so far. Therefore in this chapter we will consider formatted strings. Additionally, we will consider a second container datatype (next to lists that we have seen in Chapter 3), i.e., tuples. 8.1 Tuples In Chapter 3 we have considered lists. Python has a second datatype that is often used to store collections of items, i.e, tuples. Tuples are sequences, just like lists. The differences between tuples and lists are that: (i) tuples cannot be changed unlike lists and (ii) tuples use parentheses whereas lists use square brackets. Examples of tuples are: >>> tup1 = (12, 5, 3, 4, 2 ) >>> tup2 = (1, ’hi’, 100, 3.14) Tuples can thus (like lists) contain items of different types. To make a tuple containing a single item, a comma has to be added after the item >>> tup3 = (100,) Without the comma, the result of the above assignment would be that tup3 contains the integer 100. A fourth example of a tuple is: >>> tup4 = "a", "b", "c", "d" >>> tup4 (’a’, ’b’, ’c’, ’d’) This shows that when multiple objects are given, separated by commas, without identifying symbols (like brackets for lists or parentheses for tuples), Python turns it into a tuple. Tuples thus occur quite often: in the following sections we will for instance use 75 Programming and genomics 2019/2020 8. Tuples and string formatting them to create formatted strings and to define functions that return multiple values. Indexing and slicing tuples Elements of tuples are accessed exactly in the same way as lists: >>> tup1[0] # indexing 12 >>> tup2[1:5] # slicing (’hi’, 100, 3.14) The indices start again at zero! And slicing of a tuple results in a new tuple. The main difference with lists is that tuples are immutable. It is thus impossible to change a tuple. Thus, for instance, assigning one element of the tuple a different value results in an error: >>> tup1[0] = 100 Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: ’tuple’ object does not support item assignment Basic operations on tuples Basic operations on tuples (that we have already seen for lists as well) are: len((1, 2, 3)) (1, 2, 3) + (4, 5, 6) (’Hi!’,) * 3 3 in (1, 2, 3) for x in (1, 2, 3): print(x) 3 (1, 2, 3, 4, 5, 6) (’Hi!’, ’Hi!’, ’Hi!’) True 1 2 3 Length Concatenation Repetition Membership Iteration Lists can be converted to tuples and tuples to lists: >>> m = list(tup1) >>> m [12, 5, 3, 4, 2] >>> tup5 = tuple(m) >>> tup5 (12, 5, 3, 4, 2) 8.2 Returning multiple values from a function In the previous chapter, we have seen that functions can be used to structure a computer program. The idea is to split the computational problem at hand into smaller parts, where each of these smaller parts is considered in a separate function. The function may be provided with some input (arguments), performs its operations, and returns the result. Advantage of the use of functions is also that they can be called multiple times (on the same or different input), such that the same code does not need to be repeated. Moreover, functions (developed by others) may be reused, where as a user of the function you do not need to bother about details inside the function, but can assume that for given input the resulting output is as the specification of that function c ph 76 Programming and genomics 2019/2020 8. Tuples and string formatting as described in its doc-string. The functions considered so far returned only a single object. However, functions may also return multiple objects separated by commas, which can then be caught by the same number of variables. For instance: >>> ... ... >>> >>> 6 >>> 7 >>> 8 def myfun(x): return x+1, x+2, x+3 a, b, c = myfun(5) a b c What is actually returned is a tuple, so the output of the function can also be caught in a single variable which then contains the tuple: >>> d = myfun(5) >>> d (6, 7, 8) Assignment of a tuple to multiple variables can also be done directly. Some examples are: >>> a,b = (3,4) >>> a,b = 3,4 >>> (a,b) = (3,4) all three assigning the value 3 to a and the value 4 to b. 8.3 String formatting We have seen before that the print statement can have multiple arguments separated by commas and that the types of these arguments may differ. An example where two strings and a float are combined is: >>> from math import pi >>> print(’The value of pi is’, pi, ’!’) The value of pi is 3.141592653589793 ! Python has two alternatives to provide more control on the way a sting is formatted: (i) the ’old style’ that uses the % symbol (that is very similar to the approach in other computer languages like C, TCL and Matlab) and (ii) the ’new style’ using the format method on strings. Both can be used and you are free to use either one you prefer. 8.3.1 Old style: % In the first alternative, a % character is used inside the string to indicate the position in the string where a value needs to be inserted and the value that is to be substituted follows after a second % character placed behind the string. To insert a float, the % c ph 77 Programming and genomics 2019/2020 8. Tuples and string formatting character inside the string should be followed by a character f. The above result is thus (almost) reproduced using >>> from math import pi >>> print(’The value of pi is %f !’ % pi) The value of pi is 3.141593 ! First advantage is that this easily allows for removing the presence of a space between the numerical value and the exclamation mark. Moreover, another major advantage of this method is that it readily allows for controlling for instance the number of digits. This is achieved by adding a dot followed by an integer indicating the desired number of decimals just in front of the ’f’. For instance, to show just 2 decimals of pi, the print statement should be adapted to: >>> from math import pi >>> print(’The value of pi is %.2f!’ % pi) The value of pi is 3.14! Also other types can be formatted: Format %s %d %f %e %X Description string integer float float in scientific notation integer in hexadecimal format Example ’A%sD’ % ’bc’ ’%d*%d’ % (3,4) ’%.3f’ % 0.1234 ’%.2e’ % 0.1234 ’%X’ % 255 Result ’AbcD’ ’3*4’ ’0.123’ ’1.23e-01’ ’FF’ Multiple substitution values It is also possible to substitute multiple values into a string. The values to be substituted should then be provided as a tuple, where the number of elements in the tuple should match the number of % characters in the string indicating the positions where the values should be substituted. The first element in the tuple is then substituted at the position of the first % character in the string, the second element in the tuple at the position of the second % character, etc. >>> from math import pi >>> print(’The value of %s is %.5f!’ % (’pi’, pi)) The value of pi is 3.14159! Alignment It is also possible to specify the number of characters that is minimally used in the substituted string. This is obtained by specifying the required number of characters directly following the % character. This can for instance be useful when one wants to align columns in a table. While for ... 0 0 1 1 2 4 c ph i in range(11): print(i,i**2,i**3) 0 1 8 78 Programming and genomics 2019/2020 8. Tuples and string formatting 3 9 27 4 16 64 5 25 125 6 36 216 7 49 343 8 64 512 9 81 729 10 100 1000 results in a mess, a nicely aligned table may be obtained using: >>> for i in range(11): ... print(’%2d %3d %4d’ % (i, i**2, i**3)) 0 0 0 1 1 1 2 4 8 3 9 27 4 16 64 5 25 125 6 36 216 7 49 343 8 64 512 9 81 729 10 100 1000 8.3.2 New style: the format method The second alternative to format strings is using the str.format method operating on a string str. Advantages of this approach is that it provides slightly more control and that the style is more ’Pythonic’. The latter is also immediately its drawback as the approach is not applicable in for instance Matlab. The basic usage of the str.format() method in order to reproduce the result at the beginning of Section 8.3 is: >>> from math import pi >>> print(’The value of pi is {} !’.format(pi)) The value of pi is 3.141592653589793 ! The format method thus substitutes the value of its argument at the position in the string specified by {}. Also this approach easily allows for removing the presence of the space between the numerical value and the remainder of the string and to control the number of digits for the value to be substituted. For instance, to show just 2 decimals of pi, the print statement becomes >>> print(’The value of pi is {:.2f}!’.format(pi)) The value of pi is 3.14! where the colon specifies that special formatting follows, f specifies that a float should be printed and the .m that it should be printed with m decimals. Also other types can be formatted: c ph 79 Programming and genomics 2019/2020 Format s d f e % b X Description string integer float float in scientific notation float as percentage integer in binary format integer in hexadecimal format 8. Tuples and string formatting Example ’A{:s}D’.format(’bc’) ’{:d}*{:d}’.format(3,4) ’{:.3f}’.format(0.1234) ’{:.2e}’.format(0.1234) ’{:.1%}’.format(0.1234) ’{:b}’.format(8) ’{:X}’.format(255) Result ’AbcD’ ’3*4’ ’0.123’ ’1.23e-01’ ’12.3%’ ’1000’ ’FF’ Multiple substitution values Also using the format method it is possible to substitute multiple values into a string. The default is to use multiple {} and multiple parameters for the format method. The values are then substituted in consecutive order. >>> print(’The value of {} is {}!’.format(’pi’,pi)) The value of pi is 3.14159265359! The format of both substitutions can be controlled just as for the case of a single substitution. Only difference is that it is now also possible to specify which parameter should be substituted at which position: >>> The >>> The >>> The print(’The value of {:s} is {:.5f}!’.format(’pi’,pi)) value of pi is 3.14159! print(’The value of {0:s} is {1:.5f}!’.format(’pi’,pi)) value of pi is 3.14159! print(’The value of {1:s} is {0:.5f}!’.format(pi,’pi’)) value of pi is 3.14159! The index of the argument to be substituted is thus specified by the integer in front of the colon. A single parameter can also be substituted at multiple places, e.g: >>> print(’{0:} rounded to 2 digits is {0:.2f}’.format(pi)) 3.141592653589793 rounded to 2 digits is 3.14 Instead of working with positional parameters, here also keyword parameters can be used: >>> print(’The value of {name:s} is {value:.5f}!’.format(value=pi,name=’pi’)) The value of pi is 3.14159! The format method of course also works on a string in a variable: >>> from math import pi, e >>> s = ’The value of {name:s} is {value:.5f}!’ >>> s ’The value of {name:s} is {value:.5f}!’ >>> t = s.format(value=pi,name=’pi’) >>> t ’The value of pi is 3.14159!’ >>> print(t) c ph 80 Programming and genomics 2019/2020 8. Tuples and string formatting The value of pi is 3.14159! >>> print(s.format(value=e,name=’e’)) The value of e is 2.71828! Alignment It is also possible to specify the number of characters that is minimally used in the substituted string. This can for instance be useful when one wants to align columns in a table: >>> for i in range(11): ... print(’{0:2d} {1:3d} {2:4d}’.format(i, i**2, i**3)) 0 0 0 1 1 1 2 4 8 3 9 27 4 16 64 5 25 125 6 36 216 7 49 343 8 64 512 9 81 729 10 100 1000 The number in front of the type (here ’d’) thus specifies the width of the substring where the value is formatted. Alignment within the columns can be achieved using > for right hand side alignment (default for numbers), < for left hand side alignment, and ^ for centering: >>> for i in range(11): ... print(’{0:>2d} {1:<3d} {2:^4d}’.format(i, i**2, i**3)) 0 0 0 1 1 1 2 4 8 3 9 27 4 16 64 5 25 125 6 36 216 7 49 343 8 64 512 9 81 729 10 100 1000 c ph 81 Programming and genomics 2019/2020 Format #d #.#f > < ^ 8.4 Description use # characters for an integer use # characters for a float with # decimals align right align left align center 8. Tuples and string formatting Example ’{:5d}’.format(123) Result ’ 123’ ’{:7.2f}’.format(0.1234) ’ ’{:>5d}’.format(123) ’{:<5d}’.format(123) ’{:^5d}’.format(123) ’ 123’ ’123 ’ ’ 123 ’ 0.12’ Exercises 38–40 Exercise 38:** Design a function minmaxmean that has a list of integers as parameter and returns the minimum value in the list, the maximum value, as well as the mean. Apply your method to the list m1=[9,4,5,6,2,5,4,3,1,2,12,7,4,3,2,8,4,2] and to the list m2=[3,4,5,2,2,12,2,1,8,2,9,4,3,6,4,4,7,5]. Exercise 39:* (a) Explain the output of the following python fragment: for i in range(1,13): for j in range(1,13): print(’%4d’ % i*j, end=’ ’) print() (b) Correct the above python fragment such that it prints the multiplication table (of the tables 1 through 12) in the following format to screen: 1 2 3 4 5 6 7 8 9 10 11 12 2 4 6 8 10 12 14 16 18 20 22 24 3 6 9 12 15 18 21 24 27 30 33 36 4 8 12 16 20 24 28 32 36 40 44 48 5 10 15 20 25 30 35 40 45 50 55 60 6 12 18 24 30 36 42 48 54 60 66 72 7 14 21 28 35 42 49 56 63 70 77 84 8 16 24 32 40 48 56 64 72 80 88 96 9 18 27 36 45 54 63 72 81 90 99 108 10 20 30 40 50 60 70 80 90 100 110 120 11 22 33 44 55 66 77 88 99 110 121 132 12 24 36 48 60 72 84 96 108 120 132 144 (c) Design a function mytable with as parameter an integer maxval that creates a table like above and returns it as a string. The parameter maxval should specify the number of columns to be plotted, while the number of rows should remain fixed to 12. c ph 82 Programming and genomics 2019/2020 8. Tuples and string formatting Exercise 40: Given is a text file BMIs.txt (considered before in Exercise 22) with on each line three fields separated by whitespace. The first field is the name of a person, the second field his/her weight, and the third field his/her height. (a) Design a python function with a filename as its single parameter that reads the specified file (of the type as described above) and returns the contents of the file as a list of tuples. Each tuple should contain three items, i.e., in consecutive order a name as a string, a weight as a float, and a height as a float. Apply your python function to the file BMIs.txt. (b) Design a python function with a list of tuples, as defined in part (a), as a single parameter, which uses string formatting to print a table like: Name | weight | height | BMI | Category ---------------------------------------------------Peter | 120.0 | 1.75 | 39.2 | obese Esther | 60.0 | 1.61 | 23.1 | healthy weight Tom | 90.0 | 1.70 | 31.1 | obese where BMI is defined as weight divided by the square of the height and the Category is defined according to the World Health Organization: ’Underweight’ (BMI below 18.5), ’Healthy weight’ (BMI between 18.5 and 25), ’Overweight’ (BMI between 25 and 30) or ’Obese’ (BMI above 30). c ph 83 Chapter 9 Dictionaries and database queries Next to a list and a tuple, Python has other built-in datatypes. One of those is the dictionary, which defines one-to-one relationships between keys and values. Such dictionaries can be a very useful datatype to clearly store and access diverse data. When accessing databases from Python, the returned data may also be provided as a dictionary. 9.1 Dictionaries Defining a dictionary A dictionary is an unordered set of elements where each element is a key:value pair. To define a dictionary introduce a variable name and explicitly give the elements of the dictionary enclosed in curly braces. >>> gene_dict= { ’IGF2’: ’insulin-like growth factor 2 (somatomedin A)’, ’INS’: ’insulin’ } >>> gene_dict { ’IGF2’: ’insulin-like growth factor 2 (somatomedin A)’, ’INS’: ’insulin’ } >>> gene_dict[’IGF2’] ’insulin-like growth factor 2 (somatomedin A)’, >>> gene_dict[’INS’] ’insulin’ Here ’IGF2’ is a key and its associated value, referenced by gene_dict[’IGF2’], is ’insulin-like growth factor 2 (somatomedin A)’, and similarly for the key ’INS’ and the value ’insulin’. Hence dictionary elements are obtained by their keys. An empty dictionary is created by empty_d = { } Dictionaries are not just for strings. • Dictionary values can be any datatype, including strings, integers, objects, lists, or even other dictionaries. And within a single dictionary, the values may have different types. They can be mixed as needed. • Dictionary keys are more restricted, viz. limited to immutable data types like strings, integers, tuples, and a few other types. They can also be mixed, i.e., not 84 Programming and genomics 2019/2020 9. Dictionaries and database queries all keys need to be of the same type. That different types can be used for keys as well as values is illustrated by: >>> mydict={} >>> mydict[’name’] = ’Klaas’ >>> mydict[666] = [6,6,6] >>> mydict[(1,’hi’)] = 3.1415 >>> mydict {666: [6, 6, 6], (1, ’hi’): 3.1415, ’name’: ’Klaas’} But since lists are mutable, using a list as key is thus not allowed: >>> mydict[[1,2]] = 4 Traceback (most recent call last): TypeError: unhashable type: ’list’ The number of items in a dictionary can be obtained using the function len >>> len(mydict) 3 Adding and changing dictionary elements Duplicate keys cannot occur in a dictionary. Assigning a value to an existing key will overwrite the old value. >>> gene_dict { ’IGF2’: ’insulin-like growth factor 2 (somatomedin A)’, ’INS’: ’insulin’ } >>> gene_dict[’IGF2’]=’insulin growth factor 2’ >>> gene_dict { ’IGF2’: ’insulin-like growth factor 2’, ’INS’: ’insulin’ } Adding new elements to a dictionary goes in a similar way >>> gene_dict[’INSR’]=’insulin receptor’ >>> gene_dict { ’INS’, ’insulin’, ’INSR’: ’insulin receptor’, ’IGF2’: ’insulin-like growth factor 2’ } Dictionaries have no concept of order among elements, they are simply unordered. Note that the new element (key ’INSR’, value ’insulin receptor’) appears to be in the middle. In fact, it was just a coincidence that the elements appeared to be in order in the first example; it is just as much a coincidence that they appear to be out of order now. An item can be removed from a dictionary using the del statement >>> gene_dict { ’INS’: ’insulin’, ’INSR’: ’insulin receptor’, ’IGF2’: ’insulin-like growth factor 2’ } >>> del gene_dict[’INS’] >>> gene_dict { ’INSR’: ’insulin receptor’, ’IGF2’: ’insulin-like growth factor 2’ } c ph 85 Programming and genomics 2019/2020 9. Dictionaries and database queries Methods of dictionaries Given a dictionary d the following methods can be applied to d: • d.keys() Returns a view on the dictionary’s keys • d.values() Returns a view on the dictionary’s values • d.items() Returns a view on the dictionary’s (key, value) pairs • x in d Returns True if x is in the dictionary’s list of keys, False otherwise If desired, the views returned by the methods d.keys(), d.values() and d.items() can be converted to true lists using the list function. Some examples for the dictionary filled earlier this section: >>> gene_dict.keys() dict_keys([’IGF2’, ’INS’, ’INSR’]) >>> list(gene_dict.keys()) [’IGF2’, ’INS’, ’INSR’] >>> gene_dict.values() dict_values([’insulin growth factor 2’, ’insulin’, ’insulin receptor’]) >>> gene_dict.items() dict_items([(’IGF2’, ’insulin growth factor 2’), (’INS’, ’insulin’), (’INSR’, ’insulin r >>> ’INS’ in gene_dict True >>> ’insulin’ in gene_dict False Looping over dictionaries To perform an action on all items in a dictionary, we again use the for statement. >>> for key in gene_dict: ... print(key, ’stands for’, gene_dict[key]) INS stands for insulin INSR stands for insulin receptor IGF2 stands for insulin-like growth factor 2 In this way we thus get the keys one by one and can use those to inspect the corresponding values. A second way to loop over all items, directly having access to the key-value pairs is: >>> for key,val in gene_dict.items(): ... print(key, ’stands for’, val) INS stands for insulin INSR stands for insulin receptor IGF2 stands for insulin-like growth factor 2 c ph 86 Programming and genomics 2019/2020 9. Dictionaries and database queries In both above cases the order in which the keys are processed may appear unclear. If you rather want them in (alphabetical) order, the sorted function could be used: >>> for key in sorted(gene_dict): ... print(key, ’stands for’, gene_dict[key]) IGF2 stands for insulin-like growth factor 2 INS stands for insulin INSR stands for insulin receptor Also if you prefer reversed order: >>> for key in sorted(gene_dict,reverse=True): ... print(key, ’stands for’, gene_dict[key]) INSR stands for insulin receptor INS stands for insulin IGF2 stands for insulin-like growth factor 2 9.2 Database queries Until now we have merely used data that was locally accessible, but huge amounts of data are also available outside the TU/e. Accessibility to these data is usually arranged by large database servers. For the domain of Biomedical Engineering the worldwide most frequently used source is the National Center for Biotechnology Information (NCBI) including PubMed, a free full-text archive of biomedical and life sciences journal literature. NCBI advances science and health by providing access to biomedical and genomic information. In this section we discuss how to retrieve genomic data from this site, but we start with a more general treatment on how to automatically download data from an arbitrary site. 9.2.1 Open arbitrary resources by URL In section 5.4, we have seen that a file of the local file system can be opened by issuing the open-command. An example is: inf=open("example.txt") It would be nice if we could use a similar command for remote files. Python indeed has a standard module to access remote resources: urllib. To access a remote file we of course have to know the address of the location the file is residing on. The international standard is the so-called Uniform Resource Locator (URL), commonly termed as web address. In fact it is more. It includes also a reference to a web resource that specifies its location on a computer network and a mechanism for retrieving it. URLs occur most commonly to reference web pages (http), but are also used for file transfer (ftp), email (mailto), database access (JDBC), and many other applications. Most web browsers display the URL of a web page above the page in an address bar. A typical URL could have the form http://cbio.bmt.tue.nl/~philbers/index.htm, which indicates a protocol (http), a hostname (cbio.bmt.tue.nl), as well as a file name (~philbers/index.htm). To access such a file in Python the following small program could be used: c ph 87 Programming and genomics 2019/2020 9. Dictionaries and database queries import urllib.request protocol = "http" hostname = "cbio.bmt.tue.nl" path = "~philbers/index.htm" url = protocol+"://"+hostname+"/"+path rf = urllib.request.urlopen(url) data = rf.read() print(data) Running this program (the url is only accessible from within the TU/e-environment) results in (linebreaks have been added) b’<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html lang="nl" xml:lang="nl" xmlns="http://www.w3.org/1999/xhtml"> <HEAD> <TITLE>Peter Hilbers</TITLE> </HEAD> <BODY TEXT="#000000" LINK="#0000FF" VLINK="#00007F" BGCOLOR="#EAD9C7"> <H2> Welcome, on this page where you can find information concerning my recent publications, courses and presentations. </H2> ..... NWO Computional Life Science<br><br> NWO Computional Science<br><br> Lorentz Center<br><br> </body></html>’ The letter b followed by the quote at the beginning of what is returned shows that it is not a str object, but of the type bytes. It could be converted to a string (str) using the decode method of bytes, i.e., adding one line: data = data.decode() So compared to local file access the main difference is in using urlopen instead of open and the extra decode step to obtain the contents as a string (str). Web pages are often dynamic. This roughly means that the (html) output being generated depends on the optional additional information given in the url. This optional information is usually called the query and is separated from the preceding part of the url by a question mark (?). Its syntax is not well defined, but by convention it is most often a sequence of attribute–value pairs separated by a delimiter. The two worldwide most used delimiters are the ampersand (&) and the semicolon. Since the ampersand is most used we will also use it. So an example of a url including a query part is: url="https://www.ncbi.nlm.nih.gov/pubmed/?term=genomics" One difficulty with queries is their syntax that makes them hardly readable. For instance to build a query string into a URL, spaces are to be replaced by plus signs and as a consequence plus signs need to be escaped by using their ”%xx” variant, viz. ”%2B”. c ph 88 Programming and genomics 2019/2020 9. Dictionaries and database queries As an example, suppose we are interested in the Ebola virus. If we would like to use in the query part of a url the term ’Ebola virus’, we should use ’Ebola+virus’, but there are in general many more similar substitutions needed, especially when multiple attribute–value pairs are used. To do those substitutions manually is a tedious task, so we should have an alternative. Python has a special facility called urlencode(query) that automatically performs such substitutions. The query could for instance be in the dictionary form. If we would have the url (one single line) https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore& id=110645916&rettype=fasta we could build this string in Python by using a dictionary: import urllib.parse import urllib.request protocol="https" hostname="eutils.ncbi.nlm.nih.gov" path="entrez/eutils/efetch.fcgi" url=protocol+"://"+hostname+"/"+path queryd={} queryd[’db’]=’nuccore’ queryd[’id’]=’110645916’ queryd[’rettype’]=’fasta’ params = urllib.parse.urlencode(queryd) params = params.encode(’ascii’) rf = urllib.request.urlopen(url, params) data=rf.read() print(data) Although it seems a bit superfluous to use so many small steps, such an approach appears to be less error prone and hence it is strongly recommended to do it in this way. Problem is, however, that not all webservers accept calls from Python as they do not like ’robots’. Another example that does work is http://www.chemcalc.org/ which provides some web services related with mass spectrometry. A small example, that we will also use in one of the exercises, is: import urllib.parse import urllib.request # Define a molecular formula string mf = ’C4H5N3O’ # Define the parameters and send them to Chemcalc pardict = {’mf’: mf,’isotopomers’:’jcamp,xy’} params = urllib.parse.urlencode(pardict).encode() # url of Chemcalc c ph 89 Programming and genomics 2019/2020 9. Dictionaries and database queries url = ’http://www.chemcalc.org/chemcalc/mf’ # Open the url and read the page response = urllib.request.urlopen(url, params) data = response.read() 9.2.2 Accessing databases: NCBI, Entrez and BioPython Examples of human diseases caused by viruses include the common cold, influenza, chickenpox, and cold sores and many serious diseases such as avian influenza, SARS, and Ebola virus disease. If we would like to study the genome of this last virus then a literature search shows that the genus Ebolavirus is a member of the Filoviridae family. There are currently 5 known species: Zaire ebolavirus, Sudan ebolavirus, Tai Forest ebolavirus, Bundibugyo ebolavirus, and Reston ebolavirus. The Zaire ebolavirus is responsible for the outbreak that started in West Africa in 2014, the largest outbreak since the virus was first discovered in 1976. Genetic sequencing has shown that the virus isolated from infected patients in the 2014 outbreak is 97% similar to the virus that first emerged in 1976. If we want to check this similarity we have to search for the nucleotide sequences. The first step towards a complete genome of the Zaire ebolavirus is a general search (using a web browser) at the NCBI site: http://www.ncbi.nlm.nih.gov/ Since we are interested in nucleotides, we select the nucleotide database http://www.ncbi.nlm.nih.gov/nuccore and add the query term ’ebola+virus+isolate’: http://www.ncbi.nlm.nih.gov/nuccore/?term=ebola+virus+isolate It returns with 74491 hits in the nucleotide database for ”ebola virus” of which the first 20 are shown. In the publication Viruses 2014, 6, 3663-3682; doi:10.3390/v6093663 we find: ’Ebola virus (EBOV) is the most thoroughly characterized ebola virus. Dozens of EBOV isolates are available, but the vast majority of published experiments have been performed with isolates Mayinga and Kikwit. The Mayinga isolate, the first EBOV isolate obtained in 1976, has been used extensively for molecular-biological characterizations. The Kikwit variant, obtained during an Ebola virus disease outbreak in 1995, has been used almost exclusively for pathogenesis studies in nonhuman primates in the US (the Mayinga isolate is used almost everywhere else).’ So inspecting the first 20 items we select the items 14 and 15: ’Ebola virus isolate Ebola virus/H.sapiens-tc/COD/1995/Kikwit-807223, complete genome’ and ’Ebola virus isolate Ebolavirus/H.sapiens-tc/COD/1976/YambukuMayinga, complete genome’. Next we go to the ’Send to’ button where we select file as destination and start downloading these two fasta records. They are also available on the Canvas site in the file ’ebolasequences.fasta’. For two hits this process is doable but for more hits we need an alternative: BioPython. c ph 90 Programming and genomics 2019/2020 9. Dictionaries and database queries Entrez The module in the BioPython package to be used for data retrieval is ’Entrez’: ”Entrez (https://biopython.org/DIST/docs/api/Bio.Entrez-module.html, https://www.ncbi.nlm.nih.gov/books/NBK25501/ ) is a data retrieval system that provides users access to NCBI’s databases such as PubMed, GenBank, GEO, and many others. Access Entrez from a web browser to manually enter queries, or you can use Biopythons Bio.Entrez module for programmatic access to Entrez. The latter allows you for example to search PubMed or download GenBank records from within a Python script.” In this subsection we demonstrate how to use Entrez. Instead of the Ebola virus we use a bacterium as example, namely the organism Escherichia coli. This bacterium is for instance used in the laboratory of the ’Chemical Biology’ group for the production of new proteins but also by student teams from the TU/e that participated in the international iGEM competition. The first step in using Entrez in a Python program is always to import it: from Bio import Entrez Next to access the NCBI site, an email identification is needed. Use your TU/e email account for access: from Bio import Entrez Entrez.email="your.name@student.tue.nl" To search for the reference sequence of the complete genome of Escherichia coli we have to specify two items: • The database Since we are searching for the nucleotide sequence we select the nucleotide database: db="nucleotide" • The search term This is more complicated. The NCBI database entries could be interpreted as a kind of dictionary. NCBI calls her keys Fields. Examples of fields are ”Organism” and ”Properties” but there is also a construction that matches all fields: ”All Fields”. As organism we search for ”Escherichia coli”, we are interested in the complete genome sequence, and only those sequences that have a stable reference. The term to be used for this last part is that we search in the field ”Properties” for the value srcdb_refseq. Combining these all leads to our search term: term=’"Escherichia coli"[Organism] AND "complete genome"[All Fields] AND "srcdb_refseq"[Properties]’ Such search terms can become long, so we need a more elegant construct. Similarly as in programming we have to build our large construct from smaller pieces. As you may have inferred to combine search terms the phrase ”AND” is used. So we should have built our construct by: l=[’"Escherichia coli"[Organism]’] l.append(’"complete genome"[All Fields]’) c ph 91 Programming and genomics 2019/2020 9. Dictionaries and database queries l.append(’"srcdb_refseq"[Properties]’) searchterm=’ AND ’.join(l) Combining these two items leads to our standard Python search script for Entrez: from Bio import Entrez Entrez.email="p.a.j.hilbers@tue.nl" l=[’"Escherichia coli"[Organism]’] l.append(’"complete genome"[All Fields]’) l.append(’"srcdb_refseq"[Properties]’) searchterm=’ AND ’.join(l) handle=Entrez.esearch(db="nucleotide", term=searchterm) Such a handle is similar as a file and we apply the Entrez ’read’ method to it: record=Entrez.read(handle) The record we obtain is a Python dictionary having many different key–value pairs: {’Count’: ’4186’, ’RetMax’: ’20’, ’RetStart’: ’0’, ’IdList’: [’1767732542’, ’1767732539’, ’1767732528’, ’1767732513’, ’1767732496’, ’1767732466’, ’1767732428’, ’1767045039’, ’1724048583’, ’1724048578’, ’1724048573’, ’1724048568’, ’1724048560’, ’1724048547’, ’1724048519’, ’1724048161’, ’1520474167’, ’1520474049’, ’1511333192’, ’1393717272’], ’TranslationSet’: [{’From’: ’"Escherichia coli"[Organism]’, ’To’: ’"Escherichia coli"[Organism]’}], ’TranslationStack’: [{’Term’: ’"Escherichia coli"[Organism]’, ’Field’: ’Organism’, ’Count’: ’7055571’, ’Explode’: ’Y’}, {’Term’: ’"complete genome"[All Fields]’, ’Field’: ’All Fields’, ’Count’: ’552639’, ’Explode’: ’N’}, ’AND’, {’Term’: ’"srcdb_refseq"[Properties]’, ’Field’: ’Properties’, ’Count’: ’61550647’, ’Explode’: ’N’}, ’AND’], ’QueryTranslation’: ’"Escherichia coli"[Organism] AND "complete genome"[All Fields] AND "srcdb_refseq"[Properties]’} It states that all keys are strings. There are 4186 matches (’Count’) in the database. The Ids of only 20 (’RetMax’) matches are returned in the list ’IdList’. To access these identifiers we use: idl=record[’IdList’] If we are interested in all identifiers we should not be limited to the first 20: handle=Entrez.esearch(db="nucleotide", term=searchterm, retmax=record[’Count’]) rec2=Entrez.read(handle) print(len(rec2[’IdList’])) This should produce 4186 on output (on October 31, 2019). So far we have only searched for matching records in the database. The next step is to get (some of) these records. To that end we have to use the Entrez’s fetch method. Similarly as the esearch-method it needs a database (’db=”nucleotide”’), a sequence of identifiers that are separated by commas, the return type and the return mode. Since we are interested in the sequences we use as return type ’fasta’ and as return mode ’text’. So to obtain the first 2 matching fasta sequences from the u’IdList’: Entrez.email="p.a.j.hilbers@tue.nl" c ph 92 Programming and genomics 2019/2020 9. Dictionaries and database queries idn=",".join(rec2[u’IdList’][:2]) handle=Entrez.efetch(db="nucleotide", id=idn, rettype=’fasta’, retmode="text") outf=open(’E_coli2ids.fasta’, ’w’) outf.write(handle.read()) outf.close() 9.3 Exercises 41–48 Exercise 41:** (a) Given is an arbitrary string s. Write a python fragment that determines for each character occurring in the string, how often it occurs. (b) Print the number of occurences in a nicely aligned table, with in the first column the character and in the second column its percentage: e.g.: for "AAAAAAAAAAA#A#AB" the output should look like # A B 2 13 1 12.50% 81.25% 6.25% Exercise 42: Let s be a DNA string consisting of the characters A, C, G, and T. The string s may consist of both capital and small letters. Design a Python function with s as parameter that determines of each letter in s the number of times it occurs and returns the four letters in the order of decreasing number of occurence, i.e., first the letter that occurs most often, subsequently the second most occuring, etc. Exercise 43:** (a) Design a function, with a string molformula as single parameter that extracts, using urllib, information about the molecule described by molformula from the website http://www.chemcalc.org. (A description of how information can be retrieved from this website is given at the end of section 9.2.1.) The function should return the string read from this webpage. Apply the function on your favorite molecule (e.g. ’C2H6O’ or ’C11H15NO2’). (b) The result of part (a) should be a string that begins and ends (except for possible white space characters) with { and }, respectively. This string is in so called JSON format. That is an open standard format that uses human-readable text to transmit data objects. Using the Python library json this string s can be converted into a Python dictionary using: import json chemcalcdict = json.loads(s) Create the dictionary for your favorite molecule and show which keys the dictionary has. If one of these keys reads ’mw’, check the corresponding value to see what the c ph 93 Programming and genomics 2019/2020 9. Dictionaries and database queries molecular weight of your favorite molecule is. (c) From the dictionary of part (b), we can also extract information on the mass percentage of its constituting elements. Namely, chemcalcdict[’parts’][0][’ea’] yields a list of dictionaries. Each dictionary contains an element name, its number of occurrences, and its mass percentage in the molecule. Use this information to plot a table like: C H O 52.14% 13.13% 34.73% Exercise 44:** Open, using urllib, the webpage http://cbio.bmt.tue.nl/~philbers/index.htm and count the number of times the word ’computational’ (case insensitive) occurs on that page. Exercise 45:** Design a Python function with a single list parameter that returns the number of hits a search on the NCBI has when the elements of the lists are concatenated with ’ AND ’ as search term in the nucleotide database. Apply the function for l=[ "Escherichia coli[Organism]", "complete genome[All Fields]", "srcdb_refseq[Properties]"] Exercise 46: Design a Python function with a searchterm list and an integer as parameters that returns the list of Entrez id’s that match the search query when the elements of the searchterm list are concatenated with ’ AND ’ as search term in the nucleotide database and with the integer as ’retmax’ parameter. Apply the function for l=[ "Escherichia coli[Organism]", "complete genome[All Fields]", "srcdb_refseq[Properties]"] and the total number of hits (as obtained in the previous exercise) as integer parameter. Exercise 47: Design a Python function with a list of GenBank ids and a string as parameter that generates a file with that string as name and as contents the FASTA records of the GenBank ids. (Multiple ids can be fetched by joining them with a comma.) Apply the function with a list of 10 different GenBank ids matching ”Escherich coli” as organism, a complete genome and refseq as property and as string parameter ”E Coli10ids.fasta” Tips: 1) use your solution of the previous exercise to obtain the 10 ids, and 2) running your program (i.e. dowloading the sequences) may take a minute or so. c ph 94 Programming and genomics 2019/2020 9. Dictionaries and database queries Exercise 48: On the webpage http://cbio.bmt.tue.nl/~philbers/8CA10/E_coliallids.fasta a file ’E coliallids.fasta’ is given containing the FASTA records of all (at the moment the file was generated) 166 hits of the search on reference sequences of complete genome nucleotides of E coli. N.B.: This file is more than 317 Mbytes in size. Apart from using your solution to an earlier exercise, this Fasta file could be read and stored in a dictionary with as keys the id’s of the records and as values the corresponding sequences: from Bio import SeqIO def getFastaRecordsDict(filename): handle = open(filename, "r") seqdict = {} for record in SeqIO.parse(handle, "fasta") : seqdict[record.id] = str(record.seq) handle.close() return seqdict allEcoliDict = getFastaRecordsDict(’E_coliallids.fasta’) (a) Design a Python function with a sequence as single parameter, that determines the relative frequencies of occurrence of the four bases A, C, G and T for the sequence (i.e. the fractions of those bases in the sequence) and returns those frequencies as a dictionary. In the solution one should not use the standard count-method of Python. Test your function on for instance the sequence ”AATAATGCCC”. (b) Design a Python function with two parameters (a record id and a sequence) that, using a call to the function created in part (a), returns as a string a single line of text with as first entry the record id, then the 4 frequencies of the bases and as last entry the length of the sequence of the record. The entries should be separated by a semicolon. Test your function on for instance the Fasta record with the id ’gi|452742789|ref|NZ_CADZ01000110.1|’. (c) Design a function with the allEcoliDict dictionary as single argument, that generates for each of the records in the dictionary a line as described in part (b) and returns all these lines as a single string. (d) Write the output of part (c) to a file ’Ecolifreq.txt’. (e) Ever since the early days of molecular biology, base composition has been used as a descriptive statistic for genomes of various organisms. Especially the guaninecytosine content of bulk DNA, i.e. the fraction of C and G among all base pairs, has been frequently used as, even before the availability of cloning and DNA sequencing, it could already be determined by measuring the melting temperature of the DNA, because GC base pairs are more stable than AT base pairs. Design a function that generates for a record of the FASTA file a similar line as described in part (b) having the id and the length of the sequence but now the C+G content of the sequence should be generated. Remark: It is not excluded that a sequence contains 0 bases! c ph 95 Chapter 10 Program design and examples We consider in this chapter another often used, more general repetition construct, the while. Moreover, we discuss our approach of “How to design programs”. We, however, can only touch on this last subject. There are numerous books and Bachelor and Master programmes about this topic. A (bio)medical Engineering student does not need to know all details but should have some basic knowledge of this material. The approach we therefore take is to discuss general issues by using practical cases as a guide. 10.1 A more general repetition construct (while) In Python a for loop is not the only type of looping construct available. In a for construct one has to know in advance or to be able to calculate the number of iterations that has to be performed. So what happens when we want to keep doing a specific task until something happens but we don’t know when that something will be? To solve this problem we have another type of loop: the while-loop. An example of its usage is shown below. >>> value=1.0 >>> while value <= 10: ... print("Current value is: ", value) ... value=value*2.7 ... print("After the loop the value is: ", value) Execution of this program fragment gives: Current value is: Current value is: Current value is: After the loop the 1.0 2.7 7.29 value is: 19.683 With a while-construct a sequence of actions is repeated until a condition no longer holds. Stepping through the above example we have 1. First we initialize value to 1.0. Initializing the control variable of a while loop is 96 Programming and genomics 2019/2020 10. Program design and examples a very important first step, and a frequent cause of errors when missed out. 2. Next we execute the while statement itself, which evaluates a boolean expression. 3. If the result is True, it proceeds to execute the indented block which follows. In our example value is less than 10 so we enter the block. 4. We execute the print statement to output the first line. 5. The next line of the block multiplies the control variable, value with 2.7. In this case it is the last indented line, signifying the end of the while block. 6. We go back up to the while statement and repeat steps 4-6 with our new value of value. 7. We keep on repeating this sequence of actions until value reaches 19.683. At that point the while test will return False and we skip past the indented block to the next line with the same indentation as the while statement. 8. In this case a statement is executed by which the value is printed. Since there are no other lines, so the program stops. The general form of a while is: initialization while boolean_expression: statement_block Having introduced these programming constructs we shift our attention to the most important topic of this course: “How to design pograms”. 10.2 Programming: problem formulation, analysis and design The design of a program usually follows four phases that are repeated until a satisfactory solution is obtained. 1. The first step in the design of a program is always a careful analysis and when needed a reformulation of the problem. By this analysis also a list of requirements a solution should satisfy is defined. This list of requirements is called the problem specification. In the analysis we use mathematics to rephrase the problem to arrive at a formulation that is much more precise than in daily language. In this phase we also try to abstract from certain irrelevant details. 2. Having formulated the problem into a precise form, the next step is to create a sketch of a program. Remarkably is that in this stage not a real programming language is used but rather the program sketch is written in a pseudo code. 3. The third step is to translate this pseudo code into a real programming language. This translation is achieved by the usage of tools, libraries and templates. 4. The last step in the cycle is to determine whether the solution satisfies the requirements. In many programming texts this last phase is called testing. Since testing cannot exhaustively check all possibilities we take a different approach by a style of programming called program derivation. In the last phase the requirements’ c ph 97 Programming and genomics 2019/2020 10. Program design and examples analysis might result in looking for alternatives. Usually, in the first approach not all elements a problem consists of are dealt with. In general the problem is first split into subproblems and these subproblems are solved using the above scheme. In a later phase the solutions of the subproblems are assembled into the definite solution. Please note that the above scheme is not a general theory. In fact there is no general recipe for designing programs. Programming experience of the last twenty years suggests that the only way to learn to design programs is by investigating earlier solutions and by doing. By experience and trying to imitate the solutions from others, sometimes denoted by the term reverse engineering, a personal style can be developed. 10.3 Programming examples 10.3.1 Counting ’CGs’ in DNA strings A typical bioinformatics problem is to design a method by which from a given list of strings, a sorted list of counts of the pattern ’CG’ in the items of the list is determined. Although the problem is rather simple we use it to describe the general approach step by step. The first step is the problem analysis and specification. Since we know how to sort a list we simply forget for the moment the sorting requirement. Such a problem simplification is quite common and has deserved a special term: separation of concerns. So our first task is to generate a list of counts of occurrences of “CG” in a list of string items. Such a state to be reached is called a postcondition and is denoted by the letter R. Since a result is to be returned we have to introduce a variable, say outl. Next we have to specify the relation between the input conditions, called the precondition, and the variable(s) in the postcondition. The input conditions here is just the list l of string items and the relation is that outl is the list of counts of the pattern “CG” in the items of l. It is always wise to write what has been achieved so far: P: l is a list of string items name of the program/method: countList R: outl, the list of counts of the pattern ‘‘CG’’ in the items of l Since the list may consist of several items and for each item the same procedure has to be followed we recognize that our solution needs repetition. But be warned, in some cases the problem might suggest a repetition, while a direct solution is possible. As an example consider a program to sum the first n integers. For the solution no loop is needed, since s=n(n+1)/2 is a simple solution to this problem. The general framework of a repetition has 3 parts: I: initialisation B: body of the loop F: finalisation but both the initialisation and the finalisation part might be empty. c ph 98 Programming and genomics 2019/2020 10. Program design and examples In designing a loop we are looking for a statement that is valid independent of the number of times the loop has been repeated. Such a statement is called an invariant and can usually be derived from the postcondition. Hence in our case it should say something about the variable outl. At the end of the loop outl should contain all counts, so when still inside the loop outl holds the counts of the items considered so far. This is the invariant here. When the loop has not yet been entered, the invariant should also hold and that is what has to be realized by the initialisation. If outl holds the counts of the items considered so far, and we have not considered an item yet, then outl should be the empty list: outl=[] It is also clear that we have to loop over all items of the list. The next step is then simply to write down what has been obtained so far: outl=[] # outl holds the counts of the items of l considered so far for dnas in l: # outl holds the counts of the items of l considered so far .... # outl holds the counts of the items of l considered so far # outl holds the counts of the items of l considered so far # and all items have been considered, hence # outl holds the counts of all items of l If we consider the next item of the list we are obliged, because of the invariant, to add its count to the list: countCG=dnas.count("CG") outl.append(countCG) and combining the several parts our solution becomes: outl=[] # outl holds the counts of the items of l considered so far for dnas in l: # outl holds the counts of the items of l considered so far countCG=dnas.count("CG") outl.append(countCG) # outl holds the counts of the items of l considered so far # outl holds the counts of all items of l Having the list of all counts it simply suffices to sort the list to obtain the result desired: outl.sort() The last step is to assemble all parts into a method that decently returns the list calculated: def countList(l): """ In this method the number of occurrences of the pattern ’CG’ in each of the string items of the list l is returned in a sorted list. """ outl=[] for dnas in l: c ph 99 Programming and genomics 2019/2020 10. Program design and examples countCG=dnas.count("CG") outl.append(countCG) outl.sort() return outl If we put these lines of code in a file, say countList.py, we finally should test our method. We could simply call the method by, for instance, : print(countList([’AAACGCGAA’, ’CCGA’, ’ACCC’])) but if we would do so an output string would be produced if we would import this file in other modules. Python has a special construct for testing a module, the __name__="__main__": if __name__=="__main__": print countList([’AAACGCGAA’, ’CCGA’, ’ACCC’]) Only when the file countList.py is imported as the main module, i.e., called by the run command, the boolean expression holds and the output string is produced. The procedure described is lengthy but should be a valuable guide for designing loops in general. This is also shown in the next example. 10.3.2 All pattern positions in a string As a second example we will design a method with two strings seq and pat as parameters that returns a list of positions containing all start positions of non-overlapping pat’s in seq. We have encountered this problem before in Exercise 27, but here we discuss this problem along the framework introduced in this chapter. • Specification pre(P): seq a string, pat a string name: patternPosList post(R): list posl holds all positions in seq where pat starts • Analysis Since the pattern pat may occur several times in the sequence seq we need a repetition. Eventually posl should hold all pat occurrence positions, so we again look for an invariant of the type “DBC”, meaning that we have a part we already have dealt(D) with, a part to be(B) done and the whole sequence seq as object that is constant(C). In order to administrate which part of seq we already have seen we need an additional variable, say pos. When we introduce a new variable it is advisable to explicitly denote the values a new variable is allowed to take. It turns out that in adding this information simple mistakes can be avoided, and moreover it is commonly useful information in the program design phase. So the invariant I we are striving for is: posl holds all occurrences of pat in seq[:pos], seq[pos:] is still to be considered, and 0<=pos<=len(seq) There are always 3 issues in considering invariants c ph 100 Programming and genomics 2019/2020 10. Program design and examples 1. finalisation When pos=len(seq) then posl indeed holds all occurrence positions. 2. init(ialisation) An invariant should always be simple to establish when the loop is not yet entered. In this case it is indeed simple to establish the invariant when the search is still to be started: pos=0 posl=[] 3. loop and body Here one should always start with writing down what has been achieved so far and what is to be established. # P pos=0 posl=[] # I while B: # I and B body # I # I and not B # R: posl holds all positions in seq where pat starts The not B part results of course from the fact that at that point the repetition has ended. Its importance becomes clear from the following: I and not B should guarantee that in posl all positions in seq where pat starts are stored. So given invariant I: posl holds all occurrences of pat in seq[:pos], seq[pos:] is still to be considered, and 0<=pos<=len(seq) what should not B be such that R holds? It is clear that nothing should be left in seq[pos:] is still to be considered As many roads lead to Rome, here too we have some freedom. An obvious choice is to have pos=len(seq) but we do not directly have a clue how to have pos increased every time such that the postcondition is established. When do we know that the part seq[pos:] that still has to be considered does not include the pattern pat anymore? The answer is not too difficult: when pat is not part of seq[pos:], calling seq[pos:].find(pat) returns the value -1. So we should continue looping as long as seq[pos:].find(pat) differs from -1: # P c ph 101 Programming and genomics 2019/2020 10. Program design and examples pos=0 posl=[] # I while seq[pos:].find(pat) != -1: # I and B # in seq[pos:] pattern pat is still present loop # I # in seq[pos:] no pattern pat is present anymore and I # R: posl holds all positions in seq where pat starts From this code it is also clear that in order to stop looping inside the loop pos has to change. The question is of course how should pos be changed. For efficiency reasons we want to inspect each pattern’s beginning position only once when we have found it. So the first step is to introduce a variable holding this beginning position: l=seq[pos:].find(pat) Notice that find always starts counting at 0, so we should always add pos to the result of find to have the right position inside the original sequence seq, hence l=l+pos is needed to transfer the result of find into the position in seq. And if we have this position then a possible next (non-overlapping) occurrence of pattern pat can hence only start at: l+len(pat) So we have # P pos=0 posl=[] # I l=seq[pos:].find(pat) while l != -1: # I and B # in seq[pos:] pattern pat is still present # transform l into the position in the original sequence seq l=pos+l pos=l+len(pat) # I l=seq[pos:].find(pat) # in seq[pos:] no pattern pat is present anymore and I # R: posl holds all positions in seq where pat starts The final part is to add the position l to the list posl: posl.append(l) So if we include our method and its test in a file ’testpatpos.py’ def patternPosList(seq, pat): c ph 102 Programming and genomics 2019/2020 10. Program design and examples """return a list of all positions in string seq where substring pat is found """ posl=[] # no positions found yet pos=0 # everything in seq from 0 to pos is done l=seq[pos:].find(pat) while l != -1: l=l+pos # new position l in seq found, hence l>=0 posl.append(l) pos=l+len(pat) # only search in the remaining part of seq l=seq[pos:].find(pat) return posl if __name__=="__main__": print(patternPosList("AAAAAUGBBBBAUGAAAAUG", "AUG")) and use the Python interpreter by running this file we should find as output [4, 11, 17] 10.4 Exercises 49–57 Exercise 49: The purpose of this exercise is to learn more about the design and usage of invariants and variant functions. Given is a bag with white and black marbles and additionally there is a sufficiently large collection of black marbles. Repeatedly the following actions are taken: 2 marbles are randomly drawn from the bag, if the two marbles have a different color, the white one is put back into the bag, if the two marbles have the same color, a black marble is put into the bag. Two questions have to be answered: 1. Does this repetition end, i.e., is there a moment that there is only one marble in the bag. 2. If so, can you predict the color of this last marble? Exercise 50:** Design a python function that prints a triangle of stars (’*’) with k stars as basis and k stars as height by using a while-construct. Hence in your solution a for-statement is not allowed. k should be an integer parameter of the function and between two stars a space should be printed. For k=4 the output should look as follows: * * * * * * * * * * c ph 103 Programming and genomics 2019/2020 10. Program design and examples Exercise 51: Similar as the previous exercise but now for an open triangle, i.e., the first output line has one star, lines 2 through k-1 two stars and the last line k stars. Hence, for k=4 the output should look as follows: * * * * * * * * * Exercise 52: Design a function that prints a cross consisting of 2 diagonals of k stars. k is an odd positive integer parameter of the function. For k=5 de output should be * * * * * * * * * Exercise 53:** Given is the following function def countSomething(word=’insulin resistance’): i = 0 counter = 0 jump = word.index(’n’) while i<len(word): if word[i]==’n’: counter = counter+i i = i+jump i = i+1 return counter (a) Without actually running the program, which values does i obtain when the function is called by countSomething(’diabetes patient’) (b) The same question but now for countSomething(’diabetes patient or not’) (c) Similar for countSomething(’not a diabetes patient’) (d) And finally for countSomething() c ph 104 Programming and genomics 2019/2020 10. Program design and examples Exercise 54:** Given is the following function: def examplerep(): print("Give 5 numbers smaller than 1000:") a=1000 b=1000 i=0 while (i<5): c=int(input("Give one value: ")) if (c<b): if (c<=a): b=a a=c i=i+1 print("My answers are: "+str(a)+" and "+str(b)) If this function is called with the following 5 integer values that are supplied on input on different lines, add then adequate output formatting statements to this function such that a table with the values of the variables in each turn of the iteration is printed. The table should hence have a column per variable and a row per iteration step. (a): 0 1 2 3 4 (b): 900 800 700 600 500 (c): 3 33 333 444 33 Exercise 55: Design in a number of steps a program that prints the word IK in the form of stars. The letter I consists of a one vertical line of k stars, while the K has a height of k lines and a width of (k + 1)/2 positions where k is an odd input parameter of at least 3. The two letters are separated by a space. (a) Design a function makeI(k) that returns the letter I as list of lines. Clearly show which invariant(s) are used. (b) Design a function makeK(k) that returns the letter K as list of lines. (c) Design a function makewoord(k) that prints the word IK that uses the methods of (a) and (b). When for example k = 5, the output would be: * * * * * c ph * * ** * ** * * 105 Programming and genomics 2019/2020 10. Program design and examples Exercise 56: Given is a list l of integers. (a) Design a Python function with l as parameter that returns the number of times of two consecutive pair of elements of l the first one is smaller than the second one. (b) Adapt your solution of part (a) such that it returns the index of the 6th pair of such elements, when they exists, otherwise return -1. Design your function in such a way that it does not perform redundant calculations. For instance in general it should not first determine all the pairs. The solutions should include a specification and, when a repetition is used, an invariant in accordance to the repetition should be given. Exercise 57: Let a CA+C row be defined as a DNA sequence starting with a C, followed by one or more As and ending with a C. (a) Design a function that determines the starting positions of all non-overlapping CA+C rows in the DNA sequences and that returns these indices as a list. (b) Design a function that determines the end positions of all non-overlapping CA+C rows in the DNA sequences and that returns these indices as a list. (c) Design a function that determines the starting and end positions of all non-overlapping CA+C rows in the DNA sequences and that returns these as a list of 2-tuples (, where each 2-tuple comprises the starting and end position of one CA+C row). In your solution you have to make use of your solutions of part (a) and (b). (d) Which adaptation(s) is (are) necessary when overlap of CA+C rows is allowed for? Provide specifications, analyses and invariants for all your solutions. c ph 106 Chapter 11 Classes, Excel files and boxplots In previous chapters we have written output to screen as well as to text files. Also we have seen how strings can be formatted such that the data is presented in a structured way in such a plain text format. Here we will see how one can also use different file formats instead of plain text files. In particular we focus on reading/writing Excel files. Finally we will also see how boxplots can be made to visualize data sets. But first we will consider classes. 11.1 Classes and objects We have now dealt with all basic components to introduce the main notion in object oriented programming: classes. So far we have only mentioned the notion of a class and behind the scenes so perhaps without noticing, have used them in case of the built-in methods and types. Moreover, our attention has been on the actions to be performed on the data types. In general, however, there is a close relation between the actions and the data elements. For instance, if we would have a list of blood pressure values of a patient and another list with DNA strings from genes belonging to a certain family, then calculating the average makes sense for the former list and not for the latter, whereas in case of counting the number of occurrences of the nucleotide base ’A’ it is the other way around. So what is needed is a programming structure in which the close relation between the data elements and methods to be invoked on these elements can be expressed. Object-oriented programming languages offer such a structure called class. A small example In this section we show by an example the general framework of designing a class. For instance suppose we design a database in which we want to store information about students. One element of a student is the name of the student. As an example we could create a file ’student.py’ having the following content: class Student: def setName(self, stname): self.name=stname print("Name of student ’"+self.name+"’ set") 107 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots If we have such a class definition we can create an object of this class by using its name followed by parentheses. So by stu1=Student() stu2=Student() we have created two new Student objects. If we have an object of a class, then to perform actions on an object we have to use the methods defined for the objects of the class. In this example we have defined one such method, setName. To use a method on an object, in object-oriented jargon calling a method, one has to give the name of the object, followed by a ., and then the name of the method and the arguments: objectname.methodname(arguments). The argument list is, however, to be given in a special way. The special thing about calling a class method is that the object itself is passed as the first argument. In our example, in the call stu1.setName("Tobe OrNotToBe") the object stu1 is passed as first argument to the setName method while the string ”Tobe OrNotToBe” becomes the second argument. So the parameter self gets as value stu1 and the parameter name the value "Tobe OrNotToBe". Hence stu1.setName("Tobe OrNotToBe") is exactly equivalent to setName(stu1, "Tobe OrNotToBe"). In general, calling a method with a list of n arguments is equivalent to calling the corresponding function with an argument list that is created by inserting the method’s object before the first argument. So the result of stu1.setName("Tobe OrNotToBe") is that the data attribute name of the object stu1 gets the value ”Tobe OrNotToBe”, while the text Name ’Tobe OrNotToBe’ set is shown on output. So if we subsequently enter the statement print("The name of this student is", stu1.name) then the text The name of this student is Tobe OrNotToBe should appear. Extending this class with other data attributes and methods is straightforward. If we create a file “student.py” with the following contents: class Student: def setName(self, stname): self.name=stname print("Name ’"+self.name+"’ set") def setStudentId(self, stid=0): self.studid=stid c ph 108 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots print("Student id:", self.studid, "set") def printvals(self): print(self.studid) print(self.name) then we could use this file and the class defined in it by: >>> import student >>> stu=student.Student() >>> stu.setName("Tobe OrNotToBe") Name ’Tobe OrNotToBe’ set >>> print("The name of this student is", stu.name) The name of this student is Tobe OrNotToBe >>> stu.setStudentId(12345) Student id: 12345 set >>> print("The id of the student is", stu.studid) The id of the student is 12345 >>> stu.printvals() 12345 Tobe OrNotToBe So generalizing this example, in designing programs we have to split up our problem in classes and to define in each of the classes the desired methods, such as setName, setStudentId, and printvals in the class Student, and the data attributes, such as name and studid in the class Student. Creation of an object of the class, in computer jargon object instantiation, has been realized by a function-like call, by writing the name of the class followed by parentheses. The object created in this way has no data attributes yet and is an “empty” object. There is, however, a standard method to create objects having data attributes already at instantiation. This method is shown in the next subsection. The built-in init method The instantiation operation Student(), “calling” a class object, creates an empty object. Many classes like to create objects in a pre-defined state. To that end a class may define a special method named __init__(), like this: def __init__(self): self.name = ’’ self.studid=0 This special method is also often called the constructor. When a class defines an __init__() method, class instantiation automatically invokes __init__() for the newly-created class instance. So in this example, a new, initialized instance can be obtained by: stu=student.Student() and then automatically the data attributes are constructed. c ph 109 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots Of course, the __init__() method may have arguments for greater flexibility. In that case, arguments given to the class instantiation operator are passed on to __init__(). For example, >>> class Student: ... def __init__(self, stname=’’, stdid=0): ... self.name = stname ... self.studid = stdid ... >>> stu = Student("Tobe OrNotToBe", 12345) >>> print(stu.name, stu.studid) Tobe OrNotToBe 12345 11.2 Excel files in Python In previous chapters we have seen that using python it is possible to read and write text files, and that this can be used for instance to process data. Another common file type to store (medical) data is given by Excel files (like .xlsx files). Importing openpyxl Excel files can also be read and written from Python using for instance the openpyxl library, which can be loaded using import openpyxl Reading Excel files If the library is installed, opening a file Munster2mets.xlsx works as follows import openpyxl wb = openpyxl.load_workbook(filename = ’Munster2mets.xlsx’) where the file name is thus provided as a key word parameter. The variable wb now contains a openpyxl.workbook.Workbook object that may contain multiple work sheets (data attributes) and some methods. Each work sheet may contain again cells, ordered in rows and columns, in which the actual data is stored. To get the current work sheet and determine the number of rows containing data one can use the commands ws = wb.active nrrows = ws.max_row Accessing the cell in the i-th row and j-th column and the data stored in that cell can be done using cell = ws.cell(row=i,column=j) data = cell.value # get the cell # get the actual data Unfortunately, the row and column numbers start at 1, instead of the 0 we are used to in Python. The above two line can also be combined to data = ws.cell(row=i,column=j).value c ph 110 # directly access the data Programming and genomics 2019/2020 11. Classes, Excel files and boxplots In a small example we will show how the data in the Excel file shown in Figure 11.1a can be plotted. First we read the Excel file and store the data of the first column in Figure 11.1: a) Screen shot of the file Munster2mets.xlsx in Microsoft Excel. b) Scatter plot of the data in the Excel file. a list x. Subsequently we use Matplotlib to plot the data. When plotting the data we ignore the first element as this contains, as apparent from Figure 11.1a, a comment import openpyxl wb = openpyxl.load_workbook(filename = ’Munster2mets.xlsx’) ws = wb.active nrrows = ws.max_row x = [] for i in range(1,nrrows+1): x.append(ws.cell(row=i,column=1).value) import matplotlib.pyplot as plt plt.plot(x[1:],’r*-’) If we now also want to use the data in the second column, we could repeat the for loop for column=1 and thus add e.g. the lines y = [] for i in range(1,nrrows+1): y.append(ws.cell(row=i,column=2).value) or merge the two loops to x = [] y = [] for i in range(1,nrrows+1): x.append(ws.cell(row=i,column=1).value) y.append(ws.cell(row=i,column=2).value) Both solutions however have the drawback that code is repeated. A nicer solution would be the use of functions as we considered in the beginning of this chapter. We could then define a function readColumn(ws, colnr), with the work sheet and requested column c ph 111 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots number as arguments, which returns the values in the requested column as a list def readColumn(ws, colnr): nrrows = ws.max_row l = [] for i in range(1,nrrows+1): l.append(ws.cell(row=i,column=colnr).value) return l x = readColumn(ws, 1) y = readColumn(ws, 2) This of course becomes more and more advantageous as the number of columns to be processed would increase. The complete code to make a scatter plot of the data in the first column versus the data in the second column (as in Fig.11.1b) then may look like: import openpyxl import matplotlib.pyplot as plt def getWorksheet(xlsfilename): wb = openpyxl.load_workbook(filename = xlsfilename) ws = wb.active return ws def readColumn(ws, colnr): nrrows = ws.max_row l = [] for i in range(1,nrrows+1): l.append(ws.cell(row=i,column=colnr).value) return l ws = getWorksheet(’Munster2mets.xlsx’) x = readColumn(ws, 1) y = readColumn(ws, 2) plt.plot(x[1:],y[1:],’r*’) plt.xlabel(x[0]) plt.ylabel(y[0]) Writing Excel files Apart from reading Excel files, openpyxl also allows for generating and editing such files. A new empty workbook can be generated using import openpyxl wb = openpyxl.Workbook() ws1 = wb.active ws1.title = "Test" c ph # generate a new empty workbook # select the first (and only) work sheet # change the name of the work sheet 112 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots Data can be added to the work sheets in different manners. The first way is analogous to the way we read data above, i.e., by accessing cells by row and column number and accessing their value field, e.g.: for row in range(10, 20): for col in range(7, 34): cell = ws1.cell(column=col, row=row) cell.value = row-col Alternatively, the cells could also be accessed by their name in in Excel like fashion, i.e., using one (or multiple) letter(s) for the column number followed by the row number. Adding the value pi to a new work sheet (named ’Pi’) in 6-th column on the 5-th row is done by ws2 = wb.create_sheet(title="Pi") cell = ws2[’F5’] cell.value = 3.14 As a third method it is also possible to add data to multiple cells at once using the append command. If a list l is provided as an argument, a new row is added to the active work sheet and the items of the list are written to the first len(l) columns in that row. ws3 = wb.create_sheet(title="Data") for row in range(1, 40): ws3.append([’He’,’Ho’,row]) The Excel file can be written to disc using the save method of the workbook wb.save(filename = ’testbook.xlsx’) Additional information on openpyxl Additional information on openpyxl is available from the package documentation at https://pypi.python.org/pypi/openpyxl. 11.3 Boxplot A standardized way of displaying the distribution of data is by means of a boxplot (also known as a box-and-whisker diagram or plot). Such a plot is based on the five number summary of the data: minimum, first quartile, median, third quartile, and maximum. In a simple boxplot, see Figure 11.2a, a central rectangle spans the first quartile to the third quartile (the interquartile range or IQR), a line inside the rectangle shows the median and ’whiskers’ above and below the box show the locations of the minimum and maximum. Real (medical) datasets will often display surprisingly high or surprisingly low values called outliers. In order to prevent that such outliers stretch the whiskers, the length of these whiskers is often maximized at a multiple of IQR. Data points outside this range are then explicitly drawn, as illustrated in Figure 11.2b. In Python, boxplots can be made using the same library that we used before to plot data, i.e. using matplotlib. Given a list l, a boxplot showing outliers as in 11.2b is c ph 113 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots 1400 1400 maximum 1200 1200 1000 1000 800 800 600 600 outliers 1.5 IQR third quartile 400 400 IQR median 200 0 200 first quartile minimum 0 1 (a) 1 (b) Figure 11.2: Boxplots: a) Simple boxplot indicating the five number summary, b) boxplot with visualization of outliers. created using import matplotlib.pyplot as plt plt.boxplot(l) Apart from a single positional parameter, i.e., the list of data to be plotted, the boxplot function has over 20 keyword parameters with default values that can be used to control the appearance of the result. For instance, adding vert=False results in a horizontal boxplot instead of the default vertical one, and whis=10 increases the maximal whisker length (from its default 1.5) to 10 times IQR. Using the latter, the number of points considered as outliers decreases, possibly to zero and thus resulting in a simple boxplot as in 11.2a. import matplotlib.pyplot as plt plt.boxplot(l, whis=10) For an extensive overview of all options we refer to the pyplot website http://matplotlib. org/api/pyplot_api.html. Multiple data sets In many cases one may want to compare different data sets. For instance, data on a metabolite in a group of patients before a treatment and after a treatment, or between a group of male patients and group of female patients, etc. This can be done by providing the boxplot function with as first parameter a list of lists: import matplotlib.pyplot as plt plt.boxplot([list1, list2]) The data in list1 is then used for the first box with whiskers and the data in list2 for the second, which is drawn in the same figure at the right hand side of the first as in 11.3a. c ph 114 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots 1400 1400 1200 1200 1000 1000 800 800 600 600 400 400 200 200 0 1 0 2 (a) label1 label2 (b) Figure 11.3: Boxplots with more than 1 box with whiskers: a) boxplot with integers at the ticks on the x-axis indicating order of the data sets in the list provided as parameter to the boxplot function, b) same boxplot with labels at the x-ticks. Pimp your plot In the boxplot created in the previous paragraph, the two boxes where labelled by two integers at the ticks on the x-axis, indicating the order of the corresponding datasets in the list provided as parameter to the boxplot function. By first creating a figure object, connecting an axes object to it, and creating the boxplot on this axis: import matplotlib.pyplot as plt fig = plt.figure() ax = fig.add_subplot(1, 1, 1) ax.boxplot([list1, list2], notch=True) ax.set_xticklabels( [’label1’, ’label2’] ) any strings can be placed at the x-ticks using the set xticklabels method on the axes object, resulting in Figure 11.3b. Here, some extra emphasis is given on the median by adding notch=True to the parameters given to the boxplot function. 11.4 Exercises 58–61 Exercise 58:** Design a class Gene with two data attributes: gene_symbol, and gene_name. (a) Design a method of this class where gene_symbol should have as default value ”INS” and gene_name ”insulin”, respectively. Give examples of object instantiations with a varying number of default values and with other values as arguments. (b) Design a method print_geneinfo by which the contents of the data attributes are printed. Exercise 59: c ph 115 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots (a) Create a class Atom and a constructor having two parameters that are set (use __init__). The two parameters are atomName and atomWeight. (b) Create another class Molecule, which has one attribute, a list of atoms. This attribute should be empty at initialisation. (c) Design a method addAtom for the Molecule class, which creates a new atom and adds this to the list. The method addAtom, should have two parameters: atomName and atomWeight. (d) Design a method calculateWeight for the Molecule class, which calculates the weight of the molecule. Exercise 60:** In this exercise we consider an Excel file that contains some clinical measurements on a number of patients. The first two rows contain some info on the file contents and the headers of the different columns, respectively. Subsequently, it contains a single line of data per patient. (a) Write a function with a file name as single parameter. The method should first check whether the specified file name ends with ’.xlsx’. If not, the method should return None. Otherwise, the specified Excel file should be read and the workbook should be returned. Apply your method to the provided Excel file health.xlsx. (b) Write a function that extracts the data from a specified column of the active work sheet in a workbook and returns the values in that column in a list. The method should have two parameters: as first parameter a workbook object and as second parameter a keyword parameter specifying the desired column number, which should have default value 1. (c) Write a function with two parameters, i.e., as first parameter a list and as second parameter a float, which returns a new list with all elements of the input list multiplied by the second parameter. (d) Use your functions defined in parts (a), (b) and (c) to extract the information on patient weight and height from the file health.xlsx. Make a scatterplot of that data with the weight in kilograms on the x-axis and the height in meters on the y-axis. (e) Calculate for all patients their BMI (weight/height**2), with weight in kilograms and height in meters. (f) Create using Python a new workbook with only information for the male patients. For each male patient only the name, height, weight and BMI should be stored in the first 4 columns, respectively. Write this file as bmi male.xlsx. Exercise 61: (a) Use the functions written in parts (a) and (b) of the previous exercise to read the Excel file health.xlsx and obtain 3 equally long lists: one with the genders, one with the cholesterol levels, and one with the weights. (b) Write a function divideByGender with two lists as parameters. A first list with c ph 116 Programming and genomics 2019/2020 11. Classes, Excel files and boxplots the gender of a series of patients and a second equally long list with arbitrary other data on the corresponding patients. One may assume that the gender list only contains items with the values ’F’ for female and ’M’ for male patients. The function should return two lists: the first with the data of the female patients, the second with the data of the male patients. (c) Make a scatter plot of the cholesterol levels versus the weights (as in Figure 11.4a), where the male data are shown in red and the female data in blue. (d) Make a boxplot of the cholesterol level where the data is shown for all patients, for the female patients as well as for the male patients (as in Figure 11.4b). Figure 11.4: Examples of a scatter and a box plot of data from the Excel file. c ph 117 Chapter 12 Graphical user interfaces In this chapter we will introduced the tkinter system by which graphical user interfaces can be built. We will show that adding widgets to a GUI is in general straightforward. 12.1 A first window The top level window Programming a GUI is exactly like any other kind of programming, sequences, loops, branches and modules can be used just as before. The first step is defining the main object by which the windows of the program are managed. >>> import tkinter This is the first requirement of any tkinter program - import the names of the widgets. >>> top = tkinter.Tk() This creates the top level widget in our widget hierarchy. All other widgets will be created as children of this widget. In order to make the widget visible we have to enter the tkinter event loop: >>> top.mainloop() The mainloop, the so-called event loop, handles the events from the user (such as mouse clicks and key presses) or the windowing system (such as redraw events and window configuration messages), and it also handles operations queued by tkinter itself. This also means that the application window will not appear before you enter the main loop. So a minimal tkinter program has at least these three lines >>> import tkinter >>> top = tkinter.Tk() >>> top.mainloop() When this program is run, see Figure 12.1a, the top level window automatically comes furnished with widgets to minimize, maximize, and close the window. Clicking on the ”close” widget (the ”x” in a box, at the right of the title bar) generates a ”destroy” event. The destroy event terminates the main event loop, and since there are 118 Programming and genomics 2019/2020 12. Graphical user interfaces Figure 12.1: Two first windows: a) with standard title, b) with a user defined title. no statements after top.mainloop(), the program has nothing more to do, and ends. Also note that the program will stay in the event loop until we close the window. The title of the window In the previous example the title bar had the default value ’Tk’. If we build our own user interface it is possible to give the window another title. import tkinter top = tkinter.Tk() top.title("This is the top window") top.mainloop() If we run this program a window will appear, see Figure 12.1b, but possibly not the whole text on the title bar is visible. The reason is that the tkinter system has some predefined values for the initial size of the window and this is our first example in which the flexible role of object initialisation is clearly shown. This size can be changed by enlarging the window by clicking the left mouse-button in one of the corners and sliding the corner to the required size, but there is an alternative. The size of the top level window The geometry method is applicable to the top level window widget and sets the size of the window. If top is the Tk object then • top.geometry(gstr) is the method by which the size of a top-level window is set. The geometry string gstr must have the form: "wxh" where the w and h parts give the window width and height in pixels. They are separated by the character ”x”. c ph 119 Programming and genomics 2019/2020 12. Graphical user interfaces Figure 12.2: Two windows a) one with a user defined title and size, and b) one with additionally a different background color. Example: import tkinter top = tkinter.Tk() top.title("This is the top window") top.geometry("300x200") top.mainloop() producing a window as shown in Figure 12.2a. The background color of the top level window So far our windows have the same grey background color. A GUI toolkit should obviously have methods to add color to widgets. To control the appearance of a widget, options rather than method calls are used. Typical options include color, height and width. To deal with options, all core widgets implement the same configuration interface, using again keyword arguments. • configure(option=value, ...) One of the options is ”background”, to define the background color of the window. In tkinter there are two general ways to specify colors: • The colors ”white”, ”black”, ”red”, ”green”, ”blue”, ”cyan”, ”yellow”, and ”magenta” are available. Other names may work, but depend on the installation. Example: import tkinter top = tkinter.Tk() top.title("This is the top window") top.geometry("300x200") top.configure(background="red") top.mainloop() c ph 120 Programming and genomics 2019/2020 12. Graphical user interfaces The window being produced by running this example is shown in Figure 12.2b. • The other possibility is to use a string specifying the proportion of red, green, and blue in hexadecimal digits: #rrggbb For example, "#000000" is black, "#ff0000" is red, and "#00ffff" is pure cyan (green plus blue). So to obtain the same result as in the previous example we could have written import tkinter top = tkinter.Tk() top.title("This is the top window") top.geometry("300x200") top.configure(background="#ff0000") top.mainloop() These methods all apply to the top level widget. In general our top level window should have additional components like buttons, images and text. After an intermezzo about GUI programming in general we introduce some of these components in building a realistic GUI using tkinter. 12.2 The four basic GUI-programming tasks Before introducing the components a window can contain we give some attention to designing graphical user interfaces in general. When a user interface (UI) is designed, there is a standard set of tasks that must be accomplished. • It must be specified how the UI should “look”. That is, we must write code that determines what the user will see on the computer screen. • It must be specified what the actions are to be done when the UI is used. That is, we must write routines that accomplish the tasks of the program. • We must associate the ”looking” with the ”doing”. That is, we must write code that associates the things that the user sees on the screen with the routines that have been written to perform the program’s tasks. • Finally, we must write code that sits and waits for input from the user. GUI programming has some special jargon associated with these basic tasks. • We specify how we want a GUI to look by describing the ”widgets” that we want it to display, and their spatial relationships (i.e., whether one widget is above or below, or to the right or left, of other widgets). The word ”widget” is a nonsense word that has become the common term for ”graphical user interface component”. Widgets include things such as windows, buttons, menus and menu items, icons, drop-down lists, scroll bars, and so on. • The routines that actually do the work of the GUI are called ”callback handlers” or ”event handlers”. ”Events” are input events such as mouse clicks or presses of a key on the keyboard. These routines are called ”handlers” because they ”handle” c ph 121 Programming and genomics 2019/2020 12. Graphical user interfaces (that is, respond to) such events. • Associating an event handler with a widget is called ”binding”. Roughly, the process of binding involves associating three different things: – a type of event (e.g. a click of the left mouse button, or a press of the ENTER key on the keyboard), – a widget (e.g. a button), and – an event-handler routine. For example, we might bind (a) a single-click of the left mouse button on (b) the ”CLOSE” button/widget on the screen to (c) the ”closeProgram” routine, which closes the window and shuts down the program. • The code that sits and waits for input is called the ”event loop”. Above we already have seen the name of the event loop method in tkinter, i.e., the ”mainloop” method of the top object. As the mainloop runs, it waits for events to happen in top. If an event occurs, then it is handled and the loop continues running, waiting for the next evernt. The loop continues to execute until a ”destroy” event happens to the root window. A ”destroy” event is one that closes a window. When the top is destroyed, the window is closed and the event loop is exited. 12.3 The label widget For building of our graphical user interface we introduce a class. The main reason to use classes is to simplify the design of the program. A program that is structured into classes is, especially if it is a very large program, much easier to understand than one that is unstructured. Another important consideration is that structuring your application as a class helps to avoid the use of global variables, i.e., variables that are not defined inside a class, but that are accessible in all program parts. Because it leads to messy (”spaghetti”) programs, frequent use of global variables is considered poor programming. A much better way is to use instance (that is, ”self.”) variables, and for that our application must have a class structure. Usually a GUI contains some text. If the text is simple an object from the Label (widget) class can be used: c ph 122 Programming and genomics 2019/2020 12. Graphical user interfaces Figure 12.3: Three windows with labels: a) a window with an unvisible label, b) a window with the text close to the top, and c) a window with the text close to the bottom. import tkinter class MyApp: def __init__(self, parent): self.myParent = parent # always keep a reference to the parent # widget so parameters of the parent # can be used self.label=tkinter.Label(parent, text="A GUI for inspecting gene descriptions") root = tkinter.Tk() root.title("This is the top window") root.geometry("300x200") myapp = MyApp(root) root.mainloop() Our aim is to create a label widget that is a child widget of top and that displays the text ”A GUI for inspecting gene descriptions”. Notice that because tkinter object constructors tend to have many parameters (each with default values) it is usual to use the named parameter technique of passing arguments to tkinter objects. If the above program is run perhaps surprisingly this text is not shown as demonstrated in Figure 12.3a. This is because we have to explicitly instruct the system where to place the widget. Three so-called layout managers (or geometry managers) are predefined. Here we are going yo use the most simple one called pack. With pack the widget is packed itself into its parent. The options for packing a widget against either the parents wall or a previous widget with the same packing are TOP, LEFT, RIGHT, and BOTTOM, where the default value is TOP. Hence import tkinter class MyApp: def __init__(self, parent): c ph 123 Programming and genomics 2019/2020 12. Graphical user interfaces self.myParent = parent # always keep a reference to the parent # widget so parameters of the parent # can be used self.label=tkinter.Label(parent, text="A GUI for inspecting gene descriptions") self.label.pack() root = tkinter.Tk() root.title("This is the top window") root.geometry("300x200") myapp = MyApp(root) root.mainloop() should give a window as in Figure 12.3b with the text close to the title bar, while import tkinter class MyApp: def __init__(self, parent): self.myParent = parent # always keep a reference to the parent # widget so parameters of the parent # can be used self.label=tkinter.Label(parent, text="A GUI for inspecting gene descriptions") self.label.pack(side=tkinter.BOTTOM) root = tkinter.Tk() root.title("This is the top window") root.geometry("300x200") myapp = MyApp(root) root.mainloop() should show a window as in Figure 12.3c with the text close to the bottom of the window. In the above programs the term tkinter. occurs at all places where we have to refer to objects and their methods from the tkinter module. Although this is the strongly recommended way of refering to these objects and methods, in Python GUI programming it is more common to use a somewhat shorter notation, i.e., the term tkinter. is omitted. In order to obtain a syntactically correct program we should use another way of importing instead of import tkinter: from tkinter import * meaning that everything from tkinter is to be imported. With this notation the last program becomes: from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent # always keep a reference to the parent c ph 124 Programming and genomics 2019/2020 12. Graphical user interfaces # widget so parameters of the parent # can be used self.label=Label(parent, text="A GUI for inspecting gene descriptions") self.label.pack(side=BOTTOM) root = Tk() root.title("This is the top window") root.geometry("300x200") myapp = MyApp(root) root.mainloop() 12.4 The button widget With the class Button we create a new widget called button. Similarly as with Label we use bA = Button(top,text="base A") bT = Button(top,text="base T") The class Button takes the parent window as the first argument. As we will see later other objects may also act as parents. The rest of the arguments are passed by keyword and are all optional. Again the buttons have first to be packed to make them visible. bA.pack() bT.pack() Notice that after the first command the button is placed in the window. When the second button is packed the window is expanded to accomodate it. The default TOP stacked them vertically in the order they were packed. The result is shown in Figure 12.4a. Figure 12.4: Two windows with buttons: a) one with 2 buttons close to the top, and b) with two buttons packed horizontally. If we would use bA.pack(side=LEFT) bT.pack(side=LEFT) c ph 125 Programming and genomics 2019/2020 12. Graphical user interfaces then the window looks like in Figure 12.4b. In practice the pack geometry manager is generally used in one of these two modes to place a set of widgets in either a vertical column or horizontal row. Our buttons look a little squished. We can fix that by packing them with a little padding. ”padx” adds pixels to the left and right and ”pady” adds them to the top and bottom (Figure 12.5a). bA.pack(side=LEFT, padx=10) bT.pack(side=LEFT, padx=20) Figure 12.5: Two windows with buttons: a) one window with two buttons packed horizontally with some extra space, and b) a window with an additional frame to pack one label on top of three buttons. 12.5 The frame widget A frame is a widget whose sole purpose is to contain other widgets. Groups of widgets, whether packed or placed in a grid, may be combined into a single Frame. Frames may then be packed with other widgets and frames. As an example we place a label over 3 buttons in a row. We first pack the buttons into a frame horizontally and then pack the label and frame vertically in the window. from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent # always keep a reference to the parent # widget so parameters of the parent # can be used # Create and pack label self.l = Label(parent, text="A label above the buttons") self.l.pack() # Create and pack frame with 3 buttons self.frame=Frame(parent) f=self.frame self.bA = Button(f,text="base A") self.bT = Button(f,text="base T") self.bG = Button(f,text="base G") self.bA.pack(side=LEFT) c ph 126 Programming and genomics 2019/2020 12. Graphical user interfaces self.bT.pack(side=LEFT) self.bG.pack(side=LEFT) f.pack() win = Tk() myapp=MyApp(win) win.mainloop() The result is shown in Figure 12.5b. Our button widgets have only been used to display text so far. The next step is of course to have some action coupled to clicking on a button. 12.6 Bringing the buttons to life. We start with the button definition as introduced in the previous chapter. from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent # always keep a reference to the parent # widget so parameters of the parent # can be used self.myContainer1 = Frame(parent) self.myContainer1.pack() self.buttonA = Button(self.myContainer1, text="base A") self.buttonA.pack(side=LEFT) root = Tk() root.title("Button A example") myapp = MyApp(root) root.mainloop() The result of running this program is shown in Figure 12.6a. When the button is clicked, it is highlighted and depresses fine but it just does not do anything. As we have seen, widgets are objects and have methods. We have been using their pack method. Now we use the earlier introduced configure method to associate some action. from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent self.myContainer1 = Frame(parent) self.myContainer1.pack() c ph 127 Programming and genomics 2019/2020 12. Graphical user interfaces Figure 12.6: Two windows with a single button: a) an inactive button in default colors, and b) a custom colored button that is (though not visible in this figure) also connected to an event handler. self.buttonA = Button(self.myContainer1) self.buttonA.configure(text="base A", bg="blue", fg="yellow") self.buttonA.configure(command=self.butA) self.buttonA.pack(side=LEFT) def butA(self): print("Button base A has been pushed") root = Tk() root.title("Button A example") myapp = MyApp(root) root.mainloop() Buttons are tied to callback functions using the parameter command either when the button is created or with configure. In this case when we click button ”base A” the message ”Button base A has been pushed” is printed. The window that is shown when running this program is depicted in Figure 12.6b. Now everytime when we click button ”base A” the message ”Button base A has been pushed” is printed. Remember that the argument supplied to the command keyword is a function name. Indeed this should always hold: the argument must be a function name, since an action has to be taken. The callback and lambda forms Suppose we have to design a DNA calculator with 4 buttons ”A”, ”C”, ”G”, and ”T”. When we push the button the corresponding base name should be printed. The action for pushing button ”A” has been given above and a straightforward but not too elegant a solution is to define for instance def butT(self): print("Button base T has been pushed") as callback when pushing button T and similarly for C and G. What we want is one single method that is called with the parameter of the button: c ph 128 Programming and genomics 2019/2020 12. Graphical user interfaces def but(self, b) print("Button base "+b+" has been pushed") One might try the following self.buttonA.configure(command=self.but(’A’)) If we do so, then the string ’Button base A has been pushed’ is printed, as the function is evaluated when the class is loaded. As a result command receives the return value of this statement, i.e., None, and this is clearly not the function object we have been expecting. So when the button is pushed nothing happens!! So we need some method by which an expression is changed into a function. Python indeed has such a facility: the lambda form. Lambda forms can be used wherever function objects are required. They are syntactically restricted to a single expression. Semantically, they are just syntactic sugar for a normal function definition. >>> def exampleLambda(c): ... print(’Some text’, c) ... >>> fa=(lambda : exampleLambda(’aaaa’)) >>> fa <function <lambda> at 0xb7d2b5dc> >>> fa() Some text aaaa Lambda forms can also use values from the containing namespace. For example: def make_inc(n): ... return lambda ... >>> f = make_inc(4) # # >>> f <function <lambda> at >>> f(0) 4 >>> f(1) 5 x: x + n f is now (lambda x: x + 4) in math terms f(x)=x+4 0xb7f63b8c> So in our example we write: from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent self.myContainer1 = Frame(parent) self.myContainer1.pack() self.buttonA = Button(self.myContainer1, text="base A") self.buttonA.grid(row=0, column=0) c ph 129 Programming and genomics 2019/2020 12. Graphical user interfaces self.buttonA.configure(command=(lambda : self.but("A"))) self.buttonT = Button(self.myContainer1, text="base T") self.buttonT.grid(row=1, column=1) self.buttonT.configure(command=(lambda : self.but("T"))) def but(self, b): print("Button base "+b+" has been pushed") root = Tk() root.title("Buttons A and T, lambda form") myapp = MyApp(root) root.mainloop() If one of the buttons is pushed indeed the correct text is printed. 12.7 The entry widget The purpose of an Entry widget in tkinter is to let the user see and modify a single line of text. Just as in the case of buttons we need some way to communicate with the entry widget, in this case to set and retrieve text. This is done with a special tkinter object called a StringVar that simply holds a string of text and allows us to set its contents and read it (with get). from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent self.myContainer1 = Frame(parent) self.myContainer1.pack() self.v = StringVar() self.e = Entry(self.myContainer1,textvariable=self.v) self.e.pack() win = Tk() win.title("An example of an entry widget") win.geometry(’400x50’) win.configure(bg=’#ff00ff’) myapp = MyApp(win) win.mainloop() The result of this program is shown in Figure 12.7. When we type ”ACG” into the entry, we can retrieve it from our linked StringVar object by calling the get()-method. This is shown in the following example where we introduce a button with as the purpose to show the contents of the entry. c ph 130 Programming and genomics 2019/2020 12. Graphical user interfaces Figure 12.7: A window with an entry widget. from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent self.myContainer1 = Frame(parent) self.myContainer1.pack() self.v = StringVar() self.e = Entry(self.myContainer1,textvariable=self.v) self.e.pack() self.buttonS=Button(self.myContainer1, text="Show entered text") self.buttonS.configure(bg=’blue’, fg=’yellow’) self.buttonS.configure(command=self.showEntryVal) self.buttonS.pack() def showEntryVal(self): print(self.v.get()) win = Tk() win.title("An example of an entry widget") win.geometry(’400x50’) win.configure(bg=’#0000ff’) myapp = MyApp(win) win.mainloop() giving the result as shown in Figure 12.8. Figure 12.8: A window with an entry widget and a button. Similarly as getting the contents of an entry widget object, there is also a method by which the StringVar object can be given a value. It is the set()-method. Its usage is shown in the following program fragment. c ph 131 Programming and genomics 2019/2020 12. Graphical user interfaces from tkinter import * class MyApp: def __init__(self, parent): self.myParent = parent self.myContainer1 = Frame(parent) self.myContainer1.pack() self.v = StringVar() self.e = Entry(self.myContainer1,textvariable=self.v) self.e.pack() self.assignEntryVal(’An initial value’) self.buttonS=Button(self.myContainer1, text="Show entered text") self.buttonS.configure(bg=’blue’, fg=’yellow’) self.buttonS.configure(command=self.showEntryVal) self.buttonS.pack() def showEntryVal(self): print(self.v.get()) def assignEntryVal(self, s): self.v.set(s) win = Tk() win.title("An example of an entry widget with a value") win.geometry(’400x50’) win.configure(bg=’#ff00ff’) myapp = MyApp(win) win.mainloop() Figure 12.9: A window with an entry widget and a button and a value. 12.8 Exercises 62–69 Exercise 62:** (a) Using tkinter, design a window having ”A DNA calculator” as title. (b) Change the window of part (a) to a size of height 400 and a width of 500. (c) Change the background color of the window to blue. (d) Place a label with as text ”This DNA calculator should become colorful”. Change c ph 132 Programming and genomics 2019/2020 12. Graphical user interfaces the size of the window by mouse operations. (e) Change the background color of the label to red. (f ) Change the foreground color of the label to yellow. (g) Add a label to the window with text "Beautiful colors isn’t it?" background color green and foreground color white. Note that the two labels are placed on top of each other. (h) Change the order in which the two labels are packed. Exercise 63: (a) Again create a tkinter window having ”A DNA calculator” as title and with blue as background color. (b) Place a label with as text ”This DNA calculator should become colorful” with black as background color of the label and yellow as foreground color of the label, but now add the option side=LEFT to the pack method. (c) Add a label to the window with text "Beautiful colors isn’t it?" background color green, foreground color white, and with as value BOTTOM for the side option of the pack method. Enlarge the size of the window to experience the effects. (d) Add a third label with text ”base A” and with TOP as value for the side option, and enlarge the window to experience the effects. (e) Finally add a fourth label "base T" with RIGHT as value for the side option and again enlarge the window to experience the effects. Exercise 64:** In general the size of the widget does not change when the size of the master window is adapted. If we require that the widget size is also adapted we have to add values for the options expand and fill. Moreover there is also another option, namely anchor, to control the position of the widget inside its master widget. In this exercise some examples of their use are considered. (a) Create a tkinter window having ”A DNA calculator” as title, with blue as background color, place a label with as text ”This DNA calculator should become colorful” with black as background color of the label and yellow as foreground color of the label, but now add only the option anchor=N to the pack method and enlarge the window to experience the effects. c ph 133 Programming and genomics 2019/2020 12. Graphical user interfaces (b) Add side=RIGHT to the argument list of pack and explain the changes when the size of the master widget is adapted. (c) Next add fill=X, expand=True to the argument list of pack and discuss the changes. (d) Replace the fill value X by Y and investigate the difference. (e) Replace the fill value Y by BOTH and experience the effect. Exercise 65: Design a window with four buttons with as text ”A”, ”C”, ”G”, and ”T”, each having different backgound colors, filling up the complete master window and when the master window is resized all space should remain evenly occupied by the buttons. Exercise 66: Design a graphical user interface with a label with as text ’Please enter a DNA string’ and an entry widget in which a DNA string can be entered. The window should also have a button by which the length of the DNA string that has been entered is printed. Exercise 67: Each individual widget can be controlled by the pack options. In many cases we however need the same behaviour simultaneously for several widgets. To that end tkinter has a container entity by which widgets can be grouped. This framing operation is the subject of this exercise. (a) Again create a tkinter window having ”A DNA calculator” as title, with as background color blue, place 4 buttons with as texts ”A”, ”C”, ”G”, and ”T”, respectively. Choose different background and/or foreground colors and use the pack side option LEFT for ”A”, ”C”, ”T”, and RIGHT for ”G”. Enlarge the window again to experience the effects. (b) Next we introduce a frame for the buttons with names ”A” and ”T”. Instead of placing the two buttons into the top window say win we use the following grouping f = Frame(win) bA = Button(f,text="base A") bT = Button(f,text="base T") f.pack() What changes are obtained when the window size is adapted? (c) Next add fill=X, expand=True to the argument list of the packing of the frame and discuss the changes. (d) Similarly as in part (c) but now for button ”C”. (e) Replace the fill value Y by BOTH in all pack arguments lists and experience the effect. c ph 134 Programming and genomics 2019/2020 12. Graphical user interfaces Exercise 68: Usually the characters that the user typed are shown in an entry widget. In some cases such as for password entries, an asterisk should instead be echoed. By adding show=* as option to the entry constructor this effect is obtained. Design a graphical user interface with two labels and two entry widgets on it. One label should be ’Username’, and the label positioned below it ’Password’. The entry in the widget coupled to the ’Username’ label should echo the characters typed in by a user, while in the other entry widget asterisks should then be shown. Exercise 69: Similarly as the previous exercise but after having closed the interface a new window should be popped up with as size ’500x500’, background color ’blue’, foreground color ’yellow’ and a label with as text ’Welcome ...’ where instead of ’...’ the username typed by the user should appear. c ph 135 Chapter 13 Two examples In this chapter we consider two example programs. In the first example a program is created to perform some simple simulations. This is a nice example of how a problem can be split in multiple subproblems that are each solved in a separate function. The corresponding exercises are a good test to see whether you comprehend the use of functions. The second example is to illustrate that pyplot has many more options than used so far and can be used to create nice customized plots. 13.1 Simulations Key advantage of computers is that they can perform tasks repeatedly and very accurately. Simulation is the imitation of the operation of a real-world process or system over time. The act of simulating something first requires that a model be developed; this model represents the key characteristics or behaviors/functions of the selected physical or abstract system or process. A computer can then be used to study the time evolution of that model. Examples of computer simulations of biological systems include molecular dynamics simulations of proteins, DNA and/or membranes. Here, we will not go into details in such large scale simulations. Instead we will consider a small model consisting of a linear array of cells that each can be in one out of two states. The state of the cells are followed in time, where the state of a cell in the next generation depends only on the current state of the cell and its two immediate neighbors. Such a model is called a one dimensional cellular automaton. The reason we are interested in such models is that they allow, based on some simple rules, to generate all sorts of (sometimes complex) patterns that may also be observed in nature, see e.g. Figure 13.1. The code We will divide the problem at hand in 4 parts: 136 Programming and genomics 2019/2020 13. Two examples Figure 13.1: Example of intriguing patterns found in nature. 1. Generating the initial state for n cells. 2. Determining the new state of a cell given its own state and that of its neighbors. 3. Updating the system, i.e., the state of all cells by repeatedly calling the solution to step 2. 4. Running the simulation, i.e., repeatedly updating the states of all cells by repeatedly calling the solution to step 3. Generating the initial states We will consider a one-dimensional cellular automaton consisting of n cells. Each cell can be in either of two states. We will denote these states by either ’ ’ or ’X’. The whole one-dimensional cellular automaton is then described by a string of length n made out of ’ ’s and/or ’X’s. We can start the simulation from an arbitrary string, but will consider here a string consisting of n-1 spaces with a single ’X’ in the center. A function that generates and returns such a string is: def createInitialLine(n): """This function generates and returns a string of ’n’ characters (’ ’ and ’X’) that can be used as initial configuration for a cellular automaton simulation. """ #Initially empty line (only spaces) with one X at center l=[’ ’]*n l[n//2]=’X’ initline = ’’.join(l) return initline c ph 137 Programming and genomics 2019/2020 13. Two examples Rules for new cell state The idea of the cellular automaton is that the new cell state is fixed by the old state of that cell plus the old states of its two neighbouring cells. As each cell could be in one out of 2 states (’ ’ or ’X’, for a triplet of cells there are thus 23 = 8 different possibilities. For each of these 8 possibilities, we need to define what the new state of the middle cell should be. A function that has a string of length 3 as input and returns that new state as a string of length 1 is given here: def calcNewCellState(pattern): """ Given a pattern of three characters, only consisting of ’ ’ and ’X’ and indicating the state of a cell and the states of its left and right hand side neighbours, this function returns the new state (either ’ ’ or ’X’) of the cell according to some rule. """ #Rule 90: #current pattern 111 110 101 100 011 010 001 000 #new state center cell 0 1 0 1 1 0 1 0 newstate = ’ ’ if pattern == ’ ’: newstate = ’ ’ elif pattern == ’ X’: newstate = ’X’ elif pattern == ’ X ’: newstate = ’ ’ elif pattern == ’ XX’: newstate = ’X’ elif pattern == ’X ’: newstate = ’X’ elif pattern == ’X X’: newstate = ’ ’ elif pattern == ’XX ’: newstate = ’X’ elif pattern == ’XXX’: newstate = ’ ’ return newstate Of course, the rule could be changed by changing ’ ’s for ’X’s (and/or the other way around) for the newstate in the different cases. In fact, there are 28 = 256 different combinations possible. These are possibilities are called the 256 Rules. The rule in the above example is called Rule 90. This is because of we order the 8 patterns and the resulting new cell states accoring to the rule as follows: #Rule 90: #current pattern #new state center cell 111 110 101 100 011 010 001 000 0 1 0 1 1 0 1 0 The 8 digits for the new state of the center cell form a binary number. This can be converted to a decimal number as follows: 0 ∗ 27 + 1 ∗ 26 + 0 ∗ 25 + 1 ∗ 24 + 23 + 0 ∗ 22 + 1 ∗ 21 + 0 ∗ 20 = 90 c ph 138 Programming and genomics 2019/2020 13. Two examples Oppositely, a decimal number can also be converted to a binary number, e.g. as seen in Section 8.3 using Python string formatting: >> "{:b}".format(90) ’1011010’ >>> "{:b}".format(222) ’11011110’ where in the first example only the leading 0 remains to be added. Updating the system A new generation of the cellular automaton can be made by looping over all cells (except for the two outermost) def calcNewLine(s): """ Generate a new generation by updating each cell based on its old state and that of its two nearest neighbours. Because first and last cell only have a single neighbour, these cells remain untouched. """ l = [s[0]] # State of first cell kept for i in range(1,len(s)-1): # Update all intermediate cells p = s[i-1]+s[i]+s[(i+1)] l.append(calcNewCellState(p)) l.append(s[-1]) # State of last cell kept newline = ’’.join(l) return newline Running the simulation The evolution of a one-dimensional cellular automaton can be followed by printing the initial state (generation zero) as a first line, the first new generation on a second line, and so on. A function that does so for a cellular automaton with width cells over height generations is shown here: def runSimulation(width,height): """ Runs a simulation on a cellular automaton consisting of ’width’ cells, for ’height’ iterations. """ curline = createInitialLine(width) print(curline) for it in range(height): curline = calcNewLine(curline) print(curline) The simuation can then be run for example for 120 cells over 60 generations using a single call to the latter function: runSimulation(width=120,height=60) resulting in a plot like shown in Fig. 13.2a. c ph 139 Programming and genomics 2019/2020 13. Two examples Figure 13.2: Examples of cellular automata with different rules: a) The rule as applied in the code in this section, b) a rule that generates a very regular pattern, and c) an example of a rule that generates a very irregular pattern. 13.2 A bar plot example In this chapter we will combine many of the topics discussed in previous chapters to create a customized bar plot. Assume that amongst 3 groups of men and women a small poll has been conducted and assume that in each group separate scores of its men and women have been collected. For group 1 the men scored 30, the women 25, for group 2 the men 35, the women 32, and in group 3 28 and 21, respectively. Our programming task is to create in a single figure a bar plot of the scores of each group with the scores of the men in red and those of the women in yellow. Since we are already familiar with matlab we decide to draw the figure by using matplotlib. As usual in programming we split our task into several small subtasks. In this case one may come up with the following 4 design tasks: • First, create an empty figure with a title. • Second, add the scores of the men to the figure. • Next, construct a figure of the scores of the women. • Finally, combine the two figures into one figure. Hence by following these steps we obtain our solution. • As first step we create an empty figure with a title. Since we are going to use the module pyplot of matplotlib, as first action we should import it import matplotlib.pyplot But when numerous references to a module are to be made long names should be avoided. To that end Python has an elegant way to introduce a name alias: import matplotlib.pyplot as plt Instead of writing matplotlib.pyplot one can now use plt. c ph 140 Programming and genomics 2019/2020 13. Two examples In matplotlib a figure can be created by calling the figure-method: fig = plt.figure() Since in matplotlib the Axes is the plotting area into which most of the objects go, we create an instance ax of it, by calling the add subplot-method which returns an Axes object. In our case we need a figure with only one subplot, so we call: ax = fig.add_subplot(1, 1, 1) # one row, one column, first plot To ax a title can be added by calling its set title-method: ax.set_title(’A small bar plot example’) and finally we may need to use the pyplot show-method to display our results: plt.show() In Anacoda the latter statement is however not required as the figure is shown immediately upon creation. So the total program becomes import matplotlib.pyplot as plt # create a figure and an axes fig = plt.figure() ax = fig.add_subplot(1, 1, 1) ax.set_title(’A small bar plot example’) plt.show() Running this program results in a figure as shown in Figure 13.3. Figure 13.3: An empty figure with a title. • The next subtask is to add the scores of the men to the figure. In chapter 4 (and exercise 15) the plot-method has been introduced: matplotlib.pyplot.plot(x, y) in which x and y are two lists of the same length. c ph 141 Programming and genomics 2019/2020 13. Two examples Similarly, an Axes instance has a bar-method that plots a bar (rectangle) of height y[0] at position x[0], a bar of height y[1] at position x[1], . . . a bar of height y[-1] at position x[-1] using the default width of 0.8 ax.bar(x, y) in which x and y are two lists of the same length. So to plot the men’s scores we could use: menScores = [30, 35, 28] n=len(menScores) indmen = range(n) # the x locations for the groups rectsmen = ax.bar(indmen, menScores, color=’r’) ax.set_title(’The men scores’) Notice the use of the keyword argument color. Information about possible keywords of the bar-method of an axes instance can be found at: http://matplotlib.org/api/axes api.html. Since inspecting the figure we find the width too large, we add the keyword argument width=barwidth to our bar-method call. Of course barwidth should first be given a value, and in our case 0.3 seems appropriate: menScores = [30, 35, 28] n=len(menScores) barwidth=0.3 indmen = range(n) # the x locations for the groups rectsmen = ax.bar(indmen, menScores, width=barwidth, color=’r’) ax.set_title(’The men scores’) Another aspect we dislike about the figure is its height. In matplotlib it is simple to change the height of the y-axis by applying the ax.set_ylim(top=value)command. The value to be added should depend on the maximum scores of the men. To that end we use Python’s max-function that calculates the maximum value of the argument list: maxvalue=max(menScores) Combining all and adding a small value to the calculated maximum of the list, results in import matplotlib.pyplot as plt barwidth = 0.3 # the width of the bars # create a figure and an axes fig = plt.figure() ax = fig.add_subplot(1, 1, 1) # the data menScores = [30, 35, 28] n=len(menScores) c ph 142 Programming and genomics 2019/2020 13. Two examples indmen = range(n) # the x locations for the groups rectsmen = ax.bar(indmen, menScores, width=barwidth, color=’r’) ax.set_ylim(top=max(menScores)+5) ax.set_title(’The men scores’) plt.show() In Figure 13.4a the result is shown. (a) Figure with the scores of the men. (b) Final bar plot. Figure 13.4: Bar plots at different stage in the program development. • The next subtask is to construct a figure of the scores of the women. Of course a similar figure as that of the men can be constructed for the scores of the women, but if we would use the same x-positions for the bars, in our next step the bars would become overlapped. So we decide to create a list of x-positions that are shifted with the width of the bars. indwomen = [] for i in range(n): indwomen.append(i+barwidth) Since constructing such kinds of lists is very frequently occurring, Python has a special construct, called list comprehension, for it. The same list can be obtained by indwomen = [i+barwidth for i in range(n)] Using this construct with a shifted x-position our program becomes: import matplotlib.pyplot as plt barwidth = 0.3 # the width of the bars # create a figure and an axes fig = plt.figure() ax = fig.add_subplot(1, 1, 1) c ph 143 Programming and genomics 2019/2020 13. Two examples # the data womenScores = [25, 32, 21] n=len(womenScores) indwomen = [i+barwidth for i in range(n)] rectswomen = ax.bar(indwomen, womenScores, width=barwidth, color=’y’) ax.set_ylim(top=max(womenScores)+5) ax.set_title(’The women scores’) plt.show() When we inspect this figure then we see that the y-axis is lacking a labels Obtaining the y-label is simple: ax.set_ylabel(’Scores’) • Combine the two figures into one figure. Apart from combining the two figures we also add labels to the bars and a legend to explain the colors of the bars. We also adapt the title of the figure. By using the set xticks and the set xticklabels-methods on the ax-object ticks and labels to the bars are added. ax.set_xticks([i+width for i in ind]) ax.set_xticklabels( [’G1’, ’G2’, ’G3’]) For the legend we can use the results of the bar-method. The results are lists of Rectangle instances. The first element of each such instance is the face color of the bar: # add a legend ax.legend([rectsmen[0], rectswomen[0]], [’Men’, ’Women’] ) So our solution becomes import matplotlib.pyplot as plt barwidth = 0.3 # the width of the bars # create a figure and an axes fig = plt.figure() ax = fig.add_subplot(1, 1, 1) # the data menScores = [30, 35, 28] womenScores = [25, 32, 21] n=len(menScores) indmen = range(n) # the x locations for the groups rectsmen = ax.bar(indmen, menScores, width=barwidth, color=’r’) indwomen = [i+barwidth for i in range(n)] rectswomen = ax.bar(indwomen, womenScores, width=barwidth, color=’y’) c ph 144 Programming and genomics 2019/2020 13. Two examples ax.set_ylim(top=max(menScores+womenScores)+5) # add a title and labels and ticks to the x- and y-axis ax.set_ylabel(’Scores’) ax.set_title(’Scores by group and gender’) ax.set_xticks(indwomen) ax.set_xticklabels( [’G1’, ’G2’, ’G3’] ) # add a legend ax.legend( [rectsmen[0], rectswomen[0]], [’Men’, ’Women’] ) plt.show() The result is displayed in Figure 13.4b. 13.3 Exercises 70–73 Exercise 70: In Section 13.1 a program has been designed to simulate a cellular automaton. The initial state of the system was such that only one cell in the center differed from all others. Extend the code such that the function runSimulation has an additional parameter init the can have the value ’center’, ’left’ or ’right’. The behavior for ’center’ should be as before, while for ’left’ only the cell at the left should be ’X’ and all others ’ ’ and for ’right’ only the cell at the right should be ’X’ and all others ’ ’. Exercise 71: In Section 13.1 a program is designed to simulate a cellular automaton. There one Rule was used. (a) Extend the code such that by calling runSimulation(width=120,height=60,rule=90) the same simulation is still performed. (b) Extend the code such that (apart from Rule 90) it can also perform the simulation for the following rule (Rule 222): #Rule 222: #current pattern #new state for center cell 111 1 110 1 101 0 100 1 011 1 010 1 (c) Extend the code further such that it can also do Rule 122 (d) Extend the code further such that it can also do Rule 94 (e) Extend the code further such that it can also do Rule 30 (f ) Challenge: Change the code such that it can do all rules 0 up to 255 c ph 145 001 1 000 0 Programming and genomics 2019/2020 13. Two examples Exercise 72: (a) In the lecture notes the NCBI site has been introduced. Open a web browser and go to the site http://www.ncbi.nlm.nih.gov/. Change the field ”All databases” into ”Nucleotide” and enter ”KR063672.1 OR KR063671.1” in the search field (KR063672.1 and KR063671.1 are the accession codes for the two most used isolates of the virus, ”Kikwit” and ”Mayinga” respectively). Download both sequences by checking the boxes of both entries and clicking on the button ’Send to’, selecting ’File’ as destination and ’FASTA’ as format, and clicking on the button ’Create file’. When everything went fine, you will be asked to save the file with as name ”sequence.fasta”. In the Download folder of your computer you will then have the file ’sequence.fasta’. Move that file to your working directory. (b) In exercise 27 the FASTA format has been described and a Python program has been asked for to determine the number of sequences in a FASTA file. Design a method with a filename as parameter that returns a list of 2-tuples as elements. Each 2-tuple corresponds to a sequence in the FASTA file and has as first element the description line of the sequence while the second element of the 2-tuple the nucleotide sequence is. Apply your method to the file ’sequence.fasta’ of part (a). (c) What are the sequence lengths of the Kikwit and Mayinga isolates? (d) From the answer of part (c) we infer that the Kikwit sequence is one nucleotide shorter. So if we want to compare the sequences we have to leave one nucleotide from the Mayinga sequence out. Of course the question is which one. Write a methode that leaves the nucleotide at index i out of the sequence and subsequently counts the number of matches, where a match means that both sequences have the same nucleotide at the same index. (e) Apply the method of part (d) to all indices of the Mayinga sequence. The result should be a list having at index i the number of matches when the i-th nucleotide is left out of the Mayinga sequence. (f ) Make a barplot that shows the result of part (e) in a graph. Use the x-axis for the index i and the number of matches on the y-axis. Exercise 73: (a) Design a class Frequency that has two data attributes. The first data attribute is nme and is used to store a name as string, the second one pctl is a list of percentages. The __init__-method should have next to self only one parameter, a string. This string consists of one of more entries that are separated by a semicolon. The first entry is a substring, the other entries are all strings that can be converted to floats. (b) Design a method having a filename as parameter that reads from the file a sequence of lines. Each line starts with a name and is followed by a number of percentages, where all entries including the name are separated by semicolons. The method should return the corresponding list of Frequency-objects. (c) Apply the method of the previous item on the file ’sequencespct.txt’ consisting of a c ph 146 Programming and genomics 2019/2020 . Two examples number of gene names and the percentages of occurrences of the nucleotides A, C, G, and T in its gene sequence. (d) Use the list of item (c) to generate a bar plot with on the x-axis the names of the genes and as bars the percentages of occurrences of the nucleotides A, C, G, and T in the gene sequence in different colors. So the plot shows per gene the distribution of the nucleotides. The solution should be independent of the number of genes. Add also a legend to the plot. (e) Similarly as the previous item, but now are the bars grouped based upon the distribution of a single nucleotide over the genes with a different color per gene. So the plot shows first how often the nucleotide A occurs in each of the genes, then how often the T etc. Add also decent xtick-labels to the groups of bars. c ph 147 Appendix A Summary of useful commands A.1 Common Python constructs Repetition for x in l: block for i in range(len(l)): block while bool: block Selection if bool: block1 elif bool: block2 else: block3 Functions def myfunc(inargs): block return outargs A.2 Operations on string s s.count(sub) return the number of non-overlapping occurrences of substring sub in string s. s.upper() and s.lower() return a copy of the string s converted to uppercase and lowercase resp. 148 Programming and genomics 2019/2020 A. Summary of useful commands s.rstrip(), s.lstrip(), s.strip() return a copy of the string s with trailing whitespace characters (the characters space, tab, linefeed, return, formfeed, and vertical tab) removed. Analogous for leading whitespace for lstrip and leading and trailing whitespace for strip. s.find(sub) return the lowest index in the string s where substring sub is found. Return -1 if sub is not found. s.rfind(sub) return the highest index in the string s where substring sub is found. Return -1 if sub is not found. s.replace(old, new) return a copy of string s with all occurrences of substring old replaced by new. s.split([sep]) return a list of the words in the string, using sep as the delimiter string. If sep is not specified or is None, first the whitespace characters are stripped from both ends and then words are separated by arbitrary length strings of whitespace characters. float(s) return the sting s converted to a floating point number, if possible. int(s) return the sting s converted to an integer number, if possible. A.3 Operations on file f f=open(fname, option) return a new object f of type file with filename fname for reading when option is omitted or option=’r’, for writing when option=’w’. f.close() close the file. f.read() return the contents of the file in a string. f.readline() return the next line from the file. The line includes the end of line character (\n) f.readlines() return a list of all lines from the file, where each line includes the newline character. f.write(s) write the string s to the file f. A.4 Operations on lists l Given a list l the following methods can be applied to l. Note that in most cases the list l is changed. l.append(x) c ph 149 Programming and genomics 2019/2020 A. Summary of useful commands add an item to the end of the list. l.extend(L) extend the list l by appending all the items in the given list L. l.insert(i, x) insert item x at given position i. l.remove(x) remove the first item from the list whose value is x. Raises an error if the item is not in the list. l.pop(i) remove the item at the given position in the list, and return it. If no index is specified, l.pop() removes and returns the last item in the list. l.index(x) return the index in the list of the first item whose value is x. Raises an error if the item is not in the list. l.count(x) return the number of times x appears in the list. l.sort() sort the items of the list, in place. l.reverse() reverse the elements of the list, in place. list(s) return a list whose items are the same and in the same order as in the string s sep.join(l) return a string which is the concatenation of the strings in the list l. The separator between elements is the string sep providing this method. A.5 Operations on dictionaries d Given a dictionary d the following methods can be applied to d. d.keys() Returns a view on the dictionary’s keys d.values() Returns a view on the dictionary’s values d.items() Returns a view on the dictionary’s (key, value) pairs x in d Returns True if x is in the dictionary’s list of keys, False otherwise c ph 150 Programming and genomics 2019/2020 A.6 A. Summary of useful commands List generation and plotting range The general form of the range-method is range([start,] stop[, step]) If the step argument is omitted, it defaults to 1. If the start argument is omitted, it defaults to 0. Combined with the list function, the full form list(range([start,] stop[, step])) returns a list of plain integers [start, start + step, start + 2 * step, ...] If step is positive, the last element is the largest start + i * step less than stop; if step is negative, the last element is the smallest start + i * step greater than stop. plot Two lists x and y of the same length with in x the x-coordinates and in y the y-coordinates of a collection of points can be plotted in red with circle markers and subsequently shown by executing import matplotlib.pyplot as plt plt.plot(x, y, ’ro’) plt.xlabel(’x-text’) plt.ylabel(’y-text’) plt.show() A.7 turtle In some exercises the turtle library is used. Some basic commands in that library are import turtle turtle.up() turtle.down() turtle.pencolor(r,g,b) turtle.goto(x,y) turtle.forward(dist) turtle.backward(dist) turtle.right(degrees) turtle.left(degrees) turtle.mainloop() A.8 # # # # # # # # # # load the library pen up (not drawing) pen down (start drawing) set pen color move to position x,y move specified distance forward move specified distance backward turn specified degrees right turn specified degrees left place at end of program to activate drawing window openpyxl import openpyxl wb = openpyxl.load_workbook(filename = ’myfile.xlsx’) ws = wb.active cell = ws.cell(row=i,column=j) print(cell.value) c ph 151 Programming and genomics 2019/2020 A. Summary of useful commands ws2 = wb.create_sheet(title="mytitle") ws2[’F5’].value = 3.14 ws2.append([...]) wb2.save(filename = ’myfile2.xlsx’) A.9 Database queries urllib: import urllib.parse, urllib.request params = urllib.parse.urlencode(mydict) # if needed url="http(s)://hostname/path" rf = urllib.request.urlopen(url, params.encode(’ascii’)) Entrez:from Bio import Entrez Entrez.email="myname@student.tue.nl" handle=Entrez.esearch(db="nucleotide", term=’...’) record=Entrez.read(handle) A.10 tkinter Creating a master window in tkinter with text mytitle import tkinter top=tkinter.Tk() top.title(mytitle) ... top.mainloop() Creating a widget on master and packing it widget=tkinter.widgetclass(top) widget.pack(side=tkinter.TOP) creates an instance of the widget class, as a child to top. widgetclass is one from Label Button Entry The default value for the side option is tkinter.TOP, other values are tkinter.LEFT, tkinter.BOTTOM and tkinter.RIGHT. Configure To set specific options for a widget use the configure method: widget.configure(option=value) General option=value The following options apply to the above-mentioned widgets: background=color foreground=color where color is one from "white", "black", "red", "green", "blue", "cyan", "yellow", and "magenta". c ph 152 Programming and genomics 2019/2020 A. Summary of useful commands Option for the Label widget text=mytxt where mytxt is a string. c ph 153 Appendix B Solutions to selected exercises Solution to exercise 1: >>> 5*40 200 >>> 1.25*7 8.75 >>> 100/25 4.0 >>> 106/25 4.24 >>> 106//25 4 >>> 106.0/25 4.24 >>> 106.0//25 4.0 >>> 100/5*5 100.0 >>> 100/(5*5) 4.0 >>> 2**10 1024 >>> 3*2**3 24 >>> (3*2)**3 216 All operators on integers also result an integer, except for the true division operator / which results a float. If one of the operands is a float, all operators also return a float. The operator // is the floor division, which basically returns the number of times one number fits into another, without any decimal points or remainders. Moreover, due to the differences in priorities of operators, the use of brackets () matters. 154 Programming and genomics 2019/2020 B. Solutions to selected exercises Solution to exercise 2: >>> nrbases = 4 >>> seqlength = 10 >>> nrbases**seqlength 1048576 >>> nrpos = nrbases**seqlength >>> nrpos 1048576 >>> print(nrpos) 1048576 >>> prob = 1.0/nrpos >>> print(’The probability is’, prob) The probability is 9.5367431640625e-07 Solution to exercise 3: A name, also called identifier, is a word that consists of letters, underscores, and digits, it must start with a letter or an underscore. So a space ’ ’, ’ ?’, ’ !’ and ’;’ are not allowed in an identifier and hence whats in a name, Whats in a name?, yo!u, and Hello; are all not valid. Since it must start with a letter or an underscore 5600MB is also not valid. Syntactically correct identifiers are thus whatsinaname, whats_in_a_name, I, HelloYou, and varName, i.e., a), c), f ), g) and i). When the invalid identifiers are used, the Python interpreter generates the following error messages: >>> whats in a name File "<stdin>", line 1 whats in a name ^ SyntaxError: invalid syntax >>> 5600MB File "<stdin>", line 1 5600MB ^ SyntaxError: invalid syntax >>> yo!u File "<stdin>", line 1 yo!u ^ SyntaxError: invalid syntax >>> Hello;=3 File "<stdin>", line 1 Hello;=3 ^ SyntaxError: invalid syntax >>> what?name File "<stdin>", line 1 what?name c ph 155 Programming and genomics 2019/2020 B. Solutions to selected exercises ^ SyntaxError: invalid syntax In each case the parser repeats the offending line and displays a little arrow pointing at the earliest point where the error was detected. Solution to exercise 4: Having followed the instructions in the exercise, you should have a text file named exercise4.py in the folder D:\8CA10 that may not only be opened in the Spyder editor, but for example also using the standard Windows editor Wordpad. The value 1048576 in the variable nrpos is writting to the Python console only once, i.e., as a result of the single print statement in the program. Whereas typing a variable name at the command prompt will show you its contents, from you program only print statements will result in output at the Python console. Solution to exercise 5: (a) Create a new file (using the New file ... option in the File menu or using the key combination Ctrl+N), paste the text from the exercise and adapt it such that it reads: genomelength = 3.2e9 nrcells = 4e13 massperbasepair = 660 Na = 6.022e23 # # # # number of number of grams per number of base pairs per cell cells mole per base pair molecules per mole (Avogadro’s number) totalDNAmass = genomelength*nrcells*massperbasepair/Na print(’approximate DNA mass one human:’, totalDNAmass, ’grams’) Finally, save the file as exercise5.py. Running the above Python fragment then yields as an anwer 140.28561939554965 grams. (b) The print statement should be replaced by: print(’approximate DNA mass one human:’, totalDNAmass/1.0e3, ’kg’) Solution to exercise 6: Program fragment for all three subparts may read: s = ’AAACGAACGTAGGATCAAGTAGGCAAAAAG’ print(’a) the first character of s:’) print(s[0]) print(’b) the last character of s:’) print(s[len(s)-1]) print(’c) the string using 10 characters per line and space after each 5th:’) c ph 156 Programming and genomics 2019/2020 B. Solutions to selected exercises print(s[0:5] + ’ ’ + s[5:10]) print(s[10:15] + ’ ’ + s[15:20]) print(s[20:25] + ’ ’ + s[25:30]) Solution to exercise 7: Below the programming fragments with some additional comments are given and the results are shown when the fragments are executed by the Python interpeter. (a) [] is the empty list, so assigning it to the variable l, can be done by l=[] (b) A single element can be added by invoking the append-method: l.append(7) print(l) The print statement then yields: [7] (c) The length of list variable l is produced by invoking the len-method with l as parameter. Printing this length using print(len(l)) yields 1 (d) The first method is to apply the extend-method to l with the list as single parameter: l.extend([1, 2, 3, 1]) print(l) yielding as output [7, 1, 2, 3, 1]. A second method is to invoke the appendmethod on l with each of the elements of the extension list successively as parameter: for x in [1, 2, 3, 1]: l.append(x) print(l) This yields consecutively: [7, [7, [7, [7, 1] 1, 2] 1, 2, 3] 1, 2, 3, 1] (e) The result of the above actions is that l has now length 5. print(len(l)) now thus yields 5 (f ) The third element of l is l[2]. By an assignment its value can be changed. l[2]=4 c ph 157 Programming and genomics 2019/2020 B. Solutions to selected exercises print(l) The print will now thus yield [7, 1, 4, 3, 1]. (g) The first occurrence of element x in l can be removed by invoking the removemethod: l.remove(1) print(l) displaying [7, 4, 3, 1]. (h) The effect of a remove on the length of a list is of course that it has become one smaller such that print(len(l)) now yields 4. (i) The pop-method invoked without an argument removes the last element from a list and returns this last element. This can be stored in a variable and then printed t = l.pop() print(t) or printed directly print(l.pop()) In both cases 1 will be displayed. (j) The effect of a pop on the length of a list is also of course that it has become one smaller: print(len(l)) now thus yields 3. (k) When a negative index is used, then len(l) is added to the index, so l[-1] is the same as l[len(l)-1]. With a print statement this contents can be shown print(l[-1]) to be 3. (l) To validate that l[len(l)-1] is indeed the same as l[-1] print(l[len(l)-1]) also displays 3. (m) Index -len(l) is the same as 0, so using print( l[-len(l)]) the value of the first element is produced, i.e., 7. Solution to exercise 8: (a) A list of the first 20 integers can be produced by invoking list(range(20)): c ph 158 Programming and genomics 2019/2020 B. Solutions to selected exercises l=list(range(20)) print(l) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] (b) print(list(range(len(l)))) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] (c) Execution of for x in l: print(l) produces as output [0, [0, [0, . . [0, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19] since for all elements of l it prints the whole list l (d) Execution of for x in l: print(x, 2*x) produces as output 0 0 1 2 2 4 . . 18 36 19 38 since of all the elements of l first its value and then twice its value are output. (e) When the following programming fragment is executed for x in l: print(x) the system returns with the error message IndentationError: expected an indented block In this case there is no indentation in front of print(x), so in fact no actions are associated with the loop. (f ) Execution of the programming fragment: for i in range(len(l)): print(l[i], 2*l[i]) c ph 159 Programming and genomics 2019/2020 B. Solutions to selected exercises gives output identical to that of (d), since for i in range(len(l)) means that for each of the first 20 integers the corresponding l-value and twice this value are output. Since l=[0, 1, 2, .., 19] the output is as shown. (g) Execution of print("Start") for i in range(len(l)): print(l[-i]) print("Finished") gives as output Start 0 Finished 19 Finished . . 2 Finished 1 Finished since the first time the loop is executed the value of i is 0, and, hence the value of the first element of l 0, followed by the string “Finished” is produced, then i becomes 1 and l[-1], hence the value of the last element of l and “Finished” are output, then the one but the last etc., until i is 19 and since l[-19] is the same as l[1] a 1 followed by ”Finished” is produced. (h) By removing the indentation before print("Finished"), it is taken out of the block of the for-statement that is executed for every index of l. Then it will only be executed once when the for-loop is done. Programming fragment should thus look like: print("Start") for i in range(len(l)): print(l[-i]) print("Finished") giving as output Start 0 19 18 . . 1 Finished (i) In the previous two exercises the first element of l was produced first and then c ph 160 Programming and genomics 2019/2020 B. Solutions to selected exercises the other elements of l in reversed order. So we need only a small adjustment to produce all elements of l in reversed order. The solution is to find the right “pattern”: the last element is l[-1-0], one but the last l[-1-1], two but the last l[-1-2] etc., until l[-1-(len(l)-1)]. Hence a programming fragment that produces the elements of l in reverse order is: for i in range(len(l)): print(l[-1-i]) Solution to exercise 9: (a) The corrected programming fragment could look like: n=10 print(0, n) for t in range(6): n = 2*n print(t+1, n) print(’After 6 hours the number of cells is’, n) This shows that the number of cells after 6 hours is 640. (b) n=10 print(0, n) for t in range(24): n = 2*n - 5 print(t+1, n) print(’After 24 hours the number of cells is’, n) After 24 hours the number of cells is 83 886 085. Solution to exercise 10: (a) A new window pops up in which a square is drawn. (b) The requested regular octagon consists of 8 equal line pieces, each rotated with respect to each other by 45 degrees. import turtle d=100 turtle.up() # starting point shifted slight upward turtle.goto(-d/2,d) turtle.down() for i in range(8): turtle.forward(d) turtle.right(45) turtle.mainloop() c ph 161 Programming and genomics 2019/2020 B. Solutions to selected exercises (c) The requested star consists of 6 equal parts, where each part now consists of 2 line pieces rotated 120 degrees with respect to each other and where the 6 parts are each rotated with respect to each other by 60 degrees. Because we already rotated 120 degrees, we have to rotate (120-60=) 60 degrees back. import turtle d=100 turtle.up() turtle.goto(d/2,d/2) turtle.down() for i in range(6): turtle.forward(d) turtle.right(120) turtle.forward(d) turtle.right(-60) turtle.mainloop() Instead of turtle.right(-60) one could also use turtle.left(60). (d) The requested ’spiral’ plot consists of multiple (200 to be exact) line segments, where starting from the center consecutive line segments increase in size and make an angle of 45 degrees. import turtle d=10 turtle.up() turtle.goto(0,0) turtle.down() for i in range(200): turtle.forward(d) turtle.right(45) d = d+1 turtle.mainloop() Solution to exercise 11: Lets first consider the case where k is equal to 3. Then, what should be printed is: print(’* ’) # 1 star, 1 * ’* ’ print(’* * ’) # 2 stars, 2 * ’* ’ print(’* * * ’) # 3 stars, 3 * ’* ’ So, on the first line once ’* ’. On the second line twice the same string, and on the third line three times that same string. More general, on the k-th line we need k times ’* ’. This can be realized in different ways. One way would be c ph 162 Programming and genomics 2019/2020 B. Solutions to selected exercises k=5 for n in range(1, k+1): print(n * ’* ’) A second way that uses a second (nested) for loop for the repetition on a single line: k=7 for n in range(1, k+1): for i in range(n): print(’*’,end=’’) print() Partial solution to exercise 14: (a) Given a list of values, to calculate the mean we first have to determine the sum of all the values. Similarly as in the previous exercise we have to consider all elements one by one: s = 0 for x in l: s = s + x The next step is to calculate the mean and print it mean = s / len(l) print(’The mean is ’, mean) If we apply the calculation to the list l of the previous exercise we find: The mean is 5.75 Solution to exercise 16: (a) range(n) generates the sequence of all integers 0, .., n-1, so here we have to choose the value 11 for n and use the list function to convert the range object into a true list: >>> list(range(11)) [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (b) range(start, n) generates all integers start, .., n-1, so here we have to choose for start 1 and for n 11: >>> list(range(1, 11)) [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] (c) When step is greater than 0, range(start, n, step) generates all integers start, start+step, .., s such that k is maximal, i.e., start + k ∗ step < n, while start + (k + 1) ∗ step ≥ n, so here we have to choose for start 4, for n 21, and for step 4: >>> list(range(4, 21, 4)) [4, 8, 12, 16, 20] (d) As exercise (c) but now we should choose for start -24, for n 21, and for step 3: c ph 163 Programming and genomics 2019/2020 B. Solutions to selected exercises >>> list(range(-24, 21, 3)) [-24, -21, -18, -15, -12, -9, -6, -3, 0, 3, 6, 9, 12, 15, 18] (e) Several solutions exist to construct the requested list. We can use the result of exercise (d) and then reverse it either using l=list(range(-24, 21, 3)) m=[] for i in range(len(l)): m.append(l[-1-i]) print(m) or l=list(range(-24, 21, 3)) m=l[::-1] print(m) However, both of these solutions do no exclusively use the range and list functions and, therefore, do not satisfy the exercise, even though the correct list is provided. The solution is to use range with a negative step value: if step is negative, range(start, n, step) generates all integers from start to start+k*step where the last element is the smallest start+k*step greater than n, so here we may choose for start 18, for n -25, and for step-3: >>> list(range(18, -25, -3)) [18, 15, 12, 9, 6, 3, 0, -3, -6, -9, -12, -15, -18, -21, -24] Partial solution to exercise 17: (a) First, observe what sublists the list [n+10, n+9, .., 2, 1, 0, 0, 5, .., n+5, n+10] is built of. The first sublist is a list starting at n+10, descending by 1, and ending with a 0. The second list starts with the second 0, increases by steps of 5, and ends with n+10. Both these sublist can be created using a combination of the range and list functions and then concatenated using the concatenation operator (+), resulting in list(range(n+10, -1, -1)) + list(range(0, n+11, 5)) (b) [n, n, (n-5), (n-10), .., 10, 5, 0, -5, -10, ..,-(n-5), -n, -n] = list(range(n, n+1)) + list(range(n, -n-1, -5)) + list(range(-n, -n+1)) Solution to exercise 19: The python fragment provided contains 3 errors: (1) At the end of an if statement a colon is required, (2) the boundaries 0 and 10 were not included, and (3) an ’else:’ is required before the final print statement such that it is only executed when the value provided is outside the requested range. The corrected program then reads: s = input("Give a value between 0 and 10: ") value = int(s) if (value >= 0) and (value <= 10): print(’Thank you’) c ph 164 Programming and genomics 2019/2020 B. Solutions to selected exercises else: print(’You fool!’) Solution to exercise 20: (a) To open and read all lines from file ’sequences.seq’ we use the following programming fragment infile = open(’sequences.seq’) alllines=infile.readlines() infile.close() We then have a list of all lines of the file. By applying a for-loop over the total number of lines in the file, len(alllines), we easily have access to both the current line number and the line itself. Note that numbering of list indices starts with 0, so we have to add 1 to the line number when printed: for linenr in range(len(alllines)): print(linenr+1, alllines[linenr].rstrip()) When printing a line, rstrip() is used to remove possible newline characters at the end of the line. These should not be printed as print itself already finishes with a newline, and otherwise thus additional blank lines would be printed. So the total program becomes: """ Read the file sequences.seq and for all lines print the line number and the line itself """ infile = open(’sequences.seq’) alllines=infile.readlines() infile.close() for linenr in range(len(alllines)): print(linenr+1, alllines[linenr].rstrip()) (b) Reading the file and looping over all lines is analogous to the solution of (a). Given a line, we have to consider the first occurrence of TT. Since it is given that such a string occurs we simply may use the find-method: line=alllines[linenr] m=line.find("TT") Having the position of the TT we have to search starting from that position for the occurrence of AA. Once again we use the find-method, but applied to the part of the line after TT. If we have the position of the AA in the part of the line after the TT, we translate it to the position in the original line: remainingpartofline=line[m:] n=remainingpartofline.find("AA") # n is the first occurrence of AA in remainingpartofline n=m+n # n is the first occurrence of AA in the line after the first TT c ph 165 Programming and genomics 2019/2020 B. Solutions to selected exercises So the total program becomes: """ Read the file sequences.seq and for all lines print the line number and the part from TT to the first AA """ infile = open(’sequences.seq’) alllines=infile.readlines() infile.close() for linenr in range(len(alllines)): line=alllines[linenr] m=line.find("TT") remainingpartofline=line[m:] n=remainingpartofline.find("AA") # n is the first occurrence of AA in remainingpartofline n=m+n # n is the first occurrence of AA in the line after the first TT print(linenr+1, alllines[linenr][m:n+len("AA")]) Partial solution to exercise 22: (a) To read the file and store its contents in 3 lists: # Read the file BMIs.txt and store its lines in a list inf = open(’BMIs.txt’) lines = inf.readlines() inf.close() # Create 3 empty list names = [] weights = [] lengths = [] # Parse line by line and store data in appropriate lists for line in lines: s = line.split() name = s[0] weight = float(s[1]) length = float(s[2]) names.append(name) weights.append(weight) lengths.append(length) Solution to exercise 25: (a) The command s=input(text) displays the string text on the screen and the result being input is stored in the variable s. Here text should be ’Enter a sequence’: c ph 166 Programming and genomics 2019/2020 B. Solutions to selected exercises # Ask the user for a sequence and print its length seq = input(’Enter a sequence: ’) print(’It is’, len(seq), ’bases long’) (b) To determine the number of substrings subs in a string s, the method s.count(subs) can be applied: # also print the number of A, T, C, and G characters in the sequence seq = input(’Enter a sequence: ’) print(’It is’, len(seq), ’bases long’) print(’adenine:’, seq.count(’A’)) print(’thymine:’, seq.count(’T’)) print(’cytosine:’, seq.count(’C’)) print(’guanine:’, seq.count(’G’)) (c) To allow for both lower-case and upper-case characters we first transform the input string to an all uppercase character string by applying the upper-method: # ... allow both lower-case and upper-case characters seq = input(’Enter a sequence: ’) seq = seq.upper() print(’It is’, len(seq), ’bases long’) print(’adenine:’, seq.count(’A’)) print(’thymine:’, seq.count(’T’)) print(’cytosine:’, seq.count(’C’)) print(’guanine:’, seq.count(’G’)) (d) To determine the number of characters sum up the occurrences of ’A’, ’C’, ’T’, and ’G’ and compare it to the total length of the string: # ... also print the number of unknown characters seq = input(’Enter a sequence: ’) seq = seq.upper() n = len(seq) a = seq.count(’A’) t = seq.count(’T’) c = seq.count(’C’) g = seq.count(’G’) print(’It is’, n, ’bases long’) print(’adenine:’, a) print(’thymine:’, t) print(’cytosine:’, c) print(’guanine:’, g) print(’unknown:’, n - a - t - c - g) Solution to exercise 27: (a) The list of integers [start, start + step, start + 2 * step, ...] can be generated by applying the methods list(range(start, stop, step)). If step is positive, the last element is the largest start+i*step less than stop; if step is negative, the last element is the largest start+i*step greater than stop. So c ph 167 Programming and genomics 2019/2020 B. Solutions to selected exercises list(range(1, n, 1)) produces [1, 2, 3, 4, 5, 6, 7, 8, ..., n-2, n-1] and list(range(n, 0, -1)) produces [n, n-1, ..., 3, 2, 1] Hence the required list can be generated by: list(range(1, n, 1))+list(range(n, 0, -1)) The solution suggested in the exercise a = list(range(1,101)) b = a is not a proper one. The last assignment makes a and b two different names for one and the same object, so any change to b is also made on a. So after b.reverse() a is also reversed: >>> a [100, 99, 98, ..., 3, 2, 1] and print(a+b[1:]) gives [100, 99, 98, ..., 3, 2, 1, 99, 98, ... 3, 2, 1] (b) When l is a list and x is an element in l, then l.remove(x) removes the first occurrence of x from l. So one way to obtain the result is: a=list(range(1, n, 1)) a.remove(73) b=list(range(n, 0, -1)) b.remove(73) a+b Another approach is to combine several range commands: list(range(1, 73, 1))+list(range(74, n, 1))+list(range(n, 73, -1))+ list(range(72, 0, -1)) Yet another approach would be a=list(range(1, n+1, 1)) a.remove(73) m = a[:-1]+a[::-1] c ph 168 Programming and genomics 2019/2020 B. Solutions to selected exercises (c) When these commands are executed in the Python interpreter, the following output is produced >>> >>> >>> >>> >>> 2 >>> 1 >>> 4 >>> 4 a=[1,2,3,4] b=[9,16,25,36] c=[a,b] d=[a+b] len(c) len(d) len(a) len(b) List c consists of two elements. The first element is list a, the second list b. List d has only one element, list [1, 2, 3, 4, 9, 16, 25, 36]. (d) Inspection of the list learns that it is in fact the sorted version of 3 copies of the list [0, 1, 2, ..., n-1, n], so one way of obtaining the result is: a=list(range(0, n+1, 1))*3 a.sort() print(a) (e) Applying 3 times the range-method with step=4 and sorting the concatenated result, does the job: a=list(range(1, n+1, 4)) b=list(range(2, n+1, 4)) c=list(range(3, n+1, 4)) d=a+b+c d.sort() print(d) Solution to exercise 29: The sequence seq is a DNA sequence consisting of only A’s, C’s, G’s and T’s. So after seq=seq.upper() seq=seq.replace(’C’,’ ’) seq=seq.replace(’T’,’ ’) seq=seq.replace(’G’,’ ’) seq consists only of A’s and spaces. Splitting it on white space hence results in a list with as elements sequences of only A’s. When this list is sorted the elements are sorted on length with the shortest sequences first. Hence by a=seq.split() a.sort() print(len(a[-1])) c ph 169 Programming and genomics 2019/2020 B. Solutions to selected exercises the length of the longest subsequence consisting of only A’s in the DNA sequence seq is printed. Solution to exercise 30: A proper name for the function whatshouldbemyname would be uniqueSorted. The function namely returns a new list with a single copy of all elements in the input list, i.e., all duplicates omitted, where the elements in the returned list are sorted in increasing order. Solution to exercise 32: (a) Lists are more general than strings. There are for instance more methods available for lists than for strings. An example of such a method is reverse. To reverse a string, make a list out of it and then make again a string out of the list using the join-method: def wording(word): """ A function that has a word as parameter and prints the word, its length and the reversed word. """ letters = list(word) letters.reverse() revword = ’’.join(letters) print(’word:\t’, word) print(’length:\t’, len(word)) print(’reverse\t:’, revword,’\n’) wording(’verzuring’) (b) def processFile(filename): """A function that has a filename as parameter and prints for each word in the file the word, its length and the reversed word.""" # Open and read the file ’filename’ inf = open(filename) filecontents = inf.read() inf.close() # Optionally one could remove some punctuation for c in [’.’,’,’,’:’,’;’]: filecontents = filecontents.replace(c,’’) # Divide the filecontents in a list of words words = filecontents.split() # and process all words one by one for word in words: wording(word) c ph 170 Programming and genomics 2019/2020 B. Solutions to selected exercises processFile(’mytext.txt’) Partial solution to exercise 35: (a) If we have a string in which the contents of the file is stored, the sentences can be separated by applying the split-method with as splitting string the period. Next we have to add the period again to each of the elements of this list. We have to exclude the last element of the list since that element is empty when the file is ended with a period or it is a string without a period and hence it is not a sentence. A Python method that implements this description is: def text2sentences(filename): """ A function having an input file name (string) as parameter, it returns a list of all the sentences. """ infile = open(filename) alllines = infile.read() infile.close() listofsentenceswithoutperiod = alllines.split(’.’) allsentences=[] for s in listofsentenceswithoutperiod[:-1]: allsentences.append(s+’.’) return allsentences print(text2sentences(’mytext2.txt’)) Solution to exercise 38: The requested function could look like: def minmaxmean(m): """Returns the minimum value, the maximum value, as well as the mean of the items in the list m """ if m==[]: # if the list m is empty: minimum, maximum and mean are undefined return None, None, None else: # otherwise they can be calculated by looping over all items minval = m[0] maxval = m[0] sumval = m[0] for i in range(1,len(m)): if m[i] < minval: minval = m[i] elif m[i] > maxval: maxval = m[i] sumval = sumval + m[i] c ph 171 Programming and genomics 2019/2020 B. Solutions to selected exercises return minval, maxval, float(sumval)/len(m) The function can then be executed with m1 as parameter, storing its resulting 3 values in 3 distinct variables, which can subsequntly be printed (e.g. using the ’new style’): m1=[9,4,5,6,2,5,4,3,1,2,12,7,4,3,2,8,4,2] mymin,mymax,mymean = minmaxmean(m1) print(’min: {}, max: {}, mean: {:.3f}’.format(mymin,mymax,mymean)) This yields as output: min: 1, max: 12, mean: 4.611 The same function can then also be executed on the other list, i.e. m2. Instead of storing the three resulting values in three variables, the three values can also be stored in a single tuple. Using indexing operations, the three values can then be extracted and displayed similarly as above. Using the ’old style’ the tuple may also be used directly: m2=[3,4,5,2,2,12,2,1,8,2,9,4,3,6,4,4,7,5] myminmaxmeantuple = minmaxmean(m2) print(’min: %d, max: %d, mean: %.3f’ % myminmaxmeantuple) Partial solution to exercise 39: (a) The output of for i in range(1,5): for j in range(1,5): print(’%4d’ % i*j, end=’ ’) print() is: 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 The reason for this (maybe unexpected) result is that the % operator here has a higher priority than the * operator. In each print statement, the value of i is thus first substituted in the string after which that string is repeated j times. Thus first a single string ’ 1’ is plotted, followed by a space due to the , at the end of the print statement. Then on the same line two such strings are plotted, again followed by a space. Then three such strings followed by a space, and finally four such strings followed by a space. Then the inner loop is finished after which the second print staement is executed. This simply results in a new line. Subsequently the inner loop is again executed with i equal to 2, etc, etc, finally resulting in the above output. (b) In order to print the requested table, the value that should be printed each time is the product of i and j. To make sure that this product is calculated first and that the resulting value is subsequently substituted in the string, parentheses should be used, i.e.: c ph 172 Programming and genomics 2019/2020 B. Solutions to selected exercises for i in range(1,13): for j in range(1,13): print(’%4d’ % (i*j), end=’ ’) print() Solution to exercise 41: (a) Because we do not know which characters the arbitrary string consists of, the best solution seems to loop over all characters in the string and tally the numbers of characters encountered. A good way to store the numbers of characters encountered (so far) is by using a dictionary. When we have not yet considered any characters, the dictionary should be empty. For each character encountered we check whether it was already observed before or not, i.e., whether it is already present in the dictionary or not. If it is not present in the dictionary yet, we add that character as a new key to the dictionary and attach the value 1 to it, i.e., one occurrence (so far). If we encounter a character that is already present in the dictionary, we just increase the attached value by one. As a Python fragment this may read: # define the arbitrary string s s = ’AAAAAAAAAAA#A#AB’ # loop over all characters and tally countd={} for char in s: if char in countd: countd[char] += 1 else: countd[char] = 1 Note: in the above solution, countd[char] += 1 is a shorter notation for: countd[char] = countd[char] + 1 i.e., a way to increase the value of countd[char] by 1. (b) The aligned table could subsequently be produced using: for key in sorted(countd): n = countd[key] print(’{} {:3d} {:7.2%}’.format(key, n, n/len(s))) Solution to exercise 43: (a) import urllib.request import urllib.parse def getFromChemCalc(molformula): """ Return molecular information for ’molformula’ as obtained c ph 173 Programming and genomics 2019/2020 B. Solutions to selected exercises from http://www.chemcalc.org """ url = ’http://www.chemcalc.org/chemcalc/mf’ # Define the parameters and send them to Chemcalc mfdict = {’mf’: molformula,’isotopomers’:’jcamp,xy’} params = urllib.parse.urlencode(mfdict) response = urllib.request.urlopen(url, params.encode()) # Read the output return response.read().decode() chemcalcstr = getFromChemCalc(’C2H6O’) (b) import json chemcalcdict = json.loads(chemcalcstr) print(chemcalcdict.keys()) print(’molecular weight of’,chemcalcdict[’mf’],’is:’,chemcalcdict[’mw’]) (c) elemlist = chemcalcdict[’parts’][0][’ea’] for elem in elemlist: print(’%4s %7.2f%%’ % (elem[’element’], elem[’percentage’])) An alternative using the new style formatting is to unpack the dictionary (using **) and use the dictionary keys as indicators at which position in the string the values should be inserted: for elem in elemlist: print(’{element:>4s} {percentage:7.2f}%’.format(**elem)) Solution to exercise 44: import urllib.request import urllib.parse # Open the webpage protocol="http" hostname="cbio.bmt.tue.nl" path="~philbers/index.htm" url=protocol+"://"+hostname+"/"+path rf=urllib.request.urlopen(url) # Read the data from the webpage data=rf.read().decode() # Convert to lower case and count data = data.lower() nr = data.count(’computational’) print(’The webpage contains’, nr, ’times the word "computational"’) The output of the program: The webpage contains 9 times the word "computational". Solution to exercise 45: c ph 174 Programming and genomics 2019/2020 B. Solutions to selected exercises from Bio import Entrez def returnnrhits(l=[]): Entrez.email="your.name@student.tue.nl" searchterm=" AND ".join(l) handle=Entrez.esearch(db="nucleotide", term=searchterm) record = Entrez.read(handle) return record["Count"] nrhits=returnnrhits(l=[ "Escherichia coli[Organism]", "complete genome[All Fields]", "srcdb_refseq[Properties]"]) print("Nr of hits:", nrhits) The output of the program is: Nr of hits: 1158 Solution to exercise 50: def triangleOfStars(k): """ A method that prints, for k as an integer parameter, a filled triangle of stars (’*’) with k stars as basis and k stars as height. Between two stars a space is printed. For example, when k is 5: print(’*’) # line print(’* *’) # line print(’* * *’) # line print(’* * * *’) # line print(’* * * * *’) # line 0 1 2 3 4 has has has has has 1 2 3 4 5 star stars stars stars stars Thus, closed triangle: with lines numbered from 0 through k-1, each line has one more star than its line number. """ i = 0 # invariant: 0<=i<=k and i lines printed while i < k: # invariant: 0<=i<=k and i lines printed print(’* ’*(i+1)) # line i with i+1 stars i = i+1 # invariant: 0<=i<=k and i lines printed # invariant: 0<=i<=k and i lines printed # because the loop stopped, also holds: i>=k # Thus: i==k and i lines printed c ph 175 Programming and genomics 2019/2020 B. Solutions to selected exercises # Thus: whole triangle printed # test the function triangleOfStars(5) Solution to exercise 53: In all cases it is being asked to show the values i is obtaining. Adding a line to the program in which the value is printed, solves the problem. So running def countSomething(word=’insulin resistance’): i=0 print(word) print("The value of i is:", str(i)) counter=0 jump=word.index(’n’) print("The value of jump is:", str(jump)) while i<len(word): if word[i]==’n’: counter=counter+i i=i+jump i=i+1 print("The value of i is:", str(i)) return counter print(countSomething(’diabetes patient’)) print("****") print(countSomething(’diabetes patient or not’)) print("****") print(countSomething(’not a diabetes patient’)) print("****") print(countSomething()) yields diabetes patient The value of i is: 0 The value of jump is: 14 The value of i is: 1 The value of i is: 2 The value of i is: 3 The value of i is: 4 The value of i is: 5 The value of i is: 6 The value of i is: 7 The value of i is: 8 The value of i is: 9 The value of i is: 10 The value of i is: 11 c ph 176 Programming and genomics 2019/2020 B. Solutions to selected exercises The value of i is: 12 The value of i is: 13 The value of i is: 14 The value of i is: 29 14 **** diabetes patient or not The value of i is: 0 The value of jump is: 14 The value of i is: 1 The value of i is: 2 The value of i is: 3 The value of i is: 4 The value of i is: 5 The value of i is: 6 The value of i is: 7 The value of i is: 8 The value of i is: 9 The value of i is: 10 The value of i is: 11 The value of i is: 12 The value of i is: 13 The value of i is: 14 The value of i is: 29 14 **** not a diabetes patient The value of i is: 0 The value of jump is: 0 The value of i is: 1 The value of i is: 2 The value of i is: 3 The value of i is: 4 The value of i is: 5 The value of i is: 6 The value of i is: 7 The value of i is: 8 The value of i is: 9 The value of i is: 10 The value of i is: 11 The value of i is: 12 The value of i is: 13 The value of i is: 14 The value of i is: 15 The value of i is: 16 The value of i is: 17 The value of i is: 18 The value of i is: 19 c ph 177 Programming and genomics 2019/2020 B. Solutions to selected exercises The value of i is: 20 The value of i is: 21 The value of i is: 22 20 **** insulin resistance The value of i is: 0 The value of jump is: 1 The value of i is: 1 The value of i is: 3 The value of i is: 4 The value of i is: 5 The value of i is: 6 The value of i is: 8 The value of i is: 9 The value of i is: 10 The value of i is: 11 The value of i is: 12 The value of i is: 13 The value of i is: 14 The value of i is: 15 The value of i is: 17 The value of i is: 18 22 The function first prints ’The value of i is: 0’, after which it determines the index of the first occurrence of the letter ’n’ in the string word and stores that value in the variable jump. Subsequently, the function iterates over all indices of the string word and prints those indices. Only, when an a character ’n’ is encountered, the index considered jumps jump characters forward. Solution to exercise 54: We have made a small adaptation to the function by adding as argument a list containing the values that would otherwise be input. Two print statements have also been added to produce a ’nice’ table. def examplerep(inlist): """ Changed the method to print a nice table at the end of each iteration. """ a = 1000 b = 1000 i = 0 print("%8s %8s %8s %8s" % ("i", "a", "b","c")) while (i<5): c = inlist[i] # automated input if (c<b): if (c<=a): b = a c ph 178 Programming and genomics 2019/2020 B. Solutions to selected exercises a = c print("%8d %8d %8d %8d" % (i, a, b, c)) i = i + 1 print("My answers are: "+str(a)+" and "+str(b)) examplerep([0, 1, 2, 3, 4]) print() examplerep([900, 800, 700, 600, 500]) print() examplerep([3, 33, 333, 444, 33]) When this program is run, the following output is produced: i 0 1 2 3 4 My answers are: a 0 1 2 3 4 4 and b 1000 1000 1000 1000 1000 1000 i a b 0 900 1000 1 800 900 2 700 800 3 600 700 4 500 600 My answers are: 500 and 600 i a b 0 3 1000 1 33 1000 2 333 1000 3 444 1000 4 33 444 My answers are: 33 and 444 c 0 1 2 3 4 c 900 800 700 600 500 c 3 33 333 444 33 Solution to exercise 58: (a) class Gene: def __init__(self, genesymbol="INS", genename="insulin"): self.gene_symbol=genesymbol self.gene_name=genename Examples of object instantiation are mygene=Gene() mygene1=Gene(genesymbol="Casp4") mygene2=Gene(genesymbol="Casp4", genename="apoptosis-related cysteine peptidase") (b) A method of a class has always self as first parameter. Since the method can only c ph 179 Programming and genomics 2019/2020 B. Solutions to selected exercises be applied on an object of the class, it is safe to assume that the object is already has been created, and hence all data attributes have already received a value. The printing of the contents of the data attributes can be achieved by using the print method: def print_geneinfo(self): print("The symbol of this gene is:", self.gene_symbol) print("The name of this gene is:", self.gene_name) So applying this method (mygene2.print_geneinfo()) to the object mygene2 using the instantiation given above, yields The symbol of this gene is: Casp4 The name of this gene is : apoptosis-related cysteine peptidase Solution to exercise 60: (a) from openpyxl import load_workbook import matplotlib.pyplot as plt def getWorkBook(xlsxfilename): """ A function with a file name as single parameter that, if the file name corresponds to an excel file, reads that file and returns it as a workbook object. """ wb = None if xlsxfilename[-5:]==’.xlsx’: wb = load_workbook(filename = xlsxfilename) return wb wb = getWorkBook(’health.xlsx’) (b) def readColumn(wb, colnr=1): """ Extracts the data from column ’colnr’ of the workbook ’wb’ and returns it as a list """ ws = wb.active nrrows = ws.max_row l = [] for i in range(nrrows): # add 1 because excel columns and rows start at 1 l.append(ws.cell(row=i+1,column=colnr+1).value) return l (c) def scaleList(l, fac=1.0): """ Multiplies all elements of the list ’l’ with a factor ’fac’ and returns the scaled values in a new list. """ newl = [] for i in range(len(l)): c ph 180 Programming and genomics 2019/2020 B. Solutions to selected exercises newl.append(l[i]*fac) return newl (d) Visual inspection of the Excel file using Excel learns that the first row contains a description of the file contents and the second row contains headers for the values on the subsequent rows. That second row shows that weights can be found in the fifth column and heights in the fourth column. Moreover, from the first row it is clear that the specified weights are in pounds, while the heights are in inches. These thus need to be converted to kilograms and meters, respectively, before being plotted. # open the excel file wb = getWorkBook(’health.xlsx’) weight = readColumn(wb, 4) height = readColumn(wb, 3) print(’Extracted columns:’, weight[1], height[1]) weight = weight[2:] height = height[2:] # convert pounds to kilograms weightKg = scaleList(weight, 0.45359237) # convert inches to meters heightMeter = scaleList(height, 0.0254) # make the scatter plot plt.figure() plt.plot(weightKg,heightMeter,’b+’) plt.xlabel(’weight (kg)’) plt.ylabel(’heigth (m)’) (e) We can use the already converted weights and heights to calculate the BMIs: # calculate BMI bmicalc = [] for i in range(len(heightMeter)): bmicalc.append(weightKg[i]/heightMeter[i]**2) ## make plot of BMI values to check them #plt.figure() #plt.plot(bmicalc,’ro’) #plt.xlabel(’patient number’) #plt.ylabel(’BMI’) (f) To select the data for male patients, we also need to extract the genders from the excel workbook. Visual inspection of the Excel file learns that these are in the second column. Moreover, we need to extract the names from the first column. Info on male patients can then be written to a new workbook by looping over all patients and selecting on the gender. c ph 181 Programming and genomics 2019/2020 B. Solutions to selected exercises names = readColumn(wb, 0) names = names[2:] gender = readColumn(wb, 1) gender = gender[2:] wbout = Workbook() ws1 = wbout.active ws1.title = "Male data" ws1.append(["name","height (m)", "weight (kg)","bmi"]) for i in range(len(gender)): if gender[i] == ’M’: ws1.append([names[i],heightMeter[i],weightKg[i],bmicalc[i]]) wbout.save(filename = ’bmi_male.xlsx’) Solution to exercise 62: The solutions can immediately be obtained by selecting the right configuring options and hence are straightforward. (a) Similarly as described in the lecture notes import tkinter top = tkinter.Tk() top.title("A DNA calculator") top.mainloop() (b) import tkinter top = tkinter.Tk() top.title("A DNA calculator") top.geometry("500x400") top.mainloop() (c) import tkinter top = tkinter.Tk() top.title("A DNA calculator") top.geometry("500x400") top.configure(background="blue") top.mainloop() (d) import tkinter class MyApp: def __init__(self, parent): self.parent=parent self.l=tkinter.Label(parent, c ph 182 Programming and genomics 2019/2020 B. Solutions to selected exercises text="This DNA calculator should become colorful") self.l.pack() top = tkinter.Tk() top.title("A DNA calculator") top.geometry("500x400") top.configure(background="blue") myapp=MyApp(top) top.mainloop() (e) import tkinter class MyApp: def __init__(self, parent): self.parent=parent self.l=tkinter.Label(parent, text="This DNA calculator should become colorful") self.l.configure(background="red") self.l.pack() top = tkinter.Tk() top.title("A DNA calculator") top.geometry("500x400") top.configure(background="blue") myapp=MyApp(top) top.mainloop() (f ) import tkinter class MyApp: def __init__(self, parent): self.parent=parent self.l=tkinter.Label(parent, text="This DNA calculator should become colorful") self.l.configure(background="red") self.l.configure(foreground="yellow") self.l.pack() top = tkinter.Tk() top.title("A DNA calculator") top.geometry("500x400") top.configure(background="blue") myapp=MyApp(top) top.mainloop() c ph 183 Programming and genomics 2019/2020 (g) B. Solutions to selected exercises import tkinter class MyApp: def __init__(self, parent): self.parent=parent self.l=tkinter.Label(parent, text="This DNA calculator should become colorful") self.l.configure(background="red") self.l.configure(foreground="yellow") self.l.pack() self.l2=tkinter.Label(parent, text="Beautiful colors\n isn’t it?", background="green", foreground="white") self.l2.pack() top = tkinter.Tk() top.title("A DNA calculator") top.geometry("500x400") top.configure(background="blue") myapp=MyApp(top) top.mainloop() (h) Change the order in which the two labels are packed. import tkinter class MyApp: def __init__(self, parent): self.parent=parent self.l=tkinter.Label(parent, text="This DNA calculator should become colorful") self.l.configure(background="red") self.l.configure(foreground="yellow") self.l2=tkinter.Label(parent, text="Beautiful colors\n isn’t it?", background="green", foreground="white") self.l2.pack() self.l.pack() top = tkinter.Tk() top.title("A DNA calculator") top.geometry("500x400") top.configure(background="blue") myapp=MyApp(top) top.mainloop() c ph 184 Programming and genomics 2019/2020 B. Solutions to selected exercises Solution to exercise 64: (a) from tkinter import * class MyApp: def __init__(self, parent): self.parent=parent self.l=Label(parent, text="This DNA calculator should become colorful", bg="black", fg="yellow") self.l.pack(anchor=N) r=Tk() r.configure(bg="blue") myapp=MyApp(r) r.mainloop() (b) from tkinter import * class MyApp: def __init__(self, parent): self.parent=parent self.l=Label(parent, text="This DNA calculator should become colorful", bg="black", fg="yellow") self.l.pack(side=RIGHT, anchor=N) r=Tk() r.configure(bg="blue") myapp=MyApp(r) r.mainloop() By adding side=RIGHT the label widget remains attached to the right side of the parent window. (c) from tkinter import * class MyApp: def __init__(self, parent): self.parent=parent self.l=Label(parent, text="This DNA calculator should become colorful", bg="black", fg="yellow") self.l.pack(fill=X, side=RIGHT, expand=True) c ph 185 Programming and genomics 2019/2020 B. Solutions to selected exercises r=Tk() r.configure(bg="blue") myapp=MyApp(r) r.mainloop() Adding fill=X guarantees that in the X-direction the widget remains attached to the right side, in the Y-direction it remains ’centered’. Expanding it implies that the widget is taking as much space as is available. In this case this space equals that of the space of the parent window. (d) and (e) Similar as in (c), but in (d) the role of the X and Y-direction are interchanged. In (e) both directions are involved. c ph 186 Bibliography [1] The Python Programming Language, http://www.python.org/. [2] Python Documentation, http://www.python.org/doc/ [3] Alberts, Johnson, Lewis, Raff, Roberts, Walter, Molecular Biology of the Cell, Garland Science, 2002 187 Index ”, 14, 58 ., 28 [ ], 22 init() , 109 \n, 58 \t, 58 #, 16 {}, 84 1000 Genomes Project, 5 abstraction, 67 algorithm, 10 alternative splicing, 55 amino acid, 53 assignment statement, 13 base, 6 adenine, 6 cytosine, 6 guanine, 6 thymine, 6 uracil, 6 bioinformatics, 5 definition, 9 block, 26 BMI, 17 Boolean, 42 Boolean expressions, 44 boxplot, 113 chromosome, 7 autosomal, 7, 8 close, 48 codon, 53 comments, 16 comparisons, 43 complementary, 7 complementary base pairing, 52 dictionary, 84 DNA, 5 5’, 7 double helix, 6 docstring, 69 dot notation, 28, 32 elif, 41 else, 40 empty list, 22 Excel files, 110 extron, 53 False, 43 file open method, 46 float, 14 for statement, 26 function call, 68 print, 17 gene, 54 genes, 5 identifier, 13, 155 if, 40 if–else statement, 41 immutability, 60 import, 27 index number, 16, 23 negative, 25 input, 45 integer, 13 intron, 53 join, 62 keyword parameter, 70 len, 16, 22 list, 22 append, 23 concatenation, 25 188 Programming and genomics 2019/2020 count, 24 empty list, 22 extend, 23 index method, 24 indexing, 23 insert, 24 methods, 23 negative index number, 25 pop, 24 remove, 24 repetition, 25 reverse, 24 slicing, 34 sort, 24 literals, 12 matplotlib, 36 boxplot, 113 Mendel, 5 method call, 108 methods, 12 negative index number, 25 nucleotides, 6 numeric types float, 14 integer, 13 operations, 14 B. Index readlines, 47 RNA, 6, 52 selection method, 40 simulation, 136 slicing, 34 specification, 97 statement assignment, 13 for, 26 if–else, 41 import, 27 string, 14 concatenation, 15 format, 77, 79 indexing, 16 slicing, 16 string literals, 14 string methods, 58 syntax, 12 transcript, 53 True, 43 tuple, 75 variable, 13 Venter, 8 while-statement, 96 object, 12 object instantiation, 109 optional arguments, 32 plot, 36 positional parameters, 69 print, 77 print function, 17 problem analysis, 97 program design, 97 promoters, 52 pyplot, 36 Python, 4, 12 >>>, 15 interactive mode, 15 prompt, 15 range, 25, 33 read, 47 readline, 46 c ph 189