Programming for Bioinformatics Michael Schroeder BioTechnological Center TU Dresden ms@mpi-cbg.de Biotec The module… will teach students basic programming skills relevant to bioinformatics, which will enable them to actively develop bioinformatics tools. will take a problem-driven approach. will present bioinformatics problems and show how to solve them using existing online tools and how to implement such tools. will revisit some of the problems and databases discussed in applied bioinformatics. will be very practical and hands-on approach to basic computer science tools such as using command line operating systems, programming in Python, and using relational databases. Objectives Students will have an understanding of different operating systems Students will be able to automate simple repetitive information retrieval tasks Students will be able to write simple programs in Python Students will be able to work with relational databases Students will appreciate the principles, limits, and possibilities of programming Students will be able to formulate biological questions as information processing problems Students will understand when and how programming can help to automate bioinformatics problems Module Structure Introduction Databases Introduction to SQL A Little Exercise A Little Science Introduction to Python Data types and loops Sequences and lists Patterns and functions Dictionaries Advanced topics More Python Dynamic programming Clustering Revision Class Books You will need two books for the module: a reference book on MySQL and a book on Python Books: Python We will follow a number of online resources. (see course web page) Further, we look in Python in a Nutshell, Alex Martelli, O’Reilly Wesley Chun's Core Python Programming Python Cookbook (O’Reilly) The publisher O’Reilly has many general programming books on linux, python, etc. They allow you to read all books for 2 weeks online for free. This is very nice to decide what to buy and what not. You can also buy electronic copies of the book. Books: MySQL There are many, many books on MySQL The following two are just sugestions, as there are many other books covering the same material MySQL Cookbook by Paul DuBois, O'Reilly or MySQL by Paul DuBois, Michael Widenius, O'Reilly Structure of Labs Databases Lab 1,2: Simple SQL Lab 3,4: SQL to answer interesting scientific questions Python Lab 5: Data types and loops, accessing a DB from Python Lab 6: Sequences and lists Lab 7: Patterns and Functions Lab 8: Dictionaries Lab 9: BioPython Lab 10: Python & PyMOL More Python: Lab 11: Dynamic programming revisited Lab 12: Clustering revisited Lab 13: Revision Assessment Lab Exercises: Each week during the lab you get exercises which you have to do during the lab and finish on your own during the week These exercises need to be handed in on paper at the next lecture Results are discussed during the labs and as part of the assessment you will have to present a solution once Doing the exercises is compulsory, but there are no marks Project You will demonstrate your programming skills by implementing and presenting a software project Exam Pen and paper exam on material covered in lecture Programming Project Goal: Demonstrate ability to use SQL and programming Goal 2: Produce science movie for Long Night of Science You will work in a team and get a biological problem. Part 1: Programming: You have to implement some workflows, which integrate data from various sites and use various tools programmatically. This includes an animation of your target protein in PyMol. Part 2: Make a movie. Tell the story about your protein based on the data collected and analysis carried out. Create a story board and turn all material and Pymol animations into a movie. Motivation: Databases In the last term, we accessed most information online via the web we interacted directly and manually with databases and tools we had to manually submit queries, interpret results. select interesting results, cut&paste them, and submit queries again,… Pro: Reasonably easy to get hold of information Con: Not possible to ask many queries Queries limited by interface provided by web page Difficult/impossible to integrate information from different sites In this term, we will look at the databases underlying the online front ends How is the data internally stored? How can we - and more important computer programs - directly interact with the underlying data, so that we can ask more powerful queries, large queries, and integrate different systems What actually happens You are limited by what web server allows you to ask: Example CATH: •PDB ID, •CATH code, or •General text But you cannot ask: •In how many different PDB structures is there a P-loop domain? •Is there a PDB entry with a P-loop and a DNA-binding domain •How many different superfamilies does the largest structure in PDB have? •With direct access to the underlying database you could answer all these questions (and many more) Motivation: SCOP as Relational Database We worked with SCOP, the Structural Classification of Proteins Family: >30% sequence identity Superfamily: Similar structure and function (possibly lower 30% sequence identity) Family Same Superfamily, But not family 30% Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt Motivation: Databases We wish to answer the following questions: How many families and superfamilies are there? Do all superfamilies roughly have the same number of families? How many families does the immunoglobulin superfamily have? Which superfamily has the most families and how many? How many percent of superfamilies have only one family? Which PDB structure has the largest number of distinct superfamilies? How many percent of PDB structures have only one type of superfamily, how many percent have at least two? Which is the most popular superfamily? Are all superfamilies equally likely to co-occur or do they have preferences? Which superfamily has the most co-occurrence partners? Is the number of co-occurrence partners and the frequency of the superfamily correlated? What is a Database SCOP contains relevant information, but we cannot answer the above questions through the web-interface of SCOP The problem is that we do not have access to the underlying database What is a database anyway? A database provides… Logical organization of data data models, schema design, dictionaries Physical organization of data Fast retrieval, indexing, compact storage of data Relational Database Central Idea: Data as relations in a table E.g. Employee +-------+------+---------+---------+ | id | name | salary | role | +-------+------+---------+---------+ | 46457 | pete | 50.000 | director| | 46458 | jane | 60.000 | nurse | | 46459 | asif | 70.000 | driver | +-------+------+---------+---------+ Relational Database Central Idea: Data as relations in a table E.g. SCOP, Structural Classification of Proteins +-------+------+---------+---------+--------------------------------------+ | id | type | sccs | sid | description | +-------+------+---------+---------+--------------------------------------+ | 46457 | cf | a.1 | | Globin-like | | 46458 | sf | a.1.1 | | Globin-like | | 46459 | fa | a.1.1.1 | | Truncated hemoglobin | | 46460 | dm | a.1.1.1 | | Truncated hemoglobin | | 46461 | sp | a.1.1.1 | | Ciliate (Paramecium caudatum) | | 14982 | px | a.1.1.1 | d1dlwa_ | 1dlw A: | | 46462 | sp | a.1.1.1 | | Green alga (Chlamydomonas eugametos) | | 14983 | px | a.1.1.1 | d1dlya_ | 1dly A: | | 63437 | sp | a.1.1.1 | | Mycobacterium tuberculosis | | 62301 | px | a.1.1.1 | d1idra_ | 1idr A: | +-------+------+---------+---------+--------------------------------------+ SCOP Tables mysql> select * from cla limit 1; +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ mysql> select * from des limit 1; +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | | All alpha proteins | +-------+------+------+------+--------------------+ mysql> select * from astral limit 1; +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ mysql> select * from subchain limit 1; +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+ SCOP Tables mysql> select * from cla limit 1; +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | sid | pdb_id | sccs | cl | cf | sf | fa | dm | sp | px | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ | d1dlwa_ | 1dlw | a.1.1.1 | 46456 | 46457 | 46458 | 46459 | 46460 | 46461 | 14982 | +---------+--------+---------+-------+-------+-------+-------+-------+-------+-------+ mysql> select * from des limit 1; +-------+------+------+------+--------------------+ | id | type | sccs | sid | description | +-------+------+------+------+--------------------+ | 46456 | cl | a | | All alpha proteins | +-------+------+------+------+--------------------+ mysql> select * from astral limit 1; +---------+---------+-----------------------------------------------------------+ | sid | sccs | seq | +---------+---------+-----------------------------------------------------------+ | d1dlwa_ | a.1.1.1 | slfeqlggqaavqavtaqfyaniqadatvatffngidmpnqtnktaaflcaalgg...| +---------+---------+-----------------------------------------------------------+ mysql> select * from subchain limit 1; +----+-------+----------+-------+------+ | id | px | chain_id | begin | end | +----+-------+----------+-------+------+ | 1 | 14982 | A | | | +----+-------+----------+-------+------+ Querying Relational Databases SQL = Structured Query Language Select Which attributes? from Which tables? where Which conditions? Select … from … where … Distinct Like Union/intersect Join Count/average/sum/min/m ax Group by Having Show tables Show databases Use Create database Create table … as Drop table Load data Insert into Databases Given SCOP as relational database, we can answer all the questions raised above using the SQL constructs of the previous slide! Programming We will use Python (Guido van Rossum, named after Monty Python) as a convenient extension to the operating system Easy to write quick programs More than just a scripting language Interpreted, interactive, indented Supports string processing well Widely used in bioinformatics Object oriented, general purpose Many nice libraries for database access, Graphics, Web, GUI, R… Scientific orientation: Numerical Python (math), Scientific Python, Biopython Beware: Python is inefficient, but computationally expensive parts can be included as C-libraries Motivation: Families and Identity We said that SCOP families share >30% identity What does that mean? Any two structures in a family >30%? At least one other member in family with >30%? What is the average sequence similarity within a family? Within a superfamily? Given a sequence and that we know already which superfamily it belongs to. Can we find the superfamily’s family best suited for the sequence Two approaches: Blast vs. DIY We can answer the above easily: We use SCOP database and run database queries from a Python script For a given superfamily select all corresponding sequences from the astral table For all pairs of selected sequences Call Blast and record the sequence identity Or run your own dynamic programming algorithm and record the sequence identity For second problem: Compare sequence to all family sequences and assign it to the family which shares the highest (must be >30%) similarity with the sequence Motivation: Sequence vs. Structure Can we verify the plot below? Can we create a similar plot for specific superfamilies? E.g. DNA-binding domains? Family Same Superfamily, But not family 30% Picture from www.jenner.ac.uk/YBF/DanielleTalbot.ppt Motivation: Sequence vs. Structure Again: select the relevant sequences from the astral table and besides computing the sequence identity, we compute structural similarity to the relevant structure using an algorithm like Dali or CE Then plot the two similarities against each other in a scatter plot Motivation: Amino Acid Composition of Families Can we characterise the amino acid composition of different families/superfamilies? Again: select the relevant sequences from astral and count the frequencies of amino acids Is the amino acid composition at the interface of a domain different from the rest of the domain? Motivation: Let’s rebuild SCOP families Given a SCOP superfamily and its sequences, how can we divide it into families? First, we need dynamic programming to determine the sequence similarity Then we do the following: For all pairs of sequences, call the sequence similarity algorithm and record the similarity into a distance matrix Next, run hierarchical clustering to cluster the sequences. What’s needed… …programming in Python Python Programming Constructs Variables, strings, For/while Loops If statements File I/O Regular expressions Data structures: Lists, Hashes Code Structure: Objects, classes, modules Hello World in Python Given a file helloworld.py Open a shell and type at the command prompt helloworld.py The shell then executes your programme In the first line, it realises that the python interpreter needs to be loaded and that what follows is a python program The line below prints a message File: helloworld.py print "Hello World" Read a text file in python The command open opens a text file and creates “r” as second argument after the filename indicates that file is read (this is default, ie. can be left out) “w” as second argument indicates that file is written to “a” as second argument indicates that file is appended to The for-loop reads all lines of the file one by one (requires python >2.2) The body of the loop prints them on the screen (note that print adds a new line automatically, avoid that with adding a ”,” ) File: fileIO.py data = open("seq.txt“, “r”) File: seq.txt for line in data: print "Line:”, line, acgt gggt Output Line: acgt Line: gggt Variables in Python The = symbol is used to assign values to variables The + symbol is also used to concatenate strings File: fileIO.pl lineNo = 1 File: seq.txt for line in open(“seq.txt”): Output 1: acgt acgt print lineNo+”: ”+line, 2: gggt gggt lineNo = lineNo+1 File: seqcomp.py If-thenelse and strings in Python data = open("seq.txt") line1 = data.readline().rstrip() line2 = data.readline().rstrip() len1=len(line1) len2=len(line2) if len1 < len2: minLen = len1 else: minLen = len2 line3 = "" for i in range(minLen): if line1[i] == line2[i]: line3=line3+"*" else: line3=line3+" " print print print print "Sequence comparison" line1 line2 line3 File: seq.txt acgt gggt Output Sequence comparison acgt gggt ** Programming Example