The following material is the result of a curriculum development effort to provide a set of courses to support bioinformatics efforts involving students from the biological sciences, computer science, and mathematics departments. They have been developed as a part of the NIH funded project “Assisting Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum development effort include: Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon University. Dr. Ricardo Gonzalez-Mendez, University of Puerto Rico Medical Sciences Campus. Dr. Alade Tokuta, North Carolina Central University. Dr. Jaime Seguel and Dr. Bienvenido Velez, University of Puerto Rico at Mayaguez. Dr. Satish Bhalla, Johnson C. Smith University. Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon University. Permission is granted for use, modify, and reproduce these materials for teaching purposes. 1 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center This material is targeted towards students with a general background in Biology. It was developed to introduce biology students to the computational mathematical and biological issues surrounding bioinformatics. This specific lesson deals with the following fundamental topics: Essential computing for bioinformatics Computer Science track 2 This material has been developed by: Dr. Hugh B. Nicholas, Jr. National Center for Biomedical Supercomputing Pittsburgh Supercomputing Center Carnegie Mellon University These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Essential Computing for Bioinformatics Course Development Overview Bienvenido Vélez UPR Mayaguez July, 2008 Outline Essential Computing for Bioinformatics Bioinformatics Data Management 4 Course Description Educational Objectives Major Course Modules Module Descriptions Status Course Description Educational Objectives Major Course Modules Module Descriptions Status These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Essential Computing for Bioinformatics Course Description This course provides a broad introductory discussion of essential computer science concepts that have wide applicability in the natural sciences. Particular emphasis will be placed on applications to Bioinformatics. The concepts will be motivated by practical problems arising from the use of bioinformatics research tools such as genetic sequence databases. Concepts will be discussed in a weekly lecture and will be practiced via simple programming exercises using Python, an easy to learn and widely available scripting language. 5 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Educational Objectives Awareness of the mathematical models of computation and their fundamental limits Basic understanding of the inner workings of a computer system Ability to extract useful information from various bioinformatics data sources Ability to design computer programs in a modern high level language to analyze bioinformatics data. Experience with commonly used software development environments and operating systems Experience applying computer programming to solve bioinformatics problems 6 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Major Course Modules Module Lecture MARC Lecture First Steps in Computing: Course Overview 1 Using Bioinformatics Data Sources 2 Mathematical Computing Models 3 5 High-level Programming (Python): Basics 5 1 High-level Programming (Python): Flow Control 6 2 High-level Programming (Python): Container Objects 7 3 High-level Programming (Python): Files 8 4 High-level Programming (Python): BioPython 9 7 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Main Advantages of Python 8 Familiar to C/C++/C#/Java Programmers Very High Level Interpreted and Multi-platform Dynamic Object-Oriented Modular Strong string manipulation Lots of libraries available Runs everywhere Free and Open Source Track record in bioInformatics (BioPython) These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Example A Codon -> AminoAcid Dictionary >> ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... >> 9 code = { ’ttt’: ’ttc’: ’tta’: ’ttg’: ’ctt’: ’ctc’: ’cta’: ’ctg’: ’att’: ’atc’: ’ata’: ’atg’: ’gtt’: ’gtc’: ’gta’: ’gtg’: ’F’, ’F’, ’L’, ’L’, ’L’, ’L’, ’L’, ’L’, ’I’, ’I’, ’I’, ’M’, ’V’, ’V’, ’V’, ’V’, ’tct’: ’tcc’: ’tca’: ’tcg’: ’cct’: ’ccc’: ’cca’: ’ccg’: ’act’: ’acc’: ’aca’: ’acg’: ’gct’: ’gcc’: ’gca’: ’gcg’: ’S’, ’S’, ’S’, ’S’, ’P’, ’P’, ’P’, ’P’, ’T’, ’T’, ’T’, ’T’, ’A’, ’A’, ’A’, ’A’, ’tat’: ’tac’: ’taa’: ’tag’: ’cat’: ’cac’: ’caa’: ’cag’: ’aat’: ’aac’: ’aaa’: ’aag’: ’gat’: ’gac’: ’gaa’: ’gag’: ’Y’, ’Y’, ’*’, ’*’, ’H’, ’H’, ’Q’, ’Q’, ’N’, ’N’, ’K’, ’K’, ’D’, ’D’, ’E’, ’E’, ’tgt’: ’tgc’: ’tga’: ’tgg’: ’cgt’: ’cgc’: ’cga’: ’cgg’: ’agt’: ’agc’: ’aga’: ’agg’: ’ggt’: ’ggc’: ’gga’: ’ggg’: ’C’, ’C’, ’*’, ’W’, ’R’, ’R’, ’R’, ’R’, ’S’, ’S’, ’R’, ’R’, ’G’, ’G’, ’G’, ’G’ } These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Example A DNA Sequence >>> cds = "atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac cgttttatcgggcggggtaa" >>> 10 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Example CDS Sequence -> Protein Sequence >>> def translate(cds, code): ... prot = "" ... for i in range(0,len(cds),3): ... codon = cds[i:i+3] ... prot = prot + code[codon] ... return prot >>> translate(cds, code) ’MSERLSITPLGPYIGAQ*’ 11 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Using BioInformatics Data Sources Goal:Basic Experience Searching Nuceotide Sequence Databases Searching Amino Acid Sequence Database Performing BLAST Searches Using Specialized Data Sources Reference: Bioinformatics for Dummies (Ch 1-4) IDEA: How can we expedite data collection and analysis? ... writing programs to automate parts of the process. 12 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Mathematical Computing Goal:General Awareness What is Computing? Mathematical Models of Computing 13 Finite Automata Turing Machines The Limits of Computation Church/Turing Thesis What is an Algorithm? Big O Notation These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center High-Level Programming (Python) Goal:Knowledge and Experience Downloading and Installing the Interpreter Command-line versus Batch Mode Values, Expressions and Naming Designing your own Functional Building Blocks Controlling the Flow of your Program String Manipulation (Sequence Processing) Container Data Structures File Manipulation Reference: How to Think Like A Computer Scientist: Learning with Python 14 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center CS Fundamentals will be Interleaved Throughout the Course 15 Information Representation and Encoding Computer Architecture Programming Language Translation Methods The Software Development Cycle Fundamental Principles of Software Engineering Basic Data Structures for Bioinformatics Design and Analysis of Bioinformatics Algorithms These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Outline Essential Computing for Bioinformatics Course Description Educational Objectives Major Course Modules Module Descriptions Status Bioinformatics Data Management 16 Course Description Educational Objectives Major Course Modules Module Descriptions Status These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Essential Computing for Bioinformatics Course Description An introduction to the concepts of Data Base Management and Information Retrieval most commonly applicable to Bio-informatics database systems. The course is intended for Computer Science graduate students wishing to focus their academic careers in Bio-Informatics. The first half of the course discusses relational databases and the second half covers Information Retrieval systems. 17 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Educational Objectives Understanding of the pros and cons of the different data models available to manage bioinformatics data Experience with the most widely used query models of bioinformatics data retrieval Understanding of the relational data base model and its query language Understanding of the vector and boolean query models for unstructured text retrieval Experience programming tools to move information across daa models in order to efficiently process it Experience with widely used database management systems and tools 18 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Major Course Modules Module 19 Lecture Overview of Bioinformatics Data Models 1 Using Bioinformatics Data Sources 2 Fundamentals of Unstructured Text Retrieval 3 Fundamentals of The Relational Data Model 4 Structure Query Language (SQL) 5 Querying DNA/Protein sequence databases 6 Using highly-Specialized biological databases 7 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Relational Databases and SQL 20 The Relational Model The SQL Relational Query Language Downloading and Installing MySQL Creating databases in MySQL + Access Importing data into MySQL + Access Creating Analysis Reports with iReports Exporting Data into Spreadsheets These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Extracting Information from Bioinformatics Databases 21 Manipulating sequences Bioinformatics database file formats Parsing Bioinformatics database files to focus on information of interest Exporting data into relational databases and other tools These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Questions? 22 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Accomplishments for Year 1 Studied alternative HLL languages Gained experience with Python Worked on slides for Mathematical Computing Reviewed existing beginning courses in computing for Bioinformatics using Python Revised course syllabus and curricular structure Offered the course for the first time in Spring 2007 23 These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center Plan for Year 2 24 Offer the course a second time Fall 2007 Develop slides and lecture notes for remaining modules Develop Problem Sets and Solutions Study BioPython and integrate throughout the course Round 1 of Assessment These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center