Essential Computing for Bioinformatics

advertisement

The following material is the result of a curriculum development effort to provide a set of courses to
support bioinformatics efforts involving students from the biological sciences, computer science, and
mathematics departments. They have been developed as a part of the NIH funded project “Assisting
Bioinformatics Efforts at Minority Schools” (2T36 GM008789). The people involved with the curriculum
development effort include:

Dr. Hugh B. Nicholas, Dr. Troy Wymore, Mr. Alexander Ropelewski and Dr. David Deerfield II, National
Resource for Biomedical Supercomputing, Pittsburgh Supercomputing Center, Carnegie Mellon
University.

Dr. Ricardo Gonzalez-Mendez, University of Puerto Rico Medical Sciences Campus.

Dr. Alade Tokuta, North Carolina Central University.

Dr. Jaime Seguel and Dr. Bienvenido Velez, University of Puerto Rico at Mayaguez.

Dr. Satish Bhalla, Johnson C. Smith University.

Unless otherwise specified, all the information contained within is Copyrighted © by Carnegie Mellon
University. Permission is granted for use, modify, and reproduce these materials for teaching purposes.
1
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
This material is targeted towards students with a general background in Biology. It
was developed to introduce biology students to the computational mathematical
and biological issues surrounding bioinformatics. This specific lesson deals with the
following fundamental topics:
 Essential computing for bioinformatics
 Computer Science track


2
This material has been developed by:
Dr. Hugh B. Nicholas, Jr.
National Center for Biomedical Supercomputing
Pittsburgh Supercomputing Center
Carnegie Mellon University
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Essential Computing for Bioinformatics
Course Development Overview
Bienvenido Vélez
UPR Mayaguez
July, 2008
Outline

Essential Computing for Bioinformatics






Bioinformatics Data Management





4
Course Description
Educational Objectives
Major Course Modules
Module Descriptions
Status
Course Description
Educational Objectives
Major Course Modules
Module Descriptions
Status
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Essential Computing for Bioinformatics
Course Description
This course provides a broad introductory discussion
of essential computer science concepts that have
wide applicability in the natural sciences. Particular
emphasis will be placed on applications to
Bioinformatics. The concepts will be motivated by
practical problems arising from the use of
bioinformatics research tools such as genetic
sequence databases. Concepts will be discussed in
a weekly lecture and will be practiced via simple
programming exercises using Python, an easy to
learn and widely available scripting language.
5
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Educational Objectives






Awareness of the mathematical models of computation and their
fundamental limits
Basic understanding of the inner workings of a computer system
Ability to extract useful information from various bioinformatics data
sources
Ability to design computer programs in a modern high level language
to analyze bioinformatics data.
Experience with commonly used software development environments
and operating systems
Experience applying computer programming to solve bioinformatics
problems
6
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Major Course Modules
Module
Lecture
MARC
Lecture
First Steps in Computing: Course Overview
1
Using Bioinformatics Data Sources
2
Mathematical Computing Models
3
5
High-level Programming (Python): Basics
5
1
High-level Programming (Python): Flow Control
6
2
High-level Programming (Python): Container Objects
7
3
High-level Programming (Python): Files
8
4
High-level Programming (Python): BioPython
9
7
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Main Advantages of Python











8
Familiar to C/C++/C#/Java Programmers
Very High Level
Interpreted and Multi-platform
Dynamic
Object-Oriented
Modular
Strong string manipulation
Lots of libraries available
Runs everywhere
Free and Open Source
Track record in bioInformatics (BioPython)
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Example
A Codon -> AminoAcid Dictionary
>>
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
>>
9
code =
{ ’ttt’:
’ttc’:
’tta’:
’ttg’:
’ctt’:
’ctc’:
’cta’:
’ctg’:
’att’:
’atc’:
’ata’:
’atg’:
’gtt’:
’gtc’:
’gta’:
’gtg’:
’F’,
’F’,
’L’,
’L’,
’L’,
’L’,
’L’,
’L’,
’I’,
’I’,
’I’,
’M’,
’V’,
’V’,
’V’,
’V’,
’tct’:
’tcc’:
’tca’:
’tcg’:
’cct’:
’ccc’:
’cca’:
’ccg’:
’act’:
’acc’:
’aca’:
’acg’:
’gct’:
’gcc’:
’gca’:
’gcg’:
’S’,
’S’,
’S’,
’S’,
’P’,
’P’,
’P’,
’P’,
’T’,
’T’,
’T’,
’T’,
’A’,
’A’,
’A’,
’A’,
’tat’:
’tac’:
’taa’:
’tag’:
’cat’:
’cac’:
’caa’:
’cag’:
’aat’:
’aac’:
’aaa’:
’aag’:
’gat’:
’gac’:
’gaa’:
’gag’:
’Y’,
’Y’,
’*’,
’*’,
’H’,
’H’,
’Q’,
’Q’,
’N’,
’N’,
’K’,
’K’,
’D’,
’D’,
’E’,
’E’,
’tgt’:
’tgc’:
’tga’:
’tgg’:
’cgt’:
’cgc’:
’cga’:
’cgg’:
’agt’:
’agc’:
’aga’:
’agg’:
’ggt’:
’ggc’:
’gga’:
’ggg’:
’C’,
’C’,
’*’,
’W’,
’R’,
’R’,
’R’,
’R’,
’S’,
’S’,
’R’,
’R’,
’G’,
’G’,
’G’,
’G’
}
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Example
A DNA Sequence
>>> cds = "atgagtgaacgtctgagcattaccccgctggggccgtatatcggcgcacaaa
tttcgggtgccgacctgacgcgcccgttaagcgataatcagtttgaacagctttaccatgcggtg
ctgcgccatcaggtggtgtttctacgcgatcaagctattacgccgcagcagcaacgcgcgctggc
ccagcgttttggcgaattgcatattcaccctgtttacccgcatgccgaaggggttgacgagatca
tcgtgctggatacccataacgataatccgccagataacgacaactggcataccgatgtgacattt
attgaaacgccacccgcaggggcgattctggcagctaaagagttaccttcgaccggcggtgatac
gctctggaccagcggtattgcggcctatgaggcgctctctgttcccttccgccagctgctgagtg
ggctgcgtgcggagcatgatttccgtaaatcgttcccggaatacaaataccgcaaaaccgaggag
gaacatcaacgctggcgcgaggcggtcgcgaaaaacccgccgttgctacatccggtggtgcgaac
gcatccggtgagcggtaaacaggcgctgtttgtgaatgaaggctttactacgcgaattgttgatg
tgagcgagaaagagagcgaagccttgttaagttttttgtttgcccatatcaccaaaccggagttt
caggtgcgctggcgctggcaaccaaatgatattgcgatttgggataaccgcgtgacccagcacta
tgccaatgccgattacctgccacagcgacggataatgcatcgggcgacgatccttggggataaac
cgttttatcgggcggggtaa"
>>>
10
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Example
CDS Sequence -> Protein Sequence
>>> def translate(cds, code):
...
prot = ""
...
for i in range(0,len(cds),3):
...
codon = cds[i:i+3]
...
prot = prot + code[codon]
...
return prot
>>> translate(cds, code)
’MSERLSITPLGPYIGAQ*’
11
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Using BioInformatics Data Sources
Goal:Basic Experience
Searching Nuceotide Sequence Databases
Searching Amino Acid Sequence Database
Performing BLAST Searches
Using Specialized Data Sources




Reference: Bioinformatics for Dummies (Ch 1-4)
IDEA: How can we expedite data collection and analysis?
... writing programs to automate parts of the process.
12
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Mathematical Computing
Goal:General Awareness


What is Computing?
Mathematical Models of Computing






13
Finite Automata
Turing Machines
The Limits of Computation
Church/Turing Thesis
What is an Algorithm?
Big O Notation
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
High-Level Programming (Python)
Goal:Knowledge and Experience








Downloading and Installing the Interpreter
Command-line versus Batch Mode
Values, Expressions and Naming
Designing your own Functional Building Blocks
Controlling the Flow of your Program
String Manipulation (Sequence Processing)
Container Data Structures
File Manipulation
Reference: How to Think Like A Computer Scientist: Learning with Python
14
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
CS Fundamentals will be Interleaved Throughout the
Course







15
Information Representation and Encoding
Computer Architecture
Programming Language Translation Methods
The Software Development Cycle
Fundamental Principles of Software Engineering
Basic Data Structures for Bioinformatics
Design and Analysis of Bioinformatics Algorithms
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Outline
 Essential






Computing for Bioinformatics
Course Description
Educational Objectives
Major Course Modules
Module Descriptions
Status
Bioinformatics Data Management





16
Course Description
Educational Objectives
Major Course Modules
Module Descriptions
Status
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Essential Computing for Bioinformatics
Course Description
An introduction to the concepts of Data Base
Management and Information Retrieval most
commonly applicable to Bio-informatics database
systems. The course is intended for Computer
Science graduate students wishing to focus their
academic careers in Bio-Informatics. The first half of
the course discusses relational databases and the
second half covers Information Retrieval systems.
17
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Educational Objectives






Understanding of the pros and cons of the different data models
available to manage bioinformatics data
Experience with the most widely used query models of bioinformatics
data retrieval
Understanding of the relational data base model and its query
language
Understanding of the vector and boolean query models for
unstructured text retrieval
Experience programming tools to move information across daa
models in order to efficiently process it
Experience with widely used database management systems and tools
18
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Major Course Modules
Module
19
Lecture
Overview of Bioinformatics Data Models
1
Using Bioinformatics Data Sources
2
Fundamentals of Unstructured Text Retrieval
3
Fundamentals of The Relational Data Model
4
Structure Query Language (SQL)
5
Querying DNA/Protein sequence databases
6
Using highly-Specialized biological databases
7
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Relational Databases and SQL







20
The Relational Model
The SQL Relational Query Language
Downloading and Installing MySQL
Creating databases in MySQL + Access
Importing data into MySQL + Access
Creating Analysis Reports with iReports
Exporting Data into Spreadsheets
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Extracting Information from
Bioinformatics Databases




21
Manipulating sequences
Bioinformatics database file formats
Parsing Bioinformatics database files to focus on
information of interest
Exporting data into relational databases and other
tools
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Questions?
22
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Accomplishments for Year 1






Studied alternative HLL languages
Gained experience with Python
Worked on slides for Mathematical Computing
Reviewed existing beginning courses in computing for
Bioinformatics using Python
Revised course syllabus and curricular structure
Offered the course for the first time in Spring 2007
23
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Plan for Year 2





24
Offer the course a second time Fall 2007
Develop slides and lecture notes for remaining
modules
Develop Problem Sets and Solutions
Study BioPython and integrate throughout the course
Round 1 of Assessment
These materials were developed with funding from the US National Institutes of Health grant #2T36 GM008789 to the Pittsburgh Supercomputing Center
Download