BMB 6216 – Algorithms for Biology

advertisement
BMB 6216 – Algorithms for Biology - Class 1
Andy Kudlicki
Office: BSB 547
Phone: 772-2253, 771-1011 cell
a.kudlicki@utmb.edu
BMB 6216 – Algorithms for Biology
Welcome!
Imagine doing science without computers? It can (almost all) be done:
– Paper file folders
– Xeroxing
– Photographs on film
– Actually going to the library to browse journals
– Abstract collections
– Telephone, Snail-mail, Telegrams
– Typewriters
BMB 6216 – Algorithms for Biology
The one exception:
Science is quantitative, and has always been.
BMB 6216 – Algorithms for Biology
This course:
– Using computers for computing.
– Aspects useful in biology / bioinformatics
• Simple tasks ( 2 * 71.12 = ? )
• Simple repetitive tasks (few or many repetitions)
• Somewhat complicated tasks
• Typical problems of high complexity
– BLAST, genome assembly, motif discovery, ...
BMB 6216 – Algorithms for Biology
This course:
– Using computers for computing.
– Aspects useful in biology / bioinformatics
• Simple tasks ( 2 * 71.12 = ? )
spreadsheets
• Simple repetitive tasks (few or many repetitions)
• Somewhat complicated tasks
• Typical problems of high complexity
( Solved, software available )
– BLAST, genome assembly, motif discovery, ...
BMB 6216 – Algorithms for Biology – Class 1
Course Overview
Class 1
Introduction to the course and to the Perl programming language
Class 2
Computational complexity and numerical stability of algorithms
Class 3
Data Structures and Containers in PERL and other languages
1.
Tables, lists, queues, hashes and when to use them
2.
When PERL is not enough: A quick look at R and C++
Class 4
Matrix operations; Principal Component Analysis; ICA
Class 5
Network / graph algorithms
1.
Interaction Networks
2.
Regulation networks
3.
Graphs for enumerating hypotheses
BMB 6216 – Algorithms for Biology
Course Overview
Class 6
Strings and Regular Expressions
1.
In silico enzyme digestion
2.
Gene translation
Class 7
Randomization and Monte Carlo simulations
1.
Randomization by permutation
2.
Modeling the null-hypothesis probability distribution
Class 8
1.
Custom vector graphics: generating SVG from your data
Create and re-create the killer graph for your paper
Class 9
Class 10
Visualization of multidimensional data
Web tools
1.
The components of a web page, elements of HTML.
2.
Extracting data from webpages and other documents.
3.
Connect to GenBank using BioPerl
BMB 6216 – Algorithms for Biology
Course Overview
Class 11
Cgi-bin: Creating dynamic web-based tools for data analysis.
Class 12
Relational databases and SQL
1.
Relational Model, normalization
2.
Basic SQL
3.
Examples: Experimental results,
Class 13
Databases and WWW
Class 14
Clustering
1.
Hierarchical
2.
K-means
3.
friends-of-Friends
Class 15
Timecourses and spectral analysis; Convolution.
BMB 6216 – Algorithms for Biology
Format:
Mixed – lecture with hands-on assignments.
Computer environment:
Linux
Perl, also C/C++, R, shell, awk, sed, ..., when needed
Supplementary reading:
Larry Wall et al: Programming Perl
Wing-Kin Sung: Algorithms in Bioinformatics
James Tisdall: Beginning Perl for Bioinformatics
James Tisdall: Mastering Perl for Bioinformatics
Stroustrup: The C++ Programmming Language
Special requests: Welcome !
BMB 6216 – Algorithms for Biology
Format:
Mixed – lecture with hands-on assignments.
Computer environment:
Linux
* Rich in standard tools, mostly open-source
* Industry standard
– * Very similar to MacOS, Android, iOS, BSD, ChromeOS, etc.
– Has many flavors created for specific purposes
BMB 6216 – Algorithms for Biology
Using your laptop in class:
To get a *nix environment:
* linux laptop (or unix console on Mac)
– Live CD distribution
* cygwin
* virtual machine
* remote session (preferred, guaranteed to work)
Remote session:
Use
– “Remote Desktop Connection” from win*
– Server: 129.109.88.185
From mac – install “Remote Desktop Connection Client for Mac”
From Linux “rdesktop 129.109.88.185”
Also works from off campus
• (mycitrix.utmb.edu -> remote desktop session)
Other options:
– ssh (puTTY on windows) , no graphics though, only on-campus
– NX NoMachine
BMB 6216 – Algorithms for Biology
Login to: 129.109.54.80
Username:
Password:
BMB 6216 – Algorithms for Biology
Unix / linux shell / command line:
– List files: ls
– Directory:
ls -a
cd
ls -1
ls -l
pwd
– Copy, move, delete, link: cp mv rm
– Machine status: ps w
/sbin/ifconfig date
– Text editors: joe
– Pager:
ls -lrt
uptime top df du
nano
more less;
ln
also:
– Misc: echo tr sed man
emacs (c-x c-f)
whoami
vi
cat, head, tail, tac
wc
chmod
BMB 6216 – Algorithms for Biology
Simple data flow / spreadsheet-like
• Find in file : grep
[grep -v; grep -f; egrep]
• Select top/bottom lines from file: head, tail
• Select columns:
awk
awk '{print $2, $3, $5+$6}'
• Merge lines: cat
• Merge columns:
paste
• Sort
• Data flow: > >> < | tee
tac
BMB 6216 – Algorithms for Biology
Exercise:
The file /data/students/classes/remastercycle.csv contains gene expression
data arranged as time-series in columns. (affy-id, name, gene-id, data*36)
• How many named genes are there?
• What is the average expression at timepoint 1? In how many genes
it is above average?
• What is the average expression at t1 of named genes, unnamed
genes, non-genes? (genes have systematic names like YLR405W)
• List 200 named genes that have the highest (t7+t19+t31)(t1+t13+t25)
BMB 6216 – Algorithms for Biology
Log in to your account (on 129.109.88.185)
– Make a fresh directory, e.g.
mkdir bmb6216
cd bmb6216
mkdir class_1; cd class_1
cp /data/students/classes/hello.pl .
* Cat it. * Less it. * Run it.
• Backup: cp hello.pl hello-0.pl
• Edit it:
vi hello.pl
BMB 6216 – Algorithms for Biology
Editing with vi
– I / i (insert)
– A / a (append)
– X / x / dd (delete)
– R (eplace) / r (eplace 1 character)
– {n} W / w / B / b / hjkl -move around
– [ESC] – back from insert to command
– ZZ / :w / :q / :wq / :x / :q! - exit / save / quit
– xp – swap chars. ddp – swap lines
BMB 6216 – Algorithms for Biology
Exercise:
The file /home/students/classes/Class_1/remastercycle.csv contains gene
expression data arranged as time-series in columns. (affy-id, name, gene-id,
data*36)
• How many named genes are there?
• What is the average expression at timepoint 1? In how many genes it
is above average?
• What is the average expression at t1 of named genes, unnamed
genes, non-genes? (genes have systematic names like YLR405W),
named genes also have a common name in column 2.
• List 200 named genes that have the highest (t7+t19+t31)(t1+t13+t25)
BMB 6216 – Algorithms for Biology
PERL
Why PERL?
Practical Extraction and Report Language
Pathologically Eclectic Rubbish Lister
• Versatile, portable
• Widely used in bioinformatics and web applications
• There's more than one way to do it
• Not the most elegant language, great for dirty hacks
• Easily integrated with anything
BMB 6216 – Algorithms for Biology
Warning: PERL6 ain't PERL
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
print ''Hello \n'';
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
> perl
print ''Hello \n'';
^D
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
>perl -e
'print ''Hello \n'';'
BMB 6216 – Algorithms for Biology
PERL
HELLO WORLD:
hello.pl
==================
#!/usr/bin/perl
print ''Hello \n'';
==================
> perl hello.pl
Or
> ./hello.pl
(after chmod +x hello.pl)
BMB 6216 – Algorithms for Biology
VARIABLES:
Scalar:
$dna = 'ATTTGCCCTGCCCATT';
$mouse_tail_inches = 2.13;
$RNA = ''GGGUUCAAUAUAUGGC'';
$seven = -6;
Default variable: $_
No need to declare variables. If not specified, $_ is assumed.
BMB 6216 – Algorithms for Biology
VARIABLES:
No need to declare variables.
Risky though:
$my_variable = 51;
$something = $my_variable + 3;
$something_else = $myvariable + 4;
use strict;
BMB 6216 – Algorithms for Biology
OPERATIONS:
String:
$dna = “ATAGAGGTA” . “CATATC”;
$at_repeat = “AT” x 50;
substr() sub-string
length()
Binding:
print $dna if $dna =~ /ATA/;
chop (last char)
chomp (end of line)
Special characters: \t \n
BMB 6216 – Algorithms for Biology
The different quotations
$x=6;
print ''x= $x \n'';
print 'x= $x \n';
BMB 6216 – Algorithms for Biology
OPERATIONS:
Arithmetic:
$a + $b
$a - $b
$a * $b
$a % $b
$a ** $b
BMB 6216 – Algorithms for Biology
OPERATIONS:
Incrementation (C-like)
$a ++
$a *= 4
$repeat = 'AT'; $repeat x=36;
BMB 6216 – Algorithms for Biology
LISTS/TABLES:
@a = (4, 6, 3.21, 7, 'cat', ''dog'');
$a[0] = 6;
$#a
@a + 0
address of last element
size of array
OPERATIONS:
* join / split
* push / pop / shift / unshift
BMB 6216 – Algorithms for Biology
LISTS/TABLES:
@a = (4, 6, 3.21, 7, 'cat', ''dog'');
$a[0] = 6;
$#a
@a + 0
address of last element
size of array
OPERATIONS:
* join / split
* push / pop / shift / unshift
BMB 6216 – Algorithms for Biology
HASHES:
The most important data type in biology!
$expression{''RPS16''} = 4.65;
%expression = (
RPL12 => 1.23,
CDC28 => 5.31,
STAT1 => ''experiment gone south”
);
BMB 6216 – Algorithms for Biology
FLOW CONTROL:
if ( $a > 4 ) { print sqrt ($a), “\n”; };
while ( $x > 0 ) { print --$x , “\n”};
$x>0 or $x = 6;
for $z (1..333) {print $z, ' ';};
for ($i=0; $i<=1000; ++$i)
{
next unless $a[$i] > 0
};
BMB 6216 – Algorithms for Biology
TRUE or FALSE
false strings:
– ''0''
– ''''
Every other string is true!
''0.00'' is true
''0.00'' + 0 is false
– if ( 'Elvis is alive' ) { print 4+5, “\n”; };
– undef()
is false
BMB 6216 – Algorithms for Biology
SUBROUTINES
sub addit {
my ($x1, $x2) = @_;
return $x1 + $x2;
};
BMB 6216 – Algorithms for Biology
Input / Output:
while (<>)
{
chomp;
$sum += $_;
};
BMB 6216 – Algorithms for Biology
Input:
open BLABLA, “data.csv”;
$firstline = <BLABLA>;
@headers = split “\t”, $firstline;
while (<BLABLA>) {something};
close BLABLA;
BMB 6216 – Algorithms for Biology
Output:
– print $x, ''\n'';
– printf ''format'', $x;
– print + join '' '', @list;
open BLABLA, “>outdata.csv”;
print BLABLA $x, $y, ''\n'';
close BLABLA;
#no comma!!!
BMB 6216 – Algorithms for Biology
Exercises:
1. repeat in PERL the awk/sort exercise from last hour
2. a-S_cer_TANAY_1000upstream.fasta contains the sequences out UTRs of
genes. What is the correlation between the position of GATGAGA sequence
and avg expression of the gene?
BMB 6216 – Algorithms for Biology
Simple data flow / spreadsheet-like
• Find in file : grep
[grep -v; grep -f; egrep]
• Select top/bottom lines from file: head, tail
• Select columns:
awk
awk '{print $2, $3, $5+$6}'
• Merge lines: cat
• Merge columns:
paste
• Sort
• Data flow: > >> < | tee
tac
BMB 6216 – Algorithms for Biology
C / C++ -> for total control
=========================== Hello.C ======
#include <iostream>
using namespace std;
int main ()
{
cout << "Hello :) " << 5+4 << endl;
};
Download