BMB 6216 – Algorithms for Biology - Class 1 Andy Kudlicki Office: BSB 547 Phone: 772-2253, 771-1011 cell a.kudlicki@utmb.edu BMB 6216 – Algorithms for Biology Welcome! Imagine doing science without computers? It can (almost all) be done: – Paper file folders – Xeroxing – Photographs on film – Actually going to the library to browse journals – Abstract collections – Telephone, Snail-mail, Telegrams – Typewriters BMB 6216 – Algorithms for Biology The one exception: Science is quantitative, and has always been. BMB 6216 – Algorithms for Biology This course: – Using computers for computing. – Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71.12 = ? ) • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity – BLAST, genome assembly, motif discovery, ... BMB 6216 – Algorithms for Biology This course: – Using computers for computing. – Aspects useful in biology / bioinformatics • Simple tasks ( 2 * 71.12 = ? ) spreadsheets • Simple repetitive tasks (few or many repetitions) • Somewhat complicated tasks • Typical problems of high complexity ( Solved, software available ) – BLAST, genome assembly, motif discovery, ... BMB 6216 – Algorithms for Biology – Class 1 Course Overview Class 1 Introduction to the course and to the Perl programming language Class 2 Computational complexity and numerical stability of algorithms Class 3 Data Structures and Containers in PERL and other languages 1. Tables, lists, queues, hashes and when to use them 2. When PERL is not enough: A quick look at R and C++ Class 4 Matrix operations; Principal Component Analysis; ICA Class 5 Network / graph algorithms 1. Interaction Networks 2. Regulation networks 3. Graphs for enumerating hypotheses BMB 6216 – Algorithms for Biology Course Overview Class 6 Strings and Regular Expressions 1. In silico enzyme digestion 2. Gene translation Class 7 Randomization and Monte Carlo simulations 1. Randomization by permutation 2. Modeling the null-hypothesis probability distribution Class 8 1. Custom vector graphics: generating SVG from your data Create and re-create the killer graph for your paper Class 9 Class 10 Visualization of multidimensional data Web tools 1. The components of a web page, elements of HTML. 2. Extracting data from webpages and other documents. 3. Connect to GenBank using BioPerl BMB 6216 – Algorithms for Biology Course Overview Class 11 Cgi-bin: Creating dynamic web-based tools for data analysis. Class 12 Relational databases and SQL 1. Relational Model, normalization 2. Basic SQL 3. Examples: Experimental results, Class 13 Databases and WWW Class 14 Clustering 1. Hierarchical 2. K-means 3. friends-of-Friends Class 15 Timecourses and spectral analysis; Convolution. BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux Perl, also C/C++, R, shell, awk, sed, ..., when needed Supplementary reading: Larry Wall et al: Programming Perl Wing-Kin Sung: Algorithms in Bioinformatics James Tisdall: Beginning Perl for Bioinformatics James Tisdall: Mastering Perl for Bioinformatics Stroustrup: The C++ Programmming Language Special requests: Welcome ! BMB 6216 – Algorithms for Biology Format: Mixed – lecture with hands-on assignments. Computer environment: Linux * Rich in standard tools, mostly open-source * Industry standard – * Very similar to MacOS, Android, iOS, BSD, ChromeOS, etc. – Has many flavors created for specific purposes BMB 6216 – Algorithms for Biology Using your laptop in class: To get a *nix environment: * linux laptop (or unix console on Mac) – Live CD distribution * cygwin * virtual machine * remote session (preferred, guaranteed to work) Remote session: Use – “Remote Desktop Connection” from win* – Server: 129.109.88.185 From mac – install “Remote Desktop Connection Client for Mac” From Linux “rdesktop 129.109.88.185” Also works from off campus • (mycitrix.utmb.edu -> remote desktop session) Other options: – ssh (puTTY on windows) , no graphics though, only on-campus – NX NoMachine BMB 6216 – Algorithms for Biology Login to: 129.109.54.80 Username: Password: BMB 6216 – Algorithms for Biology Unix / linux shell / command line: – List files: ls – Directory: ls -a cd ls -1 ls -l pwd – Copy, move, delete, link: cp mv rm – Machine status: ps w /sbin/ifconfig date – Text editors: joe – Pager: ls -lrt uptime top df du nano more less; ln also: – Misc: echo tr sed man emacs (c-x c-f) whoami vi cat, head, tail, tac wc chmod BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac BMB 6216 – Algorithms for Biology Exercise: The file /data/students/classes/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W) • List 200 named genes that have the highest (t7+t19+t31)(t1+t13+t25) BMB 6216 – Algorithms for Biology Log in to your account (on 129.109.88.185) – Make a fresh directory, e.g. mkdir bmb6216 cd bmb6216 mkdir class_1; cd class_1 cp /data/students/classes/hello.pl . * Cat it. * Less it. * Run it. • Backup: cp hello.pl hello-0.pl • Edit it: vi hello.pl BMB 6216 – Algorithms for Biology Editing with vi – I / i (insert) – A / a (append) – X / x / dd (delete) – R (eplace) / r (eplace 1 character) – {n} W / w / B / b / hjkl -move around – [ESC] – back from insert to command – ZZ / :w / :q / :wq / :x / :q! - exit / save / quit – xp – swap chars. ddp – swap lines BMB 6216 – Algorithms for Biology Exercise: The file /home/students/classes/Class_1/remastercycle.csv contains gene expression data arranged as time-series in columns. (affy-id, name, gene-id, data*36) • How many named genes are there? • What is the average expression at timepoint 1? In how many genes it is above average? • What is the average expression at t1 of named genes, unnamed genes, non-genes? (genes have systematic names like YLR405W), named genes also have a common name in column 2. • List 200 named genes that have the highest (t7+t19+t31)(t1+t13+t25) BMB 6216 – Algorithms for Biology PERL Why PERL? Practical Extraction and Report Language Pathologically Eclectic Rubbish Lister • Versatile, portable • Widely used in bioinformatics and web applications • There's more than one way to do it • Not the most elegant language, great for dirty hacks • Easily integrated with anything BMB 6216 – Algorithms for Biology Warning: PERL6 ain't PERL BMB 6216 – Algorithms for Biology PERL HELLO WORLD: print ''Hello \n''; BMB 6216 – Algorithms for Biology PERL HELLO WORLD: > perl print ''Hello \n''; ^D BMB 6216 – Algorithms for Biology PERL HELLO WORLD: >perl -e 'print ''Hello \n'';' BMB 6216 – Algorithms for Biology PERL HELLO WORLD: hello.pl ================== #!/usr/bin/perl print ''Hello \n''; ================== > perl hello.pl Or > ./hello.pl (after chmod +x hello.pl) BMB 6216 – Algorithms for Biology VARIABLES: Scalar: $dna = 'ATTTGCCCTGCCCATT'; $mouse_tail_inches = 2.13; $RNA = ''GGGUUCAAUAUAUGGC''; $seven = -6; Default variable: $_ No need to declare variables. If not specified, $_ is assumed. BMB 6216 – Algorithms for Biology VARIABLES: No need to declare variables. Risky though: $my_variable = 51; $something = $my_variable + 3; $something_else = $myvariable + 4; use strict; BMB 6216 – Algorithms for Biology OPERATIONS: String: $dna = “ATAGAGGTA” . “CATATC”; $at_repeat = “AT” x 50; substr() sub-string length() Binding: print $dna if $dna =~ /ATA/; chop (last char) chomp (end of line) Special characters: \t \n BMB 6216 – Algorithms for Biology The different quotations $x=6; print ''x= $x \n''; print 'x= $x \n'; BMB 6216 – Algorithms for Biology OPERATIONS: Arithmetic: $a + $b $a - $b $a * $b $a % $b $a ** $b BMB 6216 – Algorithms for Biology OPERATIONS: Incrementation (C-like) $a ++ $a *= 4 $repeat = 'AT'; $repeat x=36; BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3.21, 7, 'cat', ''dog''); $a[0] = 6; $#a @a + 0 address of last element size of array OPERATIONS: * join / split * push / pop / shift / unshift BMB 6216 – Algorithms for Biology LISTS/TABLES: @a = (4, 6, 3.21, 7, 'cat', ''dog''); $a[0] = 6; $#a @a + 0 address of last element size of array OPERATIONS: * join / split * push / pop / shift / unshift BMB 6216 – Algorithms for Biology HASHES: The most important data type in biology! $expression{''RPS16''} = 4.65; %expression = ( RPL12 => 1.23, CDC28 => 5.31, STAT1 => ''experiment gone south” ); BMB 6216 – Algorithms for Biology FLOW CONTROL: if ( $a > 4 ) { print sqrt ($a), “\n”; }; while ( $x > 0 ) { print --$x , “\n”}; $x>0 or $x = 6; for $z (1..333) {print $z, ' ';}; for ($i=0; $i<=1000; ++$i) { next unless $a[$i] > 0 }; BMB 6216 – Algorithms for Biology TRUE or FALSE false strings: – ''0'' – '''' Every other string is true! ''0.00'' is true ''0.00'' + 0 is false – if ( 'Elvis is alive' ) { print 4+5, “\n”; }; – undef() is false BMB 6216 – Algorithms for Biology SUBROUTINES sub addit { my ($x1, $x2) = @_; return $x1 + $x2; }; BMB 6216 – Algorithms for Biology Input / Output: while (<>) { chomp; $sum += $_; }; BMB 6216 – Algorithms for Biology Input: open BLABLA, “data.csv”; $firstline = <BLABLA>; @headers = split “\t”, $firstline; while (<BLABLA>) {something}; close BLABLA; BMB 6216 – Algorithms for Biology Output: – print $x, ''\n''; – printf ''format'', $x; – print + join '' '', @list; open BLABLA, “>outdata.csv”; print BLABLA $x, $y, ''\n''; close BLABLA; #no comma!!! BMB 6216 – Algorithms for Biology Exercises: 1. repeat in PERL the awk/sort exercise from last hour 2. a-S_cer_TANAY_1000upstream.fasta contains the sequences out UTRs of genes. What is the correlation between the position of GATGAGA sequence and avg expression of the gene? BMB 6216 – Algorithms for Biology Simple data flow / spreadsheet-like • Find in file : grep [grep -v; grep -f; egrep] • Select top/bottom lines from file: head, tail • Select columns: awk awk '{print $2, $3, $5+$6}' • Merge lines: cat • Merge columns: paste • Sort • Data flow: > >> < | tee tac BMB 6216 – Algorithms for Biology C / C++ -> for total control =========================== Hello.C ====== #include <iostream> using namespace std; int main () { cout << "Hello :) " << 5+4 << endl; };