Lives of the Scientist Genetic Basis of Differentiation Events in time and space . . . Genetic Basis of Differentiation Events in time and space . . . . . . driven by patterned gene expression Genetic Basis of Differentiation Events in time and space . . . . . . driven by patterned gene expression Genetic Basis of Differentiation Nostoc NH3 N2 NH3 Events in time and space . . . . . . driven by patterned gene expression Genetic Basis of Differentiation How? Environmental Signal NH3 Histidine Kinase Developmental Response Genetic Basis of Differentiation How? Environmental Signal NH3 Histidine Kinase Developmental Response Genetic Basis of Differentiation How? Environmental Signal Developmental Response NH3 histidine Histidine Kinase Response Regulator Genetic Basis of Differentiation How? Environmental Signal Developmental Response NH3 Histidine Kinase Response Regulator NpR3010 ??? Genetic Basis of Differentiation AATAAAGCTTTACAAACCAA How? ACTCTGGCTTCAATTGTGTAA Environmental Signal Developmental Response CCCAAGCTTTGATTCTTTCCT NH3 CTGTTAAATCGGATTGATTAT CTTCATCAAGGGCAAGACCT ACAAATTTACCATCACGAAC Histidine Kinase Response Regulator AGCTTTAGACTCACTGAATT NpR3010 ??? CATAACCTTCTGTAGGCCAA TAGCCAACTGTTTCACCACC Genes Functionally Related to His Kinase Histidine Kinase Nostoc punctiforme NpR3010 Anabaena PCC 7120 Trichodesmium Synechocystis PCC 6803 Find similar genes . . . (13 total) Conserved Blast >npun_22dec03_Contig1_revised_geneNpR3010 MWHIQDSIITLSNHNQYLTFYKNQVKNPERFCRNVNQFDSQIDFVSCDIL ELKDGRFFEQYSKPLRLAEEIIGTVWSFRDITESQQAKEENRRIIQQEKQ LAEDRAYFTSMIFHEFRNPLNIISYSTSLLKRHSHHWSEEKKLQCLQNLQ TAVEQINQFTDEVLIIESVEAGKLQYELKPIDLNLFCREVLAEMSLYTKG ASQFLLFQNK* MWHIQDSIITLSNHNQYLTFYKNQVKNPERFCRNVNQFDSQIDFVSCDIL ELKDGRFFEQYSKPLRLAEEIIGTVWSFRDITESQQAKEENRRIIQQEKQ LAEDRAYFTSMIFHEFRNPLNIISYSTSLLKRHSHHWSEEKKLQCLQNLQ TAVEQINQFTDEVLIIESVEAGKLQYELKPIDLNLFCREVLAEMSLYTKG ASQFLLFQNK >npun_22dec03_Contig1_revised_geneNpR3008 LSPYLEACCLRISASVSYQRAAEDIEYLTGVEVSKSVQQRLVHRQNFELP QVESTVEELSVDGGNIRIRTIKGQVCDWKGYKATCLHEKQAIAASFQENS LVIDWVKSQSIAPILTCLGDGHDGIWNIVRDFAPEHQRREVLDWFHLMEN LHKIGGSNQRLNQAKILLWQGKVDDAIAVFADCQLKQAFNFCTYLEKHRH RIVNYQYYQAEQICSIGSGAIESTVKQIDRRTKISGAQWKSDNVPQVLAQ RQSLSQWINLCSLNKNWDAPMKSSVERLSDYPVAR* A new family of proteins?! A type of transposase? TRANSPOSON transposase ...ATTTCTCTAGAAAGGCTGAAGGGGGGACAAGCACCCGAAAGCCTTTGTGCT... ...TAAAGAGATCTTTCCGACTTCCCCCCTGTTCGTGGGCTTTCGGAAACACGA... ...ATACAGTCAGCTTTATAGGCTTCATGTCGCCCCTTCAGCTAGAAAGGTACATA... ...TATGTCAGTCGAAATATCCGAAGTACAGCGGGGAAGTCGATCTTTCCATGTAT... A new family of proteins?! A type of transposase? TRANSPOSON transposase ...ATTTCTCTAGAAAGGCTGAAGGGGGGACAAGCACCCGAAAGCCTTTGTGCT... ...TAAAGAGATCTTTCCGACTTCCCCCCTGTTCGTGGGCTTTCGGAAACACGA... ...ATACAGTCAGCTTTATAGGCTTCATGTCGCCCCTTCAGCTAGAAAGGTACATA... ...TATGTCAGTCGAAATATCCGAAGTACAGCGGGGAAGTCGATCTTTCCATGTAT... A new family of proteins?! A type of transposase? TRANSPOSON transposase ...ATTTCTCTAGAAAGGCTGAAGGGGGGACAAGCACCCGAAAGCCTTTGTGCT... ...TAAAGAGATCTTTCCGACTTCCCCCCTGTTCGTGGGCTTTCGGAAACACGA... ...ATACAGTCAGCTTTATAGGCTTCATGTCGCCCCTTCAGCTAGAAAGGTACATA... ...TATGTCAGTCGAAATATCCGAAGTACAGCGGGGAAGTCGATCTTTCCATGTAT... A new family of proteins?! A type of transposase? TRANSPOSON transposase Is Npr3008 a transposase? AATAAAGCTTTACAAA CCAAACTCTGGCTTCA ATTGTGTAACCCAAGC TTTGATTCTTTCCTCTG TTAAATCGGATTGATT ATCTTCATCAAGGGCA AGACCTACAAATTTAC Observation * Photos courtesy of www.webshots.com and Peter Smallwood Observation * Photos courtesy of www.webshots.com and Peter Smallwood Observation * Photos courtesy of www.webshots.com and Peter Smallwood Observation * Photos courtesy of www.webshots.com and Peter Smallwood Filters: Information reducers Squirrel filter Filters: Information reducers Molecular filter Filters: Information reducers Sequence filter TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TATGAGGCAA CTCGGGAGCG CCTTTAGATG AGGCCGGAGG CCCCGGCCTA TTCCCTGGGC TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA TCACAGCATC CACGGCTCTA CAAGAAGGAG GTCAAGAACT AGGCTGCCTG TCGGCGGGAC AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA AGGTGACCTT AAGAGGCCCA GAAACAGCTC CTCCACCGGC TGCTATAAAT AGATAACATG CTAGTTCTTG TTATCTGTTT CACTAGTTTC TTAGATAAAC CTCCACGCCC ATATTAAAAA AATTAGCAAA CATTCTAGGG AAACAAGCTA ATTTCCTGGG AGCCAAGGAC TGACAGACAG ATTGAACCCT AGTGCAGACA AGAAATGAGA AGTATCTATT TATCCAGGCA GAAATCCCTG GGCAGCGGCC ACGCGGCCCA AATGTGCCCT CTCCGTAAAC CTCTAAC... How do Biologists use Bioinformation? TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA What genes are in my organism? Gene finder Interpolated Markov model Candidate genes Predicted genes How do Biologists use Bioinformation? TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA What genes are in my organism? Gene finder Interpolated Markov model Challenge accepted beliefs Candidate genes Conform to standard model Predicted genes How do Biologists use Bioinformation? TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA What genes are in my organism? Gene finder Interpolated Markov model Conform to standard model Candidate genes Predicted genes How do Biologists use Bioinformation? TCTACTTATA AAGAGTCTGT TTCTGTCTGC TGGATTTCGG GAACCTTAGT CTCCGTAAAC TGAATAAACT AAGAGTTTAA AAACCTGTAT TTATATATTT CCCCAGCTGT GACAGCACTG GCTGAAATTC CCCTGCACCA ATGAATGACT TTCAATCCAC TGAATGAACA TCTGACCTCT AACTCTAGCC GACTTCTGCT CTCTAACATG TTGTTAAAGG AGTTAAAAAC GGTTACATGA TAAGAAATTA CATTAAAAAG ACCCTCAAGA CGCTGAGAGC GGTCTTTCCT GAACGAACGA AGGGCTACAC CATACATGGT GGCAGCTTTC TGCCCCACTC ATACCAAAGT ATGTCAGCAA TACAAATGAA GAATTGCAGT ACTGCCTAAA ATTGCAATTA AGGCAAATAC AGGCACCGGC AGAGTGGTAC GTGGGCACTG TTGAATGAAA What genes are in my organism? Gene finder Interpolated Markov model Challenge accepted beliefs Candidate genes Conform to standard model Predicted genes Filters are powerful Highly filtered output • Easy to grasp • High-level insights Filters Constrain New Discovery Highly filtered output • Easy to grasp • High-level insights Unfiltered output • Confusing • Basic insights Filters are tempting Filters are tempting The Death of Science Current State of Affairs 1. Need high-level filters Current State of Affairs 1. Need high-level filters 2. Need access to raw phenomena AATAAAGCTTTACAAACCAAACTCTGGCTTCA TTGTGTAACCCAAGCTTTGATTCTTTCCTCTGTT AAATCGGATTGATTATCTTCATCAAGGGCAAG CCTACAAATTTACCATCACGAACAGCTTTAGA TCACTGAATTCATAACCTTCTGTAGGCCAATAG CCAACTGTTTCACCACCATTTTCTGAAATTTTTT CCTCTAGAATACCGCAACACTATCACCACCAA ACTCCTTCTGAATTATTTCTGATTCAGTTTGGGT ATTGCCTGTTTGAGTACCAAAAAATAAACCAA Current State of Affairs 1. Need high-level filters 2. Need access to raw phenomena 3. Need ability to build new tools ASSIGN K12-set FROM Gene-finder (K12-DNA) ASSIGN O157-set FROM Gene-finder (O157-DNA) CONSIDER EACH protein IN O157-set WHEN Constituent-of (K12-set, protein) = FALSE COLLECT protein We need… Biologists . . . . . . and Programmers Current State of Affairs 1. Need high-level filters 2. Need access to raw phenomena 3. Need ability to build new tools Need biologist programmers AATAAAGCTTTACAAACCAAA CTCTGGCTTCAATTGTGTAACC CAAGCTTTGATTCTTTCCTCTG TTAAATCGGATTGATTATCTTC ATCAAGGGCAAGACCTACAAA TTTACCATCACGAACAGCTTTG ARYGACTCACTGAATTCLARAT AACCTTCTGTAGGCCASONATA GCCAACTGTTTCACCACCATTT TATTCAAAATGAATTATATCGGTAACTTTAGTACAGAAAATGACGTTAAGA ATATCTGCAACTTTAAACCTGAATGATATTATTATTGGCGGGCCTCCATGCCAG GGATTTAGTATTGCTGGGCCAGCCCAAAEALAVGIASTCCTAAAGATCCTAGAAATG GTTTAGAATTTTCATCAACTTTGCACAATGGATAAAATTTCTTGAACCTAAAGCGTTTGTC ATGGAAAACGTGAATTCAAAAGGATTGCTATCAAGGAAAAATGCAGAAGGTTTTAAAGTTATAG ATATTATTAAGAAAACATTTGGAATTCGAGAACTTGGTTATTTTGTCGAAGTATGGGTTTTAAATGCTG CGGAATATGGCATTCCGCAAATTAGAGAACGGAATTCGATTTTTATTGTTGGCAATAAAAAAGGTAAAGTACT AGGTATTCCTAAAAAAACACATTCTCTGCAATTTTTAAGAATTCGATTTAAATAGGTCTCAATTATCGATCTTCGATGAT ATGAGTATTATACCTGCACTAACTTTGTGGGACGCAATATCAGACTTACGAATTCGACAGAACTTAATGCGCGTGAAGGAAGTGAA GAGCAACCCTATCATTTAAAACCTCAAAATACTTATCAGACTTGGGCTAGAAATGGTAGTGGAATTCGATACGCTTTACAATCATGTTGCAAT GGAACATTCTGACCGTTTAGTAGAACGTTTCCGGCATATAAAATGGGGTGAATCCAGTTCGGATGTATCTAAAGAAGAATTCGACATGGAGCTAGACGACGT AGTGGTAATGGTGAATTATCAAACAAATCATATGATCAGAATAATCGCCGTTTAAATCCTCATAAACCGGAATTCGAATTCTCACACTATTGCTGCGTCATTCTATGCTAATTTTG TCCATCCTTTTCAACATCGAAATTTAACAGCCCGTGAAGGAGCTAGAATCCAATCTTTTCCAGATAACTATAGATTTTTTGGAAAAGAATTCGAATTCAAACTGTCGTATCTCATAAACTATTGCATCGA GAAGAAAGATTTGATGAAAAATTTCTTTGTCAATATAATCAAATCGGTAATGCTGTACCCCCTCTTCTCGCTAAAGTAATTGCACATCATCTTCTAGAGAAATTAGGAATTCGAATTCAGTTATGCCAACAACTGATAGAAATCCTCTA GTGCATGGATCAAATCTTGAACAAAAAGAGAATCATCGTACAAAATACAGAGATACTGAAAGCAGGACTTTCCTTAGAGAAATCAGAACTGAATATGACAAATGGCATAAAGCAAATATGAACCTGGAATTCGAATTCGAGTTGGACCAAAATCAGAAATTACTGACCA AGATGATTCAATTATTACTCAAAGAGTGGAACTTCTCACTAAATATAAAGATTTTTTAGATCAGCAGCATTATGCAGAAAAATTTGATTCAAGATCCAACCTTCATTCTAGTGTTTTAGAGACCATTTATAAAGTAAATCTTTAGACGACTAGACGACGTAGCGAATTCGAATTCGAATTCATAATACGAGTCATAACGGCATATATG GCAGCCTCACTCATTTCTGGGAGACGCTCATAATCCTTACTGAGACGACGGTACTGGTTTAACCAGCCAAATGTTCTTTCTACTACCCACCGTTTGGGCAAAACCTGAAATTCTTGATTAGTACGCCGGATTACCTCAACATGAGCTTGAATCATCAGCCAAACAGAGAGCGCAAATTTATCACCGTCATAGCCGGAATCAACCCAGATGACTTGAATTCGAATTCGAATTCGAACAACTTTTTCCAGTAATT CTGGAC GCTCTTCTAACAGTTCCATCAAAGTATAGGCGGCAAGTAATCTTTCTCCAGCATTTGCTTCACTTACAACCACTTTTAACAAAAGTCCCAGACTATCAACCAAAGTTTGCCGCTTTCGTCCTTTTACCTTCTTGCCACCATCAAAACCGTACACATCCCCCTTTTTTCAGTCGTTTTTACCGACTGGCTGTCTGCCGCGATCGCCGTGGGTTGAGTTGACTTCCCCATTTTTTGACGAACTTGATCGCGCAAAGTATGATTCATTTCAGTTGAACTAGGAGGAAAATCCCCTGGAAGCATATCCCACTGAATTCGAATTCGAATTCGAATTCGAATTCGA CAACCTGTTTTCAGATGGTAGTAGATAGCGTTGCATACTTCTCGCATATCAGTTGTTCGGGGATGCCCACCGCATTTAGCGGGTGGAATCAAAGGAGCTAAAATTGCCCATTCTGAGTCATTAAGG TCTGTAGAATAAGACTTTCGTCTCATTGTTTCCTATGTAAATACACTCTACAAACAGTATCTTATCGCTGCCTTTTTATCTTAGCTCTCCTTTAGATTTACTTTATAAATAGCCTCTTAGAAGAATTTCTTTATTATTTATTTAAAGATTTAGTACAAGATTTCGGGCAGAACGCTCTTATTGGTAAGTCACACACGTTCAAAGATATTTTCTTCGTACCACCAAAATATTCTGAAATGCTCAAGCGACCTTATGCGCGAATTGAGAGAAAAGATCATGATTTCGTAATTGGTGCAACTGTTCAAGCATCGCTTGAAGCAGCACCTCCTCCAGAACAAAACCATGCTTGAGGGATCTTCACGCGCAGCAGAGGATTTAA Why hasn’t this happened? Part of bioinformatic program written in C if (pcInFile == NULL) pfInFile = stdin; else pfInFile = fopen(pcInFile, "r"); pfOutFile = fopen( pcOutFile, "w" ); if (pfInFile == NULL) { fprintf( stderr, "ERROR opening %s\n", pcInFile ); exit(1); } if (pfOutFile == NULL) { fprintf( stderr, "ERROR opening %s\n", pcOutFile ); exit(1); } fputc( fgetc(pfInFile), pfOutFile ); /* deal with first '>' in file */ for ( ; ; ) { if (processIdentifier( pfInFile, pfOutFile )) { } else { break; } if (processSequence( pfInFile, pfOutFile )) { else { break; } } fclose( pfInFile ); fclose( pfOutFile ); } Why hasn’t this happened? Part of bioinformatic program written in Perl sub match_positions { my $pattern; local $_; ($pattern, $_) = @_; my @results; local $matchStart; my $instrumentedPattern = qr/(?{ $matchStart = pos() })$pattern/; while (/$instrumentedPattern/g) { my $nextStart = pos(); push @results, "[$matchStart..$nextStart)"; pos() = $matchStart+1; } return @results; Why hasn’t this happened? Biologists will not come to programming Programming must come to biologists BioLingua Genetic Basis of Differentiation Environmental Signal Developmental Response NH3 P Histidine Kinase Response Regulator NpR3010 ??? Genetic Basis of Differentiation NpR3010 RR HK-upstream HK HK-downstream Genetic Basis of Differentiation NpR3010 RR HK-upstream HK HK-downstream BioLingua <1>> (GENES-DESCRIBED-BY "response regulator" IN Npun) :: (#$Npun.NpF0304 #$Npun.NpR0355 #$Npun.NpR0450 #$Npun.NpF0484 #$Npun.NpR0589 #$Npun.NpF0832 #$Npun.NpF0906 #$Npun.NpR0956 #$Npun.NpF1084 #$Npun.NpF1085 #$Npun.NpR1109 #$Npun.NpF1184 #$Npun.NpF1278 #$Npun.NpR1450 #$Npun.NpF1453 #$Npun.NpF1516 #$Npun.NpR1633 #$Npun.NpR1678 #$Npun.NpR1683 #$Npun.NpR1688 #$Npun.NpF1776 #$Npun.NpR1779 #$Npun.NpF1800 #$Npun.NpR1903 #$Npun.NpR2091 #$Npun.NpF2162 #$Npun.NpR2263 #$Npun.NpF2346 #$Npun.NpF2364 #$Npun.NpR2420 #$Npun.NpR2902 #$Npun.NpF2972 #$Npun.NpR3053 #$Npun.NpF3084 #$Npun.NpR3197 #$Npun.NpR3241 #$Npun.NpF3659 #$Npun.NpF3676 #$Npun.NpR3733 #$Npun.NpF3829 #$Npun.NpR3907 #$Npun.NpR3959 #$Npun.NpF3972 #$Npun.NpR4101 #$Npun.NpR4160 #$Npun.NpR4165 #$Npun.NpF4214 #$Npun.NpR4435 #$Npun.NpF4460 #$Npun.NpR4503 #$Npun.NpR4743 #$Npun.NpR4768 #$Npun.NpF4909 #$Npun.NpR5015 #$Npun.NpF5034 #$Npun.NpF5044 #$Npun.NpR5135 #$Npun.NpR5136 #$Npun.NpR5316 #$Npun.NpF5361 #$Npun.NpF5636 #$Npun.NpF5682 #$Npun.NpF5759 #$Npun.NpF5763 #$Npun.NpF5788 #$Npun.NpR6014 #$Npun.NpR6015 #$Npun.NpR6228 #$Npun.NpF6321 #$Npun.NpR6360 #$Npun.NpF6363 #$Npun.pNpAF075 #$Npun.pNpBR039 #$Npun.pNpBF139 #$Npun.pNpBF146 #$Npun.pNpBR169 #$Npun.pNpBR170 #$Npun.pNpBF205 #$Npun.pNpEF003) <2>> (GENE-UPSTREAM-OF NpF0304) BioLingua <2>> (GENE-UPSTREAM-OF NpF0304) :: #$Npun.NpF0303 <3>> (GENES-UPSTREAM-OF (RESULT 1)) :: (#$Npun.NpF0303 #$Npun.NpF0356 #$Npun.NpF0451 #$Npun.NpF0483 #$Npun.NpR0590 #$Npun.NpF0831 #$Npun.NpF0905 #$Npun.NpF0957 #$Npun.NpR1083 #$Npun.NpF1084 #$Npun.NpR1110 #$Npun.NpF1183 #$Npun.NpF1277 #$Npun.NpR1451 #$Npun.NpR1452 #$Npun.NpR1515 #$Npun.NpF1634 #$Npun.NpR1679 #$Npun.NpF1684 #$Npun.NpR1689 #$Npun.NpF1775 #$Npun.NpF1780 #$Npun.NpF1799 #$Npun.NpR1904 #$Npun.NpR2092 #$Npun.NpF2161 #$Npun.NpR2264 #$Npun.NpR2345 #$Npun.NpF2363 #$Npun.NpR2421 #$Npun.NpR2903 #$Npun.NpR2971 #$Npun.NpR3054 #$Npun.NpR3083 #$Npun.NpR3198 #$Npun.NpF3242 #$Npun.NpR3658 #$Npun.NpF3675 #$Npun.NpR3734 #$Npun.NpR3828 #$Npun.NpF3908 #$Npun.NpR3960 #$Npun.NpF3971 #$Npun.NpF4102 #$Npun.NpR4161 #$Npun.NpF4166 #$Npun.NpR4213 #$Npun.NpR4436 #$Npun.NpF4459 #$Npun.NpR4504 #$Npun.NpR4744 #$Npun.NpR4769 #$Npun.NpR4908 #$Npun.NpF5016 #$Npun.NpF5033 #$Npun.NpF5043 #$Npun.NpR5136 #$Npun.NpF5137 #$Npun.NpF5317 #$Npun.NpF5360 #$Npun.NpR5635 #$Npun.NpF5681 #$Npun.NpF5758 #$Npun.NpR5762 #$Npun.NpR5787 #$Npun.NpR6015 #$Npun.NpR6016 #$Npun.NpR6229 #$Npun.NpR6320 #$Npun.NpF6361 #$Npun.NpF6362 #$Npun.pNpAF074 #$Npun.pNpBR040 #$Npun.pNpBF138 #$Npun.pNpBF145 #$Npun.pNpBR170 #$Npun.pNpBR171 #$Npun.pNpBR204 #$Npun.pNpER002) <4>> (DESCRIPTIONS-OF *) BioLingua <4>> DESCRIPTIONS-OF *) :: ("two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25531611|p "unknown protein [Nostoc sp. PCC 7120] gi|25534386|pir||AH1981 hypothetical p "tmRNA-binding protein [Nostoc sp. PCC 7120] gi|22096164|sp|Q8YM70|SSRP_ANASP "GTP-binding protein era homolog" "unknown protein [Nostoc sp. PCC 7120] gi|25533156|pir||AF2229 hypothetical p "ORF_ID:tlr0160~similar to ferredoxin [Thermosynechococcus elongatus BP-1] "hypothetical protein [Nostoc sp. PCC 7120] gi|25367067|pir||AH2295 hypotheti "two-component hybrid sensor and regulator [Nostoc sp. PCC 7120] gi|25532444| "hypothetical protein [Nostoc sp. PCC 7120] gi|25358966|pir||AG2158 hypotheti "two-component response regulator [Nostoc sp. PCC 7120] gi|25533086|pir||AF21 "probable two-component sensor histidine kinase [Gloeobacter violaceus] gi|35 "phytochrome-like protein [Tolypothrix sp. PCC 7601]" "two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25530471|pir| NIL NIL NIL "hypothetical protein [Nostoc sp. PCC 7120] gi|25535333|pir||AI2179 hypotheti NIL "unknown protein [Nostoc sp. PCC 7120] gi|25535440|pir||AI2275 hypothetical p "transcriptional regulator [Nostoc sp. PCC 7120] gi|25302898|pir||AB2544 tran "similar to two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25 "putative gluconolactonase precursor [Sinorhizobium meliloti] gi|25369832|pir "similar to two-component sensor histidine kinase [Nostoc sp. PCC 7120] gi|25 "hypothetical protein [Nostoc sp. PCC 7120] gi|25530521|pir||AC1903 hypotheti . . . BioLingua <5>> (DEFINE RR-class AS (GENES-DESCRIBED-BY "response regulator" IN Npun) DISPLAY off) :: "List of length 79 suppressed" <6>> (DEFINE HK-class AS (GENES-DESCRIBED-BY “histidine kinase" IN Npun) DISPLAY off) :: "List of length 89 suppressed" <7>> (DEFINE HK-upstream AS (GENES-UPSTREAM-OF HK-class) DISPLAY off) :: "List of length 89 suppressed" <8>> (DEFINE HK-downstream AS (GENES-DOWNSTREAM-OF HK-class) DISPLAY off) :: "List of length 89 suppressed" <9>> (DEFINE HK-adjacent AS (UNION-OF (HK-upstream HK-downstream)) DISPLAY off) :: "List of length 178 suppressed" <10>>(INTERSECTION-OF (HK-adjacent RR-class)) BioLingua <10>> (INTERSECTION-OF (HK-adjacent RR-class)) :: 22 elements in INTERSECTION > (#$Npun.pNpBF205 #$Npun.pNpBF139 #$Npun.NpR6228 #$Npun.NpR5316 #$Npun.NpF4214 #$Npun.NpF3676 #$Npun.NpF3084 #$Npun.NpR3053 #$Npun.NpR1779 #$Npun.NpR0589 #$Npun.NpF0304 #$Npun.NpR1109 #$Npun.NpF1278 #$Npun.NpF1776 #$Npun.NpF1800 #$Npun.NpR2420 #$Npun.NpR2902 #$Npun.NpR3197 #$Npun.NpR4503 #$Npun.NpF5763 #$Npun.NpF6363 #$Npun.pNpBF146) <11>>(DEFINE RR-candidates AS (SET-DIFFERENCE RR-class (RESULT 10)) DISPLAY off) :: "List of length 57 suppressed" <12>> Genes Functionally Related to His Kinase Histidine Kinase Nostoc punctiforme NpR3010 Anabaena PCC 7120 Trichodesmium Synechocystis PCC 6803 Find similar genes . . . (13 total) Conserved BioLingua <10>> (INTERSECTION-OF (RR-adjacent HK-class)) :: 24 elements in INTERSECTION > (#$Npun.pNpBF205 #$Npun.pNpBF139 #$Npun.NpR6228 #$Npun.NpR5316 #$Npun.NpF4214 #$Npun.NpF3676 #$Npun.NpF3084 #$Npun.NpR3053 #$Npun.NpR1779 #$Npun.NpR0589 #$Npun.NpF0304 #$Npun.NpR1109 #$Npun.NpF1278 #$Npun.NpF1776 #$Npun.NpF1800 #$Npun.NpR2420 #$Npun.NpR2902 #$Npun.NpR3197 #$Npun.NpR4503 #$Npun.NpF5763 #$Npun.NpF6363 #$Npun.pNpBF146) <11>>(DEFINE RR-candidates AS (SET-DIFFERENCE RR-class (RESULT 10)) DISPLAY off) :: "List of length 57 suppressed" <12>>(CONTEXT-OF NpF0304) :: (<- #$Npun.NpR0302 potassium-dependent ATPase sub) 523 (-> #$Npun.NpF0303 two-component sensor histidine) 85 (-> #$Npun.NpF0304 two-component response regulat) 473 (-> #$Npun.NpF0305 hypothetical protein glr0895 [) 85 (<- #$Npun.NpR0306 primosomal protein N' [Nostoc ) > (#$Npun.NpR0302 #$Npun.NpF0303 #$Npun.NpF0304 #$Npun.NpF0305 #$Npun.NpR0306) <13>>(ALL-ORTHOLOGS-OF *) BioLingua <12>>(CONTEXT-OF NpF0304) :: (<- #$Npun.NpR0302 potassium-dependent ATPase sub) 523 (-> #$Npun.NpF0303 two-component sensor histidine) 85 (-> #$Npun.NpF0304 two-component response regulat) 473 (-> #$Npun.NpF0305 hypothetical protein glr0895 [) 85 (<- #$Npun.NpR0306 primosomal protein N' [Nostoc ) > (#$Npun.NpR0302 #$Npun.NpF0303 #$Npun.NpF0304 #$Npun.NpF0305 #$Npun.NpR0306) <13>> (ALL-ORTHOLOGS-OF *) :: ((#$S7942.sef0159 #$Npun.NpR0302 #$Gvi.glr0573 #$A29413.Av?3368 #$A7120.all3154) (#$S6803.sll1590 #$Npun.NpF0303 #$Gvi.gll0572 #$A29413.Av?1247 #$A7120.alr3155) (#$S6803.sll1592 #$P9313.PMT1405 #$Npun.NpF0304 #$Gvi.gll0571 #$A29413.Av?1248 #$A7120.alr3156) (#$Tery.Te?7017 #$Npun.NpF0305 #$Cwat.Cw?3050) (#$Tery.Te?2243 #$TeBP1.tll0415 #$S6803.sll0270 #$S8102.SynW1782 #$S7942.sef1895 #$PRO1375.Pro0497 #$P9313.PMT1271 #$PMED4.PMM0497 #$Npun.NpR0306 #$Gvi.gll0025 #$Cwat.Cw?3016 #$A29413.Av?5206 #$A7120.all4248)) <14>> A new family of proteins?! A type of transposase? TRANSPOSON transposase Is Npr3008 a transposase? BioLingua <14>>(DEFINE extended-NpR3008 AS (SEQUENCE-OF NpR3008 FROM -700 TO-END +700) DISPLAY off) :: “Results suppressed" <15>> (BLAST extended-NpR3008 Npun) :: Query Q-Start Q-End Subject S-Start S-End 1. "Seq 1" 1 2258 #$Npun.chromosome 3706846 3704589 2. "Seq 1" 293 1511 #$Npun.chromosome 4008429 4009647 3. "Seq 1" 293 1512 #$Npun.chromosome 7932036 7930817 4. "Seq 1" 293 1510 #$Npun.chromosome 4228111 4229328 5. "Seq 1" 293 1510 #$Npun.chromosome 3971285 3972502 6. "Seq 1" 293 1510 #$Npun.chromosome 4027833 4029050 7. "Seq 1" 293 1511 #$Npun.chromosome 2121987 2123204 8. "Seq 1" 293 1510 #$Npun.chromosome 2136737 2135521 9. "Seq 1" 397 1510 #$Npun.chromosome 2030748 2031861 10. "Seq 1" 1537 2258 #$Npun.pNpB 42015 42737 11. "Seq 1" 1331 1420 #$Npun.chromosome 8036134 8036045 12. "Seq 1" 1319 1385 #$Npun.chromosome 5915424 5915358 13. "Seq 1" 1319 1385 #$Npun.chromosome 2577387 2577453 > (#$Temp27 #$Temp28 #$Temp29 #$Temp30 #$Temp31 #$Temp32 #$Temp33 #$Temp34 #$Temp35 #$Temp36 #$Temp37 #$Temp38 #$Temp39) <16>> E-value 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 4.6d-83 1.8d-8 2.7d-4 2.7d-4 BioLingua <14>>(DEFINE extended-NpR3008 AS (SEQUENCE-OF NpR3008 FROM -700 TO-END +700) DISPLAY off) :: “Results suppressed" <15>> (BLAST extended-NpR3008 Npun) :: Query Q-Start Q-End Subject 1. "Seq 1" 1 2258 #$Npun.chromosome 2. "Seq 1" 293 1511 #$Npun.chromosome . . . S-Start 3706846 4008429 S-End 3704589 4009647 <16>> (FOR-EACH hit IN * AS (subj S-start) = (GET-ELEMENTS (subject Subject-start) FROM hit) AS start = (- S-start 15) AS end = (+ S-start 40) AS left-end = (SEQUENCE-OF subj FROM start TO end) COLLECT left-end) E-value 0.0 0.0 BioLingua <14>>(DEFINE extended-NpR3008 AS (SEQUENCE-OF NpR3008 FROM -700 TO-END +700) DISPLAY off) :: “Results suppressed" <15>> (BLAST extended-NpR3008 Npun) :: Query Q-Start Q-End Subject 1. "Seq 1" 1 2258 #$Npun.chromosome 2. "Seq 1" 293 1511 #$Npun.chromosome . . . S-Start 3706846 4008429 S-End 3704589 4009647 <16>> (FOR-EACH hit IN * AS (subj S-start) = (GET-ELEMENTS (subject Subject-start) FROM hit) AS start = (- S-start 15) AS end = (+ S-start 40) AS left-end = (SEQUENCE-OF subj FROM start TO end) COLLECT left-end) :: > ("TACGCTCTATCTTCAGCAAGTTGTTTTTCTTGCTGTATAATTCGGCGATTCTCTTC" "AAAGAAACGCTAGAGGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA" "AAACTGGGATGCACCCCTTATTAATGCTCTTTGGAGTCAATACTAATTTTGCCAAA" "TACCTTTGTGATAGGGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA" "AAATTAGTTTATTATGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA" "CACCGATTCACTAATGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA" "ACTATTGTAGAGACTGGGTGCATCCCAGTTTTTATTATTCCAAAACAAATAAATAA" . . . E-value 0.0 0.0 BioLingua <17>>(ALIGNMENT-OF * LINE-LENGTH 60 SEGMENT-LENGTH 60) :: Seq 4 1 TACCTTTGT-GATAGGGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA--Seq 7 1 -ACTATTGTAGAGACTGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA--Seq 2 1 -AAAGAAACGCTAGAGGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA--Seq 5 1 AAATTAGTTTATTA-TGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA--Seq 6 1 -CACCGATTCACTAATGGGTGCATCCCAGTTTTTATTAT--TCCAAAACAAATAAATAA--Seq 8 1 ----------AAACTGGGATGCA-CCCAGTCTCTACAATAGTTCTAGA-GAACACATAACGT Seq 3 1 ----------AAACTGGGATGCACCCC--TTATTAATGCTCTTTGGAGTCAATAC-TAATTT Seq 9 1 -----------CATTGTCGCCCCTTGAAGTCATCAAGAC-----TAGGTGTATCAATGACTC Seq 12 1 ------------------GTTCAGCTTGGTAATAGCTGTAGTTAATAATGCGAGAGCGATGT Seq 1 1 ---------TACGCTCTATCTTCAGCAAGTTGTTTTTCT--TGCTGTATAATTCGGCGATTC Seq 10 1 --------------GGTCGGGAAATTGCGAGATTATTCAGTGGCGAAGTAGTGGGAGAACTA Seq 11 1 ------------TTGAACAAATTTGTTCGTGGAAATGGTAATTGGAAATTTGCTGCGGAATG Seq 13 1 ------------ATTATTAACTACAGCTATTACCAAGCTGAACAACTGTGTTCTATTGGTTC consensus 1 Genetic Basis of Differentiation Nostoc NH3 + Anabaena N2 NH3 Not Synechocystis, Trichodesmium,… BioLingua <18>>(DEFINE diff-cb AS (Npun Avar A7120) DISPLAY off) :: "List of length 3 suppressed" <19>>(DEFINE non-diff-cb AS (REMOVE-FROM-SET *loaded-organisms* diff-cb) DISPLAY off) :: "List of length 10 suppressed" <20>>(DEFINE diff-cb-specific AS (COMMON-ORTHOLOGS-OF diff-cb NOT-IN non-diff-cb) DISPLAY off) :: "List of length 661 suppressed" BioLingua • Provides knowledge in accessible form • Provides tools accessed in common way • Provides results that can be manipulated • Provides a programming language that speaks to biologists The Death of Science Credits West Coast VCU - Jeff Shrager - JP Massar - Mike Travers - Austin Hess - James Mastros - Sarah Cousins - Yue Zhao BioLingua: http://ramsites.net/~biolingua/help Jeff Elhai: Center for the Study of Biological Complexity Virginia Commonwealth University Phone: 828-0794 E-mail: ElhaiJ@VCU.Edu