X03006; SV 1; linear; mRNA; STD; MAM; 620 BP. X03006; X03006.1 28-JAN-1986 (Rel. 08, Created) 12-SEP-1993 (Rel. 36, Last updated, Version 2) Bovine mRNA for lens beta-s-crystallin beta-crystallin; beta-gamma-crystallin; crystallin. Bos taurus (cow) Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos. EMBL ID XX AC XX SV XX DT DT XX DE XX KW XX OS OC OC OC XX RN RP RX RA RA RT RT RL XX CC XX ... [1] 1-620 PUBMED; 4054100. Quax-Jeuken Y.E.F.M., Driessen H., Leunissen J., Quax W.J., de Jong W., Bloemendal H.; "Beta-s-crystallin: structure and evolution of a distinct member of the beta-gamma-superfamily"; EMBO J. 4(10):2597-2602(1985). Data kindly reviewed (06-MAR-1986) by Y. Quax-Jeuken Index flatfile parser index Retrieve index entries parser display SRS Sequence Retrieval System an indexing and retrieval system for flat file databases http://srs.bioinformatics.nl http://srs.ebi.ac.uk Q: Which sequences in EMBL [do not] encode for a protein for which the 3D structure is known? Command line SRS Using getz Retrieve the UniProt entry for the protein with accession number P19558: getz "[uniprot-acc:P19558]" -e Count the human proteins in the UniProt database: getz "[uniprot-org:human]" –c Print sequence of the rice proteins in the UniProt database that have a length between 10 and 50 aa: getz "[uniprot-org:rice]&[uniprot-sl#10:50]" -f sl Give the id and description for all A.thal proteins that have at least 8 transmembrane domains: getz '[swissprot-org:arabidopsis thaliana]< ([swissprot-CountedItem:transmem] &[swissprot-CountedN#8:]))' -f "id des" Count the human protein sequences in the NCBI RefSeq database: getz "[refseqp-org:human]" –c Count the human mRNA sequences in the NCBI RefSeq database: getz "[refseq-org:human]&[refseq-mol:mrna]" –c Retrieve the mRNA sequences for all human proteins in the NCBI RefSeq database in fasta format : getz "[refseqp-org:human]>[refseq-mol:mrna]" –d –sf fasta MRS: A fast and compact retrieval system for biological data. Hekkelman M.L., Vriend G. http://mrs.cmbi.ru.nl/ European Molecular Biology Open Software Suite EMBOSS "European Molecular Biology Open Software Suite" http://emboss.sourceforge.net/ Toolbox with bioinformatics applications http://emboss.bioinformatics.nl/ http://main.g2.bx.psu.edu/ command line / shell Useful EMBOSS commands command description showdb Displays information on the currently available databases wossname Finds programs by keywords in their one-line documentation tfm Reads the manual entries for each program in EMBOSS seealso Finds the relevant programs of certain program seqret Reads and writes (returns) sequences entret Reads and writes (returns) flatfile entries extractfeat Extract features from a sequence extractseq Extract regions from a sequence transeq Translate nucleic acid sequences Get help from EMBOSS itself # showdb Shows the currently available databases # tfm wossname How to use a EMBOSS command? Just (r)tfm it # wossname alignment Which commands can handle alignments? # seealso seqret Are there any other commands able to do the similar thing? Command line options • All EMBOSS programs react to a number of command line options. The most important ones are –help –help –verbose –auto –stdout –filter Get help Get elaborate help “no questions asked” Write to standard output Read stdin, write stdout SEQRET parameters zonnebloem> seqret -help Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) filename and optional format (output USA) Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: -feature boolean Use feature information -firstonly boolean Read one sequence and stop General qualifiers: -help boolean Report command line options. More information on associated and general qualifiers can be found with -help -verbose SEQRET parameters zonnebloem> seqret -help -verbose Standard (Mandatory) qualifiers: [-sequence] seqall (Gapped) sequence(s) filename and optional format, or reference (input USA) [-outseq] seqoutall [<sequence>.<format>] Sequence set(s) filename and optional format (output USA) Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: -feature boolean Use feature information -firstonly boolean Read one sequence and stop Associated qualifiers: "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used /// SEQRET parameters /// "-sequence" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name /// SEQRET parameters /// "-outseq" associated qualifiers -osformat2 string Output seq format -osextension2 string File name extension -osname2 string Base file name -osdirectory2 string Output directory -osdbname2 string Database name to add -ossingle2 boolean Separate file for each entry -oufo2 string UFO features -offormat2 string Features format -ofname2 string Features file name -ofdirectory2 string Output directory /// SEQRET parameters /// General qualifiers: -auto -stdout -filter -options -debug -verbose -help boolean boolean boolean boolean boolean boolean boolean -warning -error -fatal -die boolean boolean boolean boolean Turn off prompts Write standard output Read standard input, write standard output Prompt for standard and additional values Write debug output to program.dbg Report some/full command line options Report command line options. More information on associated and general qualifiers can be found with -help -verbose Report warnings Report errors Report fatal errors Report dying program messages Universal Sequence Address Type Example Description filename xxx.seq A sequence file "xxx.seq" in any format format::filename fasta::xxx.seq A sequence file "xxx.seq" in fasta format db:IDname embl:paamir EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database db:AccessionNumber embl:X13776 EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number and entry name (X13776 is the accession number in this case) db-acc:AccessionNumber embl-acc:X13776 EMBL entry X13776, using whatever access method is defined locally for the EMBL database and searching by accession number only db-id:IDname embl-id:paamir EMBL entry PAAMIR, using whatever access method is defined locally for the EMBL database, and searching by ID only db-searchfield:word embl-des:lectin EMBL entries containing the word 'lectin' in the Description line db-searchfield:wildcardword embl-org:*human* EMBL entries containing the wildcarded word 'human' in the Organism fields db:wildcard-ID embl:paami* EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical order, using whatever access method is defined locally for the EMBL database Universal Sequence Address Type Example Description db or db:* embl or EMBL:* All sequences in the EMBL database @listfile @mylist Reads file mylist and uses each line as a separate USA. List files can contain references to other lists files or any other standard USA. list:listfile list:mylist Same as "@mylist" above 'getz -e [embl-id:paamir] |' The pipe character "|" causes EMBOSS to fire up getz (the SRS sequence retrieval program) to extract entry PAAMIR from EMBL in EMBL format. Any application or script which writes one or more sequences to stdout can be used in this way. asis::atacgcagttatctgaccat So far the shortest USA we could invent. In 'asis' format the name is the sequence so no file needs to be opened. This is a special case. It was intended as a joke, but could be quite useful for generating command lines. 'program parameters |' asis::sequence Each of the above can have '[start : end]' or '[start : end : r]' appended to them. The 'file' and 'dbname' forms of USA can have 'format::' in front of them (although a database knows which format it is and so this is redundant and error-prone) Walk through exercise For a protein with UniProt Accession number: Q5ZKN6 find the nucleotide sequence that encodes this (repeated) amino acid fragment: VAEEVAEE Getting the sequence seqret -auto uniprot:Q5ZKN6 -stdout >Q5ZKN6_CHICK Q5ZKN6 SubName: Full=Putative uncharacterized protein; MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR LPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE Getting the sequence seqret -auto uniprot:Q5ZKN6 -stdout >Q5ZKN6_CHICK Q5ZKN6 SubName: Full=Putative uncharacterized protein; MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR LPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE Run a program within Perl: 3 ways $seq = `seqret -auto uniprot:Q5ZKN6 stdout`; system("seqret -auto uniprot:Q5ZKN6 stdout"); open SEQRET,"seqret -auto uniprot:Q5ZKN6 stdout|"; while(my $line = <SEQRET>) { if($line !~ /^>/) { chomp($line); $seq .= $line; } } close SEQRET; my $lsOutput = `ls -l`; put shell commands or programs in backticks to run from Perl. The output can be stored in a variable. open LS,"ls -l|"; The open function can run a program and read its output. The pipe symbol "|" links the output to a filehandle. Find the fragment’s position my $seq = ""; open SEQRET,"seqret -auto uniprot:Q5ZKN6 stdout|"; while(my $line = <SEQRET>) { if($line !~ /^>/) { chomp($line); $seq .= $line; } } close SEQRET; # look for location of the repeat my $position = index($seq, "VAEEVAEE") + 1; # print the offset print "Position = ", $position, "\n"; !~ opposite of "=~ "gives true if the search found no hits. Get a cross-reference to EMBL entret uniprot:Q5ZKN6 -auto stdout |grep "DR Get the feature table of this protein entry " Understand the cross-reference DR EMBL; AJ720048; CAG31707.1; -; mRNA. Link to EMBL EMBL accession number Status identifier Protein ID Molecule Type Database cross reference The corresponding cross reference in EMBL Read the detailed documentation of UniProt cross reference http://www.expasy.org/sprot/userman.html#DR_line Get a cross-reference to EMBL entret uniprot:Q5ZKN6 -auto stdout | grep "DR |grep "EMBL;" " In Perl, use a regular expression to locate the EMBL reference line, and extract the EMBL accession number and the protein-ID Link protein to coding DNA extractfeat embl:AJ720048 -value CAG31707.1 stdout Returns the DNA coding for protein CAG31707.1 (=Q5ZKN6) Figure out the offset in DNA Offset in amino acid sequence: 128 Offset in corresponding nucleotide sequence: ((128-1) x 3) + 1 OR (128 x 3)-2 = 382 Position is from 382 to (382 + 8x3)=406 Figure out the position of its corresponding coding DNA sequence (is there anything wrong here?) Extract the DNA sequence extractfeat embl:AJ720048 -value CAG31707.1 stdout | extractseq –filter -reg "382-406" Now we got the corresponding DNA sequence for the protein fragment It should be: “gttgctgaggaggttgctgaagaac” But is that correct? Let's translate it for verification… Verify the result extractfeat embl:AJ720048 -value CAG31707.1 stdout | extractseq –filter -reg "382-406" | transeq -filter Result is “VAEEVAEEX” but not “VAEEVAEE” What’s wrong here? Always try to verify your results: computers make very few errors, but that is not true for people... Exercise Build a pipeline in Perl to perform the previous steps of the walkthrough (from slide 34) Test it with the UniProt protein A0L7N9 Find the fragment at offset 305 that is 8 aa long Find out the coding DNA of this amino acid fragment and verify it