sequence

advertisement
X03006; SV 1; linear; mRNA; STD; MAM; 620 BP.
X03006;
X03006.1
28-JAN-1986 (Rel. 08, Created)
12-SEP-1993 (Rel. 36, Last updated, Version 2)
Bovine mRNA for lens beta-s-crystallin
beta-crystallin; beta-gamma-crystallin; crystallin.
Bos taurus (cow)
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
Eutheria; Laurasiatheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae;
Bovinae; Bos.
EMBL
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
OC
XX
RN
RP
RX
RA
RA
RT
RT
RL
XX
CC
XX
...
[1]
1-620
PUBMED; 4054100.
Quax-Jeuken Y.E.F.M., Driessen H., Leunissen J., Quax W.J., de Jong W.,
Bloemendal H.;
"Beta-s-crystallin: structure and evolution of a distinct member of the
beta-gamma-superfamily";
EMBO J. 4(10):2597-2602(1985).
Data kindly reviewed (06-MAR-1986) by Y. Quax-Jeuken
Index
flatfile
parser
index
Retrieve
index
entries
parser
display
SRS
Sequence Retrieval System
an indexing and retrieval system for
flat file databases
http://srs.bioinformatics.nl
http://srs.ebi.ac.uk
Q: Which sequences in EMBL [do not]
encode for a protein for which the 3D
structure is known?
Command line SRS
Using getz
Retrieve the UniProt entry for the protein with
accession number P19558:
getz "[uniprot-acc:P19558]" -e
Count the human proteins in the UniProt database:
getz "[uniprot-org:human]" –c
Print sequence of the rice proteins in the UniProt
database that have a length between 10 and 50 aa:
getz "[uniprot-org:rice]&[uniprot-sl#10:50]" -f sl
Give the id and description for all A.thal proteins
that have at least 8 transmembrane domains:
getz '[swissprot-org:arabidopsis thaliana]<
([swissprot-CountedItem:transmem]
&[swissprot-CountedN#8:]))' -f "id des"
Count the human protein sequences in the NCBI RefSeq
database:
getz "[refseqp-org:human]" –c
Count the human mRNA sequences in the NCBI RefSeq
database:
getz "[refseq-org:human]&[refseq-mol:mrna]" –c
Retrieve the mRNA sequences for all human proteins in
the NCBI RefSeq database in fasta format :
getz "[refseqp-org:human]>[refseq-mol:mrna]" –d –sf fasta
MRS: A fast and compact retrieval system for
biological data. Hekkelman M.L., Vriend G.
http://mrs.cmbi.ru.nl/
European Molecular Biology
Open Software Suite
EMBOSS
"European Molecular Biology Open Software Suite"
http://emboss.sourceforge.net/
Toolbox with bioinformatics applications
http://emboss.bioinformatics.nl/
http://main.g2.bx.psu.edu/
command line / shell
Useful EMBOSS commands
command
description
showdb
Displays information on the currently available
databases
wossname
Finds programs by keywords in their one-line
documentation
tfm
Reads the manual entries for each program in EMBOSS
seealso
Finds the relevant programs of certain program
seqret
Reads and writes (returns) sequences
entret
Reads and writes (returns) flatfile entries
extractfeat
Extract features from a sequence
extractseq
Extract regions from a sequence
transeq
Translate nucleic acid sequences
Get help from EMBOSS itself
# showdb
Shows the currently available databases
# tfm wossname
How to use a EMBOSS command? Just (r)tfm it
# wossname alignment
Which commands can handle alignments?
# seealso seqret
Are there any other commands able to do the
similar thing?
Command line options
• All EMBOSS programs react to a number of
command line options. The most important
ones are
–help
–help –verbose
–auto
–stdout
–filter
Get help
Get elaborate help
“no questions asked”
Write to standard output
Read stdin, write stdout
SEQRET parameters
zonnebloem> seqret -help
Standard (Mandatory) qualifiers:
[-sequence]
seqall
(Gapped) sequence(s) filename and optional
format, or reference (input USA)
[-outseq]
seqoutall [<sequence>.<format>] Sequence set(s)
filename and optional format (output USA)
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers:
-feature
boolean
Use feature information
-firstonly
boolean
Read one sequence and stop
General qualifiers:
-help
boolean
Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
SEQRET parameters
zonnebloem> seqret -help -verbose
Standard (Mandatory) qualifiers:
[-sequence]
seqall
(Gapped) sequence(s) filename and
optional
format, or reference (input USA)
[-outseq]
seqoutall [<sequence>.<format>] Sequence set(s)
filename and optional format (output USA)
Additional (Optional) qualifiers: (none)
Advanced (Unprompted) qualifiers:
-feature
boolean
Use feature information
-firstonly
boolean
Read one sequence and stop
Associated qualifiers:
"-sequence" associated qualifiers
-sbegin1
integer
Start of each sequence to be used
///
SEQRET parameters
///
"-sequence" associated qualifiers
-sbegin1
integer
Start of each sequence to be used
-send1
integer
End of each sequence to be used
-sreverse1
boolean
Reverse (if DNA)
-sask1
boolean
Ask for begin/end/reverse
-snucleotide1
boolean
Sequence is nucleotide
-sprotein1
boolean
Sequence is protein
-slower1
boolean
Make lower case
-supper1
boolean
Make upper case
-sformat1
string
Input sequence format
-sdbname1
string
Database name
-sid1
string
Entryname
-ufo1
string
UFO features
-fformat1
string
Features format
-fopenfile1
string
Features file name
///
SEQRET parameters
///
"-outseq" associated qualifiers
-osformat2
string
Output seq format
-osextension2
string
File name extension
-osname2
string
Base file name
-osdirectory2
string
Output directory
-osdbname2
string
Database name to add
-ossingle2
boolean
Separate file for each entry
-oufo2
string
UFO features
-offormat2
string
Features format
-ofname2
string
Features file name
-ofdirectory2
string
Output directory
///
SEQRET parameters
///
General qualifiers:
-auto
-stdout
-filter
-options
-debug
-verbose
-help
boolean
boolean
boolean
boolean
boolean
boolean
boolean
-warning
-error
-fatal
-die
boolean
boolean
boolean
boolean
Turn off prompts
Write standard output
Read standard input, write standard output
Prompt for standard and additional values
Write debug output to program.dbg
Report some/full command line options
Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
Report warnings
Report errors
Report fatal errors
Report dying program messages
Universal Sequence Address
Type
Example
Description
filename
xxx.seq
A sequence file "xxx.seq" in any format
format::filename
fasta::xxx.seq
A sequence file "xxx.seq" in fasta format
db:IDname
embl:paamir
EMBL entry PAAMIR, using whatever access method is defined locally
for the EMBL database
db:AccessionNumber
embl:X13776
EMBL entry X13776, using whatever access method is defined locally
for the EMBL database and searching by accession number and entry
name (X13776 is the accession number in this case)
db-acc:AccessionNumber
embl-acc:X13776
EMBL entry X13776, using whatever access method is defined locally
for the EMBL database and searching by accession number only
db-id:IDname
embl-id:paamir
EMBL entry PAAMIR, using whatever access method is defined locally
for the EMBL database, and searching by ID only
db-searchfield:word
embl-des:lectin
EMBL entries containing the word 'lectin' in the Description line
db-searchfield:wildcardword
embl-org:*human*
EMBL entries containing the wildcarded word 'human' in the Organism
fields
db:wildcard-ID
embl:paami*
EMBL entries PAAMIB, PAAMIE and so on, usually in alphabetical
order, using whatever access method is defined locally for the EMBL
database
Universal Sequence Address
Type
Example
Description
db or db:*
embl or EMBL:*
All sequences in the EMBL database
@listfile
@mylist
Reads file mylist and uses each line as a separate USA. List files can
contain references to other lists files or any other standard USA.
list:listfile
list:mylist
Same as "@mylist" above
'getz -e [embl-id:paamir] |'
The pipe character "|" causes EMBOSS to fire up getz (the SRS
sequence retrieval program) to extract entry PAAMIR from EMBL in
EMBL format. Any application or script which writes one or more
sequences to stdout can be used in this way.
asis::atacgcagttatctgaccat
So far the shortest USA we could invent. In 'asis' format the name is
the sequence so no file needs to be opened. This is a special case. It
was intended as a joke, but could be quite useful for generating
command lines.
'program parameters |'
asis::sequence
Each of the above can have '[start : end]' or '[start : end : r]' appended to them.
The 'file' and 'dbname' forms of USA can have 'format::' in front of them
(although a database knows which format it is and so this is redundant and
error-prone)
Walk through exercise
For a protein with UniProt Accession number:
Q5ZKN6
find the nucleotide sequence that encodes this (repeated)
amino acid fragment:
VAEEVAEE
Getting the sequence
seqret -auto uniprot:Q5ZKN6 -stdout
>Q5ZKN6_CHICK Q5ZKN6 SubName: Full=Putative uncharacterized protein;
MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK
ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR
LPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES
SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE
LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE
YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC
LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG
QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ
PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE
KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE
EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE
Getting the sequence
seqret -auto uniprot:Q5ZKN6 -stdout
>Q5ZKN6_CHICK Q5ZKN6 SubName: Full=Putative uncharacterized protein;
MADNLPSEFDVVVIGTGLPESIIAAACARSGQRVLHVDSRNYYGGNWASFSFSGLLSWIK
ENQQNTDIKDECEDWRKLILENEEVISLNKKDKTIQHVEAFCFDDQDAAEDVEEAGALAR
LPAYGASVAEEVAEEPEKECSPLESAVPGAENLESEKATSVDPASAAEGNVTEINAESES
SHDSASGESTLESGKTEAALSEISAQEPKKITYSQIVREGRRFNIDLVSKLLYSRGLLIE
LLIKSNVSRYAEFKNATRILAFREGKVEQVPCSRADVFNSRQLAMVEKRMLMKFLTFCLE
YEQHPDEYQDYKNSTFAQFLKTRKLTPSLQHFILHSIAMVSEKDCNTLEGLQATRKFLQC
LGRYGNTPFLFPLYGQGEIPQCFCRMCAVFGGIYCLRHSVQCLVVDKESGRCKAVVDHFG
QRISANYFIVEDSYLSESVCENVCYRQLSRAVLITDQSVLKTDSEQQVSILMVPPVDLGQ
PAVCVIELCSSTMTCMKDTYLVHLTCPSTKTAREDLEPVVQKLFSLNAETEKETEDEVLE
KPRVLWALYFNMRDSSGIDRNSYSGLPSNVYVCSGPDSALGNDCAVKQAETIFQEMFPTE
EFCPAPPNPEDIIYDEDEIASEETGFNNSPETKPESSLQESSSRGSSTAVKEHIEE
Run a program within Perl: 3 ways
$seq = `seqret -auto uniprot:Q5ZKN6 stdout`;
system("seqret -auto uniprot:Q5ZKN6 stdout");
open SEQRET,"seqret -auto uniprot:Q5ZKN6 stdout|";
while(my $line = <SEQRET>) {
if($line !~ /^>/) {
chomp($line);
$seq .= $line;
}
}
close SEQRET;
my $lsOutput = `ls -l`;
put shell commands or programs in
backticks to run from Perl. The
output can be stored in a variable.
open LS,"ls -l|";
The open function can run a program
and read its output. The pipe symbol
"|" links the output to a filehandle.
Find the fragment’s position
my $seq = "";
open SEQRET,"seqret -auto uniprot:Q5ZKN6 stdout|";
while(my $line = <SEQRET>) {
if($line !~ /^>/) {
chomp($line);
$seq .= $line;
}
}
close SEQRET;
# look for location of the repeat
my $position = index($seq, "VAEEVAEE") + 1;
# print the offset
print "Position = ", $position, "\n";
!~
opposite of "=~ "gives true if the
search found no hits.
Get a cross-reference to EMBL
entret uniprot:Q5ZKN6 -auto stdout |grep "DR
Get the feature table of this protein entry
"
Understand the cross-reference
DR EMBL; AJ720048; CAG31707.1; -; mRNA.
Link to EMBL
EMBL accession number
Status identifier
Protein ID
Molecule Type
Database cross reference
The corresponding
cross reference
in EMBL
Read the detailed documentation of UniProt cross reference
http://www.expasy.org/sprot/userman.html#DR_line
Get a cross-reference to EMBL
entret uniprot:Q5ZKN6 -auto stdout | grep "DR
|grep "EMBL;"
"
In Perl, use a regular expression to locate the EMBL
reference line, and extract the EMBL accession number
and the protein-ID
Link protein to coding DNA
extractfeat embl:AJ720048 -value CAG31707.1 stdout
Returns the DNA coding for protein CAG31707.1 (=Q5ZKN6)
Figure out the offset in DNA
Offset in amino acid sequence: 128
Offset in corresponding nucleotide sequence:
((128-1) x 3) + 1
OR
(128 x 3)-2
= 382
Position is from 382 to (382 + 8x3)=406
Figure out the position of its corresponding coding DNA
sequence (is there anything wrong here?)
Extract the DNA sequence
extractfeat embl:AJ720048 -value CAG31707.1
stdout | extractseq –filter -reg "382-406"
Now we got the corresponding DNA sequence for
the protein fragment
It should be: “gttgctgaggaggttgctgaagaac”
But is that correct? Let's translate it for verification…
Verify the result
extractfeat embl:AJ720048 -value CAG31707.1
stdout | extractseq –filter -reg "382-406"
| transeq -filter
Result is “VAEEVAEEX” but not “VAEEVAEE”
What’s wrong here?
Always try to verify your results: computers make
very few errors, but that is not true for people...
Exercise
Build a pipeline in Perl to perform the previous steps
of the walkthrough (from slide 34)
Test it with the UniProt protein A0L7N9
Find the fragment at offset 305 that is 8 aa long
Find out the coding DNA of this amino acid fragment
and verify it
Download