Progress report on the analysis of the Physarum sequences

advertisement
Progress report on the analysis of the Physarum sequences obtained through the pilot
assay. By Gerard Pierron
I asked Jonatha Gott whether we could have access to the sequences obtained through the
pilot assay. My goal was to determine whether our favourite sequences like actin or profilin
had been re-sequenced, expecting eventually an extension of the known sequences. Through
Jonatha, I got access to FTP files containing the sequences, to files of assembled sequences
and so on… But in fact
The Physarum sequences are accessible on the Web in the “trace archive” of the NCBI site, a
site collecting raw data of the various sequencing projects. Having access to the NCBI site
you can easily ask whether
-
Your favourite sequence is found within the Physarum traces.
-
Which of the traces contain an ORF similar to a Dictyostelium gene
-
How many of the traces correspond to unique sequences, to repeated elements...
The information about the “archive site” was mentioned in Jonatha’s e-mail of June 17/05 but
in a form that wasn’t clear to me.
22,616 Physarum sequences are listed on the NCBI archive site. This number differs
significantly from what we knew: 13,440 plasmids were created and both ends were
sequenced generating 26,880 reads. Of these I understood that 20,780 sequences were
validated after trimming. In fact some of the 22,616 sequences have not passed the last
survey. Indeed, some of them contain vector sequences or bacterial genes…
How to access the Physarum traces?
Go to PubMed. Select taxonomy in the upper left window and type in Physarum.
Follow the path: Physarum --> P. polycephalum --> 22,616 sequences --> trace archives
(bookmark). You (preferably) should ask for 500 sequences per page (46 pages altogether).
The sequences should look something look like this:
Enter a query string or TI number
SPECIES_CODE='PHYSARUM POLYCEPHALUM'
trace
.tar
.gz file.
SCF
in color
Info-XML
by
500
alt Save result of search as
All
FASTA
Mate Pair Retrieve
item(s) per Page
1
Quality
as
FASTA
/4524
Search result: found 22616 item(s)Your request is: SPECIES_CODE='PHYSARUM
POLYCEPHALUM'
>gnl|ti|818408726 name:POAA-aaa02a03.b1 mate:820216407
ATGATATCTCGCCCTGCTGGTGGATTCGATATCCTATCGCGGGAATATCGGGCCGTGGTAAGACTACGTG
CAATACTAAGGGGAGCATACTGGGGTCCTAAACCGAGTAGGCTACCCTATTAACGAACTCCCGATATTTT
GGTTTTCCCTATCGTAGTAATTACGATAGGGCGCGACCGTTGCTTTTAGATGCGGCAATGAAAGCACATC
GCCGTATGGAAAGGGTTAAATTTATTTTACTAAGGCTTGCTAAATTATAAACCGCAAAAAAAAAAAAAAA
AAAAAAAGCTCTACTGTCTTTAGTGCTAGTACTTCTGGCACTGGTATTGTACAGTAGCCTCTGAGGATTA
TAGTAATATATCCTACATGCACGTCCTCACTTTAGCGTCTGGTTAATGTAGGATCACACTTGCTCTAGCC
TTGTTTCTCAAAATTCCGAATTTTTTAAAGTGAAAATCCTTGTACTTGAGCTTATGGTTAAGATCAAAAG
AGTTTCTTCTCTCCCCTTCGCCTGTGCCGCGAAGTAATAATCTAGGTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTAAAAACCTTAATTTGTTCTCCCTCTTTTTGCTCTTGTTGCGCCATTCTTTCCCTTACC
GTAAAAACCTGAAGAGGGCTCCCTAAATTTGGGCTTATTTTCTAAAAATATCCCAACCCCTCTCTCCTTA
CCTATTTTCCTGGAAACTCGCTTTTTTTCTCCCCCTATAAGATATTTCGATCGGGGTAGTAAATAATCTC
TTTTTGTGGATGTAAAGAAAATGATGGGAAGTAAAAAGGAAGGTTAGAAAAAAAAAGGTTAAACCCCTCC
CCTCCAACCCGCGTTTCCCTTTGATTTTGGGCGAGGCGACCTCGGCAAATAATCTATAGATAGGGAACT
Each sequence is identified by a number and a code, e.g 818408726 and aaa02a03.b1. The
number is the ID of the sequence. The code probably represents the position of the clone in its
well plate. Each trace has a mate which corresponds to the other extremity of the insert. The
mate of aaa02a03.b1 is aaa02a03.g1 (b1 is forward; g.1 is backward), its ti number is
820216407.
I do not know the exact range of insert size; it is probably 3 to 5kb, according to Jonatha’s
message of June 17, 2005. You might expect the 2 reads from a single clone to “hybridize” to
a single gene if long enough. I found some occurrences of that situation.
Wish to know whether your favorite sequence is in the bank ?
Call your sequence in PubMed nucleotide (e.g X15142) and copy it in fasta format (see the
Reports option for easy Fasta access):
AF438185
Reports
Physarum polycephalum histone H1 gene, partial cds
gi|16904229|gb|AF438185.1|AF438185[16904229]
Call the trace archive page (your bookmark).
Click on Blast (upper-right)  MegaBlast  Paste your fasta sequence in the “trace archive”
window.
Select Physarum polycephalum WGS in the data base choice.
Restrict to 10 descriptions in the format window and blast. Wait a few seconds and click on
format. You will have the result under a very convenient color-coded format.
In this case, the 466 bp partial sequence of Histone H1 is contained within the sequence
820216565 of the pilot assay.
Not surprisingly, since the reads represent less than 0.1X of a genome equivalent, most of the
known sequences will not be found but some sequence are obviously there as illustrated with
the partial Histone H1 sequence.
I was particularly interested in comparing the 22,616 sequences to the Dictyostelium genome
to measure how many new Physarum ORF could be detected. The answer is MANY. The
procedure consisted in copying one page (500 sequences) of the Physarum traces and then
clicking on blast in the upper right corner of the same window. Then select megablast and
then translation, so that the 500 sequences are translated in the 6 possible frames. In “limited
entrez” select Dictyostelium. Reduce the output by selecting 10 descriptions and 0 alignments
in the format window and blast. You will have a number for queuing. Press format and wait
for the results or copy the number and when ready activate the option “retrieve an ID”.
Here is an example of what you get.
The score is relatively high. The homology is broken in different pieces suggestive of introns
within the Physarum sequence that are evidently not present in the protein sequence of
Dictyostelium.
Download