Progress report on the analysis of the Physarum sequences obtained through the pilot assay. By Gerard Pierron I asked Jonatha Gott whether we could have access to the sequences obtained through the pilot assay. My goal was to determine whether our favourite sequences like actin or profilin had been re-sequenced, expecting eventually an extension of the known sequences. Through Jonatha, I got access to FTP files containing the sequences, to files of assembled sequences and so on… But in fact The Physarum sequences are accessible on the Web in the “trace archive” of the NCBI site, a site collecting raw data of the various sequencing projects. Having access to the NCBI site you can easily ask whether - Your favourite sequence is found within the Physarum traces. - Which of the traces contain an ORF similar to a Dictyostelium gene - How many of the traces correspond to unique sequences, to repeated elements... The information about the “archive site” was mentioned in Jonatha’s e-mail of June 17/05 but in a form that wasn’t clear to me. 22,616 Physarum sequences are listed on the NCBI archive site. This number differs significantly from what we knew: 13,440 plasmids were created and both ends were sequenced generating 26,880 reads. Of these I understood that 20,780 sequences were validated after trimming. In fact some of the 22,616 sequences have not passed the last survey. Indeed, some of them contain vector sequences or bacterial genes… How to access the Physarum traces? Go to PubMed. Select taxonomy in the upper left window and type in Physarum. Follow the path: Physarum --> P. polycephalum --> 22,616 sequences --> trace archives (bookmark). You (preferably) should ask for 500 sequences per page (46 pages altogether). The sequences should look something look like this: Enter a query string or TI number SPECIES_CODE='PHYSARUM POLYCEPHALUM' trace .tar .gz file. SCF in color Info-XML by 500 alt Save result of search as All FASTA Mate Pair Retrieve item(s) per Page 1 Quality as FASTA /4524 Search result: found 22616 item(s)Your request is: SPECIES_CODE='PHYSARUM POLYCEPHALUM' >gnl|ti|818408726 name:POAA-aaa02a03.b1 mate:820216407 ATGATATCTCGCCCTGCTGGTGGATTCGATATCCTATCGCGGGAATATCGGGCCGTGGTAAGACTACGTG CAATACTAAGGGGAGCATACTGGGGTCCTAAACCGAGTAGGCTACCCTATTAACGAACTCCCGATATTTT GGTTTTCCCTATCGTAGTAATTACGATAGGGCGCGACCGTTGCTTTTAGATGCGGCAATGAAAGCACATC GCCGTATGGAAAGGGTTAAATTTATTTTACTAAGGCTTGCTAAATTATAAACCGCAAAAAAAAAAAAAAA AAAAAAAGCTCTACTGTCTTTAGTGCTAGTACTTCTGGCACTGGTATTGTACAGTAGCCTCTGAGGATTA TAGTAATATATCCTACATGCACGTCCTCACTTTAGCGTCTGGTTAATGTAGGATCACACTTGCTCTAGCC TTGTTTCTCAAAATTCCGAATTTTTTAAAGTGAAAATCCTTGTACTTGAGCTTATGGTTAAGATCAAAAG AGTTTCTTCTCTCCCCTTCGCCTGTGCCGCGAAGTAATAATCTAGGTTTTTTTTTTTTTTTTTTTTTTTT TTTTTTTTTTTTAAAAACCTTAATTTGTTCTCCCTCTTTTTGCTCTTGTTGCGCCATTCTTTCCCTTACC GTAAAAACCTGAAGAGGGCTCCCTAAATTTGGGCTTATTTTCTAAAAATATCCCAACCCCTCTCTCCTTA CCTATTTTCCTGGAAACTCGCTTTTTTTCTCCCCCTATAAGATATTTCGATCGGGGTAGTAAATAATCTC TTTTTGTGGATGTAAAGAAAATGATGGGAAGTAAAAAGGAAGGTTAGAAAAAAAAAGGTTAAACCCCTCC CCTCCAACCCGCGTTTCCCTTTGATTTTGGGCGAGGCGACCTCGGCAAATAATCTATAGATAGGGAACT Each sequence is identified by a number and a code, e.g 818408726 and aaa02a03.b1. The number is the ID of the sequence. The code probably represents the position of the clone in its well plate. Each trace has a mate which corresponds to the other extremity of the insert. The mate of aaa02a03.b1 is aaa02a03.g1 (b1 is forward; g.1 is backward), its ti number is 820216407. I do not know the exact range of insert size; it is probably 3 to 5kb, according to Jonatha’s message of June 17, 2005. You might expect the 2 reads from a single clone to “hybridize” to a single gene if long enough. I found some occurrences of that situation. Wish to know whether your favorite sequence is in the bank ? Call your sequence in PubMed nucleotide (e.g X15142) and copy it in fasta format (see the Reports option for easy Fasta access): AF438185 Reports Physarum polycephalum histone H1 gene, partial cds gi|16904229|gb|AF438185.1|AF438185[16904229] Call the trace archive page (your bookmark). Click on Blast (upper-right) MegaBlast Paste your fasta sequence in the “trace archive” window. Select Physarum polycephalum WGS in the data base choice. Restrict to 10 descriptions in the format window and blast. Wait a few seconds and click on format. You will have the result under a very convenient color-coded format. In this case, the 466 bp partial sequence of Histone H1 is contained within the sequence 820216565 of the pilot assay. Not surprisingly, since the reads represent less than 0.1X of a genome equivalent, most of the known sequences will not be found but some sequence are obviously there as illustrated with the partial Histone H1 sequence. I was particularly interested in comparing the 22,616 sequences to the Dictyostelium genome to measure how many new Physarum ORF could be detected. The answer is MANY. The procedure consisted in copying one page (500 sequences) of the Physarum traces and then clicking on blast in the upper right corner of the same window. Then select megablast and then translation, so that the 500 sequences are translated in the 6 possible frames. In “limited entrez” select Dictyostelium. Reduce the output by selecting 10 descriptions and 0 alignments in the format window and blast. You will have a number for queuing. Press format and wait for the results or copy the number and when ready activate the option “retrieve an ID”. Here is an example of what you get. The score is relatively high. The homology is broken in different pieces suggestive of introns within the Physarum sequence that are evidently not present in the protein sequence of Dictyostelium.