today - Gogarten Lab

advertisement
MCB 3421 class 25
student evaluations
Please go to husky CT and complete student evaluations !
PSI (position-specific iterated) BLAST
The NCBI page described PSI blast as follows:
“Position-Specific Iterated BLAST (PSI-BLAST) provides an
automated, easy-to-use version of a "profile" search, which is a
sensitive way to look for sequence homologues.
The program first performs a gapped BLAST database search. The
PSI-BLAST program uses the information from any significant
alignments returned to construct a position-specific score matrix,
which replaces the query sequence for the next round of database
searching.
PSI-BLAST may be iterated until no new significant alignments are
found. At this time PSI-BLAST may be used only for comparing protein
queries with protein databases.”
The Psi-Blast Approach
1. Use results of BlastP query to construct a multiple sequence alignment
2. Construct a position-specific scoring matrix from the alignment
3. Search database with alignment instead of query sequence
4. Add matches to alignment and repeat
Psi-Blast can use existing multiple alignment, or
use RPS-Blast to search a database of PSSMs
PSI BLAST scheme
by Bob Friedman
Position-specific Matrix
M Gribskov, A D McLachlan, and D Eisenberg (1987) Profile analysis:
detection of distantly related proteins. PNAS 84:4355-8.
PSI BLAST and E-values!
Psi-Blast is for finding matches among divergent sequences (positionspecific information)
WARNING: For the nth iteration of a PSI BLAST search, the E-value
gives the number of matches to the profile NOT to the initial query
sequence! The danger is that the profile was corrupted in an earlier
iteration.
PSI Blast from the command line
Often you want to run a PSIBLAST search with two different databanks one to create the PSSM, the other to get sequences:
To create the PSSM:
blastpgp -d nr -i subI -j 5 -C subI.ckp -a 2 -o subI.out -h 0.00001 -F f
blastpgp -d swissprot -i gamma -j 5 -C gamma.ckp -a 2 -o gamma.out -h 0.00001 -F f
Runs 4 iterations of a PSIblast
the -h option tells the program to use matches with E <10^-5 for the next iteration,
(the default is 10-3 )
-C creates a checkpoint (called subI.ckp),
-o writes the output to subI.out,
-i option specifies input as using subI as input (a fasta formated aa sequence).
The nr databank used is stored in /common/data/
-a 2 use two processors
-h e-value threshold for inclusion in multipass model [Real]
default = 0.002 THIS IS A RATHER HIGH NUMBER!!!
(It might help to use the node with more memory (017)
(command is ssh node017)
Use of a PSSM:
blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i subI -a 2 -R
subI.ckp -o subI.out3 -F f
blastpgp -d /Users/jpgogarten/genomes/msb8.faa -i gamma -a 2 -R
gamma.ckp -o gamma.out3 -F f
Runs another iteration of the same blast search, but uses the databank
/Users/jpgogarten/genomes/msb8.faa
-R tells the program where to resume
-d specifies a different databank
-i input file - same sequence as before
-o output_filename
-a 2 use two processors
-h e-value threshold for inclusion in multipass model [Real]
default = 0.002. This is a rather high number, but might be ok for
the last iteration.
PSI Blast and finding gene families within genomes
Build PSSM from query sequence and a large database
(nr is a good choice – if you know the annotation of the query sequences, you don’t need to worry
about the annotations in the database)
use PSSM to search a genome:
A) Use protein sequences encoded in genome as target:
blastpgp -d target_genome.faa -i query.name -a 2 -R query.ckp -o
query.out3 -F f
B) Use nucleotide sequence and tblastn. This is an advantage if you are also interested
in pseudogenes, and/or if you don’t trust the genome annotation:
blastall -i query.name -d target_genome_nucl.ffn -p psitblastn -R
query.ckp
man wc
Comparison of blastp, PSIblastP, and psitblastn
>wc -l blastp.out PSIblastP.out psitblastn.out
34 blastp.out
44 PSIblastP.out
56 psitblastn.out
2
7
4
5
6
3
8
1
1
2
3
4
5
6
7
8
5
4
6
3
7
2
8
1
ori
Finding transferred genes
Screening in the wet-lab and in the computer
Finding transferred genes
Taxplot at NCBI
Taxplot at NCBI
Other approaches to find transferred genes
• Gene presence absence data for closely
related genomes (for additional genes)
• Phylogenetic conflict (for homologous
replacement (e.g. quartet decompositon spectra
see Figs. 1 and 2)
• Composition based analyses (for very recent
transfers).
Discussion of
HGT from Bacteria to Tardigrades
We estimate that approximately one-sixth
of tardigrade genes entered by HGT,
nearly double the fraction found in the
most extreme cases of HGT into animals
known to date. Foreign genes have
supplemented, expanded, and even
replaced some metazoan gene families
within the tardigrade genome. Our results
demonstrate that an unexpectedly large
fraction of an animal genome can be
derived from foreign sources.
Source of genes in the H. dujardini genome as
determined by HGT index calculations
Discussion of
HGT from Bacteria to Tardigrades
BIOARCHIVES
doi: http://dx.doi.org/10.1101/033464
http://biorxiv.org/content/early/2015/12/01/033464
• “While the raw data indicated extensive
contamination with bacteria, presumably from the
gut or surface of the animals, careful cleaning
generated a clean tardigrade dataset for assembly.”
Our assembly, and inferences from it, conflict with a recently
published draft genome (UNC) 6 for what is essentially the same
strain of H. dujardini. Our assembly, despite having superior
assembly statistics, is ~120 Mb shorter than the UNC assembly.
Our genome size estimate from sequence assembly is congruent
with the values we obtained by direct measurement. We find
15,000 fewer protein-coding genes, and a hugely reduced
impact of predicted HGT on gene content in H. dujardini. These
HGT candidates await detailed validation. While resolution of the
conflict between these assemblies awaits detailed examination
based on close scrutiny of the raw UNC data, our analyses
suggest that the UNC assembly is compromised by sequences
that derive from bacterial contaminants, and that the
expanded genome span, additional genes, and HGT candidates
are likely to be artefactual.
Figure 4: Mapping of read data to UNC assembly identifies
non-shared
contaminants and no expression from bacterial scaffolds
A Blobplot showing the UNC assembly contigs distributed by
GC proportion and coverage derived from the UNC raw
genomic sequence data (data file TG-300). Scaffold points are
scaled by length, and coloured based on taxonomic assignment
of the sum of the best BLAST and Diamond matches for all the
genes on the scaffold. Taxonomic assignments are summed by
phylum.
B Blobplot showing the UNC assembly contigs distributed by
GC proportion and coverage derived from the Edinburgh raw
genomic sequence data. Scaffold points are scaled by length,
and coloured based on taxonomic assignment of the sum of the
best BLAST and Diamond matches for all the genes on the
scaffold. Taxonomic assignments are summed by phylum.
UNC reads
Edinburgh reads
both mapped on the UNC assembly
Download