Class_5_Variant_Calling_final - Genome Projects at University

advertisement
E S S E N T I A L S O F N E X T G E N E R A T I O N
S E Q U E N C I N G W O R K S H O P 2 0 1 5
U N I V E R S I T Y O F K E N T U C K Y A G T C
5
Class
Variant Calling
Goal: Learn how to use various tools to extract variant information from a .bam file
Input(s):
GG11x70-15.bam (a .bam file produced by aligning reads from Magnaporthe
oryzae strain GG11 against a reference genome (strain 70-15)
magnaporthe_oryzae_70-15_8_supercontigs.fasta
Output(s): GG11x70-15_sorted.bam file
GG11x70-15_sorted.bcf file
GG11x70-15_sorted.vcf file
5.1 Extracting variant information from .bam files
We will use the samtools mpileup utility to extract information on nucleotide variations (substitutions,
insertions, deletions) between our sequence sample and a reference genome. mpileup does this by using
the sequence reads and genomic coordinate information in the .bam file to identify the corresponding
sequence region in the reference genome. It then compares the two nucleotide sequences.
Usage: samtools mpileup [options] -f <reference_genome> <.bam_file>
For this exercise, we will use a .bam file produced by aligning reads from a new Magnaporthe strain
(GG11) to the original reference genome (from strain 70-15). The reference genome will be the
same one we used for the RNAseq analysis.
 Copy the magnaporthe_oryzae_70-15_8_supercontigs.fasta file from the rnaseq directory to the
variants directory
 Change to the variants directory
Note that the .bam file itself does not contain information about nucleotide substitutions, so we
must provide mpileup with the reference genome that was used for alignment. This way it can
compare each individual read with the reference. To speed up the comparison, mpileup uses an
indexed version of the genome.
Essentials of Next Generation Sequencing 2015
Page 1 of 3
 Use samtools faidx to generate an indexed version of the reference genome:

samtools faidx magnaporthe_oryzae_70-15_8_supercontigs.fasta
This will create an index file named magnaporthe_oryzae_70-15_8_supercontigs.fasta.fai.
 Check to make sure the index was created:
mpileup requires the input .bam file to be first sorted based on genome coordinates. This can be
accomplished using the samtools sort utility:
Usage: samtools sort <infile.bam> <outfile-prefix>
 Sort the GG11x70-15.bam file:

samtools sort GG11x70-15.bam GG11x70-15_sorted
 Run mpileup on the sorted file:

samtools mpileup -uf magnaporthe_oryzae_70-15_8_supercontigs.fasta \
GG11x70-15_sorted.bam
-u
generate uncompressed bcf output
-f [FILE]
name of reference genome index
Remember, by default, samtools outputs results to the screen making the output kind of hard to read.
However, at least you can see that the program is doing something interesting.
 Stop the process so you don’t have to wait a LONG time for the output to finish printing to the
screen
 Let’s start the process again and this time pipe the output through less (or more) to check that
the program is producing sensible data
 Once we are confident that this is the case, we can quit the process, start a new one and redirect
the results to a file:

samtools mpileup -uf magnaporthe_oryzae_70-15_8_supercontigs.fasta \
GG11x70-15_sorted.bam > GG11x70-15_sorted.bcf
This will generate a file that summarizes variant statistics for every position in the reference genome for
which there are aligned reads. However, it contains only relevant statistics and does not call the variants.
 Inspect the GG11x70-15_sorted.bcf file. It’s kind of hard to interpret isn’t it? If we wish to extract
sensible variant information, we need to use the bcftools utility
Essentials of Next Generation Sequencing 2015
Page 2 of 3
 So let’s use bcftools to call the variants:

bcftools view –vg GG11x70-15_sorted.bcf
-v
output potential variant sites only (i.e. skip monomorphic ones)
-g
call genotypes at variant sites
 Note that bcftools is like samtools in that it sends results to the screen. Quit the process (ctrl-c)
and redirect the output to a file named GG11x70-15_sorted.vcf.
 Inspect the GG11x70-15_sorted.vcf file. At the head of the file is some information on how to
interpret the various fields. Below, each line provides information on variant at a specific
nucleotide position within the chromosome. Here you should be able to recognize data that make
sense.
Unfortunately, the header does not provide information on the overall structure of the .vcf file. The main
fields in the tab-delimited section are as follows:
Field 1: scaffold number
Field 2: nucleotide position
Field 3: SNP ID (if previously characterized and named)
Field 4: Nucleotide in reference genome
Field 5: alternate allele(s) identified in sequence reads
Field 6: Quality of SNP call
Field 7: Filtering information (”.” = no filter; “Low Qual”; or “PASS”)
Field 8: SNP information (see list of INFO fields in file header)
Fields 9 & 10: SNP formats (see list of FORMAT fields in file header)
A complete description of the VCF format is in the VCFv4.1.pdf file on your USB thumb drive.
Essentials of Next Generation Sequencing 2015
Page 3 of 3
Download