E S S E N T I A L S O F N E X T G E N E R A T I O N S E Q U E N C I N G W O R K S H O P 2 0 1 5 U N I V E R S I T Y O F K E N T U C K Y A G T C 5 Class Variant Calling Goal: Learn how to use various tools to extract variant information from a .bam file Input(s): GG11x70-15.bam (a .bam file produced by aligning reads from Magnaporthe oryzae strain GG11 against a reference genome (strain 70-15) magnaporthe_oryzae_70-15_8_supercontigs.fasta Output(s): GG11x70-15_sorted.bam file GG11x70-15_sorted.bcf file GG11x70-15_sorted.vcf file 5.1 Extracting variant information from .bam files We will use the samtools mpileup utility to extract information on nucleotide variations (substitutions, insertions, deletions) between our sequence sample and a reference genome. mpileup does this by using the sequence reads and genomic coordinate information in the .bam file to identify the corresponding sequence region in the reference genome. It then compares the two nucleotide sequences. Usage: samtools mpileup [options] -f <reference_genome> <.bam_file> For this exercise, we will use a .bam file produced by aligning reads from a new Magnaporthe strain (GG11) to the original reference genome (from strain 70-15). The reference genome will be the same one we used for the RNAseq analysis. Copy the magnaporthe_oryzae_70-15_8_supercontigs.fasta file from the rnaseq directory to the variants directory Change to the variants directory Note that the .bam file itself does not contain information about nucleotide substitutions, so we must provide mpileup with the reference genome that was used for alignment. This way it can compare each individual read with the reference. To speed up the comparison, mpileup uses an indexed version of the genome. Essentials of Next Generation Sequencing 2015 Page 1 of 3 Use samtools faidx to generate an indexed version of the reference genome: samtools faidx magnaporthe_oryzae_70-15_8_supercontigs.fasta This will create an index file named magnaporthe_oryzae_70-15_8_supercontigs.fasta.fai. Check to make sure the index was created: mpileup requires the input .bam file to be first sorted based on genome coordinates. This can be accomplished using the samtools sort utility: Usage: samtools sort <infile.bam> <outfile-prefix> Sort the GG11x70-15.bam file: samtools sort GG11x70-15.bam GG11x70-15_sorted Run mpileup on the sorted file: samtools mpileup -uf magnaporthe_oryzae_70-15_8_supercontigs.fasta \ GG11x70-15_sorted.bam -u generate uncompressed bcf output -f [FILE] name of reference genome index Remember, by default, samtools outputs results to the screen making the output kind of hard to read. However, at least you can see that the program is doing something interesting. Stop the process so you don’t have to wait a LONG time for the output to finish printing to the screen Let’s start the process again and this time pipe the output through less (or more) to check that the program is producing sensible data Once we are confident that this is the case, we can quit the process, start a new one and redirect the results to a file: samtools mpileup -uf magnaporthe_oryzae_70-15_8_supercontigs.fasta \ GG11x70-15_sorted.bam > GG11x70-15_sorted.bcf This will generate a file that summarizes variant statistics for every position in the reference genome for which there are aligned reads. However, it contains only relevant statistics and does not call the variants. Inspect the GG11x70-15_sorted.bcf file. It’s kind of hard to interpret isn’t it? If we wish to extract sensible variant information, we need to use the bcftools utility Essentials of Next Generation Sequencing 2015 Page 2 of 3 So let’s use bcftools to call the variants: bcftools view –vg GG11x70-15_sorted.bcf -v output potential variant sites only (i.e. skip monomorphic ones) -g call genotypes at variant sites Note that bcftools is like samtools in that it sends results to the screen. Quit the process (ctrl-c) and redirect the output to a file named GG11x70-15_sorted.vcf. Inspect the GG11x70-15_sorted.vcf file. At the head of the file is some information on how to interpret the various fields. Below, each line provides information on variant at a specific nucleotide position within the chromosome. Here you should be able to recognize data that make sense. Unfortunately, the header does not provide information on the overall structure of the .vcf file. The main fields in the tab-delimited section are as follows: Field 1: scaffold number Field 2: nucleotide position Field 3: SNP ID (if previously characterized and named) Field 4: Nucleotide in reference genome Field 5: alternate allele(s) identified in sequence reads Field 6: Quality of SNP call Field 7: Filtering information (”.” = no filter; “Low Qual”; or “PASS”) Field 8: SNP information (see list of INFO fields in file header) Fields 9 & 10: SNP formats (see list of FORMAT fields in file header) A complete description of the VCF format is in the VCFv4.1.pdf file on your USB thumb drive. Essentials of Next Generation Sequencing 2015 Page 3 of 3