Structural Variation Detection Using NGS technology Ke Lin 23rd Feb, 2012 Content • • • Introduction Methods and software used for SV detection Exercises Introduction What is Structural Variation? • variation in structure of chromosomes in one species • using FISH to detect and localize the presence or absence of specific DNA sequences Introduction What is Structural Variation? • a region of DNA include inversions, balanced translocation and genomic imbalances (CNV) • approximately 1kb or greater in size • many of SVs are associated with genetic diseases Introduction What can NGS do to detect SV? • hypothesis: the reference genome of the species is available • re-sequencing of other individuals of the species with shallow genome coverage (< 30X) • paired-end sequencing Introduction What can NGS do to detect SV? Introduction What can NGS do to detect SV? Methods used for SV detections 1. local (de novo) assembly and then align assembled sequences to reference genomes Methods used for SV detections 1. local assembly and then align assembled sequences to reference genomes Methods used for SV detections 1. local (de novo) assembly and then align assembled sequences to reference genomes • accurate but costly • the genomes of individuals within one species should be quite similar on sequence level Methods used for SV detections 2. map reads to reference genomes and deduce the SV according to expected insert size of the pairs • not accurate enough but much less cost • lots of methods were developed • downstream analysis can help to increase the accuracy Methods used for SV detections Signatures used for SV discovery • PEM (Paired End Mapping) Methods used for SV detections Signatures used for SV discovery • PEM (Paired End Mapping) 1.paired end reads have to both mapped to references 2.reads need to align without gaps Methods used for SV detections Signatures used for SV discovery • DOC (Depth Of Coverage) Methods used for SV detections Signatures used for SV discovery • DOC (Depth Of Coverage) 1.don't know where the copies occur 2.not able to detect insertions of novel sequence Methods used for SV detections Signatures used for SV discovery • Split reads Methods used for SV detections Signatures used for SV discovery • Split reads 1.gaps introduced is size limited (allow a few base pairs) 2.novel sequence insertions will not be complete if the local assembly of hanging reads are substantially larger than the insert size Software of each Methods used for SV detections • PEM 1.BreakDancer Input: BWA mapping output, bam format Command: bam2cfg.pl -g -h bamfile1 bamfile2 .. > configure_file Output: Configuration file for next process Software of each Methods used for SV detections • PEM 1.BreakDancer Software of each Methods used for SV detections • PEM 1.BreakDancer Software of each Methods used for SV detections • PEM 1.BreakDancer Input: configuration file Command: breakdancer_max -h -g int.bed -o chromosome cfg_file > output Output: tab delimited file Software of each Methods used for SV detections 1. Chromosome 1 2. Position 1 3. Orientation 1 4. Chromosome 2 5. Position 2 6. Orientation 2 7. Type of a SV 8. Size of a SV 9. Confidence Score 10. Total number of supporting read pairs 11. Total number of supporting read pairs from each bam/library 12. Estimated allele frequency (if -h) 13 - end. copy number for each bam/library Software of each Methods used for SV detections • DOC 1.cnD Input: BWA mapping output, bam format Command: samtools pileup -c bamfile | pileup2win.pl > output_file Output: windows file for next process Software of each Methods used for SV detections • DOC 1.cnD Input: windows file Command: cnD.x86-64 --prefix=lib_name --nohet windows_file1 cat lib*_viterbi.txt > viterbi.txt metaCaller.pl --threshold=value viterbi.txt > metacalls.txt extractCNChanges.pl metacalls.txt > output Output: tab delimited file chr start pos end pos Gain/Loss Software of each Methods used for SV detections • Split reads 1.Pindel Input: configuration file Command: pindel_x86_64 -f ref.fasta -i cfg_file -c ALL -o name Output: files with indicative names D = deletion, SI = short insertion, INV = inversion TD = tandem duplication, LI = large insertion, BP = unassigned Downstream Analysis after SV detections • Local assembly of SV regions • Annotation of novel insertion • Fine tune potential changed gene model Downstream Analysis after SV detections • Local assembly of SV regions • Annotation of novel insertion • Fine tune potential changed gene model Exercises: Find all deletions in chromosome1 using BreakDancer. Try to do it using cnD (gene loss) and Pindel respectively. The input file can be found: /mnt/geninf15/work/bif_course_2012/SV/exercises/ The documentation of each program can be found: /mnt/geninf15/work/bif_course_2012/SV/DOC/