NGS Bioinformatics Workshop 2.2 Tutorial – Whole Genome Assembly Part I May 9th, 2012 IRMACS 10900 Facilitator: Richard Bruskiewich Adjunct Professor, MBB Workflow for Today Generate a synthetic NGS read data set Genome assembly ABySS Velvet ALLPATHS-LG Generate synthetic NGS read data for assembly Try a new program out called “ART” from Baylor College Huang W, Li L, Myers JR, Marth GT. 2012. ART: a next-generation sequencing read simulator. Bioinformatics. 28(4):593-4 Available as open source and as binary programs for 32 or 64 bit Windows, Mac and Linux http://www.niehs.nih.gov/research/resources/software/art Notes: the binary archive names are a bit strange – really a .tar.gz in disguise (need to do a gunzip followed by a tar –xvf) The fastq sequence line is *lower case* which is not expected by some software (e.g. ABySS) Simulated Illuminex Paired End Reads Using rice chloroplast genome (~134kb) art_illumina -i Chloroplast.fasta -p -l 50 -f 20 -m 200 -s 10 -o Chloroplast -sam Generates files: Chloroplast1.aln Chloroplast1.fq Chloroplast2.aln Chloroplast2.fq Chloroplast.sam ============================================================================== ART (Q Version 1.3.6) Copyright(c) 2008-2012, Weichun Huang, Jason Myers. All Rights Reserved. ============================================================================== Paired-end Simulation Total CPU time used: 2.48 Parameters used during run Read Length: Fold Coverage: Mean Fragment Length: Standard Deviation: Profile Type: ID Tag: Quality Profile(s) First Read: Second Read: 50 20X 200 10 Combined EMP50R1 (built-in profile) EMP50R2 (built-in profile) Output files FASTQ Sequence Files: the 1st reads: Chloroplast1.fq the 2nd reads: Chloroplast2.fq ALN Alignment Files: the 1st reads: Chloroplast1.aln the 2nd reads: Chloroplast2.aln SAM Alignment File: Chloroplast.sam Unfortunately… The ART program generates peculiar id’s (doesn’t mark the paired end reads…) and lower case sequence letters, which causes some headaches… So, I wrote a small python script to fix this… #!/usr/bin/python # Fixes the output of the ART program # art_illumina -i reference.fa -p -l 50 -f 20 -m 200 -s 10 -o outFile_prefix -sam from sys import stdin seq = False qual = False if __name__ == '__main__': for line in stdin: line = line.strip() if qual: qual = False # to avoid treating rare quality score lines that start with '@' as id's elif line.startswith('+'): qual = True elif not seq and line.startswith('@'): # massage the ID part1 = line.split('|') part2 = part1[1].split('-') line = part1[0]+'_'+part2[0]+'-'+part2[1]+'/'+part2[2] seq = True elif seq: # convert sequence all to upper case to avoid downstream confusion... line = line.upper() seq = False print line Getting ABySS Installation: For Ubuntu, sudo apt-get install abyss Or visit BCGSC and download tar.gz source, then configure..make (more up-to-date?) Perhaps put the abyss bin directory on your path… To test run ABySS: abyss-pe k=25 name=test se=https://raw.github.com/dzerbino/ velvet/master/data/test_reads.fa Try our test PE read data set abyss-pe name=Chloroplast31 k=31 ABYSS_OPTIONS=--no-trim-masked in=‘Chloroplast1.fastq Chloroplast2.fastq‘ The ‘no-trim-masked’ needed because default behaviour of abyss is to trim lower case letters in sequence (which designate identified vector sequences in 454 outputs…) Try with other k-mer sizes… For more info about ABySS http://www.bcgsc.ca/platform/bioinfo/software/abyss Active list service to troubleshoot issues: abyss-users@googlegroups.com Velvet http://www.ebi.ac.uk/~zerbino/velvet/ download & tar -zxvf make sudo make install put velvet directory on your $PATH Run velveth: velveth outputdir k_mer -fastq readfile Run velvetg: velvetg outputdir -ins_length 200 -exp_cov 20 ALLPATHS-LG http://www.broadinstitute.org/software/allpaths-lg/blog/ download and tar –zxvf ./configure make sudo make install Execute the program: PrepareAllPathsInputs.pl # needs some config files… RunAllPathsLG