Samtools、BEDTools、 FASTX-Toolkit 北京大学 生命科学学院 叶永鑫 2011年5月23日 星期一 Outline Samtools 操作sam/bam文件(mapping)的工具 附加BCFtools 操作vcf/bcf文件(SNP/indel calling)的工具 BEDTools 操作bed文件(block feature annotation)的工具 FASTX-Toolkit 操作fasta/fastq文件的工具 Samtools SAM文件格式 Sequence Alignment/Map format 存储mapping的结果的Tab分隔的文本文件 header section (optional) @开头,例如: @SQ SN:ref LN:45 alignment section 11个必需的字段,例如: r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * header section @HD header line VN* format version SO sorting order of alignments @SQ reference seq. dictionary SN* reference seq. name LN* ref. seq. length AS genome assembly identifier M5 MD5 checksum SP Species UR URI header section @RG read group ID* read group identifier CN seq. center DS description DT date of the run FO flow order KS key sequence LB library PG programs PI predicted median insert size PL platform/technology PU platform unit SM sample header section @PG program ID* program record identifier PN program name CL command line PP previous @PG-ID VN program version @CO comment (one-line text) alignment section --- mandatory fields 1 2 3 4 5 6 QNAME Query template NAME FLAG bitwise flag RNAME Ref. seq. NAME POS 1-based leftmose POSition MAPQ MAPping Quality CIGAR CIGAR string M match/mismatch I insertion D deletion (相对reference) S soft clipping H hard clipping P padding N skip = match X mismatch alignment section --- mandatory fields 7 MRNM Mate Ref. sequence NaMe ('=' if same as RNAME) 8 MPOS 1-based Mate POSition 9 ISIZE Inferred insert SIZE 10 SEQ query SEQuence 11 QUAL query QUALity (ASCII of Phred-scaled base quality) alignment section --- mandatory fields - FLAG 0x001 p the read is paired in sequencing 0x002 P the read is mapped in a proper pair 0x004 u the query seq. itself is unmapped 0x008 U the mate is unmapped 0x010 r strand of the query (1 for reverse) 0x020 R strand of the mate 0x040 1 the read is the first read in a pair 0x080 2 the read is the second read in a pair 0x100 s the alignment is not primary 0x200 f the read fails platform quality checks 0x400 d the read is either a PCR or optical dup. alignment section --- optional fields TAG:TYPE:VALUE CS color read seq. CQ color read quality MD string for mismatching position aim to achieve SNP/indel calling without looking at the reference NM edit distance to the refernce RG read group match @RG - ID if @RG exists …… BAM文件格式 Binary SAM compressed in the BGZF format 查看bam文件内容: samtools view [-h] aln.bam | less –S bam可被排序: samtools sort aln.bam aln.sorted 排序的bam可建索引: samtools index aln.sorted.bam Samtools sam、bam转换 view、import△ 查看sam、bam文件 view、tview 排序bam文件 sort 建索引 faidx、index、tabix* Samtools 初步统计 flagstats、idxstats、depth* 改动bam文件 reheader、merge、cat* 校正bam文件 fixmate、calmd、rmdup call SNP/small indel pileup、mpileup Samtools Samtools设计为可以在流(管道)上 执行,在需要FILE处填 '-' 就可表示 STDIN或STDOUT。 命令行未跟足够参数时就会显示 Samtools各程序的简要帮助。eg. samtools samtools view samtools sort sam、bam转换 sam -> bam △samtools import ref.fa.fai in.sam out.bam if in.sam w/ @SQ: samtools view -bS in.sam > out.bam if in.sam w/o @SQ: samtools faidx ref.fa samtools view -bt ref.fa.fai in.sam > out.bam or samtools view -bT ref.fa in.sam > out.bam bam -> sam samtools view [-h] in.bam > out.sam samtools view samtools view [options] in.bam/sam [region1 [...]] -? longer help -b output in BAM format -c instead, only count -f INT only output alignments with all bits in FLAG [0] -F INT skip alignments with FLAG bits [0] -h include the header in output -H output the header only -l STR only output reads in library STR [null] samtools view -o FILE output file [stdout] -q INT skip alignments with map quality smaller than INT [0] -r STR only output reads in read group STR [null] -R FILE output reads in read groups listed in FILE [null] -S input is in SAM, if @SQ is absent, -t or -T is required -T FILE 指定reference sequence file. 若尚未 建立索引,会自动先建立索引,类似ref.fa.fai samtools view -t FILE tab-delimited file, each line contains ref. name and length; can be the resultant index file ref.fa.fai by 'samtools faidx ref.fa' -u output uncompressed bam, for pipe -x output FLAG in HEX (eg. 0x00) -X output FLAG in string region (1-based) chr2 chr2:1000 (start from 1000) chr2:1,000-2,000 (include 2000) samtools view eg. samtools view -bS in.sam > out.bam samtools view -bt ref.fa.fai in.sam > out.bam samtools view -bT ref.fa in.sam > out.bam samtools view [-h] in.bam > out.sam samtools view ex1.bam seq1:1-10 | less -S NOTE: random retrieval need indexed samtools view -x ex1.bam | less -S samtools view -X ex1.bam | less -S 对bam文件排序和建索引 samtools sort 可以对bam文件进行排序。 samtools index 可以对已排序的bam文件 建立索引,从而使得对该bam文件的 random retrieval成为可能。 eg. samtools sort aln.bam aln.sorted (产生已排序的aln.sorted.bam文件) samtools index aln.sorted.bam (产生aln.sorted.bam.bai索引文件) samtools sort samtools sort [options] in.bam out.prefix -o output the final alignment to the standard output 即把排序的bam文件输出到STDOUT;仍然需 要在命令行指定out.prefix,否则不能执行,但 不产生out.prefix.bam -n sort by read names rather than by chromosomal coordinates -m INT approximately the maximum required memory 建立索引 建立索引的目的是使 random retrieval 成为 可能 samtools index aln.bam 为已排序的aln.bam建索引,将创建 aln.bam.bai samtools faidx ref.fasta [region1 [...]] 为reference fasta建索引,将创建 ref.fasta.fai samtools tview samtools tview in.sorted.bam [ref.fasta] in.sorted.bam需要先用samtools index建索引 ref.fasta可不用先建索引;若无索引, samtools tview会自动去建立ref.fasta.fai索引 In samtools tview: ? help g goto a region chrM:1000; seq1:10 =1000 (if same reference sequence) q exit samtools tview h,j,k,l & ←,↑,↓,→ small scroll (1 base) H,J,K,L large scroll (20 bases) [space], [backspace] scroll (1 screen) m color for Mapping quality n color for Nucleotide b color for Base quality . dot view r read name 初步统计 samtools idxstats in.bam in.bam需要已建索引 输出tab分隔的文本,每个ref. seq一行: ref. seq. name seq. length # mapped reads # unmapped reads 初步统计 samtools flagstat in.bam in.bam不需要先建索引 输出多行文本,每行一种被统计项目, 统计reads数目 *samtools depth [options] in.bam in.bam需要已排序 输出tab分隔的文本,每个位点一行,统 计每个位点的覆盖的reads数 output eg. chrM 112 3 初步统计 *samtools depth [options] in.bam -r STR region -q INT base quality threshold (only count base quality >= INT) -Q INT mapping quality threshold (only count mapping quality >= INT) -b FILE bed file eg. samtools depth -r chrM:100-200 -q 30 aln.sorted.bam | awk '$3>2' | less -S 改动bam文件 samtools reheader in.header.sam in.bam 改变bam文件的header section 比BAM->SAM->BAM快 samtools merge [options] [-h header.sam] out.bam in1.bam in2.bam 合并已排序的bam文件 (必需含有数量相同的reference sequences) *samtools cat [-h header.sam] [-o out.bam] in1.bam in2.bam [...] 直接上下连接bam文件内容 改动bam文件 samtools merge [options] out.bam in1.bam in2.bam -h FILE use this sam file's headers, copied to out.bam -R STR merge files in the secified region -r attach an RG tag to each alignment. The tag value is inferred from file names -n the input alignments are sorted by read names rather than by chr. coordinates -u output uncompressed BAM 改动bam文件 samtools merge eg. perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n @RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' >rg.txt samtools merge -rh rg.txt merged.bam ga.bam 454.bam NOTE: in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while reads from 454.bam will be attached RG:Z:454 校正bam文件 samtools fixmate in.bam out.bam in.bam需要已按read name排序, out.bam也将会按read name排序 fixmate用来校正mate的坐标、 ISIZE(Inferred insert SIZE)、相关FLAG samtools calmd [options] aln.bam ref.fa calmd用来生成或检查每个alignment的 MD(string for mismatching position)标签, 默认输出SAM到STDOUT 校正bam文件 samtools calmd [options] aln.bam ref.fa -A modify the quality string -b compressed BAM output -e change identical bases to '=' -S input is SAM with header -r compute the BQ tag (w/o -A) or cap baseQ by BAQ (w/ -A) -u uncompressed BAM output 校正bam文件 samtools rmdup [-sS] in.bam out.bam in.bam需要已排序 -s remove dup. for SE reads -S treat PE reads as SE (force -s) (default) remove dup. for PE reads rmdup用来去除可能的PCR duplicates 一致的外部坐标,仅mapping quality不同 处理PE时,依赖正确的ISIZE 对于PE,不能正确处理unpaired reads的情况 (Picard的MarkDuplicates能正确处理该情况) rmdup example Merge sorted alignments and remove potential PCR/optical duplicates: perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n @RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt samtools merge -rh rg.txt - ga.bam 454.bam | samtools rmdup - - | samtools rmdup -s - aln.bam call SNP / small indel pileup -> pileup format mpileup -> bcf format -> bcftools NOTE: since 0.1.10, pileup is deprecated by mpileup samtools pileup samtools pileup [options] in.bam/sam 输入的in.bam需要先排序 -B disable the BAQ (base alignment quality) computation -c call the consensus sequence -C INT coefficient for adjusting mapping quality of poor mappings [0] -d INT limit maximum depth for indels for speed up. 0 for unlimited [1024] samtools pileup -f FILE the reference sequence, in FASTA; index file FILE.fai will be created if absent -G FLOAT prior of an indel between two haplotypes (for -c) [0.00015] -i only show lines/consensus with indels -I INT phred prob. of an indel in sequencing/prep. [40] -l FILE list of sites at which pileup is output samtools pileup -m INT filter reads with flag containing bits in INT [1796(0x704, usfd)] -M INT cap mapping quality at INT [60] -N INT # haplotypes in the sample (for -c) (>=2) [2] -Q INT min base quality [13] -r FLOAT prior of a difference between two haplotypes (for -c) [0.001] -S the input is in SAM samtools pileup -t FILE list of reference names and sequence lengths (force -S) -T FLOAT the theta parameter (error dependency coefficient) in MAQ consensus calling model (for -c) [0.83] -v print variants only (for -c) NOTE: pileup is deprecated, please use mpileup samtools pileup default output: pileup format 每一行都是一个genomic position, 包含以下6列: 1 chromosome name 2 coordinate 3 reference base 4 read coverage/depth 5 read bases 6 alignment mapping qualities samtools pileup read bases . for a match to the ref. on the + strand , for a match to the ref. on the - strand > < for a reference skip ACGTN for a mismatch on + strand acgtn for a mismatch on - strand +NumSeq insertion -NumSeq deletion ^ start of a read $ end of a read * deleted base samtools pileup when -c is apllied, 在原先第3、第4列之间增添4列,总共10列: 4 consensus base 5 consensus quality 6 SNP quality 7 root mean square (RMS) mapping quality 8 read coverage/depth 9 read bases 10 alignment mapping qualities samtools pileup an indel occupies an additional line (-i) 1 chromosome same 2 coordinate 3 just '*' 4 genotype 5 consensus quality 6 SNP quality 7 RMS mapping quality samtools pileup 8 # covering reads 9 the first allele 10 the second allele 11 # reads supporting the first allele 12 # reads supporting the second allele 13 # reads containing indels different from the top two alleles …… samtools pileup eg. samtools pileup -f human_chrM.fasta bwa_result.sorted.bam | less -S samtools pileup -vcf human_chrM.fasta bwa_result.sorted.bam | less -S samtools pileup -ivcf human_chrM.fasta bwa_result.sorted.bam | less -S samtools.pl varFilter samtools.pl varFilter [options] in.cns_pileup 输入 samtools pileup -vcf 的结果,进行过滤 -D INT maximum read depth [100] -d INT minimum read depth [3] -G INT min indel score for nearby SNP filtering [25] -i INT minimum indel quality [] -l INT window size for filtering adjacent gaps [30] samtools.pl varFilter -N INT max number of SNPs in a window [2] -p print filtered variants to STDERR -Q INT minimum RMS mapping quality for SNPs [25] -q INT minimum RMS mapping quality for gaps [10] -S INT minimum SNP quality [] -W INT window size for filtering dense SNPs [10] -w INT SNP within INT bp around a gap to be filtered [10] samtools.pl pileup2fq samtools.pl pileup2fq [options] in.cns_pileup 输入 samtools pileup -cf 的结果,转换为 FASTQ格式 -d INT min depth [3] -D INT max depth [255] -G INT min indel score [25] -l INT indel filter window size [10] -Q INT min RMS mapping quality [25] pileup example Call Variants samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D 100 > flt.txt Please remember to set the -D to set a maximum read depth according to the average read depth (×2). In whole genome shotgun (WGS) resequencing, SNPs with excessively high read depth are usually caused by structural variations or alignment artifacts and should not be trusted. awk '($3=="*"&&$6>=50) ||($3!="*"&&$6>=20)' flt.txt > final.txt pileup example Get the consensus sequence samtools view -u aln.bam X | samtools pileup -cf hsRef.fa - | samtools.pl pileup2fq -D100 > var-X.fq if you want to call for one chromosome only, use 'samtools view' to specify the region. (just like above) http://sourceforge.net/apps/mediawiki/samtoo ls/index.php?title=SAM_protocol pileup example Before pileup, you can use 'samtools view' or other command to filter the mapping reads/alignments. 'samtools view' can filter on mapping quality (-q), flags (-f, -F), read group (-r, -R) or library (-l). other filters can be specified using perl or awk on SAM format pileup example use reads with <=2 differenced: samtools view -h aln.bam | perl -ne 'print if (/^@/||(/NM:i:(\d+)/&&$1<=2))' | samtools pileup -S - > out.txt (NM: edit distance to the refernce) exclude all gapped alignments: samtools view -h aln.bam | awk '$6!~/[ID]/' | samtools pileup -S - pileup example set a threashold on mapping quality samtools view -bq 1 aln.bam > aln-reliable.bam exclude read group ERR00001 samtools view -h in.bam | grep -v "\<RG:Z:ERR00001\>" | samtools view -bS - >out.bam samtools mpileup samtools mpileup [options] in.bam [in2.bam [...]] generate BCF or pileup for one or multiple BAM files. Alignment records are grouped by sample identifiers in @RG header lines. If sample identifiers are absent, each input file is regarded as one sample. samtools mpileup -B disable BAQ (base alignment quality) computation -C INT coefficient for adjusting mapping quality of poor mappings [0 (disable)] the recommended value for BWA is 50 -e INT phred-scaled gap extension sequencing error probability [20] reducing leads to longer indels -f FILE the reference file [null] -g compute genotype likelihoods, and output them in BCF(binary call format) samtools mpileup -h INT coefficient for modeling homopolymer errors [100] -I do not perform indel calling -l FILE file containing a list of sites where pileup or BCF is outputed [null] -o INT phred-scaled gap open sequencing error probability [40] reducing leads to more indel calls -P STR comma dilimited list of platforms for indels [all] samtools mpileup -q INT min mapping quality for an alignment to be used [0] -Q INT min base quality for a base to be considered [13] -r STR only generate pileup in the region [all sites] -u similar to -g, but output uncompressed BCF VCF and BCF format VCF(variant call format) BCF(binary VCF) 查看BCF内容 bcftools view in.bcf VCF format meta-information lines 以##开头 the header line 以#开头,tab分隔,data lines每列的解释 #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT (Sample...) data lines tab分隔 VCF format 1 CHROM chr. name 2 POS 1-base position 3 ID variant ID 4 REF ref. seq. at POS 5 ALT comma delimited list of alternative seq. 6 QUAL phred-scaled prob. of all samples being homozygous ref. VCF format 7 FILTER semicolon delimited list of filters that the variat fails to pass 8 INFO semicolon delimited list of variant information DP combined depth across samples AF allele freq. for each ALT allele MQ RMS mapping quality VCF format 9 FORMAT colon delimited list of the format of individual genotypes in the following fields GT genotype; 0 for REF, 1 for ALT1, ... PL phred-scaled genotype likelihood order: 00, 01, 11, 02, 12, 22, ... GQ conditional genotype phred quality 10+ Sample(s) individual genotype information defined by FORMAT BCFtools bcftools view 查看,转换文件格式 index 为BCF建索引 cat 连接BCF ld compute all-pair r^2 ldpair compute r^2 between requested pairs bcftools view bcftools view [options] in.bcf [region] -b output BCF instead of VCF -c SNP calling (force -e) -e likelihood based analyses -G suppress all individual genotype info. -g call genotypes at variant sites (force -c) -I skip indels -l FILE list of sites (chr pos) or regions (BED) to output [all] -P STR type of prior: full, cond2, flat [full] -v output potential variant sites only (force -c) vcfutils.pl vcfutils.pl varFilter vcf2fq subsam listsam fillac qstats hapmap2vcf ucscsnp2vcf vcfutils.pl varFilter vcfutils.pl varFilter [options] in.vcf -a INT min number of alternate bases [2] -D INT max read depth [10000000] -d INT min read depth [2] -p print filtered variants to STDERR -Q INT min RMS mapping quality for SNPs [10] -W INT window size for filtering adjacent gaps [10] -w INT SNP with INT bp around a gap to be filtered [10] vcfutils.pl vcf2fq vcfutils.pl vcf2fq [options] all-site.vcf 输入 samtools mpileup -uf | bcftools view -cg 的结果,转换为FASTQ格式 -d INT min depth [3] -D INT max depth [100000] -l INT indel filter window size [5] -Q INT min RMS mapping quality [10] mpileup example Call SNPs and short indels for one diploid individual: samtools mpileup -uf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf usually add -C50 to mpileup if mapping quality is overestimated, eg. using BWA-short bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf -D controls the max read depth, which should be adjusted to about twice the average read depth mpileup example Call SNPs and short indels for multiple diploid individuals: samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bvcg - > var.raw.bcf Individuals are identified from the SM tags in the @RG header lines. -P specifies that indel candidates should be collected only from read groups with @RG-PL tag set to ILLUMINA bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf mpileup example Derive the AFS (allele frequency spectrum) on a list of sites from multiple individuals: samtools mpileup -Igf ref.fa *.bam >all.bcf bcftools view -bl sites.list all.bcf >sites.bcf bcftools view -cGP cond2 sites.bcf >/dev/null 2>sites.1.afs bcftools view -cGP sites.1.afs sites.bcf >/dev/null 2>sites.2.afs bcftools view -cGP sites.2.afs sites.bcf >/dev/null 2>sites.3.afs …… mpileup example Generate the consensus sequence: samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq Dump BAQ applied alignment for other SNP callers: samtools calmd -br aln.bam >aln.baq.bam Java版本——Picard 类似Samtools功能的Java版本 Samtools可以在流上(管道)操作, Picard好像一般对文件操作。 Picard的MarkDuplicates能处理Samtools 的rmdup不能处理的一些问题。(注: MarkDuplicates依赖mate的坐标) Reference SAM format http://samtools.sourceforge.net/SAM1.pdf Samtools homepage http://samtools.sourceforge.net/ Samtools manual http://samtools.sourceforge.net/samtools.shtml Samtools pileup http://samtools.sourceforge.net/cns0.shtml http://samtools.sourceforge.net/pileup.shtml Reference Samtools protocal http://sourceforge.net/apps/mediawiki/samtools/index .php?title=SAM_protocol Samtools mpileup http://samtools.sourceforge.net/mpileup.shtml VCF format http://vcftools.sourceforge.net/specs.html http://www.1000genomes.org/wiki/Analysis/Variant% 20Call%20Format/vcf-variant-call-format-version-40 http://www.1000genomes.org/wiki/Analysis/Variant% 20Call%20Format/vcf-variant-call-format-version-41 BEDTools BED文件格式 BED format是UCSC Genome Browser和 Ensembl Genome Browser都支持的,描 述annotation track信息的文件格式。 它是tab分隔的文本文件,可以有3-12列。 前3列是必需的: 1 chrom name of chr. or scaffold 2 chromStart start pos. of the feature (include) (0-based) 3 chromEnd end pos. of the feature (not include) BED文件格式 第4-9列是可选的: 4 name the name of the BED line 5 score 0~1000, higher, darker 6 strand 7 thickStart starting pos. drawn thickly 8 thickEnd ending pos. drawn thickly 9 itemRgb display color BED文件格式 第10-12列也是可选的,UCSC使用它们, 而Ensembl忽略它们: 10 blockCount # blocks (exons) in the line 11 blockSizes comma-separated, block sizes, should correspond to blockCount 12 blockStarts comma-separated, block starts, should correspond to blockCount eg. snps.bed chr1 100 101 A/G 100 chr1 200 201 C/G 1000 BedGraph文件格式 4列tab分隔的文本文件 1 chromName 2 chromStart 3 chromEnd 4 dataValue BEDTools 文件格式转换 (bam->bed) bamToBed 计算交(overlap)、差、补 intersectBed pairToBed subtractBed pairToPair complementBed windowBed BEDTools 改动BED文件 mergeBed slopBed shuffleBed sortBed 计算coverage coverageBed genomeCoverageBed BEDTools 其他 closestBed fastaFromBed maskFastaFromBed linksBed 分组简单统计 groupBy -h display help page bamToBed bamToBed -i in.bam > out.bed convert BAM to BED6 format -bed12 convert to BED12 format -bedpe convert to BEDPE format -ed use edit distance (NM tag) as BED score intersectBed intersectBed -a a.bed -b b.bed report overlaps between two BED files -c for each A, report # overlaps with B (count) -u report original A once if any overlaps found in B -v only report those A that have no overlaps with B (similar to 'grep -v') intersectBed intersectBed -a a.bed -b b.bed -wa report entire, original entry in A for each overlap -wb report entire, original entry in B for each overlap intersectBed intersectBed -a a.bed -b b.bed -wa -wb report entire, original entry in A and B for each overlap -wo report original A and B and # bp of overlap between the two features intersectBed intersectBed -a a.bed -b b.bed -f FLOAT minimum overlap required as a fraction of A -f FLOAT -r require that the fraction overlap be reciprocal for A and B -s force strandedness, only report when A and B on the same strand intersectBed intersectBed -abam a.bam -b b.bed >out.bam input file a is in BAM, and ouput will be BAM -ubam write uncompressed BAM output -bed when using BAM input, write output as BED -a 可跟 stdin,表示从STDIN读入 intersectBed eg. find genes that overlap Lines but not SINEs: intersectBed -a genes.bed -b LINEs.bed | intersectBed -a stdin -b SINEs.bed -v retain SE BAM that overlap exons: intersectBed -abam reads.bam -b exons.bed >reads.touchingExons.bam intersectBed screen novel SNP intersectBed -a snp.calls.bed -b dbSnp.bed -v | intersectBed -a stdin -b 1KG.bed -v >snp.calls.novel.bed pairToBed pairToBed -a a.bedpe -b b.bed report overlaps between a BEDPE and a BED -type either (default) report overlaps if either end of A overlaps B -type neither report A if neither end of A overlaps B -s force strandedness pairToBed pairToBed -a a.bedpe -b b.bed -type both report overlaps if both ends of A overlap B -type notboth report overlaps if neither or only one end of A overlap B -type xor report overlaps if only one end of A overlap B pairToBed pairToBed -a a.bedpe -b b.bed -type ispan report overlaps between [end1, start2] of A,B -type ospan report overlaps between [start1, end2] of A,B -type notispan report A if ispan of A does not overlap B -type notospan report A if ospan of A does not overlap B pairToBed pairToBed -abam a.bam -b b.bed >out.bam input file a is PE BAM, and output will be in BAM -ubam write uncompressed BAM output -bedpe when using BAM input, write output as BEDPE pairToBed eg. return all structural variants that overlap with genes on either end: pairToBed -a sv.bedpe -b genes.bed >sv.genes retain only PE alignments where neither or only one end overlaps segmental duplications: pairToBed -abam reads.bam -b segdups.bed type notboth >reads.notbothSSRs.bam pairToPair pairToPair -a a.bedpe -b b.bedpe report overlaps between two BEDPE -type both (default) report overlaps if both ends of A overlap B -type either report overlaps if either ends of A overlap B -type neither report overlaps if neither end of A overlaps B pairToPair eg. find all SVs in sample 1 that are also in sample 2: pairToPair -a 1.sv.bedpe -b 2.sv.bedpe | cut -f 1-10 > 1.sv.in2.bedpe find all SVs in sample 1 that are not in sample 2: pairToPair -a 1.sv.bedpe -b 2.sv.bedpe -type neither | cut -f 1-10 > 1.sv.notin2.bedpe windowBed windowBed -a a.bed -b b.bed examines a window around each A, and reports all features in B that overlap the window for each overlap, A and B are reported -w INT bp added upstream and downstream of each A (symterical windows around A) windowBed windowBed -a a.bed -b b.bed -l INT bp added upstream (left) of each A (assymterical windows around A) -r INT bp added downstream (right) of each A (assymterical windows around A) -sw define -l and -r based on strand windowBed windowBed -a a.bed -b b.bed -sm only report hits in B that overlap A on the same strand -abam FILE -ubam -bed windowBed eg. report all SNPs that are within 5000bp upstream or 1000bp downstream of genes: windowBed -a genes.bed -b snps.bed -l 5000 -r 1000 -sw subtractBed subtractBed -a a.bed -b b.bed removes A that overlap by any feature in B -f FLOAT minimum overlap required as a fraction of A -s force strandedness eg. remove introns from gene features: subtractBed -a genes.bed -b intron.bed complementBed complementBed -i in.bed -g genome return bp complement of the feature file genome file should tab-delimited and contain: chromName chromSize how to get chromSize? see -h eg. report all intervals in human genome that are not covered by repetitive elements: complementBed -i repeatMasker.bed -g hg18.genome mergeBed mergeBed -i in.bed merge overlapping entries into a single entry -n report # BED entries merged '1' is reported if no merging occurred -d INT maximum distance (bp) between features allowed to be merged [0] mergeBed mergeBed -i in.bed -s force strandedness -scores STR how to determine the scores of the merged sum, min, max, mean, median, mode, antimode, collapse slopBed slopBed -i in.bed -g genome 将feature向两侧扩展 -b INT/FLOAT in each direction -l INT/FLOAT in left (upstream) direction -r INT/FLOAT in right (downstream) direction slopBed slopBed -i in.bed -g genome -s define -l and -r based on strand -pct specify a fraction of the feature's length, instead of absolute bases only with -pct, -b/-l/-r can be given FLOAT eg. slopBed -i probes.bed -g hg19.genome -b 500 shuffleBed shuffleBed -i in.bed -g genome randomly permute the locations of features among a genome -excl BED_FILE prevent placing in these intervals -f FLOAT maximum overlap fraction of the feature with an -excl feature -chrom keep features on the same chromosome sortBed sortBed -i in.bed sort BED file, by chrom, then start position 输出排序好的BED到STDOUT see -h for more options coverageBed coverageBed -a a.bed -b b.bed compute the depth and breadth of coverage of features from A on intervals in B default output: (for each entry in B) # A overlapped the B interval # bases in B had >0 coverage the length of B the fraction of bases in B had >0 coverage coverageBed coverageBed -a a.bed -b b.bed -s force strandedness -d report the depth at each position (base) in each B eg. compute coverage and create a BEDGRAPH coverageBed -a reads.bed -b win10kb.bed | cut -f 1-4 > win10kb.cov.bedg coverageBed by default, coverageBed counts any feature in A that overlaps B by >= 1 bp. If you want to specify a minimum fraction, you can first use intersectBed: intersectBed -a a.bed -b b.bed -f 1.0 | coverageBed -a stdin -b b.bed >a_totally_in_b.coverage coverageBed work with samtools or bam file bamToBed -i reads.bam | coverageBed -a stdin -b exons.bed >exons.bed.coverage samtools view -bf 0x2 reads.bam | bamToBed -i stdin | coverageBed -a stdin -b exons.bed >exons.bed.pair_proper_mapped.coverage coverageBed compute separatedly for each strand bamToBed -i reads.bam | grep '\+$' | coverageBed -a stdin -b genes.bed >genes.bed.forward.coverage bamToBed -i reads.bam | grep '\-$' | coverageBed -a stdin -b genes.bed >genes.bed.reverse.coverage genomeCoverageBed genomeCoverageBed -i in.bed -g genome compute the coverage among a genome input BED file must be grouped by chr. default output: (for each chromosome in B) depth (0~max) # bases in the chr. had this depth the length of the chr. the fraction of bases in the chr. that had this depth -strand +/- genomeCoverageBed -ibam FILE input in BAM must be sorted by position (samtools sort) -d report the depth at each position (base) -bg report depth in BedGraph format -bga report depth in BedGraph format, also report zero coverage regions genomeCoverageBed eg. genomeCoverageBed -ibam bwa_result.sorted.bam -g human_chrM.genome genomeCoverageBed -ibam bwa_result.sorted.bam -g human_chrM.genome | grep '^chrM' | awk 'BEGIN{SUM = 0} {SUM += ($2*$5)} END {print SUM}' closestBed closestBed -a a.bed -b b.bed for each A, finds the closest feature in B -t all (default) report all ties -t first report the first tie that occurred in B -t last report the last tie that occurred in B closestBed closestBed -a a.bed -b b.bed -s force strandedness, find the closest feature on the same strand eg. find the closest ALU to each gene: closestBed -a genes.bed -b ALUs.bed fastaFromBed maskFastaFromBed -fi in.fasta -bed extract.bed -fo out.fasta 根据extract.bed文件,从in.fasta中抽取一些区 域,输出到out.fasta -s force strandedness -name use name field for fasta header maskFastaFromBed maskFastaFromBed -fi in.fasta -bed mask.bed -fo out.fasta 根据mask.bed文件,mask掉in.fasta中的一些 区域,输出到out.fasta -soft soft masking: mask with lower-case instead of N -mc CHAR replace masking char, instead of N linksBed linksBed -i in.bed > out.html 创建一个链到UCSC browser的HTML文件 -base browser basename [http://genome.ucsc.edu] -org organism [human] -db reference build [hg18] groupBy groupBy 可以指定几列(-g)作为组别, 然后分组简单统计(-o)指定列(-c) Deprecated. Now in the filo package. -i FILE input file, assume STDIN if omitted -g STR / -grp STR specify the columns (1-based) for grouping, comma separated [1,2,3] -c INT / -opCols INT specify the column (1-based) to be summarized, required groupBy -o STR / -ops STR specify the operation applied to -c column sum, count, min, max, mean, median, mode(most common), antimode(least common), stdev, sstdev(sample stdev), collapse(comma separated), concat(nondelimited), freqdesc, freqasc [sum] can be comma separated, eg. sum,mean,max groupBy -inheader input has a header line, would be ignored -outheader output header line, showing column names BEDTools Notes All BEDTools load B file into memory and process A file one-by-one. Therefore when possible, set smaller file to be B file. Most of BEDTools allow A file to be STDIN from pipes, by using '-a stdin' Reference BED format http://genome-test.cse.ucsc.edu/FAQ/FAQformat http://www.ensembl.org/info/website/upload/bed.html BedGraph format http://genome.ucsc.edu/goldenPath/help/bedgraph.ht ml BEDTools http://code.google.com/p/bedtools/ filo package (for groupBy) https://github.com/arq5x/filo FASTX-Toolkit FASTX-Toolkit 预处理FASTA/FASTQ文件的工具 (mapping之前) 格式转换,切短,切去adapter,分 barcode,quality筛选…… -h show usage information you can also use them with Galaxy FASTX-Toolkit fastq_to_fasta fastx_quality_stats fastq_quality_boxplot_graph.sh fastx_nucleotide_distribution_graph.sh fastx_clipper fastx_renamer fastx_trimmer FASTX-Toolkit fastx_collapser fastx_artifacts_filter fastx_quality_filter fastx_reverse_complement fasta_formatter fasta_nucleotide_changer fasta_clipping_histogram.pl fasta_barcode_splitter.pl Reference FASTX-Toolkit homepage http://hannonlab.cshl.edu/fastx_toolkit/ FASTX-Toolkit manual http://hannonlab.cshl.edu/fastx_toolkit/comma ndline.html Thanks for your attention!