INT - NGS Workshop @ CBI

Samtools、BEDTools、 FASTX-Toolkit 北京大学生命科学学院叶永鑫 2011年5月23日星期一 Outline  Samtools  操作sam/bam文件(mapping)的工具  附加BCFtools  操作vcf/bcf文件(SNP/indel calling)的工具  BEDTools  操作bed文件(block feature annotation)的工具  FASTX-Toolkit  操作fasta/fastq文件的工具 Samtools SAM文件格式  Sequence Alignment/Map format  存储mapping的结果的Tab分隔的文本文件  header section (optional)  @开头，例如：  @SQ SN:ref LN:45  alignment section  11个必需的字段，例如：  r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * header section  @HD header line  VN* format version  SO sorting order of alignments  @SQ reference seq. dictionary  SN* reference seq. name  LN* ref. seq. length  AS genome assembly identifier  M5 MD5 checksum  SP Species  UR URI header section  @RG read group  ID* read group identifier  CN seq. center  DS description  DT date of the run  FO flow order  KS key sequence  LB library  PG programs  PI predicted median insert size  PL platform/technology  PU platform unit  SM sample header section  @PG program  ID* program record identifier  PN program name  CL command line  PP previous @PG-ID  VN program version  @CO comment (one-line text) alignment section --- mandatory fields       1 2 3 4 5 6 QNAME Query template NAME FLAG bitwise flag RNAME Ref. seq. NAME POS 1-based leftmose POSition MAPQ MAPping Quality CIGAR CIGAR string  M match/mismatch I insertion D deletion （相对reference） S soft clipping H hard clipping P padding N skip = match X mismatch alignment section --- mandatory fields  7 MRNM Mate Ref. sequence NaMe ('=' if same as RNAME)  8 MPOS 1-based Mate POSition  9 ISIZE Inferred insert SIZE  10 SEQ query SEQuence  11 QUAL query QUALity (ASCII of Phred-scaled base quality) alignment section --- mandatory fields - FLAG            0x001 p the read is paired in sequencing 0x002 P the read is mapped in a proper pair 0x004 u the query seq. itself is unmapped 0x008 U the mate is unmapped 0x010 r strand of the query (1 for reverse) 0x020 R strand of the mate 0x040 1 the read is the first read in a pair 0x080 2 the read is the second read in a pair 0x100 s the alignment is not primary 0x200 f the read fails platform quality checks 0x400 d the read is either a PCR or optical dup. alignment section --- optional fields  TAG:TYPE:VALUE  CS color read seq.  CQ color read quality  MD string for mismatching position  aim to achieve SNP/indel calling without looking at the reference  NM edit distance to the refernce  RG read group  match @RG - ID if @RG exists  …… BAM文件格式  Binary SAM  compressed in the BGZF format  查看bam文件内容：  samtools view [-h] aln.bam | less –S  bam可被排序：  samtools sort aln.bam aln.sorted  排序的bam可建索引：  samtools index aln.sorted.bam Samtools  sam、bam转换  view、import△  查看sam、bam文件  view、tview  排序bam文件  sort  建索引  faidx、index、tabix* Samtools  初步统计  flagstats、idxstats、depth*  改动bam文件  reheader、merge、cat*  校正bam文件  fixmate、calmd、rmdup  call SNP/small indel  pileup、mpileup Samtools  Samtools设计为可以在流（管道）上执行，在需要FILE处填 '-' 就可表示 STDIN或STDOUT。  命令行未跟足够参数时就会显示 Samtools各程序的简要帮助。eg.  samtools  samtools view  samtools sort sam、bam转换  sam -> bam  △samtools import ref.fa.fai in.sam out.bam if in.sam w/ @SQ:  samtools view -bS in.sam > out.bam if in.sam w/o @SQ:  samtools faidx ref.fa  samtools view -bt ref.fa.fai in.sam > out.bam or  samtools view -bT ref.fa in.sam > out.bam  bam -> sam  samtools view [-h] in.bam > out.sam samtools view  samtools view [options] in.bam/sam [region1 [...]]  -? longer help  -b output in BAM format  -c instead, only count  -f INT only output alignments with all bits in FLAG [0]  -F INT skip alignments with FLAG bits [0]  -h include the header in output  -H output the header only  -l STR only output reads in library STR [null] samtools view  -o FILE output file [stdout]  -q INT skip alignments with map quality smaller than INT [0]  -r STR only output reads in read group STR [null]  -R FILE output reads in read groups listed in FILE [null]  -S input is in SAM, if @SQ is absent, -t or -T is required  -T FILE 指定reference sequence file. 若尚未建立索引，会自动先建立索引，类似ref.fa.fai samtools view  -t FILE tab-delimited file, each line contains ref. name and length; can be the resultant index file ref.fa.fai by 'samtools faidx ref.fa'  -u output uncompressed bam, for pipe  -x output FLAG in HEX (eg. 0x00)  -X output FLAG in string  region (1-based)  chr2  chr2:1000 (start from 1000)  chr2:1,000-2,000 (include 2000) samtools view  eg.  samtools view -bS in.sam > out.bam  samtools view -bt ref.fa.fai in.sam > out.bam  samtools view -bT ref.fa in.sam > out.bam  samtools view [-h] in.bam > out.sam  samtools view ex1.bam seq1:1-10 | less -S NOTE: random retrieval need indexed  samtools view -x ex1.bam | less -S  samtools view -X ex1.bam | less -S 对bam文件排序和建索引  samtools sort 可以对bam文件进行排序。  samtools index 可以对已排序的bam文件建立索引，从而使得对该bam文件的 random retrieval成为可能。  eg.  samtools sort aln.bam aln.sorted （产生已排序的aln.sorted.bam文件）  samtools index aln.sorted.bam （产生aln.sorted.bam.bai索引文件） samtools sort  samtools sort [options] in.bam out.prefix  -o output the final alignment to the standard output 即把排序的bam文件输出到STDOUT；仍然需要在命令行指定out.prefix，否则不能执行，但不产生out.prefix.bam  -n sort by read names rather than by chromosomal coordinates  -m INT approximately the maximum required memory 建立索引  建立索引的目的是使 random retrieval 成为可能  samtools index aln.bam 为已排序的aln.bam建索引，将创建 aln.bam.bai  samtools faidx ref.fasta [region1 [...]] 为reference fasta建索引，将创建 ref.fasta.fai samtools tview  samtools tview in.sorted.bam [ref.fasta]  in.sorted.bam需要先用samtools index建索引  ref.fasta可不用先建索引；若无索引， samtools tview会自动去建立ref.fasta.fai索引  In samtools tview:  ? help  g goto a region  chrM:1000; seq1:10  =1000 (if same reference sequence)  q exit samtools tview  h,j,k,l & ←,↑,↓,→ small scroll (1 base)  H,J,K,L large scroll (20 bases)  [space], [backspace] scroll (1 screen)  m color for Mapping quality  n color for Nucleotide  b color for Base quality  . dot view  r read name 初步统计  samtools idxstats in.bam  in.bam需要已建索引  输出tab分隔的文本，每个ref. seq一行：  ref. seq. name  seq. length  # mapped reads  # unmapped reads 初步统计  samtools flagstat in.bam  in.bam不需要先建索引  输出多行文本，每行一种被统计项目，统计reads数目  *samtools depth [options] in.bam  in.bam需要已排序  输出tab分隔的文本，每个位点一行，统计每个位点的覆盖的reads数  output eg. chrM 112 3 初步统计  *samtools depth [options] in.bam  -r STR region  -q INT base quality threshold (only count base quality >= INT)  -Q INT mapping quality threshold (only count mapping quality >= INT)  -b FILE bed file  eg. samtools depth -r chrM:100-200 -q 30 aln.sorted.bam | awk '$3>2' | less -S 改动bam文件  samtools reheader in.header.sam in.bam  改变bam文件的header section 比BAM->SAM->BAM快  samtools merge [options] [-h header.sam] out.bam in1.bam in2.bam  合并已排序的bam文件（必需含有数量相同的reference sequences）  *samtools cat [-h header.sam] [-o out.bam] in1.bam in2.bam [...]  直接上下连接bam文件内容改动bam文件  samtools merge [options] out.bam in1.bam in2.bam  -h FILE use this sam file's headers, copied to out.bam  -R STR merge files in the secified region  -r attach an RG tag to each alignment. The tag value is inferred from file names  -n the input alignments are sorted by read names rather than by chr. coordinates  -u output uncompressed BAM 改动bam文件  samtools merge eg.  perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n @RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' >rg.txt  samtools merge -rh rg.txt merged.bam ga.bam 454.bam  NOTE: in the merged.bam, reads from ga.bam will be attached RG:Z:ga, while reads from 454.bam will be attached RG:Z:454 校正bam文件  samtools fixmate in.bam out.bam  in.bam需要已按read name排序， out.bam也将会按read name排序  fixmate用来校正mate的坐标、 ISIZE(Inferred insert SIZE)、相关FLAG  samtools calmd [options] aln.bam ref.fa  calmd用来生成或检查每个alignment的 MD(string for mismatching position)标签，默认输出SAM到STDOUT 校正bam文件  samtools calmd [options] aln.bam ref.fa  -A modify the quality string  -b compressed BAM output  -e change identical bases to '='  -S input is SAM with header  -r compute the BQ tag (w/o -A) or cap baseQ by BAQ (w/ -A)  -u uncompressed BAM output 校正bam文件  samtools rmdup [-sS] in.bam out.bam  in.bam需要已排序  -s remove dup. for SE reads  -S treat PE reads as SE (force -s)  (default) remove dup. for PE reads  rmdup用来去除可能的PCR duplicates  一致的外部坐标，仅mapping quality不同  处理PE时，依赖正确的ISIZE  对于PE，不能正确处理unpaired reads的情况（Picard的MarkDuplicates能正确处理该情况） rmdup example  Merge sorted alignments and remove potential PCR/optical duplicates:  perl -e 'print "@RG\tID:ga\tSM:hs\tLB:ga\tPL:Illumina\n @RG\tID:454\tSM:hs\tLB:454\tPL:454\n"' > rg.txt  samtools merge -rh rg.txt - ga.bam 454.bam | samtools rmdup - - | samtools rmdup -s - aln.bam call SNP / small indel  pileup -> pileup format  mpileup -> bcf format -> bcftools  NOTE: since 0.1.10, pileup is deprecated by mpileup samtools pileup  samtools pileup [options] in.bam/sam  输入的in.bam需要先排序  -B disable the BAQ (base alignment quality) computation  -c call the consensus sequence  -C INT coefficient for adjusting mapping quality of poor mappings [0]  -d INT limit maximum depth for indels for speed up. 0 for unlimited [1024] samtools pileup  -f FILE the reference sequence, in FASTA; index file FILE.fai will be created if absent  -G FLOAT prior of an indel between two haplotypes (for -c) [0.00015]  -i only show lines/consensus with indels  -I INT phred prob. of an indel in sequencing/prep. [40]  -l FILE list of sites at which pileup is output samtools pileup  -m INT filter reads with flag containing bits in INT [1796(0x704, usfd)]  -M INT cap mapping quality at INT [60]  -N INT # haplotypes in the sample (for -c) (>=2) [2]  -Q INT min base quality [13]  -r FLOAT prior of a difference between two haplotypes (for -c) [0.001]  -S the input is in SAM samtools pileup  -t FILE list of reference names and sequence lengths (force -S)  -T FLOAT the theta parameter (error dependency coefficient) in MAQ consensus calling model (for -c) [0.83]  -v print variants only (for -c)  NOTE: pileup is deprecated, please use mpileup samtools pileup  default output: pileup format  每一行都是一个genomic position，包含以下6列：  1 chromosome name  2 coordinate  3 reference base  4 read coverage/depth  5 read bases  6 alignment mapping qualities samtools pileup  read bases  . for a match to the ref. on the + strand  , for a match to the ref. on the - strand  > < for a reference skip  ACGTN for a mismatch on + strand  acgtn for a mismatch on - strand  +NumSeq insertion  -NumSeq deletion  ^ start of a read  $ end of a read  * deleted base samtools pileup  when -c is apllied,  在原先第3、第4列之间增添4列，总共10列：  4 consensus base  5 consensus quality  6 SNP quality  7 root mean square (RMS) mapping quality  8 read coverage/depth  9 read bases  10 alignment mapping qualities samtools pileup  an indel occupies an additional line (-i)  1 chromosome same  2 coordinate  3 just '*'  4 genotype  5 consensus quality  6 SNP quality  7 RMS mapping quality samtools pileup  8 # covering reads  9 the first allele  10 the second allele  11 # reads supporting the first allele  12 # reads supporting the second allele  13 # reads containing indels different from the top two alleles  …… samtools pileup  eg.  samtools pileup -f human_chrM.fasta bwa_result.sorted.bam | less -S  samtools pileup -vcf human_chrM.fasta bwa_result.sorted.bam | less -S  samtools pileup -ivcf human_chrM.fasta bwa_result.sorted.bam | less -S samtools.pl varFilter  samtools.pl varFilter [options] in.cns_pileup  输入 samtools pileup -vcf 的结果，进行过滤  -D INT maximum read depth [100]  -d INT minimum read depth [3]  -G INT min indel score for nearby SNP filtering [25]  -i INT minimum indel quality []  -l INT window size for filtering adjacent gaps [30] samtools.pl varFilter  -N INT max number of SNPs in a window [2]  -p print filtered variants to STDERR  -Q INT minimum RMS mapping quality for SNPs [25]  -q INT minimum RMS mapping quality for gaps [10]  -S INT minimum SNP quality []  -W INT window size for filtering dense SNPs [10]  -w INT SNP within INT bp around a gap to be filtered [10] samtools.pl pileup2fq  samtools.pl pileup2fq [options] in.cns_pileup  输入 samtools pileup -cf 的结果，转换为 FASTQ格式  -d INT min depth [3]  -D INT max depth [255]  -G INT min indel score [25]  -l INT indel filter window size [10]  -Q INT min RMS mapping quality [25] pileup example  Call Variants  samtools pileup -vcf ref.fa aln.bam | tee raw.txt | samtools.pl varFilter -D 100 > flt.txt  Please remember to set the -D to set a maximum read depth according to the average read depth (×2). In whole genome shotgun (WGS) resequencing, SNPs with excessively high read depth are usually caused by structural variations or alignment artifacts and should not be trusted.  awk '($3=="*"&&$6>=50) ||($3!="*"&&$6>=20)' flt.txt > final.txt pileup example  Get the consensus sequence  samtools view -u aln.bam X | samtools pileup -cf hsRef.fa - | samtools.pl pileup2fq -D100 > var-X.fq  if you want to call for one chromosome only, use 'samtools view' to specify the region. (just like above)  http://sourceforge.net/apps/mediawiki/samtoo ls/index.php?title=SAM_protocol pileup example  Before pileup, you can use 'samtools view' or other command to filter the mapping reads/alignments.  'samtools view' can filter on mapping quality (-q), flags (-f, -F), read group (-r, -R) or library (-l).  other filters can be specified using perl or awk on SAM format pileup example  use reads with <=2 differenced:  samtools view -h aln.bam | perl -ne 'print if (/^@/||(/NM:i:(\d+)/&&$1<=2))' | samtools pileup -S - > out.txt  (NM: edit distance to the refernce)  exclude all gapped alignments:  samtools view -h aln.bam | awk '$6!~/[ID]/' | samtools pileup -S - pileup example  set a threashold on mapping quality  samtools view -bq 1 aln.bam > aln-reliable.bam  exclude read group ERR00001  samtools view -h in.bam | grep -v "\<RG:Z:ERR00001\>" | samtools view -bS - >out.bam samtools mpileup  samtools mpileup [options] in.bam [in2.bam [...]]  generate BCF or pileup for one or multiple BAM files.  Alignment records are grouped by sample identifiers in @RG header lines.  If sample identifiers are absent, each input file is regarded as one sample. samtools mpileup  -B disable BAQ (base alignment quality) computation  -C INT coefficient for adjusting mapping quality of poor mappings [0 (disable)]  the recommended value for BWA is 50  -e INT phred-scaled gap extension sequencing error probability [20]  reducing leads to longer indels  -f FILE the reference file [null]  -g compute genotype likelihoods, and output them in BCF(binary call format) samtools mpileup  -h INT coefficient for modeling homopolymer errors [100]  -I do not perform indel calling  -l FILE file containing a list of sites where pileup or BCF is outputed [null]  -o INT phred-scaled gap open sequencing error probability [40]  reducing leads to more indel calls  -P STR comma dilimited list of platforms for indels [all] samtools mpileup  -q INT min mapping quality for an alignment to be used [0]  -Q INT min base quality for a base to be considered [13]  -r STR only generate pileup in the region [all sites]  -u similar to -g, but output uncompressed BCF VCF and BCF format  VCF(variant call format)  BCF(binary VCF)  查看BCF内容  bcftools view in.bcf VCF format  meta-information lines  以##开头  the header line  以#开头，tab分隔，data lines每列的解释  #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT (Sample...)  data lines  tab分隔 VCF format      1 CHROM chr. name 2 POS 1-base position 3 ID variant ID 4 REF ref. seq. at POS 5 ALT comma delimited list of alternative seq.  6 QUAL phred-scaled prob. of all samples being homozygous ref. VCF format  7 FILTER semicolon delimited list of filters that the variat fails to pass  8 INFO semicolon delimited list of variant information  DP combined depth across samples  AF allele freq. for each ALT allele  MQ RMS mapping quality VCF format  9 FORMAT colon delimited list of the format of individual genotypes in the following fields  GT genotype; 0 for REF, 1 for ALT1, ...  PL phred-scaled genotype likelihood  order: 00, 01, 11, 02, 12, 22, ...  GQ conditional genotype phred quality  10+ Sample(s) individual genotype information defined by FORMAT BCFtools  bcftools  view 查看，转换文件格式  index 为BCF建索引  cat 连接BCF  ld compute all-pair r^2  ldpair compute r^2 between requested pairs bcftools view  bcftools view [options] in.bcf [region]  -b output BCF instead of VCF  -c SNP calling (force -e)  -e likelihood based analyses  -G suppress all individual genotype info.  -g call genotypes at variant sites (force -c)  -I skip indels  -l FILE list of sites (chr pos) or regions (BED) to output [all]  -P STR type of prior: full, cond2, flat [full]  -v output potential variant sites only (force -c) vcfutils.pl  vcfutils.pl  varFilter  vcf2fq  subsam  listsam  fillac  qstats  hapmap2vcf  ucscsnp2vcf vcfutils.pl varFilter  vcfutils.pl varFilter [options] in.vcf  -a INT min number of alternate bases [2]  -D INT max read depth [10000000]  -d INT min read depth [2]  -p print filtered variants to STDERR  -Q INT min RMS mapping quality for SNPs [10]  -W INT window size for filtering adjacent gaps [10]  -w INT SNP with INT bp around a gap to be filtered [10] vcfutils.pl vcf2fq  vcfutils.pl vcf2fq [options] all-site.vcf  输入 samtools mpileup -uf | bcftools view -cg 的结果，转换为FASTQ格式  -d INT min depth [3]  -D INT max depth [100000]  -l INT indel filter window size [5]  -Q INT min RMS mapping quality [10] mpileup example  Call SNPs and short indels for one diploid individual:  samtools mpileup -uf ref.fa aln.bam | bcftools view -bvcg - > var.raw.bcf  usually add -C50 to mpileup if mapping quality is overestimated, eg. using BWA-short  bcftools view var.raw.bcf | vcfutils.pl varFilter -D 100 > var.flt.vcf  -D controls the max read depth, which should be adjusted to about twice the average read depth mpileup example  Call SNPs and short indels for multiple diploid individuals:  samtools mpileup -P ILLUMINA -ugf ref.fa *.bam | bcftools view -bvcg - > var.raw.bcf  Individuals are identified from the SM tags in the @RG header lines.  -P specifies that indel candidates should be collected only from read groups with @RG-PL tag set to ILLUMINA  bcftools view var.raw.bcf | vcfutils.pl varFilter -D 2000 > var.flt.vcf mpileup example  Derive the AFS (allele frequency spectrum) on a list of sites from multiple individuals:  samtools mpileup -Igf ref.fa *.bam >all.bcf  bcftools view -bl sites.list all.bcf >sites.bcf  bcftools view -cGP cond2 sites.bcf >/dev/null 2>sites.1.afs  bcftools view -cGP sites.1.afs sites.bcf >/dev/null 2>sites.2.afs  bcftools view -cGP sites.2.afs sites.bcf >/dev/null 2>sites.3.afs  …… mpileup example  Generate the consensus sequence:  samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > cns.fq  Dump BAQ applied alignment for other SNP callers:  samtools calmd -br aln.bam >aln.baq.bam Java版本——Picard  类似Samtools功能的Java版本  Samtools可以在流上（管道）操作， Picard好像一般对文件操作。  Picard的MarkDuplicates能处理Samtools 的rmdup不能处理的一些问题。（注： MarkDuplicates依赖mate的坐标） Reference  SAM format  http://samtools.sourceforge.net/SAM1.pdf  Samtools homepage  http://samtools.sourceforge.net/  Samtools manual  http://samtools.sourceforge.net/samtools.shtml  Samtools pileup  http://samtools.sourceforge.net/cns0.shtml  http://samtools.sourceforge.net/pileup.shtml Reference  Samtools protocal  http://sourceforge.net/apps/mediawiki/samtools/index .php?title=SAM_protocol  Samtools mpileup  http://samtools.sourceforge.net/mpileup.shtml  VCF format  http://vcftools.sourceforge.net/specs.html  http://www.1000genomes.org/wiki/Analysis/Variant% 20Call%20Format/vcf-variant-call-format-version-40  http://www.1000genomes.org/wiki/Analysis/Variant% 20Call%20Format/vcf-variant-call-format-version-41 BEDTools BED文件格式  BED format是UCSC Genome Browser和 Ensembl Genome Browser都支持的，描述annotation track信息的文件格式。  它是tab分隔的文本文件，可以有3-12列。  前3列是必需的：  1 chrom name of chr. or scaffold  2 chromStart start pos. of the feature (include) (0-based)  3 chromEnd end pos. of the feature (not include) BED文件格式  第4-9列是可选的：  4 name the name of the BED line  5 score 0~1000, higher, darker  6 strand  7 thickStart starting pos. drawn thickly  8 thickEnd ending pos. drawn thickly  9 itemRgb display color BED文件格式  第10-12列也是可选的，UCSC使用它们，而Ensembl忽略它们：  10 blockCount # blocks (exons) in the line  11 blockSizes comma-separated, block sizes, should correspond to blockCount  12 blockStarts comma-separated, block starts, should correspond to blockCount  eg. snps.bed  chr1 100 101 A/G 100  chr1 200 201 C/G 1000 BedGraph文件格式  4列tab分隔的文本文件  1 chromName  2 chromStart  3 chromEnd  4 dataValue BEDTools  文件格式转换 (bam->bed)  bamToBed  计算交(overlap)、差、补  intersectBed  pairToBed  subtractBed  pairToPair  complementBed  windowBed BEDTools  改动BED文件  mergeBed  slopBed  shuffleBed  sortBed  计算coverage  coverageBed  genomeCoverageBed BEDTools  其他  closestBed  fastaFromBed  maskFastaFromBed  linksBed  分组简单统计  groupBy  -h display help page bamToBed  bamToBed -i in.bam > out.bed  convert BAM to BED6 format  -bed12  convert to BED12 format  -bedpe  convert to BEDPE format  -ed  use edit distance (NM tag) as BED score intersectBed  intersectBed -a a.bed -b b.bed  report overlaps between two BED files  -c  for each A, report # overlaps with B (count)  -u  report original A once if any overlaps found in B  -v  only report those A that have no overlaps with B (similar to 'grep -v') intersectBed  intersectBed -a a.bed -b b.bed  -wa  report entire, original entry in A for each overlap  -wb  report entire, original entry in B for each overlap intersectBed  intersectBed -a a.bed -b b.bed  -wa -wb  report entire, original entry in A and B for each overlap  -wo  report original A and B and # bp of overlap between the two features intersectBed  intersectBed -a a.bed -b b.bed  -f FLOAT  minimum overlap required as a fraction of A  -f FLOAT -r  require that the fraction overlap be reciprocal for A and B  -s  force strandedness, only report when A and B on the same strand intersectBed  intersectBed -abam a.bam -b b.bed >out.bam  input file a is in BAM, and ouput will be BAM  -ubam  write uncompressed BAM output  -bed  when using BAM input, write output as BED  -a 可跟 stdin，表示从STDIN读入 intersectBed  eg.  find genes that overlap Lines but not SINEs:  intersectBed -a genes.bed -b LINEs.bed | intersectBed -a stdin -b SINEs.bed -v  retain SE BAM that overlap exons:  intersectBed -abam reads.bam -b exons.bed >reads.touchingExons.bam intersectBed  screen novel SNP  intersectBed -a snp.calls.bed -b dbSnp.bed -v | intersectBed -a stdin -b 1KG.bed -v >snp.calls.novel.bed pairToBed  pairToBed -a a.bedpe -b b.bed  report overlaps between a BEDPE and a BED  -type either (default)  report overlaps if either end of A overlaps B  -type neither  report A if neither end of A overlaps B  -s  force strandedness pairToBed  pairToBed -a a.bedpe -b b.bed  -type both  report overlaps if both ends of A overlap B  -type notboth  report overlaps if neither or only one end of A overlap B  -type xor  report overlaps if only one end of A overlap B pairToBed  pairToBed -a a.bedpe -b b.bed  -type ispan  report overlaps between [end1, start2] of A,B  -type ospan  report overlaps between [start1, end2] of A,B  -type notispan  report A if ispan of A does not overlap B  -type notospan  report A if ospan of A does not overlap B pairToBed  pairToBed -abam a.bam -b b.bed >out.bam  input file a is PE BAM, and output will be in BAM  -ubam  write uncompressed BAM output  -bedpe  when using BAM input, write output as BEDPE pairToBed  eg.  return all structural variants that overlap with genes on either end:  pairToBed -a sv.bedpe -b genes.bed >sv.genes  retain only PE alignments where neither or only one end overlaps segmental duplications:  pairToBed -abam reads.bam -b segdups.bed type notboth >reads.notbothSSRs.bam pairToPair  pairToPair -a a.bedpe -b b.bedpe  report overlaps between two BEDPE  -type both (default)  report overlaps if both ends of A overlap B  -type either  report overlaps if either ends of A overlap B  -type neither  report overlaps if neither end of A overlaps B pairToPair  eg.  find all SVs in sample 1 that are also in sample 2:  pairToPair -a 1.sv.bedpe -b 2.sv.bedpe | cut -f 1-10 > 1.sv.in2.bedpe  find all SVs in sample 1 that are not in sample 2:  pairToPair -a 1.sv.bedpe -b 2.sv.bedpe -type neither | cut -f 1-10 > 1.sv.notin2.bedpe windowBed  windowBed -a a.bed -b b.bed  examines a window around each A, and reports all features in B that overlap the window  for each overlap, A and B are reported  -w INT  bp added upstream and downstream of each A (symterical windows around A) windowBed  windowBed -a a.bed -b b.bed  -l INT  bp added upstream (left) of each A (assymterical windows around A)  -r INT  bp added downstream (right) of each A (assymterical windows around A)  -sw  define -l and -r based on strand windowBed  windowBed -a a.bed -b b.bed  -sm  only report hits in B that overlap A on the same strand  -abam FILE  -ubam  -bed windowBed  eg.  report all SNPs that are within 5000bp upstream or 1000bp downstream of genes:  windowBed -a genes.bed -b snps.bed -l 5000 -r 1000 -sw subtractBed  subtractBed -a a.bed -b b.bed  removes A that overlap by any feature in B  -f FLOAT  minimum overlap required as a fraction of A  -s  force strandedness  eg.  remove introns from gene features:  subtractBed -a genes.bed -b intron.bed complementBed  complementBed -i in.bed -g genome  return bp complement of the feature file  genome file should tab-delimited and contain: chromName chromSize  how to get chromSize? see -h  eg.  report all intervals in human genome that are not covered by repetitive elements:  complementBed -i repeatMasker.bed -g hg18.genome mergeBed  mergeBed -i in.bed  merge overlapping entries into a single entry  -n  report # BED entries merged  '1' is reported if no merging occurred  -d INT  maximum distance (bp) between features allowed to be merged [0] mergeBed  mergeBed -i in.bed  -s  force strandedness  -scores STR  how to determine the scores of the merged  sum, min, max, mean, median, mode, antimode, collapse slopBed  slopBed -i in.bed -g genome  将feature向两侧扩展  -b INT/FLOAT  in each direction  -l INT/FLOAT  in left (upstream) direction  -r INT/FLOAT  in right (downstream) direction slopBed  slopBed -i in.bed -g genome  -s  define -l and -r based on strand  -pct  specify a fraction of the feature's length, instead of absolute bases  only with -pct, -b/-l/-r can be given FLOAT  eg.  slopBed -i probes.bed -g hg19.genome -b 500 shuffleBed  shuffleBed -i in.bed -g genome  randomly permute the locations of features among a genome  -excl BED_FILE  prevent placing in these intervals  -f FLOAT  maximum overlap fraction of the feature with an -excl feature  -chrom  keep features on the same chromosome sortBed  sortBed -i in.bed  sort BED file, by chrom, then start position  输出排序好的BED到STDOUT  see -h for more options coverageBed  coverageBed -a a.bed -b b.bed  compute the depth and breadth of coverage of features from A on intervals in B  default output: (for each entry in B)  # A overlapped the B interval  # bases in B had >0 coverage  the length of B  the fraction of bases in B had >0 coverage coverageBed  coverageBed -a a.bed -b b.bed  -s  force strandedness  -d  report the depth at each position (base) in each B  eg.  compute coverage and create a BEDGRAPH  coverageBed -a reads.bed -b win10kb.bed | cut -f 1-4 > win10kb.cov.bedg coverageBed  by default, coverageBed counts any feature in A that overlaps B by >= 1 bp.  If you want to specify a minimum fraction, you can first use intersectBed:  intersectBed -a a.bed -b b.bed -f 1.0 | coverageBed -a stdin -b b.bed >a_totally_in_b.coverage coverageBed  work with samtools or bam file  bamToBed -i reads.bam | coverageBed -a stdin -b exons.bed >exons.bed.coverage  samtools view -bf 0x2 reads.bam | bamToBed -i stdin | coverageBed -a stdin -b exons.bed >exons.bed.pair_proper_mapped.coverage coverageBed  compute separatedly for each strand  bamToBed -i reads.bam | grep '\+$' | coverageBed -a stdin -b genes.bed >genes.bed.forward.coverage  bamToBed -i reads.bam | grep '\-$' | coverageBed -a stdin -b genes.bed >genes.bed.reverse.coverage genomeCoverageBed  genomeCoverageBed -i in.bed -g genome  compute the coverage among a genome  input BED file must be grouped by chr.  default output: (for each chromosome in B)  depth (0~max)  # bases in the chr. had this depth  the length of the chr.  the fraction of bases in the chr. that had this depth  -strand +/- genomeCoverageBed  -ibam FILE  input in BAM  must be sorted by position (samtools sort)  -d  report the depth at each position (base)  -bg  report depth in BedGraph format  -bga  report depth in BedGraph format, also report zero coverage regions genomeCoverageBed  eg.  genomeCoverageBed -ibam bwa_result.sorted.bam -g human_chrM.genome  genomeCoverageBed -ibam bwa_result.sorted.bam -g human_chrM.genome | grep '^chrM' | awk 'BEGIN{SUM = 0} {SUM += ($2*$5)} END {print SUM}' closestBed  closestBed -a a.bed -b b.bed  for each A, finds the closest feature in B  -t all (default)  report all ties  -t first  report the first tie that occurred in B  -t last  report the last tie that occurred in B closestBed  closestBed -a a.bed -b b.bed  -s  force strandedness, find the closest feature on the same strand  eg.  find the closest ALU to each gene:  closestBed -a genes.bed -b ALUs.bed fastaFromBed  maskFastaFromBed -fi in.fasta -bed extract.bed -fo out.fasta  根据extract.bed文件，从in.fasta中抽取一些区域，输出到out.fasta  -s  force strandedness  -name  use name field for fasta header maskFastaFromBed  maskFastaFromBed -fi in.fasta -bed mask.bed -fo out.fasta  根据mask.bed文件，mask掉in.fasta中的一些区域，输出到out.fasta  -soft  soft masking: mask with lower-case instead of N  -mc CHAR  replace masking char, instead of N linksBed  linksBed -i in.bed > out.html  创建一个链到UCSC browser的HTML文件  -base  browser basename [http://genome.ucsc.edu]  -org  organism [human]  -db  reference build [hg18] groupBy  groupBy 可以指定几列(-g)作为组别，然后分组简单统计(-o)指定列(-c)  Deprecated. Now in the filo package.  -i FILE  input file, assume STDIN if omitted  -g STR / -grp STR  specify the columns (1-based) for grouping, comma separated [1,2,3]  -c INT / -opCols INT  specify the column (1-based) to be summarized, required groupBy  -o STR / -ops STR  specify the operation applied to -c column  sum, count, min, max, mean, median, mode(most common), antimode(least common), stdev, sstdev(sample stdev), collapse(comma separated), concat(nondelimited), freqdesc, freqasc [sum]  can be comma separated, eg. sum,mean,max groupBy  -inheader  input has a header line, would be ignored  -outheader  output header line, showing column names BEDTools Notes  All BEDTools load B file into memory and process A file one-by-one. Therefore when possible, set smaller file to be B file.  Most of BEDTools allow A file to be STDIN from pipes, by using '-a stdin' Reference  BED format  http://genome-test.cse.ucsc.edu/FAQ/FAQformat  http://www.ensembl.org/info/website/upload/bed.html  BedGraph format  http://genome.ucsc.edu/goldenPath/help/bedgraph.ht ml  BEDTools  http://code.google.com/p/bedtools/  filo package (for groupBy)  https://github.com/arq5x/filo FASTX-Toolkit FASTX-Toolkit  预处理FASTA/FASTQ文件的工具（mapping之前）  格式转换，切短，切去adapter，分 barcode，quality筛选……  -h show usage information  you can also use them with Galaxy FASTX-Toolkit        fastq_to_fasta fastx_quality_stats fastq_quality_boxplot_graph.sh fastx_nucleotide_distribution_graph.sh fastx_clipper fastx_renamer fastx_trimmer FASTX-Toolkit         fastx_collapser fastx_artifacts_filter fastx_quality_filter fastx_reverse_complement fasta_formatter fasta_nucleotide_changer fasta_clipping_histogram.pl fasta_barcode_splitter.pl Reference  FASTX-Toolkit homepage  http://hannonlab.cshl.edu/fastx_toolkit/  FASTX-Toolkit manual  http://hannonlab.cshl.edu/fastx_toolkit/comma ndline.html Thanks for your attention!

INT - NGS Workshop @ CBI

Related documents

Products

Support

INT - NGS Workshop @ CBI

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib