Bioinformatics Applications

advertisement
Bioinformatics Applications
Vivek Krishnakumar & Haibao Tang
J. Craig Venter Institute
Plant Informatics Workshop
July 15, 2013
Module 1:
FASTA
FASTA format - Example
A single FASTA sequence record:
• Definition Line or Header begins with ‘>’
•
Width of sequence rows usually 60 letters.
The Multi-FASTA format is composed of FASTA records
concatenated together.
FASTA - Definition Line
•
•
•
Minimum requirement for definition line is
'>' symbol followed by an identifier
>Medtr5g073340
Extra comments may be added after the
identifier separated by a whitespace
Several pre-defined conventions already
exist, that are followed by the sequence
databases
FASTA - Naming Conventions
faSize
htang@htang-lx $ faSize contigs.fasta
344315953 bases (621962 N's 343693991 real 343693991 upper 0 lower) in 177125 sequences in 1 files
Total size: mean 1943.9 sd 3181.2 min 200 (contig_177102) max 139442 (contig_26116) median 624
N count: mean 3.5 sd 10.5
U count: mean 1940.4 sd 3180.1
L count: mean 0.0 sd 0.0
%0.00 masked total, %0.00 masked real
htang@htang-lx $ faSize contigs.fasta -detailed | sort -k2,2nr | head
contig_26116
139442
contig_26133
126994
contig_26222
58825
contig_28230
54642
contig_25859
50265
contig_38555
47349
contig_26213
47226
contig_41638
45958
contig_6779
43393
contig_26783
42041
faOneRecord, faSomeRecords
faOneRecord - Extract a single record from a .FA file
usage:
faOneRecord in.fa recordName
faSomeRecords - Extract multiple fa records
usage:
faSomeRecords in.fa listFile out.fa
options:
-exclude - output sequences not in the list file.
• Does not create index like
cdbindex/cdbyank (does not create extra
files)
• Sufficiently fast
faSplit
faSplit - Split an fa file into several files.
usage:
faSplit how input.fa count outRoot
where how is either 'base' 'sequence' or 'size'. Files
split by sequence will be broken at the nearest
fa record boundary, while those split by base will
be broken at any base. Files broken by size will
be broken every count bases.
Examples:
faSplit sequence estAll.fa 100 est
This will break up estAll.fa into 100 files
(numbered est001.fa est002.fa, ... est100.fa
Files will only be broken at fa record boundaries
faSplit base chr1.fa 10 1_
This will break up chr1.fa into 10 files
faSplit size input.fa 2000 outRoot
This breaks up input.fa into 2000 base chunks
faSplit about est.fa 20000 outRoot
This will break up est.fa into files of about 20000 bytes each by record.
faSplit byname scaffolds.fa outRoot
This breaks up scaffolds.fa using sequence names as file names.
faSplit gap chrN.fa 20000 outRoot
This breaks up chrN.fa into files of at most 20000 bases each,
at gap boundaries if possible.
faGapSizes, faGapLocs
htang@htang-lx (~) $ faGapSizes medicago.fa
gapCount=7870, totalN=51926655, minGap=1, maxGap=250000, avgGap=6598.00
0 < size
size
10 < size
size
50 < size
size
100 < size
size
500 < size
size
1000 < size
size
5000 < size
size
10000 < size
size
size
<
=
<
=
<
=
<
=
<
=
<
=
<
=
<
=
>
10
10
50
50
100
100
500
500
1000
1000
5000
5000
10000
10000
50000
50000
50000
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
:
2402:
2:
50:
26:
38:
4654:
7:
0:
0:
0:
1:
305:
1:
1:
0:
101:
282:
***********************
*********************************************
**
**
htang@htang-lx (~) $ faGapLocs CU914135 stdout
0
CU914135.1_seq_0
70702
CU914135.1
70702
CU914135.1_gap_0
50
CU914135.1
70752
CU914135.1_seq_1
51172
CU914135.1
121924 CU914135.1_gap_1
50
CU914135.1
121974 CU914135.1_seq_2
154227 CU914135.1
276201
276201
276201
276201
276201
Module 2:
FASTQ
Format for HTS Data
•
•
•
High Throughput Sequencing (HTS)
instruments produce large quantities of
sequence data
Requirement to store sequence quality
along with raw sequence arose
Thus, logical extension to FASTA was
developed, called FASTQ
o
o
Minimal representation of sequencing read
Ability to store numeric quality score
FASTQ format
•
Line 1 begins with the @ character followed
by sequence identifier
• Line 2 consists of the raw sequence
• Line 3 begins with a + symbol followed by an
optional description or repeat of Line 1
• Line 4 corresponds to the encoded quality
values (one character each for every
nucleotide in the sequenced read)
•
Common file extensions:
.fastq, .fq, or .txt are used
FASTQ format
Illumina Quality Encoding
•
•
Phred Quality Scores Q: logarithmically related to base
calling error probabilities P
Phred score:
10 = 10% error
20 = 1% error
30 = 0.1% error
40 = 0.01% error
FASTQ format
Illumina Quality Encoding
•
•
•
•
Illumina format encodes a quality score between 0 and
62 using ASCII 64 to 126 (as compared to Sanger which
encodes 0 to 93 using ASCII 33 to 126)
The phred score + some offset = ASCII code, example:
Two types of offsets (phred +33, and phred +64). Most of
the FASTQ files are phred +33.
How do I know which offset it is? There is a quick tip
FASTQ format
Illumina Phred + 33
FASTQ format
Illumina Phred + 64
Module 3:
GFF, BED
Gene structure
Generic Feature Format
GFF = Generic Feature Format
Tab delimited, easy for data parsing and processing
Many annotation viewers accept this format, isn't very strict
Fields:
1. Reference Sequence: base seq to which the coordinates are anchored
2. Source: source of the annotation
3. Type: Type of feature
4. Start coordinate (1-based)
5. End coordinate
6. Score: Used for holding numerical scores (similarity, etc)
7. Strand: "+","-", or "." if unstranded
8. Frame: Signifies codon phase for coding sequence (CDS) features
9. Other attributes or/and comments
GFF3 – Generic Feature Format v3
http://www.sequenceontology.org/gff3.shtml
Extension of GFF by the Sequence Ontology (SO) and Generic Model
Organism Database (GMOD) Projects
- Allows hierarchies more than one level deep
- Constrains the feature type field to be taken from controlled vocabulary
- Feature can belong to more than one group
- Attributes take the form of “Key=Value” pairs separated by a ";"
- Uppercase 'Keys' are reserved. Lower case 'keys' are user-defined
BED Format
http://genome.ucsc.edu/FAQ/FAQformat.html#format1
•
•
•
•
Developed primarily for the UCSC genome
browser
BED lines have 3 required fields and 9
optional fields
With just the required fields and a few
additional optional fields (bed6), individual
features (such as gene/mRNA boundaries)
can be represented
In order to depict hierarchical features, all 12
columns are necessary (also called bed12
format)
BED Format - Description
Three required BED fields are:
1. chrom - The name of the chromosome/reference (e.g. chr3, scaffold_1)
2. chromStart - The starting position of the feature (0-based)
3. chromEnd - The ending position of the feature
The 9 additional optional BED fields are:
1. name - Defines the name of the BED line
2. score - A score between 0 and 1000
3. strand - Defines the strand - either '+' or '-'.
4. thickStart - The starting position at which the feature is drawn thickly
(for example, the start codon in gene displays).
5. thickEnd - The ending position at which the feature is drawn thickly (for
example, the stop codon in gene displays)
6. itemRgb - An RGB value of the form R,G,B (e.g. 255,0,0)
7. blockCount - The number of blocks (exons) in the BED line
8. blockSizes - A comma-separated list of the block sizes, corresponding
to blockCount
9. blockStarts - A comma-separated list of block starts, relative to
chromStart, corresponding to blockCount
BED Format - Examples
BED
chr2
chr5
0
0
673342
619924
BED6
chr2
chr2
136141
140131
136779
140463
Medtr2g005340
Medtr2g005360
0
0
+
+
BED12
chr2
140131
140131
140463
140463
Medtr2g005360
255,0,0
0
2
+
\
100,58
0,100
BEDTOOLS
• Author: Aaron Quinlan, UVA
• BED format (first three columns `chr`,
`start`, `end` are required), 0-based
#chromosome
Mt_chr01
Mt_chr01
Mt_chr01
Mt_chr01
Mt_chr01
Mt_chr01
Mt_chr01
Mt_chr01
start
150050
154858
155200
162535
164428
167402
169440
171722
end
154771
155160
162479
164388
166156
169400
171646
172427
name
Medtr1g005000
Medtr1g005010
Medtr1g005020
Medtr1g005030
Medtr1g005040
Medtr1g005050
Medtr1g005060
Medtr1g005070
score
1000
1000
1000
1000
1000
1000
1000
1000
strand ...
+
+
-
• BEDTOOLS operates on genomic
coordinates and uses “Bin indexing” to
speed up range queries
BEDTOOLS functionalities
• Find out what’s overlapping between sets of features.
(intersectBed)
• Find closest genomic features. (closestBed, windowBed)
• Merge overlapping features. (mergeBed)
• Computing coverage for alignments based on genome
features. (coverageBam, bamToBed)
• Calculating the depth and breadth of sequence coverage
across defined "windows" in a genome. (coverageBed)
• Sequence output. (fastaFromBed, maskFastaFromBed)
intersectBed: What’s in common?
• Report the base-pair overlap between sequence alignments and genes.
$ intersectBed -a reads.bed -b genes.bed
• Report those alignments that overlap NO genes. Like "grep -v"
$ intersectBed -a reads.bed -b genes.bed -v
• Report the number of genes that each alignment overlaps.
$ intersectBed -a reads.bed -b genes.bed -c
• Report the entire, original alignment and gene entries for each overlap, and number of
overlapping
bases.
$ intersectBed
-a reads.bed -b genes.bed –wo
• Only report an overlap with a repeat if it spans at least 50% of the exon.
$ intersectBed -a exons.bed -b repeatMasker.bed –f 0.50
• Read BED A from stdin. For example, find genes that overlap LINEs but not SINEs.
$ intersectBed -a genes.bed -b LINES.bed | intersectBed -a stdin -b SINEs.bed –v
• Retain only single-end BAM alignments that do not overlap simple sequence repeats.
$ intersectBed -abam reads.bam -b SSRs.bed -v > reads.noSSRs.bam
windowBed, closestBed: What’s in there?
• windowBed
• Report all genes that are within 10000 bp upstream or downstream of CNVs.
$ windowBed -a CNVs.bed -b genes.bed -w 10000
• Report all genes that are within 10000 bp upstream or 5000 bp downstream of CNVs.
$ windowBed -a CNVs.bed -b genes.bed –l 10000 –r 5000
• Report all SNPs that are within 5000 bp upstream or 1000 bp downstream of genes.
Define upstream and downstream based on strand.
$ windowBed -a genes.bed –b snps.bed –l 5000 –r 1000 -sw
• closestBed
• Note: By default, if there is a tie for closest, all ties will be reported. closestBed allows
overlapping features to be the closest.
• Find the closest ALU to each gene.
$ closestBed -a genes.bed -b ALUs.bed
• Find the closest ALU to each gene, choosing the first ALU in the file if there is a tie.
$ closestBed -a genes.bed -b ALUs.bed –t first
mergeBed, coverageBed
• mergeBed
• Merge overlapping repetitive elements into a single entry.
$ mergeBed -i repeatMasker.bed
• Merge overlapping repetitive elements into a single entry, returning the number of entries
merged.
$ mergeBed -i repeatMasker.bed -n
• Merge nearby (within 1000 bp) repetitive elements into a single entry.
$ mergeBed -i repeatMasker.bed –d 1000
• coverageBed
• Compute the coverage of aligned sequences on 10 kilobase “windows” spanning the
genome.
$ coverageBed -a reads.bed -b windows10kb.bed
Default Output:
After each entry in B, reports:
1) The number of features in A that overlapped the B interval.
2) The number of bases in B that had non-zero coverage.
3) The length of the entry in B.
4) The fraction of bases in B that had non-zero coverage.
fastaBed, maskFastaFromBed: messing with
sequences
• fastaBed
Program: fastaFromBed (v2.6.1)
Summary: Extract DNA sequences into a fasta file based on BED coordinates.
Usage:
fastaFromBed [OPTIONS] -fi -bed -fo
Options:
-fi
-bed
-fo
-name
-tab
Input FASTA file
BED file of ranges to extract from -fi
Output file (can be FASTA or TAB-delimited)
Use the BED name field (#4) for the FASTA header
Write output in TAB delimited format.
- Default is FASTA format.
• maskFastaFromBed
Program: maskFastaFromBed (v2.6.1)
Summary: Mask a fasta file based on BED coordinates.
Usage:
maskFastaFromBed [OPTIONS] -fi -out -bed
Options:
-fi
-bed
-fo
-soft
Input FASTA file
BED file of ranges to mask in -fi
Output FASTA file
Enforce "soft" masking. That is, instead of masking with Ns,
mask with lower-case bases.
Module 4:
SAM, BAM
Sequence Alignment Format
•
•
With the advent of HTS technologies, several
requirements arose:
o need for a generic alignment format to store
read alignments
o support for short and long reads generated
by different sequencing platforms
o compact file size
o random access ability
o ability to store various types of alignments
(clipped, spliced, multi-mapped, padded)
As a result, SAM (Sequence Alignment/Map)
format evolved
SAM Format - Example
http://samtools.sourceforge.net/SAM1.pdf
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45
r001 163 ref 7 30 8M2I4M1D3M
r002
0 ref 9 30 3S6M1P1I4M
r003
0 ref 9 30 5H6M
r004
0 ref 16 30 6M14N5M
r003 16 ref 29 30 6H5M
r001 83 ref 37 30 9M
=
*
*
*
*
=
37 39
0
0
0
0
0
0
0
0
7 -39
TTAGATAAAGGATACTG
AAAAGATAAGGATA
AGCTAA
ATAGCTTCAGC
TAGGC
CAGCGCCAT
*
*
* NM:i:1
*
* NM:i:0
*
SAM Format - Header Lines
• Header lines start with @
• @ is followed by TAG
• Known TAGS:
@HD - Header
@SQ - Reference Sequence dictionary
@RG - Read Group
• Header fields are TYPE:VALUE pairs
@SQ SN:ref LN:45
TAG
TYPE:VALUE
• Example:
TYPE:VALUE
@RG ID:L2 PU:SC_2_12 LB:SC_2
SM:NA12891
SAM Format - Alignment Section
•
•
•
11 mandatory fields
Tab-delimited
Optional fields (variable)
1. QNAME: Query name of the read or the read pair
2. FLAG: Bitwise flag (multiple segments, unmapped, etc.)
3. RNAME: Reference sequence name
4. POS: 1-Based leftmost position of clipped alignment
5. MAPQ: Mapping quality (Phred-scaled)
6. CIGAR: Extended CIGAR string (operations: MIDNSHP)
7. MRNM: Mate reference name (‘=’ if same as RNAME)
8. MPOS: 1-based leftmost mate position
9. ISIZE: Inferred insert size
10.SEQ: Sequence on the same strand as the reference
11.QUAL: Query quality (ASCII-33 = Phred base quality)
SAM Format - CIGAR string
M:
I:
D:
P:
N:
S:
H:
match/mismatch
insertion
deletion
padding
skip
soft-clip
hard-clip
Ref: GCATTCAGATGCAGTACGC
Read: ccTCAG--GCAGTAgtg
CIGAR
POS
2S4M2D6M3S
5
SAM Format
Compressed Binary Version
•
•
•
•
•
Known as BAM
Binary, indexed representation of SAM
Uses BGZF (Blocked GZip Format)
compression
Storage space requirements ~27% of
original SAM
SAM/BAM can be manipulated using
SAMTOOLS package
SAMTOOLS
• SAM, and BAM format is popular for
reporting read mappings onto the
reference genome, SAM is human
readable
@HD
VN:1.0 SO:unsorted
@SQ
SN:contig_27167 LN:26169
GFNQG3Z01EMD0D 65
contig_27167
CATTCTCTCTTTCTTCTTTGTGCTTCA
*
GFNQG3Z01EMD0D 129
contig_27167
GTCAAACAACCCTCGTAGAAATATGAT
*
25164
60
3S24M
23568
60
18M1D8M
=
23568
=
1620
25164
1620
• SAMTOOLS, author: Heng Li, Broad
good for manipulating SAM files
SAMTOOLS
• Samtools manipulate SAM/BAM
• Once you have the bam indexed, you can
quickly access any range in the reference
(remember bin indexing?)
bowtie ref -1 SRR067323_1.fastq -2 SRR067323_2.fastq --best --maxins 1000 -S | \
samtools view -h -S -F 4 - > SRR067323.aligned.sam
samtools view -bS SRR067323.aligned.sam –o SRR067323.aligned.bam
samtools sort SRR067323.aligned.bam SRR067323.sorted
samtools index SRR067323.sorted.bam
SAM
samtools view
BAM
Sorted BAM
samtools sort
Indexed BAM
samtools index
Download