7-Computing_with_Sequences_and_Ranges_M_Morgan

advertisement
Computing with Sequences and Ranges
Martin Morgan
In Bioconductor:
Biostrings – basic containers representing data
BSgenome – allows whole genomes to be represented – collection of large
sequences.
ShortRead – allows work with FASTQ without reading the whole file – short
sequences and their qualities.
rtracklayer – can work with UCSC 2bit files
In R classes are like the nouns and methods are like the verbs.
R can iterate, process and randomly access genomic data.
It is possible to restrict the input of data into R, to iterate e.g. through a whole
FASTQ and to chop a file into chunks, e.g. with FastqStreamer.
R will store a representation of the data stored on disk.
BioStrings
DNAStringSet – vector of sequences
Behaves like an R vector
e.g. can be subset with [ ] to produce the same data type (endomorphic) or [[ ]] to get
back a single element – here this is a DNAString.
BioStrings has lots of methods e.g. reverse / reverse complement, nucleotide
frequency etc.
BSGenome and AnnotationHub
UCSC has gene models etc. which can be retrieved using AnnotationHub.
AnnotationHub can also download data from other sources.
Ranges
Specifies the parts of the sequences we are interested in.
Can use:
-
GenomicRanges
Stores sequence name, interval (start and end), strand and metadata.
For a hierarchical structure use GRangesList – can extract e.g. all exons grouped
by transcript.
Each item in a GRangesList can be accessed using [[ ]].
It is better to unlist, work on long GRanges then relist. This is a flesh (data) and
skeleton (structure) design.
-
GenomicAlignments
Specialised for aligned reads. Can read in a specific region of a BAM or a whole
BAM. Uses restriction and iteration to manage large data.
-
GenomicFeatures
rtracklayer
Useful for inputting and outputting common file types e.g. bed, gtf.
-
IRanges
S4Vectors
GRanges Methods
range() – co-ordinates of the minimum and maximum of the GRangesList
reduce(unlist(gene_model)) – collapses overlapping ranges in this gene model
findOverlaps – finds overlaps e.g. between ranges representing data and ranges
representing annotations.
Download