Computing with Sequences and Ranges Martin Morgan In Bioconductor: Biostrings – basic containers representing data BSgenome – allows whole genomes to be represented – collection of large sequences. ShortRead – allows work with FASTQ without reading the whole file – short sequences and their qualities. rtracklayer – can work with UCSC 2bit files In R classes are like the nouns and methods are like the verbs. R can iterate, process and randomly access genomic data. It is possible to restrict the input of data into R, to iterate e.g. through a whole FASTQ and to chop a file into chunks, e.g. with FastqStreamer. R will store a representation of the data stored on disk. BioStrings DNAStringSet – vector of sequences Behaves like an R vector e.g. can be subset with [ ] to produce the same data type (endomorphic) or [[ ]] to get back a single element – here this is a DNAString. BioStrings has lots of methods e.g. reverse / reverse complement, nucleotide frequency etc. BSGenome and AnnotationHub UCSC has gene models etc. which can be retrieved using AnnotationHub. AnnotationHub can also download data from other sources. Ranges Specifies the parts of the sequences we are interested in. Can use: - GenomicRanges Stores sequence name, interval (start and end), strand and metadata. For a hierarchical structure use GRangesList – can extract e.g. all exons grouped by transcript. Each item in a GRangesList can be accessed using [[ ]]. It is better to unlist, work on long GRanges then relist. This is a flesh (data) and skeleton (structure) design. - GenomicAlignments Specialised for aligned reads. Can read in a specific region of a BAM or a whole BAM. Uses restriction and iteration to manage large data. - GenomicFeatures rtracklayer Useful for inputting and outputting common file types e.g. bed, gtf. - IRanges S4Vectors GRanges Methods range() – co-ordinates of the minimum and maximum of the GRangesList reduce(unlist(gene_model)) – collapses overlapping ranges in this gene model findOverlaps – finds overlaps e.g. between ranges representing data and ranges representing annotations.