Small RNA Seq Analysis Small RNAs • Small RNAs are usually 18nt – 34nt in length, non-­‐coding, perform regulatory roles in the cell • Different types of sRNAs – microRNAs (miRNAs) – Small interfering RNAs (siRNAs) – Piwi-­‐interacFng RNAs (piRNAs) – Trans-­‐acFng-­‐siRNAs (ta-­‐siRNAs) – Natural anFsense siRNAs (nat-­‐siRNAs) FuncFons of sRNAs • Maintain integrity of the genome (repress transposons) • Regulate gene expression (thus funcFon in development, stress response, defense against pathogens, etc.) • Regulate the process of meiosis and postmeiosis, and repress retrotransposon transposiFon in male germline cells (piRNAs) • Direct DNA methylaFon and double-­‐strand DNA break repairs Biogenesis and funcFon (in plants) KaFyar-­‐Agarwal & Jin 2010 Small RNA sequencing Protocol (with barcoding) Hafner et al 2012 Raw sequences Farazi et al. 2012 Workflow for preprocessing Usually >=18nt May not be necessary, depend on applicaFons Raw reads clean reads unique reads Removing adapters With barcode Without barcode • For sRNA-­‐seq, adapter is usually at 3’ end • Using exisFng tools: fastx_clipper, cutadapt … • Write your own code (find the LAST perfect match to first few bases in the adapter) Removing adapter my $adapter = "TCGTATGCCGTCTTCTGCTTG“; my $adapter_first_6nt = substr($adapter, 0, 6); my $pos = -­‐1; my $len = 0; while(($pos = index($seq, $adapter_first_6nt, $pos)) > -­‐1){ $len = $pos; $pos++; } my $trimmed_seq = $seq; if($len > -­‐1){ # there is adapter seq $trimmed_seq = substr($seq, 0, $len); } Warning: This piece of code has not been rigorously tested. Use at your own risk. Clean/unique reads contain: • • • • • • • • • • Degraded mRNA fragments rRNAs (ribosomal RNAs) tRNAs snRNAs (small nuclear RNAs) snoRNAs (small nucleolar RNAs) miRNAs siRNAs Low-­‐complexity sequences (e.g. polyA seqs) Random sequences Contaminated sequences Down-­‐stream analysis also SOAP, BowFe Perfect match or allowing mismatches SomeFmes uniquely mapped seqs or low-­‐copy seqs NormalizaFon NormalizaFon: expression level of sRNAs • Make expression levels comparable across libraries • Normalize using total # clean reads or mapped reads in the library with structural RNAs removed • Transcripts per ten million (TPTM) = sRNA_read_num * 10,000,000 / library_size • May also normalize with copy No. (i.e. divide by number of mapped locaFons) microRNA Analysis • Check expression level of known miRNAs – Use mature miRNA seqs in the miRBase (www.mirbase.org) – Need to convert U to T before the search • PredicFng novel miRNAs – Homology search for conserved miRNAs – Predict using sRNA sequencing data • • • • Clustering into miRNA families DifferenFal expression Target predicFon EvoluFonary analysis MiRNA Discovery through sRNA Sequencing Obtain clean sRNA reads Removing adaptor Unique sequence Remove sRNAs with < 10 copies Delete sRNAs that match known RNAs and repeats Fold segments using UNAFold Secondary structure Filters: Free energy <= -­‐35 Mismatch <= 4 Bulges <= 2 Asymmetrical bulge size < 2 Asymmetrical bulge # < 1 Expression paHern filters miRNA/miRNA* >= 75% Strand bias >= 80% Remove sRNAs with > 20 copies in genome Pick unique segments Map the sRNAs to genome, extract segments around the map posiDon Length 80-­‐300bp Remove segments that overlap with CDS regions AnnotaFon criteria based on Myers et al. (2008) miRNA family clustering • Compare mature sequences to each other or to the mature miRNAs in the miRBase using blastn or FASTA program (ssearch36) • Put into one family if less then 2 mismatches miRNA target predicFon • Compare mature miRNAs to annotated mRNA transcripts • Allowing certain number of mismatches or using a scoring system • Quite a few exisFng tools available DifferenFal expression of miRNAs • Without biological replicates – Use Audic & Claverie (1997) method • With replicates – Use EdgeR • AddiFonal criteria – Minimum expression level – Fold change Barrera-­‐Figueroa et al. 2011 Homework Using fasta file /home/rliu/GEN220/Ath_sRNA1.fa.gz (this file contains raw sRNA reads) as input, do the following: (1) Write your own script to trim adapter sequence at the 3’ end. Adapter seq = CTGTAGGCACCATCAAT. Trim a sequence if it contains perfect match to the first 6 bp of the adapter. (OpFonal: trim sequences using an exisFng adapter trimmer and see if the performance of your own script is close to the public tool). Your script should remove sequences that contain no adapter sequence, shorter than 18nt, or contain at least one ‘N’ in the sRNA sequence. Report number of clean reads that you obtain (2) Write a script to cluster your clean reads into unique reads, report number of unique reads that you obtain (3) Draw a figure to show size distribuFon of your clean sRNAs (x-­‐axis is sRNA length, y-­‐axis is percentage of sRNAs to the total). It is opFonal to submit the script that you write to obtain the size distribuFon stats. (4) Compare your unique reads to the mature miRNAs in the miRBase (release 19) (perfect match only) and report normalized expression values of known miRNAs (use total number of clean reads, normalize to RPTM, ignore miRNAs with 0 expression). You should create a table that has two columns, column one has the miRNA ID, column two has the normalized expression value. Submit the script that you write for this step. Please turn in your scripts, and a single word document that contains your results (two numbers , a figure, and a table).