Supplementary Information for detectMITE: A novel approach to detect miniature inverted repeat transposable elements in genomes Congting Ye, Guoli Ji and Chun Liang Supplementary Figures Figure S1 …………………………………………………………………………… 2 Figure S2 …………………………………………………………………………… 3 Figure S3 …………………………………………………………………………… 4 Figure S4 …………………………………………………………………………… 5 Figure S5 …………………………………………………………………………… 6 1 A >SQ261080185Oryza sativa, chromosome:Chr04, from MITE family DTT_Ors37, SuperFamily Tc1/Mariner, complete sequence TACTACCTCCGTCCTATATTACTTGCTTTTTTGAGTTTTTTAAGTTTTTGTTTGTCAATGTTTGATCATT CGTCTTATTCAAATTTTTTTGGAATTATTATTTATTTTGTTTGTCATTTGCTTTATTATCAAAAGTACTT TACATATGACTTATCTTTTTTTATATTTACACTAATTTTTCAAATAAAATGAGTTTTTGTTTGTCAATGT TTAATCATTCGTCTTATTCAATTTTTTTTGGAATTATTATTTATTTTGTTTGTCATTTGCTTTATTATCA AAATTACTTTACATATGACTTATTTTTTTTTATATTTGCACTAATTTTTCAAATAAAACGAGTTTATGTT TGTCAATATTTGATCATTCGTCTTATTCAAAATTTTTTAGAATTATTATTTATTTTGTTTGTCATTTGCT TTATTATCAAAAATACTTTACATATGACTTATCTTTTTTTATATTTGCACTAATTTTTCAAATAAAACGA ATGGTTAAACGTTGCAAATAAAAAATCAAAAACGTCACCTATTATGGAACGGAGGGAGTA B >SQ265159547 Oryza sativa, chromosome:Chr02, from MITE family DTH_Ors8, SuperFamily PIF/Harbinger, complete sequence GGCCCCCATCGGTTGGCTTTTTTTTTCTAATAAGGCAAAACGGTTTATCAGGGAATAAAAAAAATTATAG GTAAAACTTATATATATATATATATATATATATATATATATATATATATACATACATACATATATATATA TATACATACATACATATATATATATATACATACATACATATATATATATATATACATACATACATATATA TATATATACATACATACATATATATATATATATACATACATACATATATATATATATATACATACATACA TATATATACATATATATATGTGTGTGTGTGTTTTAACTTAAAAGCCAATGCTGAAAAAAATACGTTGAAA ATATATCAAAATTAATCTCAAAATTAAGTTTGAAAATTCAAAATTTGGCTTATTCTTTAGCTTATTGGGC CATCTGATGGGAGCC C >SQ262021977 Oryza sativa, chromosome:Chr10, from MITE family DTM_Ors4, SuperFamily Mutator, complete sequence CTGGATTTTTCACATTTGGGTCCTTTTGAAAAACTTATTTTGCAAATAGACCCTGGAAAAACTTATCCCA GAAATAGTCCTTTTTGGGGCGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGACGACGGCGCCAACCACTCTGGTGCCCCAAAAAGGAAC CATTTCTGGAATAAGTTTTTCCAGGGGTCTACTTGCGAAATAAGTTTTTTAAAAGGGCCAAAATGTGAAA AATCCAG D >Rice_3_55765 Unknow 3 AATGGGCGATCGATCGCCTGGGGATGGGGGTAGCGATCAATCCCCTCCCCCTCCCTCCCTCCACCTGGTT TCCTTTTTTGGCACCGCATTACTTTCCTATTTTAGTAAATTTATGCACCTAAAGTTTATACACCTCAAGT TTACACATCTAAAGTTTAGAGACCAAAAGTTTATAAGTCAAAAGTTTATATATCCGATTCAAATTTGAAT TTGAATTCAAATATTTTTTATATATAGTATTTCTATACATCTAAAGTTTATACACCTAAAGTTTATAGAC CCAAAGTTTATAAGTCAAAAGTTTACATACCCGTTTCAAATTTGAATTTGAATTATATCTGATTCAAATT TGAATTTGAATTCAAATATTTTCTATATATAGTATTTTTATGCATCTAAAGTTTATACACCTAAAGTTTA TAGACCTAAAGTTTATAAGTCAAAAGTTTACATACCCGATTCAAATTTGAATTTGAATTATATCCGATTC AAATTTGAATTTGAATTCAAATATTTTCTATATATAGTATTTCTATACATCTAAAGTTTATACACCTAAA GTTTATAGACCCAAAGTTTATAAGTCAAAAGTTTACATACCCGATTCAAATTTGAATTTGAATTCAAATT TTTTATATATAGTATTTCTATACATAAATTTTTCTAACTTTTGTTTTTTTTAAAAAAATTTGTGTGGTGT ACTGTAGTAGGAAGAGAAGAAGGGGAGGAGGAAGGGGGGAGAGGAGGGAGGAGTGTATCGAGTATAGGGG AGGGGGGGCGGATCTGATCGCTGGGCGGATGGCGTGGCGATCA Figure S1. Examples of low complexity MITEs identified by the Lempel-Ziv complexity algorithm in the rice genome from the detection outputs of MITE-Hunter and RSPB. (A) A sequence containing tandem repeats in the output of RSPB. (B) A sequence mainly consisting of ‘AT’ dinucleotide repeats in the output of RSPB. (C) A sequence containing too many unknown bases in the output of RSPB. (D) A sequence containing tandem repeats in the output of MITE-Hunter. 2 A 3' 5' …… B TIR TIR …… C TIR TIR …… D TIR TIR …… Figure S2. Examples of the 795 groups of MITE sequences in the rice genome uniquely identified by RSPB, but not by detectMITE. (A) Sequences that do not bear terminal inverted repeats (TIRs). (B) TIRs of sequences that have too many mismatches or non-complementary pairs. (C) A/T content of TIRs is too high. (D) Number of full-length copies of the MITE sequences possessing good TIRs is less than 3 across the genome (i.e., the second and the last sequences do not have good TIRs - mismatched pairs in the stem≥3). Relevant data are available at http://sourceforge.net/projects/detectmite/files/Supplementary_Data.7z. 3 Figure S3. Examples of MITE super-families (represented by family member) uniquely detected by detectMITE among the pairwise comparison with MITE Digger, MITE-Hunter, and RSPB individually. (A) A MITE super-family (family_2578 in super-family_1295) missed by MITE Digger. (B) A MITE super-family (family_3151 in super-family_1445) missed by MITE-Hunter. (C) A MITE super-family (family_4493 in super-family_1757) missed by RSPB. (D) Three MITE super-families (family_3080 in super-famiy_1419; family_1254 in super-family_660; family_1151 in super-family_587) missed by MITE Digger, MITE-Hunter and RSPB together. Relevant data are available at http://sourceforge.net/projects/detectmite/files/Supplementary_Data.7z. 4 A >6|31213569|31213694|2|3;family_378;super-family_161 >ORSgTEMT01701756 gi|19223840|nt141213-141384 putative MITE, MITE-adh, type D-like AGACTTTCTAGCATTGCCCACATTCATATAGATGTTAATGAATCTGGACATAACATCTGTATGAATGTGGGAT ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| AGACTTTCTAGCATTGCCCACATTCATATAGATGTTAATGAATCTGGACATAACATCTGTATGAATGTGGGAT ATATGTCTAGATTCATTAATATCTATATGAATGTGGGCAATGCTAGAAAGTCT ||||||||||||||||||||||||||||||||||||||||||||||||||||| ATATGTCTAGATTCATTAATATCTATATGAATGTGGGCAATGCTAGAAAGTCT B >5|829372|829569|2|3;family_2688;super-family_1324 >ORSgTEMT01602375 gi|18656390|nt97785-98015 putative MITE, MITE-adh, type B-like TATGACACCATCGACTTTTTAACAAACATTTGTCCATTCATCTTATTCAAATTCTTTTATGCAAATATAAAAA ||||||:||:|:||| |||||||:|||:||||:||||||:||||||||||||| ||||||||||||||||||| TATGACGCCGTTGAC-TTTTAACCAACGTTTGACCATTCGTCTTATTCAAATT-TTTTATGCAAATATAAAAA AAAATAAGTCATGCTTAAAGAATATTTGAAGATAAATCAAGTCACAATAAAATAAATAATAATTATATGTATT :|::||:|||||:|||||||||:||||||:|||||||||||||||||||||||||||||||||||:||::||| TACTTATGTCATACTTAAAGAACATTTGATGATAAATCAAGTCACAATAAAATAAATAATAATTACATAAATT TTTTGAATAATACAAATGGTCAAATGTATCCCAAAAAGTCAACGGTGTCATA ||||||||||:||:||:|||||||:||:|:|||||||||||||||:|||||| TTTTGAATAAGACGAAAGGTCAAACGTTTATTAAAAAGTCAACGGCGTCATA C >11|15010637|15010746|6|3;family_695;super-family_279 >ORSgTEMT01701949 gi|23307556|nt16547-16700 putative MITE, MITE-adh, type D-like GACTTTCTAGCATTGTCCACATTCATATAGATGTTAATGAATCCAGGCGCATATATATATATTTCTAGATTCA |||||||||||||||:|||:|||:|||||:|||||||||||||:| : ||||||||||:|:|||||||||| GACTTTCTAGCATTGCCCATATTTATATATATGTTAATGAATCTA---A-ATATATATATGTGTCTAGATTCA TTAATATATATATGAATATAGACAATGCTAGAAAGTC ||||:||:|||||||||:|:||||||||||||||||| TTAACATCTATATGAATGTGGACAATGCTAGAAAGTC Figure S4. Examples of MITE super-families (represented by family member) detected by detectMITE, shared by MITE-Hunter but missed by MITE Digger, have valid blast matches (e-value ≤10-10) against the TIGR Plant Repeat Database. In each alignment, the upper sequence is the MITE sequence detected by detectMITE, and the lower sequence is the sequence annotated in the TIGR Plant Repeat Database. | stands for two identical nucleotides. : stands for two different nucleotides. - stands for a gap. Relevant data are available at http://sourceforge.net/projects/detectmite/files/Supplementary_Data.7z. 5 A >4|20606447|20606593|3|5;family_26;super-family_25 >ORSgTEMT00500006 gi|21912505|nt113703-113850 putative MITE, Gaijin/ Gaigin-like GGCCGTGTTTAGTTTCAAAGTTTTTCTTCAAACTTCTAACTTTTCTATCACATCGAAACTTTCTTACACACAT ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| GGCCGTGTTTAGTTTCAAAGTTTTTCTTCAAACTTCTAACTTTTCTATCACATCGAAACTTTCTTACACACAT AAACTTATAACTTTTCCATCACATCGTTCCAATTTTAACCAAACTTTTAATTTTGACGTGAACTAAACACAGC ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| AAACTTATAACTTTTCCATCACATCGTTCCAATTTTAACCAAACTTTTAATTTTGACGTGAACTAAACACAGC C | C B >11|3151373|3151519|3|4;family_23;super-family_23 >ORSgTEMT00501140 gi|24137620|nt71791-71930 putative MITE, Gaijin/Gaiginlike ---GGCTGTGTTTAGATTCAAAATTTGGATCTAAACTTTAAACTTCAGTCCTTTTCCGTCACATCAACCCATC ||||||||||||||:||||||||||||| | ||||||||||||||||||:|||:|||||||::|| TAAGGCTGTGTTTAGATCCAAAATTTGGATC----C---AAACTTCAGTCCTTTTCCATCATATCAACCTGTC ATACACATACAACTTTTCAGTCACATCATCTTCAATTTTAACCAAAATCCAAACTTCCCCCTCAACTAAACAC |||||||:||||||||||||||||||||||||||||||:|||||||||||||||||||||||||||||||||| ATACACACACAACTTTTCAGTCACATCATCTTCAATTTCAACCAAAATCCAAACTTCCCCCTCAACTAAACAC AGCC | A--C >1|12802401|12802508|3|3;family_738;super-family_380 >ORSgTEMT01701078 gi|13603465|nt65492-65639 putative MITE, MITE-adh, type D-like AAGTCATTCTAGCATTTTCCACATCCATATGGATGTTAGTGAATCTAGACACATATA--TATCTAGATTCACT ||||||||||||||||||||||||||||||:|:|||||:||||||||||||:||||| ||||||||||||:| AAGTCATTCTAGCATTTTCCACATCCATATTGTTGTTAATGAATCTAGACATATATATTTATCTAGATTCATT AACATCCATATGTATGTAAAAAAATCTAGAATGACTT ||:|||:|||||:||:|:|||||::|||||||||||| AATATCAATATGAATATGAAAAATGCTAGAATGACTT Figure S5. Examples of MITE super-families (represented by family member) detected by detectMITE but missed by both MITE Digger and MITE-Hunter have valid blast matches (e-value ≤10-10) against the TIGR Plant Repeat Database. In each alignment, the upper sequence is the MITE sequence detected by detectMITE, and the lower sequence is the sequence annotated in the TIGR Plant Repeat Database. | stands for two identical nucleotides. : stands for two different nucleotides. - stands for a gap. Relevant data are available at http://sourceforge.net/projects/detectmite/files/Supplementary_Data.7z. 6