Highly Conserved Non-Coding Sequences are Associated with Vertebrate Development PLoS Biol. 2005 Jan;3(1):e7. Epub 2004 Nov 11. Yvonne Li Paper presentation for MEDG505 Jan 27, 2005 Outline Motivation Method and Results Discussion Motivation Gene Regulatory Networks for development have been described in invertebrates but not characterized for vertebrates Studies have shown: a number of developmental genes are regulated by highly conserved enhancer regions at distances of hundreds of kb ultra-conserved elements are more frequent than expected there is a significant association between these highly conserved elements and DNA binding proteins Goal: look for all such elements in the entire human genome and see how they relate to development. Method Computationally identify Computationally analyze Experimentally validate Sequence Data Identifying CNE : Highly Conserved Noncoding Elements Which 2 species to use for whole-genome alignment? Sequence Data Which 2 species to use for wholegenome alignment? Human and Fugu Fugu has 1/8 genome size of human but similar gene repertoire Fugu’s developmental blueprint is very similar to Human Two ways to detect CNEs 1. 2. Whole-genome alignment Regional alignments Identifying Identifying Obtaining CNEs Start with Fugu genome assembly MegaBLAST against Ensembl human genome v18.34.1 Remove alignments < 100bp in length Masked coding and non-coding RNA content Remove telomere-like sequences and transposons S T A T S 1373 core set of elements Length: ave 199bp Identity: ave 84% 1365 conserved in 1316 conserved in 1310 conserved in 1093 conserved in max 736 bp max 98% mouse rat chicken zebrafish CNE Distribution CNEs in human genome are found on all chromosomes except 21 and Y Distribution of CNEs is highly clustered Clustered CNEs by genomic location 165 clusters The 20 largest clusters have ≥ 20 CNEs Analyzing CNE associated genes Find most statistically over-represented GO terms Over 93% of clusters have transdev gene within 500kb of its CNEs. 15% have 2 or more. CNEs generally located large distances from nearest gene For each CNE, extract closest gene from Ensembl 12 of the 13 terms relate to transcriptional regulation and development How many clusters situated near such transdev genes? Analyzing Average distance between CNE and 5’ end of closest human gene is 182kb, with 93 CNEs > 500kb, and 12 CNEs > 1Mb. Transdev genes are located in regions of low gene density Average number of genes within 500 kb upstream or downstream is 16 for all human genes and 6 for transdev genes Obtaining rCNEs Identifying Use MLAGAN (Localized multiple alignment) to identify additional conserved sequences around specific genes MLAGAN more sensitive than whole-genome alignment Species: Human, Fugu, Mouse, Rat Algorithm itself is more sensitive Require only 40bp window with 60% identity Chose 4 cluster regions containing diff types of developmental genes: SOX21, PAX6, HLXB9, SHH Sometimes, the CNEs are more conserved than the gene’s coding exons! Sox21 MLAGAN Vertebrates vs Invertebrates Are the CNEs also found in invertebrates? Use all CNEs and rCNEs Search whole genome sequence of Ciona intestinalis Drosophila melanogaster Caenorhabditis elegans Anopheles gambiae No significant matches (however, the genes have clear homologs) 43 CNEs show significant similarity to at least one other CNE (their genes have clear paralogous relationships) Method Computationally identify CNEs Computationally analyze CNEs Experimentally validate a few CNEs Experimental Validation Coinject CNEs with green fluorescent protein (GFP) reporter, in zebrafish embryos Idea: CNEs contain something that affects the transcription of a transdev gene The transdev gene affects development Examine the ability of CNEs to up-regulate GFP reporter expression Experimental Validation Chose 25 regions for GFP assay 10 CNEs, 15 rCNEs Look for GFP expression in live embryos Average of 200 embryos screened per control No upregulation Average of 188 embryos screened per element GFP expression in all but 2 elements; varied from 4% to 44% SOX21 associated elements Known SRY-related box gene Acts as a transcriptional repressor during early development Expressed in a complex manner in CNS, and in nasal epithelium, lens and retina of eye, inner ear PAX6 associated elements Known Paired-box containing transcription factor, known to be influenced by cis-acting elements in upstream, intronic and downstream positions Expressed in developing eye, forebrain, hindbrain, spinal cord HLXB9 associated elements Known Homeobox gene associated with autosomal dominant effects Zebrafish ortholog is expressed in notochord, hypochord, tail mesoderm, and tailbud SHH associated elements Known A signaling molecule Zebrafish ortholog is expressed mainly in midline structures like floorplate and notochord, but also in branchial arches, pectoral fin buds, retina Limitations CNE-gene misassociated, especially in gene-rich regions Can kind of tell from results of assays CNEs missed due to stringent whole-genome analysis Down regulation of expression will not be detected Assayed elements out of context and individually Each element had cases of unexpected expression Tissues from few cells are underrepresented Late developing tissues or cell types after 24 h will be missed completely Summary Identified a set of 1373 vertebrate CNEs Experimentally showed CNE-transdev gene association CNEs found in clusters, in front of transdev genes CNEs act at large distances from coding sequence The relative order and positions of CNEs are conserved No vertebrate CNEs were found in invertebrates, even though the genes had clear homologs Many of these results are paralleled by a similar paper (Sandelin et al. 2004) >50bp, >95% Human/Mouse identity 3583 Human/Mouse/Pufferfish UCRs; ave length 125 bp Discussion Almost all CNEs are associated with developmental regulators CNEs act at large distances from gene Do most transdev genes have CNEs associated? They could be enhancers or silencers The relative positioning and order of CNEs are completely conserved Do they play a role in structuring the genomic architecture around transdev genes? Discussion No vertebrate CNEs are found in invertebrates Are there CNEs in invertebrates? But PAX6 in Drosophila has been shown to have an highly effective LE9 enhancer, that is also well conserved in vertebrates (The Interactive Fly) Why is it not found in this analysis? Only 52 bp in length! (but the MLAGAN should have found it ..) So, maybe invertebrate enhancers/CNEs are shorter Should maybe look for shorter CNEs in vertebrates Discussion Missing whole genome CNEs due to stringency of parameters. Try discontinuous MegaBLAST which does not require exact word match of 20. Only 109 of 256 of non-coding ultraconserved regions from Berejano et al. are identified. Discussion What is in the CNE? Modules of transcription factor binding sites? Regulatory RNAs? (i.e. microRNAs) Hard to account for the high level conservation. Perform assays on portions of the CNEs. Use computational methods. Lack of EST evidence. Use regulatory RNA gene finders? Something else entirely? One thing is in agreement: More functional studies are needed. Discussion Do CNEs work together? How to robustly test combinations of elements? Mutations in CNEs can cause human disease Studies are showing that mutations in CNEs cause disorders. CNEs at very distal locations can still effect the transcription May be candidates for genetic screens seeking sequence variation associated with disease Check it out with dbSNP! References & Acknowledgements Thanks to Misha Bilenky for lots of fun discussion Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G. Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005 Jan;3(1):e7. Epub 2004 Nov 11. Elgar, G. Identification and analysis of cis-regulatory elements in development using comparative genomics with the pufferfish, Fugu rubripes. Semin Cell Dev Biol. 2004 Dec;15(6):715-9. Venkatesh B, Yap WH. Comparative genomics using fugu: a tool for the identification of conserved vertebrate cis-regulatory elements. Bioessays. 2005 Jan;27(1):100-7. Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, Ericson J, Lenhard B. Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics. 2004 Dec 21;5(1):99. The interactive fly. http://www.sdbonline.org/fly/aimain/1aahome.htm