Estimating a useful flanking region size to screen for regulatory elements: mining NCBI databases for distribution of known gene regulatory regions. Recent reports (1,2) of gene-centric computational approaches for identification of putative regulatory regions have limited upstream region search space to 10 kb. Most potential regulatory sequences would be missed using a more traditional approach limited to studying only a few kilobases upstream of the transcription initiation site. An additional consideration is that many genes have poorly identified first exons and that first introns can exceed 10 kb. To develop a more systematic approach to the estimation of generally required flanking base pair regions for computational screening, we mapped known regulatory regions using NCBI Entrez resources. To do this, we used the EntrezNucleotide query (“[FKEY] AND human[ORGN]” and “[FKEY] AND mouse[ORGN]” for human and mouse respectively), we downloaded all the NCBI GenBank sequences (209 human and 136 mouse) that had an explicitly defined regulatory region. Because some of the sequences had more than one enhancer defined, the 209 human sequences had 239 enhancers, while the mouse sequences had 141. The “.gbk” (genbank format) files of each these sequences were then parsed to extract the enhancer regions as defined in the “feature” section of the individual gbk files and then mapped to its corresponding gene and calculated for the relative distance of these enhancers with respect to the associated gene transcriptional start site. In all cases where the enhancer regions described in the gbk files were less than 17 bp, flanking regions of 15 bp (upstream, downstream or both) were added, depending upon the original sequence, to facilitate blat search (for blat search, the minimum length of query sequence should be 17 bp). Thirtytwo of the 239 enhancers were localized to an intron while 192 were mapped to the upstream region of the respective gene. Four of them were mapped to exonic regions of which 3 (enhancers of genes MNDA, OTC and ISG20) of them occur in 5’UTR while one of them maps to a coding exonic region of gene GRIN1. Eleven were found to be occurring downstream to 3’ region. A significant number (76%) were found within upstream 10 kb region. This is of interest in the light of recent findings in Drosophila wherein all the true-positive predictions were within 10 kb of a known or predicted TSS of a gene whose expression was regulated by five TFs involved in anterior-posterior embryonic patterning (3). The mouse enhancers also behaved in similar fashion (Figure 1). Based on these results, we thought, on average, 40 kb flanking region was a reasonable space to search for regulatory signals even though there are instances where regulatory regions are known to occur in regions as far as 100 thousand base pairs upstream or downstream. In some cases, promoters for genes not affected by the cisregulatory module (CRM) are located between the CRM and its target (4). For example, the cis-regulatory region that directs mouse Pax6 expression in the developing pretectum, neural retina and olfactory region, located approximately 77 kb downstream of the 3’ polyA-addition site of the PAX6 gene, actually resides within the transcription unit of an unrelated neighboring gene PAXNEB (5). Therefore, the formidable challenges of “how much” flanking regions to be searched and the association of the predicted regulatory regions with their gene targets still remain and can be met only by experimental validation. Our present results of known human and mouse enhancer mappings relative to TSS and the recent reports of all true-positive Drosophila developmental enhancers occurring within 10 kb of a known or predicted TSS (3) can add additional insights to the predicted putative regulatory regions and their gene associations. Proximity is not always a reliable indicator of the target of a predicted CRM, and a considerable amount of DNA around any gene needs to be examined for a potential role in regulation (4). Thus, our approach of increasing the flanking base pairs’ length and making available precomputed potential regulatory regions for the same provide researchers an “extra space” to work with especially in cases where the immediate flanking regions lack any critical regulatory regions. The problem remains though for cases like PAX6, cited above, in which their regulatory regions are known to occur further up and downstream of the gene and thus lay outside the flanking base pair regions of 40 kb. In these cases the manually curated TraFaC server and similar servers can still be used to identify highly conserved regions after including larger flanking base pair regions in the analysis. REFERENCES 1. 2. 3. 4. 5. Jegga, A.G., Sherwood, S.P., Carman, J.W., Pinski, A.T., Phillips, J.L., Pestian, J.P. and Aronow, B.J. (2002) Detection and visualization of compositionally similar cis-regulatory element clusters in orthologous and coordinately controlled genes. Genome Res, 12, 1408-1417. Dieterich, C., Cusack, B., Wang, H., Rateitschak, K., Krause, A. and Vingron, M. (2002) Annotating regulatory DNA based on man-mouse genomic comparison. Bioinformatics, 18 Suppl 2, S84-90. Berman, B.P., Pfeiffer, B.D., Laverty, T.R., Salzberg, S.L., Rubin, G.M., Eisen, M.B. and Celniker, S.E. (2004) Computational identification of developmental enhancers: conservation and function of transcription factor binding-site clusters in Drosophila melanogaster and Drosophila pseudoobscura. Genome Biol, 5, R61. Miller, W., Makova, K.D., Nekrutenko, A. and Hardison, R.C. (2004) Comparative genomics. Annu Rev Genomics Hum Genet, 5, 15-56. Griffin, C., Kleinjan, D.A., Doe, B. and van Heyningen, V. (2002) New 3' elements control Pax6 expression in the developing pretectum, neural retina and olfactory region. Mech Dev, 112, 89-100. Distance Relative to TSS Figure 1: Distribution of human (red) and mouse (blue) enhancers with respect to the transcription start of the nearest neighboring gene. 100000 90000 80000 70000 60000 50000 40000 30000 20000 10000 0 -10000 -20000 -30000 -40000 -50000 -60000 -70000 -80000 -90000 -100000 No. of Enhancers (human/mouse)