Philipp Bucher

advertisement
Comprehensive analysis and probabilistic modeling of human
polyadenylation sites defined by EST sequences
Philipp Bucher
Swiss Institute for Exper. Cancer Research CH-1066 Epalinges s/Lausanne, Switzerland
We have generated a comprehensive database of human polyadenylation sites by mapping
EST sequences to the human genome. The database contains over 150,000 entries which
very likely reflect true 3' ends of in vivo generated transcripts. We explain this large excess
of different mRNA 3'ends over current gene number estimates by a previously
underestimated large extent of alternative polyadenylation site usage. As we observe a
strong correlation between the number of observed 3'end variants and the coverage of the
corresponding gene by ESTs, we assume that our current database still reflects only a part
of the real 3'end diversity of the human transcriptome, and that the number of entries will
further increase with continued EST sequencing. Comparative analysis of the genomic
sequences around polyadenylation sites revealed new previously unnoticed sequence
features between the polyA signal and the RNA processing site. Using probabilistic
modeling, we were able to construct a hidden Markov model that recognizes in vivo used
polyadenylation signals with high sensitivity and selectivity. In many cases, the model also
accurately predicts the location of the most frequently used processing site downstream of a
predicted signal. The latter indicates that the newly discovered sequence features are indeed
recognized by the eukaryotic mRNA 3'end processing machinery.
Download