Comprehensive analysis and probabilistic modeling of human polyadenylation sites defined by EST sequences Philipp Bucher Swiss Institute for Exper. Cancer Research CH-1066 Epalinges s/Lausanne, Switzerland We have generated a comprehensive database of human polyadenylation sites by mapping EST sequences to the human genome. The database contains over 150,000 entries which very likely reflect true 3' ends of in vivo generated transcripts. We explain this large excess of different mRNA 3'ends over current gene number estimates by a previously underestimated large extent of alternative polyadenylation site usage. As we observe a strong correlation between the number of observed 3'end variants and the coverage of the corresponding gene by ESTs, we assume that our current database still reflects only a part of the real 3'end diversity of the human transcriptome, and that the number of entries will further increase with continued EST sequencing. Comparative analysis of the genomic sequences around polyadenylation sites revealed new previously unnoticed sequence features between the polyA signal and the RNA processing site. Using probabilistic modeling, we were able to construct a hidden Markov model that recognizes in vivo used polyadenylation signals with high sensitivity and selectivity. In many cases, the model also accurately predicts the location of the most frequently used processing site downstream of a predicted signal. The latter indicates that the newly discovered sequence features are indeed recognized by the eukaryotic mRNA 3'end processing machinery.