Laboratory of Regulatory Genomics

advertisement

Laboratory of Regulatory Genomics

Project participants: Kirill Babeev, Sophia Buyanova, Maria Sysoeva

Project leader: Ivan Kulakovsky

Scientific advisor: Irina Eliseeva

Within the framework of the Biological School, we made an attempt to conduct a small but completed bioinformatic study, implicating computer analysis of the regulatory biopolymer sequences (DNA and RNA). It was supposed that participants would learn how to work with major databases (gene identifiers, gene mapping etc.) and how to write basic programs with a script language. Then, using the acquired knowledge, they were supposed to reproduce some recently published result on the analysis of regulatory sequences.

The object of study was mRNA targets of the mTOR cascade, which have been recently reported to contain a new pyrimidine-rich regulatory motif:

Hsieh AC, et al.

, The translational landscape of mTOR signalling steers cancer initiation and metastasis. Nature. 2012 Feb 22;485(7396):55-61. doi: 10.1038/nature10912.

Course of the research project

To share the results and discussion, we actively used the document-sharing facilities of Google

Documents. For example, all the participants maintained a shared laboratory journal (this collaborative report is based upon it).

The task we chose turned out quite complex, the causes being both typical (the necessity to hand-edit the list of target genes, to map gene identifiers between the different versions of genomic annotations etc.) and unexpected (a poor data format, errors in the analysis of data in the original paper). Another difficulty was the need to solve the problems at two technical levels

(the original program scripts written within the framework of the project and the existing analytic tools) and simultaneously analyze the data at two system levels (transcription and translation).

By the middle of the School, it became clear that we could not reproduce the result published.

To understand the causes of our failure, we had to make prompt use of an additional open source of experimental data. A simple training course (to reproduce of a published result) thereby grew into the real scientific study (to find out what is happening and why the published result is not reproduced).

Given below is a scientific report written by the participants after the project completion (with minimal editing by the project leader and scientific advisor).

Further analysis and discussion of the project results, as well as thorough comparison with the literature data were conducted via e-mail communication in the fall, after the School was finished.

Computer analysis of the 5’UTR regulatory sequences in the mRNA targets of the mTOR signal cascade

Definitions and abbreviations

TOP

PRTE terminal oligopyrimidine tract pyrimidine-rich translational element

TORTE terminal oligopyrimidine regulatory translational element upstream localized in the 5’ region of the sequence downstream localized in the 3’ region of the sequence

UTR

TSS

CAP

Inhibition untranslated region transcription start site the modified nucleot ide at the 5’ end of mRNA deceleration, suppression

1. Introduction

In the life of eukaryotic cell, an important role is played by the regulatory cascade controlled by the protein mTOR (main target of rapamycine). The mTOR cascade performs regulatory functions, both at the level of transcription and translation. At the level of translation, mTOR regulates (activates) translation of many a ribosomal protein and factor of translation initiation and elongation. Of special interest is the specifics of the mRNA UTR sequences, translation of which is inhibited when the mTOR signal cascade is switched off.

Putative structure of 5’UTR mRNA targets of mTOR

From the paper [Hsieh et al ., 2012 ], we know what the structure of 5’UTR is supposed to be:

TOP , a short CTrich sequence consisting of 5 and more nucleotides at the beginning of 5’UTR

PRTE , a CT-rich sequence consisting of 10 and more nucleotides in the middle and at the end of 5’UTR

In addition, the authors supposed that the PRTE sequence contains a 100% conservative nucleotide U at the position 6, and that the sequence does not have positional preferences in relation to its localization in 5’UTR.

2. Methods

2.1. Preliminary work with the sample set of target genes and the name-ID matching

The sample set of target genes was taken from the paper [Hsieh et al ., 2012]. In this paper, the method of ribosomal profiling was used in the cells of the PC-3 line (human prostate cancer) to determine the mRNA whose translation was substantially inhibited by the chemical agents blocking the mTOR signal cascade. On the basis of the mRNA list given in Table 5 (see supplementary materials to [Hsieh et al ., 2012]), we selected 144 targets, translation of which was significantly suppressed upon switching mTOR off. Using the “Custom Downloads” procedure of the site http://www.genenames.org

, we retrieved a complete table of “gene name – database ID” matches. We were interested in the UCSC known gene ID.

Using the language Ruby, we wrote a script, which automatically matched the gene names with the UCSC gene IDs according to the last assembly of the human genome hg19. Not all the genes were automatically matched with IDs; for some of them, we had to do it manually, using the web-resource UCSC Genome Browser: http://genome.ucsc.edu/ and then by the link

“Genomes”. Later, we also needed to map names and IDs for the previous assembly hg18, so we modified the scripts accordingly.

2.2. Constructing sets of 5’UTR sequences

5’UTR sequences were extracted from the UCSC database ( http://genome.ucsc.edu/cgibin/hgTables?command=start

). We created 3 sets: (1) UTR “as it is” in the database; (2) with a

5-nucleotide upstream sequence; and (3) with a 10-nucleotide sequence. We worked with a database of DNA sequences, which gived us the possibility to analyze the 5’UTR upstream regions (available in the genome). Therefore, when we speak about thymine (T) in DNA, it corresponds to uracil (U) in mRNA.

Using Ruby scripts, we searched the entire genome set for the mRNA sequences corresponding to a text sample (i.e., names of the genes whose mRNA is inhibited by the mTOR signal cascade).

2.3. Preliminary search for a CT-rich motif

Using the packages XXMotif ( www.xxmotif.genzentrum.lmu.de

), SeSiMCMC

( www.favorov.bioinfolab.net/SeSiMCMC/ ) and ChIPMunk ( www.autosome.ru/ChIPMunk ), we performed a preliminary search f or the motifs in the 5’UTR sequences and found out that many of them do contain a CT-rich motif, which can be either sequence, TOP or PRTE.

2.4. Search for TOP and PRTE motifs with regular expressions

To estimate the extent by which a test sample of mTOR-dependent mRNA is enriched with leader TOP/internal PRTE, we chose simple models, namely, regular expressions. The presence of TOR was checked in leader sequences of the length 10 and 20 nucleotides.

Check for TOP

5' UTR without upstream

TOTAL SEQUENCES

Test sample: 144

Genome set: 50366 unique (69050 non-unique) model leader 10 nucleotides, test sample leader 10 nucleotides, full-genome mRNA

55 8322 minimum 5 letters,

ONLY CT

[CT]+

1 insertion, A or G, permitted, 6 letters

[CT]+[AG]?[CT]+

65 11578

2 insertions, 7 letters

[CT]+[AG]?[CT]+[AG]?[CT]+

65 11851 leader 20 nucleotides, test sample

74 leader 20 nucleotides, full-genome mRNA

16285

89

91

22187

23225

Extended 5' UTR (5 nucleotides upsteam)

Test sample: 144

Genome set: 51613 unique (non-unique = 80922) model leader 10 nucleotides, test sample leader 10 nucleotides, full-genome mRNA

8061 minimum 5 letters, ONLY CT

[CT]+

52

1 insertion, A or G, permitted, 6 letters

[CT]+[AG]?[CT]+

55 10004

2 insertions, minimum

7 letters

[CT]+[AG]?[CT]+[AG]?[CT]+

55 11284 leader 20 nucleotides, test sample

75

94

93 leader 20 nucleotides, full-genome mRNA

16634

21595

23306

Extended 5’UTR, 10 nucleotides upstream

Test sample: 144

Genome set: 59911 unique (non-unique = 80922) model leader 10 nucleotides, test sample leader 10 nucleotides, full-genome mRNA minimum 5 letters, ONLY CT

[CT]+

34 13217 leader 20 nucleotides, test sample

78 leader 20 nucleotides, full-genome mRNA

20471

1 insertion, A or G, permitted, 6 letters

[CT]+[AG]?[CT]+

47 17647 94 27632

2 insertions, minimum

7 letters

[CT]+[AG]?[CT]+[AG]?[CT]+

47 18149 96 28834

The preliminary results agree with the estimates from [Hsieh et al ., 2012]: approximately 90 mRNA contain a TOP sequence.

Moreover, the test sample is indeed enriched with leader TOPs. By the example of 5’UTR with

10 nucleotides upstream, let us calculate their occurrence (leader 20 nucleotides)

Occurrence of TOP-containing sequences in the test sample:

94/144=0.65

Occurrence of TOP-containing sequences in the genome:

27632/59911=0.46

Test for the presence of PRTE

When checking for the presence of PRTE, we cut off the 20-letter leader segment of mRNA that was searched for TOP at the previous stage.

Number of PRTE- containing sequences

Too short UTR Total number of sequences

Test sample 90 1 144

Full-genome mRNA 33244 10800 59911

The regular expression /[CT]+[AG]?[CT]+[AG]?[CT]+/ for PRTE defines a CT sequence with two possible replacements of CT with A or G; in addition, we controlled the length (not less than

12 letters).

Occurrence of PRTE-containing sequences:

90/(144-1) = 0.63

33244/(59911-10800) = 0.680

Thus, we do not see any differences between the full-genome mRNA and the test sample.

2.5. Motif search using MEME: distance lookup

As a control test, we decided to construct PRTE motifs on the UTR sequences with the upstream regions of 10 nucleotides. For this purpose, we used the MEME package

( http://meme.sdsc.edu/meme/intro.html

), which was initially applied by the authors [Hsieh et al .,

2012] to reveal the PRTE sequence.

5’UTR sequence on the basis of the genome assembly hg18:

The motif is found in 97 sequences; in 20 cases, it is at the beginning of the sequence

(suspected to be TOP).

5’UTR sequence on the basis of the genome assembly hg18:

The motif is found in 110 sequences; in ~20 cases, it is at the beginning of the sequence

(suspected to be TOP)

2.6. Annotation of TSS and revealing TORTE

For accurate determination of PRTE localization in relation to the transcription start, one cannot use only the databases specifying a single coordinate of the start. As shown by the data obtained using the hCAGE technology within the framework of the FANTOM project

( http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=yiEiVQVLIlvlWVcT0KsiWB;loc=hg18::chr1:429

19983..42921680

, see “all hCAGE data”), transcription starts of many genes are rather vague.

Thus, many mRNA transcribed from the same gene will have quite different leader sequences in their 5’UTR. Moreover, the transcription starts annotated in the existing databases (e.g., in the

UCSC known gene that we used) often miss the actual starts.

A careful consideration of some mTOR-dependent mRNA genes (see the section

“Results”) shows that the typical length of a transcription start is at least 3-4 nucleotides. This means that the majority of mRNA will have different leader sequences. To be sure that all mRNA of a gene are mTOR-dependent, one should provide a CT leader for each of them.

The latter, in its turn, is possible if a CT-rich sequence is inserted upstream the transcription start. It is the very wide CT-rich region that we observe upstream many transcription starts, and it is this region that was erroneously marked in [Hsieh et al ., 2012] as a new regulatory PRTE element.

Unfortunately, the hCAGE data (accurate mapping of TSS) were published only for the cell lines THP-1 and HeLa. We cannot, therefore, say for sure which TSS worked for the mRNA inhibited in the experiment with the PC-3 line, which was conducted by Hsieh et al . However, we carefully registered the cases of alternative minor TSS and TORTE sequences in the vicinity of transcription starts.

Results and Discussion

TORTE: a key regulatory element giving rise to TOP in the 5’UTR mRNA targets of the mTOR signal cascade

Our results show that PRTE is, in fact, a TORTE sequence, which at the stage of transcription serves as a generator of TOP sequences in most mRNA. The start mapped in the database is an artifact, which entails discovering “inner” PRTE sequences whose regulatory role is difficult to explain. The TORTE sequence exists in DNA. The TOP sequence is a fragment of TORTE.

The size of the TORTE fragment getting in the 5’UTR mRNA and becoming the TOP sequence depends on the location of the transcription start.

Statistics of the revealed TORTE

1. A TORTE > 10 nucleotides was found for 75 genes right upstream the major start (according to the data of the hCAGE gene browser Zenbu: http://fantom.gsc.riken.jp/zenbu/gLyphs/#config=yiEiVQVLIlvlWVcT0KsiWB;loc=hg18::chr1:429

19983..42921680

)

2. A small TORTE (6-10 nucleotides) was found for 19 genes upstream the major transcription start.

3. For 18 genes, a TORTE was found upstream the alternative start.

4. For 32 genes, no evident TORTE was found; in some cases, the regions near the transcription starts are slightly enriched with CT.

Groups 1-3 contain ribosomal proteins and factors of translation initiation and elongation. Group

4 consists of various proteins.

An example of TSS of group 1: RPL28 (a gene in the main chain)

An example of TSS of group 2: EEF2 (a gene in the reverse chain)

An example of TSS of group 3: RPS25 (a gene in the reverse chain)

An example of TSS of group 4: NCLN (a gene in the main chain)

In conclusion, we would like to draw attention to the group of genes that do not contain TORTE motifs in the region of TSS. We suppose that the corresponding mRNA are regulated by a fundamentally different mechanism, perhaps, via 3’UTR. When analyzed separately, the 5’UTR sequences of these genes do not contain an evident common motif, and preliminary analysis of the 3’UTR sequences showed the presence of both C- and G-rich motifs (it is possible that they correspond to some secondary structures).

Additional observations

We have checked how well the major transcription starts of the mTOR-cascade target genes are annotated in the database and found out that: 63 genes are annotated more or less properly (the real and mapped starts are not more than 10 nucleotides apart); 70 genes , from

10 to 100 nucleotides; and 11 genes are annotated very poorly (the real start is more than 100 nucleotides apart from the mapped one).

Download