A researcher studies the binding sites of a transcription factor X. He conducts a ChIP-Seq experiment by binding X to the genome and sequencing the sequences to which the protein binds. In order to find the binding site motif, the researcher then ran MEME.
1.
The following multiple sequence alignment is the extracted motifs found in a subset of the ChIP-Seq sequences. Create a representation of the motif in 3 different options: a.
A count matrix b.
A probability matrix c.
A consensus sequence
AGGGCAGCTT
ACGACTGCTG
CTGGCTGCTA
ATGACTGCTG
AGGACTGCTC
CCGGCAGCTG
ATGGCTGCTC
2.
The researcher than ran MAST on the 50 sequences. 25 of them are known to be bound by protein X and 25 of them are known to be unbound by protein X. The following table contains MAST results from running the motif found in step 1 and 50 sequences. a.
What is the rate of false positive, false negative, true positive and true negatives of MAST? b.
Describe the quality of this MAST run in terms of sensitivity and specificity.
19
20
21
22
23
24
25
15
16
17
18
Sequence
ID
Sequences that bind protein X
MAST result
1
2
3
Motif found
Motif found
Motif found
8
9
10
11
12
13
14
4
5
6
7
Motif found
Motif found
Motif found
Motif found
Motif found
Motif not found
Motif found
Motif found
Motif found
Motif found
Motif found
Motif found
Motif found
Motif found
Motif not found
Motif found
Motif found
Motif found
Motif found
Motif found
Motif found
Motif found
44
45
46
47
48
49
50
40
41
42
43
Sequences that don't bind protein X
Sequence
ID
MAST result
26
27
28
Motif not found
Motif found
Motif not found
33
34
35
36
29
30
31
32
37
38
39
Motif not found
Motif not found
Motif found
Motif not found
Motif found
Motif not found
Motif not found
Motif found
Motif found
Motif found
Motif not found
Motif not found
Motif found
Motif not found
Motif found
Motif not found
Motif not found
Motif found
Motif found
Motif found
Motif not found
Motif not found
The PUM2 protein is a human RNA binding protein. In order to identify the preferred binding motif of this protein on RNA, a researcher conducted a high throughput RNA binding experiments (CLIP). The results of the experiments are given in the attached file pum2.fasta.txt. Sequences are in FASTA format and are ranked according to the binding score, from high to low.
1.
Identify the preferred binding motif of PUM2 by running the file on DRIMUST http://drimust.technion.ac.il/ .
Make sure that you are using the “Single strand” mode for RNA sequences.
Provide the motif logo.
2.
What is the significance of the identified motif? Look at the occurrences distribution by clicking on “view occurrences distribution” or by downloading the motif occurrences file and explain how come the motif is so highly significant.
3.
In a different study it has been shown that PUM2 binds RNA only in loop regions
(single stranded) of the folded RNA which contain the consensus motif. The researcher is studying 5 RNA sequences in the 3’ untranslated regions of different genes (sequences are found below) and is interested to learn which of the sequences bind the PUM2 protein. a.
Use the program mfold http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi
to predict which one of the 5 sequences below is the most likely candidate to bind RNA. Explain your results. b.
What is the free energy of the RNA predicted to bind PUM2? Are the results you got expected for a binding site of PUM2? Explain. NOTE! Stable RNA structures of similar length are usually lower than -10 kcal/mol. c.
What part of the identified PUM2 motif has the highest probability to be in a loop? Explain. d.
It has been shown that in the presence of an RNA helicase (an enzyme that can rewind stem loop RNA structures) PUM2 can also bind its preferred binding motif in partial stem regions. Based on this information, which other sequence (from the list below) would bind PUM2 in the presence of the RNA helicase?
List of 5 sequences:
Seq 1: 5’ CCGGCCAAAUAAAUGUCCCCAAAAGGCC 3’
Seq 2: 5’ CCGGCGCACAUAAAUGUACAGUGCGGCC 3’
Seq 3: 5’ CCGGUUAACGUUUUAUUAUACCCAGGCC 3’
Seq 4: 5’ CCGGCCAAUGUAAAUACCCCAAAAGGCC 3’
Seq 5: 5’ CCGGCGCACUGUAAAUAACAGUGCGGCC 3’
Provide snapshots or PDF files of the predicted RNA fold for each sequence!
This is an open research question; you are requested only to write your research plan (as described below) and not conduct the research!!!
As we discussed in class it has been proposed that long non coding RNAs (lncRNAs) can bind
Transcription Factors (which usually bind double-stranded DNA) and compete with the natural promoter of a gene which is regulated by that Transcription Factor. This is another elegant way by which the cell can regulate the gene expression. To-date there are only very few known examples of lncRNA that has been confirmed in the laboratory bind a Transcription Factor. In these cases it was found that the binding motif on the lncRNA is exactly the same as the binding motif on the promoter (DNA).
You are given chip-seq data for 10 Transcription Factors and sequences of 400 lncRNAs, each of length 1000 nucleotides. Your goal is to design a bioinformatics experiments to find the lncRNAs (among the 400) that can bind one or more Transcription factor for which you have chip-seq data.
In your answer you will have to define clearly the research steps you plan. Each step should include the question that is answered in that step, and the bioinformatics tools that you will use to answer it. Please elaborate and explain each step – what is the aim and how does it promote you to the next step. Also detail what is the expected output of your analysis.
Note that this is an open question, and there may be many ways you can approach it.
Several important clues to remember when answering the question:
1. In terms of the motif, “T” in DNA is completely equivalent to “U” in RNA
2. Motifs in single-stand are completely different than motifs in double-strand
3. RNA folding algorithms are most accurate for sequences of length 150-200 nucleotides.