HW4_final

advertisement

Introduction to Bioinformatics (236523)

HW 4 – Winter 2016

General Instructions:

Dead Line: 7/1/15 23:55.

Submission according to published pairs only.

The submission is electronic only in the course website.

Question 1: Motif representation, sensitivity and specificity

A researcher studies the binding sites of a transcription factor X. He conducts a ChIP-Seq experiment by binding X to the genome and sequencing the sequences to which the protein binds. In order to find the binding site motif, the researcher then ran MEME.

1.

The following multiple sequence alignment is the extracted motifs found in a subset of the ChIP-Seq sequences. Create a representation of the motif in 3 different options: a.

A count matrix b.

A probability matrix c.

A consensus sequence

AGGGCAGCTT

ACGACTGCTG

CTGGCTGCTA

ATGACTGCTG

AGGACTGCTC

CCGGCAGCTG

ATGGCTGCTC

2.

The researcher than ran MAST on the 50 sequences. 25 of them are known to be bound by protein X and 25 of them are known to be unbound by protein X. The following table contains MAST results from running the motif found in step 1 and 50 sequences. a.

What is the rate of false positive, false negative, true positive and true negatives of MAST? b.

Describe the quality of this MAST run in terms of sensitivity and specificity.

19

20

21

22

23

24

25

15

16

17

18

Sequence

ID

Sequences that bind protein X

MAST result

1

2

3

Motif found

Motif found

Motif found

8

9

10

11

12

13

14

4

5

6

7

Motif found

Motif found

Motif found

Motif found

Motif found

Motif not found

Motif found

Motif found

Motif found

Motif found

Motif found

Motif found

Motif found

Motif found

Motif not found

Motif found

Motif found

Motif found

Motif found

Motif found

Motif found

Motif found

44

45

46

47

48

49

50

40

41

42

43

Sequences that don't bind protein X

Sequence

ID

MAST result

26

27

28

Motif not found

Motif found

Motif not found

33

34

35

36

29

30

31

32

37

38

39

Motif not found

Motif not found

Motif found

Motif not found

Motif found

Motif not found

Motif not found

Motif found

Motif found

Motif found

Motif not found

Motif not found

Motif found

Motif not found

Motif found

Motif not found

Motif not found

Motif found

Motif found

Motif found

Motif not found

Motif not found

Question 2: RNA structure and function

The PUM2 protein is a human RNA binding protein. In order to identify the preferred binding motif of this protein on RNA, a researcher conducted a high throughput RNA binding experiments (CLIP). The results of the experiments are given in the attached file pum2.fasta.txt. Sequences are in FASTA format and are ranked according to the binding score, from high to low.

1.

Identify the preferred binding motif of PUM2 by running the file on DRIMUST http://drimust.technion.ac.il/ .

Make sure that you are using the “Single strand” mode for RNA sequences.

Provide the motif logo.

2.

What is the significance of the identified motif? Look at the occurrences distribution by clicking on “view occurrences distribution” or by downloading the motif occurrences file and explain how come the motif is so highly significant.

3.

In a different study it has been shown that PUM2 binds RNA only in loop regions

(single stranded) of the folded RNA which contain the consensus motif. The researcher is studying 5 RNA sequences in the 3’ untranslated regions of different genes (sequences are found below) and is interested to learn which of the sequences bind the PUM2 protein. a.

Use the program mfold http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi

to predict which one of the 5 sequences below is the most likely candidate to bind RNA. Explain your results. b.

What is the free energy of the RNA predicted to bind PUM2? Are the results you got expected for a binding site of PUM2? Explain. NOTE! Stable RNA structures of similar length are usually lower than -10 kcal/mol. c.

What part of the identified PUM2 motif has the highest probability to be in a loop? Explain. d.

It has been shown that in the presence of an RNA helicase (an enzyme that can rewind stem loop RNA structures) PUM2 can also bind its preferred binding motif in partial stem regions. Based on this information, which other sequence (from the list below) would bind PUM2 in the presence of the RNA helicase?

List of 5 sequences:

Seq 1: 5’ CCGGCCAAAUAAAUGUCCCCAAAAGGCC 3’

Seq 2: 5’ CCGGCGCACAUAAAUGUACAGUGCGGCC 3’

Seq 3: 5’ CCGGUUAACGUUUUAUUAUACCCAGGCC 3’

Seq 4: 5’ CCGGCCAAUGUAAAUACCCCAAAAGGCC 3’

Seq 5: 5’ CCGGCGCACUGUAAAUAACAGUGCGGCC 3’

Provide snapshots or PDF files of the predicted RNA fold for each sequence!

Question 3: Research Question

This is an open research question; you are requested only to write your research plan (as described below) and not conduct the research!!!

As we discussed in class it has been proposed that long non coding RNAs (lncRNAs) can bind

Transcription Factors (which usually bind double-stranded DNA) and compete with the natural promoter of a gene which is regulated by that Transcription Factor. This is another elegant way by which the cell can regulate the gene expression. To-date there are only very few known examples of lncRNA that has been confirmed in the laboratory bind a Transcription Factor. In these cases it was found that the binding motif on the lncRNA is exactly the same as the binding motif on the promoter (DNA).

You are given chip-seq data for 10 Transcription Factors and sequences of 400 lncRNAs, each of length 1000 nucleotides. Your goal is to design a bioinformatics experiments to find the lncRNAs (among the 400) that can bind one or more Transcription factor for which you have chip-seq data.

In your answer you will have to define clearly the research steps you plan. Each step should include the question that is answered in that step, and the bioinformatics tools that you will use to answer it. Please elaborate and explain each step – what is the aim and how does it promote you to the next step. Also detail what is the expected output of your analysis.

Note that this is an open question, and there may be many ways you can approach it.

Several important clues to remember when answering the question:

1. In terms of the motif, “T” in DNA is completely equivalent to “U” in RNA

2. Motifs in single-stand are completely different than motifs in double-strand

3. RNA folding algorithms are most accurate for sequences of length 150-200 nucleotides.

Download