HW4 (2014)

advertisement
Introduction to Bioinformatics (236523)
HW 4 – Winter 2014
General Instructions:
 Dead Line: 2/1/2014 23:55.
 Submission according to published pairs only.
 The submission is electronic only in the course website.
 You should submit a single file in a .ZIP format named according to next format:
<HW#>_<ID1>_<ID2>.zip
For example: HW1_333222111_012345678.zip
Protein DBs and Secondary structure prediction
1. Use UniProt to find a protein with the following identifier: B5SY89. From the data you
see in UniProt detail:
a. The protein name
b. what organism is it taken from
c. how many amino acids does it contain
d. is it involved in any disease
e. what protein family does it belong to.
f. Extract the fasta sequence of that protein - press the “fasta” link to get it in a fasta
format and display it in your answer.
2. Use Jpred3 to get a secondary structure prediction from the sequence you extracted.
Based on Jpred results and scores – would you trust this prediction?
3. Search for the protein’s sequence in Pfam.
a. What domains are contained in it?
b. Click on the domain link. Go to “Structure” and view the structure in Jmol. Note
that you can turn it around. Provide a screen shot of the structure of that
domain. In general, is it similar to the SS prediction you got on step 2?
Sensitivity vs. specificity
1. Explain what is the meaning of false positive and false negative in a pregnancy test?
2. Complete the following table with one of the following phrases regarding the result
detailed on the left:
True Positive (TP) / True Negative (TN) / False Positive (FP) / False Negative (FN)
Scenario
You are using microarray to find highly expressed
genes, you look at a specific gene and see it is
highly expressed. In reality it is actually lowly
expressed.
You are conducting a ChIP-Seq experiment to
find binding sites of a protein. You are examining
a certain region and see you results do not count
it as a binding site. In reality this is really not a
binding site of the protein.
You are conducting a sequencing analysis to find
variants in your sample compared to a reference
genome. In a certain location the results do not
show a variant. In reality there is a variant in that
location.
TP/ TN/ FP/ FN
3. If I used the BLOSUM45 substitution matrix to blast a protein against a DB, and then I
ran BLAST again but with BLOSUM90 – how did the sensitivity and specificity change?
Gene expression analysis – clustering and GO annotation
A. Clustering
You are asked to cluster transcripts from an experiment on five differentiated larval tissues
in fruit fly, with three repeats per each experiment (original experiment can be found under
GDS444 in GEO).
a. Go to the experiment in GEO, accession GDS444. Look at what each sample represents.
Provide a screenshot.
b. Use the given file named “GDS444.txt“ and run hierarchical clustering on the tissues
(use the “choose action” option before clustering, note you need to choose the specific
action required for this experiment).
Include the dendrogram in your results.
c. Where would you “cut” the tree? How many clusters do you get? Explain.
d. You are now asked to run K-Means data on the genes. What would be your initial K?
What do you think about the quality of the clustering you got using that K? Include the
graph lines in your answer.
B. GO annotation analysis:
In this question you will investigate the gene expression matrix given to you in the supplied
file names “example”.
1. Use EPclust and apply hierarchical clustering on the genes. Provide a screenshot.
2. In the results page under “Show options” you can cut the tree in different heights. Cut
the tree at 0.9. How many clusters did you get?
3. Apply k-means clustering with k=number of clusters found in 2. Provide a screen shot of
the summary of the results (at the bottom page of the results).
4. Use the K-means clustering results and run David on each cluster. In David choose
“UNIPROT_ACCESSION” as the identifier and “Gene List” as the List Type. In your answer
refer to the enriched terms found under molecular function (MF) and biological process
(BP) in Gene Ontology (GO) (GOTERM_MF_FAT and GOTERM_BP_FAT).
What are the three most enriched biological processes and molecular functions in each
cluster?
Note two clusters show a higher level of enriched terms than other clusters. Which are
they? Can you suggest what is the health state of the individuals given these enriched
terms? Explain.
Research question exercise
Below you will find a research question and your task is to plan a bioinformatics experiment to
answer it using the various tools learnt in class. Note that this is an open question and as in
science – there are many possible ways to address it.
The biological question
Title: Predicting possible functions of hypothetical proteins in metagenomic data
Goal: Metagenomics is a relative new field studying the genomics of the species extracted
directly from the environment. The majority of the genes which are detected from
Matagenomic data are of unknown function and do not resemble other known genes. To
understand the environment it is crucial to develop method for predicting the function of these
genes.
Input: You have sequence data from a metgenomic survey in the ocean.
Step 1: Before designing the experiment be sure you understand the biological question
Answer the following questions, you can use any resources on the web or your own knowledge
Please note – we mentioned these phrases in class but didn’t focus on them, so you may not
find the answers to these questions directly in the course slides. Please be open minded and try
to think as a scientist who came across an interesting question and will learn its details by
reading the appropriate literature and searching the net.
1. What is metagenomic sequence data?
2. Give an example of a source for such data.
3. What are hypothetical proteins?
Step 2: What do you expect to achieve in this project
In real (scientific) life, this is not always a question you can answer ahead, and you may change
the answer to this question as you proceed with your research. However before starting the
bioinformatics analysis you need to have an idea what will be the outcome of the work. Here
you will need to define what may be the output you want to get to in order to answer the
question. Your output should include the bioinformatics measures that will enable you to
evaluate your results and prioritize them.
An example for results of a bioinformatics analysis was shown to you in the high throughput
sequencing lecture (slides 28 and 35). You can also think of the different outputs the
bioinformatics tools you use provide – what parameters there allow you to rank the results?
Please define 3 possible outcomes and how you will evaluate them.
Step 3: Break the question into smaller steps and choose the most appropriate tool to apply
for each step
Below is a list of possible steps to answer the research question. Your task is to suggest
appropriate bioinformatics tools for answering the question in each step.
If you think of a different way to address this question, you can write your own set of steps with
appropriate tools.
1. Initial filtration of the sequences to coding vs. non-coding sequences.
A. Find homologous sequences of known proteins or genes to the given sequences:
B. Find elements known to be in genes
2. Find distant homologies for the sequences
3. Look for important domains in the sequences that may give a clue about their function
4. Find common motifs in the suspected coding sequences
(Even if that motif isn’t known, we may point to an important function that is worth
going into)
Download