Introduction to Bioinformatics (236523) HW 4 – Winter 2014 General Instructions: Dead Line: 2/1/2014 23:55. Submission according to published pairs only. The submission is electronic only in the course website. You should submit a single file in a .ZIP format named according to next format: <HW#>_<ID1>_<ID2>.zip For example: HW1_333222111_012345678.zip Protein DBs and Secondary structure prediction 1. Use UniProt to find a protein with the following identifier: B5SY89. From the data you see in UniProt detail: a. The protein name b. what organism is it taken from c. how many amino acids does it contain d. is it involved in any disease e. what protein family does it belong to. f. Extract the fasta sequence of that protein - press the “fasta” link to get it in a fasta format and display it in your answer. 2. Use Jpred3 to get a secondary structure prediction from the sequence you extracted. Based on Jpred results and scores – would you trust this prediction? 3. Search for the protein’s sequence in Pfam. a. What domains are contained in it? b. Click on the domain link. Go to “Structure” and view the structure in Jmol. Note that you can turn it around. Provide a screen shot of the structure of that domain. In general, is it similar to the SS prediction you got on step 2? Sensitivity vs. specificity 1. Explain what is the meaning of false positive and false negative in a pregnancy test? 2. Complete the following table with one of the following phrases regarding the result detailed on the left: True Positive (TP) / True Negative (TN) / False Positive (FP) / False Negative (FN) Scenario You are using microarray to find highly expressed genes, you look at a specific gene and see it is highly expressed. In reality it is actually lowly expressed. You are conducting a ChIP-Seq experiment to find binding sites of a protein. You are examining a certain region and see you results do not count it as a binding site. In reality this is really not a binding site of the protein. You are conducting a sequencing analysis to find variants in your sample compared to a reference genome. In a certain location the results do not show a variant. In reality there is a variant in that location. TP/ TN/ FP/ FN 3. If I used the BLOSUM45 substitution matrix to blast a protein against a DB, and then I ran BLAST again but with BLOSUM90 – how did the sensitivity and specificity change? Gene expression analysis – clustering and GO annotation A. Clustering You are asked to cluster transcripts from an experiment on five differentiated larval tissues in fruit fly, with three repeats per each experiment (original experiment can be found under GDS444 in GEO). a. Go to the experiment in GEO, accession GDS444. Look at what each sample represents. Provide a screenshot. b. Use the given file named “GDS444.txt“ and run hierarchical clustering on the tissues (use the “choose action” option before clustering, note you need to choose the specific action required for this experiment). Include the dendrogram in your results. c. Where would you “cut” the tree? How many clusters do you get? Explain. d. You are now asked to run K-Means data on the genes. What would be your initial K? What do you think about the quality of the clustering you got using that K? Include the graph lines in your answer. B. GO annotation analysis: In this question you will investigate the gene expression matrix given to you in the supplied file names “example”. 1. Use EPclust and apply hierarchical clustering on the genes. Provide a screenshot. 2. In the results page under “Show options” you can cut the tree in different heights. Cut the tree at 0.9. How many clusters did you get? 3. Apply k-means clustering with k=number of clusters found in 2. Provide a screen shot of the summary of the results (at the bottom page of the results). 4. Use the K-means clustering results and run David on each cluster. In David choose “UNIPROT_ACCESSION” as the identifier and “Gene List” as the List Type. In your answer refer to the enriched terms found under molecular function (MF) and biological process (BP) in Gene Ontology (GO) (GOTERM_MF_FAT and GOTERM_BP_FAT). What are the three most enriched biological processes and molecular functions in each cluster? Note two clusters show a higher level of enriched terms than other clusters. Which are they? Can you suggest what is the health state of the individuals given these enriched terms? Explain. Research question exercise Below you will find a research question and your task is to plan a bioinformatics experiment to answer it using the various tools learnt in class. Note that this is an open question and as in science – there are many possible ways to address it. The biological question Title: Predicting possible functions of hypothetical proteins in metagenomic data Goal: Metagenomics is a relative new field studying the genomics of the species extracted directly from the environment. The majority of the genes which are detected from Matagenomic data are of unknown function and do not resemble other known genes. To understand the environment it is crucial to develop method for predicting the function of these genes. Input: You have sequence data from a metgenomic survey in the ocean. Step 1: Before designing the experiment be sure you understand the biological question Answer the following questions, you can use any resources on the web or your own knowledge Please note – we mentioned these phrases in class but didn’t focus on them, so you may not find the answers to these questions directly in the course slides. Please be open minded and try to think as a scientist who came across an interesting question and will learn its details by reading the appropriate literature and searching the net. 1. What is metagenomic sequence data? 2. Give an example of a source for such data. 3. What are hypothetical proteins? Step 2: What do you expect to achieve in this project In real (scientific) life, this is not always a question you can answer ahead, and you may change the answer to this question as you proceed with your research. However before starting the bioinformatics analysis you need to have an idea what will be the outcome of the work. Here you will need to define what may be the output you want to get to in order to answer the question. Your output should include the bioinformatics measures that will enable you to evaluate your results and prioritize them. An example for results of a bioinformatics analysis was shown to you in the high throughput sequencing lecture (slides 28 and 35). You can also think of the different outputs the bioinformatics tools you use provide – what parameters there allow you to rank the results? Please define 3 possible outcomes and how you will evaluate them. Step 3: Break the question into smaller steps and choose the most appropriate tool to apply for each step Below is a list of possible steps to answer the research question. Your task is to suggest appropriate bioinformatics tools for answering the question in each step. If you think of a different way to address this question, you can write your own set of steps with appropriate tools. 1. Initial filtration of the sequences to coding vs. non-coding sequences. A. Find homologous sequences of known proteins or genes to the given sequences: B. Find elements known to be in genes 2. Find distant homologies for the sequences 3. Look for important domains in the sequences that may give a clue about their function 4. Find common motifs in the suspected coding sequences (Even if that motif isn’t known, we may point to an important function that is worth going into)