BI420 - Introduction to Bioinformatics Homework 2 (Prof. Marth) - due Thursday, October 26, 2006 at 1:30PM in Prof. Marth's mailbox in the Biology office (Higgins 355) Regular questions 1. (10 pts) Gene expression microarrays. List at least 3 questions that one can study with gene expression microarrays. (b) Explain what the terms “probe” and “target” mean. (c) What is the readout of the expression microarray that allows us to quantify the gene expression level? (d) When one compares expression between two different populations of cells in a single experiment, how does one distinguish between the expression levels of the different populations? (a) What genes are active (i) at different developmental stages; (ii) in cells from different tissues; (iii) at different time points during the cell cycle; (iv) cells under different environmental conditions; (v) between normal and cancerous cells? (b) Probes are DNA oligonucleotides spotted in specific locations of the expression chip. Targets are the cDNA molecules that hybridize to (and therefore captured by) the complementary probes. (c) Dye color-intensities measured by lasers. (d) The two populations are labeled by different color dyes. Relative expression levels are quantified by the relative color intensities. 2. (10 pts) Gene expression microarray data analysis. (a) What is “relative expression ratio”? (b) What is log2 transformation and what are its advantages? (c) What is the purpose of signal normalization? (d) What are timecourse experiments? (a) The ratio of expression levels of the same gene between two populations of cells that are compared in a single experiment. (b) Log2 transformation computes the 2-base logarithm of the relative expression level. The resulting number is negative if the two populations are expressed equally; it is a negative if the first population is under-expressed, and positive, if the first population is over-expressed. (c) To balance fluorescent intensities of the two dyes within a single microarray and to adjust for differences in experimental conditions across multiple arrays. (d) The collection of expression level data at different time points during the course of an experiment. 3. (10 pts) Proteomics. (a) What are the three main attributes that the Gene Ontology (GO) used to classify genes? (b) What does “3D protein structure” mean, and why is its knowledge important? (c) Give one method for the identification of proteins in a mixture. (a) (i) Biological process; (ii) Molecular function; (iii) Cellular component. (b) The spatial conformation of the protein. It is important because the 3D structure dictates how proteins can interact with other proteins, chemicals, or DNA. (c) 2D gel electrophoresis; Mass-spectrometry. 4. (10 pts) Gene safari. Using the NCBI web site, gather the following information about the gene BRCA2: (a) simple description of the gene; (b) the organism in which it is found; (c) the chromosome which it is on; (c) its orthologs in chimp (P. troglodytes) and in fruitfly (D. melanogaster); (d) a SNP in this gene as identified by a dbSNP rs number (i.e. reference SNP id). (a) [Gene] Breast cancer 2 gene, early onset. (b) [Gene] Homo sapiens. (c) [Homologene] Chimp: P.troglodytes; LOC452526; similar to breast cancer 2, early onset. Fruitfly: D.melanogaster; CG9286;Drosophila melanogaster CG9286 gene. (d) [SNP]: rs17077519. 5. (10 pts) Gene safari. Print and submit the visual genotype data for the CRP gene from the SeattleSNPs resource (URL: http://pga.gs.washington.edu/). How many individuals were analyzed? How many SNPs were found? 47 individuals were analyzed and 31 SNPs found (see Visual Genotypes below). 6. Sequence alignments. (a) What is a pair-wise sequence alignment? (b) What is a multiple sequence alignment? (c) What is a scoring scheme? (d) What is the alignment score? (e) What is the difference between global and local pair-wise sequence alignments? (f) Can the optimal of the score be negative for a global alignment, and for a local alignment? (a) A pair-wise alignment establishes the base-to-base correspondence between the residues of two sequences. (b) A multiple alignment establishes the base-to-base correspondence between the residues of multiple sequences. (c) The specification of the rewards or penalties for matches, mismatches, and gaps in a given alignment. More sophisticated scoring schemes assign a reward/penalty to every possible pair of residues (nucleotides or amino acids). (d) A global alignment seeks to align two sequences across their entire length. A local alignment attempts to find the best-matching parts of the two sequences. (e) The some of the rewards and penalties of the aligned residues and gaps for the entire alignment. (f) Yes for global, no for local. 7. (30 pts) Sequence alignments. Consider two DNA sequences: CAGTCCAT and GAGCCA. Determine the best score and the optimal alignment between these two sequences given the following scoring scheme: Match = 2, mismatch = -1, gap = -3. (a) Use the Needleman-Wunsch global sequence alignment algorithm. (b) Use the Smith-Waterman local alignment algorithm. (a) Scoring matrix: Numbers in cells indicate scores. D: diagonal trace-back; H: horizontal trace-back; V: vertical trace-back. Best score: 3. Optimal global alignment: GAG-CCACAGTCCAT (b) Scoring matrix: Best score: 7. Optimal local alignment: AG-CCA AGTCCA 8. (10 pts) Sequence database searching. (a) What is the focus of sequence database searching, in comparison to sequence alignment? (b) What makes the similarity search tool BLAST fast and efficient? (c) Which version of BLAST would you use if you wanted to utilize mouse protein sequences to annotate genes in the rat genome? (d) When you search the protein BRCA2 for other similar sequences in the non-redundant protein sequence division of GenBank, which sequence(s) would be the “query sequence(s)”, and which would be the “database sequence(s)”? (a) The focus of sequence database searching is to find sequences that show biologically significant similarity to the query sequence. During this process, some of the candidate “matches” from the database may be aligned to the query sequence so an optimal alignment score can be calculated. (b) The method of fast filtering through the sequences in the database and discarding most database sequences. This is done by searching for word hits (similar sub-sequences) between the query and the database sequences and only continuing to pursue those database sequences that share words in common with the query sequence. (c) BLASTX because that is the appropriate tool for finding protein-coding genes in genome DNA. (d) BRCA2 would be the query and the nr protein set would be the database. Bonus questions B1. (10pts) Gene expression array analysis: advanced methods. (a) What is the purpose of clustering? (b) How does one measure the similarity of expression time-course between a pair of genes? (c) What is the principle of hierarchical clustering? (a) To delineate groups of genes the expression course of which are similar because such genes may have related functions. (b) By calculating the correlation coefficient between the relative expression values at each time point within the time-course. More correlated genes are “closer” in this metric. (c) To merge groups of similar genes in a step-wise fashion, starting with genes and then groups of genes whose expression patterns are most similar. B2. (10pts) Genome copy number detection in cancer cells. (a) What is “genomic representation” in the context of a genome microarray? (b) How does one use a genomic representational microarray to determine if, at a given location of the genome, cells from a cancer tissue have a higher or a lower copy number than “normal” cells? (a) Genomic representation is a (usually quasi-random) collection of probes representing a fraction of the genome. (b) One measures and compares the levels of hybridization from cancer vs. normal cells at probes in a genome location of interest. If the level is higher in the cancer cells, the copy number is higher than normal; if the level is lower, the copy number is lower than normal. Question 5. Visual genotypes from SeattleSNPs for the CRP gene.