Scoring matrix - BC Bioinformatics

advertisement
BI420 - Introduction to Bioinformatics
Homework 2 (Prof. Marth) - due Thursday, October 26, 2006 at 1:30PM in Prof. Marth's
mailbox in the Biology office (Higgins 355)
Regular questions
1. (10 pts) Gene expression microarrays. List at least 3 questions that one can study with gene expression
microarrays. (b) Explain what the terms “probe” and “target” mean. (c) What is the readout of the expression
microarray that allows us to quantify the gene expression level? (d) When one compares expression between
two different populations of cells in a single experiment, how does one distinguish between the expression
levels of the different populations?
(a) What genes are active (i) at different developmental stages; (ii) in cells from different tissues; (iii) at different
time points during the cell cycle; (iv) cells under different environmental conditions; (v) between normal and
cancerous cells? (b) Probes are DNA oligonucleotides spotted in specific locations of the expression chip.
Targets are the cDNA molecules that hybridize to (and therefore captured by) the complementary probes. (c)
Dye color-intensities measured by lasers. (d) The two populations are labeled by different color dyes. Relative
expression levels are quantified by the relative color intensities.
2. (10 pts) Gene expression microarray data analysis. (a) What is “relative expression ratio”? (b) What is log2
transformation and what are its advantages? (c) What is the purpose of signal normalization? (d) What are timecourse experiments?
(a) The ratio of expression levels of the same gene between two populations of cells that are compared in a
single experiment. (b) Log2 transformation computes the 2-base logarithm of the relative expression level. The
resulting number is negative if the two populations are expressed equally; it is a negative if the first population is
under-expressed, and positive, if the first population is over-expressed. (c) To balance fluorescent intensities of
the two dyes within a single microarray and to adjust for differences in experimental conditions across multiple
arrays. (d) The collection of expression level data at different time points during the course of an experiment.
3. (10 pts) Proteomics.
(a) What are the three main attributes that the Gene Ontology (GO) used to classify genes? (b) What does “3D
protein structure” mean, and why is its knowledge important? (c) Give one method for the identification of
proteins in a mixture.
(a) (i) Biological process; (ii) Molecular function; (iii) Cellular component. (b) The spatial conformation of the
protein. It is important because the 3D structure dictates how proteins can interact with other proteins,
chemicals, or DNA. (c) 2D gel electrophoresis; Mass-spectrometry.
4. (10 pts) Gene safari. Using the NCBI web site, gather the following information about the gene BRCA2: (a)
simple description of the gene; (b) the organism in which it is found; (c) the chromosome which it is on; (c) its
orthologs in chimp (P. troglodytes) and in fruitfly (D. melanogaster); (d) a SNP in this gene as identified by a
dbSNP rs number (i.e. reference SNP id).
(a) [Gene] Breast cancer 2 gene, early onset. (b) [Gene] Homo sapiens. (c) [Homologene] Chimp: P.troglodytes;
LOC452526; similar to breast cancer 2, early onset. Fruitfly: D.melanogaster; CG9286;Drosophila melanogaster
CG9286 gene. (d) [SNP]: rs17077519.
5. (10 pts) Gene safari. Print and submit the visual genotype data for the CRP gene from the SeattleSNPs
resource (URL: http://pga.gs.washington.edu/). How many individuals were analyzed? How many SNPs were
found?
47 individuals were analyzed and 31 SNPs found (see Visual Genotypes below).
6. Sequence alignments. (a) What is a pair-wise sequence alignment? (b) What is a multiple sequence
alignment? (c) What is a scoring scheme? (d) What is the alignment score? (e) What is the difference between
global and local pair-wise sequence alignments? (f) Can the optimal of the score be negative for a global
alignment, and for a local alignment?
(a) A pair-wise alignment establishes the base-to-base correspondence between the residues of two
sequences. (b) A multiple alignment establishes the base-to-base correspondence between the residues of
multiple sequences. (c) The specification of the rewards or penalties for matches, mismatches, and gaps in a
given alignment. More sophisticated scoring schemes assign a reward/penalty to every possible pair of residues
(nucleotides or amino acids). (d) A global alignment seeks to align two sequences across their entire length. A
local alignment attempts to find the best-matching parts of the two sequences. (e) The some of the rewards and
penalties of the aligned residues and gaps for the entire alignment. (f) Yes for global, no for local.
7. (30 pts) Sequence alignments. Consider two DNA sequences: CAGTCCAT and GAGCCA. Determine the
best score and the optimal alignment between these two sequences given the following scoring scheme: Match
= 2, mismatch = -1, gap = -3. (a) Use the Needleman-Wunsch global sequence alignment algorithm. (b) Use the
Smith-Waterman local alignment algorithm.
(a) Scoring matrix:
Numbers in cells indicate scores. D: diagonal trace-back; H: horizontal trace-back; V: vertical trace-back.
Best score: 3. Optimal global alignment:
GAG-CCACAGTCCAT
(b) Scoring matrix:
Best score: 7. Optimal local alignment:
AG-CCA
AGTCCA
8. (10 pts) Sequence database searching. (a) What is the focus of sequence database searching, in comparison
to sequence alignment? (b) What makes the similarity search tool BLAST fast and efficient? (c) Which version of
BLAST would you use if you wanted to utilize mouse protein sequences to annotate genes in the rat genome?
(d) When you search the protein BRCA2 for other similar sequences in the non-redundant protein sequence
division of GenBank, which sequence(s) would be the “query sequence(s)”, and which would be the “database
sequence(s)”?
(a) The focus of sequence database searching is to find sequences that show biologically significant similarity to
the query sequence. During this process, some of the candidate “matches” from the database may be aligned to
the query sequence so an optimal alignment score can be calculated. (b) The method of fast filtering through the
sequences in the database and discarding most database sequences. This is done by searching for word hits
(similar sub-sequences) between the query and the database sequences and only continuing to pursue those
database sequences that share words in common with the query sequence. (c) BLASTX because that is the
appropriate tool for finding protein-coding genes in genome DNA. (d) BRCA2 would be the query and the nr
protein set would be the database.
Bonus questions
B1. (10pts) Gene expression array analysis: advanced methods. (a) What is the purpose of clustering? (b) How
does one measure the similarity of expression time-course between a pair of genes? (c) What is the principle of
hierarchical clustering?
(a) To delineate groups of genes the expression course of which are similar because such genes may have
related functions. (b) By calculating the correlation coefficient between the relative expression values at each
time point within the time-course. More correlated genes are “closer” in this metric. (c) To merge groups of
similar genes in a step-wise fashion, starting with genes and then groups of genes whose expression patterns
are most similar.
B2. (10pts) Genome copy number detection in cancer cells. (a) What is “genomic representation” in the context
of a genome microarray? (b) How does one use a genomic representational microarray to determine if, at a
given location of the genome, cells from a cancer tissue have a higher or a lower copy number than “normal”
cells?
(a) Genomic representation is a (usually quasi-random) collection of probes representing a fraction of the
genome. (b) One measures and compares the levels of hybridization from cancer vs. normal cells at probes in a
genome location of interest. If the level is higher in the cancer cells, the copy number is higher than normal; if
the level is lower, the copy number is lower than normal.
Question 5. Visual genotypes from SeattleSNPs for the CRP gene.
Download