Lecture 1 (cont.): DNA, Genes, Gene Expression, transcription. 31/3/2005 Note: [figure numbers] refer to slides in ppt presentation 3 DIFFERENTIATION and GENE EXPRESSION All cells of a multicellular organism were ”born” in division of a single cell. [Central dogma] All cells contain the same DNA and contain all genes - the full genome. However, different cell types synthesize different proteins. But for higher multicellular organisms one finds that the 2000 most abundant proteins (> 50, 000 copies/cell) appear with about the same concentrations (within factor 5) in different cell types. Only a few % of the proteins are expressed in a tissue-specific way. A typical higher eucaryotic cell expresses 10 - 20,000 genes. Presence/absence of a few hundred - 1000 induces tremendous differences. The same cell can have different expression levels at different times, especially in response to external signals (conditions) or internal signals. Knowledge of the concentration of all protein species will yield information on the cell and it’s ”biological state”. Direct measurement of the concentrations of all proteins with reasonable accuracy is very difficult, and one also needs to know their phosphorylation state, bound complexes, etc. We will focus on gene expression, as reflected in the concentration of mRNA in the cell. Working assumption: the concentrations of the mRNA molecules in a cell define it’s ”biological state”. To understand the incompleteness of this assumption, note that the concentration of proteins can be controlled in a variety of ways. [control]: Transcriptional control RNA transport control RNA degradational control RNA processing control (splicing) Translational control Protein activity control Nevertheless, under our working assumption, knowledge of 25 - 30,000 numbers - the concentration of 25-30,000 kinds of mRNA - defines the state of the cells of a human tissue. The device that measures these mRNA concentrations is called a DNA chip or microarray. The latest Affymetrix U133plus2.0 chip, measures the expression levels of 55,000 ”genes” (probe sets); in a good experiment this is done for 10 - 100 samples taken from different tissues, people, tumors. These expression levels are in the form of a large matrix or array, that contains 500,000 - 5,000,000 numbers. [excel] Our aim is to represent this data, visualize it and extract biologically significant meaning from it. Visualization: color code [color code],[leukemia1],[leukemia2] Lecture 2: DNA chips. 1 INTRODUCTION The basic assumption: the ”state” of a cell <===> mRNA concentration of its genes. Two main types of experiments: (a) Ns ≈ 10 − 100 samples (e.g. tissue removed from tumors), for each L ≈ 1 − 10 known clinical labels and measured expression levels for Ng ≈ few thousand to few 10,000 genes. (b) M ≈ 1 − 10 experimental conditions. Initialize some biological process at time t = 0 and measure expression at ≈ 5 − 20 time points ti for Ng genes. Aims: (a) Identify different sub-classes of a disease (cancer) on the basis of gene expression. Use for diagnosis, prognosis, design of therapy. (b) Identify groups of genes with correlated expression levels over conditions to learn about possible functions, networks and relationships. DNA chips measure the mRNA concentration in a solution. The method is based on a hybridization reaction: Two matching strands of DNA are held together by hydrogen bonds, A-T and C-G [hybridization1]. Similarly, DNA and its complementary RNA are strongly bound by A-U and C-G bonds. When the double stranded compound is heated (or submitted to changes of the chemical environment), the two strands dissociate (denaturation). Upon reversal of conditions, matching single strands meet and reunite (hybridization) [hybridization1]. An oligonucleotide of say 20 bp will bind with the highest affinity to a perfectly matching sequence (PM) of a long single stranded polynucleotide, and with significantly lower affinity to a segment with imperfect (mis)match (MM). The stringency of the hybridization depends on the temperature; higher T, more stringent. This is a non-equilibrium effect; at lower T the strand is bound to the much more abundant MM and does not dissociate (essential in order to seek and find its PM). [hybridization2]. The basic idea of DNA chips: prepare a solid substrate (chip) divided into pixels. Stick to each pixel identical probes; segments of DNA taken from one gene. Prepare a solution that contains different targets - species of mRNA at different concentrations. Pour the solution onto the chip. The mRNA molecules diffuse and if they find a matching probe, i.e. one taken from the gene from which the mRNA was transcribed, they stick. Detect the amount of targets that stuck to each pixel. There are two basic families of implementations of this idea: (a) cDNA microarrays (spotted microarrays) (b) Oligonucleotide microarrays (Affymetrix chips). 2 QUICK OVERVIEW 2.1 cDNA spotted arrays - Simplified picture [cDNA expt]. (a) TARGET (sample) preparation: Compare gene expression under two conditions: experiment and control. Extract mRNA from both samples. Use Reverse Transcription to produce cDNA from each species of mRNA. The cDNA is fluorescently marked; experiment by RED, control by GREEN. RT cells → mRNA −→ cDNA (label Red/Green) (b) PROBE (chip) preparation: Prepare a library of (double stranded) DNA. Print a spotted array - each spot contains 107 − 108 clones of DNA from one gene, 5-10,000 spots on chip. (c) Hybridization: Incubate the solution of the two kinds of marked cDNA and hybridize to probes. (d) Detection: Wash away unbound targets and measure fluorescence from each spot; RATIO=RED/GREEN To control non-specific binding: Two kinds of targets (expt. and control) compete for probe Campbells animation: http://www.bio.davidson.edu/courses/genomics/chip/chip.html [cDNA spots] 2.2 Oligonucleotide microarrays (Affymetrix) - simplified picture. [Aff.design: Target] (a) TARGET (sample) preparation: Only ”experiment”; if want to compare to control - run a separate chip. Extract mRNA from sample. Reverse-Transcribe to (double-strand) cDNA, and transcribe back to cRNA The cRNA is amplified, fluorescently marked and fragmented (cut into short oligonucleotide chains). RT cells → mRNA −→ cDNA −→ mRNA (amplify, label, fragment) (b) PROBE (chip) preparation: [Affy probe] Synthesize in situ single strand 25-mer oligos [Probe litho]; basic feature - ≈ few 106 copies of single oligo [Wafer], [PM/MM]: PM: - ”perfect match” - the DNA of the gene to be studied MM: - ”mismatch” - the same 25-mer with a central mistake 55,000 probe-sets (c) Hybridization: Incubate the labeled cRNA solution, hybridize to chip, wash away unbound targets, stain [Affy expt design] (d) Detection: Measure fluorescence from each spot; record Difference = Perfect Match - MisMatch To control non-specific binding: Two kinds of probes (PM/MM) compete for target . Target . Probe . Hybridize Detection . nonspec. bind. cDNA Expt and Control, cDNA label Red/Green need 1 µg mRNA cDNA (2-stranded, 1000bp) on spots of size 100µ, No.spots¡10,000 Expt and Control to cDNA Red/Green laser fluorescent record RATIO 2 targets compete for probe Affymetrix Expt. cRNA, label, fragment 0.2 - 2 µg mRNA 25-mer oligos, single-strand DNA 11 PM/MM pairs, No. PS=55K Expt. to PM and MM fluorescent dye, record PM-MM DIFFERENCE 2 probes compete for target 3 DETAILED DESCRIPTION OF THE TECHNIQUES 3.1 Oligonucleotide Microarrays [AffED] (a) Target preparation Need 1 - 15 µ g of total RNA; (107 cells yield 200 µg). Extraction: total RNA and from that separate the mRNA (about 2 %), using the poly-A tail to pull out the mRNA. Reverse transcription: Reverse Transcription (RT): Transcription of RNA to DNA by Reverse Transcriptase, an unusual kind of DNA polymerase, an enzyme (complex) that uses an RNA strand as template to synthesize the matching complementary DNA. [4f5-74]. It is used by retroviruses: Virus = genetic elements enclosed by protective coat. 0.1µ. Inserts its genetic material (normally - DNA) and enzymes into a cell and ”takes a ride” to replicate itself (several 100 copies). The genome contains also the special enzymes needed for its own replication. Retroviruses’ genetic material is RNA. The virus also contains reverse transcriptase [3F6-82] Reverse transcriptase (like any DNA polymerase) adds the next nucleotide to a growing strand only if 1. the incoming base matches that of the template and 2. the preceding base pair is also matched - a proofreading device to correct replication error. BUT this poses a question - how does the process start? In vivo - a short (10 bp) RNA primer is assembled at the initiation site by DNA primase. In vitro - we introduce a particular short sequence that complements the desired start site and serves as primer to grow a desired sequence of DNA. In the RT process a TTTTT segment is used as primer to complement the poly-A. An RNA-polymerase binding sequence XXXX is added. Then 1.RT produces single strand cDNA, which is 2. complemented to double stranded cDNA by DNA polymerase. 3. Transcription to mRNA (50-fold linear amplification!): Only the correct strand is transcribed (the XXXX determines which - transcription goes only in the 5’ direction of the template). 4. The bases that are used are to build the RNA are biotinated. 5. Fragment the mRNA (chemically randomly) to get bits of 50 - 200bp. (b) Probe (chip) preparation [Wafer]: Cover glass slide with covalent linker molecules, terminated with a protective group that can be removed by light. Use photo-lithography to expose selectively patches and add the appropriate chemical group and cover with protective layer and repeat. The basic feature is (on the U133plus2 chip) a 11 µ square covered with several 106 identical single strand oligonucleotides, 25 bp long, copied from a gene. [Photo,Synth] For each gene to be represented - select 11 such oligos; pair each with a mismatched [PerfectMatch]. Selection of the probe-sets is a major component of the process. They are biased towards the 3’ end (last exon usually not spliced out and also if the RNA - polymerase falls off too early when the target is prepared - the complements of the 3’-biased probes are present. (c) Hybridization At 45 degrees for 16 hours; wash and stain (fluorescently marked avidin attaches to biotin).[Hyb,Shirley’s]. (d) Detection Illuminate and scan [Shirley’s] - measure intensity of emitted light, at resolution of 2.5 µ [pixels] calculate average intensity of each basic feature (square). (e) Output for each gene on chip 11+11 numbers; the average of the pixels in each basic feature, P Mi and M Mi , i = 1, 2, ..., 11. Calculate differences Di and their weighted average for each gene: Di = P M i − M M i AvgDiff = N 1 X Wi Di N i=1 The average is taken over N differences; outliers and suspect values are discarded.[excel] Absent/Present call. Reproducability, noise: [scatter] multiplicative noise, 90 % within 20-50 %