Tiling arrays

Please take a ticket • …and sit in your group (7 different groups) • This if for a larger exercise in the afternoon Today • • • • • Homework1! Tiling arrays and ChIP Cross-species conservation (possibly) The ENCODE project Tying it all together ( larger group exercise in interpretation) From last time:Lets look at these things • 5 minutes with your sideman: • Look at the RPS9 gene, and turn on Refseqs, UCSC genes, human mRNAs, ESTs, CpG islands and repeats • How well does refseqs, ESTs and Known genes correlate • Are there any CpGs or repeats - where are they located? What type of repeats are there? Annotating genome by hybridization approaches • There is a set of methods to annotate genomes that are based on the ability of DNA and/or RNA to hybridize in double strands Fundamental idea: make DNA probes that are complementary to RNAs that exist in the cell Put these on a glass slide Measure hybridization by attaching a color molecule at the RNA end RNA DNA PROBES If the RNA that we designed the probe for is present, The probe will light up = the gene is expressed RNA DNA PROBES What do we make probes for? • Genes? Probe specific for the HCFC1 gene • Say that we make one probe per gene and put on a glass slide • That way we could potentially detect all genes if they are expressed • This is how microarrays work (next part of the course) • What are the limitations with this? • (2 minutes) Albin’s take • Even if it works perfectly, we will only get values for genes we know (that we make probes for) • We get no information on how the gene looks (what isoform) – we just know whether it is there or not • We get only expression information, no location information • How can we resolve these issues? Answer: make “all possible probes” • How do we know what probes that are possible? • We have the genome! • Make probes tiling the genome • What probes will light up if only the HCFX1 gene is expressed? • Only the exonic parts should light up Because we hybridize mature mRNAs We call these probe sets on glass slides “tiling arrays” Tiling arrays Tiling arrays are multi-purpose tools - can be use for many things: to see • expressed RNA • DNA regions bound by transcription factors • Accessible DNA The tiling array is composed of probes that “tiles” the genome – meant to cover most of it. This is only partially true: 1) Probes are often separated by spacers 2) There are regions that are hard to make meaningful probes for – what regions would that be? Finding active RNA locations Lets look at this in the browser - still the RPS9 gene Slightly more tricky, due to how the data looks. This is non-standard data, so we need to use a specific assembly: Human May 2004 (hg17) Use the Affy Txn… track in the Expression group, and click on it to set options Set the option like this for now, and submit Try to interpret the many tracks compared to the cDNA tracks: what are we looking at? Transfrags and signals • Tiling array probes give a signal - the stronger it is, the more “expression”. • Tiling array probes are after each other, so it makes sense to group all such probes that have “significant” signal together to a larger block • This is done by specific statistics packages (not part of the course, but can be downloaded) - methodology is slightly arbitrary • Affymetrix calls these blocks “transfrags” • They should be viewed as a simplification of the data, that often is helpful Lets change the options to also look at transfrags Again, lets look at the region Live Demo To consider: What is the advantage of tiling arrays compared to cDNA sequencing, and vice versa? cDNA sequencing • Connectivity between exons • Full-length transcripts • Can be a blind method – but is often used in a targeted way – it is common to make cDNAs using targeted PCR Tiling arrays • No connectivity, just signals where probes are • Issues with crosshybridization • Hard to get transcript edges right • An almost blind method Blind methods • …are not targeting any specific gene or region • Independent of annotation! • Some call this “unbiased”, but…no methods are unbiased • Valuable, because molecular biology is very affected by “ascertainment bias” • Blind methods are not affected by this, and therefore can give totally new insights Tiling arrays used to find transcription factor binding sites - ChIP-on-chip Chromatin immuno-precipitation(ChIP) is a classical molecular biology technique - used to capture DNA-bound proteins and corresponding sequence Isolation of bound DNA: Sequencing Chromatin immunoprecipitation (ChIP): 1. Fixating everything that is bound to DNA by formaldehyde 2. Shearing DNA 3. Fish out the protein of interest with an antibody 4. Then get the bound DNA and sequence it ChIP-chip With a tiling array, we can see all sites in the genome by instead putting the bound DNA on the tiling array! Assessing ChIP-chip data in the browser • Data-wise, the same thing as tiling arrays with RNA: probes will get signal, and we lump probes together like transfrags • Usually looks very different: much fewer regions, and smaller regions • Live Demo: Using the “Affy sites” track in the ENCODE Chromatin Immunoprecipitation track: set this to “full”. • To consider: what are the pros and cons with this technique compared to predicting transcription factor binding sites comutationally Recent developments • Tiling arrays were very hot 2004-2007 • Are now becoming outdated by new sequencing methods(!). • More in the last part of the course Genomic conservation between species As we have the genomic sequence of multiple species, it makes sense to compare these on genome level In the UCSC browser, this is made in many steps, and we can choose what step we want to look at Central to the analysis is that we compare two genomes to each other at a time, and later combine the results Big-chunk changes • Whole chromosomes do not correspond to each other in terms of content between species - a lot of movingaround has occurred What mouse genome parts matches to where in human? So, it makes sense to try to find smaller pieces that are very similar, and then connect these to each other The human ACTN3 gene aligned to mouse Mouse genome regions with high similarity are shown as blocks, colored after what chromosomes they are from Matches to different chromosomes at the same place might mean duplicated genes, or even duplicated parts of genes We try to make chains that link these blocks: Must have a logical order (follow each other) in BOTH species, but can skip segments Now, most chromosomes just have one longer chain. Seems clear that the pink chain is the best, and that the yellow and green represent gene variants on different chromosomes What if want the “best” chain? The NET algorithm tries to find the best chain, and throws out other chains that overlap it. Potentially you can get many NETs, though In this case, there is only one solution Walking between genomes By clicking on a chain or a “netted” chain from another species, you get to page where you can open a new browser with that position in the other species (live demo) The “Conservation” track • The conservation track is multiple alignment (“all” species), built from pairwise net chains • It is both an alignment and a “score”, which is a –log( P( nucleotide is neutrally evolving)) • Alignments are shown as letters if we zoom in deep enough (live demo). In the newest assemblies, there are two conservation tracks, made by slightly different methods. They are basically two sides of the same coin - same basic idea. Challenge • Turn on the mouse chain and net tracks for the RPS9 gene • Are there one or many chains? • If you use these chains to go over to the mouse genome, what do you hit (use all annotation that is relevant)? Variation within species on single nucleotides • Single nucleotide polymorphisms (SNPs “snips”) - a whole research field • Single nucleotide change that happens in >1% of the population. • The UCSC browser shows SNPs from the SNPdb (database at NIH). It colors SNPs after potential functional annotation Challenge Technical: • In the RPS9 gene, turn on the SNP track to “full” • At one part of the gene, the SNPs are colored. What annotations do the colors stand for? Biological/philosophical: • Find arguments and counter-arguments on the following statement • “SNPs are what makes us humans different from each other” Larger exercise in larger groups: interpreting genes • • • • • I will divide you into 6 groups - look at your lottery ticket Each group will get two gene IDs – You will investigate both of these in the browser (mm8 assembly ), for instance: • what is special with it • what is the gene product (annotation) • what kind of track features are particularly interesting/strange in this case, etc. Use all tracks you we have talked about if available and relevant, or even other tracks • Do not forget to look at the neighborhood of the gene - zoom out! You will then show the main findings for one of the genes to the other groups (say 3-5 minutes) - and we can discuss and interpret. The other gene, another group will present. Your group will be “opponents” for this gene: try to find things that the other group have not thought about. Presentation and opponent genes will be distributed randomly – you wont know until you present. Some cutting edge science: The ENCODE project • Encyclopedia Of DNA Elements • Aims for a targeted and coordinated elucidation of the (whole) human genome, using multiple systems • Currently ended the pilot phase: analyze 1% (30 Mbp)of the genome deeply • 44 regions • 14 regions chosen by function: for instance the HOXD gene cluster – 0.5 to 2 Mbp • 30 regions chosen “randomly” - 500 kb • Should be viewed as a pilot project for the rest of the genome – so both technology and biology is driving • A large number of labs involved in both data production and analysis Some impressive numbers • 400 million data points, excluding sequencing of other genomes (adds another 250 million)! • Tiling arrays from 11 different cell sources • 96 ChIP-chip experiments • Tag sequecning data to identify promoters (covered later in the course) • In-depth cDNA annotation (GENCODE) • Sequencing of orthologous regions in a wide array of species • …and a lot more …and these are not all data tracks! Why introduce these things now in the course? • Get you used to look at tons of data at once • This is a fantastic data resource, which is under-used • make you realize that to analyze such data, you will have to understand the underlying method/biology .A. B. C. Highlights of the ENCODE paper Ewan Birney*,1, John A. Stamatoyannopoulos*,2, Anindya Dutta*,3, Roderic Guigó*,4, 5, Thomas R. Gingeras*,6, Elliott H. Margulies*,7, Zhiping Weng*,8, 9, Michael Snyder*,10, 11, Emmanouil T. Dermitzakis*,12;John A. Stamatoyannopoulos*,2, Robert E. Thurman2, 13, Michael S. Kuehn2, 13, Christopher M. Taylor3, Shane Neph2, Christoph M. Koch12, Saurabh Asthana14, Ankit Malhotra3, Ivan Adzhubei14, Jason A. Greenbaum15, Robert M. Andrews12, Paul Flicek1, Patrick J. Boyle3, Hua Cao13, Nigel P. Carter12, Gayle K. Clelland12, Sean Davis16, Nathan Day2, Pawandeep Dhami12, Shane C. Dillon12, Michael O. Dorschner2, Heike Fiegler12, Paul G. Giresi17, Jeff Goldy2, Michael Hawrylycz18, Andrew Haydock2, Richard Humbert2, Keith D. James12, Brett E. Johnson13, Sarah M. Johnson13, Neerja Karnani3, Kristin Lee2, Gregory C. Lefebvre12, Patrick A. Navas13, Fidencio Neri2, Stephen C. J. Parker15, Peter J. Sabo2, Richard Sandstrom2, Anthony Shafer2, David Vetrie12, Molly Weaver2, Sarah Wilcox12, Man Yu13, Francis S. Collins7, Job Dekker19, Jason D. Lieb17, Thomas D. Tullius15, Gregory E. Crawford20, Shamil Sunayev14, William S. Noble2, Ian Dunham12, Anindya Dutta*,3;Roderic Guigó*,4, 5, France Denoeud5, Alexandre Reymond21, 22, Philipp Kapranov6, Joel Rozowsky11, Deyou Zheng11, Robert Castelo5, Adam Frankish12, Jennifer Harrow12, Srinka Ghosh6, Albin Sandelin23, Ivo L. Hofacker24, Robert Baertsch25, 26, Damian Keefe1, Paul Flicek1, Sujit Dike6, Jill Cheng6, Heather A. Hirsch27, Edward A. Sekinger27, Julien Lagarde5, Josep F. Abril5, 28, Atif Shahab29, Christoph Flamm24, 30, Claudia Fried30, Jörg Hackermüller31, Jana Hertel30, Manja Lindemeyer30, Kristin Missal30, 32, Andrea Tanzer24, 30, Stefan Washietl24, Jan Korbel11, Olof Emanuelsson11, Jakob S. Pedersen26, Nancy Holroyd12, Ruth Taylor12, David Swarbreck12, Nicholas Matthews12, Mark C. Dickson33, Daryl J. Thomas25, 26, Matthew T. Weirauch25, James Gilbert12, Jorg Drenkow6, Ian Bell6, XiaoDong Zhao34, K.G. Srinivasan34, Wing-Kin Sung34, Hong Sain Ooi34, Kuo Ping Chiu34, Sylvain Foissac4, Tyler Alioto4, Michael Brent35, Lior Pachter36, Michael L. Tress37, Alfonso Valencia37, Siew Woh Choo34, Chiou Yu Choo34, Catherine Ucla22, Caroline Manzano22, Carine Wyss22, Evelyn Cheung6, Taane G. Clark38, James B. Brown39, Madhavan Ganesh6, Sandeep Patel6, Hari Tammana6, Jacqueline Chrast21, Charlotte N. Henrichsen21, Chikatoshi Kai23, Jun Kawai23, 40, Ugrappa Nagalakshmi10, Jiaqian Wu10, Zheng Lian41, Jin Lian41, Peter Newburger42, Xueqing Zhang42, Peter Bickel43, John S. Mattick44, Piero Carninci40,Yoshihide Hayashizaki23, 40, Sherman Weissman41, Emmanouil T. Dermitzakis*,12, Elliott H. Margulies*,7, Tim Hubbard12, Richard M. Myers33, Jane Rogers12, Peter F. Stadler24, 30, 45, Todd M. Lowe25, Chia-Lin Wei34, Yijun Ruan34, Michael Snyder*,10, 11, Ewan Birney*,1, Kevin Struhl27, Mark Gerstein11, 46, 47, Stylianos E. Antonarakis22, Thomas R. Gingeras*,6;James B. Brown39, Paul Flicek1, Yutao Fu8, Damian Keefe1, Ewan Birney*,1, France Denoeud5, Mark Gerstein11, 46, 47, Eric D. Green7, 48, Philipp Kapranov6, Ulaş Karaöz8, Richard M. Myers33, William S. Noble2, Alexandre Reymond21, 22, Joel Rozowsky11, Kevin Struhl27, Adam Siepel25, 26, $, John A. Stamatoyannopoulos*,2, Christopher M. Taylor3, James Taylor49, 50, Robert E. Thurman2, 13, Thomas D. Tullius15, Stefan Washietl24, Deyou Zheng11;Laura Liefer51, Kris A. Wetterstrand51, Peter J. Good51, Elise A. Feingold51, Mark S. Guyer51, Francis S. Collins52;Elliott H. Margulies*,7, Gregory M. Cooper33,%, George Asimenos53, Daryl J. Thomas25, 26, Colin N. Dewey54, Adam 62;Gerard G. Bouffard7, 48, Xiaobin Guan48, Nancy F. Hansen48, Jacquelyn R. Idol7, Valerie V.B. Maduro7, Baishali Maskeri48, Jennifer C. McDowell48, Morgan Park48, Pamela J. Thomas48, Alice C. Young48, and Robert W. Blakesley7, 48;Donna M. Muzny63, Erica Sodergren63, David A. Wheeler63, Kim C. Worley63, Huaiyang Jiang63, George M. Weinstock63, and Richard A. Gibbs63;Tina Graves64, Robert Fulton64, Elaine R. Mardis64, and Richard K. Wilson64;Michele Clamp65, James Cuff65, Sante Gnerre65, David B. Jaffe65, Jean L. Chang65, Kerstin Lindblad-Toh65, and Eric S. Lander65, 66;Maxim Koriabine67, Mikhail Nefedov67, Kazutoyo Osoegawa67, Yuko Yoshinaga67, Baoli Zhu67, and Pieter J. de Jong67;Zhiping Weng*,8, 9, Nathan D. Trinklein33,#, Yutao Fu8, Zhengdong D. Zhang11, Ulaş Karaöz8, Leah Barrera68, Rhona Stuart68, Deyou Zheng11, Srinka Ghosh6, Paul Flicek1, David C. King50, 59, James Taylor49, 50, Adam Ameur69, Stefan Enroth69, Mark C. Bieda70, Christoph M. Koch12, Heather A. Hirsch27, Chia-Lin Wei34, Jill Cheng6, Jonghwan Kim71, Akshay A. Bhinge71, Paul G. Giresi17, Nan Jiang72, Jun Liu34, Fei Yao34, Wing-Kin Sung34, Kuo Ping Chiu34, Vinsensius B. Vega34, Charlie W.H Lee34, Patrick Ng34, Atif Shahab29, Edward A. Sekinger27, Annie Yang27, Zarmik Moqtaderi27, Zhou Zhu27, Xiaoqin Xu70, Sharon Squazzo70, Matthew J. Oberley73, David Inman73, Michael A. Singer72, Todd A. Richmond72, Kyle J. Munn72, 74, Alvaro Rada-Iglesias74, Ola Wallerman74, Jan Komorowski69, Gayle K. Clelland12, Sarah Wilcox12, Shane C. Dillon12, Robert M. Andrews12, Joanna C. Fowler12, Phillippe Couttet12, Keith D. James12, Gregory C. Lefebvre12, Alexander W. Bruce12, Oliver M. Dovey12, Peter D. Ellis12, Pawandeep Dhami12, Cordelia F. Langford12, Nigel P. Carter12, David Vetrie12, Philipp Kapranov6, David A. Nix6, Ian Bell6, Sandeep Patel6, Joel Rozowsky11, Ghia Euskirchen10, Stephen Hartman10, Jin Lian41, Jiaqian Wu10, Alexander E. Urban10, Peter Kraus10, Sara Van Calcar68, Nate Heintzman68, Tae Hoon Kim68, Kun Wang68, Chunxu Qu68, Gary Hon68, Rosa Luna75, Christopher K. Glass75, M. Geoff Rosenfeld75, Shelley Force Aldred33,#, Sara J. Cooper33, Anason Halees8, Jane M. Lin9, Hennady P. Shulha9, Xiaoling Zhang8, Mousheng Xu8, Jaafar N. S. Haidar9, Yong Yu9, Ewan Birney*,1, Sherman Weissman41, Yijun Ruan34, Jason D. Lieb17, Vishwanath R. Iyer71, Roland D. Green72, Thomas R. Gingeras*,6, Claes Wadelius74, Ian Dunham12, Kevin Struhl27, Ross C. Hardison50, 59, Mark Gerstein11, 46, 47, Peggy J. Farnham70, Richard M. Myers33, Bing Ren68, Michael Snyder*,10, 11;Daryl J. Thomas25, 26, Kate Rosenbloom26, Rachel A. Harte26, Angie S. Hinrichs26, Heather Trumbower26, Hiram Clawson26, Jennifer Hillman-Jackson26, Ann S. Zweig26, Kayla Smith26, Archana Thakkapallayil26, Galt Barber26, Robert M. Kuhn26, Donna Karolchik26, David Haussler25, 26, 60, W. James Kent25, 26;Emmanouil T. Dermitzakis*,12, Lluis Armengol76, Christine P. Bird12, Taane G. Clark38, Gregory M. Cooper33,%, Paul I. W. de Bakker77, Andrew D. Kern26, Nuria Lopez-Bigas5, Joel D. Martin50, 59, Barbara E. Stranger12, Daryl J. Thomas25, 26, Abigail Woodroffe78, The genome is pervasively transcribed  The majority of nucleotides in the encode regions are part of at least one primary transcript GENCODE annotation, RACE-array experiments (RxFrags), and PET tags Regulatory elements are distributed around TSSs (not upstream in particular) Distal TSSs • RACE extension validated by PCR of exons detected by tiling array show that many genes can have distal TSS within 330kb other genes(!) Around 5% of the bases are under selective pressure • However, not all functional regions are conserved: biologically active elements with neutral benefits? • Or, is our measure of conservation capturing what we want? Back to the UCSC browser Two ways to use ENCODE data 1. The encode tracks (we have already used some). Danger: only covers 1% of genome! 2. The ENCODE version of the UCSC browser: http://genome.ucsc.edu/ENCODE/ only shows the 1% regions. Same data as above, though Issues • ENCODE data is complex and hard to interpret due to – – – – – New technology: what is noise and what is signal? What does the signal mean even if it real? Very many technologies - no-one knows them all Messy biology Use of different cell lines in different experiments (sigh) Larger challenge Use the ENCODE browser and again look at the RPS9 gene What additional tracks are available? Do we see anything more than we have already seen? (I am NOT expecting you to understand and look at every track :) )

Tiling arrays

Related documents

Products

Support

Tiling arrays

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib