Tiling arrays

advertisement
Please take a ticket
• …and sit in your group (7 different
groups)
• This if for a larger exercise in the
afternoon
Today
•
•
•
•
•
Homework1!
Tiling arrays and ChIP
Cross-species conservation
(possibly) The ENCODE project
Tying it all together ( larger group
exercise in interpretation)
From last time:Lets look at
these things
• 5 minutes with your sideman:
• Look at the RPS9 gene, and turn on Refseqs,
UCSC genes, human mRNAs, ESTs, CpG
islands and repeats
• How well does refseqs, ESTs and Known
genes correlate
• Are there any CpGs or repeats - where are
they located? What type of repeats are there?
Annotating genome by
hybridization approaches
• There is a set of methods to annotate
genomes that are based on the ability of
DNA and/or RNA to hybridize in double
strands
Fundamental idea:
make DNA probes that are complementary to
RNAs that exist in the cell
Put these on a glass slide
Measure hybridization by attaching a
color molecule at the RNA end
RNA
DNA PROBES
If the RNA that we designed the probe for is present,
The probe will light up = the gene is expressed
RNA
DNA PROBES
What do we make probes for?
• Genes?
Probe specific for the HCFC1 gene
• Say that we make one probe per gene
and put on a glass slide
• That way we could potentially detect all
genes if they are expressed
• This is how microarrays work (next part
of the course)
• What are the limitations with this?
• (2 minutes)
Albin’s take
• Even if it works perfectly, we will only
get values for genes we know (that we
make probes for)
• We get no information on how the gene
looks (what isoform) – we just know
whether it is there or not
• We get only expression information, no
location information
• How can we resolve these issues?
Answer: make “all possible
probes”
• How do we know what probes that are
possible?
• We have the genome!
• Make probes tiling the genome
• What probes will light up if only the
HCFX1 gene is expressed?
• Only the exonic parts should light up
Because we hybridize mature mRNAs
We call these probe sets on glass slides
“tiling arrays”
Tiling arrays
Tiling arrays are multi-purpose tools - can be use for
many things: to see
• expressed RNA
• DNA regions bound by transcription factors
• Accessible DNA
The tiling array is composed of probes that “tiles” the
genome – meant to cover most of it.
This is only partially true:
1) Probes are often separated by spacers
2) There are regions that are hard to make meaningful
probes for – what regions would that be?
Finding active RNA locations
Lets look at this in the browser
- still the RPS9 gene
Slightly more tricky, due to how the data
looks. This is non-standard data, so we
need to use a specific assembly:
Human May 2004 (hg17)
Use the
Affy Txn… track in the Expression group,
and click on it to set options
Set the option like this for now, and submit
Try to interpret the many tracks compared to
the cDNA tracks: what are we looking at?
Transfrags and signals
• Tiling array probes give a signal - the stronger it is,
the more “expression”.
• Tiling array probes are after each other, so it makes
sense to group all such probes that have “significant”
signal together to a larger block
• This is done by specific statistics packages (not part
of the course, but can be downloaded) - methodology
is slightly arbitrary
• Affymetrix calls these blocks “transfrags”
• They should be viewed as a simplification of the data,
that often is helpful
Lets change the options to
also look at transfrags
Again, lets look at the region
Live Demo
To consider: What is the advantage of
tiling arrays compared to cDNA
sequencing, and vice versa?
cDNA sequencing
• Connectivity between
exons
• Full-length transcripts
• Can be a blind method –
but is often used in a
targeted way – it is
common to make cDNAs
using targeted PCR
Tiling arrays
• No connectivity, just
signals where probes are
• Issues with crosshybridization
• Hard to get transcript
edges right
• An almost blind method
Blind methods
• …are not targeting any specific gene or
region
• Independent of annotation!
• Some call this “unbiased”, but…no
methods are unbiased
• Valuable, because molecular biology is
very affected by “ascertainment bias”
• Blind methods are not affected by this,
and therefore can give totally new
insights
Tiling arrays used to find
transcription factor binding
sites - ChIP-on-chip
Chromatin immuno-precipitation(ChIP) is a
classical molecular biology technique - used
to capture DNA-bound proteins and
corresponding sequence
Isolation of bound DNA: Sequencing
Chromatin immunoprecipitation (ChIP):
1. Fixating everything
that is bound to DNA
by formaldehyde
2. Shearing DNA
3. Fish out the protein of
interest with an
antibody
4. Then get the bound
DNA and sequence it
ChIP-chip
With a tiling array, we
can see all sites in the
genome by instead
putting the bound DNA
on the tiling array!
Assessing ChIP-chip data in the browser
• Data-wise, the same thing as tiling arrays with RNA:
probes will get signal, and we lump probes together
like transfrags
• Usually looks very different: much fewer regions, and
smaller regions
• Live Demo: Using the “Affy sites” track in the
ENCODE Chromatin Immunoprecipitation track: set
this to “full”.
• To consider: what are the pros and cons with this
technique compared to predicting transcription factor
binding sites comutationally
Recent developments
• Tiling arrays were very hot 2004-2007
• Are now becoming outdated by new
sequencing methods(!).
• More in the last part of the course
Genomic conservation
between species
As we have the genomic sequence of
multiple species, it makes sense to
compare these on genome level
In the UCSC browser, this is made in
many steps, and we can choose what
step we want to look at
Central to the analysis is that we compare
two genomes to each other at a time,
and later combine the results
Big-chunk changes
• Whole chromosomes do not correspond
to each other in terms of content
between species - a lot of movingaround has occurred
What mouse genome parts matches to where in human?
So, it makes sense to try to find smaller
pieces that are very similar, and then
connect these to each other
The human ACTN3 gene aligned to mouse
Mouse genome regions with high similarity are shown
as blocks, colored after what chromosomes they are
from
Matches to different chromosomes at the same place
might mean duplicated genes, or even duplicated parts
of genes
We try to make chains that link these blocks:
Must have a logical order (follow each other) in
BOTH species, but can skip segments
Now, most chromosomes just have one longer chain.
Seems clear that the pink chain is the best, and that the
yellow and green represent gene variants on different
chromosomes
What if want the “best” chain?
The NET algorithm tries to find the best chain, and
throws out other chains that overlap it. Potentially you
can get many NETs, though
In this case, there is only one solution
Walking between genomes
By clicking on a chain or a “netted” chain
from another species, you get to page
where you can open a new browser with
that position in the other species
(live demo)
The “Conservation” track
• The conservation track is multiple alignment
(“all” species), built from pairwise net chains
• It is both an alignment and a “score”, which is
a –log( P( nucleotide is neutrally evolving))
• Alignments are shown as letters if we zoom in
deep enough
(live demo).
In the newest assemblies, there are two
conservation tracks, made by slightly different
methods. They are basically two sides of the
same coin - same basic idea.
Challenge
• Turn on the mouse chain and net tracks
for the RPS9 gene
• Are there one or many chains?
• If you use these chains to go over to the
mouse genome, what do you hit (use all
annotation that is relevant)?
Variation within species on
single nucleotides
• Single nucleotide polymorphisms
(SNPs “snips”) - a whole research
field
• Single nucleotide change that
happens in >1% of the population.
• The UCSC browser shows SNPs
from the SNPdb (database at
NIH). It colors SNPs after
potential functional annotation
Challenge
Technical:
• In the RPS9 gene, turn on the SNP track to
“full”
• At one part of the gene, the SNPs are
colored. What annotations do the colors
stand for?
Biological/philosophical:
• Find arguments and counter-arguments on
the following statement
• “SNPs are what makes us humans different
from each other”
Larger exercise in larger
groups: interpreting genes
•
•
•
•
•
I will divide you into 6 groups - look at your lottery ticket
Each group will get two gene IDs
– You will investigate both of these in the browser (mm8 assembly ),
for instance:
• what is special with it
• what is the gene product (annotation)
• what kind of track features are particularly interesting/strange
in this case, etc. Use all tracks you we have talked about if
available and relevant, or even other tracks
• Do not forget to look at the neighborhood of the gene - zoom
out!
You will then show the main findings for one of the genes to the other
groups (say 3-5 minutes) - and we can discuss and interpret.
The other gene, another group will present. Your group will be
“opponents” for this gene: try to find things that the other group have
not thought about.
Presentation and opponent genes will be distributed randomly – you
wont know until you present.
Some cutting edge science:
The ENCODE project
• Encyclopedia Of DNA Elements
• Aims for a targeted and coordinated elucidation of the
(whole) human genome, using multiple systems
• Currently ended the pilot phase: analyze 1% (30 Mbp)of the
genome deeply
• 44 regions
• 14 regions chosen by function: for instance the HOXD gene
cluster – 0.5 to 2 Mbp
• 30 regions chosen “randomly” - 500 kb
• Should be viewed as a pilot project for the rest of the
genome – so both technology and biology is driving
• A large number of labs involved in both data production and
analysis
Some impressive numbers
• 400 million data points, excluding sequencing of other
genomes (adds another 250 million)!
• Tiling arrays from 11 different cell sources
• 96 ChIP-chip experiments
• Tag sequecning data to identify promoters (covered later in
the course)
• In-depth cDNA annotation (GENCODE)
• Sequencing of orthologous regions in a wide array of species
• …and a lot more
…and these are not all data tracks!
Why introduce these things
now in the course?
• Get you used to look at tons of data at
once
• This is a fantastic data resource, which
is under-used
• make you realize that to analyze such
data, you will have to understand the
underlying method/biology
.A.
B.
C.
Highlights of the ENCODE paper
Ewan Birney*,1, John A. Stamatoyannopoulos*,2, Anindya Dutta*,3, Roderic Guigó*,4, 5, Thomas R. Gingeras*,6, Elliott H. Margulies*,7, Zhiping Weng*,8, 9,
Michael Snyder*,10, 11, Emmanouil T. Dermitzakis*,12;John A. Stamatoyannopoulos*,2, Robert E. Thurman2, 13, Michael S. Kuehn2, 13, Christopher M.
Taylor3, Shane Neph2, Christoph M. Koch12, Saurabh Asthana14, Ankit Malhotra3, Ivan Adzhubei14, Jason A. Greenbaum15, Robert M. Andrews12,
Paul Flicek1, Patrick J. Boyle3, Hua Cao13, Nigel P. Carter12, Gayle K. Clelland12, Sean Davis16, Nathan Day2, Pawandeep Dhami12, Shane C.
Dillon12, Michael O. Dorschner2, Heike Fiegler12, Paul G. Giresi17, Jeff Goldy2, Michael Hawrylycz18, Andrew Haydock2, Richard Humbert2, Keith D.
James12, Brett E. Johnson13, Sarah M. Johnson13, Neerja Karnani3, Kristin Lee2, Gregory C. Lefebvre12, Patrick A. Navas13, Fidencio Neri2, Stephen
C. J. Parker15, Peter J. Sabo2, Richard Sandstrom2, Anthony Shafer2, David Vetrie12, Molly Weaver2, Sarah Wilcox12, Man Yu13, Francis S. Collins7,
Job Dekker19, Jason D. Lieb17, Thomas D. Tullius15, Gregory E. Crawford20, Shamil Sunayev14, William S. Noble2, Ian Dunham12, Anindya
Dutta*,3;Roderic Guigó*,4, 5, France Denoeud5, Alexandre Reymond21, 22, Philipp Kapranov6, Joel Rozowsky11, Deyou Zheng11, Robert Castelo5,
Adam Frankish12, Jennifer Harrow12, Srinka Ghosh6, Albin Sandelin23, Ivo L. Hofacker24, Robert Baertsch25, 26, Damian Keefe1, Paul Flicek1, Sujit
Dike6, Jill Cheng6, Heather A. Hirsch27, Edward A. Sekinger27, Julien Lagarde5, Josep F. Abril5, 28, Atif Shahab29, Christoph Flamm24, 30, Claudia
Fried30, Jörg Hackermüller31, Jana Hertel30, Manja Lindemeyer30, Kristin Missal30, 32, Andrea Tanzer24, 30, Stefan Washietl24, Jan Korbel11, Olof
Emanuelsson11, Jakob S. Pedersen26, Nancy Holroyd12, Ruth Taylor12, David Swarbreck12, Nicholas Matthews12, Mark C. Dickson33, Daryl J.
Thomas25, 26, Matthew T. Weirauch25, James Gilbert12, Jorg Drenkow6, Ian Bell6, XiaoDong Zhao34, K.G. Srinivasan34, Wing-Kin Sung34, Hong Sain
Ooi34, Kuo Ping Chiu34, Sylvain Foissac4, Tyler Alioto4, Michael Brent35, Lior Pachter36, Michael L. Tress37, Alfonso Valencia37, Siew Woh Choo34,
Chiou Yu Choo34, Catherine Ucla22, Caroline Manzano22, Carine Wyss22, Evelyn Cheung6, Taane G. Clark38, James B. Brown39, Madhavan
Ganesh6, Sandeep Patel6, Hari Tammana6, Jacqueline Chrast21, Charlotte N. Henrichsen21, Chikatoshi Kai23, Jun Kawai23, 40, Ugrappa
Nagalakshmi10, Jiaqian Wu10, Zheng Lian41, Jin Lian41, Peter Newburger42, Xueqing Zhang42, Peter Bickel43, John S. Mattick44, Piero
Carninci40,Yoshihide Hayashizaki23, 40, Sherman Weissman41, Emmanouil T. Dermitzakis*,12, Elliott H. Margulies*,7, Tim Hubbard12, Richard M.
Myers33, Jane Rogers12, Peter F. Stadler24, 30, 45, Todd M. Lowe25, Chia-Lin Wei34, Yijun Ruan34, Michael Snyder*,10, 11, Ewan Birney*,1, Kevin
Struhl27, Mark Gerstein11, 46, 47, Stylianos E. Antonarakis22, Thomas R. Gingeras*,6;James B. Brown39, Paul Flicek1, Yutao Fu8, Damian Keefe1,
Ewan Birney*,1, France Denoeud5, Mark Gerstein11, 46, 47, Eric D. Green7, 48, Philipp Kapranov6, Ulaş Karaöz8, Richard M. Myers33, William S. Noble2,
Alexandre Reymond21, 22, Joel Rozowsky11, Kevin Struhl27, Adam Siepel25, 26, $, John A. Stamatoyannopoulos*,2, Christopher M. Taylor3, James
Taylor49, 50, Robert E. Thurman2, 13, Thomas D. Tullius15, Stefan Washietl24, Deyou Zheng11;Laura Liefer51, Kris A. Wetterstrand51, Peter J. Good51,
Elise A. Feingold51, Mark S. Guyer51, Francis S. Collins52;Elliott H. Margulies*,7, Gregory M. Cooper33,%, George Asimenos53, Daryl J. Thomas25, 26,
Colin N. Dewey54, Adam 62;Gerard G. Bouffard7, 48, Xiaobin Guan48, Nancy F. Hansen48, Jacquelyn R. Idol7, Valerie V.B. Maduro7, Baishali
Maskeri48, Jennifer C. McDowell48, Morgan Park48, Pamela J. Thomas48, Alice C. Young48, and Robert W. Blakesley7, 48;Donna M. Muzny63, Erica
Sodergren63, David A. Wheeler63, Kim C. Worley63, Huaiyang Jiang63, George M. Weinstock63, and Richard A. Gibbs63;Tina Graves64, Robert
Fulton64, Elaine R. Mardis64, and Richard K. Wilson64;Michele Clamp65, James Cuff65, Sante Gnerre65, David B. Jaffe65, Jean L. Chang65, Kerstin
Lindblad-Toh65, and Eric S. Lander65, 66;Maxim Koriabine67, Mikhail Nefedov67, Kazutoyo Osoegawa67, Yuko Yoshinaga67, Baoli Zhu67, and Pieter J.
de Jong67;Zhiping Weng*,8, 9, Nathan D. Trinklein33,#, Yutao Fu8, Zhengdong D. Zhang11, Ulaş Karaöz8, Leah Barrera68, Rhona Stuart68, Deyou
Zheng11, Srinka Ghosh6, Paul Flicek1, David C. King50, 59, James Taylor49, 50, Adam Ameur69, Stefan Enroth69, Mark C. Bieda70, Christoph M. Koch12,
Heather A. Hirsch27, Chia-Lin Wei34, Jill Cheng6, Jonghwan Kim71, Akshay A. Bhinge71, Paul G. Giresi17, Nan Jiang72, Jun Liu34, Fei Yao34, Wing-Kin
Sung34, Kuo Ping Chiu34, Vinsensius B. Vega34, Charlie W.H Lee34, Patrick Ng34, Atif Shahab29, Edward A. Sekinger27, Annie Yang27, Zarmik
Moqtaderi27, Zhou Zhu27, Xiaoqin Xu70, Sharon Squazzo70, Matthew J. Oberley73, David Inman73, Michael A. Singer72, Todd A. Richmond72, Kyle J.
Munn72, 74, Alvaro Rada-Iglesias74, Ola Wallerman74, Jan Komorowski69, Gayle K. Clelland12, Sarah Wilcox12, Shane C. Dillon12, Robert M.
Andrews12, Joanna C. Fowler12, Phillippe Couttet12, Keith D. James12, Gregory C. Lefebvre12, Alexander W. Bruce12, Oliver M. Dovey12, Peter D.
Ellis12, Pawandeep Dhami12, Cordelia F. Langford12, Nigel P. Carter12, David Vetrie12, Philipp Kapranov6, David A. Nix6, Ian Bell6, Sandeep Patel6,
Joel Rozowsky11, Ghia Euskirchen10, Stephen Hartman10, Jin Lian41, Jiaqian Wu10, Alexander E. Urban10, Peter Kraus10, Sara Van Calcar68, Nate
Heintzman68, Tae Hoon Kim68, Kun Wang68, Chunxu Qu68, Gary Hon68, Rosa Luna75, Christopher K. Glass75, M. Geoff Rosenfeld75, Shelley Force
Aldred33,#, Sara J. Cooper33, Anason Halees8, Jane M. Lin9, Hennady P. Shulha9, Xiaoling Zhang8, Mousheng Xu8, Jaafar N. S. Haidar9, Yong Yu9,
Ewan Birney*,1, Sherman Weissman41, Yijun Ruan34, Jason D. Lieb17, Vishwanath R. Iyer71, Roland D. Green72, Thomas R. Gingeras*,6, Claes
Wadelius74, Ian Dunham12, Kevin Struhl27, Ross C. Hardison50, 59, Mark Gerstein11, 46, 47, Peggy J. Farnham70, Richard M. Myers33, Bing Ren68,
Michael Snyder*,10, 11;Daryl J. Thomas25, 26, Kate Rosenbloom26, Rachel A. Harte26, Angie S. Hinrichs26, Heather Trumbower26, Hiram Clawson26,
Jennifer Hillman-Jackson26, Ann S. Zweig26, Kayla Smith26, Archana Thakkapallayil26, Galt Barber26, Robert M. Kuhn26, Donna Karolchik26, David
Haussler25, 26, 60, W. James Kent25, 26;Emmanouil T. Dermitzakis*,12, Lluis Armengol76, Christine P. Bird12, Taane G. Clark38, Gregory M. Cooper33,%,
Paul I. W. de Bakker77, Andrew D. Kern26, Nuria Lopez-Bigas5, Joel D. Martin50, 59, Barbara E. Stranger12, Daryl J. Thomas25, 26, Abigail Woodroffe78,
The genome is pervasively
transcribed

The majority of nucleotides in the encode
regions are part of at least one primary
transcript
GENCODE annotation,
RACE-array experiments
(RxFrags), and PET tags
Regulatory elements are distributed
around TSSs (not upstream in
particular)
Distal TSSs
• RACE extension validated by PCR of
exons detected by tiling array show that
many genes can have distal TSS within 330kb
other genes(!)
Around 5% of the bases are
under selective pressure
• However, not all
functional regions
are conserved:
biologically active
elements with neutral
benefits?
• Or, is our measure of
conservation
capturing what we
want?
Back to the UCSC browser
Two ways to use ENCODE data
1. The encode tracks (we have already used
some). Danger: only covers 1% of genome!
2. The ENCODE version of the UCSC
browser: http://genome.ucsc.edu/ENCODE/
only shows the 1% regions. Same data as
above, though
Issues
• ENCODE data is complex and hard to
interpret due to
–
–
–
–
–
New technology: what is noise and what is signal?
What does the signal mean even if it real?
Very many technologies - no-one knows them all
Messy biology
Use of different cell lines in different experiments
(sigh)
Larger challenge
Use the ENCODE browser and again look
at the RPS9 gene
What additional tracks are available?
Do we see anything more than we have
already seen?
(I am NOT expecting you to understand
and look at every track :) )
Download