Joint Quality and Error Control Coding for DNA Microarrays

advertisement
Quality and Error Control
Coding for DNA Microarrays
Olgica Milenkovic
ECE Department
University of Colorado, Boulder
IEEE Denver ComSoc
Outline
• DNA Microarrays
• VLSIPS (Very Large Scale Immobilized Polymer Synthesis)
• Production of DNA Microarrays (http://www.affymetrix.com/)
– Base Scheduling
– Mask Design
– Quality-Control Coding
• Error-Correcting DNA Microarrays (Multiplexed Arrays)
• Production of Multiplexed DNA Microarrays
– Base/Color Scheduling
– Mask Design
– Quality-Control Coding
IEEE Denver ComSoc
DNA microarrays I
Slide #1
Goal: Determining which genes are expressed (active) and which are
unexpressed (inactive)
Comparative gene expression study of multiple cells
Transcription
Translation
Control of
Transcription &
Translation
Transcription
Translation
Protein Coding
Sequence
Protein Coding
Sequence
Protein
Protein
Gene expression and co-regulation
IEEE Denver ComSoc
DNA microarrays II
Slide #2
Creating the `cell cultures’ to be compared…
`Green’ Cell Culture :
`Red’ Cell Culture:
DNA
Subsequence
3’- AATTT CGC… - 5’
DNA
Subsequence
3’- AATTT CGC… - 5’
mRNA
5’ - UUAAAGCG… - 3’
mRNA
3’ - UUAAAGCG… - 5’
cDNA
3’ - AATTTCGC… - 5’
cDNA
3’- AATTT CGC… - 5’
“Color Coding”
3’ - AATTTCGC… - 5’
“Color Coding”
3’- AATTT CGC… - 5’
Creation of tagged cDNA sequences
from first cell type
Creation of tagged cDNA sequences
from first cell type
IEEE Denver ComSoc
DNA microarrays III
Spots
Slide #3
Gene Probes
Complementary sequences
hybridize with each other, forming
stable double-helices
Hybridization:
3’-AAGCT-5’
5’-TTCGA-3’
DNA microarray is scanned by laser light of different wave-lengths
IEEE Denver ComSoc
Probe synthesis in microarrays I
VLSIPS (Gene Chip, AFFYMETRIX, Array
Manufacturing Manual)
Linkers
Mask
Quartz Wafer
Linker Activation
IEEE Denver ComSoc
Slide #4
Probe synthesis in microarrays II
VLSIPS (Gene Chip, AFFYMETRIX, Array
Manufacturing Manual)
Solution of one DNA base
(A or T or G or C)
Solution of one DNA base
(A)
IEEE Denver ComSoc
Slide #5
Slide #6
Base scheduling I
Fixed probe length: N
Production steps
ATGC ATGC ATGC ATGC
Spots
Synchronous
1
schedule
2
(length 4N)
3
4
CTGA
5
ACAA
IEEE Denver ComSoc
Slide #7
Base scheduling II
Production steps
AGGC TTGC TTGC CCGC
Spots
1
2
Asynchronous
schedule
3
4
5
IEEE Denver ComSoc
Base Scheduling III
Slide #8
• Shortest asynchronous base schedule
– Shortest common super-sequence of set of M sequences (NP-hard)
ESN(M,k) – expected length of a longest common subsequence of M
randomly chosen sequences of length N over an alphabet of size k
 k( l )  lim
N 
 k( M ) 
ES N ( M , k )
N
M log(z0 k )
log(k(( 1  z0 ) M  1 ))
 z0  21 / M  1
No significant gain for N≈20-30
Periodic schedule used instead (length 4N)
IEEE Denver ComSoc
Mask Design
Slide #9
Border-length minimization
Feldman and Pevzner, 1994
Hannehalli et.al., 2002
Kahng et.al. 2003, 2004
Key idea:
Arrange the probes on the array in such a way that the border-length of all
masks is minimal
Border-length graph: complete graph on M vertices, weight of edges equal to
the Hamming distance between probes
Greedy traveling salesman algorithm+ threading (discrete space-filing curve)
IEEE Denver ComSoc
Quality Control
Slide #10
Hubbell and Pevzner, 1999
Sengupta and Tompa, 2002
Colbourn et.al., 2002
Quality control (fidelity)
spots
Manufacture identical probes at several
quality-control spots in order to test precision
of production steps
IEEE Denver ComSoc
Relevant coding-theoretic ideas
Slide #11
Balanced code (Sengupta and Tompa, 2002):
An b×v binary matrix of zeros and ones with
• each row has weight k;
• each column has weight bounded between l and b-l, for some constant l;
• any pair of columns is at least at Hamming distance d apart;
Superimposed designs in Renyi’s search model (Kautz and Singleton, 1964,
Dyachkov and Rykov, 1983):
An b×v binary matrix of zeros and ones with
• all Boolean sums composed of no more than s columns are distinct;
•each row has weight exactly t;
Additional constraints: the Boolean sums form an error-correcting code with
prescribed minimum distance d;
IEEE Denver ComSoc
Slide #12
Error-correcting microarray design
• Probe multiplexing (Khan et.al, 2003)
Probes
Excluding hybridization effects, spot formation quality and
under iid measurement noise,
min tr(G* T G* )
G*  ( G T G ) 1 G T
n  k 0  1 G matrix, n  k
rank( G )  k
0
1

0
G
0
1

1
1
0
0
1
0
1
1
1
1
0
0
0
0
0

1

1
1

0
 G( i , j )  c  const.
j
X – vector of RNA levels corresponding to N genes
Y – total concentration of RNA at all spots
Y  TGS X
S - hybridization affinity matrix, T - spot quality matrix
Decoding algorithm: numerical optimization
IEEE Denver ComSoc
s
p
o
t
s
VLSIPS/analysis for multiplexed arrays
Slide #13
Features:
• Multiple polymer synthesis at one given spot (for simplicity, will consider only two
probes per spot)
• Can use two different classes of linkers sensitive to different wavelengths so to select
probes for extension (say, `blue’ and `green’ and `cyan’)
Spots
A T G C
A T G C
g b
c c
A T G C
A T G C
g g c
b g b g
1
2
3
4
5
6
g
b
b c
b
Slide #14
VLSIPS/analysis for multiplexed arrays
Scheduling: shortest schedule of bases/colors
(Using results from V. Dancık, Expected Length of Longest Common Subsequences, 1994)
Set-up: two identical sets of M `blue’ and M `green’ randomly and uniformly
chosen sequences of length N over the alphabet of size four
Length of shortest schedule
lim  2 4( M )   4( M ) ( 2 4( M )  1 )
N 
Chvatal-Sankoff
constants
Synchronous schedule, no `cyan’ colored steps: 8N
IEEE Denver ComSoc
Mask
design:
A
C
G
T
A
C
G
Slide #15
T
S1
AT,CA
S2
AC,CC
S3
GT,GA
S4
TT,TA
b
g
c
c
g
c
c
b
s1 s4
s3 s2
L(M)=4, L(M)=4,
L(M)=2,
L(M)=2, L(M)=2, L(M)=2,
L(M)=2
L(M)=2, L(M)=2,
L(M)=2,
L(M)=2, L(M)=2, L(M)=2,
L(M)=2
s1 s3
s2 s4
Mask Design / Scheduling
Slide #16
Neighborhood graph: complete graph with M vertices labeled by two distinct
sequences ( p ( ), p ( ))
1
2
No `cyan’ steps: weight of edge between two vertices
( p1 (1 ), p2 (1 )), ( p1 (2 ), p2 (2 ))
sums of Hamming distances
Issues: For reasons of controlled hybridization, different probes (blue and
green) at the same spot should have fairly large Hamming distance
(Milenkovic and Kashyap, 2005)
Border-length minimization becomes less effective
With cyan colored steps involved, the distance measure also depends on the
longest common subsequence of the probes at the same spot
IEEE Denver ComSoc
Quality Control Coding
Slide #17
Theorem: Assume that there exists a linear error-control code with parameters
[n,k,d] containing the all-ones codeword. Then one can construct a quality control
array for a multiplexed DNA chip with 2(2k-2) disjoint blue and green production
steps and M probes such that the length of each quality control probe is 2(k-1)-1,
and that the weights w of the columns in the quality control array satisfy
d wnd
Furthermore, with such an array any collection of less than n/(n-d) failed blue or
green steps, respectively, can be uniquely identified.
Open question: how does one extend this result for schedules involving `cyan’
colored production steps, and under `spot’ failures.
IEEE Denver ComSoc
Download