Slides

advertisement
EECS 730
Introduction to Bioinformatics
Microarray
Luke Huan
Electrical Engineering and Computer Science
http://people.eecs.ku.edu/~jhuan/
Administrative

Final exam: Dec 15 7:30-10:00
2016/3/16
EECS 730
2
Model Based Subspace
Clustering



Microarray
Bi-clustering
δ-clustering
2016/3/16
EECS 730
3
MicroArray Dataset
2016/3/16
EECS 730
4
 x11
 ...

 xi1

 ...
 xn1

... x1 j ... x1m 
... ... ... ... 
... xij ... xim 

... ... ... ... 
... xnj ... xnm 
Genes
Genes
Gene Expression Matrix
Conditions
Time points
Cancer Tissues
2016/3/16
EECS 730
5
Conditions
Data Mining: Clustering
k
  dist ( x
t 1 ict
K-means clustering minimizes
Where dist ( x , c
i
2016/3/16
t
)
m
 (x
j 1
EECS 730
ij
i
, ct ) 2
 ctj ) 2
6
Clustering by Pattern Similarity (pClustering)


The micro-array “raw”
data shows 3 genes and
their values in a multidimensional space

Parallel Coordinates Plots

Difficult to find their
patterns
“non-traditional”
clustering
2016/3/16
EECS 730
7
Clusters Are Clear After
Projection
2016/3/16
EECS 730
8
Motivation
 DNA microarray analysis
2016/3/16
CH1I
CH1B
CH1D
CH2I
CH2B
CTFC3
4392
284
4108
280
228
VPS8
401
281
120
275
298
EFB1
318
280
37
277
215
SSA1
401
292
109
580
238
FUN14
2857
285
2576
271
226
SP07
228
290
48
285
224
MDM10
538
272
266
277
236
CYS3
322
288
41
278
219
DEP1
312
272
40
273
232
NTG1
329
296
33
274
228
EECS 730
10
strength
Motivation
450
400
350
300
250
200
150
100
50
0
CH1I
CH1D
CH2B
condition
2016/3/16
EECS 730
11
Motivation

Strong coherence exhibits by the selected objects
on the selected attributes.



They are not necessarily close to each other but rather
bear a constant shift.
Object/attribute bias
bi-cluster
2016/3/16
EECS 730
12
Challenges


The set of objects and the set of attributes are
usually unknown.
Different objects/attributes may possess different
biases and such biases



may be local to the set of selected objects/attributes
are usually unknown in advance
May have many unspecified entries
2016/3/16
EECS 730
13
Previous Work

Subspace clustering


Identifying a set of objects and a set of attributes
such that the set of objects are physically close to
each other on the subspace formed by the set of
attributes.
Collaborative filtering: Pearson R

Only considers global offset of each
object/attribute.
 (o1  o1 )(o2  o2 )
2
2
(
o

o
)

(
o

o
)
 1 1  2 2
2016/3/16
EECS 730
14
bi-cluster Terms

Consists of a (sub)set of objects and a (sub)set of
attributes


Corresponds to a submatrix
Occupancy threshold 




Each object/attribute has to be filled by a certain
percentage.
Volume: number of specified entries in the
submatrix
Base: average value of each object/attribute (in
the bi-cluster)
Biclustering of Expression Data, Cheng & Church
ISMB’00
2016/3/16
EECS 730
15
bi-cluster
CH1I
CH1B
CH1D
CH2I
CH2B
Obj base
CTFC3
VPS8
401
120
298
273
EFB1
318
37
215
190
322
41
219
194
347
66
244
219
SSA1
FUN14
SP07
MDM10
CYS3
DEP1
NTG1
Attr base
2016/3/16
EECS 730
16
17 conditions
2
0
69
0
0
110
0
0
69
69
69
110
161
161
69
240
110
139
304
0
220
383
220
220
330
289
289
464
208
322
161
256
498
240
110
397
220
248
230
110
195
3
69
69
69
69
69
69
0
69
110
69
110
139
195
110
248
139
161
322
69
277
413
195
161
300
264
264
456
240
300
139
248
488
289
69
383
220
230
208
161
271
4
139
110
69
110
110
69
110
69
110
110
139
161
195
139
264
195
139
326
110
289
414
161
139
277
277
240
451
230
330
161
264
477
300
69
371
208
240
161
139
195
5
139
110
110
110
110
139
110
110
161
139
179
195
256
161
304
248
179
350
110
326
403
195
161
240
277
256
422
248
356
139
271
460
294
110
347
208
248
195
179
304
6
139
110
139
110
110
161
110
69
110
110
139
161
220
139
283
179
161
340
69
289
381
161
161
240
289
220
417
240
361
161
248
466
289
69
314
161
240
161
161
289
7
139
110
139
139
161
179
69
110
69
0
110
161
208
110
283
161
139
376
0
289
393
110
110
179
277
208
403
283
333
139
240
484
264
69
277
161
179
161
179
283
8
9
69
0
69
0
139
0
110
0
161
0
139
0
110
0
110
0
139
69
0
0
139
69
161
110
240
139
161
69
283
195
220
110
69
69
318
248
69
0
248
220
343
350
110
110
139
110
195
220
300
248
220
248
432
510
248
220
369
376
179
110
256
220
449
532
277
248
69
69
330
264
208
179
248
208
195
161
161EECS 69
730
289
304
10
0
69
69
69
69
69
0
0
69
69
69
161
195
110
220
179
139
314
69
271
369
195
195
277
283
271
438
230
369
110
230
485
283
69
289
195
208
208
110
330
11
69
69
69
110
69
0
0
69
110
69
69
161
195
139
240
195
69
283
69
240
358
179
195
289
271
256
442
230
374
139
230
473
283
69
283
179
220
195
139
264
12
110
110
139
139
110
110
69
0
110
110
110
139
195
69
240
161
69
314
139
271
347
179
195
240
294
256
450
220
369
139
256
464
277
69
304
179
230
220
139
256
13
0
110
69
69
0
0
0
0
139
69
110
139
161
69
240
179
179
318
69
294
358
69
69
240
256
240
462
240
343
139
208
487
283
69
264
161
220
161
139
271
14
69
69
69
69
69
69
69
69
110
69
110
161
195
110
248
208
179
326
69
277
356
139
161
220
264
220
419
248
361
110
208
477
277
110
264
139
179
179
161
309
15
0
0
0
0
0
0
0
0
69
0
69
110
161
69
195
110
110
264
0
230
314
110
139
161
271
179
476
220
393
161
240
492
271
69
340
161
230
195
139
277
16
0
69
0
69
69
69
69
69
110
0
69
110
110
69
208
110
69
264
0
208
289
110
139
161
283
208
476
240
399
161
230
484
283
69
343
139
230
220
161
256
YBL069W
YBL097W
YBR064W
YBR065C
YBR114W
YCL013W
YDR149C
YDR461W
YDR526C
YHR061C
YIL092W
YIR043C
YJL010C
YJL023C
YJL033W
YJL076W
YJR162C
YKL068W
YKL134C
YLR219W
YLR380W
YLR381W
YLR382C
YLR383W
YLR384C
YLR386W
YLR388W
YLR392C
YLR395C
YLR400W
YLR401C
YLR406C
YLR408C
YLR411W
YLR413W
YLR450W
YLR451W
YLR452C
YLR453C
YLR454W
40 genes
0
1
139
69
0
69
139
110
139
110
208
179
0
0
0
0
179
161
69
110
69
0
139
161
179
179
179
240
161
161
208
283
161
195
139
161
304
326
69
69
283
208
337
383
161
161
208
195
248
230
264
300
230
240
439
442
256
230
374
322
139
195
230
277
494
470
326
248
179
139
326
411
161
220
220
271
220
271
179
195
2016/3/16
283
318
17
Motivation
expression level
600
500
400
300
200
100
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
condition
2016/3/16
EECS 730
18
17 conditions
2
0
69
0
0
110
0
0
69
69
69
110
161
161
69
240
110
139
304
0
220
383
220
220
330
289
289
464
208
322
161
256
498
240
110
397
220
248
230
110
195
3
69
69
69
69
69
69
0
69
110
69
110
139
195
110
248
139
161
322
69
277
413
195
161
300
264
264
456
240
300
139
248
488
289
69
383
220
230
208
161
271
4
139
110
69
110
110
69
110
69
110
110
139
161
195
139
264
195
139
326
110
289
414
161
139
277
277
240
451
230
330
161
264
477
300
69
371
208
240
161
139
195
5
139
110
110
110
110
139
110
110
161
139
179
195
256
161
304
248
179
350
110
326
403
195
161
240
277
256
422
248
356
139
271
460
294
110
347
208
248
195
179
304
6
139
110
139
110
110
161
110
69
110
110
139
161
220
139
283
179
161
340
69
289
381
161
161
240
289
220
417
240
361
161
248
466
289
69
314
161
240
161
161
289
7
139
110
139
139
161
179
69
110
69
0
110
161
208
110
283
161
139
376
0
289
393
110
110
179
277
208
403
283
333
139
240
484
264
69
277
161
179
161
179
283
8
9
69
0
69
0
139
0
110
0
161
0
139
0
110
0
110
0
139
69
0
0
139
69
161
110
240
139
161
69
283
195
220
110
69
69
318
248
69
0
248
220
343
350
110
110
139
110
195
220
300
248
220
248
432
510
248
220
369
376
179
110
256
220
449
532
277
248
69
69
330
264
208
179
248
208
195
161
161EECS 69
730
289
304
10
0
69
69
69
69
69
0
0
69
69
69
161
195
110
220
179
139
314
69
271
369
195
195
277
283
271
438
230
369
110
230
485
283
69
289
195
208
208
110
330
11
69
69
69
110
69
0
0
69
110
69
69
161
195
139
240
195
69
283
69
240
358
179
195
289
271
256
442
230
374
139
230
473
283
69
283
179
220
195
139
264
12
110
110
139
139
110
110
69
0
110
110
110
139
195
69
240
161
69
314
139
271
347
179
195
240
294
256
450
220
369
139
256
464
277
69
304
179
230
220
139
256
13
0
110
69
69
0
0
0
0
139
69
110
139
161
69
240
179
179
318
69
294
358
69
69
240
256
240
462
240
343
139
208
487
283
69
264
161
220
161
139
271
14
69
69
69
69
69
69
69
69
110
69
110
161
195
110
248
208
179
326
69
277
356
139
161
220
264
220
419
248
361
110
208
477
277
110
264
139
179
179
161
309
15
0
0
0
0
0
0
0
0
69
0
69
110
161
69
195
110
110
264
0
230
314
110
139
161
271
179
476
220
393
161
240
492
271
69
340
161
230
195
139
277
16
0
69
0
69
69
69
69
69
110
0
69
110
110
69
208
110
69
264
0
208
289
110
139
161
283
208
476
240
399
161
230
484
283
69
343
139
230
220
161
256
YBL069W
YBL097W
YBR064W
YBR065C
YBR114W
YCL013W
YDR149C
YDR461W
YDR526C
YHR061C
YIL092W
YIR043C
YJL010C
YJL023C
YJL033W
YJL076W
YJR162C
YKL068W
YKL134C
YLR219W
YLR380W
YLR381W
YLR382C
YLR383W
YLR384C
YLR386W
YLR388W
YLR392C
YLR395C
YLR400W
YLR401C
YLR406C
YLR408C
YLR411W
YLR413W
YLR450W
YLR451W
YLR452C
YLR453C
YLR454W
40 genes
0
1
139
69
0
69
139
110
139
110
208
179
0
0
0
0
179
161
69
110
69
0
139
161
179
179
179
240
161
161
208
283
161
195
139
161
304
326
69
69
283
208
337
383
161
161
208
195
248
230
264
300
230
240
439
442
256
230
374
322
139
195
230
277
494
470
326
248
179
139
326
411
161
220
220
271
220
271
179
195
2016/3/16
283
318
19
Motivation
600
expression level
500
400
300
200
100
0
3
5
9
14
15
YBL069W
YBL097W
YBR064W
YBR065C
YBR114W
YCL013W
YDR149C
YDR461W
YDR526C
YHR061C
YIL092W
YIR043C
YJL010C
YJL023C
YJL033W
YJL076W
YJR162C
YKL068W
YKL134C
YLR219W
condition
2016/3/16
EECS 730 genes
Co-regulated
20
bi-cluster

Perfect -cluster
d ij  d iJ  d Ij  d IJ
d ij  d Ij  d iJ  d IJ

d ij  d iJ  d Ij  d IJ
Imperfect -cluster

dij
dIJ
dIj
Residual:
rij 
2016/3/16
diJ
0,
d ij  d iJ  d Ij  d IJ , d ij is specified
d ij is unspecified
EECS 730
21
bi-cluster


The smaller the average residue, the stronger the
coherence.
Objective: identify -clusters with residue smaller
than a given threshold
2016/3/16
EECS 730
22
Cheng-Church Algorithm




Find one bi-cluster.
Replace the data in the first bi-cluster with
random data
Find the second bi-cluster, and go on.
The quality of the bi-cluster degrades (smaller
volume, higher residue) due to the insertion of
random data.
2016/3/16
EECS 730
23
The FLOC algorithm
Generating initial clusters
Determine the best action for
each row and each column
Perform the best action of each
row and column sequentially
Improved?
Y
N
Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02
2016/3/16
EECS 730
24
The FLOC algorithm

Action: the change of membership of a row (or
column) with respect to a cluster
column
1
2
3
4
1
3
4
2
2
2
1
3
2
3
3
4
2
0
4
row
N=3
2016/3/16
M=4
EECS 730
M+N actions are
Performed at
each iteration
25
The FLOC algorithm


Gain of an action: the residual reduction incurred
by performing the action
Order of action:




Fixed order
Random order
Weighted random order
Complexity: O((M+N)MNkp)
2016/3/16
EECS 730
26
The FLOC algorithm

Additional features




Maximum allowed overlap among clusters
Minimum coverage of clusters
Minimum volume of each cluster
Can be enforced by “temporarily blocking”
certain action during the mining process if such
action would violate some constraint.
2016/3/16
EECS 730
27
Performance

Microarray data: 2884 genes, 17 conditions


100 bi-clusters with smallest residue were returned.
Average residue = 10.34



2016/3/16
The average residue of clusters found via the state of the
art method in computational biology field is 12.54
The average volume is 25% bigger
The response time is an order of magnitude faster
EECS 730
28
Conclusion Remark

The model of bi-cluster is proposed to capture
coherent objects with incomplete data set.



base
residue
Many additional features can be accommodated
(nearly for free).
2016/3/16
EECS 730
29
References




J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace
correlation in a large data set, Proceedings of the 18th IEEE
International Conference on Data Engineering (ICDE), pp. 517-528,
2002.
H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity
in large data sets, to appear in Proceedings of the ACM SIGMOD
International Conference on Management of Data (SIGMOD), 2002.
Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced
pClustering and its applications to gene expression data Bioinformatics
and Bioengineering, 2004.
J. Liu and W. Wang, OP-Cluster: clustering by tendency in high
dimensional space, ICDM’03.
2016/3/16
EECS 730
30
Download