EECS 730 Introduction to Bioinformatics Microarray Luke Huan Electrical Engineering and Computer Science http://people.eecs.ku.edu/~jhuan/ Administrative Final exam: Dec 15 7:30-10:00 2016/3/16 EECS 730 2 Model Based Subspace Clustering Microarray Bi-clustering δ-clustering 2016/3/16 EECS 730 3 MicroArray Dataset 2016/3/16 EECS 730 4 x11 ... xi1 ... xn1 ... x1 j ... x1m ... ... ... ... ... xij ... xim ... ... ... ... ... xnj ... xnm Genes Genes Gene Expression Matrix Conditions Time points Cancer Tissues 2016/3/16 EECS 730 5 Conditions Data Mining: Clustering k dist ( x t 1 ict K-means clustering minimizes Where dist ( x , c i 2016/3/16 t ) m (x j 1 EECS 730 ij i , ct ) 2 ctj ) 2 6 Clustering by Pattern Similarity (pClustering) The micro-array “raw” data shows 3 genes and their values in a multidimensional space Parallel Coordinates Plots Difficult to find their patterns “non-traditional” clustering 2016/3/16 EECS 730 7 Clusters Are Clear After Projection 2016/3/16 EECS 730 8 Motivation DNA microarray analysis 2016/3/16 CH1I CH1B CH1D CH2I CH2B CTFC3 4392 284 4108 280 228 VPS8 401 281 120 275 298 EFB1 318 280 37 277 215 SSA1 401 292 109 580 238 FUN14 2857 285 2576 271 226 SP07 228 290 48 285 224 MDM10 538 272 266 277 236 CYS3 322 288 41 278 219 DEP1 312 272 40 273 232 NTG1 329 296 33 274 228 EECS 730 10 strength Motivation 450 400 350 300 250 200 150 100 50 0 CH1I CH1D CH2B condition 2016/3/16 EECS 730 11 Motivation Strong coherence exhibits by the selected objects on the selected attributes. They are not necessarily close to each other but rather bear a constant shift. Object/attribute bias bi-cluster 2016/3/16 EECS 730 12 Challenges The set of objects and the set of attributes are usually unknown. Different objects/attributes may possess different biases and such biases may be local to the set of selected objects/attributes are usually unknown in advance May have many unspecified entries 2016/3/16 EECS 730 13 Previous Work Subspace clustering Identifying a set of objects and a set of attributes such that the set of objects are physically close to each other on the subspace formed by the set of attributes. Collaborative filtering: Pearson R Only considers global offset of each object/attribute. (o1 o1 )(o2 o2 ) 2 2 ( o o ) ( o o ) 1 1 2 2 2016/3/16 EECS 730 14 bi-cluster Terms Consists of a (sub)set of objects and a (sub)set of attributes Corresponds to a submatrix Occupancy threshold Each object/attribute has to be filled by a certain percentage. Volume: number of specified entries in the submatrix Base: average value of each object/attribute (in the bi-cluster) Biclustering of Expression Data, Cheng & Church ISMB’00 2016/3/16 EECS 730 15 bi-cluster CH1I CH1B CH1D CH2I CH2B Obj base CTFC3 VPS8 401 120 298 273 EFB1 318 37 215 190 322 41 219 194 347 66 244 219 SSA1 FUN14 SP07 MDM10 CYS3 DEP1 NTG1 Attr base 2016/3/16 EECS 730 16 17 conditions 2 0 69 0 0 110 0 0 69 69 69 110 161 161 69 240 110 139 304 0 220 383 220 220 330 289 289 464 208 322 161 256 498 240 110 397 220 248 230 110 195 3 69 69 69 69 69 69 0 69 110 69 110 139 195 110 248 139 161 322 69 277 413 195 161 300 264 264 456 240 300 139 248 488 289 69 383 220 230 208 161 271 4 139 110 69 110 110 69 110 69 110 110 139 161 195 139 264 195 139 326 110 289 414 161 139 277 277 240 451 230 330 161 264 477 300 69 371 208 240 161 139 195 5 139 110 110 110 110 139 110 110 161 139 179 195 256 161 304 248 179 350 110 326 403 195 161 240 277 256 422 248 356 139 271 460 294 110 347 208 248 195 179 304 6 139 110 139 110 110 161 110 69 110 110 139 161 220 139 283 179 161 340 69 289 381 161 161 240 289 220 417 240 361 161 248 466 289 69 314 161 240 161 161 289 7 139 110 139 139 161 179 69 110 69 0 110 161 208 110 283 161 139 376 0 289 393 110 110 179 277 208 403 283 333 139 240 484 264 69 277 161 179 161 179 283 8 9 69 0 69 0 139 0 110 0 161 0 139 0 110 0 110 0 139 69 0 0 139 69 161 110 240 139 161 69 283 195 220 110 69 69 318 248 69 0 248 220 343 350 110 110 139 110 195 220 300 248 220 248 432 510 248 220 369 376 179 110 256 220 449 532 277 248 69 69 330 264 208 179 248 208 195 161 161EECS 69 730 289 304 10 0 69 69 69 69 69 0 0 69 69 69 161 195 110 220 179 139 314 69 271 369 195 195 277 283 271 438 230 369 110 230 485 283 69 289 195 208 208 110 330 11 69 69 69 110 69 0 0 69 110 69 69 161 195 139 240 195 69 283 69 240 358 179 195 289 271 256 442 230 374 139 230 473 283 69 283 179 220 195 139 264 12 110 110 139 139 110 110 69 0 110 110 110 139 195 69 240 161 69 314 139 271 347 179 195 240 294 256 450 220 369 139 256 464 277 69 304 179 230 220 139 256 13 0 110 69 69 0 0 0 0 139 69 110 139 161 69 240 179 179 318 69 294 358 69 69 240 256 240 462 240 343 139 208 487 283 69 264 161 220 161 139 271 14 69 69 69 69 69 69 69 69 110 69 110 161 195 110 248 208 179 326 69 277 356 139 161 220 264 220 419 248 361 110 208 477 277 110 264 139 179 179 161 309 15 0 0 0 0 0 0 0 0 69 0 69 110 161 69 195 110 110 264 0 230 314 110 139 161 271 179 476 220 393 161 240 492 271 69 340 161 230 195 139 277 16 0 69 0 69 69 69 69 69 110 0 69 110 110 69 208 110 69 264 0 208 289 110 139 161 283 208 476 240 399 161 230 484 283 69 343 139 230 220 161 256 YBL069W YBL097W YBR064W YBR065C YBR114W YCL013W YDR149C YDR461W YDR526C YHR061C YIL092W YIR043C YJL010C YJL023C YJL033W YJL076W YJR162C YKL068W YKL134C YLR219W YLR380W YLR381W YLR382C YLR383W YLR384C YLR386W YLR388W YLR392C YLR395C YLR400W YLR401C YLR406C YLR408C YLR411W YLR413W YLR450W YLR451W YLR452C YLR453C YLR454W 40 genes 0 1 139 69 0 69 139 110 139 110 208 179 0 0 0 0 179 161 69 110 69 0 139 161 179 179 179 240 161 161 208 283 161 195 139 161 304 326 69 69 283 208 337 383 161 161 208 195 248 230 264 300 230 240 439 442 256 230 374 322 139 195 230 277 494 470 326 248 179 139 326 411 161 220 220 271 220 271 179 195 2016/3/16 283 318 17 Motivation expression level 600 500 400 300 200 100 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 condition 2016/3/16 EECS 730 18 17 conditions 2 0 69 0 0 110 0 0 69 69 69 110 161 161 69 240 110 139 304 0 220 383 220 220 330 289 289 464 208 322 161 256 498 240 110 397 220 248 230 110 195 3 69 69 69 69 69 69 0 69 110 69 110 139 195 110 248 139 161 322 69 277 413 195 161 300 264 264 456 240 300 139 248 488 289 69 383 220 230 208 161 271 4 139 110 69 110 110 69 110 69 110 110 139 161 195 139 264 195 139 326 110 289 414 161 139 277 277 240 451 230 330 161 264 477 300 69 371 208 240 161 139 195 5 139 110 110 110 110 139 110 110 161 139 179 195 256 161 304 248 179 350 110 326 403 195 161 240 277 256 422 248 356 139 271 460 294 110 347 208 248 195 179 304 6 139 110 139 110 110 161 110 69 110 110 139 161 220 139 283 179 161 340 69 289 381 161 161 240 289 220 417 240 361 161 248 466 289 69 314 161 240 161 161 289 7 139 110 139 139 161 179 69 110 69 0 110 161 208 110 283 161 139 376 0 289 393 110 110 179 277 208 403 283 333 139 240 484 264 69 277 161 179 161 179 283 8 9 69 0 69 0 139 0 110 0 161 0 139 0 110 0 110 0 139 69 0 0 139 69 161 110 240 139 161 69 283 195 220 110 69 69 318 248 69 0 248 220 343 350 110 110 139 110 195 220 300 248 220 248 432 510 248 220 369 376 179 110 256 220 449 532 277 248 69 69 330 264 208 179 248 208 195 161 161EECS 69 730 289 304 10 0 69 69 69 69 69 0 0 69 69 69 161 195 110 220 179 139 314 69 271 369 195 195 277 283 271 438 230 369 110 230 485 283 69 289 195 208 208 110 330 11 69 69 69 110 69 0 0 69 110 69 69 161 195 139 240 195 69 283 69 240 358 179 195 289 271 256 442 230 374 139 230 473 283 69 283 179 220 195 139 264 12 110 110 139 139 110 110 69 0 110 110 110 139 195 69 240 161 69 314 139 271 347 179 195 240 294 256 450 220 369 139 256 464 277 69 304 179 230 220 139 256 13 0 110 69 69 0 0 0 0 139 69 110 139 161 69 240 179 179 318 69 294 358 69 69 240 256 240 462 240 343 139 208 487 283 69 264 161 220 161 139 271 14 69 69 69 69 69 69 69 69 110 69 110 161 195 110 248 208 179 326 69 277 356 139 161 220 264 220 419 248 361 110 208 477 277 110 264 139 179 179 161 309 15 0 0 0 0 0 0 0 0 69 0 69 110 161 69 195 110 110 264 0 230 314 110 139 161 271 179 476 220 393 161 240 492 271 69 340 161 230 195 139 277 16 0 69 0 69 69 69 69 69 110 0 69 110 110 69 208 110 69 264 0 208 289 110 139 161 283 208 476 240 399 161 230 484 283 69 343 139 230 220 161 256 YBL069W YBL097W YBR064W YBR065C YBR114W YCL013W YDR149C YDR461W YDR526C YHR061C YIL092W YIR043C YJL010C YJL023C YJL033W YJL076W YJR162C YKL068W YKL134C YLR219W YLR380W YLR381W YLR382C YLR383W YLR384C YLR386W YLR388W YLR392C YLR395C YLR400W YLR401C YLR406C YLR408C YLR411W YLR413W YLR450W YLR451W YLR452C YLR453C YLR454W 40 genes 0 1 139 69 0 69 139 110 139 110 208 179 0 0 0 0 179 161 69 110 69 0 139 161 179 179 179 240 161 161 208 283 161 195 139 161 304 326 69 69 283 208 337 383 161 161 208 195 248 230 264 300 230 240 439 442 256 230 374 322 139 195 230 277 494 470 326 248 179 139 326 411 161 220 220 271 220 271 179 195 2016/3/16 283 318 19 Motivation 600 expression level 500 400 300 200 100 0 3 5 9 14 15 YBL069W YBL097W YBR064W YBR065C YBR114W YCL013W YDR149C YDR461W YDR526C YHR061C YIL092W YIR043C YJL010C YJL023C YJL033W YJL076W YJR162C YKL068W YKL134C YLR219W condition 2016/3/16 EECS 730 genes Co-regulated 20 bi-cluster Perfect -cluster d ij d iJ d Ij d IJ d ij d Ij d iJ d IJ d ij d iJ d Ij d IJ Imperfect -cluster dij dIJ dIj Residual: rij 2016/3/16 diJ 0, d ij d iJ d Ij d IJ , d ij is specified d ij is unspecified EECS 730 21 bi-cluster The smaller the average residue, the stronger the coherence. Objective: identify -clusters with residue smaller than a given threshold 2016/3/16 EECS 730 22 Cheng-Church Algorithm Find one bi-cluster. Replace the data in the first bi-cluster with random data Find the second bi-cluster, and go on. The quality of the bi-cluster degrades (smaller volume, higher residue) due to the insertion of random data. 2016/3/16 EECS 730 23 The FLOC algorithm Generating initial clusters Determine the best action for each row and each column Perform the best action of each row and column sequentially Improved? Y N Yang et al. delta-Clusters: Capturing Subspace Correlation in a Large Data Set, ICDE’02 2016/3/16 EECS 730 24 The FLOC algorithm Action: the change of membership of a row (or column) with respect to a cluster column 1 2 3 4 1 3 4 2 2 2 1 3 2 3 3 4 2 0 4 row N=3 2016/3/16 M=4 EECS 730 M+N actions are Performed at each iteration 25 The FLOC algorithm Gain of an action: the residual reduction incurred by performing the action Order of action: Fixed order Random order Weighted random order Complexity: O((M+N)MNkp) 2016/3/16 EECS 730 26 The FLOC algorithm Additional features Maximum allowed overlap among clusters Minimum coverage of clusters Minimum volume of each cluster Can be enforced by “temporarily blocking” certain action during the mining process if such action would violate some constraint. 2016/3/16 EECS 730 27 Performance Microarray data: 2884 genes, 17 conditions 100 bi-clusters with smallest residue were returned. Average residue = 10.34 2016/3/16 The average residue of clusters found via the state of the art method in computational biology field is 12.54 The average volume is 25% bigger The response time is an order of magnitude faster EECS 730 28 Conclusion Remark The model of bi-cluster is proposed to capture coherent objects with incomplete data set. base residue Many additional features can be accommodated (nearly for free). 2016/3/16 EECS 730 29 References J. Young, W. Wang, H. Wang, P. Yu, Delta-cluster: capturing subspace correlation in a large data set, Proceedings of the 18th IEEE International Conference on Data Engineering (ICDE), pp. 517-528, 2002. H. Wang, W. Wang, J. Young, P. Yu, Clustering by pattern similarity in large data sets, to appear in Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2002. Y. Sungroh, C. Nardini, L. Benini, G. De Micheli, Enhanced pClustering and its applications to gene expression data Bioinformatics and Bioengineering, 2004. J. Liu and W. Wang, OP-Cluster: clustering by tendency in high dimensional space, ICDM’03. 2016/3/16 EECS 730 30