Harnessing Idle Computers with Condor at Notre Dame: Impact on Research in 2006 Prof. Douglas Thain CSE Department 9 Feb 2007 What is Condor? Condor is software from UW-Madison that harnesses idle cycles and storage from existing machines. (ND workstations are 89% idle!) With the assistance of OIT/CSE staff, Condor has been installed on 379 CPUs in the Colleges of Engineering and Science since early 2005. Our Condor pool is expanding the capabilities of researchers in CSE, EE, AME, and Physics perform CPU and storage intensive research. More users and contributors are welcome to join! http://www.nd.edu/~condor I will only run jobs between midnight and 8 AM I will only run jobs when Computing Environment there is no-one working at the keyboard Miscellaneous CSE Workstations CPU CPU CPU CPU CPU Fitzpatrick Workstation Cluster CPU Job Job CPU Job CPU CPU CPU Disk Disk Disk Disk Disk Job Job Job Job Job Job Job Job Condor Match Maker Disk Disk Disk I prefer to run a job submitted by a CSE student. CPU CPU CPU CPU CPU CPU CPU Job Disk Job Disk Job Disk Job Disk Job Disk CVRL Research Cluster Disk Disk CCL Research Cluster Scheduling Policy First, Owners Exercise Absolute Control – Set who, what, and when can use machine. – Can kick jobs off at any time manually. – Default policy: Start job if console idle > 15 minutes Suspend job if console used or CPU busy. Kick off job if suspended > 10 minutes. After satisfying that principle, the users split available CPU hours equally. A little more complicated, see details here: http://www.cse.nd.edu/~ccl/operations/condor/policy.shtml CPU History Storage History Current Donors Feb 2007 Owner CRC/OIT Nodes 92 CPUs Storage (TB) 92 3.7 CSE 73 124 11.7 Prof. Thain 59 91 5.5 Prof. Flynn 18 35 0.65 Prof. Striegel 10 20 0.65 Misc 7 17 Total 259 379 20.2 TB Flocking Between Universities Wisconsin 1200 CPUs Purdue A 541 CPUs Notre Dame 379 CPUs Purdue B 1016 CPUs http://www.cse.nd.edu/~ccl/operations/condor/ Total Consumption in 2005 1128038 (100%) CPU-Hours Total 134978 (11%) CPU-Hours Consumed by Owner at Keyboard 350148 (31%) CPU-Hours Totally Unused 642912 (56%) CPU-Hours Harnessed by Condor http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html Top Condor Users in 2005 User CPU Hours Percent of Total Max Jobs Running Max Jobs in Queue Total 642912 100.00% 327 2268 tdysart@nd.edu 548187 85.27% 204 tfaltemi@nd.edu 51058 7.94% johanes@nd.edu 27184 ice-user.tdysart@nd.edu Dept Advisor 740 CSE Kogge 163 2004 CSE Flynn 4.23% 7 7 CRC - 22425 3.49% 100 100 CSE Kogge lxiao@nd.edu 13236 2.06% 78 85 EE Fuja dsalyers@nd.edu 10016 1.56% 28 688 CSE Striegel yjiang3@nd.edu 7371 1.15% 24 800 CSE Striegel pbrenne1@nd.edu 6148 0.96% 112 120 CSE Izaguirre dcieslak@nd.edu 5619 0.87% 145 1814 CSE Chawla bnovak@nd.edu 4116 0.64% 52 52 ??? ??? dvonhand@nd.edu 1820 0.28% 32 32 CSE Izaguirre jmcraven@nd.edu 1390 0.22% 40 92 ??? ??? http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html Total Consumption in 2006 2376456 (100%) CPU-Hours Total 281003 (11%) CPU-Hours Consumed by Owner at Keyboard 934277 (39%) CPU-Hours Totally Unused 1161176 (48%) CPU-Hours Harnessed by Condor http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html Top Condor Users in 2006 User CPU Hours Percent of Total Max Jobs Running Max Jobs in Queue Total 1161176 100.00% 1156 61695 dcieslak@nd.edu 471126 40.57% 1142 tfaltemi@nd.edu 415972 35.82% tdysart@nd.edu 186050 johanes@nd.edu (> 389 jobs running by some migrating to Purdue and UW.) Dept Advisor 60314 CSE Chawla 447 20030 CSE Flynn 16.02% 213 1275 CSE Kogge 31341 2.70% 21 21 CRC - pbrenne1@nd.edu 27693 2.38% 64 1082 CSE Izaguirre dthain@nd.edu 23217 2.00% 299 300 CSE Thain apusane@nd.edu 20708 1.78% 160 192 EE Costello yjiang3@nd.edu 4842 0.42% 24 366 CSE Striegel lxiao@nd.edu 2741 0.24% 53 66 EE Fuja npatel@nd.edu 1690 0.15% 6 6 AME Renaud gniederw@nd.edu 980 0.08% 29 35 CSE Thain jwozniak@nd.edu 413 0.04% 30 72 CSE Izaguirre http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html Research Projects Using Condor Data Mining and Applications – CSE: Chawla Multidimensional Biometric Imaging and Applications (NSF/DOJ) – CSE: Flynn and Bowyer High End Biometric Computing (NSF) – CSE: Thain and Flynn Architectures and Devices for Quantum Dot Cellular Automata (NSF) – EE and CSE: Kogge, Lent, Fay, Orlov GEMS Grid Enabled Molecular Simulations (NSF) – CSE and Chem: Izaguirre, Striegel, Peng Delay-Constrained Multihop Transmission in Wireless Networks: Interaction of Coding, Channel Access, and Routing (NSF/NASA/Moto) – EE: Laneman, Costello, Fuja, Haenggi ND Design Automation Laboratory – AME: Renaud GRAND: Gamma Ray Astrophysics at Notre Dame – Physics: Poirer (Distributed Storage) Recent Papers Supported by Cycles from Condor at ND (1) N. Chawla, D. Cieslak, L. Hall, A. Joshi, "Killing Two Birds with One Stone: Countering Cost and Imbalance," Data Mining and Knowledge Discovery, under review. D. Cieslak, N. Chawla, "The Calibration and Power of Probability Estimation Trees in Ensembles," 7th International Workshop on Multiclassifier Systems, under review. D. Cieslak, N. Chawla, "Reducing Loss and Improving ROC AUC Through Sampling," International Conference on Machine Learning , Corvallis, Oregon, 2007. N. Chawla, D. Cieslak, "Evaluating Calibration of Probability Estimation Trees,“ Proceedings of the AAAI Workshop on the Evaluation Methods in Machine Learning, Boston, July 2006 D. Cieslak, D. Thain, N. Chawla, "Troubleshooting Distributed Systems via Data Mining," Hot Topics Sessions: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), Paris, France, June 2006 D. Cieslak, N. Chawla, A. Striegel, "Combating Imbalance in Network Intrusion Datasets,“ IEEE International Conference on Granular Computing, Athens, Georgia, May 2006. CSE: Data Mining Recent Papers Supported by Cycles from Condor at ND (2) X. Chen, T. Faltemier, P. Flynn, and K. Bowyer, “Human Face Modeling and Recognition Through Multi-View High Resolution Stereopsis”, Biometrics: Theory, Applications, and Systems, 2006. D. Woodard, T. Faltemier, P. Yan, and P. Flynn, “A Comparison of 3D Biometric Modalities”, Biometrics: Theory, Applications, and Systems, 2006. T. Faltemier, P. Flynn, and K. Bowyer, “3D Face Recognition with Cruvature Based Region Selection”, 3D Data Processing, Visualization, and Transmission, 2006. T. Faltemier, K. Bowyer, and P. Flynn, “Region Ensemble for 3D Face Recognition and Indexing”, under submission. T. Faltemier, K. Bowyer, and P. Flynn, “Using Multiple Gallery Images for 3D Face Recognition”, under submission. CSE: Biometrics Timothy J. Dysart. "Defect Properties and Design Tools for Quantum Dot Cellular Automata." Master's Thesis, 2005. PDF Timothy J. Dysart, Peter M. Kogge, Craig S. Lent, and Mo Liu. "An Analysis of Missing Cell Defects in Quantum-Dot Cellular Automata." IEEE International Workshop on Design and Test of Defect-Tolerant Nanoscale Architectures (NANOARCH '05) in conjunction with the VLSI Test Symposium. Palm Springs, CA. May 1, 2005 EE/CSE: Quantum Comp Recent Papers Supported by Cycles from Condor at ND (3) On Deriving Good LDPC Convolutional Codes, A. E. Pusane, R. Smarandache, P. O. Vontobel, and D. J. Costello, Jr, submitted to IEEE International Symposium on Information Theory, Nice, France, June 2007. A Comparison of ARA- and Protograph-Based LDPC Block and Convolutional Codes, D. J. Costello, Jr., A. E. Pusane, C. Jones, and D. Divsalar, to appear in Proc. Information Theory and Applications Workshop, San Diego, CA, USA, January 29-February 2, 2007. LDPC Convolutional Codes: What Are They? How Do They Work? Are They Any Good?, D. J. Costello, Jr. and A. E. Pusane in Book of Abstracts, AMS Joint Mathematics Meetings, New Orleans, LA, USA, January 5-8, 2007. L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Algebraic Superposition of LDGM Codes for Cooperative Diversity'' submitted to IEEE International Symposium on Information Theory (ISIT) 2007. L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Cooperative diversity based on code superposition'' in IEEE International Symposium on Information Theory (ISIT), Seattle, WA, July 2006. L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Nested codes with multiple interpretations'' in 40th Conference on Information Sciences and Systems (CISS), Princeton, NJ, March 2006. EE: Signal Coding Recent Papers Supported by Cycles from Condor at ND (4) Yingxin Jiang, Aaron Striegel, "A Distributed Traffic Control Scheme based on Edge-Centric Resource Management," ACM Computer Communications Review, vol. 36, no. 2, pp. 5-16, April 2006. Effects of low-quality computation time estimates in policed schedulers, Justin M. Wozniak, Yingxin Jiang and Aaron Striegel, Proc. Annual Simulation Symposium, IEEE Computer Society, 2007. D. Salyers, A. Striegel "A Novel Approach for Transparent Bandwidth Conservation,“ Proceedings of Networking 2005, Waterloo Ontario Canada, May 2005 CSE: Network Simulation Access Control for a Replica Management Database, Justin Wozniak, Paul Brenner, Douglas Thain, Aaron Striegel, Jesus Izaguirre, ACM Workshop on Storage Security and Survivability (StorageSS), October 2006. Generosity and Gluttony in GEMS: Grid Enabled Molecular Simulations, Justin Wozniak, Paul Brenner, Douglas Thain, Aaron Striegel, and Jesus Izaguirre, in Proceedings of the IEEE Symposium on High Performance Distributed Computing, July 2005 CSE: Scientific Databases Recent Papers Supported by Cycles from Condor at ND (5) Challenges in Executing Data Intensive Biometric Workloads on a Desktop Grid, Christopher Moretti, Timothy Faltemier, Douglas Thain, and Patrick J. Flynn, Workshop on Large Scale and Volatile Desktop Grids, March 2006. Operating System Support for Space Allocation in Grid Storage Systems, Douglas Thain, IEEE Conference on Grid Computing, September 2006. The Consequences of Decentralized Security in a Cooperative Storage System, Douglas Thain, Chris Moretti, Paul Madrid, Phil Snowberger, and Jeff Hemmes, IEEE Workshop on Security in Storage (SISW), San Francisco, December 2005. Separating Abstractions from Resources in a Tactical Storage System, Douglas Thain, Sander Klous, Justin Wozniak, Paul Brenner, Aaron Striegel, and Jesus Izaguirre, in Proceedings of IEEE/ACM Supercomputing, Nov 2005. Patisserie: Support for Parameter Sweeps in a Fault-Tolerant, Massively Parallel, Peer-to-Peer Simulation Environment, Timothy Schoenharl, Scott Christley, and Douglas Thain, Workshop on Agent Directed Simulation (ADS), San Diego, California, April 2005. CSE: Grid Computing How does Condor relate to CRC? Use the CRC clusters for: – CPU-intensive, fine-grained parallel codes. – The latest, fastest machines. – Professional, continuous support. Use the Condor pool for: – – – – Coarse grained, naturally parallel codes. Harnessing college/dept level machines. Integration with distributed storage. Building and deploying novel systems for computer science research. – Self-service support at this point. (Some ambitious students use both!) How does Condor relate to OSG? The Open Science Grid – A wide-area consortium of universities. – A mechanism (Condor+Globus) to access remote batch/storage systems over the WAN. – Interface (Condor-G) is one piece of Condor. The ND Condor Pool – A campus-scale collection of resources. – Could be made accessible via OSG interface. – Indirectly part of OSG/TeraGrid via Purdue. Example of Application/CS Research Using Condor Scalable I/O for Biometrics Computer Vision Research Lab in CSE – Goal: Develop robust algorithms for identifying humans from (non-ideal) images. – Technique: Collect lots of images. Think up clever new matching function. Compare them. How do you test a matching function? – For a set S of images, – Compute F(Si,Sj) for all Si and Sj in S. – Compare the result matrix to known functions. Credit: Patrick Flynn at Notre Dame CSE Computing Similarities 1 .8 .1 0 0 .1 1 0 .1 .1 0 1 0 .1 .7 1 0 0 1 .1 F 1 A Big Data Problem Data Size: 10k images of 1MB = 10 GB Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB Would like to repeat many times! In order to execute such a workload, we must be careful to partition both the I/O and the CPU needs, taking advantage of distributed capacity. Conventional Solution Disk Disk Disk Disk Move 200 TB at Runtime! Job Job Job Job Job Job Job Job CPU CPU CPU CPU CPU CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk Using Tactical Storage 3. Jobs find nearby data copy, and make full use before discarding. CPU Job CPU CPU Job CPU Job CPU Job CPU CPU CPU Disk Disk Disk Disk Disk Disk Disk Disk 2. Replicate data to many disks. 1. Break array into MB-size chunks. Result: achieve greater than 2Gb/s of disk->application bandwidth on large workload Technical Issues (1) Deployment – All codes and config in AFS, just deploy startup script in /etc/init.d. – Manual copy onto each node gets lost at the end of the semester, copy into image. Firewalls – TCP/UDP on ports 9000-1000 both directions. – One firewalled machine can hang everyone! – Workaround: Periodic check of TCP ports, manually disable Condor on FW nodes. Technical Issues (2) Disappearing Servers – Problem: condor_master on each host disappears mysteriously; pool decays. – Diagnosis: AFS outage? Condor bug? – Solution: /etc/cron.hourly/restart_condor CPU Detection – Problem: Hyperthreaded machines appear to be multi-CPU machines on Linux. – Result: Condor overcommits the CPU. – Solution: Manual override NUM_CPUS=1 Summary With your help, our Condor pool has provided significant benefits for both research and education. Thank you! Liaison between faculty and staff at the dept, college, and univ level is needed to keep the system working. Lots more info here: – http://www.nd.edu/~condor – condor-discuss@listserv.nd.edu