Impact on Research 2006 - University of Notre Dame

advertisement
Harnessing Idle Computers
with Condor at Notre Dame:
Impact on Research in 2006
Prof. Douglas Thain
CSE Department
9 Feb 2007
What is Condor?
Condor is software from UW-Madison that
harnesses idle cycles and storage from existing
machines. (ND workstations are 89% idle!)
With the assistance of OIT/CSE staff, Condor has
been installed on 379 CPUs in the Colleges of
Engineering and Science since early 2005.
Our Condor pool is expanding the capabilities of
researchers in CSE, EE, AME, and Physics
perform CPU and storage intensive research.
More users and contributors are welcome to join!
http://www.nd.edu/~condor
I will only run jobs between
midnight and 8 AM
I will only run jobs when
Computing Environment
there is no-one working at
the keyboard
Miscellaneous CSE Workstations
CPU
CPU
CPU
CPU
CPU
Fitzpatrick Workstation Cluster
CPU
Job
Job CPU
Job CPU
CPU CPU
Disk
Disk Disk
Disk
Disk
Job
Job
Job
Job
Job
Job
Job
Job
Condor
Match
Maker
Disk
Disk
Disk
I prefer to run a job
submitted by a CSE student.
CPU
CPU
CPU
CPU
CPU CPU CPU
Job
Disk
Job
Disk
Job
Disk
Job
Disk
Job
Disk
CVRL Research Cluster
Disk
Disk
CCL Research Cluster
Scheduling Policy
First, Owners Exercise Absolute Control
– Set who, what, and when can use machine.
– Can kick jobs off at any time manually.
– Default policy:
Start job if console idle > 15 minutes
Suspend job if console used or CPU busy.
Kick off job if suspended > 10 minutes.
After satisfying that principle, the users
split available CPU hours equally.
A little more complicated, see details here:
http://www.cse.nd.edu/~ccl/operations/condor/policy.shtml
CPU History
Storage History
Current Donors Feb 2007
Owner
CRC/OIT
Nodes
92
CPUs Storage (TB)
92
3.7
CSE
73
124
11.7
Prof. Thain
59
91
5.5
Prof. Flynn
18
35
0.65
Prof. Striegel
10
20
0.65
Misc
7
17
Total
259
379
20.2 TB
Flocking Between Universities
Wisconsin
1200 CPUs
Purdue A
541 CPUs
Notre Dame
379 CPUs
Purdue B
1016 CPUs
http://www.cse.nd.edu/~ccl/operations/condor/
Total Consumption in 2005
1128038 (100%)
CPU-Hours Total
134978 (11%)
CPU-Hours Consumed by Owner at Keyboard
350148 (31%)
CPU-Hours Totally Unused
642912 (56%)
CPU-Hours Harnessed by Condor
http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html
Top Condor Users in 2005
User
CPU
Hours
Percent
of Total
Max Jobs
Running
Max Jobs
in Queue
Total
642912
100.00%
327
2268
tdysart@nd.edu
548187
85.27%
204
tfaltemi@nd.edu
51058
7.94%
johanes@nd.edu
27184
ice-user.tdysart@nd.edu
Dept
Advisor
740
CSE
Kogge
163
2004
CSE
Flynn
4.23%
7
7
CRC
-
22425
3.49%
100
100
CSE
Kogge
lxiao@nd.edu
13236
2.06%
78
85
EE
Fuja
dsalyers@nd.edu
10016
1.56%
28
688
CSE
Striegel
yjiang3@nd.edu
7371
1.15%
24
800
CSE
Striegel
pbrenne1@nd.edu
6148
0.96%
112
120
CSE
Izaguirre
dcieslak@nd.edu
5619
0.87%
145
1814
CSE
Chawla
bnovak@nd.edu
4116
0.64%
52
52
???
???
dvonhand@nd.edu
1820
0.28%
32
32
CSE
Izaguirre
jmcraven@nd.edu
1390
0.22%
40
92
???
???
http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html
Total Consumption in 2006
2376456 (100%)
CPU-Hours Total
281003 (11%)
CPU-Hours Consumed by Owner at Keyboard
934277 (39%)
CPU-Hours Totally Unused
1161176 (48%)
CPU-Hours Harnessed by Condor
http://www.cse.nd.edu/~ccl/operations/condor/2005/users.html
Top Condor Users in 2006
User
CPU
Hours
Percent
of Total
Max Jobs
Running
Max Jobs
in Queue
Total
1161176
100.00%
1156
61695
dcieslak@nd.edu
471126
40.57%
1142
tfaltemi@nd.edu
415972
35.82%
tdysart@nd.edu
186050
johanes@nd.edu
(> 389 jobs running
by some migrating
to Purdue and UW.)
Dept
Advisor
60314
CSE
Chawla
447
20030
CSE
Flynn
16.02%
213
1275
CSE
Kogge
31341
2.70%
21
21
CRC
-
pbrenne1@nd.edu
27693
2.38%
64
1082
CSE
Izaguirre
dthain@nd.edu
23217
2.00%
299
300
CSE
Thain
apusane@nd.edu
20708
1.78%
160
192
EE
Costello
yjiang3@nd.edu
4842
0.42%
24
366
CSE
Striegel
lxiao@nd.edu
2741
0.24%
53
66
EE
Fuja
npatel@nd.edu
1690
0.15%
6
6
AME
Renaud
gniederw@nd.edu
980
0.08%
29
35
CSE
Thain
jwozniak@nd.edu
413
0.04%
30
72
CSE
Izaguirre
http://www.cse.nd.edu/~ccl/operations/condor/2006/users.html
Research Projects Using Condor
Data Mining and Applications
– CSE: Chawla
Multidimensional Biometric Imaging and Applications (NSF/DOJ)
– CSE: Flynn and Bowyer
High End Biometric Computing (NSF)
– CSE: Thain and Flynn
Architectures and Devices for Quantum Dot Cellular Automata (NSF)
– EE and CSE: Kogge, Lent, Fay, Orlov
GEMS Grid Enabled Molecular Simulations (NSF)
– CSE and Chem: Izaguirre, Striegel, Peng
Delay-Constrained Multihop Transmission in Wireless Networks:
Interaction of Coding, Channel Access, and Routing (NSF/NASA/Moto)
– EE: Laneman, Costello, Fuja, Haenggi
ND Design Automation Laboratory
– AME: Renaud
GRAND: Gamma Ray Astrophysics at Notre Dame
– Physics: Poirer (Distributed Storage)
Recent Papers Supported
by Cycles from Condor at ND (1)
N. Chawla, D. Cieslak, L. Hall, A. Joshi, "Killing Two Birds with One Stone:
Countering Cost and Imbalance," Data Mining and Knowledge Discovery,
under review.
D. Cieslak, N. Chawla, "The Calibration and Power of Probability Estimation
Trees in Ensembles," 7th International Workshop on Multiclassifier Systems,
under review.
D. Cieslak, N. Chawla, "Reducing Loss and Improving ROC AUC Through
Sampling," International Conference on Machine Learning , Corvallis,
Oregon, 2007.
N. Chawla, D. Cieslak, "Evaluating Calibration of Probability Estimation
Trees,“ Proceedings of the AAAI Workshop on the Evaluation Methods in
Machine Learning, Boston, July 2006
D. Cieslak, D. Thain, N. Chawla, "Troubleshooting Distributed Systems via
Data Mining," Hot Topics Sessions: 15th IEEE International Symposium on
High Performance Distributed Computing (HPDC-15), Paris, France, June
2006
D. Cieslak, N. Chawla, A. Striegel, "Combating Imbalance in Network
Intrusion Datasets,“ IEEE International Conference on Granular Computing,
Athens, Georgia, May 2006.
CSE: Data Mining
Recent Papers Supported
by Cycles from Condor at ND (2)
X. Chen, T. Faltemier, P. Flynn, and K. Bowyer, “Human Face Modeling and
Recognition Through Multi-View High Resolution Stereopsis”, Biometrics:
Theory, Applications, and Systems, 2006.
D. Woodard, T. Faltemier, P. Yan, and P. Flynn, “A Comparison of 3D
Biometric Modalities”, Biometrics: Theory, Applications, and Systems, 2006.
T. Faltemier, P. Flynn, and K. Bowyer, “3D Face Recognition with Cruvature
Based Region Selection”, 3D Data Processing, Visualization, and
Transmission, 2006.
T. Faltemier, K. Bowyer, and P. Flynn, “Region Ensemble for 3D Face
Recognition and Indexing”, under submission.
T. Faltemier, K. Bowyer, and P. Flynn, “Using Multiple Gallery Images for 3D
Face Recognition”, under submission.
CSE: Biometrics
Timothy J. Dysart. "Defect Properties and Design Tools for Quantum Dot
Cellular Automata." Master's Thesis, 2005. PDF
Timothy J. Dysart, Peter M. Kogge, Craig S. Lent, and Mo Liu. "An Analysis
of Missing Cell Defects in Quantum-Dot Cellular Automata." IEEE
International Workshop on Design and Test of Defect-Tolerant Nanoscale
Architectures (NANOARCH '05) in conjunction with the VLSI Test
Symposium. Palm Springs, CA. May 1, 2005
EE/CSE: Quantum Comp
Recent Papers Supported
by Cycles from Condor at ND (3)
On Deriving Good LDPC Convolutional Codes, A. E. Pusane, R. Smarandache,
P. O. Vontobel, and D. J. Costello, Jr, submitted to IEEE International Symposium
on Information Theory, Nice, France, June 2007.
A Comparison of ARA- and Protograph-Based LDPC Block and Convolutional
Codes, D. J. Costello, Jr., A. E. Pusane, C. Jones, and D. Divsalar, to appear in
Proc. Information Theory and Applications Workshop, San Diego, CA, USA,
January 29-February 2, 2007.
LDPC Convolutional Codes: What Are They? How Do They Work? Are They Any
Good?, D. J. Costello, Jr. and A. E. Pusane in Book of Abstracts, AMS Joint
Mathematics Meetings, New Orleans, LA, USA, January 5-8, 2007.
L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Algebraic Superposition of
LDGM Codes for Cooperative Diversity'' submitted to IEEE International
Symposium on Information Theory (ISIT) 2007.
L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Cooperative diversity
based on code superposition'' in IEEE International Symposium on Information
Theory (ISIT), Seattle, WA, July 2006.
L. Xiao, T. E. Fuja, J. Kliewer and D. J. Costello, Jr., ``Nested codes with multiple
interpretations'' in 40th Conference on Information Sciences and Systems
(CISS), Princeton, NJ, March 2006.
EE: Signal Coding
Recent Papers Supported
by Cycles from Condor at ND (4)
Yingxin Jiang, Aaron Striegel, "A Distributed Traffic Control Scheme based
on Edge-Centric Resource Management," ACM Computer Communications
Review, vol. 36, no. 2, pp. 5-16, April 2006.
Effects of low-quality computation time estimates in policed schedulers,
Justin M. Wozniak, Yingxin Jiang and Aaron Striegel, Proc. Annual
Simulation Symposium, IEEE Computer Society, 2007.
D. Salyers, A. Striegel "A Novel Approach for Transparent Bandwidth
Conservation,“ Proceedings of Networking 2005, Waterloo Ontario Canada,
May 2005
CSE: Network Simulation
Access Control for a Replica Management Database, Justin Wozniak, Paul
Brenner, Douglas Thain, Aaron Striegel, Jesus Izaguirre, ACM Workshop on
Storage Security and Survivability (StorageSS), October 2006.
Generosity and Gluttony in GEMS: Grid Enabled Molecular Simulations,
Justin Wozniak, Paul Brenner, Douglas Thain, Aaron Striegel, and Jesus
Izaguirre, in Proceedings of the IEEE Symposium on High Performance
Distributed Computing, July 2005
CSE: Scientific Databases
Recent Papers Supported
by Cycles from Condor at ND (5)
Challenges in Executing Data Intensive Biometric Workloads on a Desktop
Grid, Christopher Moretti, Timothy Faltemier, Douglas Thain, and Patrick J.
Flynn, Workshop on Large Scale and Volatile Desktop Grids, March 2006.
Operating System Support for Space Allocation in Grid Storage Systems,
Douglas Thain, IEEE Conference on Grid Computing, September 2006.
The Consequences of Decentralized Security in a Cooperative Storage
System, Douglas Thain, Chris Moretti, Paul Madrid, Phil Snowberger, and
Jeff Hemmes, IEEE Workshop on Security in Storage (SISW), San
Francisco, December 2005.
Separating Abstractions from Resources in a Tactical Storage System,
Douglas Thain, Sander Klous, Justin Wozniak, Paul Brenner, Aaron Striegel,
and Jesus Izaguirre, in Proceedings of IEEE/ACM Supercomputing, Nov
2005.
Patisserie: Support for Parameter Sweeps in a Fault-Tolerant, Massively
Parallel, Peer-to-Peer Simulation Environment, Timothy Schoenharl, Scott
Christley, and Douglas Thain, Workshop on Agent Directed Simulation
(ADS), San Diego, California, April 2005.
CSE: Grid Computing
How does Condor relate to CRC?
Use the CRC clusters for:
– CPU-intensive, fine-grained parallel codes.
– The latest, fastest machines.
– Professional, continuous support.
Use the Condor pool for:
–
–
–
–
Coarse grained, naturally parallel codes.
Harnessing college/dept level machines.
Integration with distributed storage.
Building and deploying novel systems for computer
science research.
– Self-service support at this point.
(Some ambitious students use both!)
How does Condor relate to OSG?
The Open Science Grid
– A wide-area consortium of universities.
– A mechanism (Condor+Globus) to access
remote batch/storage systems over the WAN.
– Interface (Condor-G) is one piece of Condor.
The ND Condor Pool
– A campus-scale collection of resources.
– Could be made accessible via OSG interface.
– Indirectly part of OSG/TeraGrid via Purdue.
Example of Application/CS
Research Using Condor
Scalable I/O for Biometrics
Computer Vision Research Lab in CSE
– Goal: Develop robust algorithms for
identifying humans from (non-ideal) images.
– Technique: Collect lots of images. Think up
clever new matching function. Compare them.
How do you test a matching function?
– For a set S of images,
– Compute F(Si,Sj) for all Si and Sj in S.
– Compare the result matrix to known functions.
Credit: Patrick Flynn at Notre Dame CSE
Computing Similarities
1
.8
.1
0
0
.1
1
0
.1
.1
0
1
0
.1
.7
1
0
0
1
.1
F
1
A Big Data Problem
Data Size: 10k images of 1MB = 10 GB
Total I/O: 10k * 10k * 2 MB *1/2 = 100 TB
Would like to repeat many times!
In order to execute such a workload, we
must be careful to partition both the I/O
and the CPU needs, taking advantage of
distributed capacity.
Conventional Solution
Disk
Disk
Disk
Disk
Move 200 TB at Runtime!
Job
Job
Job
Job
Job
Job
Job
Job
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Using Tactical Storage
3. Jobs find nearby data copy,
and make full use before discarding.
CPU
Job
CPU
CPU
Job
CPU
Job
CPU
Job
CPU
CPU
CPU
Disk
Disk
Disk
Disk
Disk
Disk
Disk
Disk
2. Replicate data to many disks.
1. Break array into MB-size chunks.
Result: achieve greater than
2Gb/s of disk->application
bandwidth on large workload
Technical Issues (1)
Deployment
– All codes and config in AFS,
just deploy startup script in /etc/init.d.
– Manual copy onto each node gets lost at the
end of the semester, copy into image.
Firewalls
– TCP/UDP on ports 9000-1000 both directions.
– One firewalled machine can hang everyone!
– Workaround: Periodic check of TCP ports,
manually disable Condor on FW nodes.
Technical Issues (2)
Disappearing Servers
– Problem: condor_master on each host
disappears mysteriously; pool decays.
– Diagnosis: AFS outage? Condor bug?
– Solution: /etc/cron.hourly/restart_condor
CPU Detection
– Problem: Hyperthreaded machines appear
to be multi-CPU machines on Linux.
– Result: Condor overcommits the CPU.
– Solution: Manual override NUM_CPUS=1
Summary
With your help, our Condor pool has
provided significant benefits for both
research and education. Thank you!
Liaison between faculty and staff at the
dept, college, and univ level is needed to
keep the system working.
Lots more info here:
– http://www.nd.edu/~condor
– condor-discuss@listserv.nd.edu
Download