DR&E LLGrid Portal Interactive Supercomputing for DoD

advertisement
DR&E LLGrid Portal
Interactive Supercomputing for DoD
Albert Reuther, William Arcand, Chansup Byun, Bill Bergeron,
Matthew Hubbell, Jeremy Kepner, Andrew McCabe, Peter
Michaleas, Julie Mullen & Andrew Prout
MIT Lincoln Laboratory
HPEC Workshop
September 15, 2010
This work is sponsored by the Department of the Air Force under Air Force contract FA8721-05-C-0002. Opinions, interpretations,
conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.
Slide-1
Portal
DR&E Portal Prototype
Best of desktop
+ Best of supercomputing
Interactive …
“what if scenarios”
Good for experts,
great for novices
0
DOD researcher
w/CAC card
DREN
•
•
TX-DoD
1
2
3
4
5
6
HPCMP selected LLGrid for DoD wide prototype DR&E Portal
Prototype goal: interactive pMatlab on a modest cluster (TX-DoD) over
DREN alpha users with CAC authentication
Slide-2
Portal
Outline
• Introduction
• Design Overview
• Technologies
• Summary
Slide-3
Portal
• LLGrid
• Interactive Supercomputing
• Parallel Matlab
What is LLGrid?
Best of desktop
+ Best of supercomputing
Interactive …
“what if scenarios”
0
Good for experts,
great for novices
TX-DoD
1
2
3
4
5
6
• LLGrid is a ~400 user ~2000 processor system
• World’s only desktop interactive supercomputer
– Dramatically easier to use than any other supercomputer
– Highest fraction of staff using (20%) supercomputing of any
organization on the planet
Slide-4
Portal
Computing Time
LLGrid Interactive Supercomputing
Classic Supercomputing
Days
Hours
Interactive Supercomputing
Minutes
Batch
Processing
Lincoln
Laboratory
“Sweet Spot”
Desktop Computing
Seconds
1
10
100
1,000
Processor (CPUs)
•
Classic supercomputing: Jobs take hours/days to run but jobs tolerate
waiting in a queue
•
Interactive supercomputing: Jobs are large requiring answers in
minutes/hours but can not tolerate waiting in a queue
•
Desktop computing: Jobs take minutes on a desktop
(e.g., algorithm proof-of-principles)
Slide-5
Portal
Why is LLGrid easier to use?
Universal Parallel Matlab programming
Amap = map([Np 1],{},0:Np-1);
Bmap = map([1 Np],{},0:Np-1);
A = rand(M,N,Amap);
B = zeros(M,N,Bmap);
B(:,:) = fft(A);
• pMatlab runs in all parallel Matlab
environments
• Only a few functions are needed
– Np
– Pid
– map
– local
– put_local
– global_index
– agg
– SendMsg/RecvMsg
•
Parallel MATLAB
for Multicore and Multinode Computers
1
2
Distributed arrays have been recognized as the easiest way to
program a parallel computers since the 1970s
–
•
Jeremy Kepner
Only a small number of distributed array functions are necessary to
write nearly all parallel programs
LLGrid is the first system to deploy interactive distributed arrays
Slide-6
Portal
3
4
Outline
• Introduction
• Design Overview
• Technologies
• Summary
Slide-7
Portal
• Requirements
• Phases
• Architecture
Requirements for DR&E Portal
• Cannot utilize any new networking ports
– Hypertext Transport Protocol (http) – port 80
– Secure Sockets Layer (ssl) – port 443
• Cannot install new software on desktop computers
• Dual-layer authentication
– CAC Card with SSL certificates
– PIN authentication
•
•
•
•
Slide-8
Portal
Traverse multiple organizations over DREN
Isolate users accounts from each other
Intuitive to go from serial to parallel coding
Desktop computer is one of computational workers
Prototype Components:
Pre-alpha cluster (TX-DoD)
•
Provided an icon on scientists' and engineers’
desktops that provides them tools to do their
jobs faster
–
•
Dedicated cluster at LL on DREN
–
•
pMatlab is first tool in the suite (extensible over
time)
40 node blade system along with 8 TB of parallel
storage
Used for initial development
–
LLGrid software stack deployed and modified to
work in HPCMP environment based on
requirements
•
Software stack copied to alpha cluster
•
Maintained as a mirror system for development
purposes
Slide-9
Portal
Prototype Components:
alpha cluster testbed
• Experimental testbed on DREN
• Used for trials with alpha users
• Software stack was copied from pre-alpha
and modified based on trials; changes
folded back to alpha mirror (TX-DoD)
• Software stack copied to beta system
Slide-10
Portal
LLGrid Software Stack
Mac Client
Win7 Client
MATLAB /
Octave
pMatlab /
gridMatlab
MATLAB /
Octave
pMatlab /
gridMatlab
https
WebDAV
https
WebDAV
Mac OS X
Windows 7
Web Server
Login Node
Compute Node
MATLAB /
Octave
pMatlab /
gridMatlab
MATLAB /
Octave
pMatlab /
gridMatlab
Cluster Mngr
Sun Grid Engine (SGE) Scheduler
https
WebDAV
Lustre 1.8.1
grsecurity Patches
Linux OS (2.6.27.10 Kernel)
Hardware / Network
User Desktops
• Windows 7 and Mac OS X supported
• Portal connection options: WebDAV over https (port 443)
Cluster Management: LL-modified Rocks 5.2
Cluster Parallel File System: Lustre 1.8.1
Scheduler: Sun Grid Engine (SGE)
Login and Compute Nodes (15 GB image size)
• last 5 versions of Matlab, Octave, pMatlab, GridMatlab,
• lammpi, mpich, mpich2, mvapich, openmpi
Slide-11
Portal
Storage
Local FS
Prototype Architecture
Secure Portal Technology
Web Server
1
Client
Systems
10
Scheduler
2
9
3
5
Portal
Watcher
7
6
8
12
3
11
Grid
4
5
13
6
Authentication
1.
2.
3.
4.
5.
6.
7.
Access Secure Portal
CAC Authentication Requested
Provide CAC with PIN
Credential Approved
Map User’s Home
Submit a job with a protocol file
Portal Watcher gets notified
Slide-12
Portal
12
Storage
8.
9.
10.
11.
12.
13.
Read & parse job description in XML
Send the job to scheduler via DRMAA
Job scheduled and dispatched
Job ID returned in a protocol file
Job ID displayed on the client system
Output generated and stored
Outline
• Introduction
• Design Overview
• Technologies
• Summary
Slide-13
Portal
• Key Components
• Component Descriptions
Prototype Architecture
Key Components
Secure Portal Technology
Web Server
1
Client
Systems
3
gridMatlab
for Portal
10
Scheduler
2
9
3
5
7
Portal
Watcher
6
2
8
12
3
11
Grid
4
5
13
6
1
Authentication
CAC-Enabled
Apache WebDAV
Server
Slide-14
Portal
12
Linux File
System
Watcher
Storage
4
gr-security
grsecurity
Kernel
Patches
Prototype Architecture
CAC-Enabled Apache WebDAV Server
Secure Portal Technology
Web Server
1
Client
Systems
10
Scheduler
2
9
3
5
7
Portal
Watcher
6
8
12
3
11
Grid
4
5
13
6
Authentication
12
Storage
• WebDAV provides file system services across HTTP (80)
• Apache server authenticates via CAC
• Required significant modification to Apache Web Server
Slide-15
Portal
Prototype Architecture
Linux File System Watcher
Secure Portal Technology
Web Server
1
Client
Systems
10
Scheduler
2
9
3
5
7
Portal
Watcher
6
8
12
3
11
Grid
4
5
13
6
Authentication
12
Storage
• File Access Monitor in Linux kernel (2.6.25+)
• Receive event notification when file events occur
• Configure actions based on file name, directory, etc.
• Enables activities to launch jobs, abort jobs, etc.
Slide-16
Portal
Prototype Architecture
gridMatlab for Portal
Secure Portal Technology
Web Server
1
Client
Systems
10
Scheduler
2
9
3
5
7
Portal
Watcher
6
8
12
3
11
Grid
4
5
13
6
Authentication
12
Storage
• Launch jobs, abort jobs, etc. by writing files to WebDAV file
system
• Defined rich XML file formats for each action
Slide-17
Portal
Prototype Architecture
grsecurity Kernel Patches
Secure Portal Technology
Web Server
1
Client
Systems
10
Scheduler
2
9
3
5
7
Portal
Watcher
6
8
12
3
11
Grid
4
5
13
6
Authentication
12
Storage
• Role-Based Access Control (RBAC) system
• Users can only view own processes, files, etc.
• Extensive auditing and logging
• Randomization of the stack, library, heap and kernel bases
• Prevention of arbitrary code execution
Slide-18
Portal
Speed-up for Example Code 1 running on Lincoln
Laboratory Grid (LLGrid) (Matlab/pMatlab)
EEG Speedup on LLGrid
25
Speed-up
20
15
Series1
10
5
0
1
10
20
36
Number of Processors
Slide-19
Portal
Nprocs
Max Time (secs)
1
10
20
36
178972.64
25247.58
14825.30
7589.20
Average Time
(secs)
17897.64
17448.12
8767.93
4832.97
Speedup
1
7.08
12.1
23.6
Speed-up for Example Code 2 running on Lincoln
Laboratory Grid (LLGrid) (Matlab/pMatlab)
SIM Code with MATLAB Speedup on
LLGrid, nreps = 1000
Speedup
250
200
150
speedup
100
Linear Speedup
50
0
1
25
50
100
200
Number of Processors
Slide-20
Portal
Nprocs
Max Time (secs)
1
25
50
100
200
176344.0309
6088.6758
3245.6632
1673.5188
803.9898
Average Time
(secs)
176344.0309
5471.0087
2699.6459
1341.6794
658.7564
Speedup
1
29
54
105
219
Outline
• Introduction
• Design Overview
• Technologies
• Summary
Slide-21
Portal
Summary
• DR&E Portal technology enables interactive, on-demand
parallel Matlab from DoD desktops
– Required Zero Footprint LLGrid
– Several phase rollout
• Four key technologies
–
–
–
–
CAC-enabled Apache WebDAV Server
Linux File System Watcher
gridMatlab for Portal
grsecurity Kernel Patches
• Performance does not impede user experience
Slide-22
Portal
Backups
Slide-23
Portal
HPCMP DR&E Portal Prototype Demo
Using LLGridZF (zero footprint)
Steps:
Account use
Slide-24
Portal
1
Slide-25
Portal
1
Slide-26
Portal
2
Slide-27
Portal
3
Slide-28
Portal
14
Slide-29
Portal
5
Slide-30
Portal
16
Slide-31
Portal
17
Slide-32
Portal
18
Slide-33
Portal
19
Slide-34
Portal
1
10
Slide-35
Portal
1
11
Slide-36
Portal
1
12
Slide-37
Portal
1
13
Slide-38
Portal
1
14
Slide-39
Portal
1
15
Slide-40
Portal
1
16
Slide-41
Portal
1
17
Slide-42
Portal
1
18
43
Phases
•
Prototype
– On-demand interactive parallel MATLAB delivered to alpha/beta
users
•
Phase I
– On-demand interactive parallel MATLAB delivered to DoD
researchers and engineers
•
Phase II
– A suite of on-demand interactive applications and an easy-to-use
batch environment delivered to DoD researchers and engineers
Slide-43
Portal
Download