Blacklight_PSC - Pittsburgh Supercomputing Center Staff

advertisement
PSC Blacklight,
a Large Hardware-Coherent Shared Memory Resource
In TeraGrid Production Since 1/18/2011
Why Shared Memory?
graphbased
informatics
machine
learning
data
exploration
highproductivity
languages
rapid
prototyping
Enable
memory-intensive
computation
Increase
users’
productivity
Change the way
we look at data
Boost scientific output
Broaden participation
statistics
algorithm
expression
ISV
apps
viz
...
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
interactivity
…
2
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
3
PSC’s Blacklight (SGI Altix® UV 1000)
Programmability + Hardware Acceleration → Productivity
• 2×16 TB of cache-coherent shared memory
–
–
–
hardware coherency unit: 1 cache line (64B)
16 TB exploits the processor’s full 44-bit physical address space
ideal for fine-grained shared memory applications, e.g. graph algorithms, sparse matrices
• 32 TB addressable with PGAS languages, MPI, and hybrid approaches
–
–
low latency, high injection rate supports one-sided messaging
also ideal for fine-grained shared memory applications
• NUMAlink® 5 interconnect
–
–
–
fat tree topology spanning full UV system; low latency, high bisection bandwidth
transparent hardware support for cache-coherent shared memory, message pipelining and
transmission, collectives, barriers, and optimization of fine-grained, one-sided communications
hardware acceleration for PGAS, MPI, gather/scatter, remote atomic memory operations, etc.
• Intel Nehalem-EX processors: 4096 cores (2048 cores per SSI)
–
–
–
8-cores per socket, 2 hardware threads per core, 4 flops/clock, 24MB L3, Turbo Boost, QPI
4 memory channels per socket  strong memory bandwidth
x86 instruction set with SSE 4.2  excellent portability and ease of use
• SUSE Linux operating system
–
–
supports OpenMP, p-threads, MPI, PGAS models  high programmer productivity
supports a huge number of ISV applications  high end user productivity
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
4
Programming Models & Languages
• UV supports an extremely broad range of programming models and
languages for science, engineering, and computer science
– Parallelism
•
•
•
•
Coherent shared memory: OpenMP, POSIX threads (“p-threads”), OpenMPI, q-threads
Distributed shared memory: UPC, Co-Array Fortran*
Distributed memory: MPI, Charm++
Linux OS and standard languages enable users’ domain-specific languages, e.g. NESL
– Languages
• C, C++, Java, UPC, Fortran, Co-Array Fortran*
• R, R-MPI
• Python, Perl, …
→ Rapidly express algorithms that defy distributed-memory implementation.
→ To existing codes, offer 16-32 TB memory and high concurrency.
* pending F2008-compliant compilers
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
5
ccNUMA memory (a brief review; 1)
• ccNUMA: cache-coherent non-uniform memory access
• Memory is organized into a non-uniform hierarchy, where each level takes
longer to access:
registers
1 clock
L1 cache, ~32 kB per core
1 socket
~2-4 sockets
many sockets
~4 clocks
L2 cache, ~256-512 kB per core
~11 clocks
L3 cache, ~1-3 MB per core, shared between cores
~40 clocks
DRAM attached to a processor (“socket”)
O(200) clocks
DRAM attached to a neighboring processor on the node
O(200) clocks
DRAM attached to processors on other nodes
O(1500) clocks
Cache coherency protocols ensure that all data is maintained consistently
in all levels of the memory hierarchy. The unit of consistency should
match the processor, i.e. one cache line. Hardware support is required to
this maintain memory consistency at acceptable speeds.
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
6
Blacklight Architecture: Blade
NL5
NL5
NL5
NL5
“node pair”
Topology
• fat tree, spanning
all 4096 cores
“node”
NUMAlink-5
UV
Hub
UV
Hub
QPI
QPI
Intel
Nehalem
EX-8
Intel
Nehalem
EX-8
Intel
Nehalem
EX-8
Intel
Nehalem
EX-8
64 GB
RAM
64 GB
RAM
64 GB
RAM
64 GB
RAM
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
Per SSI:
• 128 sockets
• 2048 cores
• 16 TB
• hardware-enabled
coherent shared
memory
Full system:
• 256 sockets
• 4096 cores
• 32 TB
• PGAS, MPI, or
hybrid parallelism
7
I/O and Grid
• /bessemer
– PSC’s Center-wide Lustre filesystem
• $SCRATCH: Zest-enabled
– high efficiency scalability (designed for
O(106) cores), low-cost commodity
components, lightweight software layers,
end-to-end parallelism, client-side caching
and software parity, and a unique model
of load-balancing outgoing I/O onto highspeed intermediate storage followed by
asynchronous reconstruction to a 3rd-party
parallel file system
P. Nowoczynski, N. T. B. Stone, J. Yanovich,
and J. Sommerfield, Zest Checkpoint Storage
System for Large Supercomputers, Petascale
Data Storage Workshop ’08.
http://www.pdsi-scidac.org/events/PDSW08/resources/
papers/Nowoczynski_Zest_paper_PDSW08.pdf
• Gateway ready: Gram5, GridFTP, comshell, Lustre-WAN…
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
8
Memory-Intensive Analysis Use Cases
• Algorithm Expression
– Implement algorithms and analyses, e.g. graph-theoretical, for which
distributed-memory implementations have been elusive or impractical.
– Enable rapid, innovative analyses of complex networks.
• Interactive Analysis of Large Datasets
– Example: fit the whole ClueWeb09 corpus into RAM to enable
development of rapid machine-learning algorithms for inferring
relationships.
– Foster totally new ways of exploring large datasets. Interactive queries
and deeper analyses limited only by the community’s imagination.
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
9
User Productivity Use Cases
• Rapid Prototyping
– Rapid development of algorithms for large-scale data analysis
– Rapid development of “one-off” analyses
– Enable creativity and exploration of ideas
• Familiar Programming Languages
– Java, R, Octave, etc.
– Leverage tools that scientists , engineers, and computer scientists
already know and use. Lower the barrier to using HPC.
• ISV Applications
– ADINA, Gaussian, VASP, …
• Vast memory accessible from even a modest number of cores
– Leverage tools that scientists , engineers, and computer scientists
already know and use. Lower the barrier to using HPC.
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
10
Data crisis: genomics
• DNA sequencing machine
throughput increasing at a rate
of 5x per year
• Hundreds of petabytes of
data will be produced in the
next few years
• Moving and analyzing these
data will be the major
bottleneck in this field
http://www.illumina.com/systems/hiseq_2000.ilmn
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
11
Genomics analysis: two basic flavors
• Loosely-coupled problems
Sequence alignment: Read many short DNA sequences from
disk and map to a reference genome
– Lots of disk I/O
– Fits well with MapReduce framework
• Tightly-coupled problems
De novo assembly: Assemble a complete genome from short
genome fragments generated by sequencers
– Primarily a large graph problem
– Works best with a lot of shared memory
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
12
PSC Blacklight:
EARLY illumination
Sequence Assembly of Sorghum
Sarah Young and Steve Rounsley (University of Arizona)
• Tested various genomes, assembly
codes, and parameters to determine
best options for plant genome assemblies
• Performed assembly of a
600+ Mbase genome of a member
of the Sorghum genus on Blacklight
using ABySS.
• Sequence assemblies of this type will be key to the iPlant Collaborative.
Larger plant assemblies are planned in the future.
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
13
What can a machine with 16 TB shared memory
do for genomics?
Exploring efficient solution of both loosely and
tightly-coupled problems:
• Sequence alignment:
–
–
–
Experimenting with use of ramdisk to alleviate I/O
bottlenecks and increase performance
Configuring Hadoop to work on large shared
memory system
Increasing productivity by allowing researchers to
use simple, familiar MapReduce framework
• De novo assembly of huge genomes:
–
–
–
–
Human genome with 3 gigabases (Gb) of DNA
typically requires 256-512 GB RAM to assemble
Cancer research requires hundreds of these
assemblies
Certain important species, e.g. Loblolly pine, have
genomes ~10x larger than humans requiring
terabytes of RAM to assemble
Metagenomics (sampling unknown microbial
populations): no theoretical limit to how many base
pairs one might assemble together (100x more than
human assembly!)
Pinus taeda (Loblolly Pine)
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
14
Thermodynamic Stability of Quasicrystals
PSC Blacklight:
EARLY illumination
Max Hutchinson and Mike Widom (Carnegie Mellon University)
• A leading proposal for the thermodynamic stability of
quasicrystals depends on the configurational entropy
associated with tile flips (“phason flips”).
• Exploring the entropy of symmetry-broken structures
whose perimeter is an irregular octagon will allow an
approximate theory of quasicrystal entropy to be
developed, replacing the actual discrete tilings with a
continuum system modeled as a dilute gas of
interacting tiles.
T(1)=8
• Quasicrystals are modeled by rhombic/octagonal tilings,
for which enumeration exposes thermodynamic properties.
• Breadth-first search over a graph that grows superexponentially with system size; very little locality.
graph for the
3,3,3,3 quasicrystal
• Nodes must carry arbitrary-precision integers.
T(7) = 10042431607269542604521005988830015956735912072
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
15
Performance Profiling of Million-core Runs
PSC Blacklight:
EARLY illumination
Sameer Shende (ParaTools and University of Oregon)
• ~500 GB of shared memory successfully applied to the visual analysis of very large
scale performance profiles, using TAU.
• Profile data: synthetic million-core dataset assembled from 32k-core LS3DF runs on
ANL’s BG/P.
Metadata Information about
1million core profile datasets,
TAUParaProf Manager
Window.
Execution Time Breakdown of
LS3DF subroutines over all
MPI ranks.
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
LS3DF Routines Profiling
Data on rank 1,048,575.
Histogram of MPI_Barrier,
distribution of the routine calls
over the execution time.
16
Summary
• On PSC’s Blacklight resource, hardware-supported cache-coherent
shared memory is enabling new data-intensive and memoryintensive analytics and simulations. In particular, Blacklight is:
– enabling new kinds of analyses on large data,
– bringing new communities into HPC, and
– increasing the productivity of both “traditional HPC” and new users.
• PSC is actively working with the research community to bring this
new analysis capability to diverse fields of research. This will entail
development of data-intensive workflows, new algorithms, scaling
and performance engineering, and software infrastructure.
Interested? Contact blood@psc.edu, sergiu@psc.edu
SG-WG Update | Sanielevici | March 18, 2011
© 2010 Pittsburgh Supercomputing Center
17
Download