How can I use a 1000 cores? P.J. Hasnip ()

advertisement
Introduction
dCSE Project
Summary
How can I use a 1000 cores?
Lessons in Parallel Scaling from Castep
P.J. Hasnip (pjh503@york.ac.uk)
Department of Physics
University of York
HECToR User Group Meeting
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Outline
1
Introduction
What is Castep?
How does Castep work?
Castep in Parallel
2
dCSE Project
dCSE project plan
Castep development
Results
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Outline
1
Introduction
What is Castep?
How does Castep work?
Castep in Parallel
2
dCSE Project
dCSE project plan
Castep development
Results
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is...
A general-purpose ‘first principles’ atomistic modelling
program
Based on density functional theory
Free to UK academics
Installed on HECToR
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is...
A general-purpose ‘first principles’ atomistic modelling
program
Based on density functional theory
Free to UK academics
Installed on HECToR
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is...
A general-purpose ‘first principles’ atomistic modelling
program
Based on density functional theory
Free to UK academics
Installed on HECToR
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is...
A general-purpose ‘first principles’ atomistic modelling
program
Based on density functional theory
Free to UK academics
Installed on HECToR
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep can...
Compute the electronic density
Determine the groundstate atomic configuration and cell
Simulate molecular dynamics (path-integrals, variable cell)
Calculate band-structures and density of states
Compute various spectra (optical, IR, Raman, NMR,
XANES...)
plus linear response, population analysis, ELF, etc.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep can...
Compute the electronic density
Determine the groundstate atomic configuration and cell
Simulate molecular dynamics (path-integrals, variable cell)
Calculate band-structures and density of states
Compute various spectra (optical, IR, Raman, NMR,
XANES...)
plus linear response, population analysis, ELF, etc.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep can...
Compute the electronic density
Determine the groundstate atomic configuration and cell
Simulate molecular dynamics (path-integrals, variable cell)
Calculate band-structures and density of states
Compute various spectra (optical, IR, Raman, NMR,
XANES...)
plus linear response, population analysis, ELF, etc.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep can...
Compute the electronic density
Determine the groundstate atomic configuration and cell
Simulate molecular dynamics (path-integrals, variable cell)
Calculate band-structures and density of states
Compute various spectra (optical, IR, Raman, NMR,
XANES...)
plus linear response, population analysis, ELF, etc.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep can...
Compute the electronic density
Determine the groundstate atomic configuration and cell
Simulate molecular dynamics (path-integrals, variable cell)
Calculate band-structures and density of states
Compute various spectra (optical, IR, Raman, NMR,
XANES...)
plus linear response, population analysis, ELF, etc.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep can...
Compute the electronic density
Determine the groundstate atomic configuration and cell
Simulate molecular dynamics (path-integrals, variable cell)
Calculate band-structures and density of states
Compute various spectra (optical, IR, Raman, NMR,
XANES...)
plus linear response, population analysis, ELF, etc.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is written using...
Fortran 90
BLAS/LAPACK for linear algebra
FFT libraries (where available)
MPI for parallel communication
Portable and well optimised (achieves 37-40% peak on
HECToR).
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is written using...
Fortran 90
BLAS/LAPACK for linear algebra
FFT libraries (where available)
MPI for parallel communication
Portable and well optimised (achieves 37-40% peak on
HECToR).
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is written using...
Fortran 90
BLAS/LAPACK for linear algebra
FFT libraries (where available)
MPI for parallel communication
Portable and well optimised (achieves 37-40% peak on
HECToR).
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is written using...
Fortran 90
BLAS/LAPACK for linear algebra
FFT libraries (where available)
MPI for parallel communication
Portable and well optimised (achieves 37-40% peak on
HECToR).
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is written using...
Fortran 90
BLAS/LAPACK for linear algebra
FFT libraries (where available)
MPI for parallel communication
Portable and well optimised (achieves 37-40% peak on
HECToR).
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep is written using...
Fortran 90
BLAS/LAPACK for linear algebra
FFT libraries (where available)
MPI for parallel communication
Portable and well optimised (achieves 37-40% peak on
HECToR).
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Outline
1
Introduction
What is Castep?
How does Castep work?
Castep in Parallel
2
dCSE Project
dCSE project plan
Castep development
Results
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Castep Basics
Castep solves a set of Schrödinger equations,
Hk [n]ψbk (r) = bk ψbk (r)
where n is the electronic density and {ψbk } are the bands.
X
n (r) =
2wk |ψbk (r)|2
bk
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Self-consistency
Hk [n]ψbk (r) = bk ψbk (r)
Hk depends on n (r)
n (r) depends on {ψbk }
We need to solve this eigenvalue equation iteratively until we
have self-consistency.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Self-consistency
Hk [n]ψbk (r) = bk ψbk (r)
Hk depends on n (r)
n (r) depends on {ψbk }
We need to solve this eigenvalue equation iteratively until we
have self-consistency.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Self-consistency
Hk [n]ψbk (r) = bk ψbk (r)
Hk depends on n (r)
n (r) depends on {ψbk }
We need to solve this eigenvalue equation iteratively until we
have self-consistency.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Self-consistency
Hk [n]ψbk (r) = bk ψbk (r)
Hk depends on n (r)
n (r) depends on {ψbk }
We need to solve this eigenvalue equation iteratively until we
have self-consistency.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
How Castep Works
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
A useful basis set
We expand ψbk in a plane-wave (Fourier) basis,
X
ψbk (r) =
cGbk ei(G+k).r
G
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of G-vectors, NG , increases
On HECToR, NG might be O(100,000)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
A useful basis set
We expand ψbk in a plane-wave (Fourier) basis,
X
ψbk (r) =
cGbk ei(G+k).r
G
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of G-vectors, NG , increases
On HECToR, NG might be O(100,000)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
A useful basis set
We expand ψbk in a plane-wave (Fourier) basis,
X
ψbk (r) =
cGbk ei(G+k).r
G
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of G-vectors, NG , increases
On HECToR, NG might be O(100,000)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
A useful basis set
We expand ψbk in a plane-wave (Fourier) basis,
X
ψbk (r) =
cGbk ei(G+k).r
G
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of G-vectors, NG , increases
On HECToR, NG might be O(100,000)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
A useful basis set
We expand ψbk in a plane-wave (Fourier) basis,
X
ψbk (r) =
cGbk ei(G+k).r
G
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of G-vectors, NG , increases
On HECToR, NG might be O(100,000)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
The k-points
The vectors {k} are basically sampling points in the region of
space
1
|k| < |Gsmallest | .
2
We increase the sampling density of k-points until our
calculation converges.
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of k-points we need decreases
On HECToR, Nk might be O(1)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
The k-points
The vectors {k} are basically sampling points in the region of
space
1
|k| < |Gsmallest | .
2
We increase the sampling density of k-points until our
calculation converges.
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of k-points we need decreases
On HECToR, Nk might be O(1)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
The k-points
The vectors {k} are basically sampling points in the region of
space
1
|k| < |Gsmallest | .
2
We increase the sampling density of k-points until our
calculation converges.
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of k-points we need decreases
On HECToR, Nk might be O(1)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
The k-points
The vectors {k} are basically sampling points in the region of
space
1
|k| < |Gsmallest | .
2
We increase the sampling density of k-points until our
calculation converges.
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of k-points we need decreases
On HECToR, Nk might be O(1)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
The k-points
The vectors {k} are basically sampling points in the region of
space
1
|k| < |Gsmallest | .
2
We increase the sampling density of k-points until our
calculation converges.
If we increase the size of our simulation system:
The size of the smallest G-vector decreases
The number of k-points we need decreases
On HECToR, Nk might be O(1)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Orthogonalisation
We find ψbk by varying {cGbk } to minimise bk .
To prevent all the bands heading for the lowest energy one, we
explicitly orthogonalise them to each other.
The computational time per k-point scales as
NG Nb2 .
i.e. cubically for large systems (recall Nk = O(1)).
This cost dominates in large calculations.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Orthogonalisation
We find ψbk by varying {cGbk } to minimise bk .
To prevent all the bands heading for the lowest energy one, we
explicitly orthogonalise them to each other.
The computational time per k-point scales as
NG Nb2 .
i.e. cubically for large systems (recall Nk = O(1)).
This cost dominates in large calculations.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Orthogonalisation
We find ψbk by varying {cGbk } to minimise bk .
To prevent all the bands heading for the lowest energy one, we
explicitly orthogonalise them to each other.
The computational time per k-point scales as
NG Nb2 .
i.e. cubically for large systems (recall Nk = O(1)).
This cost dominates in large calculations.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Orthogonalisation
We find ψbk by varying {cGbk } to minimise bk .
To prevent all the bands heading for the lowest energy one, we
explicitly orthogonalise them to each other.
The computational time per k-point scales as
NG Nb2 .
i.e. cubically for large systems (recall Nk = O(1)).
This cost dominates in large calculations.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Fourier Transforms
H contains some contributions we apply in G-space, and some
in r-space.
We use FFTs to switch efficiently between r- and G-space
where needed. The computational time per k-point scales as
NG ln (NG ) Nb
i.e. approximately quadratically for large systems
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Outline
1
Introduction
What is Castep?
How does Castep work?
Castep in Parallel
2
dCSE Project
dCSE project plan
Castep development
Results
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Parallel Efficiency
We want to know how effective running in parallel is.
Comparing to the serial time can cause problems because:
Efficient parallel calculations may need different algorithms
We cannot run large calculations in serial
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Parallel Efficiency
We want to know how effective running in parallel is.
Comparing to the serial time can cause problems because:
Efficient parallel calculations may need different algorithms
We cannot run large calculations in serial
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Parallel Efficiency
We want to know how effective running in parallel is.
Comparing to the serial time can cause problems because:
Efficient parallel calculations may need different algorithms
We cannot run large calculations in serial
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Relative speed-up
Instead of the absolute parallel performance, we shall look at
the relative speed-up when we double the number of cores:
E(c) =
T ( 2c )
.
T (c)
E(c) = 2 indicates perfect scaling. The HPCx Capability
Incentive target is 1.7.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Relative speed-up
Instead of the absolute parallel performance, we shall look at
the relative speed-up when we double the number of cores:
E(c) =
T ( 2c )
.
T (c)
E(c) = 2 indicates perfect scaling. The HPCx Capability
Incentive target is 1.7.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Relative speed-up
Instead of the absolute parallel performance, we shall look at
the relative speed-up when we double the number of cores:
E(c) =
T ( 2c )
.
T (c)
E(c) = 2 indicates perfect scaling. The HPCx Capability
Incentive target is 1.7.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
What hampers scaling?
Serial parts of the code
One-to-all communications (linear)
All-to-all communications (quadratic)
T (c) = S +
P.J. Hasnip
P
+ Lc + Qc 2
c
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
What hampers scaling?
Serial parts of the code
One-to-all communications (linear)
All-to-all communications (quadratic)
T (c) = S +
P.J. Hasnip
P
+ Lc + Qc 2
c
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
What hampers scaling?
Serial parts of the code
One-to-all communications (linear)
All-to-all communications (quadratic)
T (c) = S +
P.J. Hasnip
P
+ Lc + Qc 2
c
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
What hampers scaling?
Serial parts of the code
One-to-all communications (linear)
All-to-all communications (quadratic)
T (c) = S +
P.J. Hasnip
P
+ Lc + Qc 2
c
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
What hampers scaling?
Serial parts of the code
One-to-all communications (linear)
All-to-all communications (quadratic)
T (c) = S +
P.J. Hasnip
P
+ Lc + Qc 2
c
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
k-point parallelism
Hk [n]ψbk (r ) = bk ψbk (r )
The eigenvalue equations for different k-points are only weakly
coupled.
Distribute data and workload by k-point
Gives near-perfect scaling
Large calculations only need O(1) k-points =⇒ run out of
them very quickly!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
k-point parallelism
Hk [n]ψbk (r ) = bk ψbk (r )
The eigenvalue equations for different k-points are only weakly
coupled.
Distribute data and workload by k-point
Gives near-perfect scaling
Large calculations only need O(1) k-points =⇒ run out of
them very quickly!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
k-point parallelism
Hk [n]ψbk (r ) = bk ψbk (r )
The eigenvalue equations for different k-points are only weakly
coupled.
Distribute data and workload by k-point
Gives near-perfect scaling
Large calculations only need O(1) k-points =⇒ run out of
them very quickly!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
k-point parallelism
Hk [n]ψbk (r ) = bk ψbk (r )
The eigenvalue equations for different k-points are only weakly
coupled.
Distribute data and workload by k-point
Gives near-perfect scaling
Large calculations only need O(1) k-points =⇒ run out of
them very quickly!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
k-point parallelism
Hk [n]ψbk (r ) = bk ψbk (r )
The eigenvalue equations for different k-points are only weakly
coupled.
Distribute data and workload by k-point
Gives near-perfect scaling
Large calculations only need O(1) k-points =⇒ run out of
them very quickly!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN Benchmark
The TiN simulation is a small standard benchmark
33 atoms
8 k-points
164 bands
10,972 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN Benchmark
The TiN simulation is a small standard benchmark
33 atoms
8 k-points
164 bands
10,972 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN Benchmark
The TiN simulation is a small standard benchmark
33 atoms
8 k-points
164 bands
10,972 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN Benchmark
The TiN simulation is a small standard benchmark
33 atoms
8 k-points
164 bands
10,972 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN Benchmark
The TiN simulation is a small standard benchmark
33 atoms
8 k-points
164 bands
10,972 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN k-point Parallel
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
G-vector parallelism
ψbk (r ) =
X
cGbk ei(G+k).r
G
Distribute the data and workload over the G-vectors
NG is large, and increases with system size
Fourier transforms require all-to-all communications
Need extra data reorganisation, but good scaling for
moderate numbers of cores
Eventually the all-to-all cost dominates
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
G-vector parallelism
ψbk (r ) =
X
cGbk ei(G+k).r
G
Distribute the data and workload over the G-vectors
NG is large, and increases with system size
Fourier transforms require all-to-all communications
Need extra data reorganisation, but good scaling for
moderate numbers of cores
Eventually the all-to-all cost dominates
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
G-vector parallelism
ψbk (r ) =
X
cGbk ei(G+k).r
G
Distribute the data and workload over the G-vectors
NG is large, and increases with system size
Fourier transforms require all-to-all communications
Need extra data reorganisation, but good scaling for
moderate numbers of cores
Eventually the all-to-all cost dominates
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
G-vector parallelism
ψbk (r ) =
X
cGbk ei(G+k).r
G
Distribute the data and workload over the G-vectors
NG is large, and increases with system size
Fourier transforms require all-to-all communications
Need extra data reorganisation, but good scaling for
moderate numbers of cores
Eventually the all-to-all cost dominates
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
G-vector parallelism
ψbk (r ) =
X
cGbk ei(G+k).r
G
Distribute the data and workload over the G-vectors
NG is large, and increases with system size
Fourier transforms require all-to-all communications
Need extra data reorganisation, but good scaling for
moderate numbers of cores
Eventually the all-to-all cost dominates
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
G-vector parallelism
ψbk (r ) =
X
cGbk ei(G+k).r
G
Distribute the data and workload over the G-vectors
NG is large, and increases with system size
Fourier transforms require all-to-all communications
Need extra data reorganisation, but good scaling for
moderate numbers of cores
Eventually the all-to-all cost dominates
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN G-vector Parallel
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 1 - Parallel Hierarchies
k-point parallelism near-perfect to 8 cores
G-vector parallelism good to 16 or 32 cores
We allow both simultaneously
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 1 - Parallel Hierarchies
k-point parallelism near-perfect to 8 cores
G-vector parallelism good to 16 or 32 cores
We allow both simultaneously
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 1 - Parallel Hierarchies
k-point parallelism near-perfect to 8 cores
G-vector parallelism good to 16 or 32 cores
We allow both simultaneously
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 1 - Parallel Hierarchies
k-point parallelism near-perfect to 8 cores
G-vector parallelism good to 16 or 32 cores
We allow both simultaneously
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN in Mixed k- and G-Parallel
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 2 - Reduce All-to-All Communications
Ultimate scaling dominated by all-to-all
Castep already optimised to minimise FFTs
Split into two phases:
local core-to-core within each node
node-to-node
Devised and implemented by Keith Refson (RAL) and
Martin Plummer (STFC)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 2 - Reduce All-to-All Communications
Ultimate scaling dominated by all-to-all
Castep already optimised to minimise FFTs
Split into two phases:
local core-to-core within each node
node-to-node
Devised and implemented by Keith Refson (RAL) and
Martin Plummer (STFC)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 2 - Reduce All-to-All Communications
Ultimate scaling dominated by all-to-all
Castep already optimised to minimise FFTs
Split into two phases:
local core-to-core within each node
node-to-node
Devised and implemented by Keith Refson (RAL) and
Martin Plummer (STFC)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 2 - Reduce All-to-All Communications
Ultimate scaling dominated by all-to-all
Castep already optimised to minimise FFTs
Split into two phases:
local core-to-core within each node
node-to-node
Devised and implemented by Keith Refson (RAL) and
Martin Plummer (STFC)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 2 - Reduce All-to-All Communications
Ultimate scaling dominated by all-to-all
Castep already optimised to minimise FFTs
Split into two phases:
local core-to-core within each node
node-to-node
Devised and implemented by Keith Refson (RAL) and
Martin Plummer (STFC)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Lesson 2 - Reduce All-to-All Communications
Ultimate scaling dominated by all-to-all
Castep already optimised to minimise FFTs
Split into two phases:
local core-to-core within each node
node-to-node
Devised and implemented by Keith Refson (RAL) and
Martin Plummer (STFC)
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
TiN with all-to-all optimisations
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Al2 O3 -3x3 benchmark
The TiN benchmark is quite small. A larger standard
benchmark system Al2 O3 slab (3×3 surface):
270 atoms
2 k-points
778 bands (1296 electrons)
88,184 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Al2 O3 -3x3 benchmark
The TiN benchmark is quite small. A larger standard
benchmark system Al2 O3 slab (3×3 surface):
270 atoms
2 k-points
778 bands (1296 electrons)
88,184 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Al2 O3 -3x3 benchmark
The TiN benchmark is quite small. A larger standard
benchmark system Al2 O3 slab (3×3 surface):
270 atoms
2 k-points
778 bands (1296 electrons)
88,184 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Al2 O3 -3x3 benchmark
The TiN benchmark is quite small. A larger standard
benchmark system Al2 O3 slab (3×3 surface):
270 atoms
2 k-points
778 bands (1296 electrons)
88,184 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Al2 O3 -3x3 benchmark
The TiN benchmark is quite small. A larger standard
benchmark system Al2 O3 slab (3×3 surface):
270 atoms
2 k-points
778 bands (1296 electrons)
88,184 G-vectors
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
What is Castep?
How Castep Works
Castep in Parallel
Al2 O3 -3x3 parallel scaling
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Outline
1
Introduction
What is Castep?
How does Castep work?
Castep in Parallel
2
dCSE Project
dCSE project plan
Castep development
Results
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE project plan
ψbk (r ) =
X
cGbk ei(G+k).r
G
We have distributed cGbk over G and k
What about over b?
8 month dCSE project to:
Investigate Castep performance on HECToR
Implement band-parallelism
Parallelise costly non-distributed operations
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE project plan
ψbk (r ) =
X
cGbk ei(G+k).r
G
We have distributed cGbk over G and k
What about over b?
8 month dCSE project to:
Investigate Castep performance on HECToR
Implement band-parallelism
Parallelise costly non-distributed operations
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE project plan
ψbk (r ) =
X
cGbk ei(G+k).r
G
We have distributed cGbk over G and k
What about over b?
8 month dCSE project to:
Investigate Castep performance on HECToR
Implement band-parallelism
Parallelise costly non-distributed operations
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE project plan
ψbk (r ) =
X
cGbk ei(G+k).r
G
We have distributed cGbk over G and k
What about over b?
8 month dCSE project to:
Investigate Castep performance on HECToR
Implement band-parallelism
Parallelise costly non-distributed operations
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE project plan
ψbk (r ) =
X
cGbk ei(G+k).r
G
We have distributed cGbk over G and k
What about over b?
8 month dCSE project to:
Investigate Castep performance on HECToR
Implement band-parallelism
Parallelise costly non-distributed operations
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE project plan
ψbk (r ) =
X
cGbk ei(G+k).r
G
We have distributed cGbk over G and k
What about over b?
8 month dCSE project to:
Investigate Castep performance on HECToR
Implement band-parallelism
Parallelise costly non-distributed operations
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Outline
1
Introduction
What is Castep?
How does Castep work?
Castep in Parallel
2
dCSE Project
dCSE project plan
Castep development
Results
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Castep development
Castep is quite a large program...
334,395 lines of Fortran 90
54 modules
Already has three levels of parallelism
... but well structured and commented.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Castep development
Castep is quite a large program...
334,395 lines of Fortran 90
54 modules
Already has three levels of parallelism
... but well structured and commented.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Castep development
Castep is quite a large program...
334,395 lines of Fortran 90
54 modules
Already has three levels of parallelism
... but well structured and commented.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Castep development
Castep is quite a large program...
334,395 lines of Fortran 90
54 modules
Already has three levels of parallelism
... but well structured and commented.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Castep development
Castep is quite a large program...
334,395 lines of Fortran 90
54 modules
Already has three levels of parallelism
... but well structured and commented.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Basic band parallelism changed 3 modules
communication module
wavefunction module
electronic groundstate module
Optimisation changed 2 further modules
Extra parallelisation changed 2 modules
communication module
algorithm module
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A surprise
We know FFTs are expensive on lots of cores, and we also
know orthogonalisation dominates large calculations.
We profiled a reasonably large parallel calculation:
30% of time in FFTs
30% of time in ZGEMM of one subroutine ... but not the
orthogonalisation!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A surprise
We know FFTs are expensive on lots of cores, and we also
know orthogonalisation dominates large calculations.
We profiled a reasonably large parallel calculation:
30% of time in FFTs
30% of time in ZGEMM of one subroutine ... but not the
orthogonalisation!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A surprise
We know FFTs are expensive on lots of cores, and we also
know orthogonalisation dominates large calculations.
We profiled a reasonably large parallel calculation:
30% of time in FFTs
30% of time in ZGEMM of one subroutine ... but not the
orthogonalisation!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A surprise
We know FFTs are expensive on lots of cores, and we also
know orthogonalisation dominates large calculations.
We profiled a reasonably large parallel calculation:
30% of time in FFTs
30% of time in ZGEMM of one subroutine ... but not the
orthogonalisation!
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 3 - Profile!
It is important to check your assumptions about performance.
HECToR has the Cray Performance Analysis Tools (PAT)
installed for profiling.
Easy to profile BLAS and MPI calls
Good for early investigations
Larger calculations generated 90GB per run!
Better to use built-in timing if available (Castep’s Trace)
Cray PAT available as a library
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 3 - Profile!
It is important to check your assumptions about performance.
HECToR has the Cray Performance Analysis Tools (PAT)
installed for profiling.
Easy to profile BLAS and MPI calls
Good for early investigations
Larger calculations generated 90GB per run!
Better to use built-in timing if available (Castep’s Trace)
Cray PAT available as a library
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 3 - Profile!
It is important to check your assumptions about performance.
HECToR has the Cray Performance Analysis Tools (PAT)
installed for profiling.
Easy to profile BLAS and MPI calls
Good for early investigations
Larger calculations generated 90GB per run!
Better to use built-in timing if available (Castep’s Trace)
Cray PAT available as a library
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 3 - Profile!
It is important to check your assumptions about performance.
HECToR has the Cray Performance Analysis Tools (PAT)
installed for profiling.
Easy to profile BLAS and MPI calls
Good for early investigations
Larger calculations generated 90GB per run!
Better to use built-in timing if available (Castep’s Trace)
Cray PAT available as a library
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 3 - Profile!
It is important to check your assumptions about performance.
HECToR has the Cray Performance Analysis Tools (PAT)
installed for profiling.
Easy to profile BLAS and MPI calls
Good for early investigations
Larger calculations generated 90GB per run!
Better to use built-in timing if available (Castep’s Trace)
Cray PAT available as a library
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 3 - Profile!
It is important to check your assumptions about performance.
HECToR has the Cray Performance Analysis Tools (PAT)
installed for profiling.
Easy to profile BLAS and MPI calls
Good for early investigations
Larger calculations generated 90GB per run!
Better to use built-in timing if available (Castep’s Trace)
Cray PAT available as a library
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 3 - Profile!
It is important to check your assumptions about performance.
HECToR has the Cray Performance Analysis Tools (PAT)
installed for profiling.
Easy to profile BLAS and MPI calls
Good for early investigations
Larger calculations generated 90GB per run!
Better to use built-in timing if available (Castep’s Trace)
Cray PAT available as a library
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Band-parallelism implemented as an additional level of
parallelism
ScaLAPACK used for parallel matrix
diagonalisation/inversion
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Band-parallelism implemented as an additional level of
parallelism
ScaLAPACK used for parallel matrix
diagonalisation/inversion
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE work
Band-parallelism implemented as an additional level of
parallelism
ScaLAPACK used for parallel matrix
diagonalisation/inversion
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Outline
1
Introduction
What is Castep?
How does Castep work?
Castep in Parallel
2
dCSE Project
dCSE project plan
Castep development
Results
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE results for Al2 O3 3×3
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
dCSE results for Al2 O3 3×3 on BlueGene/P
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Lesson 4 - Scaling isn’t everything
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A real system
So far we’ve only looked at benchmarks–now we’ll look at
something more interesting:
Immidazolium chloride–a room-temperature ionic liquid
408 atoms
1 k-point
662 bands
137,728 G-vectors
Want to run a molecular dynamics simulation
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A real system
So far we’ve only looked at benchmarks–now we’ll look at
something more interesting:
Immidazolium chloride–a room-temperature ionic liquid
408 atoms
1 k-point
662 bands
137,728 G-vectors
Want to run a molecular dynamics simulation
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A real system
So far we’ve only looked at benchmarks–now we’ll look at
something more interesting:
Immidazolium chloride–a room-temperature ionic liquid
408 atoms
1 k-point
662 bands
137,728 G-vectors
Want to run a molecular dynamics simulation
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A real system
So far we’ve only looked at benchmarks–now we’ll look at
something more interesting:
Immidazolium chloride–a room-temperature ionic liquid
408 atoms
1 k-point
662 bands
137,728 G-vectors
Want to run a molecular dynamics simulation
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A real system
So far we’ve only looked at benchmarks–now we’ll look at
something more interesting:
Immidazolium chloride–a room-temperature ionic liquid
408 atoms
1 k-point
662 bands
137,728 G-vectors
Want to run a molecular dynamics simulation
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A real system
So far we’ve only looked at benchmarks–now we’ll look at
something more interesting:
Immidazolium chloride–a room-temperature ionic liquid
408 atoms
1 k-point
662 bands
137,728 G-vectors
Want to run a molecular dynamics simulation
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
A real system
So far we’ve only looked at benchmarks–now we’ll look at
something more interesting:
Immidazolium chloride–a room-temperature ionic liquid
408 atoms
1 k-point
662 bands
137,728 G-vectors
Want to run a molecular dynamics simulation
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Immidazolium chloride
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Immidazolium chloride with band-parallelism
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Immidazolium chloride performance
Good scaling to 512 cores even with only 1 k-point
Achieves 1 SCF cycle per minute on 1024 cores
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Immidazolium chloride performance
Good scaling to 512 cores even with only 1 k-point
Achieves 1 SCF cycle per minute on 1024 cores
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Immidazolium chloride performance
Good scaling to 512 cores even with only 1 k-point
Achieves 1 SCF cycle per minute on 1024 cores
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Remaining work
Integrate with latest (4.4) source
Extend to NMR, linear response etc.
Develop ‘band-local’ optimisers to improve scaling
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Remaining work
Integrate with latest (4.4) source
Extend to NMR, linear response etc.
Develop ‘band-local’ optimisers to improve scaling
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Remaining work
Integrate with latest (4.4) source
Extend to NMR, linear response etc.
Develop ‘band-local’ optimisers to improve scaling
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Remaining work
Integrate with latest (4.4) source
Extend to NMR, linear response etc.
Develop ‘band-local’ optimisers to improve scaling
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Interesting ‘features’
Some things cropped up you may find helpful...
Bug with Cray MPI collectives – job killed with error 13
(believed fixed)
Cray LibSci 10.2.1 much better than 10.0.
Portland Group compiler reports incorrect CPU time
Out-of-memory errors not reported to users (yet), job killed
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Interesting ‘features’
Some things cropped up you may find helpful...
Bug with Cray MPI collectives – job killed with error 13
(believed fixed)
Cray LibSci 10.2.1 much better than 10.0.
Portland Group compiler reports incorrect CPU time
Out-of-memory errors not reported to users (yet), job killed
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Interesting ‘features’
Some things cropped up you may find helpful...
Bug with Cray MPI collectives – job killed with error 13
(believed fixed)
Cray LibSci 10.2.1 much better than 10.0.
Portland Group compiler reports incorrect CPU time
Out-of-memory errors not reported to users (yet), job killed
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Interesting ‘features’
Some things cropped up you may find helpful...
Bug with Cray MPI collectives – job killed with error 13
(believed fixed)
Cray LibSci 10.2.1 much better than 10.0.
Portland Group compiler reports incorrect CPU time
Out-of-memory errors not reported to users (yet), job killed
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Interesting ‘features’
Some things cropped up you may find helpful...
Bug with Cray MPI collectives – job killed with error 13
(believed fixed)
Cray LibSci 10.2.1 much better than 10.0.
Portland Group compiler reports incorrect CPU time
Out-of-memory errors not reported to users (yet), job killed
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Acknowledgements
Matt Probert (York)
Keith Refson (RAL) and Martin Plummer (STFC)
Mike Ashworth (STFC)
NAG, especially Guy Robinson, Ian Reid, Edward Smyth,
Sarfraz Nadeem and Phil Ridley.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Acknowledgements
Matt Probert (York)
Keith Refson (RAL) and Martin Plummer (STFC)
Mike Ashworth (STFC)
NAG, especially Guy Robinson, Ian Reid, Edward Smyth,
Sarfraz Nadeem and Phil Ridley.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Acknowledgements
Matt Probert (York)
Keith Refson (RAL) and Martin Plummer (STFC)
Mike Ashworth (STFC)
NAG, especially Guy Robinson, Ian Reid, Edward Smyth,
Sarfraz Nadeem and Phil Ridley.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Acknowledgements
Matt Probert (York)
Keith Refson (RAL) and Martin Plummer (STFC)
Mike Ashworth (STFC)
NAG, especially Guy Robinson, Ian Reid, Edward Smyth,
Sarfraz Nadeem and Phil Ridley.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
dCSE project plan
Castep development
Results
Acknowledgements
Matt Probert (York)
Keith Refson (RAL) and Martin Plummer (STFC)
Mike Ashworth (STFC)
NAG, especially Guy Robinson, Ian Reid, Edward Smyth,
Sarfraz Nadeem and Phil Ridley.
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Summary
Use hierarchical parallelism
Minimise all-to-all communications
Profile the code
Remember good scaling isn’t necessarily good
performance
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Summary
Use hierarchical parallelism
Minimise all-to-all communications
Profile the code
Remember good scaling isn’t necessarily good
performance
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Summary
Use hierarchical parallelism
Minimise all-to-all communications
Profile the code
Remember good scaling isn’t necessarily good
performance
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Summary
Use hierarchical parallelism
Minimise all-to-all communications
Profile the code
Remember good scaling isn’t necessarily good
performance
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Summary
Use hierarchical parallelism
Minimise all-to-all communications
Profile the code
Remember good scaling isn’t necessarily good
performance
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Castep on HECToR
We investigated the impact on performance of...
Compiler
BLAS/LAPACK libraries
FFT libraries
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Castep on HECToR
We investigated the impact on performance of...
Compiler
BLAS/LAPACK libraries
FFT libraries
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Castep on HECToR
We investigated the impact on performance of...
Compiler
BLAS/LAPACK libraries
FFT libraries
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Castep on HECToR
We investigated the impact on performance of...
Compiler
BLAS/LAPACK libraries
FFT libraries
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
Compiler
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
BLAS/LAPACK
P.J. Hasnip
How can I use a 1000 cores?
Introduction
dCSE Project
Summary
FFT
P.J. Hasnip
How can I use a 1000 cores?
Download