Introduction dCSE Project Summary How can I use a 1000 cores? Lessons in Parallel Scaling from Castep P.J. Hasnip (pjh503@york.ac.uk) Department of Physics University of York HECToR User Group Meeting P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Outline 1 Introduction What is Castep? How does Castep work? Castep in Parallel 2 dCSE Project dCSE project plan Castep development Results P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Outline 1 Introduction What is Castep? How does Castep work? Castep in Parallel 2 dCSE Project dCSE project plan Castep development Results P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is... A general-purpose ‘first principles’ atomistic modelling program Based on density functional theory Free to UK academics Installed on HECToR P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is... A general-purpose ‘first principles’ atomistic modelling program Based on density functional theory Free to UK academics Installed on HECToR P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is... A general-purpose ‘first principles’ atomistic modelling program Based on density functional theory Free to UK academics Installed on HECToR P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is... A general-purpose ‘first principles’ atomistic modelling program Based on density functional theory Free to UK academics Installed on HECToR P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep can... Compute the electronic density Determine the groundstate atomic configuration and cell Simulate molecular dynamics (path-integrals, variable cell) Calculate band-structures and density of states Compute various spectra (optical, IR, Raman, NMR, XANES...) plus linear response, population analysis, ELF, etc. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep can... Compute the electronic density Determine the groundstate atomic configuration and cell Simulate molecular dynamics (path-integrals, variable cell) Calculate band-structures and density of states Compute various spectra (optical, IR, Raman, NMR, XANES...) plus linear response, population analysis, ELF, etc. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep can... Compute the electronic density Determine the groundstate atomic configuration and cell Simulate molecular dynamics (path-integrals, variable cell) Calculate band-structures and density of states Compute various spectra (optical, IR, Raman, NMR, XANES...) plus linear response, population analysis, ELF, etc. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep can... Compute the electronic density Determine the groundstate atomic configuration and cell Simulate molecular dynamics (path-integrals, variable cell) Calculate band-structures and density of states Compute various spectra (optical, IR, Raman, NMR, XANES...) plus linear response, population analysis, ELF, etc. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep can... Compute the electronic density Determine the groundstate atomic configuration and cell Simulate molecular dynamics (path-integrals, variable cell) Calculate band-structures and density of states Compute various spectra (optical, IR, Raman, NMR, XANES...) plus linear response, population analysis, ELF, etc. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep can... Compute the electronic density Determine the groundstate atomic configuration and cell Simulate molecular dynamics (path-integrals, variable cell) Calculate band-structures and density of states Compute various spectra (optical, IR, Raman, NMR, XANES...) plus linear response, population analysis, ELF, etc. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is written using... Fortran 90 BLAS/LAPACK for linear algebra FFT libraries (where available) MPI for parallel communication Portable and well optimised (achieves 37-40% peak on HECToR). P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is written using... Fortran 90 BLAS/LAPACK for linear algebra FFT libraries (where available) MPI for parallel communication Portable and well optimised (achieves 37-40% peak on HECToR). P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is written using... Fortran 90 BLAS/LAPACK for linear algebra FFT libraries (where available) MPI for parallel communication Portable and well optimised (achieves 37-40% peak on HECToR). P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is written using... Fortran 90 BLAS/LAPACK for linear algebra FFT libraries (where available) MPI for parallel communication Portable and well optimised (achieves 37-40% peak on HECToR). P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is written using... Fortran 90 BLAS/LAPACK for linear algebra FFT libraries (where available) MPI for parallel communication Portable and well optimised (achieves 37-40% peak on HECToR). P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep is written using... Fortran 90 BLAS/LAPACK for linear algebra FFT libraries (where available) MPI for parallel communication Portable and well optimised (achieves 37-40% peak on HECToR). P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Outline 1 Introduction What is Castep? How does Castep work? Castep in Parallel 2 dCSE Project dCSE project plan Castep development Results P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Castep Basics Castep solves a set of Schrödinger equations, Hk [n]ψbk (r) = bk ψbk (r) where n is the electronic density and {ψbk } are the bands. X n (r) = 2wk |ψbk (r)|2 bk P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Self-consistency Hk [n]ψbk (r) = bk ψbk (r) Hk depends on n (r) n (r) depends on {ψbk } We need to solve this eigenvalue equation iteratively until we have self-consistency. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Self-consistency Hk [n]ψbk (r) = bk ψbk (r) Hk depends on n (r) n (r) depends on {ψbk } We need to solve this eigenvalue equation iteratively until we have self-consistency. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Self-consistency Hk [n]ψbk (r) = bk ψbk (r) Hk depends on n (r) n (r) depends on {ψbk } We need to solve this eigenvalue equation iteratively until we have self-consistency. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Self-consistency Hk [n]ψbk (r) = bk ψbk (r) Hk depends on n (r) n (r) depends on {ψbk } We need to solve this eigenvalue equation iteratively until we have self-consistency. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel How Castep Works P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel A useful basis set We expand ψbk in a plane-wave (Fourier) basis, X ψbk (r) = cGbk ei(G+k).r G If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of G-vectors, NG , increases On HECToR, NG might be O(100,000) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel A useful basis set We expand ψbk in a plane-wave (Fourier) basis, X ψbk (r) = cGbk ei(G+k).r G If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of G-vectors, NG , increases On HECToR, NG might be O(100,000) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel A useful basis set We expand ψbk in a plane-wave (Fourier) basis, X ψbk (r) = cGbk ei(G+k).r G If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of G-vectors, NG , increases On HECToR, NG might be O(100,000) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel A useful basis set We expand ψbk in a plane-wave (Fourier) basis, X ψbk (r) = cGbk ei(G+k).r G If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of G-vectors, NG , increases On HECToR, NG might be O(100,000) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel A useful basis set We expand ψbk in a plane-wave (Fourier) basis, X ψbk (r) = cGbk ei(G+k).r G If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of G-vectors, NG , increases On HECToR, NG might be O(100,000) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel The k-points The vectors {k} are basically sampling points in the region of space 1 |k| < |Gsmallest | . 2 We increase the sampling density of k-points until our calculation converges. If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of k-points we need decreases On HECToR, Nk might be O(1) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel The k-points The vectors {k} are basically sampling points in the region of space 1 |k| < |Gsmallest | . 2 We increase the sampling density of k-points until our calculation converges. If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of k-points we need decreases On HECToR, Nk might be O(1) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel The k-points The vectors {k} are basically sampling points in the region of space 1 |k| < |Gsmallest | . 2 We increase the sampling density of k-points until our calculation converges. If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of k-points we need decreases On HECToR, Nk might be O(1) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel The k-points The vectors {k} are basically sampling points in the region of space 1 |k| < |Gsmallest | . 2 We increase the sampling density of k-points until our calculation converges. If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of k-points we need decreases On HECToR, Nk might be O(1) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel The k-points The vectors {k} are basically sampling points in the region of space 1 |k| < |Gsmallest | . 2 We increase the sampling density of k-points until our calculation converges. If we increase the size of our simulation system: The size of the smallest G-vector decreases The number of k-points we need decreases On HECToR, Nk might be O(1) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Orthogonalisation We find ψbk by varying {cGbk } to minimise bk . To prevent all the bands heading for the lowest energy one, we explicitly orthogonalise them to each other. The computational time per k-point scales as NG Nb2 . i.e. cubically for large systems (recall Nk = O(1)). This cost dominates in large calculations. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Orthogonalisation We find ψbk by varying {cGbk } to minimise bk . To prevent all the bands heading for the lowest energy one, we explicitly orthogonalise them to each other. The computational time per k-point scales as NG Nb2 . i.e. cubically for large systems (recall Nk = O(1)). This cost dominates in large calculations. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Orthogonalisation We find ψbk by varying {cGbk } to minimise bk . To prevent all the bands heading for the lowest energy one, we explicitly orthogonalise them to each other. The computational time per k-point scales as NG Nb2 . i.e. cubically for large systems (recall Nk = O(1)). This cost dominates in large calculations. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Orthogonalisation We find ψbk by varying {cGbk } to minimise bk . To prevent all the bands heading for the lowest energy one, we explicitly orthogonalise them to each other. The computational time per k-point scales as NG Nb2 . i.e. cubically for large systems (recall Nk = O(1)). This cost dominates in large calculations. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Fourier Transforms H contains some contributions we apply in G-space, and some in r-space. We use FFTs to switch efficiently between r- and G-space where needed. The computational time per k-point scales as NG ln (NG ) Nb i.e. approximately quadratically for large systems P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Outline 1 Introduction What is Castep? How does Castep work? Castep in Parallel 2 dCSE Project dCSE project plan Castep development Results P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Parallel Efficiency We want to know how effective running in parallel is. Comparing to the serial time can cause problems because: Efficient parallel calculations may need different algorithms We cannot run large calculations in serial P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Parallel Efficiency We want to know how effective running in parallel is. Comparing to the serial time can cause problems because: Efficient parallel calculations may need different algorithms We cannot run large calculations in serial P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Parallel Efficiency We want to know how effective running in parallel is. Comparing to the serial time can cause problems because: Efficient parallel calculations may need different algorithms We cannot run large calculations in serial P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Relative speed-up Instead of the absolute parallel performance, we shall look at the relative speed-up when we double the number of cores: E(c) = T ( 2c ) . T (c) E(c) = 2 indicates perfect scaling. The HPCx Capability Incentive target is 1.7. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Relative speed-up Instead of the absolute parallel performance, we shall look at the relative speed-up when we double the number of cores: E(c) = T ( 2c ) . T (c) E(c) = 2 indicates perfect scaling. The HPCx Capability Incentive target is 1.7. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Relative speed-up Instead of the absolute parallel performance, we shall look at the relative speed-up when we double the number of cores: E(c) = T ( 2c ) . T (c) E(c) = 2 indicates perfect scaling. The HPCx Capability Incentive target is 1.7. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel What hampers scaling? Serial parts of the code One-to-all communications (linear) All-to-all communications (quadratic) T (c) = S + P.J. Hasnip P + Lc + Qc 2 c How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel What hampers scaling? Serial parts of the code One-to-all communications (linear) All-to-all communications (quadratic) T (c) = S + P.J. Hasnip P + Lc + Qc 2 c How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel What hampers scaling? Serial parts of the code One-to-all communications (linear) All-to-all communications (quadratic) T (c) = S + P.J. Hasnip P + Lc + Qc 2 c How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel What hampers scaling? Serial parts of the code One-to-all communications (linear) All-to-all communications (quadratic) T (c) = S + P.J. Hasnip P + Lc + Qc 2 c How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel What hampers scaling? Serial parts of the code One-to-all communications (linear) All-to-all communications (quadratic) T (c) = S + P.J. Hasnip P + Lc + Qc 2 c How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel k-point parallelism Hk [n]ψbk (r ) = bk ψbk (r ) The eigenvalue equations for different k-points are only weakly coupled. Distribute data and workload by k-point Gives near-perfect scaling Large calculations only need O(1) k-points =⇒ run out of them very quickly! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel k-point parallelism Hk [n]ψbk (r ) = bk ψbk (r ) The eigenvalue equations for different k-points are only weakly coupled. Distribute data and workload by k-point Gives near-perfect scaling Large calculations only need O(1) k-points =⇒ run out of them very quickly! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel k-point parallelism Hk [n]ψbk (r ) = bk ψbk (r ) The eigenvalue equations for different k-points are only weakly coupled. Distribute data and workload by k-point Gives near-perfect scaling Large calculations only need O(1) k-points =⇒ run out of them very quickly! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel k-point parallelism Hk [n]ψbk (r ) = bk ψbk (r ) The eigenvalue equations for different k-points are only weakly coupled. Distribute data and workload by k-point Gives near-perfect scaling Large calculations only need O(1) k-points =⇒ run out of them very quickly! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel k-point parallelism Hk [n]ψbk (r ) = bk ψbk (r ) The eigenvalue equations for different k-points are only weakly coupled. Distribute data and workload by k-point Gives near-perfect scaling Large calculations only need O(1) k-points =⇒ run out of them very quickly! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN Benchmark The TiN simulation is a small standard benchmark 33 atoms 8 k-points 164 bands 10,972 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN Benchmark The TiN simulation is a small standard benchmark 33 atoms 8 k-points 164 bands 10,972 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN Benchmark The TiN simulation is a small standard benchmark 33 atoms 8 k-points 164 bands 10,972 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN Benchmark The TiN simulation is a small standard benchmark 33 atoms 8 k-points 164 bands 10,972 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN Benchmark The TiN simulation is a small standard benchmark 33 atoms 8 k-points 164 bands 10,972 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN k-point Parallel P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel G-vector parallelism ψbk (r ) = X cGbk ei(G+k).r G Distribute the data and workload over the G-vectors NG is large, and increases with system size Fourier transforms require all-to-all communications Need extra data reorganisation, but good scaling for moderate numbers of cores Eventually the all-to-all cost dominates P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel G-vector parallelism ψbk (r ) = X cGbk ei(G+k).r G Distribute the data and workload over the G-vectors NG is large, and increases with system size Fourier transforms require all-to-all communications Need extra data reorganisation, but good scaling for moderate numbers of cores Eventually the all-to-all cost dominates P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel G-vector parallelism ψbk (r ) = X cGbk ei(G+k).r G Distribute the data and workload over the G-vectors NG is large, and increases with system size Fourier transforms require all-to-all communications Need extra data reorganisation, but good scaling for moderate numbers of cores Eventually the all-to-all cost dominates P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel G-vector parallelism ψbk (r ) = X cGbk ei(G+k).r G Distribute the data and workload over the G-vectors NG is large, and increases with system size Fourier transforms require all-to-all communications Need extra data reorganisation, but good scaling for moderate numbers of cores Eventually the all-to-all cost dominates P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel G-vector parallelism ψbk (r ) = X cGbk ei(G+k).r G Distribute the data and workload over the G-vectors NG is large, and increases with system size Fourier transforms require all-to-all communications Need extra data reorganisation, but good scaling for moderate numbers of cores Eventually the all-to-all cost dominates P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel G-vector parallelism ψbk (r ) = X cGbk ei(G+k).r G Distribute the data and workload over the G-vectors NG is large, and increases with system size Fourier transforms require all-to-all communications Need extra data reorganisation, but good scaling for moderate numbers of cores Eventually the all-to-all cost dominates P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN G-vector Parallel P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 1 - Parallel Hierarchies k-point parallelism near-perfect to 8 cores G-vector parallelism good to 16 or 32 cores We allow both simultaneously P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 1 - Parallel Hierarchies k-point parallelism near-perfect to 8 cores G-vector parallelism good to 16 or 32 cores We allow both simultaneously P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 1 - Parallel Hierarchies k-point parallelism near-perfect to 8 cores G-vector parallelism good to 16 or 32 cores We allow both simultaneously P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 1 - Parallel Hierarchies k-point parallelism near-perfect to 8 cores G-vector parallelism good to 16 or 32 cores We allow both simultaneously P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN in Mixed k- and G-Parallel P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 2 - Reduce All-to-All Communications Ultimate scaling dominated by all-to-all Castep already optimised to minimise FFTs Split into two phases: local core-to-core within each node node-to-node Devised and implemented by Keith Refson (RAL) and Martin Plummer (STFC) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 2 - Reduce All-to-All Communications Ultimate scaling dominated by all-to-all Castep already optimised to minimise FFTs Split into two phases: local core-to-core within each node node-to-node Devised and implemented by Keith Refson (RAL) and Martin Plummer (STFC) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 2 - Reduce All-to-All Communications Ultimate scaling dominated by all-to-all Castep already optimised to minimise FFTs Split into two phases: local core-to-core within each node node-to-node Devised and implemented by Keith Refson (RAL) and Martin Plummer (STFC) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 2 - Reduce All-to-All Communications Ultimate scaling dominated by all-to-all Castep already optimised to minimise FFTs Split into two phases: local core-to-core within each node node-to-node Devised and implemented by Keith Refson (RAL) and Martin Plummer (STFC) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 2 - Reduce All-to-All Communications Ultimate scaling dominated by all-to-all Castep already optimised to minimise FFTs Split into two phases: local core-to-core within each node node-to-node Devised and implemented by Keith Refson (RAL) and Martin Plummer (STFC) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Lesson 2 - Reduce All-to-All Communications Ultimate scaling dominated by all-to-all Castep already optimised to minimise FFTs Split into two phases: local core-to-core within each node node-to-node Devised and implemented by Keith Refson (RAL) and Martin Plummer (STFC) P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel TiN with all-to-all optimisations P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Al2 O3 -3x3 benchmark The TiN benchmark is quite small. A larger standard benchmark system Al2 O3 slab (3×3 surface): 270 atoms 2 k-points 778 bands (1296 electrons) 88,184 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Al2 O3 -3x3 benchmark The TiN benchmark is quite small. A larger standard benchmark system Al2 O3 slab (3×3 surface): 270 atoms 2 k-points 778 bands (1296 electrons) 88,184 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Al2 O3 -3x3 benchmark The TiN benchmark is quite small. A larger standard benchmark system Al2 O3 slab (3×3 surface): 270 atoms 2 k-points 778 bands (1296 electrons) 88,184 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Al2 O3 -3x3 benchmark The TiN benchmark is quite small. A larger standard benchmark system Al2 O3 slab (3×3 surface): 270 atoms 2 k-points 778 bands (1296 electrons) 88,184 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Al2 O3 -3x3 benchmark The TiN benchmark is quite small. A larger standard benchmark system Al2 O3 slab (3×3 surface): 270 atoms 2 k-points 778 bands (1296 electrons) 88,184 G-vectors P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary What is Castep? How Castep Works Castep in Parallel Al2 O3 -3x3 parallel scaling P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Outline 1 Introduction What is Castep? How does Castep work? Castep in Parallel 2 dCSE Project dCSE project plan Castep development Results P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE project plan ψbk (r ) = X cGbk ei(G+k).r G We have distributed cGbk over G and k What about over b? 8 month dCSE project to: Investigate Castep performance on HECToR Implement band-parallelism Parallelise costly non-distributed operations P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE project plan ψbk (r ) = X cGbk ei(G+k).r G We have distributed cGbk over G and k What about over b? 8 month dCSE project to: Investigate Castep performance on HECToR Implement band-parallelism Parallelise costly non-distributed operations P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE project plan ψbk (r ) = X cGbk ei(G+k).r G We have distributed cGbk over G and k What about over b? 8 month dCSE project to: Investigate Castep performance on HECToR Implement band-parallelism Parallelise costly non-distributed operations P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE project plan ψbk (r ) = X cGbk ei(G+k).r G We have distributed cGbk over G and k What about over b? 8 month dCSE project to: Investigate Castep performance on HECToR Implement band-parallelism Parallelise costly non-distributed operations P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE project plan ψbk (r ) = X cGbk ei(G+k).r G We have distributed cGbk over G and k What about over b? 8 month dCSE project to: Investigate Castep performance on HECToR Implement band-parallelism Parallelise costly non-distributed operations P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE project plan ψbk (r ) = X cGbk ei(G+k).r G We have distributed cGbk over G and k What about over b? 8 month dCSE project to: Investigate Castep performance on HECToR Implement band-parallelism Parallelise costly non-distributed operations P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Outline 1 Introduction What is Castep? How does Castep work? Castep in Parallel 2 dCSE Project dCSE project plan Castep development Results P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Castep development Castep is quite a large program... 334,395 lines of Fortran 90 54 modules Already has three levels of parallelism ... but well structured and commented. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Castep development Castep is quite a large program... 334,395 lines of Fortran 90 54 modules Already has three levels of parallelism ... but well structured and commented. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Castep development Castep is quite a large program... 334,395 lines of Fortran 90 54 modules Already has three levels of parallelism ... but well structured and commented. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Castep development Castep is quite a large program... 334,395 lines of Fortran 90 54 modules Already has three levels of parallelism ... but well structured and commented. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Castep development Castep is quite a large program... 334,395 lines of Fortran 90 54 modules Already has three levels of parallelism ... but well structured and commented. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Basic band parallelism changed 3 modules communication module wavefunction module electronic groundstate module Optimisation changed 2 further modules Extra parallelisation changed 2 modules communication module algorithm module P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A surprise We know FFTs are expensive on lots of cores, and we also know orthogonalisation dominates large calculations. We profiled a reasonably large parallel calculation: 30% of time in FFTs 30% of time in ZGEMM of one subroutine ... but not the orthogonalisation! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A surprise We know FFTs are expensive on lots of cores, and we also know orthogonalisation dominates large calculations. We profiled a reasonably large parallel calculation: 30% of time in FFTs 30% of time in ZGEMM of one subroutine ... but not the orthogonalisation! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A surprise We know FFTs are expensive on lots of cores, and we also know orthogonalisation dominates large calculations. We profiled a reasonably large parallel calculation: 30% of time in FFTs 30% of time in ZGEMM of one subroutine ... but not the orthogonalisation! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A surprise We know FFTs are expensive on lots of cores, and we also know orthogonalisation dominates large calculations. We profiled a reasonably large parallel calculation: 30% of time in FFTs 30% of time in ZGEMM of one subroutine ... but not the orthogonalisation! P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 3 - Profile! It is important to check your assumptions about performance. HECToR has the Cray Performance Analysis Tools (PAT) installed for profiling. Easy to profile BLAS and MPI calls Good for early investigations Larger calculations generated 90GB per run! Better to use built-in timing if available (Castep’s Trace) Cray PAT available as a library P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 3 - Profile! It is important to check your assumptions about performance. HECToR has the Cray Performance Analysis Tools (PAT) installed for profiling. Easy to profile BLAS and MPI calls Good for early investigations Larger calculations generated 90GB per run! Better to use built-in timing if available (Castep’s Trace) Cray PAT available as a library P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 3 - Profile! It is important to check your assumptions about performance. HECToR has the Cray Performance Analysis Tools (PAT) installed for profiling. Easy to profile BLAS and MPI calls Good for early investigations Larger calculations generated 90GB per run! Better to use built-in timing if available (Castep’s Trace) Cray PAT available as a library P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 3 - Profile! It is important to check your assumptions about performance. HECToR has the Cray Performance Analysis Tools (PAT) installed for profiling. Easy to profile BLAS and MPI calls Good for early investigations Larger calculations generated 90GB per run! Better to use built-in timing if available (Castep’s Trace) Cray PAT available as a library P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 3 - Profile! It is important to check your assumptions about performance. HECToR has the Cray Performance Analysis Tools (PAT) installed for profiling. Easy to profile BLAS and MPI calls Good for early investigations Larger calculations generated 90GB per run! Better to use built-in timing if available (Castep’s Trace) Cray PAT available as a library P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 3 - Profile! It is important to check your assumptions about performance. HECToR has the Cray Performance Analysis Tools (PAT) installed for profiling. Easy to profile BLAS and MPI calls Good for early investigations Larger calculations generated 90GB per run! Better to use built-in timing if available (Castep’s Trace) Cray PAT available as a library P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 3 - Profile! It is important to check your assumptions about performance. HECToR has the Cray Performance Analysis Tools (PAT) installed for profiling. Easy to profile BLAS and MPI calls Good for early investigations Larger calculations generated 90GB per run! Better to use built-in timing if available (Castep’s Trace) Cray PAT available as a library P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Band-parallelism implemented as an additional level of parallelism ScaLAPACK used for parallel matrix diagonalisation/inversion P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Band-parallelism implemented as an additional level of parallelism ScaLAPACK used for parallel matrix diagonalisation/inversion P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE work Band-parallelism implemented as an additional level of parallelism ScaLAPACK used for parallel matrix diagonalisation/inversion P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Outline 1 Introduction What is Castep? How does Castep work? Castep in Parallel 2 dCSE Project dCSE project plan Castep development Results P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE results for Al2 O3 3×3 P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results dCSE results for Al2 O3 3×3 on BlueGene/P P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Lesson 4 - Scaling isn’t everything P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A real system So far we’ve only looked at benchmarks–now we’ll look at something more interesting: Immidazolium chloride–a room-temperature ionic liquid 408 atoms 1 k-point 662 bands 137,728 G-vectors Want to run a molecular dynamics simulation P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A real system So far we’ve only looked at benchmarks–now we’ll look at something more interesting: Immidazolium chloride–a room-temperature ionic liquid 408 atoms 1 k-point 662 bands 137,728 G-vectors Want to run a molecular dynamics simulation P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A real system So far we’ve only looked at benchmarks–now we’ll look at something more interesting: Immidazolium chloride–a room-temperature ionic liquid 408 atoms 1 k-point 662 bands 137,728 G-vectors Want to run a molecular dynamics simulation P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A real system So far we’ve only looked at benchmarks–now we’ll look at something more interesting: Immidazolium chloride–a room-temperature ionic liquid 408 atoms 1 k-point 662 bands 137,728 G-vectors Want to run a molecular dynamics simulation P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A real system So far we’ve only looked at benchmarks–now we’ll look at something more interesting: Immidazolium chloride–a room-temperature ionic liquid 408 atoms 1 k-point 662 bands 137,728 G-vectors Want to run a molecular dynamics simulation P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A real system So far we’ve only looked at benchmarks–now we’ll look at something more interesting: Immidazolium chloride–a room-temperature ionic liquid 408 atoms 1 k-point 662 bands 137,728 G-vectors Want to run a molecular dynamics simulation P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results A real system So far we’ve only looked at benchmarks–now we’ll look at something more interesting: Immidazolium chloride–a room-temperature ionic liquid 408 atoms 1 k-point 662 bands 137,728 G-vectors Want to run a molecular dynamics simulation P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Immidazolium chloride P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Immidazolium chloride with band-parallelism P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Immidazolium chloride performance Good scaling to 512 cores even with only 1 k-point Achieves 1 SCF cycle per minute on 1024 cores P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Immidazolium chloride performance Good scaling to 512 cores even with only 1 k-point Achieves 1 SCF cycle per minute on 1024 cores P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Immidazolium chloride performance Good scaling to 512 cores even with only 1 k-point Achieves 1 SCF cycle per minute on 1024 cores P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Remaining work Integrate with latest (4.4) source Extend to NMR, linear response etc. Develop ‘band-local’ optimisers to improve scaling P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Remaining work Integrate with latest (4.4) source Extend to NMR, linear response etc. Develop ‘band-local’ optimisers to improve scaling P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Remaining work Integrate with latest (4.4) source Extend to NMR, linear response etc. Develop ‘band-local’ optimisers to improve scaling P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Remaining work Integrate with latest (4.4) source Extend to NMR, linear response etc. Develop ‘band-local’ optimisers to improve scaling P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Interesting ‘features’ Some things cropped up you may find helpful... Bug with Cray MPI collectives – job killed with error 13 (believed fixed) Cray LibSci 10.2.1 much better than 10.0. Portland Group compiler reports incorrect CPU time Out-of-memory errors not reported to users (yet), job killed P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Interesting ‘features’ Some things cropped up you may find helpful... Bug with Cray MPI collectives – job killed with error 13 (believed fixed) Cray LibSci 10.2.1 much better than 10.0. Portland Group compiler reports incorrect CPU time Out-of-memory errors not reported to users (yet), job killed P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Interesting ‘features’ Some things cropped up you may find helpful... Bug with Cray MPI collectives – job killed with error 13 (believed fixed) Cray LibSci 10.2.1 much better than 10.0. Portland Group compiler reports incorrect CPU time Out-of-memory errors not reported to users (yet), job killed P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Interesting ‘features’ Some things cropped up you may find helpful... Bug with Cray MPI collectives – job killed with error 13 (believed fixed) Cray LibSci 10.2.1 much better than 10.0. Portland Group compiler reports incorrect CPU time Out-of-memory errors not reported to users (yet), job killed P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Interesting ‘features’ Some things cropped up you may find helpful... Bug with Cray MPI collectives – job killed with error 13 (believed fixed) Cray LibSci 10.2.1 much better than 10.0. Portland Group compiler reports incorrect CPU time Out-of-memory errors not reported to users (yet), job killed P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Acknowledgements Matt Probert (York) Keith Refson (RAL) and Martin Plummer (STFC) Mike Ashworth (STFC) NAG, especially Guy Robinson, Ian Reid, Edward Smyth, Sarfraz Nadeem and Phil Ridley. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Acknowledgements Matt Probert (York) Keith Refson (RAL) and Martin Plummer (STFC) Mike Ashworth (STFC) NAG, especially Guy Robinson, Ian Reid, Edward Smyth, Sarfraz Nadeem and Phil Ridley. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Acknowledgements Matt Probert (York) Keith Refson (RAL) and Martin Plummer (STFC) Mike Ashworth (STFC) NAG, especially Guy Robinson, Ian Reid, Edward Smyth, Sarfraz Nadeem and Phil Ridley. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Acknowledgements Matt Probert (York) Keith Refson (RAL) and Martin Plummer (STFC) Mike Ashworth (STFC) NAG, especially Guy Robinson, Ian Reid, Edward Smyth, Sarfraz Nadeem and Phil Ridley. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary dCSE project plan Castep development Results Acknowledgements Matt Probert (York) Keith Refson (RAL) and Martin Plummer (STFC) Mike Ashworth (STFC) NAG, especially Guy Robinson, Ian Reid, Edward Smyth, Sarfraz Nadeem and Phil Ridley. P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Summary Use hierarchical parallelism Minimise all-to-all communications Profile the code Remember good scaling isn’t necessarily good performance P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Summary Use hierarchical parallelism Minimise all-to-all communications Profile the code Remember good scaling isn’t necessarily good performance P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Summary Use hierarchical parallelism Minimise all-to-all communications Profile the code Remember good scaling isn’t necessarily good performance P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Summary Use hierarchical parallelism Minimise all-to-all communications Profile the code Remember good scaling isn’t necessarily good performance P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Summary Use hierarchical parallelism Minimise all-to-all communications Profile the code Remember good scaling isn’t necessarily good performance P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Castep on HECToR We investigated the impact on performance of... Compiler BLAS/LAPACK libraries FFT libraries P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Castep on HECToR We investigated the impact on performance of... Compiler BLAS/LAPACK libraries FFT libraries P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Castep on HECToR We investigated the impact on performance of... Compiler BLAS/LAPACK libraries FFT libraries P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Castep on HECToR We investigated the impact on performance of... Compiler BLAS/LAPACK libraries FFT libraries P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary Compiler P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary BLAS/LAPACK P.J. Hasnip How can I use a 1000 cores? Introduction dCSE Project Summary FFT P.J. Hasnip How can I use a 1000 cores?