More Charm++/TAU examples Applications: NAMD Parallel Framework for Unstructured Meshing (ParFUM) Features: • Profile snapshots: • Captures the runtime of the application by segregating it into user specified intervals • CUDA Profiling • Tracks time spent in CUDA kernel routines • Shows scaling behavior for a experiment varying the number of devices used. Mean Exclusive Time Standard Deviation Load Balancing Phases NAMD Snapshot Profile of over 800sec on 2048 processors enqueneSelfB enqueneSelfA Main enqueneWorkB enqueneWorkA Idle NAMD CUDA events ~50% efficiency ~100% efficiency Device #0 GPU efficiency gained by doubling the number of GPU from 16 to 32. These Events are broken down by routine and by device number. Scaling Efficiency NAMD CUDA scaling Non-Bonded Calculations Sum Forces Calculations Number of Devices Scaling by event and device number, Non-Bonded Calculations scale well. Sum Forces less well but the overall time is only a few microseconds. ParFUM CUDA speedup 250 200 150 Total time using only a CPU 100 Total Time with CUDA acceleration Time spent in CUDA Kernel 50 0 128x8x8 Mesh Single CPU or GPU Performance on a 128x8x8 mesh. When run with GPU acceleration enabled ParFUM spent 9 seconds in the CUDA Kernel routines.