LS-DYNA Productivity and Power-aware Simulations in Cluster Environments

From concept to engineering and from design to test and manufacturing; engineering relies on powerful virtual development solutions. Finite Element Analysis (FEA) and Computational Fluid Dynamics (CFD) are used in an effort to secure quality and speed up the development process. Cluster solutions maximize the total value of ownership for FEA and CFD environments and extend innovation in virtual product development. Multi-core cluster environments impose high demands for cluster connectivity throughput, low-latency, low CPU overhead, network flexibility and high-efficiency in order to maintain a balanced system and to achieve high application performance and scaling. Low-performance interconnect solutions, or lack of interconnect hardware capabilities will result in degraded system and application performance. Livermore Software Technology Corporation (LSTC) LS-DYNA software was investigated. In all InfiniBand-based cases, LS-DYNA demonstrated high parallelism and scalability, which enabled it to take full advantage of multi-core HPC clusters. Moreover, according to the results, a lower-speed interconnect, such as GigE or 10 Gigabit Ethernet are ineffective on mid to large cluster size, and can cause a reduction in performance beyond 16 or 20 server nodes (i.e. the application run time actually gets slower) We have profiled the communications over the network of LS-DYNA software to determine LS-DYNA sensitivity points, which is essential in order to estimate the influence of the various cluster components, both hardware and software. We evidenced the large number of network latency sensitive small messages through MPI_AllReduce and MPI_Bcast operations that dominate the performance of the application on mid to large cluster size. The results indicated also that large data messages are used and the amount of the data sent via the large message sizes increased with cluster size. From those results we have concluded that the combination of a very high-bandwidth and extremely low-latency interconnect, with low CPU overhead, is required to increase the productivity at mid to large node count. We have also investigated the increase of productivity from single job to multiple jobs in parallel across the cluster. The increase of productivity is based on two facts. First, good scalability of AMD architecture that allows to run multiple jobs on a given compute node without saturating the memory controller. Second, the low latency and high bandwidth available on the InfiniBand interconnect that allowed us to offload the CPU to CPU data traffic from MPI communications via the interconnect instead of in compute node. The net result of that practice is an increase of the productivity by a factor of 200% with respect to the single job run. Finally, the increase of productivity on single job runs with high speed interconnects has been analyzed from the point of view of power consumption leading to a 60% reduction or energy savings when using InfiniBand with respect to Ethernet.

application/pdf N-I-01.pdf — 223.0 KB