A Correlation Study between MPP LS-DYNA Performance and Various Interconnection Networks — a Quantitative Approach for Determining the Communication and Computation Costs
Yih-Yih Lin As MPP LS-DYNA uses the message-passing paradigm to obtain parallelism, the elapsed time of an MPP LS-DYNA simulation comprises of two parts: computation cost and communication cost. A quantitative approach for determining the communication cost and, hence, the computation cost and the speedup of an MPP LS-DYNA simulation is presented. Elapsed times, characteristic—latency and bandwidth—of interconnect networks, and message patterns are first measured, and then the method of least square errors is applied to estimate the two costs. This approach allows one to predict the performance, or the speedup, of MPP LS-DYNA simulations with any interconnect network whose characteristics is known. Also, while conducting this performance study of MPP LS-DYNA, loss of accuracy in single-precision (32-bit) MPP LS-DYNA simulations has been found. This finding and the advantage of double-precision (64-bit) arithmetic are presented. INTRODUCTION - Theory for Performance of MPP LS-DYNA To run an N-processor MPP LS-DYNA simulation, or job, an interconnect network, or called simply as interconnect, must first be established to connect the N processors; the collection of the N processors and the interconnect is called an N-processor cluster. In this paper, we will consider only the case that the N processors are of the same kind. For such a job, MPP LS-DYNA starts by decomposing the geometrical configuration of the model into N sub-domains. Each of the N processors is assigned to perform computation on one of the sub-domains; meanwhile, messages are passed among all those processors so that necessary physical conditions, such as 1 2 N force conditions, can be enforced. Let T comput, T comput, …, T comput be each 1 2 N processor’s computation cost, and let T comm, T comm, …, T comm be each processor’s 1 2 communication cost. Define the computation cost Tcomput as max (T comput, T comput, N 1 2 N …, T comput) and the communication cost Tcomm as max(T comm, T comm, …, T comm), respectively. Then the job’s elapsed time can be described as: T elapsed = Tcomput + Tcomm (1) For a given decomposition, the computation cost Tcomput is fixed. In contrast, the communication cost Tcomm varies with the characteristics of interconnects used. The term “speedup” is defined as the ratio T elapsed, 1-processor / T elapsed, N-processor. In general, speedups are smaller than N. Since for the 1-processor job the communication cost Tcomm is zero, the perfect speedup of N folds can be realized only under the unrealistic conditions of zero communication cost, i.e., Tcomm = 0, and perfectly 1 2 N balanced decomposition, which renders T comput = T comput = …= T comput. Assuming that the N processors are of the same kind, the variation of T comput, 2 N T comput, …, T comput arises out of the unbalanced decomposition of the N subdomains. It is extremely difficult to find a universal algorithm to decompose a model with a balanced decomposition. MPP LS-DYNA does provide features, as documented in pfile in parallel specific options, for users to provide hints to get a more balanced decomposition than the default.
https://www.dynamore.de/en/downloads/papers/03-conference/mpp-linux-cluster-hardware/a-correlation-study-between-mpp-ls-dyna/view
https://www.dynamore.de/@@site-logo/DYNAmore_Logo_Ansys.svg
A Correlation Study between MPP LS-DYNA Performance and Various Interconnection Networks — a Quantitative Approach for Determining the Communication and Computation Costs
Yih-Yih Lin As MPP LS-DYNA uses the message-passing paradigm to obtain parallelism, the elapsed time of an MPP LS-DYNA simulation comprises of two parts: computation cost and communication cost. A quantitative approach for determining the communication cost and, hence, the computation cost and the speedup of an MPP LS-DYNA simulation is presented. Elapsed times, characteristic—latency and bandwidth—of interconnect networks, and message patterns are first measured, and then the method of least square errors is applied to estimate the two costs. This approach allows one to predict the performance, or the speedup, of MPP LS-DYNA simulations with any interconnect network whose characteristics is known. Also, while conducting this performance study of MPP LS-DYNA, loss of accuracy in single-precision (32-bit) MPP LS-DYNA simulations has been found. This finding and the advantage of double-precision (64-bit) arithmetic are presented. INTRODUCTION - Theory for Performance of MPP LS-DYNA To run an N-processor MPP LS-DYNA simulation, or job, an interconnect network, or called simply as interconnect, must first be established to connect the N processors; the collection of the N processors and the interconnect is called an N-processor cluster. In this paper, we will consider only the case that the N processors are of the same kind. For such a job, MPP LS-DYNA starts by decomposing the geometrical configuration of the model into N sub-domains. Each of the N processors is assigned to perform computation on one of the sub-domains; meanwhile, messages are passed among all those processors so that necessary physical conditions, such as 1 2 N force conditions, can be enforced. Let T comput, T comput, …, T comput be each 1 2 N processor’s computation cost, and let T comm, T comm, …, T comm be each processor’s 1 2 communication cost. Define the computation cost Tcomput as max (T comput, T comput, N 1 2 N …, T comput) and the communication cost Tcomm as max(T comm, T comm, …, T comm), respectively. Then the job’s elapsed time can be described as: T elapsed = Tcomput + Tcomm (1) For a given decomposition, the computation cost Tcomput is fixed. In contrast, the communication cost Tcomm varies with the characteristics of interconnects used. The term “speedup” is defined as the ratio T elapsed, 1-processor / T elapsed, N-processor. In general, speedups are smaller than N. Since for the 1-processor job the communication cost Tcomm is zero, the perfect speedup of N folds can be realized only under the unrealistic conditions of zero communication cost, i.e., Tcomm = 0, and perfectly 1 2 N balanced decomposition, which renders T comput = T comput = …= T comput. Assuming that the N processors are of the same kind, the variation of T comput, 2 N T comput, …, T comput arises out of the unbalanced decomposition of the N subdomains. It is extremely difficult to find a universal algorithm to decompose a model with a balanced decomposition. MPP LS-DYNA does provide features, as documented in pfile in parallel specific options, for users to provide hints to get a more balanced decomposition than the default.