1 Introduction
Deep Neural Networks (DNNs) are actively being used to solve complex problems by examining enormous amounts of data. The success of DNNs can be mainly attributed to advances in parallel computing. The introduction of the General Purpose Graphics Processing Unit (GPGPU) has allowed us to exploit the inherent parallelism of the computations that take place in deep learning applications. This development enabled data scientists to deploy deeper neural networks and handle larger datasets, in an effort to solve more complex problems. As a result, the training phase of DNNs may take days or weeks to come to completion. In order to meet the ever-increasing computational demand, distributed training is becoming very important, where the idea is to use computer clusters to train a neural network. Effectively utilizing state-of-the-art technologies in High-Performance Computing (HPC) for distributed learning will be integral in pushing the boundaries of the field of deep learning.
High Energy Physics (HEP) is a field that requires computationally demanding deep learning applications to solve complex scientific problems. The experiments at the Large Hadron Collider (LHC) at CERN are devoting more than 50% of their resources to Monte Carlo (MC) simulation tasks. The main reason for this, is because in HEP, particles are simulation in extreme detail as they traverse through matter. At the moment of writing, CERN has temporarily shut down the LHC experiments in preparation for their upgrade (High Luminosity LHC phase), after which the demand for simulated data is expected to increase by a factor of 100 [3]. Deep learning is a promising approach to replace traditional Monte Carlo simulations, and in particular generative models. Prior work have demonstrated that Generative Adversarial Networks (GANs) can partially replace MC task in HEP simulations with an acceptable accuracy [10]. The idea is for a GAN to generate a relatively large set of realistic simulation samples by training the model on a relatively small sample set of experimental or simulated data. The inference step of the GAN is orders of magnitude faster than the traditional Monte Carlo simulations, and could therefore meet the demands of the High Luminosity LHC phase. The training of the GAN is however very time consuming and its performance must be improved significantly in order to test and deploy this deep learning approach to various types of detectors and particles. This paper is an effort to assess the performance of IBM POWER architecture to fulfill this task.
The organization of the paper is as follows. Section 2 discusses related work. In Sect. 3 we define the problem in more detail. In Sect. 4 we describe the methodology of our approach. Section 5 describes the hardware and software setup. In Sect. 6 we present the results. And finally, in Sect. 7 we discuss the future works.
2 Related Work
The idea of adversarial training was originally introduced by Goodfellow in [7]. GANs generally consists of a generative network (generator) and a discriminative network (discriminator). The generator learns to map data from a random distribution to data that follows the distribution of interest. The discriminator learns how to distinguish between data produced by the generator from the true data distributions. Training a GAN involves presenting it with data samples from the training dataset until the generated candidates are indistinguishable from samples from the training dataset. The training is complete when the generator effectively managed to ‘fool’ the discriminator up to a desired accuracy.
The applications of GANs, and their variations, are being investigated in many fields [8, 15, 20]. In High-Energy Physics (HEP) we can consider particle detectors as cameras that capture the decay products of particle collisions in accelerators such as the Large Hadron Collider. Consequently, researchers have been exploring the applications of GANs in HEP, by generating images that mimic the complex distributions of these ‘detector images’ [16, 17].
To reach convergence between the generator and discriminator, it can take quite a long time relative to other deep neural network architectures [14], and makes for an appealing use case for distributing the training process over many nodes. The distributed training of GANs is a relatively new research area, and there are only a few efforts made in that direction. We shall describe them in this section.
In [24], the authors investigate the performance of distributed training of GANs for fast detector simulations. The training is performed on an HPC cluster consisting of 256 CPUs, in a dual-socket configuration. The results show that the performance scales linearly, with a scaling efficiency of 94% when utilizing the entire cluster.
The authors in [23] performs a data-parallel training of GANs for high-energy physics simulations. For the distribution of the training workload, the authors rely on an MPI-based Cray Machine Learning Plugin to train the GAN over multiple nodes and GPGPUs. Preliminary results show that the training of 3 epochs on 16 NVIDIA P100 GPUs took 2 h.
The authors in [12] explore the scaling of a GAN to thousands of nodes on Cray XC supercomputing systems. The network that is used as a benchmark, CosmoGAN, is used to produce cosmology mass maps, which otherwise would be obtained through computationally expensive simulations. Several distribution frameworks were evaluated, on two HPC systems: one consisting of 9,688 CPUs, and one 5,320 hybrid CPU+GPU nodes. The work reports the strong and weak scaling behavior of both systems.
3 Problem Definition
Following up on related work, the main questions that this paper aims to answer is: How does distributed training of adversarial generative networks performs on a POWER8 cluster, in terms of overall runtime, scalability, and power efficiency?
We limit the scope of the question by focusing on a particular use case of generative adversarial networks (GANs) in high-energy physics simulations. The type of simulations that we are targeting are those of highly segmented calorimeters, as typically found in particle detectors, such as the LHC experiments at CERN. In this work we consider the use case of the Compact Linear Collider (CLIC), which is a conceptual linear accelerator study [2]. This use case is representative of the demand for simulated data as to be expect for the HL-LHC era. Moreover, the dataset is publicly available [4]. The dataset that we are considering consists of what is in high-energy physics referred to as electron showers. An electron shower is the result of a chain-reaction of electromagnetic collisions initiated by one electron penetrating a material (in this case a calorimeter). Electron showers are one of the main reasons why HEP simulations are very time-consuming, as all of the electrons that come forth from the shower need to be simulated through the matter in which they traverse. The High-Luminosity LHC upgrade will increase the energy level at which experiments are done significantly, and accordingly the simulations must be able to keep up. The trained GAN will be able to replace the simulation of an electron shower with a single inference of the model. The neural architecture of the GAN we used is described in [22]. Based on the available dataset [4], we train a GAN to reproduce the energy distribution of the showers generated by electrons with energies ranging from 10 to 500 GeV.
4 Data-Parallel Distributed Training
Distributed training can generally be implemented according to two different paradigms: model parallelism, and data parallelism. As described in [11], data parallelism is more efficient when the number of computations per model weight is high, which is the case of the GAN we are considering, as it mostly consists of convolutional layers. For this reason, we decided to go for a data-parallel approach.
In data-parallel distributed training, we must run multiple instances of the stochastic gradient descent (SGD) algorithm (RMSProp in our use case). The gradients that are computed by each SGD instance must be propagated through the network to maintain a consistent model. Horovod [21] is a framework that works on top of well-known machine learning libraries (Tensorflow, Keras, PyTorch), and takes care of the model consistency. Under the hood, it uses the ring all-reduce [18] decentralized scheme to communicate the gradients across the network. The ring all-reduce scheme requires communication only between neighboring nodes in a ring-like network topology, effectively reducing the amount of cross-server communication.
5 Computing Configuration

An overview of the fabric of an IBM S822LC node. IB stands for InfiniBand.
The GAN is implemented in Keras v1.2.2 [5], with Tensorflow 1.12.0 [1] as the backend execution framework, which is part of the IBM PowerAI DL stack. The distributed training is done through Horovod 0.15.1 [21], which makes use of the IBM Spectrum MPI [25] v10.2.0 with CUDA support, which is based on OpenMPI [6]. One MPI process was spawned for each instance of the GAN as part of the distributed stochastic gradient descent algorithm. In order to reduce the cross-talk between NUMA domains, and between servers, we pinned the fraction of the training data to the NUMA domain that corresponds to the GPU that ingests that fraction of the training data. This is achieved by enabling the mpirun option –bind-to-socket, and setting the -rankfile option to point to a rankfile that assigns one MPI process per GPU. The POWER8 system was configured with CUDA 10.0, which includes the CUDNN 7.5 library.
The training data consist of 200,000 electron showers, spread over 80 files, which account for a total of 12 GB on disk. 90% of the dataset was used for training and the remainder for testing. Typically, within the training phase, one computes both the training loss (through backpropagation) and the test loss (through feed-forward propagation). To evaluate the overall runtime we take both computations into account, but for the scalability analysis, we take only training into account. A full training consists of 40 epochs, but for the scalability test, we timed only the training time per epoch, since each epoch is of the same duration1. In order to warm up the GPUs on the machine, we ran five warm-up epochs for every measurement. The choice of the mini-batch size for GANs is still an open research question, but empirically we observed that an increasingly larger mini-batch size reduced the overall training time, while resulting in the same convergence of the loss functions. Therefore, we fixed the mini-batch size to 512, which we found to be the largest batch size to perform training with, without facing out-of-memory issues.
6 Results
We performed a full training on the POWER8 cluster on a single GPU and measured the wall time to be 20 h and 16 min. On 12 GPUs the total training time is reduced to 2 h and 14 min, which is a speedup of approximately . From the profiling results, it becomes clear that the main reason for the discrepancy with an ideal speedup is due to the single-GPU execution of the evaluation of the test losses. Unlike the training of the GAN, the evaluation is not executed in a distributed manner. Following Amdahl’s law [9], the maximum theoretical speedup will be limited by the serial fraction of the code, which in this case is the code for computing the test losses. Compared to previous work on distributed training of the same neural architecture on 16 P100 GPUs (2 h for 3 epochs) [23], we have demonstrated a significant improvement.
For the scalability benchmark, we excluded the computation of the tests losses in the training phase. We performed the distributed training of the GAN for different numbers of GPUs and measured the time per epoch (averaged over five epochs). The results for speedup with respect to training on a single GPU are plotted in Fig. 3. We observe strong scaling performance as we increase the number of GPUs, with a scaling efficiency of 98.9% when utilizing the entire cluster. Profiling results from NVIDIA’s GPU profiling tool nvprof show that the data transfer between neighboring nodes in the ring all-reduce scheme does not saturate the NVLink bandwidth (see Fig. 2), and therefore the scalability of the training depends on the computational throughput of the GPUs.

Partial nvprof profiling output for one of the GPUs in a 12-GPU training, showing that training is compute-bound.

Performance of scaling number of P100 GPUs and corresponding scaling efficiency.
7 Future Work
As part of the IBM PowerAI framework [25], the IBM DDL (Distributed Deep Learning) library offers, similarly to Horovod, an implementation of distributed deep learning. DDL offers multiple methods to perform the distributed stochastic gradient descent algorithm. It would be interesting to explore these options in an effort to improve the overall training time.
Large Model Support (LMS) [13] is a feature of the IBM PowerAI framework, that enables the training of deep neural networks (DNNs) that would otherwise exceed the memory capacity of a GPU. LMS achieves this by swapping the feature maps of DNNs in and out of the GPU memory – keeping only the data in memory that is required at certain stages in the training pipeline. This approach puts a larger burden on the CPU-GPU interconnect, and could possibly saturate the bandwidth. The high-bandwidth NVLink interconnect between the CPU and the GPUs in modern POWER systems, however, makes LMS an appealing approach to scaling out the training of DNNs. We would be interested to find out how LMS performs against the data-parallel approach that is used in this work.