Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks

Abstract

The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the field of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 h and 16 min on a single GPU, to 2 h and 14 min on all 12 GPUs. We achieve a scaling efficiency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.

1 Introduction

Deep Neural Networks (DNNs) are actively being used to solve complex problems by examining enormous amounts of data. The success of DNNs can be mainly attributed to advances in parallel computing. The introduction of the General Purpose Graphics Processing Unit (GPGPU) has allowed us to exploit the inherent parallelism of the computations that take place in deep learning applications. This development enabled data scientists to deploy deeper neural networks and handle larger datasets, in an effort to solve more complex problems. As a result, the training phase of DNNs may take days or weeks to come to completion. In order to meet the ever-increasing computational demand, distributed training is becoming very important, where the idea is to use computer clusters to train a neural network. Effectively utilizing state-of-the-art technologies in High-Performance Computing (HPC) for distributed learning will be integral in pushing the boundaries of the field of deep learning.

High Energy Physics (HEP) is a field that requires computationally demanding deep learning applications to solve complex scientific problems. The experiments at the Large Hadron Collider (LHC) at CERN are devoting more than 50% of their resources to Monte Carlo (MC) simulation tasks. The main reason for this, is because in HEP, particles are simulation in extreme detail as they traverse through matter. At the moment of writing, CERN has temporarily shut down the LHC experiments in preparation for their upgrade (High Luminosity LHC phase), after which the demand for simulated data is expected to increase by a factor of 100 [3]. Deep learning is a promising approach to replace traditional Monte Carlo simulations, and in particular generative models. Prior work have demonstrated that Generative Adversarial Networks (GANs) can partially replace MC task in HEP simulations with an acceptable accuracy [10]. The idea is for a GAN to generate a relatively large set of realistic simulation samples by training the model on a relatively small sample set of experimental or simulated data. The inference step of the GAN is orders of magnitude faster than the traditional Monte Carlo simulations, and could therefore meet the demands of the High Luminosity LHC phase. The training of the GAN is however very time consuming and its performance must be improved significantly in order to test and deploy this deep learning approach to various types of detectors and particles. This paper is an effort to assess the performance of IBM POWER architecture to fulfill this task.

The organization of the paper is as follows. Section 2 discusses related work. In Sect. 3 we define the problem in more detail. In Sect. 4 we describe the methodology of our approach. Section 5 describes the hardware and software setup. In Sect. 6 we present the results. And finally, in Sect. 7 we discuss the future works.

2 Related Work

The idea of adversarial training was originally introduced by Goodfellow in [7]. GANs generally consists of a generative network (generator) and a discriminative network (discriminator). The generator learns to map data from a random distribution to data that follows the distribution of interest. The discriminator learns how to distinguish between data produced by the generator from the true data distributions. Training a GAN involves presenting it with data samples from the training dataset until the generated candidates are indistinguishable from samples from the training dataset. The training is complete when the generator effectively managed to ‘fool’ the discriminator up to a desired accuracy.

The applications of GANs, and their variations, are being investigated in many fields [8, 15, 20]. In High-Energy Physics (HEP) we can consider particle detectors as cameras that capture the decay products of particle collisions in accelerators such as the Large Hadron Collider. Consequently, researchers have been exploring the applications of GANs in HEP, by generating images that mimic the complex distributions of these ‘detector images’ [16, 17].

To reach convergence between the generator and discriminator, it can take quite a long time relative to other deep neural network architectures [14], and makes for an appealing use case for distributing the training process over many nodes. The distributed training of GANs is a relatively new research area, and there are only a few efforts made in that direction. We shall describe them in this section.

In [24], the authors investigate the performance of distributed training of GANs for fast detector simulations. The training is performed on an HPC cluster consisting of 256 CPUs, in a dual-socket configuration. The results show that the performance scales linearly, with a scaling efficiency of 94% when utilizing the entire cluster.

The authors in [23] performs a data-parallel training of GANs for high-energy physics simulations. For the distribution of the training workload, the authors rely on an MPI-based Cray Machine Learning Plugin to train the GAN over multiple nodes and GPGPUs. Preliminary results show that the training of 3 epochs on 16 NVIDIA P100 GPUs took 2 h.

The authors in [12] explore the scaling of a GAN to thousands of nodes on Cray XC supercomputing systems. The network that is used as a benchmark, CosmoGAN, is used to produce cosmology mass maps, which otherwise would be obtained through computationally expensive simulations. Several distribution frameworks were evaluated, on two HPC systems: one consisting of 9,688 CPUs, and one 5,320 hybrid CPU+GPU nodes. The work reports the strong and weak scaling behavior of both systems.

3 Problem Definition

Following up on related work, the main questions that this paper aims to answer is: How does distributed training of adversarial generative networks performs on a POWER8 cluster, in terms of overall runtime, scalability, and power efficiency?

We limit the scope of the question by focusing on a particular use case of generative adversarial networks (GANs) in high-energy physics simulations. The type of simulations that we are targeting are those of highly segmented calorimeters, as typically found in particle detectors, such as the LHC experiments at CERN. In this work we consider the use case of the Compact Linear Collider (CLIC), which is a conceptual linear accelerator study [2]. This use case is representative of the demand for simulated data as to be expect for the HL-LHC era. Moreover, the dataset is publicly available [4]. The dataset that we are considering consists of what is in high-energy physics referred to as electron showers. An electron shower is the result of a chain-reaction of electromagnetic collisions initiated by one electron penetrating a material (in this case a calorimeter). Electron showers are one of the main reasons why HEP simulations are very time-consuming, as all of the electrons that come forth from the shower need to be simulated through the matter in which they traverse. The High-Luminosity LHC upgrade will increase the energy level at which experiments are done significantly, and accordingly the simulations must be able to keep up. The trained GAN will be able to replace the simulation of an electron shower with a single inference of the model. The neural architecture of the GAN we used is described in [22]. Based on the available dataset [4], we train a GAN to reproduce the energy distribution of the showers generated by electrons with energies ranging from 10 to 500 GeV.

4 Data-Parallel Distributed Training

Distributed training can generally be implemented according to two different paradigms: model parallelism, and data parallelism. As described in [11], data parallelism is more efficient when the number of computations per model weight is high, which is the case of the GAN we are considering, as it mostly consists of convolutional layers. For this reason, we decided to go for a data-parallel approach.

In data-parallel distributed training, we must run multiple instances of the stochastic gradient descent (SGD) algorithm (RMSProp in our use case). The gradients that are computed by each SGD instance must be propagated through the network to maintain a consistent model. Horovod [21] is a framework that works on top of well-known machine learning libraries (Tensorflow, Keras, PyTorch), and takes care of the model consistency. Under the hood, it uses the ring all-reduce [18] decentralized scheme to communicate the gradients across the network. The ring all-reduce scheme requires communication only between neighboring nodes in a ring-like network topology, effectively reducing the amount of cross-server communication.

5 Computing Configuration

The POWER8 cluster on which we performed the benchmarks consists of 3 IBM S822LC nodes, each consisting of 2 POWER8 sockets (8 cores per socket), 256 GB of DDR4 RAM, and 4 NVIDIA P100 GPUs (16 of RAM per GPU). Moreover, each node is provided with an 80 GB/s bidirectional NVLink interconnect between the POWER8 sockets and P100 GPUs, and also between the GPUs, as displayed in Fig. 1.

../images/491247_1_En_32_Chapter/491247_1_En_32_Fig1_HTML.png — Fig. 1.
An overview of the fabric of an IBM S822LC node. IB stands for InfiniBand.

The GAN is implemented in Keras v1.2.2 [5], with Tensorflow 1.12.0 [1] as the backend execution framework, which is part of the IBM PowerAI DL stack. The distributed training is done through Horovod 0.15.1 [21], which makes use of the IBM Spectrum MPI [25] v10.2.0 with CUDA support, which is based on OpenMPI [6]. One MPI process was spawned for each instance of the GAN as part of the distributed stochastic gradient descent algorithm. In order to reduce the cross-talk between NUMA domains, and between servers, we pinned the fraction of the training data to the NUMA domain that corresponds to the GPU that ingests that fraction of the training data. This is achieved by enabling the mpirun option –bind-to-socket, and setting the -rankfile option to point to a rankfile that assigns one MPI process per GPU. The POWER8 system was configured with CUDA 10.0, which includes the CUDNN 7.5 library.

The training data consist of 200,000 electron showers, spread over 80 files, which account for a total of 12 GB on disk. 90% of the dataset was used for training and the remainder for testing. Typically, within the training phase, one computes both the training loss (through backpropagation) and the test loss (through feed-forward propagation). To evaluate the overall runtime we take both computations into account, but for the scalability analysis, we take only training into account. A full training consists of 40 epochs, but for the scalability test, we timed only the training time per epoch, since each epoch is of the same duration¹. In order to warm up the GPUs on the machine, we ran five warm-up epochs for every measurement. The choice of the mini-batch size for GANs is still an open research question, but empirically we observed that an increasingly larger mini-batch size reduced the overall training time, while resulting in the same convergence of the loss functions. Therefore, we fixed the mini-batch size to 512, which we found to be the largest batch size to perform training with, without facing out-of-memory issues.

6 Results

We performed a full training on the POWER8 cluster on a single GPU and measured the wall time to be 20 h and 16 min. On 12 GPUs the total training time is reduced to 2 h and 14 min, which is a speedup of approximately $9\times$ . From the profiling results, it becomes clear that the main reason for the discrepancy with an ideal speedup is due to the single-GPU execution of the evaluation of the test losses. Unlike the training of the GAN, the evaluation is not executed in a distributed manner. Following Amdahl’s law [9], the maximum theoretical speedup will be limited by the serial fraction of the code, which in this case is the code for computing the test losses. Compared to previous work on distributed training of the same neural architecture on 16 P100 GPUs (2 h for 3 epochs) [23], we have demonstrated a significant improvement.

For the scalability benchmark, we excluded the computation of the tests losses in the training phase. We performed the distributed training of the GAN for different numbers of GPUs and measured the time per epoch (averaged over five epochs). The results for speedup with respect to training on a single GPU are plotted in Fig. 3. We observe strong scaling performance as we increase the number of GPUs, with a scaling efficiency of 98.9% when utilizing the entire cluster. Profiling results from NVIDIA’s GPU profiling tool nvprof show that the data transfer between neighboring nodes in the ring all-reduce scheme does not saturate the NVLink bandwidth (see Fig. 2), and therefore the scalability of the training depends on the computational throughput of the GPUs.

Power consumption is another metric we want to report in this benchmark. From the power consumption calculator that is referred to in [19] we calculated the estimated power consumption of an IBM S822LC node to be 1,800 W. The calculation includes the consumption of all electronic devices present in the node. For the entire POWER8 cluster this results in a total of 5,400 W. Since the computations of the GAN training are offloaded to the P100 GPUs, the CPUs do not contribute (significantly) to the computational throughput. According to the nvprof profiling results, the computational occupancy of the training phase was on average close to 85%. We experimentally measured the P100 GPU to have a throughput of 8.8 TFLOPs (single-precision floating points) with the matrix multiplication benchmark provided in the CUDA samples, since the GAN network consists primarily of convolutional layers. Therefore, we estimate the throughput of the GAN training to be 7.5 TFLOPs per GPU. Based off of this estimate, the cluster’s overall power efficiency for training GANs is 16.6 GFLOPs/Watt.

../images/491247_1_En_32_Chapter/491247_1_En_32_Fig2_HTML.png — Fig. 2.
Partial nvprof profiling output for one of the GPUs in a 12-GPU training, showing that training is compute-bound.

../images/491247_1_En_32_Chapter/491247_1_En_32_Fig3_HTML.png — Fig. 3.
Performance of scaling number of P100 GPUs and corresponding scaling efficiency.

7 Future Work

As part of the IBM PowerAI framework [25], the IBM DDL (Distributed Deep Learning) library offers, similarly to Horovod, an implementation of distributed deep learning. DDL offers multiple methods to perform the distributed stochastic gradient descent algorithm. It would be interesting to explore these options in an effort to improve the overall training time.

Large Model Support (LMS) [13] is a feature of the IBM PowerAI framework, that enables the training of deep neural networks (DNNs) that would otherwise exceed the memory capacity of a GPU. LMS achieves this by swapping the feature maps of DNNs in and out of the GPU memory – keeping only the data in memory that is required at certain stages in the training pipeline. This approach puts a larger burden on the CPU-GPU interconnect, and could possibly saturate the bandwidth. The high-bandwidth NVLink interconnect between the CPU and the GPUs in modern POWER systems, however, makes LMS an appealing approach to scaling out the training of DNNs. We would be interested to find out how LMS performs against the data-parallel approach that is used in this work.

References

1.
Abadi, M., et al.: TensorFlow: large-scale machine learning on heterogeneous systems (2015). https://www.tensorflow.org/
2.
Aicheler, M., et al.: A multi-TeV linear collider based on CLIC technology: CLIC conceptual design report. CERN Yellow Reports: Monographs, CERN, Geneva (2012). https://doi.org/10.5170/CERN-2012-007. https://cds.cern.ch/record/1500095
3.
Bird, I.: Workshop introduction, context of the workshop: half-way through run2; preparing for run3, run4. WLCG Workshop (2016)
4.
Carminati, F., Khattak, G., Pierini, M., Vallecor-safa, S., Farbin, A.: Calorimetry with deep learning: particle classification, energy regression, and simulation for high-energy physics. In: NIPS (2017)
5.
Chollet, F., et al.: Keras (2015). https://github.com/fchollet/keras
6.
Gabriel, E., et al.: Open MPI: goals, concept, and design of a next generation MPI implementation. In: Kranzlmüller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30218-6_19Crossref
7.
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, pp. 2672–2680 (2014)
8.
Goodfellow, I.J.: On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515 (2014)
9.
Hill, M.D., Marty, M.R.: Amdahl’s law in the multicore era. Computer 41(7), 33–38 (2008)Crossref
10.
Khattak, G., Vallecorsa, S., Carminati, F.: Three dimensional energy parametrized generative adversarial networks for electromagnetic shower simulation. In: 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3913–3917, October 2018
11.
Krizhevsky, A.: One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997 (2014)
12.
Kurth, T., Smorkalov, M., Mendygral, P., Sridharan, S., Mathuriya, A.: Tensorflow at scale: performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML. Concurr. Comput.: Pract. Exp. (2018). https://doi.org/10.1002/cpe.4989
13.
Le, T.D., Imai, H., Negishi, Y., Kawachiya, K.: TFLMS: large model support in TensorFlow by graph rewriting. arXiv preprint arXiv:1807.02037 (2018)
14.
Mescheder, L., Geiger, A., Nowozin, S.: Which training methods for GANs do actually converge? arXiv preprint arXiv:1801.04406 (2018)
15.
Odena, A., Olah, C., Shlens, J.: Conditional image synthesis with auxiliary classifier GANs. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 2642–2651. JMLR.org (2017)
16.
de Oliveira, L., Paganini, M., Nachman, B.: Learning particle physics by example: location-aware generative adversarial networks for physics synthesis. Comput. Softw. Big Sci. 1(1), 4 (2017)Crossref
17.
Paganini, M., de Oliveira, L., Nachman, B.: CaloGAN: simulating 3D high energy particle showers in multilayer electromagnetic calorimeters with generative adversarial networks. Phys. Rev. D 97(1), 014021 (2018)Crossref
18.
Patarasuk, P., Yuan, X.: Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69(2), 117–124 (2009)Crossref
19.
Quintero, D.: IBM POWER8 high-performance computing guide: IBM power system S822LC (8335-GTB) edition. IBM Corporation, International Technical Support Organization, Poughkeepsie, NY (2017)
20.
Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
21.
Sergeev, A., Del Balso, M.: Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)
22.
Vallecorsa, S.: Generative models for fast simulation. J. Phys.: Conf. Ser. 1085, 022005 (2018). https://doi.org/10.1088/1742-6596/1085/2/022005Crossref
23.
Vallecorsa, S., Moise, D., Carminati, F., Khattak, G.R.: Data-parallel training of generative adversarial networks on HPC systems for HEP simulations. In: 2018 IEEE 25th International Conference on High Performance Computing (HiPC), pp. 162–171, December 2018. https://doi.org/10.1109/HiPC.2018.00026
24.
Vallecorsa, S., et al.: Distributed training of generative adversarial networks for fast detector simulation. In: Yokota, R., Weiland, M., Shalf, J., Alam, S. (eds.) ISC High Performance 2018. LNCS, vol. 11203, pp. 487–503. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-02465-9_35Crossref
25.
Wale, D.: IBM PowerAI : Deep Learning Unleashed on IBM Power Systems Servers. IBM Redbooks, S.l (2018)