© Springer Nature Switzerland AG 2019
M. Weiland et al. (eds.)High Performance ComputingLecture Notes in Computer Science11887https://doi.org/10.1007/978-3-030-34356-9_6

Singularity GPU Containers Execution on HPC Cluster

Giuseppa Muscianisi1  , Giuseppe Fiameni1   and Abdulrahman Azab2  
(1)
CINECA - Interuniversity Consortium, Bologna, Italy
(2)
University of Oslo, Oslo, Norway
 
 
Giuseppa Muscianisi (Corresponding author)
 
Giuseppe Fiameni
 
Abdulrahman Azab

Abstract

This paper describes how to use the Singularity containerization tool in a HPC cluster equipped with GPU accelerators. The application chosen for benchmarking is Tensorflow, the open-source software library for machine learning. The singularity containers built have run into GALILEO HPC cluster at CINECA. A performance comparison between bare metal and container executions is also provided, showing a negligible difference in the number of images computed per second.

Keywords

SingularityGPUTensorflow

1 Introduction

The use of container technology has been increasing in the last years thanks to the diffusion of Docker [1] among researchers both for sharing their applications and to simplify the installation processes on the hosts.

Within containers, it is possible to run the same software on different architectures without installing it and its dependencies every time on different platforms. Since the containers allow a light-weight virtualization of the available resources in the host, it is possible to install a software only one time in the container and then bring and run such container on multiple platforms.

So, thanks to Docker, an extensive diffusion of containers creation and sharing among researchers has motivated their utilization also in the HPC context. Unfortunately, Docker is not suitable for HPC environment because to run a Docker container privileged access rights are needed possibly leading to malicious actions within the container execution.

In the years, many containerization tools have been developed to avoid such security limitations. Among them, since the first official release in 2017, Singularity [2, 3] has established itself among the best tools available for the execution of containerized applications in the HPC environment.

Singularity is able to bring containers and thus improve scientific workflow reproducibility. It is “secure” in the sense that the container execution takes place in the user space, without having demons running in the host as happens for Docker.

Through Singularity it is possible to run parallel applications on HPC cluster, both MPI, OpenMP and hybrid codes. It is possible to leverage accelerators, also, as GPU, FPGA or Intel MIC, that are often available in a supercomputing system. Singularity integrates within common workload managers, as Torque [4], PBS Pro [5] or SLURM [6], always available in HPC environments to orchestrate job execution: from a user perspective, run a bare metal or a containerized application is exactly the same.

In this article, a Singularity GPU containers execution on a HPC cluster is presented. The application chosen for benchmarking is Tensorflow [7], the widely used open-source software library for machine learning. Due to its many dependencies at the installation time, it results to be a good tool to be containerized. The singularity containers built have run into GALILEO supercomputing cluster, located at CINECA, the Italian Interuniversity Consortium [8].

In Sect. 2 both the container building process and run on GALILEO cluster are described. In Sect. 3 the benchmarking results are presented. Finally, in Sect. 4 some consideration are reported.

2 Singularity GPU Containers Building and Running

By design, a container is platform-agnostic and also software-agnostic. But nowadays HPC systems are equipped with accelerators, as GPU, FPGA or Intel MIC, and a relevant number of softwares and libraries have been written and designed to use such specialized hardware.

So to be really useful, a containerized application has to be able to use such kind of hardware. Of course, a solution is to install the accelerator specific drivers and libraries into the container. But for nVidia GPUs, Singularity provide an other solution. In fact, Singularity supports “natively” the GPUs in the sense that all the relevant nVidia/Cuda libraries on the host are found via the ld.so.cache that is bound into a library location within the container automatically. To enable the usage of GPUs in a Singularity container, simply add the flag “–nv” in the singularity execution command line [3]. The Cuda Toolkit has to be available in the container, with a version compatible with the driver version installed on the GPUs in the host.

Different Singularity GPU containers have been built, bootstrapping both from other Docker or Singularity containers already available in the official Docker Hub [9] and Singularity Hub [10], or bootstrapping from Ubuntu or CentOS official distributions. In the first case, few modifications have been done as add a directory for test purposes or install additional packages. In all cases, the simplicity of usage has been noted, both in building and running containers from scratch or using a pre-built ones.

In the tests presented here, two Singularity containers with the widely used machine learning framework “Tensorflow” have been built, both with GPU libraries available, version 1.10.1 and 1.12.1.

Accordingly to the philosophy of reproducibility and sharing that featured the containerization techniques, it has been decided to use the pre-built, official released containers. The containers built have been bootstrapped from the Docker ones available in the official Docker Hub repository of Tensorflow community [12].

To build the containers, the skeleton recipe reported in Table 1 has been used:
Table 1.

Singularity recipe skeleton used to build the GPU container.

../images/491247_1_En_6_Chapter/491247_1_En_6_Tab1_HTML.png

where the directory “test” is local to the container and has been used to bind the input files previously downloaded in the host.

3 Benchmark

3.1 Systems Description

GALILEO. GALILEO [8] is the CINECA Tier-1 “National” level supercomputer. Introduced first in 2015, it has been configured at the beginning of 2018 with Intel Xeon E5-2697 v4 (Broadwell) nodes. Also available in GALILEO is a pool of 15 nodes, each equipped with 2 x 8-cores Intel Haswell 2.40 GHz plus 2 nVidia K80 GPUs, 128 GB/node RAM, and an Infiniband with 4x QDR switches as Internal Network.

3.2 Test Case 1: Containerized Tensorflow Execution on GALILEO Versus Official Tensorflow Performance Data

A Singularity container has been built directly from those available in the official Tensorflow Docker hub. The version of Tensorflow used is the 1.10.0 for GPU. The container has been built using Singularity 2.3.1 version, while on GALILEO the 2.5.2 version. For container test the ResNt-50 [16] neural network has been considered, with a batch size of 32 over the ImageNet [18] data set. The obtained results have been compared with those provided in the official Tensorflow nVidia Teska K80 [11] and are reported in Table 2. As it is possible to note by the reported number, the usage of the containerized application does not introduce any significant overhead in the computations, since the number of images computed per second remains of the same order of magnitude.
Table 2.

The number of images computed per second is reported both in the simulation of containerized Tensorflow on 1 GALILEO node with 2 nVidia Tesla K80 and in those reported in the official Tensorflow performance documentation.

 

GALILEO nVidia Tesla K80

Official Tensorflow nVidia Tesla K80

1 gpu

54,968

52

2 gpus

107,916

99

3 gpus

194,15

195

../images/491247_1_En_6_Chapter/491247_1_En_6_Fig1_HTML.png
Fig. 1.

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 32, the dataset is ImageNet - synthetic.

3.3 Test Case 2: Containerized Versus Bare Metal Execution on GALILEO

A Singularity container has been built directly from those available in the official Tensorflow Docker hub. The version of Tensorflow used is the 1.12.0 for GPU. The container has been built using Singularity 3.0.2 version, the same also available as a module on GALILEO.
../images/491247_1_En_6_Chapter/491247_1_En_6_Fig2_HTML.png
Fig. 2.

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 64, the dataset is ImageNet - synthetic.

../images/491247_1_En_6_Chapter/491247_1_En_6_Fig3_HTML.png
Fig. 3.

Number of images per second computed in ResNet-50, AlexNet, and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 128, the dataset is ImageNet - synthetic.

Both for bare metal and container tests, five different neural networks have been considered: AlexNet [13], googLeNet [14], InceptionV3 [15], ResNet-50 [16] and VGG16 [17]. The dataset was ImageNet [18] (synthetic) and three different batch sizes were analyzed: 32, 64 and 128 per device, where the device is the GPU. The tests have been repeated 6 times and averaged on the set of data. All the runs have been executed in a single node with 2*8-cores Intel Xeon E5-2630 v3 at 2.40 GHz and 2 nVidia K80 GPUs.
Table 3.

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 32, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.

Model

GPUs number

Bare metal

Container

ResNet-50

1 gpu

54,9 ± 0,83

54,87 ± 0,98

2 gpus

102,2 ± 1,3

101,4 ± 1,5

3 gpus

155,4 ± 2,8

155,4 ± 2,3

4 gpus

197,8 ± 1,9

198,3 ± 1,4

AlexNet

1 gpu

408,6 ± 5,2

406,3 ± 3,8

2 gpus

578,1 ± 5,4

580,0 ± 4,2

3 gpus

704 ± 24

686 ± 25

4 gpus

699 ± 20

699 ± 14

VGG16

1 gpu

38,54 ± 0,38

38,32 ± 0,43

2 gpus

66,12 ± 0,82

66,06 ± 0,47

3 gpus

67,36 ± 0,60

66,61 ± 0,51

4 gpus

115,4 ± 1,7

115,9 ± 1,5

InceptionV3

1 gpu

33,92 ± 0,34

33,85 ± 0,41

2 gpus

62,17 ± 0,58

62,43 ± 0,86

3 gpus

94,4 ± 1,3

94,1 ± 1,4

4 gpus

123,4 ± 1,4

124,2 ± 1,4

GoogleNet

1 gpu

129,9 ± 1,8

129,9 ± 2,5

2 gpus

237,2 ± 4,7

237,0 ± 5,1

3 gpus

360,3 ± 4,9

359,8 ± 4,3

4 gpus

462,9 ± 3,7

461,4 ± 7,4

Table 4.

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 64, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.

Model

GPUs number

Bare metal

Container

ResNet-50

1 gpu

59,19 ± 0,62

58,43 ± 0,60

2 gpus

108,97 ± 0,94

107,7 ± 1,8

3 gpus

164,3 ± 2,6

163,0 ± 1,1

4 gpus

213,6 ± 3,0

211,9 ± 3,4

AlexNet

1 gpu

568,1 ± 3,1

565,3 ± 7,1

2 gpus

917,5 ± 3,8

908,1 ± 11,7

3 gpus

1167 ± 55

1174 ± 22

4 gpus

$$1316\pm 38$$

$$1299 \pm 28$$

VGG16

1 gpu

40,98 ± 0,35

40,95 ± 0,42

2 gpus

72,64 ± 0,48

72,36 ± 0,99

3 gpus

71,77 ± 0,72

71,38 ± 0,84

4 gpus

134,9 ± 2,1

134,8 ± 1,8

InceptionV3

1 gpu

36,01 ± 0,52

35,99 ± 0,30

2 gpus

66,10 ± 0,78

65,90 ± 0,51

3 gpus

99,1 ± 1,2

100,2 ± 1,2

4 gpus

131,6 ± 1,4

130,8 ± 1,4

GoogleNet

1 gpu

139,9 ± 1,9

142,0 ± 2,0

2 gpus

259,4 ± 2,1

260,1 ± 5,2

3 gpus

400,1 ± 3,8

397,4 ± 4,5

4 gpus

513,6 ± 3,9

509,3 ± 5,2

Table 5.

Number of images per second computed in ResNet-50, AlexNet and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 128, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.

Model

GPUs number

Bare metal

Container

ResNet-50

1 gpu

61,94 ± 0,72

61,18 ± 0,65

2 gpus

113,2 ± 1,3

111,8 ± 1,7

3 gpus

171 ± 2,4

170,3 ± 1,1

4 gpus

203 ± 3

202,1 ± 2,8

AlexNet

1 gpu

626,7 ± 9,5

629 ± 15

2 gpus

$$1115 \pm 12$$

$$1108 \pm 16$$

3 gpus

$$1544 \pm 25$$

1534,3 ± 5,3

4 gpus

$$1957 \pm 20$$

$$1920 \pm 72$$

GoogleNet

1 gpu

147,4 ± 2,2

149,2 ± 2,4

2 gpus

277,9 ± 2,4

276,6 ± 2,4

3 gpus

414,2 ± 7,2

412,7 ± 3,3

4 gpus

539,4 ± 8,5

538,9 ± 4,9

The number of images per second is reported in the Figs. 1, 2 and 3. The batch size being fixed, each histogram shows the number of images per second computed in each neural network model described above and for 1, 2, 3, and 4 GPUs a single GALILEO node. Note that the models VGG16 and Inception V3 with a batch size of 128 are not shown because the run was out of memory. As shown, no overhead in the execution is introduced using the container instead of the bare metal application, since the number of images computed per second is more or less the same.

In Tables 3, 4 and 5, the number of images is reported for each test, together with the standard deviation of the measurements.

4 Conclusion

A possible way to run Singularity GPU containers has been considered. The containers have been built starting from those available in the Tensorflow Docker Hub repository. Two performance tests have been described: in the first, the data obtained run a containerized version of Tensorflow on GALILEO cluster have been compared with those available in the official Tensorflow performance webpage. In the second, a comparison among a bare metal and a containerized executions, both on GALILEO, has been done. Both such tests show that running a GPU containerized version of Tensorflow does not introduce relevant overhead or performance issues.