Singularity GPU Containers Execution on HPC Cluster

Abstract

This paper describes how to use the Singularity containerization tool in a HPC cluster equipped with GPU accelerators. The application chosen for benchmarking is Tensorflow, the open-source software library for machine learning. The singularity containers built have run into GALILEO HPC cluster at CINECA. A performance comparison between bare metal and container executions is also provided, showing a negligible difference in the number of images computed per second.

1 Introduction

The use of container technology has been increasing in the last years thanks to the diffusion of Docker [1] among researchers both for sharing their applications and to simplify the installation processes on the hosts.

Within containers, it is possible to run the same software on different architectures without installing it and its dependencies every time on different platforms. Since the containers allow a light-weight virtualization of the available resources in the host, it is possible to install a software only one time in the container and then bring and run such container on multiple platforms.

So, thanks to Docker, an extensive diffusion of containers creation and sharing among researchers has motivated their utilization also in the HPC context. Unfortunately, Docker is not suitable for HPC environment because to run a Docker container privileged access rights are needed possibly leading to malicious actions within the container execution.

In the years, many containerization tools have been developed to avoid such security limitations. Among them, since the first official release in 2017, Singularity [2, 3] has established itself among the best tools available for the execution of containerized applications in the HPC environment.

Singularity is able to bring containers and thus improve scientific workflow reproducibility. It is “secure” in the sense that the container execution takes place in the user space, without having demons running in the host as happens for Docker.

Through Singularity it is possible to run parallel applications on HPC cluster, both MPI, OpenMP and hybrid codes. It is possible to leverage accelerators, also, as GPU, FPGA or Intel MIC, that are often available in a supercomputing system. Singularity integrates within common workload managers, as Torque [4], PBS Pro [5] or SLURM [6], always available in HPC environments to orchestrate job execution: from a user perspective, run a bare metal or a containerized application is exactly the same.

In this article, a Singularity GPU containers execution on a HPC cluster is presented. The application chosen for benchmarking is Tensorflow [7], the widely used open-source software library for machine learning. Due to its many dependencies at the installation time, it results to be a good tool to be containerized. The singularity containers built have run into GALILEO supercomputing cluster, located at CINECA, the Italian Interuniversity Consortium [8].

In Sect. 2 both the container building process and run on GALILEO cluster are described. In Sect. 3 the benchmarking results are presented. Finally, in Sect. 4 some consideration are reported.

2 Singularity GPU Containers Building and Running

By design, a container is platform-agnostic and also software-agnostic. But nowadays HPC systems are equipped with accelerators, as GPU, FPGA or Intel MIC, and a relevant number of softwares and libraries have been written and designed to use such specialized hardware.

So to be really useful, a containerized application has to be able to use such kind of hardware. Of course, a solution is to install the accelerator specific drivers and libraries into the container. But for nVidia GPUs, Singularity provide an other solution. In fact, Singularity supports “natively” the GPUs in the sense that all the relevant nVidia/Cuda libraries on the host are found via the ld.so.cache that is bound into a library location within the container automatically. To enable the usage of GPUs in a Singularity container, simply add the flag “–nv” in the singularity execution command line [3]. The Cuda Toolkit has to be available in the container, with a version compatible with the driver version installed on the GPUs in the host.

Different Singularity GPU containers have been built, bootstrapping both from other Docker or Singularity containers already available in the official Docker Hub [9] and Singularity Hub [10], or bootstrapping from Ubuntu or CentOS official distributions. In the first case, few modifications have been done as add a directory for test purposes or install additional packages. In all cases, the simplicity of usage has been noted, both in building and running containers from scratch or using a pre-built ones.

In the tests presented here, two Singularity containers with the widely used machine learning framework “Tensorflow” have been built, both with GPU libraries available, version 1.10.1 and 1.12.1.

Accordingly to the philosophy of reproducibility and sharing that featured the containerization techniques, it has been decided to use the pre-built, official released containers. The containers built have been bootstrapped from the Docker ones available in the official Docker Hub repository of Tensorflow community [12].

To build the containers, the skeleton recipe reported in Table 1 has been used:

Table 1.

Singularity recipe skeleton used to build the GPU container.

../images/491247_1_En_6_Chapter/491247_1_En_6_Tab1_HTML.png

where the directory “test” is local to the container and has been used to bind the input files previously downloaded in the host.

3 Benchmark

3.1 Systems Description

GALILEO. GALILEO [8] is the CINECA Tier-1 “National” level supercomputer. Introduced first in 2015, it has been configured at the beginning of 2018 with Intel Xeon E5-2697 v4 (Broadwell) nodes. Also available in GALILEO is a pool of 15 nodes, each equipped with 2 x 8-cores Intel Haswell 2.40 GHz plus 2 nVidia K80 GPUs, 128 GB/node RAM, and an Infiniband with 4x QDR switches as Internal Network.

3.2 Test Case 1: Containerized Tensorflow Execution on GALILEO Versus Official Tensorflow Performance Data

A Singularity container has been built directly from those available in the official Tensorflow Docker hub. The version of Tensorflow used is the 1.10.0 for GPU. The container has been built using Singularity 2.3.1 version, while on GALILEO the 2.5.2 version. For container test the ResNt-50 [16] neural network has been considered, with a batch size of 32 over the ImageNet [18] data set. The obtained results have been compared with those provided in the official Tensorflow nVidia Teska K80 [11] and are reported in Table 2. As it is possible to note by the reported number, the usage of the containerized application does not introduce any significant overhead in the computations, since the number of images computed per second remains of the same order of magnitude.

Table 2.

The number of images computed per second is reported both in the simulation of containerized Tensorflow on 1 GALILEO node with 2 nVidia Tesla K80 and in those reported in the official Tensorflow performance documentation.

	GALILEO nVidia Tesla K80	Official Tensorflow nVidia Tesla K80
1 gpu	54,968	52
2 gpus	107,916	99
3 gpus	194,15	195

../images/491247_1_En_6_Chapter/491247_1_En_6_Fig1_HTML.png — Fig. 1.
Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 32, the dataset is ImageNet - synthetic.

3.3 Test Case 2: Containerized Versus Bare Metal Execution on GALILEO

A Singularity container has been built directly from those available in the official Tensorflow Docker hub. The version of Tensorflow used is the 1.12.0 for GPU. The container has been built using Singularity 3.0.2 version, the same also available as a module on GALILEO.

../images/491247_1_En_6_Chapter/491247_1_En_6_Fig2_HTML.png — Fig. 2.
Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 64, the dataset is ImageNet - synthetic.

../images/491247_1_En_6_Chapter/491247_1_En_6_Fig3_HTML.png — Fig. 3.
Number of images per second computed in ResNet-50, AlexNet, and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 128, the dataset is ImageNet - synthetic.

Both for bare metal and container tests, five different neural networks have been considered: AlexNet [13], googLeNet [14], InceptionV3 [15], ResNet-50 [16] and VGG16 [17]. The dataset was ImageNet [18] (synthetic) and three different batch sizes were analyzed: 32, 64 and 128 per device, where the device is the GPU. The tests have been repeated 6 times and averaged on the set of data. All the runs have been executed in a single node with 2*8-cores Intel Xeon E5-2630 v3 at 2.40 GHz and 2 nVidia K80 GPUs.

Table 3.

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 32, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.

Model	GPUs number	Bare metal	Container
ResNet-50	1 gpu	54,9 ± 0,83	54,87 ± 0,98
	2 gpus	102,2 ± 1,3	101,4 ± 1,5
	3 gpus	155,4 ± 2,8	155,4 ± 2,3
	4 gpus	197,8 ± 1,9	198,3 ± 1,4
AlexNet	1 gpu	408,6 ± 5,2	406,3 ± 3,8
	2 gpus	578,1 ± 5,4	580,0 ± 4,2
	3 gpus	704 ± 24	686 ± 25
	4 gpus	699 ± 20	699 ± 14
VGG16	1 gpu	38,54 ± 0,38	38,32 ± 0,43
	2 gpus	66,12 ± 0,82	66,06 ± 0,47
	3 gpus	67,36 ± 0,60	66,61 ± 0,51
	4 gpus	115,4 ± 1,7	115,9 ± 1,5
InceptionV3	1 gpu	33,92 ± 0,34	33,85 ± 0,41
	2 gpus	62,17 ± 0,58	62,43 ± 0,86
	3 gpus	94,4 ± 1,3	94,1 ± 1,4
	4 gpus	123,4 ± 1,4	124,2 ± 1,4
GoogleNet	1 gpu	129,9 ± 1,8	129,9 ± 2,5
	2 gpus	237,2 ± 4,7	237,0 ± 5,1
	3 gpus	360,3 ± 4,9	359,8 ± 4,3
	4 gpus	462,9 ± 3,7	461,4 ± 7,4

Table 4.

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 64, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.

Model	GPUs number	Bare metal	Container
ResNet-50	1 gpu	59,19 ± 0,62	58,43 ± 0,60
	2 gpus	108,97 ± 0,94	107,7 ± 1,8
	3 gpus	164,3 ± 2,6	163,0 ± 1,1
	4 gpus	213,6 ± 3,0	211,9 ± 3,4
AlexNet	1 gpu	568,1 ± 3,1	565,3 ± 7,1
	2 gpus	917,5 ± 3,8	908,1 ± 11,7
	3 gpus	1167 ± 55	1174 ± 22
	4 gpus	$1316\pm 38$	$1299 \pm 28$
VGG16	1 gpu	40,98 ± 0,35	40,95 ± 0,42
	2 gpus	72,64 ± 0,48	72,36 ± 0,99
	3 gpus	71,77 ± 0,72	71,38 ± 0,84
	4 gpus	134,9 ± 2,1	134,8 ± 1,8
InceptionV3	1 gpu	36,01 ± 0,52	35,99 ± 0,30
	2 gpus	66,10 ± 0,78	65,90 ± 0,51
	3 gpus	99,1 ± 1,2	100,2 ± 1,2
	4 gpus	131,6 ± 1,4	130,8 ± 1,4
GoogleNet	1 gpu	139,9 ± 1,9	142,0 ± 2,0
	2 gpus	259,4 ± 2,1	260,1 ± 5,2
	3 gpus	400,1 ± 3,8	397,4 ± 4,5
	4 gpus	513,6 ± 3,9	509,3 ± 5,2

Table 5.

Number of images per second computed in ResNet-50, AlexNet and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 128, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.

Model	GPUs number	Bare metal	Container
ResNet-50	1 gpu	61,94 ± 0,72	61,18 ± 0,65
	2 gpus	113,2 ± 1,3	111,8 ± 1,7
	3 gpus	171 ± 2,4	170,3 ± 1,1
	4 gpus	203 ± 3	202,1 ± 2,8
AlexNet	1 gpu	626,7 ± 9,5	629 ± 15
	2 gpus	$1115 \pm 12$	$1108 \pm 16$
	3 gpus	$1544 \pm 25$	1534,3 ± 5,3
	4 gpus	$1957 \pm 20$	$1920 \pm 72$
GoogleNet	1 gpu	147,4 ± 2,2	149,2 ± 2,4
	2 gpus	277,9 ± 2,4	276,6 ± 2,4
	3 gpus	414,2 ± 7,2	412,7 ± 3,3
	4 gpus	539,4 ± 8,5	538,9 ± 4,9

The number of images per second is reported in the Figs. 1, 2 and 3. The batch size being fixed, each histogram shows the number of images per second computed in each neural network model described above and for 1, 2, 3, and 4 GPUs a single GALILEO node. Note that the models VGG16 and Inception V3 with a batch size of 128 are not shown because the run was out of memory. As shown, no overhead in the execution is introduced using the container instead of the bare metal application, since the number of images computed per second is more or less the same.

In Tables 3, 4 and 5, the number of images is reported for each test, together with the standard deviation of the measurements.

4 Conclusion

A possible way to run Singularity GPU containers has been considered. The containers have been built starting from those available in the Tensorflow Docker Hub repository. Two performance tests have been described: in the first, the data obtained run a containerized version of Tensorflow on GALILEO cluster have been compared with those available in the official Tensorflow performance webpage. In the second, a comparison among a bare metal and a containerized executions, both on GALILEO, has been done. Both such tests show that running a GPU containerized version of Tensorflow does not introduce relevant overhead or performance issues.

References

1.
Docker. https://www.docker.com/
2.
Kurtzer, G.M., Sochat, V., Bauer, M.W.: Singularity: scientific containers for mobility of compute. PLoS ONE 12(5), e0177459 (2017). https://doi.org/10.1371/journal.pone.0177459Crossref
3.
Singularity. https://www.sylabs.io/
4.
Torque. http://www.adaptivecomputing.com/products/torque/
5.
PBS Pro. https://www.pbspro.org/
6.
Slurm. https://slurm.schedmd.com/
7.
Tensorflow. https://www.tensorflow.org/
8.
GALILEO cluster user documentation. https://wiki.u-gov.it/confluence/display/SCAIUS/UG3.3%3A+GALILEO+UserGuide
9.
Docker Hub. https://hub.docker.com/
10.
Singularity Hub. http://singularity-hub.org/
11.
Tensorflow Performance benchmarks. https://www.tensorflow.org/guide/performance/benchmarks
12.
Tensorflow Docker Hub. https://hub.docker.com/r/tensorflow/tensorflow/
13.
Krizhevsky, A., Sutskever, I., Hinton, G. E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, pp. 1097–1105 (2012)
14.
Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015)
15.
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision (2015). https://arxiv.org/abs/1512.00567
16.
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition (2015). https://arxiv.org/abs/1512.03385
17.
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). https://arxiv.org/abs/1409.1556
18.
ImageNet dataset. http://www.image-net.org/