1 Introduction
The use of container technology has been increasing in the last years thanks to the diffusion of Docker [1] among researchers both for sharing their applications and to simplify the installation processes on the hosts.
Within containers, it is possible to run the same software on different architectures without installing it and its dependencies every time on different platforms. Since the containers allow a light-weight virtualization of the available resources in the host, it is possible to install a software only one time in the container and then bring and run such container on multiple platforms.
So, thanks to Docker, an extensive diffusion of containers creation and sharing among researchers has motivated their utilization also in the HPC context. Unfortunately, Docker is not suitable for HPC environment because to run a Docker container privileged access rights are needed possibly leading to malicious actions within the container execution.
In the years, many containerization tools have been developed to avoid such security limitations. Among them, since the first official release in 2017, Singularity [2, 3] has established itself among the best tools available for the execution of containerized applications in the HPC environment.
Singularity is able to bring containers and thus improve scientific workflow reproducibility. It is “secure” in the sense that the container execution takes place in the user space, without having demons running in the host as happens for Docker.
Through Singularity it is possible to run parallel applications on HPC cluster, both MPI, OpenMP and hybrid codes. It is possible to leverage accelerators, also, as GPU, FPGA or Intel MIC, that are often available in a supercomputing system. Singularity integrates within common workload managers, as Torque [4], PBS Pro [5] or SLURM [6], always available in HPC environments to orchestrate job execution: from a user perspective, run a bare metal or a containerized application is exactly the same.
In this article, a Singularity GPU containers execution on a HPC cluster is presented. The application chosen for benchmarking is Tensorflow [7], the widely used open-source software library for machine learning. Due to its many dependencies at the installation time, it results to be a good tool to be containerized. The singularity containers built have run into GALILEO supercomputing cluster, located at CINECA, the Italian Interuniversity Consortium [8].
In Sect. 2 both the container building process and run on GALILEO cluster are described. In Sect. 3 the benchmarking results are presented. Finally, in Sect. 4 some consideration are reported.
2 Singularity GPU Containers Building and Running
By design, a container is platform-agnostic and also software-agnostic. But nowadays HPC systems are equipped with accelerators, as GPU, FPGA or Intel MIC, and a relevant number of softwares and libraries have been written and designed to use such specialized hardware.
So to be really useful, a containerized application has to be able to use such kind of hardware. Of course, a solution is to install the accelerator specific drivers and libraries into the container. But for nVidia GPUs, Singularity provide an other solution. In fact, Singularity supports “natively” the GPUs in the sense that all the relevant nVidia/Cuda libraries on the host are found via the ld.so.cache that is bound into a library location within the container automatically. To enable the usage of GPUs in a Singularity container, simply add the flag “–nv” in the singularity execution command line [3]. The Cuda Toolkit has to be available in the container, with a version compatible with the driver version installed on the GPUs in the host.
Different Singularity GPU containers have been built, bootstrapping both from other Docker or Singularity containers already available in the official Docker Hub [9] and Singularity Hub [10], or bootstrapping from Ubuntu or CentOS official distributions. In the first case, few modifications have been done as add a directory for test purposes or install additional packages. In all cases, the simplicity of usage has been noted, both in building and running containers from scratch or using a pre-built ones.
In the tests presented here, two Singularity containers with the widely used machine learning framework “Tensorflow” have been built, both with GPU libraries available, version 1.10.1 and 1.12.1.
Accordingly to the philosophy of reproducibility and sharing that featured the containerization techniques, it has been decided to use the pre-built, official released containers. The containers built have been bootstrapped from the Docker ones available in the official Docker Hub repository of Tensorflow community [12].
Singularity recipe skeleton used to build the GPU container.

where the directory “test” is local to the container and has been used to bind the input files previously downloaded in the host.
3 Benchmark
3.1 Systems Description
GALILEO. GALILEO [8] is the CINECA Tier-1 “National” level supercomputer. Introduced first in 2015, it has been configured at the beginning of 2018 with Intel Xeon E5-2697 v4 (Broadwell) nodes. Also available in GALILEO is a pool of 15 nodes, each equipped with 2 x 8-cores Intel Haswell 2.40 GHz plus 2 nVidia K80 GPUs, 128 GB/node RAM, and an Infiniband with 4x QDR switches as Internal Network.
3.2 Test Case 1: Containerized Tensorflow Execution on GALILEO Versus Official Tensorflow Performance Data
The number of images computed per second is reported both in the simulation of containerized Tensorflow on 1 GALILEO node with 2 nVidia Tesla K80 and in those reported in the official Tensorflow performance documentation.
GALILEO nVidia Tesla K80 | Official Tensorflow nVidia Tesla K80 | |
---|---|---|
1 gpu | 54,968 | 52 |
2 gpus | 107,916 | 99 |
3 gpus | 194,15 | 195 |

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 32, the dataset is ImageNet - synthetic.
3.3 Test Case 2: Containerized Versus Bare Metal Execution on GALILEO

Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 64, the dataset is ImageNet - synthetic.

Number of images per second computed in ResNet-50, AlexNet, and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 128, the dataset is ImageNet - synthetic.
Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 32, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.
Model | GPUs number | Bare metal | Container |
---|---|---|---|
ResNet-50 | 1 gpu | 54,9 ± 0,83 | 54,87 ± 0,98 |
2 gpus | 102,2 ± 1,3 | 101,4 ± 1,5 | |
3 gpus | 155,4 ± 2,8 | 155,4 ± 2,3 | |
4 gpus | 197,8 ± 1,9 | 198,3 ± 1,4 | |
AlexNet | 1 gpu | 408,6 ± 5,2 | 406,3 ± 3,8 |
2 gpus | 578,1 ± 5,4 | 580,0 ± 4,2 | |
3 gpus | 704 ± 24 | 686 ± 25 | |
4 gpus | 699 ± 20 | 699 ± 14 | |
VGG16 | 1 gpu | 38,54 ± 0,38 | 38,32 ± 0,43 |
2 gpus | 66,12 ± 0,82 | 66,06 ± 0,47 | |
3 gpus | 67,36 ± 0,60 | 66,61 ± 0,51 | |
4 gpus | 115,4 ± 1,7 | 115,9 ± 1,5 | |
InceptionV3 | 1 gpu | 33,92 ± 0,34 | 33,85 ± 0,41 |
2 gpus | 62,17 ± 0,58 | 62,43 ± 0,86 | |
3 gpus | 94,4 ± 1,3 | 94,1 ± 1,4 | |
4 gpus | 123,4 ± 1,4 | 124,2 ± 1,4 | |
GoogleNet | 1 gpu | 129,9 ± 1,8 | 129,9 ± 2,5 |
2 gpus | 237,2 ± 4,7 | 237,0 ± 5,1 | |
3 gpus | 360,3 ± 4,9 | 359,8 ± 4,3 | |
4 gpus | 462,9 ± 3,7 | 461,4 ± 7,4 |
Number of images per second computed in ResNet-50, AlexNet, VGG16, InceptionV3 and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 64, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.
Model | GPUs number | Bare metal | Container |
---|---|---|---|
ResNet-50 | 1 gpu | 59,19 ± 0,62 | 58,43 ± 0,60 |
2 gpus | 108,97 ± 0,94 | 107,7 ± 1,8 | |
3 gpus | 164,3 ± 2,6 | 163,0 ± 1,1 | |
4 gpus | 213,6 ± 3,0 | 211,9 ± 3,4 | |
AlexNet | 1 gpu | 568,1 ± 3,1 | 565,3 ± 7,1 |
2 gpus | 917,5 ± 3,8 | 908,1 ± 11,7 | |
3 gpus | 1167 ± 55 | 1174 ± 22 | |
4 gpus | |||
VGG16 | 1 gpu | 40,98 ± 0,35 | 40,95 ± 0,42 |
2 gpus | 72,64 ± 0,48 | 72,36 ± 0,99 | |
3 gpus | 71,77 ± 0,72 | 71,38 ± 0,84 | |
4 gpus | 134,9 ± 2,1 | 134,8 ± 1,8 | |
InceptionV3 | 1 gpu | 36,01 ± 0,52 | 35,99 ± 0,30 |
2 gpus | 66,10 ± 0,78 | 65,90 ± 0,51 | |
3 gpus | 99,1 ± 1,2 | 100,2 ± 1,2 | |
4 gpus | 131,6 ± 1,4 | 130,8 ± 1,4 | |
GoogleNet | 1 gpu | 139,9 ± 1,9 | 142,0 ± 2,0 |
2 gpus | 259,4 ± 2,1 | 260,1 ± 5,2 | |
3 gpus | 400,1 ± 3,8 | 397,4 ± 4,5 | |
4 gpus | 513,6 ± 3,9 | 509,3 ± 5,2 |
Number of images per second computed in ResNet-50, AlexNet and GoogleNet model for 1, 2, 3 and 4 K80 NVIDIA GPUs on a single GALILEO node. The batch size used is 128, the dataset is ImageNet - synthetic. The standard deviation of the measurements is also reported.
Model | GPUs number | Bare metal | Container |
---|---|---|---|
ResNet-50 | 1 gpu | 61,94 ± 0,72 | 61,18 ± 0,65 |
2 gpus | 113,2 ± 1,3 | 111,8 ± 1,7 | |
3 gpus | 171 ± 2,4 | 170,3 ± 1,1 | |
4 gpus | 203 ± 3 | 202,1 ± 2,8 | |
AlexNet | 1 gpu | 626,7 ± 9,5 | 629 ± 15 |
2 gpus | |||
3 gpus | 1534,3 ± 5,3 | ||
4 gpus | |||
GoogleNet | 1 gpu | 147,4 ± 2,2 | 149,2 ± 2,4 |
2 gpus | 277,9 ± 2,4 | 276,6 ± 2,4 | |
3 gpus | 414,2 ± 7,2 | 412,7 ± 3,3 | |
4 gpus | 539,4 ± 8,5 | 538,9 ± 4,9 |
The number of images per second is reported in the Figs. 1, 2 and 3. The batch size being fixed, each histogram shows the number of images per second computed in each neural network model described above and for 1, 2, 3, and 4 GPUs a single GALILEO node. Note that the models VGG16 and Inception V3 with a batch size of 128 are not shown because the run was out of memory. As shown, no overhead in the execution is introduced using the container instead of the bare metal application, since the number of images computed per second is more or less the same.
In Tables 3, 4 and 5, the number of images is reported for each test, together with the standard deviation of the measurements.
4 Conclusion
A possible way to run Singularity GPU containers has been considered. The containers have been built starting from those available in the Tensorflow Docker Hub repository. Two performance tests have been described: in the first, the data obtained run a containerized version of Tensorflow on GALILEO cluster have been compared with those available in the official Tensorflow performance webpage. In the second, a comparison among a bare metal and a containerized executions, both on GALILEO, has been done. Both such tests show that running a GPU containerized version of Tensorflow does not introduce relevant overhead or performance issues.