Running Containerised MPI Workloads with Slurm

Running Containerised MPI Workloads is not recommended for most users do to its complexity. It is possible however, provides some advantages in reproducibility, and allows use of applications that do not support the environment on the Lovelace cluster (e.g. an application does not support the current version of the Red Hat Enterprise Linux distribution).

This will consist of building a container that is compatibile with the host environment (including Slurm, PMIx, and OFED) with Podman. We extend the container to include the MPI application using the example of HiRep. We will then convert the container to a Singularity container - this makes it easier to allow the container to access MPI, Network and Device information from the host as this is allowed by default by Singularity but restricted by Podman. We will finally write a Job Submission script to run the container and the workload within the container.

Building the OpenMPI Container

We follow the process given in Podman. We start by creating a folder called openmpi with a file called Dockerfile with the contents below:

FROM registry.access.redhat.com/ubi9:9.5

RUN dnf groupinstall -y "Development Tools"
RUN dnf install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm && crb enable && dnf config-manager --set-enabled codeready-builder-for-rhel-9-x86_64-rpms
RUN dnf install -y perl-sigtrap lsof pciutils ethtool gcc-gfortran tcl numactl-libs pciutils-libs tk libnl3

ADD MLNX_OFED_LINUX-24.10-2.1.8.0-rhel9.5-x86_64.tgz /
RUN yes | /MLNX_OFED_LINUX-24.10-2.1.8.0-rhel9.5-x86_64/mlnxofedinstall --user-space-only --without-fw-update
RUN rpm -i /MLNX_OFED_LINUX-24.10-2.1.8.0-rhel9.5-x86_64/RPMS/ucx-knem-1.18.0-1.2410068.x86_64.rpm /MLNX_OFED_LINUX-24.10-2.1.8.0-rhel9.5-x86_64/RPMS/knem-1.1.4.90mlnx3-OFED.23.10.0.2.1.1.rhel9u5.x86_64.rpm

RUN dnf install -y environment-modules wget hwloc hwloc-devel libevent libevent-devel python3 python3-devel pam-devel readline-devel mariadb-devel perl bzip2-devel logrotate numactl-devel
WORKDIR /root

COPY pmix-5.0.6-1.src.rpm /root
COPY .rpmmacros /root
RUN rpmbuild --rebuild --noclean pmix-5.0.6-1.src.rpm
RUN rpm -i /root/rpmbuild/RPMS/x86_64/pmix-5.0.6-1.el9.x86_64.rpm
RUN rm .rpmmacros

COPY prrte-3.0.8-1.src.rpm /root
COPY .rpmmacros /root
RUN rpmbuild --rebuild --noclean prrte-3.0.8-1.src.rpm
RUN rpm -i /root/rpmbuild/RPMS/x86_64/prrte-3.0.8-1.el9.x86_64.rpm
RUN rm .rpmmacros

COPY openmpi-5.0.6-1.src.rpm /root
RUN rpmbuild --rebuild --noclean --define 'configure_options --with-slurm --with-verbs --with-knem=/opt/knem-1.1.4.90mlnx3' openmpi-5.0.6-1.src.rpm
RUN rpm -e openmpi mpitests_openmpi && rpm -i /root/rpmbuild/RPMS/x86_64/openmpi-5.0.6-1.el9.x86_64.rpm

Note that the container above is built against a Red Hat Universal Base Image. Users may choose to use rockylinux instead. We also expect installation files for Nvidia OFED, PMIx Reference Library (OpenPMIX), PMIx Reference RunTime Environment (PRRTE), and OpenMPI to be in the openmpi folder. Each file can be download from the linked sources.

Additionally, a file called .rpmmacros with the following contents is expected in the folder:

%_lto_cflags %nil

We then build the container with:

podman build -t openmpi openmpi

Note that we expect the openmpi folder mentioned previously to be in the current working directory.

Extending the OpenMPI Container with the HiRep Application

We then extend the container using the Dockerfile below. We create a folder called hirep with a file called Dockerfile with the contents below:

FROM localhost/openmpi

RUN yum install -y bsdtar

COPY 57bac424dec078bbccb0d3eeb7e32a027d023685.zip /root/hirep.zip

WORKDIR /hirep
RUN bsdtar xvf /root/hirep.zip --strip-components=1
RUN ln -s /usr/bin/python3 /usr/bin/python
RUN \
\
[ -z "$CC" ] && CC="$(which cc)" && \
mpicc -show && \
\
sed -i "s|^CFLAGS =.*|CFLAGS = -Wall -Wshadow -std=c11 -O3 -march=native -pipe |g" Make/MkFlags && \
sed -i "s|^MPICC = .*|MPICC = $(which mpicc) |g" Make/MkFlags && \
sed -i "s|^CC = .*|CC = ${CC} |g" Make/MkFlags && \
sed -i "s|^INCLUDE = .*|INCLUDE = |g" Make/MkFlags && \
\
cd HMC && \
make -j
RUN cd TestProgram/DiracOperator && make -j

We also expect the source files for HiRep at a specific commit to be in the hirep folder. This can be downloaded from https://github.com/claudiopica/HiRep/archive/57bac424dec078bbccb0d3eeb7e32a027d023685.zip.

Next we build the container with:

podman build -t hirep hirep

Note that we expect the hirep folder mentioned previously to be in the current working directory.

Convert the Container to a Singularity Container

We now convert the container to a Singularity container. This can be done with the following command:

podman save --format oci-archive hirep | singularity build hirep.sif oci-archive:///dev/stdin

This will create a file called hirep.sif which contains the container.

Submit the Job with Slurm

Finally, we will create a job submission file for the DiracOperator component of HiRep. First we create an input file called hirep_input_file with the following contents:

GLB_T = 96
GLB_X = 48
GLB_Y = 48
GLB_Z = 48
NP_T = 8
NP_X = 4
NP_Y = 2
NP_Z = 2
rlx_level = 1
rlx_seed = 12345 

We can the submit the job using the following Job Submission script:

#!/bin/bash
#SBATCH -N  2
#SBATCH -n  128

export PMIX_MCA_psec=native

srun --mpi=pmix singularity run -B /users,/scratch hirep.sif /hirep/TestProgram/DiracOperator/speed_test_diracoperator -i hirep_input_file -o hirep_output_file

Note that we expect hirep_input_file, hirep.sif and the Job Submission script to be in the current working directory. Note that this example uses PMIx with OpenMPI but other MPI Implementations (such as Intel MPI) may work better with --mpi=pmi2.

We can submit the job above as normal and, upon completion, you should see the output of the DiracOperator test in a file called hirep_output_file in the current working directory.