MPI¶

TauREx-CUDA can be used with MPI to perform retrievals using multiple GPUs with samplers such as MultiNest.

Assuming an we have our [Model], [Optimizer] and [Fitting] sections setup like so

[Model]
model_type = transmission_cuda
    [[AbsorptionCUDA]]

    [[CIACUDA]]
    cia_pairs = H2-H2, H2-He

    [[RayleighCUDA]]

[Optimizer]
optimizer = multinest
multi_nest_path = ./multinest
num_live_points = 1000

[Fitting]
T:fit = True
T:priors = "Uniform(bounds=[1000,2000])"

We can run the retrieval with 2 cores under MPI:

mpirun -n 2 taurex -i science.par --retrieval

This will run 2 tasks on a single GPU. Increasing the n number will force all tasks onto this GPU. For multiple GPU systems we can control which GPU is used through the CUDA_VISIBLE_DEVICES environment variable. For example to switch to a second GPU we set the variable:

export CUDA_VISIBLE_DEVICES=1
mpirun -n 2 taurex -i science.par --retrieval

However again, all tasks will use this GPU.

Multi-GPU¶

To make use of multiple GPUs in an MPI run we will need to bind the TauREx instances to a specific GPU. We can accomplish this with a binding script that sets th CUDA_VISIBLE_DEVICES environment at run-time. This will vary slightly with different MPI implementations but for OpenMPI we can use a modified version of gpu_bind.sh:

#!/usr/bin/bash

export TOTAL_GPUS=2
export PROC_ID=$OMPI_COMM_WORLD_LOCAL_RANK


export CUDA_VISIBLE_DEVICES=$((PROC_ID % TOTAL_GPUS))

$@

Here OMPI_COMM_WORLD_LOCAL_RANK is an environment variable set by OpenMPI that gives the task rank and TOTAL_GPUS are number of GPUs available (per node). We can therefore use this script with taurex like so:

mpirun -n 2 ./gpu_bind.sh taurex -i science.par --retrieval

Each task will run on a seperate GPU. Depending on the GPU and retrieval, we can run 2 tasks on each GPU by doubling the number of tasks:

mpirun -n 4 ./gpu_bind.sh taurex -i science.par --retrieval

This can potentially double performance for GPUs with large memory and number of cores. For the V100 it is possible to run 3 tasks without affecting performance too much::

mpirun -n 6 ./gpu_bind.sh taurex -i science.par --retrieval

For different clusters the PROC_ID and TOTAL_GPUS may need to be modified. For Wilkes TOTAL_GPUS=4, for SLURM schedulers you can set $PROC_ID=$SLURM_PROCID. Check example scripts for your HPC centre to determine the best variables to use.

Tip

For the best performance, it is recommended to enable MPS (Multi-Process Service) for the GPUs. This will allow multiple processes to prevent context switching. This will interleave transfers and compute between processes and significantly improve performance.