GPU programming

Installing CUDA.jl

In a context such as Sockeye where it is not possible to access GPU nodes with internet access, precompilation becomes more complicated than the earlier page on non-CUDA precompilation.

We provide a workaround, pkg_gpu.nf, which offers the same functionality as pkg.nf but is slower since all precompilation has to occur on the login node.

First, add the package as before:

ENV["JULIA_PKG_PRECOMPILE_AUTO"]=0 # Hold off precompile since we are in login node
using Pkg 
Pkg.activate("experiment_repo/julia_env")
Pkg.add("CUDA")

Next, use the GPU precompilation script:

cd experiment_repo 
./nextflow run nf-nest/pkg_gpu.nf 
N E X T F L O W  ~  version 24.10.0
Launching `nf-nest/pkg_gpu.nf` [clever_neumann] DSL2 - revision: 713b74ac4a
[2a/8719cd] Submitted process > instantiate_process
[18/d35b0f] Submitted process > precompile_gpu

Running nextflow processes requiring GPU

An example of a workflow using GPUs:

include { instantiate; precompile_gpu; } from "../pkg_gpu.nf"
include { activate; } from "../pkg.nf"

def julia_env = file('julia_env')

workflow {
    instantiate(julia_env) | precompile_gpu | run_julia
}

process run_julia {
    debug true
    label 'gpu'
    input:
        path julia_env
    """
    ${activate(julia_env)}

    using CUDA 

    println("CPU")
    x = rand(5000, 5000);
    @time x * x;
    @time x * x;

    println("GPU")
    x = CUDA.rand(5000, 5000);
    @time x * x;
    @time x * x;

    CUDA.versioninfo()
    """
}

We run it using the same command as usual:

cd experiment_repo
./nextflow run nf-nest/examples/gpu.nf -profile cluster
N E X T F L O W  ~  version 24.10.0
Launching `nf-nest/examples/gpu.nf` [voluminous_coulomb] DSL2 - revision: 9be41cea49
[e0/640d79] Submitted process > instantiate_process
[62/a5286f] Submitted process > precompile_gpu
[51/94d515] Submitted process > run_julia
CPU
 13.110772 seconds (4.79 M allocations: 517.701 MiB, 3.77% gc time, 18.17% compilation time)
 10.580591 seconds (2 allocations: 190.735 MiB, 0.17% gc time)
GPU
  1.518137 seconds (1.67 M allocations: 107.375 MiB, 98.30% compilation time)
  0.000597 seconds (50 allocations: 1.172 KiB)
CUDA runtime 12.5, artifact installation
CUDA driver 12.6
NVIDIA driver 550.90.12

CUDA libraries: 
- CUBLAS: 12.5.3
- CURAND: 10.3.6
- CUFFT: 11.2.3
- CUSOLVER: 11.6.3
- CUSPARSE: 12.5.1
- CUPTI: 2024.2.1 (API 23.0.0)
- NVML: 12.0.0+550.90.12

Julia packages: 
- CUDA: 5.5.2
- CUDA_Driver_jll: 0.10.3+0
- CUDA_Runtime_jll: 0.15.3+0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

Preferences:
- CUDA_Runtime_jll.version: 12.5

1 device:
  0: Tesla V100-SXM2-32GB (sm_70, 31.137 GiB / 32.000 GiB available)

GPU kernel development

One way to leverage GPUs is to use array programming as demonstrated in the example above. When a problem cannot be cast into an array problem, an alternative is to construct a custom GPU kernel.

Designing custom GPU kernels is especially attractive in Julia. This is in big part thanks to
KernelAbstractions.jl, which allows the same code to be emit both CPU and GPU versions. Since error messages are easier to interpret when doing CPU development, it is useful to be able to test both CPU and GPU targets.

Compared to Julia CPU development, the main constraint when doing GPU development is that inside the kernel there should not be heap allocations. Seasoned Julia developers are often already often avoiding to allocate in the inner loop due to garbage collection costs.

For a concrete example of KernelAbstractions.jl in action, see these kernels used to implement Sequential Annealed Importance Sampling.