Setting up your HPC account

Overview

You will need to go through these setup instructions before using nf-nest. This needs to be done only once per user and per cluster.

Prerequisites

  • Working knowledge of unix (files, permissions, processes, environment variables).
  • Working knowledge of git.

There are countless web resources to learn those things. Tedious but worth the investment.

Optional: Password-less SSH

It is imperative for your sanity to avoid entering your password and do 2FA every time a SSH connection is needed.

For example, on Sockeye, the closest you can get to that is with SSH Connection Sharing. This means that you open a first terminal window and perform 2FA in it. As long as that window is open, other SSH connections can be established without password and 2FA. Minimize that first window and do not touch it.

Optional: Install VS Code

VS Code makes it easier to develop on a remote server. For example, it simplifies file editing, resuming your session, manages github credential for you, etc.

Use New Terminal to open a terminal. Once you have a project (git repo) where you develop ready, use Open Folder... to point VS Code to the root of your project. VS Code is also able to manage your github authentification.

Understanding bash

When you login to HPC, it starts a process allowing you to interact with the server in a scripting language, which is typically bash.

It is a good investment to learn the basics of bash. See the software carpentry open source tutorial.

Understanding environment variables

Any UNIX process owns a special set of global variables called its environment variables. Each variable holds a string, possibly empty.

An important convention is the PATH environment variable holding a :-separated list of paths where executable such as julia or python will be searched for.

For example, in a bash process, to display the value of an environment variable called MY_ENV_VAR, use

echo $MY_ENV_VAR

In a Julia process:

ENV["MY_ENV_VAR"]

When a process starts a child process, the parent process decides what the child’s environment variable will be.

For example, consider what happens when you start Julia from bash. It is bash that will decide. Bash does it as follows: it uses the command export to mark a variable as something we want to propagate to children. This is why you will see export PATH in a start up file (next section).

Nextflow will take a different approach than bash: the environment variables are specified explicitly to ensure better reproducibility.

Understanding start up files

Bash is used in several different ways, with terminology for each kind.

In one axis: bash is called either on login or not:

  1. login: when you login to a machine (via SSH, or start-up a linux machine).
  2. non-login after being logged in, bash can be called by another process (e.g. bash, VS Code).

In another axis: bash is called either:

  1. interactively: when a user will interact with the process
  2. non-interactively: when it is another program in control.

The following files will be sourced automatically in the following circumstances:

  • ~/.bashrc, loaded when the shell is interactive and non-login,
  • ~/.bash_profile, loaded for login shell.

A common pattern is to use the following in ~/.bash_profile

# put that in .bash_profile so that anything in .bashrc also gets loaded
if [ -f ~/.bashrc ]; then
    . ~/.bashrc
fi

and then to put environment variables in ~/.bash_profile and aliases in ~/.bashrc.

Recall that any environment variables with export are passed in to child process. So no need to set them in non-login shells. On the other hand, anything with alias is not propagated to a child process.

One annoyance with this strategy is that after an update of the PATH variable, if you have several windows open, you may have to logout and log back in for the change to take effect. Moreover, even if you disconnect to VSCode, it maintains a process on the server, so you have to kill that server using

ps aux | grep .vscode-server | awk '{print $2}' | xargs kill

then log out, and then log back in.

Git

We assume the command git is available.

In some HPC such as UBC Sockeye, this command is not available by default, instead a module has to be loaded:

module load git

Allocation code

In some HPC such as UBC Sockeye, we need an allocation code to submit to the job queue. Scripts in nf-nest use an ENV variable called ALLOCATION_CODE to find the allocation code.

To set the variable in your current session, use:

export ALLOCATION_CODE=my-alloc-code

Optional: disable nextflow fancy output

It can be useful to avoid nextflow’s fancy progress output, for example they do not work well in notebook or in screen.

To disable in the current session, use:

export NXF_ANSI_LOG=false

Java

Nextflow needs Java. Follow these instructions taken from the nextflow website:

First, Install SDKMAN

curl -s https://get.sdkman.io | bash

Second, open a new terminal, and install Java:

sdk install java 17.0.10-tem

Finally, confirm that Java in installed correctly:

java -version

Install Julia

At the time of writing, Julia 1.11 is causing various issues, so install Julia 1.10 using the “Linux and FreeBSD” instructions but using the URL for the Long-term support version 1.10.6.

You do not need root privileges, e.g., install in a folder at ~/bin/ and add to your path.

Julia depot

Julia has a package manager called Pkg. The Julia depot folder is a location where Pkg puts files it downloads and compiles. Each user should have their own. The path to the depot is controlled via the JULIA_DEPOT_PATH environment variable.

In a HPC setup, the JULIA_DEPOT_PATH should be accessible read/write by all nodes, and ideally be on fast storage.

For example, on Sockeye the first choice if you allocation has it would be to use burst storage, i.e. path of the form

export JULIA_DEPOT_PATH=/arc/burst/[allocation_code]/[username_in_alloc]/depot

A second choice would be to use the scratch space

export JULIA_DEPOT_PATH=/scratch/[allocation_code]/[username_in_alloc]/depot

Optional: useful aliases

Bash supports shortcuts called alias for commonly used commands. We show two use cases here.

A very common patter in Julia is to activate the environment in the current directory and load Revise.jl. To do so, create a Julia script in your home directory called julia-start.jl and put that in it:

println("Active project: $(Base.active_project())")
println("Loading Revise...")
using Revise

Then add the following to .bashrc:

alias j='julia --banner=no --load /home/[your_alloc]/julia-start.jl --project=@. ' 

You will need to install Revise.jl in the global environment:

julia
]
activate
add Revise 

Second, most HPC has some ways to start an interactive allocation. Create an alias to be able to do it quicly. Here is an example for interactive CPU and GPU nodes respectively on Sockeye:

alias interact='salloc --time=3:00:00 --mem=16G --nodes=1 --ntasks=2 --account=[your_alloc]'
alias ginteract='salloc --account=[your_alloc-gpu] --partition=interactive_gpu --time=3:00:00 -N 1 -n 2 --mem=16G --gpus=1'