Combine outputs

Overview

Now that we know how to run many jobs, the next question is how to combine the output of all these jobs to analyze it.

Example

We will run Pigeons on the cross product formed by calling crossProduct(variables) with:

def variables = [
    seed: 1..10,
    n_chains: [10, 20], 
]

Suppose we want to create a plot from the output of these 20 Julia processes.

Strategy

Each Julia process will create a folder. Using a function, we will provide an automatic name to this folder encoding the inputs used (seed and n_chains). That name is provided by nf-nest’s filed() function. In that folder, we will
put csv files.

Then, once all Julia processes are done, another utilities from nf-nest, combine_csvs, will merge all CSVs while adding columns for the inputs (here, seed and n_chains).

Finally, we will pass the merged CSVs to a plotting process.

Nextflow script

// includes are relative to the .nf file, should always start with ./ or ../
include { crossProduct; filed; deliverables } from '../cross.nf'
include { instantiate; precompile; activate } from '../pkg.nf'
include { combine_csvs; } from '../combine.nf'

// in contrast, file(..) is relative to `pwd`, use projectDir/ 
//   to make it relative to main .nf file, or moduleDir for the .nf file
def julia_env = file(moduleDir/'julia_env')
def plot_script = file(moduleDir/'plot.jl')

def variables = [
    seed: 1..10,
    n_chains: [10, 20], 
]

workflow {
    compiled_env = instantiate(julia_env) | precompile
    configs = crossProduct(variables)
    combined = run_julia(compiled_env, configs) | combine_csvs
    plot(compiled_env, plot_script, combined)
}

process run_julia {
    input:
        path julia_env 
        val config 
    output:
        path "${filed(config)}"
    """
    ${activate(julia_env)}

    # run your code
    using Pigeons 
    using CSV 
    pt = pigeons(
            target = toy_mvn_target(1000), 
            n_chains = ${config.n_chains}, 
            seed = ${config.seed})

    # organize output as follows:
    #   - create a directory with name controlled by filed(config)
    #     to keep track of input configuration
    #   - put any number of CSV in there
    mkdir("${filed(config)}")
    CSV.write("${filed(config)}/summary.csv", pt.shared.reports.summary)
    CSV.write("${filed(config)}/swap_prs.csv", pt.shared.reports.swap_prs)
    """
}

process plot {
    input:
        path julia_env 
        path plot_script
        path combined_csvs_folder 
    output:
        path '*.png'
        path combined_csvs_folder
    publishDir "${deliverables(workflow, params)}", mode: 'copy', overwrite: true
    """
    ${activate(julia_env)}

    include("$plot_script")
    create_plots("$combined_csvs_folder")
    """
}

Running the nextflow script

cd experiment_repo
./nextflow run nf-nest/examples/full.nf -profile cluster 
N E X T F L O W  ~  version 24.10.0
Launching `nf-nest/examples/full.nf` [golden_poitras] DSL2 - revision: a68c131baa
[8c/a117a8] Submitted process > instantiate_process
[5f/997727] Submitted process > combine_workflow:instantiate_process
[72/cc3bb2] Submitted process > precompile
[b9/57c5b2] Submitted process > combine_workflow:precompile
[42/f159a5] Submitted process > run_julia (13)
[bc/98503a] Submitted process > run_julia (10)
[4d/bd3f1f] Submitted process > run_julia (1)
[c6/c5473f] Submitted process > run_julia (7)
[9c/4051ca] Submitted process > run_julia (5)
[c1/2cb9ec] Submitted process > run_julia (11)
[6d/af7d1f] Submitted process > run_julia (16)
[a2/78b5dc] Submitted process > run_julia (8)
[47/65c526] Submitted process > run_julia (14)
[9b/e1998e] Submitted process > run_julia (12)
[65/c7c28f] Submitted process > run_julia (6)
[07/9f3d4c] Submitted process > run_julia (2)
[13/27eb7d] Submitted process > run_julia (3)
[95/88b5e1] Submitted process > run_julia (9)
[33/81cb06] Submitted process > run_julia (19)
[e4/7dc064] Submitted process > run_julia (17)
[50/cb53b6] Submitted process > run_julia (4)
[44/c2c487] Submitted process > run_julia (15)
[b9/05c738] Submitted process > run_julia (20)
[c9/0e2573] Submitted process > run_julia (18)
[cf/4b324f] Submitted process > combine_workflow:combine_process
[ff/ab4541] Submitted process > plot

Accessing the output

Each nextflow process is associated with a unique work directory to ensure the processes do not interfere with each other. Here we cover two ways to quickly access these work directories.

Quick inspection

A quick way to find the output of a nextflow process that we just ran is to use:

cd experiment_repo 
nf-nest/nf-open

This lists the work folders for the last nextflow job.

Organizing the output with a publishDir

A better approach is to use the publishDir directive, combined with nf-nest’s deliverables() utility, as illustrated in the run_julia process above. This will automatically copy the output of the process associated with the directive in a sub-directory of experiment_repo/deliverables.

cd experiment_repo
tree deliverables
deliverables
└── scriptName=full.nf
    ├── output
    │   ├── summary.csv
    │   └── swap_prs.csv
    ├── plot.png
    └── runName.txt

2 directories, 4 files

Here the contents of runName.txt can be used with nextflow’s log command to obtain more information on the run.

cat deliverables/scriptName=full.nf/runName.txt 
golden_poitras
./nextflow log
TIMESTAMP           DURATION    RUN NAME            STATUS  REVISION ID SESSION ID                              COMMAND                                                       
2024-11-12 12:05:21 5.8s        angry_northcutt     OK      9d1a692a7e  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/examples/hello.nf                        
2024-11-12 12:05:33 8.6s        wise_goldberg       OK      9d1a692a7e  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/examples/hello.nf -profile cluster       
2024-11-12 12:06:36 1m 20s      ridiculous_volta    OK      fc0374e695  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/pkg.nf -profile cluster                  
2024-11-12 12:08:22 1m 39s      elegant_fermi       OK      aa082b1978  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/examples/many_jobs.nf -profile cluster   
2024-11-12 12:10:07 6.5s        tiny_elion          OK      d9de661ecc  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/examples/filter.nf                       
2024-11-12 12:10:24 2m 19s      nice_hopper         OK      8cef9f29d6  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/examples/stan_example.nf -profile cluster
2024-11-12 12:13:41 34.7s       clever_neumann      OK      713b74ac4a  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/pkg_gpu.nf                               
2024-11-12 12:14:22 1m 50s      voluminous_coulomb  OK      9be41cea49  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/examples/gpu.nf -profile cluster         
2024-11-12 12:16:23 6m 26s      golden_poitras      OK      a68c131baa  2e019c7d-7c4e-42e5-b142-1dd6770fcb61    nextflow run nf-nest/examples/full.nf -profile cluster        

And we can see in the CSV that indeed the columns seed and n_chains were added to the left:

head -n 2 deliverables/scriptName=full.nf/output/summary.csv 
seed,n_chains,round,n_scans,n_tempered_restarts,global_barrier,global_barrier_variational,last_round_max_time,last_round_max_allocation,stepping_stone
10,10,1,2,,8.998207738433418,,0.000288568,13536.0,-1173.429270641805