Using the HPC

Author: Yi Liu

This document is intended to be an overview and a "cheatsheet" for IEU researchers, but not a replacement for the detailed documentation available on IEU and ACRC sites. See links at the bottom of the page for these information.

Example submission script

Below is a quick template for a submission script, which does the following

submit a job called "hello-world"
request compute resources
- 1 process (ntasks)
- 1 core for subthreads in the process (cpus-per-task)
- 10 secs compute time
cd to submission directory, and print out the working directory; print out the name of the compute node
print out "Hello world"

To use the script you should replace various placeholders, e.g. <PARTITION> with the appropriate values.

hello-world.sbatch

#!/bin/bash

#SBATCH --job-name=hello-world
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --partition=<PARTITION>
#SBATCH --account=<PROJECT-CODE>
#SBATCH --mail-user=<MAIL-ACCOUNT>
#SBATCH --mail-type=ALL

cd "${SLURM_SUBMIT_DIR}"
echo $(pwd)
echo $(hostname)
echo "Hello world"

where

<PARTITION> the partition to submit the job to
<PROJECT-CODE> the account code for IEU HPC projects. For further details, see https://uob.sharepoint.com/sites/integrative-epidemiology/SitePages/ieu-servers-and-hpc-clusters.aspx and search for "project code". NOTE: in the current HPC systems your jobs will NOT run unless you specify the correct project codes.
<MAIL-ACCOUNT> your email account.

To submit, run

sbatch hello-world.sbatch

IEU HPC resources

Systems

The IEU has dedicated compute nodes and partitions on two HPC systems

BlueCrystal Phase 4 (BC4)
- BC4 is recommended for any computationally heavy duty tasks
- The mrcieu partition is dedicated to IEU users

BluePebble (BP1)
- BP1 is recommended ONLY for GPU-oriented tasks, as its non-GPU specs (CPU, RAM, etc) are inferior to those on BC4
- The mrcieu-gpu partition is dedicated to IEU users

The IEU has other non-HPC compute resources, e.g. the epi-franklin server. Further details can be accessed via Links.

Filesystem

Typically IEU users have 3 locations in the local filesystem on the HPC

home directory: /user/home/<USERNAME>. You should only use the home directory for system configurations.
work directory: /user/work/<USERNAME>. Your projects should live here.
IEU purchased project space: /mnt/storage/private/mrcieu (bc4), /bp1/mrcieu1. Path to various shared datasets and common project spaces.

In addition, the IEU RDSF space (IEU data warehouse and project workspace) is also mounted on the HPC at /projects/MRC-IEU/research. However this path is only mounted on login nodes and not accessible on compute nodes. You should only use the space for data backup, but not as part of the local filesystem, as the RDSF filesystem is a much slower system and web mounting can be very unstable.

There are also other locations worth mentioning:

bc4 /mnt/storage/scratch/<USERNAME>: this is a scratch space for legacy purposes. Users should now their work directories instead.

Slurm commands

Viewing the overall state of the HPC

`sinfo`

Use sinfo -s to display a glimpse view of the partitions and compute nodes that you have access to.

sinfo -s

PARTITION AVAIL  TIMELIMIT   NODES(A/I/O/T) NODELIST
cpu*         up 14-00:00:0      xxxxxxxxxxx xxxxxxxxxxxxxxxx
gpu          up 7-00:00:00        xxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxx
hmem         up 14-00:00:0          xxxxxxx xxxxxxxxxxxxxx
test         up    1:00:00      xxxxxxxxxxx xxxxxxxxxxxxxxxx
veryshort    up    6:00:00      xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
mrcieu       up 14-00:00:0        xxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Submitting jobs and request for resources

There are two modes for job submissions:

batch mode, where you submit a job submission script (e.g. hello-world.sbatch) to execute the commands detailed in the script. This is the generally expected way to using HPC compute. However like any other large-scale batch processing you need to make sure your script and the underlying commands are robust to potential errors, or have mechanisms to accommodate errors, to make full use of the HPC resources. In other words, prepare for failure of the submitted job due to some bugs in your code.
interactive mode, where during the duration of the job you gain access to an interactive shell on a compute node, and use the command line to run commands. The typical use cases for interactive jobs include trial experiments or interactive use such as jupyter notebooks.

`sbatch`, batch mode job submission

sbatch is used for submitting a job script.

sbatch <JOB-SCRIPT>.sbatch

`srun`, interactive mode

srun is used to run a command (e.g. echo), where this command could also be requesting an interactive shell.

Run a command

srun --account <PROJECT-CODE> --partition mrcieu echo "hello world"

# srun: job 11318808 queued and waiting for resources
# srun: job 11318808 has been allocated resources
# hello world

Request an interactive shell

srun --account <PROJECT-CODE> --partition mrcieu --pty /bin/bash -l

# srun: job 11318831 queued and waiting for resources
# srun: job 11318831 has been allocated resources

You will need to use your current terminal session and current ssh session when using srun, which can be unfeasible as the waiting time could be very long, and once you disconnect from the ssh session. To circumvent this you can use terminal multiplexers (e.g. tmux or screen) which will not be covered here. Alternatively you could use salloc.

`salloc`, interactive mode

salloc is used for requesting allocation of compute resources, which when granted could be accessed by srun.

As an example, below are a salloc job submitted and interacted by srun.

## request compute resources in jobid 11319203 via salloc
salloc --account <PROJECT-CODE> --partition mrcieu --time 0:30:00 --mail-user=<USER-EMAIL> --mail-type=ALL

# salloc: Pending job allocation 11319203
# salloc: job 11319203 queued and waiting for resources
# salloc: job 11319203 has been allocated resources
# salloc: Granted job allocation 11319203
# salloc: Nodes compute062 are ready for job
# Info: Loaded slurm/23.11.10 into the modular environment.
#
#
# SLURM: Your account, XXXXXXX, is ready to submit Slurm jobs.

## access the allocated resources in jobid 11319203 via srun
srun --jobid 11319203 --pty /bin/bash -l

## check where the compute node is
hostname

# compute062.bc4.acrc.priv

## check the slurm job id environment variable

echo ${SLURM_JOB_ID}

# 11319203

Managing jobs

`scancel`, cancelling jobs

scancel <JOB-ID>

Viewing the status of jobs

`squeue`, displaying the status of submitted jobs

❯ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
          xxxxxxxx       cpu   MA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu   MA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu   BA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu   CA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu   CA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu   CA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu   SA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu   SA5025  xxxxxxx PD       0:00     10 (Resources)
          xxxxxxxx       cpu bigdft_g  xxxxxxx PD       0:00      4 (Resources)

`sacct`, displaying the accounting data of recent jobs

❯ sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
xxxxxxxxxxxx       bash     mrcieu    default          0 CANCELLED+      0:0
xxxxxxxxxxxx       bash     mrcieu xxxxxxxxxx         12     FAILED    127:0
xxxxxxxxxxxx     extern            xxxxxxxxxx         12  COMPLETED      0:0
xxxxxxxxxxxx       bash            xxxxxxxxxx         12     FAILED    127:0
xxxxxxxxxxxx interacti+        cpu xxxxxxxxxx          1  COMPLETED      0:0
xxxxxxxxxxxx     extern            xxxxxxxxxx          1  COMPLETED      0:0

Other useful slurm commands

`scontrol`

scontrol show nodes
scontrol show jobs

topic: job array

Below is an example job array script.

job-array.sbatch

#!/bin/bash

#SBATCH --job-name=job-array-example
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --partition=<PARTITION>
#SBATCH --account=<PROJECT-CODE>
#SBATCH --mail-user=<MAIL-ACCOUNT>
#SBATCH --mail-type=ALL
#SBATCH --array=1-5

cd "${SLURM_SUBMIT_DIR}"
echo $(hostname)
echo "SLURM_ARRAY_JOB_ID: ${SLURM_ARRAY_JOB_ID}"
echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"
echo "SLURM_ARRAY_TASK_COUNT: ${SLURM_ARRAY_TASK_COUNT}"
echo "SLURM_ARRAY_TASK_MIN: ${SLURM_ARRAY_TASK_MIN}"
echo "SLURM_ARRAY_TASK_MAX: ${SLURM_ARRAY_TASK_MAX}"
echo "Hello world"

Replace the placeholders <PARTITION>, <PROJECT-CODE>, <MAIL-ACCOUNT>.

The typical use case for job arrays is to parallelize loops, i.e. you have 5 sub-tasks where these tasks are independent of each other. The environment variable w${SLURM_ARRAY_TASK_ID} is the index value of the array (e.g. 1, 2, 3, etc) and can be used to implement sub-task logics (e.g. to process this chromosome ID).

module commands

Modules are pre-installed software packages on the HPC systems.

❯ module --help
Usage: module [options] sub-command [args ...]

Options:
  -h -? -H --help                   This help message
  -s availStyle --style=availStyle  Site controlled avail style: system (default: system)
  --regression_testing              Lmod regression testing
  -b --brief                        brief listing with only user specified modules
  -D                                Program tracing written to stderr
  --debug=dbglvl                    Program tracing written to stderr (where dbglvl is a number 1,2,3)
  --pin_versions=pinVersions        When doing a restore use specified version, do not follow defaults
  -d --default                      List default modules only when used with avail
  -q --quiet                        Do not print out warnings
  --expert                          Expert mode
  -t --terse                        Write out in machine readable format for commands: list, avail, spider, savelist
  --initial_load                    loading Lmod for first time in a user shell
  --latest                          Load latest (ignore default)
  -I --ignore_cache                 Treat the cache file(s) as out-of-date
  --novice                          Turn off expert and quiet flag
  --raw                             Print modulefile in raw output when used with show
  -w twidth --width=twidth          Use this as max term width
  -v --version                      Print version info and quit
  -r --regexp                       use regular expression match
  --gitversion                      Dump git version in a machine readable way and quit
  --dumpversion                     Dump version in a machine readable way and quit
  --check_syntax --checkSyntax      Checking module command syntax: do not load
  --config                          Report Lmod Configuration
  --miniConfig                      Report Lmod Configuration differences
  --config_json                     Report Lmod Configuration in json format
  --mt                              Report Module Table State
  --timer                           report run times
  -f --force                        force removal of a sticky module or save an empty collection
  --redirect                        Send the output of list, avail, spider to stdout (not stderr)
  --no_redirect                     Force output of list, avail and spider to stderr
  --show_hidden                     Avail and spider will report hidden modules
  --spider_timeout=timeout          a timeout for spider
  -T --trace
  --nx --no_extensions
  --loc --location                  Just print the file location when using show

module [options] sub-command [args ...]

show available module

❯ module avail

----------------------------------------------------------- /modules/spack/lmod/linux-rocky8-x86_64/Core ------------------------------------------------------------
   apptainer/1.1.9                    doxygen/1.9.8          gnuplot/6.0.0             nasm/2.15.05                       py-gputil/1.4.0
   aria2/1.37.0                       eigen/3.4.0            gsl/2.7.1                 netlib-lapack/3.11.0               qctool/2.2.0
   autoconf-archive/2023.02.20        fastp/0.23.4           htslib/1.19.1             nextflow/23.10.1                   rclone/1.65.1
   bcftools/1.19-openblas             fastqc/0.12.1          intel-mkl/2020.4.304      ont-guppy/6.1.7                    rust/1.78.0
   beagle/5.4                         fasttree/2.1.11        interproscan/5.63-95.0    openblas/0.3.26                    samtools/1.19.2
   beast2/2.7.4                       ffmpeg/6.1.1           jags/4.3.2-openblas       opencv/4.8.0                       slurm/23.11.6    (S)
   bismark/0.24.1                     gatk/4.5.0.0           jasper/3.0.3              openmpi/4.1.2                      slurm/23.11.10   (S,L,D)
   blast-plus/2.14.1                  gcc/6.5.0              kokkos/4.3.01             openmpi/4.1.6                      snptest/2.5.2
   boost/1.85.0                       gcc/9.5.0              ldsc/2.0.1-openblas       openmpi/5.0.3               (D)    spades/3.15.5
   bowtie/1.3.1                       gcc/10.5.0             libffi/3.4.6              openssh/9.7p1                      star/2.7.11a
   bowtie2/2.5.2                      gcc/11.4.0             liblas/1.8.1              p7zip/17.05                        tk/8.6.11
   bwa/0.7.17                         gcc/12.3.0      (D)    libpng/1.6.39             paml/4.10.7                        trimmomatic/0.39
   clustal-omega/1.2.4                gcta/1.94.0beta        libspatialindex/1.9.3     plink/1.9-beta6.27-openblas        vcftools/0.1.16
   cmake/3.27.9                       gdal/3.8.5             libtool/2.4.7             pmix/2.2.5                         vmd/1.9.3
   cuda/11.1.1                        gemma/0.98.5           mafft/7.505               pmix/3.2.4-2                       zlib/1.3.1
   cuda/12.4.0                 (D)    git/2.42.0             metal/2020-05-05          pmix/4.2.4                  (D)
   cudnn/8.9.7.29-12                  gmake/4.4.1            mysql/8.0.35              pugixml/1.13

-------------------------------------------------------------------------- /modules/local ---------------------------------------------------------------------------
   apps/abaqus/2018             apps/ls-dyna/r11.2-356-mpp-intel        apps/regenie/3.6                       languages/julia/1.10.3
   apps/abaqus/2022             apps/ls-dyna/r11.2-356-smp-intel        apps/regenie/4.0                (D)    languages/perl/5.38.2
   apps/abaqus/2023      (D)    apps/ls-dyna/r11.2-356           (D)    apps/sagemath/10.3                     languages/python/bioconda
   apps/alerax/1.2.0            apps/lumerical-fdtd/2023-r1.3           apps/salvus/2024.1.1                   languages/python/biopython-1.83

Search for a module

❯ module spider plink

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
  plink: plink/1.9-beta6.27-openblas
-----------------------------------------------------------------------------------------------------------------------------------------------------------------

     Other possible modules matches:
        apps/plink2

    This module can be loaded directly: module load plink/1.9-beta6.27-openblas

    Help:
      PLINK is a free, open-source whole genome association analysis toolset,
      designed to perform a range of basic, large-scale analyses in a
      computationally efficient manner.

-----------------------------------------------------------------------------------------------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*plink.*'

Load modules into the working environment

module add plink

Unload them from the working environment

module del plink

Other commands

Getting help for a command, `man` and `--help`

Many (but not all) commands have a manual installed, and use the command man to view the manual.

man module

The commands themselves will likely to have help description to showcase their usage examples, and by convention it is invoked by the --help flag.

man --help

Check disk quota, `user-quota`

❯ user-quota
                        Block Limits                                     |     File Limits
Filesystem Fileset      blocks      quota      limit   in_doubt    grace |    files   quota    limit in_doubt    grace
      HOME USR          7.664G        20G        21G          0     none |   262664       0        0        0     none
      WORK USR          184.7G         1T         1T          0     none |     2401       0        0        0     none

FAQs

TODO: To update

Links

MRC IEU Sharepoint resources (UoB access only)
- IEU servers and HPC clusters
- HPC Storage
HPC docs by ACRC
- ACRC HPC Documentation and User Guides (UoB access only)
- HPC community docs
Other University of Bristol docs
- Docs for Bristol Centre for Supercomputing (BriCS) facilities and services (i.e. Isambard-AI and Isambard 3)
- Learning materials provided by ACRC
Other resources on HPC commands
- Official documentation for the slurm scheduler