Using the HPC
Author: Yi Liu
This document is intended to be an overview and a "cheatsheet" for IEU researchers, but not a replacement for the detailed documentation available on IEU and ACRC sites. See links at the bottom of the page for these information.
Example submission script
Below is a quick template for a submission script, which does the following
- submit a job called "hello-world"
- request compute resources
- 1 process (
ntasks
) - 1 core for subthreads in the process (
cpus-per-task
) - 10 secs compute time
- 1 process (
cd
to submission directory, and print out the working directory; print out the name of the compute node- print out "Hello world"
To use the script you should replace various placeholders, e.g. <PARTITION>
with the appropriate values.
hello-world.sbatch
#!/bin/bash
#SBATCH --job-name=hello-world
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --partition=<PARTITION>
#SBATCH --account=<PROJECT-CODE>
#SBATCH --mail-user=<MAIL-ACCOUNT>
#SBATCH --mail-type=ALL
cd "${SLURM_SUBMIT_DIR}"
echo $(pwd)
echo $(hostname)
echo "Hello world"
where
<PARTITION>
the partition to submit the job to<PROJECT-CODE>
the account code for IEU HPC projects. For further details, see https://uob.sharepoint.com/sites/integrative-epidemiology/SitePages/ieu-servers-and-hpc-clusters.aspx and search for "project code". NOTE: in the current HPC systems your jobs will NOT run unless you specify the correct project codes.<MAIL-ACCOUNT>
your email account.
To submit, run
sbatch hello-world.sbatch
IEU HPC resources
Systems
The IEU has dedicated compute nodes and partitions on two HPC systems
- BlueCrystal Phase 4 (BC4)
- BC4 is recommended for any computationally heavy duty tasks
- The
mrcieu
partition is dedicated to IEU users
- BluePebble (BP1)
- BP1 is recommended ONLY for GPU-oriented tasks, as its non-GPU specs (CPU, RAM, etc) are inferior to those on BC4
- The
mrcieu-gpu
partition is dedicated to IEU users
The IEU has other non-HPC compute resources, e.g. the epi-franklin server. Further details can be accessed via Links.
Filesystem
Typically IEU users have 3 locations in the local filesystem on the HPC
- home directory:
/user/home/<USERNAME>
. You should only use the home directory for system configurations. - work directory:
/user/work/<USERNAME>
. Your projects should live here. - IEU purchased project space:
/mnt/storage/private/mrcieu
(bc4),/bp1/mrcieu1
. Path to various shared datasets and common project spaces.
In addition, the IEU RDSF space (IEU data warehouse and project workspace) is also mounted on the HPC at /projects/MRC-IEU/research
.
However this path is only mounted on login nodes and not accessible on compute nodes.
You should only use the space for data backup, but not as part of the local filesystem,
as the RDSF filesystem is a much slower system and web mounting can be very unstable.
There are also other locations worth mentioning:
- bc4
/mnt/storage/scratch/<USERNAME>
: this is a scratch space for legacy purposes. Users should now their work directories instead.
Slurm commands
Viewing the overall state of the HPC
sinfo
Use sinfo -s
to display a glimpse view of the partitions and compute nodes that you have access to.
sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
cpu* up 14-00:00:0 xxxxxxxxxxx xxxxxxxxxxxxxxxx
gpu up 7-00:00:00 xxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxx
hmem up 14-00:00:0 xxxxxxx xxxxxxxxxxxxxx
test up 1:00:00 xxxxxxxxxxx xxxxxxxxxxxxxxxx
veryshort up 6:00:00 xxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
mrcieu up 14-00:00:0 xxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Submitting jobs and request for resources
There are two modes for job submissions:
- batch mode, where you submit a job submission script (e.g.
hello-world.sbatch
) to execute the commands detailed in the script. This is the generally expected way to using HPC compute. However like any other large-scale batch processing you need to make sure your script and the underlying commands are robust to potential errors, or have mechanisms to accommodate errors, to make full use of the HPC resources. In other words, prepare for failure of the submitted job due to some bugs in your code. - interactive mode, where during the duration of the job you gain access to an interactive shell on a compute node, and use the command line to run commands. The typical use cases for interactive jobs include trial experiments or interactive use such as jupyter notebooks.
sbatch
, batch mode job submission
sbatch
is used for submitting a job script.
sbatch <JOB-SCRIPT>.sbatch
srun
, interactive mode
srun
is used to run a command (e.g. echo
), where this command could also be requesting an interactive shell.
Run a command
srun --account <PROJECT-CODE> --partition mrcieu echo "hello world"
# srun: job 11318808 queued and waiting for resources
# srun: job 11318808 has been allocated resources
# hello world
Request an interactive shell
srun --account <PROJECT-CODE> --partition mrcieu --pty /bin/bash -l
# srun: job 11318831 queued and waiting for resources
# srun: job 11318831 has been allocated resources
You will need to use your current terminal session and current ssh session when using srun
,
which can be unfeasible as the waiting time could be very long, and once you disconnect from the ssh session.
To circumvent this you can use terminal multiplexers (e.g. tmux or screen) which will not be covered here.
Alternatively you could use salloc
.
salloc
, interactive mode
salloc
is used for requesting allocation of compute resources, which when granted could be accessed by srun
.
As an example, below are a salloc
job submitted and interacted by srun
.
## request compute resources in jobid 11319203 via salloc
salloc --account <PROJECT-CODE> --partition mrcieu --time 0:30:00 --mail-user=<USER-EMAIL> --mail-type=ALL
# salloc: Pending job allocation 11319203
# salloc: job 11319203 queued and waiting for resources
# salloc: job 11319203 has been allocated resources
# salloc: Granted job allocation 11319203
# salloc: Nodes compute062 are ready for job
# Info: Loaded slurm/23.11.10 into the modular environment.
#
#
# SLURM: Your account, XXXXXXX, is ready to submit Slurm jobs.
## access the allocated resources in jobid 11319203 via srun
srun --jobid 11319203 --pty /bin/bash -l
## check where the compute node is
hostname
# compute062.bc4.acrc.priv
## check the slurm job id environment variable
echo ${SLURM_JOB_ID}
# 11319203
Managing jobs
scancel
, cancelling jobs
scancel <JOB-ID>
Viewing the status of jobs
squeue
, displaying the status of submitted jobs
❯ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
xxxxxxxx cpu MA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu MA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu BA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu CA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu CA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu CA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu SA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu SA5025 xxxxxxx PD 0:00 10 (Resources)
xxxxxxxx cpu bigdft_g xxxxxxx PD 0:00 4 (Resources)
sacct
, displaying the accounting data of recent jobs
❯ sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
xxxxxxxxxxxx bash mrcieu default 0 CANCELLED+ 0:0
xxxxxxxxxxxx bash mrcieu xxxxxxxxxx 12 FAILED 127:0
xxxxxxxxxxxx extern xxxxxxxxxx 12 COMPLETED 0:0
xxxxxxxxxxxx bash xxxxxxxxxx 12 FAILED 127:0
xxxxxxxxxxxx interacti+ cpu xxxxxxxxxx 1 COMPLETED 0:0
xxxxxxxxxxxx extern xxxxxxxxxx 1 COMPLETED 0:0
Other useful slurm commands
scontrol
scontrol show nodes
scontrol show jobs
topic: job array
Below is an example job array script.
job-array.sbatch
#!/bin/bash
#SBATCH --job-name=job-array-example
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --time=0:0:10
#SBATCH --partition=<PARTITION>
#SBATCH --account=<PROJECT-CODE>
#SBATCH --mail-user=<MAIL-ACCOUNT>
#SBATCH --mail-type=ALL
#SBATCH --array=1-5
cd "${SLURM_SUBMIT_DIR}"
echo $(hostname)
echo "SLURM_ARRAY_JOB_ID: ${SLURM_ARRAY_JOB_ID}"
echo "SLURM_ARRAY_TASK_ID: ${SLURM_ARRAY_TASK_ID}"
echo "SLURM_ARRAY_TASK_COUNT: ${SLURM_ARRAY_TASK_COUNT}"
echo "SLURM_ARRAY_TASK_MIN: ${SLURM_ARRAY_TASK_MIN}"
echo "SLURM_ARRAY_TASK_MAX: ${SLURM_ARRAY_TASK_MAX}"
echo "Hello world"
Replace the placeholders <PARTITION>
, <PROJECT-CODE>
, <MAIL-ACCOUNT>
.
The typical use case for job arrays is to parallelize loops, i.e. you have 5 sub-tasks where these tasks are independent of each other.
The environment variable w${SLURM_ARRAY_TASK_ID}
is the index value of the array (e.g. 1, 2, 3, etc) and can be used to implement sub-task logics (e.g. to process this chromosome ID).
module commands
Modules are pre-installed software packages on the HPC systems.
❯ module --help
Usage: module [options] sub-command [args ...]
Options:
-h -? -H --help This help message
-s availStyle --style=availStyle Site controlled avail style: system (default: system)
--regression_testing Lmod regression testing
-b --brief brief listing with only user specified modules
-D Program tracing written to stderr
--debug=dbglvl Program tracing written to stderr (where dbglvl is a number 1,2,3)
--pin_versions=pinVersions When doing a restore use specified version, do not follow defaults
-d --default List default modules only when used with avail
-q --quiet Do not print out warnings
--expert Expert mode
-t --terse Write out in machine readable format for commands: list, avail, spider, savelist
--initial_load loading Lmod for first time in a user shell
--latest Load latest (ignore default)
-I --ignore_cache Treat the cache file(s) as out-of-date
--novice Turn off expert and quiet flag
--raw Print modulefile in raw output when used with show
-w twidth --width=twidth Use this as max term width
-v --version Print version info and quit
-r --regexp use regular expression match
--gitversion Dump git version in a machine readable way and quit
--dumpversion Dump version in a machine readable way and quit
--check_syntax --checkSyntax Checking module command syntax: do not load
--config Report Lmod Configuration
--miniConfig Report Lmod Configuration differences
--config_json Report Lmod Configuration in json format
--mt Report Module Table State
--timer report run times
-f --force force removal of a sticky module or save an empty collection
--redirect Send the output of list, avail, spider to stdout (not stderr)
--no_redirect Force output of list, avail and spider to stderr
--show_hidden Avail and spider will report hidden modules
--spider_timeout=timeout a timeout for spider
-T --trace
--nx --no_extensions
--loc --location Just print the file location when using show
module [options] sub-command [args ...]
show available module
❯ module avail
----------------------------------------------------------- /modules/spack/lmod/linux-rocky8-x86_64/Core ------------------------------------------------------------
apptainer/1.1.9 doxygen/1.9.8 gnuplot/6.0.0 nasm/2.15.05 py-gputil/1.4.0
aria2/1.37.0 eigen/3.4.0 gsl/2.7.1 netlib-lapack/3.11.0 qctool/2.2.0
autoconf-archive/2023.02.20 fastp/0.23.4 htslib/1.19.1 nextflow/23.10.1 rclone/1.65.1
bcftools/1.19-openblas fastqc/0.12.1 intel-mkl/2020.4.304 ont-guppy/6.1.7 rust/1.78.0
beagle/5.4 fasttree/2.1.11 interproscan/5.63-95.0 openblas/0.3.26 samtools/1.19.2
beast2/2.7.4 ffmpeg/6.1.1 jags/4.3.2-openblas opencv/4.8.0 slurm/23.11.6 (S)
bismark/0.24.1 gatk/4.5.0.0 jasper/3.0.3 openmpi/4.1.2 slurm/23.11.10 (S,L,D)
blast-plus/2.14.1 gcc/6.5.0 kokkos/4.3.01 openmpi/4.1.6 snptest/2.5.2
boost/1.85.0 gcc/9.5.0 ldsc/2.0.1-openblas openmpi/5.0.3 (D) spades/3.15.5
bowtie/1.3.1 gcc/10.5.0 libffi/3.4.6 openssh/9.7p1 star/2.7.11a
bowtie2/2.5.2 gcc/11.4.0 liblas/1.8.1 p7zip/17.05 tk/8.6.11
bwa/0.7.17 gcc/12.3.0 (D) libpng/1.6.39 paml/4.10.7 trimmomatic/0.39
clustal-omega/1.2.4 gcta/1.94.0beta libspatialindex/1.9.3 plink/1.9-beta6.27-openblas vcftools/0.1.16
cmake/3.27.9 gdal/3.8.5 libtool/2.4.7 pmix/2.2.5 vmd/1.9.3
cuda/11.1.1 gemma/0.98.5 mafft/7.505 pmix/3.2.4-2 zlib/1.3.1
cuda/12.4.0 (D) git/2.42.0 metal/2020-05-05 pmix/4.2.4 (D)
cudnn/8.9.7.29-12 gmake/4.4.1 mysql/8.0.35 pugixml/1.13
-------------------------------------------------------------------------- /modules/local ---------------------------------------------------------------------------
apps/abaqus/2018 apps/ls-dyna/r11.2-356-mpp-intel apps/regenie/3.6 languages/julia/1.10.3
apps/abaqus/2022 apps/ls-dyna/r11.2-356-smp-intel apps/regenie/4.0 (D) languages/perl/5.38.2
apps/abaqus/2023 (D) apps/ls-dyna/r11.2-356 (D) apps/sagemath/10.3 languages/python/bioconda
apps/alerax/1.2.0 apps/lumerical-fdtd/2023-r1.3 apps/salvus/2024.1.1 languages/python/biopython-1.83
Search for a module
❯ module spider plink
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
plink: plink/1.9-beta6.27-openblas
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
Other possible modules matches:
apps/plink2
This module can be loaded directly: module load plink/1.9-beta6.27-openblas
Help:
PLINK is a free, open-source whole genome association analysis toolset,
designed to perform a range of basic, large-scale analyses in a
computationally efficient manner.
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
To find other possible module matches execute:
$ module -r spider '.*plink.*'
Load modules into the working environment
module add plink
Unload them from the working environment
module del plink
Other commands
Getting help for a command, man
and --help
Many (but not all) commands have a manual installed, and use the command man
to view the manual.
man module
The commands themselves will likely to have help description to showcase their usage examples,
and by convention it is invoked by the --help
flag.
man --help
Check disk quota, user-quota
❯ user-quota
Block Limits | File Limits
Filesystem Fileset blocks quota limit in_doubt grace | files quota limit in_doubt grace
HOME USR 7.664G 20G 21G 0 none | 262664 0 0 0 none
WORK USR 184.7G 1T 1T 0 none | 2401 0 0 0 none
FAQs
TODO: To update
Links
- MRC IEU Sharepoint resources (UoB access only)
- HPC docs by ACRC
- Other University of Bristol docs
- Other resources on HPC commands