HPC Deployment and Slurm Guides
This guide collects practical Slurm patterns for the first software packages covered on the site: VASP, ORCA, and Quantum ESPRESSO.
The examples assume a Linux cluster with:
- Slurm as the scheduler
- a module system for compilers, MPI stacks, and chemistry codes
- shared submission storage plus node-local scratch
srunas the preferred launcher inside allocations
Adapt these templates to your cluster
Module names, executable names, partitions, MPI plugins, and scratch paths vary by site. Treat the examples below as working boilerplates, then align them with your center's documented software stack and queue policy.
Slurm Concepts That Matter For Quantum Chemistry
For electronic-structure jobs, the scheduler parameters that matter most are:
--nodes: how many physical nodes the job spans--ntasks: total MPI ranks across the job--ntasks-per-node: MPI ranks placed on each node--cpus-per-task: OpenMP threads reserved for each MPI rank--memor--mem-per-cpu: memory reservation strategy--time: wall-clock limit including staging, restarts, and cleanup
The most common mistake is to scale only by adding MPI ranks. In practice, each code has its own balance between MPI ranks, OpenMP threads, memory footprint, FFT behavior, and filesystem pressure.
Recommended Starting Points
| Package | Safe First Layout | When To Increase Nodes | Scaling Notes |
|---|---|---|---|
| VASP | 1-2 nodes, mostly MPI, cpus-per-task=1 or 2 |
Larger supercells, hybrid functionals, AIMD, NEB | Benchmark KPAR, NCORE, and threads together |
| ORCA | 1 node first, either shared memory or one MPI task per NUMA region | Large DLPNO, memory-heavy correlated methods, long compound jobs | Match %pal to the Slurm allocation |
| Quantum ESPRESSO | 1-4 nodes, MPI ranks plus modest threading | Larger plane-wave jobs, dense k-point meshes, phonons | Tune k-point pools and band groups, not just rank count |
A Reusable Scratch-Staging Pattern
Node-local scratch usually gives better I/O performance than running entirely on shared storage. The pattern below is safe for many chemistry workloads:
#!/bin/bash
set -euo pipefail
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"
SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"
mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"
cleanup() {
status=$?
rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true
rm -rf "${JOB_SCRATCH}"
exit "${status}"
}
trap cleanup EXIT
cd "${JOB_SCRATCH}"
Use this pattern when:
- restart files are large enough that repeated shared-filesystem writes hurt throughput
- the code emits many medium-sized scratch files during integral or FFT work
- you want a predictable place to collect outputs after preemption or failure
Site-specific variations to watch for:
- some clusters provide
$TMPDIRautomatically, others require explicit scratch paths - some centers purge node-local scratch immediately on job exit, so
trapbased copy-back is essential - some MPI installations require
srun --mpi=pmix,--mpi=pmi2, or site wrappers instead of baresrun
VASP Slurm Template
VASP usually performs best with a mostly-MPI layout and only light threading.
Start conservative, then benchmark KPAR, NCORE, and node count before
launching expensive production campaigns.
#!/bin/bash
#SBATCH --job-name=vasp-relax
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --time=12:00:00
#SBATCH --output=logs/%x-%j.out
set -euo pipefail
module purge
module load intel-oneapi-mpi
module load vasp/6.4
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"
SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"
mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"
trap 'status=$?; rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true; rm -rf "${JOB_SCRATCH}"; exit "${status}"' EXIT
cd "${JOB_SCRATCH}"
cp "${SUBMIT_DIR}/INCAR" .
cp "${SUBMIT_DIR}/POSCAR" .
cp "${SUBMIT_DIR}/KPOINTS" .
cp "${SUBMIT_DIR}/POTCAR" .
if [[ -f "${SUBMIT_DIR}/WAVECAR" ]]; then cp "${SUBMIT_DIR}/WAVECAR" .; fi
if [[ -f "${SUBMIT_DIR}/CHGCAR" ]]; then cp "${SUBMIT_DIR}/CHGCAR" .; fi
srun --mpi=pmix vasp_std > vasp.out
VASP Layout Guidance
- Prefer physical cores over logical hyperthreads unless your site recommends otherwise.
- For large k-point workloads, test whether
KPARshould divide the total MPI rank count cleanly. - For memory-heavy hybrids or large supercells, fewer MPI ranks with
cpus-per-task=2can outperform a pure-MPI layout. - Preserve
WAVECARandCHGCARwhen chaining relaxations, static runs, or DOS calculations across multiple jobs.
ORCA Slurm Template
ORCA is often most efficient on a single node first, especially for DFT, spectroscopy, and many medium-sized wavefunction jobs. Large correlated methods can benefit from more nodes, but memory and scratch behavior should be measured before scaling out aggressively.
#!/bin/bash
#SBATCH --job-name=orca-dlpno
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=0
#SBATCH --time=24:00:00
#SBATCH --output=logs/%x-%j.out
set -euo pipefail
module purge
module load openmpi/4.1
module load orca/6.0
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"
SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"
mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"
trap 'status=$?; rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true; rm -rf "${JOB_SCRATCH}"; exit "${status}"' EXIT
cd "${JOB_SCRATCH}"
cp "${SUBMIT_DIR}/job.inp" .
cp "${SUBMIT_DIR}/job.xyz" . 2>/dev/null || true
orca job.inp > job.out
ORCA Layout Guidance
- Keep the
%palblock in the ORCA input synchronized with the Slurm allocation. For example,nprocs 32should match--cpus-per-task=32in the single-node template above. - Large DLPNO and correlated jobs often need generous local scratch because integral and pair-domain intermediates can become substantial.
- If your ORCA build supports multi-node execution, start with one MPI rank per NUMA region or node and benchmark carefully before using many nodes.
- Compound jobs are easier to recover when each stage writes named outputs into a dedicated results directory copied back from scratch.
Example ORCA %pal block:
Quantum ESPRESSO Slurm Template
Quantum ESPRESSO has more parallel dimensions than most users need on day one. Begin with a stable MPI layout, then introduce k-point pools or additional parallel flags only after you confirm baseline scaling.
#!/bin/bash
#SBATCH --job-name=qe-scf
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --time=08:00:00
#SBATCH --output=logs/%x-%j.out
set -euo pipefail
module purge
module load openmpi/4.1
module load quantum-espresso/7.4
export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"
SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"
mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"
trap 'status=$?; rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true; rm -rf "${JOB_SCRATCH}"; exit "${status}"' EXIT
cd "${JOB_SCRATCH}"
cp "${SUBMIT_DIR}/scf.in" .
cp "${SUBMIT_DIR}"/*.UPF . 2>/dev/null || true
srun --mpi=pmix pw.x -in scf.in -nk 4 > scf.out
Quantum ESPRESSO Layout Guidance
- Make sure the number of k-point pools divides the total rank count cleanly.
- Treat
-nkor-npoolsas a tuning knob for k-point parallelism rather than a default that should always scale with node count. - Phonon workflows may want a different rank layout than
pw.x, so benchmarkph.xseparately instead of reusing SCF settings blindly. - Keep pseudopotentials versioned with the input set so restart and reproduction remain straightforward across clusters.
Checklist Before Production Runs
- Confirm the executable was built with the same MPI family you load in the job script.
- Create
logs/andresults/directories in the submission tree before large job campaigns. - Verify restart files that must be copied into scratch and back out again.
- Match OpenMP settings to
--cpus-per-task; do not leave thread counts at library defaults. - Benchmark one small and one medium system before launching a long campaign on many nodes.
- Record scheduler settings next to the scientific inputs so scaling decisions stay reproducible.
Common Failure Modes
- Jobs remain idle or oversubscribe cores because
%pal,OMP_NUM_THREADS, or Slurm CPU settings disagree. - Scaling gets worse beyond one node because the filesystem, not the solver, becomes the bottleneck.
- Restarts fail because scratch contents were not copied back on timeout or job cancellation.
- Site wrappers expect
srun, but the job script usesmpirunfrom a mismatched MPI installation.
This guide is intentionally conservative. Once the basic layouts are stable on your site, the next step is to turn these patterns into package-specific, version-controlled runbooks for your group or cluster.