HPC Deployment and Slurm Guides

This guide collects practical Slurm patterns for the first software packages covered on the site: VASP, ORCA, and Quantum ESPRESSO.

The examples assume a Linux cluster with:

Slurm as the scheduler
a module system for compilers, MPI stacks, and chemistry codes
shared submission storage plus node-local scratch
srun as the preferred launcher inside allocations

Adapt these templates to your cluster

Module names, executable names, partitions, MPI plugins, and scratch paths vary by site. Treat the examples below as working boilerplates, then align them with your center's documented software stack and queue policy.

Slurm Concepts That Matter For Quantum Chemistry

For electronic-structure jobs, the scheduler parameters that matter most are:

--nodes: how many physical nodes the job spans
--ntasks: total MPI ranks across the job
--ntasks-per-node: MPI ranks placed on each node
--cpus-per-task: OpenMP threads reserved for each MPI rank
--mem or --mem-per-cpu: memory reservation strategy
--time: wall-clock limit including staging, restarts, and cleanup

The most common mistake is to scale only by adding MPI ranks. In practice, each code has its own balance between MPI ranks, OpenMP threads, memory footprint, FFT behavior, and filesystem pressure.

Recommended Starting Points

Package	Safe First Layout	When To Increase Nodes	Scaling Notes
VASP	1-2 nodes, mostly MPI, `cpus-per-task=1` or `2`	Larger supercells, hybrid functionals, AIMD, NEB	Benchmark `KPAR`, `NCORE`, and threads together
ORCA	1 node first, either shared memory or one MPI task per NUMA region	Large DLPNO, memory-heavy correlated methods, long compound jobs	Match `%pal` to the Slurm allocation
Quantum ESPRESSO	1-4 nodes, MPI ranks plus modest threading	Larger plane-wave jobs, dense k-point meshes, phonons	Tune k-point pools and band groups, not just rank count

A Reusable Scratch-Staging Pattern

Node-local scratch usually gives better I/O performance than running entirely on shared storage. The pattern below is safe for many chemistry workloads:

#!/bin/bash
set -euo pipefail

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK:-1}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"

SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"

mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"

cleanup() {
  status=$?
  rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true
  rm -rf "${JOB_SCRATCH}"
  exit "${status}"
}

trap cleanup EXIT
cd "${JOB_SCRATCH}"

Use this pattern when:

restart files are large enough that repeated shared-filesystem writes hurt throughput
the code emits many medium-sized scratch files during integral or FFT work
you want a predictable place to collect outputs after preemption or failure

Site-specific variations to watch for:

some clusters provide $TMPDIR automatically, others require explicit scratch paths
some centers purge node-local scratch immediately on job exit, so trap based copy-back is essential
some MPI installations require srun --mpi=pmix, --mpi=pmi2, or site wrappers instead of bare srun

VASP Slurm Template

VASP usually performs best with a mostly-MPI layout and only light threading. Start conservative, then benchmark KPAR, NCORE, and node count before launching expensive production campaigns.

#!/bin/bash
#SBATCH --job-name=vasp-relax
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --time=12:00:00
#SBATCH --output=logs/%x-%j.out

set -euo pipefail
module purge
module load intel-oneapi-mpi
module load vasp/6.4

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"

SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"

mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"
trap 'status=$?; rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true; rm -rf "${JOB_SCRATCH}"; exit "${status}"' EXIT

cd "${JOB_SCRATCH}"
cp "${SUBMIT_DIR}/INCAR" .
cp "${SUBMIT_DIR}/POSCAR" .
cp "${SUBMIT_DIR}/KPOINTS" .
cp "${SUBMIT_DIR}/POTCAR" .

if [[ -f "${SUBMIT_DIR}/WAVECAR" ]]; then cp "${SUBMIT_DIR}/WAVECAR" .; fi
if [[ -f "${SUBMIT_DIR}/CHGCAR" ]]; then cp "${SUBMIT_DIR}/CHGCAR" .; fi

srun --mpi=pmix vasp_std > vasp.out

VASP Layout Guidance

Prefer physical cores over logical hyperthreads unless your site recommends otherwise.
For large k-point workloads, test whether KPAR should divide the total MPI rank count cleanly.
For memory-heavy hybrids or large supercells, fewer MPI ranks with cpus-per-task=2 can outperform a pure-MPI layout.
Preserve WAVECAR and CHGCAR when chaining relaxations, static runs, or DOS calculations across multiple jobs.

ORCA Slurm Template

ORCA is often most efficient on a single node first, especially for DFT, spectroscopy, and many medium-sized wavefunction jobs. Large correlated methods can benefit from more nodes, but memory and scratch behavior should be measured before scaling out aggressively.

#!/bin/bash
#SBATCH --job-name=orca-dlpno
#SBATCH --partition=compute
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=32
#SBATCH --mem=0
#SBATCH --time=24:00:00
#SBATCH --output=logs/%x-%j.out

set -euo pipefail
module purge
module load openmpi/4.1
module load orca/6.0

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"

SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"

mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"
trap 'status=$?; rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true; rm -rf "${JOB_SCRATCH}"; exit "${status}"' EXIT

cd "${JOB_SCRATCH}"
cp "${SUBMIT_DIR}/job.inp" .
cp "${SUBMIT_DIR}/job.xyz" . 2>/dev/null || true

orca job.inp > job.out

ORCA Layout Guidance

Keep the %pal block in the ORCA input synchronized with the Slurm allocation. For example, nprocs 32 should match --cpus-per-task=32 in the single-node template above.
Large DLPNO and correlated jobs often need generous local scratch because integral and pair-domain intermediates can become substantial.
If your ORCA build supports multi-node execution, start with one MPI rank per NUMA region or node and benchmark carefully before using many nodes.
Compound jobs are easier to recover when each stage writes named outputs into a dedicated results directory copied back from scratch.

Example ORCA %pal block:

%pal
  nprocs 32
end

Quantum ESPRESSO Slurm Template

Quantum ESPRESSO has more parallel dimensions than most users need on day one. Begin with a stable MPI layout, then introduce k-point pools or additional parallel flags only after you confirm baseline scaling.

#!/bin/bash
#SBATCH --job-name=qe-scf
#SBATCH --partition=compute
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=64
#SBATCH --cpus-per-task=1
#SBATCH --mem=0
#SBATCH --time=08:00:00
#SBATCH --output=logs/%x-%j.out

set -euo pipefail
module purge
module load openmpi/4.1
module load quantum-espresso/7.4

export OMP_NUM_THREADS="${SLURM_CPUS_PER_TASK}"
export MKL_NUM_THREADS="${OMP_NUM_THREADS}"

SUBMIT_DIR="${SLURM_SUBMIT_DIR}"
SCRATCH_BASE="${TMPDIR:-/scratch/$USER}"
JOB_SCRATCH="${SCRATCH_BASE}/${SLURM_JOB_ID}"
RESULT_DIR="${SUBMIT_DIR}/results/${SLURM_JOB_ID}"

mkdir -p "${JOB_SCRATCH}" "${RESULT_DIR}"
trap 'status=$?; rsync -a "${JOB_SCRATCH}/" "${RESULT_DIR}/" || true; rm -rf "${JOB_SCRATCH}"; exit "${status}"' EXIT

cd "${JOB_SCRATCH}"
cp "${SUBMIT_DIR}/scf.in" .
cp "${SUBMIT_DIR}"/*.UPF . 2>/dev/null || true

srun --mpi=pmix pw.x -in scf.in -nk 4 > scf.out

Quantum ESPRESSO Layout Guidance

Make sure the number of k-point pools divides the total rank count cleanly.
Treat -nk or -npools as a tuning knob for k-point parallelism rather than a default that should always scale with node count.
Phonon workflows may want a different rank layout than pw.x, so benchmark ph.x separately instead of reusing SCF settings blindly.
Keep pseudopotentials versioned with the input set so restart and reproduction remain straightforward across clusters.

Checklist Before Production Runs

Confirm the executable was built with the same MPI family you load in the job script.
Create logs/ and results/ directories in the submission tree before large job campaigns.
Verify restart files that must be copied into scratch and back out again.
Match OpenMP settings to --cpus-per-task; do not leave thread counts at library defaults.
Benchmark one small and one medium system before launching a long campaign on many nodes.
Record scheduler settings next to the scientific inputs so scaling decisions stay reproducible.

Common Failure Modes

Jobs remain idle or oversubscribe cores because %pal, OMP_NUM_THREADS, or Slurm CPU settings disagree.
Scaling gets worse beyond one node because the filesystem, not the solver, becomes the bottleneck.
Restarts fail because scratch contents were not copied back on timeout or job cancellation.
Site wrappers expect srun, but the job script uses mpirun from a mismatched MPI installation.

This guide is intentionally conservative. Once the basic layouts are stable on your site, the next step is to turn these patterns into package-specific, version-controlled runbooks for your group or cluster.