A-Cluster: Unterschied zwischen den Versionen
(+OpenMolcas) |
K (→Queueing system: Slurm: job arrays) |
||
(12 dazwischenliegende Versionen desselben Benutzers werden nicht angezeigt) | |||
Zeile 1: | Zeile 1: | ||
Linux cluster with currently | Linux cluster with currently 16 compute nodes (CPUs: 512 cores, GPUs: 8x RTX 2080 + 24x RTX 3090) and 2×251TiB disk storage, purchased by Ana Vila Verde and Christopher Stein | ||
= Login = | = Login = | ||
Zeile 7: | Zeile 7: | ||
= Queueing system: [https://slurm.schedmd.com/documentation.html Slurm] = | = Queueing system: [https://slurm.schedmd.com/documentation.html Slurm] = | ||
* There are | * There are three queues (''partitions'' in Slurm terminology) named: | ||
** ''CPUs'', the default | ** ''CPUs'', the default | ||
** ''GPUs'', to be selected via <code>-p GPUs</code> for jobs which involve a GPU | ** ''GPUs'', to be selected via <code>-p GPUs</code> for jobs which involve a GPU | ||
** ''Test'', to be selected via <code>-p Test</code> for test jobs of maximal 10 minutes running time (Compute node ''gpu01'' is reserved exclusively for this queue.) | |||
* In the ''CPUs'' queue, 2 cores stay reserved on each node for GPU jobs, resulting in 30 available cores. | |||
* <code>[https://slurm.schedmd.com/sinfo.html sinfo]</code> displays the cluster's total load. | * <code>[https://slurm.schedmd.com/sinfo.html sinfo]</code> displays the cluster's total load. | ||
* <code>[https://slurm.schedmd.com/squeue.html squeue]</code> shows running jobs. You can modify its output via the option <code>-o</code>. To make that permanent put something like <code>alias squeue='squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %C %o"'</code> into your <code>.bashrc</code>. | * <code>[https://slurm.schedmd.com/squeue.html squeue]</code> shows running jobs. You can modify its output via the option <code>-o</code>. To make that permanent put something like <code>alias squeue='squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %C %o"'</code> into your <code>.bashrc</code>. | ||
Zeile 19: | Zeile 21: | ||
* The assignment of cores can be non-trivial (cf. also [[Slurm/Task-Affinity|task affinity]]), some rules: | * The assignment of cores can be non-trivial (cf. also [[Slurm/Task-Affinity|task affinity]]), some rules: | ||
** gromacs: '''Don't''' use its <code>-pin</code> options. | ** gromacs: '''Don't''' use its <code>-pin</code> options. | ||
* There are restrictions per user: | |||
** You cannot use more than 384 CPU cores simultaneously. | |||
** You cannot have more than 128 submitted jobs. If you have many, many runs with just varying parameters, consider using [https://slurm.schedmd.com/job_array.html job arrays]. | |||
= GPUs = | |||
There are two GPUs on each node (RTX2080 on gpu01-04, RTX3090 on g3pu05-16). After having requested GPUs (cf. above), you'll find the ID(s) \(\in\{0,1\}\) of the GPUs(s) assigned to your job in the environment variable <code>SLURM_STEP_GPUS</code> as well as in <code>GPU_DEVICE_ORDINAL</code>. | |||
The command <code>sgpus</code> (no manpage) displays the number of unallocated GPUs on each node. | |||
= Scratch space = | |||
If your job makes heavy use of temporary files, you shouldn't have them in your home directory (to avoid too much network traffic). Each node has about 400GiB disk space available in <code>/tmp</code>, where you should create <code>/tmp/$USER/$SLURM_JOBID</code> (to avoid cluttering) and wipe it at the end of your job. | |||
Four nodes (g3pu07-10) have a dedicated scratch directory <code>/scratch</code> of 3.4TiB capacity, where you should create (and later wipe) <code>/scratch/$USER/$SLURM_JOBID</code>. To use it, you have to specify <code>--gres=scratch:</code>''X'' upon submission, where ''X'' is the amount of scratch space intended to use in GiB (max 3480). (This amount is not checked during the job's runtime.) | |||
= Scientific Software = | = Scientific Software = | ||
Zeile 49: | Zeile 66: | ||
Ana provided an [https://wiki.uni-due.de/vilaverde/index.php/File:Gromacs_cpu.sh example script] to be submitted via <code>sbatch</code>. | Ana provided an [https://wiki.uni-due.de/vilaverde/index.php/File:Gromacs_cpu.sh example script] to be submitted via <code>sbatch</code>. | ||
== OpenMM + open forcefield == | |||
* <code>source /usr/local/miniconda3/bin/activate</code> | |||
* <code>conda activate openforcefield</code> | |||
* installed openff components: forceBalance, geomeTRIC, openFF toolkit, openFF evaluator, TorsionDrive, pyMBAR | |||
* also installed: jupyterlab | |||
== OpenMolcas == | == OpenMolcas == | ||
Zeile 59: | Zeile 83: | ||
export MOLCAS=/usr/local/openmolcas | export MOLCAS=/usr/local/openmolcas | ||
export MOLCAS_WORKDIR=/tmp/$USER-$SLURM_JOB_NAME-$SLURM_JOB_ID | |||
mkdir $MOLCAS_WORKDIR | |||
export PATH=$PATH:$MOLCAS | export PATH=$PATH:$MOLCAS | ||
export LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/mkl/latest/lib/intel64 | export LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/mkl/latest/lib/intel64 | ||
Zeile 64: | Zeile 90: | ||
pymolcas the_input.inp | pymolcas the_input.inp | ||
rm -rf $MOLCAS_WORKDIR | |||
If you want/need to use the module system instead of setting <code>LD_LIBRARY_PATH</code> manually: | If you want/need to use the module system instead of setting <code>LD_LIBRARY_PATH</code> manually: | ||
Zeile 78: | Zeile 106: | ||
* is located in <code>/opt/intel/oneapi</code> | * is located in <code>/opt/intel/oneapi</code> | ||
* must be made available via <code>module use /opt/intel/oneapi/modulefiles</code> (unless you include <code>/opt/intel/oneapi/modulefiles</code> in your <code>MODULEPATH</code>), then <code>module avail</code> lists the available modules. | * must be made available via <code>module use /opt/intel/oneapi/modulefiles</code> (unless you include <code>/opt/intel/oneapi/modulefiles</code> in your <code>MODULEPATH</code>), then <code>module avail</code> lists the available modules. | ||
* Module ''mkl/latest'' contains also FFT routines. | |||
= Backups = | |||
A backup of the users' home directories is taken nightly. To access the | |||
backups, first log in to the cluster. Then: | |||
* Users in <code>/home/stor.vd1</code>: Last night's backup is in <code>/export/vd1/$USER</code>. | |||
* Users in <code>/home/stor1.lv0</code>: You actually have seven backups corresponding to the last 7 days in <code>/exports/lv1/snapshots/days.''D''/stor1/home/stor1.lv0/$USER</code> with ''D'' \(\in\{0,\dots,6\}\). |
Aktuelle Version vom 27. April 2023, 15:50 Uhr
Linux cluster with currently 16 compute nodes (CPUs: 512 cores, GPUs: 8x RTX 2080 + 24x RTX 3090) and 2×251TiB disk storage, purchased by Ana Vila Verde and Christopher Stein
Login
External Hostname is a-cluster.physik.uni-due.de
(134.91.59.16), internal hostname is stor2
.
Queueing system: Slurm
- There are three queues (partitions in Slurm terminology) named:
- CPUs, the default
- GPUs, to be selected via
-p GPUs
for jobs which involve a GPU - Test, to be selected via
-p Test
for test jobs of maximal 10 minutes running time (Compute node gpu01 is reserved exclusively for this queue.)
- In the CPUs queue, 2 cores stay reserved on each node for GPU jobs, resulting in 30 available cores.
sinfo
displays the cluster's total load.squeue
shows running jobs. You can modify its output via the option-o
. To make that permanent put something likealias squeue='squeue -o "%.18i %.9P %.8j %.8u %.2t %.10M %.6D %R %C %o"'
into your.bashrc
.- In the most simple cases, jobs are submitted via
sbatch -n
n script-name. The number n of CPUs is available within the script as$SLURM_NTASKS
. It's not necessary to pass it on tompirun
, since the latter evaluates it on its own, anyway. - To allocate GPUs as well, add
-G
n or--gpus=
n with n ∈ {1,2}. You can specify the type as well by prependingrtx2080:
orrtx3090:
to n. - Don't use background jobs (
&
), unless youwait
for them before the end of the script. srun
is intended for interactive jobs (stdin+stdout+stderr stay attached to the terminal) and its-n
doesn't only reserve n cores but starts n jobs. (Those shouldn't containmpirun
, otherwise you'd end up with n² busy cores.)- For an interactive shell with n reserved cores on a compute node:
srun --pty -c
nbash
- The assignment of cores can be non-trivial (cf. also task affinity), some rules:
- gromacs: Don't use its
-pin
options.
- gromacs: Don't use its
- There are restrictions per user:
- You cannot use more than 384 CPU cores simultaneously.
- You cannot have more than 128 submitted jobs. If you have many, many runs with just varying parameters, consider using job arrays.
GPUs
There are two GPUs on each node (RTX2080 on gpu01-04, RTX3090 on g3pu05-16). After having requested GPUs (cf. above), you'll find the ID(s) \(\in\{0,1\}\) of the GPUs(s) assigned to your job in the environment variable SLURM_STEP_GPUS
as well as in GPU_DEVICE_ORDINAL
.
The command sgpus
(no manpage) displays the number of unallocated GPUs on each node.
Scratch space
If your job makes heavy use of temporary files, you shouldn't have them in your home directory (to avoid too much network traffic). Each node has about 400GiB disk space available in /tmp
, where you should create /tmp/$USER/$SLURM_JOBID
(to avoid cluttering) and wipe it at the end of your job.
Four nodes (g3pu07-10) have a dedicated scratch directory /scratch
of 3.4TiB capacity, where you should create (and later wipe) /scratch/$USER/$SLURM_JOBID
. To use it, you have to specify --gres=scratch:
X upon submission, where X is the amount of scratch space intended to use in GiB (max 3480). (This amount is not checked during the job's runtime.)
Scientific Software
... installed (on the compute nodes)
AMBER
The module system is not involved. Instead, scripts provided by the software set the environment.
/usr/local/amber18
/usr/local/amber20
(providesparmed
as well)
Script to source therein (assuming bash): amber.sh
GROMACS
The module system is not involved. Instead, scripts provided by the software set the environment.
Versions (not all tested):
/usr/local/gromacs-2018.3
/usr/local/gromacs-2020.4
/usr/local/gromacs-3.3.4
/usr/local/gromacs-4.6.4
/usr/local/gromacs-5.0.1
/usr/local/gromacs-5.1.1
Script to source therein (assuming bash): bin/GMXRC.bash
Ana provided an example script to be submitted via sbatch
.
OpenMM + open forcefield
source /usr/local/miniconda3/bin/activate
conda activate openforcefield
- installed openff components: forceBalance, geomeTRIC, openFF toolkit, openFF evaluator, TorsionDrive, pyMBAR
- also installed: jupyterlab
OpenMolcas
(compiled with Intel compiler and MKL)
Minimal example script to be sbatch
ed:
#!/bin/bash export MOLCAS=/usr/local/openmolcas export MOLCAS_WORKDIR=/tmp/$USER-$SLURM_JOB_NAME-$SLURM_JOB_ID mkdir $MOLCAS_WORKDIR export PATH=$PATH:$MOLCAS export LD_LIBRARY_PATH=/opt/intel/oneapi/compiler/latest/linux/compiler/lib/intel64_lin:/opt/intel/oneapi/mkl/latest/lib/intel64 export OMP_NUM_THREADS=${SLURM_NTASKS:-1} pymolcas the_input.inp rm -rf $MOLCAS_WORKDIR
If you want/need to use the module system instead of setting LD_LIBRARY_PATH
manually:
shopt -s expand_aliases source /etc/profile.d/modules.sh module use /opt/intel/oneapi/modulefiles module -s load compiler/latest module -s load mkl/latest
Intel Compiler & Co.
- is located in
/opt/intel/oneapi
- must be made available via
module use /opt/intel/oneapi/modulefiles
(unless you include/opt/intel/oneapi/modulefiles
in yourMODULEPATH
), thenmodule avail
lists the available modules. - Module mkl/latest contains also FFT routines.
Backups
A backup of the users' home directories is taken nightly. To access the backups, first log in to the cluster. Then:
- Users in
/home/stor.vd1
: Last night's backup is in/export/vd1/$USER
. - Users in
/home/stor1.lv0
: You actually have seven backups corresponding to the last 7 days in/exports/lv1/snapshots/days.D/stor1/home/stor1.lv0/$USER
with D \(\in\{0,\dots,6\}\).