TTIC Slurm Cluster: Usage & Guidelines

Current Status: [Docs] | [Job Process Info] | [Job Queue] | [Cluster Usage & Load]

The TTIC slurm cluster is a pool of machines, many with modern GPUs, to which users may submit compute jobs through the slurm scheduler.

Table of Contents

General Guidelines

Much of the cluster infrastructure relies on users monitoring their own jobs and usage while being careful about adhering to policy, rather than automated scripts that kill jobs or block accounts for infractions. We request that you respect this collegial arrangement, and be careful that your usage of the cluster adheres to the following set of guidelines:

  1. Use the head node (i.e., slurm.ttic.edu) only to submit and monitor jobs. Do not run any computationally intensive processes (including installing software). You should use an interactive job for such things.

  2. You can explicitly request a certain number of cores / GPUs for your jobs to use when you submit them (with the -c option, see the next section). The scheduler will set up environment variables for your job so that most common computing packages (OpenMP, MKL, CUDA) will restrict their usage to only the assigned resources. But it is possible for jobs to cross these limits inadvertently (and of course, deliberately) when using a non-standard package or library. It is your responsibility to make sure this does not happen. We strongly recommend that as your jobs are running, you monitor their processes on the slurm website's PS page, and ensure they aren't exceeding their assigned resource allocations.

  3. For GPU jobs, it is important that your code respect the CUDA_VISIBLE_DEVICES environment variable (most CUDA programs do this automatically), and uses its assigned GPU on a node. Also, a GPU job should not use more than 3 CPU cores per assigned GPU. So if your job asked for n GPUs, it should use its assigned n GPUs, and up to 3n CPU cores on the node.

  4. Your job shouldn't use any program or library that uses a nohup system call. Programs that do this include screen and tmux. You may use screen and tmux on the head-node, but not on any compute nodes (these programs aren't available by default on the compute nodes). Any process escaping the slurmstepd process will be killed.

  5. We generally discourage the use of interactive jobs, but recognize that they are necessary in some workflows (for example, for compilation and initial testing of programs). However, we find that with interactive jobs, users often confuse which machine they are on, and either (a) confuse the head node for a compute node and start running their jobs on the head node, which slows it down and makes it difficult or impossible for other users to submit their jobs; or (b) confuse compute nodes for the head node and use them to submit jobs, and take up a slot on the compute node that remains idle. If you do use interactive jobs, please keep track of which machine you are on!

  6. Scratch space: if your jobs need to repeatedly read and write large files from disk, we ask that you use fast temporary local storage (4T SSD) on the compute nodes, and not your NFS-shared home directories. Scratch space is available in /scratch on all compute nodes. We also request that you delete all temporary files when you are done with them, and also organize them in a subdirectory with your user or group name.

However, if there is some dataset that you expect to use multiple times, you should leave it in the temporary directory rather than transferring it at the beginning of every job. Your job could check for the presence of the dataset and copy it from a central location only if its absent. Optionally, if this is a large dataset that you expect to use over a period of time, you can ask the IT Director to place it on all (or a subset of) compute nodes.

Submitting jobs

All jobs are submitted by logging in via ssh to the head node slurm.ttic.edu which, along with all compute nodes, mounts your NFS home directory. Jobs are run by submitting them to the slurm scheduler, which then executes them on one of the compute nodes.

In this section, we provide information to get you started with using the scheduler, and about details of the TTIC cluster setup. For complete documentation, see the man pages on the slurm website.

Understanding Partitions

All jobs in slurm are submitted to a partition---which defines whether the submission is a GPU or CPU job, the set of nodes it can be run on, and the priority it will have in the queue. Different users will have access to different partitions (based on the group's contributions to the cluster) as noted below:

You can run the sinfo command to see which partitions you have access to. Please consult with your faculty adviser (or the IT Director) if you need access to other partitions.

All of the above partitions have a strict time limit of 4 hours per job, and jobs that do not finish in this time will be killed. However, the cluster also has -long partitions that contain a subset of nodes that allow for longer running jobs, with the limit of 4 days. To use this longer time-limit, submit your jobs to partitions: cpu-long, contrib-gpu-long, or <group>-gpu-long as appropriate.

Memory Usage

Currently each cpu core selected has roughly 4GB of memory. If you need more than that either ask for more cores or use the -C highmem constraint which will a only run on the nodes with at least 192G of memory.

Batch Jobs

The primary way to submit jobs is through the command sbatch. In this regime, you write the commands you want executed into a script file (typically, a bash script). It is important that the first line of this file is a shebang line to the script interpreter: in most cases, you will want to use #!/bin/bash.

The sbatch command also takes the following options (amongst others):

There are many others options you can pass to sbatch to customize your jobs (for example, to submit array jobs, to use MPI, etc.). See the sbatch man page.

Array Jobs

You can submit array jobs using the python script below. You must provide an input file that contains the commands you wish to run for each job on a single line and a partition on which to run the job. The script will then package up a batch-commands-$.txt and sbatch-script-$.txt splitting your input file into batches of 5000 if necessary. You then submit the job by running sbatch sbatch-script-$.txt. You can also optionally supply a job name and constraint with -J and -C respectively.

#!/usr/bin/env python

import argparse

parser = argparse.ArgumentParser(description='TTIC SLURM sbatch script creator')
parser.add_argument('INPUT_FILE', help='Input file with list of commands to run')
parser.add_argument('PARTITION', help='Name of partition to use')
parser.add_argument('-C', '--constraint', help='Constraint to use')
parser.add_argument('-J', '--job-name', help='Name of the job')

args = parser.parse_args()

def gen_sbatch_end(constraint, job_name):
  if constraint and job_name:
    sbatch_end = ' --constraint=' + args.constraint + ' --job-name=' + args.job_name
  elif constraint:
    sbatch_end = ' --constraint=' + args.constraint
  elif job_name:
    sbatch_end = ' --job-name=' + args.job_name
  else:
    sbatch_end = ''
  return sbatch_end

file_in = open(args.INPUT_FILE, 'r')
lines = file_in.readlines()

count = 0
commands = []
while count < len(lines):
  if count % 5000 == 0 and count > 0:
    index = count / 5000
    file_out = open('batch-commands-' + str(index) + '.txt', 'w')
    for i in commands:
      file_out.write(i.strip() + '\n')
    file_out.close()
    file_out = open('sbatch-script-' + str(index) + '.txt', 'w')
    file_out.write('#!/bin/bash\n')
    sbatch_end = gen_sbatch_end(args.constraint, args.job_name)
    file_out.write('#SBATCH --partition=' + args.PARTITION + ' --cpus-per-task=1 --array=1-' + str(len(commands)) + sbatch_end + '\n')
    file_out.write('bash -c "`sed "${SLURM_ARRAY_TASK_ID}q;d" '+'batch-commands-'+str(index)+'.txt'+'`"')
    file_out.close()
    commands = []
  commands.append(lines[count])
  count += 1

file_out = open('batch-commands-last.txt', 'w')
for i in commands:
  file_out.write(i.strip() + '\n')
file_out.close()
file_out = open('sbatch-script-last.txt', 'w')
file_out.write('#!/bin/bash\n')
sbatch_end = gen_sbatch_end(args.constraint, args.job_name)
file_out.write('#SBATCH --partition=' + args.PARTITION + ' --cpus-per-task=1 --array=1-' + str(len(commands)) + sbatch_end + '\n')
file_out.write('bash -c "`sed "${SLURM_ARRAY_TASK_ID}q;d" '+'batch-commands-last.txt'+'`"')
file_out.close()

Interactive jobs

While we recommend that you try to use batch jobs for majority of tasks submitted to the cluster, it may be necessary to run programs interactively occasionally to set-up your experiments for the first time. You can use the srun command to request an interactive shell on a compute node.

Call srun with the same options as sbatch above to specify partition, number of cores, etc., followed by the option --pty bash. For example, to request a shell with a single gpu with atleast 11GB of memory on the gpu partition, run

srun -p gpu -c1 -C 11g --pty bash

Note that interactive jobs are subject to the same time limits and priority as batch jobs, which means that you might have to wait for your job to be scheduled, and that your shell will be automatically killed after the time limit expires.

Job Sequences for Dealing with Time limits

Let's say you have split up your job into a series of three script files called: optimize_1.sh, optimize_2.sh, optimize_3.sh --- each of which runs under the cluster's time limit, and picks up from where the last left off. You can request that they be executed as separate jobs in sequence on the cluster.

Pick a unique "job name" for the sequence, (let's say "series_A"). Then, just submit the three batch jobs in series using sbatch, with the additional command parameters -J series_A -d singleton. For example:

sbatch -p gpu -c1 -J series_A -d singleton optimize_1.sh
sbatch -p gpu -c1 -J series_A -d singleton optimize_2.sh
sbatch -p gpu -c1 -J series_A -d singleton optimize_3.sh

All three jobs will be immediately added to the queue, and if there are slots free, optimize_1.sh will start running. But optimize_2.sh will NOT start until the first job is done, and similarly, optimize_3.sh will only be started after the other two jobs have ended. Note that there is no guarantee that they will start on the same machine.

The singleton dependency essentially requires that previously submitted jobs with the same name (by the same user) have finished. There is a caveat however---the subsequent job will even be started if the previous job failed, or was killed (for example, because it overshot the time limit). So your scripts should be robust to the possibility that the previous job may have failed.

Note that you can have multiple such sequences running in parallel by giving them different names.

Monitoring your usage

Once your jobs have been scheduled, you can keep an eye on them using both command line tools on the login host, as well as on the cluster website http://slurm.ttic.edu/. At the very least, you should monitor your jobs to ensure that their processor usage is not exceeding what you requested when submitting these jobs.

The cluster website provides you with a listing of scheduled and waiting jobs in the cluster queue, shows you statistics of load on the cluster, as well as provides details (from the output of ps and nvidia-smi) of processes corresponding to jobs running on the cluster.

You can also use the slurm command line tool squeue to get a list of jobs in the queue (remember to call it with the -a option to see all jobs, including those in other groups' partitions that you may not have access to). To get a listing like the output on the website, which organizes job sequences into single entries, you can run xqueue.py.

Finally, use the scancel to cancel any of your running or submitted jobs. See the scancel man page for details on how to call this command. In particular, if you are using job sequences, you can use -n series_name option to cancel all jobs in a sequence.

List of Node Names & Features

Node-name Public Cores RAM GPU(s) GPU Type Feature labels
cpu0 Y 8 24G - -
cpu1 Y 8 24G - -
cpu2 Y 8 24G - -
cpu3 Y 8 24G - -
cpu4 Y 8 24G - -
cpu6 Y 8 24G - -
cpu7 Y 8 24G - -
cpu8 Y 8 48G - - avx
cpu9 Y 8 48G - - avx
cpu10 Y 8 48G - - avx
cpu11 Y 8 48G - - avx
cpu12 Y 12 48G - -
cpu13 Y 12 48G - -
cpu14 Y 12 48G - -
cpu15 Y 12 48G - -
cpu16 Y 12 64G - - avx
cpu17 Y 12 128G - - avx
cpu18 Y 20 128G - - avx
cpu19 Y 12 128G - - avx
cpu20 Y 64 256G - - avx,highmem
cpu21 Y 256 1024G - - avx,highmem
gpu0 N 24 192G 4 2080 Ti 11g,2080ti,highmem,avx
gpu1 N 20 256G 4 2080 Ti 11g,2080ti,highmem,avx
gpu2 N 20 192G 4 A4000 11g,12g,16g,a4000,highmem,avx
gpu3 N 24 192G 4 2080 Ti 11g,2080ti,highmem,avx
gpu4 N 20 192G 4 A4000 11g,12g,16g,a4000,highmem,avx
gpu5 N 16 256G 4 A4000 11g,12g,16g,a4000,highmem,avx
gpu6 Y 16 256G 4 Titan V 11g,12g,titanv,highmem,avx
gpu7 N 20 192G 4 Titan RTX 11g,12g,24g,titanrtx,highmem,avx
gpu10 N 48 1024G 8 A6000 Ada 11g,12g,16g,24g,48g,a6000ada,highmem,avx
gpu11 N 20 192G 8 2080 Ti 11g,2080ti,highmem,avx
gpu12 N 48 1024G 8 A6000 11g,12g,16g,24g,48g,a6000,highmem,avx
gpu13 Y 48 1024G 8 A4000 11g,12g,16g,a4000,highmem,avx
gpu14 N 20 384G 8 A6000 11g,12g,16g,24g,48g,a6000,highmem,avx
gpu15 Y 20 384G 8 A6000 11g,12g,16g,24g,48g,a6000,highmem,avx
gpu16 N 48 1024G 8 A6000 Ada 11g,12g,16g,24g,48g,a6000ada,highmem,avx
gpu17 N 20 192G 8 2080 Ti 11g,2080ti,highmem,avx
gpu18 N 20 192G 8 2080 Ti 11g,2080ti,highmem,avx
gpu19 N 20 192G 8 2080 Ti 11g,2080ti,highmem,avx
gpu20 N 20 192G 8 A6000 11g,12g,24g,48g,a6000,highmem,avx
gpu21 N 20 256G 10 A4000 11g,12g,16g,a4000,highmem,avx
gpu30 N 6 64G 2 Titan X Pascal
gpu31 N 4 64G 2 RTX 8000
gpu32 N 4 64G 2 A5000
gpu33 N 8 128G 2 RTX 6000
gpu34 N 8 64G 2 RTX 6000
gpu35 N 4 64G 2 2080 Ti
gpu36 N 8 64G 2 A5500
gpu37 N 4 64G 2 1080 Ti
gpu38 N 4 64G 2 A5000
gpu39 N 8 128G 2 2080 Ti
gpu40 N 20 1536G 8 A6000
gpu41 N 20 384G 8 A6000
gpu42 N 12 32G 1 2080 Ti
gpu43 N 12 256G 2 2080 Ti

Software Tips

Anaconda

Because users are limited to 20G home directories Miniconda is preferred.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/mc3
rm Miniconda3-latest-Linux-x86_64.sh
eval "$($HOME/mc3/bin/conda 'shell.bash' 'hook')"

You will want to run the last command every time you want to start working within miniconda. Installing this way (skipping a bashrc auto initialization) will keep your logins quick by not loading and scanning files unnecessarily.

Jupyter Notebook

First you will need to install jupyter notebook. Here are a couple of of options. The examples below will be using the virtualenv option.

  1. Install Anaconda (Miniconda is preferred)
  2. Create a python virtualenv
virtualenv ~/myenv # create the virtualenv
. ~/myenv/bin/activate # activate the env
pip install --upgrade pip # it's always a good idea to update pip
pip install jupyter # install jupyter

You can run the jupyter notebook as either an interactive or batch job.

Jupyter File Locations

We recommend setting your jupyter environment variables so that they are not located on NFS directories and instead use node local /scratch space.

mkdir -p /scratch/$USER/jupyter
export JUPYTER_CONFIG_DIR=/scratch/$USER/jupyter
export JUPYTER_PATH=/scratch/$USER/jupyter
export JUPYTER_DATA_DIR=/scratch/$USER/jupyter
export JUPYTER_RUNTIME_DIR=/scratch/$USER/jupyter
export IPYTHONDIR=/scratch/$USER/ipython

Interactive

srun --pty bash run an interactive job

. ~/myenv/bin/activate activate virutal env

unset XDG_RUNTIME_DIR jupyter tries to use the value of this environment variable to store some files, by defaut it is set to '' and that causes errors when trying to run juypter notebook.

export NODEIP=$(hostname -i) get the ip address of the node you are using

export NODEPORT=$(( $RANDOM + 1024 )) get a random port above 1024

echo $NODEIP:$NODEPORT echo the env var values to use later

jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser start the jupyter notebook

Make a new ssh connection with a tunnel to access your notebook

ssh -N -L 8888:$NODEIP:$NODEPORT user@slurm.ttic.edu using the values not variables

This will make an ssh tunnel on your local machine that fowards traffic sent to localhost:8888 to $NODEIP:$NODEPORT via the ssh tunnel. This command will appear to hang since we are using the -N option which tells ssh not to run any commands including a shell on the remote machine.

Open your local browser and visit: http://localhost:8888

Batch

The process for a batch job is very similar.

jupyter-notebook.sbatch

#!/bin/bash
unset XDG_RUNTIME_DIR
NODEIP=$(hostname -i)
NODEPORT=$(( $RANDOM + 1024))
echo "ssh command: ssh -N -L 8888:$NODEIP:$NODEPORT `whoami`@slurm.ttic.edu"
. ~/myenv/bin/activate
jupyter-notebook --ip=$NODEIP --port=$NODEPORT --no-browser

Check the output of your job to find the ssh command to use when accessing your notebook.

Make a new ssh connection to tunnel your traffic. The format will be something like:

ssh -N -L 8888:###.###.###.###:#### user@slurm.ttic.edu

This command will appear to hang since we are using the -N option which tells ssh not to run any commands including a shell on the remote machine.

Open your local browser and visit: http://localhost:8888

PyTorch

These are the commands to install the current stable version of pytorch. In this example we are using /scratch, though in practice you may want to install it in a network location. The total install is 11G which means that installing in your home directory in not recommended.

# getting an interactive job on a gpu node
srun -p contrib-gpu --pty bash

# creating a place to work
export MYDIR=/scratch/$USER/pytorch
mkdir -p $MYDIR && cd $MYDIR

# installing miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh -b -p $MYDIR/mc3
rm Miniconda3-latest-Linux-x86_64.sh

# activating the miniconda base environment (you will need to run this before using pytorch in future sessions).
eval "$($MYDIR/mc3/bin/conda 'shell.bash' 'hook')"

# install pytorch, with cuda 11.8
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# test (should return True)
python -c "import torch; print(torch.cuda.is_available())"

Tensorflow

When using tensorflow it will not respect common environmental variables to restrict the number of threads in use. If you add the following code to your tensorflow setup if will respect the correct number of threads requested with the -c option.

import os
NUM_THREADS = int(os.environ['OMP_NUM_THREADS'])
sess = tf.Session(config=tf.ConfigProto(
    intra_op_parallelism_threads=NUM_THREADS,
    inter_op_parallelism_threads=NUM_THREADS))

[For Faculty] Contributing to the Cluster

If you are a faculty member at TTIC, we would like to invite you to contribute machines or hardware to the cluster. A rather high-level outline of what it would mean to you is below.

The basic principle we intend to follow is that the setup should provide people, on average, with access to more resources than they have on their own, and to manage these pooled resources to maximize efficiency and throughput.

If you contribute hardware, you will gain access to the entire cluster (see description of partitions above), and be given a choice between two options for high-priority access:

  1. You (which means you and any users you designate as being in your "group") will be guaranteed access to your machines within a specified time window from when you request it (the window is 4 hours, i.e., the time-limit for any other users' job that may be running on your machine). Once you get this access, you can keep it as long as you need it. Effectively, you decide when you want to let others use your machines, with a limited waiting period when you want them back.

  2. You give up the ability to guarantee on-demand access to your specific machines, in exchange for a higher priority for your jobs on the entire cluster. You still can reserve your machines for up to four weeks per year (possibly in installments of at least a week each time).

  3. As noted above, the priority of one's jobs is affected by a weighted combination of waiting time, user priority (higher for members of a group that has contributed more equipment), and fair share (lower for users who have recently made heavy use of the cluster).