Slurm on Talapas
Slurm is job scheduling software that apportions resources to jobs distributed across the hundreds of nodes on Talapas. To use Talapas, you must request resources and define job parameters in the form of Slurm jobs. This lesson introduces Slurm job configuration in detail.
Lesson Setup
For this lesson, you will need to connect to a Talapas login node through a shell application of your choice.
For convenience, we recommend the Talapas OnDemand shell.
To start, make sure you are in your home directory.
cd ~
Copy the /projects/racs_training/intro-hpc-s25/slurm
folder into your home directory. Don’t forget -r
to recursively copy the contents!
cp -r /projects/racs_training/intro-hpc-s25/slurm/ .
Navigate inside the newly created slurm
directory.
cd slurm
ls -F
part1/ part2/
Navigate inside the part1
folder using cd
, then inspect its contents using ls
.
cd part1
ls
gpu.sbatch hello.py hello.sbatch long.sbatch
Troubleshooting a Slurm Job
Let’s inspect hello.sbatch
.
cat hello.sbatch
#!/bin/bash
#SBATCH --partition=compute ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=hello_world_python ### Job Name
#SBATCH --output=%x-%j.out ### File in which to store job output
#SBATCH --error=%x-%j.err ### File in which to store job error messages
#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Number of nodes needed for the job
#SBATCH --mem=500M ### Total Memory for job in MB -- can do K/M/G/T for KB/MB/GB/TB
#SBATCH --ntasks-per-node=1 ### Number of tasks to be launched per Node
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
### Load needed modules
module purge
module load python3/3.11.4
module list
### Run your actual program
python3 hello.py
Let’s parse this file using the comments as a guide! To learn more, see the Slurm documentation for sbatch
.
- This job runs for a maximum of five minutes (
0-00:05:00
) on one node and one CPU core of that node. - It requests 500MB of RAM.
- The job output will be written
hello_world_python-[jobid].out
andhello_world_python-[jobid].err
respectively. - It loads the
python3/3.11.4
module. - It lists the current modules loaded with
module list
and prints it to standard output. - It runs a file in the
slurm_examples
directory calledhello.py
that prints to standard output.
To submit your job file, run the sbatch
command.
sbatch hello.sbatch
Look at your recent job history with sacct
.
sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
34658037 hello_wor+ compute racs_trai+ 1 FAILED 1:0
34658037.ba+ batch racs_trai+ 1 FAILED 1:0
34658037.ex+ extern racs_trai+ 1 COMPLETED 0:0
There’s an error! Check the error log using cat
.
cat hello_world_python-[YOURJOBID].err
Currently Loaded Modules:
1) miniconda-t2/20230523 2) python3/3.11.4
Traceback (most recent call last):
File "/gpfs/home/emwin/part1/hello.py", line 2, in <module>
print(dictionary['b']) # Should Raise KeyError !!!
~~~~~~~~~~^^^^^
KeyError: 'b'
This is KeyError a common mistake when using Python dictionaries. Check the original Python file in a command line text editor like nano
.
nano hello.py
Looking at the Python code, there’s a bug inside hello.py
.
print("Hello World !!")
dictionary = {'a': 1}
print(dictionary['b']) # This should raise a KeyError
This can be fixed by checking the dictionary for a valid entry: a
. Change the last line to the following in nano
, then use Ctrl+O and Ctrl+X to modify the file.
print(dictionary['a']) # This will not raise a KeyError
Now that the file is saved, rerun your fixed job.
sbatch hello.sbatch
Submitted batch job 34704219
Running sacct
will show your job completed successfully.
sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
34704219 hello_wor+ compute racs_trai+ 1 COMPLETED 0:0
34704219.ba+ batch racs_trai+ 1 COMPLETED 0:0
34704219.ex+ extern racs_trai+ 1 COMPLETED 0:0
Check the output log to see the results of your job. You’ll want to pick the second and later run’s log. Note that because each log is named for the jobid, and jobids are unique, your logs will not overwrite each other.
1
Hello Wold !!
To check the status of your queued jobs, use the squeue
command. Do not use squeue without an argument unless you want to see all the queued jobs on Talapas.
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
Notice that because each job creates two log files based on the job id, you will have at least four log files (%x-%j.err
, %x-%j.out
) in your current working directory.
Cancelling Longer Jobs
Inspect long.sbatch
.
cat long.sbatch
#!/bin/bash
#SBATCH --partition=computelong ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=long_hello_world ### Job Name
#SBATCH --output=%x-%j.out ### File in which to store job output
#SBATCH --error=%x-%j.err ### File in which to store job error messages
#SBATCH --time=0-00:20:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Number of nodes needed for the job
#SBATCH --mem=500M ### Total Memory for job in MB -- can do K/M/G/T for KB/M
B/GB/TB
#SBATCH --ntasks-per-node=1 ### Number of tasks to be launched per Node
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
### Run your actual program
for i in {1..100}
do
echo "This is loop iteration $i"
sleep 10
done
This job features a Bash for-loop will print one line of text and sleep for ten seconds 100 times, once for each iteration of the loop. This means it will run for 1000 seconds or for about 15 minutes.
Inspect the job in the queue.
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
34704248 computelo long_hel emwin R 0:03 1 n0136
Let’s practice cancelling a slow job with scancel
.
scancel 34704248
Looking at the queue again, it’s now empty, even though the job would have otherwise run for longer.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
When you cancel a job, Slurm will write an error message to the error file for that file.
cat long_hello_world-34704248.err
slurmstepd: error: *** JOB 34704248 ON n0136 CANCELLED AT 2025-05-21T16:42:49 ***
Optimization with seff
Want to know how much of the resources you requested a finished job used? Try the seff
command.
seff 34700623
Job ID: 34700623
Cluster: talapas
User/Group: emwin/uoregon
State: CANCELLED (exit code 0)
Cores: 1
CPU Utilized: 00:00:00
CPU Efficiency: 0.00% of 00:02:23 core-walltime
Job Wall-clock time: 00:02:23
Memory Utilized: 456.00 KB
Memory Efficiency: 0.09% of 500.00 MB
Adjust your resource requests accordingly after your job runs. Be courteous and jobs won’t be deprioritized. Remember that your job’s runtime is its queue time + execution time, so asking for too many resources can make your job run slower because it will spend longer in the queue.
If your job runs out of memory, try doubling the amount you request. If you job uses less memory than you expected, you can use seff
to lower the amount of memory requested and have your job schedule faster next time.
This job, which spends its CPU time “sleeping” or waiting, uses very little of its requested resources.
Helper Scripts From RACS
Need to decide which nodes (and which partitions) are appropriate for your jobs? RACS has a script that shows available CPU and RAM on each node.
/packages/racs/bin/slurm-show-cpu-mem
This script shows the name of the node, the number of CPU cores on the node, and the amount of RAM (in MB) in that order for each node on Talapas.
...
n0187 128 515048
n0188 128 515048
n0189 128 515048
n0190 128 515048
n0191 128 515048
n0192 128 515048
n0193 128 515048
...
Nodes on the compute partition typically have 128 CPU cores and 500GB of RAM.
To check which partition a node belongs to, find its name in the output of the sinfo
command,
sinfo
As you can see, n0191
is one of the nodes in the compute partition.
compute up 1-00:00:00 1 plnd n0195
compute up 1-00:00:00 6 drain n[0112-0117]
compute up 1-00:00:00 35 mix n[0111,0118-0135,0180-0194,0196]
You may also want to know what hardware is associated with each node.
/packages/racs/bin/slurm-show-features
This script shows the node name, processor type, processor architecture, process
Let’s inspect n0191
again.
...
n0185 amd,milan,7713
n0186 amd,milan,7713
n0187 amd,milan,7713
n0188 amd,milan,7713
n0189 amd,milan,7713
n0190 amd,milan,7713
n0191 amd,milan,7713
n0192 amd,milan,7713
n0193 amd,milan,7713
...
Nodes with extra RAM are indicated with mem-[amount]
tags.
n0372 intel,broadwell,e7-4830,mem-1t
n0373 intel,broadwell,e7-4830,mem-1tb
n0374 intel,broadwell,e7-4830,mem-4tb
n0375 intel,broadwell,e7-4830,mem-4tb
n0376 intel,broadwell,e7-4830,mem-2tb
n0377 intel,broadwell,e7-4830,mem-2tb
n0378 intel,broadwell,e7-4830,mem-2tb
n0379 intel,broadwell,e7-4830,mem-2tb
Nodes with GPUs are also indicated in this list.
n0149 amd,milan,7413,a100,gpu-10gb
n0150 amd,milan,7413,a100,gpu-40gb
n0151 amd,milan,7413,a100,gpu-10gb
n0152 amd,milan,7413,a100,3xgpu-80gb,no-mig
n0153 amd,milan,7413,a100,gpu-80gb,2xgpu-80gb,no-mig
n0154 amd,milan,7413,a100,gpu-80gb,2xgpu-80gb,no-mig
n0155 amd,milan,7413,a100,gpu-40gb
n0156 amd,milan,7413,a100,gpu-40gb
n0157 amd,milan,7413,a100,gpu-40gb
n0158 amd,milan,7413,a100,gpu-40gb
These indicators are called features. We can use constraints to direct Slurm to schedule jobs on nodes with specific features (extra RAM, GPU, CPU architecture) based on code hardware requirements.
GPU Jobs
Let’s examine an example GPU job in detail: gpu.sbatch
.
Only three partitions are compatible with GPUs:
- gpu
- gpulong
- interactivegpu
Each node in a GPU partition has at most 4 GPUS.
Open gpu.sbatch
in nano and modify it as follows.
nano gpu.sbatch
#!/bin/bash
#SBATCH --partition=gpu ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=gpu_hello_world ### Job Name
#SBATCH --output=%x-%j.out ### File in which to store job output
#SBATCH --error=%x-%j.err ### File in which to store job error messages
#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Number of nodes needed for the job
#SBATCH --mem=500M ### Total Memory for job in MB -- can do K/M/G/T for KB/MB/GB/TB
#SBATCH --gpus=1 ### Number of GPUs to request
#SBATCH --ntasks-per-node=1 ### Number of tasks to be launched per Node
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
#SBATCH --constraint=gpu-10gb
#SBATCH --mail-type=BEGIN,END
#SBATCH --mail-user=YOURDUCKID@uoregon.edu
### Load needed modules
module purge
module load cuda/12.4.1
module list
### Run your actual program
nvidia-smi
- You request at least one gpu with the
--gpus
argument. It’s not enough to run on a GPU partition. - If you need to select certain nodes within a partition, the
--constraint=
flag, can allow you to specify certain nodes based on job constraints. In this case--constraint=gpu-10gb
restricts the job to nodes with thegpu-10gb
feature. - That is, it gives a GPU with 10GB of VRAM.
- Every GPU requires a CPU too!
Emailing Job Status
Running a really long job? Tired of checking squeue
?
SLURM can even email you when jobs reach certain states:
### #SBATCH --mail-type=BEGIN,END,FAIL
### #SBATCH --mail-user=<duckID>@uoregon.edu
Edit these lines in your sbatch script with your DuckID. You should get an email from Slurm when the job begins and finishes (whether in success or failure.)
Schedule the GPU job and wait for results in your email.
sbatch gpu.sbatch
Submitted batch job 34702879
The nvidia-smi
command gives you the status of GPUs on your system, if any. It will err out if there are no GPUs on the system.
cat gpu_hello_world-34702879.out
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:E9:00.0 Off | On |
| N/A 39C P0 74W / 300W | 792MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 9 0 0 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
The MIG devices:
section of this output indicates that the job was scheduled on a 9728MiB (10GB) slice of a GPU with 80GB of total VRAM. The 10GB constraint worked.
Serial Processing: One Core and One Task at A Time
Navigate inside the part2
folder using cd
.
cd ../part2
Check that you see the following files inside with ls
.
ls -F
array.sbatch* books/ books.sbatch* hello.sbatch* parallel_steps.sbatch* random.py* serial_steps.sbatch* steps.sh*
Examine serial_steps.sbatch
job with `cat.
cat serial_steps.sbatch
##!/bin/bash
#SBATCH --partition=compute ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=serial_steps ### Job Name
#SBATCH --output=%x.out ### File in which to store job output
#SBATCH --error=%x.err ### File in which to store job error messages
#SBATCH --time=0-00:25:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --ntasks=1 ### Number of tasks included in the job
#SBATCH --mem-per-cpu=50M ### Total Memory for job in MB -- can do K/M/G/T for KB/MB/GB/TB
#SBATCH --ntasks-per-node=1 ### Number of tasks to be launched per Node
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
for i in {1..10}; do
srun steps.sh $i
done
This job runs a Bash for-loop. To learn more about Bash loops, check out this quick tutorial.
This script uses srun
to create 10 sequential (serial) steps in one job. By default, each iteration (here i = 1, 2, 3 ... 10
) must finish before the loop moves to the next.
This is because the srun
denotes a step within a job. A job with multiple srun
commands will create a separate step (with the same jobid) in the sacct
for each step.
sbatch serial_steps.sbatch
Submitted batch job 34704292
Check the queue for the status of your job.
squeue --me
34704292 compute serial_s emwin R 0:23 1 n0135
As you can observe, the serial job is running and corresponds to exactly one row in the Slurm queue.
However, each step of serial steps (each subjob) is treated as a separaterow in the sacct
command.,.
sacct
34704292.0 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.1 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.2 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.3 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.4 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.5 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.6 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.7 steps.sh racs_trai+ 1 COMPLETED 0:0
34704292.8 steps.sh racs_trai+ 1 RUNNING 0:0
This job is on loop 8 of the serial steps. 34704292.9 can’t begin until 34704292.8 finishes. Serial jobs require that steps be completed one at a time AND in a specific order.
Let’s look at the time spent on each stage in more detail using a specially configured sacct
command. You can learn more about sacct
formatting on the Slurm documentation.
sacct -u $(whoami) --units=G --format=JobID,jobname,MaxRSS,Start,Elapsed,State,ExitCode
34704292.0 steps.sh 0 2025-05-21T16:45:47 00:00:11 COMPLETED 0:0
34704292.1 steps.sh 0 2025-05-21T16:45:58 00:00:10 COMPLETED 0:0
34704292.2 steps.sh 0 2025-05-21T16:46:08 00:00:10 COMPLETED 0:0
34704292.3 steps.sh 0 2025-05-21T16:46:18 00:00:10 COMPLETED 0:0
34704292.4 steps.sh 0 2025-05-21T16:46:28 00:00:10 COMPLETED 0:0
34704292.5 steps.sh 0 2025-05-21T16:46:38 00:00:10 COMPLETED 0:0
34704292.6 steps.sh 0 2025-05-21T16:46:48 00:00:10 COMPLETED 0:0
34704292.7 steps.sh 0 2025-05-21T16:46:58 00:00:10 COMPLETED 0:0
34704292.8 steps.sh 0 2025-05-21T16:47:08 00:00:10 COMPLETED 0:0
34704292.9 steps.sh 0 2025-05-21T16:47:18 00:00:11 COMPLETED 0:0
As you can see, all ten steps (numbered 0-9) are now marked as COMPLETED.
Each subjob has a different start time. The subjobs start roughly ten seconds apart, after the previous subjob has finished.
There are some jobs where certain steps or subcomponents must be completed in a serial fashion. However, serial jobs do not take advantage of the computational power of Talapas or its support for parallelism. Think of serial computation as a last resort when designing code on Talapas.
How can we make this job run in parallel? Will it be faster?
Parallelism with sbatch
Let’s examine parallel-steps.sbatch
with cat
.
cat parallel-steps.sbatch
It looks very similar to serial-steps.sbatch
but there are a few subtle and important differences.
#!/bin/bash
#SBATCH --partition=compute ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=parallel_steps ### Job Name
#SBATCH --output=%x.out ### File in which to store job output
#SBATCH --error=%x.err ### File in which to store job error messages
#SBATCH --time=0-00:25:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --ntasks=10 ### Number of tasks included in the job
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
#SBATCH --mem-per-cpu=50M ### Number of cpus/cores to be launched per Task
for i in {1..10}; do
srun --ntasks=1 steps.sh $i &
done
wait
This job requests 10 CPUS, allocates 1 CPU per task, and then launches 10 concurrent srun
subjobs.
The &
at the end of srun --ntasks=1 test.sh $i &
is crucial! The ampersand tells Bash to run this command in the background. Without the ampersand this loop would wait for one iteration to finish before moving to the next one.
Let’s launch this command with sbatch
.
sbatch parallel_steps.sbatch
Submitted batch job 34704407
If you look at the output log, you’ll notice something important.
The job steps don’t complete in order!
cat parallel_steps.out
I am printing this from job step 9
I am printing this from job step 2
I am printing this from job step 1
I am printing this from job step 8
I am printing this from job step 3
I am printing this from job step 7
I am printing this from job step 5
I am printing this from job step 6
I am printing this from job step 10
I am printing this from job step 4
The steps in the parallel job did not run in order or finish in order. You lose that guarantee in a parallel context.
How much faster did the parallel job run?
The serial job ran in 70 seconds. Use sacct
to see how much faster the parallel job ran.
sacct -u $(whoami) --units=G --format=JobID,jobname,MaxRSS,Start,Elapsed,State,ExitCode
JobID JobName MaxRSS Start Elapsed State ExitCode
34704407 parallel_+ 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.ba+ batch 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.ex+ extern 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.0 steps.sh 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.1 steps.sh 0.00G 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.2 steps.sh 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.3 steps.sh 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.4 steps.sh 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.5 steps.sh 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.6 steps.sh 0.00G 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.7 steps.sh 0.00G 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.8 steps.sh 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
34704407.9 steps.sh 0 2025-05-21T16:51:53 00:00:10 COMPLETED 0:0
The parallel job ran in 11 seconds, significantly faster than the 70 second serial version of the job. This is an example of an embarassingly parallel job, because each subjob is completely independent of the other subjobs.
Array Jobs
Slurm has a special option for bulk scheduling of nearly identical jobs: array jobs. This approach is less flexible than srun
, but it can be faster to configure.
Take a close look at array.sbatch
with cat
.
cat array.sbatch
#!/bin/bash
#SBATCH --partition=compute ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=python_array ### Job Name
#SBATCH --output=%x-%A-%a.out ### File in which to store job output
#SBATCH --error=%x-%A-%a.err ### File in which to store job error messages
#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Number of nodes needed for the job
#SBATCH --mem=50M ### Total Memory for job in MB -- can do K/M/G/T for KB/MB/GB/TB
#SBATCH --ntasks-per-node=1 ### Number of tasks to be launched per Node
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
#SBATCH --array=1-10
### Load needed modules
module purge
module load python3/3.11.4
### Run your actual program
srun python3 random.py $SLURM_ARRAY_TASK_ID $SLURM_ARRAY_JOB_ID $SLURM_JOB_ID
Array jobs require the #SBATCH --array=
parameter. In this example, the array parameter is the list {0, 1, 2, ... 10}
. The %a
stands for the id of the task and the %A
stands for the main array.
This will create one main job and a subjob for each index in the array. Each array index will inherit the requested parameters specified in cpus-per-task
and ntasks-per-node
. This means each subjob asks for 1 CPU core.
This job runs random_test.py
and passes in three arguments: the $SLURM_ARRAY_TASK_ID, the $SLURM_ARRAY_JOB_ID and $SLURM_JOB_ID. $SLURM_ARRAY_TASK_ID $SLURM_ARRAY_JOB_ID $SLURM_JOB_ID
are environment variables.
The advantage of array jobs is that Slurm does not wait for all requested cores to be available for the first subjobs of the array jobs to be scheduled. This is not the case for parallel jobs that utilize srun
to distribute subjobs among multiple requested cores.
Let’s look in more detail at random_test.py
.
cat random_test.py
#!/usr/bin/env python3
import random
import sys
if __name__ == "__main__":
args = sys.argv[1:]
seed = args[0]
array_job_id = args[1]
job_id = args[2]
print(f"Array Job ID: {array_job_id} Array Task ID: {seed} Job ID: {job_id}")
print("SEED: {}".format(seed))
random.seed(seed)
for i in range(0, 3):
print(random.random())
This Python script takes three arguments (array_job_id, array_task_id, job_id) and and then prints three random numbers This script will run once for each subjob in the array: 10 times.
Go ahead and submit the array job.
sbatch array.sbatch
Submitted batch job 34704494
Array jobs are parallel jobs.
You have no guarantee that all the array jobs will schedule or run at the same time.
Let’s check progress the job’s progress with squeue
.
squeue --me
Unlike the paralllel job creatd with srun
, each array subjob gets its own row in the results of squeue
.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
34704494_1 compute python_a emwin R 0:01 1 n0135
34704494_2 compute python_a emwin R 0:01 1 n0135
34704494_3 compute python_a emwin R 0:01 1 n0135
34704494_4 compute python_a emwin R 0:01 1 n0135
34704494_5 compute python_a emwin R 0:01 1 n0135
34704494_6 compute python_a emwin R 0:01 1 n0135
34704494_7 compute python_a emwin R 0:01 1 n0135
34704494_8 compute python_a emwin R 0:01 1 n0135
34704494_9 compute python_a emwin R 0:01 1 n0135
34704494_10 compute python_a emwin R 0:01 1 n0135
Each subjob in an array job creates in own log.
ls python_array*
python_array-34704494-10.err python_array-34704494-2.out python_array-34704494-5.err python_array-34704494-7.out
python_array-34704494-10.out python_array-34704494-3.err python_array-34704494-5.out python_array-34704494-8.err
python_array-34704494-1.err python_array-34704494-3.out python_array-34704494-6.err python_array-34704494-8.out
python_array-34704494-1.out python_array-34704494-4.err python_array-34704494-6.out python_array-34704494-9.err
python_array-34704494-2.err python_array-34704494-4.out python_array-34704494-7.err python_array-34704494-9.out
To look at the last five lines of each of the output logs logs, you can use the tail
command.
tail py*.out
...
==> python_array-31638482-8.out <==
Array Job ID: 31638482 Array Task ID: 8 Job ID: 31638490
SEED: 8
0.22991307664292304
0.7963928625032808
0.7965374772142675
==> python_array-31638482-9.out <==
Array Job ID: 31638482 Array Task ID: 9 Job ID: 31638491
SEED: 9
0.1666321090310302
0.02959962356249568
0.629304132385857
Notice that each job has its own array task id, its own job id, and a shared array job id.
When to Use Array Jobs
If you can format your job as an array job, do so! Array jobs are the preferred way to parallelize your embarassingly parallel jobs. Array jobs are easy for the scheduler to manage.
Parallelism vs. Serial Execution
Array jobs do not “talk to each other”. They do not run in a guaranteed order, nor do they finish in a guaranteed order.
Regardless of how your job is configured, no two jobs or subjobs should not be used to write to or read from the same inputs concurrently. This create race conditions in which your code and filesystem can behave unpredictably.
Race conditions are why we make sure everyone in the class copies example files from shared directories to their home directory before modifying them and executing them.
Interactive Jobs: srun
Interactive jobs are great for development work if you need a better resourced machine for a couple hours or want to test a Slurm configuration on a small sample before deploying it to a batch job.
Let’s try the simplest possible interactive job.
srun --partition=compute --account=racs_training --pty bash
srun: job 34703737 queued and waiting for resources
srun: job 34703737 has been allocated resources
You are now in a Bash terminal, but this terminal is no longer on the compute node.
hostname
n0135.talapas.uoregon.edu
By default, your interactive jobs have maximum of 24 hours. Make sure to exit and cancel your job when you finish to free up resources,
exit
hostname
login1.talapas.uoregon.edu
You’ll noticed that the interactive job is marked as cancelled after using the exit
command when checking the job id with the sacct
command.
sacct -j 34703737
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
34703737 bash compute racs_trai+ 1 COMPLETED 0:0
34703737.ex+ extern racs_trai+ 1 COMPLETED 0:0
34703737.0 bash racs_trai+ 1 COMPLETED 0:0
Interactive jobs are also a great place to try out configuring GPU jobs. Let’s be more polite and request only 30 minutes. By default, interactive jobs run with a maximum of 4GB of RAM, so your job will be killed in either 30 minutes or if you exceed our allotted RAM.
srun --partition=gpu --account=racs_training --time=30 --gpus=1 --constraint=gpu-10gb --pty bash
In this example constraint, you’re requesting a GPU with 10GB of VRAM.
srun: job 34704657 queued and waiting for resources
srun: job 34704657 has been allocated resources
Check that you’re on the interactive node using hostname
.
hostname
n0162.talapas.uoregon.edu
Use the nvidia-smi
command to get GPU information for the current node.
nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe On | 00000000:E8:00.0 Off | On |
| N/A 38C P0 70W / 300W | 677MiB / 81920MiB | N/A Default |
| | | Enabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG |
| | | ECC| |
|==================+==================================+===========+=======================|
| 0 12 0 0 | 12MiB / 9728MiB | 14 0 | 1 0 0 0 0 |
| | 0MiB / 16383MiB | | |
+------------------+----------------------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
If you requested GPU resources correctly, you should be able to run nvidia-smi
and see information about the GPU available to you. This output indicates the GPU resources were allocated correctly.
Exit your interactive job.
exit
Check that you’re back on a login node.
hostname
login2.talapas.uoregon.edu
This nvidia-smi
command will fail in contexts where a GPU is unavailable or has not been allocated.
For example, if you run it on the login node, you’ll get a message like this.
nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running
It prints this failure message because nodes without GPUs don’t have a driver installed.
RACS has a guide on configuring interactive jobs on Talapas if you’d like to read more.
Quick Slurm Scheduling Tips
- Read the resource descriptions carefully: jobs that exceed their request time, memory, or processor usage will be automatically killed.
- Need to run for more than 24 hours? You must use
computelong
,gpulong
, ormemorylong
. - Most jobs should have
--ntasks-per-node=1
and--cpus-per-task=1
if they do not reference tools or code that is explicitly multiprocessor code. - Remember to exit interactive jobs when you’re finished to free up resources for your colleagues.
Today’s Slurm Commands
command | description | example usage |
---|---|---|
sbatch [jobfile] | queues a Slurm job | sbatch my_job.sh |
sacct | lists (your) recent jobs and subjobs | sacct |
squeue -u [user] | gets status of running or queued slurm jobs for user | squeue -u [yourDuckID] |
scancel [jobid] | cancels job jobid | scancel [jobid] |
srun | launches interactive shell session on a compute node | srun --partition=compute --account=racs_training --pty bash |
sinfo | lists information on available partitions | sinfo |