Advanced Slurm
This lesson expands on the principles of Slurm job configuration and parallelization introduced in the first Slurm lesson. It also introduces Conda, a popular software package and virtual environment management tool available on Talapas that can be used to configure virtual environments for batch scripts or interactive jobs.
Lesson Setup
For this lesson, you will need to connect to a Talapas login node through a shell application of your choice.
For convenience, we recommend the Talapas OnDemand shell.
Make sure you are in your home directory.
cd ~
Copy the slurm_day2
folder to your home directory.
cp -r /projects/racs_training/intro-hpc-s25/slurm_day2/ .
Navigate inside the slurm_day2
directory you copied over.
cd slurm_day2
Array Jobs: Creating Tasks from an Array of Input Files
Check the folder contents using ls
.
ls -F
books_example/ python_pi_example/
deps_example/ snakemake_example/
First, we will examine the books_example
Slurm task.
Change to the books_example
directory.
cd books_example
ls -F
books/ books.sbatch* logs/
In the books
folder, you should have the text of five books.
Use ls
to list the filenames of the books stored there.
ls books
alice_in_wonderland.txt moby_dick.txt romeo_and_juliet.txt
complete_works_shakespeare.txt pride_and_prejudice.txt
Let’s look at an example array job books.sbatch
which has one subtask for each of the five books in the books/
folder.
cat books.sbatch
#!/bin/bash
#SBATCH --partition=compute ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=books_wc_array ### Job Name
#SBATCH --output=logs/%x-%A-%a.out ### File in which to store job output
#SBATCH --error=logs/%x-%A-%a.err ### File in which to store job error messages
#SBATCH --time=0-00:05:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Number of nodes needed for the job
#SBATCH --mem=50M ### Total Memory for job in MB -- can do K/M/G/T for KB/MB/GB/TB
#SBATCH --ntasks-per-node=1 ### Number of tasks to be launched per Node
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
#SBATCH --array=0-4
BOOKS=(books/*)
srun wc -c ${BOOKS[$SLURM_ARRAY_TASK_ID]}
First, the variable BOOKS is created using a wildcard BOOKS=(books/*)
.
BOOKS is a list of the filenames of each of the files in the books
folder.
books/alice_in_wonderland.txt books/complete_works_shakespeare.txt books/moby_dick.txt books/pride_and_prejudice.txt books/romeo_and_juliet.txt
Then, srun is used to launch subtasks that compute the wc - c
or “word count” in characters of each of the five books.
sbatch books.sbatch
Submitted batch job 34763686
Check the status of each of the five tasks using sacct
.
sacct
34763686_0 books_wc_+ compute racs_trai+ 1 COMPLETED 0:0
34763686_0.+ batch racs_trai+ 1 COMPLETED 0:0
34763686_0.+ extern racs_trai+ 1 COMPLETED 0:0
34763686_0.0 wc racs_trai+ 1 COMPLETED 0:0
34763686_1 books_wc_+ compute racs_trai+ 1 COMPLETED 0:0
34763686_1.+ batch racs_trai+ 1 COMPLETED 0:0
34763686_1.+ extern racs_trai+ 1 COMPLETED 0:0
34763686_1.0 wc racs_trai+ 1 COMPLETED 0:0
34763686_2 books_wc_+ compute racs_trai+ 1 COMPLETED 0:0
34763686_2.+ batch racs_trai+ 1 COMPLETED 0:0
34763686_2.+ extern racs_trai+ 1 COMPLETED 0:0
34763686_2.0 wc racs_trai+ 1 COMPLETED 0:0
34763686_3 books_wc_+ compute racs_trai+ 1 COMPLETED 0:0
34763686_3.+ batch racs_trai+ 1 COMPLETED 0:0
34763686_3.+ extern racs_trai+ 1 COMPLETED 0:0
34763686_3.0 wc racs_trai+ 1 COMPLETED 0:0
34763686_4 books_wc_+ compute racs_trai+ 1 COMPLETED 0:0
34763686_4.+ batch racs_trai+ 1 COMPLETED 0:0
34763686_4.+ extern racs_trai+ 1 COMPLETED 0:0
34763686_4.0 wc racs_trai+ 1 COMPLETED 0:0
These following lines of sbatch
configuration direct the output logs to the logs
folder. The %x-%A-%a
notation creates log files named (job name)-(array parent job)-(array index)
.out.
#SBATCH --output=logs/%x-%A-%a.out ### File in which to store job output
#SBATCH --error=logs/%x-%A-%a.err ### File in which to store job error messages
Inspect the logs folder with ls
.
ls logs
books_wc_array-34763686-0.err
books_wc_array-34763686-0.out
books_wc_array-34763686-1.err
books_wc_array-34763686-1.out
books_wc_array-34763686-2.err
books_wc_array-34763686-2.out
books_wc_array-34763686-3.err
books_wc_array-34763686-3.out
books_wc_array-34763686-4.err
books_wc_array-34763686-4.out
Let’s use tail
and a *
wildcard to check the last few lines of each of the output logs.
tail logs/books*.out
As expected, the output logs contain the results of the wc -c
followed by the book that was listed as input.
==> logs/books_wc_array-34763686-0.out <==
174357 books/alice_in_wonderland.txt
==> logs/books_wc_array-34763686-1.out <==
5638516 books/complete_works_shakespeare.txt
==> logs/books_wc_array-34763686-2.out <==
1276288 books/moby_dick.txt
==> logs/books_wc_array-34763686-3.out <==
772419 books/pride_and_prejudice.txt
==> logs/books_wc_array-34763686-4.out <==
169541 books/romeo_and_juliet.txt
Evaluating Resource Usage on Running Jobs with htop
Remember the seff
command? That’s a great way to evaluate resource usage of a finished job. But what about resource usage for running jobs?
Navigate to the python_pi_example
directory and inspect the contents.
cd ../python_pi_example
ls
calculate_pi.py calculate_pi.sbatch
Inspect calculate_pi.sbatch
.
nano calculate_pi.sbatch
#!/bin/bash
#SBATCH --partition=compute ### Partition (like a queue in PBS)
#SBATCH --account=racs_training ### Account used for job submission
### NOTE: %u=userID, %x=jobName, %N=nodeID, %j=jobID, %A=arrayMain, %a=arraySub
#SBATCH --job-name=digits_of_pi ### Job Name
#SBATCH --output=%x.log ### File in which to store job output
#SBATCH --time=0-00:60:00 ### Wall clock time limit in Days-HH:MM:SS
#SBATCH --nodes=1 ### Number of nodes needed for the job
#SBATCH --mem=100M ### Total Memory for job in MB -- can do K/M/G/T for KB/MB/GB/TB
#SBATCH --cpus-per-task=1 ### Number of cpus/cores to be launched per Task
### Load needed modules
module purge
module load python3/3.11.4
module list
### Run your actual program
python3 calculate_pi.py 50000
This is a serial job that loads the python3 module and executes a Python script that computes the first 5000 digits of pi.
cat calculate_pi.py
#!/usr/bin/env python3
import sys
def calcPi(limit): # Generator function
"""
Prints out the digits of PI
until it reaches the given limit
"""
q, r, t, k, n, l = 1, 0, 1, 1, 3, 3
decimal = limit
counter = 0
while counter != decimal + 1:
if 4 * q + r - t < n * t:
# yield digit
yield n
# insert period after first digit
if counter == 0:
yield '.'
# end
if decimal == counter:
break
counter += 1
nr = 10 * (r - n * t)
n = ((10 * (3 * q + r)) // t) - 10 * n
q *= 10
r = nr
else:
nr = (2 * q + r) * l
nn = (q * (7 * k) + 2 + (r * l)) // (t * l)
q *= k
t *= l
l += 2
k += 1
n = nn
r = nr
def main(): # Wrapper function
num_digits = sys.argv[1]
for d in calcPi(int(num_digits)):
print(d, end='')
if __name__ == '__main__':
main()
Launch the Slurm job with sbatch
.
sbatch calculate_pi.sbatch
Submitted batch job 34799867
Check your queue. Note the node number (hostname) of the compute node where the job is currently running.
squeue --me
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
34799867 compute digits_o emwin R 0:02 1 n0135
As we can see, we are currently running on node n0135.
Because we have an active Slurm task on node n0135, we can remote into that compute node from the login node with ssh
.
ssh n0135
Enter the password to your DuckID when prompted. Confirm that you are connected with the hostname
command.
hostname
n0135.talapas.uoregon.edu
While you are on the compute node, the htop command can be used to evaluate the resources usage of your running processes.
htop -u $USER
The RES
column measures RAM per process in KB. The TIME
column indicates how long the process has been running.
Your connection to the compute node will terminate when your job terminates. After the job has finished, you can use seff
to get more granular information about your job.
Don’t Overwhelm the Scheduler!
Do not make 50,000 individual jobs with srun
and a for-loop. The maximum number of concurrent array jobs per user on Talapas is 12,000.
To get around these restrictions, you can tell Slurm to limit the nubmer of jobs submitted to the queue at a time. Using #SBATCH --array=0-50000%100
parameter submits only 100 subjobs to Slurm at a time.
The Slurm nodes
Parameter
Unless you’re using an MPI or message passing interface, you should use the default sbatch param of --nodes=1
. Your jobs can’t communicate between computers (nodes) without code compatible with MPI. Fall into this case? RACS has a guide for configuring jobs that use MPI on Talapas.
Slurm Defaults on Talapas
I have selected a few common parameters. These constraints are subject to change and do not necessarily apply on condo nodes.
Parameter | Default | Description | Notes |
---|---|---|---|
ntasks | 1 | tasks per job | |
cpus-per-task | 1 | number of cpu threads per task | |
mem-per-cpu | 4G | memory per cpu thread | |
mem | 4G* | memory per node | cannot be used with mem-per-cpu |
nodes | 1 | nodes allocated for the job | requires mpi |
gpus | 0 | gpus per job | must be >1 to use gpus |
error | slurm-%j.err | default error log location | |
output | slurm-%j.out | default output log location |
In summary, parallelism on Talapas is enabled by
- using code that supports multiple tasks, threads, or nodes
- enabling additional tasks, cores, or nodes as sbatch parameters
- OR using sbatch array jobs to launch simultaneous, independent jobs
Without meeting these requirements, your jobs will run serially.
Conda Environments
Conda is an open-source package and environment management system. It helps you easily install, run, and update software packages and manage isolated environments for different projects.
Conda works across platforms (Windows, macOS, Linux) and is especially popular in data science and scientific computing because it handles complex dependency situations like:
- Python packages that require older versions of Python
- non-Python libraries like R and Julia
- packages from different channels (conda, conda-forge, pip)
RACS has a detailed guide for building, creating, and loading conda environments on Talapas. We will walk through some of these steps in this exercise.
Benefits of Using Conda
- Conda environments allow for reproducibility and consistency when running code on different devices and operating systems.
- Conda only loads the packages you need for that specific workflow (helps reduce the amount of “clutter” in your environment)
- Each Conda environment is a self-contained workspace, so you can:
- Use different Python versions side-by-side (e.g. Python 3.10 & 3.12)
- Avoid dependency conflicts between projects
Many researchers maintain separate conda environments for different projects and contexts. We highly recommend this approach for reproducibility, consistency, and ease of replicating environmental configurations among colleagues.
Conda Options on Talapas
We have two main conda distributions available to users:
- miniconda3/20240410 offical source, maintained by anaconda, uses defaults as default channel, minimal installer for the Anaconda ecosystem
- miniforge3/20240410 open-source distribution, maintained by community, uses conda-forge as default channel, fully open-source conda installer using community packages.
Building a Conda Environment from Scratch
Copy the conda
folder to your home directory and navigate inside.
cp -r /projects/racs_training/conda/ ~
cd ~/conda
Look inside with ls
. There’s a hello_world.py
script that requires a Python instance of some kind.
ls
basic_r.R conda_notes.md hello_world.py jupyter.yml
Let’s load the miniconda3/20240410
module.
module load miniconda3/20240410
Check the module is loaded with module list
.
module list
Currently Loaded Modules:
1) miniconda3/20240410
List the conda environments available to you with conda env list
.
There are a number of public conda environments maintained by RACS in the /packages/miniconda3/20240410/envs/
folder. If you have not created any conda environments of your own, then only the public environments compiled by RACS will be listed.
conda env list
# conda environments:
#
base /packages/miniconda3/20240410
R-test-pack /packages/miniconda3/20240410/envs/R-test-pack
SE3nv /packages/miniconda3/20240410/envs/SE3nv
ancestryhmm-v2 /packages/miniconda3/20240410/envs/ancestryhmm-v2
argweaver-20241202 /packages/miniconda3/20240410/envs/argweaver-20241202
bgchm-20241008 /packages/miniconda3/20240410/envs/bgchm-20241008
brainiak-20240412 /packages/miniconda3/20240410/envs/brainiak-20240412
dcm2bids-20240904 /packages/miniconda3/20240410/envs/dcm2bids-20240904
dcm2niix-20240416 /packages/miniconda3/20240410/envs/dcm2niix-20240416
fmriprep-docker /packages/miniconda3/20240410/envs/fmriprep-docker
gambit_bsm-20240416 /packages/miniconda3/20240410/envs/gambit_bsm-20240416
gnomix /packages/miniconda3/20240410/envs/gnomix
...
Let’s create a new environment named myenv
that will be stored inside the .conda
folder of your home directory. You can specify which python version is used through the python=
argument.
conda create --name myenv python=3.12 numpy matplotlib
This command creates an environment with the numpy and matplotlib packages. When Conda finishes building the environment, you will see a message like this.
Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
# $ conda activate myenv
#
# To deactivate an active environment, use
#
# $ conda deactivate
To activate myenv, run the conda activate
command.
conda activate myenv
Observe that your environment name will now appear to the left of your terminal prompt.
(myenv) [emwin@login2 conda]$
From inside our conda environment, we can run the which python
command to confirm we are using the Python instance stored inside myenv.
which python
~/.conda/envs/myenv/bin/python
To see which packages are in the current environment, use conda list
.
conda list
We can scroll through the list to find matplotlib
and numpy
.
...
matplotlib 3.10.0 py312h06a4308_0
matplotlib-base 3.10.0 py312hbfdbfaf_0
mkl 2023.1.0 h213fc3f_46344
mkl-service 2.4.0 py312h5eee18b_2
mkl_fft 1.3.11 py312h5eee18b_0
mkl_random 1.2.8 py312h526ad5a_0
mysql 8.4.0 h721767e_2
ncurses 6.4 h6a678d5_0
numpy 2.2.5 py312h2470af2_0
...
Alternatively, use of piping and grep
will return only the lines that reference the packages of interest.
conda list | grep -E "matplotlib|numpy"
matplotlib 3.10.0 py312h06a4308_0
matplotlib-base 3.10.0 py312hbfdbfaf_0
numpy 2.2.5 py312h2470af2_0
numpy-base 2.2.5 py312h06ae042_0
Inspect hello_world.py
with nano
.
nano hello_world.py
import subprocess
import numpy
import time
import pandas as pd
def average_age():
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
return df['Age'].mean()
def compute_sum():
res = 4 + 4
return res
def main():
print("Hello World!")
output = compute_sum()
print(output)
print(average_age())
if __name__ == "__main__":
main()
For those unfamiliar with Python, this script executes the following statements:
- prints “Hello World!”
- assigns of the results of the
compute_sum()
function to the variable output - prints the value stored at output
- prints the result of the
average_age()
function, which operates on a Pandas DataFrame
A login node is not an appopriate location to run non-trivial scripts, but this toy script is safe to test there.
Use the python
command to run the version of Python in the active conda environment.
python hello_world.py
Traceback (most recent call last):
File "/gpfs/home/emwin/conda/hello_world.py", line 4, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'
This fails because myenv is missing the pandas
module. We need to install pandas to the current environment using conda install
.
conda install pandas
When the installation is finished, rerun the script.
python hello_world.py
Hello World!
8
30.0
This is the output we expected from inspecting the source code.
R in Conda
You can also install R and R packages to conda environments.
conda install R=4.3
Let’s examine basic_r.R
cat basic_r.R
add_numbers1 <- 4 * 2
add_numbers6 <- 4 * 8
add_numbers3 <- 4 * 9
add_numbers4 <- 4 * 12
add_numbers5 <- 9 * 7
add_numbers2 <- 5 * 8
results <- list(
add_numbers1 = add_numbers1,
add_numbers2 = add_numbers2,
add_numbers3 = add_numbers3,
add_numbers4 = add_numbers4,
add_numbers5 = add_numbers5,
add_numbers6 = add_numbers6
)
# Write results to a file
writeLines(paste(names(results), results, sep=": "), "results.txt")
Because R has been installed inside the myenv Conda environment, the Rscript
is available on our path when myenv is activated.
which Rscript
~/.conda/envs/myenv/bin/Rscript
Launch the basic_r.R
script using Rscript
.
Rscript basic_r.R
Inspect the results file from the script, results.txt
.
add_numbers1: 8
add_numbers2: 40
add_numbers3: 36
add_numbers4: 48
add_numbers5: 63
add_numbers6: 32
Exporting a Conda Environment
To share our Conda environment, we can create a textual representation of the environment in the form of a special .yml
configuration file. Conventionally, conda environments bundled with source code are named environment.yml
.
The conda env export
command exports the environment, which is written to myenv-environment.yml
.
conda env export > myenv-environment.yml
Inspect the contents with cat
.
cat myenv-environment.yml
name: myenv
channels:
- defaults
dependencies:
- _libgcc_mutex=0.1=main
- _openmp_mutex=5.1=1_gnu
- _r-mutex=1.0.0=anacondar_1
- _sysroot_linux-64_curr_repodata_hack=3=haa98f57_10
- binutils_impl_linux-64=2.40=h5293946_0
- binutils_linux-64=2.40.0=hc2dff05_2
- blas=1.0=openblas
- bottleneck=1.4.2=py312ha883a20_0
- brotli-python=1.0.9=py312h6a678d5_9
- bwidget=1.9.16=h9eba36c_0
- bzip2=1.0.8=h5eee18b_6
- c-ares=1.19.1=h5eee18b_0
- ca-certificates=2025.2.25=h06a4308_0
- cairo=1.16.0=hb05425b_5
...
This list is so lengthy because it includes the packages manually installed (numpy, matplotlib, pandas, R) and their dependencies.
If you use the --from-history
flag, conda will only export the packages you indicated, leaving the conda solver to choose versions and dependencies.
conda env export --from-history
name: myenv
channels:
- defaults
dependencies:
- matplotlib
- numpy
- python=3.12
- pandas
- r=4.3
Creating A Conda Environment from a .yml File
Let’s inspect an environment file called jupyter.yml
before creating the environment defined there.
cat jupyter.yml
name: jupyter-racs-s25
channels:
- conda-forge
dependencies:
- numpy
- pandas
- matplotlib
- seaborn
- jupyter
- r-base
- r-essentials
- r-irkernel
- nodejs
- ipywidgets
- pip
- pip:
- git+https://github.com/conery/nbscan.git
Conda Channels
This environment file defines an environment named jupyter-racs-s25. It uses the conda-forge channel. Channels are repositories where packages are downloaded from.
Conda defaults to a channel called defaults. The conda-forge channel has a larger repository of packages from the broader community.
Conda looks for packages in channels in the order that the channels appear.
This environment also uses pip to install a nbscan package that is not available through conda.
If you do not specify version numbers, conda will select the latest compatible version of a given package.
Before we can create our new environment, we need to deactivate our current one.
Then, we can create our new environment as defined in juptyer.yml
using conda env create
.
conda env create -f jupyter.yml
When the environment building finishes, you will get a message like this.
#
# To activate this environment, use
#
# $ conda activate jupyter-racs-s25
#
# To deactivate an active environment, use
#
# $ conda deactivate
No two conda environments can have the same name, so this command fail if you try to run it a second time.
We will be using the jupyter-racs-s25 environment in the JupyterLab session.
Using Conda Environments in Batch Scripts
To use a Conda environment you created in a Slurm batch script, you can add the following lines to your script. Don’t forget to load miniconda3/202410
first!
For Python files:
module load miniconda3/202410
conda activate [env-name]
python [my-special-python-script].py
Or for R:
module load miniconda3/202410
conda activate [env-name]
R [my-special-r-script].R
Debugging Conda
Observing strange behavior with Conda on Talapas? Make sure you’re not loading into a conda environment through your .bashrc file.
A default .bashrc
looks something like this:
# .bashrc
# Source global definitions
if [ -f /etc/bashrc ]; then
. /etc/bashrc
fi
# User specific environment
if ! [[ "$PATH" =~ "$HOME/.local/bin:$HOME/bin:" ]]
then
PATH="$HOME/.local/bin:$HOME/bin:$PATH"
fi
export PATH
# Uncomment the following line if you don't like systemctl's auto-paging feature:
# export SYSTEMD_PAGER=
# User specific aliases and functions
Do not run the conda init
command on Talapas to avoid modifying your .bashrc file unintentionally. Instead, use module load
to load either miniconda3
or miniforge3
.
What About pip
?
Pip is an alternative package manager and it can be paired with venv
to create Python virtual environments. Using pip
within a conda
environment can introduce problems, but it can be a necessary evil for packages available in pip but not in conda.
Useful Conda Commands
Command | Description |
---|---|
which python | see which python you are currently using, helpful for sanity checks |
conda env list | list out all available conda environments from loaded conda module |
conda activate <environment name> | activates specified conda environment |
pip list | see packages and their versions that were installed via pip package manager |
conda list | see packages and their versions installed, includes both pip and conda packages |
conda search <package name> | searches for packages within your environment |
conda list --name ENVNAME --show-channel-url | useful when trying to figure out what channel was used to install what package |
conda env remove --name <environment name> --all | deletes a conda environment (make sure you have created a backup) |