Talapas Essentials: The Structure of Talapas

RACS

The Talapas (from Chinook for coyote) cluster is managed by Research Advanced Computing Services or RACS. RACS administers the hardware, software, PIRGS, and other key services for Talapas. Troubleshooting Talapas? Need to request software for your team? The best way to reach RACS is through their customer portal.

RACS Resources:

A University Supercomputer

Talapas enables programming tasks that could not be computed without more CPU cores, GPU cores, or memory than are available on consumer hardware. It also accelerates research at UO by offloading repetitive, highly parallel computational jobs from researchers’ devices, allowing scientists to focus on more important tasks.

Talapas is a heterogenous cluster consisting of hundreds of individual computers called nodes. It is made up of login nodes, compute nodes, and private “condo” nodes. These nodes are much more powerful than a personal computer! Each compute node has up to 128 CPU cores. Nodes typically have at least 500GB of RAM, and specialized nodes for large memory jobs can have up to 4TB of RAM.

talapas-hierarchy

Talapas is a living, growing computational ecosystem. New software is added upon request, CPU and GPU hardware is periodically upgraded, and new computing nodes are added as research groups buy special “condo” nodes for their needs.

Getting to Talapas: Login Nodes

The four login nodes are shared by hundreds of Talapas users simultaneously. The login nodes share the same filesyste. Having multiple login nodes adds redundancy and fewer points of failure to the Talapas ecosystem. Login nodes are intended for loading data, transferring large datasets from the internet or the cloud to the Talapas filesystem, preparing software environments, and connecting to IDEs. Unlike other nodes in the cluster, login nodes are open to connections from the broader internet.

The CPU cores and memory on login nodes are not for doing computational work.

There’s a detailed tutorial for connecting to login nodes here.

What is login.talapas.uoregon.edu?

Talapas has a load balancer at login.talapas.uoregon.edu that distributes users as evenly as possible among the four entry or “login” nodes.

If you connect to login.talapas.uoregon.edu through your terminal, you will be routed to one of the four login nodes – login1, login2, login3, or login4 – based on how many people are currently connected to each node.

If connecting directly to a given login node doesn’t work, try another. For example, try login1 if login2 times out. If you can’t reach any of the login nodes, please open a ticket with RACS.

Compute Nodes

The login and compute nodes share the same filesystem, but all non-trivial work occurs on compute nodes.

The compute nodes on Talapas are grouped into partitions based on what resources they have, how long those resources can be used, and (in the case of condo nodes) which users can access them.

The Talapas Filesystem

Talapas uses a networked filesystem called GPFS to make code, input files, and other crucial pieces of data available across all nodes in the cluster.

All users have 250GB of storage available to them in their home directory at /home/yourDuckID. No other users have permission to access to your home directory.

You can check your home directory path by using the following Bash command.

echo $HOME

If you have access to Talapas, you are also part of a PIRG. Your research group’s data should live in /projects/PIRG_NAME. Unless extra storage has been negotiated, PIRG project directories have a maximum of 2TB of storage.

Because the file structure for PIRGs recently changed, some PIRGS may have a slightly different stucture in their /projects/PIRG_NAME folder.

You can also explore the filesystem and even upload files of up to 10GB in the Talapas Files app.

Need temporary storage? The /scratch/PIRG_NAME directory associated with your PIRG has 20TB of storage. Data in /scratch/ not accessed within the last 90 days will be deleted.

New PIRGs

Joined Talapas recently? This implementation assumes all files and folders within a PIRG are shared among all members. That means any files stored in /projects/PIRG are, by default, readable by members of that PIRG.

For example, all members of racs_training can access files and folders in the /projects/racs_training directory. This is why everyone was added to a single, temporary PIRG for the purpose of sharing files among all members of the workshop.

Sharing Files in Legacy PIRGs

Older PIRGS were created with the same storage limit but a different permissions structure. This implementation had problems for collaboration among lab members, as labs have members that leave to work at other institutions.

Each user had their own folder within their PIRG with files their labmates couldn’t see at /projects/PIRG/DuckID While all members could access data shared in /projects/PIRG/shared

Shared data was reserved for the folder /projects/PIRG/shared, but it wasn’t uncommon for lab members to want to share files and folders from their /projects/PIRG/DuckID folders with fellow lab members.

I am a PI and have a legacy PIRG. Can you fix our my group’s permissions now that members have left?

Yes, to fix the file structure and permissions within your projects/PIRG directory, open a ticket with RACS. RACS can help retrofit your PIRG into a flatter file structure with simpler permissions. When your PIRG is modified, you can choose to keep the existing file and permissions structure or have the new file permissions scheme implemented.

If you do an ll or long-listing command on Talapas, you will see special folders in your home directory that begin with @.

ll
...
lrwxrwxrwx. 1 root  root         26 May 28  2024 library_it -> /projects/library_it/emwin
lrwxrwxrwx. 1 root  root         23 Sep  5 16:22 racs_training -> /projects/racs_training

These folders aren’t actually in your home directory. Symlinks or symbolic links are references between different locations in the filesystem. If you cd into the /home/racs_training symlink, you will be redirected or linked to /projects/racs_training.

These pointers are added for your convenience so that you can move files and code from your home directory to your project directory.

Some new PIRGs do not have this symlink in place. Do not worry, you can still access the same folder through /projects/PIRG_NAME. PIs can request that the symlink be added to their lab members’ project directories.

Which Folder Should I Use for What?

/home/yourDuckID - 250GB quota

  • code, testing instances, personal work

/projects/yourPIRG - 2TB shared quota

  • datasets, project data, code you want to share with other members of your PIRG

/scratch/yourPIRG - 20GB shared quota

  • data used within 90 days, intermediate outputs or inputs, temporary files

Transferring Files to Talapas

There are a variety of ways to transfer files to and from Talapas based on your use case.

  • To transfer files to Talapas from your web browser you can use the Talapas file browser. There’s 10GB limit on what you can upload at a time.
  • To transfer files from the command line, use the scp command.
  • To transfer datasets to Talapas from the internet, use the wget http:/<filepath> command on a login node. Remember, compute nodes are firewalled.
  • For complex or large-scale transfers, try Globus or FTP tools like Filezilla.

Software: Modules on Talapas

Software on Talapas is controlled through lmod modules.

You don’t need to worry too much lmod’s implementation details to use modules, especially if the software you need is already available in the module catalogue.

To run software from within a Slurm job, you’ll need to load the appopriate modules.

For example, let’s load Python.

module load python3

Tab</kdb> to autocomplete is available to you when searching through the list of modules.

To see the modules you currently have loaded, run module list.

module list
Currently Loaded Modules:
  1) miniconda-t2/20230523
  2) python3/3.11.4

Now that Python has been loaded, you can access it from your path at python3. This version of Python has been added to your PATH through lmod.

which python3
/packages/miniconda-t2/20230523/envs/python-3.11.4/bin/python
python3 --version
Python 3.10.13

To remove a module, use the module unload command followed by the module name.

 module unload python3/3.11.4 

Or get rid of ALL modules with module purge.

module purge

Now, you can see that there are no modules loaded.

module list

Modules and PATH

The lmod system works by modifying your PATH variable.

The PATH variable defines the shell’s search path for executables: the list of directories that the shell looks in for runnable programs when you type in a program name without specifying what directory it is in.

When you type a command, the shell checks each directory in the PATH variable in turn, looking for a program with the requested name in that directory. As soon as it finds a match, it stops searching and runs the program.

For example, loading Python added /packages/miniconda-t2/20230523/envs/python-3.10.13 to my path. This is because miniconda is a dependency of the Python package.

module load python3/3.10.13
echo $PATH
/packages/miniconda-t2/20230523/envs/python-3.10.13/bin:/packages/miniconda-t2/20230523/condabin:/packages/miniconda-t2/20230523/bin:/packages/miniconda-t2/20230523/envs/python-3.11.4/bin:/packages/miniconda-t2/20230523/condabin:/home/emwin/.local/bin:/home/emwin/bin:/gpfs/t2/slurm/apps/current/bin:/gpfs/t2/slurm/apps/current/sbin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/dell/srvadmin/sbin

Talapas also supports compiled languages like C and C++. Compilers like gcc and aocc are available as modules.

Browsing Modules

Want a more user friendly list of modules available?

Try module spider [keyword]. Below, I’ll search for the neuroscience related software, FSL.

module spider fsl
-----------------------------------------------------------------------------------------------------------------
  fsl:
-----------------------------------------------------------------------------------------------------------------
     Versions:
        fsl/5.0.9
        fsl/5.0.10
        fsl/6.0.1
        fsl/6.0.7
        fsl/6.0.7.9
     Other possible modules matches:
        FSL  FSLeyes  fsleyes  fslpy

-----------------------------------------------------------------------------------------------------------------
  To find other possible module matches execute:

      $ module -r spider '.*fsl.*'

Alternatively, you can use the module avail command to get a full list without relying on the spider search mechanism.

module avail
---------- /packages/modulefiles/t2/modulefiles/mpi/gcc/13.1.0 --------------
   mpich/4.1.1 (L)    openmpi/4.1.6

--------------------- /packages/modulefiles/t2/modulefiles ---------------------
   AOCL/4.2.0
   Geneious
   MRIConvert/2.1.0
   Mathematica/11.3
   Mathematica/12.0                                (D)
   NonLinLoc/20221102
   OpenDX/4.4.4
   R/3.4.2-lcni
   R/4.3.2
   R/4.3.3
   R/4.4.2                                         (D)
   RECON/1.08
   RFdiffusion1/RFdiffusion1
   RepeatMasker/4.0.7racs1
   RepeatModeler/1.0.10
   RepeatScout/1.0.5
   adapterremoval/2.1.7
   adapterremoval/2.3.3   

You can scroll through the list produced by module avail using the arrow keys. Press Q to quit. Default versions of a module are indicated with (D). Remember that defaults can change over time!

Talapas modules are maintained by RACS. If software you need for your workflow is unavailable, you can request the creation of new modules through the Talapas ticketing system.

Reproducibility with Modules

Always use complete module names in your batch jobs and scripts when possible.

For example, module load fsl/6.0.7.9 is preferred to module load fsl because the default version will change over time. The default version could be out of date.

You should always know which packages and which versions your code relies upon, as it will make it easier for other scientists to reproduce your code.

Talapas Partitions: Where Do I Run My Jobs

Here is a summary of the primary partitions on Talapas so you can decide where to schedule your jobs at a glance.

Partition Max Job Time GPUs Description CPU Type
compute 24 hrs no default partition, appropriate for most users AMD
compute_intel 24 hrs no for software that requires Intel processors, older computers Intel
computelong 2 wks no default partition for jobs that take longer than 24 hours AMD
computelong_intel 2 wks no default partition for jobs that take longer than 24 hours Intel
gpu 24 hrs yes for shorter jobs that requires GPUs AMD
gpulong 2 wks yes partition for GPU jobs that take longer than 24 hours AMD
interactive 12 hrs no partition for interactive srun jobs, OnDemand apps Talapas Desktop, JupyterLab AMD
interactivegpu 8 hrs yes GPU partition for interactive srun jobs, OnDemand apps Talapas Desktop, JupyterLab AMD
memory 24 hrs no for memory-intensive jobs that require up to 4TB of RAM AMD
memorylong 2 wks no for memory-intensive jobs that require up to 4TB of RAM of a long duration AMD
preempt 1 wk yes special “partition” that appropriates nodes in other partitions Various

Partition Status: sinfo

Want to know the current status and time limits of all the partitions on Talapas? The command sinfo displays all available partitions that you can schedule jobs on.

sinfo

Each partition has one line for each state in order to list the number of nodes in each of the following states: mix, alloc, and idle.

PARTITION         AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute              up 1-00:00:00      1   drng n0135
compute              up 1-00:00:00     35    mix n[0111-0120,0122-0133,0181-0185,0187,0189-0190,0192-0196]
compute              up 1-00:00:00      4  alloc n[0180,0186,0188,0191]
compute              up 1-00:00:00      2   idle n[0121,0134]
compute_intel        up 1-00:00:00     17    mix n[0055-0056,0063-0064,0073,0078,0081-0082,0084,0087-0089,0092,0094,0096,0106-0107]
compute_intel        up 1-00:00:00     33  alloc n[0049-0050,0052-0054,0057-0062,0065-0072,0074-0077,0079-0080,0083,0085-0086,0090-0091,0093,0095,0105]
computelong          up 14-00:00:0      1   drng n0136
computelong          up 14-00:00:0     23    mix n[0119-0120,0122-0133,0181-0185,0187,0189-0190,0192]
computelong          up 14-00:00:0      4  alloc n[0180,0186,0188,0191]
computelong          up 14-00:00:0      2   idle n[0121,0134]
computelong_intel    up 14-00:00:0     17    mix n[0055-0056,0063-0064,0073,0078,0081-0082,0084,0087-0089,0092,0094,0096,0106-0107]
computelong_intel    up 14-00:00:0     33  alloc n[0049-0050,0052-0054,0057-0062,0065-0072,0074-0077,0079-0080,0083,0085-0086,0090-0091,0093,0095,0105]
gpu                  up 1-00:00:00     19    mix n[0149-0160,0162-0167,0301]
gpu                  up 1-00:00:00      1  alloc n0171
gpu                  up 1-00:00:00      3   idle n[0168-0169,0172]
gpulong              up 14-00:00:0     12    mix n[0150,0152-0153,0155-0157,0162-0167]
gpulong              up 14-00:00:0      1  alloc n0171
gpulong              up 14-00:00:0      3   idle n[0168-0169,0172]
interactive          up   12:00:00      9    mix n[0210,0212,0302,0308-0309,0311-0313,0398]
interactive          up   12:00:00      9  alloc n[0209,0211,0303-0307,0310,0399]
interactivegpu       up    8:00:00      1    mix n0161
memory               up 1-00:00:00      7    mix n[0148,0372,0374,0376-0379]
memory               up 1-00:00:00      9  alloc n[0141-0147,0373,0375]
memorylong           up 14-00:00:0      5    mix n[0148,0372,0374,0376,0378]
memorylong           up 14-00:00:0      3  alloc n[0142,0144,0146]
preempt              up 7-00:00:00      2   drng n[0135-0136]
preempt              up 7-00:00:00    114    mix n[0038-0039,0041,0043,0055-0056,0063-0064,0073,0078,0081-0082,0084,0087-0089,0092,0094,0096,0106-0107,0111-0120,0122-0133,0148-0167,0170,0181-0185,0187,0189,0192-0196,0210,0212-0214,0216,0221-0222,0224-0226,0267,0270,0301-0302,0308-0309,0311-0313,0363-0364,0368,0370,0372,0374,0376-0379,0388-0389,0391-0394,0396-0398]
preempt              up 7-00:00:00    181  alloc n[0037,0040,0042,0044-0046,0049-0050,0052-0054,0057-0062,0065-0072,0074-0077,0079-0080,0083,0085-0086,0090-0091,0093,0095,0105,0109-0110,0141-0147,0171,0173-0180,0186,0188,0191,0197,0201-0209,0211,0215,0218-0220,0223,0227-0242,0244-0266,0268-0269,0303-0307,0310,0314-0348,0359-0362,0369,0371,0373,0375,0380-0387,0390,0395,0399,0996-0999]
preempt              up 7-00:00:00      5   idle n[0121,0134,0168-0169,0172]

You can interpret the results from sinfo as follows. Nodes are grouped by partition and state.

  • The AVAIL column represents the status of partition.
  • The TIMELIMIT column represents max job time in days. 1-00:00:00 is 24 hours.
  • The NODES column indicates how many nodes are in the partition have a given state.
  • The STATE column lists the node states, ie. are the nodes alloc allocated, idle, or in a mixture of idle and allocated states.
  • The NODELIST column lists the node in each of the possible states within a partition. Each node can be a member of one or more partitions.

To learn more, see the Slurm documentation for sinfo.

If your PIRG has purchased condo nodes, you will see additional nodes in the list returned by sinfo.

The preempt partition is a special partition that allows users to take advantage of additional computational resources in a low-priority queue.

Do not run critical jobs on preempt, as there’s always a risk of having your job cancelled.

On any other partition, your job will run until it either finishes, meets the time limit you requested, or exceeds the resources you requested.

Introducing Conda Environments

Conda is tool for managing virtual environments available on Talapas. Conda helps you manage different coding environments for different projects.

We will discuss Conda more in future lessons, but today we will demonstrate how it works and how to use it to create Python environments from the command line.

Loading the Conda Module

To use Conda, you must load the miniconda3/20240410 module.

module load miniconda3/20240410

Check the module is loaded with module list.

module list
Currently Loaded Modules:
  1) miniconda3/20240410

Looking at Available Conda Environments

List the conda environments available to you with conda env list.

There are a number of public conda environments maintained by RACS in the /packages/miniconda3/20240410/envs/ folder. If you have not created any conda environments of your own, then only the public environments compiled by RACS will be listed.

conda env list
# conda environments:
#
base                     /packages/miniconda3/20240410
R-test-pack              /packages/miniconda3/20240410/envs/R-test-pack
SE3nv                    /packages/miniconda3/20240410/envs/SE3nv
ancestryhmm-v2           /packages/miniconda3/20240410/envs/ancestryhmm-v2
argweaver-20241202       /packages/miniconda3/20240410/envs/argweaver-20241202
bgchm-20241008           /packages/miniconda3/20240410/envs/bgchm-20241008
brainiak-20240412        /packages/miniconda3/20240410/envs/brainiak-20240412
...

Creating Conda Environments

Let’s create a new environment named workshop-fall that will be stored inside the .conda folder of your home directory. You can specify which python version is used through the python= argument.

conda create --name workshop-fall python=3.12 numpy matplotlib

This command creates an environment with the numpy and matplotlib packages. When Conda finishes building the environment, you will see a message like this.

Preparing transaction: done                                 
Verifying transaction: done                                 
Executing transaction: done                                 
#                                                           
# To activate this environment, use                         
#                                                           
#     $ conda activate fall-workshop                               
#                                                           
# To deactivate an active environment, use                  
#                                                           
#     $ conda deactivate 

To activate myenv, run the conda activate command.

conda activate fall-workshop

Observe that your environment name will now appear to the left of your terminal prompt.

(fall-workshop) [emwin@login2 conda]$     

From inside our conda environment, we can run the which python command to confirm we are using the Python instance stored inside myenv.

which python
~/.conda/envs/fall-workshop/bin/python  

To see which packages are in the current environment, use conda list.

conda list

We can scroll through the list to find matplotlib and numpy.

...
matplotlib                3.10.0          py312h06a4308_0  
matplotlib-base           3.10.0          py312hbfdbfaf_0  
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0           py312h5eee18b_2  
mkl_fft                   1.3.11          py312h5eee18b_0  
mkl_random                1.2.8           py312h526ad5a_0  
mysql                     8.4.0                h721767e_2  
ncurses                   6.4                  h6a678d5_0  
numpy                     2.2.5           py312h2470af2_0
...

Alternatively, use of piping and grep will return only the lines that reference the packages of interest.

 conda list | grep -E "matplotlib|numpy"
matplotlib                3.10.0          py312h06a4308_0  
matplotlib-base           3.10.0          py312hbfdbfaf_0  
numpy                     2.2.5           py312h2470af2_0  
numpy-base                2.2.5           py312h06ae042_0  

Confirm that your conda environment works by opening a Python interpreter and importing one of the installed packages. Remember, you should not be doing work on the login node.

python
Python 3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:16:04) [GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Let’s import numpy import numpy as np and confirm that it works by creating a 1-dimensional array and summing it.

x = np.array([1,2,3])
x.sum()
np.int64(6)

Creating Conda Environments for Your Code

As you migrate Python code to Talapas, inspect import statements to identify packages that will need you need to install to a conda environment on Talapas.

Not sure what version of a Python package you’re running? You can quickly check by using the __version__ attribute.

# Replace lxml with your package of choice
import lxml
print(lxml.__version__)

Slurm: The Talapas Scheduler

Slurm is the job scheduling software used on the Talapas. While Talapas has scheduling policies, partitions, and PIRGs that are specific to UO, Slurm is used for job scheduling on high-performance computing clusters around the world.

To schedule jobs on Talapas, you must give Slurm a partition where the job must run and an account (PIRG) associated with the job.

Slurm manages a queue of jobs that determines which node(s) on a partition your job will run.

Scheduling Simple Jobs with Slurm

To practice with Slurm tasks, connect to a Talapas login node. For this exercise, feel free to use the Talapas OnDemand shell.

Batch Scheduling with sbatch

Batch scripts in Slurm are configured through special comments prefixed with #SBATCH.

All batch jobs should have #!/bin/bash on the first line followed by #SBATCH options in any order. It doesn’t matter what order you specify your #SBATCH options in as long you specify them one per line.

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --account=racs_training

This set of comments represent the minimum required options for a Slurm job of Talapas:

  • a valid Talapas partition
  • an account (PIRG)
nano first.sbatch

Inside nano, enter the following lines. When you’re finished, use Ctrl+O and Ctrl+X to write out to the first.sbatch file and then exit nano.

#!/bin/bash
#SBATCH --partition=compute
#SBATCH --account=racs_training
echo "Hello!"

Required Slurm Job Elements

  • #!/bin/bash on the first line
  • --partition=[a valid partition]
  • --account==[your PIRG]

All other parameters like mem, ntasks, and --cpus-per-task have default values by partition. The default value for job memory as configured through the --mem-per-cpu is 4GB per CPU. For single core jobs, that means single-core jobs start with 4GB RAM unless you specify otherwise.

Let’s run our minimum viable job by passing it to the sbatch command.

sbatch first.sbatch

You will get a response with a (unique) job number when your job is submitted successfully.

Submitted batch job 34704033

Check your job’s status in the queue using the squeue command. The --me flag is a helpful trick if you don’t want to type -u [yourDuckID] each time.

squeue --me

With a job this simple, it’s probably already finished.

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

If you see an empty queue like this, go ahead and check your most recent finished jobs with sacct.

sacct
34704033     first.sba+    compute racs_trai+          1  COMPLETED      0:0 
34704033.ba+      batch            racs_trai+          1  COMPLETED      0:0 
34704033.ex+     extern            racs_trai+          1  COMPLETED      0:0

This job doesn’t have a specified output and error log file name, so it uses the slurm defaults: slurm-[jobid].out. Doing an ls, we see a file that was created with the default parameters at slurm-[jobid].out. You can see where a #SBATCH --job-name might be more helpful in the debugging process.

Let’s check the contents of the output log. If it worked as intended, we should the results of the echo command from first.sbatch.

cat slurm-34704033.out
Hello!

We will look at Slurm and several associated commands in detail in the next session!

Shared Resource Etiquette

  • Be conscientious about your use of shared storage in the /projects/[yourPIRG] folder.
  • Close out your jobs when you’re done!
  • Avoid modifying the same files concurrently; this is a shared filesystem.
  • Book your interactive jobs for as long as you need, but not longer.
  • You will not be warned when time is about to run out when running interactive jobs or the Talapas Desktop app. Track your own time conscientiously.