R on Talapas

R can be used in Talapas in three ways:

through the RStudio IDE on the Talapas Desktop App
through the command line in sbatch scripts or interactive srun jobs
through JupyterLab OnDemand app with an R-enabled kernel

You can and should configure your R environment with an package manager like Conda. Whichever option is most convenient for you to use, custom environments are available on all three.

As mentioned in a previous class session, conda is not CRAN. In some cases, you will need to install specialized packages through R’s install.packages() because the package simply isn’t available through conda.

Supported Versions of R on Talapas

R/4.3.2
R/4.3.3
R/4.4.2

Creating a Custom R Environment

First, let’s create our custom R environment from the following dual-kernel-jupyter.yml file.

If you would prefer not to create the file from scratch, you can access it from the /projects/racs_training/emwin/r_activity/ folder as follows:

cp -r /projects/racs_training/emwin/r_activity/ .
cd r_activity
ls

dual-kernel-jupyter.yml  multi-r.R  multi-r.sh  serial-r.R  serial-r.sh

cat dual-kernel-jupyter.yml

name: dualkernel
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.12
  - r-base=4.3.3
  - r-essentials
  - jupyter
  - r-irkernel
  - r-tidyverse
  - r-devtools
  - r-data.table
  - r-tidycensus
  - r-tigris

In addition to R 4.3.3 and Python 3.12, this kernel has a special set of packages for running R within the JupyterLab OnDemand app:

jupyter - allows for running Jupyter
r-base - enables R within Conda
r-essentials - adds some popular R packages
r-irkernel - enables R within Jupyter

You can modify this template to create your own dual kernel environments.

When you’re ready, create the dualkernel environment:

conda env create -f dual-kernel-jupyter.yml --solver=libmamba

This command will fail if you already have an environment named dualkernel.

RStudio with a Custom R Environment

Access to the RStudio GUI is described at length in this tutorial for Talapas Desktop.

Start a new Talapas Desktop session and open Mate Terminal.

Activate the dualkernel environment.

module load miniconda3/20240410
conda activate dualkernel

Before we test it in R studio, let’s check the R version inside the dualkernel environment in the interactive interpreter.

Let’s type the installed.packages (past tense) in our R interpreter to check that our packages have been installed.

installed.packages()

IRdisplay     NA                    NA      NA     NA               "4.3.3"
IRkernel      NA                    NA      NA     "no"             "4.3.3"

I can tell this is my special installation because I can see the packages IRDisplay and IRKernel, which are going to come in handy later.

Exit the R interpreter.

quit()

Which RStudio?

There are two primary rstudio modules for use with custom environments on Talapas.

rstudio/base
- Base Rstudio module that uses whatever R module that the user has loaded
rstudio/2024.04.1+748-condaGLIBX
- Rstudio package that allows for newer library packages to be loaded that need specific GLIBX system libraries

Let’s load rstudio/base.

module load rstudio/base

We want load the base Rstudio module that uses whatever R module is on our path: the environment named dualkernel.

Start Rstudio using rstudio.

rstudio

We’ll use find.package(), which, given the name of a package, looks for the filepath location of its compiled binaries.

find.package("dplyr")

[1] "/gpfs/home/emwin/.conda/envs/dualkernel/lib/R/library/dplyr"

It should reference the dualkernel environment in your home directory on Talapas.

Freddy’s Fun Facts about R

For the sake of reproducibility, write every script assuming it will be run in a fresh R process.

User-level setup: Do not save .RData when you quit R and don’t load .RData when you fire up R.
In RStudio, this behavior can be requested in the General tab of Preferences.
If you run R from the shell, put something like this in your .bash_profile: alias R='R --no-save --no-restore-data'.
Don’t do things in your .Rprofile that affect how R code runs, such as loading a package like dplyr or ggplot or setting an option such as stringsAsFactors = FALSE.

The Monte Carlo Method for Approximating Pi

We’re going to be testing R with a classic parallel computing task: approximating pi.

monte carlo approximation diagram

This code simulates random (x, y) points in a 2-D plane with domain as a square of side 2r units centered on (0,0). Imagine a circle inside the same domain with same radius r and inscribed into the square.

The more points n, the closer our approximation is to pi.

Let’s examine write following R function inside RStudio.

monte_carlo_pi <- function(n) {
  x <- runif(n, -1, 1)
  y <- runif(n, -1, 1)
  sum(x^2 + y^2 <= 1) / n * 4
}

Let’s practice calling this function in the console.

monte_carlo_pi(5)

[1] 2.4

Let’s increase n to 100!

monte_carlo_pi(100)

[1] 3.32

Now 1000.

monte_carlo_pi(1000)

[1] 3.14

Finally, 10000000.

monte_carlo_pi(10^7)

[1] 3.140647

As expected, a larger n, produces a better approximation of pi.

Now, let’s produce a better approximation by averaging several concurrently generated samples.

Vectorized Functions

R programmers typically rely on vectorized functions like apply, sapply, lapply, and tapply to leverage multi-threaded operations.

The equivalent operations in the Tidyverse library are in the dplyr:across function.

Regardless of whether you use base R or Tidyverse, these vectorized operations are optimized to have multiple iterations of input processed at the same time, increasing memory and CPU usage in exchange for speed.

lapply(x, function) returns a list of the same length as its input, each element of function which is the result of applying FUN to the corresponding element of x.

For example, let’s make a quick list my_list in the console.

my_list <- 1:10
my_list

You can see it’s one dimensional and has the values 1, 2, and 3..10 in that order.

 [1]  1  2  3  4  5  6  7  8  9 10

We if apply the sqrt function using lapply, we get the square root of each of the values in my_list.

lapply(my_list, sqrt)

[[1]]
[1] 1

[[2]]
[1] 1.414214

[[3]]
[1] 1.732051

[[4]]
[1] 2

[[5]]
[1] 2.236068

[[6]]
[1] 2.44949

[[7]]
[1] 2.645751

[[8]]
[1] 2.828427

[[9]]
[1] 3

[[10]]
[1] 3.162278

We can also pass in our own functions:

lapply(my_list, monte_carlo_pi)

[[1]]
[1] 4

[[2]]
[1] 2

[[3]]
[1] 2.666667

[[4]]
[1] 4

[[5]]
[1] 3.2

[[6]]
[1] 2

[[7]]
[1] 2.857143

[[8]]
[1] 2.5

[[9]]
[1] 2.666667

[[10]]
[1] 3.6

But what if we want to run lapply on monte_carlo_pi with a different argument?

We can run monte_carlo_pi with n=1000 10 times as follows:

lapply(my_list, function(i) monte_carlo_pi(1000))

[[1]]
[1] 3.168

[[2]]
[1] 3.12

[[3]]
[1] 3.256

[[4]]
[1] 3.176

[[5]]
[1] 3.152

[[6]]
[1] 3.192

[[7]]
[1] 3.172

[[8]]
[1] 3.112

[[9]]
[1] 3.064

[[10]]
[1] 3.084

In this case, the length of the list passed to lapply is what matters, not the values stored there.

Next, benchmarking. Adjust the file as follows in the RStudio file editior as follows.

monte_carlo_pi <- function(n){
  x <- runif(n, -1, 1)
  y <- runif(n, -1, 1)
  sum(x^2 + y^2 <= 1) / n * 4
}

# Set random seed
set.seed(42)
system.time({
  # Compute 100 approximations of pi
  results_vectorized <- lapply(1:100, function(i) monte_carlo_pi(10^7))
  # Average results
  print(mean(unlist(results_vectorized)))
})

Estimating Resource Usage in R

We’re going to time our outputs for these jobs in R using system.time, which allows us to see how CPU time was actually used.

This script performs 100 runs of 10M points each of the monte_carlo_pi approximation, then averages those 100 results and prints them.

Running it gives a result like this:

[1] 3.141586
   user  system elapsed
 66.004   7.851  74.024

user: The time R spent on the CPU(s)
system: Time spent on related system processes
elapsed: wall clock time

While lapply is a vectorized operation, this is technically serial code. You’re using only the one CPU core you requested and were allocated.

Multithreading, Multicores

But what if we want to use mulithreading or multiple cores in R?

To do so, you will need to need launch batch Slurm jobs.

Open a connection to a Talapas login node and copy today’s files into your working directory.

cp -r /projects/racs_training/emwin/r_activity/ .
cd r_activity

Running Batch R Jobs with `Rscript`

Take a look at the contents of your r_activity folder.

ls

dual-kernel-jupyter.yml  multi-r.R  multi-r.sh  serial-r.R  serial-r.sh

First examine, serial-r.R. It should look familiar.

monte_carlo_pi <- function(n){
  x <- runif(n, -1, 1)
  y <- runif(n, -1, 1)
  sum(x^2 + y^2 <= 1) / n * 4
}

set.seed(42)
system.time({
  results_vectorized <- lapply(1:100, function(i) monte_carlo_pi(10^7))
  print(mean(unlist(results_vectorized)))
})

This script is called by the slurm job serial-r.sh.

#!/bin/bash
#SBATCH --account=racs_training
#SBATCH --job-name=serial-r
#SBATCH --output=serial.out
#SBATCH --error=serial.err
#SBATCH --partition=compute
#SBATCH --time=0-00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

module purge
module load miniconda3/20240410
conda activate dualkernel
Rscript serial-r.R

This is a single-core, serial job, that asks for the default 4GB RAM.

Launch it.

sbatch serial-r.sh

Submitted batch job 31969962

After the job completes (within 60 seconds), check its resources usage with seff.

seff 31969962

Job ID: 31969962
Cluster: talapas
User/Group: emwin/uoregon
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 00:01:13
CPU Efficiency: 96.05% of 00:01:16 core-walltime
Job Wall-clock time: 00:01:16
Memory Utilized: 332.11 MB
Memory Efficiency: 8.11% of 4.00 GB

Now examine multi-r.R which performs the same computations using multiple threads.

library(parallel)

# Do not get the number of CPUS on the node
# Get the number you asked for
workers <- as.numeric(Sys.getenv("SLURM_CPUS_PER_TASK"))
cat("using", workers, "workers")
# Fork the process ncpus-1 times to make ncpus total workers
cl <- makeCluster(workers - 1)

system.time({
  monte_carlo_pi <- function(i) {
    n <- 10^7
    x <- runif(n, -1, 1)
    y <- runif(n, -1, 1)
    sum(x^2 + y^2 <= 1) / n * 4
  }
  results <- parLapply(cl, 1:100, monte_carlo_pi)
  print(mean(unlist(results)))
})

stopCluster(cl)

This script instantiates as many threads as CPUs requested from Slurm.

The number of requested CPUs for each Slurm job is stored in the SLURM_CPUS_PER_TASK environment variable and corresponds to the value passed to the #SBATCH --cpus-per-task= parameter.

These threads are instantiated with R’s built-in parallel package and executed using the parallel version of lapply called parLapply.

This package is not the only option for performing parallel computation in R, nor is parLapply its only option. I have included a list of additional resources in the Additional R Resources section.

Regardless of what package you use, take advantage of environment variables like SLURM_CPUS_PER_TASK to make sure you instantiate threads appropriate to the number of CPUs allocated.

Creating more threads than can be scheduled on the CPUs allocated to you will not speed up your job.

Inspect multi-r.sh.

#!/bin/bash
#SBATCH --account=racs_training
#SBATCH --output=multi.out
#SBATCH --error=multi.err
#SBATCH --job-name=multi-r
#SBATCH --partition=compute
#SBATCH --time=0-00:10:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

module purge
module load miniconda3/20240410
conda activate dualkernel
Rscript multi-r.R

This job requests 4 CPUs, so multi-r.R will run with 4 threads.

Launch the job. It should finish in about 20 seconds. Because of scheduling and overhead, it is not 4x faster than the serial version.

sbatch multi-r.sh

Submitted batch job 31969987

When the job finishes, run the seff command to examine its resource usage.

seff 31969987

Job ID: 31969987
Cluster: talapas
User/Group: emwin/uoregon
State: COMPLETED (exit code 0)
Nodes: 1
Cores per node: 4
CPU Utilized: 00:00:53
CPU Efficiency: 55.21% of 00:01:36 core-walltime
Job Wall-clock time: 00:00:24
Memory Utilized: 4.16 MB
Memory Efficiency: 0.03% of 16.00 GB

Examine the distinction between the Job Wall-clock time and CPU Utilized time.

Because multiple CPU threads were used concurrently, 53 seconds of work was performed in 24 seconds.

R on the JupyterLab OnDemand App

You can run R on the JupyterLab app, and while it’s not quite RStudio, it’s a fast and interactive way to create figures with tools like GGPlot generated from data that’s on Talapas.

To run R on JupyterLab, you will need to create a special conda environment with (at least) these packages:

jupyter - allows for running Jupyter
r-base - enables R within Conda
r-essentials - adds some popular R packages
r-irkernel - enables R within Jupyter

Want to learn more about which R packages are available in Jupyter? Anaconda keeps a list of them.

Open the JupyterLab OnDemand app, but make sure to pass in your dualkernel environment to the custom environment box.

example config screen

Today, we’re going to use JupyterLab to play around with ggplot.

Running R in JupyterLab is slower in terms of computational power, but its interface is much better than a command line experience for generating and iterating on data visualizations.

R Activity

Open a new notebook with the R kernel selected.

kernel selection

Creating variables in R doesn’t look that different from Python.

x <- 3 # Create a variable called x
y <- 5 # Create a variable called y

If I evaluate this cell, I will get 8.

x + y

Let’s practice visualizing data with ggplot2.

R has a convenient tidycensus package that can be used to download data from the Census API.

GGPlot2 in JupyterLab

Load required libraries.

library(ggplot2)
library(tidycensus)
library(tigris)

Get the median income by county or “B19013_001” in Oregon.

oregon_incomes <- get_acs(
  geography = "county",
  variables = "B19013_001",
  year = 2020,
  state = "OR",
  geometry = TRUE,
) 

Plot the data.

ggplot(data = oregon_incomes, aes(fill = estimate)) + 
  geom_sf()

png

Add axes and a title using the labs or labels parameter.

ggplot(data = oregon_incomes, aes(fill = estimate)) + 
  geom_sf() + 
  labs(title = "  Median Household Income by County",
       caption = "Data source: 2020 ACS, US Census Bureau",
       fill = "ACS estimate") 

png

Adjust the palette with the `scale_fill_distiller`.

ggplot(data = oregon_incomes, aes(fill = estimate)) + 
  geom_sf() + 
  labs(title = "  Median Household Income by County",
       caption = "Data source: 2020 ACS, US Census Bureau",
       fill = "ACS estimate")  +
  scale_fill_distiller(palette = "PuBu", 
                       direction = 1)

png

Use theme_void() to adjust the display of white space.

ggplot(data = oregon_incomes, aes(fill = estimate)) + 
  geom_sf() + 
  labs(title = "  Median Household Income by County",
       caption = "Data source: 2020 ACS, US Census Bureau",
       fill = "ACS estimate")  +
  scale_fill_distiller(palette = "PuBu", 
                       direction = 1) +
  theme_void()

png

To write out your plot to file, save it as a variable. Then use ggsave.

orplot <- ggplot(data = oregon_incomes, aes(fill = estimate)) + 
  geom_sf() + 
  scale_fill_distiller(palette = "PuBu", 
                       direction = 1) + 
  labs(title = "  Median Household Income by County in Oregon",
       caption = "Data source: 2020 ACS, US Census Bureau",
       fill = "ACS estimate") + 
  theme_void()

Write out your file to oregon_incomes.png. This will be saved to your current working directory in Jupyter.

ggsave(filename="oregon_incomes.png", plot = orplot, width =12, height=10, dpi=300, units="cm")

Additional R Resources

R for Computational Reserachers

Reproducibility in R and RStudio A lesson on reproducibility in R and Rstudio from UNC’s BeginR curriculum.
R for Data Science A free book on data analysis in R using Tidyverse.

R Parallel Computing Packages

CRAN has an extensive list of parallel computing packages. We’ve tested a few.

R Futures An R API for asynchronous programming compatible with Slurm.
parallel This built-in R library will behave differently on Windows than on Linux machines like Talapas compute nodes.
foreach This package can be used in combination with doparallel or doFuture backends to run flexible multi-core R jobs using special for-loop constructs.