GPUs¶
Available Devices¶
Cheaha has GPUs available with the following statistics, broken out by Slurm Partition. For more information on all available partitions, see our Hardware Summary.
pascalnodes |
amperenodes |
|
---|---|---|
Product Name | P100 | A100 80GB |
Architecture | Pascal | Ampere |
CUDA Compute Capability Version | 6.0 | 8.0 |
CUDA Cores | 3584 | 6912 |
Memory (GB) | 16 | 80 |
Memory Bandwidth (GB/s) | 720 | 2039 |
NVLink Bandwidth (GB/s) | 160 | 600 |
FP32 performance (TFLOPs) | 10.6 | 19.5 |
For more information on these nodes, see Detailed Hardware Information
.
Scheduling GPUs¶
To submit a job with one or more GPUs, you will need to set the partition to pascalnodes
or amperenodes
family of partitions for P100 GPUs or amperenodes
family for A100 GPUs.
When requesting a job using sbatch
, you will need to include the Slurm flag --gres=gpu:#
. Replace #
with the number of GPUs you need. Quotas and constraints are available on our Hardware Summary
Note
It is suggested that at least 2 CPUs are requested for every GPU to begin with. The user should monitor and adjust the number of cores on subsequent job submissions if necessary. Look at Managing Jobs for more information.
Ensuring IO Performance With A100 GPUs¶
If you are using amperenodes
and the A100 GPUs, then it is highly recommended to move your input files to /local/$SLURM_JOB_ID
prior to running your workflow, to ensure adequate GPU performance. Using $USER_SCRATCH
, or other network file locations, will starve the GPU of data, resulting in poor performance.
The following script can be used to wrap your existing workflows. It will automatically create a temporary directory $TMPDIR
and delete it when your workflow is finished. You'll need to supply the original source of your data as $MY_DATA_DIR
. The script is not guaranteed to delete the temporary directory if the job ends before it reaches the final line, so please be mindful and periodically check for any extra temporary directories and delete them as needed.
#!/bin/bash
#SBATCH ...
#SBATCH --partition=amperenodes
#SBATCH --gres=gpu:1
# LOAD CUDA MODULES
module load CUDA/12.1.1
module load cuDNN/12.1.1
# CREATE TEMPORARY DIRECTORY
# WARNING! $TMPDIR will be deleted at the end of the script!
# Changing the following line can cause permanent, unintended deletion of important data.
TMPDIR="/local/$USER/$SLURM_JOB_ID"
mkdir -p "$TMPDIR"
# COPY RESEARCH DATA TO LOCAL TEMPORARY DIRECTORY
# Replace $MY_DATA_DIR with the path to your data folder
cp -r "$MY_DATA_DIR" "$TMPDIR"
# YOUR ORIGINAL WORKFLOW GOES HERE
# be sure to load files from "$TMPDIR"!
# CLEAN UP TEMPORARY DIRECTORY
# WARNING!
# Changing the following line can cause permanent, unintended deletion of important data.
rm -rf "$TMPDIR"
Open OnDemand¶
When requesting an interactive job through Open OnDemand
, selecting the pascalnodes
partitions will automatically request access to one GPU as well. There is currently no way to change the number of GPUs for OOD interactive jobs.
MATLAB¶
To use GPUs with our Open OnDemand MATLAB app, you may need to take a slightly different route than usual.
If you are using MATLAB R2022a or newer, then our pascalnodes
P100 GPUs and amperenodes
A100 GPUs should work without any additional steps.
If you are using R2021b and earlier, then follow the instructions below.
- Start an HPC Interactive Desktop Job with appropriate resources. Be sure to use one of the
pascalnodes*
Partitions. - Open a terminal.
- Load the appropriate CUDA Module.
- Determine which CUDA Modules are compatible with your required version of MATLAB using the table at the MathWorks Site.
- Check the
Pascal (cc6.x)
column for thepascalnodes
P100 GPUs andAmpere (cc8.x)
column for theamperenodes
A100 GPUs. - As of September, 2023,
module load CUDA/11.6.0
and newer should work fine with any version of MATLAB R2021b or older, with possible caveats for some functions.
- Load the appropriate MATLAB Module.
- Start MATLAB by entering the command
matlab
. - When MATLAB loads, enter the command
gpuDevice
in the MATLAB Command Window to verify it can identify the GPU.
For more information and official MATLAB documentation please see this page: https://www.mathworks.com/help/parallel-computing/gpu-computing-requirements.html.
CUDA Modules¶
You will need to load a CUDA module to make use of GPUs on Cheaha. Depending on which version of software you are using, different versions of CUDA module may be required. For instance, tensorflow version 2.13.0
requires the CUDA/11.8.0
module. To see which versions are available on Cheaha, use the following command at the terminal.
If a specific version of CUDA is needed but not installed, please send an install request to support@listserv.uab.edu.
cuDNN Modules¶
If working with deep neural networks (DNNs, CNNs, LSTMs, LLMs, etc.), you will need to load a cuDNN
module as well. The cuDNN
modules are built to be compatible with a sibling CUDA
module and are named with the corresponding version. For example, if you are loading CUDA/12.2.0
, you will also need to load cuDNN/8.9.2.26-CUDA-12.2.0
.
Tensorflow Compatibility¶
To check which CUDA Module version is required for your version of Tensorflow, see the toolkit requirements chart here https://www.tensorflow.org/install/source#gpu.
PyTorch Compatibility¶
PyTorch does not maintain a simple compatibility table for CUDA versions. Instead, please manually check their "get started" page for the latest PyTorch version compatibility, and their "previous versions" page for older PyTorch version compatibility. Assume that a CUDA version is not compatible if it is not listed for a specific PyTorch version.
To use GPUs prior to PyTorch version 1.13 you must select a cudatoolkit
version from the PyTorch channel when you install PyTorch using Anaconda. It is how PyTorch knows to install a GPU compatible flavor, as opposed to the CPU only flavor. See below for templates of CPU and GPU installs for PyTorch versions prior to 1.13. Be sure to check the compatibility links above for your selected version. Note torchaudio
is also available for signal processing.
- CPU Version:
conda install pytorch==... torchvision==... -c pytorch
- GPU Version:
conda install pytorch==... torchvision==... cudatoolkit=... -c pytorch
For versions of PyTorch 1.13 and newer, use the following template instead.
- CPU Version:
conda install pytorch==... torchvision==... cpuonly -c pytorch
- GPU Version:
conda install pytorch==... torchvision==... pytorch-cuda=... -c pytorch -c nvidia
Note
When loading modules, such as CUDA modules for jobs requiring one or more GPUs, always utilize module reset
before loading modules, both at the terminal and within sbatch
scripts. See best practice for loading modules for more information.
Reviewing GPU Jobs¶
As with all jobs, use sacct
to review GPU jobs. Quantity of GPUs may be reviewed using the reqtres
and alloctres
fields.
Frequently Asked Questions (FAQ) About A100 GPUs¶
- I've been using the P100 GPUs on
pascalnodes
up until now, what is the easiest way to start using the A100 GPUs?- If you are using an
sbatch
script...- Change
--partition=pascalnodes
to--partition=amperenodes
, or change--partition=pascalnodes-medium
to--partition=amperenodes-medium
. - Also change
--gres=gpu:3
and--gres=gpu:4
to--gres=gpu:2
, as there are only two A100 GPUs per node.
- Change
- If you are using an Open OnDemand Interactive App...
- Change the partition from "pascalnodes" to "amperenodes, or change "pascalnodes-medium" to "amperenodes-medium".
- In all cases, be sure to read the section on Ensuring IO Performance With A100 GPUs to be sure disk read speed doesn't limit your performance gains.
- If you are using an
- How do I access the A100 GPUs?
You can access the A100 GPUs by request jobs in the appropriate partitions. Use
amperenodes
partition for up to 12 hours oramperenodes-medium
partition for up to 48 hours. - How many GPUs can I request at once? Up to four GPUs may be requested by any one researcher at once. However, there are only two GPUs per node, so requesting four GPUs will allocate two nodes. To make use of multiple nodes, your workflow software must know how to communicate between nodes using software like Horovod or OpenMPI. If you are new to GPUs and aren't sure you need multiple nodes, please limit your request to one or two gpus.
- What performance improvements can I expect over the P100 GPUs? Performance improvements depend on the software and algorithms being used. Determining optimal configuration will take some experimenting. Swapping a single P100 to a single A100, you can generally expect 3x to 20x improvement. For more information about possible performance improvements, please see the Official NVIDIA A100 page.
- How can I make the most efficient use of the A100 GPUs?
A100s process data very rapidly compared with previous technology. Ideally, we want the A100 to be the bottleneck during processing, rather than CPUs or I/O operations. Here are two initial possibilities to consider for optimizing efficiency:
- All researchers should copy their input data onto
/local/$SLURM_JOB_ID
(node-specific NVMe drives) before processing to avoid I/O bottlenecks reducing performance. See Ensuring IO Performance With A100 GPUs. - Some researchers may benefit from using a larger number of CPU cores for data loading and preprocessing, compared with
pascalnodes
. Please consider experimenting with different numbers of CPU cores using the same dataset to find what is optimal for you. If you feel that performance should be higher, please contact Support so we can guide you toward an optimal CPU-to-GPU ratio for your application and workflow.
- All researchers should copy their input data onto
- Where are the A100 nodes physically located, and will this impact my workflows?
The A100 nodes are located in the DC BLOX Data Center, west of UAB Campus. Because Cheaha storage (GPFS) is located on campus, there may be slightly higher latency when transferring data between the A100 nodes and GPFS. Impacts will only occur if very small amounts of data are transferred very frequently, which is unusual for most GPU workflows. We strongly recommend copying your input data onto
/local/$SLURM_JOB_ID
prior to processing, see Ensuring IO Performance With A100 GPUs. - What will happen to the P100 GPUs? We intend to retain all of the 18 existing P100 GPU nodes, of which 9 nodes are available now. The remaining 9 nodes have been temporarily taken offline as we reconfigure hardware, and will be reallocated based on demand and other factors.
- What else should I be aware of?
- Please be sure to clean your data off of
/local/$SLURM_JOB_ID
as soon as you no longer need it, before the job finishes. - We have updated the CUDA and cuDNN modules to improve reliability and ease of use. Please see the section on CUDA Modules for more information.
- Please be sure to clean your data off of