One-U Responsible AI Initiative GPU nodes

As a part of the One-U Responsible AI (RAI) Initiative at the University of Utah, the CHPC has installed a total of ten GPU nodes. Each node contains eight NVIDIA H200 GPUs, 96 Intel Emerald Rapids CPU cores, and 2 TB of RAM. Eight of the ten nodes are a part of the General Environment (GE) Granite HPC cluster, while two are part of the Protected Environment (PE) Redwood cluster. You can read more about the new GPU nodes on the RAI website. After a brief early testing phase, we have made these nodes available to all users.
The RAI nodes are set up as owner nodes—RAI being the owner—within the condo cluster model used by the CHPC. Any university researcher with an AI-related project can request priority access by emailing our helpdesk, [email protected]. All RAI nodes are also open, through guess access, to non-RAI researchers with the usual guest restrictions; namely, jobs submitted to the RAI nodes by guests may be preempted by RAI jobs.
Further details and instructions for accessing the RAI nodes
As a result of differences in the configuration of the Granite and Redwood clusters, the instructions for using the new RAI nodes differ slightly by cluster.
General Environment (Granite)
In the GE, we have divided the GPUs on two nodes—using the Multi-Instance GPU (MIG) approach—to many smaller-sized GPUs. One node has 56 instances of one-seventh of the H200 (1g.18gb in the MIG terminology), with 18 GB of GPU RAM each; the other node has 16 instances with 35 GB GPU RAM (2g.35gb) and 8 instances with 71 GB GPU RAM (3g.71gb). These small GPUs should be very useful for testing and interactive jobs, such as jobs submitted through Open OnDemand.
To use these GPUs as an owner (after being granited priority access), log in to the CHPC's Granite cluster and use the rai-gpu-grn partition, rai-gpu-grn QoS, and rai Slurm account, along with --gres=gpu to request the GPUs. For example, for interactive jobs, run
salloc -N 1 -n 12 -A rai -p rai-gpu-grn --qos=rai-gpu-grn --gres=gpu:h200:1 -t 1:00:00
For guest access, use your group account name and the granite-gpu-guest partition, such as
salloc -N 1 -n 12 -A my_group -p granite-gpu-guest --qos=granite-gpu-guest --gres=gpu:h200:1 -t 1:00:00
To list all the MIG GPU feature types, run
sinfo --partition=rai-gpu-grn -o "%20P %5D %6t %8z %10m %10d %11l %16f %N %60G"
For example, to request the smallest GPU fraction (1g.18gb), run
salloc -N 1 -n 12 -A rai -p rai-gpu-grn --qos=rai-gpu-grn --gres=gpu:h200_1g.18gb:1 -t 1:00:00
or
salloc -N 1 -n 12 -A my_group-p granite-gpu-guest --qos=granite-gpu-guest --gres=gpu:h200_1g.18gb:1 -t 1:00:00
Protected Environment (Redwood)
In the PE, one of the two nodes has been configured to use Multi-Instance GPUs, with 28 instances of h200_1g.18gb, 8 instances of h200_2g.35gb, and 4 instances of h200_3g.71gb.
To use the two GPU nodes in the PE, log in to Redwood and use account, partition, and QoS rai-gpu-rw, such as
salloc -N 1 -n 12 -A rai-gpu-grn -p rai-gpu-grn --qos=rai-gpu-grn --gres=gpu:h200:1 -t 1:00:00
To list all the MIG GPU feature types in the PE, run
sinfo --partition=rai-gpu-rw -o "%20P %5D %6t %8z %10m %10d %11l %16f %N %60G"
Finally, to request MIG instances in the PE, run
salloc -N 1 -n 12 -A rai-gpu-rw -p rai-gpu-rw --qos=rai-gpu-rw --gres=gpu:h200_1g.18gb:1 -t 1:00:00
or
salloc -N 1 -n 12 -A owner-gpu-guest -p redwood-gpu-guest --qos=redwood-gpu-guest --gres=gpu:h200_1g.18gb:1 -t 1:00:00
Notes and request for feedback
At the moment, there is no fast inter-node connectivity, but we are looking into providing that in the future. At this point, we recommend only using up to 8 GPUs within a single node for parallel, multi-GPU jobs.
Be aware that the MIG instances don't support the fast NVLink GPU interconnect, so using more than one MIG instance in a single job will degrade the performance. In our tests, seven parallel 1g instances ran about 20% slower than the full H200 GPU. A better alternative is to use a larger (single) MIG slice or the whole H200 GPU.
We are interested in any feedback you may have. Please contact us at [email protected] to submit feedback. We are especially interested in feedback on the three smaller-sized MIG partitions, as we are open to adjusting the ratio of the MIGs based on the needs of researchers.
If you have any questions or issues, please open a ticket at [email protected].