One-U Responsible AI Initiative nodes: Support for interactive jobs
CHPC users with AI-related projects can utilize hardware obtained through the One-U Responsible AI (RAI) Initiative, which includes eight nodes in the General Environment and two nodes in the Protected Environment. Each node has eight H200 GPUs.
We announced earlier this summer that two of these nodes in the General Environment have the H200s split into smaller pieces as Multi-Instance GPUs (MIG). Each instance acts like a distinct GPU, which effectively increases the number of GPUs available for jobs. This makes it more likely that jobs with modest GPU memory or throughput requriements will start immediately, better supporting interactive work. The MIG types are listed on the CHPC documentation.
Some users may still need the full H200 GPUs for interactive work. As these are in high demand, users may need to wait in a queue; the GPUs may not be available at a given point in time. They are a limited resource, and it is not practical to set H200 GPUs aside for development work. As an alternative, we created a special Quality of Service (QoS) for the "rai" Slurm account, called "rai-gpu-grn-short." A job submitted with this QoS will move to the top of the rai-gpu-grn partition queue. This gives it higher priority than the standard jobs, but it will not preempt them. This will not guarantee that the job will start immediately, as all H200s could be busy, but jobs with this QoS will be the first to start once resources are available.
This QoS has strict limits to encourage its use only for interactive jobs. The limits include one job per user; a maximum of 12 CPU cores, 250 GB RAM, and one H200 GPU per job; and a maximum walltime of eight hours. Also, only eight concurrent jobs with this QoS are allowed per research group.
We hope that this new QoS helps users acquire a full H200 more quickly for interactive jobs, and we welcome feedback in this regard.