Banner image depicting GPUs with clock icon

Changes in Protected Environment One-U RAI scheduling policies

The CHPC operates several compute nodes owned by the One-U Responsible AI Initiative, as described in a post on the One-U RAI website last year.

The One-U RAI nodes are treated as owner nodes with a maximum job wall time of two weeks. Additionally, up to five queued jobs accrue “priority” with the scheduler, Slurm; this determines how quickly jobs start when many are queued. We have noticed that in the Protected Environment, which has two RAI nodes with H200 GPUs, users tend to use wall times close to the maximum, making it very hard for other users to start jobs in a reasonable time.

We will be adjusting the scheduling policies in the near future as follows:

Limit the general RAI owner queue (account=rai-gpu-rw and qos=rai-gpu-rw) to a maximum wall time to 24 hours, with two queued jobs accruing priority per user.
Create a new long QoS (account=rai-gpu-rw and qos=rai-gpu-rw-long) for jobs needing more than a day of wall time, with a wall time limit of 14 days and a maximum of one running job per user. Access to this QoS will be by request only.
Create a new short, priority QoS (account=rai-gpu-rw and qos=rai-gpu-rw-short). This QoS will have a maximum wall time of eight hours, maximum of four GPUs and four jobs per group, maximum of one GPU per job with 12 CPU cores per GPU, and maximum of 250 GB RAM per job. This QoS will have higher priority than others to ensure jobs submitted to this QoS start first when resources are available. Jobs submitted to this QoS will not, however, preempt jobs running with other qualities of service.

Return to newsletter Subscribe to the CHPC mailing list for updates