Dear HPC Users,

We have been observing and monitoring the usage of the HPC cluster closely recently, and we will be making changes to the current Quality of Service (QoS) system which we feel is necessary. The new changes will change the way users submit their jobs and use the HPC cluster, as the changes will involve maximum wall time limit per job and amount of resources they can request per job. The new changes will be live after the power maintenance work next week completes.

We have decided to reduce the default maximum wall time to 24 hours. This means that users that choose NORMAL QoS or without specifying QoS in the job will default to a 24 hours maximum wall time limit per job. We noticed that previously many users who submitted their job with NORMAL QoS, either do not kill their job after the calculation completes (for example, Jupyter Notebook users) or mostly do not need to run more than 24 hours normally. We feel that the 10 days wall time limit is too long for inactive jobs to waste resources, so we decided to reduce the default wall time limit to 24 hours. This should hopefully reduce resource wastage for new users that are not familiar with their jobs. Users can still request a longer wall time limit by specifying another QoS in their job.

We will also be removing INFINITE QoS from the system which allows users to run their job for 365 days as we rarely noticed any job with such a long running time. This will also help to prevent random job failure due to hardware or software issues after running for a long period of time. The new default maximum possible wall time limit in the system is now 14 days, with the QoS named LONG. We are also removing DEBUG QoS and reducing the maximum wall time limit for SHORT qos to 1 hour, which the SHORT QoS is now intended to be used for short debugging sessions and short jobs with boosted priority.

We are also going to remove resource limits on all QoS. Users can now request resources up to the maximum allowed limit per user, which is currently 256 CPUs and 900GB memory (the maximum resource per node is still limited by the amount of resources available on that particular node). Users should now be able to utilise more resources to run their jobs faster in a shorter amount of time compared to before.

We will also be adding another QoS EXTENDED, which allows users to run up to 30 days per job. This QoS is locked by default for all the users unless requested. We will be analysing the usage pattern of the account and their job nature before we decide to allow the user to run more than 14 days in the HPC cluster.

The new settings will be updated in the documentation after the changes are live. The new QoS settings can be summarized as followed:

  1. debug (removed)
  2. short
    • maximum wall time : 1 hour (was 3 days)
    • priority : +10000 (was +8000)
  3. normal
    •  maximum wall time : 24 hours / 1 day (was 10 days)
    •  priority : no boost (was +2000)
  4. long
    •  maximum wall time : 14 days (was 30 days)
    •  priority : no boost (was +1000)
  5. extended (new)
    •  maximum wall time : 30 days
    •  priority : no boost
  6. infinite (removed)

In short, if you are one of the users that usually do not submit a job longer than 1 day, the changes should not have much impact on you. If you are one of the users that usually run larger sizes of jobs, you should now be able to request for more resources for your jobs, but you will also need to specify the correct QoS to run the job longer than 1 day. If you are one of the users that do not know how to run a job in the HPC cluster, please raise a ticket so that we can assist you.

We will be monitoring closely after the changes are made and see if further tweaking is needed. If you have any questions or concerns, please let us know through the service desk.

Thank you.

UPDATE 1: Due to the power maintenance work has been postponed to later date, the changes will be made on 4 Jun 2021 (Friday), 10.00pm.

Categories: HPC