Greetings HPC Users,

We would like to inform you on the changes that are performed and also coming to the HPC cluster. The changes on the HPC cluster will be outlined as follows:

Maximum Walltime per Job

The maximum walltime per job will be reduced from 7 days to 3 days, effectively starting June 2024. We have collected enough job execution information in May and have concluded that all jobs should be able to complete under 3 days with proper resource allocation. We also found that all jobs that could not complete under 3 days are also supporting checkpoint, so resuming the jobs should not be an issue.

Job age priority will also be updated, where users will only need to queue for 7 days for maximum priority boost instead of 14 days. Alongside with this decision, all job QoS will be updated as follows:

  • short QoS wall time remains unchanged and allows a job to run up to 1 hour.
  • normal QoS wall time will be reduced from 24 hours to 8 hours.
  • long QoS wall time will be reduced from 7 days to 3 days.

This change also aims to reduce overall queue time for users with fewer jobs and also reduce resource wastage when the jobs go idle during midnight.

Cluster Stability and Performance

We understand that cluster stability and performance is crucial for researchers to carry out their works. Thus, we would like to remind you that, please do not run any CPU and Memory intensive processes in the Login Node. We do not tolerate anyone that runs servers on our Login Node or Compute Nodes.

Also, if you are running jobs in the compute nodes, please make sure all the processes are spawning correctly by monitoring the jobs after submission. Make sure your jobs are allocating proper resources and in such a way that the resources are not wasted.

We are aware of some users that are trying to launch their own servers or running codes in the login node and compute nodes. If you are trying to do that again, we will take appropriate action if we find out. Please be responsible for each other in the cluster.

Cluster Security

In order to strengthen HPC cluster security, we are in the process of implementing Firewall Policy on all the cluster nodes. Normal job submission and execution should remain unaffected.

HPC Test

We have increased the passing score from 60% to 80% after some observation and consideration. Delay between retakes has also been reduced from 7 days to 3 days. If you have not passed the test, then good luck with the test.

Training Slot

The training slots for HPC are now available for June and July. If you need some training, please help yourself and proceed to to book a slot.

We now no longer provide training for basic Linux usage in HPC and we expect users should at least have some basic knowledge on Linux before attempting to use HPC.

HPC Application

We have recently launched Rstudio Server in the OpenOnDemand portal, which is now more streamlined and stable.

Meanwhile, we are working closely with the MATLAB team to bring the MATLAB to the HPC cluster. We would like to apologise for the long delay for MATLAB deployment as there are still many things to be tested for stability and performance.

Scratch Cleanup

Scratch cleanup will be enabled again starting June. All files that have not been accessed for more than 90 days will be wiped automatically.

If you have any questions or concerns, please let us know.

Thank you.

Categories: HPCNews