Greeting HPC Users,

It has been awhile since our last email for news and updates in the HPC cluster. We would like to give brief and short updates on what happened in the past few months, and what to expect in the near future.

Maximum Allowed Wall Time Per Job

We made a change on 25 August 2022 to reduce maximum wall time to 7 days per job. After 2 months of monitoring and observation, we can confirm that there is no need to extend the wall time limit beyond 7 days for the moment. All the jobs submitted in the past 2 months can either complete within 7 days, or can be resumed again after the 7 days limit termination. We will continue to observe and make adjustments if necessary.

Improper Amber 20 Executables

Meanwhile, we noticed quite a lot of job configuration issues with Amber 20 users. Most of the users (including new and old users) did not execute a correct Amber binary to utilise the resources properly. Thus we have made some changes to the Amber documentation page at https://confluence.dicc.um.edu.my/display/HPCDOCS/Amber. It is your responsibility to understand well about Amber binary and usage before submitting jobs to the HPC cluster, any improper use of Amber binary will lead to resource wastage.

Odd Job Configuration and Bad Resources Utilisation

We also noticed quite a lot of odd job configurations, which include but not limited to:

  1. allocating 1 CPU and low amount of memory for long running jobs.
  2. allocating an odd and non-standard number of CPU cores (5, 31, 43 and etc.)
  3. allocating all CPU cores, but only utilising 1 CPU during job execution.
  4. allocating 32 CPUs out of 48 CPUs in EPYC node instead of all 48 CPUs.

We have performed many uninformed job termination due to the reason that those jobs are either not running with correct parameters, input or configurations. Those jobs have to be terminated to prevent resource wastage. We have internal records about those jobs and will consider taking further action if the issue persists. If you are one of those users that had jobs getting terminated without receiving any email or information, please double check and understand your job before re-submitting those jobs. 

Meanwhile, if you are running only 1 job with 1 CPU and 2GB memory in the HPC, we would suggest you to run those calculations on your own computer, since your computer should have a much faster CPU compared to the one in the HPC cluster. There is no point to bring jobs that require little to no resources to run in the HPC cluster.

Kindly monitor your job efficiency closely in the Open OnDemand portal at https://umhpc.dicc.um.edu.my. We will not hesitate to terminate jobs that are found to be wasting resources.

Single Sign On (SSO) Service

We have been working on bringing the DICC SSO service up, so that users can have easier time managing their DICC account and password. With the upcoming release of DICC public SSO service, users can directly change their password on the SSO website rather than have to rely on SSH to the HPC Login Node to use passwd command. This also means that expired password reset and 2FA can now be supported and self serviced by users. There will be another announcement once the SSO service is ready.

If you have any questions, please let us know through the service desk. Also, don’t forget to join the DICC Official Telegram Channel to receive the latest news and updates on the HPC cluster.

Thank you.

Categories: HPCNews