Dear all HPC Users,

There was an incident where one of the CPU compute nodes in the HPC pool, cpu06 crashed due to high CPU load caused by some processes stuck in the machine indefinitely. This incident has caused all the jobs running in cpu06 to fail as the worker daemon was not able to communicate with the scheduler due to high CPU load. 

The machine was rebooted physically this morning around 7.45am. If you were running some jobs in the affected machine, please verify and resubmit your job if necessary. We are still investigating the root causes of the incident and will take appropriate action to prevent this issue from happening again.

If you have any issue or question, please do not hesitate to contact us through the service desk

Thank you.

Categories: Incident