Greeting HPC users,

It has been awhile since our last news and updates on the HPC. There has been a lot of stuff going on in DICC, so we will be describing some of the current issues we are tackling and the plan we have made in today’s write-up.

HPC Training

Current Issue

  • Due to the limited number of staff in our centre, we are aware of the current situation that the available HPC training sessions throughout the year are unable to cover all our users.
  • There is currently no specific way to request for HPC training other than requesting through collaboration projects or through major HPC workshops.
  • We also feel that a large number of participants in one training may not produce the training result we expected. 
  • Some of the users had started to show their inability to properly utilise HPC resources since their last attended HPC training could be years ago.

Current Plan

  • We are going to increase the amount of training available to users, if possible, up to 4 times a month compared to once per quarter previously. Currently we plan to stick with 2 sessions per month first, and may have more slots open in the future.
  • We are also going to introduce a new way for users that require HPC training to request for one, via a booking system. The booking system can be accessed at this link.
  • We are going to limit the participants to 5 users for each training session.

HPC Tests

Current Issue

  • Some of the users had started to show their inability to properly utilise HPC resources since their last attended HPC training could be years ago.
  • Our marking scheme for the previous HPC test was less strict, as we previously felt that users should be given a chance to learn and improve their skills while using the HPC to run their calculation. We noticed that this simply isn’t the case since we have quite a number of irresponsible HPC users, which indirectly causes resource wastage and service disruption.
  • Our current HPC test does not require users to participate in HPC training, so they can always attempt the test blindly and retake the test one week after if failed.

Current Plan

  • We will be introducing three new sets of HPC tests to replace the previous test. This aims to identify the users’ weak spot in HPC and encourage them to participate in the corresponding HPC training afterward.
  • All current users will be given a deadline to complete the new HPC tests before their accounts are reverted to ‘limited’ state.

Application Benchmark

Current Issue

  • Due to service and configuration improvement done throughout the years, many of the previously recommended resource configurations for jobs and calculations are no longer relevant.
  • We also notice the ways users running and submitting their calculations are different among each other, thus leading to different performance and utilisation.

Current Plan

  • We are going to conduct multiple benchmarks on different applications to identify the performance level on different resource configurations. This should help reduce the overall time taken before a result can be obtained.
  • Benchmark results should also improve overall throughput of the HPC cluster if followed properly, thus more jobs can be completed in the same amount of time.
  • The page to the list of benchmark results can be found here. More benchmark results will be available as time goes on.

System Upgrade

Current Issue

  • Many of the HPC components are outdated and will be end-of-life soon, which pose some security risks to users as well as DICC administrators.
  • Some of the software used in the HPC are few versions behind, so there could be performance improvement or bug fixes in the newer version.

Current Plan

  • We plan to perform system upgrade and modernisation in stages, to ensure that all the critical components are up-to-date to avoid system disruption.
  • Configuration updates will be performed from time to time, with minimal downtime. Public announcement to all active HPC users will be made if downtime is required.
  • Users might experience occasional service disruption, but should have no impact on their jobs running in the HPC cluster.

Service Guidelines

Current Issue

  • We felt that the lack of properly defined service guidelines for different aspects of DICC services might lead to some complication as the user base expands.

Current Plan

  • We are going to introduce guidelines for different components in DICC services to ensure that the scope of services are defined clearly to avoid confusion.
  • All users are advised to go through the guidelines to avoid conducting prohibited action while using the HPC. The link to the guidelines can be found at DICC website under the guideline menu.

We appreciate your time spent reading everything in the write-up. Thank you for your support throughout the journey.

Categories: HPCNews