In order to ensure job scheduling efficiency in the HPC cluster, the FairShare algorithm was used in our cluster scheduler. In our scheduler context, the resource usage accounting is referred to as billing. As the time goes on, we feel that the current way of resource billing can no longer be able to properly cover all types of job usages, especially memory-intensive jobs and GPU jobs. Thus, a new billing system is necessary to ensure all users are being billed properly for the amount of resources they requested and allocated.

There are also changes to the priority system to accommodate for the new billing system. We are also aiming to reduce the overall queue time of the jobs by manipulating the weightage of each priority value from time to time based on the observation on the queue. The details of the changes will be outlined in the following sections.

FairShare System

Before we continue further, it is important to understand how the FairShare system works in our cluster. There are three important values in FairShare system:

  • RawShare: Every user in the system is allocated the same RawShare value. All users are equal in terms of resources shared.
  • RawUsage: This value refers to the total amount of resources the users have allocated in the past. This is the sum of all job usage including: failed jobs, requeued jobs and completed jobs.
  • FairShare: This value is calculated by the system based on the RawShare and RawUsage values, and is used to determine if the users are underserved or overserved.

Every five minutes, the scheduler will recalculate the FairShare values for all the users, and the values will be used to determine which jobs will start next.

New Billing System

Previously, users are only billed for the amount of resources they have been allocated, which seems to be reasonable at the first glance. As there weren’t many memory-intensive jobs during the last configuration and memory is usually not a significant resource to care about, memory usage was not billed in previously. Users were also not penalised for blocking jobs by allocating the majority portion of the resources on the nodes. However, there have been a reasonable amount of memory-intensive jobs with only a few CPU cores submitted to our cluster recently. Many of these jobs have the tendency to block the nodes from running any other jobs, as there usually will not have enough unallocated memory remaining on the node.

In the new system, a new billing value is defined for every type of resource, including CPU, memory, and GPU. Each core allocated for non-multithreaded jobs will be treated as 2 CPUs and no multiple multithreaded jobs should fall within the same core. All the resource types will be assigned a new billing value based on the cost of the nodes during acquisition. Jobs will be billed based on the highest amount of resource type allocated. The billing amount for each resource type will be calculated using a ratio proportionally to the cost of the nodes. The following list the billing value per configured partition:

PartitionCPUMemoryGPUMaxPerNodeEstimated Cost per hour (RM)
cpu-opteron468.751250300003.42
cpu-epyc3751500360004.11
gpu-k10656.253752625210002.40
gpu-titan75020012000240002.74
gpu-k40c70040011200224002.56
gpu-v100s1437.5500460009200010.50

Let’s take a look at few of the following examples:

  • A non-multithreaded job was submitted to cpu-epyc with 48 Cores and 32GB Memory.
    • The billing values for each type of resources can be breakdown as followed:
      • CPU = 96 (2 CPUs per Core, 48 Cores) * 375 (Billing Value per CPU) = 36000 RawUsage per minute
      • Memory = 32 (32 GB Memory) * 150 (Billing Value per GB memory) = 4800 RawUsage per minute
    • The job will be charged for 36000 RawUsage per minute as 48 Cores is the highest type of resource allocated for the job.
  • A multithreaded job was submitted to cpu-epyc with 24 CPUs and 240GB Memory.
    • The billing values for each type of resources can be breakdown as followed:
      • CPU = 24 (24 CPUs) * 375 (Billing Value per CPU) = 9000 RawUsage per minute
      • Memory = 240 (234 GB Memory) * 150 (Billing Value per GB memory) = 36000 RawUsage per minute
    • The job will be charged for 36000 RawUsage per minute as 240 GB Memory is the highest type of resource allocated for the job.
  • A non-multithreaded job was submitted to gpu-v100s with 2 Cores,64GB Memory and 2 v100s GPUs.
    • The billing values for each type of resources can be breakdown as followed:
      • CPU = 4 (2 CPUs per Core, 2 Cores) * 1437.5 (Billing Value per CPU) = 5750 RawUsage per minute
      • Memory = 64 (64 GB Memory) * 500 (Billing Value per GB memory) = 32000 RawUsage per minute
      • GPU = 2 (2 GPUs) * 46000 (Billing Value per GPU) = 92000 RawUsage per minute
    • The job will be charged for 92000 RawUsage per minute as 2 GPUs is the highest type of resource allocated for the job.

With the new billing system, we can ensure that all resources are treated equally important, and users that are not planning their resources properly are penalised in terms of FairShare. Also, with the new billing system, we can also prepare reports for users to get an idea on how much money they have spent on the cluster, if we are to charge them in real money. If you were to convert the amount of billing value per hour for a fully allocated gpu-v100s to cost in real money (not including power consumption), it would be 92000 / 365 days / 24 hours = ~ RM10.50 per hour. This is still very much cheaper than most of the cloud services out there as we do not include the actual operation expenses, e.g. power consumption and staff time. An instance with 1 v100s GPU in Amazon Elastic Compute Cloud (EC2) costs about $3.06 per hour, which is equal to about RM12.00 per hour.

Resource Limit

We have also adjusted the maximum amount of resources a user can allocate at one time accordingly. Previously we set a limit of 450 CPUs and 2TB memory for all users that passed the tests, which is about 50% of all the CPUs from CPU and GPU nodes. We have removed the resource limit for all users that passed the test, and only keep the limit for the users with a limited account. The decision to remove the resource limit was made to allow any user to fully occupy the cluster when no one is submitting any jobs. However, the FairShare will ensure jobs run without queuing too long.

The new billing limit for users with a limited account is 12500 (about 2% of the cluster). Limitations on type of node access are not changed. In this case, a user with a limited account can have multiple jobs submitted to run by planning around the billing limit. The 12500 billing limit is approximately equal to running both a job with one K40 GPU and a job with 2 CPUs in Opteron.

Priority System

In order to adapt for the billing system change, the priority system is also updated to ensure low queue time and fairness of the queue. Most of the priority weightage were updated to a new value based on the observation and data we collected in the past few months. The following table display a list of previous priority weightage value and new priority weightage value:

Priority FactorOld WeightageNew Weightage
Age300004000
Fairshare5000050000
QoS100002000

Other than the weightage value update, there will also be some changes on how certain priority factors will work:

  • short QoS priority boost was reduced from 10000 to 2000.
  • Maximum time to reach Maximum Age Priority value (4000) is reduced from 30 Days to 14 days, where most jobs with low FairShare priority should be able to run after queuing for about 5-7 days, as most average users have about 2000-3000 FairShare Priority value.
  • The amount of the past job usage history that will contribute to the FairShare Priority value is increased from 14 days to 90 days to account for the removal of resource limits in the cluster. Users will now be penalised more heavily if resources are not properly allocated or planned.
  • limited QoS now has a UsageFactor of 10, which any resource usage will be multiplied by 10 times for billing. This is intended to ensure limited user accounts are not getting priority higher than those who have passed the tests, while also encouraging all HPC users to attempt and pass the tests. Kindly refer to the Confluence page for more detail on the list of available QoS.

Removal of Titan X GPU

Based on the performance benchmark on some of the applications in the past, we have decided to remove Titan X GPU from gpu02. Due to the fact that some users allocated all the GPUs from gpu02 in one single job, this often led to performance degradation as Titan X is significantly less performant than Titan Xp in most of the scenarios we have tested. Removing Titan X GPU from the node should reduce the impact on the performance when running GPU jobs with all the GPUs allocated.

Affected Audience

The following list of user groups are those we think will be affected by the changes mentioned above:

  • Heavy users that submitted a lot of jobs will now be able to fully utilise all the resources, if no one is submitting jobs. This was previously limited by the amount of CPU and Memory set. However, the user FairShare priority value will also drop much lower (already very low previously) and recover slower due to the past job contribution changes.
  • Non-heavy users that submit some jobs occasionally will not be affected much by the changes.
  • Users that submit a lot of GPU jobs will now have much lower FairShare priority value, as GPUs are now billed based on node value.
  • Users that submit memory intensive jobs will now have much lower FairShare priority value, as Memory is now billed and is based on node value.
  • Users that submit a lot of jobs with short QoS will not always be able to start immediately now with the priority boost change made to short QoS.
  • Users with limited accounts will now have a harder time to start their jobs if the queue is full.

TL;DR

The changes will be slightly more complicated than our previous other changes, so please let us know if you need clarification. In short, the list of changes can be summarised as followed:

  • New resource accounting was introduced in order to ensure all jobs are properly billed.
  • Resource limit per non-limited user account was removed to increase overall resource utilisation, but remain for users with a limited account.
  • Priority system was updated to improve overall job queue time.
  • Titan X GPU was removed from the cluster to reduce performance degradation.

If you have any concerns, please let us know.

Categories: HPCNews