To ensure the most efficient use of our cluster resources and provide a smooth experience for all researchers, we would like to provide a reminder regarding our policies on resource requests, monitoring, and Slurm Fairshare.

📉 Active Monitoring & Under-Utilization

While users are free to request the specific resources required for their research (for example, requesting 4+ GPUs for high-parallelism workloads), we actively monitor all GPU, CPU, and memory usage to prevent waste.

  • The 10% Rule: To maintain fair access for those in the queue, any job found to be using less than 10% of its allocated resources for 1/3 of its walltime may be canceled by the administration without prior notice.
  • Why? This action frees up idle resources for other users who are waiting to run their code.
  • Track Your Usage: You can monitor your resource utilization in real-time by visiting DICC OnDemand > Active Jobs > Utilization graph on each job.

⚖️ Slurm Fairshare: How Priority is Calculated

We continue to implement a Fairshare policy via Slurm to ensure equitable access across the cluster. The system is designed so that a single user submitting many jobs cannot block others.

  • Usage vs. Score: The more resources you consume, the lower your Fairshare score becomes.
  • Resource “Cost”: Using “expensive” resources like GPUs reduces your score faster than CPU-only jobs.
  • Priority: A lower Fairshare score results in a lower priority for your new job allocations. As your usage tapers off, your score will gradually recover.

🚀 Optimization is Key

We encourage all users to optimize their job and request only the resources their job can effectively utilize. Efficient requests lead to:

  1. Faster queue times for everyone.
  2. Better overall performance of the cluster.
  3. Protection of your Fairshare score for future jobs.

Thank you for your cooperation in keeping DICC HPC environment efficient and fair for the entire research community.

Categories: HPCNews