ML-Based Duration Scheduler Cuts HPC Job Wait Times

Date:

Duration-Informed Workload Scheduler

Summary: arXiv:2604.09599v1 Announce Type: cross

Abstract: High-performance computing systems are complex machines whose behaviour is governed by the correct functioning of its many subsystems. Among these, the workload scheduler has a crucial impact on the timely execution of the jobs continuously submitted to the computing resources. Making high-quality scheduling decisions is contingent on knowing the duration of submitted jobs before their execution–a non-trivial task for users that can be tackled with Machine Learning.

In this work, we devise a workload scheduler enhanced with a duration prediction module built via Machine Learning. We evaluate its effectiveness and show its performance using workload traces from a Tier-0 supercomputer, demonstrating a decrease in mean waiting time across all jobs of around 11%. Lower waiting times are directly connected to better quality of service from the users’ point of view and higher turnaround from the system’s perspective.

Introduction

The advent of high-performance computing (HPC) has transformed the landscape of computational tasks across various domains, including scientific research, data analytics, and machine learning. However, the efficiency of these systems is heavily reliant on the performance of their workload schedulers. Traditional scheduling methods often struggle with dynamically predicting job durations, leading to inefficiencies and increased waiting times for users.

Machine Learning in Workload Scheduling

The integration of Machine Learning (ML) into workload scheduling represents a paradigm shift. By leveraging historical job execution data, ML models can predict job durations with higher accuracy. This predictive capability allows schedulers to make informed decisions that optimize resource allocation and minimize user wait times.

  • Enhanced Prediction: The duration prediction module uses advanced algorithms to analyze past workload traces, capturing patterns that inform future job duration estimates.
  • Dynamic Adaptation: The scheduler adapts in real-time, adjusting priorities based on predicted durations, thus improving overall system throughput.
  • User Satisfaction: By reducing wait times, the system enhances the user experience, leading to higher satisfaction and productivity.

Evaluation and Results

To validate the effectiveness of the proposed scheduler, extensive evaluations were conducted using workload traces from a Tier-0 supercomputer. The results were promising:

  • Mean Waiting Time Reduction: The new scheduler demonstrated an 11% decrease in mean waiting time across all submitted jobs.
  • Increased Job Throughput: The enhanced scheduling decisions led to a higher number of jobs completed in a given time frame.
  • Quality of Service Improvement: Users reported an improved experience due to the reduced delays in job execution.

Conclusion

The implementation of a duration-informed workload scheduler marks a significant advancement in the field of high-performance computing. By incorporating Machine Learning to predict job durations, this scheduler not only optimizes resource utilization but also enhances user experience. As HPC systems continue to evolve, the integration of intelligent scheduling solutions will be crucial in meeting the growing demands of computational workloads.


Related AI Insights

Lazarus Omolua
Lazarus Omoluahttps://richlyai.com/blog
My mission is to make sure that people in Africa are not left behind in the global AI revolution. RichlyAI exists to give everyone — students, founders, creators, and businesses — the tools to compete globally.

Subscribe

Popular

More like this
Related

How Business Ops Teams Boost Productivity with Codex

Discover how business operations teams use Codex to streamline documentation, enhance collaboration, and improve decision-making with AI-powered automation...

OpenAI Partners with Malta to Offer ChatGPT Plus Nationwide

OpenAI and Malta team up to provide free ChatGPT Plus access and AI training to all citizens, promoting digital literacy and responsible AI use.

Critical Linux Kernel Flaw Risks SSH Host Key Theft

A critical Linux kernel flaw risks stolen SSH host keys. Learn how to protect your systems and stay secure until patches are widely available.

Top External Hard Drives 2026: Expert Reviews & Buying Guide

Discover the best external hard drives of 2026 with expert reviews. Find top picks for speed, durability, and security to suit all storage needs.