ML-Based Duration Scheduler Cuts HPC Job Wait Times

Duration-Informed Workload Scheduler

Summary: arXiv:2604.09599v1 Announce Type: cross

Abstract: High-performance computing systems are complex machines whose behaviour is governed by the correct functioning of its many subsystems. Among these, the workload scheduler has a crucial impact on the timely execution of the jobs continuously submitted to the computing resources. Making high-quality scheduling decisions is contingent on knowing the duration of submitted jobs before their execution–a non-trivial task for users that can be tackled with Machine Learning.

In this work, we devise a workload scheduler enhanced with a duration prediction module built via Machine Learning. We evaluate its effectiveness and show its performance using workload traces from a Tier-0 supercomputer, demonstrating a decrease in mean waiting time across all jobs of around 11%. Lower waiting times are directly connected to better quality of service from the users’ point of view and higher turnaround from the system’s perspective.

Introduction

The advent of high-performance computing (HPC) has transformed the landscape of computational tasks across various domains, including scientific research, data analytics, and machine learning. However, the efficiency of these systems is heavily reliant on the performance of their workload schedulers. Traditional scheduling methods often struggle with dynamically predicting job durations, leading to inefficiencies and increased waiting times for users.

Machine Learning in Workload Scheduling

The integration of Machine Learning (ML) into workload scheduling represents a paradigm shift. By leveraging historical job execution data, ML models can predict job durations with higher accuracy. This predictive capability allows schedulers to make informed decisions that optimize resource allocation and minimize user wait times.

Enhanced Prediction: The duration prediction module uses advanced algorithms to analyze past workload traces, capturing patterns that inform future job duration estimates.
Dynamic Adaptation: The scheduler adapts in real-time, adjusting priorities based on predicted durations, thus improving overall system throughput.
User Satisfaction: By reducing wait times, the system enhances the user experience, leading to higher satisfaction and productivity.

Evaluation and Results

To validate the effectiveness of the proposed scheduler, extensive evaluations were conducted using workload traces from a Tier-0 supercomputer. The results were promising:

Mean Waiting Time Reduction: The new scheduler demonstrated an 11% decrease in mean waiting time across all submitted jobs.
Increased Job Throughput: The enhanced scheduling decisions led to a higher number of jobs completed in a given time frame.
Quality of Service Improvement: Users reported an improved experience due to the reduced delays in job execution.

Conclusion

The implementation of a duration-informed workload scheduler marks a significant advancement in the field of high-performance computing. By incorporating Machine Learning to predict job durations, this scheduler not only optimizes resource utilization but also enhances user experience. As HPC systems continue to evolve, the integration of intelligent scheduling solutions will be crucial in meeting the growing demands of computational workloads.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

ML-Based Duration Scheduler Cuts HPC Job Wait Times

Duration-Informed Workload Scheduler

Introduction

Machine Learning in Workload Scheduling

Evaluation and Results

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related