A Stochastic Approach for Scheduling AI Training Jobs in GPU-based Systems - POLARIS - Performance analysis and Optimization of LARge Infrastructure and Systems Access content directly
Journal Articles IEEE Transactions on Cloud Computing Year : 2024

A Stochastic Approach for Scheduling AI Training Jobs in GPU-based Systems

Abstract

In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17-29% on average.
Fichier principal
Vignette du fichier
STochasticSchedulingPaper.pdf (1.34 Mo) Télécharger le fichier
Origin : Files produced by the author(s)

Dates and versions

hal-04337856 , version 1 (12-12-2023)

Licence

Attribution

Identifiers

Cite

Federica Filippini, Jonatha ANSELMI, Danilo Ardagna, Bruno Gaujal. A Stochastic Approach for Scheduling AI Training Jobs in GPU-based Systems. IEEE Transactions on Cloud Computing, 2024, 12 (1), pp.53-69. ⟨10.1109/TCC.2023.3336540⟩. ⟨hal-04337856⟩
22 View
52 Download

Altmetric

Share

Gmail Facebook X LinkedIn More