Multi-level Analysis of GPU Utilization in ML Training Workloads

Paul Delestrac; Debjyoti Bhattacharjee; Simei Yang; Diksha Moolchandani; Francky Catthoor; Lionel Torres; David Novo

Communication Dans Un Congrès Année : 2024

Multi-level Analysis of GPU Utilization in ML Training Workloads

(1) , (2) , (2) , (2) , (2) , (1) , (1)

1
2

Paul Delestrac

Fonction : Auteur correspondant
PersonId : 1163872
IdHAL : paul-delestrac
ORCID : 0000-0002-7476-1422
IdRef : 280857977

Connectez-vous pour contacter l'auteur

ADAptive Computing

Debjyoti Bhattacharjee

Fonction : Auteur
PersonId : 1391406
ORCID : 0000-0001-6561-8934

IMEC

Simei Yang

Fonction : Auteur
PersonId : 1391409
ORCID : 0000-0002-0130-8176

IMEC

Diksha Moolchandani

Fonction : Auteur
PersonId : 1368576
ORCID : 0000-0001-8110-049X

IMEC

Francky Catthoor

Fonction : Auteur
PersonId : 1086572
ORCID : 0000-0002-3599-8515

IMEC

Lionel Torres

Fonction : Auteur
PersonId : 929667
ORCID : 0000-0001-5807-5070

ADAptive Computing

David Novo

Fonction : Auteur
PersonId : 170933
IdHAL : david-novo
ORCID : 0000-0002-5510-4152
IdRef : 244276455

ADAptive Computing

Résumé

Training time has become a critical bottleneck due 100% to the recent proliferation of large-parameter ML models. GPUs continue to be the prevailing architecture for training ML models. However, the complex execution flow of ML frameworks makes it difficult to understand GPU computing resource utilization. Our main goal is to provide a better understanding of how efficiently ML training workloads use the computing resources of modern GPUs. To this end, we first describe an ideal reference execution of a GPU-accelerated ML training loop and identify relevant metrics that can be measured using existing profiling tools. Second, we produce a coherent integration of the traces obtained from each profiling tool. Third, we leverage the metrics within our integrated trace to analyze the impact of different software optimizations (e.g., mixed-precision, various ML frameworks, and execution modes) on the throughput and the associated utilization at multiple levels of hardware abstraction (i.e., whole GPU, SM subpartitions, issue slots, and tensor cores). In our results on two modern GPUs, we present seven takeaways and show that although close to 100% utilization is generally achieved at the GPU level, average utilization of the issue slots and tensor cores always remains below 50% and 5.2%, respectively.

Mots clés

GPU utilization Performance Analysis ML Training

Domaines

Intelligence artificielle [cs.AI] Machine Learning [stat.ML]

Fichier principal

Delestrac 2024 Multilevel.pdf (908.04 Ko)

Origine	Fichiers produits par l'(les) auteur(s)

Paul Delestrac : Connectez-vous pour contacter le contributeur

https://hal.umontpellier.fr/hal-04523554

Soumis le : samedi 30 mars 2024-16:41:27

Dernière modification le : jeudi 7 novembre 2024-16:14:03

Dates et versions

hal-04523554 , version 1 (30-03-2024)

Identifiants

HAL Id : hal-04523554 , version 1

Citer

Paul Delestrac, Debjyoti Bhattacharjee, Simei Yang, Diksha Moolchandani, Francky Catthoor, et al.. Multi-level Analysis of GPU Utilization in ML Training Workloads. DATE 2024 - 27th Design, Automation and Test in Europe Conference, Mar 2024, Valencia, Spain. ⟨hal-04523554⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

CNRS LIRMM GENCI ADAC UNIV-MONTPELLIER

94 Consultations

238 Téléchargements

Multi-level Analysis of GPU Utilization in ML Training Workloads

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager