Analyzing GPU Energy Consumption in Data Movement and Storage
Résumé
GPUs are the prevailing solution to execute high- performance tasks (e.g., machine learning training). As the peak performance of modern GPUs increases with each generation, so does their thermal design power (TDP). Hence, identifying energy bottlenecks in the GPU architecture is crucial to designing more efficient architectures in the future. However, due to the complex proprietary nature of modern GPU architectures, providing a detailed breakdown of the GPU energy consumption is not trivial.
The goal of this work is to estimate a lower bound for the energy consumed by data movement and storage in modern GPU architectures, leveraging internal power sensors. We establish a basic energy model for modern GPUs, focused on data movement to/from the hardware-managed caches and software-managed memories. We propose a methodology to calibrate the energy model using microbenchmarks, performance counters, and the internal power sensor. We experimentally calibrate the model on an A100 NVIDIA GPU. Then, we challenge the consistency of the results by cross-validating with modified microbenchmarks with additional instructions. Finally, we use the calibrated energy model to evaluate breakdowns for workloads of increasing complexity (e.g., a ResNet-50 training iteration with different software optimizations). Our results show that data movement dominates the dynamic energy consumption of the GPU (up to 84%), with DRAM accesses being the main contributor.
Origine | Fichiers produits par l'(les) auteur(s) |
---|