GPU Performance in Virtualized Environments: History
Please note this is an old version of this entry, which may differ significantly from the current revision.

The graphics processing unit (GPU) plays a crucial role in boosting application performance and enhancing computational tasks. Thanks to its parallel architecture and energy efficiency, the GPU has become essential in many computing scenarios. On the other hand, the advent of GPU virtualization has been a significant breakthrough, as it provides scalable and adaptable GPU resources for virtual machines. However, this technology faces challenges in debugging and analyzing the performance of GPU-accelerated applications. Most current performance tools do not support virtual GPUs (vGPUs), highlighting the need for more advanced tools.

  • GPU virtualization
  • GVT-g
  • software tracing

1. Introduction

Accelerators are specialized processors that have been developed and integrated into computer systems in response to the growing demand for high-performance computing. These accelerators aim to assist the central processing unit (CPU) in executing certain types of computations. Several studies have demonstrated that offloading specific tasks to accelerators can considerably boost the overall system performance [1]. Hence, the use of accelerators has become a typical approach to handling the high computational demands in various industries and research fields. This has motivated hardware manufacturers to develop a variety of specialized accelerators, including accelerated processing units (APUs), floating-point units (FPUs), digital signal processing units (DSPs), network processing units (NPUs), and graphics processing units (GPUs).
The GPU, considered one of the most pervasive accelerators, was initially devised for graphics rendering and image processing. However, in the last few years, its computational power has been increasingly harnessed for parallel number crunching. Nowadays, GPU-accelerated applications are present in many domains that are often unrelated to graphics, such as deep learning, financial analytics, and general-purpose scientific calculations. On the other hand, virtualization plays a fundamental role, as it enables many modern computing concepts. Its primary function is facilitating the sharing and multiplexing of physical resources among different applications. Particularly, virtualization significantly enhances GPU utilization by enabling more efficient allocation of its computational power. In general, the main advantage of virtualizing computing resources lies in reducing energy consumption in data centers, which, in turn, contributes to reducing operational costs. Hence, a variety of new applications that leverage the capabilities of virtual GPUs (vGPUs) have emerged. Examples of such applications that benefit from vGPU acceleration are deep learning, virtual desktop infrastructure (VDI), and artificial intelligence applications.
It is worth noting that virtualizing GPUs presents more complex challenges than the virtualization of CPUs and most I/O devices, as the latter rely on well-established technologies. These challenges stem from several key obstacles. First, there is significant architectural diversity in hardware among GPU brands, complicating the development of a universal virtualization solution. Second, using closed-source drivers from popular GPU brands, such as NVIDIA, poses a significant challenge for third-party developers in developing virtualization technology for these devices. Third, most GPU designs lack inherent sharing mechanisms, leading to a GPU process gaining exclusive access to its resources, thereby blocking other processes from preemption. In addition, several studies have shown that the overhead involved in process preemption is substantially higher in GPUs than in CPUs [2][3]. This increased overhead is essentially due to the higher number of cores and context states in GPUs. It is important to note that some recent GPUs, such as the NVIDIA Pascal GPU [4], include support for preemption at the kernel, thread, and instruction levels to mitigate these challenges.
Before developing advanced GPU virtualization technologies, practitioners used the passthrough technique to enable VMs to access physical GPU (pGPU) resources directly. This approach, however, has limitations, such as the inability to share a GPU among multiple VMs and the lack of support for live migration. Major GPU manufacturers such as NVIDIA, AMD, and Intel have introduced their brand-specific virtualization solutions to address these issues. These include NVIDIA GRID [5], AMD MxGPU [6], and Intel’s Graphics Virtualization Technology—Grid generation (GVT-g) [7]. The first two virtualization solutions are based on hardware virtualization capabilities, whereas the third one, Intel’s GVT-g, provides a software-based solution for full GPU virtualization. GVT-g is open source and has been integrated into the Linux mainline kernel, which makes it a desirable option due to its accessibility and potential for broader integration.
On the other hand, performance analysis tools for GPUs are important for debugging performance issues in GPU-accelerated applications [8]. These tools help understand how the GPU resources are allocated and consumed, and they facilitate the diagnosis of potential performance bottlenecks. They are particularly crucial in virtualized environments, where resource sharing in vGPUs and its impact on performance need to be better understood. However, developing practical tools for monitoring and debugging vGPUs remains challenging. This is because virtualized environments often present many layers, encompassing hardware, middleware, and host and guest operating systems, which increase the isolation and abstraction of GPU resources, making it difficult to pinpoint the causes of performance issues.

2. GPU Performance in Virtualized Environments

As the complexity of GPU-accelerated applications continues to grow, the need for effective performance analysis methods becomes increasingly critical, particularly in vGPU-based systems. The study of existing GPU performance analysis tools shows that they offer different levels of analysis, and they are mostly dedicated to specific GPU architectures. High-end production-quality tools such as vTune Profiler [9] and Nsight systems [10] are notable for their comprehensive approach to analyzing GPU-accelerated applications. They provide a holistic understanding of the application runtime behavior, particularly unveiling the interaction between CPU and GPU, which is crucial for effective optimization. In contrast, tools proposed in academic research are often tailored to specific GPU programming models or addressing particular GPU-related performance issues.
Many vendors offer dedicated software for profiling GPU applications, such as NVIDIA Nsight Systems, Intel vTune Profiler, and AMD Radeon GPU Profiler [11]. These tools leverage various techniques such as binary instrumentation, hardware counters, and API hooking to gather detailed performance events. Despite providing rich insights into kernel execution, CPU–GPU interaction, memory access patterns, and GPU API call paths, these tools are often limited in terms of openness, flexibility, and cross-architecture applicability. In addition to proprietary offerings, the GPU performance analysis ecosystem encompasses feature-rich open-source tools. For example, HPCToolkit [12][13] and TAU [14] are two versatile tools tailored for analyzing heterogeneous systems’ performance. These tools offer valuable diagnostic capabilities for pinpointing GPU bottlenecks and determining their root causes. For instance, through call path profiling, they provide insights for kernel execution and enable the identification of hotspots in the program’s code.
Aside from the established profilers, academic research also presents many innovative tools for the diagnosis of performance issues in GPU-accelerated applications. Zhou et al. proposed GVProf [15], a value-aware profiler for identifying redundant memory accesses in GPU-accelerated applications. Their follow-up work [16] focused on improving the detection of value-related patterns (e.g., redundant values, duplicate writes, and single-valued data). The main objective of their work was to identify diverse performance bottlenecks and provide suggestions for code optimization. GPA (GPU Performance Advisor) [17] is a diagnostic tool that leverages instruction sampling and data flow analysis to pinpoint inefficiencies in the application code. DrGPU [18] uses a top-down profiling approach to quantify and decompose stall cycles using hardware performance counters. Based on the stall analysis, it identifies inefficient software–hardware interactions and their root causes, thus helping make informed optimization decisions. CUDAAdvisor [19], built on top of LLVM, instrumentalizes application code on both the host and device sides. It conducts code- and data-centric profiling to identify performance bottlenecks arising from competition for cache resources and memory and control flow divergence. The main disadvantages of these tools lie in their considerable overhead and exclusive applicability to NVIDIA GPUs. On the other hand, several profiling tools leverage library interposition and userspace tracing to capture runtime events, enabling the correlation of CPU and GPU activities. For example, CLUST [20] and LTTng-HSA [21] employ these techniques to profile OpenCL- and HSA-based applications, respectively. However, a substantial drawback of these tools is their tight coupling with specific GPU programming frameworks, which limits their capability to provide a system-wide analysis.

This entry is adapted from the peer-reviewed paper 10.3390/fi16030072

References

  1. Hong, C.H.; Spence, I.; Nikolopoulos, D.S. FairGV: Fair and Fast GPU Virtualization. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 3472–3485.
  2. Ji, Z.; Wang, C.L. Compiler-Directed Incremental Checkpointing for Low Latency GPU Preemption. In Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS), Lyon, France, 30 May–3 June 2022; pp. 751–761.
  3. Hong, C.H.; Spence, I.; Nikolopoulos, D.S. GPU Virtualization and Scheduling Methods: A Comprehensive Survey. ACM Comput. Surv. 2017, 50, 1–37.
  4. NVIDIA. GP100 Pascal Whitepaper. 2016. Available online: https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf (accessed on 20 February 2024).
  5. Nvidia Grid: Graphics Accelerated VDI with the Visual Performance of a Workstation. 2013. Available online: http://www.nvidia.com/content/grid/vdi-whitepaper.pdf (accessed on 20 February 2024).
  6. AMD MxGPU. 2024. Available online: https://www.amd.com/en/graphics/workstation-virtualization-solutions (accessed on 20 February 2024).
  7. Tian, K.; Dong, Y.; Cowperthwaite, D. A Full GPU Virtualization Solution with Mediated Pass-through. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, Philadelphia, PA, USA, 19–20 June 2014; USENIX Association: Berkeley, CA, USA, 2014; pp. 121–132.
  8. Aceto, G.; Botta, A.; Donato, W.; Pescapè, A. Cloud monitoring: A survey. Comput. Netw. 2013, 57, 2093–2115.
  9. Intel VTune Amplifier. 2024. Available online: https://software.intel.com/en-us/intel-vtune-amplifier-xe (accessed on 20 February 2024).
  10. Nvidia Nsight Graphics. 2024. Available online: https://developer.nvidia.com/nsight-graphics (accessed on 20 February 2024).
  11. Devices, A.M. AMD GPU Open-Radeon GPU Profiler. 2024. Available online: https://gpuopen.com/rgp/ (accessed on 20 February 2024).
  12. Gupta, R.; Shen, X.; Zhou, K.; Krentel, M.; Mellor-Crummey, J. A tool for top-down performance analysis of GPU-accelerated applications. In Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, San Diego, CA, USA, 22–26 February 2020; pp. 415–416.
  13. Cherian, A.T.; Zhou, K.; Grubisic, D.; Meng, X.; Mellor-Crummey, J. Measurement and Analysis of GPU-Accelerated OpenCL Computations on Intel GPUs. In Proceedings of the 2021 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools), St. Louis, MO, USA, 14 November 2021.
  14. TAU Performance System. 2024. Available online: http://www.paratools.com/tau (accessed on 20 February 2024).
  15. Zhou, K.; Hao, Y.; Mellor-Crummey, J.; Meng, X.; Liu, X. GVPROF: A Value Profiler for GPU-Based Clusters. In Proceedings of the SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, 9–19 November 2020; pp. 1–16.
  16. Falsafi, B.; Ferdman, M.; Lu, S.; Wenisch, T.; Zhou, K.; Hao, Y.; Mellor-Crummey, J.; Meng, X.; Liu, X. ValueExpert: Exploring value patterns in GPU-accelerated applications. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022; pp. 171–185.
  17. Zhou, K.; Meng, X.; Sai, R.; Grubisic, D.; Mellor-Crummey, J. An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications. IEEE Trans. Parallel Distrib. Syst. 2021, 33, 854–865.
  18. Hao, Y.; Jain, N.; Van der Wijngaart, R.; Saxena, N.; Fan, Y.; Liu, X. DrGPU: A Top-Down Profiler for GPU Applications. In Proceedings of the 2023 ACM/SPEC International Conference on Performance Engineering, London, UK, 7–11 May 2023; pp. 43–53.
  19. Knoop, J.; Schordan, M.; Johnson, T.; O’Boyle, M.; Shen, D.; Song, S.L.; Li, A.; Liu, X. CUDAAdvisor: LLVM-based runtime profiling for modern GPUs. In Proceedings of the 2018 International Symposium on Code Generation and Optimization, Vienna, Austria, 24–28 February 2018; pp. 214–227.
  20. Couturier, D.; Dagenais, M.R. LTTng CLUST: A System-wide Unified CPU and GPU Tracing Tool for OpenCL Applications. Adv. Softw. Eng. 2015, 2015, 940628.
  21. Margheritta, P.; Dagenais, M.R. LTTng-HSA: Bringing LTTng tracing to HSA-based GPU runtimes. Concurr. Comput. Pract. Exp. 2019, 31, e5231.
More
This entry is offline, you can click here to edit this entry!
Video Production Service