GPUs have gained tremendous popularity as accelerators for a broad class of applications belonging to a number of important computing domains. Many applications have achieved significant performance gains using the inherent parallelism offered by GPU architectures. Given the growing impact of GPU computing, there is a growing need to provide improved utilization of compute resources and increased application throughput. Modern GPUs support concurrent execution of kernels from a single application context in order to increase the resource utilization. However, this support is limited to statically assigning compute resources to multiple kernels, and lacks the flexibility to adapt resources dynamically. The degree of concurrency present in a single application may also be insufficient to fully exploit the resources on a GPU.
Applications developed for modern GPUs include multiple compute kernels, where each kernel exhibits a distinct computational behavior, with differing resource requirements. These applications place high demands on the hardware and may also include strict deadline constraints. Their multi-kernel nature may allow for concurrent execution of kernels on the device. The use of GPUs in cloud engines and data centers will require a new class of GPU sharing mechanisms enforced at an application context level. What is needed are new mechanisms that can streamline concurrent execution for multi-kernel applications. At the same time, concurrent-execution support must be extended to schedule multiple applications on the same GPU. The implementation of application level and kernel level concurrency can deliver significantly improved resource utilization and application throughput. A number of architectural and runtime-level design challenges need to be addressed to enable efficient scheduling and management of memory and compute resources to support multiple levels concurrency.
In this thesis, we propose a dynamic and adaptive mechanism to manage multi-level concurrency on a GPU. We present a new scheduling mechanism for dynamic spatial partitioning on the GPU. Our mechanism monitors and guides current execution of compute workloads on a device. To enable this functionality, we extend the OpenCL runtime environment to map multiple command queues to a single GPU, and effectively partition the device. The result is that kernels, that can benefit from concurrent execution on a partitioned device, can more effectively utilize more of the available compute resources of a GPU. We also introduce new scheduling mechanisms and partitioning policies to match the computational requirements of different applications.
We present new techniques that address design challenges for supporting concurrent execution of multiple application contexts on the GPU. We design a hardware/software-based mechanism to enable multi-context execution, while preserving adaptive multi-kernel execution. Our Transparent Memory Management (TMM) combines host-based and GPU-based control to manage data access/transfers for multiple executing contexts. We enhance the Command Processor (CP) on the GPU to manage complex firmware tasks for dynamic compute resource allocation, memory handling and kernel scheduling. We design a hardware-based scheme that modifies the L2 cache and TLB to implement virtual memory isolation across multiple contexts. We provide a detailed evaluation of our adaptive partitioning mechanism while leveraging multi-context execution using a large set of real-world applications. We also present a hardware-based runtime approach to enable profile-guided partitioning and scheduling mechanisms. Our partitioning/scheduling mechanism uses machine learning to analyze the current execution state of the GPU. We improve the effectiveness of adaptive partitioning and TMM by tracking execution time behavior of real world applications.
Advisor: Professor David Kaeli
Professor Gunar Schirner
Dr. Norm Rubin