As throughput-oriented accelerators, GPUs provide tremendous processing power by executing a massive number of threads in parallel. However, exploiting high degrees of thread-level parallelism (TLP) does not always translate to the peak performance that GPUs can offer, leaving the GPU's resources often under-utilized.
Compared to compute resources, memory resources can tolerate considerably lower levels of TLP due to hardware bottlenecks. Unfortunately, this tolerance is not effectively exploited by the Single Instruction Multiple Thread (SIMT) execution model employed by current GPU compute frameworks.
Assuming an SIMT execution model, GPU applications frequently send bursts of memory requests that compete for GPU memory resources. Traditionally, hardware units, such as the wavefront scheduler, are used to manage such requests. Compute-bound threads can be scheduled to utilize compute resources while memory requests are serviced. However, the scheduler struggles when the number of memory operations dominate execution, unable to effectively hide the long latency of memory operations.
The degree of instruction diversity present in a single application may also be insufficient to fully utilize the resources on a GPU. GPU workloads tend to stress a particular hardware resource, but can leave others underutilized. Using coarse-grained hardware resource sharing techniques such as concurrent kernel execution, fails to guarantee that GPU hardware resources are truly shared by different kernels. Introducing additional kernels that utilize similar resources may introduce more contention to the system, especially if kernel candidates fail to use hardware resources collaboratively.
Most previous studies considered the goal of achieving GPU peak performance as a hardware issue. Extensive efforts have been made to remove hardware bottlenecks to improve efficiency. In this thesis, we argue that software plays an equal, if not more important, role. We need to acknowledge that hardware working alone is not able to achieve peak performance in a GPU system. We propose novel compiler-centric software techniques that work with hardware. Our compiler-centric solutions improve GPU performance by redistributing and diversifying instructions at compile time, which reduces memory contention and improves utilization of hardware resources at the same time. A rebalanced GPU application can enjoy much better performance with minimal effort from the programmer, and at no cost of hardware changes.
To support our study of these novel compiler-based optimizations, we need a complete simulation framework that can work seamlessly with a compiler tool chain. In this thesis we develop a full compiler tool chain based on LLVM, that works seamlessly with the Multi2Sim CPU-GPU simulation framework. In addition to supporting our work, developing this compiler framework allows future researchers to explore cross-layer optimizations for GPU systems.
- Professor David Kaeli (Advisor)
- Professor Gunar Schirner
- Dr. Norman Rubin