Over the past decade, GPU computing has evolved from being a simple task of mapping data-parallel kernels to Single Instruction Multiple Thread (SIMT) hardware, to a more complex challenge, transforming multiple complex, and potentially irregular, kernels to more powerful and sophisticated many-core engines. Furthermore, recent advances in GPU architectures, including support for nested parallelism and concurrent kernel execution, further complicate this move to fully exploit the power of GPU computing.
Improving application performance is a central concern for software developers. To start with, the programmer needs to be able to identify where performance opportunities reside. Many times the right optimization is tied to the underlying nature of the application and the specific algorithms used. The task of tuning kernels to exploit hardware features can become an endless manual process. There is a growing need to develop sophisticated characterization techniques that can help the programmer identify opportunities to exploit new hardware features, and to move a broader class of applications to GPUs efficiently.
In this thesis, we present novel approaches to characterize application behavior
that can exploit nested parallelism and concurrent kernel execution introduced on recent GPU architectures. To identify bottlenecks that can be improved through the exploitation of these new hardware features, we have developed a set of metrics that can better characterize performance bottlenecks.
For nested parallelism, our approach focuses on irregular and recursive applications. For irregular applications we define and evaluate three main runtime components: i) control flow analysis, ii) child kernel launch behavior, and iii) child kernel synchronization behavior. For recursive kernel applications, we define and evaluate: i) the degree of thread-level parallelism, ii) the degree of work efficiency, and iii) the overhead of kernel launches. For concurrent kernel execution, our characterization captures a kernel's launch configuration, the associated resource consumption, and the degree of overlapped execution. Our proposed metrics identify when to exploit nested parallelism and concurrent kernel execution.
We demonstrate the utility of our metric-based framework by focusing on a diverse set of workloads that include both irregular and recursive program behavior. This suite of workloads includes: i) a set of microbenchmarks that specifically target the set of new GPU features discussed in this thesis, ii) the NUPAR suite, iii) the Lonestar suite and iv) a set of real-world applications. By using our framework, we are able speedup applications by more than 5x-23x, as compared to the original programs.
- Prof. David Kaeli (Advisor)
- Prof. Ningfang Mi
- Prof. Qianqian Fang
- Norman Rubin