Graphics Processing Units (GPUs) have been used in a wide range of high performance computing domains. Unfortunately, computing with GPU devices presents its own challenges, including inefficiencies in the global memory system. With today’s growing demand for Big Data processing, the need to leverage larger-scale GPUs or multiple GPUs becomes the natural next step. Big Data applications magnify the current limitations of global memory on GPU-based systems. A major source of this global memory inefficiency is due to bottlenecks in the on-chip network associated with this memory.
In this dissertation, we describe how to optimize the performance and power efficiency of an on-chip network used on a GPU. We explore the GPU-based Network-on-Chip (NoC) design space, develop execution-driven simulation models, and analyze a range of parallelized applications. We evaluate a number of conventional network topologies, and their impact on performance of a GPU system. We use detailed simulation to characterize memory access patterns present in the GPU applications, and explore electrical on-chip networks that best match the needs of these applications. We incorporate asymmetry into the NoC design as a solution to reduce the power consumption of a network, while providing comparable performance to the best conventional topology. Our solution reduces the Energy-Delay Product (EDP) by as much as 88%.
In order to improve the performance of current and future GPUs, we explore the use of silicon-photonic link technology when constructing the NoC. This emerging, low-latency, high- bandwidth technology has been incorporated in chip multiproccesors (CMPs). By introducing a hybrid silicon-photonic NoC in the GPU memory system, we are able to improve performance of memory-intensive applications by 3.43×, as compared with the best alternative electrical NoC.
Finally, we conduct a thorough analysis of global memory management schemes for multi-GPU systems. We identify limitations of the global memory present in previously proposed memory management schemes, and purpose an alternative, Unified Memory Hierarchy (UMH). Our aim is to reduce the communication between multiple GPU devices. Our solution supports coherency and cooperative execution between a CPU and multiple GPUs through a single, shared, memory hierarchy. From the GPU’s perspective, the host/CPU memory now serves the role of main memory for both the CPU and the GPU. Adopting this design, a GPU accesses data which resides in CPU memory only if it does not find the data in its own high-bandwidth memory. The UMH design includes the addition of a memory directory for inter-device memory management and coherency. The proposed coherency protocol allows coherent access between a CPU and multiple GPU devices, while relaxing coherency constraints on the GPU when coherency does not need to be enforced. Adding a proper memory coherency protocol to our UMH design reduces the overhead of synchronization between devices by as much as 13×. Additionally, UMH enhances the performance of multi-GPU system by 1.92× and 5.38× (on average) over alternative memcpy and zero-copy approaches, respectively, for a system with 4 discrete GPU devices.
- Prof. David Kaeli (Advisor)
- Prof. Yunsi Fei
- Prof. Ajay Joshi