GPU has been appeared as very efficient many-core platform to execute applications with massive thread-level parallelism. GPU achieves high throughput by running many threads concurrently and switching between them rapidly to hide memory latency. With the introduction of general purpose programming models such as CUDA and OpenCL, many applications are motived to use GPU to accelerate compute intensive kernels and see impressive speedup. With trend toward using GPUs for a diverse range of applications (e.g. vision and scientific computing) new challenges have been raised. New challenges raised for both algorithm and architecture designer when targeting GPUs for general-purpose applications
In the architecture perspective, One of the main challenges is inter-warp conflicts in the shared resources including I$, D$ and compute unites. The conflicts in the shared resources mainly caused by inter-warp divergence which is uneven execution progress across the concurrent warps. Excessive inter-warp divergence may hinder GPU to achieve their peak throughput. This motivates the need for approaches that manage inter-warp divergence, avoiding I$ conflicts, for divergence-sensitive benchmarks. In the algorithm perspective the challenge is how to understand algorithm-specific optimizations to achieve GPU maximum utilization enhancing the overall performance.
This thesis primarily focuses on architecture perspective. In architecture level, this thesis quantitatively studies the benefits of inter-warp divergence aware execution on GPUs. To that end, the thesis first proposes a novel approach to quantify the inter-warp divergence by measuring the temporal similarity in execution progress of concurrent warps, which we call Warp Progression Similarity (WPS). Based on the WPS metric, this letter proposes a WPS-aware Scheduler (WPSaS) to optimize GPU throughput.
The aim is to manage inter-warp divergence to hide memory access latency and minimize resource conflicts and temporal under-utilization in compute units allowing GPUs to achieve their peak throughput. Our results demonstrate that WPSaS improves throughput by 10% with a pronounced reduction in resource conflicts and temporal under-utilization.
In algorithm perspective, this thesis demonstrates a GPU implementation of background subtraction MoG (Mixture of Gaussians) algorithm that surpasses real-time processing for full HD resolution. This implementation applies both general and algorithm-specific optimization to achieve 101x speedup over sequential implementation without impact to the output quality.
Advisor: Professor Gunar Schirner
Professor Gunar Schirner
Professor David Kaeli
Professor Rafael Ubal