Performance Evaluation and Optimization Mechanisms for Inter-operable Graphics and Computation on GPUs
Evaluate the performance and efficiency of the OpenCL-OpenGL(CL-GL) interoperability(interop) mode. Explore different methods to improve the execution performance of the CL-GL interop-based applications. We propose a slot-based rendering mechanism for CL-GL interop to increase the efficiency of the application.
Support for Cache Non-Coherence in Many-Core Architectures
Design and implementation of a cache coherence protocol
includes a non-coherent state for shared, modified blocks. Allows a
seamless transition between memory consistency models that require
coherence and those that do not.
Advanced Ultrasound with OpenCL
Implementation of filtering, motion compensation, and
other image processing algorithms for a real-time ultrasound system.
Improving Synchronization on GPUs
Research focus on improving the synchronization on GPUs with a hardware based synchronization mechanism in order to extend the applicability of GPUs to a broader class of general purpose applications. Hierarchical Queuing Locks(HQL) is hardware based synchronization mechanism that favors a blocking mechanism for efficiency and uses hardware based hierarchical queuing locks for scalability.
Improving SIMD Efficiency on GPUs
Design micro-architectural techniques to improve the efficiency of SIMD execution on GPUs.
Computer Tomography Image Segmentation on GPGPU Architecture
In a general detection system we can distinguish different operations such as: sensing, segmentation, feature extraction, detection and post processing. One of its bottleneck is image segmentation phase, which takes more time and can define the accuracy of the final result. Therefore image segmentation is one of the most widely open research topic. My approach is to consider different algorithms that can later on being paralleled taking advantage of GPGPU architecture. Currently, I'm focusing on graph-based and CCL approaches, both of these two techniques can be speed-up using OpenCL or CUDA framework.
Analysis of Power-Performance Efficiency of Optimizations applied to Heterogenous Applications
Power-Performance Efficiency of different optimization techniques such as coalesced access, local memory usage, loop unrolling, etc. is evaluated on heterogeneous devices. Fast Fourier transform is used as the test application for evaluating the power consumption on GPUs, APUs and embedded SoCs with OpenCL support. AMD Southern Islands GPU, AMD Fusion APU, Nvidia Kepler GPU, Intel Ivy-Bridge APU and Qualcomm Snapdragon SoC are evaluated for their power-performance efficiency.
Accelerating Phase Field Simulation
Phase-field model is a mathematical model for solving interfacial problems in physics(such as growth of snowflakes). Phase field approaches have long been restricted to the microscopic scale due to the small mesh size necessary to retain a reasonable accuracy. Porting Fortran/C implementation of phase field algorithm to GPU makes it possible to reach experimentally relevant length and time scales. Comparing with CPU implementation, GPU accelerated phase field algorithm achieved 30x speedup on a single GPU.
Scalar-Vector Opportunities on GPU architectures
Research looks at scalar opportunity and scalar-vector GPU
architecture. I take advantage of scalar opportunities observed in
GPGPU workloads to improve performance and power efficiency on novel
scalar-vector GPU architecture.
Valar benchmark is a new benchmark suite consisting of real
world applications that effectively leverage heterogeneous devices.
The main characteristics of these benchmarks is their data-dependent
behavior and their usability to study interaction between computation
and data movement on different heterogeneous devices.
Profiling and Performance Analysis Tools
Haptic is a software framework that implements
analysis devices as an extension to OpenCL to specialize algorithms
and utilize multiple devices. HAPTIC provides extensions to OpenCL
called analysis device for inserting optimization and profiling into a
compute pipeline. HAPTIC supports discrete and fused platforms
and we have used it to study the performance of analysis devices
when integrated into applications where host-device behavior is coupled.
Finite Difference Time Domain (FDTD) accelaration using GPUs
Porting a three-dimensional finite
difference time domain (FDTD) algorithm in Fortran to GPU using OpenCL