Computing platforms for high performance and parallel applications have changed from traditional Central Processing Units (CPUs) to hybrid systems which combine CPUs with accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi, etc. These developments bring more and more challenges to application developers, especially to maintain a high performance application across various platforms. Traditional parallel programming methods are usually low-level. They focus more on data distribution and communications, aiming to generate high performance and scalable applications. The programs are usually closely related to the underlying platform. Once the platform changes or programs are ported to different environments, lots of effort is needed to modify or reprogram.
To reduce development effort and improve portability of applications we need a programming method that can hide low-level platform hardware features to ease the programming of parallel applications as well as maintain good performance. In this research, we propose a lightweight and flexible parallel programming framework, Unified Tasks and Conduits (UTC), for heterogeneous computing platforms. In this framework, we provide high level program components, tasks and conduits, for a user to easily construct parallel applications. In a program, computational workloads are abstracted as task objects and different tasks make use of conduit objects for communication. Multiple tasks can run in parallel on different devices and each task can launch a group of threads for execution. In this way, we can separate an applications' high-level structure from low-level task implementations. When porting such a parallel application to utilize different computing resources on different platforms, the applications'
main structure can remain unchanged and only adopt appropriate task implementations, easing the development effort. Also, the explicit task components can easily implement task and pipeline parallelism. In addition, the multiple threads of each task can efficiently implement data parallelism as well as overlapping computation and communication.
We have implemented a runtime system prototype of the Tasks and Conduits framework on a cluster platform, supporting the use of multicore CPUs and GPUs for task execution. To facilitate muti-threaded tasks, we implement a task based global shared data object to allow a task to create threads across multiple nodes and share data sets through one-sided remote memory access mechanism. For GPU tasks, we provide concise interfaces for users to choose proper types of memory for host/device data transfer. To demonstrate and analyze our framework, we have adapted a set of benchmark applications to our framework. The experiments on real clusters show that applications with our framework have similar or better performance than traditional parallel implementations such as OpenMP or MPI. Also we are able to make use of GPUs on the platform for acceleration through GPU tasks. Base on our high level tasks and conduits design, we can maintain a well organized program structure for improved potability and maintainability.
- Professor Miriam Leeser (Advisor)
- Professor Stefano Basagni
- Professor Ningfang Mi