Breakthrough streaming applications such as virtual reality, augmented reality, autonomous vehicles, and multimedia demand for high-performance and power-efficient computing. In response to this ever-increasing demand, manufactures look beyond the parallelism available in Chip Multi-Processors (CMPs), and more toward application-specific designs. In this regard, ACCelerator (ACC)-based heterogeneous CMPs (ACMs) have emerged as a promising platform.
An ACMP combines application-specific HW ACCelerators (ACCs) with General Purpose Processor(s) (GPP) onto a single chip. ACCs are customized to provide high-performance and power-efficient computing for specific compute-intensive functions and GPP(s) runs the remaining functions and controls the whole system. In ACMP platforms, ACCs achieve performance and power benefits at the expense of reduced flexibility and generality for running different workloads.
Therefore, manufactures must utilize several ACCs to target a diverse set of workloads within a given application domain.
However, our observation shows that conventional ACMP architectures with many ACCs have scalability limitations. The ACCs' benefits in processing power can be overshadowed by bottlenecks on shared resources of processor core(s), communication fabric/DMA, and on-chip memory. The primary source of the resources bottlenecks stems from ACCs data access and orchestration load. Due to very loosely defined semantics for communication with ACCs, and relying on general platform architectures, the resources bottlenecks hamper performance.
This dissertation explores and alleviates the scalability limitations of ACMPs. To this end, the dissertation first proposes an analytical model to holistically explore how bottlenecks emerge on shared resources with increasing number of ACCs. Afterward, it proposes ACMPerf, an analytical model to capture the impact of the resources bottlenecks on the achievable ACCs' benefits. Then, to open a path toward more scalable integration of ACCs, the dissertation identifies and formalizes ACC communication semantics. The semantics describe four primary aspects: data access, synchronization, data granularity, and data marshalling.
Considering our identified ACC communication semantics, and improving upon conventional ACMP architectures, the dissertation proposes a novel architecture of Transparent Self-Synchronizing ACCs (TSS). TSS efficiently realizes our identified communication semantics of direct ACC-to-ACC connections often occurring in streaming applications.
As a result of reducing the overhead of direct ACC-to-ACC connections, TSS delivers more of the ACCs' benefits than that of conventional ACMP
architectures: up to 130x higher throughput and 209x lower energy, all as results of up to 78x reduction in the imposed load to the shared resources.
- Professor Gunar Schirner (Advisor)
- Professor David Kaeli
- Professor Yunsi Fei
- Professor Hamed Tabkhi