Dissertation Defense

Algorithm-Architecture Co-Design for Domain-Specific Accelerators

Yaoyu Tao

The exploding demands to efficiently transmit information and learn useful knowledge from them have stimulated domain-specific breakthroughs in communication and AI algorithms.  However, the path to achieving high-efficiency domain-specific hardware systems is challenging. On one hand, new paradigms in communication and AI are often computationally expensive. On the other hand, the hardware requirements (e.g., speed, power, and cost) are key factors that are hindering mass commercial deployment. Despite the obstacles ahead, algorithmic advancements have revealed rich opportunities for algorithm-architecture co-design, driving new innovations for domain-specific accelerators.

This dissertation presents six algorithm-architecture co-design solutions for 1) channel coding, including polar and low-density parity-check (LDPC) decoders; 2) neural networks, including differentiable neural computer (DNC) and neural ordinary differentiable equations (NODE). The first work is a successive-cancellation list (SCL) polar decoder using a split-tree algorithm.  The decoder chip achieves 3.25Gb/s with 13.2pJ/b energy efficiency, outperforming prior works by orders of magnitudes. The second work is a LDPC post-processor inspired by simulated annealing algorithm. The FPGA prototype achieves two orders of magnitudes lower error rate. The third work is a nonbinary LDPC decoder with early termination and fine-grained dynamic clock gating. The decoder chip delivers 1.22Gb/s throughput and 3.03nJ/b energy efficiency, presenting more than 5X speedup over prior works. The fourth work is HiMA, a tiled, history-based memory access engine that incorporates a traffic-aware multi-mode network-on-chip (NoC), an optimal submatrix-based memory partition, a two-stage usage sort and a distributed DNC (DNC-D) model. It achieves up to 39.1x higher speed over the state-of-the-art MANN accelerator. The fifth work is a NODE training accelerator with adaptive neural activation sparsity for up to 80% computation reduction. The hardware efficiency is further enhanced by reconfigurable interconnects, hardware reuse and hierarchical memory. The sixth work is a DNC-aided SCL flip algorithm that demonstrates up to 0.34dB coding gain improvement or 54.2% latency reduction compared to prior works. A two-phase decoding flow is developed with new state/action encoding and training methods.

Chair: Professor Zhengya Zhang