DONNA: Distributed Optimized Neural Network Allocation on CIM-Based Heterogeneous Accelerators

Abstract

The continued development of neural network architectures continues to drive demand for computing power. While data center scaling continues, inference away from the cloud will increasingly rely on distributed inference on multiple devices. Most prior efforts have focused on optimizing single-device inference or partitioning models to enhance inference throughput. Meanwhile, energy consumption continues to grow in importance as a factor of consideration. This work proposes a framework that searches for optimal model splits and distributes the partitions across the combination of devices taking into account throughput and energy. Participating devices are strategically grouped into homogeneous and heterogeneous clusters consisting of general-purpose CPU and GPU architectures, as well as emerging Compute-In-Memory (CIM) accelerators. The framework simultaneously optimizes inference throughput and energy consumption. It is able to demonstrate up to 4× speedup with approximately 4× per-device energy reduction in a heterogeneous setup compared to single GPU inference. The algorithm also finds a smooth Pareto-like curve in the energy-throughput space for CIM devices.

Publication
In IEEE International Conference on Edge Computing and Communications (EDGE)
Suhaib A. Fahmy
Suhaib A. Fahmy
Associate Professor of Computer Science

Suhaib is Principal Investigator of the Accelerated Connected Computing Lab (ACCL) at KAUST. His research explores hardware acceleration of complex algorithms and the integration of these accelerators within wider computing infrastructure.

Related