1

DONNA: Distributed Optimized Neural Network Allocation on CIM-Based Heterogeneous Accelerators

The continued development of neural network architectures continues to drive demand for computing power. While data center scaling continues, inference away from the cloud will increasingly rely on distributed inference on multiple devices. Most …

Split DNN Inference for Exploiting Near-Edge Accelerators

The deployment of increasingly complex deep learn- ing models for inference in real world settings requires dealing with the constrained computational capabilities of edge devices. Splitting inference between edge and cloud has been proposed to …

High Throughput Massive MIMO Signal Decoding Using Multi-Level Tree Search on FPGAs

Supporting the evolution of wireless communication beyond 5G using high-performance networks requires massive device connectivity. Massive Multiple-Input Multiple-Output (MIMO) systems have been used and proven to increase the data throughput of …

Exploring FPGA Acceleration for Distributed Serverless Computing

Serverless computing has become a popular cloud computing paradigm. However, its deployment abstraction entails significant performance overheads. We explore the potential for enabling serverless computing on FPGAs and present some early results that …

REFL: Resource-Efficient Federated Learning

Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device …

Signal Detection for Large MIMO Systems Using Sphere Decoding on FPGAs

Wireless communication systems rely on aggressive spatial multiplexing Multiple-Input Multiple-Output (MIMO) access points to enhance network throughput. A significant computational hurdle for large MIMO systems is signal detection and decoding, …

High Throughput Multidimensional Tridiagonal System Solvers on FPGAs

We present a high performance tridiagonal solver library for Xilinx FPGAs optimized for multiple multi-dimensional systems common in real-world applications. An analytical performance model is developed and used to explore the design space and obtain …

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL

We explore the design and development of structured-mesh-based solvers on Intel FPGA hardware using the SYCL programming model. Two classes of applications are targeted : (1) stencil applications based on explicit numerical methods and (2) …

Heterogeneous Communication Virtualization for Distributed Embedded Applications

Distributed embedded applications (DEAs) are typ- ically implemented on diverse embedded nodes interconnected through communication network(s) to exchange data and control information to achieve the desired functionality. Conventional approaches of …

StressBench: A Configurable Full System Network and I/O Benchmark Framework

We present StressBench, a network benchmarking framework written for testing MPI operations and file I/O concurrently. It is designed specifically to execute MPI communication and file access patterns that are representative of real-world scientific …