Supporting the evolution of wireless communication beyond 5G using high-performance networks requires massive device connectivity. Massive Multiple-Input Multiple-Output (MIMO) systems have been used and proven to increase the data throughput of wireless links. However, scaling such systems to a large number of antennas and a high modulation factor entails a significant computational cost for signal decoding using conventional non-linear decoders. Heuristic tree-search-based approaches have been proposed to address this challenge as a means to achieve real-time decoding requirements for large MIMO configurations. FPGAs represent an ideal platform for accelerating MIMO signal decoding on account of their low latency and potential for integration within the signal processing chain while consuming low power. This paper presents software/hardware co-design for a multi-level tree search approach that integrates the computation of multiple tree levels. The proposed heuristic transforms the tree search process into a streaming operation well suited for the FPGA’s architecture. We show a series of hardware design and algorithmic optimizations that significantly improve scalability and decoding time, resulting in a design capable of decoding 64×64 64-QAM MIMO within 10ms real-time requirements.