SISA: A Scale-In Systolic Array for GEMM Acceleration
Summary: arXiv:2603.29913v1 Announce Type: cross
Abstract: The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.
Introduction
As artificial intelligence continues to evolve, the demand for efficient processing of workloads, particularly in the realm of machine learning, is paramount. Large Language Models (LLMs) have emerged as a focal point, often necessitating substantial computational resources due to their complexity. General Matrix-Matrix Multiplication (GEMM) serves as a fundamental operation in this domain, making the design of effective hardware accelerators essential.
Systolic Arrays and Their Limitations
Systolic Arrays (SAs) have been widely adopted for their efficiency in executing GEMM operations. However, the traditional square configuration of SAs poses challenges when dealing with the input-dependent and skewed matrices characteristic of LLMs. This misalignment can result in significant underutilization of processing elements (PEs), thereby diminishing overall performance and energy efficiency.
Introducing SISA
The Scale-In Systolic Array (SISA) architecture presents a solution to the limitations of conventional SAs. By partitioning the square array into horizontal rectangular slabs, SISA allows for more flexible scheduling of operations. This innovative design enables separate scheduling of slabs, thus facilitating the efficient execution of smaller or skewed matrix shapes while still supporting full-array operations for larger GEMM tasks.
Performance Metrics
The implementation of SISA has demonstrated remarkable performance improvements. Key metrics include:
- Speedup: SISA achieves up to 8.52 times faster execution for representative LLMs compared to traditional monolithic SAs.
- Energy-Delay Product (EDP) Reduction: A significant reduction of 93% in EDP has been observed, indicating enhanced energy efficiency without compromising performance.
Conclusion
In summary, the Scale-In Systolic Array (SISA) architecture offers a transformative approach to GEMM acceleration in the context of modern AI workloads. By effectively addressing the challenges posed by LLMs through innovative design and flexible scheduling, SISA not only enhances performance but also optimizes energy consumption. As the landscape of AI continues to evolve, architectures like SISA stand to play a critical role in shaping the future of hardware acceleration.
