CUDA Tile Performance on Hopper & Blackwell GPUs for AI

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

NVIDIA’s CUDA Tile (CuTile) introduces a Python-based, tile-centric abstraction designed for GPU kernel development. This new approach aims to simplify the programming process while maintaining the efficiency of Tensor Core and Tensor Memory Accelerator (TMA) on modern GPUs. Recent research, detailed in arXiv:2604.23466v1, presents the first independent, cross-architecture evaluation of CuTile against established methods like cuBLAS, Triton, WMMA, and raw SIMT on three NVIDIA GPUs, including Hopper and Blackwell architectures: H100 NVL, B200, and RTX PRO 6000 Blackwell Server Edition.

This evaluation benchmarks representative AI workloads, such as General Matrix Multiply (GEMM), fused multi-head attention, and end-to-end large language model (LLM) inference in BF16/FP16 precision. The primary goal is to assess both performance and portability across various architectures.

Key Findings

Workload and Architecture Dependence: CuTile’s effectiveness varies significantly depending on the specific workload and the architecture of the GPU being utilized.
Performance on Blackwell: On the datacenter-class Blackwell GPU (B200), CuTile achieves an impressive performance of up to 1007 TFLOP/s for fused attention tasks. This performance surpasses that of FlashAttention-2 by a factor of 2.5, all while requiring just 60 lines of Python kernel code.
Comparison with cuBLAS: For GEMM operations, CuTile manages to reach 52-79% of cuBLAS performance, which is achieved with only 22 lines of code, compared to 123 lines required for WMMA. This suggests CuTile serves as a practical alternative to hand-written CUDA kernels, although it has not yet reached the level of optimization found in vendor-specific libraries.
Cross-Architecture Optimization Gaps: Notably, the same CuTile attention kernel falls short at only 53% of FlashAttention-2 throughput on the RTX PRO 6000 (sm_120), highlighting significant optimization gaps when comparing across architectures.
Portability of Triton: In contrast to CuTile, Triton demonstrates superior portability, achieving between 62-101% of cuBLAS performance across all tested platforms without the need for architecture-specific tuning. This reveals Triton’s ability to maintain consistent performance across different architectures.

Conclusion

The evaluation of NVIDIA’s CUDA Tile presents a promising avenue for simplifying AI workload development on modern GPUs. While CuTile shows potential, particularly on the Blackwell architecture, its performance and efficiency are heavily influenced by the specific workload and the GPU architecture in use. Future research may focus on addressing the cross-architecture optimization gaps identified in this study, enhancing CuTile’s competitiveness with established libraries like cuBLAS and Triton. As the AI landscape continues to evolve, tools like CuTile could play a critical role in streamlining GPU programming and maximizing performance.

RichlyAI Blog AI Guide, Tutorials, Industrial Insights, & more!

Company

CUDA Tile Performance on Hopper & Blackwell GPUs for AI

Evaluating CUDA Tile for AI Workloads on Hopper and Blackwell GPUs

Key Findings

Conclusion

Related AI Insights

Subscribe

More like thisRelated

About us

Company

The latest

Subscribe

RichlyAI Blog
AI Guide, Tutorials, Industrial Insights, & more!

More like this
Related