Multi-Tile Matrix Multiplication with Adder-Tree ==================================================== In the Simple Multi-Tile Matrix Multiplication Example (16x16x16), the entire inner dimension (K) of the matrix multiplication is kept intact and not split across tiles. Here, K represents the inner dimension in an NxKxM multiplication, where A is MxK and B is KxN. Keeping the inner dimension whole allows each kernel to perform matrix multiplication independently, without needing to sum results between kernels. This is efficient for small matrices, but for large K, splitting the inner dimension and accumulating results between kernels helps manage memory and compute resources. There are several ways to combine outputs between kernels. One approach is to use the cascade stream to pass data directly between accumulator registers of neighboring kernels. Here, we present a different approach. In this example, matrix multiplication tiles are placed next to a central addition kernel (``add_tree.cpp``), which reads output buffers from its neighbors and accumulates the partial results. This design is inspired by MaxEva, a framework which maximizes matrix multiplication across tiles. [1]_ .. figure:: image/add_tree.png :alt: Adder-tree tiling scheme :width: 600px :align: center This adder-tree method is demonstrated with an int16 4x128x128 matrix multiplication spread across 4 tiles. The matrix multiplication kernel (``mmul_skinny.cpp``) uses ``aie::mac`` and ``aie::mul`` intrinsics for flexibility. The K dimension must be a multiple of VEC. VEC is the vectorization factor, and VEC = 16 in our case because of length limits on the default 48-bit accumulator register for int16. File structure: :: . ├── graph.h # Declaration of the graph ├── graph.cpp # Initializes, runs, and ends the graph. ├── kernels │ └── mmul.cpp # matmul kernel implementation │ └── add_tree.cpp # addition kernel implementation └── kernels.h # declarations of kernel Kernel Code ************* The multiplication kernel is called ``mmul_skinny`` because it operates on 4x32x128 dimensions, where the A matrix is skinny with a large K. The global variables N = 4, M = 128, K = 128 set the multiplication size. The variable K_Tile represents the portion of the inner dimension handled by each of the 4 tiles, so K_Tile = K/4. When multiplying two int16 vectors with ``aie::mac``, results are stored in a 48-bit accumulator. These are cast back to int16 vectors using ``to_vector()``, which automatically saturates the data to prevent overflow. This approach uses kernel classes with attributes for configuration. The graph code instantiates and runs these kernel classes, passing the appropriate parameters. The ``a_block`` parameter in ``mmul_skinny`` acts like a thread ID, determining which part of the A matrix each kernel processes. The B matrix is split into 4x128 blocks, so no ``b_block`` parameter is needed. While previous designs used scalar port streaming for this, the class-based approach sets these parameters at compile time. ``kernels.h``: .. code-block :: cpp #include using namespace adf; // Matrix multiplication kernel for tiled processing class mmul_skinny { int K; // Inner dimension (cols of A / rows of B) int M; // Columns of output matrix (C) int T; // Number of tiles int a_block; // Tile/block ID for parallel execution public: mmul_skinny(int K_val, int M_val, int T_val, int id) : K(K_val), M(M_val), T(T_val), a_block(id) {} void run(adf::input_buffer& a_buf, adf::input_buffer& b_buf, adf::output_buffer& c_buf); // ADF framework registration (exposes parameters to toolchain) static void registerKernelClass() { REGISTER_FUNCTION(mmul_skinny::run); REGISTER_PARAMETER(K); REGISTER_PARAMETER(M); REGISTER_PARAMETER(T); REGISTER_PARAMETER(a_block); } }; // 4-input adder tree for output matrix columns (M) class add_tree_4 { int M; // Columns of output matrix public: add_tree_4(int M_val) : M(M_val) {} void run( adf::input_buffer& in0, adf::input_buffer& in1, adf::input_buffer& in2, adf::input_buffer& in3, adf::output_buffer& out); static void registerKernelClass() { REGISTER_FUNCTION(add_tree_4::run); REGISTER_PARAMETER(M); } }; ``mmul.cpp``: .. code-block:: cpp #include #include #include "include.h" #include "kernels.h" using namespace adf; void mmul_skinny::run( input_buffer& a_buf, input_buffer& b_buf, output_buffer& c_buf) { // Set hardware-accelerated rounding/saturation modes aie::set_rounding(aie::rounding_mode::symmetric_zero); aie::set_saturation(aie::saturation_mode::saturate); const int K_Tile = K/T; // Tiling factor for dimension K // Offset A-matrix pointer for parallel block processing auto a_iter = aie::begin_vector(a_buf) + K_Tile/VEC*a_block; auto c_iter = aie::begin(c_buf); // Outer loop over N - batch dimension for (int n = 0; n < N; ++n) { auto b_iter = aie::begin_vector(b_buf); // M dimension - rows of output matrix for (int m = 0; m < M; ++m) { aie::accum acc = aie::zeros(); aie::vector a_vec; aie::vector b_vec; // Inner K_Tile loop unrolled by VEC (vectorization factor) for (int k = 0; k < K_Tile/VEC; ++k) { a_vec = *a_iter++; // Vector load from A b_vec = *b_iter++; // Vector load from B // First multiply, subsequent MACs (fused multiply-accumulate) acc = (k == 0) ? aie::mul(a_vec, b_vec) : aie::mac(acc, a_vec, b_vec); } a_iter -= K_Tile/VEC; // Rewind A pointer for next M iteration // Convert 48b accumulator to 16b with saturation aie::vector res_vec16 = acc.to_vector(); // Horizontal sum of vector results int16 res = aie::reduce_add(res_vec16); *c_iter++ = res; // Store scalar result } a_iter += K/VEC; // Advance A pointer to next N block } } Here is the kernel for the adder tree. It reads 4 input buffers from neighboring tiles and sums them together. ``add_tree.cpp``: .. code-block :: cpp #include #include #include "include.h" #include "kernels.h" using namespace adf; // Assume N*M is multiple of VEC for vector alignment void add_tree_4::run( input_buffer& in0, // North neighbor input_buffer& in1, // East neighbor input_buffer& in2, // South neighbor input_buffer& in3, // West neighbor output_buffer& out ) { // Vector iterators for VEC-element parallel processing auto in0_iter = aie::begin_vector(in0); auto in1_iter = aie::begin_vector(in1); auto in2_iter = aie::begin_vector(in2); auto in3_iter = aie::begin_vector(in3); auto out_iter = aie::begin_vector(out); const int total_vectors = (N * M) / VEC; // Process all elements in vector chunks for(int i = 0; i < total_vectors; ++i) { aie::vector v0 = *in0_iter++; aie::vector v1 = *in1_iter++; aie::vector v2 = *in2_iter++; aie::vector v3 = *in3_iter++; // Vector addition with saturation aie::vector sum = aie::add(aie::add(v0, v1), aie::add(v2, v3)); *out_iter++ = sum; } } Graph Code *************** Each kernel is carefully mapped to a specific adjacent tile to ensure direct read/write buffer access between the addition tile and the multiplication tiles. Without direct buffering between adjacent tiles, bandwidth may be lowered when data is forced to be streamed through the 32 bit AXI4 interface. Note the change in syntax when calling the kernels. The kernels are called with the correct values for the class wrapper parameters. K = 128, M = 128, T = 4, and a_block is set to the tile ID (0-3). ``graph.hpp``: .. code-block :: cpp #include #include "kernels.h" #include #include "include.h" using namespace adf; // Graph for 4-tile, 128x128 matrix multiplication class mmul_4x128x128 : public adf::graph { private: const unsigned int K = 128; // Inner dimension const unsigned int M = 128; // Output columns const unsigned int T = 4; // Number of tiles public: kernel mmul[4]; // 4 parallel matrix multiplication kernels kernel add; // 4-input adder tree kernel input_plio in_A; // Input for matrix A input_plio in_B[4]; // 4 inputs for matrix B tiles output_plio out_C; // Output for matrix C mmul_128x128() { // Create input and output streams in_A = input_plio::create(plio_128_bits, "data/A_matrix.txt"); out_C = output_plio::create(plio_128_bits, "data/C_output.txt"); // Create 4-input adder tree kernel for final accumulation add = kernel::create_object(M); source(add) = "src/kernels/add_tree.cpp"; runtime(add) = 1.0; dimensions(add.out[0]) = {N*M}; // Output is full matrix dimensions(add.in[4]) = {M}; // Each input is a column block connect(add.out[0], out_C.in[0]); // Connect adder output to final output // Instantiate and connect 4 mmul kernels, each handling a tile for (unsigned int i = 0; i < N; ++i) { dimensions(add.in[i]) = {N*M}; in_B[i] = input_plio::create(plio_128_bits, "data/B_"+std::to_string(i)+ ".txt"); // Each mmul kernel gets its tile/block ID as 'i' mmul[i] = kernel::create_object(K, M, T, i); runtime(mmul[i]) = 1.0; // Set input/output buffer shapes for each kernel dimensions(mmul[i].in[0]) = {N*K}; // A: full row block dimensions(mmul[i].in[1]) = {M*(K/T)}; // B: tile (partitioned K) dimensions(mmul[i].out[0]) = {N*M}; // Output: full matrix block // Connect data streams connect(in_A.out[0], mmul[i].in[0]); // Broadcast A to all mmuls connect(in_B[i].out[0], mmul[i].in[1]); // Unique B for each tile connect(mmul[i].out[0], add.in[i]); // Each mmul feeds one adder input source(mmul[i]) = "src/kernels/mmul.cpp"; } // Map kernels to hardware tiles for parallel execution location(add) = tile(0, 1); location(mmul[0]) = tile(0, 0); location(mmul[1]) = tile(1, 1); location(mmul[2]) = tile(0, 2); location(mmul[3]) = tile(1, 0); } }; AIE Grid View **************** Through software simulation, the kernel layout is visualized. Note how output buffers of the mmul kernels are read directly into the adder tree kernel, bypassing the AXI4 stream. .. image:: image/4x128x128_array.svg :alt: Adder-tree Matmul Grid Layout :width: 600px :align: center .. [1] MaxEva: Maximizing the Efficiency of Matrix Multiplication on Versal AI Engine https://arxiv.org/pdf/2311.04980v2