avatarHao Zhuang, an engineer, Tesla AI, Ex-Googler, PhD

Summary

The provided web content discusses the implementation of prefetching and post-storing techniques in CUDA pipelines to optimize matrix multiplication kernel performance.

Abstract

The web content delves into the optimization strategies for CUDA pipelines, focusing on prefetching and post-storing data to enhance the performance of matrix multiplication kernels. It references a fun and informative read from Scott's wiki on SGEMM, which includes a code snippet demonstrating the prefetch technique. The code illustrates how to prefetch data for the next iteration while processing the current one, using a double buffer approach to avoid stalling the pipeline. Additionally, the content covers a two-way prefetch method, which includes a device-to-host (DtoH) post-store operation, and provides examples of asynchronous memory prefetching and synchronization using CUDA streams and events to ensure efficient data transfer without blocking the computation pipeline.

Opinions

  • The author finds Scott's explanation of SGEMM on his wiki to be a fun and informative read, implying that it is both accessible and insightful.
  • The use of prefetching is considered essential for maintaining high throughput in the CUDA pipeline, as it allows for overlapping computation with memory transfers.
  • The double buffer approach in the prefetch code is highlighted as a method to ensure continuous data supply for computation without waiting for memory accesses.
  • The author emphasizes the importance of non-blocking data transfers, suggesting that synchronization mechanisms like CUDA streams and events are critical for achieving high performance in CUDA applications.
  • The content suggests that careful orchestration of memory transfers and kernel executions using separate streams can lead to more efficient use of GPU resources.
  • The provided code snippets and explanations convey the author's expertise and practical experience in optimizing CUDA pipelines for matrix multiplication operations.

CUDA Pipeline — Prefetch and Poststore

Extract this part from https://readmedium.com/good-blogs-about-matmul-kernel-5e0de6413b57

This is a fun read from Scott’s https://github.com/NervanaSystems/maxas/wiki/SGEMM

The following is the prefetch code

readAs = ((tid >> 1) & 7) << 4;
readBs = (((tid & 0x30) >> 3) | (tid & 1)) << 4 + 2048;

while (track0 < end)
{
    // Process each of our 8 lines from shared
    for (j = 0; j < 8; j++)
    {
        // We fetch one line ahead while calculating the current line.
        // Wrap the last line around to the first.
        prefetch = (j + 1) % 8;
        
        // Use even/odd rows to implement our double buffer.
        if (j & 1)
        {
            ld.shared.v4.f32 j0Ax00, [readAs + 4*(prefetch*64 + 0)];
            ld.shared.v4.f32 j0By00, [readBs + 4*(prefetch*64 + 0)];
            ld.shared.v4.f32 j0Ax32, [readAs + 4*(prefetch*64 + 32)];
            ld.shared.v4.f32 j0By32, [readBs + 4*(prefetch*64 + 32)];
        }
        else
        {
            ld.shared.v4.f32 j1Ax00, [readAs + 4*(prefetch*64 + 0)];
            ld.shared.v4.f32 j1By00, [readBs + 4*(prefetch*64 + 0)];
            ld.shared.v4.f32 j1Ax32, [readAs + 4*(prefetch*64 + 32)];
            ld.shared.v4.f32 j1By32, [readBs + 4*(prefetch*64 + 32)];
        }
    }
    // swap our shared memory buffers after reading out 8 lines
    readAs ^= 4*16*64;
    readBs ^= 4*16*64;

    // Additional loop code omitted for clarity.
}

There are two-way prefetch, usually I call the DtoH part “post-store” (from accelerator standpoint). The following is the one-way prefetch from the blog

for (int i = 0; i < num_tiles; i++) { // offload previous tile to the cpu if (i > 0) 
    cudaMemPrefetchAsync(a + tile_size * (i-1), tile_size * sizeof(size_t), 
                         cudaCpuDeviceId, s1); 
  // run multiple kernels on current tile 
  for (int j = 0; j < num_kernels; j++) 
    kernel<<<1024, 1024, 0, s2>>>(tile_size, a + tile_size * i); 
  // prefetch next tile to the gpu 
  if (i < num_tiles) 
    cudaMemPrefetchAsync(a + tile_size * (i+1), tile_size * sizeof(size_t), 
                         0, s3); 
  // sync all streams 
  cudaDeviceSynchronize(); 
}

you see the blocking DtoH. to unlock this

for (int i = 0; i < num_tiles; i++) { 
  // make sure previous kernel and current tile copy both completed 
  cudaEventSynchronize(e1);  
  cudaEventSynchronize(e2);
  // run multiple kernels on current tile 
  for (int j = 0; j < num_kernels; j++)
    kernel<<<1024, 1024, 0, s1>>>(tile_size, a + tile_size * i); 
  cudaEventRecord(e1, s1); 
  // prefetch next tile to the gpu in a separate stream 
  if (i < num_tiles-1) {
    // make sure the stream is idle to force non-deferred HtoD prefetches first 
    cudaStreamSynchronize(s2);       
    cudaMemPrefetchAsync(a + tile_size * (i+1), 
                         tile_size * sizeof(size_t), 
                         0, s2); 
    cudaEventRecord(e2, s2); 
  } 
  // offload current tile to the cpu after the kernel is completed using the deferred path 
  cudaMemPrefetchAsync(a + tile_size * i, tile_size * sizeof(size_t), 
                       cudaCpuDeviceId, s1); 
  // rotate streams and swap events 
  st = s1; s1 = s2; s2 = st; 
  st = s2; s2 = s3; s3 = st; 
  et = e1; e1 = e2; e2 = et; 
}
Ai Accelerator
Gpu
Cuda
Pipeline
Prefetch
Recommended from ReadMedium