AI infrastructure

CUDA (Compute Unified Device Architecture)

In the area of AI infrastructure, Nvdia was the dominator with its CUDA ecosystem, which has been the world’s most dominant AI-Computing Moat.

NVDIA Driver & GPU

  • Kernel-Mode Drivers

    • WDDM(Windows Display Driver Model) on windows, defined by Windows and implemented by NVIDIA Driver .It Kill a compute kernel takes >2 seconds to prevent the screen from freezing
    • MPS(Multi-Process Service) on linux, it’s a daemon.
  • CUDA cores are also used on graphics rendering for shaders by OS, OS will be referee when Paralleling computing is resource conflicting with graphics rendering

  • Headless GPU is a card with no video output ports that OS will not often kill computing tasks (Kernels) due to the prioritized display task. 100% VRAM and power goes to AI program, no WDDM timeouts, no conflicts.

CUDA Toolkit and CUDA-X

~ - Kernel is a function (in c/c++) to be executed on GPU, while defining kernel, a function is prefixed with keywork
- CPU is host, GPU is device
- global kernel available to host
- device kernel only available to device

1
2
3
4
5
6
7
8
9
10
11
12
__device__ void sub()
{
...
}
__global__ void add()
{
sub();
}
void main()
{
add();
}

  • General parallel math (add, subtract, logic) are executed on CUDA cores

    • SM, SP, CUDA cores:
      • SM: GPU consists of smaller components called as Strem Multiprocessores(SM).
        • On one SM, one or more blocks can be executed.
      • SP: Each SM consists of many Stream Processores(SP) on which actual computatin is done, each SP is also called a CUDA core.
        • On one SP, one or more threads can be executed.
    • Thread, Block, Grid
      • Thread is a single instance of execution,
        • one thread can only be executed on one SP.
      • Block is a group of threads,
        • one block can only be executed on one SM.
      • A group of blocks is called a Grid. One grid is generated for one Kernel and on one GPU. Only one kernel can be executed at one time instance.
    • Warp is number of threads in a block running simultaneously on a SM.
      • For a instruction pipline of 4 stages (say fetch, decode, execute, write-back), run on a SM with 8 SPs, then 4*8=32 threads can be executed on this SM simultaneously. These 32 threads form one warp.
      • Suppose a block has 128 threads and is going to run on this SM with 8 SPs, this block will have 4 warps.
      • This means on this SM, first warp will get run, then second, third, forth and after that again first, second…
  • Deep Learing math (Mixed-precision matrix multiplication) on Tensor cores

    • TBD
  • The Compiler:

    • nvcc, static compiler
      • Separates code into host(CPU) and device(GPU)
      • Sends the CPU code to a standard compiler(MSVC on Windows or GCC on Linux)
      • Compiles the GPU code into PTX(parallel thread execution, intermediate GPU assembly)
      • CUDA language extension to C++
        • kernel<<<blocks
        • threads>>>(args)
        • Both are compiled to CUDA Runtime API of cudaLaunchKernel
    • nvrtc (NVDIA Runtime Compilation), JIT compiler
      • pros
        • light weight environment, only need GPU driver and nvrtc.so
        • dynamic transpilation
        • extreme optimization
      • cons
        • only handle GPU code
        • no static type checking
        • launch overhead
        • no hight-level libraries like Thrust or CUB support
  • Debugger & Profiler

    • CUDA-GDB: finding bugs
    • Nsight: measuring speed
      • Standalone desktop apps with a GUI
      • Extension in VS Code~

High-level Programming

  • NIM(NVIDIA Inference Microservices)(CUDA-X): Packaged as cloud APIs, easy to integrate, customize, and deploy.

    • engage with GPU by docker container service
  • CUDA-X: Over 400+ libraries for building, optimizing, deploying and scaling applications. Customized operators, writing specific mathematical operation(like a new type of activation function) from scratch using CUDA C++; Hardware optimization, using shared memory, warp shuffles, and tiling to make an algorithm run at the maximum speed of the hardware.

    • Operator Libraries: Operator is a specific mathematical function like a matrix multiplication, a convolution, or an activation function. Instead of writing these from scratch in CUDA C++, NVIDIA provides highly optimized operator libraies.

      • Domain-Specific(High level), ready-to-use operators for AI, vision or video. :
        • cuDNN(standalone): Deep Neural Network library
        • TensorRT(standalone):
        • DeepStream(standalone):
        • CV-CUDA(standalone):
      • Math & Algorithms(Mid level)
        • cuBLAS: Basic linear algebra (matrix multiplication)
        • cuSPARSE: Math for “Sparse” matrices (matrices full of zeros)
        • cuFFT: Fast Fourier Transforms (Signal processing)
        • cuRAND: Random number generation
        • cuSOLVER: Direct solvers for dense/sparse linear systems
        • Thrust: Parallel Algorithms Library, high-level “STL-Like” sorting, searching, and reducing
      • Template & Primitive (Low level)
        • CUB: CUDA Unbound, a header-only library of collective primitives.
        • CUTLASS(standalone): CUDA Templates for Linear Algebra Subroutines and Software. It is a collection of C++ templates.
    • Host-Level/System programming, CUDA operation- managing the flow of the program

      • cudart.dll or cudart.so: hight level api, CUDA Runtime API

        • Memory management, manages the allocation and deallocation of VRAM
          • cudaMalloc()
          • cudaFree()
        • Graph/Stream capture, orchestrating how different kernels launch in sequence to minimize CPU overhead
          • cudaStreamCreate()
          • cudaStreamSynchronize()
          • cudaStreamWaitEvent()
          • cudaGraphCreate()
          • cudaGraphInstantiate()
          • cudaGraphLaunch()
        • Data transfer, handles moving data back and forth between RAM and VRAM
          • cudaMemcpy()
        • Kernel execution: manages the launching of kernels(parallel functions that run on GPU cores)
          • cudaLaunchKernel()
        • Device management: identifies which GPUs are available in the system and initializes the execution environment
      • libcuda.so or nvcuda.dll: low level api, CUDA Driver API

        • Offers much finer control but is significantly more complex to write.
  • PyTorch: almost 99% of AI developers use CUDA through PyTorch

    • most users never write a single line of
      1
      __global__ void kernel()
      instead,
      1
      2
      x = torch.randn(1024, 1024).cuda() # moves data via CUDART
      y = torch.matmul(x, x) # runs math via cuBLAS
- TBD

PyTorch

Industry standard for AI research and leader in production deployment

Eager Mode (Dynamic Computation Graph):

  • TBD

Three Pillars

  • torch.Tensor
    • Backend system, PyTorch is designed to be hardware-agnostic layer. It uses a “Backend” system. User code is based on torch.Tensor API. Under the hood, PyTorch has a CUDA Backedn, a CPU Backend, and an MPS(Apple Silicon) Backend.
    • Relation to CUDA, PyTorch was developed with a “CUDA-first” mindset. Most of its C++ source code is hightly optimized specifically for NVIDIA GPUs. While it can run on other tech, its performance is most mature on CUDA.
  • torch.autograd
    • TBD
  • torch.nn
    • TBD

Stack Layer

  • User Code

  • Core Libraries

    1. Data preprocessing, abstracting raw data into structured tensors; btaching, shuffling, and multi-process memory pinning for GPU transfer.
      • torch.utils.data.Dataset, DataLoader
    2. Model architecture definition, constructing the computational graph by subclassing nn.Module and initializing the linear or non-linear layers and stateful parameters
      • torch.nn.Module, nn.Paramter
    3. Forward propagation, executing the sequence of mathematical transformations on input tensors to produce model outputs or latent representations.
      • model.forward(), tensor Operations
    4. Loss Computation & Gradient Atrribution, quantifying error via an objective function and utilizing the Autograd engine to compute the partial deriatives of the loss with respect to all trainable parameters.
      • torch.nn.modules.loss
      • torch.autograd.backward()
    5. Parameter Optimization, applying optimization algorithms to update the model’s parameter state based on the calculated gradients to minimize the objective function.
      • torch.optim.Optimizer (e.g., Adam, SGD)
  • Dispatcher

  • Kernel

Domain Libraries

  • TorchVision (Computer Vision): models like ResNet, YOLO
  • Hugging Face Transformers (Natural Language): standard for LLMs
  • PyTorch Lightning (Research): removes the boilerplate code
  • TorchScript, ExecuTorch (production): tools to turn python code into a fast, standalone file that can run on a phone or a server without Python.
  • TBD

Extensions,

  • Intel Extension for PyTorch(IPEX): allows PyTorch to work better on Intel hardware
  • XLA: an extension that allows PyTorch to run on Google’s TPU.
  • TBD

ROCm (Radeon Open Compute)

TBD

TPU

Google has showed its ambitious to overset the LLM training and inferencing infrastructure with its TPU and software stack.

TBD