AI infrastructure

2025-12-26

CUDA (Compute Unified Device Architecture)

In the area of AI infrastructure, Nvdia was the dominator with its CUDA ecosystem, which has been the world’s most dominant AI-Computing Moat.

NVDIA Driver & GPU

Kernel-Mode Drivers
- WDDM(Windows Display Driver Model) on windows, defined by Windows and implemented by NVIDIA Driver .It Kill a compute kernel takes >2 seconds to prevent the screen from freezing
- MPS(Multi-Process Service) on linux, it’s a daemon.
CUDA cores are also used on graphics rendering for shaders by OS, OS will be referee when Paralleling computing is resource conflicting with graphics rendering
Headless GPU is a card with no video output ports that OS will not often kill computing tasks (Kernels) due to the prioritized display task. 100% VRAM and power goes to AI program, no WDDM timeouts, no conflicts.

CUDA Toolkit and CUDA-X

~ - Kernel is a function (in c/c++) to be executed on GPU, while defining kernel, a function is prefixed with keywork
- CPU is host, GPU is device
- global kernel available to host
- device kernel only available to device

__device__ void sub()
{
    ...
}
__global__ void add()
{
    sub();
}
void main()
{
    add();
}

General parallel math (add, subtract, logic) are executed on CUDA cores
- SM, SP, CUDA cores:
  - SM: GPU consists of smaller components called as Strem Multiprocessores(SM).
    - On one SM, one or more blocks can be executed.
  - SP: Each SM consists of many Stream Processores(SP) on which actual computatin is done, each SP is also called a CUDA core.
    - On one SP, one or more threads can be executed.
- Thread, Block, Grid
  - Thread is a single instance of execution,
    - one thread can only be executed on one SP.
  - Block is a group of threads,
    - one block can only be executed on one SM.
  - A group of blocks is called a Grid. One grid is generated for one Kernel and on one GPU. Only one kernel can be executed at one time instance.
- Warp is number of threads in a block running simultaneously on a SM.
  - For a instruction pipline of 4 stages (say fetch, decode, execute, write-back), run on a SM with 8 SPs, then 4*8=32 threads can be executed on this SM simultaneously. These 32 threads form one warp.
  - Suppose a block has 128 threads and is going to run on this SM with 8 SPs, this block will have 4 warps.
  - This means on this SM, first warp will get run, then second, third, forth and after that again first, second…
Deep Learing math (Mixed-precision matrix multiplication) on Tensor cores
- TBD
The Compiler:
- nvcc, static compiler
  - Separates code into host(CPU) and device(GPU)
  - Sends the CPU code to a standard compiler(MSVC on Windows or GCC on Linux)
  - Compiles the GPU code into PTX(parallel thread execution, intermediate GPU assembly)
  - CUDA language extension to C++
    - kernel<<<blocks
    - threads>>>(args)
    - Both are compiled to CUDA Runtime API of cudaLaunchKernel
- nvrtc (NVDIA Runtime Compilation), JIT compiler
  - pros
    - light weight environment, only need GPU driver and nvrtc.so
    - dynamic transpilation
    - extreme optimization
  - cons
    - only handle GPU code
    - no static type checking
    - launch overhead
    - no hight-level libraries like Thrust or CUB support
Debugger & Profiler
- CUDA-GDB: finding bugs
- Nsight: measuring speed
  - Standalone desktop apps with a GUI
  - Extension in VS Code~

High-level Programming

NIM(NVIDIA Inference Microservices)(CUDA-X): Packaged as cloud APIs, easy to integrate, customize, and deploy.
- engage with GPU by docker container service
CUDA-X: Over 400+ libraries for building, optimizing, deploying and scaling applications. Customized operators, writing specific mathematical operation(like a new type of activation function) from scratch using CUDA C++; Hardware optimization, using shared memory, warp shuffles, and tiling to make an algorithm run at the maximum speed of the hardware.
- Operator Libraries: Operator is a specific mathematical function like a matrix multiplication, a convolution, or an activation function. Instead of writing these from scratch in CUDA C++, NVIDIA provides highly optimized operator libraies.
  - Domain-Specific(High level), ready-to-use operators for AI, vision or video. :
    - cuDNN(standalone): Deep Neural Network library
    - TensorRT(standalone):
    - DeepStream(standalone):
    - CV-CUDA(standalone):
  - Math & Algorithms(Mid level)
    - cuBLAS: Basic linear algebra (matrix multiplication)
    - cuSPARSE: Math for “Sparse” matrices (matrices full of zeros)
    - cuFFT: Fast Fourier Transforms (Signal processing)
    - cuRAND: Random number generation
    - cuSOLVER: Direct solvers for dense/sparse linear systems
    - Thrust: Parallel Algorithms Library, high-level “STL-Like” sorting, searching, and reducing
  - Template & Primitive (Low level)
    - CUB: CUDA Unbound, a header-only library of collective primitives.
    - CUTLASS(standalone): CUDA Templates for Linear Algebra Subroutines and Software. It is a collection of C++ templates.
- Host-Level/System programming, CUDA operation- managing the flow of the program
  - cudart.dll or cudart.so: hight level api, CUDA Runtime API
    - Memory management, manages the allocation and deallocation of VRAM
      - cudaMalloc()
      - cudaFree()
    - Graph/Stream capture, orchestrating how different kernels launch in sequence to minimize CPU overhead
      - cudaStreamCreate()
      - cudaStreamSynchronize()
      - cudaStreamWaitEvent()
      - cudaGraphCreate()
      - cudaGraphInstantiate()
      - cudaGraphLaunch()
    - Data transfer, handles moving data back and forth between RAM and VRAM
      - cudaMemcpy()
    - Kernel execution: manages the launching of kernels(parallel functions that run on GPU cores)
      - cudaLaunchKernel()
    - Device management: identifies which GPUs are available in the system and initializes the execution environment
  - libcuda.so or nvcuda.dll: low level api, CUDA Driver API
    - Offers much finer control but is significantly more complex to write.

PyTorch: almost 99% of AI developers use CUDA through PyTorch

most users never write a single line of

1	__global__ void kernel()

instead,

1 2	x = torch.randn(1024, 1024).cuda() # moves data via CUDART y = torch.matmul(x, x) # runs math via cuBLAS

NV-LINK

- TBD

PyTorch

Industry standard for AI research and leader in production deployment

Eager Mode (Dynamic Computation Graph):

Three Pillars

torch.Tensor
- Backend system, PyTorch is designed to be hardware-agnostic layer. It uses a “Backend” system. User code is based on torch.Tensor API. Under the hood, PyTorch has a CUDA Backedn, a CPU Backend, and an MPS(Apple Silicon) Backend.
- Relation to CUDA, PyTorch was developed with a “CUDA-first” mindset. Most of its C++ source code is hightly optimized specifically for NVIDIA GPUs. While it can run on other tech, its performance is most mature on CUDA.
torch.autograd
- TBD
torch.nn
- TBD

Stack Layer

User Code
Core Libraries
1. Data preprocessing, abstracting raw data into structured tensors; btaching, shuffling, and multi-process memory pinning for GPU transfer.
  - torch.utils.data.Dataset, DataLoader
2. Model architecture definition, constructing the computational graph by subclassing nn.Module and initializing the linear or non-linear layers and stateful parameters
  - torch.nn.Module, nn.Paramter
3. Forward propagation, executing the sequence of mathematical transformations on input tensors to produce model outputs or latent representations.
  - model.forward(), tensor Operations
4. Loss Computation & Gradient Atrribution, quantifying error via an objective function and utilizing the Autograd engine to compute the partial deriatives of the loss with respect to all trainable parameters.
  - torch.nn.modules.loss
  - torch.autograd.backward()
5. Parameter Optimization, applying optimization algorithms to update the model’s parameter state based on the calculated gradients to minimize the objective function.
  - torch.optim.Optimizer (e.g., Adam, SGD)
Dispatcher
Kernel

Domain Libraries

TorchVision (Computer Vision): models like ResNet, YOLO
Hugging Face Transformers (Natural Language): standard for LLMs
PyTorch Lightning (Research): removes the boilerplate code
TorchScript, ExecuTorch (production): tools to turn python code into a fast, standalone file that can run on a phone or a server without Python.
TBD

Extensions,

Intel Extension for PyTorch(IPEX): allows PyTorch to work better on Intel hardware
XLA: an extension that allows PyTorch to run on Google’s TPU.
TBD

ROCm (Radeon Open Compute)

TBD

TPU

Google has showed its ambitious to overset the LLM training and inferencing infrastructure with its TPU and software stack.

TBD