CUDA (Compute Unified Device Architecture)
In the area of AI infrastructure, Nvdia was the dominator with its CUDA ecosystem, which has been the world’s most dominant AI-Computing Moat.
NVDIA Driver & GPU
Kernel-Mode Drivers
- WDDM(Windows Display Driver Model) on windows, defined by Windows and implemented by NVIDIA Driver .It Kill a compute kernel takes >2 seconds to prevent the screen from freezing
- MPS(Multi-Process Service) on linux, it’s a daemon.
CUDA cores are also used on graphics rendering for shaders by OS, OS will be referee when Paralleling computing is resource conflicting with graphics rendering
Headless GPU is a card with no video output ports that OS will not often kill computing tasks (Kernels) due to the prioritized display task. 100% VRAM and power goes to AI program, no WDDM timeouts, no conflicts.
CUDA Toolkit and CUDA-X
~ - Kernel is a function (in c/c++) to be executed on GPU, while defining kernel, a function is prefixed with keywork
- CPU is host, GPU is device
- global kernel available to host
- device kernel only available to device
1
2
3
4
5
6
7
8
9
10
11
12__device__ void sub()
{
...
}
__global__ void add()
{
sub();
}
void main()
{
add();
}
General parallel math (add, subtract, logic) are executed on CUDA cores
- SM, SP, CUDA cores:
- SM: GPU consists of smaller components called as Strem Multiprocessores(SM).
- On one SM, one or more blocks can be executed.
- SP: Each SM consists of many Stream Processores(SP) on which actual computatin is done, each SP is also called a CUDA core.
- On one SP, one or more threads can be executed.
- SM: GPU consists of smaller components called as Strem Multiprocessores(SM).
- Thread, Block, Grid
- Thread is a single instance of execution,
- one thread can only be executed on one SP.
- Block is a group of threads,
- one block can only be executed on one SM.
- A group of blocks is called a Grid. One grid is generated for one Kernel and on one GPU. Only one kernel can be executed at one time instance.
- Thread is a single instance of execution,
- Warp is number of threads in a block running simultaneously on a SM.
- For a instruction pipline of 4 stages (say fetch, decode, execute, write-back), run on a SM with 8 SPs, then 4*8=32 threads can be executed on this SM simultaneously. These 32 threads form one warp.
- Suppose a block has 128 threads and is going to run on this SM with 8 SPs, this block will have 4 warps.
- This means on this SM, first warp will get run, then second, third, forth and after that again first, second…
- SM, SP, CUDA cores:
Deep Learing math (Mixed-precision matrix multiplication) on Tensor cores
- TBD
The Compiler:
- nvcc, static compiler
- Separates code into host(CPU) and device(GPU)
- Sends the CPU code to a standard compiler(MSVC on Windows or GCC on Linux)
- Compiles the GPU code into PTX(parallel thread execution, intermediate GPU assembly)
- CUDA language extension to C++
- kernel<<<blocks
- threads>>>(args)
- Both are compiled to CUDA Runtime API of cudaLaunchKernel
- nvrtc (NVDIA Runtime Compilation), JIT compiler
- pros
- light weight environment, only need GPU driver and nvrtc.so
- dynamic transpilation
- extreme optimization
- cons
- only handle GPU code
- no static type checking
- launch overhead
- no hight-level libraries like Thrust or CUB support
- pros
- nvcc, static compiler
Debugger & Profiler
- CUDA-GDB: finding bugs
- Nsight: measuring speed
- Standalone desktop apps with a GUI
- Extension in VS Code~
High-level Programming
NIM(NVIDIA Inference Microservices)(CUDA-X): Packaged as cloud APIs, easy to integrate, customize, and deploy.
- engage with GPU by docker container service
CUDA-X: Over 400+ libraries for building, optimizing, deploying and scaling applications. Customized operators, writing specific mathematical operation(like a new type of activation function) from scratch using CUDA C++; Hardware optimization, using shared memory, warp shuffles, and tiling to make an algorithm run at the maximum speed of the hardware.
Operator Libraries: Operator is a specific mathematical function like a matrix multiplication, a convolution, or an activation function. Instead of writing these from scratch in CUDA C++, NVIDIA provides highly optimized operator libraies.
- Domain-Specific(High level), ready-to-use operators for AI, vision or video. :
- cuDNN(standalone): Deep Neural Network library
- TensorRT(standalone):
- DeepStream(standalone):
- CV-CUDA(standalone):
- Math & Algorithms(Mid level)
- cuBLAS: Basic linear algebra (matrix multiplication)
- cuSPARSE: Math for “Sparse” matrices (matrices full of zeros)
- cuFFT: Fast Fourier Transforms (Signal processing)
- cuRAND: Random number generation
- cuSOLVER: Direct solvers for dense/sparse linear systems
- Thrust: Parallel Algorithms Library, high-level “STL-Like” sorting, searching, and reducing
- Template & Primitive (Low level)
- CUB: CUDA Unbound, a header-only library of collective primitives.
- CUTLASS(standalone): CUDA Templates for Linear Algebra Subroutines and Software. It is a collection of C++ templates.
- Domain-Specific(High level), ready-to-use operators for AI, vision or video. :
Host-Level/System programming, CUDA operation- managing the flow of the program
cudart.dll or cudart.so: hight level api, CUDA Runtime API
- Memory management, manages the allocation and deallocation of VRAM
- cudaMalloc()
- cudaFree()
- Graph/Stream capture, orchestrating how different kernels launch in sequence to minimize CPU overhead
- cudaStreamCreate()
- cudaStreamSynchronize()
- cudaStreamWaitEvent()
- cudaGraphCreate()
- cudaGraphInstantiate()
- cudaGraphLaunch()
- Data transfer, handles moving data back and forth between RAM and VRAM
- cudaMemcpy()
- Kernel execution: manages the launching of kernels(parallel functions that run on GPU cores)
- cudaLaunchKernel()
- Device management: identifies which GPUs are available in the system and initializes the execution environment
- Memory management, manages the allocation and deallocation of VRAM
libcuda.so or nvcuda.dll: low level api, CUDA Driver API
- Offers much finer control but is significantly more complex to write.
PyTorch: almost 99% of AI developers use CUDA through PyTorch
- most users never write a single line of instead,
1
__global__ void kernel()
1
2x = torch.randn(1024, 1024).cuda() # moves data via CUDART
y = torch.matmul(x, x) # runs math via cuBLAS
- most users never write a single line of
NV-LINK
- TBD
PyTorch
Industry standard for AI research and leader in production deployment
Eager Mode (Dynamic Computation Graph):
- TBD
Three Pillars
- torch.Tensor
- Backend system, PyTorch is designed to be hardware-agnostic layer. It uses a “Backend” system. User code is based on torch.Tensor API. Under the hood, PyTorch has a CUDA Backedn, a CPU Backend, and an MPS(Apple Silicon) Backend.
- Relation to CUDA, PyTorch was developed with a “CUDA-first” mindset. Most of its C++ source code is hightly optimized specifically for NVIDIA GPUs. While it can run on other tech, its performance is most mature on CUDA.
- torch.autograd
- TBD
- torch.nn
- TBD
Stack Layer
User Code
Core Libraries
- Data preprocessing, abstracting raw data into structured tensors; btaching, shuffling, and multi-process memory pinning for GPU transfer.
- torch.utils.data.Dataset, DataLoader
- Model architecture definition, constructing the computational graph by subclassing nn.Module and initializing the linear or non-linear layers and stateful parameters
- torch.nn.Module, nn.Paramter
- Forward propagation, executing the sequence of mathematical transformations on input tensors to produce model outputs or latent representations.
- model.forward(), tensor Operations
- Loss Computation & Gradient Atrribution, quantifying error via an objective function and utilizing the Autograd engine to compute the partial deriatives of the loss with respect to all trainable parameters.
- torch.nn.modules.loss
- torch.autograd.backward()
- Parameter Optimization, applying optimization algorithms to update the model’s parameter state based on the calculated gradients to minimize the objective function.
- torch.optim.Optimizer (e.g., Adam, SGD)
- Data preprocessing, abstracting raw data into structured tensors; btaching, shuffling, and multi-process memory pinning for GPU transfer.
Dispatcher
Kernel
Domain Libraries
- TorchVision (Computer Vision): models like ResNet, YOLO
- Hugging Face Transformers (Natural Language): standard for LLMs
- PyTorch Lightning (Research): removes the boilerplate code
- TorchScript, ExecuTorch (production): tools to turn python code into a fast, standalone file that can run on a phone or a server without Python.
- TBD
Extensions,
- Intel Extension for PyTorch(IPEX): allows PyTorch to work better on Intel hardware
- XLA: an extension that allows PyTorch to run on Google’s TPU.
- TBD
ROCm (Radeon Open Compute)
TBD
TPU
Google has showed its ambitious to overset the LLM training and inferencing infrastructure with its TPU and software stack.
TBD