How to Write CUDA GPU Kernels in Rust with NVIDIA's cuda-oxide Compiler

Introduction

NVIDIA's cuda-oxide is an experimental compiler that lets you write CUDA SIMT (Single Instruction, Multiple Threads) GPU kernels directly in standard Rust. Instead of switching to C++ or relying on Python-level abstractions, you can now compile Rust code straight to PTX (Parallel Thread Execution)—the intermediate representation used by CUDA for NVIDIA GPUs. This guide walks you through setting up and using cuda-oxide to create your first Rust-based GPU kernel, explaining the unique compilation pipeline and how it fits into the Rust GPU ecosystem.

How to Write CUDA GPU Kernels in Rust with NVIDIA's cuda-oxide Compiler — Source: www.marktechpost.com

What You Need

Rust nightly toolchain – cuda-oxide depends on nightly Rust features and the Stable MIR API.
cuda-oxide source code – Clone the repository from NVIDIA's GitHub (link below).
NVIDIA GPU with CUDA drivers – Required to compile and run PTX on hardware.
Basic knowledge of Rust and CUDA – Familiarity with SIMT model and GPU programming concepts helps.

Step-by-Step Guide

Step 1: Set Up the Rust Nightly Toolchain

cuda-oxide uses unstable Rust features that are only available in nightly builds. Install nightly via rustup and set it as default for your project (or use a toolchain override).

Install Rust if you haven't: curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
Install nightly: rustup install nightly
Set nightly as default: rustup default nightly (or use rustup override set nightly in your project directory).

Step 2: Clone and Build cuda-oxide

Download the cuda-oxide repository. It includes the custom codegen backend and all necessary crates (like rustc-codegen-cuda and Pliron-based dialects).

Clone the repo: git clone https://github.com/NVIDIA/cuda-oxide.git
Navigate into the directory: cd cuda-oxide
Build the project: cargo build --release
Note the path to the compiled backend – you'll use it later to compile kernels.

Step 3: Create a New Rust Project for Your Kernel

cuda-oxide compiles entire #![no_std] crates into PTX. Create a library crate with special attributes.

Create a new cargo project: cargo new my_gpu_kernel --lib
Edit Cargo.toml to add: crate-type = ["lib"] and set edition = "2021".
Add #![no_std] and #![feature(abi_c_cmse_nonsecure_call)] (or other required features) to lib.rs.

Step 4: Write a SIMT Kernel in Rust

Define a function that will run on the GPU. Use the #[cuda_kernel] attribute (provided by cuda-oxide) and access thread indices via device intrinsics.

// lib.rs
#![no_std]
#![feature(abi_c_cmse_nonsecure_call)]

extern crate cuda_oxide_intrinsics;
use cuda_oxide_intrinsics::{thread_idx_x, block_idx_x, block_dim_x};

#[no_mangle]
pub unsafe extern "C" fn vector_add(
    a: *const f32,
    b: *const f32,
    c: *mut f32,
    n: u32
) {
    let idx = block_idx_x() * block_dim_x() + thread_idx_x();
    if idx < n {
        *c.add(idx as usize) = *a.add(idx as usize) + *b.add(idx as usize);
    }
}

Step 5: Compile Your Kernel with cuda-oxide

Invoke the cuda-oxide compiler (a custom rustc wrapper) to produce a PTX file.

Build using the custom codegen backend: cargo +nightly build --target-dir ptx --release -Zcodegen-backend=/path/to/cuda-oxide/target/release/librustc_codegen_cuda.so
Find the generated PTX file in ptx/release/my_gpu_kernel.ptx.

Step 6: Integrate PTX into a CUDA Host Program (Optional)

To run the kernel on actual hardware, you need a C/C++ host program that loads the PTX and launches it. cuda-oxide focuses on the kernel compilation step; you can use standard CUDA runtime APIs.

Write a simple host program (e.g., in C) that calls cuModuleLoadData and cuLaunchKernel with the PTX.
Compile with NVCC: nvcc host.cu -o host
Run: ./host

Step 7: Test and Debug

Validate the kernel by comparing output with CPU results. Use ptxas to assemble PTX into cubin for additional verification. Note that cuda-oxide is experimental, so expect occasional failures.

Tips for Success

Leverage Stable MIR: cuda-oxide uses rustc_public (Stable MIR) to stay compatible across nightly versions. Avoid raw internal MIR to prevent breakage.
Use Safe Rust Where Possible: The compiler aims to bring CUDA into safe Rust. Use unsafe blocks only for device intrinsics and pointer access.
Understand Pliron Dialects: The middle representation uses Pliron, a Rust-native IR similar to MLIR. You don't need to modify it, but understanding it helps with debugging.
Coordinate with rust-cuda: These projects are complementary. Use rust-cuda for async and higher-level abstractions; cuda-oxide for direct SIMT kernel writing.
Keep Kernel Simple: The compiler is experimental – complex control flow may fail. Start with small kernels like vector addition.
Monitor NVIDIA's Repository: Check for updates and example code in the official cuda-oxide repository for best practices.