Meta's KernelEvolve: Autonomous Kernel Optimization for Scalable AI Infrastructure
By
<h2>Introduction</h2>
<p>Meta operates some of the world's largest AI services, from personalized recommendations to generative AI assistants, serving billions of users daily. Behind these systems lies a complex infrastructure of heterogeneous hardware, including NVIDIA GPUs, AMD GPUs, Meta's custom MTIA chips, and CPUs. To harness this diverse hardware efficiently, Meta developed <strong>KernelEvolve</strong>, an agentic kernel authoring system that autonomously optimizes the low-level software instructions—known as kernels—that translate high-level model operations into chip-specific code. This article explores how KernelEvolve accelerates AI infrastructure optimization, delivering significant performance gains and freeing engineers for higher-level tasks.</p><figure style="margin:20px 0"><img src="https://engineering.fb.com/wp-content/uploads/2026/04/Meta-KernalEvolve-REA-Hero.png" alt="Meta's KernelEvolve: Autonomous Kernel Optimization for Scalable AI Infrastructure" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: engineering.fb.com</figcaption></figure>
<h2 id="challenge">The Challenge of Heterogeneous Hardware</h2>
<p>Efficiently using Meta's fleet of hardware requires writing optimized kernels for each chip generation and machine learning model architecture. While standard operations like general matrix multiplications (GEMMs) and convolutions are covered by vendor libraries, production ranking models demand many custom operators. As the number of models and hardware types grows, manual tuning by kernel experts becomes unsustainable. The combinatorial explosion of <em>models × hardware generations</em> creates a massive optimization workload that traditional approaches cannot scale.</p>
<h2 id="kernel-evolve">KernelEvolve: An Agentic Approach</h2>
<p>KernelEvolve treats kernel optimization as a <strong>search problem</strong>. It leverages a large language model (LLM) to generate candidate kernels and a purpose-built job harness that evaluates each candidate on real hardware. The system then feeds diagnostics back to the LLM, driving an iterative search over hundreds of alternatives. This approach compresses what used to take weeks of expert engineering effort into hours of automated exploration.</p>
<h3 id="how-it-works">How It Works</h3>
<p>The process begins with a high-level description of the target operation, hardware constraints, and performance goals. KernelEvolve's agent generates kernel code in high-level domain-specific languages (DSLs) like Triton, Cute DSL, and FlyDSL, or in low-level languages such as CUDA, HIP, and MTIA C++. Each candidate is compiled, run on the target hardware, and measured for throughput, latency, and resource utilization. The job harness returns these metrics to the LLM, which then proposes refinements. This closed-loop optimization continues until performance converges or a user-defined budget is reached.</p>
<h2 id="results">Results and Impact</h2>
<p>KernelEvolve has demonstrated remarkable improvements across Meta's infrastructure:</p><figure style="margin:20px 0"><img src="https://engineering.fb.com/wp-content/uploads/2026/04/Meta-KernalEvolve-REA-image1.png" alt="Meta's KernelEvolve: Autonomous Kernel Optimization for Scalable AI Infrastructure" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: engineering.fb.com</figcaption></figure>
<ul>
<li><strong>Faster development:</strong> What previously required weeks of manual profiling, optimization, and cross-hardware debugging now completes in hours of automated search.</li>
<li><strong>Better performance:</strong> The Andromeda Ads model achieved over <strong>60% inference throughput improvement</strong> on NVIDIA GPUs, and an ads model saw over <strong>25% training throughput improvement</strong> on Meta's custom MTIA chips.</li>
<li><strong>Broad applicability:</strong> The system optimizes kernels for public and proprietary hardware, including NVIDIA GPUs, AMD GPUs, MTIA chips, and CPUs.</li>
</ul>
<h2 id="broader-applicability">Broader Applicability Beyond Ads</h2>
<p>While KernelEvolve was originally developed for Meta's <em>Ranking Engineer Agent</em> to optimize ad ranking models, its design is general-purpose. The system can be applied to any AI model requiring custom kernels, from computer vision to natural language processing. Its ability to target both high-level DSLs and low-level languages makes it suitable for a wide range of hardware accelerators, ensuring that Meta's entire AI ecosystem benefits from continuous kernel optimization.</p>
<h2 id="conclusion">Conclusion</h2>
<p>KernelEvolve represents a major step forward in automating AI infrastructure optimization. By treating kernel authoring as a search problem and leveraging LLMs for code generation and refinement, Meta has turned a manual, expert-driven task into a scalable, autonomous process. The results—dramatic throughput gains and reduced engineering time—demonstrate the power of agentic approaches to systems optimization. As AI workloads continue to grow, systems like KernelEvolve will be essential for keeping pace with demand. Detailed technical insights are available in the paper, <em>"KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta"</em>, presented at ISCA 2026.</p>