Skip to content

Auto-vectorization in Rust: how to see it (and when it fails)

five persons riding camels walking on sand beside Pyramid of Egypt

Photo by Simon Berger

Auto-vectorization is a common optimization you get from rustc/LLVM: if your loop has the right shape, LLVM can turn scalar operations into SIMD instructions. This post shows two small loops, how to spot whether they vectorized, and the most common reasons they don’t.

What auto-vectorization is

Auto-vectorization is an optimization pass provided by LLVM that transforms scalar code into vector code when it’s safe and worth it. In practice, this often means emitting packed SIMD instructions like mulps/vmulps (x86) instead of scalar mulss.

Vectorization usually only happens when LLVM can prove:

How LLVM vectorizes Rust loops

rustc lowers Rust into LLVM IR, and LLVM runs its vectorizers. Two passes matter most:

Vectorization depends on code shape, target CPU, and LLVM’s cost model. The easiest way to confirm what happened is to inspect the generated asm/IR.

Case 1: loop-carried dependency

This loop has a loop-carried dependency, so LLVM can’t safely vectorize it:

pub fn prefix_product(a: &[f32], b: &mut [f32]) {
    let mut acc = 1.0;
    for (x, y) in a.iter().zip(b.iter_mut()) {
        acc *= *x; // depends on the previous iteration
        *y = acc;
    }
}

If you paste this into Compiler Explorer (godbolt.org) and look at the asm, you’ll typically see scalar ops (mulss/movss) rather than packed SIMD (mulps/vmulps).

In LLVM IR, don’t rely on seeing shufflevector, many vectorized loops won’t need it. Better indicators are:

Case 2: independent iterations

Now compare it with a loop where each iteration is independent:

pub fn mul_arrays(a: &mut [f32], b: &[f32], c: f32) {
    for (el_1, el_2) in a.iter_mut().zip(b.iter()) {
        *el_1 = el_2 * c;
    }
}

This loop is a much better candidate for vectorization: there are no loop-carried dependencies, and the computation is the same shape for every element.

LLVM will only vectorize when it can prove the vector loop preserves the original semantics.

If it vectorizes on x86, you’ll usually see packed SIMD instructions like mulps/vmulps plus vector loads/stores.

In LLVM IR, look for vector types and vector ops (examples above).

Benchmarking it correctly

If you want to measure the effect of LLVM’s vectorizers, benchmark the same function with vectorization enabled vs disabled. Comparing different computations (even if they look similar) will give you meaningless numbers.

LLVM’s docs include a performance section with examples of auto-vectorization improving speed. However, as Linus Torvalds famously quipped, “Talk is cheap. Show me the code.”

Rust exposes flags to disable LLVM’s vectorizers:

That lets you run the same benchmark twice:

# Vectorization on
RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo bench

# Vectorization off (both loop + SLP)
RUSTFLAGS="-C opt-level=3 -C target-cpu=native -C no-vectorize-loops -C no-vectorize-slp" cargo bench

If you want a small playground project, I keep one here: vectorization_benchmark.

Practical checklist

To make vectorization more likely:

  1. Keep the loop shape simple: predictable bounds and straightforward indexing help LLVM prove safety.

  2. Avoid loop-carried dependencies: reductions like sum/product can vectorize with special handling, but general dependencies usually block vectorization.

  3. Help alias analysis: prefer separate input/output slices (like (&[T], &mut [T])) so LLVM can assume they don’t overlap.

  4. Compile for your CPU: -C target-cpu=native can enable wider SIMD and better codegen.

  5. Use explicit SIMD when needed: when LLVM can’t prove safety or the cost model refuses, reach for std::simd.

Wrap-up

Auto-vectorization is easiest to reason about when you treat it like a compiler optimization you can verify: write a loop, inspect asm/IR, then measure the same benchmark with vectorizers on vs off.