Mojo Performance

Mojo performance optimization using SIMD, parallelism, and memory layout control

Mojo

Performance

Use in Builder All Mojo Options

Details

Language / Topic

Mojo

Rules

Mojo Performance · balanced balanced

- Use `SIMD[DType.float32, simd_width]` and `vectorize[simd_width](fn, size)` to process multiple elements per CPU cycle instead of scalar loops.

- Prefer `parallelize[num_workers](fn, count)` from `algorithm` for embarrassingly parallel workloads — it distributes iterations across CPU cores automatically.

- Use `UnsafePointer` and manual memory layout only when profiling shows that `DynamicVector` or `Tensor` allocation is the bottleneck.

- Use `from sys import simdwidthof` to query the target's native SIMD width at compile time and pass it to `vectorize` — avoid hardcoding `8` or `16`.

- Annotate hot inner-loop `fn` functions with `@always_inline` to eliminate call overhead after confirming the benefit in a benchmark.

- Prefer `Tensor[DType.float32]` with contiguous row-major layout for matrix math — cache-friendly access patterns can improve throughput by an order of magnitude.

- Use `benchmark.run[fn]()` to measure wall-clock time with warm-up iterations — never use `time.now()` diffs for microbenchmarks.