- Use `SIMD[DType.float32, simd_width]` and `vectorize[simd_width](fn, size)` to process multiple elements per CPU cycle instead of scalar loops.
- Prefer `parallelize[num_workers](fn, count)` from `algorithm` for embarrassingly parallel workloads — it distributes iterations across CPU cores automatically.
- Use `UnsafePointer` and manual memory layout only when profiling shows that `DynamicVector` or `Tensor` allocation is the bottleneck.
- Use `from sys import simdwidthof` to query the target's native SIMD width at compile time and pass it to `vectorize` — avoid hardcoding `8` or `16`.
- Annotate hot inner-loop `fn` functions with `@always_inline` to eliminate call overhead after confirming the benefit in a benchmark.
- Prefer `Tensor[DType.float32]` with contiguous row-major layout for matrix math — cache-friendly access patterns can improve throughput by an order of magnitude.
- Use `benchmark.run[fn]()` to measure wall-clock time with warm-up iterations — never use `time.now()` diffs for microbenchmarks.