Optimization - Making Rust Code Go Brrrr

Rust code can be fast. Very fast, in fact. If you look at the Benchmarks Game, it goes head-to-head with C and C++.

But performance isn't effortless, although Rust's LLVM backend makes it seem so. I'm going to go over the ways I improve performance in my Rust projects.

Rayon isn't a magic bullet

It's really not. Many people think just slapping par_iter on the smallest operation will magically fix their performance. It won't. With that mindset, synchronization overhead will eat you alive.

Rayon has more than just par_iter. For example, par-chunks is very useful - you can split your task into parallel chunks, each thread processing a portion of the entire dataset at a time. This greatly reduces synchronization overhead, especially for situations where you have a large amount of small tasks. However, it still may be better to use par_iter for large tasks that take a while per iteration.

iter.par_chunks(4096).for_each(|x| {
	for y in x {
		y.do_small_thing();
	}
});

Buffering matters!

This is simple. I/O involves syscalls. Syscalls are bad for performance. Therefore, you want to minimize syscalls and optimize I/O.

You should always wrap I/O (whether it be a File, TcpStream, et cetera) in an BufReader or BufWriter. These quite simply buffer I/O operations, preferring to write things in a single large batch, over many small batches. This reduces your total syscalls, and overall increases performance.

Remember!!: If you use a BufWriter, make sure to call flush and/or sync_all before it's dropped! This will allow you to handle any errors.

let fd = File::create("example.bin").expect("Failed to create file!");
let mut writer = BufWriter::new(fd);
std::io::copy(&mut buffer, &mut writer).expect("Failed to copy buffer!");
writer.flush().expect("Failed to write file!");

std isn't always the best.

The Rust standard library is great. I mean, it really is. But it doesn't always offer the best options. Some crates provide near-identical interfaces at greatly increased performance.

parking_lot - Offers better Mutex and RwLock implementations than Rust's standard library. In addition to performing better, they don't poison (so no need for an additional match/unwrap).
crossbeam-channel and flume - These provide alternative Sender/Receiver implementations to the ones in std::sync::mpsc. I personally prefer flume, as it's implemented in 100% safe code.
dashmap is a better solution than throwing Arc<RwLock<HashMap<K, V>>> everywhere - as it's optimized with sharding, allows for concurrent access, highly performant, and easy to use/convert to.
ryu and lexical - These are highly performant interfaces for converting to and from decimal strings. Quite simply, they turn "1.2345" to 1.2345_f32 and do so fast, and vice versa.
- Just prefer to avoid text processing when possible, truth be told.

Allocating the path to hell

Many Rust developers take types such as String and Vec for granted, without understanding the downsides. These are dynamically allocated types. Allocations are not your friend when you're optimizing for performance.

In types that will be serialized/deserialized from another format, prefer Cow<str>. This will allow you to borrow the string, and then convert it to an owned string if needed.
Look into crates such as tinyvec and smolstr. These allow for you to have stack-optimized structures, with minimal effort.
Types that require an explicit clone typically allocate! Prefer Copy types where possible.

In addition, look into alternative allocators which may yield better performance for your project, such as jemallocator or mimalloc.

Advanced Magic Extensions

Modern processor have tons of extremely useful extensions, such as AVX and SSE. Even on non-x86 platforms, extensions with similar functionality are available, such as NEON on ARM, and the proposed P and V extensions for RISC-V.

While Rust allows you to directly interface with these extensions, and there are many packages for higher-level interfacing, such as packed_simd and generic-simd, the LLVM optimizer is capable of automatically optimizing code to use these extensions.

You may need to pass -C target-cpu=native or -C target-features=+avx through RUSTFLAGS in order to take advantage of this (see rustc --print target-features for available features for your target, and use somethng like lscpu to see what your CPU supports).

Doing things in groups of 4/8 is good for vectorization.
- Do note, branching will heavily reduce the chances of vectorization.

See this function. It converts four f32s into four u8s.

#[inline]
pub unsafe fn f32_to_u8(f: f32) -> u8 {
	if f > f32::from(u8::MAX) {
		u8::MAX
	} else {
		f32::to_int_unchecked(f)
	}
}

/// Converts a slice of 4 [f32] s into a tuple of 4 [u8]s, rounding it in the process
#[must_use]
pub fn f32s4_to_u8(f: [f32; 4]) -> (u8, u8, u8, u8) {
	let f = &f[..4];
	unsafe {
		(
			f32_to_u8(f[0]),
			f32_to_u8(f[1]),
			f32_to_u8(f[2]),
			f32_to_u8(f[3]),
		)
	}
}

Now, we can throw this code into Compiler Explorer to see what assembly it generates. Don't forget the compiler flags!

example::f32s4_to_u8:
        vmovss  xmm0, dword ptr [rip + .LCPI0_0]
        vminss  xmm1, xmm0, dword ptr [rdi]
        vcvttss2si      eax, xmm1
        vminss  xmm0, xmm0, dword ptr [rdi + 4]
        vcvttss2si      ecx, xmm0
        vmovsd  xmm0, qword ptr [rdi + 8]
        vbroadcastss    xmm1, dword ptr [rip + .LCPI0_0]
        vcmpleps        xmm2, xmm1, xmm0
        vblendvps       xmm0, xmm0, xmm1, xmm2
        vcvttps2dq      xmm0, xmm0
        vpand   xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
        vpsllvd xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
        movzx   ecx, cl
        shl     ecx, 8
        movzx   eax, al
        or      eax, ecx
        vmovd   ecx, xmm0
        or      ecx, eax
        vpextrd eax, xmm0, 1
        or      eax, ecx
        ret

Success! It generates AVX instructions, such as VBROADCASTSS and VMOVSS!

Making the compiler brrrr harder

It is entirely possible to configure the compiler to optimize more aggressively! For example, in Cargo.toml (Do note this will increase compile times!!):

[profile.release]
lto = 'thin'
panic = 'abort'
codegen-units = 1

[profile.bench]
lto = 'thin'
codegen-units = 1

Each option explained:

lto = 'thin' - Quite simply enables Thin LTO. You can also try lto = 'fat', performance gains should be similar.
panic = 'abort' - Abort instead of unwinding on panic. You'll get a smaller, more performant binary, but you won't be able to catch panics anymore. See the Rust Guide for more info.
codegen-units = 1 - Ensures that the crate is compiled with only one code generation unit. This reduces the paralellization of the compilation, but will allow the LLVM to optimize it much better.

Edits

9/30/2020, 3:40 PM EST - Re-phrased the Copy/Clone section, (thanks /u/SkiFire13) mentioned sync_all in the buffering section (thanks /u/Freeky), and also mentioned lto = 'fat' (thanks /u/po8)