Optimization - Making Rust Code Go Brrrr
Rust code can be fast. Very fast, in fact. If you look at the Benchmarks Game, it goes head-to-head with C and C++.
But performance isn't effortless, although Rust's LLVM backend makes it seem so. I'm going to go over the ways I improve performance in my Rust projects.
Rayon isn't a magic bullet
It's really not. Many people think just slapping par_iter
on the smallest operation will magically fix their performance. It won't. With that mindset, synchronization overhead will eat you alive.
Rayon has more than just par_iter
. For example, par-chunks
is very useful - you can split your task into parallel chunks, each thread processing a portion of the entire dataset at a time. This greatly reduces synchronization overhead, especially for situations where you have a large amount of small tasks. However, it still may be better to use par_iter
for large tasks that take a while per iteration.
iter.par_chunks(4096).for_each(|x| {
for y in x {
y.do_small_thing();
}
});
Buffering matters!
This is simple. I/O involves syscalls. Syscalls are bad for performance. Therefore, you want to minimize syscalls and optimize I/O.
You should always wrap I/O (whether it be a File
, TcpStream
, et cetera) in an BufReader
or BufWriter
. These quite simply buffer I/O operations, preferring to write things in a single large batch, over many small batches. This reduces your total syscalls, and overall increases performance.
Remember!!: If you use a BufWriter
, make sure to call flush
and/or sync_all
before it's dropped! This will allow you to handle any errors.
let fd = File::create("example.bin").expect("Failed to create file!");
let mut writer = BufWriter::new(fd);
std::io::copy(&mut buffer, &mut writer).expect("Failed to copy buffer!");
writer.flush().expect("Failed to write file!");
std isn't always the best.
The Rust standard library is great. I mean, it really is. But it doesn't always offer the best options. Some crates provide near-identical interfaces at greatly increased performance.
parking_lot
- Offers betterMutex
andRwLock
implementations than Rust's standard library. In addition to performing better, they don't poison (so no need for an additional match/unwrap).crossbeam-channel
andflume
- These provide alternativeSender
/Receiver
implementations to the ones instd::sync::mpsc
. I personally preferflume
, as it's implemented in 100% safe code.dashmap
is a better solution than throwingArc<RwLock<HashMap<K, V>>>
everywhere - as it's optimized with sharding, allows for concurrent access, highly performant, and easy to use/convert to.ryu
andlexical
- These are highly performant interfaces for converting to and from decimal strings. Quite simply, they turn"1.2345"
to1.2345_f32
and do so fast, and vice versa.- Just prefer to avoid text processing when possible, truth be told.
Allocating the path to hell
Many Rust developers take types such as String
and Vec
for granted, without understanding the downsides. These are dynamically allocated types. Allocations are not your friend when you're optimizing for performance.
- In types that will be serialized/deserialized from another format, prefer
Cow<str>
. This will allow you to borrow the string, and then convert it to an owned string if needed. - Look into crates such as
tinyvec
andsmolstr
. These allow for you to have stack-optimized structures, with minimal effort. - Types that require an explicit
clone
typically allocate! PreferCopy
types where possible.
In addition, look into alternative allocators which may yield better performance for your project, such as jemallocator or mimalloc.
Advanced Magic Extensions
Modern processor have tons of extremely useful extensions, such as AVX and SSE. Even on non-x86 platforms, extensions with similar functionality are available, such as NEON on ARM, and the proposed P and V extensions for RISC-V.
While Rust allows you to directly interface with these extensions, and there are many packages for higher-level interfacing, such as packed_simd and generic-simd, the LLVM optimizer is capable of automatically optimizing code to use these extensions.
You may need to pass -C target-cpu=native
or -C target-features=+avx
through RUSTFLAGS
in order to take advantage of this (see rustc --print target-features
for available features for your target, and use somethng like lscpu
to see what your CPU supports).
- Doing things in groups of 4/8 is good for vectorization.
- Do note, branching will heavily reduce the chances of vectorization.
See this function. It converts four f32
s into four u8
s.
#[inline]
pub unsafe fn f32_to_u8(f: f32) -> u8 {
if f > f32::from(u8::MAX) {
u8::MAX
} else {
f32::to_int_unchecked(f)
}
}
/// Converts a slice of 4 [f32] s into a tuple of 4 [u8]s, rounding it in the process
#[must_use]
pub fn f32s4_to_u8(f: [f32; 4]) -> (u8, u8, u8, u8) {
let f = &f[..4];
unsafe {
(
f32_to_u8(f[0]),
f32_to_u8(f[1]),
f32_to_u8(f[2]),
f32_to_u8(f[3]),
)
}
}
Now, we can throw this code into Compiler Explorer to see what assembly it generates. Don't forget the compiler flags!
example::f32s4_to_u8:
vmovss xmm0, dword ptr [rip + .LCPI0_0]
vminss xmm1, xmm0, dword ptr [rdi]
vcvttss2si eax, xmm1
vminss xmm0, xmm0, dword ptr [rdi + 4]
vcvttss2si ecx, xmm0
vmovsd xmm0, qword ptr [rdi + 8]
vbroadcastss xmm1, dword ptr [rip + .LCPI0_0]
vcmpleps xmm2, xmm1, xmm0
vblendvps xmm0, xmm0, xmm1, xmm2
vcvttps2dq xmm0, xmm0
vpand xmm0, xmm0, xmmword ptr [rip + .LCPI0_1]
vpsllvd xmm0, xmm0, xmmword ptr [rip + .LCPI0_2]
movzx ecx, cl
shl ecx, 8
movzx eax, al
or eax, ecx
vmovd ecx, xmm0
or ecx, eax
vpextrd eax, xmm0, 1
or eax, ecx
ret
Success! It generates AVX instructions, such as VBROADCASTSS
and VMOVSS
!
Making the compiler brrrr harder
It is entirely possible to configure the compiler to optimize more aggressively! For example, in Cargo.toml
(Do note this will increase compile times!!):
[profile.release]
lto = 'thin'
panic = 'abort'
codegen-units = 1
[profile.bench]
lto = 'thin'
codegen-units = 1
Each option explained:
lto = 'thin'
- Quite simply enables Thin LTO. You can also trylto = 'fat'
, performance gains should be similar.panic = 'abort'
- Abort instead of unwinding on panic. You'll get a smaller, more performant binary, but you won't be able to catch panics anymore. See the Rust Guide for more info.codegen-units = 1
- Ensures that the crate is compiled with only one code generation unit. This reduces the paralellization of the compilation, but will allow the LLVM to optimize it much better.
Edits
- 9/30/2020, 3:40 PM EST - Re-phrased the Copy/Clone section, (thanks /u/SkiFire13) mentioned
sync_all
in the buffering section (thanks /u/Freeky), and also mentionedlto = 'fat'
(thanks /u/po8)