Optimizing a Chess Move Generator and Rust Shenanigans
12 minute read
This post is part of my series on making Shiro, a chess engine written in Rust.
Some time ago, I reached the first milestone of my chess engine project---legal move generation validated by an extensive perft suite. (Perft refers to a test querying how many legal moves exist after X moves from a given position) The suite tested correct enumeration of 5 billion nodes (chess positions) and ran in 20 seconds.
I had mixed feelings about the 250 million nodes per second (nps) move generation. On one hand, it's objectively an impressive number. On the other, it fell way short of my reference---Gigantua's---1.6 billion (tested on Ryzen 9900X) nps.
Dissatisfied with my move generator's speed, I set out to optimize my move generator with the goal of at least doubling its performance. After dozens of hours of tedious micro optimizations and a questionable amount of unsafe Rust, I achieved 700 million nps move gen.
What is counted as move generation?
There is a significant difference between counting up moves and actually generating moves. If we only cared about a single perft number, we could skip exploring the last depth of the move search and just count how many moves are in the move list, speeding up generation by orders of magnitude. This is called bulk counting, and is what the numbers refer to in this article.
This post documents the techniques used and optimizations made and some interesting details I've found about Rust. Enjoy~
Motivation
Move generation is almost never a bottleneck in chess engines; in fact, generating legal moves is actually suboptimal for standard alpha-beta engines. That said, optimizing Shiro's movegen was not so much for engine performance as for testing whether Rust---and I---could stand on the same podium as engines like Gigantua.
Methodology
To optimize the throughput of any program, it helps to know where that program
is spending its time. Tools called profilers are made exactly for this purpose.
I mainly used perf
, a sampling
profiler for Linux. perf
basically works by asking the kernel a few thousand
times a second about our program's running statistics. It accumulates these
stats and can figure out which functions our program spent the most time in,
how many cache misses and branch mispredictions there were, and a bunch of
other useful stats.
Unfortunately, perf
on its own is a bit unwieldy. Luckily (and this is
probably my favorite thing about Rust), the Rust ecosystem has amazing tooling,
including a tool called cargo flamegraph
which takes the
monstrous (seriously, its can be hundreds of megabytes) output of perf
and
puts it into an interactive and intuitive view.
Looking at the graph, it's immediately clear that a significant amount of time
is spent on alloc
related functions. This comes as no surprise---heap
indirection, alloc and free, and especially vector growth are relatively
expensive.
Fortunately, we can very easily swap out the old Vec<Move>
with static-sized
arrays, removing pretty much all heap usage from our program. This will be our
first optimization.
But before applying that change, as a benchmark, this is the perf stat
of the
unoptimized move gen:
Performance counter stats for 'target/release/game':
19,423.29 msec task-clock:u # 1.000 CPUs utilized
492,861,228,726 instructions:u # 4.63 insn per cycle
# 0.02 stalled cycles per insn
106,526,240,345 cycles:u # 5.484 GHz
11,899,588,129 stalled-cycles-frontend:u # 11.17% frontend cycles idle
79,585,107,489 branches:u # 4.097 G/sec
346,631,542 branch-misses:u # 0.44% of all branches
19.428567337 seconds time elapsed
Just over 250 mnps.
Heapless Movegen
We can use the fact that no position can ever have more than 218
moves.
Therefore, we will use a 255 (+size makes this nicely aligned to 512 bytes)
sized static array to replace Vec<Move>
.
/// Stack allocated container for `[Move; 255]`
let mut movelist = new;
gen_moves;
This simple change had a surprising impact, increasing nps to 400 million. Here
is the perf
results after removing heap usage from the program:
Performance counter stats for 'target/release/game':
12,354.94 msec task-clock:u # 1.000 CPUs utilized
242,060,456,588 instructions:u # 3.53 insn per cycle
# 0.02 stalled cycles per insn
68,535,493,978 cycles:u # 5.547 GHz
5,408,830,538 stalled-cycles-frontend:u # 7.89% frontend cycles idle
27,211,774,270 branches:u # 2.203 G/sec
220,828,970 branch-misses:u # 0.81% of all branches
12.359881661 seconds time elapsed
Analyzing the perf
results a little more, we see that our total instructions
executed have been halved---there's a lot more library code responsible for
heap allocation than I anticipated! Additionally, fewer frontend cycles (~3%)
are spent idle, or in other words, the cpu spends less time waiting for
resources like memory.
The flamegraph without alloc
routines looks much better:
Precomputing Shared Values
Currently, many of the values, such as bitboards for a specific piece type or
lookup bitboards such as unsafe_square_mask
are computed multiple times. This
leads to an obvious optimization of simply pre-computing shared values, and
storing them for later.
Our MoveGen
handler now contains a few additional fields:
pub struct MoveGen<'a, const IS_WHITE: bool> {
us: Bitboard,
op: Bitboard,
+ us_king: Bitboard,
+ us_pawns: Bitboard,
+ us_knights: Bitboard,
+ us_sliders_hv: Bitboard,
+ us_sliders_dg: Bitboard,
+ op_king: Bitboard,
+ op_pawns: Bitboard,
+ op_knights: Bitboard,
+ op_sliders_hz: Bitboard,
+ op_sliders_dg: Bitboard,
checkmask: Bitboard,
pinmask_hv: Bitboard,
pinmask_dg: Bitboard,
seen_by_enemy: Bitboard,
not_pinned: Bitboard,
empty: Bitboard,
occupied: Bitboard,
movelist: &'a mut MoveList,
ep_sqr: u8,
checks: u8,
flags: Flags,
}
And the perf
stats.
Performance counter stats for 'target/release/game':
225,550,027,475 instructions:u # 3.54 insn per cycle
63,684,115,357 cycles:u # 5.539 GHz
4,962,731,287 stalled-cycles-frontend:u # 7.79% frontend cycles idle
30,275,775,309 branches:u # 2.633 G/sec
216,672,355 branch-misses:u # 0.72% of all branches
11.499080487 seconds time elapsed
4 million more branches, but no additional branch misses, and a time improvement that could probably be attributed to jitter. Nice!
Unchecked Unsafe Rust
In Rust, whenever an array is accessed and the index's range is not known at compile-time, the compiler will insert a runtime bounds check. The overhead of this check is practically zero since the branch predictor will learn that all bounds checks are within bounds (if it weren't the program would error).
In my use case, I was certain that my indicies would always be in bounds; especially those used to query the lookup maps. Some might argue introducing unsafety in attempts to micro-optimize is "bad practice" or not worth it, but I think this case is a good exception as an OOB error indicates something is inherently wrong with the move generation logic.
// Changing to unsafe indexing
// a couple more...
In addition, I made pushes to the move list unchecked as well.
Let's look at the perf
and flamegraph to see where we're at now.
Performance counter stats for 'target/release/game':
10,547.95 msec task-clock:u # 1.000 CPUs utilized
207,850,450,905 instructions:u # 3.57 insn per cycle
# 0.02 stalled cycles per insn
58,230,003,783 cycles:u # 5.521 GHz
4,664,269,369 stalled-cycles-frontend:u # 8.01% frontend cycles idle
23,572,488,287 branches:u # 2.235 G/sec
213,444,703 branch-misses:u # 0.91% of all branches
10.550542110 seconds time elapsed
Switching to some unsafe rust saves billions of branches and instructions, bringing our execution time down another second.
Pop Rightmost
From the latest flamegraph, we see significant time is spent on
pop_leftmost()
. pop_leftmost()
is a subroutine that pops the leftmost (most
significant) set bit of a number and returns the number of leading zeroes. It
is used to loop through the set bits of a bitboard (u64
) in order to extract
moves from a move mask.
The implementation looks like
pub const
However, there is a better way to implement a bit loop; namely with
pop_rightmost()
. Popping the least significant bit is faster due to a cpu
instruction called blsr
which is slightly faster than bsr
on many micro
architectures.
pub const
The original pop_leftmost()
would take 4 instructions and the new
pop_rightmost()
takes just 3. In addition, there's room for improvement down
to a mere 2 instructions. This might not sound like a large improvement, but
when there are billions of moves to enumerate, a few billion cycles saved
becomes a lot.
Performance counter stats for 'target/release/game':
9,163.40 msec task-clock:u # 1.000 CPUs utilized
180,988,491,319 instructions:u # 3.60 insn per cycle
# 0.03 stalled cycles per insn
50,320,036,864 cycles:u # 5.491 GHz
4,956,756,140 stalled-cycles-frontend:u # 9.85% frontend cycles idle
25,097,073,517 branches:u # 2.739 G/sec
219,628,599 branch-misses:u # 0.88% of all branches
9.164524983 seconds time elapsed
This brings us under 10 seconds, meaning we have surpassed the milestone of doubling performance, but there's still more optimizations to be discovered!
Inline Assembly
Looking closely at the flamegraphs, we see that there's a slice of time spent
on core::core_arch::x86_64::bmi2::_pext_u64
. It's not significant, but its
weird that its there since I'm using intrinsics:
Investigating the assembly, we see something odd:
0x5555557fcd38 lea rax, [rip - 0x29fbbf]
0x5555557fcd3f add r14, rax
0x5555557fcd42 mov rdi, qword ptr [rsp + 8]
0x5555557fcd47 call core::core_arch::x86_64::bmi2::_pext_u64
↓
0x5555557f3fa0 pext rax, rdi, rsi
0x5555557f3fa5 ret
↓
0x5555557fcd4c mov rax, qword ptr [r14 + rax*8]
0x5555557fcd50 and rax, r13
...
_pext_u64
isn't inlined! As a result, we have 2 more instructions than
necessary, and our code is jumping around for no reason. The fact that rustc
didn't inline this simple function was very confusing, and all methods of
trying to get it to inline failed. I even dug through LLVM documentation and
tried combing through rustc
passes to figure out why the inline didn't occur.
Eventually, I gave up used inline assembly instead; which I probably should
have done much earlier consider how nice Rust's support for inline assembly is.
After this change, I disassembled the program again and pext
was nicely
inlined. The speedup wasn't much (3 billion instructions saved and 0.3 seconds
of time), but this particular change wasn't about the optimization so much as
my building frustration at Rustc for not inlining _pext_u64
.
Performance counter stats for 'target/release/game':
8,758.45 msec task-clock:u # 1.000 CPUs utilized
177,933,940,135 instructions:u # 3.70 insn per cycle
# 0.02 stalled cycles per insn
48,053,645,660 cycles:u # 5.487 GHz
4,402,567,620 stalled-cycles-frontend:u # 9.16% frontend cycles idle
23,418,366,132 branches:u # 2.674 G/sec
210,124,185 branch-misses:u # 0.90% of all branches
8.762713925 seconds time elapsed
Why wasn't pext inlined?
I submitted an issue to Rustc trying to find out why
pext
wasn't being inlined. Some time later, I got my reply. It turns out,pext
being part of thebmi2
extension set, requires abmi2
feature annotation on any function that calls it---even if the entire program is compiled with thebmi2
microarchitecture. This allows a software fallback to be used, alongside runtime detection of the ISA.
Optimized Return
You may have noticed the large block of time spent in libc.so.6
. I never used
any complex std functions in make_move()
so I was confused why the time was
being spent there. Pulling out a debugger and setting a breakpoint in
make_move()
reveals some AVX
instructions around the return statement of
make_move()
.
0x7ffff7efaf84 mov rax, rdi
0x7ffff7efaf87 cmp rdx, 0x40
0x7ffff7efaf8b jb 0x7ffff7efafd0
0x7ffff7efaf8d vmovdqu64 zmm16, zmmword ptr [rsi]
0x7ffff7efaf93 cmp rdx, 0x80
0x7ffff7efaf9a ja 0x7ffff7efb080
0x7ffff7efafa0 vmovdqu64 zmm17, zmmword ptr [rsi + rdx - 0x40]
0x7ffff7efafa8 vmovdqu64 zmmword ptr [rdi], zmm16
0x7ffff7efafae vmovdqu64 zmmword ptr [rdi + rdx - 0x40], zmm17
0x7ffff7efafb6 ret
These instructions are copying the BoardState
in order to return it. This is
odd since I would expect rustc
to have optimized in a copy elision (RVO) and
avoided the 144 byte copy. Rather than return by value, I refactored
make_move()
to modify a mutable instance of BoardState
instead:
// modifies `newgame` and returns nothing
// usage
let mut next = game.clone;
make_move;
Benchmarking the new version, I found a surprising speedup:
Performance counter stats for 'target/release/game':
7,239.67 msec task-clock:u # 1.000 CPUs utilized
167,822,127,457 instructions:u # 4.22 insn per cycle
# 0.03 stalled cycles per insn
39,754,720,670 cycles:u # 5.491 GHz
5,010,090,849 stalled-cycles-frontend:u # 12.60% frontend cycles idle
21,741,622,228 branches:u # 3.003 G/sec
206,750,434 branch-misses:u # 0.95% of all branches
7.241345391 seconds time elapsed
This brings us to just under 7 mnps.
Further Optimizations
This about covers it for the key optimizations applied. I have no doubt another second could be squeezed out with some considerable effort, but I have spent enough time on this chapter as is.
In addition, the bottleneck is moving away from move generation and to move making---something that can't be easily optimized. Move counters like Gigantua can afford to do the bare minimum when computing a move, but real chess engines have to maintain information not strictly necessary for move generation, like maintaining Zobrist hashes and updating half move counters.
There are three other big optimizations that I have thought of, but due to complexity or some other reason, I chose not to implement them. I will however, document them here as a reminder to revisit this project.
Hashing
This one is bit cheaty---as in not quite in the spirit of move generation. That is to hash each position and memoize the number of legal moves at the position so that we can quickly add its move tree to our total whenever we come across it.
The overhead of hashing is not too bad, especially when using Zobrist boards. However, such a technique would never be used in a real engine since there is simply no point. Chess engines care about generating actual moves, not counting them up.
Templated Make Move
This technique is one that Gigantua and some other engines use. By using monomorphisms (templating on which piece was moved), we can cut a lot of branching out of move making.
However, this carries with it a lot of complexity and blows up the binary size. I suspect the gains from it would also be lackluster relative to the work it requires.
Another interesting optimization would be to use a DenseMove
struct that looks like
This DenseMove
can store potentially many moves. The benefit is that all
branching in move generator can be postponed to move making. This makes the
code path ever-so slightly more predictable.
Reflections as a Whole
I've spent quite some time away from this project (this post is over 3 months in the making). As such, I've had a lot of time to think about this post. Looking back, if I were to do a few things differently, it would be making more organized commits (with reasonable commit messages) and more rigorous testing. Digging for which commit contained which optimization was really hard---many optimizations were batched together and some had no effect. Regarding testing, I definitely should have fixed my CPU clock rate and done some graphing + benchmark suite.
Finally I want to leave some thoughts on Rust. I'm still new to the language, and many parts of this project saw me fighting the language. By writing "unrusty" or "idiomatically poor" Rust, I gained something like 3 seconds---a huge improvement. This makes it difficult to fully believe in Rust's promise of "zero-cost abstractions." No doubt, much of this "language-fighting" and hacky low-level stuff would not have been present if I were using C++.
However, Rust has made organization as a whole much easier. I admit, it's module system and strong type system are really growing me. As Shiro grows, Rust's strengths become evermore apparent. And at the end of the day, though it took much effort and research, I was able to squeeze the performance I needed out of Rust---and I learned quite a lot along the way.
Thanks for making it this far; hopefully, this post was insightful! I'm not sure if there will be another part to this series since there isn't anything too interesting regarding minimax or board evaluation. But who knows, maybe something worth writing about will show up :)