Optimizing a Chess Move Generator and Rust Shenanigans

Jun 26, 2025
12 minute read

This post is part of my series on making Shiro, a chess engine written in Rust.

Some time ago, I reached the first milestone of my chess engine project---legal move generation validated by an extensive perft suite. (Perft refers to a test querying how many legal moves exist after X moves from a given position) The suite tested correct enumeration of 5 billion nodes (chess positions) and ran in 20 seconds.

I had mixed feelings about the 250 million nodes per second (nps) move generation. On one hand, it's objectively an impressive number. On the other, it fell way short of my reference---Gigantua's---1.6 billion (tested on Ryzen 9900X) nps.

Dissatisfied with my move generator's speed, I set out to optimize my move generator with the goal of at least doubling its performance. After dozens of hours of tedious micro optimizations and a questionable amount of unsafe Rust, I achieved 700 million nps move gen.

What is counted as move generation?

There is a significant difference between counting up moves and actually generating moves. If we only cared about a single perft number, we could skip exploring the last depth of the move search and just count how many moves are in the move list, speeding up generation by orders of magnitude. This is called bulk counting, and is what the numbers refer to in this article.

This post documents the techniques used and optimizations made and some interesting details I've found about Rust. Enjoy~

Motivation

Move generation is almost never a bottleneck in chess engines; in fact, generating legal moves is actually suboptimal for standard alpha-beta engines. That said, optimizing Shiro's movegen was not so much for engine performance as for testing whether Rust---and I---could stand on the same podium as engines like Gigantua.

Methodology

To optimize the throughput of any program, it helps to know where that program is spending its time. Tools called profilers are made exactly for this purpose. I mainly used perf, a sampling profiler for Linux. perf basically works by asking the kernel a few thousand times a second about our program's running statistics. It accumulates these stats and can figure out which functions our program spent the most time in, how many cache misses and branch mispredictions there were, and a bunch of other useful stats.

Unfortunately, perf on its own is a bit unwieldy. Luckily (and this is probably my favorite thing about Rust), the Rust ecosystem has amazing tooling, including a tool called cargo flamegraph which takes the monstrous (seriously, its can be hundreds of megabytes) output of perf and puts it into an interactive and intuitive view.

Flamegraphs show where a program spends its time. The vertical axis represents the depth of the stack trace, and the width of a segment is how long the cpu spends executing code in that function. Try clicking around!

Looking at the graph, it's immediately clear that a significant amount of time is spent on alloc related functions. This comes as no surprise---heap indirection, alloc and free, and especially vector growth are relatively expensive.

Fortunately, we can very easily swap out the old Vec<Move> with static-sized arrays, removing pretty much all heap usage from our program. This will be our first optimization.

But before applying that change, as a benchmark, this is the perf stat of the unoptimized move gen:

Performance counter stats for 'target/release/game':

         19,423.29 msec task-clock:u                     #    1.000 CPUs utilized             
   492,861,228,726      instructions:u                   #    4.63  insn per cycle            
                                                  #    0.02  stalled cycles per insn   
   106,526,240,345      cycles:u                         #    5.484 GHz                       
    11,899,588,129      stalled-cycles-frontend:u        #   11.17% frontend cycles idle      
    79,585,107,489      branches:u                       #    4.097 G/sec                     
       346,631,542      branch-misses:u                  #    0.44% of all branches           

      19.428567337 seconds time elapsed

Just over 250 mnps.

Heapless Movegen

We can use the fact that no position can ever have more than 218 moves. Therefore, we will use a 255 (+size makes this nicely aligned to 512 bytes) sized static array to replace Vec<Move>.

src/engine/perft
/// Stack allocated container for `[Move; 255]`
pub struct MoveList {
    size: u32,
    list: [Move; 255],
}

let mut movelist = MoveList::new();
gen_moves(game, &mut movelist);

This simple change had a surprising impact, increasing nps to 400 million. Here is the perf results after removing heap usage from the program:

 Performance counter stats for 'target/release/game':

         12,354.94 msec task-clock:u                     #    1.000 CPUs utilized             
   242,060,456,588      instructions:u                   #    3.53  insn per cycle            
                                                  #    0.02  stalled cycles per insn   
    68,535,493,978      cycles:u                         #    5.547 GHz                       
     5,408,830,538      stalled-cycles-frontend:u        #    7.89% frontend cycles idle      
    27,211,774,270      branches:u                       #    2.203 G/sec                     
       220,828,970      branch-misses:u                  #    0.81% of all branches           

      12.359881661 seconds time elapsed

Analyzing the perf results a little more, we see that our total instructions executed have been halved---there's a lot more library code responsible for heap allocation than I anticipated! Additionally, fewer frontend cycles (~3%) are spent idle, or in other words, the cpu spends less time waiting for resources like memory.

The flamegraph without alloc routines looks much better:

Precomputing Shared Values

Currently, many of the values, such as bitboards for a specific piece type or lookup bitboards such as unsafe_square_mask are computed multiple times. This leads to an obvious optimization of simply pre-computing shared values, and storing them for later.

Our MoveGen handler now contains a few additional fields:

pub struct MoveGen<'a, const IS_WHITE: bool> {
    us: Bitboard,
    op: Bitboard,

+   us_king: Bitboard,
+   us_pawns: Bitboard,
+   us_knights: Bitboard,
+   us_sliders_hv: Bitboard,
+   us_sliders_dg: Bitboard,

+   op_king: Bitboard,
+   op_pawns: Bitboard,
+   op_knights: Bitboard,
+   op_sliders_hz: Bitboard,
+   op_sliders_dg: Bitboard,

    checkmask: Bitboard,
    pinmask_hv: Bitboard,
    pinmask_dg: Bitboard,
    seen_by_enemy: Bitboard,
    not_pinned: Bitboard,

    empty: Bitboard,
    occupied: Bitboard,
    movelist: &'a mut MoveList,

    ep_sqr: u8,
    checks: u8,

    flags: Flags,
}

And the perf stats.

 Performance counter stats for 'target/release/game':

   225,550,027,475      instructions:u                   #    3.54  insn per cycle
    63,684,115,357      cycles:u                         #    5.539 GHz                       
     4,962,731,287      stalled-cycles-frontend:u        #    7.79% frontend cycles idle      
    30,275,775,309      branches:u                       #    2.633 G/sec                     
       216,672,355      branch-misses:u                  #    0.72% of all branches           

      11.499080487 seconds time elapsed

4 million more branches, but no additional branch misses, and a time improvement that could probably be attributed to jitter. Nice!

Unchecked Unsafe Rust

In Rust, whenever an array is accessed and the index's range is not known at compile-time, the compiler will insert a runtime bounds check. The overhead of this check is practically zero since the branch predictor will learn that all bounds checks are within bounds (if it weren't the program would error).

In my use case, I was certain that my indicies would always be in bounds; especially those used to query the lookup maps. Some might argue introducing unsafety in attempts to micro-optimize is "bad practice" or not worth it, but I think this case is a good exception as an OOB error indicates something is inherently wrong with the move generation logic.

// Changing to unsafe indexing
#[inline(always)]
pub fn knight_lookup(pos: usize) -> Bitboard {
    unsafe { *KNIGHT_LOOKUP.get_unchecked(pos) }
}
#[inline(always)]
pub fn king_lookup(pos: usize) -> Bitboard {
    unsafe { *KING_LOOKUP.get_unchecked(pos) }
}
#[inline(always)]
pub fn check_lookup(king_pos: usize, other_pos: usize) -> Bitboard {
    unsafe { *CHECK_LOOKUP.get_unchecked(king_pos).get_unchecked(other_pos) }
}
// a couple more...

In addition, I made pushes to the move list unchecked as well.

#[inline(always)]
fn add_move(&mut self, v: Variant, start: u8, target: u8) {
    unsafe {
        self.movelist.push_unchecked(
            Move::new().with_variant(v).with_start(start).with_target(target)
        )
    }
}

Let's look at the perf and flamegraph to see where we're at now.

 Performance counter stats for 'target/release/game':

         10,547.95 msec task-clock:u                     #    1.000 CPUs utilized             
   207,850,450,905      instructions:u                   #    3.57  insn per cycle            
                                                  #    0.02  stalled cycles per insn   
    58,230,003,783      cycles:u                         #    5.521 GHz                       
     4,664,269,369      stalled-cycles-frontend:u        #    8.01% frontend cycles idle      
    23,572,488,287      branches:u                       #    2.235 G/sec                     
       213,444,703      branch-misses:u                  #    0.91% of all branches           

      10.550542110 seconds time elapsed

Switching to some unsafe rust saves billions of branches and instructions, bringing our execution time down another second.

Pop Rightmost

From the latest flamegraph, we see significant time is spent on pop_leftmost(). pop_leftmost() is a subroutine that pops the leftmost (most significant) set bit of a number and returns the number of leading zeroes. It is used to loop through the set bits of a bitboard (u64) in order to extract moves from a move mask.

The implementation looks like

#[inline(always)]
pub const fn pop_leftmost(&mut self) -> usize {
    let lz = self.0.leading_zeros();    // intrinsic compiled to BSR
    self.0 ^= (1 << 63) >> lz;
    lz
}

However, there is a better way to implement a bit loop; namely with pop_rightmost(). Popping the least significant bit is faster due to a cpu instruction called blsr which is slightly faster than bsr on many micro architectures.

#[inline(always)]
pub const fn pop_rightmost(&mut self) -> usize {
    let tz = self.0.trailing_zeros();
    self.0 &= self.0 - 1;   // this gets compiled into blsr
    63 - tz     // must subtract from 63 to replicate return value of leading_zeroes
}

The original pop_leftmost() would take 4 instructions and the new pop_rightmost() takes just 3. In addition, there's room for improvement down to a mere 2 instructions. This might not sound like a large improvement, but when there are billions of moves to enumerate, a few billion cycles saved becomes a lot.

 Performance counter stats for 'target/release/game':

          9,163.40 msec task-clock:u                     #    1.000 CPUs utilized             
   180,988,491,319      instructions:u                   #    3.60  insn per cycle            
                                                  #    0.03  stalled cycles per insn   
    50,320,036,864      cycles:u                         #    5.491 GHz                       
     4,956,756,140      stalled-cycles-frontend:u        #    9.85% frontend cycles idle      
    25,097,073,517      branches:u                       #    2.739 G/sec                     
       219,628,599      branch-misses:u                  #    0.88% of all branches           

       9.164524983 seconds time elapsed

This brings us under 10 seconds, meaning we have surpassed the milestone of doubling performance, but there's still more optimizations to be discovered!

Inline Assembly

Looking closely at the flamegraphs, we see that there's a slice of time spent on core::core_arch::x86_64::bmi2::_pext_u64. It's not significant, but its weird that its there since I'm using intrinsics:

pub fn pext_hv_lookup(occupied: u64, pos: usize) -> u64 {
  unsafe { 
    *PEXT_HV_LOOKUP
        .get_unchecked(pos)
        .get_unchecked(
           // intrinsic used here
            _pext_u64(occupied, HV_PEXT_MASK_LOOKUP[pos]) as usize
        ) 
  }
}

Investigating the assembly, we see something odd:

0x5555557fcd38   lea    rax, [rip - 0x29fbbf]
0x5555557fcd3f   add    r14, rax
0x5555557fcd42   mov    rdi, qword ptr [rsp + 8]
0x5555557fcd47   call   core::core_arch::x86_64::bmi2::_pext_u64
 
0x5555557f3fa0   pext   rax, rdi, rsi
0x5555557f3fa5   ret
 
0x5555557fcd4c   mov    rax, qword ptr [r14 + rax*8]
0x5555557fcd50   and    rax, r13
...

_pext_u64 isn't inlined! As a result, we have 2 more instructions than necessary, and our code is jumping around for no reason. The fact that rustc didn't inline this simple function was very confusing, and all methods of trying to get it to inline failed. I even dug through LLVM documentation and tried combing through rustc passes to figure out why the inline didn't occur. Eventually, I gave up used inline assembly instead; which I probably should have done much earlier consider how nice Rust's support for inline assembly is.

pub fn pext_hv_lookup(occupied: u64, pos: usize) -> u64 {
  unsafe {
    let mut result: u64;
    asm!(
        "pext {result}, {occupied}, {pos}",
        occupied = in(reg) occupied,
        pos = in(reg) HV_PEXT_MASK_LOOKUP.get_unchecked(pos),
        result = out(reg) result,
    );
    *PEXT_HV_LOOKUP.get_unchecked(pos).get_unchecked(result as usize) 
  }
}

After this change, I disassembled the program again and pext was nicely inlined. The speedup wasn't much (3 billion instructions saved and 0.3 seconds of time), but this particular change wasn't about the optimization so much as my building frustration at Rustc for not inlining _pext_u64.

 Performance counter stats for 'target/release/game':

          8,758.45 msec task-clock:u                     #    1.000 CPUs utilized             
   177,933,940,135      instructions:u                   #    3.70  insn per cycle            
                                                  #    0.02  stalled cycles per insn   
    48,053,645,660      cycles:u                         #    5.487 GHz                       
     4,402,567,620      stalled-cycles-frontend:u        #    9.16% frontend cycles idle      
    23,418,366,132      branches:u                       #    2.674 G/sec                     
       210,124,185      branch-misses:u                  #    0.90% of all branches           

       8.762713925 seconds time elapsed
No more pext function!

Why wasn't pext inlined?

I submitted an issue to Rustc trying to find out why pext wasn't being inlined. Some time later, I got my reply. It turns out, pext being part of the bmi2 extension set, requires a bmi2 feature annotation on any function that calls it---even if the entire program is compiled with the bmi2 microarchitecture. This allows a software fallback to be used, alongside runtime detection of the ISA.

Optimized Return

You may have noticed the large block of time spent in libc.so.6. I never used any complex std functions in make_move() so I was confused why the time was being spent there. Pulling out a debugger and setting a breakpoint in make_move() reveals some AVX instructions around the return statement of make_move().

0x7ffff7efaf84    mov    rax, rdi 
0x7ffff7efaf87    cmp    rdx, 0x40
0x7ffff7efaf8b    jb     0x7ffff7efafd0 
0x7ffff7efaf8d    vmovdqu64 zmm16, zmmword ptr [rsi]
0x7ffff7efaf93    cmp    rdx, 0x80
0x7ffff7efaf9a    ja     0x7ffff7efb080  
0x7ffff7efafa0    vmovdqu64 zmm17, zmmword ptr [rsi + rdx - 0x40]
0x7ffff7efafa8    vmovdqu64 zmmword ptr [rdi], zmm16
0x7ffff7efafae    vmovdqu64 zmmword ptr [rdi + rdx - 0x40], zmm17
0x7ffff7efafb6    ret

These instructions are copying the BoardState in order to return it. This is odd since I would expect rustc to have optimized in a copy elision (RVO) and avoided the 144 byte copy. Rather than return by value, I refactored make_move() to modify a mutable instance of BoardState instead:

// modifies `newgame` and returns nothing
pub fn make_move(game: &BoardState, m: Move, newgame: &mut BoardState) {...}

// usage
let mut next = game.clone();
make_move(game, m, &mut next);

Benchmarking the new version, I found a surprising speedup:

 Performance counter stats for 'target/release/game':

          7,239.67 msec task-clock:u                     #    1.000 CPUs utilized
   167,822,127,457      instructions:u                   #    4.22  insn per cycle
                                                  #    0.03  stalled cycles per insn
    39,754,720,670      cycles:u                         #    5.491 GHz
     5,010,090,849      stalled-cycles-frontend:u        #   12.60% frontend cycles idle
    21,741,622,228      branches:u                       #    3.003 G/sec
       206,750,434      branch-misses:u                  #    0.95% of all branches

       7.241345391 seconds time elapsed

This brings us to just under 7 mnps.

Further Optimizations

This about covers it for the key optimizations applied. I have no doubt another second could be squeezed out with some considerable effort, but I have spent enough time on this chapter as is.

In addition, the bottleneck is moving away from move generation and to move making---something that can't be easily optimized. Move counters like Gigantua can afford to do the bare minimum when computing a move, but real chess engines have to maintain information not strictly necessary for move generation, like maintaining Zobrist hashes and updating half move counters.

There are three other big optimizations that I have thought of, but due to complexity or some other reason, I chose not to implement them. I will however, document them here as a reminder to revisit this project.

Hashing

This one is bit cheaty---as in not quite in the spirit of move generation. That is to hash each position and memoize the number of legal moves at the position so that we can quickly add its move tree to our total whenever we come across it.

The overhead of hashing is not too bad, especially when using Zobrist boards. However, such a technique would never be used in a real engine since there is simply no point. Chess engines care about generating actual moves, not counting them up.

Templated Make Move

This technique is one that Gigantua and some other engines use. By using monomorphisms (templating on which piece was moved), we can cut a lot of branching out of move making.

However, this carries with it a lot of complexity and blows up the binary size. I suspect the gains from it would also be lackluster relative to the work it requires.

Another interesting optimization would be to use a DenseMove struct that looks like

struct DenseMove {
    start: Square,
    piece_moved: PieceType,
    targets: Bitboard,
}

This DenseMove can store potentially many moves. The benefit is that all branching in move generator can be postponed to move making. This makes the code path ever-so slightly more predictable.

Reflections as a Whole

I've spent quite some time away from this project (this post is over 3 months in the making). As such, I've had a lot of time to think about this post. Looking back, if I were to do a few things differently, it would be making more organized commits (with reasonable commit messages) and more rigorous testing. Digging for which commit contained which optimization was really hard---many optimizations were batched together and some had no effect. Regarding testing, I definitely should have fixed my CPU clock rate and done some graphing + benchmark suite.

Finally I want to leave some thoughts on Rust. I'm still new to the language, and many parts of this project saw me fighting the language. By writing "unrusty" or "idiomatically poor" Rust, I gained something like 3 seconds---a huge improvement. This makes it difficult to fully believe in Rust's promise of "zero-cost abstractions." No doubt, much of this "language-fighting" and hacky low-level stuff would not have been present if I were using C++.

However, Rust has made organization as a whole much easier. I admit, it's module system and strong type system are really growing me. As Shiro grows, Rust's strengths become evermore apparent. And at the end of the day, though it took much effort and research, I was able to squeeze the performance I needed out of Rust---and I learned quite a lot along the way.

Thanks for making it this far; hopefully, this post was insightful! I'm not sure if there will be another part to this series since there isn't anything too interesting regarding minimax or board evaluation. But who knows, maybe something worth writing about will show up :)

RSS
https://mxple.wtf/blog/feed.xml