Extracting Chess Positions from Screenshots

2025-01-26

Introduction

I play a lot of online chess.

Quite badly, according to my ELO.

Recently, I became interested in writing a computer program to automatically extract positions from screenshots of chess boards.

Why bother? I dunno, it seems interesting and it's outside my normal area of expertise, which is web programming.

In this post, I will describe how to do that.

This is a pet project, but feedback is welcome!

Ground Rules

I started with a fairly specific idea in mind:

  • The input should be a screenshot of a top-down, 2D board
  • The output should be a FEN string (the position as text)
  • It should work with different icon sets (pieces and boards)
  • It should be stateless (no move history)
  • It should use a neural net for piece classification

Basically, using python, we want this:

def extract_fen(screenshot: Image) -> str:
    # do image stuff

The "stateless" criteria means we won't track castling, en-passant, or the turn-number. Those are key concerns for generating useful game analysis! But we'll ignore them for now.

The program will run a single step, capturing one board from one image. But of course, if you have lots of images, you could wrap the logic in a while-loop and capture the entire game.

The Approach

In pseudo-code, we want something like this:

board = array(64)
image = open("screenshot.png")
for i in range(1, 64):
    square = crop(image, i)
    board[i] = classify(square)
result = generate_fen(board)

That is to say:

  • open the image file of the board
  • for each of the 64 squares..
    • crop, to slice out an individual square
    • classify its contents (either a specific piece, or "empty")
    • accumulate the results into an array
  • convert the array into a FEN string

We'll develop each part in detail, but basically:

  • cropping can be done using any image library
  • classification will use a neural net, and we'll need to train it
  • FEN strings can be generated from the python-chess library

Setting up the environment

I am currently using uv to manage python environments. But the setup will be similar with most package managers. We will need:

  • tensorflow
  • tensorflow-macos
  • numpy
  • pillow
  • albumentations
  • scikit-learn
  • chess
  • jupyter (optional)
  • cairosvg (optional)

I had to fiddle with the installation a bit to satisfy my older Macbook M1. It wanted the ARM version of python for tensorflow, and numpy 1.x, instead of 2.x (YMMV)

uv init -p cpython-3.10.16-macos-aarch64-none
uv add chess pillow scikit-learn jupyter cairosvg albumentations
uv pip install tensorflow tensorflow-macos
uv pip install 'numpy<2'

Once complete, you should be able to import tensorflow

import tensorflow as tf

Cool.

Getting the images

The next thing we're gonna need is some images of chess boards.

Specifically, we need an image that looks like this:

The starting position

The graphical style of the board and icons aren't important. But the position of the pieces is critical. We'll use this image for our training data. We need:

  • a normal starting position, but..
  • 2 extra queens (d2 and d7)
  • 2 extra kings (e2 and e7)

Pawns on b2, c2, f2, g2, b7, c7, f7, g7 are optional. You can keep or remove them. The code below will work the same.

Why do we need the extra kings and queens? The neural net will learn better if we show it an example of every possible piece and square combination. The normal starting position lacks 4 possibilities:

  • white king on a light square
  • white queen on a dark square
  • black king on a dark square
  • black queen on a light square

By creating an image that includes those, we can supply complete training data.

There are lots of ways to create the image above. You can..

  • download it
  • use python to create a PNG file
  • use a website and capture a screenshot

To generate the image using the python-chess package, you need to have the optional cairosvg package installed.

Then run:

import os
import chess
import chess.svg
import cairosvg

fen = "rnbqkbnr/p2qk2p/8/8/8/8/P2QK2P/RNBQKBNR w KQ - 0 1"
board = chess.Board(fen)
svg_data = chess.svg.board(board, coordinates=False, size=256)
with open("board.svg", "w") as f:
    f.write(svg_data)
cairosvg.svg2png(url="board.svg", write_to="board.png")

Alternatively, you can generate the image by creating an "analysis board" on any of the popular chess websites like chess.com or lichess.org.

Then take a screenshot.

It needs to be a GOOD screenshot!

  • 256x256 or larger
  • square (or close, +/- 5px)
  • well-clipped (just the board, no borders)

If the screenshot is really bad, these image processing steps won't work well. Use an image editor if necessary to get a clean image.

Extracting individual squares

The next step is to start processing the image. We need the ability to extract each individual square from an 8x8 chess board.

There are many image manipulation libraries, but Pillow (PIL) is widely used and I like it.

Let's import libraries and declare the size of our board:

import math
import numpy as np
from typing import List
import albumentations as A
import tensorflow as tf
from PIL import Image, ImageDraw, ImageFont
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.image import ImageDataGenerator, img_to_array
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import train_test_split

BOARD_SIZE = 256
SQUARE_SIZE = 32

Next, load the screenshot and resize it to the target size.

board = PIL.open('path/to/my_screenshot.png')
board = board.resize((BOARD_SIZE, BOARD_SIZE))

Then, define a function to extract the contents of an individual square.

def get_square(board: Image, pos: str) -> Image:
    file, rank = pos[0].lower(), pos[1]
    file_num = ord(file) - ord('a')
    rank_num = 8 - int(rank)
    left = file_num * SQUARE_SIZE
    upper = rank_num * SQUARE_SIZE
    box = (left, upper, left + SQUARE_SIZE, upper + SQUARE_SIZE)
    return board.crop(box)

Our function will use chess rank and file notation to refer to the squares. For example, to see the contents of square "a1", the bottom-left corner, we would use:

a1_img = get_square(board, 'a1')
a1_img.show()

And we should see this:

The Rooook on a1

Now we can loop through the squares of the board, and process each square separately. By placing the ranks in an outer loop, we can enumerate the positions in a natural order, starting from the top-left corner.

def extract_board(board: Image) -> List[List[str]]
    result = []
    for rank in '87654321':
        row = []
        for file in 'abcdefgh':
            pos = file + rank
            img = get_square(board, pos)
            piece = predict_piece(img)
            row.append(piece)
        result.append(row)
    return result

All that's left is to implement the predict_piece() function, which so far remains undefined.

Preparing the data

To write that function, we will train a neutral net to look at a square and predict the most likely piece.

Let's start by defining our labels. We'll use the notation of the python-chess library, with uppercase letters for white, lowercase for black, and "." for an empty square.

LABELS = ['.', 'P', 'N', 'B', 'R', 'Q', 'K', 'p', 'n', 'b', 'r', 'q', 'k']

The task for our network will be to process an image, and decide which of those 13 labels the image represents. The neural network doesn't actually learn the labels themselves (P, N, B, etc). Rather, it learns the integer offsets of the labels. If you ask the neural net what piece is on e1, will respond "6" instead "K", because "K" is the 7th element of the zero-indexed array.

We need one example for each type of piece. We can define that using a lookup table.

# (piece, light_sq, dark_sq)
LOOKUP = (
    ('.', 'b3', 'a3'),
    ('p', 'h7', 'a7'),
    ('n', 'g8', 'b8'),
    ('b', 'c8', 'f8'),
    ('r', 'a8', 'h8'),
    ('q', 'd7', 'd8'),
    ('k', 'e8', 'e7'),
    ('P', 'a2', 'h2'),
    ('N', 'b1', 'g1'),
    ('B', 'f1', 'c1'),
    ('R', 'h1', 'a1'),
    ('Q', 'd1', 'd2'),
    ('K', 'e2', 'e1'),
)

Each entry in this table states where the corresponding piece may be found in our screenshot.

For example, this line ('b', 'c8', 'f8') means:

  • one black bishop is on c8 (a light square)
  • one black bishop is on f8 (a dark square)

The dot "." indicates an empty square. We need to have both a light and dark square example for each piece, including the kings and queens.

So far we have 26 labeled images. But to train a network, ideally we need hundreds or thousands of examples. Capturing that many screenshots would be impractical. Instead, we can generate synthetic data using a library.

One such library is Albumentations, which can be used to make small adjustments to each image, for example by adjusting the contrast, shifting the pixels up or down, or rotating the whole image. Training the neutral net with a diverse set of images prevents over-fitting (memorizing the data) and makes it more robust in real usage.

transform = A.Compose([
    A.Rotate(limit=10, p=0.8),
    A.RandomSizedCrop(min_max_height=(28, 34), size=(32, 32), p=0.8),
    A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.8),
    A.ShiftScaleRotate(shift_limit=0.2, scale_limit=0.2, rotate_limit=0.0, p=0.8),
    A.HorizontalFlip(p=0.5),
])

After defining our transformation parameters, we convert each image to grayscale, and then repeatedly call the transform() function to apply some random adjustments to the image. This gives us 5000 samples per piece.

num_images = 5000
label_list = []
image_list = []
for piece, light_sq, dark_sq in LOOKUP:
    class_num = LABELS.index(piece)
    for pos in (light_sq, dark_sq):
        img = get_square(board, pos)
        img = img.convert('L')
        img_array = img_to_array(img) / 255.0

        k = 0
        while k < num_images:
            augmented = transform(image=img_array)
            new_img = augmented['image']
            image_list.append(new_img)
            label_list.append(class_num)
            k = k + 1

Training the model

The next step is to convert the data into the required format, and then split into testing and training sets. Tensorflow wants to receive numpy arrays for its inputs.

image_arr = np.array(image_list)

label_arr = np.array(label_list)
label_arr = to_categorical(label_list, num_classes=len(LABELS))

X_train, X_test, y_train, y_test = train_test_split(image_arr, label_arr, test_size=0.2)

For the network itself, we will use a Convolutional Neural Network (CNN) model.

inputs = Input(shape=(32, 32, 1))
x = Conv2D(32, (3, 3), activation='relu', padding='same')(inputs)
x = MaxPooling2D((2, 2))(x)
x = Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = MaxPooling2D((2, 2))(x)
x = Flatten()(x)
x = Dense(32, activation='relu')(x)
x = Dropout(0.2)(x)
outputs = Dense(len(LABELS), activation='softmax')(x)
model = Model(inputs=inputs, outputs=outputs)

model.compile(optimizer=Adam(learning_rate=0.0001), loss='categorical_crossentropy', metrics=['accuracy'])

The most important parts of this are the input and output

  • The input shape (32, 32, 1) describes our 32x32 pixel, 1-channel grayscale images
  • The output shape, Dense(len(LABELS)) will give us one output per label

Then we will fit the model to the data. This is the famous "training" step.

model.fit(X_train, y_train, epochs=10, shuffle=True) 

You'll see a bunch of output. On my computer it takes 5-10 minutes to run.

So uh.. did it actually work?

Let's find out by checking the accuracy of the model on our "test data" which was intentionally omitted from the training.

test_loss, test_accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {test_loss}")
print(f"Test Accuracy: {test_accuracy}")

# Test Loss: 0.05167254060506821
# Test Accuracy: 0.9879999756813049

98.8% accuracy is pretty good (maybe even "suspiciously good" after just a few epochs)

However, we are classifying 64 squares each move, so this error rate means we'll still make a mistake every 2-3 turns.

In practice, this won't be a big problem. As we'll see in a moment, the transformations that we performed on the synthetic data are aggressive. A real game screenshot won't have the rooks rotated 20 degrees and the knights facing backwards. The end notes also suggest a few alternative ways to improve the accuracy beyond "more data".

Let's spot-check a few squares.

We want a function that accepts an image and returns the piece.

The predict_piece() function first converts an input image to the numpy format required by Tensorflow. Then it calls model.predict() and the result is a list of class probabilities. The function then selects the index of the class with the highest probability, and finally it returns the associated label (the name of the piece)

def predict_piece(model: Model, labels: List[str], image_sq: Image.Image) -> str:
    image_sq = image_sq.convert('L')
    image_arr = img_to_array(image_sq) / 255.0
    image_arr = np.expand_dims(image_arr, axis=0)
    predictions = model.predict(image_arr, verbose=False)
    offset = np.argmax(predictions)
    return labels[offset]

We can now check that our original screenshot has a rook on "a1"

image_sq = get_square(board, 'a1')
piece = predict_piece(model, LABELS, image_sq)
print(piece)

# "R"

And it does! Nice!

Putting it all together

Okay, prediction appears to work.

Performance can be tuned by adjusting model parameters such as the layer sizes, the quantity and variety of training data, and the number of training epochs.

Earlier, we wrote an extract_board() function. Now that we also have implemented predict_piece(), we can put this to use for the entire board.

def extract_board(model: Model, labels: List[str], board: Image.Image) -> List[List[str]]:
    result = []
    for rank in '87654321':
        row = []
        for file in 'abcdefgh':
            pos = file + rank
            img = get_square(board, pos)
            piece = predict_piece(model, labels, img)
            row.append(piece)
        result.append(row)
    return result

Cool. Are we done?

Let's try it on a brand new screenshot.

This was taken from the Opera Game after move 11.. bd7

The Opera Game

To make it more fun, I edited that image in Photoshop to blur and shift some of the pieces, so that it wouldn't be too easy.

opera_board = Image.open('path/to/opera-game.png')
extract_board(model, LABELS, opera_board)

The output is a double-nested list. We can see that each piece matches the screenshot. So it works!

[['r', '.', '.', '.', 'k', 'b', '.', 'r'],
 ['p', '.', '.', 'n', 'q', 'p', 'p', 'p'],
 ['.', '.', '.', '.', '.', 'n', '.', '.'],
 ['.', 'B', '.', '.', 'p', '.', 'B', '.'],
 ['.', '.', '.', '.', 'P', '.', '.', '.'],
 ['.', 'Q', '.', '.', '.', '.', '.', '.'],
 ['P', 'P', 'P', '.', '.', 'P', 'P', 'P'],
 ['R', '.', '.', '.', 'K', '.', '.', 'R']]

Finally, we need a way to convert a board into a FEN string.

Below is a code snippet to build a representation of the board using the popular python-chess module and then calling board.fen()

def to_fen(board_array):
    board = chess.Board(None)
    for row in range(8):
        for col in range(8):
            piece = board_array[row][col]
            if piece != '.':
                color = chess.WHITE if piece.isupper() else chess.BLACK
                piece_type = {
                    'p': chess.PAWN,
                    'r': chess.ROOK,
                    'n': chess.KNIGHT,
                    'b': chess.BISHOP,
                    'q': chess.QUEEN,
                    'k': chess.KING
                }[piece.lower()]
                board.set_piece_at(chess.square(col, 7 - row), chess.Piece(piece_type, color))
    return board.fen()

This FEN string is something that we'll need in later stages of the pipeline. It can also be copy/pasted into chess engines and analysis programs, to help double-check the accuracy of the piece classification.

Let's wrap this up into the single function promised at the beginning:

def extract_fen(screenshot: Image.Image) -> str:
    board = screenshot.resize((256, 256))
    board_arr = extract_board(model, LABELS, board)
    return to_fen(board_arr)

That's it! We can extract the position from any screenshot (well, any screenshot that uses same set of icons as our original)

Debugging

It was very useful to visualize the pieces, labels and classification errors while writing this post.

The function below draws a grid of pieces. It can also optionally draw a red overlay to highlight classification errors.

def visualize(x, y, width=16, height=None, predict=False):
    height = height if height else math.ceil(len(x) / width)
    size = width * height
    y_orig = np.argmax(y, axis=1)
    labels = [LABELS[i] for i in y_orig]

    sheet = Image.new('RGBA', (SQUARE_SIZE * width, SQUARE_SIZE * height))
    mask = Image.new('RGBA', (SQUARE_SIZE * width, SQUARE_SIZE * height))
    text = Image.new('RGBA', (SQUARE_SIZE * width, SQUARE_SIZE * height))
    draw = ImageDraw.Draw(text)
    font = ImageFont.load_default()

    def draw_piece(x, i):
        img = np.squeeze(x[i])
        img = Image.fromarray(img * 255.0)
        pos = (SQUARE_SIZE * (i % width), SQUARE_SIZE * (i // width))
        sheet.paste(img, pos)

    if predict:
        y_pred = model.predict(x)
        y_pred_labels = np.argmax(y_pred, axis=1)
        pred_labels = [LABELS[i] for i in y_pred_labels]

        for i in range(size):
            draw_piece(x, i)
            pos = (SQUARE_SIZE * (i % width), SQUARE_SIZE * (i // width))
            pred_label = pred_labels[i]
            orig_label = labels[i]
            draw.text(pos, pred_label, font=font, fill=(0, 0, 0, 200))
            color = (255, 0, 0, 128) if orig_label != pred_label else (0, 0, 0, 0)
            mask.paste(Image.new('RGBA', (SQUARE_SIZE, SQUARE_SIZE), color), pos)
    else:
        for i in range(size):
            draw_piece(x, i)
            pos = (SQUARE_SIZE * (i % width), SQUARE_SIZE * (i // width))
            label = labels[i]
            draw.text(pos, label, font=font, fill=(0, 0, 0, 200))

    sheet = Image.alpha_composite(sheet, mask)
    sheet = Image.alpha_composite(sheet, text)
    return sheet 

I used this for several things while debugging the code.

It was worth it to spot-check the training images and labels. It's super-easy to make array indexing errors with numpy. On several occasions, I had messed up and the inputs were mislabeled (which makes training impossible).

It was also nice to see the classification errors visually.

Below is an example of some of the "synthetic" images from the testing data.

visualize(X_test, y_test, width=16, height=16, predict=True).show()

Pretty cool looking..! Right?

Also, you can see that two rooks are misclassified.

Piece classifications

A bit later, I realized that I could also visualize "ordered" results for specific transformations, such as translation or rotation. The next image shows each piece rotated from 0-50 degrees in both directions.

(These particular images were generated from a separate process and don't appear in the training or testing data)

We can see that the unrotated images are all classified correctly. But as the angle of rotation increases, we get more mistakes. A rotated pawn starts to look like a bishop, and rook classification drops of pretty quickly.

Rotated pieces

The overall "shape" of these errors has some explanatory power.

If there were zero errors, that might hint at memorization or pollution of training data into the "test" split. If we saw totally random errors, that could suggest insufficient data or model size. Mistakes like one specific piece being totally incorrect, could mean we have a class imbalance or other issues with the data processing pipeline. I know because all of those problems happened while I was debugging!

Next Steps and What Abouts..

So this basically works and I'm pretty happy with it.

There's lots that could be done to make it more robust. Also, my own code is organized a bit differently than what I've shown here. I wanted to present the snippets in a logical order, but my actual code is more modularized and organized into class files, to make it easier to experiment with multiple boards.

My eventual goal is to develop a program that can automatically capture, analyze, and provide commentary for live, in-progress games. We'll see how far I get with it! Hopefully there will be more posts on that topic soon.

Below are some FAQ style "what about.." questions covering issues I ran into, and options that I didn't pursue yet. After playing around with this a bit, I've got a million new questions and things to try.

Some of the questions below mention improving the accuracy. At 99%, why bother? Well, the testing we did here was totally synthetic and omitted a lot of problems that can happen with live screen capture - we made it pretty easy for the classifier! I am considering some future ideas that are less easy, so just thinking ahead a bit..

What about cheating?

We extracted positions from screenshots and output FEN, which happens to be the language of Stockfish.

Could you extend this program to cheat with Stockfish? Yes, you could, but uh.. please don't do that..

Cheating is a real problem for online chess.

That said, there are MUCH easier ways to cheat, like.. using any engine.. opening an analysis board in a private window.. or just asking a stronger player to stand over your shoulder and tell you the best moves.

I don't think I'm enabling or encouraging cheating here.

I also don't think a lot of cheaters are gonna bother with building an image recognition model.

But if they do, then at least it's educational!

I never cheat. But I have played over 5000 games and I still suck. Sometimes I wish that Chess.com would ban me! It would be a tremendous time-saver if they just didn't let me play anymore.

What problems did you run into?

More than I would like to admit for a couple hundred lines of code..

It took me a bit to get comfortable with the inputs and outputs of different model layers. ChatGPT was super helpful here.

I repeatedly forgot to re-scale the pixel values to [0.0 - 1.0] for the model, and then back to [0 - 255] to preview the images. Every time, I was left my scratching my head for 15-30 minutes before figuring it out.

I ended up fiddling with the model architecture a lot. I wanted to get the smallest model that still performed well, but without memorizing the data. I also wanted to try different options and start building some intuition about what works, and why.

At one point, I seemed to be getting instant "perfect" classification with a single tiny layer. I knew that was impossible. Which led to discovering the next problem..

Before I started using Albumentations to generate the synthetic images, I was using a different library. The idea with data augmentation is to increase the quantity of training data by starting with a small number of labeled source images, and then introducing random minor transformations to get more images with the same labels.

Well, for some reason, that randomization was not working how I expected. I kept getting the same "random" transformations for each piece, over-and-over again. This was hard to spot visually, but I eventually got suspicious enough that I went back and used a hash table count the instances and verify that every synthetic image was unique. Well, they weren't! My 5000 "unique" images per class were actually closer to 80. Switching libraries fixed this problem immediately, and other things snapped into place.

Why did you use synthetic data in the test split?

True, you're not supposed to do that.

Typically you would only use synthetic data for training.

But.. I wasn't gonna take hundreds of screenshots to get "real" test data. This is just a tutorial, so I used synthetic data and just verified that it didn't overlap with the training data. Really though, you should be validating model performance against real, unseen data.

What about different icon sets?

Here, we trained using a single set of icons. Any other "normal" set of icons should work just fine too.

But what about multiple sets of icons? Can the model generalize if we train it on lots of different icons? Can it learn the concept of "this is what a knight looks like"?

My guess is "yes, up to a point" (even in the best case, you could imagine an adversarial set of icons)

I want to investigate this a lot more.

What about image pre-processing?

I think this might improve the results.

I wasted quite a few hours experimenting with blurring, sharpening, masking, normalizing and a variety of FFT-based image processing techniques. The idea was to modify the screenshots in some way that makes them easier to learn.

None of that produced reliable improvements across many different sets of icons and boards. My understanding is that pre-processing is less common these days, because the models just end up learning the same thing, with or without it.

That said, I still suspect that background removal could be useful. Some boards are styled to look like "wood" or "marble", which creates significant artifacts that may confuse the classifier. I experimented with U-Net architectures to automatically generate image masks for background separation as a pre-processing step. The masks I got were "fine", but not amazing.

Maybe user error! I might return to this idea in the future.

What about ChatGPT and LLMs?

I did a very quick check on whether ChatGPT would ace this.

Not yet! But also not bad!

For my quick test, I uploaded a screenshot of a chess puzzle with 18 pieces to ChatGPT (January 2025).

It placed 13 of the 18 pieces correctly. You could say that's 72% correct. You could also say it's 92% correct, since it got 59 of the 64 squares right. Either way, it seems possible that commercial AI models will perform near-perfectly on this task in the future.

Note that I was asking ChatGPT to extract the entire board at once. In the code above, we were splitting the board and performing 64 separate classifications. It's a different problem. I will test ChatGPT on the single-square version sometime soon.

What about visual artifacts, cursors, icons, etc?

Due to being screenshots, some input images contain visual artifacts. This can be tough for the classifier, which prefers "clean" images. Artifacts that I encountered were:

  • mouse cursors
  • highlighted squares
  • arrows and non-piece icons
  • rank/file labels on the squares

These artifacts don't appear in the training data, so the model doesn't know what to do with them. An arrow or large icon that obscures 25% of a piece will likely mess up the classification.

The easiest solution here is to include examples of those situations in the training data. Perhaps the data augmentation step could randomly add cursors or small icons to the training examples.

What about state-specific rules?

Chess is stateful. You can't determine the entire game status from one screenshot with just the squares. There are special rules for:

These rules require knowing aspects of the game history. Here we ignored that because we were only concerned with the image recognition task.

But technically, that means our final output, a FEN string, is partly wrong. We fail to set the special bits for castling, en passant, and the half-move number.

What about live-streaming, video, or 3d boards?

I have ideas to handle live-streaming/video for 2D boards in the near future.

I barely know where to start with 3D. You've got to worry about lighting, shadows, bad camera angles, and piece orientation. Even worse, people are constantly putting their hands into the scene!

Seems hard!

I did zero research because I wanted to do this project for fun, without looking at other people's work. I might do some research in the future to learn what the state of the art is. We have driverless cars (sort of?), so it must be possible to recognize pieces on a real-life, 3d chess board (right?)

What about "sliding windows" for classification?

Another way to detect/classify pieces without a neural net is to "slide" an image or mask across every square of the board. The best "match" will have the most overlapping pixels.

I didn't try that because I wanted to use neural nets.

My gut feeling is that sliding windows should work pretty well. But it might fail in the same difficult cases that confound the model described here.

Maybe this is better or faster! I dunno! Something to investigate.

What about diff-ing the boards between moves?

Perhaps instead of continously detecting the entire board state, we can subtract the previous board from the current board to find out "which two squares changed" and use that to build up the game state over time.

from_sq, to_sq = current_board - last_board

We don't need to do any processing on the CONTENTS of the squares. Just knowing WHICH two squares changed on a given turn is (almost) sufficient to fully determine the last move.

Is this better?

Speed and accuracy are definitely advantages. It requires minimal image processing to calculate the "from" and "to" squares, and the results are deterministic. This method also completely avoids problems with image artifacts and non-standard graphics for the pieces and board.

Pawn promotion is a major problem though. Without external information, there is no way to immediately determine the promotion type. Either image recognition on the promotion square, or user input, would still be required to fully handle promotions.

Another minor drawback is that this method is stateful. A neural net can detect the whole board at any moment of the game. This method requires that we see every move, in order, from the very beginning. Does that matter? Maybe or maybe not, depending on the specific application.

What about using legal moves to improve correctness?

An interesting possibility is to use the current legal moves to re-weigh the neural net's class probability predictions for each square. For example, suppose the class probabilities for the square e5 are:

{R: 0.50, Q: 0.45, P: 0.05}

In this situation, our network predicts a white rook (R), because 50% is the highest probability. But suppose we possess external knowledge that a rook on e5 is impossible (maybe all 4 rooks have been captured).

If we discard the rook on e5 as impossible, then we can recalculate the class probabilitites for that square and predict a queen as the most likely piece instead.

{Q: 0.90, P: 0.10}

Using the last board position and legal moves to constrain the class probabilities seems powerful, especially in cases where the neural net is uncertain. Maybe worth exploring!