Augmented Reality Sudoku Solver: Reading Puzzle Values from Image

Shashank Goyal
9 min readOct 16, 2021

--

This article is a continuation of the Augmented Reality Sudoku Solver, we are going to build a GUI-based AR Sudoku Solver. In this article, we understand how to extract numbers from the processed image of the puzzle. We implement Convolutional Neural Networks trained on Char74k Dataset to recognize digits from each square using PyTorch.

We aim to solve any sudoku of the N dimension, where N is a non-prime. The project will be implemented in two forms -

  • An option to load an image saved on the system or use the webcam to feed the image to the program and then play the game on the system.
  • An option to use Augmented Reality and solve the sudoku by showing the puzzle to the webcam.

The project uses Dancing Links in the form of Algorithm X to find the solution of the Sudoku puzzle. Sudoku is a well-known NP-Complete problem and Algorithm X is a means of implementing a form of greedy depth-first search to find the appropriate solution. The project will be blogged in 4 parts -

Part III: Recognizing Numerical Cell Values -

For detecting the numerical values from the cell, we will be using the PyTorch Library. We will be implementing a Convolutional Neural Network (CNN) based model to train our network over the char74k dataset. The architecture that I use here gives me an accuracy of 99.61% and is inspired by this model.

To start lets first understand the basics of a Convolutional Neural Network -

In a simple feed-forward neural network, we typically flatten the image into a linear vector, and then we take a dot product with some weights and biases to get one element of the output array. This output array follows the same pattern for a few hidden layers before converging to the final output layer.

The main limitation of this approach is its inability to capture Spatial relationships between pixel positions, so a pattern that is spread over a local region in the 2-D plane may represent a feature that can lie anywhere else in the image plane as well. For instance, an area in an image representing a nose can be found in the top-left corner or the bottom-right corner. This property is called the spatial invariance i.e. a small area of an image representing a feature can move around anywhere in the image plane.
Now let us understand what the convolution operation really means -

In simple terms, we run a smaller matrix which is called the kernel over the plane of the image. The kernel is essentially a 2-D weight matrix. We take the dot product of the kernel with the portion of the image for each position of the kernel over the matrix as evident from the above image.
Before moving forward a few terms related to convolution -

  • Stride: It is the number of pixels by which the kernel moves. In the image above the kernel for each iteration the kernel moves right by one pixel and similarly down by one pixel, this is called the stride.
  • Padding: In the above image we can see that the kernel runs over the internal elements of the matrix more times than it runs over the outermost elements. This results in the image getting reduced from 5x5 to 3x3. We can supplement the original matrix by providing borders of zero elements around the original matrix to avoid the size reduction.
  • Dilation: This is the intra-kernel point distance i.e the distance between elements taken from the matrix for the kernel multiplication.

To better understand the above terms, I recommend taking a look at this README which provides great visualization for these explanations.

Now let's take a look at the architecture of the neural network that we are going to use for our solution -

+----------+--------------------------------------------+----------+
| Input | Operation | Output |
| Size | | Size |
+----------+--------------------------------------------+----------+
| Feature Extraction |
+----------+--------------------------------------------+----------+
| 1x28x28 | Conv2d(1, 32, kernel_size=3, stride=1, | 32x28x28 |
| | padding=1), | |
| | BatchNorm2d(32), | |
| | ReLU(inplace=True) | |
+----------+--------------------------------------------+----------+
| 32x28x28 | nn.Conv2d(32, 32, kernel_size=3, stride=1, | 32x14x14 |
| | padding=1), | |
| | BatchNorm2d(32), | |
| | ReLU(inplace=True), | |
| | MaxPool2d(kernel_size=2, stride=2) | |
+----------+--------------------------------------------+----------+
| 32x14x14 | Conv2d(32, 64, kernel_size=3, padding=1), | 64x14x14 |
| | BatchNorm2d(64), | |
| | ReLU(inplace=True) | |
+----------+--------------------------------------------+----------+
| 64x14x14 | Conv2d(64, 64, kernel_size=3, padding=1), | 64x7x7 |
| | BatchNorm2d(64), | |
| | ReLU(inplace=True), | |
| | MaxPool2d(kernel_size=2, stride=2) | |
+----------+--------------------------------------------+----------+
| Classification |
+----------+--------------------------------------------+----------+
| 64x7x7 | Dropout(p=0.5), | 512 |
| | Linear(64 * 7 * 7, 512), | |
| | BatchNorm1d(512), | |
| | ReLU(inplace=True), | |
| | Dropout(p=0.5) | |
+----------+--------------------------------------------+----------+
| 512 | Linear(512, 512), | 512 |
| | BatchNorm1d(512), | |
| | ReLU(inplace=True) | |
+----------+--------------------------------------------+----------+
| 512 | Dropout(p=0.5), | 10 |
| | Linear(512, 10) | |
+----------+--------------------------------------------+----------+

Though the above architecture looks extremely daunting, it is not as hard to implement in the code.
Without any further delay now let us have a look at the code -

PyTorch GPU Utility Functions -

The “get_default_device” method is used to check whether a Cuda-enabled GPU is available on the system or not. Accordingly, it will return “torch.device(‘cuda’)” or “torch.device(‘cpu’) ”which returns an object representing the device on which the tensor variables will be allocated for computation.
The “to_device” method is used to allocate the object on the available device.

PyTorch GPU Utility Device DataLoader Class-

This class is used to supplement the “torch.utils.data.DataLoader” class. It is used to shift an entire DataLoader object to the available device. But the allocation to device operation happens only while yielding an element.

Accuracy Method -

The “accuracy” method is used to determine the percentage of predictions that were correct. “torch.max” method returns the index with the highest value from the output array. Example -

X = [
[0.42, 0.72, 0.86, 0.87, 0.29, 0.19, 0.59, 0.99, 0.65, 0.94],
[0.49, 0.76, 0.77, 0.66, 0.35, 0.41, 0.79, 0.72, 0.04, 0.31]
]
torch.max(X) => [7, 6]

We then divide the total number of correct predictions by the total number of predictions to get the accuracy fraction.

Char74k Model Initialization-

This is the initialization method of the class. Here we inherit our class from the “torch.nn.Module” and define the feature extraction and the classifier sequences. We then initialize the weights for the feature extraction network as explained below in the next function.

Layer Weight Initialization -

Here we normalize the kernel values for Convolution Layers and set weights and biases as 1 and 0 respectively for the Batch Normalization Layers.

Feed-Forward Computation -

The “forward” method is a special method of the “torch.nn.module” class, it defines the forward pass step from one layer (or in our case, a sequence of layers) to another. Defining this method enables us to perform the necessary steps in the forward pass. Here, we input the “1x28x28” image i.e. the 28x28 image matrix into the feature extraction layers. Note the 1 here suggests a single-channel image i.e. grayscale image. The “x.view” is used to flatten the vector to a two-dimensional vector. Example -

a = torch.random.torch.randint(size=(2,5,2), low=0, high=10)
a =>
tensor([[[9, 2],
[2, 8],
[2, 2],
[3, 4],
[5, 8]],

[[0, 3],
[3, 5],
[3, 1],
[8, 7],
[5, 6]]])
a.view(a.size(0), -1) =>
tensor([[9, 2, 2, 8, 2, 2, 3, 4, 5, 8],
[0, 3, 3, 5, 3, 1, 8, 7, 5, 6]])

We then feed this two-dimensional vector to the classifier, which returns the prediction score for the image. This prediction score is a one-dimensional vector carrying the percentage likelihood for each class (in our case 0–9 digits).

Training Step -

The “training_step” method defines the process for a single iteration of the training cycle. Each iteration or step accepts a batch of images along with the associated labels. The images are then passed to the model. We then calculate the cross-entropy loss for the batch predictions. Cross-Entropy Loss is a metric that combines the Logrithamic Softmax loss and Negative-Logrithamic Likelihood loss and is used for N class classification problems. We then calculate the loss after moving the loss function object to the available computation device and return it.

Validation Step -

The “validation_step” method is identical to the “training_step”, with the difference that the former additionally calculates the accuracy and returns the loss and accuracy for display only whereas the latter only returns the loss for backpropagating it to the network.

Validation Epoch End -

The “validation_epoch_end” method runs at the end of a training epoch and combines the data from all the validation steps and calculates the mean validation accuracy and validation loss for that epoch.

Epoch End -

This method is used to print the mean validation loss and accuracy for an epoch.

Model Training and Evaluation Methods -

Evaluate Model-

This method is used to run a model over a validation or test dataset (in the form of a Dataloader object) post and during the training to determine the model metrics.

Fit Model on Training Data -

Here we do the fit the model over the training data for the number of epochs specified. We use the Adam optimizer for first-order gradient-based optimization. We also use a learning rate scheduler, specifically the StepLR scheduler. Here, the step_size=7 and gamma is 0.1, which means that every 7 epochs the learning rate will get 0.1 times the current value.
We then start the model fitting by iterating over the batches of Training data for each epoch.

Training Model -

This method is the driver method for training the dataset. We first create a transformation sequence. This sequence first converts the images to grayscale, then resizes them to 28x28, and finally converts them to a tensor. We load the data directory using the “torchvision.datasets.ImageFolder” method which inputs the directory path and transformations to run on loading the data. We then specify batch size and split the dataset into 80–20 training-validation ratios and load them into respective “DeviceDataLoaders”.

We then move the model to the available computational device and fit the model for 20 epochs. We then evaluate the model and save the model weights to a file.

The Final Loss and Accuracy that I have achieved on this model is -
Model Result = val_loss: 0.0076, val_acc: 0.9961

Evaluate Loaded Model -

This method helps us load a saved model file and determine the accuracy and loss over the entire dataset.

Load Model -

This method helps load the model from a saved file.

You can find the complete python implementation here. If you found this article helpful, please follow me on Medium and GitHub and star the Project Repository.

You can find the complete implementation explained in detail for this project here.

--

--

Shashank Goyal

I'm Shashank Goyal, a passionate Dual Master's student at Johns Hopkins University, pursuing degrees in Computer Science and Robotics.