Autoencoder Improvements

The first avenue for improving the performance of the Pokemon autoencoder was by normalizing the input stats. Initially I had just divided them all by 200, to force them into the range \([0, 1]\). This was arbitrary, and often neural networks function better if their inputs are normally distributed. So, I fit both models a number of times to see how they compared:

Comparison of original method vs normalized stats

The losses are very different, which is to be expected since the scaling of the stats are not the same. What really matters is how well the stats and types are reconstructed:

Method Primary Type Accuracy Secondary Type Accuracy Both Types Accuracy Stat MSE
Original 69.4% 62.9% 39.4% 597
Normalized Stats 39.7% 49.1% 22.5% 544

The results here are mixed. The new normalization lowers the MSE on the reconstructed stats, but hinders the prediction of the types. This is due to the larger range of values for the stats in the normalized model; originally they were all in the range \([0, 1]\), but now they can take on larger values, many of them even being above 2.5. This increases their weighting in the loss function and the optimizer will then focus more on the stats than the types. Ultimately there isn’t enough of a difference in the MSE to definitively conclude that this normalization is useful, so I will use the original scaling.

The next optimization for the network is batch normalization. This is an extremely useful technique in deep learning, and since our network is more than a couple of layers deep it should give a noticeable improvement. Here, since the scale of the inputs is identical the losses can be meaningfully compared:

Loss with and without batch normalization

The effect of batch normalization here is clear. The loss decreases faster initially, and approaches a lower value as training continues. However, it appears to be inconsistent in its training with the loss fluctiating significantly. Lowering the training rate helps this, while minimally affecting the lowest loss achieved:

Loss with and without batch normalization

In fact, the model with the lower learning rate lowers its loss faster, achieving a loss of 0.02 before the 100th epoch, while the model with the higher learning rate takes almost 150 epochs. But again, the most important judge of performance is the median type prediction accuracy and stat MSE:

Method Primary Type Accuracy Secondary Type Accuracy Both Types Accuracy Stat MSE
Original 68.3% 61.35 41.7% 655
Batch Normalization 85.0% 58.0% 50.7% 829
Batch Normalization (halved learning rate) 87.7% 59.5% 52.9% 753

The model with the lower learning rate did perform better, though the improvement is narrow enough that it may just be noise. Overall this is a similar situation to earlier where one area was improved and the other suffered. This is not a loss for batch normalization though, it objectively was able to achieve a lower loss than the orignal model; this is just an indication of a bias in the loss function that gives more value to increasing type accuracy than stat accuracy.

So how deep can we go with batch normalization? I doubled the network depth (without really optimizing the new sizes or the other hyperparameters) to see how networks with and without batch normalization fared:

Loss with and without batch normalization, for a deeper network

While the overall loss has increased due to the unoptimized network architecture, the difference between the losses has also increased. This illustrates how batch normalization benefits deeper networks; less hyperparameter tuning is would be required to match or beat the loss of the shallower network. Since this is ultimately a learning exercise I won’t do that for this network, but it will become important for the next aspect of this project when I create a network that will actually assist with competitive Pokemon battling.