Passing Pokemon Through an Autoencoder

In the hopes of becoming more familiar with the data and making something that may turn out to be useful in the future, I decided to apply dimentionality reduction to Pokemon. While the goal is to use an autoencoder, I started with PCA to get a baseline of what I should be looking to beat.

For some background (skip this paragraph if you’re familiar with Pokemon), the creatures in Pokemon games are called ‘Pokemon’. There are around 1000 of them currently, and they each have their own values for a number of attributes. They have six numerical attributes called ‘stats’ (‘Hit Points’ or ‘hp’, ‘Attack’ or ‘atk’, ‘Defence’ or ‘def’, ‘Special Attack’ or ‘spa’, ‘Special Defence’ or ‘spd’, and ‘Speed’ or ‘spe’) and two categorical attributes (a primary type, and a secondary type which may not exist). This is what a sample of the data looks like:

A slice of the Pokemon information dataframe

Since there is no inherent structure to the primary and secondary types, I chose to do a one-hot encoding of them. If the types were encoded as integers it would be possible to represent a single Pokemon with 8 numbers, but that it likely a less useful representation for future work than applying dimensionality reduction on the one-hot encoded types. Since each type has 18 possibilities, the feature space has 42 dimensions.

After some more data preparation (dividing the stats by 200 to bring their scale to roughly match the one-hot encoded types), I performed PCA on the data using 42 components. The following graph shows the cumulative fraction of the total variance explained by the top \(n\) components.

Fraction of Variance vs Number of Components

PCA clearly has trouble with the large number of one-hot encoded features. To achieve 90% of the variance, 28 components are needed. While this is certainly fewer than 90% of the original features, it isn’t great. Other useful metrics relevant to the data are how well the original types can be recovered from the latent space, and the average error on the reconstructed stats.

Reconstructed type accuracy

MSE on reconstructed stats

The reason for the lack of early improvement in the secondary type reconstruction is the fact that the secondary type may be NaN, and my method of reconstructing it. The predicted secondary type will only not be NaN if one of the recosntructed one-hot encodings is larger than 0.5. For PCA with very few components, all of these reconstructed values are very small and so the model generally predicts no secondary type. Approximately 49% of Pokemon do not have a secondary type, corresponding to the secondary type accuracy for very few components. Note that this would be likely be different if these values were first run through a softmax layer.

With the intent of using a 2-dimensional latent space for the autoencoder, here are the performance statistics for 2-compontent PCA:

Primary type accuracy:  [0.30562771]
Secondary type accuracy:  [0.48831169]
Combined type accuracy:  [0.17489177]
Stat MSE:  880.5201221373346

It has relatively high error on the stat reconstruction (considering most stats are approximately 80-120) and poor type accuracy. Let’s see what improvement an autoencoder will bring.

After a good deal of experimentation, the autoencoder architecture I chose was three encoding layers of sizes 30, 20, and 10, followed by the 2 nodes representing the encoded data, and then a symmetric setup for the decoder. All activation functions are ReLUs, except for the final encoding layer which is a softsign. The loss function was the MSE across all features, although using a softmax layer and cross entropy on the types would likely perform better (having multiple loss functions and activation functions can be a project for another time). The train/validataion losses for the trained model, as well as the MSE for the 2-component PCA encoding are:

Train MSE:  0.018529806247487138
Validation MSE:  0.01875758472143913
PCA MSE:  0.03163098595634264

These are the scaled values, where the stats still take values from 0 to 1. The error on the autoencoder is approximately 40% lower than that of PCA, which is a nice improvement. However, what really matters is the stat MSE and how accurately it can reconstruct types:

Primary type accuracy:  [0.82770563]
Secondary type accuracy:  [0.53593074]
Combined type accuracy:  [0.44848485]
Stat MSE:  544.1272305297197

Not bad! Again a 40% decrease in the MSE on the stats, and more than double the accuracy when recovering both types. Another interesting observation is their different clusterings in the latent space:

PCA encoded Pokemon

Autoencoder encoded Pokemon

The autoencoder very efficiently separates Pokemon based on their primary type; aside from one cluster all primary types have been separated. The secondary type is not very accurate, and in almost all cases where a cluster has uniform secondary type, it also has uniform primary type. For example, all five Pokemon with a Grass/Dragon typing (Appletun, Applin, Exeggutor-Alola, Flapple, and Sceptile-Mega) are clustered around \((-0.2, -0.7)\). PCA seems to have picked up on the large number of Pokemon with Flying as their secondary type. There are only three clusters of Pokemon with secondary type Flying in the PCA latent space, and they are all uniformly Flying. This is likely due to how common Flying is as a secondary type.

This first look at using an autoencoder to represent a Pokemon with fewer features has been promising, and I have a few more ideas to improve the performance. I may also increase the dimension of the latent space if two dimensions isn’t able to sufficiently represent a Pokemon for my future needs.