Have you ever wondered how the app Prisma manages to turn your photos into impressionist paintings? They use, amongst other things, a special type of algorithm called neural styles.
Neural styles are a special type of algorithm that combines the content of one image with the style of another using deep neural networks. It was first introduced in a famous paper by LA Gatys and team at the University of Tubingen, Germany. Here they demonstrate how one can use a class of deep neural networks to extract features from any “style” image and subsequently “apply” them onto any “content” image.
The class of deep neural networks that are most powerful for image processing tasks are called Convolutional Neural Networks (CNN). It consists of a series of layers, which act as image filters. Each filter extracts a feature from the input image. This series of layers form a model, or a network, that describes the transformations from the input image to the output features. The model used for this particular exercise was the VGG Network, which is a popular model used for object recognition tasks.
As mentioned earlier, we have two images: a style image, from which we extract features, and a content image, on which these features are applied.
Here’s the content image, a photograph of the Taj Mahal in Agra, India.
Here’s the style image, one of the famous Water Lilies by the French impressionist, Claude Monet. We shall extract features from this particular image, and then apply them to our content image.
When we train our model on these two images, learning the styles from the style image, and appying them to the content image, we obtain the following output – the Taj Mahal drawn in the style of the lilies.
Notice that the structural features of the content image (or in other words, the borders of the building) have been preserved. We find that a kind of texture has been extracted from the style image and applied to the content image.
We can also load pre-trained models. In fact, this is what most mobile apps do. The picture we take on our camera is our content image. The app then takes our content image and performs a single forward pass on the model, “applying” a texture onto it.
We can see this with another content image. This is an artist’s impression of Kvothe, one of my favorite fictional characters.
We have two pretrained models that are meant to provide two kinds of textures: fire and frost. We pass the content image into our pretrained model and obtain the following results, which shows the original image with two different styles applied.
The code for the above exercise can be found here. We observe fast training times thanks in large part to optimized convolution kernels on the GPU. Using the MXNet deep learning library, Julia makes it very easy to perform these operations on a GPU. The following chart shows that the benefits of using the GPU are very large indeed!
We performed this excercise on a a IBM PowerNV 8335-GCA server, which has 160 CPU cores, and a Tesla K80 (dual) GPU accelerator.
It would be nice to make stylized videos, such as this. This would involve taking a pretrained model and passing every frame of the video into a pretained model, thereby generating a new video. We will write about that work in subsequent blog posts.