The training phase requires a whole lot of data for its input. It will learn how to apply a picture that you provide against a huge database of images. This database of images is roughly 20GB in size.
First, let's install some packages as root that will be used to create the database file. The training program is written to python 2, so we need the python 2 versions.
# dnf install hdf5-devel
# dnf install python2-h5py
# dnf install python2-h5py
Then as your ai user:
$ luarocks install totem
$ luarocks install https://raw.githubusercontent.com/deepmind/torch-hdf5/master/hdf5-0-0.rockspec
$ luarocks install https://raw.githubusercontent.com/deepmind/torch-hdf5/master/hdf5-0-0.rockspec
Now, we need the training and validation images. The fast-neural-style page says that all training was done using the 2014 COCO dataset. The project's home page is located here:
http://cocodataset.org
In case you wanted to read about their work. The files are large. 18GB together and 20GB uncompressed.
On to our work...
$ cd ~/fast-neural-style
$ mkdir -p data/coco/images/
$ cd data/coco/images
$ wget http://images.cocodataset.org/zips/train2014.zip
$ unzip train2014.zip
$ wget http://images.cocodataset.org/zips/val2014.zip
$ unzip val2014.zip
$ cd ~/fast-neural-style
$ mkdir -p data/coco/images/
$ cd data/coco/images
$ wget http://images.cocodataset.org/zips/train2014.zip
$ unzip train2014.zip
$ wget http://images.cocodataset.org/zips/val2014.zip
$ unzip val2014.zip
$ cd ~/fast-neural-style
Next step is to make a hdf5 file out of the training and validation images.
$ python2 scripts/make_style_dataset.py \
--output_file data/coco/dataset.h5
--output_file data/coco/dataset.h5
If you're tight for disk space, you don't need the uncompressed jpg files since they are all in the hdf5 file. You can delete them if you want. The next step is to get a neural net model that we can train.
$ bash models/download_vgg16.sh
One more thing, we need some art. In a previous post about neural style transfer, I pointed the reader to the wikiart web site. It has thousands of pieces of art. For our purposes, we do not need anything high resolution. Anything bigger that 256 pixels in either direction is fine. The art that seems to work the best is art with a very strong style. If you pick something like impression sunrise, it picks up the colors but can't find the style because its too subtle. For this post we will use the following picture:
You can read a little about it here:
https://www.wikiart.org/en/kazuo-nakamura/inner-view-3-1955
Now lets grab it
$ mkdir art
$ cd art
$ wget http://use2-uploads2.wikiart.org/images/kazuo-nakamura/inner-view-3-1955.jpg
$ cd ..
$ cd art
$ wget http://use2-uploads2.wikiart.org/images/kazuo-nakamura/inner-view-3-1955.jpg
$ cd ..
Now we are ready for the main event.
th train.lua \
-style_image_size 384 \
-content_weights 1.0 \
-style_weights 5.0 \
-checkpoint_name checkpoint \
-gpu 0 \
-h5_file data/coco/dataset.h5 \
-style_image art/inner-view-3-1955.jpg
-style_image_size 384 \
-content_weights 1.0 \
-style_weights 5.0 \
-checkpoint_name checkpoint \
-gpu 0 \
-h5_file data/coco/dataset.h5 \
-style_image art/inner-view-3-1955.jpg
And...wait...for...it. This...will...take...some...time.
I have a beefy system with a GTX 1080Ti. I used the time command to see how long the training took. This is the results:
real 116m21.518s
user 140m40.911s
sys 41m21.145s
The nvidia-smi program said that this consumed 5.3 GiB of video memory. If you have a 4GiB video board, its might run significantly slower. The 1080Ti has about 3600 cores and the tool said it was using 99% of them. You can estimate the full amount of time by timing the first 1000 iterations and multiply that by 40.
OK. Let's take our new style for a spin.
$ mkdir models/my-styles
$ mv checkpoint.t7 models/my-styles/inner-view-3-1955.t7
$ rm checkpoint.json
$ qlua webcam_demo.lua -gpu 0 -models models/my-styles/inner-view-3-1955.t7
$ mv checkpoint.t7 models/my-styles/inner-view-3-1955.t7
$ rm checkpoint.json
$ qlua webcam_demo.lua -gpu 0 -models models/my-styles/inner-view-3-1955.t7
Pretty cool, eh?
One other tip, if you pass a height & width that is too big for your web cam, you will get a warning message and the program will crash. The warning gives you the maximum resolution that your camera supports. For example, I get:
Warning: camera resolution changed to 720x960
When I re-run with that resolution, the image looks much smoother and less pixelated.
Training your own model is not hard. It just takes time. I read in the original paper that the work was supported by nvidia by providing them with the latest hardware. This would have been back around 2016. The paper said that it took them 4 hours to train a model. Have fun playing with this now that you have the recipe.