Chris wrote earlier this year that we’ve been experimenting with DCNNs (deep convolutional neural networks) for machine vision, using our fruit-picking robot as a toy application. It’s time for an update.
To start with, some explanation. Machine image recognition is based on ‘features’. A feature is any measurable property of an image, for example the hue at a given pixel, or presence of a diagonal line. A feature is useful for image classification if knowing about its presence or absence helps you identify what is in a picture. Some face detection algorithms look for two dark rectangles (eyes) above two light ones (cheeks).
DCNNs detect features too, but unlike older machine vision strategies, the features are not explicitly specified by engineers. A DCNN has thousands of free parameters. The space of different features it could compute is enormous and almost all of them are garbage. ‘Training’ a DCNN for image recognition means using algorithms to search this enormous space for useful features. DCNNs are built of many layers, with each layer using the output of previous layers. The first few layers detect simple features like lines and corners; later layers, after training, might detect complex features like eyes, or ‘banananess’.
Training a DCNN can be expensive, and not everyone has access to millions of labelled images and racks of servers with custom ASICs. Luckily, however, some of those who do have made their trained DCNNs freely available for everyone to play with. We can customise a pretrained DCNN as follows: we take a simple classification algorithm (support vector machine, etc.) that might have been used with some standard feature set, Haar-like features, say. Then, instead of using Haar-like features, we use the activations of neurons from a pretrained DCNN. In other words, we combine old-fashioned classification algorithms with new-fangled feature sets. The amount of training data and compute power required for this customisation is miniscule compared to what was used to train the network from scratch.
We tried this first with networks from the Caffe model zoo. So, what have we found?
1. Installing Caffe may try your patience. It is harder if, as we do, you want it to use it with the Robot Operating System, since the two have incompatible dependencies. We ended up running Caffe inside a Docker container, while ROS ran in the host operating system and talked to Caffe via ZeroMQ. It works, but it took a while to get there. It is theoretically possible to install Caffe
on Windows, and once I tried to do it.
2. Caffe is amazing. The first day I got it, I searched the list of things one pretrained classifier was meant to distinguish, and it was missing the word ‘apple’. Being congenitally inclined to break stuff, I went and took some pictures of green apples in our canteen. Out it came with ‘granny smith’! Having regained our composure, we trained classifiers of our own on a few hundred images of fruit, using DCNN activations as features. We found it easy to get high classification accuracy with impressively little training data.
So far this is a story of excellent new technology, which is not yet user-friendly. But now we get to the interesting and hard problem…
3. The features DCNNs detect are hard to describe succinctly, or sketch, or see, or understand, though people do try. This matters, because it makes it hard to work out what training data you need. In real-world, messy image recognition, usually your training data cannot cover all the possibilities, and the amount of data you can get is limited: you may have limited patience for photographing fruit alone in a lab; or you may want to find a missing cat, with only a poster to go on (see picture). Our DCNN-based fruit classifier may work well on one set of images, and on a second set of images that we reserved for cross-validation, but we have little intuition about how it will behave when presented with an orange that is yellower, wartier, or bigger than any it has seen before. By contrast, if I knock up a classifier that just assesses the hue of each object, I know straight away that my training images should span the range of hues of each fruit type, I don’t need to worry about size or texture, and I know that yellowish oranges are likely to be mixed up with lemons.
I hear the words ‘big data’ a lot, but what would really amaze, excite, or frighten me is computer vision working with little data. DCNNs get us closer to that dream, but we have some way to go.