ICLR is a relatively new conference that is primarily concerned with deep learning and learned representations. The conference is into its third year and had over 300 attendees, two of which were from Lyst. In this post we’ll discuss a few of the interesting papers and themes presented this year.
Simplifying network topology
One of the difficulties of employing deep convolutional networks is that their complicated topology is often hand-tuned for each new application. Choosing the number of convolutional and pooling layers, their stride sizes, non-linearities used, and the initialisation method to suit a particular task resembles the burden of traditional feature engineering.
Two of the papers presented at ICLR suggest that this might not be entirely necessary.
The first is Striving for Simplicity: The All Convolutional Net. The authors argue that pooling (subsampling) layers in a network can be replaced by convolution layers without loss of accuracy.
The argument goes roughly as follows. The purpose of pooling layers is to perform dimensionality reduction to widen subsequent convolutional layers’ receptive fields. For example, instead of detecting a feature in the top left corner, a pooling layer allows the same feature to be detected across the entire top part of the image. However, the same effect can be achieved by using a convolutional layer: using a stride of 2 also reduces the dimensionality of the output and widens the receptive field of higher layers.
The resulting operation differs from a max-pooling layer in that (1) it cannot perform a true max operation, and (2) it allows pooling across input channels. The authors argue that the structure of cross-channel connections should be treated as something that is learned rather than imposed. If the absence of cross-channel interactions is indeed beneficial, the network should be able to discover that structure.
The approach seems to work in practice, achieving competitive results on a number of tasks.
The second is Very Deep Convolutional Networks for Large-Scale Image Recognition. The core idea here is that hand-tuning layer kernel sizes to achieve optimal receptive fields (say, \(5 \times 5\) or \(7 \times 7\)) can be replaced by simply stacking homogenous \(3 \times 3\) layers. The same effect of widening the receptive field is then achieved by layer composition rather than increasing the kernel size: three stacked \(3 \times 3\) have a \(7 \times 7\) receptive field. At the same time, the number of parameters is reduced: a \(7 \times 7\) layer has 81% more parameters than three stacked \(3 \times 3\) layers. The authors report that the resulting models perform very well compared to other state-of-the-art architectures.
Gaussian word embeddings
This allows the model to express uncertainty: one word’s meaning may be either applicable in many contexts (and so uncertain), or very specific to a given context. It also allows for asymmetry in word relationships. One interesting application of this is picking words that are specific to a given context rather than used across many contexts
For example, according to the Gaussian embedding model both ‘sense’ and ‘joviality’ are in the neighbourhood of ‘feeling’. However, the variance of ‘joviality’ is smaller than the variance of ‘sense’, reflecting the fact that ‘joviality’ is very specific while ‘sense’ can be applied across many contexts. This gives rise to the further idea of entailment: since the mass of the probability distribution of ‘joviality’ lies inside ‘sense’, ‘joviality’ entails ‘sense’.
Detecting different senses of the same word could be an interesting extension to this model. For example, the word ‘bank’ could be used both to denote a financial institution (‘investment bank’) as well as a natural feature (‘river bank’).
Currently, highly polysemic words will be represented by diffuse distributions; it would be interesting to be able to treat these as mixtures of specific meanings. For example, it might be possible to treat each token with high variance as a mixture of two Gaussians, and use an expectation-maximisation algorithm to discriminate between the different meanings of incoming tokens during model training. Applying the mixture splitting approach recursively might allow exhaustive identification of individual word meanings.
Adverserial training examples
Goodfellow et al. presented some work on explaning and harnessing adverserial training examples. Adversarial examples are misclassifications that are only slightly different to correctly classified examples.
Szegedy et al. found that tiny changes to an image (average distortion of 0.006508) resulted in consistant misclassification. The following images in the right column were all misclassified as ‘ostrich’ after systematic perturbations despite appearing almost identical to a human.
Another concept, related to adversarial examples, are so called rubbish examples. A rubish example is one which a human would reject as belonging to a given class but a model would assign to a class with high confidence.
Nguyen et al. trained a convolutional neural network (CNN) on ImageNet and then used an evolutionary algorithm to generate synthetic images such that the cost function of the evolution is the confidence the CNN has in the synthetic image belonging to a particular ImageNet class.
The images below show the result of this evolutionary process for a selection of classes. Each image can be considered the prototypical representation of what the CNN uses to predict a given class.
In their ICLR paper Goodfellow et al. continued this work and found that adversarial examples exploit the linear behaviour of activation functions and the fact that with high-dimensional input we can make many infinitesimal perturbations along the dimensions which results in a big perturbation in the final output.
The authors provide a method for constructing adversarial examples which can be used for training to provide regularisation beyond techniques such as dropout.
Synthetic data has a rich history in machine learning. Two of the most prominent include Minsky’s use of the XOR example and the circle, spiral, moon style datasets for clustering.
Facebook’s research lab presented a set of synthetic generative tests with the aim of developing AI systems capable of general natural language reasoning. Leon Bottou defines machine reasoning, in contrast to machine learning, as:
“manipulating previously acquired knowledge in order to answer a new question”
The authors advocate a move away from simple models trained on a lot of data in order to promote the development of new methodologies on harder datasets.
The proposed reasoning tests each have a very specific goal, somewhat like unit tests for code. For example, the following tests a systems ability to understand coreference.
Daniel was in the kitchen. Then he went to the studio. Sandra was in the office. Where is Daniel?
Whilst the following complex test assess a systems ability to reason about time.
In the afternoon Julie went to the park. Yesterday Julie was at school. Julie went to the cinema this evening. Where did Julie go after the park?
The power of these artificial tests is that when a system fails a test, we know exactly what form of reasoning it cannot perform. This contrasts with the standard ‘aggregation’ evaluation methods such as accuracy.
Jadberg et al. also use synthetic data but in a different manner. They built an artifical dataset of noisy images containing distorted text. They trained a model with this artifical data set whereby the labels were the original text used to generate the images.
They then evaluated their model on real-word images of text and found that the artifical training data works well as a cheap proxy for real-world data.
The conference was a good mix of interesting papers and stimulating conversations. We also liked the call for submissions format: all submissions have to be posted on arxiv to be reviewed. Not only does this allow early discussion and dissemination of results, but it also makes all papers easily accessible, circumventing the issue of ridiculous academic publishing paywalls.
If you are interested in deep learning then we recommend the draft of deep learning from Bengio et al.