Originally published 21/08/2017 as part of the Bulletproof Technical Blog
In this blog and the next I’m covering some of the history and basics of Deep Learning. Earlier this year I went to the International Conference on Machine Learning here in Sydney and it occurred to me that without knowing about the past it’s hard to appreciate developments in the present.
Deep Learning is part of Machine Learning which is itself a sub-field of the field of Artificial Intelligence. It’s generally accepted that there are three broad categories or types of Machine Learning – Supervised, Unsupervised and Reinforcement (these are fuzzy boundaries, other categories exist and definitions vary but we’ll stick with these for now).
In Unsupervised Learning we don’t make any assumptions about the data we have, the processes that generated it nor any underlying structure/constraints. We effectively throw data at fairly well known machine learning algorithms and ask the learning system to “find interesting patterns”. Canonical examples include supermarket basket analysis and generating customer segmentations via clustering.
A common use case for Reinforcement Learning is learning optimal control policies in robotics where the robot (or agent) has sensors (light, temperature, magnetism, orientation etc) with which it can detect the state of its environment and effectors it can use to take actions (movement actuators, drills, collectors etc). The agent learns via feedback from its environment in the form of delayed rewards (either directly from its environment or from a human teacher) in the process learning how to perform some task.
These are both active areas of Machine Learning research but as the focus of these articles is Deep Learning, which falls under the branch known as Supervised Learning, we will leave these aside for the present.
In this case the ML algorithm uses a dataset that contains the questions (a set of attributes in a dataset) as well as the answers (called the response or target variable). The learning process is therefore “supervised” and attempts to reproduce the supplied answers as accurately as possible without simply memorising them in the hope that we can generalise this information, learn something about the data itself and use the past to predict the future.
For example if we had a dataset containing past advertising budgets for various media (TV, Radio and Newspapers) as well as the resulting Sales figures we could train a model to use this information to predict expected Sales figures under various future advertising scenarios.
In the process, we would also learn things about the data itself – what effect these “predictors” have (if any) on the expected response variable (in this case Sales). Note that in this case, the target variable is numeric. In many datasets it is categorical (eg Small, Medium, Large) in which case we call this process Classification rather than Prediction.
Much of Machine Learning theory centres around data preparation, data sampling techniques, tuning algorithms as well as best practices for training processes to ensure best generalisation and statistical validity of results. In addition there are various other methods of classification and prediction where many models are used in a kind of committee structure that often outperform stand alone models.
Artificial Neural Networks
ANNs arose in the 1960s with early biologically inspired research into the structure of the human brain. The human brain has about 100 billion neurons each connected on average to about 10 000 other neurons (so a fairly dense network) via links called synapses.
Each neuron receives signals from its neighbours and combines them together. If the result is below a certain threshold, nothing happens. However if the result is strong enough, that will cause it to “fire” too – sending its own signal to the next set of neurons downstream so the process can continue.
Synapses that fire often become stronger and those that don’t attenuate. This is the basis of the Hebbian Theory of human learning that the neural networks in our brains use to acquire, store and represent knowledge and experiences. The idea was to get computers to simulate this process to build a new kind of machine learning approach – Artificial Neural Networks.
A basic ANN consists of a number of layers of these artificial neurons where each layer is connected to the next layer in sequence. Attached to each connection is a small number representing the strength of that connection – these numbers are known as weights.
As each data sample is presented to the network the weights would be adjusted in such as way as to best reproduce the output. In this way the “knowledge” learned would be encoded in the set of weights in the network.
However no-one knew how you would go about training such networks and it wasn’t until the 1980s that the Backpropagation algorithm was developed and progress could continue.
To train an ANN, each sample in the dataset is fed into the network where the data attributes are combined with the current weights and fed-forward through the network until an approximation of the correct target value is produced at the output layer.
Now since we know the actual target value, (remember it’s Supervised Learning we’re doing here) we can calculate how far off the network’s approximation is for that sample (i.e the error). This error is then propagated backwards through the network making suitable weight adjustments so that this error is minimised and it should be obvious that it is from here that the Backpropagation algorithm derives its name.
When this has been done for every sample in the dataset (referred to as a training epoch), the process repeats. Generally, ANNs are trained over many (often thousands of) epochs until some stopping criterion is met.
You can think of the weight adjustment process as moving down a convex error surface looking for a globally minimum value. At each step you adjust the weights so as to move closer to that minimum and to do that as quickly as possible, you need to move in the direction of steepest descent. This is known as Gradient Descent.
Now for various (read tedious) technical reasons, in reality training isn’t really done in quite this manner. Samples are often batched up and weight adjustments only made at the end of each batch. This results in a more random path towards the minima and this is referred to as Stochastic Gradient Descent.
A Deep Winter
So now ANNs could be trained using Backpropagation and early results were promising. Backed by some solid theory, early image classifiers were developed and ANNs were found to excel at function approximation especially highly nonlinear ones lacking closed form solutions.
However overall, ANNs failed to realise much of the industry hype at the time. Only relatively simple network structures could actually be reliably trained and there just wasn’t the compute power available to work out why. As a consequence, ANNs fell out of favour and research languished.
It would not be until the early 2000s that the birth of the cloud created a springboard that would catapult Artificial Neural Network research out of its winter and into the realm of Deep Learning.
Categories: Deep Learning