A Shallow Introduction to Deep Learning by Rafael Espericueta

Traditional AI vs Deep Learning A Shallow Introduction to Deep Learning by Rafael Espericueta Deep learning is one form of machine learning which is part of the field of artificial intelligence. Basically deep learning refers to artificial neural networks, and what make them deep is the presence of more than two layers, an input layer, one or more so called hidden layers, and an output layer. Traditional AI required a team of human experts as well as a team of expert programmers to create a program that basically has been hard-wired to deal with every conceivable option. Such systems have had success in many realms, but are limited to the logic programmed into them. They tend to be very brittle, buggy, and don't adjust well to minor changes in their inputs. Machine learning in traditional AI works due to the programmer's cleverness in selecting features in the data that can be used by various learning algorithms. The newer deep learning approach requires a far smaller team. One creates a neural net architecture capable of learning to do the desired task, and then having it learn how to do so, given lots of labeled examples (supervised learning). A more subtle version of this, where the feedback isn't so immediate (reinforcement learning), is used in cases where one doesn't have sufficient labeled data. You tell the system what output you want, given the input, and it figures out how to best accomplish that task. The features needed for the learning to take place are automatically selected in the early layers of the neural network, rather than needing to be hand-crafted by a clever monkey as in traditional AI. Any solution to the problem of general intelligence will require such an ability. One great accomplishment of traditional AI was IBM's grandmaster defeating chess program of the '90s, Deep Blue. Many years of effort on the part of programmers and chess masters alike were required to program in all the if this is true, then that, else if this than that, ad absurdum. This brute force approach requiring teams of experts and programmers working in tandem has not been a successful strategy when applied to more difficult problems, like vision and the other unsolved problems in AI. One of these more challenging problems has been to create a go-playing program that can defeat top human players. Go is an ancient game of strategy originating in China thousands of years ago. In Trevanian's best-selling novel Shibumi it was mentioned that Go is to chess as poetry is to doubleentry accounting. In any case, despite the surprising simplicity of the game's rules, go has a game tree with a vastly higher degree of branching than in chess. It's been estimated that the number of possible go games exceeds the number of possible chess games by a factor of. For decades this has been the holy grail of AI, and most experts in the field didn't believe we would attain this goal for at least another decade. A legendary $1,000,000 prize was offered for the first computer program that could defeat a human professional go player. That prize wasn't claimed by its expiration year 2000. Nonetheless, a $1,000,000 prize was finally claimed by the Google DeepMind team, for the success of their program AlphaGo, in 2016. In a match watched by millions around the world, AlphaGo defeated the 9-dan go master, Lee Sodol, in 4 out of 5 games in a televised and Internet-streamed match. Almost all had been expecting the machine to lose to this top-ranked go master, as no other go-playing program had come close to the level of a professional human player.

As great of an accomplishment this all was for the DeepMind team, more significant than this accomplishment was the way it was accomplished, using deep learning. Rather than the tedious traditional AI method described above (that was so successful in the case of chess), what the DeepMind team did was to create a deep neural network with random initial weights, that was capable of learning how to play a strong game of go. It was trained on about 100,000 strong human amateur games, trying to learn to predict where a human would play; then it played itself many millions of games to hone its skills. Interestingly, fairly early in its training it could easily defeat all those who had programmed it. This same neural network is capable of learning many other tasks besides just playing go. The AlphaGo project took far fewer man-hours than did the Deep Blue (chess) project, and AlphaGo attained its mastery far faster than any human has ever attained a comparable skill level. Interestingly, AlphaGo has improved significantly since its landmark victory. After AlphaGo's historic victory, many in the AI world began to wonder how many other hitherto unsolvable problems would yield to the power of deep learning. Indeed, the deep learning principles used in AlphaGo are generally applicable, and have already helped cross off many problems from AI's unsolved problems list. The rapidity with which this is happening is notable. Google realized the importance of DeepMind's work back in 2014, purchasing the company for about half a billion dollars. Recently Google open sourced it's own internal deep learning development framework, TensorFlow, now (by far) the most popular deep learning platform. Putting this tool into the hands of thousands of researchers and knowledge engineers around the world seemed a better strategy than trying to do it all internally. There are so many conceivable applications, the more people exploring the possibilities, the better. And those who come up with an interesting application of deep learning may find their resulting start-up facing a buy-out offer from Google, along with a lucrative job offer. Another of deep learning's recent accomplishments was in the field of computational vision. A deep neural network attained slightly better than human performance on the recognition of 1000 objects in the ImageNet dataset. The deep learning approach performed better than both human and previous AI attempts using more traditional techniques. Advances in computer vision are leading to a plethora of advances in applications of AI in diverse areas, including autonomous cars, drones, and more generally, robotics. In addition to the ImageNet example (and many other advances in computer vision), there have been comparable advances in other subfields of AI, including speech recognition and synthesis. Even automatic language translation has made huge advances using deep learning. Applications abound in the medical field, and promise to revolutionize the practice of medicine. Recently Google's DeepMind used deep learning to improve the power usage at its large data centers, by 40%. They are now negotiating to apply this technology to the entire electrical grid of Great Briton. This one breakthrough alone holds great promise to significantly improve the efficiency of all the worlds power grids. One begins to wonder if there's any area where deep learning techniques can't be fruitfully applied. The History of Artificial Neural Networks Artificial neural networks have been with us for about as long as digital computers. Many of the early pioneers of computer science were interested in this idea, since it seems so suggestive of the way biological brains work. After all, our brains form a sort of existence proof that artificial neural networks might lead to a system capable of intelligent perception and cognition.

Despite researchers' early interest in neural networks, it was only recently that we've developed the techniques needed to make deep learning work. The main reason for this long delay concerns Moore's Law every decade, computers are 1,000 times faster. We simply had to wait until sufficient computer power was available. As a tipping point was reached, the engineering of such networks underwent a rapid evolution. Thanks to computer gamers, GPUs (graphics processing units) were created that each contain thousands of computer cores. These allow certain computations such as are needed for graphics processing to be performed in parallel, and thus thousands of times faster than is possible for conventional CPUs. It turns out that GPUs can also be used to implement neural networks, and their general availability and low cost helped provide the computer power needed for successful neural network implementation. In the 1990's, AI researchers believed that neural networks weren't practical (which was pretty much true, given what passed for computers back then), and as a result researchers in the neural network field had great difficulty publishing papers at all. The advances mentioned above, along with many others through the years, have now turned the tides. Now it's becoming difficult to obtain funding for AI research that doesn't involve deep learning. GoogLeNet The yearly ImageNet competition is a competition to automatically identify 1000 objects in images. In earlier competitions, only tiny improvements over the previous years winning entry were sufficient to win, but in 2014 Google's entry to the ImageNet competition used deep learning to easily defeat all its rivals by a healthy margin. Their winning entry was a neural network architecture called GoogLeNet. Figure 1: Schematic of the GoogLeNet artificial network The diagram in Figure 1 is actually a simplification of the actual neural net. Many of the rectangles above represent large collections of parallel node layers. This network is capable of discerning a thousand different common objects that may be in an image, for example flowers. There is one particular layer within the GoogLeNet network that is maximally excited whenever it sees flowers. When any image is input to the network, the flower detecting layer tries to see flowers in the image. If one outputs that layer, one can see where the network was beginning to hallucinate flowers in the input image. I found that by feeding the image with the beginnings of flower hallucinations back into the input to the network, and outputting the flower detector results, the hallucinations became more vivid. After about five such feedback loops, the hallucinations become quite vivid, and then there is little further change. In Figure 2 you can see a photo of my wife Julie (off the coast of New Zealand), along with 5 iterates of flower hallucinations.

Figure 2a: Original picture Figure 2b: The hallucinations begin! Figure 2c: Hallucinations deepen... Figure 2d: And deepen... Figure 2e: The changes become less noticeable. Figure 2f: Further iterations change very little. I created animations of over a hundred sequences such as the above (the above, animated, is here), exploring the various inception layers of GoogLeNet. Not all these layers are as recognizable as the flower detector. Generally it's not individual layers that detect anything, but combinations of these layers. Using these layers as inputs, subsequent layers are able to accomplish their object recognition tasks. To see more of these animations, click on a thumbnail below (excepting the first):

Notice how the above thumbnail images, each the result of 5 hallucination iterations, nonetheless as a thumbnail resembles the original image (if you squint!), which is a bit surprising. This shows that much information from the original image is preserved in each of the hallucinated versions of it. The hallucinatory inception layers of GoogLeNet can be used for many other purposes besides the recognition of objects in images. If the last layers of the network are discarded, the earlier layers can serve as a starting point for other AI tasks. The vast amount of time that Google used in training GoogLeNet on millions of images can be leveraged to solve more specialized tasks, for example facial recognition of a particular person. One interesting application is termed style transfer. A neural network, grafted to the end of GoogLeNet (minus it's later layers), allows one to train the network to recognize the style of a particular artist. And what a network can recognize, it can also hallucinate. So with style transfer, one may input a photo, and get an output that resembles a particular artist's rendition of that photo. Deep learning has achieved comparable successes in the auditory realm as well as the visual, with the understanding of spoken speech using recurrent neural networks. It's now possible to automate the captioning of video with better than human-level performance. Music generation has also recently achieved surprising successes via deep learning. Doing the Math Consider the simple neural network depicted in Figure 3. We're going to slowly walk through this example to introduce the basic concepts. Figure 3: A simple deep neural network. To see what this neural network does, suppose the input values are and. The blue paths connecting the circular nodes (the neurons) have numeric weights which are applied to the input values as follows: To get the values of the hidden layer nodes and, we need to put the above results through a simple nonlinear filter. For this example, we'll use the so called sigmoid function (there are other possible nonlinear functions we could use here as well): Then the hidden layer neurons take on the values: We do this again to obtain the output value. This output node often also has a nonlinear function applied, but for this example we'll just output the value directly:

The above process can be written more succinctly using matrix notation. If and, then. Similarly, with, we have. This process is called feed forward, and is how the trained network makes its predictions. Next we examine the learning part of the process. What exactly constitutes learning for such a network? The behavior of such a network on given inputs is entirely determined by the weights along the paths, which can be gathered into weight matrices. These weights are ordinarily initially chosen with small random values. For our network, these values were picked arbitrarily, with and. To obtain a network that's useful, these weights need to be learned rather than being given a priori. In supervised learning we learn the weights using labeled training data. The training data consists of pairs of values like, along with corresponding labels, the output values observed (or desired) for those inputs. In the above example, if the label for our were in fact 0.128196834 (the output value we observed above), then our network would have correctly computed this value, and the weights would be right for this input/label pair. If the label were something else, then the weights would need to be modified in such a way that the output would be closer to the label for that input. The actual learning takes place via a process called back-propagation that allows us to propagate the observed error back through the network, adjusting all the weights in such a way that the network, given that input again, would compute a value closer to the label. In this way, given a number of labeled inputs, the network can iteratively modify its weights, learning the correct function input/output pairs. Memorizing the input/output pairs isn't the point though we want the network to learn to make reasonable predictions for inputs it hasn't seen. We want our neural net to generalize from its input training data set to predict the output from data it hasn't encountered. The back-propagation process amounts to minimizing a multivariate error function, by moving the weights small steps in the direction if the error function's negative gradient. A function's gradient points in the direction of maximum increase of the function; we seek the minimum of the function, and so need to go in the direction of the function's maximum decrease, the negative of its gradient. Back propagation starts with a training example. Suppose the initial training example were

, with all the weights as above. Again, this means that we want our network to output 3.0 when the input is. We've already calculated our network output (with the given network weights) for this input to be 0.128196834. The error at the output node is then computed: Error =, and we need to propagate this error backwards through the successive layers of the network using the same weights as we used before for the forward propagation process. A learning rate multiplier like is used to take a small step in the right direction (without this, one well might overshoot the optimal solution). Often this learning rate is decayed as the learning process proceeds to help the iterates converge, but for this example we'll keep it constant. We need to use the chain rule (from calculus) to correctly compute the gradient of the composition of matrix products and our nonlinear sigmoid function,. So let's propagate this error back through the network, updating the path weights as we go. The weight on the path from the top hidden node to the output node is then modified as follows: Similarly we perform an update on the weight on the path from the bottom hidden node to the output node: Next consider the sigmoid function. The chain-rule requires us to multiply our above values by the derivative of. As it turns out (the proof of this is left as an exercise): The values are precisely the values we obtained as our previous output from during the forward propagation process. By saving this value, we can easily now compute this derivative. Recall that the output from the top hidden node, just after was applied, was 0.9677045353. The corresponding derivative is Similarly for the lower hidden node, As we back-propagate the Error, it first is multiplied by the original path weight, and then times the derivative we just computed:..

Similarly: Now we adjust the weights from the input layer to the hidden layer using these back-propagated errors: It can be shown that with these new weights, our neural network will yield an output that's closer to the target value than our first attempt. By iterating this process many times, we can get ever closer to our target. The power of this process is that we can end up with a neural network that can accurately estimate values for inputs it hasn't seen before. The power of a neural network is that once trained, it can generalize to correctly deal with data it's never seen before.