CSE-E4810 Machine Learning and Neural Networks

CSE-E4810 Machine Learning and Neural Networks (5 cr) Lecture 1: Introduction to Neural Networks Prof. Juha Karhunen https://mycourses.aalto.fi/course/view.php?id=13086 Aalto University School of Science, Espoo, Finland 1

Artificial neural networks Consist of simple, adaptive processing units, called often neurons. The neurons are interconnected, forming a large network. Computation takes place in parallel, often layer-by-layer. Nonlinearities are typically used in the computations. Important property of neural networks: they learn from input data. Artificial neural networks have their roots in many areas, including: Neuroscience and neurobiology; mathematics and statistics; artificial intelligence; statistical physics; engineering; signal processing. Aalto University School of Science, Espoo, Finland 2

Aalto University School of Science, Espoo, Finland 3

Example of an artificial neural network The figure shows a fully connected feedforward network. There are three layers: input layer, hidden layer, and output layer. In such a network, computations proceed layer-by-layer from the input layer to the output layer. In the input layer of 10 neurons, the components data vector are inputted to the networks only. All the computations take place in the middle hidden layer of four neurons. And in the output layer of two neurons. In this example, the input (data) vectors are 10-dimensional, and output vectors two-dimensional. Aalto University School of Science, Espoo, Finland 4

Neural computing was inspired by computing in human brains. Neural networks resemble the brain in two respects: 1. The network acquires knowledge from its environment using a learning process (algorithm). 2. Synaptic weights, which are interneuron connection strenghts, are used to store the learned information. This is very different from digital computing. However, artificial neural network methods are in practice realized using standard digital computers. This is because standard computers have a huge advantage over neurocomputers (hardware realizations of neural networks) in usability. Aalto University School of Science, Espoo, Finland 5

Computational intelligence Computational intelligence is a broader area, which includes: Neural networks Fuzzy systems Evolutionary computing (especially genetic algorithms) Artificial intelligence Other machine learning approaches, such as: - Graphical modeling - Bayesian methods Our three machine learning courses cover neural networks and machine learning. Aalto University School of Science, Espoo, Finland 6

Application areas of neural networks Neural networks have applications in many branches of science and engineering, including: Modeling of nonlinear systems and mappings Time series processing Pattern recognition Signal processing Automatic control Engineering Business life and banking As well as many applied sciences Aalto University School of Science, Espoo, Finland 7

Figure 1: An example application of neural networks in business. Aalto University School of Science, Espoo, Finland 8

Nonlinearity Benefits of neural networks Allows modeling of nonlinear functions and processes. Nonlinearity is distributed through the network. Each neuron typically has a nonlinear output. Using nonlinearities has drawbacks, too: local minima, difficult analysis, no closed-form easy linear solutions. Input-output mapping In supervised learning, the input-output mapping is learned from training data. For example from known prototypes in classification. Typically, some statistical criterion is used. The synaptic weights (free parameters) are modified to optimize Aalto University School of Science, Espoo, Finland 9

the criterion. After the input-output mapping has been learned, it can be used for mapping new input vectors. Adaptivity Weights (parameters) can be retrained with new data. The network can adapt to nonstationary environment. However, the changes must be slow enough. Fault tolerance and VLSI implementability Neural networks are well-suited for very-large-scale-integrated (VLSI) technology. If neurons are damaged, the performance degrades gradually. Standard computers do not have this property. Some neurocomputers have been built. Aalto University School of Science, Espoo, Finland 10

However, their programming and use is difficult. Neurobiological analogy Human brains are fast, powerful, fault tolerant, and use massively parallel computing. Neurobiologists try to explain the operation of human brains using artificial neural networks. Engineers use neural computation principles for solving complex problems. Aalto University School of Science, Espoo, Finland 11

Learning types Two major categories: supervised and unsupervised learning. Supervised learning Some amount of training data are available. The training data consist of known input-output pairs. The known outputs are sometimes called desired responses. The training data are used to learn the weights of the networks. One can then use the input-output mapping learned in this way to map unseen new data vectors. The quality of learning is measured using a suitable criterion. Such as the mean-square error between the outputs of the network and the corresponding desired responses. Aalto University School of Science, Espoo, Finland 12

Unsupervised learning The are no known input-output training pairs available but only data (input) vectors. Unsupervised learning methods typically fit chosen type of model to the input data. The parameters of the model = weights of the neural networks are learned from input data. A suitable statistical criterion is used to measure the quality of learning. Other types of learning In semi-supervised learning, there is small amount of labeled training data but lots of unlabeled data. Aalto University School of Science, Espoo, Finland 13

Both are used in learning. This is a common situation nowadays as for example internet provides lots of data but labeling it is costly and/or time-consuming. In reinforcement learning one knows the desired output only coarsely. A reward can be given for good performance and/or punishment for poor performance. Humans and animals learn typically in this way. A more advanced mathematical form of reinforcement learning is dynamic programming. There optimization of the reward is based on the combined effect of several sequential decisions. We shall not discuss these learning types in our course. Aalto University School of Science, Espoo, Finland 14

A short history of neural networks McCulloch and Pitts presented in 1943 first simple mathematical model of neuron with no learning. In 1958 Rosenblatt introduced perceptron which is the first computational neural network with learning. In 1960 Widrow introduced Widrow-Hoff learning rule and network structures associated with it. This learning rule for a single neuron has found widespread use in adaptive signal processing under the name LMS (least mean square) algorithm. Minsky and Papert criticized in their book perceptron for its limited capability in 1969. This led to slowdown of neural network research in 1970 s. Aalto University School of Science, Espoo, Finland 15

A boom attracting lots of researchers to study neural networks began around 1985 with several new promising approaches: Hopfield s network Multilayer perceptrons using backpropagation learning Self-organizing map This strong research activity continued largely in the 1990 s. During the last decade many researchers have moved from neural networks to study other machine learning methods and data mining. However, neural networks have many real-world applications in engineering, science, and business. With many conferences and journals still covering their recent developments. Aalto University School of Science, Espoo, Finland 16

Emerging research topics Recently neural networks have again become popular mainly due to deep learning. There one uses neural networks with many layers. We shall discuss it somewhat superficially on the last lecture because it is a difficult topic. By training wisely deep neural networks world records have been achieved in many benchmark classification problems. Another new research topic is cognitive computing. The ultimate goal is to build brain-like cognitive computing chips. This SyNAPSE project tries to combine neuroscience, supercomputing, and nanotechnology for achieving that goal. Aalto University School of Science, Espoo, Finland 17

Examples of applications with real-world data Classification of handwritten digits Deep belief networks (DBFs) are advanced neural network methods for nonlinear mapping and classification. They use a stack of restricted Boltzmann machines. We shall discuss these topics briefly in the last lecture 13. Data: handwritten digits (0,1,2,...,9) from widely used MNIST benchmark database. The MNIST data is often used for testing the performance of different mapping and/or classification methods. By mapping the high-dimensional handwritten digit data to two dimensions, one can assess visually the quality of the mapping. Aalto University School of Science, Espoo, Finland 18

And one can compare the classification errors of different methods for the MNIST data. Figure 2 shows that DBF provides a nonlinear mapping which can separate pretty well the digits even in two dimensions. The classification error of deep belief network is only 1.0%. This is smaller than for multilayer perceptrons, 1.6%, and for support vector machines, 1.4%. These widely used neural network methods are discussed in more detail later on in this course. Principal component analysis is a widely used linear mapping method. But it provides much worse mapping than DBF in this example; see the Figure 3. Aalto University School of Science, Espoo, Finland 19

Figure 2: Mapping of MNIST data using a deep belief network. Aalto University School of Science, Espoo, Finland 20

Figure 3: Mapping of MNIST data using principal component analysis. Aalto University School of Science, Espoo, Finland 21

Web mining using self-organizing maps The self-organizing map (SOM) is a useful tool for visualizing and arranging data with many real-worls applications. It has developed by Prof. Teuvo Kohonen in our laboratory. The next figure shows an application of SOM to web mining of a huge patent dataset of some 6,8 million patents. The self-organizing map computed had about one million neurons. From the map, one can search by keywords closely related patents by their contents. The map inserts in the figure show results of a coarse, medium, and fine search. Aalto University School of Science, Espoo, Finland 22

Figure 4: WEBSOM document map of 6,8 million patents. Aalto University School of Science, Espoo, Finland 23

Analysis of world climate data In this course, we discuss independent component analysis (ICA). ICA can often find more meaningful components from vector-valued input data than for example PCA. Denoising source separation (DSS) is an ICA-related technique which can utilize prior information. DSS and ICA methods have been developed in our laboratory. In this example, we consider application of DSS techniques to world climate data. A huge data sets of daily weather measurements over 56 years in 10,000 locations over the globe. Quantities such as surface temperature, precipitation, air pressure, and cloudiness were measured. Aalto University School of Science, Espoo, Finland 24

Figure 5: A satellite image of earth. Aalto University School of Science, Espoo, Finland 25

2 0 2 1950 1955 1960 1965 1970 1975 1980 1985 1990 1995 2000 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 Figure 6: The component describing global warming separated by DSS. Aalto University School of Science, Espoo, Finland 26

DSS with suitable prior information can extract a component which clearly corresponds to global warning. The previous figure shows it both with respect to time (upper curve) and location (world map). The upmost curve in the next figure depicts the component extracted by DSS with largest spatial interannual variability. It describes quite well the El Nino phenomenon; cf. with the third curve which is the climatological El Nino index. The two other curves are derivatives of the El Nino phenomenon: - Separated by DSS (component 2); - And computed from the climatology index (component 4). The red curves show the mean value of the component. Aalto University School of Science, Espoo, Finland 27

3 0 3 3 0 3 3 1 1 1950 1960 1970 1980 1990 2000 Figure 7: The two upmost components separated by DSS have the largest interannual variability. The third curve is the El Nino index used in climatology and the fourth one is its derivative. Aalto University School of Science, Espoo, Finland 28

The last image shows the spatial patterns corresponding to the El Nino component found by denoising source separation: - Surface temperature (upmost subfigure); - Sea level pressure (middle subfigure); - Precipitation (bottom subfigure). Red color) in all the spatial images shows larger values compared to normal ones. Respectively blue color is used to depict values smaller than the normal ones. Aalto University School of Science, Espoo, Finland 29

0.5 0 0.5 100 0 100 2 1 0 1 2 Figure 8: Surface temperature (top), sea level pressure (middle), and precipitation (bottom) corresponding to the first component found by DSS shown in the previous figure. Aalto University School of Science, Espoo, Finland 30

More on neural networks and machine learning Useful books 1. E. Alpaydin, Introduction to Machine Learning, 3rd ed., The MIT Press, 2014. It is used as the textbook in our course T-61.3050 Machine Learning: Basic Principles. This undergraduate level book deals mainly with other machine learning methods than neural networks. 2. C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006. A graduate level textbook which is a useful reference especially on probabilistic methods. It is too difficult and deals too little neural networks for the purposes of our course. Aalto University School of Science, Espoo, Finland 31

Some examples from this book are presented in our course. 3. S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed., Prentice-Hall, 1998. This book was used earlier when we had two courses on neural networks. It is too extensive (800 pages) for the purposes of our course. However, we use the Chapter 6 on support vector machines. Matters of lecture 11: Processing of temporal information are taken from Chapter 13 of this book. 4. S. Haykin, Neural Networks and Learning Methods, 3rd ed., Pearson Int. Ed., 2009. This new 3rd edition of the previous book is not markedly better than the 2nd edition, but has now more than 900 pages. Chapters have been restructured, some new material has been Aalto University School of Science, Espoo, Finland 32

added and some has been left out. The main problem of this book is that the matters are discussed throughout too extensively and in detail. 5. K. Murphy, Machine Learning: A Probabilistic Perspective, The MIT Press, 2012. An excellent and extensive (over 1000 pages) new book on probabilistic machine learning methods. But it does not discuss hardly at all neural networks. Journals publishing new research results Journals on neural network research: Neural Computation, IEEE Trans. on Neural Networks, Neural Networks, Neurocomputing, Neural Network Letters, Int. Journal on Neural Systems. Many of these publish also articles on other machine learning methods. Aalto University School of Science, Espoo, Finland 33

Journals on machine learning research: Machine Learning, Int. Journal of Machine Learning Research. International Conferences IJCNN, IEEE Int. Joint Conf. on Neural Networks, is the largest neural network conference in the world. ICANN, Int. Conf. on Artificial Neural Networks, is the premiere European conference on neural networks and now also machine learning. NIPS, Neural Information Processing Systems, is a high quality conference on machine learning and neural networks. ICML, Int. Conf. on Machine Learning, is a high quality machine learning conference. ECML, European Conf. on Machine Learning, is the respective good Aalto University School of Science, Espoo, Finland 34

quality European conference. There are many other smaller and/or lower quality conferences. Usually new research results are first published in conferences, and valuable enough ones later on in expanded form in journals. Aalto University School of Science, Espoo, Finland 35